1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
[[analysis-common-grams-tokenfilter]]
=== Common Grams Token Filter
Token filter that generates bigrams for frequently occuring terms.
Single terms are still indexed. It can be used as an alternative to the
<<analysis-stop-tokenfilter,Stop
Token Filter>> when we don't want to completely ignore common terms.
For example, the text "the quick brown is a fox" will be tokenized as
"the", "the_quick", "quick", "brown", "brown_is", "is_a", "a_fox",
"fox". Assuming "the", "is" and "a" are common words.
When `query_mode` is enabled, the token filter removes common words and
single terms followed by a common word. This parameter should be enabled
in the search analyzer.
For example, the query "the quick brown is a fox" will be tokenized as
"the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
The following are settings that can be set:
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`common_words` |A list of common words to use.
|`common_words_path` |A path (either relative to `config` location, or
absolute) to a list of common words. Each word should be in its own
"line" (separated by a line break). The file must be UTF-8 encoded.
|`ignore_case` |If true, common words matching will be case insensitive
(defaults to `false`).
|`query_mode` |Generates bigrams then removes common words and single
terms followed by a common word (defaults to `false`).
|=======================================================================
Note, `common_words` or `common_words_path` field is required.
Here is an example:
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
index_grams :
tokenizer : whitespace
filter : [common_grams]
search_grams :
tokenizer : whitespace
filter : [common_grams_query]
filter :
common_grams :
type : common_grams
common_words: [a, an, the]
common_grams_query :
type : common_grams
query_mode: true
common_words: [a, an, the]
--------------------------------------------------
|