summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/icu-plugin.asciidoc
blob: c1be21618fae389749380b20f60ca0a6a7249909 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
[[analysis-icu-plugin]]
== ICU Analysis Plugin

The http://icu-project.org/[ICU] analysis plugin allows for unicode
normalization, collation and folding. The plugin is called
https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].

The plugin includes the following analysis components:

[float]
[[icu-normalization]]
=== ICU Normalization

Normalizes characters as explained
http://userguide.icu-project.org/transforms/normalization[here]. It
registers itself by default under `icu_normalizer` or `icuNormalizer`
using the default settings. Allows for the name parameter to be provided
which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "normalization" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[icu-folding]]
=== ICU Folding

Folding of unicode characters based on `UTR#30`. It registers itself
under `icu_folding` and `icuFolding` names.
The filter also does lowercasing, which means the lowercase filter can
normally be left out. Sample setting:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_folding"]
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[icu-filtering]]
==== Filtering

The folding can be filtered by a set of unicode characters with the
parameter `unicodeSetFilter`. This is useful for a non-internationalized
search engine where retaining a set of national characters which are
primary letters in a specific language is wanted. See syntax for the
UnicodeSet
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].

The Following example exempts Swedish characters from the folding. Note
that the filtered characters are NOT lowercased which is why we add that
filter below.

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_icu_folding", "lowercase"]
                }
            }
            "filter" : {
                "my_icu_folding" : {
                    "type" : "icu_folding"
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[icu-collation]]
=== ICU Collation

Uses collation token filter. Allows to either specify the rules for
collation (defined
http://www.icu-project.org/userguide/Collate_Customization.html[here])
using the `rules` parameter (can point to a location or expressed in the
settings, location can be relative to config location), or using the
`language` parameter (further specialized by country and variant). By
default registers under `icu_collation` or `icuCollation` and uses the
default locale.

Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_collation"]
                }
            }
        }
    }
}
--------------------------------------------------

And here is a sample of custom collation:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["myCollator"]
                }
            },
            "filter" : {
                "myCollator" : {
                    "type" : "icu_collation",
                    "language" : "en"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Options

[horizontal]
`strength`::
    The strength property determines the minimum level of difference considered significant during comparison.
     The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
     Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
 +
 See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed
 explanation for the specific values.

`decomposition`::
    Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
    `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
    normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
    before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between
    faster and more complete collation behavior. Since a great many of the world's languages do not require text
    normalization, most locales set `no` as the default decomposition mode.

[float]
==== Expert options:

[horizontal]
`alternate`::
     Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
     to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.

`caseLevel`::
    Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
     strength is set to `primary` this will ignore accent differences.

`caseFirst`::
    Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
    for strength `tertiary`.

`numeric`::
    Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
    example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.

`variableTop`::
    Single character or contraction. Controls what is variable for `alternate`.

`hiraganaQuaternaryMode`::
    Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
    Hiragana characters in `quaternary` strength .

[float]
=== ICU Tokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "icu_tokenizer",
                }
            }
        }
    }
}
--------------------------------------------------