1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
|
[[analysis-hunspell-tokenfilter]]
=== Hunspell Token Filter
Basic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(defaults to `<path.conf>/hunspell`). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold both the \*.aff and \*.dic
files (all of which will automatically be picked up). For example,
assuming the default hunspell location is used, the following directory
layout will define the `en_US` dictionary:
[source,js]
--------------------------------------------------
- conf
|-- hunspell
| |-- en_US
| | |-- en_US.dic
| | |-- en_US.aff
--------------------------------------------------
The location of the hunspell directory can be configured using the
`indices.analysis.hunspell.dictionary.location` settings in
_elasticsearch.yml_.
Each dictionary can be configured with two settings:
`ignore_case`::
If true, dictionary matching will be case insensitive
(defaults to `false`)
`strict_affix_parsing`::
Determines whether errors while reading a
affix rules file will cause exception or simple be ignored (defaults to
`true`)
These settings can be configured globally in `elasticsearch.yml` using
* `indices.analysis.hunspell.dictionary.ignore_case` and
* `indices.analysis.hunspell.dictionary.strict_affix_parsing`
or for specific dictionaries:
* `indices.analysis.hunspell.dictionary.en_US.ignore_case` and
* `indices.analysis.hunspell.dictionary.en_US.strict_affix_parsing`.
It is also possible to add `settings.yml` file under the dictionary
directory which holds these settings (this will override any other
settings defined in the `elasticsearch.yml`).
One can use the hunspell stem filter by configuring it the analysis
settings:
[source,js]
--------------------------------------------------
{
"analysis" : {
"analyzer" : {
"en" : {
"tokenizer" : "standard",
"filter" : [ "lowercase", "en_US" ]
}
},
"filter" : {
"en_US" : {
"type" : "hunspell",
"locale" : "en_US",
"dedup" : true
}
}
}
}
--------------------------------------------------
The hunspell token filter accepts four options:
`locale`::
A locale for this filter. If this is unset, the `lang` or
`language` are used instead - so one of these has to be set.
`dictionary`::
The name of a dictionary. The path to your hunspell
dictionaries should be configured via
`indices.analysis.hunspell.dictionary.location` before.
`dedup`::
If only unique terms should be returned, this needs to be
set to `true`. Defaults to `true`.
`recursion_level`::
Configures the recursion level a
stemmer can go into. Defaults to `2`. Some languages (for example czech)
give better results when set to `1` or `0`, so you should test it out.
NOTE: As opposed to the snowball stemmers (which are algorithm based)
this is a dictionary lookup based stemmer and therefore the quality of
the stemming is determined by the quality of the dictionary.
[float]
==== References
Hunspell is a spell checker and morphological analyzer designed for
languages with rich morphology and complex word compounding and
character encoding.
1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell
2. Source code, http://hunspell.sourceforge.net/
3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
4. Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
5. Chromium Hunspell dictionaries,
http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/
|