summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/tokenfilters/hunspell-tokenfilter.asciidoc
blob: 9a235dd6de8fc18976c08bc66f7f817331ad885c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
[[analysis-hunspell-tokenfilter]]
=== Hunspell Token Filter

Basic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(defaults to `<path.conf>/hunspell`). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold both the \*.aff and \*.dic
files (all of which will automatically be picked up). For example,
assuming the default hunspell location is used, the following directory
layout will define the `en_US` dictionary:

[source,js]
--------------------------------------------------
- conf
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
    |    |    |-- en_US.aff
--------------------------------------------------

The location of the hunspell directory can be configured using the
`indices.analysis.hunspell.dictionary.location` settings in
_elasticsearch.yml_.

Each dictionary can be configured with two settings:

`ignore_case`:: 
    If true, dictionary matching will be case insensitive
    (defaults to `false`)

`strict_affix_parsing`::
    Determines whether errors while reading a
    affix rules file will cause exception or simple be ignored (defaults to
    `true`)

These settings can be configured globally in `elasticsearch.yml` using

* `indices.analysis.hunspell.dictionary.ignore_case` and
* `indices.analysis.hunspell.dictionary.strict_affix_parsing`

or for specific dictionaries:

* `indices.analysis.hunspell.dictionary.en_US.ignore_case` and
* `indices.analysis.hunspell.dictionary.en_US.strict_affix_parsing`.

It is also possible to add `settings.yml` file under the dictionary
directory which holds these settings (this will override any other
settings defined in the `elasticsearch.yml`).

One can use the hunspell stem filter by configuring it the analysis
settings:

[source,js]
--------------------------------------------------
{
    "analysis" : {
        "analyzer" : {
            "en" : {
                "tokenizer" : "standard",
                "filter" : [ "lowercase", "en_US" ]
            }
        },
        "filter" : {
            "en_US" : {
                "type" : "hunspell",
                "locale" : "en_US",
                "dedup" : true
            }
        }
    }
}
--------------------------------------------------

The hunspell token filter accepts four options:

`locale`:: 
    A locale for this filter. If this is unset, the `lang` or
    `language` are used instead - so one of these has to be set.

`dictionary`:: 
    The name of a dictionary. The path to your hunspell
    dictionaries should be configured via
    `indices.analysis.hunspell.dictionary.location` before.

`dedup`:: 
    If only unique terms should be returned, this needs to be
    set to `true`. Defaults to `true`.

`recursion_level`:: 
    Configures the recursion level a
    stemmer can go into. Defaults to `2`. Some languages (for example czech)
    give better results when set to `1` or `0`, so you should test it out.

NOTE: As opposed to the snowball stemmers (which are algorithm based)
this is a dictionary lookup based stemmer and therefore the quality of
the stemming is determined by the quality of the dictionary.

[float]
==== References

Hunspell is a spell checker and morphological analyzer designed for
languages with rich morphology and complex word compounding and
character encoding.

1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell

2. Source code, http://hunspell.sourceforge.net/

3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries

4.  Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/

5. Chromium Hunspell dictionaries,
   http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/