summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc
blob: 41cc2337940818c5eed229a42b8a9870f18ff15d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
[[analysis-edgengram-tokenizer]]
=== Edge NGram Tokenizer

A tokenizer of type `edgeNGram`.

This tokenizer is very similar to `nGram` but only keeps n-grams which
start at the beginning of a token.

The following are settings that can be set for a `edgeNGram` tokenizer
type:

[cols="<,<,<",options="header",]
|=======================================================================
|Setting |Description |Default value
|`min_gram` |Minimum size in codepoints of a single n-gram |`1`.

|`max_gram` |Maximum size in codepoints of a single n-gram |`2`.

|`token_chars` | Characters classes to keep in the
tokens, Elasticsearch will split on characters that don't belong to any
of these classes. |`[]` (Keep all characters)
|=======================================================================


`token_chars` accepts the following character classes:

[horizontal]
`letter`::      for example `a`, `b`, `ï` or `京`
`digit`::       for example `3` or `7`
`whitespace`::  for example `" "` or `"\n"`
`punctuation`:: for example `!` or `"`
`symbol`::      for example `$` or `√`

[float]
==== Example

[source,js]
--------------------------------------------------
    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "my_edge_ngram_analyzer" : {
                        "tokenizer" : "my_edge_ngram_tokenizer"
                    }
                },
                "tokenizer" : {
                    "my_edge_ngram_tokenizer" : {
                        "type" : "edgeNGram",
                        "min_gram" : "2",
                        "max_gram" : "5",
                        "token_chars": [ "letter", "digit" ]
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_edge_ngram_analyzer' -d 'FC Schalke 04'
    # FC, Sc, Sch, Scha, Schal, 04
--------------------------------------------------

[float]
==== `side` deprecated

There used to be a `side` parameter up to `0.90.1` but it is now deprecated. In
order to emulate the behavior of `"side" : "BACK"` a
<<analysis-reverse-tokenfilter,`reverse` token filter>>  should be used together
with the <<analysis-edgengram-tokenfilter,`edgeNGram` token filter>>. The
`edgeNGram` filter must be enclosed in `reverse` filters like this:

[source,js]
--------------------------------------------------
    "filter" : ["reverse", "edgeNGram", "reverse"]
--------------------------------------------------

which essentially reverses the token, builds front `EdgeNGrams` and reverses
the ngram again. This has the same effect as the previous `"side" : "BACK"` setting.