summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc
blob: 23e6bc52dda7bf2fe5dd27553689822ded9f2cd4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
[[analysis-ngram-tokenizer]]
=== NGram Tokenizer

A tokenizer of type `nGram`.

The following are settings that can be set for a `nGram` tokenizer type:

[cols="<,<,<",options="header",]
|=======================================================================
|Setting |Description |Default value
|`min_gram` |Minimum size in codepoints of a single n-gram |`1`.

|`max_gram` |Maximum size in codepoints of a single n-gram |`2`.

|`token_chars` |Characters classes to keep in the
tokens, Elasticsearch will split on characters that don't belong to any
of these classes. |`[]` (Keep all characters)
|=======================================================================

`token_chars` accepts the following character classes:

[horizontal]
`letter`::      for example `a`, `b`, `ï` or `京`
`digit`::       for example `3` or `7`
`whitespace`::  for example `" "` or `"\n"`
`punctuation`:: for example `!` or `"`
`symbol`::      for example `$` or `√`

[float]
==== Example

[source,js]
--------------------------------------------------
    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "my_ngram_analyzer" : {
                        "tokenizer" : "my_ngram_tokenizer"
                    }
                },
                "tokenizer" : {
                    "my_ngram_tokenizer" : {
                        "type" : "nGram",
                        "min_gram" : "2",
                        "max_gram" : "3",
                        "token_chars": [ "letter", "digit" ]
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
--------------------------------------------------