blob: e3c6c44e19617861ca3cb7a7d298fc8884896b75 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
[[analysis-shingle-tokenfilter]]
=== Shingle Token Filter
A token filter of type `shingle` that constructs shingles (token
n-grams) from a token stream. In other words, it creates combinations of
tokens as a single token. For example, the sentence "please divide this
sentence into shingles" might be tokenized into shingles "please
divide", "divide this", "this sentence", "sentence into", and "into
shingles".
This filter handles position increments > 1 by inserting filler tokens
(tokens with termtext "_"). It does not handle a position increment of
0.
The following are settings that can be set for a `shingle` token filter
type:
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`max_shingle_size` |The maximum shingle size. Defaults to `2`.
|`min_shingle_size` |The minimum shingle size. Defaults to `2`.
|`output_unigrams` |If `true` the output will contain the input tokens
(unigrams) as well as the shingles. Defaults to `true`.
|`output_unigrams_if_no_shingles` |If `output_unigrams` is `false` the
output will contain the input tokens (unigrams) if no shingles are
available. Note if `output_unigrams` is set to `true` this setting has
no effect. Defaults to `false`.
|`token_separator` |The string to use when joining adjacent tokens to
form a shingle. Defaults to `" "`.
|=======================================================================
|