diff options
author | Hilko Bengen <bengen@debian.org> | 2014-06-07 12:02:12 +0200 |
---|---|---|
committer | Hilko Bengen <bengen@debian.org> | 2014-06-07 12:02:12 +0200 |
commit | d5ed89b946297270ec28abf44bef2371a06f1f4f (patch) | |
tree | ce2d945e4dde69af90bd9905a70d8d27f4936776 /docs/reference/analysis/tokenizers | |
download | elasticsearch-d5ed89b946297270ec28abf44bef2371a06f1f4f.tar.gz |
Imported Upstream version 1.0.3upstream/1.0.3
Diffstat (limited to 'docs/reference/analysis/tokenizers')
10 files changed, 273 insertions, 0 deletions
diff --git a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc new file mode 100644 index 0000000..41cc233 --- /dev/null +++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc @@ -0,0 +1,80 @@ +[[analysis-edgengram-tokenizer]] +=== Edge NGram Tokenizer + +A tokenizer of type `edgeNGram`. + +This tokenizer is very similar to `nGram` but only keeps n-grams which +start at the beginning of a token. + +The following are settings that can be set for a `edgeNGram` tokenizer +type: + +[cols="<,<,<",options="header",] +|======================================================================= +|Setting |Description |Default value +|`min_gram` |Minimum size in codepoints of a single n-gram |`1`. + +|`max_gram` |Maximum size in codepoints of a single n-gram |`2`. + +|`token_chars` | Characters classes to keep in the +tokens, Elasticsearch will split on characters that don't belong to any +of these classes. |`[]` (Keep all characters) +|======================================================================= + + +`token_chars` accepts the following character classes: + +[horizontal] +`letter`:: for example `a`, `b`, `ï` or `京` +`digit`:: for example `3` or `7` +`whitespace`:: for example `" "` or `"\n"` +`punctuation`:: for example `!` or `"` +`symbol`:: for example `$` or `√` + +[float] +==== Example + +[source,js] +-------------------------------------------------- + curl -XPUT 'localhost:9200/test' -d ' + { + "settings" : { + "analysis" : { + "analyzer" : { + "my_edge_ngram_analyzer" : { + "tokenizer" : "my_edge_ngram_tokenizer" + } + }, + "tokenizer" : { + "my_edge_ngram_tokenizer" : { + "type" : "edgeNGram", + "min_gram" : "2", + "max_gram" : "5", + "token_chars": [ "letter", "digit" ] + } + } + } + } + }' + + curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_edge_ngram_analyzer' -d 'FC Schalke 04' + # FC, Sc, Sch, Scha, Schal, 04 +-------------------------------------------------- + +[float] +==== `side` deprecated + +There used to be a `side` parameter up to `0.90.1` but it is now deprecated. In +order to emulate the behavior of `"side" : "BACK"` a +<<analysis-reverse-tokenfilter,`reverse` token filter>> should be used together +with the <<analysis-edgengram-tokenfilter,`edgeNGram` token filter>>. The +`edgeNGram` filter must be enclosed in `reverse` filters like this: + +[source,js] +-------------------------------------------------- + "filter" : ["reverse", "edgeNGram", "reverse"] +-------------------------------------------------- + +which essentially reverses the token, builds front `EdgeNGrams` and reverses +the ngram again. This has the same effect as the previous `"side" : "BACK"` setting. + diff --git a/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc new file mode 100644 index 0000000..be75f3d --- /dev/null +++ b/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc @@ -0,0 +1,15 @@ +[[analysis-keyword-tokenizer]] +=== Keyword Tokenizer + +A tokenizer of type `keyword` that emits the entire input as a single +input. + +The following are settings that can be set for a `keyword` tokenizer +type: + +[cols="<,<",options="header",] +|======================================================= +|Setting |Description +|`buffer_size` |The term buffer size. Defaults to `256`. +|======================================================= + diff --git a/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc new file mode 100644 index 0000000..03025cc --- /dev/null +++ b/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc @@ -0,0 +1,7 @@ +[[analysis-letter-tokenizer]] +=== Letter Tokenizer + +A tokenizer of type `letter` that divides text at non-letters. That's to +say, it defines tokens as maximal strings of adjacent letters. Note, +this does a decent job for most European languages, but does a terrible +job for some Asian languages, where words are not separated by spaces. diff --git a/docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc new file mode 100644 index 0000000..0cdbbc3 --- /dev/null +++ b/docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc @@ -0,0 +1,15 @@ +[[analysis-lowercase-tokenizer]] +=== Lowercase Tokenizer + +A tokenizer of type `lowercase` that performs the function of +<<analysis-letter-tokenizer,Letter +Tokenizer>> and +<<analysis-lowercase-tokenfilter,Lower +Case Token Filter>> together. It divides text at non-letters and converts +them to lower case. While it is functionally equivalent to the +combination of +<<analysis-letter-tokenizer,Letter +Tokenizer>> and +<<analysis-lowercase-tokenfilter,Lower +Case Token Filter>>, there is a performance advantage to doing the two +tasks at once, hence this (redundant) implementation. diff --git a/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc new file mode 100644 index 0000000..23e6bc5 --- /dev/null +++ b/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc @@ -0,0 +1,57 @@ +[[analysis-ngram-tokenizer]] +=== NGram Tokenizer + +A tokenizer of type `nGram`. + +The following are settings that can be set for a `nGram` tokenizer type: + +[cols="<,<,<",options="header",] +|======================================================================= +|Setting |Description |Default value +|`min_gram` |Minimum size in codepoints of a single n-gram |`1`. + +|`max_gram` |Maximum size in codepoints of a single n-gram |`2`. + +|`token_chars` |Characters classes to keep in the +tokens, Elasticsearch will split on characters that don't belong to any +of these classes. |`[]` (Keep all characters) +|======================================================================= + +`token_chars` accepts the following character classes: + +[horizontal] +`letter`:: for example `a`, `b`, `ï` or `京` +`digit`:: for example `3` or `7` +`whitespace`:: for example `" "` or `"\n"` +`punctuation`:: for example `!` or `"` +`symbol`:: for example `$` or `√` + +[float] +==== Example + +[source,js] +-------------------------------------------------- + curl -XPUT 'localhost:9200/test' -d ' + { + "settings" : { + "analysis" : { + "analyzer" : { + "my_ngram_analyzer" : { + "tokenizer" : "my_ngram_tokenizer" + } + }, + "tokenizer" : { + "my_ngram_tokenizer" : { + "type" : "nGram", + "min_gram" : "2", + "max_gram" : "3", + "token_chars": [ "letter", "digit" ] + } + } + } + } + }' + + curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04' + # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04 +-------------------------------------------------- diff --git a/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc new file mode 100644 index 0000000..e6876f5 --- /dev/null +++ b/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc @@ -0,0 +1,32 @@ +[[analysis-pathhierarchy-tokenizer]] +=== Path Hierarchy Tokenizer + +The `path_hierarchy` tokenizer takes something like this: + +------------------------- +/something/something/else +------------------------- + +And produces tokens: + +------------------------- +/something +/something/something +/something/something/else +------------------------- + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`delimiter` |The character delimiter to use, defaults to `/`. + +|`replacement` |An optional replacement character to use. Defaults to +the `delimiter`. + +|`buffer_size` |The buffer size to use, defaults to `1024`. + +|`reverse` |Generates tokens in reverse order, defaults to `false`. + +|`skip` |Controls initial tokens to skip, defaults to `0`. +|======================================================================= + diff --git a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc new file mode 100644 index 0000000..72ca604 --- /dev/null +++ b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc @@ -0,0 +1,29 @@ +[[analysis-pattern-tokenizer]] +=== Pattern Tokenizer + +A tokenizer of type `pattern` that can flexibly separate text into terms +via a regular expression. Accepts the following settings: + +[cols="<,<",options="header",] +|====================================================================== +|Setting |Description +|`pattern` |The regular expression pattern, defaults to `\\W+`. +|`flags` |The regular expression flags. +|`group` |Which group to extract into tokens. Defaults to `-1` (split). +|====================================================================== + +*IMPORTANT*: The regular expression should match the *token separators*, +not the tokens themselves. + +`group` set to `-1` (the default) is equivalent to "split". Using group +>= 0 selects the matching group as the token. For example, if you have: + +------------------------ +pattern = \\'([^\']+)\\' +group = 0 +input = aaa 'bbb' 'ccc' +------------------------ + +the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). +With the same input but using group=1, the output would be: bbb and ccc +(no ' marks). diff --git a/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc new file mode 100644 index 0000000..c8b405b --- /dev/null +++ b/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc @@ -0,0 +1,18 @@ +[[analysis-standard-tokenizer]] +=== Standard Tokenizer + +A tokenizer of type `standard` providing grammar based tokenizer that is +a good tokenizer for most European language documents. The tokenizer +implements the Unicode Text Segmentation algorithm, as specified in +http://unicode.org/reports/tr29/[Unicode Standard Annex #29]. + +The following are settings that can be set for a `standard` tokenizer +type: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`max_token_length` |The maximum token length. If a token is seen that +exceeds this length then it is discarded. Defaults to `255`. +|======================================================================= + diff --git a/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc new file mode 100644 index 0000000..9ed28e6 --- /dev/null +++ b/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc @@ -0,0 +1,16 @@ +[[analysis-uaxurlemail-tokenizer]] +=== UAX Email URL Tokenizer + +A tokenizer of type `uax_url_email` which works exactly like the +`standard` tokenizer, but tokenizes emails and urls as single tokens. + +The following are settings that can be set for a `uax_url_email` +tokenizer type: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`max_token_length` |The maximum token length. If a token is seen that +exceeds this length then it is discarded. Defaults to `255`. +|======================================================================= + diff --git a/docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc new file mode 100644 index 0000000..f0e1ce2 --- /dev/null +++ b/docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc @@ -0,0 +1,4 @@ +[[analysis-whitespace-tokenizer]] +=== Whitespace Tokenizer + +A tokenizer of type `whitespace` that divides text at whitespace. |