summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/tokenizers
diff options
context:
space:
mode:
authorHilko Bengen <bengen@debian.org>2014-06-07 12:02:12 +0200
committerHilko Bengen <bengen@debian.org>2014-06-07 12:02:12 +0200
commitd5ed89b946297270ec28abf44bef2371a06f1f4f (patch)
treece2d945e4dde69af90bd9905a70d8d27f4936776 /docs/reference/analysis/tokenizers
downloadelasticsearch-d5ed89b946297270ec28abf44bef2371a06f1f4f.tar.gz
Imported Upstream version 1.0.3upstream/1.0.3
Diffstat (limited to 'docs/reference/analysis/tokenizers')
-rw-r--r--docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc80
-rw-r--r--docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc15
-rw-r--r--docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc7
-rw-r--r--docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc15
-rw-r--r--docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc57
-rw-r--r--docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc32
-rw-r--r--docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc29
-rw-r--r--docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc18
-rw-r--r--docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc16
-rw-r--r--docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc4
10 files changed, 273 insertions, 0 deletions
diff --git a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc
new file mode 100644
index 0000000..41cc233
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc
@@ -0,0 +1,80 @@
+[[analysis-edgengram-tokenizer]]
+=== Edge NGram Tokenizer
+
+A tokenizer of type `edgeNGram`.
+
+This tokenizer is very similar to `nGram` but only keeps n-grams which
+start at the beginning of a token.
+
+The following are settings that can be set for a `edgeNGram` tokenizer
+type:
+
+[cols="<,<,<",options="header",]
+|=======================================================================
+|Setting |Description |Default value
+|`min_gram` |Minimum size in codepoints of a single n-gram |`1`.
+
+|`max_gram` |Maximum size in codepoints of a single n-gram |`2`.
+
+|`token_chars` | Characters classes to keep in the
+tokens, Elasticsearch will split on characters that don't belong to any
+of these classes. |`[]` (Keep all characters)
+|=======================================================================
+
+
+`token_chars` accepts the following character classes:
+
+[horizontal]
+`letter`:: for example `a`, `b`, `ï` or `京`
+`digit`:: for example `3` or `7`
+`whitespace`:: for example `" "` or `"\n"`
+`punctuation`:: for example `!` or `"`
+`symbol`:: for example `$` or `√`
+
+[float]
+==== Example
+
+[source,js]
+--------------------------------------------------
+ curl -XPUT 'localhost:9200/test' -d '
+ {
+ "settings" : {
+ "analysis" : {
+ "analyzer" : {
+ "my_edge_ngram_analyzer" : {
+ "tokenizer" : "my_edge_ngram_tokenizer"
+ }
+ },
+ "tokenizer" : {
+ "my_edge_ngram_tokenizer" : {
+ "type" : "edgeNGram",
+ "min_gram" : "2",
+ "max_gram" : "5",
+ "token_chars": [ "letter", "digit" ]
+ }
+ }
+ }
+ }
+ }'
+
+ curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_edge_ngram_analyzer' -d 'FC Schalke 04'
+ # FC, Sc, Sch, Scha, Schal, 04
+--------------------------------------------------
+
+[float]
+==== `side` deprecated
+
+There used to be a `side` parameter up to `0.90.1` but it is now deprecated. In
+order to emulate the behavior of `"side" : "BACK"` a
+<<analysis-reverse-tokenfilter,`reverse` token filter>> should be used together
+with the <<analysis-edgengram-tokenfilter,`edgeNGram` token filter>>. The
+`edgeNGram` filter must be enclosed in `reverse` filters like this:
+
+[source,js]
+--------------------------------------------------
+ "filter" : ["reverse", "edgeNGram", "reverse"]
+--------------------------------------------------
+
+which essentially reverses the token, builds front `EdgeNGrams` and reverses
+the ngram again. This has the same effect as the previous `"side" : "BACK"` setting.
+
diff --git a/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc
new file mode 100644
index 0000000..be75f3d
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc
@@ -0,0 +1,15 @@
+[[analysis-keyword-tokenizer]]
+=== Keyword Tokenizer
+
+A tokenizer of type `keyword` that emits the entire input as a single
+input.
+
+The following are settings that can be set for a `keyword` tokenizer
+type:
+
+[cols="<,<",options="header",]
+|=======================================================
+|Setting |Description
+|`buffer_size` |The term buffer size. Defaults to `256`.
+|=======================================================
+
diff --git a/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc
new file mode 100644
index 0000000..03025cc
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc
@@ -0,0 +1,7 @@
+[[analysis-letter-tokenizer]]
+=== Letter Tokenizer
+
+A tokenizer of type `letter` that divides text at non-letters. That's to
+say, it defines tokens as maximal strings of adjacent letters. Note,
+this does a decent job for most European languages, but does a terrible
+job for some Asian languages, where words are not separated by spaces.
diff --git a/docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc
new file mode 100644
index 0000000..0cdbbc3
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/lowercase-tokenizer.asciidoc
@@ -0,0 +1,15 @@
+[[analysis-lowercase-tokenizer]]
+=== Lowercase Tokenizer
+
+A tokenizer of type `lowercase` that performs the function of
+<<analysis-letter-tokenizer,Letter
+Tokenizer>> and
+<<analysis-lowercase-tokenfilter,Lower
+Case Token Filter>> together. It divides text at non-letters and converts
+them to lower case. While it is functionally equivalent to the
+combination of
+<<analysis-letter-tokenizer,Letter
+Tokenizer>> and
+<<analysis-lowercase-tokenfilter,Lower
+Case Token Filter>>, there is a performance advantage to doing the two
+tasks at once, hence this (redundant) implementation.
diff --git a/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc
new file mode 100644
index 0000000..23e6bc5
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc
@@ -0,0 +1,57 @@
+[[analysis-ngram-tokenizer]]
+=== NGram Tokenizer
+
+A tokenizer of type `nGram`.
+
+The following are settings that can be set for a `nGram` tokenizer type:
+
+[cols="<,<,<",options="header",]
+|=======================================================================
+|Setting |Description |Default value
+|`min_gram` |Minimum size in codepoints of a single n-gram |`1`.
+
+|`max_gram` |Maximum size in codepoints of a single n-gram |`2`.
+
+|`token_chars` |Characters classes to keep in the
+tokens, Elasticsearch will split on characters that don't belong to any
+of these classes. |`[]` (Keep all characters)
+|=======================================================================
+
+`token_chars` accepts the following character classes:
+
+[horizontal]
+`letter`:: for example `a`, `b`, `ï` or `京`
+`digit`:: for example `3` or `7`
+`whitespace`:: for example `" "` or `"\n"`
+`punctuation`:: for example `!` or `"`
+`symbol`:: for example `$` or `√`
+
+[float]
+==== Example
+
+[source,js]
+--------------------------------------------------
+ curl -XPUT 'localhost:9200/test' -d '
+ {
+ "settings" : {
+ "analysis" : {
+ "analyzer" : {
+ "my_ngram_analyzer" : {
+ "tokenizer" : "my_ngram_tokenizer"
+ }
+ },
+ "tokenizer" : {
+ "my_ngram_tokenizer" : {
+ "type" : "nGram",
+ "min_gram" : "2",
+ "max_gram" : "3",
+ "token_chars": [ "letter", "digit" ]
+ }
+ }
+ }
+ }
+ }'
+
+ curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
+ # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
+--------------------------------------------------
diff --git a/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc
new file mode 100644
index 0000000..e6876f5
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asciidoc
@@ -0,0 +1,32 @@
+[[analysis-pathhierarchy-tokenizer]]
+=== Path Hierarchy Tokenizer
+
+The `path_hierarchy` tokenizer takes something like this:
+
+-------------------------
+/something/something/else
+-------------------------
+
+And produces tokens:
+
+-------------------------
+/something
+/something/something
+/something/something/else
+-------------------------
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`delimiter` |The character delimiter to use, defaults to `/`.
+
+|`replacement` |An optional replacement character to use. Defaults to
+the `delimiter`.
+
+|`buffer_size` |The buffer size to use, defaults to `1024`.
+
+|`reverse` |Generates tokens in reverse order, defaults to `false`.
+
+|`skip` |Controls initial tokens to skip, defaults to `0`.
+|=======================================================================
+
diff --git a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
new file mode 100644
index 0000000..72ca604
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
@@ -0,0 +1,29 @@
+[[analysis-pattern-tokenizer]]
+=== Pattern Tokenizer
+
+A tokenizer of type `pattern` that can flexibly separate text into terms
+via a regular expression. Accepts the following settings:
+
+[cols="<,<",options="header",]
+|======================================================================
+|Setting |Description
+|`pattern` |The regular expression pattern, defaults to `\\W+`.
+|`flags` |The regular expression flags.
+|`group` |Which group to extract into tokens. Defaults to `-1` (split).
+|======================================================================
+
+*IMPORTANT*: The regular expression should match the *token separators*,
+not the tokens themselves.
+
+`group` set to `-1` (the default) is equivalent to "split". Using group
+>= 0 selects the matching group as the token. For example, if you have:
+
+------------------------
+pattern = \\'([^\']+)\\'
+group = 0
+input = aaa 'bbb' 'ccc'
+------------------------
+
+the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).
+With the same input but using group=1, the output would be: bbb and ccc
+(no ' marks).
diff --git a/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc
new file mode 100644
index 0000000..c8b405b
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc
@@ -0,0 +1,18 @@
+[[analysis-standard-tokenizer]]
+=== Standard Tokenizer
+
+A tokenizer of type `standard` providing grammar based tokenizer that is
+a good tokenizer for most European language documents. The tokenizer
+implements the Unicode Text Segmentation algorithm, as specified in
+http://unicode.org/reports/tr29/[Unicode Standard Annex #29].
+
+The following are settings that can be set for a `standard` tokenizer
+type:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`max_token_length` |The maximum token length. If a token is seen that
+exceeds this length then it is discarded. Defaults to `255`.
+|=======================================================================
+
diff --git a/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc
new file mode 100644
index 0000000..9ed28e6
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc
@@ -0,0 +1,16 @@
+[[analysis-uaxurlemail-tokenizer]]
+=== UAX Email URL Tokenizer
+
+A tokenizer of type `uax_url_email` which works exactly like the
+`standard` tokenizer, but tokenizes emails and urls as single tokens.
+
+The following are settings that can be set for a `uax_url_email`
+tokenizer type:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`max_token_length` |The maximum token length. If a token is seen that
+exceeds this length then it is discarded. Defaults to `255`.
+|=======================================================================
+
diff --git a/docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc
new file mode 100644
index 0000000..f0e1ce2
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/whitespace-tokenizer.asciidoc
@@ -0,0 +1,4 @@
+[[analysis-whitespace-tokenizer]]
+=== Whitespace Tokenizer
+
+A tokenizer of type `whitespace` that divides text at whitespace.