diff options
Diffstat (limited to 'docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc')
-rw-r--r-- | docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc | 29 |
1 files changed, 29 insertions, 0 deletions
diff --git a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc new file mode 100644 index 0000000..72ca604 --- /dev/null +++ b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc @@ -0,0 +1,29 @@ +[[analysis-pattern-tokenizer]] +=== Pattern Tokenizer + +A tokenizer of type `pattern` that can flexibly separate text into terms +via a regular expression. Accepts the following settings: + +[cols="<,<",options="header",] +|====================================================================== +|Setting |Description +|`pattern` |The regular expression pattern, defaults to `\\W+`. +|`flags` |The regular expression flags. +|`group` |Which group to extract into tokens. Defaults to `-1` (split). +|====================================================================== + +*IMPORTANT*: The regular expression should match the *token separators*, +not the tokens themselves. + +`group` set to `-1` (the default) is equivalent to "split". Using group +>= 0 selects the matching group as the token. For example, if you have: + +------------------------ +pattern = \\'([^\']+)\\' +group = 0 +input = aaa 'bbb' 'ccc' +------------------------ + +the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). +With the same input but using group=1, the output would be: bbb and ccc +(no ' marks). |