summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
diff options
context:
space:
mode:
Diffstat (limited to 'docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc')
-rw-r--r--docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc29
1 files changed, 29 insertions, 0 deletions
diff --git a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
new file mode 100644
index 0000000..72ca604
--- /dev/null
+++ b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
@@ -0,0 +1,29 @@
+[[analysis-pattern-tokenizer]]
+=== Pattern Tokenizer
+
+A tokenizer of type `pattern` that can flexibly separate text into terms
+via a regular expression. Accepts the following settings:
+
+[cols="<,<",options="header",]
+|======================================================================
+|Setting |Description
+|`pattern` |The regular expression pattern, defaults to `\\W+`.
+|`flags` |The regular expression flags.
+|`group` |Which group to extract into tokens. Defaults to `-1` (split).
+|======================================================================
+
+*IMPORTANT*: The regular expression should match the *token separators*,
+not the tokens themselves.
+
+`group` set to `-1` (the default) is equivalent to "split". Using group
+>= 0 selects the matching group as the token. For example, if you have:
+
+------------------------
+pattern = \\'([^\']+)\\'
+group = 0
+input = aaa 'bbb' 'ccc'
+------------------------
+
+the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).
+With the same input but using group=1, the output would be: bbb and ccc
+(no ' marks).