summaryrefslogtreecommitdiff
path: root/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
blob: 72ca60410200945c12996bd5e8390058b6510c27 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[[analysis-pattern-tokenizer]]
=== Pattern Tokenizer

A tokenizer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:

[cols="<,<",options="header",]
|======================================================================
|Setting |Description
|`pattern` |The regular expression pattern, defaults to `\\W+`.
|`flags` |The regular expression flags.
|`group` |Which group to extract into tokens. Defaults to `-1` (split).
|======================================================================

*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.

`group` set to `-1` (the default) is equivalent to "split". Using group
>= 0 selects the matching group as the token. For example, if you have:

------------------------
pattern = \\'([^\']+)\\'
group   = 0
input   = aaa 'bbb' 'ccc'
------------------------

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).
With the same input but using group=1, the output would be: bbb and ccc
(no ' marks).