summaryrefslogtreecommitdiff
path: root/docs/reference/query-dsl/queries/regexp-syntax.asciidoc
blob: 5d2c06105ce392691fb58707f0b29c8efecedd59 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
[[regexp-syntax]]
==== Regular expression syntax

Regular expression queries are supported by the `regexp` and the `query_string`
queries.  The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.

[NOTE]
====
We will not attempt to explain regular expressions, but
just explain the supported operators.
====

===== Standard operators

Anchoring::
+
--

Most regular expression engines allow you to match any part of a string.
If you want the regexp pattern to start at the beginning of the string or
finish at the end of the string, then you have to _anchor_ it specifically,
using `^` to indicate the beginning or `$` to indicate the end.

Lucene's patterns are always anchored.  The pattern provided must match
the entire string. For string `"abcde"`:

    ab.*     # match
    abcd     # no match

--

Allowed characters::
+
--

Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped.  The standard reserved characters are:

....
. ? + * | { } [ ] ( ) " \
....

If you enable optional features (see below) then these characters may
also be reserved:

    # @ & < >  ~

Any reserved character can be escaped with a backslash `"\*"` including
a literal backslash character: `"\\"`

Additionally, any characters (except double quotes) are interpreted literally
when surrounded by double quotes:

    john"@smith.com"


--

Match any character::
+
--

The period `"."` can be used to represent any character.  For string `"abcde"`:

    ab...   # match
    a.c.e   # match

--

One-or-more::
+
--

The plus sign `"+"` can be used to repeat the preceding shortest pattern
once or more times. For string `"aaabbb"`:

    a+b+        # match
    aa+bb+      # match
    a+.+        # match
    aa+bbb+     # no match

--

Zero-or-more::
+
--

The asterisk `"*"` can be used to match the preceding shortest pattern
zero-or-more times.  For string `"aaabbb`":

    a*b*        # match
    a*b*c*      # match
    .*bbb.*     # match
    aaa*bbb*    # match

--

Zero-or-one::
+
--

The question mark `"?"` makes the preceding shortest pattern optional. It
matches zero or one times.  For string `"aaabbb"`:

    aaa?bbb?    # match
    aaaa?bbbb?  # match
    .....?.?    # match
    aa?bb?      # no match

--

Min-to-max::
+
--

Curly brackets `"{}"` can be used to specify a minimum and (optionally)
a maximum number of times the preceding shortest pattern can repeat.  The
allowed forms are:

    {5}     # repeat exactly 5 times
    {2,5}   # repeat at least twice and at most 5 times
    {2,}    # repeat at least twice

For string `"aaabbb"`:

    a{3}b{3}        # match
    a{2,4}b{2,4}    # match
    a{2,}b{2,}      # match
    .{3}.{3}        # match
    a{4}b{4}        # no match
    a{4,6}b{4,6}    # no match
    a{4,}b{4,}      # no match

--

Grouping::
+
--

Parentheses `"()"` can be used to form sub-patterns. The quantity operators
listed above operate on the shortest previous pattern, which can be a group.
For string `"ababab"`:

    (ab)+       # match
    ab(ab)+     # match
    (..)+       # match
    (...)+      # no match
    (ab)*       # match
    abab(ab)?   # match
    ab(ab)?     # no match
    (ab){3}     # match
    (ab){1,2}   # no match

--

Alternation::
+
--

The pipe symbol `"|"` acts as an OR operator. The match will succeed if
the pattern on either the left-hand side OR the right-hand side matches.
The alternation applies to the _longest pattern_, not the shortest.
For string `"aabb"`:

    aabb|bbaa   # match
    aacc|bb     # no match
    aa(cc|bb)   # match
    a+|b+       # no match
    a+b+|b+a+   # match
    a+(b|c)+    # match

--

Character classes::
+
--

Ranges of potential characters may be represented as character classes
by enclosing them in square brackets `"[]"`. A leading `^`
negates the character class. The allowed forms are:

    [abc]   # 'a' or 'b' or 'c'
    [a-c]   # 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'

Note that the dash `"-"` indicates a range of characeters, unless it is
the first character or if it is escaped with a backslash.

For string `"abcd"`:

    ab[cd]+     # match
    [a-d]+      # match
    [^a-d]+     # no match

--

===== Optional operators

These operators are only available when they are explicitly enabled, by
passing `flags` to the query.

Multiple flags can be enabled either using the `ALL` flag, or by
concatenating flags with a pipe `"|"`:

    {
        "regexp": {
            "username": {
                "value": "john~athon<1-5>",
                "flags": "COMPLEMENT|INTERVAL"
            }
        }
    }

Complement::
+
--

The complement is probably the most useful option. The shortest pattern that
follows a tilde `"~"` is negated.  For the string `"abcdef"`:

    ab~df     # match
    ab~cf     # no match
    a~(cd)f   # match
    a~(bc)f   # no match

Enabled with the `COMPLEMENT` or `ALL` flags.

--

Interval::
+
--

The interval option enables the use of numeric ranges, enclosed by angle
brackets `"<>"`. For string: `"foo80"`:

    foo<1-100>     # match
    foo<01-100>    # match
    foo<001-100>   # no match

Enabled with the `INTERVAL` or `ALL` flags.


--

Intersection::
+
--

The ampersand `"&"` joins two patterns in a way that both of them have to
match. For string `"aaabbb"`:

    aaa.+&.+bbb     # match
    aaa&bbb         # no match

Using this feature usually means that you should rewrite your regular
expression.

Enabled with the `INTERSECTION` or `ALL` flags.

--

Any string::
+
--

The at sign `"@"` matches any string in its entirety.  This could be combined
with the intersection and complement above to express ``everything except''.
For instance:

    @&~(foo.+)      # anything except string beginning with "foo"

Enabled with the `ANYSTRING` or `ALL` flags.
--