summaryrefslogtreecommitdiff
path: root/docs/reference/search/percolate.asciidoc
blob: c214d8016e2dd34c1f2f1b04b5ad2e6ff9a36a51 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
[[search-percolate]]
== Percolator

added[1.0.0.Beta1]

Traditionally you design documents based on your data and store them into an index and then define queries via the search api
in order to retrieve these documents. The percolator works in the opposite direction, first you store queries into an
index and then via the percolate api you define documents in order to retrieve these queries.

The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in
JSON. This allows you to embed queries into documents via the index api. Elasticsearch can extract the query from a
document and make it available to the percolate api. Since documents are also defines as json, you can define a document
in a request to the percolate api.

The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used
in the percolate api.

[float]
=== Sample usage

Adding a query to the percolator:

[source,js]
--------------------------------------------------
curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
    "query" : {
        "match" : {
            "message" : "bonsai tree"
        }
    }
}'
--------------------------------------------------

Matching documents to the added queries:

[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/my-index/message/_percolate' -d '{
    "doc" : {
        "message" : "A new bonsai tree in the office"
    }
}'
--------------------------------------------------

The above request will yield the following response:

[source,js]
--------------------------------------------------
{
    "took" : 19,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
    },
    "total" : 1,
    "matches" : [
    	{
    	     "_index" : "my-index",
    	     "_id" : "1"
    	}
    ]
}
--------------------------------------------------

The percolate api returns matches that refer to percolate queries that have matched with the document defined in the percolate api.

[float]
=== Indexing percolator queries

Percolate queries are stored as documents in a specific format and in an arbitrary index under a reserved type with the
name `.percolator`. The query itself is placed as is in a json objects under the top level field `query`.

[source,js]
--------------------------------------------------
{
    "query" : {
		"match" : {
			"field" : "value"
		}
	}
}
--------------------------------------------------

Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only
percolate documents by specific queries.

[source,js]
--------------------------------------------------
{
	"query" : {
		"match" : {
			"field" : "value"
		}
	},
	"priority" : "high"
}
--------------------------------------------------

On top of this also a mapping type can be associated with the this query. This allows to control how certain queries
like range queries, shape filters and other query & filters that rely on mapping settings get constructed. This is
important since the percolate queries are indexed into the `.percolator` type, and the queries / filters that rely on
mapping settings would yield unexpected behaviour. Note by default field names do get resolved in a smart manner,
but in certain cases with multiple types this can lead to unexpected behaviour, so being explicit about it will help.

[source,js]
--------------------------------------------------
{
	"query" : {
		"range" : {
			"created_at" : {
				"gte" : "2010-01-01T00:00:00",
				"lte" : "2011-01-01T00:00:00"
			}
		}
	},
	"type" : "tweet",
	"priority" : "high"
}
--------------------------------------------------

In the above example the range query gets really parsed into a Lucene numeric range query, based on the settings for
the field `created_at` in the type `tweet`.

Just as with any other type, the `.percolator` type has a mapping, which you can configure via the mappings apis.
The default percolate mapping doesn't index the query field and only stores it.

Because `.percolate` is a type it also has a mapping. By default the following mapping is active:

[source,js]
--------------------------------------------------
{
	".percolator" : {
		"properties" : {
			"query" : {
				"type" : "object",
				"enabled" : false
			}
		}
	}
}
--------------------------------------------------

If needed this mapping can be modified wit the update mapping api.

In order to un-register a percolate query the delete api can be used. So if the previous added query needs to be deleted
the following delete requests needs to be executed:

[source,js]
--------------------------------------------------
curl -XDELETE localhost:9200/my-index/.percolator/1
--------------------------------------------------

[float]
=== Percolate api

The percolate api executes in a distributed manner, meaning it executes on all shards an index points to.

.Required options
* `index` - The index that contains the `.percolator` type. This can also be an alias.
* `type` - The type of the document to be percolated. The mapping of that type is used to parse document.
* `doc` - The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note this isn't required when percolating an existing document.

[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
	"doc" : {
		"created_at" : "2010-10-10T00:00:00",
		"message" : "some text"
	}
}'
--------------------------------------------------

.Additional supported query string options
* `routing` - In the case the percolate queries are partitioned by a custom routing value, that routing option make sure
that the percolate request only gets executed on the shard where the routing value is partitioned to. This means that
the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a
comma separated string, in that case the request can be be executed on more than one shard.
* `preference` - Controls which shard replicas are preferred to execute the request on. Works the same as in the search api.
* `ignore_unavailable` - Controls if missing concrete indices should silently be ignored. Same as is in the search api.
* `percolate_format` - If `ids` is specified then the matches array in the percolate response will contain a string
array of the matching ids instead of an array of objects. This can be useful the reduce the amount of data being send
back to the client. Obviously if there are to percolator queries with same id from different indices there is no way
the find out which percolator query belongs to what index. Any other value to `percolate_format` will be ignored.

.Additional request body options
* `filter` - Reduces the number queries to execute during percolating. Only the percolator queries that match with the
filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have
occurred for the filter to included the latest percolate queries.
* `query` - Same as the `filter` option, but also the score is computed. The computed scores can then be used by the
`track_scores` and `sort` option.
* `size` - Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited.
* `track_scores` - Whether the `_score` is included for each match. The is based on the query and represents how the query matched
to the percolate query's metadata and *not* how the document being percolated matched to the query. The `query` option
is required for this option. Defaults to `false`.
* `sort` - Define a sort specification like in the search api. Currently only sorting `_score` reverse (default relevancy)
is supported. Other sort fields will throw an exception. The `size` and `query` option are required for this setting. Like
`track_score` the score is based on the query and represents how the query matched to the percolate query's metadata
and *not* how the document being percolated matched to the query.
* `facets` - Allows facet definitions to be included. The facets are based on the matching percolator queries. See facet
documentation how to define facets.
* `aggs` - Allows aggregation definitions to be included. The aggregations are based on the matching percolator queries,
look at the aggregation documentation on how to define aggregations.
* `highlight` - Allows highlight definitions to be included. The document being percolated is being highlight for each
matching query. This allows you to see how each match is highlighting the document being percolated. See highlight
documentation on how to define highlights. The `size` option is required for highlighting, the performance of highlighting
 in the percolate api depends of how many matches are being highlighted.

[float]
=== Dedicated percolator index

Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in,
these queries can also be added to an dedicated index. The advantage of this is that this dedicated percolator index
can have its own index settings (For example the number of primary and replicas shards). If you choose to have a dedicated
percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index.
Otherwise percolate queries can be parsed incorrectly.

[float]
=== Filtering Executed Queries

Filtering allows to reduce the number of queries, any filter that the search api supports, (expect the ones mentioned in important notes)
can also be used in the percolate api. The filter only works on the metadata fields. The `query` field isn't indexed by
default. Based on the query we indexed before the following filter can be defined:

[source,js]
--------------------------------------------------
curl -XGET localhost:9200/test/type1/_percolate -d '{
    "doc" : {
        "field" : "value"
    },
    "filter" : {
        "term" : {
            "priority" : "high"
        }
    }
}'
--------------------------------------------------

[float]
=== Percolator count api

The count percolate api, only keeps track of the number of matches and doesn't keep track of the actual matches
Example:

[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{
   "doc" : {
       "message" : "some message"
   }
}'
--------------------------------------------------

Response:

[source,js]
--------------------------------------------------
{
   ... // header
   "total" : 3
}
--------------------------------------------------


[float]
=== Percolating an existing document

In order to percolate in newly indexed document, the percolate existing document can be used. Based on the response
from an index request the `_id` and other meta information can be used to the immediately percolate the newly added
document.

.Supported options for percolating an existing document on top of existing percolator options:
* `id` - The id of the document to retrieve the source for.
* `percolate_index` - The index containing the percolate queries. Defaults to the `index` defined in the url.
* `percolate_type` - The percolate type (used for parsing the document). Default to `type` defined in the url.
* `routing` - The routing value to use when retrieving the document to percolate.
* `preference` - Which shard to prefer when retrieving the existing document.
* `percolate_routing` - The routing value to use when percolating the existing document.
* `percolate_preference` - Which shard to prefer when executing the percolate request.
* `version` - Enables a version check. If the fetched document's version isn't equal to the specified version then the request fails with a version conflict and the percolation request is aborted.
* `version_type` - Whether internal or external versioning is used. Defaults to internal versioning.

Internally the percolate api will issue a get request for fetching the`_source` of the document to percolate.
For this feature to work the `_source` for documents to be percolated need to be stored.

[float]
==== Example

Index response:

[source,js]
--------------------------------------------------
{
	"_index" : "my-index",
	"_type" : "message",
	"_id" : "1",
	"_version" : 1,
	"created" : true
}
--------------------------------------------------

Percolating an existing document:

[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/my-index1/message/1/_percolate'
--------------------------------------------------

The response is the same as with the regular percolate api.

[float]
=== Multi percolate api

The multi percolate api allows to bundle multiple percolate requests into a single request, similar to what the multi
search api does to search requests. The request body format is line based. Each percolate request item takes two lines,
the first line is the header and the second line is the body.

The header can contain any parameter that normally would be set via the request path or query string parameters. T
here are several percolate actions, because there are multiple types of percolate requests.

.Supported actions:
* `percolate` - Action for defining a regular percolate request.
* `count` - Action for defining a count percolate request.

Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing
document actions support different parameters.

.The following endpoints are supported
* GET|POST /[index]/[type]/_mpercolate
* GET}POST /[index]/_mpercolate
* GET|POST /_mpercolate

The `index` and `type` defined in the url path are the default index and type.

[float]
==== Example

Request:

[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary @requests.txt; echo
--------------------------------------------------

The index twitter is the default index and the type tweet is the default type and will be used in the case a header
doesn't specify an index or type.

requests.txt:

[source,js]
--------------------------------------------------
{"percolate" : {"index" : twitter", "type" : "tweet"}}
{"doc" : {"message" : "some text"}}
{"percolate" : {"index" : twitter", "type" : "tweet", "id" : "1"}}
{}
{"percolate" : {"index" : users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }}
{"size" : 10}
{"count" : {"index" : twitter", "type" : "tweet"}}
{"doc" : {"message" : "some other text"}}
{"count" : {"index" : twitter", "type" : "tweet", "id" : "1"}}
{}
--------------------------------------------------

For a percolate existing document item (headers with the `id` field), the response can be an empty json object.
All the required options are set in the header.

Response:

[source,js]
--------------------------------------------------
{
    "items" : [
        {
            "took" : 24,
            "_shards" : {
                "total" : 5,
                "successful" : 5,
                "failed" : 0,
            },
            "total" : 3,
            "matches" : ["1", "2", "3"]
        },
        {
            "took" : 12,
            "_shards" : {
                "total" : 5,
                "successful" : 5,
                "failed" : 0,
            },
            "total" : 3,
            "matches" : ["4", "5", "6"]
        },
        {
            "error" : "[user][3]document missing"
        },
        {
            "took" : 12,
            "_shards" : {
                "total" : 5,
                "successful" : 5,
                "failed" : 0,
            },
            "total" : 3
        },
        {
            "took" : 14,
            "_shards" : {
                "total" : 5,
                "successful" : 5,
                "failed" : 0,
            },
            "total" : 3
        }
    ]
}
--------------------------------------------------

Each item represents a percolate response, the order of the items maps to the order in where the percolate requests
were specified. In case a percolate request failed, the item response is substituted with an error message.

[float]
=== How it works under the hood

When indexing a document that contains a query in an index and the `.percolator` type the query part of the documents gets
parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the
`.percolator` type get removed. So all the active percolator queries are kept in memory.

At percolate time the document specified in the request gets parsed into a Lucene document and is stored in a in-memory
Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries
that are registered to the index that the percolate request is targeted for are going to be executed on this single document
in-memory index. This happens on each shard the percolate request need to execute.

By using `routing`, `filter` or `query` features the amount of queries that need to be executed can be reduced and thus
the time the percolate api needs to run can be decreased.

[float]
=== Important notes

Because the percolator API is processing one document at a time, it doesn't support queries and filters that run
against child and nested documents such as `has_child`, `has_parent`, `top_children`, and `nested`.

The `wildcard` and `regexp` query natively use a lot of memory and because the percolator keeps the queries into memory
this can easily take up the available memory in the heap space. If possible try to use a `prefix` query or ngramming to
achieve the same result (with way less memory being used).

The delete-by-query api doesn't work to unregister a query, it only deletes the percolate documents from disk. In order
to update the registered queries in memory the index needs be closed and opened.