diff options
author | Hilko Bengen <bengen@debian.org> | 2014-06-07 12:02:12 +0200 |
---|---|---|
committer | Hilko Bengen <bengen@debian.org> | 2014-06-07 12:02:12 +0200 |
commit | d5ed89b946297270ec28abf44bef2371a06f1f4f (patch) | |
tree | ce2d945e4dde69af90bd9905a70d8d27f4936776 /docs/reference/mapping | |
download | elasticsearch-d5ed89b946297270ec28abf44bef2371a06f1f4f.tar.gz |
Imported Upstream version 1.0.3upstream/1.0.3
Diffstat (limited to 'docs/reference/mapping')
28 files changed, 2923 insertions, 0 deletions
diff --git a/docs/reference/mapping/conf-mappings.asciidoc b/docs/reference/mapping/conf-mappings.asciidoc new file mode 100644 index 0000000..e9bb3f9 --- /dev/null +++ b/docs/reference/mapping/conf-mappings.asciidoc @@ -0,0 +1,19 @@ +[[mapping-conf-mappings]] +== Config Mappings + +Creating new mappings can be done using the +<<indices-put-mapping,Put Mapping>> +API. When a document is indexed with no mapping associated with it in +the specific index, the +<<mapping-dynamic-mapping,dynamic / default +mapping>> feature will kick in and automatically create mapping +definition for it. + +Mappings can also be provided on the node level, meaning that each index +created will automatically be started with all the mappings defined +within a certain location. + +Mappings can be defined within files called `[mapping_name].json` and be +placed either under `config/mappings/_default` location, or under +`config/mappings/[index_name]` (for mappings that should be associated +only with a specific index). diff --git a/docs/reference/mapping/date-format.asciidoc b/docs/reference/mapping/date-format.asciidoc new file mode 100644 index 0000000..eada734 --- /dev/null +++ b/docs/reference/mapping/date-format.asciidoc @@ -0,0 +1,206 @@ +[[mapping-date-format]] +== Date Format + +In JSON documents, dates are represented as strings. Elasticsearch uses a set +of pre-configured format to recognize and convert those, but you can change the +defaults by specifying the `format` option when defining a `date` type, or by +specifying `dynamic_date_formats` in the `root object` mapping (which will +be used unless explicitly overridden by a `date` type). There are built in +formats supported, as well as complete custom one. + +The parsing of dates uses http://joda-time.sourceforge.net/[Joda]. The +default date parsing used if no format is specified is +http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser()[ISODateTimeFormat.dateOptionalTimeParser]. + +An extension to the format allow to define several formats using `||` +separator. This allows to define less strict formats that can be used, +for example, the `yyyy/MM/dd HH:mm:ss||yyyy/MM/dd` format will parse +both `yyyy/MM/dd HH:mm:ss` and `yyyy/MM/dd`. The first format will also +act as the one that converts back from milliseconds to a string +representation. + +[float] +[[date-math]] +=== Date Math + +The `date` type supports using date math expression when using it in a +query/filter (mainly make sense in `range` query/filter). + +The expression starts with an "anchor" date, which can be either `now` +or a date string (in the applicable format) ending with `||`. It can +then follow by a math expression, supporting `+`, `-` and `/` +(rounding). The units supported are `y` (year), `M` (month), `w` (week), `h` (hour), +`m` (minute), and `s` (second). + +Here are some samples: `now+1h`, `now+1h+1m`, `now+1h/d`, +`2012-01-01||+1M/d`. + +Note, when doing `range` type searches, and the upper value is +inclusive, the rounding will properly be rounded to the ceiling instead +of flooring it. + +To change this behavior, set +`"mapping.date.round_ceil": false`. + + +[float] +[[built-in]] +=== Built In Formats + +The following tables lists all the defaults ISO formats supported: + +[cols="<,<",options="header",] +|======================================================================= +|Name |Description +|`basic_date`|A basic formatter for a full date as four digit year, two +digit month of year, and two digit day of month (yyyyMMdd). + +|`basic_date_time`|A basic formatter that combines a basic date and time, +separated by a 'T' (yyyyMMdd'T'HHmmss.SSSZ). + +|`basic_date_time_no_millis`|A basic formatter that combines a basic date +and time without millis, separated by a 'T' (yyyyMMdd'T'HHmmssZ). + +|`basic_ordinal_date`|A formatter for a full ordinal date, using a four +digit year and three digit dayOfYear (yyyyDDD). + +|`basic_ordinal_date_time`|A formatter for a full ordinal date and time, +using a four digit year and three digit dayOfYear +(yyyyDDD'T'HHmmss.SSSZ). + +|`basic_ordinal_date_time_no_millis`|A formatter for a full ordinal date +and time without millis, using a four digit year and three digit +dayOfYear (yyyyDDD'T'HHmmssZ). + +|`basic_time`|A basic formatter for a two digit hour of day, two digit +minute of hour, two digit second of minute, three digit millis, and time +zone offset (HHmmss.SSSZ). + +|`basic_time_no_millis`|A basic formatter for a two digit hour of day, +two digit minute of hour, two digit second of minute, and time zone +offset (HHmmssZ). + +|`basic_t_time`|A basic formatter for a two digit hour of day, two digit +minute of hour, two digit second of minute, three digit millis, and time +zone off set prefixed by 'T' ('T'HHmmss.SSSZ). + +|`basic_t_time_no_millis`|A basic formatter for a two digit hour of day, +two digit minute of hour, two digit second of minute, and time zone +offset prefixed by 'T' ('T'HHmmssZ). + +|`basic_week_date`|A basic formatter for a full date as four digit +weekyear, two digit week of weekyear, and one digit day of week +(xxxx'W'wwe). + +|`basic_week_date_time`|A basic formatter that combines a basic weekyear +date and time, separated by a 'T' (xxxx'W'wwe'T'HHmmss.SSSZ). + +|`basic_week_date_time_no_millis`|A basic formatter that combines a basic +weekyear date and time without millis, separated by a 'T' +(xxxx'W'wwe'T'HHmmssZ). + +|`date`|A formatter for a full date as four digit year, two digit month +of year, and two digit day of month (yyyy-MM-dd). + +|`date_hour`|A formatter that combines a full date and two digit hour of +day. + +|`date_hour_minute`|A formatter that combines a full date, two digit hour +of day, and two digit minute of hour. + +|`date_hour_minute_second`|A formatter that combines a full date, two +digit hour of day, two digit minute of hour, and two digit second of +minute. + +|`date_hour_minute_second_fraction`|A formatter that combines a full +date, two digit hour of day, two digit minute of hour, two digit second +of minute, and three digit fraction of second +(yyyy-MM-dd'T'HH:mm:ss.SSS). + +|`date_hour_minute_second_millis`|A formatter that combines a full date, +two digit hour of day, two digit minute of hour, two digit second of +minute, and three digit fraction of second (yyyy-MM-dd'T'HH:mm:ss.SSS). + +|`date_optional_time`|a generic ISO datetime parser where the date is +mandatory and the time is optional. + +|`date_time`|A formatter that combines a full date and time, separated by +a 'T' (yyyy-MM-dd'T'HH:mm:ss.SSSZZ). + +|`date_time_no_millis`|A formatter that combines a full date and time +without millis, separated by a 'T' (yyyy-MM-dd'T'HH:mm:ssZZ). + +|`hour`|A formatter for a two digit hour of day. + +|`hour_minute`|A formatter for a two digit hour of day and two digit +minute of hour. + +|`hour_minute_second`|A formatter for a two digit hour of day, two digit +minute of hour, and two digit second of minute. + +|`hour_minute_second_fraction`|A formatter for a two digit hour of day, +two digit minute of hour, two digit second of minute, and three digit +fraction of second (HH:mm:ss.SSS). + +|`hour_minute_second_millis`|A formatter for a two digit hour of day, two +digit minute of hour, two digit second of minute, and three digit +fraction of second (HH:mm:ss.SSS). + +|`ordinal_date`|A formatter for a full ordinal date, using a four digit +year and three digit dayOfYear (yyyy-DDD). + +|`ordinal_date_time`|A formatter for a full ordinal date and time, using +a four digit year and three digit dayOfYear (yyyy-DDD'T'HH:mm:ss.SSSZZ). + +|`ordinal_date_time_no_millis`|A formatter for a full ordinal date and +time without millis, using a four digit year and three digit dayOfYear +(yyyy-DDD'T'HH:mm:ssZZ). + +|`time`|A formatter for a two digit hour of day, two digit minute of +hour, two digit second of minute, three digit fraction of second, and +time zone offset (HH:mm:ss.SSSZZ). + +|`time_no_millis`|A formatter for a two digit hour of day, two digit +minute of hour, two digit second of minute, and time zone offset +(HH:mm:ssZZ). + +|`t_time`|A formatter for a two digit hour of day, two digit minute of +hour, two digit second of minute, three digit fraction of second, and +time zone offset prefixed by 'T' ('T'HH:mm:ss.SSSZZ). + +|`t_time_no_millis`|A formatter for a two digit hour of day, two digit +minute of hour, two digit second of minute, and time zone offset +prefixed by 'T' ('T'HH:mm:ssZZ). + +|`week_date`|A formatter for a full date as four digit weekyear, two +digit week of weekyear, and one digit day of week (xxxx-'W'ww-e). + +|`week_date_time`|A formatter that combines a full weekyear date and +time, separated by a 'T' (xxxx-'W'ww-e'T'HH:mm:ss.SSSZZ). + +|`weekDateTimeNoMillis`|A formatter that combines a full weekyear date +and time without millis, separated by a 'T' (xxxx-'W'ww-e'T'HH:mm:ssZZ). + +|`week_year`|A formatter for a four digit weekyear. + +|`weekyearWeek`|A formatter for a four digit weekyear and two digit week +of weekyear. + +|`weekyearWeekDay`|A formatter for a four digit weekyear, two digit week +of weekyear, and one digit day of week. + +|`year`|A formatter for a four digit year. + +|`year_month`|A formatter for a four digit year and two digit month of +year. + +|`year_month_day`|A formatter for a four digit year, two digit month of +year, and two digit day of month. +|======================================================================= + +[float] +[[custom]] +=== Custom Format + +Allows for a completely customizable date format explained +http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html[here]. diff --git a/docs/reference/mapping/dynamic-mapping.asciidoc b/docs/reference/mapping/dynamic-mapping.asciidoc new file mode 100644 index 0000000..b10bced --- /dev/null +++ b/docs/reference/mapping/dynamic-mapping.asciidoc @@ -0,0 +1,65 @@ +[[mapping-dynamic-mapping]] +== Dynamic Mapping + +Default mappings allow to automatically apply generic mapping definition +to types that do not have mapping pre defined. This is mainly done +thanks to the fact that the +<<mapping-object-type,object mapping>> and +namely the <<mapping-root-object-type,root +object mapping>> allow for schema-less dynamic addition of unmapped +fields. + +The default mapping definition is plain mapping definition that is +embedded within the distribution: + +[source,js] +-------------------------------------------------- +{ + "_default_" : { + } +} +-------------------------------------------------- + +Pretty short, no? Basically, everything is defaulted, especially the +dynamic nature of the root object mapping. The default mapping +definition can be overridden in several manners. The simplest manner is +to simply define a file called `default-mapping.json` and placed it +under the `config` directory (which can be configured to exist in a +different location). It can also be explicitly set using the +`index.mapper.default_mapping_location` setting. + +The dynamic creation of mappings for unmapped types can be completely +disabled by setting `index.mapper.dynamic` to `false`. + +The dynamic creation of fields within a type can be completely +disabled by setting the `dynamic` property of the type to `strict`. + +Here is a <<indices-put-mapping,Put Mapping>> example that +disables dynamic field creation for a `tweet`: + +[source,js] +-------------------------------------------------- +$ curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d ' +{ + "tweet" : { + "dynamic": "strict", + "properties" : { + "message" : {"type" : "string", "store" : true } + } + } +} +' +-------------------------------------------------- + +Here is how we can change the default +<<mapping-date-format,date_formats>> used in the +root and inner object types: + +[source,js] +-------------------------------------------------- +{ + "_default_" : { + "date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"] + } +} +-------------------------------------------------- diff --git a/docs/reference/mapping/fields.asciidoc b/docs/reference/mapping/fields.asciidoc new file mode 100644 index 0000000..a1f7e98 --- /dev/null +++ b/docs/reference/mapping/fields.asciidoc @@ -0,0 +1,33 @@ +[[mapping-fields]] +== Fields + +Each mapping has a number of fields associated with it +which can be used to control how the document metadata +(eg <<mapping-all-field>>) is indexed. + +include::fields/uid-field.asciidoc[] + +include::fields/id-field.asciidoc[] + +include::fields/type-field.asciidoc[] + +include::fields/source-field.asciidoc[] + +include::fields/all-field.asciidoc[] + +include::fields/analyzer-field.asciidoc[] + +include::fields/boost-field.asciidoc[] + +include::fields/parent-field.asciidoc[] + +include::fields/routing-field.asciidoc[] + +include::fields/index-field.asciidoc[] + +include::fields/size-field.asciidoc[] + +include::fields/timestamp-field.asciidoc[] + +include::fields/ttl-field.asciidoc[] + diff --git a/docs/reference/mapping/fields/all-field.asciidoc b/docs/reference/mapping/fields/all-field.asciidoc new file mode 100644 index 0000000..65453ef --- /dev/null +++ b/docs/reference/mapping/fields/all-field.asciidoc @@ -0,0 +1,78 @@ +[[mapping-all-field]] +=== `_all` + +The idea of the `_all` field is that it includes the text of one or more +other fields within the document indexed. It can come very handy +especially for search requests, where we want to execute a search query +against the content of a document, without knowing which fields to +search on. This comes at the expense of CPU cycles and index size. + +The `_all` fields can be completely disabled. Explicit field mapping and +object mapping can be excluded / included in the `_all` field. By +default, it is enabled and all fields are included in it for ease of +use. + +When disabling the `_all` field, it is a good practice to set +`index.query.default_field` to a different value (for example, if you +have a main "message" field in your data, set it to `message`). + +One of the nice features of the `_all` field is that it takes into +account specific fields boost levels. Meaning that if a title field is +boosted more than content, the title (part) in the `_all` field will +mean more than the content (part) in the `_all` field. + +Here is a sample mapping: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "_all" : {"enabled" : true}, + "properties" : { + "name" : { + "type" : "object", + "dynamic" : false, + "properties" : { + "first" : {"type" : "string", "store" : true , "include_in_all" : false}, + "last" : {"type" : "string", "index" : "not_analyzed"} + } + }, + "address" : { + "type" : "object", + "include_in_all" : false, + "properties" : { + "first" : { + "properties" : { + "location" : {"type" : "string", "store" : true, "index_name" : "firstLocation"} + } + }, + "last" : { + "properties" : { + "location" : {"type" : "string"} + } + } + } + }, + "simple1" : {"type" : "long", "include_in_all" : true}, + "simple2" : {"type" : "long", "include_in_all" : false} + } + } +} +-------------------------------------------------- + +The `_all` fields allows for `store`, `term_vector` and `analyzer` (with +specific `index_analyzer` and `search_analyzer`) to be set. + +[float] +[[highlighting]] +==== Highlighting + +For any field to allow +<<search-request-highlighting,highlighting>> it has +to be either stored or part of the `_source` field. By default `_all` +field does not qualify for either, so highlighting for it does not yield +any data. + +Although it is possible to `store` the `_all` field, it is basically an +aggregation of all fields, which means more data will be stored, and +highlighting it might produce strange results. diff --git a/docs/reference/mapping/fields/analyzer-field.asciidoc b/docs/reference/mapping/fields/analyzer-field.asciidoc new file mode 100644 index 0000000..30bb072 --- /dev/null +++ b/docs/reference/mapping/fields/analyzer-field.asciidoc @@ -0,0 +1,41 @@ +[[mapping-analyzer-field]] +=== `_analyzer` + +The `_analyzer` mapping allows to use a document field property as the +name of the analyzer that will be used to index the document. The +analyzer will be used for any field that does not explicitly defines an +`analyzer` or `index_analyzer` when indexing. + +Here is a simple mapping: + +[source,js] +-------------------------------------------------- +{ + "type1" : { + "_analyzer" : { + "path" : "my_field" + } + } +} +-------------------------------------------------- + +The above will use the value of the `my_field` to lookup an analyzer +registered under it. For example, indexing a the following doc: + +[source,js] +-------------------------------------------------- +{ + "my_field" : "whitespace" +} +-------------------------------------------------- + +Will cause the `whitespace` analyzer to be used as the index analyzer +for all fields without explicit analyzer setting. + +The default path value is `_analyzer`, so the analyzer can be driven for +a specific document by setting `_analyzer` field in it. If custom json +field name is needed, an explicit mapping with a different path should +be set. + +By default, the `_analyzer` field is indexed, it can be disabled by +settings `index` to `no` in the mapping. diff --git a/docs/reference/mapping/fields/boost-field.asciidoc b/docs/reference/mapping/fields/boost-field.asciidoc new file mode 100644 index 0000000..1d00845 --- /dev/null +++ b/docs/reference/mapping/fields/boost-field.asciidoc @@ -0,0 +1,72 @@ +[[mapping-boost-field]] +=== `_boost` + +deprecated[1.0.0.RC1,See <<function-score-instead-of-boost>>] + +Boosting is the process of enhancing the relevancy of a document or +field. Field level mapping allows to define explicit boost level on a +specific field. The boost field mapping (applied on the +<<mapping-root-object-type,root object>>) allows +to define a boost field mapping where *its content will control the +boost level of the document*. For example, consider the following +mapping: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_boost" : {"name" : "my_boost", "null_value" : 1.0} + } +} +-------------------------------------------------- + +The above mapping defines mapping for a field named `my_boost`. If the +`my_boost` field exists within the JSON document indexed, its value will +control the boost level of the document indexed. For example, the +following JSON document will be indexed with a boost value of `2.2`: + +[source,js] +-------------------------------------------------- +{ + "my_boost" : 2.2, + "message" : "This is a tweet!" +} +-------------------------------------------------- + +[[function-score-instead-of-boost]] +==== Function score instead of boost + +Support for document boosting via the `_boost` field has been removed +from Lucene and is deprecated in Elasticsearch as of v1.0.0.RC1. The +implementation in Lucene resulted in unpredictable result when +used with multiple fields or multi-value fields. + +Instead, the <<query-dsl-function-score-query>> can be used to achieve +the desired functionality by boosting each document by the value in +any field the document: + +[source,js] +-------------------------------------------------- +{ + "query": { + "function_score": { + "query": { <1> + "match": { + "title": "your main query" + } + }, + "functions": [{ + "script_score": { <2> + "script": "doc['my_boost_field'].value" + } + }], + "score_mode": "multiply" + } + } +} +-------------------------------------------------- +<1> The original query, now wrapped in a `function_score` query. +<2> This script returns the value in `my_boost_field`, which is then + multiplied by the query `_score` for each document. + + diff --git a/docs/reference/mapping/fields/id-field.asciidoc b/docs/reference/mapping/fields/id-field.asciidoc new file mode 100644 index 0000000..1adab49 --- /dev/null +++ b/docs/reference/mapping/fields/id-field.asciidoc @@ -0,0 +1,52 @@ +[[mapping-id-field]] +=== `_id` + +Each document indexed is associated with an id and a type. The `_id` +field can be used to index just the id, and possible also store it. By +default it is not indexed and not stored (thus, not created). + +Note, even though the `_id` is not indexed, all the APIs still work +(since they work with the `_uid` field), as well as fetching by ids +using `term`, `terms` or `prefix` queries/filters (including the +specific `ids` query/filter). + +The `_id` field can be enabled to be indexed, and possibly stored, +using: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_id" : {"index": "not_analyzed", "store" : false } + } +} +-------------------------------------------------- + +The `_id` mapping can also be associated with a `path` that will be used +to extract the id from a different location in the source document. For +example, having the following mapping: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_id" : { + "path" : "post_id" + } + } +} +-------------------------------------------------- + +Will cause `1` to be used as the id for: + +[source,js] +-------------------------------------------------- +{ + "message" : "You know, for Search", + "post_id" : "1" +} +-------------------------------------------------- + +This does require an additional lightweight parsing step while indexing, +in order to extract the id to decide which shard the index operation +will be executed on. diff --git a/docs/reference/mapping/fields/index-field.asciidoc b/docs/reference/mapping/fields/index-field.asciidoc new file mode 100644 index 0000000..96a320b --- /dev/null +++ b/docs/reference/mapping/fields/index-field.asciidoc @@ -0,0 +1,15 @@ +[[mapping-index-field]] +=== `_index` + +The ability to store in a document the index it belongs to. By default +it is disabled, in order to enable it, the following mapping should be +defined: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_index" : { "enabled" : true } + } +} +-------------------------------------------------- diff --git a/docs/reference/mapping/fields/parent-field.asciidoc b/docs/reference/mapping/fields/parent-field.asciidoc new file mode 100644 index 0000000..3225b53 --- /dev/null +++ b/docs/reference/mapping/fields/parent-field.asciidoc @@ -0,0 +1,21 @@ +[[mapping-parent-field]] +=== `_parent` + +The parent field mapping is defined on a child mapping, and points to +the parent type this child relates to. For example, in case of a `blog` +type and a `blog_tag` type child document, the mapping for `blog_tag` +should be: + +[source,js] +-------------------------------------------------- +{ + "blog_tag" : { + "_parent" : { + "type" : "blog" + } + } +} +-------------------------------------------------- + +The mapping is automatically stored and indexed (meaning it can be +searched on using the `_parent` field notation). diff --git a/docs/reference/mapping/fields/routing-field.asciidoc b/docs/reference/mapping/fields/routing-field.asciidoc new file mode 100644 index 0000000..8ca2286 --- /dev/null +++ b/docs/reference/mapping/fields/routing-field.asciidoc @@ -0,0 +1,69 @@ +[[mapping-routing-field]] +=== `_routing` + +The routing field allows to control the `_routing` aspect when indexing +data and explicit routing control is required. + +[float] +==== store / index + +The first thing the `_routing` mapping does is to store the routing +value provided (`store` set to `false`) and index it (`index` set to +`not_analyzed`). The reason why the routing is stored by default is so +reindexing data will be possible if the routing value is completely +external and not part of the docs. + +[float] +==== required + +Another aspect of the `_routing` mapping is the ability to define it as +required by setting `required` to `true`. This is very important to set +when using routing features, as it allows different APIs to make use of +it. For example, an index operation will be rejected if no routing value +has been provided (or derived from the doc). A delete operation will be +broadcasted to all shards if no routing value is provided and `_routing` +is required. + +[float] +==== path + +The routing value can be provided as an external value when indexing +(and still stored as part of the document, in much the same way +`_source` is stored). But, it can also be automatically extracted from +the index doc based on a `path`. For example, having the following +mapping: + +[source,js] +-------------------------------------------------- +{ + "comment" : { + "_routing" : { + "required" : true, + "path" : "blog.post_id" + } + } +} +-------------------------------------------------- + +Will cause the following doc to be routed based on the `111222` value: + +[source,js] +-------------------------------------------------- +{ + "text" : "the comment text" + "blog" : { + "post_id" : "111222" + } +} +-------------------------------------------------- + +Note, using `path` without explicit routing value provided required an +additional (though quite fast) parsing phase. + +[float] +==== id uniqueness + +When indexing documents specifying a custom `_routing`, the uniqueness +of the `_id` is not guaranteed throughout all the shards that the index +is composed of. In fact, documents with the same `_id` might end up in +different shards if indexed with different `_routing` values. diff --git a/docs/reference/mapping/fields/size-field.asciidoc b/docs/reference/mapping/fields/size-field.asciidoc new file mode 100644 index 0000000..7abfd40 --- /dev/null +++ b/docs/reference/mapping/fields/size-field.asciidoc @@ -0,0 +1,26 @@ +[[mapping-size-field]] +=== `_size` + +The `_size` field allows to automatically index the size of the original +`_source` indexed. By default, it's disabled. In order to enable it, set +the mapping to: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_size" : {"enabled" : true} + } +} +-------------------------------------------------- + +In order to also store it, use: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_size" : {"enabled" : true, "store" : true } + } +} +-------------------------------------------------- diff --git a/docs/reference/mapping/fields/source-field.asciidoc b/docs/reference/mapping/fields/source-field.asciidoc new file mode 100644 index 0000000..22bb963 --- /dev/null +++ b/docs/reference/mapping/fields/source-field.asciidoc @@ -0,0 +1,41 @@ +[[mapping-source-field]] +=== `_source` + +The `_source` field is an automatically generated field that stores the +actual JSON that was used as the indexed document. It is not indexed +(searchable), just stored. When executing "fetch" requests, like +<<docs-get,get>> or +<<search-search,search>>, the `_source` field is +returned by default. + +Though very handy to have around, the source field does incur storage +overhead within the index. For this reason, it can be disabled. For +example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_source" : {"enabled" : false} + } +} +-------------------------------------------------- + +[float] +[[include-exclude]] +==== Includes / Excludes + +Allow to specify paths in the source that would be included / excluded +when it's stored, supporting `*` as wildcard annotation. For example: + +[source,js] +-------------------------------------------------- +{ + "my_type" : { + "_source" : { + "includes" : ["path1.*", "path2.*"], + "excludes" : ["pat3.*"] + } + } +} +-------------------------------------------------- diff --git a/docs/reference/mapping/fields/timestamp-field.asciidoc b/docs/reference/mapping/fields/timestamp-field.asciidoc new file mode 100644 index 0000000..97bca8d --- /dev/null +++ b/docs/reference/mapping/fields/timestamp-field.asciidoc @@ -0,0 +1,82 @@ +[[mapping-timestamp-field]] +=== `_timestamp` + +The `_timestamp` field allows to automatically index the timestamp of a +document. It can be provided externally via the index request or in the +`_source`. If it is not provided externally it will be automatically set +to the date the document was processed by the indexing chain. + +[float] +==== enabled + +By default it is disabled, in order to enable it, the following mapping +should be defined: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_timestamp" : { "enabled" : true } + } +} +-------------------------------------------------- + +[float] +==== store / index + +By default the `_timestamp` field has `store` set to `false` and `index` +set to `not_analyzed`. It can be queried as a standard date field. + +[float] +==== path + +The `_timestamp` value can be provided as an external value when +indexing. But, it can also be automatically extracted from the document +to index based on a `path`. For example, having the following mapping: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_timestamp" : { + "enabled" : true, + "path" : "post_date" + } + } +} +-------------------------------------------------- + +Will cause `2009-11-15T14:12:12` to be used as the timestamp value for: + +[source,js] +-------------------------------------------------- +{ + "message" : "You know, for Search", + "post_date" : "2009-11-15T14:12:12" +} +-------------------------------------------------- + +Note, using `path` without explicit timestamp value provided require an +additional (though quite fast) parsing phase. + +[float] +==== format + +You can define the <<mapping-date-format,date +format>> used to parse the provided timestamp value. For example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_timestamp" : { + "enabled" : true, + "path" : "post_date", + "format" : "YYYY-MM-dd" + } + } +} +-------------------------------------------------- + +Note, the default format is `dateOptionalTime`. The timestamp value will +first be parsed as a number and if it fails the format will be tried. diff --git a/docs/reference/mapping/fields/ttl-field.asciidoc b/docs/reference/mapping/fields/ttl-field.asciidoc new file mode 100644 index 0000000..d47aaca --- /dev/null +++ b/docs/reference/mapping/fields/ttl-field.asciidoc @@ -0,0 +1,70 @@ +[[mapping-ttl-field]] +=== `_ttl` + +A lot of documents naturally come with an expiration date. Documents can +therefore have a `_ttl` (time to live), which will cause the expired +documents to be deleted automatically. + +[float] +==== enabled + +By default it is disabled, in order to enable it, the following mapping +should be defined: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_ttl" : { "enabled" : true } + } +} +-------------------------------------------------- + +[float] +==== store / index + +By default the `_ttl` field has `store` set to `true` and `index` set to +`not_analyzed`. Note that `index` property has to be set to +`not_analyzed` in order for the purge process to work. + +[float] +==== default + +You can provide a per index/type default `_ttl` value as follows: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_ttl" : { "enabled" : true, "default" : "1d" } + } +} +-------------------------------------------------- + +In this case, if you don't provide a `_ttl` value in your query or in +the `_source` all tweets will have a `_ttl` of one day. + +In case you do not specify a time unit like `d` (days), `m` (minutes), +`h` (hours), `ms` (milliseconds) or `w` (weeks), milliseconds is used as +default unit. + +If no `default` is set and no `_ttl` value is given then the document +has an infinite `_ttl` and will not expire. + +You can dynamically update the `default` value using the put mapping +API. It won't change the `_ttl` of already indexed documents but will be +used for future documents. + +[float] +==== Note on documents expiration + +Expired documents will be automatically deleted regularly. You can +dynamically set the `indices.ttl.interval` to fit your needs. The +default value is `60s`. + +The deletion orders are processed by bulk. You can set +`indices.ttl.bulk_size` to fit your needs. The default value is `10000`. + +Note that the expiration procedure handle versioning properly so if a +document is updated between the collection of documents to expire and +the delete order, the document won't be deleted. diff --git a/docs/reference/mapping/fields/type-field.asciidoc b/docs/reference/mapping/fields/type-field.asciidoc new file mode 100644 index 0000000..bac7457 --- /dev/null +++ b/docs/reference/mapping/fields/type-field.asciidoc @@ -0,0 +1,31 @@ +[[mapping-type-field]] +=== Type Field + +Each document indexed is associated with an id and a type. The type, +when indexing, is automatically indexed into a `_type` field. By +default, the `_type` field is indexed (but *not* analyzed) and not +stored. This means that the `_type` field can be queried. + +The `_type` field can be stored as well, for example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_type" : {"store" : true} + } +} +-------------------------------------------------- + +The `_type` field can also not be indexed, and all the APIs will still +work except for specific queries (term queries / filters) or faceting +done on the `_type` field. + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_type" : {"index" : "no"} + } +} +-------------------------------------------------- diff --git a/docs/reference/mapping/fields/uid-field.asciidoc b/docs/reference/mapping/fields/uid-field.asciidoc new file mode 100644 index 0000000..f9ce245 --- /dev/null +++ b/docs/reference/mapping/fields/uid-field.asciidoc @@ -0,0 +1,11 @@ +[[mapping-uid-field]] +=== `_uid` + +Each document indexed is associated with an id and a type, the internal +`_uid` field is the unique identifier of a document within an index and +is composed of the type and the id (meaning that different types can +have the same id and still maintain uniqueness). + +The `_uid` field is automatically used when `_type` is not indexed to +perform type based filtering, and does not require the `_id` to be +indexed. diff --git a/docs/reference/mapping/meta.asciidoc b/docs/reference/mapping/meta.asciidoc new file mode 100644 index 0000000..5cb0c14 --- /dev/null +++ b/docs/reference/mapping/meta.asciidoc @@ -0,0 +1,25 @@ +[[mapping-meta]] +== Meta + +Each mapping can have custom meta data associated with it. These are +simple storage elements that are simply persisted along with the mapping +and can be retrieved when fetching the mapping definition. The meta is +defined under the `_meta` element, for example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "_meta" : { + "attr1" : "value1", + "attr2" : { + "attr3" : "value3" + } + } + } +} +-------------------------------------------------- + +Meta can be handy for example for client libraries that perform +serialization and deserialization to store its meta model (for example, +the class the document maps to). diff --git a/docs/reference/mapping/types.asciidoc b/docs/reference/mapping/types.asciidoc new file mode 100644 index 0000000..0cc967e --- /dev/null +++ b/docs/reference/mapping/types.asciidoc @@ -0,0 +1,24 @@ +[[mapping-types]] +== Types + +The datatype for each field in a document (eg strings, numbers, +objects etc) can be controlled via the type mapping. + +include::types/core-types.asciidoc[] + +include::types/array-type.asciidoc[] + +include::types/object-type.asciidoc[] + +include::types/root-object-type.asciidoc[] + +include::types/nested-type.asciidoc[] + +include::types/ip-type.asciidoc[] + +include::types/geo-point-type.asciidoc[] + +include::types/geo-shape-type.asciidoc[] + +include::types/attachment-type.asciidoc[] + diff --git a/docs/reference/mapping/types/array-type.asciidoc b/docs/reference/mapping/types/array-type.asciidoc new file mode 100644 index 0000000..3f887b1 --- /dev/null +++ b/docs/reference/mapping/types/array-type.asciidoc @@ -0,0 +1,74 @@ +[[mapping-array-type]] +=== Array Type + +JSON documents allow to define an array (list) of fields or objects. +Mapping array types could not be simpler since arrays gets automatically +detected and mapping them can be done either with +<<mapping-core-types,Core Types>> or +<<mapping-object-type,Object Type>> mappings. +For example, the following JSON defines several arrays: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "message" : "some arrays in this tweet...", + "tags" : ["elasticsearch", "wow"], + "lists" : [ + { + "name" : "prog_list", + "description" : "programming list" + }, + { + "name" : "cool_list", + "description" : "cool stuff list" + } + ] + } +} +-------------------------------------------------- + +The above JSON has the `tags` property defining a list of a simple +`string` type, and the `lists` property is an `object` type array. Here +is a sample explicit mapping: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "message" : {"type" : "string"}, + "tags" : {"type" : "string", "index_name" : "tag"}, + "lists" : { + "properties" : { + "name" : {"type" : "string"}, + "description" : {"type" : "string"} + } + } + } + } +} +-------------------------------------------------- + +The fact that array types are automatically supported can be shown by +the fact that the following JSON document is perfectly fine: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "message" : "some arrays in this tweet...", + "tags" : "elasticsearch", + "lists" : { + "name" : "prog_list", + "description" : "programming list" + } + } +} +-------------------------------------------------- + +Note also, that thanks to the fact that we used the `index_name` to use +the non plural form (`tag` instead of `tags`), we can actually refer to +the field using the `index_name` as well. For example, we can execute a +query using `tweet.tags:wow` or `tweet.tag:wow`. We could, of course, +name the field as `tag` and skip the `index_name` all together). diff --git a/docs/reference/mapping/types/attachment-type.asciidoc b/docs/reference/mapping/types/attachment-type.asciidoc new file mode 100644 index 0000000..54f9701 --- /dev/null +++ b/docs/reference/mapping/types/attachment-type.asciidoc @@ -0,0 +1,90 @@ +[[mapping-attachment-type]] +=== Attachment Type + +The `attachment` type allows to index different "attachment" type field +(encoded as `base64`), for example, Microsoft Office formats, open +document formats, ePub, HTML, and so on (full list can be found +http://lucene.apache.org/tika/0.10/formats.html[here]). + +The `attachment` type is provided as a +https://github.com/elasticsearch/elasticsearch-mapper-attachments[plugin +extension]. The plugin is a simple zip file that can be downloaded and +placed under `$ES_HOME/plugins` location. It will be automatically +detected and the `attachment` type will be added. + +Note, the `attachment` type is experimental. + +Using the attachment type is simple, in your mapping JSON, simply set a +certain JSON element as attachment, for example: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "properties" : { + "my_attachment" : { "type" : "attachment" } + } + } +} +-------------------------------------------------- + +In this case, the JSON to index can be: + +[source,js] +-------------------------------------------------- +{ + "my_attachment" : "... base64 encoded attachment ..." +} +-------------------------------------------------- + +Or it is possible to use more elaborated JSON if content type or +resource name need to be set explicitly: + +[source,js] +-------------------------------------------------- +{ + "my_attachment" : { + "_content_type" : "application/pdf", + "_name" : "resource/name/of/my.pdf", + "content" : "... base64 encoded attachment ..." + } +} +-------------------------------------------------- + +The `attachment` type not only indexes the content of the doc, but also +automatically adds meta data on the attachment as well (when available). +The metadata supported are: `date`, `title`, `author`, and `keywords`. +They can be queried using the "dot notation", for example: +`my_attachment.author`. + +Both the meta data and the actual content are simple core type mappers +(string, date, ...), thus, they can be controlled in the mappings. For +example: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "properties" : { + "file" : { + "type" : "attachment", + "fields" : { + "file" : {"index" : "no"}, + "date" : {"store" : true}, + "author" : {"analyzer" : "myAnalyzer"} + } + } + } + } +} +-------------------------------------------------- + +In the above example, the actual content indexed is mapped under +`fields` name `file`, and we decide not to index it, so it will only be +available in the `_all` field. The other fields map to their respective +metadata names, but there is no need to specify the `type` (like +`string` or `date`) since it is already known. + +The plugin uses http://lucene.apache.org/tika/[Apache Tika] to parse +attachments, so many formats are supported, listed +http://lucene.apache.org/tika/0.10/formats.html[here]. diff --git a/docs/reference/mapping/types/core-types.asciidoc b/docs/reference/mapping/types/core-types.asciidoc new file mode 100644 index 0000000..90ec792 --- /dev/null +++ b/docs/reference/mapping/types/core-types.asciidoc @@ -0,0 +1,754 @@ +[[mapping-core-types]] +=== Core Types + +Each JSON field can be mapped to a specific core type. JSON itself +already provides us with some typing, with its support for `string`, +`integer`/`long`, `float`/`double`, `boolean`, and `null`. + +The following sample tweet JSON document will be used to explain the +core types: + +[source,js] +-------------------------------------------------- +{ + "tweet" { + "user" : "kimchy" + "message" : "This is a tweet!", + "postDate" : "2009-11-15T14:12:12", + "priority" : 4, + "rank" : 12.3 + } +} +-------------------------------------------------- + +Explicit mapping for the above JSON tweet can be: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "user" : {"type" : "string", "index" : "not_analyzed"}, + "message" : {"type" : "string", "null_value" : "na"}, + "postDate" : {"type" : "date"}, + "priority" : {"type" : "integer"}, + "rank" : {"type" : "float"} + } + } +} +-------------------------------------------------- + +[float] +[[string]] +==== String + +The text based string type is the most basic type, and contains one or +more characters. An example mapping can be: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "message" : { + "type" : "string", + "store" : true, + "index" : "analyzed", + "null_value" : "na" + }, + "user" : { + "type" : "string", + "index" : "not_analyzed", + "norms" : { + "enabled" : false + } + } + } + } +} +-------------------------------------------------- + +The above mapping defines a `string` `message` property/field within the +`tweet` type. The field is stored in the index (so it can later be +retrieved using selective loading when searching), and it gets analyzed +(broken down into searchable terms). If the message has a `null` value, +then the value that will be stored is `na`. There is also a `string` `user` +which is indexed as-is (not broken down into tokens) and has norms +disabled (so that matching this field is a binary decision, no match is +better than another one). + +The following table lists all the attributes that can be used with the +`string` type: + +[cols="<,<",options="header",] +|======================================================================= +|Attribute |Description +|`index_name` |The name of the field that will be stored in the index. +Defaults to the property/field name. + +|`store` |Set to `true` to store actual field in the index, `false` to not +store it. Defaults to `false` (note, the JSON document itself is stored, +and it can be retrieved from it). + +|`index` |Set to `analyzed` for the field to be indexed and searchable +after being broken down into token using an analyzer. `not_analyzed` +means that its still searchable, but does not go through any analysis +process or broken down into tokens. `no` means that it won't be +searchable at all (as an individual field; it may still be included in +`_all`). Setting to `no` disables `include_in_all`. Defaults to +`analyzed`. + +|`doc_values` |Set to `true` to store field values in a column-stride fashion. +Automatically set to `true` when the fielddata format is `doc_values`. + +|`term_vector` |Possible values are `no`, `yes`, `with_offsets`, +`with_positions`, `with_positions_offsets`. Defaults to `no`. + +|`boost` |The boost value. Defaults to `1.0`. + +|`null_value` |When there is a (JSON) null value for the field, use the +`null_value` as the field value. Defaults to not adding the field at +all. + +|`norms.enabled` |Boolean value if norms should be enabled or not. Defaults +to `true` for `analyzed` fields, and to `false` for `not_analyzed` fields. + +|`norms.loading` |Describes how norms should be loaded, possible values are +`eager` and `lazy` (default). It is possible to change the default value to +eager for all fields by configuring the index setting `index.norms.loading` +to `eager`. + +|`index_options` | Allows to set the indexing +options, possible values are `docs` (only doc numbers are indexed), +`freqs` (doc numbers and term frequencies), and `positions` (doc +numbers, term frequencies and positions). Defaults to `positions` for +`analyzed` fields, and to `docs` for `not_analyzed` fields. It +is also possible to set it to `offsets` (doc numbers, term +frequencies, positions and offsets). + +|`analyzer` |The analyzer used to analyze the text contents when +`analyzed` during indexing and when searching using a query string. +Defaults to the globally configured analyzer. + +|`index_analyzer` |The analyzer used to analyze the text contents when +`analyzed` during indexing. + +|`search_analyzer` |The analyzer used to analyze the field when part of +a query string. Can be updated on an existing field. + +|`include_in_all` |Should the field be included in the `_all` field (if +enabled). If `index` is set to `no` this defaults to `false`, otherwise, +defaults to `true` or to the parent `object` type setting. + +|`ignore_above` |The analyzer will ignore strings larger than this size. +Useful for generic `not_analyzed` fields that should ignore long text. + +|`position_offset_gap` |Position increment gap between field instances +with the same field name. Defaults to 0. +|======================================================================= + +The `string` type also support custom indexing parameters associated +with the indexed value. For example: + +[source,js] +-------------------------------------------------- +{ + "message" : { + "_value": "boosted value", + "_boost": 2.0 + } +} +-------------------------------------------------- + +The mapping is required to disambiguate the meaning of the document. +Otherwise, the structure would interpret "message" as a value of type +"object". The key `_value` (or `value`) in the inner document specifies +the real string content that should eventually be indexed. The `_boost` +(or `boost`) key specifies the per field document boost (here 2.0). + +[float] +[[number]] +==== Number + +A number based type supporting `float`, `double`, `byte`, `short`, +`integer`, and `long`. It uses specific constructs within Lucene in +order to support numeric values. The number types have the same ranges +as corresponding +http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html[Java +types]. An example mapping can be: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "rank" : { + "type" : "float", + "null_value" : 1.0 + } + } + } +} +-------------------------------------------------- + +The following table lists all the attributes that can be used with a +numbered type: + +[cols="<,<",options="header",] +|======================================================================= +|Attribute |Description +|`type` |The type of the number. Can be `float`, `double`, `integer`, +`long`, `short`, `byte`. Required. + +|`index_name` |The name of the field that will be stored in the index. +Defaults to the property/field name. + +|`store` |Set to `true` to store actual field in the index, `false` to not +store it. Defaults to `false` (note, the JSON document itself is stored, +and it can be retrieved from it). + +|`index` |Set to `no` if the value should not be indexed. Setting to +`no` disables `include_in_all`. If set to `no` the field can be stored +in `_source`, have `include_in_all` enabled, or `store` should be set to +`true` for this to be useful. + +|`doc_values` |Set to `true` to store field values in a column-stride fashion. +Automatically set to `true` when the fielddata format is `doc_values`. + +|`precision_step` |The precision step (number of terms generated for +each number value). Defaults to `4`. + +|`boost` |The boost value. Defaults to `1.0`. + +|`null_value` |When there is a (JSON) null value for the field, use the +`null_value` as the field value. Defaults to not adding the field at +all. + +|`include_in_all` |Should the field be included in the `_all` field (if +enabled). If `index` is set to `no` this defaults to `false`, otherwise, +defaults to `true` or to the parent `object` type setting. + +|`ignore_malformed` |Ignored a malformed number. Defaults to `false`. + +|`coerce` |Try convert strings to numbers and truncate fractions for integers. Defaults to `true`. + +|======================================================================= + +[float] +[[token_count]] +==== Token Count +The `token_count` type maps to the JSON string type but indexes and stores +the number of tokens in the string rather than the string itself. For +example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "name" : { + "type" : "string", + "fields" : { + "word_count": { + "type" : "token_count", + "store" : "yes", + "analyzer" : "standard" + } + } + } + } + } +} +-------------------------------------------------- + +All the configuration that can be specified for a number can be specified +for a token_count. The only extra configuration is the required +`analyzer` field which specifies which analyzer to use to break the string +into tokens. For best performance, use an analyzer with no token filters. + +[NOTE] +=================================================================== +Technically the `token_count` type sums position increments rather than +counting tokens. This means that even if the analyzer filters out stop +words they are included in the count. +=================================================================== + +[float] +[[date]] +==== Date + +The date type is a special type which maps to JSON string type. It +follows a specific format that can be explicitly set. All dates are +`UTC`. Internally, a date maps to a number type `long`, with the added +parsing stage from string to long and from long to string. An example +mapping: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "postDate" : { + "type" : "date", + "format" : "YYYY-MM-dd" + } + } + } +} +-------------------------------------------------- + +The date type will also accept a long number representing UTC +milliseconds since the epoch, regardless of the format it can handle. + +The following table lists all the attributes that can be used with a +date type: + +[cols="<,<",options="header",] +|======================================================================= +|Attribute |Description +|`index_name` |The name of the field that will be stored in the index. +Defaults to the property/field name. + +|`format` |The <<mapping-date-format,date +format>>. Defaults to `dateOptionalTime`. + +|`store` |Set to `true` to store actual field in the index, `false` to not +store it. Defaults to `false` (note, the JSON document itself is stored, +and it can be retrieved from it). + +|`index` |Set to `no` if the value should not be indexed. Setting to +`no` disables `include_in_all`. If set to `no` the field can be stored +in `_source`, have `include_in_all` enabled, or `store` should be set to +`true` for this to be useful. + +|`doc_values` |Set to `true` to store field values in a column-stride fashion. +Automatically set to `true` when the fielddata format is `doc_values`. + +|`precision_step` |The precision step (number of terms generated for +each number value). Defaults to `4`. + +|`boost` |The boost value. Defaults to `1.0`. + +|`null_value` |When there is a (JSON) null value for the field, use the +`null_value` as the field value. Defaults to not adding the field at +all. + +|`include_in_all` |Should the field be included in the `_all` field (if +enabled). If `index` is set to `no` this defaults to `false`, otherwise, +defaults to `true` or to the parent `object` type setting. + +|`ignore_malformed` |Ignored a malformed number. Defaults to `false`. + +|======================================================================= + +[float] +[[boolean]] +==== Boolean + +The boolean type Maps to the JSON boolean type. It ends up storing +within the index either `T` or `F`, with automatic translation to `true` +and `false` respectively. + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "hes_my_special_tweet" : { + "type" : "boolean", + } + } + } +} +-------------------------------------------------- + +The boolean type also supports passing the value as a number (in this +case `0` is `false`, all other values are `true`). + +The following table lists all the attributes that can be used with the +boolean type: + +[cols="<,<",options="header",] +|======================================================================= +|Attribute |Description +|`index_name` |The name of the field that will be stored in the index. +Defaults to the property/field name. + +|`store` |Set to `true` to store actual field in the index, `false` to not +store it. Defaults to `false` (note, the JSON document itself is stored, +and it can be retrieved from it). + +|`index` |Set to `no` if the value should not be indexed. Setting to +`no` disables `include_in_all`. If set to `no` the field can be stored +in `_source`, have `include_in_all` enabled, or `store` should be set to +`true` for this to be useful. + +|`boost` |The boost value. Defaults to `1.0`. + +|`null_value` |When there is a (JSON) null value for the field, use the +`null_value` as the field value. Defaults to not adding the field at +all. + +|`include_in_all` |Should the field be included in the `_all` field (if +enabled). If `index` is set to `no` this defaults to `false`, otherwise, +defaults to `true` or to the parent `object` type setting. +|======================================================================= + +[float] +[[binary]] +==== Binary + +The binary type is a base64 representation of binary data that can be +stored in the index. The field is not stored by default and not indexed at +all. + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "image" : { + "type" : "binary", + } + } + } +} +-------------------------------------------------- + +The following table lists all the attributes that can be used with the +binary type: + +[cols="<,<",options="header",] +|======================================================================= +|Attribute |Description +|`index_name` |The name of the field that will be stored in the index. +Defaults to the property/field name. +|`store` |Set to `true` to store actual field in the index, `false` to not +store it. Defaults to `false` (note, the JSON document itself is stored, +and it can be retrieved from it). +|======================================================================= + +[float] +[[fielddata-filters]] +==== Fielddata filters + +It is possible to control which field values are loaded into memory, +which is particularly useful for faceting on string fields, using +fielddata filters, which are explained in detail in the +<<index-modules-fielddata,Fielddata>> section. + +Fielddata filters can exclude terms which do not match a regex, or which +don't fall between a `min` and `max` frequency range: + +[source,js] +-------------------------------------------------- +{ + tweet: { + type: "string", + analyzer: "whitespace" + fielddata: { + filter: { + regex: { + "pattern": "^#.*" + }, + frequency: { + min: 0.001, + max: 0.1, + min_segment_size: 500 + } + } + } + } +} +-------------------------------------------------- + +These filters can be updated on an existing field mapping and will take +effect the next time the fielddata for a segment is loaded. Use the +<<indices-clearcache,Clear Cache>> API +to reload the fielddata using the new filters. + +[float] +[[postings]] +==== Postings format + +Posting formats define how fields are written into the index and how +fields are represented into memory. Posting formats can be defined per +field via the `postings_format` option. Postings format are configurable. +Elasticsearch has several builtin formats: + +`direct`:: + A postings format that uses disk-based storage but loads + its terms and postings directly into memory. Note this postings format + is very memory intensive and has certain limitation that don't allow + segments to grow beyond 2.1GB see \{@link DirectPostingsFormat} for + details. + +`memory`:: + A postings format that stores its entire terms, postings, + positions and payloads in a finite state transducer. This format should + only be used for primary keys or with fields where each term is + contained in a very low number of documents. + +`pulsing`:: + A postings format in-lines the posting lists for very low + frequent terms in the term dictionary. This is useful to improve lookup + performance for low-frequent terms. + +`bloom_default`:: + A postings format that uses a bloom filter to + improve term lookup performance. This is useful for primarily keys or + fields that are used as a delete key. + +`bloom_pulsing`:: + A postings format that combines the advantages of + *bloom* and *pulsing* to further improve lookup performance. + +`default`:: + The default Elasticsearch postings format offering best + general purpose performance. This format is used if no postings format + is specified in the field mapping. + +[float] +===== Postings format example + +On all field types it possible to configure a `postings_format` +attribute: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "properties" : { + "second_person_id" : {"type" : "string", "postings_format" : "pulsing"} + } + } +} +-------------------------------------------------- + +On top of using the built-in posting formats it is possible define +custom postings format. See +<<index-modules-codec,codec module>> for more +information. + +[float] +==== Doc values format + +Doc values formats define how fields are written into column-stride storage in +the index for the purpose of sorting or faceting. Fields that have doc values +enabled will have special field data instances, which will not be uninverted +from the inverted index, but directly read from disk. This makes _refresh faster +and ultimately allows for having field data stored on disk depending on the +configured doc values format. + +Doc values formats are configurable. Elasticsearch has several builtin formats: + +`memory`:: + A doc values format which stores data in memory. Compared to the default + field data implementations, using doc values with this format will have + similar performance but will be faster to load, making '_refresh' less + time-consuming. + +`disk`:: + A doc values format which stores all data on disk, requiring almost no + memory from the JVM at the cost of a slight performance degradation. + +`default`:: + The default Elasticsearch doc values format, offering good performance + with low memory usage. This format is used if no format is specified in + the field mapping. + +[float] +===== Doc values format example + +On all field types, it is possible to configure a `doc_values_format` attribute: + +[source,js] +-------------------------------------------------- +{ + "product" : { + "properties" : { + "price" : {"type" : "integer", "doc_values_format" : "memory"} + } + } +} +-------------------------------------------------- + +On top of using the built-in doc values formats it is possible to define +custom doc values formats. See +<<index-modules-codec,codec module>> for more information. + +[float] +==== Similarity + +Elasticsearch allows you to configure a similarity (scoring algorithm) per field. +Allowing users a simpler extension beyond the usual TF/IDF algorithm. As +part of this, new algorithms have been added including BM25. Also as +part of the changes, it is now possible to define a Similarity per +field, giving even greater control over scoring. + +You can configure similarities via the +<<index-modules-similarity,similarity module>> + +[float] +===== Configuring Similarity per Field + +Defining the Similarity for a field is done via the `similarity` mapping +property, as this example shows: + +[source,js] +-------------------------------------------------- +{ + "book" : { + "properties" : { + "title" : { "type" : "string", "similarity" : "BM25" } + } +} +-------------------------------------------------- + +The following Similarities are configured out-of-box: + +`default`:: + The Default TF/IDF algorithm used by Elasticsearch and + Lucene in previous versions. + +`BM25`:: + The BM25 algorithm. + http://en.wikipedia.org/wiki/Okapi_BM25[See Okapi_BM25] for more + details. + + +[[copy-to]] +[float] +===== Copy to field + +added[1.0.0.RC2] + +Adding `copy_to` parameter to any field mapping will cause all values of this field to be copied to fields specified in +the parameter. In the following example all values from fields `title` and `abstract` will be copied to the field +`meta_data`. + + +[source,js] +-------------------------------------------------- +{ + "book" : { + "properties" : { + "title" : { "type" : "string", "copy_to" : "meta_data" }, + "abstract" : { "type" : "string", "copy_to" : "meta_data" }, + "meta_data" : { "type" : "string" }, + } +} +-------------------------------------------------- + +Multiple fields are also supported: + +[source,js] +-------------------------------------------------- +{ + "book" : { + "properties" : { + "title" : { "type" : "string", "copy_to" : ["meta_data", "article_info"] }, + } +} +-------------------------------------------------- + +[float] +===== Multi fields + +added[1.0.0.RC1] + +The `fields` options allows to map several core types fields into a single +json source field. This can be useful if a single field need to be +used in different ways. For example a single field is to be used for both +free text search and sorting. + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "name" : { + "type" : "string", + "index" : "analyzed", + "fields" : { + "raw" : {"type" : "string", "index" : "not_analyzed"} + } + } + } + } +} +-------------------------------------------------- + +In the above example the field `name` gets processed twice. The first time it gets +processed as an analyzed string and this version is accessible under the field name +`name`, this is the main field and is in fact just like any other field. The second time +it gets processed as a not analyzed string and is accessible under the name `name.raw`. + +[float] +==== Include in All + +The `include_in_all` setting is ignored on any field that is defined in +the `fields` options. Setting the `include_in_all` only makes sense on +the main field, since the raw field value to copied to the `_all` field, +the tokens aren't copied. + +[float] +==== Updating a field + +In the essence a field can't be updated. However multi fields can be +added to existing fields. This allows for example to have a different +`index_analyzer` configuration in addition to the already configured +`index_analyzer` configuration specified in the main and other multi fields. + +Also the new multi field will only be applied on document that have been +added after the multi field has been added and in fact the new multi field +doesn't exist in existing documents. + +Another important note is that new multi fields will be merged into the +list of existing multi fields, so when adding new multi fields for a field +previous added multi fields don't need to be specified. + +[float] +==== Accessing Fields + +deprecated[1.0.0,Use <<copy-to,`copy_to`>> instead] + +The multi fields defined in the `fields` are prefixed with the +name of the main field and can be accessed by their full path using the +navigation notation: `name.raw`, or using the typed navigation notation +`tweet.name.raw`. The `path` option allows to control how fields are accessed. +If the `path` option is set to `full`, then the full path of the main field +is prefixed, but if the `path` option is set to `just_name` the actual +multi field name without any prefix is used. The default value for +the `path` option is `full`. + +The `just_name` setting, among other things, allows indexing content of multiple +fields under the same name. In the example below the content of both fields +`first_name` and `last_name` can be accessed by using `any_name` or `tweet.any_name`. + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties": { + "first_name": { + "type": "string", + "index": "analyzed", + "path": "just_name", + "fields": { + "any_name": {"type": "string","index": "analyzed"} + } + }, + "last_name": { + "type": "string", + "index": "analyzed", + "path": "just_name", + "fields": { + "any_name": {"type": "string","index": "analyzed"} + } + } + } + } +} +-------------------------------------------------- + diff --git a/docs/reference/mapping/types/geo-point-type.asciidoc b/docs/reference/mapping/types/geo-point-type.asciidoc new file mode 100644 index 0000000..19b38e5 --- /dev/null +++ b/docs/reference/mapping/types/geo-point-type.asciidoc @@ -0,0 +1,207 @@ +[[mapping-geo-point-type]] +=== Geo Point Type + +Mapper type called `geo_point` to support geo based points. The +declaration looks as follows: + +[source,js] +-------------------------------------------------- +{ + "pin" : { + "properties" : { + "location" : { + "type" : "geo_point" + } + } + } +} +-------------------------------------------------- + +[float] +==== Indexed Fields + +The `geo_point` mapping will index a single field with the format of +`lat,lon`. The `lat_lon` option can be set to also index the `.lat` and +`.lon` as numeric fields, and `geohash` can be set to `true` to also +index `.geohash` value. + +A good practice is to enable indexing `lat_lon` as well, since both the +geo distance and bounding box filters can either be executed using in +memory checks, or using the indexed lat lon values, and it really +depends on the data set which one performs better. Note though, that +indexed lat lon only make sense when there is a single geo point value +for the field, and not multi values. + +[float] +==== Geohashes + +Geohashes are a form of lat/lon encoding which divides the earth up into +a grid. Each cell in this grid is represented by a geohash string. Each +cell in turn can be further subdivided into smaller cells which are +represented by a longer string. So the longer the geohash, the smaller +(and thus more accurate) the cell is. + +Because geohashes are just strings, they can be stored in an inverted +index like any other string, which makes querying them very efficient. + +If you enable the `geohash` option, a `geohash` ``sub-field'' will be +indexed as, eg `pin.geohash`. The length of the geohash is controlled by +the `geohash_precision` parameter, which can either be set to an absolute +length (eg `12`, the default) or to a distance (eg `1km`). + +More usefully, set the `geohash_prefix` option to `true` to not only index +the geohash value, but all the enclosing cells as well. For instance, a +geohash of `u30` will be indexed as `[u,u3,u30]`. This option can be used +by the <<query-dsl-geohash-cell-filter>> to find geopoints within a +particular cell very efficiently. + +[float] +==== Input Structure + +The above mapping defines a `geo_point`, which accepts different +formats. The following formats are supported: + +[float] +===== Lat Lon as Properties + +[source,js] +-------------------------------------------------- +{ + "pin" : { + "location" : { + "lat" : 41.12, + "lon" : -71.34 + } + } +} +-------------------------------------------------- + +[float] +===== Lat Lon as String + +Format in `lat,lon`. + +[source,js] +-------------------------------------------------- +{ + "pin" : { + "location" : "41.12,-71.34" + } +} +-------------------------------------------------- + +[float] +===== Geohash + +[source,js] +-------------------------------------------------- +{ + "pin" : { + "location" : "drm3btev3e86" + } +} +-------------------------------------------------- + +[float] +===== Lat Lon as Array + +Format in `[lon, lat]`, note, the order of lon/lat here in order to +conform with http://geojson.org/[GeoJSON]. + +[source,js] +-------------------------------------------------- +{ + "pin" : { + "location" : [-71.34, 41.12] + } +} +-------------------------------------------------- + +[float] +==== Mapping Options + +[cols="<,<",options="header",] +|======================================================================= +|Option |Description +|`lat_lon` |Set to `true` to also index the `.lat` and `.lon` as fields. +Defaults to `false`. + +|`geohash` |Set to `true` to also index the `.geohash` as a field. +Defaults to `false`. + +|`geohash_precision` |Sets the geohash precision. It can be set to an +absolute geohash length or a distance value (eg 1km, 1m, 1ml) defining +the size of the smallest cell. Defaults to an absolute length of 12. + +|`geohash_prefix` |If this option is set to `true`, not only the geohash +but also all its parent cells (true prefixes) will be indexed as well. The +number of terms that will be indexed depends on the `geohash_precision`. +Defaults to `false`. *Note*: This option implicitly enables `geohash`. + +|`validate` |Set to `true` to reject geo points with invalid latitude or +longitude (default is `false`) *Note*: Validation only works when +normalization has been disabled. + +|`validate_lat` |Set to `true` to reject geo points with an invalid +latitude + +|`validate_lon` |Set to `true` to reject geo points with an invalid +longitude + +|`normalize` |Set to `true` to normalize latitude and longitude (default +is `true`) + +|`normalize_lat` |Set to `true` to normalize latitude + +|`normalize_lon` |Set to `true` to normalize longitude +|======================================================================= + +[float] +==== Field data + +By default, geo points use the `array` format which loads geo points into two +parallel double arrays, making sure there is no precision loss. However, this +can require a non-negligible amount of memory (16 bytes per document) which is +why Elasticsearch also provides a field data implementation with lossy +compression called `compressed`: + +[source,js] +-------------------------------------------------- +{ + "pin" : { + "properties" : { + "location" : { + "type" : "geo_point", + "fielddata" : { + "format" : "compressed", + "precision" : "1cm" + } + } + } + } +} +-------------------------------------------------- + +This field data format comes with a `precision` option which allows to +configure how much precision can be traded for memory. The default value is +`1cm`. The following table presents values of the memory savings given various +precisions: + +|============================================= +| Precision | Bytes per point | Size reduction +| 1km | 4 | 75% +| 3m | 6 | 62.5% +| 1cm | 8 | 50% +| 1mm | 10 | 37.5% +|============================================= + +Precision can be changed on a live index by using the update mapping API. + +[float] +==== Usage in Scripts + +When using `doc[geo_field_name]` (in the above mapping, +`doc['location']`), the `doc[...].value` returns a `GeoPoint`, which +then allows access to `lat` and `lon` (for example, +`doc[...].value.lat`). For performance, it is better to access the `lat` +and `lon` directly using `doc[...].lat` and `doc[...].lon`. diff --git a/docs/reference/mapping/types/geo-shape-type.asciidoc b/docs/reference/mapping/types/geo-shape-type.asciidoc new file mode 100644 index 0000000..600900a --- /dev/null +++ b/docs/reference/mapping/types/geo-shape-type.asciidoc @@ -0,0 +1,232 @@ +[[mapping-geo-shape-type]] +=== Geo Shape Type + +The `geo_shape` mapping type facilitates the indexing of and searching +with arbitrary geo shapes such as rectangles and polygons. It should be +used when either the data being indexed or the queries being executed +contain shapes other than just points. + +You can query documents using this type using +<<query-dsl-geo-shape-filter,geo_shape Filter>> +or <<query-dsl-geo-shape-query,geo_shape +Query>>. + +Note, the `geo_shape` type uses +https://github.com/spatial4j/spatial4j[Spatial4J] and +http://www.vividsolutions.com/jts/jtshome.htm[JTS], both of which are +optional dependencies. Consequently you must add Spatial4J v0.3 and JTS +v1.12 to your classpath in order to use this type. + +[float] +==== Mapping Options + +The geo_shape mapping maps geo_json geometry objects to the geo_shape +type. To enable it, users must explicitly map fields to the geo_shape +type. + +[cols="<,<",options="header",] +|======================================================================= +|Option |Description + +|`tree` |Name of the PrefixTree implementation to be used: `geohash` for +GeohashPrefixTree and `quadtree` for QuadPrefixTree. Defaults to +`geohash`. + +|`precision` |This parameter may be used instead of `tree_levels` to set +an appropriate value for the `tree_levels` parameter. The value +specifies the desired precision and Elasticsearch will calculate the +best tree_levels value to honor this precision. The value should be a +number followed by an optional distance unit. Valid distance units +include: `in`, `inch`, `yd`, `yard`, `mi`, `miles`, `km`, `kilometers`, +`m`,`meters` (default), `cm`,`centimeters`, `mm`, `millimeters`. + +|`tree_levels` |Maximum number of layers to be used by the PrefixTree. +This can be used to control the precision of shape representations and +therefore how many terms are indexed. Defaults to the default value of +the chosen PrefixTree implementation. Since this parameter requires a +certain level of understanding of the underlying implementation, users +may use the `precision` parameter instead. However, Elasticsearch only +uses the tree_levels parameter internally and this is what is returned +via the mapping API even if you use the precision parameter. + +|`distance_error_pct` |Used as a hint to the PrefixTree about how +precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum +supported value. +|======================================================================= + +[float] +==== Prefix trees + +To efficiently represent shapes in the index, Shapes are converted into +a series of hashes representing grid squares using implementations of a +PrefixTree. The tree notion comes from the fact that the PrefixTree uses +multiple grid layers, each with an increasing level of precision to +represent the Earth. + +Multiple PrefixTree implementations are provided: + +* GeohashPrefixTree - Uses +http://en.wikipedia.org/wiki/Geohash[geohashes] for grid squares. +Geohashes are base32 encoded strings of the bits of the latitude and +longitude interleaved. So the longer the hash, the more precise it is. +Each character added to the geohash represents another tree level and +adds 5 bits of precision to the geohash. A geohash represents a +rectangular area and has 32 sub rectangles. The maximum amount of levels +in Elasticsearch is 24. +* QuadPrefixTree - Uses a +http://en.wikipedia.org/wiki/Quadtree[quadtree] for grid squares. +Similar to geohash, quad trees interleave the bits of the latitude and +longitude the resulting hash is a bit set. A tree level in a quad tree +represents 2 bits in this bit set, one for each coordinate. The maximum +amount of levels for the quad trees in Elasticsearch is 50. + +[float] +===== Accuracy + +Geo_shape does not provide 100% accuracy and depending on how it is +configured it may return some false positives or false negatives for +certain queries. To mitigate this, it is important to select an +appropriate value for the tree_levels parameter and to adjust +expectations accordingly. For example, a point may be near the border of +a particular grid cell. And may not match a query that only matches the +cell right next to it even though the shape is very close to the point. + +[float] +===== Example + +[source,js] +-------------------------------------------------- +{ + "properties": { + "location": { + "type": "geo_shape", + "tree": "quadtree", + "precision": "1m" + } + } +} +-------------------------------------------------- + +This mapping maps the location field to the geo_shape type using the +quad_tree implementation and a precision of 1m. Elasticsearch translates +this into a tree_levels setting of 26. + +[float] +===== Performance considerations + +Elasticsearch uses the paths in the prefix tree as terms in the index +and in queries. The higher the levels is (and thus the precision), the +more terms are generated. Both calculating the terms, keeping them in +memory, and storing them has a price of course. Especially with higher +tree levels, indices can become extremely large even with a modest +amount of data. Additionally, the size of the features also matters. +Big, complex polygons can take up a lot of space at higher tree levels. +Which setting is right depends on the use case. Generally one trades off +accuracy against index size and query performance. + +The defaults in Elasticsearch for both implementations are a compromise +between index size and a reasonable level of precision of 50m at the +equator. This allows for indexing tens of millions of shapes without +overly bloating the resulting index too much relative to the input size. + +[float] +==== Input Structure + +The http://www.geojson.org[GeoJSON] format is used to represent Shapes +as input as follows: + +[source,js] +-------------------------------------------------- +{ + "location" : { + "type" : "point", + "coordinates" : [45.0, -45.0] + } +} +-------------------------------------------------- + +Note, both the `type` and `coordinates` fields are required. + +The supported `types` are `point`, `linestring`, `polygon`, `multipoint` +and `multipolygon`. + +Note, in geojson the correct order is longitude, latitude coordinate +arrays. This differs from some APIs such as e.g. Google Maps that +generally use latitude, longitude. + +[float] +===== Envelope + +Elasticsearch supports an `envelope` type which consists of coordinates +for upper left and lower right points of the shape: + +[source,js] +-------------------------------------------------- +{ + "location" : { + "type" : "envelope", + "coordinates" : [[-45.0, 45.0], [45.0, -45.0]] + } +} +-------------------------------------------------- + +[float] +===== http://www.geojson.org/geojson-spec.html#id4[Polygon] + +A polygon is defined by a list of a list of points. The first and last +points in each list must be the same (the polygon must be closed). + +[source,js] +-------------------------------------------------- +{ + "location" : { + "type" : "polygon", + "coordinates" : [ + [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ] + ] + } +} +-------------------------------------------------- + +The first array represents the outer boundary of the polygon, the other +arrays represent the interior shapes ("holes"): + +[source,js] +-------------------------------------------------- +{ + "location" : { + "type" : "polygon", + "coordinates" : [ + [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ], + [ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2] ] + ] + } +} +-------------------------------------------------- + +[float] +===== http://www.geojson.org/geojson-spec.html#id7[MultiPolygon] + +A list of geojson polygons. + +[source,js] +-------------------------------------------------- +{ + "location" : { + "type" : "multipolygon", + "coordinates" : [ + [[[102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0], [102.0, 2.0]]], + [[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]], + [[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]]] + ] + } +} +-------------------------------------------------- + +[float] +==== Sorting and Retrieving index Shapes + +Due to the complex input structure and index representation of shapes, +it is not currently possible to sort shapes or retrieve their fields +directly. The geo_shape value is only retrievable through the `_source` +field. diff --git a/docs/reference/mapping/types/ip-type.asciidoc b/docs/reference/mapping/types/ip-type.asciidoc new file mode 100644 index 0000000..51f3c5a --- /dev/null +++ b/docs/reference/mapping/types/ip-type.asciidoc @@ -0,0 +1,36 @@ +[[mapping-ip-type]] +=== IP Type + +An `ip` mapping type allows to store _ipv4_ addresses in a numeric form +allowing to easily sort, and range query it (using ip values). + +The following table lists all the attributes that can be used with an ip +type: + +[cols="<,<",options="header",] +|======================================================================= +|Attribute |Description +|`index_name` |The name of the field that will be stored in the index. +Defaults to the property/field name. + +|`store` |Set to `true` to store actual field in the index, `false` to not +store it. Defaults to `false` (note, the JSON document itself is stored, +and it can be retrieved from it). + +|`index` |Set to `no` if the value should not be indexed. In this case, +`store` should be set to `true`, since if it's not indexed and not +stored, there is nothing to do with it. + +|`precision_step` |The precision step (number of terms generated for +each number value). Defaults to `4`. + +|`boost` |The boost value. Defaults to `1.0`. + +|`null_value` |When there is a (JSON) null value for the field, use the +`null_value` as the field value. Defaults to not adding the field at +all. + +|`include_in_all` |Should the field be included in the `_all` field (if +enabled). Defaults to `true` or to the parent `object` type setting. +|======================================================================= + diff --git a/docs/reference/mapping/types/nested-type.asciidoc b/docs/reference/mapping/types/nested-type.asciidoc new file mode 100644 index 0000000..17d8a13 --- /dev/null +++ b/docs/reference/mapping/types/nested-type.asciidoc @@ -0,0 +1,81 @@ +[[mapping-nested-type]] +=== Nested Type + +Nested objects/documents allow to map certain sections in the document +indexed as nested allowing to query them as if they are separate docs +joining with the parent owning doc. + +One of the problems when indexing inner objects that occur several times +in a doc is that "cross object" search match will occur, for example: + +[source,js] +-------------------------------------------------- +{ + "obj1" : [ + { + "name" : "blue", + "count" : 4 + }, + { + "name" : "green", + "count" : 6 + } + ] +} +-------------------------------------------------- + +Searching for name set to blue and count higher than 5 will match the +doc, because in the first element the name matches blue, and in the +second element, count matches "higher than 5". + +Nested mapping allows mapping certain inner objects (usually multi +instance ones), for example: + +[source,js] +-------------------------------------------------- +{ + "type1" : { + "properties" : { + "obj1" : { + "type" : "nested" + } + } + } +} +-------------------------------------------------- + +The above will cause all `obj1` to be indexed as a nested doc. The +mapping is similar in nature to setting `type` to `object`, except that +it's `nested`. + +Note: changing an object type to nested type requires reindexing. + +The `nested` object fields can also be automatically added to the +immediate parent by setting `include_in_parent` to true, and also +included in the root object by setting `include_in_root` to true. + +Nested docs will also automatically use the root doc `_all` field. + +Searching on nested docs can be done using either the +<<query-dsl-nested-query,nested query>> or +<<query-dsl-nested-filter,nested filter>>. + +[float] +==== Internal Implementation + +Internally, nested objects are indexed as additional documents, but, +since they can be guaranteed to be indexed within the same "block", it +allows for extremely fast joining with parent docs. + +Those internal nested documents are automatically masked away when doing +operations against the index (like searching with a match_all query), +and they bubble out when using the nested query. + +Because nested docs are always masked to the parent doc, the nested docs +can never be accessed outside the scope of the `nested` query. For example +stored fields can be enabled on fields inside nested objects, but there is +no way of retrieving them, since stored fields are fetched outside of +the `nested` query scope. + +The `_source` field is always associated with the parent document and +because of that field values via the source can be fetched for nested object. diff --git a/docs/reference/mapping/types/object-type.asciidoc b/docs/reference/mapping/types/object-type.asciidoc new file mode 100644 index 0000000..ce28239 --- /dev/null +++ b/docs/reference/mapping/types/object-type.asciidoc @@ -0,0 +1,244 @@ +[[mapping-object-type]] +=== Object Type + +JSON documents are hierarchical in nature, allowing them to define inner +"objects" within the actual JSON. Elasticsearch completely understands +the nature of these inner objects and can map them easily, providing +query support for their inner fields. Because each document can have +objects with different fields each time, objects mapped this way are +known as "dynamic". Dynamic mapping is enabled by default. Let's take +the following JSON as an example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "person" : { + "name" : { + "first_name" : "Shay", + "last_name" : "Banon" + }, + "sid" : "12345" + }, + "message" : "This is a tweet!" + } +} +-------------------------------------------------- + +The above shows an example where a tweet includes the actual `person` +details. A `person` is an object, with a `sid`, and a `name` object +which has `first_name` and `last_name`. It's important to note that +`tweet` is also an object, although it is a special +<<mapping-root-object-type,root object type>> +which allows for additional mapping definitions. + +The following is an example of explicit mapping for the above JSON: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "person" : { + "type" : "object", + "properties" : { + "name" : { + "properties" : { + "first_name" : {"type" : "string"}, + "last_name" : {"type" : "string"} + } + }, + "sid" : {"type" : "string", "index" : "not_analyzed"} + } + }, + "message" : {"type" : "string"} + } + } +} +-------------------------------------------------- + +In order to mark a mapping of type `object`, set the `type` to object. +This is an optional step, since if there are `properties` defined for +it, it will automatically be identified as an `object` mapping. + +[float] +==== properties + +An object mapping can optionally define one or more properties using the +`properties` tag for a field. Each property can be either another +`object`, or one of the +<<mapping-core-types,core_types>>. + +[float] +==== dynamic + +One of the most important features of Elasticsearch is its ability to be +schema-less. This means that, in our example above, the `person` object +can be indexed later with a new property -- `age`, for example -- and it +will automatically be added to the mapping definitions. Same goes for +the `tweet` root object. + +This feature is by default turned on, and it's the `dynamic` nature of +each object mapped. Each object mapped is automatically dynamic, though +it can be explicitly turned off: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "person" : { + "type" : "object", + "properties" : { + "name" : { + "dynamic" : false, + "properties" : { + "first_name" : {"type" : "string"}, + "last_name" : {"type" : "string"} + } + }, + "sid" : {"type" : "string", "index" : "not_analyzed"} + } + }, + "message" : {"type" : "string"} + } + } +} +-------------------------------------------------- + +In the above example, the `name` object mapped is not dynamic, meaning +that if, in the future, we try to index JSON with a `middle_name` within +the `name` object, it will get discarded and not added. + +There is no performance overhead if an `object` is dynamic, the ability +to turn it off is provided as a safety mechanism so "malformed" objects +won't, by mistake, index data that we do not wish to be indexed. + +If a dynamic object contains yet another inner `object`, it will be +automatically added to the index and mapped as well. + +When processing dynamic new fields, their type is automatically derived. +For example, if it is a `number`, it will automatically be treated as +number <<mapping-core-types,core_type>>. Dynamic +fields default to their default attributes, for example, they are not +stored and they are always indexed. + +Date fields are special since they are represented as a `string`. Date +fields are detected if they can be parsed as a date when they are first +introduced into the system. The set of date formats that are tested +against can be configured using the `dynamic_date_formats` on the root object, +which is explained later. + +Note, once a field has been added, *its type can not change*. For +example, if we added age and its value is a number, then it can't be +treated as a string. + +The `dynamic` parameter can also be set to `strict`, meaning that not +only new fields will not be introduced into the mapping, parsing +(indexing) docs with such new fields will fail. + +[float] +==== enabled + +The `enabled` flag allows to disable parsing and indexing a named object +completely. This is handy when a portion of the JSON document contains +arbitrary JSON which should not be indexed, nor added to the mapping. +For example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "properties" : { + "person" : { + "type" : "object", + "properties" : { + "name" : { + "type" : "object", + "enabled" : false + }, + "sid" : {"type" : "string", "index" : "not_analyzed"} + } + }, + "message" : {"type" : "string"} + } + } +} +-------------------------------------------------- + +In the above, `name` and its content will not be indexed at all. + + +[float] +==== include_in_all + +`include_in_all` can be set on the `object` type level. When set, it +propagates down to all the inner mapping defined within the `object` +that do no explicitly set it. + +[float] +==== path + +deprecated[1.0.0,Use <<copy-to,`copy_to`>> instead] + +In the <<mapping-core-types,core_types>> +section, a field can have a `index_name` associated with it in order to +control the name of the field that will be stored within the index. When +that field exists within an object(s) that are not the root object, the +name of the field of the index can either include the full "path" to the +field with its `index_name`, or just the `index_name`. For example +(under mapping of _type_ `person`, removed the tweet type for clarity): + +[source,js] +-------------------------------------------------- +{ + "person" : { + "properties" : { + "name1" : { + "type" : "object", + "path" : "just_name", + "properties" : { + "first1" : {"type" : "string"}, + "last1" : {"type" : "string", "index_name" : "i_last_1"} + } + }, + "name2" : { + "type" : "object", + "path" : "full", + "properties" : { + "first2" : {"type" : "string"}, + "last2" : {"type" : "string", "index_name" : "i_last_2"} + } + } + } + } +} +-------------------------------------------------- + +In the above example, the `name1` and `name2` objects within the +`person` object have different combination of `path` and `index_name`. +The document fields that will be stored in the index as a result of that +are: + +[cols="<,<",options="header",] +|================================= +|JSON Name |Document Field Name +|`name1`/`first1` |`first1` +|`name1`/`last1` |`i_last_1` +|`name2`/`first2` |`name2.first2` +|`name2`/`last2` |`name2.i_last_2` +|================================= + +Note, when querying or using a field name in any of the APIs provided +(search, query, selective loading, ...), there is an automatic detection +from logical full path and into the `index_name` and vice versa. For +example, even though `name1`/`last1` defines that it is stored with +`just_name` and a different `index_name`, it can either be referred to +using `name1.last1` (logical name), or its actual indexed name of +`i_last_1`. + +More over, where applicable, for example, in queries, the full path +including the type can be used such as `person.name.last1`, in this +case, both the actual indexed name will be resolved to match against the +index, and an automatic query filter will be added to only match +`person` types. diff --git a/docs/reference/mapping/types/root-object-type.asciidoc b/docs/reference/mapping/types/root-object-type.asciidoc new file mode 100644 index 0000000..ac368c4 --- /dev/null +++ b/docs/reference/mapping/types/root-object-type.asciidoc @@ -0,0 +1,224 @@ +[[mapping-root-object-type]] +=== Root Object Type + +The root object mapping is an +<<mapping-object-type,object type mapping>> that +maps the root object (the type itself). On top of all the different +mappings that can be set using the +<<mapping-object-type,object type mapping>>, it +allows for additional, type level mapping definitions. + +The root object mapping allows to index a JSON document that either +starts with the actual mapping type, or only contains its fields. For +example, the following `tweet` JSON can be indexed: + +[source,js] +-------------------------------------------------- +{ + "message" : "This is a tweet!" +} +-------------------------------------------------- + +But, also the following JSON can be indexed: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "message" : "This is a tweet!" + } +} +-------------------------------------------------- + +Out of the two, it is preferable to use the document *without* the type +explicitly set. + +[float] +==== Index / Search Analyzers + +The root object allows to define type mapping level analyzers for index +and search that will be used with all different fields that do not +explicitly set analyzers on their own. Here is an example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "index_analyzer" : "standard", + "search_analyzer" : "standard" + } +} +-------------------------------------------------- + +The above simply explicitly defines both the `index_analyzer` and +`search_analyzer` that will be used. There is also an option to use the +`analyzer` attribute to set both the `search_analyzer` and +`index_analyzer`. + +[float] +==== dynamic_date_formats + +`dynamic_date_formats` (old setting called `date_formats` still works) +is the ability to set one or more date formats that will be used to +detect `date` fields. For example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "dynamic_date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy"], + "properties" : { + "message" : {"type" : "string"} + } + } +} +-------------------------------------------------- + +In the above mapping, if a new JSON field of type string is detected, +the date formats specified will be used in order to check if its a date. +If it passes parsing, then the field will be declared with `date` type, +and will use the matching format as its format attribute. The date +format itself is explained +<<mapping-date-format,here>>. + +The default formats are: `dateOptionalTime` (ISO) and +`yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z`. + +*Note:* `dynamic_date_formats` are used *only* for dynamically added +date fields, not for `date` fields that you specify in your mapping. + +[float] +==== date_detection + +Allows to disable automatic date type detection (a new field introduced +and matches the provided format), for example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "date_detection" : false, + "properties" : { + "message" : {"type" : "string"} + } + } +} +-------------------------------------------------- + +[float] +==== numeric_detection + +Sometimes, even though json has support for native numeric types, +numeric values are still provided as strings. In order to try and +automatically detect numeric values from string, the `numeric_detection` +can be set to `true`. For example: + +[source,js] +-------------------------------------------------- +{ + "tweet" : { + "numeric_detection" : true, + "properties" : { + "message" : {"type" : "string"} + } + } +} +-------------------------------------------------- + +[float] +==== dynamic_templates + +Dynamic templates allow to define mapping templates that will be applied +when dynamic introduction of fields / objects happens. + +For example, we might want to have all fields to be stored by default, +or all `string` fields to be stored, or have `string` fields to always +be indexed with multi fields syntax, once analyzed and once not_analyzed. +Here is a simple example: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "dynamic_templates" : [ + { + "template_1" : { + "match" : "multi*", + "mapping" : { + "type" : "{dynamic_type}", + "index" : "analyzed", + "fields" : { + "org" : {"type": "{dynamic_type}", "index" : "not_analyzed"} + } + } + } + }, + { + "template_2" : { + "match" : "*", + "match_mapping_type" : "string", + "mapping" : { + "type" : "string", + "index" : "not_analyzed" + } + } + } + ] + } +} +-------------------------------------------------- + +The above mapping will create a field with multi fields for all field +names starting with multi, and will map all `string` types to be +`not_analyzed`. + +Dynamic templates are named to allow for simple merge behavior. A new +mapping, just with a new template can be "put" and that template will be +added, or if it has the same name, the template will be replaced. + +The `match` allow to define matching on the field name. An `unmatch` +option is also available to exclude fields if they do match on `match`. +The `match_mapping_type` controls if this template will be applied only +for dynamic fields of the specified type (as guessed by the json +format). + +Another option is to use `path_match`, which allows to match the dynamic +template against the "full" dot notation name of the field (for example +`obj1.*.value` or `obj1.obj2.*`), with the respective `path_unmatch`. + +The format of all the matching is simple format, allowing to use * as a +matching element supporting simple patterns such as xxx*, *xxx, xxx*yyy +(with arbitrary number of pattern types), as well as direct equality. +The `match_pattern` can be set to `regex` to allow for regular +expression based matching. + +The `mapping` element provides the actual mapping definition. The +`{name}` keyword can be used and will be replaced with the actual +dynamic field name being introduced. The `{dynamic_type}` (or +`{dynamicType}`) can be used and will be replaced with the mapping +derived based on the field type (or the derived type, like `date`). + +Complete generic settings can also be applied, for example, to have all +mappings be stored, just set: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "dynamic_templates" : [ + { + "store_generic" : { + "match" : "*", + "mapping" : { + "store" : true + } + } + } + ] + } +} +-------------------------------------------------- + +Such generic templates should be placed at the end of the +`dynamic_templates` list because when two or more dynamic templates +match a field, only the first matching one from the list is used. |