summaryrefslogtreecommitdiff
path: root/docs/reference/index-modules
diff options
context:
space:
mode:
authorHilko Bengen <bengen@debian.org>2014-06-07 12:02:12 +0200
committerHilko Bengen <bengen@debian.org>2014-06-07 12:02:12 +0200
commitd5ed89b946297270ec28abf44bef2371a06f1f4f (patch)
treece2d945e4dde69af90bd9905a70d8d27f4936776 /docs/reference/index-modules
downloadelasticsearch-d5ed89b946297270ec28abf44bef2371a06f1f4f.tar.gz
Imported Upstream version 1.0.3upstream/1.0.3
Diffstat (limited to 'docs/reference/index-modules')
-rw-r--r--docs/reference/index-modules/allocation.asciidoc136
-rw-r--r--docs/reference/index-modules/analysis.asciidoc18
-rw-r--r--docs/reference/index-modules/cache.asciidoc56
-rw-r--r--docs/reference/index-modules/codec.asciidoc278
-rw-r--r--docs/reference/index-modules/fielddata.asciidoc270
-rw-r--r--docs/reference/index-modules/mapper.asciidoc39
-rw-r--r--docs/reference/index-modules/merge.asciidoc215
-rw-r--r--docs/reference/index-modules/similarity.asciidoc140
-rw-r--r--docs/reference/index-modules/slowlog.asciidoc87
-rw-r--r--docs/reference/index-modules/store.asciidoc122
-rw-r--r--docs/reference/index-modules/translog.asciidoc28
11 files changed, 1389 insertions, 0 deletions
diff --git a/docs/reference/index-modules/allocation.asciidoc b/docs/reference/index-modules/allocation.asciidoc
new file mode 100644
index 0000000..c1a1618
--- /dev/null
+++ b/docs/reference/index-modules/allocation.asciidoc
@@ -0,0 +1,136 @@
+[[index-modules-allocation]]
+== Index Shard Allocation
+
+[float]
+[[shard-allocation-filtering]]
+=== Shard Allocation Filtering
+
+Allow to control allocation if indices on nodes based on include/exclude
+filters. The filters can be set both on the index level and on the
+cluster level. Lets start with an example of setting it on the cluster
+level:
+
+Lets say we have 4 nodes, each has specific attribute called `tag`
+associated with it (the name of the attribute can be any name). Each
+node has a specific value associated with `tag`. Node 1 has a setting
+`node.tag: value1`, Node 2 a setting of `node.tag: value2`, and so on.
+
+We can create an index that will only deploy on nodes that have `tag`
+set to `value1` and `value2` by setting
+`index.routing.allocation.include.tag` to `value1,value2`. For example:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT localhost:9200/test/_settings -d '{
+ "index.routing.allocation.include.tag" : "value1,value2"
+}'
+--------------------------------------------------
+
+On the other hand, we can create an index that will be deployed on all
+nodes except for nodes with a `tag` of value `value3` by setting
+`index.routing.allocation.exclude.tag` to `value3`. For example:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT localhost:9200/test/_settings -d '{
+ "index.routing.allocation.exclude.tag" : "value3"
+}'
+--------------------------------------------------
+
+`index.routing.allocation.require.*` can be used to
+specify a number of rules, all of which MUST match in order for a shard
+to be allocated to a node. This is in contrast to `include` which will
+include a node if ANY rule matches.
+
+The `include`, `exclude` and `require` values can have generic simple
+matching wildcards, for example, `value1*`. A special attribute name
+called `_ip` can be used to match on node ip values.
+
+Obviously a node can have several attributes associated with it, and
+both the attribute name and value are controlled in the setting. For
+example, here is a sample of several node configurations:
+
+[source,js]
+--------------------------------------------------
+node.group1: group1_value1
+node.group2: group2_value4
+--------------------------------------------------
+
+In the same manner, `include`, `exclude` and `require` can work against
+several attributes, for example:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT localhost:9200/test/_settings -d '{
+ "index.routing.allocation.include.group1" : "xxx"
+ "index.routing.allocation.include.group2" : "yyy",
+ "index.routing.allocation.exclude.group3" : "zzz",
+ "index.routing.allocation.require.group4" : "aaa",
+}'
+--------------------------------------------------
+
+The provided settings can also be updated in real time using the update
+settings API, allowing to "move" indices (shards) around in realtime.
+
+Cluster wide filtering can also be defined, and be updated in real time
+using the cluster update settings API. This setting can come in handy
+for things like decommissioning nodes (even if the replica count is set
+to 0). Here is a sample of how to decommission a node based on `_ip`
+address:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT localhost:9200/_cluster/settings -d '{
+ "transient" : {
+ "cluster.routing.allocation.exclude._ip" : "10.0.0.1"
+ }
+}'
+--------------------------------------------------
+
+[float]
+=== Total Shards Per Node
+
+The `index.routing.allocation.total_shards_per_node` setting allows to
+control how many total shards for an index will be allocated per node.
+It can be dynamically set on a live index using the update index
+settings API.
+
+[float]
+[[disk]]
+=== Disk-based Shard Allocation
+
+Elasticsearch con be configured to prevent shard
+allocation on nodes depending on disk usage for the node. This
+functionality is disabled by default, and can be changed either in the
+configuration file, or dynamically using:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT localhost:9200/_cluster/settings -d '{
+ "transient" : {
+ "cluster.routing.allocation.disk.threshold_enabled" : true
+ }
+}'
+--------------------------------------------------
+
+Once enabled, Elasticsearch uses two watermarks to decide whether
+shards should be allocated or can remain on the node.
+
+`cluster.routing.allocation.disk.watermark.low` controls the low
+watermark for disk usage. It defaults to 0.70, meaning ES will not
+allocate new shards to nodes once they have more than 70% disk
+used. It can also be set to an absolute byte value (like 500mb) to
+prevent ES from allocating shards if less than the configured amount
+of space is available.
+
+`cluster.routing.allocation.disk.watermark.high` controls the high
+watermark. It defaults to 0.85, meaning ES will attempt to relocate
+shards to another node if the node disk usage rises above 85%. It can
+also be set to an absolute byte value (similar to the low watermark)
+to relocate shards once less than the configured amount of space is
+available on the node.
+
+Both watermark settings can be changed dynamically using the cluster
+settings API. By default, Elasticsearch will retrieve information
+about the disk usage of the nodes every 30 seconds. This can also be
+changed by setting the `cluster.info.update.interval` setting.
diff --git a/docs/reference/index-modules/analysis.asciidoc b/docs/reference/index-modules/analysis.asciidoc
new file mode 100644
index 0000000..1cf33e8
--- /dev/null
+++ b/docs/reference/index-modules/analysis.asciidoc
@@ -0,0 +1,18 @@
+[[index-modules-analysis]]
+== Analysis
+
+The index analysis module acts as a configurable registry of Analyzers
+that can be used in order to both break indexed (analyzed) fields when a
+document is indexed and process query strings. It maps to the Lucene
+`Analyzer`.
+
+Analyzers are (generally) composed of a single `Tokenizer` and zero or
+more `TokenFilters`. A set of `CharFilters` can be associated with an
+analyzer to process the characters prior to other analysis steps. The
+analysis module allows one to register `TokenFilters`, `Tokenizers` and
+`Analyzers` under logical names that can then be referenced either in
+mapping definitions or in certain APIs. The Analysis module
+automatically registers (*if not explicitly defined*) built in
+analyzers, token filters, and tokenizers.
+
+See <<analysis>> for configuration details. \ No newline at end of file
diff --git a/docs/reference/index-modules/cache.asciidoc b/docs/reference/index-modules/cache.asciidoc
new file mode 100644
index 0000000..829091e
--- /dev/null
+++ b/docs/reference/index-modules/cache.asciidoc
@@ -0,0 +1,56 @@
+[[index-modules-cache]]
+== Cache
+
+There are different caching inner modules associated with an index. They
+include `filter` and others.
+
+[float]
+[[filter]]
+=== Filter Cache
+
+The filter cache is responsible for caching the results of filters (used
+in the query). The default implementation of a filter cache (and the one
+recommended to use in almost all cases) is the `node` filter cache type.
+
+[float]
+[[node-filter]]
+==== Node Filter Cache
+
+The `node` filter cache may be configured to use either a percentage of
+the total memory allocated to the process or an specific amount of
+memory. All shards present on a node share a single node cache (thats
+why its called `node``). The cache implements an LRU eviction policy:
+when a cache becomes full, the least recently used data is evicted to
+make way for new data.
+
+The setting that allows one to control the memory size for the filter
+cache is `indices.cache.filter.size`, which defaults to `20%`. *Note*,
+this is *not* an index level setting but a node level setting (can be
+configured in the node configuration).
+
+`indices.cache.filter.size` can accept either a percentage value, like
+`30%`, or an exact value, like `512mb`.
+
+[float]
+[[index-filter]]
+==== Index Filter Cache
+
+A filter cache that exists on the index level (on each node). Generally,
+not recommended for use since its memory usage depends on which shards
+are allocated on each node and its hard to predict it. The types are:
+`resident`, `soft` and `weak`.
+
+All types support the following settings:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`index.cache.filter.max_size` |The max size (count, not byte size) of
+the cache (per search segment in a shard). Defaults to not set (`-1`),
+which is usually fine with `soft` cache and proper cacheable filters.
+
+|`index.cache.filter.expire` |A time based setting that expires filters
+after a certain time of inactivity. Defaults to `-1`. For example, can
+be set to `5m` for a 5 minute expiry.
+|=======================================================================
+
diff --git a/docs/reference/index-modules/codec.asciidoc b/docs/reference/index-modules/codec.asciidoc
new file mode 100644
index 0000000..f53c18f
--- /dev/null
+++ b/docs/reference/index-modules/codec.asciidoc
@@ -0,0 +1,278 @@
+[[index-modules-codec]]
+== Codec module
+
+Codecs define how documents are written to disk and read from disk. The
+postings format is the part of the codec that responsible for reading
+and writing the term dictionary, postings lists and positions, payloads
+and offsets stored in the postings list. The doc values format is
+responsible for reading column-stride storage for a field and is typically
+used for sorting or faceting. When a field doesn't have doc values enabled,
+it is still possible to sort or facet by loading field values from the
+inverted index into main memory.
+
+Configuring custom postings or doc values formats is an expert feature and
+most likely using the builtin formats will suit your needs as is described
+in the <<mapping-core-types,mapping section>>.
+
+**********************************
+Only the default codec, postings format and doc values format are supported:
+other formats may break backward compatibility between minor versions of
+Elasticsearch, requiring data to be reindexed.
+**********************************
+
+
+[float]
+[[custom-postings]]
+=== Configuring a custom postings format
+
+Custom postings format can be defined in the index settings in the
+`codec` part. The `codec` part can be configured when creating an index
+or updating index settings. An example on how to define your custom
+postings format:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT 'http://localhost:9200/twitter/' -d '{
+ "settings" : {
+ "index" : {
+ "codec" : {
+ "postings_format" : {
+ "my_format" : {
+ "type" : "pulsing",
+ "freq_cut_off" : "5"
+ }
+ }
+ }
+ }
+ }
+}'
+--------------------------------------------------
+
+Then we defining your mapping your can use the `my_format` name in the
+`postings_format` option as the example below illustrates:
+
+[source,js]
+--------------------------------------------------
+{
+ "person" : {
+ "properties" : {
+ "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+=== Available postings formats
+
+[float]
+[[direct-postings]]
+==== Direct postings format
+
+Wraps the default postings format for on-disk storage, but then at read
+time loads and stores all terms & postings directly in RAM. This
+postings format makes no effort to compress the terms and posting list
+and therefore is memory intensive, but because of this it gives a
+substantial increase in search performance. Because this holds all term
+bytes as a single byte[], you cannot have more than 2.1GB worth of terms
+in a single segment.
+
+This postings format offers the following parameters:
+
+`min_skip_count`::
+ The minimum number terms with a shared prefix to
+ allow a skip pointer to be written. The default is *8*.
+
+`low_freq_cutoff`::
+ Terms with a lower document frequency use a
+ single array object representation for postings and positions. The
+ default is *32*.
+
+Type name: `direct`
+
+[float]
+[[memory-postings]]
+==== Memory postings format
+
+A postings format that stores terms & postings (docs, positions,
+payloads) in RAM, using an FST. This postings format does write to disk,
+but loads everything into memory. The memory postings format has the
+following options:
+
+`pack_fst`::
+ A boolean option that defines if the in memory structure
+ should be packed once its build. Packed will reduce the size for the
+ data-structure in memory but requires more memory during building.
+ Default is *false*.
+
+`acceptable_overhead_ratio`::
+ The compression ratio specified as a
+ float, that is used to compress internal structures. Example ratios `0`
+ (Compact, no memory overhead at all, but the returned implementation may
+ be slow), `0.5` (Fast, at most 50% memory overhead, always select a
+ reasonably fast implementation), `7` (Fastest, at most 700% memory
+ overhead, no compression). Default is `0.2`.
+
+Type name: `memory`
+
+[float]
+[[bloom-postings]]
+==== Bloom filter posting format
+
+The bloom filter postings format wraps a delegate postings format and on
+top of this creates a bloom filter that is written to disk. During
+opening this bloom filter is loaded into memory and used to offer
+"fast-fail" reads. This postings format is useful for low doc-frequency
+fields such as primary keys. The bloom filter postings format has the
+following options:
+
+`delegate`::
+ The name of the configured postings format that the
+ bloom filter postings format will wrap.
+
+`fpp`::
+ The desired false positive probability specified as a
+ floating point number between 0 and 1.0. The `fpp` can be configured for
+ multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
+ number docs per index segment is larger than *1m* then use *0.03* as fpp
+ and if number of docs per segment is larger than *10k* use *0.01* as
+ fpp. The last fallback value is always *0.03*. This example expression
+ is also the default.
+
+Type name: `bloom`
+
+[[codec-bloom-load]]
+[TIP]
+==================================================
+
+It can sometime make sense to disable bloom filters. For instance, if you are
+logging into an index per day, and you have thousands of indices, the bloom
+filters can take up a sizable amount of memory. For most queries you are only
+interested in recent indices, so you don't mind CRUD operations on older
+indices taking slightly longer.
+
+In these cases you can disable loading of the bloom filter on a per-index
+basis by updating the index settings:
+
+[source,js]
+--------------------------------------------------
+PUT /old_index/_settings?index.codec.bloom.load=false
+--------------------------------------------------
+
+This setting, which defaults to `true`, can be updated on a live index. Note,
+however, that changing the value will cause the index to be reopened, which
+will invalidate any existing caches.
+
+==================================================
+
+[float]
+[[pulsing-postings]]
+==== Pulsing postings format
+
+The pulsing implementation in-lines the posting lists for very low
+frequent terms in the term dictionary. This is useful to improve lookup
+performance for low-frequent terms. This postings format offers the
+following parameters:
+
+`min_block_size`::
+ The minimum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *25*.
+
+`max_block_size`::
+ The maximum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *48*.
+
+`freq_cut_off`::
+ The document frequency cut off where pulsing
+ in-lines posting lists into the term dictionary. Terms with a document
+ frequency less or equal to the cutoff will be in-lined. The default is
+ *1*.
+
+Type name: `pulsing`
+
+[float]
+[[default-postings]]
+==== Default postings format
+
+The default postings format has the following options:
+
+`min_block_size`::
+ The minimum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *25*.
+
+`max_block_size`::
+ The maximum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *48*.
+
+Type name: `default`
+
+[float]
+=== Configuring a custom doc values format
+
+Custom doc values format can be defined in the index settings in the
+`codec` part. The `codec` part can be configured when creating an index
+or updating index settings. An example on how to define your custom
+doc values format:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT 'http://localhost:9200/twitter/' -d '{
+ "settings" : {
+ "index" : {
+ "codec" : {
+ "doc_values_format" : {
+ "my_format" : {
+ "type" : "disk"
+ }
+ }
+ }
+ }
+ }
+}'
+--------------------------------------------------
+
+Then we defining your mapping your can use the `my_format` name in the
+`doc_values_format` option as the example below illustrates:
+
+[source,js]
+--------------------------------------------------
+{
+ "product" : {
+ "properties" : {
+ "price" : {"type" : "integer", "doc_values_format" : "my_format"}
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+=== Available doc values formats
+
+[float]
+==== Memory doc values format
+
+A doc values format that stores all values in a FST in RAM. This format does
+write to disk but the whole data-structure is loaded into memory when reading
+the index. The memory postings format has no options.
+
+Type name: `memory`
+
+[float]
+==== Disk doc values format
+
+A doc values format that stores and reads everything from disk. Although it may
+be slightly slower than the default doc values format, this doc values format
+will require almost no memory from the JVM. The disk doc values format has no
+options.
+
+Type name: `disk`
+
+[float]
+==== Default doc values format
+
+The default doc values format tries to make a good compromise between speed and
+memory usage by only loading into memory data-structures that matter for
+performance. This makes this doc values format a good fit for most use-cases.
+The default doc values format has no options.
+
+Type name: `default`
diff --git a/docs/reference/index-modules/fielddata.asciidoc b/docs/reference/index-modules/fielddata.asciidoc
new file mode 100644
index 0000000..c958dcb
--- /dev/null
+++ b/docs/reference/index-modules/fielddata.asciidoc
@@ -0,0 +1,270 @@
+[[index-modules-fielddata]]
+== Field data
+
+The field data cache is used mainly when sorting on or faceting on a
+field. It loads all the field values to memory in order to provide fast
+document based access to those values. The field data cache can be
+expensive to build for a field, so its recommended to have enough memory
+to allocate it, and to keep it loaded.
+
+The amount of memory used for the field
+data cache can be controlled using `indices.fielddata.cache.size`. Note:
+reloading the field data which does not fit into your cache will be expensive
+and perform poorly.
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`indices.fielddata.cache.size` |The max size of the field data cache,
+eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
+to unbounded.
+
+|`indices.fielddata.cache.expire` |A time based setting that expires
+field data after a certain time of inactivity. Defaults to `-1`. For
+example, can be set to `5m` for a 5 minute expiry.
+|=======================================================================
+
+[float]
+[[fielddata-circuit-breaker]]
+=== Field data circuit breaker
+The field data circuit breaker allows Elasticsearch to estimate the amount of
+memory a field will required to be loaded into memory. It can then prevent the
+field data loading by raising and exception. By default the limit is configured
+to 80% of the maximum JVM heap. It can be configured with the following
+parameters:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`indices.fielddata.breaker.limit` |Maximum size of estimated field data
+to allow loading. Defaults to 80% of the maximum JVM heap.
+|`indices.fielddata.breaker.overhead` |A constant that all field data
+estimations are multiplied with to determine a final estimation. Defaults to
+1.03
+|=======================================================================
+
+Both the `indices.fielddata.breaker.limit` and
+`indices.fielddata.breaker.overhead` can be changed dynamically using the
+cluster update settings API.
+
+[float]
+[[fielddata-monitoring]]
+=== Monitoring field data
+
+You can monitor memory usage for field data as well as the field data circuit
+breaker using
+<<cluster-nodes-stats,Nodes Stats API>>
+
+[[fielddata-formats]]
+== Field data formats
+
+The field data format controls how field data should be stored.
+
+Depending on the field type, there might be several field data types
+available. In particular, string and numeric types support the `doc_values`
+format which allows for computing the field data data-structures at indexing
+time and storing them on disk. Although it will make the index larger and may
+be slightly slower, this implementation will be more near-realtime-friendly
+and will require much less memory from the JVM than other implementations.
+
+Here is an example of how to configure the `tag` field to use the `fst` field
+data format.
+
+[source,js]
+--------------------------------------------------
+{
+ tag: {
+ type: "string",
+ fielddata: {
+ format: "fst"
+ }
+ }
+}
+--------------------------------------------------
+
+It is possible to change the field data format (and the field data settings
+in general) on a live index by using the update mapping API. When doing so,
+field data which had already been loaded for existing segments will remain
+alive while new segments will use the new field data configuration. Thanks to
+the background merging process, all segments will eventually use the new
+field data format.
+
+[float]
+==== String field data types
+
+`paged_bytes` (default)::
+ Stores unique terms sequentially in a large buffer and maps documents to
+ the indices of the terms they contain in this large buffer.
+
+`fst`::
+ Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
+ memory usage if many terms share common prefixes and/or suffixes.
+
+`doc_values`::
+ Computes and stores field data data-structures on disk at indexing time.
+ Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
+ `not_analyzed`) and doesn't support filtering.
+
+[float]
+==== Numeric field data types
+
+`array` (default)::
+ Stores field values in memory using arrays.
+
+`doc_values`::
+ Computes and stores field data data-structures on disk at indexing time.
+ Doesn't support filtering.
+
+[float]
+==== Geo point field data types
+
+`array` (default)::
+ Stores latitudes and longitudes in arrays.
+
+`doc_values`::
+ Computes and stores field data data-structures on disk at indexing time.
+
+[float]
+=== Fielddata loading
+
+By default, field data is loaded lazily, ie. the first time that a query that
+requires them is executed. However, this can make the first requests that
+follow a merge operation quite slow since fielddata loading is a heavy
+operation.
+
+It is possible to force field data to be loaded and cached eagerly through the
+`loading` setting of fielddata:
+
+[source,js]
+--------------------------------------------------
+{
+ category: {
+ type: "string",
+ fielddata: {
+ loading: "eager"
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+==== Disabling field data loading
+
+Field data can take a lot of RAM so it makes sense to disable field data
+loading on the fields that don't need field data, for example those that are
+used for full-text search only. In order to disable field data loading, just
+change the field data format to `disabled`. When disabled, all requests that
+will try to load field data, e.g. when they include aggregations and/or sorting,
+will return an error.
+
+[source,js]
+--------------------------------------------------
+{
+ text: {
+ type: "string",
+ fielddata: {
+ format: "disabled"
+ }
+ }
+}
+--------------------------------------------------
+
+The `disabled` format is supported by all field types.
+
+[float]
+[[field-data-filtering]]
+=== Filtering fielddata
+
+It is possible to control which field values are loaded into memory,
+which is particularly useful for string fields. When specifying the
+<<mapping-core-types,mapping>> for a field, you
+can also specify a fielddata filter.
+
+Fielddata filters can be changed using the
+<<indices-put-mapping,PUT mapping>>
+API. After changing the filters, use the
+<<indices-clearcache,Clear Cache>> API
+to reload the fielddata using the new filters.
+
+[float]
+==== Filtering by frequency:
+
+The frequency filter allows you to only load terms whose frequency falls
+between a `min` and `max` value, which can be expressed an absolute
+number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
+*per segment*. Percentages are based on the number of docs which have a
+value for the field, as opposed to all docs in the segment.
+
+Small segments can be excluded completely by specifying the minimum
+number of docs that the segment should contain with `min_segment_size`:
+
+[source,js]
+--------------------------------------------------
+{
+ tag: {
+ type: "string",
+ fielddata: {
+ filter: {
+ frequency: {
+ min: 0.001,
+ max: 0.1,
+ min_segment_size: 500
+ }
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+==== Filtering by regex
+
+Terms can also be filtered by regular expression - only values which
+match the regular expression are loaded. Note: the regular expression is
+applied to each term in the field, not to the whole field value. For
+instance, to only load hashtags from a tweet, we can use a regular
+expression which matches terms beginning with `#`:
+
+[source,js]
+--------------------------------------------------
+{
+ tweet: {
+ type: "string",
+ analyzer: "whitespace"
+ fielddata: {
+ filter: {
+ regex: {
+ pattern: "^#.*"
+ }
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+==== Combining filters
+
+The `frequency` and `regex` filters can be combined:
+
+[source,js]
+--------------------------------------------------
+{
+ tweet: {
+ type: "string",
+ analyzer: "whitespace"
+ fielddata: {
+ filter: {
+ regex: {
+ pattern: "^#.*",
+ },
+ frequency: {
+ min: 0.001,
+ max: 0.1,
+ min_segment_size: 500
+ }
+ }
+ }
+ }
+}
+--------------------------------------------------
diff --git a/docs/reference/index-modules/mapper.asciidoc b/docs/reference/index-modules/mapper.asciidoc
new file mode 100644
index 0000000..1728969
--- /dev/null
+++ b/docs/reference/index-modules/mapper.asciidoc
@@ -0,0 +1,39 @@
+[[index-modules-mapper]]
+== Mapper
+
+The mapper module acts as a registry for the type mapping definitions
+added to an index either when creating it or by using the put mapping
+api. It also handles the dynamic mapping support for types that have no
+explicit mappings pre defined. For more information about mapping
+definitions, check out the <<mapping,mapping section>>.
+
+[float]
+=== Dynamic / Default Mappings
+
+Dynamic mappings allow to automatically apply generic mapping definition
+to types that do not have mapping pre defined or applied to new mapping
+definitions (overridden). This is mainly done thanks to the fact that
+the `object` type and namely the root `object` type allow for schema
+less dynamic addition of unmapped fields.
+
+The default mapping definition is plain mapping definition that is
+embedded within Elasticsearch:
+
+[source,js]
+--------------------------------------------------
+{
+ _default_ : {
+ }
+}
+--------------------------------------------------
+
+Pretty short, no? Basically, everything is defaulted, especially the
+dynamic nature of the root object mapping. The default mapping
+definition can be overridden in several manners. The simplest manner is
+to simply define a file called `default-mapping.json` and placed it
+under the `config` directory (which can be configured to exist in a
+different location). It can also be explicitly set using the
+`index.mapper.default_mapping_location` setting.
+
+Dynamic creation of mappings for unmapped types can be completely
+disabled by setting `index.mapper.dynamic` to `false`.
diff --git a/docs/reference/index-modules/merge.asciidoc b/docs/reference/index-modules/merge.asciidoc
new file mode 100644
index 0000000..84d2675
--- /dev/null
+++ b/docs/reference/index-modules/merge.asciidoc
@@ -0,0 +1,215 @@
+[[index-modules-merge]]
+== Merge
+
+A shard in elasticsearch is a Lucene index, and a Lucene index is broken
+down into segments. Segments are internal storage elements in the index
+where the index data is stored, and are immutable up to delete markers.
+Segments are, periodically, merged into larger segments to keep the
+index size at bay and expunge deletes.
+
+The more segments one has in the Lucene index means slower searches and
+more memory used. Segment merging is used to reduce the number of segments,
+however merges can be expensive to perform, especially on low IO environments.
+Merges can be throttled using <<store-throttling,store level throttling>>.
+
+
+[float]
+[[policy]]
+=== Policy
+
+The index merge policy module allows one to control which segments of a
+shard index are to be merged. There are several types of policies with
+the default set to `tiered`.
+
+[float]
+[[tiered]]
+==== tiered
+
+Merges segments of approximately equal size, subject to an allowed
+number of segments per tier. This is similar to `log_bytes_size` merge
+policy, except this merge policy is able to merge non-adjacent segment,
+and separates how many segments are merged at once from how many
+segments are allowed per tier. This merge policy also does not
+over-merge (i.e., cascade merges).
+
+This policy has the following settings:
+
+`index.merge.policy.expunge_deletes_allowed`::
+
+ When expungeDeletes is called, we only merge away a segment if its delete
+ percentage is over this threshold. Default is `10`.
+
+`index.merge.policy.floor_segment`::
+
+ Segments smaller than this are "rounded up" to this size, i.e. treated as
+ equal (floor) size for merge selection. This is to prevent frequent
+ flushing of tiny segments from allowing a long tail in the index. Default
+ is `2mb`.
+
+`index.merge.policy.max_merge_at_once`::
+
+ Maximum number of segments to be merged at a time during "normal" merging.
+ Default is `10`.
+
+`index.merge.policy.max_merge_at_once_explicit`::
+
+ Maximum number of segments to be merged at a time, during optimize or
+ expungeDeletes. Default is `30`.
+
+`index.merge.policy.max_merged_segment`::
+
+ Maximum sized segment to produce during normal merging (not explicit
+ optimize). This setting is approximate: the estimate of the merged segment
+ size is made by summing sizes of to-be-merged segments (compensating for
+ percent deleted docs). Default is `5gb`.
+
+`index.merge.policy.segments_per_tier`::
+
+ Sets the allowed number of segments per tier. Smaller values mean more
+ merging but fewer segments. Default is `10`. Note, this value needs to be
+ >= then the `max_merge_at_once` otherwise you'll force too many merges to
+ occur.
+
+`index.reclaim_deletes_weight`::
+
+ Controls how aggressively merges that reclaim more deletions are favored.
+ Higher values favor selecting merges that reclaim deletions. A value of
+ `0.0` means deletions don't impact merge selection. Defaults to `2.0`.
+
+`index.compound_format`::
+
+ Should the index be stored in compound format or not. Defaults to `false`.
+ See <<index-compound-format,`index.compound_format`>> in
+ <<index-modules-settings>>.
+
+For normal merging, this policy first computes a "budget" of how many
+segments are allowed by be in the index. If the index is over-budget,
+then the policy sorts segments by decreasing size (pro-rating by percent
+deletes), and then finds the least-cost merge. Merge cost is measured by
+a combination of the "skew" of the merge (size of largest seg divided by
+smallest seg), total merge size and pct deletes reclaimed, so that
+merges with lower skew, smaller size and those reclaiming more deletes,
+are favored.
+
+If a merge will produce a segment that's larger than
+`max_merged_segment` then the policy will merge fewer segments (down to
+1 at once, if that one has deletions) to keep the segment size under
+budget.
+
+Note, this can mean that for large shards that holds many gigabytes of
+data, the default of `max_merged_segment` (`5gb`) can cause for many
+segments to be in an index, and causing searches to be slower. Use the
+indices segments API to see the segments that an index have, and
+possibly either increase the `max_merged_segment` or issue an optimize
+call for the index (try and aim to issue it on a low traffic time).
+
+[float]
+[[log-byte-size]]
+==== log_byte_size
+
+A merge policy that merges segments into levels of exponentially
+increasing *byte size*, where each level has fewer segments than the
+value of the merge factor. Whenever extra segments (beyond the merge
+factor upper bound) are encountered, all segments within the level are
+merged.
+
+This policy has the following settings:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|index.merge.policy.merge_factor |Determines how often segment indices
+are merged by index operation. With smaller values, less RAM is used
+while indexing, and searches on unoptimized indices are faster, but
+indexing speed is slower. With larger values, more RAM is used during
+indexing, and while searches on unoptimized indices are slower, indexing
+is faster. Thus larger values (greater than 10) are best for batch index
+creation, and smaller values (lower than 10) for indices that are
+interactively maintained. Defaults to `10`.
+
+|index.merge.policy.min_merge_size |A size setting type which sets the
+minimum size for the lowest level segments. Any segments below this size
+are considered to be on the same level (even if they vary drastically in
+size) and will be merged whenever there are mergeFactor of them. This
+effectively truncates the "long tail" of small segments that would
+otherwise be created into a single level. If you set this too large, it
+could greatly increase the merging cost during indexing (if you flush
+many small segments). Defaults to `1.6mb`
+
+|index.merge.policy.max_merge_size |A size setting type which sets the
+largest segment (measured by total byte size of the segment's files)
+that may be merged with other segments. Defaults to unbounded.
+
+|index.merge.policy.max_merge_docs |Determines the largest segment
+(measured by document count) that may be merged with other segments.
+Defaults to unbounded.
+|=======================================================================
+
+[float]
+[[log-doc]]
+==== log_doc
+
+A merge policy that tries to merge segments into levels of exponentially
+increasing *document count*, where each level has fewer segments than
+the value of the merge factor. Whenever extra segments (beyond the merge
+factor upper bound) are encountered, all segments within the level are
+merged.
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|index.merge.policy.merge_factor |Determines how often segment indices
+are merged by index operation. With smaller values, less RAM is used
+while indexing, and searches on unoptimized indices are faster, but
+indexing speed is slower. With larger values, more RAM is used during
+indexing, and while searches on unoptimized indices are slower, indexing
+is faster. Thus larger values (greater than 10) are best for batch index
+creation, and smaller values (lower than 10) for indices that are
+interactively maintained. Defaults to `10`.
+
+|index.merge.policy.min_merge_docs |Sets the minimum size for the lowest
+level segments. Any segments below this size are considered to be on the
+same level (even if they vary drastically in size) and will be merged
+whenever there are mergeFactor of them. This effectively truncates the
+"long tail" of small segments that would otherwise be created into a
+single level. If you set this too large, it could greatly increase the
+merging cost during indexing (if you flush many small segments).
+Defaults to `1000`.
+
+|index.merge.policy.max_merge_docs |Determines the largest segment
+(measured by document count) that may be merged with other segments.
+Defaults to unbounded.
+|=======================================================================
+
+[float]
+[[scheduling]]
+=== Scheduling
+
+The merge schedule controls the execution of merge operations once they
+are needed (according to the merge policy). The following types are
+supported, with the default being the `ConcurrentMergeScheduler`.
+
+[float]
+==== ConcurrentMergeScheduler
+
+A merge scheduler that runs merges using a separated thread, until the
+maximum number of threads at which when a merge is needed, the thread(s)
+that are updating the index will pause until one or more merges
+completes.
+
+The scheduler supports the following settings:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|index.merge.scheduler.max_thread_count |The maximum number of threads
+to perform the merge operation. Defaults to
+`Math.max(1, Math.min(3, Runtime.getRuntime().availableProcessors() / 2))`.
+|=======================================================================
+
+[float]
+==== SerialMergeScheduler
+
+A merge scheduler that simply does each merge sequentially using the
+calling thread (blocking the operations that triggered the merge, the
+index operation).
diff --git a/docs/reference/index-modules/similarity.asciidoc b/docs/reference/index-modules/similarity.asciidoc
new file mode 100644
index 0000000..ae9f368
--- /dev/null
+++ b/docs/reference/index-modules/similarity.asciidoc
@@ -0,0 +1,140 @@
+[[index-modules-similarity]]
+== Similarity module
+
+A similarity (scoring / ranking model) defines how matching documents
+are scored. Similarity is per field, meaning that via the mapping one
+can define a different similarity per field.
+
+Configuring a custom similarity is considered a expert feature and the
+builtin similarities are most likely sufficient as is described in the
+<<mapping-core-types,mapping section>>
+
+[float]
+[[configuration]]
+=== Configuring a similarity
+
+Most existing or custom Similarities have configuration options which
+can be configured via the index settings as shown below. The index
+options can be provided when creating an index or updating index
+settings.
+
+[source,js]
+--------------------------------------------------
+"similarity" : {
+ "my_similarity" : {
+ "type" : "DFR",
+ "basic_model" : "g",
+ "after_effect" : "l",
+ "normalization" : "h2",
+ "normalization.h2.c" : "3.0"
+ }
+}
+--------------------------------------------------
+
+Here we configure the DFRSimilarity so it can be referenced as
+`my_similarity` in mappings as is illustrate in the below example:
+
+[source,js]
+--------------------------------------------------
+{
+ "book" : {
+ "properties" : {
+ "title" : { "type" : "string", "similarity" : "my_similarity" }
+ }
+}
+--------------------------------------------------
+
+[float]
+=== Available similarities
+
+[float]
+[[default-similarity]]
+==== Default similarity
+
+The default similarity that is based on the TF/IDF model. This
+similarity has the following option:
+
+`discount_overlaps`::
+ Determines whether overlap tokens (Tokens with
+ 0 position increment) are ignored when computing norm. By default this
+ is true, meaning overlap tokens do not count when computing norms.
+
+Type name: `default`
+
+[float]
+[[bm25]]
+==== BM25 similarity
+
+Another TF/IDF based similarity that has built-in tf normalization and
+is supposed to work better for short fields (like names). See
+http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
+This similarity has the following options:
+
+[horizontal]
+`k1`::
+ Controls non-linear term frequency normalization
+ (saturation).
+
+`b`::
+ Controls to what degree document length normalizes tf values.
+
+`discount_overlaps`::
+ Determines whether overlap tokens (Tokens with
+ 0 position increment) are ignored when computing norm. By default this
+ is true, meaning overlap tokens do not count when computing norms.
+
+Type name: `BM25`
+
+[float]
+[[drf]]
+==== DFR similarity
+
+Similarity that implements the
+http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
+from randomness] framework. This similarity has the following options:
+
+[horizontal]
+`basic_model`::
+ Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
+
+`after_effect`::
+ Possible values: `no`, `b` and `l`.
+
+`normalization`::
+ Possible values: `no`, `h1`, `h2`, `h3` and `z`.
+
+All options but the first option need a normalization value.
+
+Type name: `DFR`
+
+[float]
+[[ib]]
+==== IB similarity.
+
+http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
+based model] . This similarity has the following options:
+
+[horizontal]
+`distribution`:: Possible values: `ll` and `spl`.
+`lambda`:: Possible values: `df` and `ttf`.
+`normalization`:: Same as in `DFR` similarity.
+
+Type name: `IB`
+
+[float]
+[[default-base]]
+==== Default and Base Similarities
+
+By default, Elasticsearch will use whatever similarity is configured as
+`default`. However, the similarity functions `queryNorm()` and `coord()`
+are not per-field. Consequently, for expert users wanting to change the
+implementation used for these two methods, while not changing the
+`default`, it is possible to configure a similarity with the name
+`base`. This similarity will then be used for the two methods.
+
+You can change the default similarity for all fields like this:
+
+[source,js]
+--------------------------------------------------
+index.similarity.default.type: BM25
+--------------------------------------------------
diff --git a/docs/reference/index-modules/slowlog.asciidoc b/docs/reference/index-modules/slowlog.asciidoc
new file mode 100644
index 0000000..00029cf
--- /dev/null
+++ b/docs/reference/index-modules/slowlog.asciidoc
@@ -0,0 +1,87 @@
+[[index-modules-slowlog]]
+== Index Slow Log
+
+[float]
+[[search-slow-log]]
+=== Search Slow Log
+
+Shard level slow search log allows to log slow search (query and fetch
+executions) into a dedicated log file.
+
+Thresholds can be set for both the query phase of the execution, and
+fetch phase, here is a sample:
+
+[source,js]
+--------------------------------------------------
+#index.search.slowlog.threshold.query.warn: 10s
+#index.search.slowlog.threshold.query.info: 5s
+#index.search.slowlog.threshold.query.debug: 2s
+#index.search.slowlog.threshold.query.trace: 500ms
+
+#index.search.slowlog.threshold.fetch.warn: 1s
+#index.search.slowlog.threshold.fetch.info: 800ms
+#index.search.slowlog.threshold.fetch.debug: 500ms
+#index.search.slowlog.threshold.fetch.trace: 200ms
+--------------------------------------------------
+
+By default, none are enabled (set to `-1`). Levels (`warn`, `info`,
+`debug`, `trace`) allow to control under which logging level the log
+will be logged. Not all are required to be configured (for example, only
+`warn` threshold can be set). The benefit of several levels is the
+ability to quickly "grep" for specific thresholds breached.
+
+The logging is done on the shard level scope, meaning the execution of a
+search request within a specific shard. It does not encompass the whole
+search request, which can be broadcast to several shards in order to
+execute. Some of the benefits of shard level logging is the association
+of the actual execution on the specific machine, compared with request
+level.
+
+All settings are index level settings (and each index can have different
+values for it), and can be changed in runtime using the index update
+settings API.
+
+The logging file is configured by default using the following
+configuration (found in `logging.yml`):
+
+[source,js]
+--------------------------------------------------
+index_search_slow_log_file:
+ type: dailyRollingFile
+ file: ${path.logs}/${cluster.name}_index_search_slowlog.log
+ datePattern: "'.'yyyy-MM-dd"
+ layout:
+ type: pattern
+ conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
+--------------------------------------------------
+
+[float]
+[[index-slow-log]]
+=== Index Slow log
+
+p.The indexing slow log, similar in functionality to the search slow
+log. The log file is ends with `_index_indexing_slowlog.log`. Log and
+the thresholds are configured in the elasticsearch.yml file in the same
+way as the search slowlog. Index slowlog sample:
+
+[source,js]
+--------------------------------------------------
+#index.indexing.slowlog.threshold.index.warn: 10s
+#index.indexing.slowlog.threshold.index.info: 5s
+#index.indexing.slowlog.threshold.index.debug: 2s
+#index.indexing.slowlog.threshold.index.trace: 500ms
+--------------------------------------------------
+
+The index slow log file is configured by default in the `logging.yml`
+file:
+
+[source,js]
+--------------------------------------------------
+index_indexing_slow_log_file:
+ type: dailyRollingFile
+ file: ${path.logs}/${cluster.name}_index_indexing_slowlog.log
+ datePattern: "'.'yyyy-MM-dd"
+ layout:
+ type: pattern
+ conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
+--------------------------------------------------
diff --git a/docs/reference/index-modules/store.asciidoc b/docs/reference/index-modules/store.asciidoc
new file mode 100644
index 0000000..8388ee2
--- /dev/null
+++ b/docs/reference/index-modules/store.asciidoc
@@ -0,0 +1,122 @@
+[[index-modules-store]]
+== Store
+
+The store module allows you to control how index data is stored.
+
+The index can either be stored in-memory (no persistence) or on-disk
+(the default). In-memory indices provide better performance at the cost
+of limiting the index size to the amount of available physical memory.
+
+When using a local gateway (the default), file system storage with *no*
+in memory storage is required to maintain index consistency. This is
+required since the local gateway constructs its state from the local
+index state of each node.
+
+Another important aspect of memory based storage is the fact that
+Elasticsearch supports storing the index in memory *outside of the JVM
+heap space* using the "Memory" (see below) storage type. It translates
+to the fact that there is no need for extra large JVM heaps (with their
+own consequences) for storing the index in memory.
+
+
+[float]
+[[store-throttling]]
+=== Store Level Throttling
+
+The way Lucene, the IR library elasticsearch uses under the covers,
+works is by creating immutable segments (up to deletes) and constantly
+merging them (the merge policy settings allow to control how those
+merges happen). The merge process happens in an asynchronous manner
+without affecting the indexing / search speed. The problem though,
+especially on systems with low IO, is that the merge process can be
+expensive and affect search / index operation simply by the fact that
+the box is now taxed with more IO happening.
+
+The store module allows to have throttling configured for merges (or
+all) either on the node level, or on the index level. The node level
+throttling will make sure that out of all the shards allocated on that
+node, the merge process won't pass the specific setting bytes per
+second. It can be set by setting `indices.store.throttle.type` to
+`merge`, and setting `indices.store.throttle.max_bytes_per_sec` to
+something like `5mb`. The node level settings can be changed dynamically
+using the cluster update settings API. The default is set
+to `20mb` with type `merge`.
+
+If specific index level configuration is needed, regardless of the node
+level settings, it can be set as well using the
+`index.store.throttle.type`, and
+`index.store.throttle.max_bytes_per_sec`. The default value for the type
+is `node`, meaning it will throttle based on the node level settings and
+participate in the global throttling happening. Both settings can be set
+using the index update settings API dynamically.
+
+The following sections lists all the different storage types supported.
+
+[float]
+[[file-system]]
+=== File System
+
+File system based storage is the default storage used. There are
+different implementations or storage types. The best one for the
+operating environment will be automatically chosen: `mmapfs` on
+Solaris/Linux/Windows 64bit, `simplefs` on Windows 32bit, and
+`niofs` for the rest.
+
+The following are the different file system based storage types:
+
+[float]
+==== Simple FS
+
+The `simplefs` type is a straightforward implementation of file system
+storage (maps to Lucene `SimpleFsDirectory`) using a random access file.
+This implementation has poor concurrent performance (multiple threads
+will bottleneck). It is usually better to use the `niofs` when you need
+index persistence.
+
+[float]
+==== NIO FS
+
+The `niofs` type stores the shard index on the file system (maps to
+Lucene `NIOFSDirectory`) using NIO. It allows multiple threads to read
+from the same file concurrently. It is not recommended on Windows
+because of a bug in the SUN Java implementation.
+
+[[mmapfs]]
+[float]
+==== MMap FS
+
+The `mmapfs` type stores the shard index on the file system (maps to
+Lucene `MMapDirectory`) by mapping a file into memory (mmap). Memory
+mapping uses up a portion of the virtual memory address space in your
+process equal to the size of the file being mapped. Before using this
+class, be sure your have plenty of virtual address space.
+
+[float]
+[[store-memory]]
+=== Memory
+
+The `memory` type stores the index in main memory with the following
+configuration options:
+
+There are also *node* level settings that control the caching of buffers
+(important when using direct buffers):
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|`cache.memory.direct` |Should the memory be allocated outside of the
+JVM heap. Defaults to `true`.
+
+|`cache.memory.small_buffer_size` |The small buffer size, defaults to
+`1kb`.
+
+|`cache.memory.large_buffer_size` |The large buffer size, defaults to
+`1mb`.
+
+|`cache.memory.small_cache_size` |The small buffer cache size, defaults
+to `10mb`.
+
+|`cache.memory.large_cache_size` |The large buffer cache size, defaults
+to `500mb`.
+|=======================================================================
+
diff --git a/docs/reference/index-modules/translog.asciidoc b/docs/reference/index-modules/translog.asciidoc
new file mode 100644
index 0000000..e5215fe
--- /dev/null
+++ b/docs/reference/index-modules/translog.asciidoc
@@ -0,0 +1,28 @@
+[[index-modules-translog]]
+== Translog
+
+Each shard has a transaction log or write ahead log associated with it.
+It allows to guarantee that when an index/delete operation occurs, it is
+applied atomically, while not "committing" the internal Lucene index for
+each request. A flush ("commit") still happens based on several
+parameters:
+
+[cols="<,<",options="header",]
+|=======================================================================
+|Setting |Description
+|index.translog.flush_threshold_ops |After how many operations to flush.
+Defaults to `5000`.
+
+|index.translog.flush_threshold_size |Once the translog hits this size,
+a flush will happen. Defaults to `200mb`.
+
+|index.translog.flush_threshold_period |The period with no flush
+happening to force a flush. Defaults to `30m`.
+
+|index.translog.interval |How often to check if a flush is needed, randomized
+between the interval value and 2x the interval value. Defaults to `5s`.
+|=======================================================================
+
+Note: these parameters can be updated at runtime using the Index
+Settings Update API (for example, these number can be increased when
+executing bulk updates to support higher TPS)