summaryrefslogtreecommitdiff
path: root/docs/reference/index-modules/codec.asciidoc
diff options
context:
space:
mode:
Diffstat (limited to 'docs/reference/index-modules/codec.asciidoc')
-rw-r--r--docs/reference/index-modules/codec.asciidoc278
1 files changed, 278 insertions, 0 deletions
diff --git a/docs/reference/index-modules/codec.asciidoc b/docs/reference/index-modules/codec.asciidoc
new file mode 100644
index 0000000..f53c18f
--- /dev/null
+++ b/docs/reference/index-modules/codec.asciidoc
@@ -0,0 +1,278 @@
+[[index-modules-codec]]
+== Codec module
+
+Codecs define how documents are written to disk and read from disk. The
+postings format is the part of the codec that responsible for reading
+and writing the term dictionary, postings lists and positions, payloads
+and offsets stored in the postings list. The doc values format is
+responsible for reading column-stride storage for a field and is typically
+used for sorting or faceting. When a field doesn't have doc values enabled,
+it is still possible to sort or facet by loading field values from the
+inverted index into main memory.
+
+Configuring custom postings or doc values formats is an expert feature and
+most likely using the builtin formats will suit your needs as is described
+in the <<mapping-core-types,mapping section>>.
+
+**********************************
+Only the default codec, postings format and doc values format are supported:
+other formats may break backward compatibility between minor versions of
+Elasticsearch, requiring data to be reindexed.
+**********************************
+
+
+[float]
+[[custom-postings]]
+=== Configuring a custom postings format
+
+Custom postings format can be defined in the index settings in the
+`codec` part. The `codec` part can be configured when creating an index
+or updating index settings. An example on how to define your custom
+postings format:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT 'http://localhost:9200/twitter/' -d '{
+ "settings" : {
+ "index" : {
+ "codec" : {
+ "postings_format" : {
+ "my_format" : {
+ "type" : "pulsing",
+ "freq_cut_off" : "5"
+ }
+ }
+ }
+ }
+ }
+}'
+--------------------------------------------------
+
+Then we defining your mapping your can use the `my_format` name in the
+`postings_format` option as the example below illustrates:
+
+[source,js]
+--------------------------------------------------
+{
+ "person" : {
+ "properties" : {
+ "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+=== Available postings formats
+
+[float]
+[[direct-postings]]
+==== Direct postings format
+
+Wraps the default postings format for on-disk storage, but then at read
+time loads and stores all terms & postings directly in RAM. This
+postings format makes no effort to compress the terms and posting list
+and therefore is memory intensive, but because of this it gives a
+substantial increase in search performance. Because this holds all term
+bytes as a single byte[], you cannot have more than 2.1GB worth of terms
+in a single segment.
+
+This postings format offers the following parameters:
+
+`min_skip_count`::
+ The minimum number terms with a shared prefix to
+ allow a skip pointer to be written. The default is *8*.
+
+`low_freq_cutoff`::
+ Terms with a lower document frequency use a
+ single array object representation for postings and positions. The
+ default is *32*.
+
+Type name: `direct`
+
+[float]
+[[memory-postings]]
+==== Memory postings format
+
+A postings format that stores terms & postings (docs, positions,
+payloads) in RAM, using an FST. This postings format does write to disk,
+but loads everything into memory. The memory postings format has the
+following options:
+
+`pack_fst`::
+ A boolean option that defines if the in memory structure
+ should be packed once its build. Packed will reduce the size for the
+ data-structure in memory but requires more memory during building.
+ Default is *false*.
+
+`acceptable_overhead_ratio`::
+ The compression ratio specified as a
+ float, that is used to compress internal structures. Example ratios `0`
+ (Compact, no memory overhead at all, but the returned implementation may
+ be slow), `0.5` (Fast, at most 50% memory overhead, always select a
+ reasonably fast implementation), `7` (Fastest, at most 700% memory
+ overhead, no compression). Default is `0.2`.
+
+Type name: `memory`
+
+[float]
+[[bloom-postings]]
+==== Bloom filter posting format
+
+The bloom filter postings format wraps a delegate postings format and on
+top of this creates a bloom filter that is written to disk. During
+opening this bloom filter is loaded into memory and used to offer
+"fast-fail" reads. This postings format is useful for low doc-frequency
+fields such as primary keys. The bloom filter postings format has the
+following options:
+
+`delegate`::
+ The name of the configured postings format that the
+ bloom filter postings format will wrap.
+
+`fpp`::
+ The desired false positive probability specified as a
+ floating point number between 0 and 1.0. The `fpp` can be configured for
+ multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
+ number docs per index segment is larger than *1m* then use *0.03* as fpp
+ and if number of docs per segment is larger than *10k* use *0.01* as
+ fpp. The last fallback value is always *0.03*. This example expression
+ is also the default.
+
+Type name: `bloom`
+
+[[codec-bloom-load]]
+[TIP]
+==================================================
+
+It can sometime make sense to disable bloom filters. For instance, if you are
+logging into an index per day, and you have thousands of indices, the bloom
+filters can take up a sizable amount of memory. For most queries you are only
+interested in recent indices, so you don't mind CRUD operations on older
+indices taking slightly longer.
+
+In these cases you can disable loading of the bloom filter on a per-index
+basis by updating the index settings:
+
+[source,js]
+--------------------------------------------------
+PUT /old_index/_settings?index.codec.bloom.load=false
+--------------------------------------------------
+
+This setting, which defaults to `true`, can be updated on a live index. Note,
+however, that changing the value will cause the index to be reopened, which
+will invalidate any existing caches.
+
+==================================================
+
+[float]
+[[pulsing-postings]]
+==== Pulsing postings format
+
+The pulsing implementation in-lines the posting lists for very low
+frequent terms in the term dictionary. This is useful to improve lookup
+performance for low-frequent terms. This postings format offers the
+following parameters:
+
+`min_block_size`::
+ The minimum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *25*.
+
+`max_block_size`::
+ The maximum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *48*.
+
+`freq_cut_off`::
+ The document frequency cut off where pulsing
+ in-lines posting lists into the term dictionary. Terms with a document
+ frequency less or equal to the cutoff will be in-lined. The default is
+ *1*.
+
+Type name: `pulsing`
+
+[float]
+[[default-postings]]
+==== Default postings format
+
+The default postings format has the following options:
+
+`min_block_size`::
+ The minimum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *25*.
+
+`max_block_size`::
+ The maximum block size the default Lucene term
+ dictionary uses to encode on-disk blocks. Defaults to *48*.
+
+Type name: `default`
+
+[float]
+=== Configuring a custom doc values format
+
+Custom doc values format can be defined in the index settings in the
+`codec` part. The `codec` part can be configured when creating an index
+or updating index settings. An example on how to define your custom
+doc values format:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT 'http://localhost:9200/twitter/' -d '{
+ "settings" : {
+ "index" : {
+ "codec" : {
+ "doc_values_format" : {
+ "my_format" : {
+ "type" : "disk"
+ }
+ }
+ }
+ }
+ }
+}'
+--------------------------------------------------
+
+Then we defining your mapping your can use the `my_format` name in the
+`doc_values_format` option as the example below illustrates:
+
+[source,js]
+--------------------------------------------------
+{
+ "product" : {
+ "properties" : {
+ "price" : {"type" : "integer", "doc_values_format" : "my_format"}
+ }
+ }
+}
+--------------------------------------------------
+
+[float]
+=== Available doc values formats
+
+[float]
+==== Memory doc values format
+
+A doc values format that stores all values in a FST in RAM. This format does
+write to disk but the whole data-structure is loaded into memory when reading
+the index. The memory postings format has no options.
+
+Type name: `memory`
+
+[float]
+==== Disk doc values format
+
+A doc values format that stores and reads everything from disk. Although it may
+be slightly slower than the default doc values format, this doc values format
+will require almost no memory from the JVM. The disk doc values format has no
+options.
+
+Type name: `disk`
+
+[float]
+==== Default doc values format
+
+The default doc values format tries to make a good compromise between speed and
+memory usage by only loading into memory data-structures that matter for
+performance. This makes this doc values format a good fit for most use-cases.
+The default doc values format has no options.
+
+Type name: `default`