diff options
Diffstat (limited to 'docs/reference/index-modules/codec.asciidoc')
-rw-r--r-- | docs/reference/index-modules/codec.asciidoc | 278 |
1 files changed, 278 insertions, 0 deletions
diff --git a/docs/reference/index-modules/codec.asciidoc b/docs/reference/index-modules/codec.asciidoc new file mode 100644 index 0000000..f53c18f --- /dev/null +++ b/docs/reference/index-modules/codec.asciidoc @@ -0,0 +1,278 @@ +[[index-modules-codec]] +== Codec module + +Codecs define how documents are written to disk and read from disk. The +postings format is the part of the codec that responsible for reading +and writing the term dictionary, postings lists and positions, payloads +and offsets stored in the postings list. The doc values format is +responsible for reading column-stride storage for a field and is typically +used for sorting or faceting. When a field doesn't have doc values enabled, +it is still possible to sort or facet by loading field values from the +inverted index into main memory. + +Configuring custom postings or doc values formats is an expert feature and +most likely using the builtin formats will suit your needs as is described +in the <<mapping-core-types,mapping section>>. + +********************************** +Only the default codec, postings format and doc values format are supported: +other formats may break backward compatibility between minor versions of +Elasticsearch, requiring data to be reindexed. +********************************** + + +[float] +[[custom-postings]] +=== Configuring a custom postings format + +Custom postings format can be defined in the index settings in the +`codec` part. The `codec` part can be configured when creating an index +or updating index settings. An example on how to define your custom +postings format: + +[source,js] +-------------------------------------------------- +curl -XPUT 'http://localhost:9200/twitter/' -d '{ + "settings" : { + "index" : { + "codec" : { + "postings_format" : { + "my_format" : { + "type" : "pulsing", + "freq_cut_off" : "5" + } + } + } + } + } +}' +-------------------------------------------------- + +Then we defining your mapping your can use the `my_format` name in the +`postings_format` option as the example below illustrates: + +[source,js] +-------------------------------------------------- +{ + "person" : { + "properties" : { + "second_person_id" : {"type" : "string", "postings_format" : "my_format"} + } + } +} +-------------------------------------------------- + +[float] +=== Available postings formats + +[float] +[[direct-postings]] +==== Direct postings format + +Wraps the default postings format for on-disk storage, but then at read +time loads and stores all terms & postings directly in RAM. This +postings format makes no effort to compress the terms and posting list +and therefore is memory intensive, but because of this it gives a +substantial increase in search performance. Because this holds all term +bytes as a single byte[], you cannot have more than 2.1GB worth of terms +in a single segment. + +This postings format offers the following parameters: + +`min_skip_count`:: + The minimum number terms with a shared prefix to + allow a skip pointer to be written. The default is *8*. + +`low_freq_cutoff`:: + Terms with a lower document frequency use a + single array object representation for postings and positions. The + default is *32*. + +Type name: `direct` + +[float] +[[memory-postings]] +==== Memory postings format + +A postings format that stores terms & postings (docs, positions, +payloads) in RAM, using an FST. This postings format does write to disk, +but loads everything into memory. The memory postings format has the +following options: + +`pack_fst`:: + A boolean option that defines if the in memory structure + should be packed once its build. Packed will reduce the size for the + data-structure in memory but requires more memory during building. + Default is *false*. + +`acceptable_overhead_ratio`:: + The compression ratio specified as a + float, that is used to compress internal structures. Example ratios `0` + (Compact, no memory overhead at all, but the returned implementation may + be slow), `0.5` (Fast, at most 50% memory overhead, always select a + reasonably fast implementation), `7` (Fastest, at most 700% memory + overhead, no compression). Default is `0.2`. + +Type name: `memory` + +[float] +[[bloom-postings]] +==== Bloom filter posting format + +The bloom filter postings format wraps a delegate postings format and on +top of this creates a bloom filter that is written to disk. During +opening this bloom filter is loaded into memory and used to offer +"fast-fail" reads. This postings format is useful for low doc-frequency +fields such as primary keys. The bloom filter postings format has the +following options: + +`delegate`:: + The name of the configured postings format that the + bloom filter postings format will wrap. + +`fpp`:: + The desired false positive probability specified as a + floating point number between 0 and 1.0. The `fpp` can be configured for + multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If + number docs per index segment is larger than *1m* then use *0.03* as fpp + and if number of docs per segment is larger than *10k* use *0.01* as + fpp. The last fallback value is always *0.03*. This example expression + is also the default. + +Type name: `bloom` + +[[codec-bloom-load]] +[TIP] +================================================== + +It can sometime make sense to disable bloom filters. For instance, if you are +logging into an index per day, and you have thousands of indices, the bloom +filters can take up a sizable amount of memory. For most queries you are only +interested in recent indices, so you don't mind CRUD operations on older +indices taking slightly longer. + +In these cases you can disable loading of the bloom filter on a per-index +basis by updating the index settings: + +[source,js] +-------------------------------------------------- +PUT /old_index/_settings?index.codec.bloom.load=false +-------------------------------------------------- + +This setting, which defaults to `true`, can be updated on a live index. Note, +however, that changing the value will cause the index to be reopened, which +will invalidate any existing caches. + +================================================== + +[float] +[[pulsing-postings]] +==== Pulsing postings format + +The pulsing implementation in-lines the posting lists for very low +frequent terms in the term dictionary. This is useful to improve lookup +performance for low-frequent terms. This postings format offers the +following parameters: + +`min_block_size`:: + The minimum block size the default Lucene term + dictionary uses to encode on-disk blocks. Defaults to *25*. + +`max_block_size`:: + The maximum block size the default Lucene term + dictionary uses to encode on-disk blocks. Defaults to *48*. + +`freq_cut_off`:: + The document frequency cut off where pulsing + in-lines posting lists into the term dictionary. Terms with a document + frequency less or equal to the cutoff will be in-lined. The default is + *1*. + +Type name: `pulsing` + +[float] +[[default-postings]] +==== Default postings format + +The default postings format has the following options: + +`min_block_size`:: + The minimum block size the default Lucene term + dictionary uses to encode on-disk blocks. Defaults to *25*. + +`max_block_size`:: + The maximum block size the default Lucene term + dictionary uses to encode on-disk blocks. Defaults to *48*. + +Type name: `default` + +[float] +=== Configuring a custom doc values format + +Custom doc values format can be defined in the index settings in the +`codec` part. The `codec` part can be configured when creating an index +or updating index settings. An example on how to define your custom +doc values format: + +[source,js] +-------------------------------------------------- +curl -XPUT 'http://localhost:9200/twitter/' -d '{ + "settings" : { + "index" : { + "codec" : { + "doc_values_format" : { + "my_format" : { + "type" : "disk" + } + } + } + } + } +}' +-------------------------------------------------- + +Then we defining your mapping your can use the `my_format` name in the +`doc_values_format` option as the example below illustrates: + +[source,js] +-------------------------------------------------- +{ + "product" : { + "properties" : { + "price" : {"type" : "integer", "doc_values_format" : "my_format"} + } + } +} +-------------------------------------------------- + +[float] +=== Available doc values formats + +[float] +==== Memory doc values format + +A doc values format that stores all values in a FST in RAM. This format does +write to disk but the whole data-structure is loaded into memory when reading +the index. The memory postings format has no options. + +Type name: `memory` + +[float] +==== Disk doc values format + +A doc values format that stores and reads everything from disk. Although it may +be slightly slower than the default doc values format, this doc values format +will require almost no memory from the JVM. The disk doc values format has no +options. + +Type name: `disk` + +[float] +==== Default doc values format + +The default doc values format tries to make a good compromise between speed and +memory usage by only loading into memory data-structures that matter for +performance. This makes this doc values format a good fit for most use-cases. +The default doc values format has no options. + +Type name: `default` |