diff options
Diffstat (limited to 'docs/reference/modules')
24 files changed, 2262 insertions, 0 deletions
diff --git a/docs/reference/modules/advanced-scripting.asciidoc b/docs/reference/modules/advanced-scripting.asciidoc new file mode 100644 index 0000000..d215661 --- /dev/null +++ b/docs/reference/modules/advanced-scripting.asciidoc @@ -0,0 +1,184 @@ +[[modules-advanced-scripting]] +== Text scoring in scripts + + +Text features, such as term or document frequency for a specific term can be accessed in scripts (see <<modules-scripting, scripting documentation>> ) with the `_index` variable. This can be useful if, for example, you want to implement your own scoring model using for example a script inside a <<query-dsl-function-score-query,function score query>>. +Statistics over the document collection are computed *per shard*, not per +index. + +[float] +=== Nomenclature: + + +[horizontal] +`df`:: + + document frequency. The number of documents a term appears in. Computed + per field. + + +`tf`:: + + term frequency. The number times a term appears in a field in one specific + document. + +`ttf`:: + + total term frequency. The number of times this term appears in all + documents, that is, the sum of `tf` over all documents. Computed per + field. + +`df` and `ttf` are computed per shard and therefore these numbers can vary +depending on the shard the current document resides in. + + +[float] +=== Shard statistics: + +`_index.numDocs()`:: + + Number of documents in shard. + +`_index.maxDoc()`:: + + Maximal document number in shard. + +`_index.numDeletedDocs()`:: + + Number of deleted documents in shard. + + +[float] +=== Field statistics: + +Field statistics can be accessed with a subscript operator like this: +`_index['FIELD']`. + + +`_index['FIELD'].docCount()`:: + + Number of documents containing the field `FIELD`. Does not take deleted documents into account. + +`_index['FIELD'].sumttf()`:: + + Sum of `ttf` over all terms that appear in field `FIELD` in all documents. + +`_index['FIELD'].sumdf()`:: + + The sum of `df` s over all terms that appear in field `FIELD` in all + documents. + + +Field statistics are computed per shard and therfore these numbers can vary +depending on the shard the current document resides in. +The number of terms in a field cannot be accessed using the `_index` variable. See <<mapping-core-types, word count mapping type>> on how to do that. + +[float] +=== Term statistics: + +Term statistics for a field can be accessed with a subscript operator like +this: `_index['FIELD']['TERM']`. This will never return null, even if term or field does not exist. +If you do not need the term frequency, call `_index['FIELD'].get('TERM', 0)` +to avoid uneccesary initialization of the frequencies. The flag will have only +affect is your set the `index_options` to `docs` (see <<mapping-core-types, mapping documentation>>). + + +`_index['FIELD']['TERM'].df()`:: + + `df` of term `TERM` in field `FIELD`. Will be returned, even if the term + is not present in the current document. + +`_index['FIELD']['TERM'].ttf()`:: + + The sum of term frequencys of term `TERM` in field `FIELD` over all + documents. Will be returned, even if the term is not present in the + current document. + +`_index['FIELD']['TERM'].tf()`:: + + `tf` of term `TERM` in field `FIELD`. Will be 0 if the term is not present + in the current document. + + +[float] +=== Term positions, offsets and payloads: + +If you need information on the positions of terms in a field, call +`_index['FIELD'].get('TERM', flag)` where flag can be + +[horizontal] +`_POSITIONS`:: if you need the positions of the term +`_OFFSETS`:: if you need the offests of the term +`_PAYLOADS`:: if you need the payloads of the term +`_CACHE`:: if you need to iterate over all positions several times + +The iterator uses the underlying lucene classes to iterate over positions. For efficiency reasons, you can only iterate over positions once. If you need to iterate over the positions several times, set the `_CACHE` flag. + +You can combine the operators with a `|` if you need more than one info. For +example, the following will return an object holding the positions and payloads, +as well as all statistics: + + + `_index['FIELD'].get('TERM', _POSITIONS | _PAYLOADS)` + + +Positions can be accessed with an iterator that returns an object +(`POS_OBJECT`) holding position, offsets and payload for each term position. + +`POS_OBJECT.position`:: + + The position of the term. + +`POS_OBJECT.startOffset`:: + + The start offset of the term. + +`POS_OBJECT.endOffset`:: + + The end offset of the term. + +`POS_OBJECT.payload`:: + + The payload of the term. + +`POS_OBJECT.payloadAsInt(missingValue)`:: + + The payload of the term converted to integer. If the current position has + no payload, the `missingValue` will be returned. Call this only if you + know that your payloads are integers. + +`POS_OBJECT.payloadAsFloat(missingValue)`:: + + The payload of the term converted to float. If the current position has no + payload, the `missingValue` will be returned. Call this only if you know + that your payloads are floats. + +`POS_OBJECT.payloadAsString()`:: + + The payload of the term converted to string. If the current position has + no payload, `null` will be returned. Call this only if you know that your + payloads are strings. + + +Example: sums up all payloads for the term `foo`. + +[source,mvel] +--------------------------------------------------------- +termInfo = _index['my_field'].get('foo',_PAYLOADS); +score = 0; +for (pos : termInfo) { + score = score + pos.payloadAsInt(0); +} +return score; +--------------------------------------------------------- + + +[float] +=== Term vectors: + +The `_index` variable can only be used to gather statistics for single terms. If you want to use information on all terms in a field, you must store the term vectors (set `term_vector` in the mapping as described in the <<mapping-core-types,mapping documentation>>). To access them, call +`_index.getTermVectors()` to get a +https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[Fields] +instance. This object can then be used as described in https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[lucene doc] to iterate over fields and then for each field iterate over each term in the field. +The method will return null if the term vectors were not stored. + diff --git a/docs/reference/modules/cluster.asciidoc b/docs/reference/modules/cluster.asciidoc new file mode 100644 index 0000000..e61fbf6 --- /dev/null +++ b/docs/reference/modules/cluster.asciidoc @@ -0,0 +1,239 @@ +[[modules-cluster]] +== Cluster + +[float] +[[shards-allocation]] +=== Shards Allocation + +Shards allocation is the process of allocating shards to nodes. This can +happen during initial recovery, replica allocation, rebalancing, or +handling nodes being added or removed. + +The following settings may be used: + +`cluster.routing.allocation.allow_rebalance`:: + Allow to control when rebalancing will happen based on the total + state of all the indices shards in the cluster. `always`, + `indices_primaries_active`, and `indices_all_active` are allowed, + defaulting to `indices_all_active` to reduce chatter during + initial recovery. + + +`cluster.routing.allocation.cluster_concurrent_rebalance`:: + Allow to control how many concurrent rebalancing of shards are + allowed cluster wide, and default it to `2`. + + +`cluster.routing.allocation.node_initial_primaries_recoveries`:: + Allow to control specifically the number of initial recoveries + of primaries that are allowed per node. Since most times local + gateway is used, those should be fast and we can handle more of + those per node without creating load. + + +`cluster.routing.allocation.node_concurrent_recoveries`:: + How many concurrent recoveries are allowed to happen on a node. + Defaults to `2`. + +`cluster.routing.allocation.enable`:: + Controls shard allocation for all indices, by allowing specific + kinds of shard to be allocated. + added[1.0.0.RC1,Replaces `cluster.routing.allocation.disable*`] + Can be set to: + * `all` (default) - Allows shard allocation for all kinds of shards. + * `primaries` - Allows shard allocation only for primary shards. + * `new_primaries` - Allows shard allocation only for primary shards for new indices. + * `none` - No shard allocations of any kind are allowed for all indices. + +`cluster.routing.allocation.disable_new_allocation`:: + deprecated[1.0.0.RC1,Replaced by `cluster.routing.allocation.enable`] + +`cluster.routing.allocation.disable_allocation`:: + deprecated[1.0.0.RC1,Replaced by `cluster.routing.allocation.enable`] + + +`cluster.routing.allocation.disable_replica_allocation`:: + deprecated[1.0.0.RC1,Replaced by `cluster.routing.allocation.enable`] + +`cluster.routing.allocation.same_shard.host`:: + Allows to perform a check to prevent allocation of multiple instances + of the same shard on a single host, based on host name and host address. + Defaults to `false`, meaning that no check is performed by default. This + setting only applies if multiple nodes are started on the same machine. + +`indices.recovery.concurrent_streams`:: + The number of streams to open (on a *node* level) to recover a + shard from a peer shard. Defaults to `3`. + +[float] +[[allocation-awareness]] +=== Shard Allocation Awareness + +Cluster allocation awareness allows to configure shard and replicas +allocation across generic attributes associated the nodes. Lets explain +it through an example: + +Assume we have several racks. When we start a node, we can configure an +attribute called `rack_id` (any attribute name works), for example, here +is a sample config: + +---------------------- +node.rack_id: rack_one +---------------------- + +The above sets an attribute called `rack_id` for the relevant node with +a value of `rack_one`. Now, we need to configure the `rack_id` attribute +as one of the awareness allocation attributes (set it on *all* (master +eligible) nodes config): + +-------------------------------------------------------- +cluster.routing.allocation.awareness.attributes: rack_id +-------------------------------------------------------- + +The above will mean that the `rack_id` attribute will be used to do +awareness based allocation of shard and its replicas. For example, lets +say we start 2 nodes with `node.rack_id` set to `rack_one`, and deploy a +single index with 5 shards and 1 replica. The index will be fully +deployed on the current nodes (5 shards and 1 replica each, total of 10 +shards). + +Now, if we start two more nodes, with `node.rack_id` set to `rack_two`, +shards will relocate to even the number of shards across the nodes, but, +a shard and its replica will not be allocated in the same `rack_id` +value. + +The awareness attributes can hold several values, for example: + +------------------------------------------------------------- +cluster.routing.allocation.awareness.attributes: rack_id,zone +------------------------------------------------------------- + +*NOTE*: When using awareness attributes, shards will not be allocated to +nodes that don't have values set for those attributes. + +[float] +[[forced-awareness]] +=== Forced Awareness + +Sometimes, we know in advance the number of values an awareness +attribute can have, and more over, we would like never to have more +replicas then needed allocated on a specific group of nodes with the +same awareness attribute value. For that, we can force awareness on +specific attributes. + +For example, lets say we have an awareness attribute called `zone`, and +we know we are going to have two zones, `zone1` and `zone2`. Here is how +we can force awareness one a node: + +[source,js] +------------------------------------------------------------------- +cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 +cluster.routing.allocation.awareness.attributes: zone +------------------------------------------------------------------- + +Now, lets say we start 2 nodes with `node.zone` set to `zone1` and +create an index with 5 shards and 1 replica. The index will be created, +but only 5 shards will be allocated (with no replicas). Only when we +start more shards with `node.zone` set to `zone2` will the replicas be +allocated. + +[float] +==== Automatic Preference When Searching / GETing + +When executing a search, or doing a get, the node receiving the request +will prefer to execute the request on shards that exists on nodes that +have the same attribute values as the executing node. + +[float] +==== Realtime Settings Update + +The settings can be updated using the <<cluster-update-settings,cluster update settings API>> on a live cluster. + +[float] +[[allocation-filtering]] +=== Shard Allocation Filtering + +Allow to control allocation if indices on nodes based on include/exclude +filters. The filters can be set both on the index level and on the +cluster level. Lets start with an example of setting it on the cluster +level: + +Lets say we have 4 nodes, each has specific attribute called `tag` +associated with it (the name of the attribute can be any name). Each +node has a specific value associated with `tag`. Node 1 has a setting +`node.tag: value1`, Node 2 a setting of `node.tag: value2`, and so on. + +We can create an index that will only deploy on nodes that have `tag` +set to `value1` and `value2` by setting +`index.routing.allocation.include.tag` to `value1,value2`. For example: + +[source,js] +-------------------------------------------------- +curl -XPUT localhost:9200/test/_settings -d '{ + "index.routing.allocation.include.tag" : "value1,value2" +}' +-------------------------------------------------- + +On the other hand, we can create an index that will be deployed on all +nodes except for nodes with a `tag` of value `value3` by setting +`index.routing.allocation.exclude.tag` to `value3`. For example: + +[source,js] +-------------------------------------------------- +curl -XPUT localhost:9200/test/_settings -d '{ + "index.routing.allocation.exclude.tag" : "value3" +}' +-------------------------------------------------- + +`index.routing.allocation.require.*` can be used to +specify a number of rules, all of which MUST match in order for a shard +to be allocated to a node. This is in contrast to `include` which will +include a node if ANY rule matches. + +The `include`, `exclude` and `require` values can have generic simple +matching wildcards, for example, `value1*`. A special attribute name +called `_ip` can be used to match on node ip values. In addition `_host` +attribute can be used to match on either the node's hostname or its ip +address. Similarly `_name` and `_id` attributes can be used to match on +node name and node id accordingly. + +Obviously a node can have several attributes associated with it, and +both the attribute name and value are controlled in the setting. For +example, here is a sample of several node configurations: + +[source,js] +-------------------------------------------------- +node.group1: group1_value1 +node.group2: group2_value4 +-------------------------------------------------- + +In the same manner, `include`, `exclude` and `require` can work against +several attributes, for example: + +[source,js] +-------------------------------------------------- +curl -XPUT localhost:9200/test/_settings -d '{ + "index.routing.allocation.include.group1" : "xxx" + "index.routing.allocation.include.group2" : "yyy", + "index.routing.allocation.exclude.group3" : "zzz", + "index.routing.allocation.require.group4" : "aaa" +}' +-------------------------------------------------- + +The provided settings can also be updated in real time using the update +settings API, allowing to "move" indices (shards) around in realtime. + +Cluster wide filtering can also be defined, and be updated in real time +using the cluster update settings API. This setting can come in handy +for things like decommissioning nodes (even if the replica count is set +to 0). Here is a sample of how to decommission a node based on `_ip` +address: + +[source,js] +-------------------------------------------------- +curl -XPUT localhost:9200/_cluster/settings -d '{ + "transient" : { + "cluster.routing.allocation.exclude._ip" : "10.0.0.1" + } +}' +-------------------------------------------------- diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc new file mode 100644 index 0000000..292748d --- /dev/null +++ b/docs/reference/modules/discovery.asciidoc @@ -0,0 +1,30 @@ +[[modules-discovery]] +== Discovery + +The discovery module is responsible for discovering nodes within a +cluster, as well as electing a master node. + +Note, Elasticsearch is a peer to peer based system, nodes communicate +with one another directly if operations are delegated / broadcast. All +the main APIs (index, delete, search) do not communicate with the master +node. The responsibility of the master node is to maintain the global +cluster state, and act if nodes join or leave the cluster by reassigning +shards. Each time a cluster state is changed, the state is made known to +the other nodes in the cluster (the manner depends on the actual +discovery implementation). + +[float] +=== Settings + +The `cluster.name` allows to create separated clusters from one another. +The default value for the cluster name is `elasticsearch`, though it is +recommended to change this to reflect the logical group name of the +cluster running. + +include::discovery/azure.asciidoc[] + +include::discovery/ec2.asciidoc[] + +include::discovery/gce.asciidoc[] + +include::discovery/zen.asciidoc[] diff --git a/docs/reference/modules/discovery/azure.asciidoc b/docs/reference/modules/discovery/azure.asciidoc new file mode 100644 index 0000000..bb6fdc8 --- /dev/null +++ b/docs/reference/modules/discovery/azure.asciidoc @@ -0,0 +1,6 @@ +[[modules-discovery-azure]] +=== Azure Discovery + +Azure discovery allows to use the Azure APIs to perform automatic discovery (similar to multicast). +Please check the https://github.com/elasticsearch/elasticsearch-cloud-azure[plugin website] +to find the full documentation. diff --git a/docs/reference/modules/discovery/ec2.asciidoc b/docs/reference/modules/discovery/ec2.asciidoc new file mode 100644 index 0000000..9d0fa3f --- /dev/null +++ b/docs/reference/modules/discovery/ec2.asciidoc @@ -0,0 +1,6 @@ +[[modules-discovery-ec2]] +=== EC2 Discovery + +EC2 discovery allows to use the EC2 APIs to perform automatic discovery (similar to multicast). +Please check the https://github.com/elasticsearch/elasticsearch-cloud-aws[plugin website] +to find the full documentation. diff --git a/docs/reference/modules/discovery/gce.asciidoc b/docs/reference/modules/discovery/gce.asciidoc new file mode 100644 index 0000000..bb9c89f --- /dev/null +++ b/docs/reference/modules/discovery/gce.asciidoc @@ -0,0 +1,6 @@ +[[modules-discovery-gce]] +=== Google Compute Engine Discovery + +Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic discovery (similar to multicast). +Please check the https://github.com/elasticsearch/elasticsearch-cloud-gce[plugin website] +to find the full documentation. diff --git a/docs/reference/modules/discovery/zen.asciidoc b/docs/reference/modules/discovery/zen.asciidoc new file mode 100644 index 0000000..64281d9 --- /dev/null +++ b/docs/reference/modules/discovery/zen.asciidoc @@ -0,0 +1,161 @@ +[[modules-discovery-zen]] +=== Zen Discovery + +The zen discovery is the built in discovery module for elasticsearch and +the default. It provides both multicast and unicast discovery as well +being easily extended to support cloud environments. + +The zen discovery is integrated with other modules, for example, all +communication between nodes is done using the +<<modules-transport,transport>> module. + +It is separated into several sub modules, which are explained below: + +[float] +[[ping]] +==== Ping + +This is the process where a node uses the discovery mechanisms to find +other nodes. There is support for both multicast and unicast based +discovery (can be used in conjunction as well). + +[float] +[[multicast]] +===== Multicast + +Multicast ping discovery of other nodes is done by sending one or more +multicast requests where existing nodes that exists will receive and +respond to. It provides the following settings with the +`discovery.zen.ping.multicast` prefix: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`group` |The group address to use. Defaults to `224.2.2.4`. + +|`port` |The port to use. Defaults to `54328`. + +|`ttl` |The ttl of the multicast message. Defaults to `3`. + +|`address` |The address to bind to, defaults to `null` which means it +will bind to all available network interfaces. + +|`enabled` |Whether multicast ping discovery is enabled. Defaults to `true`. +|======================================================================= + +[float] +[[unicast]] +===== Unicast + +The unicast discovery allows to perform the discovery when multicast is +not enabled. It basically requires a list of hosts to use that will act +as gossip routers. It provides the following settings with the +`discovery.zen.ping.unicast` prefix: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`hosts` |Either an array setting or a comma delimited setting. Each +value is either in the form of `host:port`, or in the form of +`host[port1-port2]`. +|======================================================================= + +The unicast discovery uses the +<<modules-transport,transport>> module to +perform the discovery. + +[float] +[[master-election]] +==== Master Election + +As part of the initial ping process a master of the cluster is either +elected or joined to. This is done automatically. The +`discovery.zen.ping_timeout` (which defaults to `3s`) allows to +configure the election to handle cases of slow or congested networks +(higher values assure less chance of failure). + +Nodes can be excluded from becoming a master by setting `node.master` to +`false`. Note, once a node is a client node (`node.client` set to +`true`), it will not be allowed to become a master (`node.master` is +automatically set to `false`). + +The `discovery.zen.minimum_master_nodes` allows to control the minimum +number of master eligible nodes a node should "see" in order to operate +within the cluster. Its recommended to set it to a higher value than 1 +when running more than 2 nodes in the cluster. + +[float] +[[fault-detection]] +==== Fault Detection + +There are two fault detection processes running. The first is by the +master, to ping all the other nodes in the cluster and verify that they +are alive. And on the other end, each node pings to master to verify if +its still alive or an election process needs to be initiated. + +The following settings control the fault detection process using the +`discovery.zen.fd` prefix: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`ping_interval` |How often a node gets pinged. Defaults to `1s`. + +|`ping_timeout` |How long to wait for a ping response, defaults to +`30s`. + +|`ping_retries` |How many ping failures / timeouts cause a node to be +considered failed. Defaults to `3`. +|======================================================================= + +[float] +==== External Multicast + +The multicast discovery also supports external multicast requests to +discover nodes. The external client can send a request to the multicast +IP/group and port, in the form of: + +[source,js] +-------------------------------------------------- +{ + "request" : { + "cluster_name": "test_cluster" + } +} +-------------------------------------------------- + +And the response will be similar to node info response (with node level +information only, including transport/http addresses, and node +attributes): + +[source,js] +-------------------------------------------------- +{ + "response" : { + "cluster_name" : "test_cluster", + "transport_address" : "...", + "http_address" : "...", + "attributes" : { + "..." + } + } +} +-------------------------------------------------- + +Note, it can still be enabled, with disabled internal multicast +discovery, but still have external discovery working by keeping +`discovery.zen.ping.multicast.enabled` set to `true` (the default), but, +setting `discovery.zen.ping.multicast.ping.enabled` to `false`. + +[float] +==== Cluster state updates + +The master node is the only node in a cluster that can make changes to the +cluster state. The master node processes one cluster state update at a time, +applies the required changes and publishes the updated cluster state to all +the other nodes in the cluster. Each node receives the publish message, +updates its own cluster state and replies to the master node, which waits for +all nodes to respond, up to a timeout, before going ahead processing the next +updates in the queue. The `discovery.zen.publish_timeout` is set by default +to 30 seconds and can be changed dynamically through the +<<cluster-update-settings,cluster update settings api>>. diff --git a/docs/reference/modules/gateway.asciidoc b/docs/reference/modules/gateway.asciidoc new file mode 100644 index 0000000..539a57d --- /dev/null +++ b/docs/reference/modules/gateway.asciidoc @@ -0,0 +1,75 @@ +[[modules-gateway]] +== Gateway + +The gateway module allows one to store the state of the cluster meta +data across full cluster restarts. The cluster meta data mainly holds +all the indices created with their respective (index level) settings and +explicit type mappings. + +Each time the cluster meta data changes (for example, when an index is +added or deleted), those changes will be persisted using the gateway. +When the cluster first starts up, the state will be read from the +gateway and applied. + +The gateway set on the node level will automatically control the index +gateway that will be used. For example, if the `fs` gateway is used, +then automatically, each index created on the node will also use its own +respective index level `fs` gateway. In this case, if an index should +not persist its state, it should be explicitly set to `none` (which is +the only other value it can be set to). + +The default gateway used is the +<<modules-gateway-local,local>> gateway. + +[float] +[[recover-after]] +=== Recovery After Nodes / Time + +In many cases, the actual cluster meta data should only be recovered +after specific nodes have started in the cluster, or a timeout has +passed. This is handy when restarting the cluster, and each node local +index storage still exists to be reused and not recovered from the +gateway (which reduces the time it takes to recover from the gateway). + +The `gateway.recover_after_nodes` setting (which accepts a number) +controls after how many data and master eligible nodes within the +cluster recovery will start. The `gateway.recover_after_data_nodes` and +`gateway.recover_after_master_nodes` setting work in a similar fashion, +except they consider only the number of data nodes and only the number +of master nodes respectively. The `gateway.recover_after_time` setting +(which accepts a time value) sets the time to wait till recovery happens +once all `gateway.recover_after...nodes` conditions are met. + +The `gateway.expected_nodes` allows to set how many data and master +eligible nodes are expected to be in the cluster, and once met, the +`recover_after_time` is ignored and recovery starts. The +`gateway.expected_data_nodes` and `gateway.expected_master_nodes` +settings are also supported. For example setting: + +[source,js] +-------------------------------------------------- +gateway: + recover_after_nodes: 1 + recover_after_time: 5m + expected_nodes: 2 +-------------------------------------------------- + +In an expected 2 nodes cluster will cause recovery to start 5 minutes +after the first node is up, but once there are 2 nodes in the cluster, +recovery will begin immediately (without waiting). + +Note, once the meta data has been recovered from the gateway (which +indices to create, mappings and so on), then this setting is no longer +effective until the next full restart of the cluster. + +Operations are blocked while the cluster meta data has not been +recovered in order not to mix with the actual cluster meta data that +will be recovered once the settings has been reached. + +include::gateway/local.asciidoc[] + +include::gateway/fs.asciidoc[] + +include::gateway/hadoop.asciidoc[] + +include::gateway/s3.asciidoc[] diff --git a/docs/reference/modules/gateway/fs.asciidoc b/docs/reference/modules/gateway/fs.asciidoc new file mode 100644 index 0000000..8d765d3 --- /dev/null +++ b/docs/reference/modules/gateway/fs.asciidoc @@ -0,0 +1,39 @@ +[[modules-gateway-fs]] +=== Shared FS Gateway + +*The shared FS gateway is deprecated and will be removed in a future +version. Please use the +<<modules-gateway-local,local gateway>> +instead.* + +The file system based gateway stores the cluster meta data and indices +in a *shared* file system. Note, since it is a distributed system, the +file system should be shared between all different nodes. Here is an +example config to enable it: + +[source,js] +-------------------------------------------------- +gateway: + type: fs +-------------------------------------------------- + +[float] +==== location + +The location where the gateway stores the cluster state can be set using +the `gateway.fs.location` setting. By default, it will be stored under +the `work` directory. Note, the `work` directory is considered a +temporal directory with Elasticsearch (meaning it is safe to `rm -rf` +it), the default location of the persistent gateway in work intentional, +*it should be changed*. + +When explicitly specifying the `gateway.fs.location`, each node will +append its `cluster.name` to the provided location. It means that the +location provided can safely support several clusters. + +[float] +==== concurrent_streams + +The `gateway.fs.concurrent_streams` allow to throttle the number of +streams (per node) opened against the shared gateway performing the +snapshot operation. It defaults to `5`. diff --git a/docs/reference/modules/gateway/hadoop.asciidoc b/docs/reference/modules/gateway/hadoop.asciidoc new file mode 100644 index 0000000..b55a4be --- /dev/null +++ b/docs/reference/modules/gateway/hadoop.asciidoc @@ -0,0 +1,36 @@ +[[modules-gateway-hadoop]] +=== Hadoop Gateway + +*The hadoop gateway is deprecated and will be removed in a future +version. Please use the +<<modules-gateway-local,local gateway>> +instead.* + +The hadoop (HDFS) based gateway stores the cluster meta and indices data +in hadoop. Hadoop support is provided as a plugin and installing is +explained https://github.com/elasticsearch/elasticsearch-hadoop[here] or +downloading the hadoop plugin and placing it under the `plugins` +directory. Here is an example config to enable it: + +[source,js] +-------------------------------------------------- +gateway: + type: hdfs + hdfs: + uri: hdfs://myhost:8022 +-------------------------------------------------- + +[float] +==== Settings + +The hadoop gateway requires two simple settings. The `gateway.hdfs.uri` +controls the URI to connect to the hadoop cluster, for example: +`hdfs://myhost:8022`. The `gateway.hdfs.path` controls the path under +which the gateway will store the data. + +[float] +==== concurrent_streams + +The `gateway.hdfs.concurrent_streams` allow to throttle the number of +streams (per node) opened against the shared gateway performing the +snapshot operation. It defaults to `5`. diff --git a/docs/reference/modules/gateway/local.asciidoc b/docs/reference/modules/gateway/local.asciidoc new file mode 100644 index 0000000..eb4c4f3 --- /dev/null +++ b/docs/reference/modules/gateway/local.asciidoc @@ -0,0 +1,57 @@ +[[modules-gateway-local]] +=== Local Gateway + +The local gateway allows for recovery of the full cluster state and +indices from the local storage of each node, and does not require a +common node level shared storage. + +Note, different from shared gateway types, the persistency to the local +gateway is *not* done in an async manner. Once an operation is +performed, the data is there for the local gateway to recover it in case +of full cluster failure. + +It is important to configure the `gateway.recover_after_nodes` setting +to include most of the expected nodes to be started after a full cluster +restart. This will insure that the latest cluster state is recovered. +For example: + +[source,js] +-------------------------------------------------- +gateway: + recover_after_nodes: 1 + recover_after_time: 5m + expected_nodes: 2 +-------------------------------------------------- + +[float] +==== Dangling indices + +When a node joins the cluster, any shards/indices stored in its local `data/` +directory which do not already exist in the cluster will be imported into the +cluster by default. This functionality has two purposes: + +1. If a new master node is started which is unaware of the other indices in + the cluster, adding the old nodes will cause the old indices to be + imported, instead of being deleted. + +2. An old index can be added to an existing cluster by copying it to the + `data/` directory of a new node, starting the node and letting it join + the cluster. Once the index has been replicated to other nodes in the + cluster, the new node can be shut down and removed. + +The import of dangling indices can be controlled with the +`gateway.local.auto_import_dangled` which accepts: + +[horizontal] +`yes`:: + + Import dangling indices into the cluster (default). + +`close`:: + + Import dangling indices into the cluster state, but leave them closed. + +`no`:: + + Delete dangling indices after `gateway.local.dangling_timeout`, which + defaults to 2 hours. diff --git a/docs/reference/modules/gateway/s3.asciidoc b/docs/reference/modules/gateway/s3.asciidoc new file mode 100644 index 0000000..8f2f5d9 --- /dev/null +++ b/docs/reference/modules/gateway/s3.asciidoc @@ -0,0 +1,51 @@ +[[modules-gateway-s3]] +=== S3 Gateway + +*The S3 gateway is deprecated and will be removed in a future version. +Please use the <<modules-gateway-local,local +gateway>> instead.* + +S3 based gateway allows to do long term reliable async persistency of +the cluster state and indices directly to Amazon S3. Here is how it can +be configured: + +[source,js] +-------------------------------------------------- +cloud: + aws: + access_key: AKVAIQBF2RECL7FJWGJQ + secret_key: vExyMThREXeRMm/b/LRzEB8jWwvzQeXgjqMX+6br + + +gateway: + type: s3 + s3: + bucket: bucket_name +-------------------------------------------------- + +You’ll need to install the `cloud-aws` plugin, by running +`bin/plugin install cloud-aws` before (re)starting elasticsearch. + +The following are a list of settings (prefixed with `gateway.s3`) that +can further control the S3 gateway: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`chunk_size` |Big files are broken down into chunks (to overcome AWS 5g +limit and use concurrent snapshotting). Default set to `100m`. +|======================================================================= + +[float] +==== concurrent_streams + +The `gateway.s3.concurrent_streams` allow to throttle the number of +streams (per node) opened against the shared gateway performing the +snapshot operation. It defaults to `5`. + +[float] +==== Region + +The `cloud.aws.region` can be set to a region and will automatically use +the relevant settings for both `ec2` and `s3`. The available values are: +`us-east-1`, `us-west-1`, `ap-southeast-1`, `eu-west-1`. diff --git a/docs/reference/modules/http.asciidoc b/docs/reference/modules/http.asciidoc new file mode 100644 index 0000000..fbc153b --- /dev/null +++ b/docs/reference/modules/http.asciidoc @@ -0,0 +1,51 @@ +[[modules-http]] +== HTTP + +The http module allows to expose *elasticsearch* APIs +over HTTP. + +The http mechanism is completely asynchronous in nature, meaning that +there is no blocking thread waiting for a response. The benefit of using +asynchronous communication for HTTP is solving the +http://en.wikipedia.org/wiki/C10k_problem[C10k problem]. + +When possible, consider using +http://en.wikipedia.org/wiki/Keepalive#HTTP_Keepalive[HTTP keep alive] +when connecting for better performance and try to get your favorite +client not to do +http://en.wikipedia.org/wiki/Chunked_transfer_encoding[HTTP chunking]. + +[float] +=== Settings + +The following are the settings the can be configured for HTTP: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`http.port` |A bind port range. Defaults to `9200-9300`. + +|`http.max_content_length` |The max content of an HTTP request. Defaults +to `100mb` + +|`http.max_initial_line_length` |The max length of an HTTP URL. Defaults +to `4kb` + +|`http.compression` |Support for compression when possible (with +Accept-Encoding). Defaults to `false`. + +|`http.compression_level` |Defines the compression level to use. +Defaults to `6`. +|======================================================================= + +It also shares the uses the common +<<modules-network,network settings>>. + +[float] +=== Disable HTTP + +The http module can be completely disabled and not started by setting +`http.enabled` to `false`. This make sense when creating non +<<modules-node,data nodes>> which accept HTTP +requests, and communicate with data nodes using the internal +<<modules-transport,transport>>. diff --git a/docs/reference/modules/indices.asciidoc b/docs/reference/modules/indices.asciidoc new file mode 100644 index 0000000..75fd8f5 --- /dev/null +++ b/docs/reference/modules/indices.asciidoc @@ -0,0 +1,76 @@ +[[modules-indices]] +== Indices + +The indices module allow to control settings that are globally managed +for all indices. + +[float] +[[buffer]] +=== Indexing Buffer + +The indexing buffer setting allows to control how much memory will be +allocated for the indexing process. It is a global setting that bubbles +down to all the different shards allocated on a specific node. + +The `indices.memory.index_buffer_size` accepts either a percentage or a +byte size value. It defaults to `10%`, meaning that `10%` of the total +memory allocated to a node will be used as the indexing buffer size. +This amount is then divided between all the different shards. Also, if +percentage is used, allow to set `min_index_buffer_size` (defaults to +`48mb`) and `max_index_buffer_size` which by default is unbounded. + +The `indices.memory.min_shard_index_buffer_size` allows to set a hard +lower limit for the memory allocated per shard for its own indexing +buffer. It defaults to `4mb`. + +[float] +[[indices-ttl]] +=== TTL interval + +You can dynamically set the `indices.ttl.interval` allows to set how +often expired documents will be automatically deleted. The default value +is 60s. + +The deletion orders are processed by bulk. You can set +`indices.ttl.bulk_size` to fit your needs. The default value is 10000. + +See also <<mapping-ttl-field>>. + +[float] +[[recovery]] +=== Recovery + +The following settings can be set to manage recovery policy: + +[horizontal] +`indices.recovery.concurrent_streams`:: + defaults to `3`. + +`indices.recovery.file_chunk_size`:: + defaults to `512kb`. + +`indices.recovery.translog_ops`:: + defaults to `1000`. + +`indices.recovery.translog_size`:: + defaults to `512kb`. + +`indices.recovery.compress`:: + defaults to `true`. + +`indices.recovery.max_bytes_per_sec`:: + defaults to `20mb`. + +[float] +[[throttling]] +=== Store level throttling + +The following settings can be set to control store throttling: + +[horizontal] +`indices.store.throttle.type`:: + could be `merge` (default), `not` or `all`. See <<index-modules-store>>. + +`indices.store.throttle.max_bytes_per_sec`:: + defaults to `20mb`. + diff --git a/docs/reference/modules/memcached.asciidoc b/docs/reference/modules/memcached.asciidoc new file mode 100644 index 0000000..20276d0 --- /dev/null +++ b/docs/reference/modules/memcached.asciidoc @@ -0,0 +1,69 @@ +[[modules-memcached]] +== memcached + +The memcached module allows to expose *elasticsearch* +APIs over the memcached protocol (as closely +as possible). + +It is provided as a plugin called `transport-memcached` and installing +is explained +https://github.com/elasticsearch/elasticsearch-transport-memcached[here] +. Another option is to download the memcached plugin and placing it +under the `plugins` directory. + +The memcached protocol supports both the binary and the text protocol, +automatically detecting the correct one to use. + +[float] +=== Mapping REST to Memcached Protocol + +Memcached commands are mapped to REST and handled by the same generic +REST layer in elasticsearch. Here is a list of the memcached commands +supported: + +[float] +==== GET + +The memcached `GET` command maps to a REST `GET`. The key used is the +URI (with parameters). The main downside is the fact that the memcached +`GET` does not allow body in the request (and `SET` does not allow to +return a result...). For this reason, most REST APIs (like search) allow +to accept the "source" as a URI parameter as well. + +[float] +==== SET + +The memcached `SET` command maps to a REST `POST`. The key used is the +URI (with parameters), and the body maps to the REST body. + +[float] +==== DELETE + +The memcached `DELETE` command maps to a REST `DELETE`. The key used is +the URI (with parameters). + +[float] +==== QUIT + +The memcached `QUIT` command is supported and disconnects the client. + +[float] +=== Settings + +The following are the settings the can be configured for memcached: + +[cols="<,<",options="header",] +|=============================================================== +|Setting |Description +|`memcached.port` |A bind port range. Defaults to `11211-11311`. +|=============================================================== + +It also shares the uses the common +<<modules-network,network settings>>. + +[float] +=== Disable memcached + +The memcached module can be completely disabled and not started using by +setting `memcached.enabled` to `false`. By default it is enabled once it +is detected as a plugin. diff --git a/docs/reference/modules/network.asciidoc b/docs/reference/modules/network.asciidoc new file mode 100644 index 0000000..835b6e4 --- /dev/null +++ b/docs/reference/modules/network.asciidoc @@ -0,0 +1,89 @@ +[[modules-network]] +== Network Settings + +There are several modules within a Node that use network based +configuration, for example, the +<<modules-transport,transport>> and +<<modules-http,http>> modules. Node level +network settings allows to set common settings that will be shared among +all network based modules (unless explicitly overridden in each module). + +The `network.bind_host` setting allows to control the host different +network components will bind on. By default, the bind host will be +`anyLocalAddress` (typically `0.0.0.0` or `::0`). + +The `network.publish_host` setting allows to control the host the node +will publish itself within the cluster so other nodes will be able to +connect to it. Of course, this can't be the `anyLocalAddress`, and by +default, it will be the first non loopback address (if possible), or the +local address. + +The `network.host` setting is a simple setting to automatically set both +`network.bind_host` and `network.publish_host` to the same host value. + +Both settings allows to be configured with either explicit host address +or host name. The settings also accept logical setting values explained +in the following table: + +[cols="<,<",options="header",] +|======================================================================= +|Logical Host Setting Value |Description +|`_local_` |Will be resolved to the local ip address. + +|`_non_loopback_` |The first non loopback address. + +|`_non_loopback:ipv4_` |The first non loopback IPv4 address. + +|`_non_loopback:ipv6_` |The first non loopback IPv6 address. + +|`_[networkInterface]_` |Resolves to the ip address of the provided +network interface. For example `_en0_`. + +|`_[networkInterface]:ipv4_` |Resolves to the ipv4 address of the +provided network interface. For example `_en0:ipv4_`. + +|`_[networkInterface]:ipv6_` |Resolves to the ipv6 address of the +provided network interface. For example `_en0:ipv6_`. +|======================================================================= + +When the `cloud-aws` plugin is installed, the following are also allowed +as valid network host settings: + +[cols="<,<",options="header",] +|================================================================== +|EC2 Host Value |Description +|`_ec2:privateIpv4_` |The private IP address (ipv4) of the machine. +|`_ec2:privateDns_` |The private host of the machine. +|`_ec2:publicIpv4_` |The public IP address (ipv4) of the machine. +|`_ec2:publicDns_` |The public host of the machine. +|`_ec2_` |Less verbose option for the private ip address. +|`_ec2:privateIp_` |Less verbose option for the private ip address. +|`_ec2:publicIp_` |Less verbose option for the public ip address. +|================================================================== + +[float] +[[tcp-settings]] +=== TCP Settings + +Any component that uses TCP (like the HTTP, Transport and Memcached) +share the following allowed settings: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`network.tcp.no_delay` |Enable or disable tcp no delay setting. +Defaults to `true`. + +|`network.tcp.keep_alive` |Enable or disable tcp keep alive. By default +not explicitly set. + +|`network.tcp.reuse_address` |Should an address be reused or not. +Defaults to `true` on none windows machines. + +|`network.tcp.send_buffer_size` |The size of the tcp send buffer size +(in size setting format). By default not explicitly set. + +|`network.tcp.receive_buffer_size` |The size of the tcp receive buffer +size (in size setting format). By default not explicitly set. +|======================================================================= + diff --git a/docs/reference/modules/node.asciidoc b/docs/reference/modules/node.asciidoc new file mode 100644 index 0000000..5e40522 --- /dev/null +++ b/docs/reference/modules/node.asciidoc @@ -0,0 +1,32 @@ +[[modules-node]] +== Node + +*elasticsearch* allows to configure a node to either be allowed to store +data locally or not. Storing data locally basically means that shards of +different indices are allowed to be allocated on that node. By default, +each node is considered to be a data node, and it can be turned off by +setting `node.data` to `false`. + +This is a powerful setting allowing to simply create smart load +balancers that take part in some of different API processing. Lets take +an example: + +We can start a whole cluster of data nodes which do not even start an +HTTP transport by setting `http.enabled` to `false`. Such nodes will +communicate with one another using the +<<modules-transport,transport>> module. In front +of the cluster we can start one or more "non data" nodes which will +start with HTTP enabled. All HTTP communication will be performed +through these "non data" nodes. + +The benefit of using that is first the ability to create smart load +balancers. These "non data" nodes are still part of the cluster, and +they redirect operations exactly to the node that holds the relevant +data. The other benefit is the fact that for scatter / gather based +operations (such as search), these nodes will take part of the +processing since they will start the scatter process, and perform the +actual gather processing. + +This relieves the data nodes to do the heavy duty of indexing and +searching, without needing to process HTTP requests (parsing), overload +the network, or perform the gather processing. diff --git a/docs/reference/modules/plugins.asciidoc b/docs/reference/modules/plugins.asciidoc new file mode 100644 index 0000000..6c39d72 --- /dev/null +++ b/docs/reference/modules/plugins.asciidoc @@ -0,0 +1,283 @@ +[[modules-plugins]] +== Plugins + +[float] +=== Plugins + +Plugins are a way to enhance the basic elasticsearch functionality in a +custom manner. They range from adding custom mapping types, custom +analyzers (in a more built in fashion), native scripts, custom discovery +and more. + +[float] +[[installing]] +==== Installing plugins + +Installing plugins can either be done manually by placing them under the +`plugins` directory, or using the `plugin` script. Several plugins can +be found under the https://github.com/elasticsearch[elasticsearch] +organization in GitHub, starting with `elasticsearch-`. + +Installing plugins typically take the following form: + +[source,shell] +----------------------------------- +plugin --install <org>/<user/component>/<version> +----------------------------------- + +The plugins will be +automatically downloaded in this case from `download.elasticsearch.org`, +and in case they don't exist there, from maven (central and sonatype). + +Note that when the plugin is located in maven central or sonatype +repository, `<org>` is the artifact `groupId` and `<user/component>` is +the `artifactId`. + +A plugin can also be installed directly by specifying the URL for it, +for example: + +[source,shell] +----------------------------------- +bin/plugin --url file:///path/to/plugin --install plugin-name +----------------------------------- + + +You can run `bin/plugin -h`. + +[float] +[[site-plugins]] +==== Site Plugins + +Plugins can have "sites" in them, any plugin that exists under the +`plugins` directory with a `_site` directory, its content will be +statically served when hitting `/_plugin/[plugin_name]/` url. Those can +be added even after the process has started. + +Installed plugins that do not contain any java related content, will +automatically be detected as site plugins, and their content will be +moved under `_site`. + +The ability to install plugins from Github allows to easily install site +plugins hosted there by downloading the actual repo, for example, +running: + +[source,js] +-------------------------------------------------- +bin/plugin --install mobz/elasticsearch-head +bin/plugin --install lukas-vlcek/bigdesk +-------------------------------------------------- + +Will install both of those site plugins, with `elasticsearch-head` +available under `http://localhost:9200/_plugin/head/` and `bigdesk` +available under `http://localhost:9200/_plugin/bigdesk/`. + +[float] +==== Mandatory Plugins + +If you rely on some plugins, you can define mandatory plugins using the +`plugin.mandatory` attribute, for example, here is a sample config: + +[source,js] +-------------------------------------------------- +plugin.mandatory: mapper-attachments,lang-groovy +-------------------------------------------------- + +For safety reasons, if a mandatory plugin is not installed, the node +will not start. + +[float] +==== Installed Plugins + +A list of the currently loaded plugins can be retrieved using the +<<cluster-nodes-info,Node Info API>>. + +[float] +==== Removing plugins + +Removing plugins can either be done manually by removing them under the +`plugins` directory, or using the `plugin` script. + +Removing plugins typically take the following form: + +[source,shell] +----------------------------------- +plugin --remove <pluginname> +----------------------------------- + +[float] +==== Silent/Verbose mode + +When running the `plugin` script, you can get more information (debug mode) using `--verbose`. +On the opposite, if you want `plugin` script to be silent, use `--silent` option. + +Note that exit codes could be: + +* `0`: everything was OK +* `64`: unknown command or incorrect option parameter +* `74`: IO error +* `70`: other errors + +[source,shell] +----------------------------------- +bin/plugin --install mobz/elasticsearch-head --verbose +plugin --remove head --silent +----------------------------------- + +[float] +==== Timeout settings + +By default, the `plugin` script will wait indefinitely when downloading before failing. +The timeout parameter can be used to explicitly specify how long it waits. Here is some examples of setting it to +different values: + +[source,shell] +----------------------------------- +# Wait for 30 seconds before failing +bin/plugin --install mobz/elasticsearch-head --timeout 30s + +# Wait for 1 minute before failing +bin/plugin --install mobz/elasticsearch-head --timeout 1m + +# Wait forever (default) +bin/plugin --install mobz/elasticsearch-head --timeout 0 +----------------------------------- + +[float] +[[known-plugins]] +=== Known Plugins + +[float] +[[analysis-plugins]] +==== Analysis Plugins + +.Supported by Elasticsearch +* https://github.com/elasticsearch/elasticsearch-analysis-icu[ICU Analysis plugin] +* https://github.com/elasticsearch/elasticsearch-analysis-kuromoji[Japanese (Kuromoji) Analysis plugin]. +* https://github.com/elasticsearch/elasticsearch-analysis-smartcn[Smart Chinese Analysis Plugin] +* https://github.com/elasticsearch/elasticsearch-analysis-stempel[Stempel (Polish) Analysis plugin] + +.Supported by the community +* https://github.com/barminator/elasticsearch-analysis-annotation[Annotation Analysis Plugin] (by Michal Samek) +* https://github.com/yakaz/elasticsearch-analysis-combo/[Combo Analysis Plugin] (by Olivier Favre, Yakaz) +* https://github.com/jprante/elasticsearch-analysis-hunspell[Hunspell Analysis Plugin] (by Jörg Prante) +* https://github.com/medcl/elasticsearch-analysis-ik[IK Analysis Plugin] (by Medcl) +* https://github.com/suguru/elasticsearch-analysis-japanese[Japanese Analysis plugin] (by suguru). +* https://github.com/medcl/elasticsearch-analysis-mmseg[Mmseg Analysis Plugin] (by Medcl) +* https://github.com/chytreg/elasticsearch-analysis-morfologik[Morfologik (Polish) Analysis plugin] (by chytreg) +* https://github.com/imotov/elasticsearch-analysis-morphology[Russian and English Morphological Analysis Plugin] (by Igor Motov) +* https://github.com/medcl/elasticsearch-analysis-pinyin[Pinyin Analysis Plugin] (by Medcl) +* https://github.com/medcl/elasticsearch-analysis-string2int[String2Integer Analysis Plugin] (by Medcl) + +[float] +[[discovery-plugins]] +==== Discovery Plugins + +.Supported by Elasticsearch +* https://github.com/elasticsearch/elasticsearch-cloud-aws[AWS Cloud Plugin] - EC2 discovery and S3 Repository +* https://github.com/elasticsearch/elasticsearch-cloud-azure[Azure Cloud Plugin] - Azure discovery +* https://github.com/elasticsearch/elasticsearch-cloud-gce[Google Compute Engine Cloud Plugin] - GCE discovery + +[float] +[[river]] +==== River Plugins + +.Supported by Elasticsearch +* https://github.com/elasticsearch/elasticsearch-river-couchdb[CouchDB River Plugin] +* https://github.com/elasticsearch/elasticsearch-river-rabbitmq[RabbitMQ River Plugin] +* https://github.com/elasticsearch/elasticsearch-river-twitter[Twitter River Plugin] +* https://github.com/elasticsearch/elasticsearch-river-wikipedia[Wikipedia River Plugin] + +.Supported by the community +* https://github.com/domdorn/elasticsearch-river-activemq/[ActiveMQ River Plugin] (by Dominik Dorn) +* https://github.com/albogdano/elasticsearch-river-amazonsqs[Amazon SQS River Plugin] (by Alex Bogdanovski) +* https://github.com/xxBedy/elasticsearch-river-csv[CSV River Plugin] (by Martin Bednar) +* http://www.pilato.fr/dropbox/[Dropbox River Plugin] (by David Pilato) +* http://www.pilato.fr/fsriver/[FileSystem River Plugin] (by David Pilato) +* https://github.com/obazoud/elasticsearch-river-git[Git River Plugin] (by Olivier Bazoud) +* https://github.com/uberVU/elasticsearch-river-github[GitHub River Plugin] (by uberVU) +* https://github.com/sksamuel/elasticsearch-river-hazelcast[Hazelcast River Plugin] (by Steve Samuel) +* https://github.com/jprante/elasticsearch-river-jdbc[JDBC River Plugin] (by Jörg Prante) +* https://github.com/qotho/elasticsearch-river-jms[JMS River Plugin] (by Steve Sarandos) +* https://github.com/endgameinc/elasticsearch-river-kafka[Kafka River Plugin] (by Endgame Inc.) +* https://github.com/tlrx/elasticsearch-river-ldap[LDAP River Plugin] (by Tanguy Leroux) +* https://github.com/richardwilly98/elasticsearch-river-mongodb/[MongoDB River Plugin] (by Richard Louapre) +* https://github.com/sksamuel/elasticsearch-river-neo4j[Neo4j River Plugin] (by Steve Samuel) +* https://github.com/jprante/elasticsearch-river-oai/[Open Archives Initiative (OAI) River Plugin] (by Jörg Prante) +* https://github.com/sksamuel/elasticsearch-river-redis[Redis River Plugin] (by Steve Samuel) +* http://dadoonet.github.com/rssriver/[RSS River Plugin] (by David Pilato) +* https://github.com/adamlofts/elasticsearch-river-sofa[Sofa River Plugin] (by adamlofts) +* https://github.com/javanna/elasticsearch-river-solr/[Solr River Plugin] (by Luca Cavanna) +* https://github.com/sunnygleason/elasticsearch-river-st9[St9 River Plugin] (by Sunny Gleason) +* https://github.com/plombard/SubversionRiver[Subversion River Plugin] (by Pascal Lombard) +* https://github.com/kzwang/elasticsearch-river-dynamodb[DynamoDB River Plugin] (by Kevin Wang) + +[float] +[[transport]] +==== Transport Plugins + +.Supported by Elasticsearch +* https://github.com/elasticsearch/elasticsearch-transport-memcached[Memcached transport plugin] +* https://github.com/elasticsearch/elasticsearch-transport-thrift[Thrift Transport] +* https://github.com/elasticsearch/elasticsearch-transport-wares[Servlet transport] + +.Supported by the community +* https://github.com/tlrx/transport-zeromq[ZeroMQ transport layer plugin] (by Tanguy Leroux) +* https://github.com/sonian/elasticsearch-jetty[Jetty HTTP transport plugin] (by Sonian Inc.) +* https://github.com/kzwang/elasticsearch-transport-redis[Redis transport plugin] (by Kevin Wang) + +[float] +[[scripting]] +==== Scripting Plugins + +.Supported by Elasticsearch +* https://github.com/hiredman/elasticsearch-lang-clojure[Clojure Language Plugin] (by Kevin Downey) +* https://github.com/elasticsearch/elasticsearch-lang-groovy[Groovy lang Plugin] +* https://github.com/elasticsearch/elasticsearch-lang-javascript[JavaScript language Plugin] +* https://github.com/elasticsearch/elasticsearch-lang-python[Python language Plugin] + +[float] +[[site]] +==== Site Plugins + +.Supported by the community +* https://github.com/lukas-vlcek/bigdesk[BigDesk Plugin] (by Lukáš Vlček) +* https://github.com/mobz/elasticsearch-head[Elasticsearch Head Plugin] (by Ben Birch) +* https://github.com/royrusso/elasticsearch-HQ[Elasticsearch HQ] (by Roy Russo) +* https://github.com/andrewvc/elastic-hammer[Hammer Plugin] (by Andrew Cholakian) +* https://github.com/polyfractal/elasticsearch-inquisitor[Inquisitor Plugin] (by Zachary Tong) +* https://github.com/karmi/elasticsearch-paramedic[Paramedic Plugin] (by Karel Minařík) +* https://github.com/polyfractal/elasticsearch-segmentspy[SegmentSpy Plugin] (by Zachary Tong) + +[float] +[[repository-plugins]] +==== Snapshot/Restore Repository Plugins + +.Supported by Elasticsearch + +* https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs[Hadoop HDFS] Repository +* https://github.com/elasticsearch/elasticsearch-cloud-aws#s3-repository[AWS S3] Repository + +.Supported by the community + +* https://github.com/kzwang/elasticsearch-repository-gridfs[GridFS] Repository (by Kevin Wang) + +[float] +[[misc]] +==== Misc Plugins + +.Supported by Elasticsearch +* https://github.com/elasticsearch/elasticsearch-mapper-attachments[Mapper Attachments Type plugin] + +.Supported by the community +* https://github.com/carrot2/elasticsearch-carrot2[carrot2 Plugin]: Results clustering with carrot2 (by Dawid Weiss) +* https://github.com/derryx/elasticsearch-changes-plugin[Elasticsearch Changes Plugin] (by Thomas Peuss) +* https://github.com/johtani/elasticsearch-extended-analyze[Extended Analyze Plugin] (by Jun Ohtani) +* https://github.com/spinscale/elasticsearch-graphite-plugin[Elasticsearch Graphite Plugin] (by Alexander Reelsen) +* https://github.com/mattweber/elasticsearch-mocksolrplugin[Elasticsearch Mock Solr Plugin] (by Matt Weber) +* https://github.com/viniciusccarvalho/elasticsearch-newrelic[Elasticsearch New Relic Plugin] (by Vinicius Carvalho) +* https://github.com/swoop-inc/elasticsearch-statsd-plugin[Elasticsearch Statsd Plugin] (by Swoop Inc.) +* https://github.com/endgameinc/elasticsearch-term-plugin[Terms Component Plugin] (by Endgame Inc.) +* http://tlrx.github.com/elasticsearch-view-plugin[Elasticsearch View Plugin] (by Tanguy Leroux) +* https://github.com/sonian/elasticsearch-zookeeper[ZooKeeper Discovery Plugin] (by Sonian Inc.) + + diff --git a/docs/reference/modules/scripting.asciidoc b/docs/reference/modules/scripting.asciidoc new file mode 100644 index 0000000..166a030 --- /dev/null +++ b/docs/reference/modules/scripting.asciidoc @@ -0,0 +1,316 @@ +[[modules-scripting]] +== Scripting + +The scripting module allows to use scripts in order to evaluate custom +expressions. For example, scripts can be used to return "script fields" +as part of a search request, or can be used to evaluate a custom score +for a query and so on. + +The scripting module uses by default http://mvel.codehaus.org/[mvel] as +the scripting language with some extensions. mvel is used since it is +extremely fast and very simple to use, and in most cases, simple +expressions are needed (for example, mathematical equations). + +Additional `lang` plugins are provided to allow to execute scripts in +different languages. Currently supported plugins are `lang-javascript` +for JavaScript, `lang-groovy` for Groovy, and `lang-python` for Python. +All places where a `script` parameter can be used, a `lang` parameter +(on the same level) can be provided to define the language of the +script. The `lang` options are `mvel`, `js`, `groovy`, `python`, and +`native`. + +[float] +=== Default Scripting Language + +The default scripting language (assuming no `lang` parameter is +provided) is `mvel`. In order to change it set the `script.default_lang` +to the appropriate language. + +[float] +=== Preloaded Scripts + +Scripts can always be provided as part of the relevant API, but they can +also be preloaded by placing them under `config/scripts` and then +referencing them by the script name (instead of providing the full +script). This helps reduce the amount of data passed between the client +and the nodes. + +The name of the script is derived from the hierarchy of directories it +exists under, and the file name without the lang extension. For example, +a script placed under `config/scripts/group1/group2/test.py` will be +named `group1_group2_test`. + +[float] +=== Disabling dynamic scripts + +We recommend running Elasticsearch behind an application or proxy, +which protects Elasticsearch from the outside world. If users are +allowed to run dynamic scripts (even in a search request), then they +have the same access to your box as the user that Elasticsearch is +running as. + +First, you should not run Elasticsearch as the `root` user, as this +would allow a script to access or do *anything* on your server, without +limitations. Second, you should not expose Elasticsearch directly to +users, but instead have a proxy application inbetween. If you *do* +intend to expose Elasticsearch directly to your users, then you have +to decide whether you trust them enough to run scripts on your box or +not. If not, then even if you have a proxy which only allows `GET` +requests, you should disable dynamic scripting by adding the following +setting to the `config/elasticsearch.yml` file on every node: + +[source,yaml] +----------------------------------- +script.disable_dynamic: true +----------------------------------- + +This will still allow execution of named scripts provided in the config, or +_native_ Java scripts registered through plugins, however it will prevent +users from running arbitrary scripts via the API. + +[float] +=== Automatic Script Reloading + +The `config/scripts` directory is scanned periodically for changes. +New and changed scripts are reloaded and deleted script are removed +from preloaded scripts cache. The reload frequency can be specified +using `watcher.interval` setting, which defaults to `60s`. +To disable script reloading completely set `script.auto_reload_enabled` +to `false`. + +[float] +=== Native (Java) Scripts + +Even though `mvel` is pretty fast, allow to register native Java based +scripts for faster execution. + +In order to allow for scripts, the `NativeScriptFactory` needs to be +implemented that constructs the script that will be executed. There are +two main types, one that extends `AbstractExecutableScript` and one that +extends `AbstractSearchScript` (probably the one most users will extend, +with additional helper classes in `AbstractLongSearchScript`, +`AbstractDoubleSearchScript`, and `AbstractFloatSearchScript`). + +Registering them can either be done by settings, for example: +`script.native.my.type` set to `sample.MyNativeScriptFactory` will +register a script named `my`. Another option is in a plugin, access +`ScriptModule` and call `registerScript` on it. + +Executing the script is done by specifying the `lang` as `native`, and +the name of the script as the `script`. + +Note, the scripts need to be in the classpath of elasticsearch. One +simple way to do it is to create a directory under plugins (choose a +descriptive name), and place the jar / classes files there, they will be +automatically loaded. + +[float] +=== Score + +In all scripts that can be used in facets, allow to access the current +doc score using `doc.score`. + +[float] +=== Computing scores based on terms in scripts + +see <<modules-advanced-scripting, advanced scripting documentation>> + +[float] +=== Document Fields + +Most scripting revolve around the use of specific document fields data. +The `doc['field_name']` can be used to access specific field data within +a document (the document in question is usually derived by the context +the script is used). Document fields are very fast to access since they +end up being loaded into memory (all the relevant field values/tokens +are loaded to memory). + +The following data can be extracted from a field: + +[cols="<,<",options="header",] +|======================================================================= +|Expression |Description +|`doc['field_name'].value` |The native value of the field. For example, +if its a short type, it will be short. + +|`doc['field_name'].values` |The native array values of the field. For +example, if its a short type, it will be short[]. Remember, a field can +have several values within a single doc. Returns an empty array if the +field has no values. + +|`doc['field_name'].empty` |A boolean indicating if the field has no +values within the doc. + +|`doc['field_name'].multiValued` |A boolean indicating that the field +has several values within the corpus. + +|`doc['field_name'].lat` |The latitude of a geo point type. + +|`doc['field_name'].lon` |The longitude of a geo point type. + +|`doc['field_name'].lats` |The latitudes of a geo point type. + +|`doc['field_name'].lons` |The longitudes of a geo point type. + +|`doc['field_name'].distance(lat, lon)` |The `plane` distance (in meters) +of this geo point field from the provided lat/lon. + +|`doc['field_name'].distanceWithDefault(lat, lon, default)` |The `plane` distance (in meters) +of this geo point field from the provided lat/lon with a default value. + +|`doc['field_name'].distanceInMiles(lat, lon)` |The `plane` distance (in +miles) of this geo point field from the provided lat/lon. + +|`doc['field_name'].distanceInMilesWithDefault(lat, lon, default)` |The `plane` distance (in +miles) of this geo point field from the provided lat/lon with a default value. + +|`doc['field_name'].distanceInKm(lat, lon)` |The `plane` distance (in +km) of this geo point field from the provided lat/lon. + +|`doc['field_name'].distanceInKmWithDefault(lat, lon, default)` |The `plane` distance (in +km) of this geo point field from the provided lat/lon with a default value. + +|`doc['field_name'].arcDistance(lat, lon)` |The `arc` distance (in +meters) of this geo point field from the provided lat/lon. + +|`doc['field_name'].arcDistanceWithDefault(lat, lon, default)` |The `arc` distance (in +meters) of this geo point field from the provided lat/lon with a default value. + +|`doc['field_name'].arcDistanceInMiles(lat, lon)` |The `arc` distance (in +miles) of this geo point field from the provided lat/lon. + +|`doc['field_name'].arcDistanceInMilesWithDefault(lat, lon, default)` |The `arc` distance (in +miles) of this geo point field from the provided lat/lon with a default value. + +|`doc['field_name'].arcDistanceInKm(lat, lon)` |The `arc` distance (in +km) of this geo point field from the provided lat/lon. + +|`doc['field_name'].arcDistanceInKmWithDefault(lat, lon, default)` |The `arc` distance (in +km) of this geo point field from the provided lat/lon with a default value. + +|`doc['field_name'].factorDistance(lat, lon)` |The distance factor of this geo point field from the provided lat/lon. + +|`doc['field_name'].factorDistance(lat, lon, default)` |The distance factor of this geo point field from the provided lat/lon with a default value. + + +|======================================================================= + +[float] +=== Stored Fields + +Stored fields can also be accessed when executed a script. Note, they +are much slower to access compared with document fields, but are not +loaded into memory. They can be simply accessed using +`_fields['my_field_name'].value` or `_fields['my_field_name'].values`. + +[float] +=== Source Field + +The source field can also be accessed when executing a script. The +source field is loaded per doc, parsed, and then provided to the script +for evaluation. The `_source` forms the context under which the source +field can be accessed, for example `_source.obj2.obj1.field3`. + +Accessing `_source` is much slower compared to using `_doc` +but the data is not loaded into memory. For a single field access `_fields` may be +faster than using `_source` due to the extra overhead of potentially parsing large documents. +However, `_source` may be faster if you access multiple fields or if the source has already been +loaded for other purposes. + + +[float] +=== mvel Built In Functions + +There are several built in functions that can be used within scripts. +They include: + +[cols="<,<",options="header",] +|======================================================================= +|Function |Description +|`time()` |The current time in milliseconds. + +|`sin(a)` |Returns the trigonometric sine of an angle. + +|`cos(a)` |Returns the trigonometric cosine of an angle. + +|`tan(a)` |Returns the trigonometric tangent of an angle. + +|`asin(a)` |Returns the arc sine of a value. + +|`acos(a)` |Returns the arc cosine of a value. + +|`atan(a)` |Returns the arc tangent of a value. + +|`toRadians(angdeg)` |Converts an angle measured in degrees to an +approximately equivalent angle measured in radians + +|`toDegrees(angrad)` |Converts an angle measured in radians to an +approximately equivalent angle measured in degrees. + +|`exp(a)` |Returns Euler's number _e_ raised to the power of value. + +|`log(a)` |Returns the natural logarithm (base _e_) of a value. + +|`log10(a)` |Returns the base 10 logarithm of a value. + +|`sqrt(a)` |Returns the correctly rounded positive square root of a +value. + +|`cbrt(a)` |Returns the cube root of a double value. + +|`IEEEremainder(f1, f2)` |Computes the remainder operation on two +arguments as prescribed by the IEEE 754 standard. + +|`ceil(a)` |Returns the smallest (closest to negative infinity) value +that is greater than or equal to the argument and is equal to a +mathematical integer. + +|`floor(a)` |Returns the largest (closest to positive infinity) value +that is less than or equal to the argument and is equal to a +mathematical integer. + +|`rint(a)` |Returns the value that is closest in value to the argument +and is equal to a mathematical integer. + +|`atan2(y, x)` |Returns the angle _theta_ from the conversion of +rectangular coordinates (_x_, _y_) to polar coordinates (r,_theta_). + +|`pow(a, b)` |Returns the value of the first argument raised to the +power of the second argument. + +|`round(a)` |Returns the closest _int_ to the argument. + +|`random()` |Returns a random _double_ value. + +|`abs(a)` |Returns the absolute value of a value. + +|`max(a, b)` |Returns the greater of two values. + +|`min(a, b)` |Returns the smaller of two values. + +|`ulp(d)` |Returns the size of an ulp of the argument. + +|`signum(d)` |Returns the signum function of the argument. + +|`sinh(x)` |Returns the hyperbolic sine of a value. + +|`cosh(x)` |Returns the hyperbolic cosine of a value. + +|`tanh(x)` |Returns the hyperbolic tangent of a value. + +|`hypot(x, y)` |Returns sqrt(_x2_ + _y2_) without intermediate overflow +or underflow. +|======================================================================= + +[float] +=== Arithmetic precision in MVEL + +When dividing two numbers using MVEL based scripts, the engine tries to +be smart and adheres to the default behaviour of java. This means if you +divide two integers (you might have configured the fields as integer in +the mapping), the result will also be an integer. This means, if a +calculation like `1/num` is happening in your scripts and `num` is an +integer with the value of `8`, the result is `0` even though you were +expecting it to be `0.125`. You may need to enforce precision by +explicitly using a double like `1.0/num` in order to get the expected +result. diff --git a/docs/reference/modules/snapshots.asciidoc b/docs/reference/modules/snapshots.asciidoc new file mode 100644 index 0000000..309b5f8 --- /dev/null +++ b/docs/reference/modules/snapshots.asciidoc @@ -0,0 +1,207 @@ +[[modules-snapshots]] +== Snapshot And Restore + +The snapshot and restore module allows to create snapshots of individual indices or an entire cluster into a remote +repository. At the time of the initial release only shared file system repository is supported. + +[float] +=== Repositories + +Before any snapshot or restore operation can be performed a snapshot repository should be registered in +Elasticsearch. The following command registers a shared file system repository with the name `my_backup` that +will use location `/mount/backups/my_backup` to store snapshots. + +[source,js] +----------------------------------- +$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{ + "type": "fs", + "settings": { + "location": "/mount/backups/my_backup", + "compress": true + } +}' +----------------------------------- + +Once repository is registered, its information can be obtained using the following command: + +[source,js] +----------------------------------- +$ curl -XGET 'http://localhost:9200/_snapshot/my_backup?pretty' +----------------------------------- +[source,js] +----------------------------------- +{ + "my_backup" : { + "type" : "fs", + "settings" : { + "compress" : "false", + "location" : "/mount/backups/my_backup" + } + } +} +----------------------------------- + +If a repository name is not specified, or `_all` is used as repository name Elasticsearch will return information about +all repositories currently registered in the cluster: + +[source,js] +----------------------------------- +$ curl -XGET 'http://localhost:9200/_snapshot' +----------------------------------- + +or + +[source,js] +----------------------------------- +$ curl -XGET 'http://localhost:9200/_snapshot/_all' +----------------------------------- + +[float] +===== Shared File System Repository + +The shared file system repository (`"type": "fs"`) is using shared file system to store snapshot. The path +specified in the `location` parameter should point to the same location in the shared filesystem and be accessible +on all data and master nodes. The following settings are supported: + +[horizontal] +`location`:: Location of the snapshots. Mandatory. +`compress`:: Turns on compression of the snapshot files. Defaults to `true`. +`concurrent_streams`:: Throttles the number of streams (per node) preforming snapshot operation. Defaults to `5` +`chunk_size`:: Big files can be broken down into chunks during snapshotting if needed. The chunk size can be specified in bytes or by + using size value notation, i.e. 1g, 10m, 5k. Defaults to `null` (unlimited chunk size). +`max_restore_bytes_per_sec`:: Throttles per node restore rate. Defaults to `20mb` per second. +`max_snapshot_bytes_per_sec`:: Throttles per node snapshot rate. Defaults to `20mb` per second. + + +[float] +===== Read-only URL Repository + +The URL repository (`"type": "url"`) can be used as an alternative read-only way to access data created by shared file +system repository is using shared file system to store snapshot. The URL specified in the `url` parameter should +point to the root of the shared filesystem repository. The following settings are supported: + +[horizontal] +`url`:: Location of the snapshots. Mandatory. +`concurrent_streams`:: Throttles the number of streams (per node) preforming snapshot operation. Defaults to `5` + + +[float] +===== Repository plugins + +Other repository backends are available in these official plugins: + +* https://github.com/elasticsearch/elasticsearch-cloud-aws#s3-repository[AWS Cloud Plugin] for S3 repositories +* https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs[HDFS Plugin] for Hadoop environments +* https://github.com/elasticsearch/elasticsearch-cloud-azure#azure-repository[Azure Cloud Plugin] for Azure storage repositories + +[float] +=== Snapshot + +A repository can contain multiple snapshots of the same cluster. Snapshot are identified by unique names within the +cluster. A snapshot with the name `snapshot_1` in the repository `my_backup` can be created by executing the following +command: + +[source,js] +----------------------------------- +$ curl -XPUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true" +----------------------------------- + +The `wait_for_completion` parameter specifies whether or not the request should return immediately or wait for snapshot +completion. By default snapshot of all open and started indices in the cluster is created. This behavior can be changed +by specifying the list of indices in the body of the snapshot request. + +[source,js] +----------------------------------- +$ curl -XPUT "localhost:9200/_snapshot/my_backup/snapshot_1" -d '{ + "indices": "index_1,index_2", + "ignore_unavailable": "true", + "include_global_state": false +}' +----------------------------------- + +The list of indices that should be included into the snapshot can be specified using the `indices` parameter that +supports <<search-multi-index-type,multi index syntax>>. The snapshot request also supports the +`ignore_unavailable` option. Setting it to `true` will cause indices that do not exist to be ignored during snapshot +creation. By default, when `ignore_unavailable` option is not set and an index is missing the snapshot request will fail. +By setting `include_global_state` to false it's possible to prevent the cluster global state to be stored as part of +the snapshot. By default, entire snapshot will fail if one or more indices participating in the snapshot don't have +all primary shards available. This behaviour can be changed by setting `partial` to `true`. + +The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses +the list of the index files that are already stored in the repository and copies only files that were created or +changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form. +Snapshotting process is executed in non-blocking fashion. All indexing and searching operation can continue to be +executed against the index that is being snapshotted. However, a snapshot represents the point-in-time view of the index +at the moment when snapshot was created, so no records that were added to the index after snapshot process had started +will be present in the snapshot. + +Besides creating a copy of each index the snapshot process can also store global cluster metadata, which includes persistent +cluster settings and templates. The transient settings and registered snapshot repositories are not stored as part of +the snapshot. + +Only one snapshot process can be executed in the cluster at any time. While snapshot of a particular shard is being +created this shard cannot be moved to another node, which can interfere with rebalancing process and allocation +filtering. Once snapshot of the shard is finished Elasticsearch will be able to move shard to another node according +to the current allocation filtering settings and rebalancing algorithm. + +Once a snapshot is created information about this snapshot can be obtained using the following command: + +[source,shell] +----------------------------------- +$ curl -XGET "localhost:9200/_snapshot/my_backup/snapshot_1" +----------------------------------- + +All snapshots currently stored in the repository can be listed using the following command: + +[source,shell] +----------------------------------- +$ curl -XGET "localhost:9200/_snapshot/my_backup/_all" +----------------------------------- + +A snapshot can be deleted from the repository using the following command: + +[source,shell] +----------------------------------- +$ curl -XDELETE "localhost:9200/_snapshot/my_backup/snapshot_1" +----------------------------------- + +When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted +snapshot and not used by any other snapshots. If the deleted snapshot operation is executed while the snapshot is being +created the snapshotting process will be aborted and all files created as part of the snapshotting process will be +cleaned. Therefore, the delete snapshot operation can be used to cancel long running snapshot operations that were +started by mistake. + + +[float] +=== Restore + +A snapshot can be restored using this following command: + +[source,shell] +----------------------------------- +$ curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" +----------------------------------- + +By default, all indices in the snapshot as well as cluster state are restored. It's possible to select indices that +should be restored as well as prevent global cluster state from being restored by using `indices` and +`include_global_state` options in the restore request body. The list of indices supports +<<search-multi-index-type,multi index syntax>>. The `rename_pattern` and `rename_replacement` options can be also used to +rename index on restore using regular expression that supports referencing the original text as explained +http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]. + +[source,js] +----------------------------------- +$ curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" -d '{ + "indices": "index_1,index_2", + "ignore_unavailable": "true", + "include_global_state": false, + "rename_pattern": "index_(.+)", + "rename_replacement": "restored_index_$1" +}' +----------------------------------- + +The restore operation can be performed on a functioning cluster. However, an existing index can be only restored if it's +closed. The restore operation automatically opens restored indices if they were closed and creates new indices if they +didn't exist in the cluster. If cluster state is restored, the restored templates that don't currently exist in the +cluster are added and existing templates with the same name are replaced by the restored templates. The restored +persistent settings are added to the existing persistent settings. diff --git a/docs/reference/modules/threadpool.asciidoc b/docs/reference/modules/threadpool.asciidoc new file mode 100644 index 0000000..50d7a92 --- /dev/null +++ b/docs/reference/modules/threadpool.asciidoc @@ -0,0 +1,117 @@ +[[modules-threadpool]] +== Thread Pool + +A node holds several thread pools in order to improve how threads are +managed and memory consumption within a node. There are several thread +pools, but the important ones include: + +[horizontal] +`index`:: + For index/delete operations, defaults to `fixed`, + size `# of available processors`. + queue_size `200`. + +`search`:: + For count/search operations, defaults to `fixed`, + size `3x # of available processors`. + queue_size `1000`. + +`suggest`:: + For suggest operations, defaults to `fixed`, + size `# of available processors`. + queue_size `1000`. + +`get`:: + For get operations, defaults to `fixed` + size `# of available processors`. + queue_size `1000`. + +`bulk`:: + For bulk operations, defaults to `fixed` + size `# of available processors`. + queue_size `50`. + +`percolate`:: + For percolate operations, defaults to `fixed` + size `# of available processors`. + queue_size `1000`. + +`warmer`:: + For segment warm-up operations, defaults to `scaling` + with a `5m` keep-alive. + +`refresh`:: + For refresh operations, defaults to `scaling` + with a `5m` keep-alive. + +Changing a specific thread pool can be done by setting its type and +specific type parameters, for example, changing the `index` thread pool +to have more threads: + +[source,js] +-------------------------------------------------- +threadpool: + index: + type: fixed + size: 30 +-------------------------------------------------- + +NOTE: you can update threadpool settings live using + <<cluster-update-settings>>. + + +[float] +[[types]] +=== Thread pool types + +The following are the types of thread pools that can be used and their +respective parameters: + +[float] +==== `cache` + +The `cache` thread pool is an unbounded thread pool that will spawn a +thread if there are pending requests. Here is an example of how to set +it: + +[source,js] +-------------------------------------------------- +threadpool: + index: + type: cached +-------------------------------------------------- + +[float] +==== `fixed` + +The `fixed` thread pool holds a fixed size of threads to handle the +requests with a queue (optionally bounded) for pending requests that +have no threads to service them. + +The `size` parameter controls the number of threads, and defaults to the +number of cores times 5. + +The `queue_size` allows to control the size of the queue of pending +requests that have no threads to execute them. By default, it is set to +`-1` which means its unbounded. When a request comes in and the queue is +full, it will abort the request. + +[source,js] +-------------------------------------------------- +threadpool: + index: + type: fixed + size: 30 + queue_size: 1000 +-------------------------------------------------- + +[float] +[[processors]] +=== Processors setting +The number of processors is automatically detected, and the thread pool +settings are automatically set based on it. Sometimes, the number of processors +are wrongly detected, in such cases, the number of processors can be +explicitly set using the `processors` setting. + +In order to check the number of processors detected, use the nodes info +API with the `os` flag. diff --git a/docs/reference/modules/thrift.asciidoc b/docs/reference/modules/thrift.asciidoc new file mode 100644 index 0000000..85e229f --- /dev/null +++ b/docs/reference/modules/thrift.asciidoc @@ -0,0 +1,25 @@ +[[modules-thrift]] +== Thrift + +The thrift transport module allows to expose the REST interface of +elasticsearch using thrift. Thrift should provide better performance +over http. Since thrift provides both the wire protocol and the +transport, it should make using it simpler (thought its lacking on +docs...). + +Using thrift requires installing the `transport-thrift` plugin, located +https://github.com/elasticsearch/elasticsearch-transport-thrift[here]. + +The thrift +https://github.com/elasticsearch/elasticsearch-transport-thrift/blob/master/elasticsearch.thrift[schema] +can be used to generate thrift clients. + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`thrift.port` |The port to bind to. Defaults to 9500-9600 + +|`thrift.frame` |Defaults to `-1`, which means no framing. Set to a +higher value to specify the frame size (like `15mb`). +|======================================================================= + diff --git a/docs/reference/modules/transport.asciidoc b/docs/reference/modules/transport.asciidoc new file mode 100644 index 0000000..62fe6d0 --- /dev/null +++ b/docs/reference/modules/transport.asciidoc @@ -0,0 +1,49 @@ +[[modules-transport]] +== Transport + +The transport module is used for internal communication between nodes +within the cluster. Each call that goes from one node to the other uses +the transport module (for example, when an HTTP GET request is processed +by one node, and should actually be processed by another node that holds +the data). + +The transport mechanism is completely asynchronous in nature, meaning +that there is no blocking thread waiting for a response. The benefit of +using asynchronous communication is first solving the +http://en.wikipedia.org/wiki/C10k_problem[C10k problem], as well as +being the idle solution for scatter (broadcast) / gather operations such +as search in Elasticsearch. + +[float] +=== TCP Transport + +The TCP transport is an implementation of the transport module using +TCP. It allows for the following settings: + +[cols="<,<",options="header",] +|======================================================================= +|Setting |Description +|`transport.tcp.port` |A bind port range. Defaults to `9300-9400`. + +|`transport.publish_port` |The port that other nodes in the cluster +should use when communicating with this node. Useful when a cluster node +is behind a proxy or firewall and the `transport.tcp.port` is not directly +addressable from the outside. Defaults to the actual port assigned via +`transport.tcp.port`. + +|`transport.tcp.connect_timeout` |The socket connect timeout setting (in +time setting format). Defaults to `30s`. + +|`transport.tcp.compress` |Set to `true` to enable compression (LZF) +between all nodes. Defaults to `false`. +|======================================================================= + +It also shares the uses the common +<<modules-network,network settings>>. + +[float] +=== Local Transport + +This is a handy transport to use when running integration tests within +the JVM. It is automatically enabled when using +`NodeBuilder#local(true)`. diff --git a/docs/reference/modules/tribe.asciidoc b/docs/reference/modules/tribe.asciidoc new file mode 100644 index 0000000..fb998bf --- /dev/null +++ b/docs/reference/modules/tribe.asciidoc @@ -0,0 +1,58 @@ +[[modules-tribe]] +== Tribe node + +The _tribes_ feature allows a _tribe node_ to act as a federated client across +multiple clusters. + +The tribe node works by retrieving the cluster state from all connected +clusters and merging them into a global cluster state. With this information +at hand, it is able to perform read and write operations against the nodes in +all clusters as if they were local. + +The `elasticsearch.yml` config file for a tribe node just needs to list the +clusters that should be joined, for instance: + +[source,yaml] +-------------------------------- +tribe: + t1: <1> + cluster.name: cluster_one + t2: <1> + cluster.name: cluster_two +-------------------------------- +<1> `t1` and `t2` are aribitrary names representing the connection to each + cluster. + +The example above configures connections to two clusters, name `t1` and `t2` +respectively. The tribe node will create a <<modules-node,node client>> to +connect each cluster using <<multicast,multicast discovery>> by default. Any +other settings for the connection can be configured under `tribe.{name}`, just +like the `cluster.name` in the example. + +The merged global cluster state means that almost all operations work in the +same way as a single cluster: distributed search, suggest, percolation, +indexing, etc. + +However, there are a few exceptions: + +* The merged view cannot handle indices with the same name in multiple + clusters. It will pick one of them and discard the other. + +* Master level read operations (eg <<cluster-state>>, <<cluster-health>>) + will automatically execute with a local flag set to true since there is + no master. + +* Master level write operations (eg <<indices-create-index>>) are not + allowed. These should be performed on a single cluster. + +The tribe node can be configured to block all write operations and all +metadata operations with: + +[source,yaml] +-------------------------------- +tribe: + blocks: + write: true + metadata: true +-------------------------------- + |