Elasticsearch Tutorial

Elasticsearch Index Module Elasticsearch Analysis

Elasticsearch Modules

Elasticsearch is composed of many modules, which are responsible for its functions. These modules have two types of settings, as shown below:

Static settings−Before starting Elasticsearch, these settings need to be configured in the config(elasticsearch.yml) file. You need to update all concerned nodes in the cluster to reflect the changes in these settings.
Dynamic settings −These settings can be set on a real-time Elasticsearch.

We will discuss the different modules of Elasticsearch in the following sections of this chapter.

Cluster-level routing and shard allocation

Cluster-level settings determine the allocation of fragments to different nodes and the redistribution of fragments to rebalance the cluster. The following settings control fragment allocation.

Cluster-level shard allocation

Settings	Possible values	Description
cluster.routing.allocation.enable
	all	This default value allows fragment allocation for all types of fragments.
	primaries	This only allows fragment allocation for the master fragment.
	new_primaries	This only allows fragment allocation for the master fragment of a new index.
	none	This does not allow any fragment allocation.
cluster.routing.allocation.node_concurrent_recoveries	Numeric value (default value is2)	This limits the number of concurrent fragment recoveries.
cluster.routing.allocation.node_initial_primaries_recoveries	Numeric value (default is4)	This limits the number of parallel initial master recoveries.
cluster.routing.allocation.same_shard.host	Boolean value (default is false)	This limits the number of multiple copies of the same shard allocated in the same physical node.
index.recovery.concurrent_streams	Numeric value (default is3)	This controls the number of network streams opened by each node when recovering fragments from peers.
index.recovery.concurrent_small_file_streams	Numeric value (default is2)	This can control the size of fragments during recovery to be less than5The number of streams opened by mb's small files on each node.
cluster.routing.rebalance.enable
	all	This default value allows balancing all types of shards.
	primaries	This only allows shard balancing for primary fragments.
	replicas	This only allows shard balancing for replica fragments.
	none	This does not allow any form of shard balancing.
cluster.routing.allocation .allow_rebalance
	always	This default value always allows rebalancing.
	indexs_primaries_active	This allows rebalancing when all primary fragments in the cluster are allocated.
	Indices_all_active	This allows rebalancing when all primary and replica fragments are allocated.
cluster.routing.allocation.cluster _concurrent_rebalance	Numeric value (default is2)	This limits the number of concurrent shard balances in the cluster.
cluster.routing.allocation .balance.shard	Floating-point value (default is 0.45f）	This defines the weight factor for the fragments allocated to each node.
cluster.routing.allocation .balance.index	Floating-point value (default is 0.55f）	This defines the ratio of the number of fragments allocated to each index on a specific node.
cluster.routing.allocation .balance.threshold	Non-negative floating-point value (default is1.0f）	This is the minimum optimization value for the operation that should be performed.

Disk-based shard allocation

Settings	Possible values	Description
cluster.routing.allocation.disk.threshold_enabled	Boolean value (default is true)	This enables and disables the disk allocation decision-making process.
cluster.routing.allocation.disk.watermark.low	String value (default is85）	This indicates the maximum usage rate of the disk; after this point, no other shards can be allocated to this disk.
cluster.routing.allocation.disk.watermark.high	string value (default is90%)	This indicates the maximum usage during allocation; if this point is reached during allocation, Elasticsearch will allocate that shard to another disk.
cluster.info.update.interval	string value (default30s）	This is the interval between two disk usage checks.
cluster.routing.allocation.disk.include_relocations	Boolean value (default is true)	This determines whether to consider the currently allocated shards when calculating disk usage.

Discovery

This module helps the cluster discover and maintain the status of all nodes in the cluster. The cluster status changes when nodes are added or removed from the cluster. Cluster name settings are used to create logical differences between different clusters. Some modules can help you use the API provided by cloud service providers, as shown below-

Azure discovery
EC2Discovery
Google Compute Engine discovery
Zen discovery

Gateway

This module maintains cluster state and shard data during the entire cluster restart. The following are the static settings of this module-

Settings	Possible values	Description
gateway.expected_nodes	Numeric value (default is 0)	The number of nodes in the cluster used to recover local shards.
gateway.expected_master_nodes	Numeric value (default is 0)	The expected number of master nodes in the cluster before starting recovery.
gateway.expected_data_nodes	Numeric value (default is 0)	The expected number of data nodes in the cluster before starting recovery.
gateway.recover_after_time	String value (default is5m)	This is the interval between two disk usage checks.
cluster.routing.allocation. disk.include_relocations	Boolean value (default is true)	This specifies the time at which the recovery process will start, regardless of the number of nodes joining the cluster. gateway.recover_after_nodes gateway.recover_after_master_nodes gateway.recover_after_data_nodes

HTTP

This module manages communication between the HTTP client and the Elasticsearch API. This module can be disabled by changing the value of http.enabled to false.

The following are the settings used to control this module (configured in elasticsearch.yml)-

Serial number	Settings and descriptions
1	http.port This is the port for accessing Elasticsearch, ranging from9200-9300.
2	http.publish_port This port is used for http clients and is also very useful in firewall situations.
3	http.bind_host This is the host address of the http service.
4	http.publish_host This is the host address of the http client.
5	http.max_content_length This is the maximum size of the content in the http request. Its default value is100mb.
6	http.max_initial_line_length This is the maximum size of the URL, with the default value being4kb.
7	http.max_header_size This is the maximum size of the http header, with the default value being8kb.
8	http.compression This will enable or disable support for compression, with the default value being false.
9	http.pipelinig This will enable or disable HTTP pipelining.
10	http.pipelining.max_events This limits the number of events to be queued before closing the HTTP request.

Index

This module maintains the global settings for each index. The following settings are mainly related to memory usage-

Circuit breaker

This is used to prevent operations from causing OutOfMemoryError. This setting mainly limits the JVM heap size. For example, the indexs.breaker.total.limit setting, by default, is the size of the JVM heap.70%.

Field data cache

主要用于在字段上聚合时使用. It is recommended to have enough memory to allocate it. The index.fielddata.cache.size setting can be used to control the amount of memory used for field data caching.

Node query cache

This memory is used to cache query results. The cache uses the least recently used (LRU) eviction policy. The Indices.queries.cache.size setting controls the memory size of this cache.

Index buffer

This buffer stores newly created documents in the index and refreshes them when the buffer is full. Settings like indexs.memory.index_buffer_size control the number of heaps allocated to this buffer.

Shard request cache

This cache is used to store local search data for each shard. It can be enabled during index creation and disabled by sending URL parameters.

Disable cache　-　?request_cache = true
Enable cache "index.requests.cache.enable": true

Index recovery

It controls resources during the recovery process. The following settings are provided-

Settings	Default Value
indices.recovery.concurrent_streams	3
indices.recovery.concurrent_small_file_streams	2
indices.recovery.file_chunk_size	512kb
indices.recovery.translog_ops	1000
indices.recovery.translog_size	512kb
indices.recovery.compress	true
indices.recovery.max_bytes_per_sec	40mb

TTL Interval

The TTL interval setting defines the time of the document, after which the document will be deleted. The following are dynamic settings used to control this process-

Settings	Default Value
indices.ttl.interval	60s
indices.ttl.bulk_size	1000

Node

Each node can choose whether it is a data node. This attribute can be changed by modifying the node.data setting. Setting this value to false defines the node as not a data node.

Elasticsearch Index Module Elasticsearch Analysis