English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Elasticsearch Analysis

When processing queries during search operations, the analysis module analyzes the content of any index. The module consists of an analyzer, a tokenizer, a token filter, and a character filter. If an analyzer is not defined, the default built-in analyzer, tokenizer, filters, and tokenizer generator are registered in the analysis module.

In the following example, we use a standard analyzer that is used when no other analyzer is specified. It analyzes sentences based on grammar and generates the words used in the sentence.

POST _analyze
{
   "analyzer": "standard",
   "text": "Today's weather is beautiful"
{}

After running the above code, we get the following response:

{
   "tokens": [
      {
         "token" : "today's",
         "start_offset": 0,
         "end_offset": 7,
         "type": "",
         "position": 0
      },
      {
         "token": "weather",
         "start_offset": 8,
         "end_offset": 15,
         "type": "",
         "position": 1
      },
      {
         "token" : "is",
         "start_offset": 16,
         "end_offset": 18,
         "type": "",
         "position": 2
      },
      {
         "token": "beautiful",
         "start_offset": 19,
         "end_offset": 28,
         "type": "",
         "position": 3
      {}
   }
{}

Configure the standard analyzer

We can use various parameters to configure the standard analyzer to meet our custom requirements.

In the following example, we configure the standard analyzer to have a max_token_length of5.

To this end, we first create an index using an analyzer with the max_length_token parameter.

PUT index_4_analysis
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_english_analyzer": {
               "type": "standard",
               "max_token_length": 5,
               "stopwords": "_english_"
            {}
         {}
      {}
   {}
{}

Next, we use the text analyzer as shown below. Please note that tokens are not displayed because they have two spaces at the beginning and two spaces at the end. For the word" The word "is", which has a space at the beginning and a space at the end. Removing them all results in4A space-separated letter does not necessarily mean it is a word. At least one non-space character should be present at the beginning or end to make it a word to be counted.

POST index_4_analysis/_analyze
{
   "analyzer": "my_english_analyzer",
   "text": "Today's weather is beautiful"
{}

After running the above code, we get the following response:

{
   "tokens": [
      {
         "token": "today"
         "start_offset": 0,
         "end_offset": 5,
         "type": "",
         "position": 0
      },
      {
         "token": "s",
         "start_offset": 6,
         "end_offset": 7,
         "type": "",
         "position": 1
      },
      {
         "token": "weath",
         "start_offset": 8,
         "end_offset": 13,
         "type": "",
         "position": 2
      },
      {
         "token": "er",
         "start_offset": 13,
         "end_offset": 15,
         "type": "",
         "position": 3
      },
      {
         "token": "beaut",
         "start_offset": 19,
         "end_offset": 24,
         "type": "",
         "position": 5
      },
      {
         "token": "iful",
         "start_offset": 24,
         "end_offset": 28,
         "type": "",
         "position": 6
      {}
   }
{}

The following table lists various analyzers and their descriptions-

Serial NumberAnalyzers and Descriptions
1

Standard Analyzer(standard)

stopwords and max_token_length settings can be used to set this analyzer. By default, the stopwords list is empty, and max_token_length is255.

2

Simple Analyzer(simple)

This analyzer consists of lowercase tokenizers.

3

Whitespace Analyzer (whitespace)

This analyzer consists of space tokenizers

4

Stop Analyzer (stop)

You can configure stopwords and stopwords_path. By default, stopwords are initialized to English stopwords, and stopwords_path contains the path to the text file containing the stopwords

Tokenizer

The token generator is used to generate tokens from text in Elasticsearch. By considering spaces or other punctuation marks, text can be divided into tokens. Elasticsearch has many built-in tokenizers that can be used in custom analyzers.

The following is an example of a tokenizer that breaks text into multiple words when encountering non-alphabetic characters, but also converts all words to lowercase as shown below-

POST _analyze
{
   "tokenizer": "lowercase",
   "text": "It Was a Beautiful Weather 5 Days ago.
{}

After running the above code, we get the following response:

{
   "tokens": [
      {
         "token": "it",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      },
      {
         "token": "was",
         "start_offset": 3,
         "end_offset": 6,
         "type": "word",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 7,
         "end_offset": 8,
         "type": "word",
         "position": 2
      },
      {
         "token": "beautiful",
         "start_offset": 9,
         "end_offset": 18,
         "type": "word",
         "position": 3
      },
      {
         "token": "weather",
         "start_offset": 19,
         "end_offset": 26,
         "type": "word",
         "position": 4
      },
      {
         "token": "days",
         "start_offset": 29,
         "end_offset": 33,
         "type": "word",
         "position": 5
      },
      {
         "token": "ago",
         "start_offset": 34,
         "end_offset": 37,
         "type": "word",
         "position": 6
      {}
   }
{}

The list of token generators and their descriptions are as follows:

Serial NumberTokenizer and Description
1

Standard Marker

This is based on the syntactic marker, max_token_length can be configured for this marker.

2

Edge NGram Marker(edgeNGram)

Settings like min_gram, max_gram, token_chars can be set for this marker.

3

Keyword Marker

This will generate the entire input as output, buffer_size can be set for this.

4

Letter Marker

This will capture the entire word until a non-letter is encountered.