[ES 공식문서] Text analysis 

sisi237·2023년 5월 11일

elastic-search

목록 보기
5/5

Edge n-gram token filter

참고 : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html

Forms an n-gram of a specified length from the beginning of a token.
ㄴ For example, you can use the edge_ngram token filter to change quick to qu.
When not customized, the filter creates 1-character edge n-grams by default.

// example
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2
    }
  ],
  "text": "the quick brown fox jumps"
}
// input: the quick brown fox jumps
// output: [ t, th, q, qu, b, br, f, fo, j, ju ]

// Add to an analyzer
PUT edge_ngram_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_edge_ngram": {
          "tokenizer": "standard",
          "filter": [ "edge_ngram" ]
        }
      }
    }
  }
}
// parameter: max_gram, min_gram

how to Customize

PUT edge_ngram_custom_example
{
  "settings": {
      "filter": {
        "3_5_edgegrams": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

Limitations of the max_gram parameter

The edge_ngram filter’s max_gram value
limits the character length of tokens.

When the edge_ngram filter is used with an index analyzer,
this means search terms longer than the max_gram length
may not match any indexed terms.

For example,
if the max_gram is 3,
searches for apple won’t match the indexed term app.
ㄴ apple 검색했을 때 인덱싱된 app 으로 매칭되지 않는다는 말인듯

Lowercase token filter

참고 : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html

Changes token text to lowercase.
For example, you can use the lowercase filter
to change THE Lazy DoG to the lazy dog.

example

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}
// intput: THE Quick FoX JUMPs
// output: [ the, quick, fox, jumps ]

Add to an analyzer

PUT lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_lowercase": {
          "tokenizer": "whitespace",
          "filter": [ "lowercase" ]
        }
      }
    }
  }
}

N-gram token filter

참고 : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenfilter.html

Forms n-grams of specified lengths from a token.

For example,
you can use the ngram token filter
to change fox to [ f, fo, o, ox, x ].

example

GET _analyze
{
  "tokenizer": "standard",
  "filter": [ "ngram" ],
  "text": "Quick fox"
}
// input: Quick fox
// output: [ Q, Qu, u, ui, i, ic, c, ck, k, f, fo, o, ox, x ]

Add to an analyzer

PUT ngram_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_ngram": {
          "tokenizer": "standard",
          "filter": [ "ngram" ]
        }
      }
    }
  }
}
// parameter: max_gram(2) , min_gram(1)

*You can use the index.max_ngram_diff index-level setting
to control the maximum allowed difference
between the max_gram and min_gram values.

profile
자바 서버 개발자

0개의 댓글