[Elasticsearch] analyzer 기본 사용 방법

Yoon Yeoung-jin·2021년 8월 25일

Elasticsearch

목록 보기
3/3

elasticsearch analyzar

엘라스틱 서치에서는 형태소 분리 작업(tokenizer)와 이 단어들을 검색 가능하도록 가공(token filter)해주는 기능이 존재한다.

  • Tokenizer: 해당 단어를 분리 하는 작업
  • Token Filter: 분리된 단어들을 검색 가능하도록 가공하는 작업

이를 사용하는데 elasticsearch에서는 _analyze를 사용한다.

Tokenizer

from elasticsearch import Elasticsearch
es = Elasticsearch("<ip>:<port>")

Tokenizer의 기본 형식은 다음과 같다.

GET _analyze
{
  "tokenizer": "토크나이저 옵션",  
  "text": ["형태소 분리할 텍스트"]
}

tokenizer의 대표 옵션들은 다음과 같다.

  • whitespace : 스페이스바 단위로 분리 (.split(' ')와 동일)
  • standard : Unicode Text Segmentation algorithm을 기반으로 단어를 분리 시킨다. 공식 홈페이지에 따르면 가장 무난한 옵션이다.
  • letter : 특수문자들을 기준으로 분리한다.
  • lowercase : letter와 동일하게 특수문자들을 기준으로 분리한다. 그리고 분리시킨 단어들을 소문자로 출력한다.
  • uax_url_email : 메일 주소를 단일 토큰으로 인식한다는 점을 제외하면 standard와 동일하다.

이 외에 다른 옵션들을 https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-tokenizers.html 에서 볼 수 있다.

es.indices.analyze(
    body={
      "tokenizer": "whitespace",
      "text" : ["i love python "]
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'i',
   'start_offset': 0,
   'end_offset': 1,
   'type': 'word',
   'position': 0},
  {'token': 'love',
   'start_offset': 2,
   'end_offset': 6,
   'type': 'word',
   'position': 1},
  {'token': 'python',
   'start_offset': 7,
   'end_offset': 13,
   'type': 'word',
   'position': 2}]}
es.indices.analyze(
    body={
      "tokenizer": "standard",
      "text" : ["i love python "]
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'i',
   'start_offset': 0,
   'end_offset': 1,
   'type': '<ALPHANUM>',
   'position': 0},
  {'token': 'love',
   'start_offset': 2,
   'end_offset': 6,
   'type': '<ALPHANUM>',
   'position': 1},
  {'token': 'python',
   'start_offset': 7,
   'end_offset': 13,
   'type': '<ALPHANUM>',
   'position': 2}]}
es.indices.analyze(
    body={
      "tokenizer": "letter",
      "text" : ["Around&the*World#in^Eighty@Days"]
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'Around',
   'start_offset': 0,
   'end_offset': 6,
   'type': 'word',
   'position': 0},
  {'token': 'the',
   'start_offset': 7,
   'end_offset': 10,
   'type': 'word',
   'position': 1},
  {'token': 'World',
   'start_offset': 11,
   'end_offset': 16,
   'type': 'word',
   'position': 2},
  {'token': 'in',
   'start_offset': 17,
   'end_offset': 19,
   'type': 'word',
   'position': 3},
  {'token': 'Eighty',
   'start_offset': 20,
   'end_offset': 26,
   'type': 'word',
   'position': 4},
  {'token': 'Days',
   'start_offset': 27,
   'end_offset': 31,
   'type': 'word',
   'position': 5}]}
es.indices.analyze(
    body={
      "tokenizer": "lowercase",
      "text" : ["I LOVE PYTHON"]
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'i',
   'start_offset': 0,
   'end_offset': 1,
   'type': 'word',
   'position': 0},
  {'token': 'love',
   'start_offset': 2,
   'end_offset': 6,
   'type': 'word',
   'position': 1},
  {'token': 'python',
   'start_offset': 7,
   'end_offset': 13,
   'type': 'word',
   'position': 2}]}
es.indices.analyze(
    body = {
        "tokenizer" : "uax_url_email",
        "text" : ["alwns28@naver.com test"]
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'alwns28@naver.com',
   'start_offset': 0,
   'end_offset': 17,
   'type': '<EMAIL>',
   'position': 0},
  {'token': 'test',
   'start_offset': 18,
   'end_offset': 22,
   'type': '<ALPHANUM>',
   'position': 1}]}
es.indices.analyze(
    body = {
        "tokenizer" : "uax_url_email",
        "text" : ["this is a test", "the second text"]     # 여러개의 필드에 대해서도 가능
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'this',
   'start_offset': 0,
   'end_offset': 4,
   'type': '<ALPHANUM>',
   'position': 0},
  {'token': 'is',
   'start_offset': 5,
   'end_offset': 7,
   'type': '<ALPHANUM>',
   'position': 1},
  {'token': 'a',
   'start_offset': 8,
   'end_offset': 9,
   'type': '<ALPHANUM>',
   'position': 2},
  {'token': 'test',
   'start_offset': 10,
   'end_offset': 14,
   'type': '<ALPHANUM>',
   'position': 3},
  {'token': 'the',
   'start_offset': 15,
   'end_offset': 18,
   'type': '<ALPHANUM>',
   'position': 104},
  {'token': 'second',
   'start_offset': 19,
   'end_offset': 25,
   'type': '<ALPHANUM>',
   'position': 105},
  {'token': 'text',
   'start_offset': 26,
   'end_offset': 30,
   'type': '<ALPHANUM>',
   'position': 106}]}

Tokenizer를 사용하는데 추가적으로 옵션들을 줄 수 있다. 대표 옵션들에 대한 예시는 다음과 같다.

es.indices.analyze(
    body = {
      "tokenizer" : "keyword",
      "filter" : ["lowercase"],        # 소문자로 변환
      "text" : "this is a TEST"
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'this is a test',
   'start_offset': 0,
   'end_offset': 14,
   'type': 'word',
   'position': 0}]}
es.indices.analyze(
    body = {
      "tokenizer" : "keyword",
      "filter" : ["lowercase"],
      "char_filter" : ["html_strip"],       #  html 태그 제거
      "text" : "this is a <b>test</b>"
    }
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'this is a test',
   'start_offset': 0,
   'end_offset': 21,
   'type': 'word',
   'position': 0}]}
es.indices.analyze(body ={
  "tokenizer" : "whitespace",
  "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}], # 여러 옵션 존재
  "text" : "this is a test"
})
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'test',
   'start_offset': 10,
   'end_offset': 14,
   'type': 'word',
   'position': 3}]}

사용자 지정 Analyzer

위와 같은 다양한 토크나이저와 필터들을 조합하여 사용자 지정 분석기를 만들 수 있다. 분석기 인덱스 생성 방법은 다음과 같다.

{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "<analyzer_name>" : {
               "tokenizer" : "<tokenizer_name>",
               "filter" : ["<filter1_name>", "<filter2_name>", ... ]
            }
         }
       }
    }
}
mapping = {
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "my_analyzer" : {
               "tokenizer" : "whitespace",
               "filter" : ["snowball", "lowercase", "my_filter"]   # my_filter는 사용자 정의 분석기
            }
         },
         "filter" : {
            "my_filter" : {
               "type" : "synonym", # 동의어 설정
               "synonyms" : ["quick, fast", "jump, hop => hop"]
            }
         }
      }
   }
}

es.indices.create(index = 'test_analyzer', body = mapping)


es.indices.analyze(index = 'test_analyzer', body = {
    "analyzer" : 'my_analyzer',
    'text' : 'The Quick Rabbit Jumped'
})
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
  warnings.warn(message, category=ElasticsearchWarning)





{'tokens': [{'token': 'the',
   'start_offset': 0,
   'end_offset': 3,
   'type': 'word',
   'position': 0},
  {'token': 'quick',
   'start_offset': 4,
   'end_offset': 9,
   'type': 'word',
   'position': 1},
  {'token': 'fast',
   'start_offset': 4,
   'end_offset': 9,
   'type': 'SYNONYM',
   'position': 1},
  {'token': 'rabbit',
   'start_offset': 10,
   'end_offset': 16,
   'type': 'word',
   'position': 2},
  {'token': 'hop',
   'start_offset': 17,
   'end_offset': 23,
   'type': 'SYNONYM',
   'position': 3}]}

참고 사이트

profile
신기한건 다 해보는 사람

0개의 댓글