엘라스틱 서치에서는 형태소 분리 작업(tokenizer)와 이 단어들을 검색 가능하도록 가공(token filter)해주는 기능이 존재한다.
이를 사용하는데 elasticsearch에서는 _analyze를 사용한다.
from elasticsearch import Elasticsearch
es = Elasticsearch("<ip>:<port>")
Tokenizer의 기본 형식은 다음과 같다.
GET _analyze
{
"tokenizer": "토크나이저 옵션",
"text": ["형태소 분리할 텍스트"]
}
tokenizer의 대표 옵션들은 다음과 같다.
whitespace : 스페이스바 단위로 분리 (.split(' ')와 동일)standard : Unicode Text Segmentation algorithm을 기반으로 단어를 분리 시킨다. 공식 홈페이지에 따르면 가장 무난한 옵션이다.letter : 특수문자들을 기준으로 분리한다.lowercase : letter와 동일하게 특수문자들을 기준으로 분리한다. 그리고 분리시킨 단어들을 소문자로 출력한다. uax_url_email : 메일 주소를 단일 토큰으로 인식한다는 점을 제외하면 standard와 동일하다.이 외에 다른 옵션들을 https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-tokenizers.html 에서 볼 수 있다.
es.indices.analyze(
body={
"tokenizer": "whitespace",
"text" : ["i love python "]
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'i',
'start_offset': 0,
'end_offset': 1,
'type': 'word',
'position': 0},
{'token': 'love',
'start_offset': 2,
'end_offset': 6,
'type': 'word',
'position': 1},
{'token': 'python',
'start_offset': 7,
'end_offset': 13,
'type': 'word',
'position': 2}]}
es.indices.analyze(
body={
"tokenizer": "standard",
"text" : ["i love python "]
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'i',
'start_offset': 0,
'end_offset': 1,
'type': '<ALPHANUM>',
'position': 0},
{'token': 'love',
'start_offset': 2,
'end_offset': 6,
'type': '<ALPHANUM>',
'position': 1},
{'token': 'python',
'start_offset': 7,
'end_offset': 13,
'type': '<ALPHANUM>',
'position': 2}]}
es.indices.analyze(
body={
"tokenizer": "letter",
"text" : ["Around&the*World#in^Eighty@Days"]
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'Around',
'start_offset': 0,
'end_offset': 6,
'type': 'word',
'position': 0},
{'token': 'the',
'start_offset': 7,
'end_offset': 10,
'type': 'word',
'position': 1},
{'token': 'World',
'start_offset': 11,
'end_offset': 16,
'type': 'word',
'position': 2},
{'token': 'in',
'start_offset': 17,
'end_offset': 19,
'type': 'word',
'position': 3},
{'token': 'Eighty',
'start_offset': 20,
'end_offset': 26,
'type': 'word',
'position': 4},
{'token': 'Days',
'start_offset': 27,
'end_offset': 31,
'type': 'word',
'position': 5}]}
es.indices.analyze(
body={
"tokenizer": "lowercase",
"text" : ["I LOVE PYTHON"]
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'i',
'start_offset': 0,
'end_offset': 1,
'type': 'word',
'position': 0},
{'token': 'love',
'start_offset': 2,
'end_offset': 6,
'type': 'word',
'position': 1},
{'token': 'python',
'start_offset': 7,
'end_offset': 13,
'type': 'word',
'position': 2}]}
es.indices.analyze(
body = {
"tokenizer" : "uax_url_email",
"text" : ["alwns28@naver.com test"]
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'alwns28@naver.com',
'start_offset': 0,
'end_offset': 17,
'type': '<EMAIL>',
'position': 0},
{'token': 'test',
'start_offset': 18,
'end_offset': 22,
'type': '<ALPHANUM>',
'position': 1}]}
es.indices.analyze(
body = {
"tokenizer" : "uax_url_email",
"text" : ["this is a test", "the second text"] # 여러개의 필드에 대해서도 가능
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'this',
'start_offset': 0,
'end_offset': 4,
'type': '<ALPHANUM>',
'position': 0},
{'token': 'is',
'start_offset': 5,
'end_offset': 7,
'type': '<ALPHANUM>',
'position': 1},
{'token': 'a',
'start_offset': 8,
'end_offset': 9,
'type': '<ALPHANUM>',
'position': 2},
{'token': 'test',
'start_offset': 10,
'end_offset': 14,
'type': '<ALPHANUM>',
'position': 3},
{'token': 'the',
'start_offset': 15,
'end_offset': 18,
'type': '<ALPHANUM>',
'position': 104},
{'token': 'second',
'start_offset': 19,
'end_offset': 25,
'type': '<ALPHANUM>',
'position': 105},
{'token': 'text',
'start_offset': 26,
'end_offset': 30,
'type': '<ALPHANUM>',
'position': 106}]}
Tokenizer를 사용하는데 추가적으로 옵션들을 줄 수 있다. 대표 옵션들에 대한 예시는 다음과 같다.
es.indices.analyze(
body = {
"tokenizer" : "keyword",
"filter" : ["lowercase"], # 소문자로 변환
"text" : "this is a TEST"
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'this is a test',
'start_offset': 0,
'end_offset': 14,
'type': 'word',
'position': 0}]}
es.indices.analyze(
body = {
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : ["html_strip"], # html 태그 제거
"text" : "this is a <b>test</b>"
}
)
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'this is a test',
'start_offset': 0,
'end_offset': 21,
'type': 'word',
'position': 0}]}
es.indices.analyze(body ={
"tokenizer" : "whitespace",
"filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}], # 여러 옵션 존재
"text" : "this is a test"
})
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'test',
'start_offset': 10,
'end_offset': 14,
'type': 'word',
'position': 3}]}
위와 같은 다양한 토크나이저와 필터들을 조합하여 사용자 지정 분석기를 만들 수 있다. 분석기 인덱스 생성 방법은 다음과 같다.
{
"settings" : {
"analysis" : {
"analyzer" : {
"<analyzer_name>" : {
"tokenizer" : "<tokenizer_name>",
"filter" : ["<filter1_name>", "<filter2_name>", ... ]
}
}
}
}
}
mapping = {
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "whitespace",
"filter" : ["snowball", "lowercase", "my_filter"] # my_filter는 사용자 정의 분석기
}
},
"filter" : {
"my_filter" : {
"type" : "synonym", # 동의어 설정
"synonyms" : ["quick, fast", "jump, hop => hop"]
}
}
}
}
}
es.indices.create(index = 'test_analyzer', body = mapping)
es.indices.analyze(index = 'test_analyzer', body = {
"analyzer" : 'my_analyzer',
'text' : 'The Quick Rabbit Jumped'
})
C:\Users\User\Anaconda3\lib\site-packages\elasticsearch\connection\base.py:200: ElasticsearchWarning: Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
warnings.warn(message, category=ElasticsearchWarning)
{'tokens': [{'token': 'the',
'start_offset': 0,
'end_offset': 3,
'type': 'word',
'position': 0},
{'token': 'quick',
'start_offset': 4,
'end_offset': 9,
'type': 'word',
'position': 1},
{'token': 'fast',
'start_offset': 4,
'end_offset': 9,
'type': 'SYNONYM',
'position': 1},
{'token': 'rabbit',
'start_offset': 10,
'end_offset': 16,
'type': 'word',
'position': 2},
{'token': 'hop',
'start_offset': 17,
'end_offset': 23,
'type': 'SYNONYM',
'position': 3}]}