엘라스틱서치 자동완성 구현하기

쭈·2022년 12월 18일

Elasticsearch 적용기

목록 보기

3/3

자동완성 구현하기

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

Completion Suggest Api를 활용한 자동완성
1. Prefix Queries를 활용한 자동완성
2. Index 색인을 통한 Search

엘라스틱서치에서 제공하는 Suggest Api를 사용할 경우, 필드의 타입을 text에서 suggest으로 바꿔야한다. https://www.elastic.co/guide/en/elasticsearch/reference/current/completion.html
사용자가 검색어를 타이핑할 때 실시간으로 자동완성된 단어를 보여줘야 하므로 속도가 중요하다. 따라서 suggest api는 속도에 최적화됐지만, 색인된 모든 단어가 인메모리 형태로 적재되어 비용이 발생한다.

만약 내가 영어로 관련된 검색 기능을 구현한다면 standard한 analzer룰 통해서도 suggest api만 있으면 자동완성을 구현할 수 있을 것 같다.

하지만 한글은 자음과 모음으로 구성되어있기때문에 사용자가 타이핑을 했을 때 실시간으로 제안하려면 완성된 한글자가 아닌 자음 모음을 분석해 색인된 단어를 추천해줘야할 것이다. 마치 스프링을 검색했을 때 '스'를 다 타이핑해야 검색결과가 나오는 것이 아닌 'ㅅ'만 입력해도 ㅅ으로 시작하는 단어가 뜨듯이 말이다.

따라서 우리는 자동완성을 구현하기 위해 자소분리를 해야하는데..

https://github.com/javacafe-project/elasticsearch-plugin

현재 nori형태소 분석기만 설정해둔 상태이다.

{
  "analysis":{
    "tokenizer":{
      "nori_user_dict":{
        "type":"nori_tokenizer",
        "decompound_mode":"mixed"
      }
    },
    "analyzer":{
      "nori_analyzer":{
        "type":"custom",
        "tokenizer":"nori_user_dict"
      }
    },
    "filter": {
      "nori_posfilter": {
        "type": "nori_part_of_speech",
        "stoptags": [
          "E",
          "IC",
          "J",
          "MAG",
          "MM",
          "NA",
          "NR",
          "SC",
          "SE",
          "SF",
          "SH",
          "SL",
          "SN",
          "SP",
          "SSC",
          "SSO",
          "SY",
          "UNA",
          "UNKNOWN",
          "VA",
          "VCN",
          "VCP",
          "VSV",
          "VV",
          "VX",
          "XPN",
          "XR",
          "XSA",
          "XSN",
          "XSV"
        ]
      }
    }
  }
}

커스텀한 anaylzer가 없으면 text는 기본적으로 standard anlyzer로 색인되는데 토크나이저 필터에서 불용어, lowercase, whitespace로 색인된다.

나는 nori를 설치하고, 불용어 필터를 추가적으로 기입했기 때문에 형태소 분석과 불용어 처리만 되는 상태이다.

ngram filter를 추가로 사용하여 자동완성 키워드 기능을 구현하려고 한다.

GET /slowdelivery/_termvectors/1?fields=shopName

{
    "_index": "slowdelivery",
    "_type": "_doc",
    "_id": "1",
    "_version": 9,
    "found": true,
    "took": 71,
    "term_vectors": {
        "shopName": {
            "field_statistics": {
                "sum_doc_freq": 34,
                "doc_count": 8,
                "sum_ttf": 34
            },
            "terms": {
                "떡": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 3
                        }
                    ]
                },
                "떡볶이": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        }
                    ]
                },
                "볶이": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 2,
                            "start_offset": 3,
                            "end_offset": 5
                        }
                    ]
                },
                "신전": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        }
                    ]
                }
            }
        }
    }
}

jaso analyzer 설치

Dockerfile에 jaso analyzer를 추가로 넣어준다.

RUN elasticsearch-plugin install https://github.com/skyer9/elasticsearch-jaso-analyzer/releases/download/v7.15.1/jaso-analyzer-plugin-7.15.1-plugin.zip

스프링 데이터 엘라스틱을 사용 중이라 7 버전의 엘라스틱서치를 사용하고 있었다. 엘라스틱서치과 동일한 버전의 jaso analyzer를 깔아줘야 오류가 발생하지 않는다.

빌드하기

$ docker build -t [도커이미지명]:[태그] .

jaso analyzer 설정

mapping 설정

{
  "analysis": {
    "filter": {
      "suggest_filter": {
        "type": "edge_ngram",
        "min_gram": 1,
        "max_gram": 50
      },
      "nori_posfilter": {
        "type": "nori_part_of_speech",
        "stoptags": [
          "E",
          "IC",
          "J",
          "MAG",
          "MM",
          "NA",
          "NR",
          "SC",
          "SE",
          "SF",
          "SH",
          "SL",
          "SN",
          "SP",
          "SSC",
          "SSO",
          "SY",
          "UNA",
          "UNKNOWN",
          "VA",
          "VCN",
          "VCP",
          "VSV",
          "VV",
          "VX",
          "XPN",
          "XR",
          "XSA",
          "XSN",
          "XSV"
        ]
      }
    },
    "analyzer": {
      "suggest_search_analyzer": {
        "type": "custom",
        "tokenizer": "jaso_tokenizer",
        "filter": [
          "nori_posfilter"
        ]
      },
      "suggest_index_analyzer": {
        "type": "custom",
        "tokenizer": "jaso_tokenizer",
        "filter": [
          "suggest_filter"
        ]
      }
    }
  }
}

setting 설정

{
  "properties": {
    "id": {
      "type": "long"
    },
    "shopId": {
      "type": "long"
    },
    "shopName": {
      "type": "text",
      "analyzer": "suggest_index_analyzer",
      "search_analyzer": "suggest_search_analyzer"
    },
    "minOrderPrice": {
      "type": "integer"
    },
    "category": {
      "type": "keyword"
    },
    "rating": {
      "type": "float"
    },
    "menu": {
      "properties": {
        "menuId": {"type":  "long"},
        "menuName": {
          "type":  "text",
          "analyzer": "suggest_index_analyzer",
          "search_analyzer": "suggest_search_analyzer"
        }
      }
    }
  }
}

결과

http://localhost:9200/slowdelivery/_analyze

{
    "text" :"떡볶이",
    "analyzer":"suggest_index_analyzer" 
}

{
    "tokens": [
        {
            "token": "ㄷ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱㅂ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱㅂㅗ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱㅂㅗㄱ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱㅂㅗㄱㄱ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱㅂㅗㄱㄱㅇ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        },
        {
            "token": "ㄷㄷㅓㄱㅂㅗㄱㄱㅇㅣ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        }
    ]
}

{
    "text" :"떡볶이",
    "analyzer":"suggest_search_analyzer" 
}

{
    "tokens": [
        {
            "token": "ㄷㄷㅓㄱㅂㅗㄱㄱㅇㅣ",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        }
    ]
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

쭈

🌱

이전 포스트