Elasticsearch를 이용한 자동 완성 프로그램 만들기 - Mapping

허석진·2024년 1월 7일

Elasticsearch

목록 보기

3/5

사용하는 elasticsearch와 kibana의 버전은 8.11.1입니다
1-1. 8.x 버전 끼리는 크게 차이 없을 것으로 예상되나 이외의 버전에서 진행에 문제가 생긴다면 반드시 검색을 통해 확인해 봐야합니다

앞서 작성한 시리즈의 글들에서 Docker로 Elasticsearch 시작하기와 Elasticsearch java low level client를 이용해 요청을 보내는 것까지 진행해보았다.
이번에는 이를 이용해서 자동 완성 기능이 있는 도서 검색 프로그램을 만들어보고자 한다.

완벽하게 만들지는 않고, 클러스터나 노드 구성은 최소한으로, 또, 검색 대상은 도서의 제목으로 한정해두고 시작하겠다.

Mapping Query

PUT /books
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "max_ngram_diff": 30
    },
    "analysis": {
      "analyzer": {       
        "ngram-book": {
          "type": "custom",
          "tokenizer": "partial",
          "filter": [
            "lowercase"
          ]
        },
        "edge-book": {
          "type": "custom",
          "tokenizer": "edge",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "partial": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 30,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "edge": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 30,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "normalizer": {
        "normalizer-book": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_source": {
      "excludes": [
        "title_chosung",
        "title_jamo",
        "title_engtokor"
      ]
    },
    "properties": {
      "isbn": {
        "type": "keyword"
      },
      "title": {
        "type": "keyword",
        "normalizer": "normalizer-book",
        "fields": {
          "kor": {
            "type": "text",
            "analyzer": "nori"
          },
          "en": {
            "type": "text",
            "analyzer": "standard"
          },
          "edge": {
            "type": "text",
            "analyzer": "edge-book"
          },
          "partial": {
            "type": "text",
            "analyzer": "ngram-book"
          }
        }
      },
      "title_chosung": {
        "type": "keyword",
        "normalizer": "normalizer-book",
        "fields" : {
          "edge": {
            "type": "text",
            "analyzer": "edge-book"
          },
          "partial": {
            "type": "text",
            "analyzer": "ngram-book"
          }
        }
      },
      "title_jamo": {
        "type": "keyword",
        "normalizer": "normalizer-book",
        "fields" : {
          "edge": {
            "type": "text",
            "analyzer": "edge-book"
          },
          "partial": {
            "type": "text",
            "analyzer": "ngram-book"
          }
        }
      },
      "title_engtokor": {
        "type": "keyword",
        "normalizer": "normalizer-book",
        "fields" : {
          "edge": {
            "type": "text",
            "analyzer": "edge-book"
          },
          "partial": {
            "type": "text",
            "analyzer": "ngram-book"
          }
        }
      },
      "author": {
        "type": "keyword",
        "normalizer": "normalizer-book",
        "fields" : {
          "kor": {
            "type": "text",
            "analyzer": "nori"
          },
          "en": {
            "type": "text",
            "analyzer": "standard"
          },
          "edge": {
            "type": "text",
            "analyzer": "edge-book"
          },
          "partial": {
            "type": "text",
            "analyzer": "ngram-book"
          }
        }
      },
      "published_year": {
        "type": "date"
      }
    }
  }
}

settings 관련해서는 매우 허술하지만, 현재는 공부가 부족하기 때문에 나중에 개선하는 것으로하고 ㅎ
우선 각 analysis에 대한 설명부터 시작하겠다.

Analysis 설명

우선 들어가기 전에, Elasticsearch의 작동 방식, 즉 토큰화와 역색인에 대한 지식이 없는 사람들은 이 링크([Elastic Search] 기본 개념과 특징(장단점))를 참고하길 바란다.

아래는 위에서 작성한 Mapping Query 중 analysis부분만 가져온 것이다.

"analysis": {
  "analyzer": {       
    "ngram-book": {
      "type": "custom",
      "tokenizer": "partial",
      "filter": [
        "lowercase"
      ]
    },
    "edge-book": {
      "type": "custom",
      "tokenizer": "edge",
      "filter": [
        "lowercase"
      ]
    }
  },
  "tokenizer": {
    "partial": {
      "type": "ngram",
      "min_gram": 2,
      "max_gram": 30,
      "token_chars": [
        "letter",
        "digit"
      ]
    },
    "edge": {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 30,
      "token_chars": [
        "letter",
        "digit"
      ]
    }
  },
  "normalizer": {
    "normalizer-book": {
      "type": "custom",
      "filter": [
        "lowercase"
      ]
    }
  }
}

Analyzer는 아래 3가지 요소로 구성된다.

Character Filters (문자 필터): 입력 텍스트가 토큰화되기 전에 적용됨, 문자 필터는 입력 텍스트를 수정하거나 특정 문자를 제거하는 역할
Tokenizer (토크나이저): 문자 필터를 통과한 텍스트가 토큰화되어 토큰으로 분리됨, 일반적으로 공백, 구두점 등을 기준으로 분리
Token Filters (토큰 필터): 토크나이저에서 분리된 각각의 토큰에 대해 적용됨, 토큰 필터는 토큰을 수정하거나 추가적인 처리를 수행

위에서 작성한 필자의 Analyzer에는 당장에 Character Filters는 없으므로 Tokenizer부터 설명하겠다.

Tokenizer 설명

위에서 사용자 정의한 Tokenizer는 총 2개이다. partial과 edge

partial tokenizer

"partial": {
  "type": "ngram",
  "min_gram": 2,
  "max_gram": 30,
  "token_chars": [
    "letter",
    "digit"
  ]
}

partial tokenizer는 ngram을 사용했다.
이는 토큰화 대상을 n개의 인접한 글자씩 잘라서 토큰화하는 Tokenizer이다.
위 Query에서는 min_gram: 2, max_gram: 30으로 설정했으니 2개 ~ 30개까지의 인접한 글자씩 잘라서 토큰화한다는 의미이다. 아래의 예시를 보면 좀 더 이해하기 쉬울 것이다.

"This is a" -> "Th", "Thi", "This", "hi", "his", "is", "is"

"a"는 인접한 글자가 없기 때문에 토큰화 되지 못 했으며, 그 이외의 2개 이상의 글자가 인접한 모든 경우가 토큰화 되었다.

해당 Tokenizer는 원하는 책의 이름이 기억이 나지 않을 때, 예를 들면 해리포터와 아즈카반의 죄수를 찾고 싶은데 카반만 기억나는 경우 유용하게 사용할 수 있다.

하지만 위에서 예시에서 볼 수 있다 싶이 가장 조합의 토큰을 생성하기 때문에, 해당 Tokenizer를 사용하는 필드는 Boost value를 낮출 필요가 있다.

edge

"edge": {
  "type": "edge_ngram",
  "min_gram": 1,
  "max_gram": 30,
  "token_chars": [
    "letter",
    "digit"
  ]
}

edge tokenizer는 edge_ngram을 사용했다.
이는 토큰화 대상을 끝에서 n개의 인접한 글자씩 자라서 토큰화하는 Tokenizer이다.
그 이외의 부분은 위에 설명한 ngram과 동일하니 바로 예시를 보자

"This is a" -> "T", "Th", "Thi", "This", "i", "is", "a"

min_gram: 1로 설정했기 때문에 "T"나 "i", "a"같은 1글자짜리 토큰이 추가되었지만, 기본적으로는 ngram보다 토큰의 갯수가 적다는 것을 알 수 있다.

그렇다면 해당 Tokenizer는 장점이 뭘까? 필자가 생각하는 장점은 오히려 토큰의 갯수가 적다는 것에 있다.

"housekeeper"라는 책을 검색하고 싶어서 "housek"라는 검색어를 입력했다고 치자, 그러면 ngram tokenizer의 경우 이를 "use", "us" 등등의 원치않은 토큰으로 토큰화하고 이를 기반으로 검색을 시도할 것이다. 그러나 edge_ngram tokenizer의 경우 끝에서 n개의 인접한 글자씩 토큰화하므로 ngram tokenizer를 사용했을 때 나오는 원치않는 검색을 줄일 수 있다.

물론 그만큼 검색의 유연성이 줄어 들기 때문에 무조건 "edge_gram이 ngram보다 무조건 좋아!" 라고는 말할 수 없지만 이런 Trade-off를 감수 할 만큼 검색 할 때는 대상의 검색어의 앞부분이 더 기억에 남는다.

Analyzer

이제 위에서 사용자 정의한 Tokenizer를 보고 아래의 Analyzer를 정의한 것을 보면, 각각의 Tokenizer에 lowercase라는 대문자 -> 소문자 필터를 추가한 것 뿐이다.

"ngram-book": {
  "type": "custom",
  "tokenizer": "partial",
  "filter": [
    "lowercase"
  ]
},
"edge-book": {
  "type": "custom",
  "tokenizer": "edge",
  "filter": [
    "lowercase"
  ]
}

이것들 이외에 추가로 nori와 standard라는 Analyzer를 사용하는데, 각각 한국어와 영어를 분석해주는 Analyzer이다.

"kor": {
  "type": "text",
  "analyzer": "nori"
},
"en": {
  "type": "text",
  "analyzer": "standard"
},

간단히 설명하자면 이 둘은 "글자"가 아닌 "언어"로서 텍스트를 분석한다.
예를 들면, "한국어 분석기 테스트를 해볼까요?" -> "한국", "어", "분석", "기", "테스트", "하", "보"와 같이 토큰화된다.
추가로 각 토큰에는 품사 정보도 추가되는데 위에 "하"라는 토큰은 "해볼까요?"라는 동사에서 추출된 토큰이기 때문에 동사라는 정보가 들어가 있다.

이와 같이 각 언어에 맞춰 Analyzer를 설정하면, 검색의 품질을 더욱 높일 수 있다.

Mapping 설명

주 검색 대상인 title의 경우 기본, 초성, 자모, 영한으로 필드를 나누고 각 필드에 맞는 Analyzer를 Multi-field에 추가해 검색의 유연성을 높였다.

"mappings": {
  "_source": {
    "excludes": [
      "title_chosung",
      "title_jamo",
      "title_engtokor"
    ]
  },
  "properties": {
    "isbn": {
      "type": "keyword"
    },
    "title": {
      "type": "keyword",
      "normalizer": "normalizer-book",
      "fields": {
        "kor": {
          "type": "text",
          "analyzer": "nori"
        },
        "en": {
          "type": "text",
          "analyzer": "standard"
        },
        "edge": {
          "type": "text",
          "analyzer": "edge-book"
        },
        "partial": {
          "type": "text",
          "analyzer": "ngram-book"
        }
      }
    },
    "title_chosung": {
      "type": "keyword",
      "normalizer": "normalizer-book",
      "fields" : {
        "edge": {
          "type": "text",
          "analyzer": "edge-book"
        },
        "partial": {
          "type": "text",
          "analyzer": "ngram-book"
        }
      }
    },
    "title_jamo": {
      "type": "keyword",
      "normalizer": "normalizer-book",
      "fields" : {
        "edge": {
          "type": "text",
          "analyzer": "edge-book"
        },
        "partial": {
          "type": "text",
          "analyzer": "ngram-book"
        }
      }
    },
    "title_engtokor": {
      "type": "keyword",
      "normalizer": "normalizer-book",
      "fields" : {
        "edge": {
          "type": "text",
          "analyzer": "edge-book"
        },
        "partial": {
          "type": "text",
          "analyzer": "ngram-book"
        }
      }
    },
    "author": {
      "type": "keyword",
      "normalizer": "normalizer-book",
      "fields" : {
        "kor": {
          "type": "text",
          "analyzer": "nori"
        },
        "en": {
          "type": "text",
          "analyzer": "standard"
        },
        "edge": {
          "type": "text",
          "analyzer": "edge-book"
        },
        "partial": {
          "type": "text",
          "analyzer": "ngram-book"
        }
      }
    },
    "published_year": {
      "type": "date"
    }
  }
}

_source 설명

"_source": {
  "excludes": [
    "title_chosung",
    "title_jamo",
    "title_engtokor"
  ]
},

우선 해당 3개의 필드는 _source에서 제외한다.
이유는 title: 해리포터인 문서를 저장하면, 위에 필드들은 title_chosung: ㅎㄹㅍㅌ, title_chosung: ㅎㅐㄹㅣㅍㅗㅌㅓ, title_chosung: goflvhxj가 될텐데 이런 필드들은 저장공간만 차지할 뿐, 검색결과에는 필요 없기 때문이다. 해리포터를 기대하고 검색창에 ㅎㄹㅍㅌ를 입력했는데 결과로 ㅎㄹㅍㅌ가 나오길 바라는 경우는 없을테니까!

properites 설명

현재 필드는 7개로 나눠져 있다.
isbn, title, title_chosung, title_jamo, title_engtokor, author, published_year
이중 title관련과 author를 제외하고는 큰 어려움이 없을 테니 이 2가지에 대해서만 설명하겠다.

"properties": {
  "isbn": {
    "type": "keyword"
  },
  "title": {
    "type": "keyword",
    "normalizer": "normalizer-book",
    "fields": {
      "kor": {
        "type": "text",
        "analyzer": "nori"
      },
      "en": {
        "type": "text",
        "analyzer": "standard"
      },
      "edge": {
        "type": "text",
        "analyzer": "edge-book"
      },
      "partial": {
        "type": "text",
        "analyzer": "ngram-book"
      }
    }
  },
  "title_chosung": {
    "type": "keyword",
    "normalizer": "normalizer-book",
    "fields" : {
      "edge": {
        "type": "text",
        "analyzer": "edge-book"
      },
      "partial": {
        "type": "text",
        "analyzer": "ngram-book"
      }
    }
  },
  "title_jamo": {
    "type": "keyword",
    "normalizer": "normalizer-book",
    "fields" : {
      "edge": {
        "type": "text",
        "analyzer": "edge-book"
      },
      "partial": {
        "type": "text",
        "analyzer": "ngram-book"
      }
    }
  },
  "title_engtokor": {
    "type": "keyword",
    "normalizer": "normalizer-book",
    "fields" : {
      "edge": {
        "type": "text",
        "analyzer": "edge-book"
      },
      "partial": {
        "type": "text",
        "analyzer": "ngram-book"
      }
    }
  },
  "author": {
    "type": "keyword",
    "normalizer": "normalizer-book",
    "fields" : {
      "kor": {
        "type": "text",
        "analyzer": "nori"
      },
      "en": {
        "type": "text",
        "analyzer": "standard"
      },
      "edge": {
        "type": "text",
        "analyzer": "edge-book"
      },
      "partial": {
        "type": "text",
        "analyzer": "ngram-book"
      }
    }
  },
  "published_year": {
    "type": "date"
  }
}

title 필드와 author

"title": {
  "type": "keyword",
  "normalizer": "normalizer-book",
  "fields": {
    "kor": {
      "type": "text",
      "analyzer": "nori"
    },
    "en": {
      "type": "text",
      "analyzer": "standard"
    },
    "edge": {
      "type": "text",
      "analyzer": "edge-book"
    },
    "partial": {
      "type": "text",
      "analyzer": "ngram-book"
    }
  }
}

title필드에 경우 keyword 타입을 이용해 완전 일치한 경우 검색되도록 설정한다.
title.kor필드에 경우 nori analyzer를 이용해 한국어에 특성에 맞게 검색되도록 설정한다.
title.en필드에 경우 standard analyzer를 이용해 영어에 특성에 맞게 검색되도록 설정한다.
title.edge필드에 경우 위에서 사용자 정의한 edge-book를 이용해 원하는 도서에 포함된 단어 중 앞부분만 기억하고 있어 검색어로 앞에 일부분만 오는 경우 검색되도록 설정한다.
title.partial필드에 경우 위에서 사용자 정의한 ngram-book를 이용해 원하는 도서에 포함된 단어 중 일부분만 기억하고 있어 검색어로 일부분만 오는 경우 검색되도록 설정한다.

이외의 title 관련 필드

"title_chosung": {
  "type": "keyword",
  "normalizer": "normalizer-book",
  "fields" : {
    "edge": {
      "type": "text",
      "analyzer": "edge-book"
    },
    "partial": {
      "type": "text",
      "analyzer": "ngram-book"
    }
  }
},

title_chosung, title_jamo, title_engtokor의 경우 앞에 설명한 title과 같이 언어적 특성을 적용한 nori analyzer나 standard analyzer가 필요하지 않으므로 이 둘을 제외한 나머지 필드들을 설정해놓는다.

마무리

이 인덱스 설정을 하기 위해 몇 번을 갈아 엎었는지 모른다. Kibana 콘솔에서 GET _analyze 요청을 통해 몇번이고 몇번이고 테스트하고 작성하는게 중요하다는것을 새삼 느낀다...

처음에 해당 필드들을 정의할 때는 인강에 있는 것을 그대로 긁어와서 사용했는데, 원하는대로 작동하지 않고 개판이 나는 것을 보니 결국엔 시간 들여 공부하고 직접 작성하는 것이 훨씬 나았다.. ㅎㅎ

물론 노베이스에서 시작하는 것보다 훨씬 편리했지만, 응용을 하기 위해서는 원본을 고집하는 습관을 버리는게 좋을 것 같다.

허석진

이전 포스트

Elasticsearch에 Java Low Level REST Client를 통해 요청 보내기

다음 포스트