Elasticsearch 형태소분석기

Simba_b·2022년 10월 28일

elasticsearch

Elatsticsearch

목록 보기

5/5

📖 Elasticsearch 한글 형태소 분석기

nori tokenizer

사용시 analysis-nori 플러그인 설치 필수
Elastic 가이드북 : elasticsearch-nori

Standard 토크나이저는 공백 외에 아무런 분리를 하지 못했지만, nori_tokenizer는 한국어 사전 정보를 이용해 아래와 같이 분리 가능

standard tokenizer	nori tokenizer
"token" : "동해물과"	"token" : "동해"
"token" : "백두산이	"token" : "물"
	"token" : "과"
	"token" : "백두"
	"token" : "산"
	"token" : "이"

nori tokenizer 옵션

user_dictionary : 사용자 사전이 저장된 파일의 경로를 입력합니다.
user_dictionary_rules : 사용자 정의 사전을 배열로 입력합니다.
decompound_mode : 합성어의 저장 방식을 결정합니다. 다음 3개의 값을 사용 가능합니다.
- none : 어근을 분리하지 않고 완성된 합성어만 저장합니다.
- discard (디폴트) : 합성어를 분리하여 각 어근만 저장합니다.
- mixed : 어근과 합성어를 모두 저장합니다.

💻 nori plugin 설치

cd /usr/share/elasticsearch
sudo bin/elasticsearch-plugin install analysis-nori

💻 elasticsearch 재시작

sudo service elasticsearch restart

📖 nori_tokenizer 적용

Kibana 사용

1. analysis

"analysis": {
      "tokenizer": {
        "nori_none": {
          "type": "nori_tokenizer",
          "decompound_mode": "none"
        },
        "nori_discard": {
          "type": "nori_tokenizer",
          "decompound_mode": "discard"
        },
        "nori_mixed": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed"
        }
      }
    }
  }

2. analyzer

nori_readingform : 한자로 된 단어를 한글로 바꾸어 저장

"analyzer": {
        "nori_none": {
          "type": "custom",
          "tokenizer": "nori_none",
          "filter": ["lowercase", "nori_readingform"]
        },
        "nori_mixed": {
          "type": "custom",
          "tokenizer": "nori_mixed",
          "filter": ["lowercase", "nori_readingform"]
        },
        "nori_discard": {
          "type": "custom",
          "tokenizer": "nori_discard",
          "filter": ["lowercase", "nori_readingform"]
        }
      }
    }

3. settings & mapping

PUT index_name
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "nori_none": {
          "type": "nori_tokenizer",
          "decompound_mode": "none"
        },
        "nori_mixed": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed"
        },
        "nori_discard": {
          "type": "nori_tokenizer",
          "decompound_mode": "discard"
        }
      },
      "analyzer": {
        "nori_none": {
          "type": "custom",
          "tokenizer": "nori_none",
          "filter": ["lowercase", "nori_readingform"]
        },
        "nori_mixed": {
          "type": "custom",
          "tokenizer": "nori_mixed",
          "filter": ["lowercase", "nori_readingform"]
        },
        "nori_discard": {
          "type": "custom",
          "tokenizer": "nori_discard",
          "filter": ["lowercase", "nori_readingform"]
        }
      }
    }
  },
  "mappings": {
		"properties": {
			"title" : {
            	type" : "text",
        		"analyzer": "nori_mixed",
        	"fields" : {
             	"keyword" : {
               		"type" : "keyword"
              	}
        	}
		}

✔️결과

원래는 기본 토크나이저로 뛰어쓰기를 기준으로 검색 되었지만,
nori_tokenizer 적용 후 형태소 분석으로 포함된 단어에 다양한 한글 어미가 붙어있어도 검색이 가능하다.

✔️_anlayze API

Elastic 가이드북 : _anlayze API

kibana 사용
mixed 옵션 사용 : 어근과 합성어를 모두 저장
explain 옵션 미사용 : 분석된 한글 형태소들의 품사 정보

GET index_name/_analyze
{
  "analyzer": "nori_mixed", 
  "text": "원래는이거안돼",
  "explain": false
}

✔️_anlayze 확인결과

Simba_b

이전 포스트

Elasticsearch 형태소분석기

Elatsticsearch

📖 Elasticsearch 한글 형태소 분석기

nori tokenizer

nori tokenizer 옵션

💻 nori plugin 설치

💻 elasticsearch 재시작

📖 nori_tokenizer 적용

1. analysis

2. analyzer

3. settings & mapping

✔️결과

✔️_anlayze API

✔️_anlayze 확인결과

Elasticsearch (python)

0개의 댓글