[Elasticsearch] Token Filter 정리

Dahea Moon·2020년 5월 19일

Token analyzer elasticsearch token filter

apostrophe : '을 삭제. ' 뒤에 붙어있는 글자는 삭제됨

asciifolding : ascii 형태가 아닌 글자를 ascii 형태로 변형. preserve_original: true로 하면 원본도 저장 가능

cjk_bigram: 한국어, 중국어, 일본어를 분석. 형태소 바탕으로 분석하지 않고, 띄어쓰기를 기준으로 단어를 나눈 후 단어를 2글자씩 나누어서 분석.
ex) 우리나라 -> 우리, 리나, 나라

classic: 's 과 . 을 삭제 (classic tokenizer와 사용)

common grams: 설정한 common words를 bigram 형태로 분석 가능. 설정한 common words를 완전히 무시하고 싶지 않을 때 stop token filter 대신 사용 가능.
ex) the quick fox is brown -> the, the_quick, quick, fox, fox_is, is, is_brown, brown (common_words= is, the)

conditional: 주어진 조건에 맞는 token만 filter 적용. 조건은 script로 부여.

decimal digit: unicode 내 모든 언어의 숫자를 0-9 형태로 변환

delimited payload: 설정한 delimiter를 기준으로 payload와 token을 분석.
delimiter 옵션으로 delimiter를 바꿀 수 있고 (default=|), encoding 옵션으로 payload data type 설정 가능.
payload data는 해당 필드의 term_vector에 들어간다. term_vector API로 payload 값을 보면 base64로 인코딩하여 저장된다.
payload data로 scoring은 불가능하다.
payload는 도대체 언제, 무엇을 위해 쓰는 것인가? --> 수정 요망

PUT delimited_payload_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_plus_delimited": {
          "tokenizer": "whitespace",
          "filter": [ "plus_delimited" ]
        }
      },
      "filter": {
        "plus_delimited": {
          "type": "delimited_payload",
          "delimiter": "+",
          "encoding": "int"
        }
      }
    }
  }
}

dictionary decompouder: 설정해둔 단어들이 분석할 text 안에 존재하는지 bruteforce로 찾아서, 존재한다면 token output에 들어가게함.
dictionary decompounder보다 빠른 hypenation decompounder를 사용하는 것이 좋다.

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "dictionary_decompounder",
      "word_list": ["Donau", "dampf", "meer", "schiff"]
    }
  ],
  "text": "Donaudampfschiff"
}

[Donaudampfschiff, Donau, dampf, schiff]

hypenation decompounder: XML-based hypenation pattern으로 설정한 단어들이 분석할 text 안에 존재하는지 판단 후, 존재한다면 token ouput에 들어가게함.
word list는 word_list_path로 대체 될 수 있음. txt 파일의 path를 입력.

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
      "word_list": ["Kaffee", "zucker", "tasse"]
    }
  ],
  "text": "Kaffeetasse"
}

[Kaffeetasse, Kaffee, tasse]

11-1. N-gram: 토큰을 정해진 길이만큼 잘라서 토큰화 하는 filter
min_gram(default=1), max_gram(default=2)을 설정하여야 함

GET _analyze
{
  "tokenizer": "standard",
  "filter": [ "ngram" ],
  "text": "Quick fox"
}

[ Q, Qu, u, ui, i, ic, c, ck, k, f, fo, o, ox, x ]

11-2. Edge n-gram: 토큰의 처음부터 설정한 길이만큼 잘라서 토큰화 하는 n-gram.
min_gram, max_gram을 설정하여야 함 (default=1)
max_gram 보다 긴 search terms는 어떠한 indexed terms도 검색되지 않는다. 예를 들어 apple로 app을 검색할 수 없다. edge n-gram의 한계이므로 사용할 때 주의.
n-gram과 달리 각 토큰의 처음부터 정해진 길이까지만 저장한다는 것에 주의.

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2
    }
  ],
  "text": "the quick brown fox jumps"
}

[ t, th, q, qu, b, br, f, fo, j, ju ]

Elison: elison이 존재하는 언어들의 elison을 삭제하는 filter (catalan, french, irish, italian)

Fingerprint: 중복되는 토큰을 삭제한 후 여러 개의 토큰을 하나의 토큰으로 concatenate하여 filtering
separator를 설정할 수 있다. default = whitespace
- Sorts the tokens alphabetically to [ fox, quick, the, very, very, was ]
- Removes a duplicate instance of the very token.
- Concatenates the token stream to a output single token: [fox quick the very was ]

Synonym: 동의어를 filtering
sysnonyms_path에 동의어.txt를 저장한 path를 설정

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "whitespace",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}

# sysnonyms setting in sysnonyms.txt

# replacement
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit

# equivalent sysnonyms
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
lol, laughing out loud

# merging multiple sysnonyms
foo => foo bar
foo => baz
# is equivalent to
foo => foo bar, baz

# expand option
# true
ipod, i-pod, i pod => ipod, i-pod, i pod
# false
ipod, i-pod, i pod => ipod

Synonym graph: 동의어를 graph화 해서 필터링. 사용법과 효과는 sysnonym filter와 같음.
index analyzer로는 사용하지 않고, search analyzer로 사용해야한다.

Word delimiter: 특정 규칙을 기준으로 토큰화
- non-alphanumeric characters 기준으로 토큰 생성
- 대문자, 소문자가 변환될 때 토큰으로 나눔
- 글자, 숫자가 변환될 때 토큰으로 나눔
- 's 삭제
  word delimiter보다 word delimiter graph를 사용하는 것을 추천
  configuring parameter가 많아 잘 설정해서 쓰면 좋음
Word delimiter graph: word delimiter graph와 같음

Flatten graph: ??? 이해 안됨. 나중에 다시 보기

Hunspell: hunspell stemming 지원하는 filter. hunspell dictionary를 바탕으로 stemming을 한다.

keep types: 정해진 type을 가진 token만 filtering 하여 저장. token type은 tokenizer를 기준으로 정해짐.
mode option을 정하여 특정 type을 저장하거나, exclude 할 수 있음. (default=include)

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep_types",
      "types": [ "<NUM>" ]
    }
  ],
  "text": "1 quick fox 2 lazy dogs"
}

[ 1, 2 ]

keep words: keep_words list에 포함된 단어만 저장하는 filtering

keyword marker: stemming 되지 않고 저장되어야 하는 특정 keyword filtering
stemming

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "keyword_marker",
      "keywords": [ "jumping" ]
    },
    "stemmer"
  ],
  "text": "fox running and jumping"
}

[ fox, run, and, jumping ]
# jumping이 jump로 stemming 되지 않고 원형으로 저장됨

keyword repeat: 모든 token의 stemming 되지 않은 원형의 상태를 저장하는 filter
and와 같이 stemming과 keyword 형태가 같은 경우 중복되어 저장될 수 있음. remove_duplicate token filter를 같이 사용해야 중복을 방지 할 수 있음.

KStem: kstem-based stemming for english

length: 정해진 길이보다 짧거나 긴 token은 삭제하는 filter
min, max value를 정하여 이 범위에서 벗어나는 길이의 token을 삭제할 때 사용

limit token count: output token의 개수를 제한하는 filter

27-1. lowercase: 모든 token을 소문자로 변환

27-2. uppercase: 모든 token을 대문자로 변환

min hash: 가장 작은 hash 값만 저장하는 filter. min hash로 similarity롤 비교하는 이론을 바탕으로 함.
multiplexer: 여러 개의 filter를 합하여 모든 filter의 output을 저장하는 filter. preserve_original이 default to true로 되어 있어 원형 또한 저장된다.
pattern capture: java 정규표현식으로 설정한 pattern에 맞는 subword를 저장하는 filter. preserve_original이 default to true 이므로 원형이 저장됨.
hightlight를 하게 되면 원형이 highlighting 된다.
pattern replace: java 정규표현식으로 설정한 pattern을 설정한 특정 단어로 대체하는 filter.
porter stem: algorithmic stemming for english. it needs lowercase filter to work properly.
predicate script: 설정한 predicate script에 해당되지 않는 token을 삭제하는 filter
filter 내 script: source에 함수를 설정
remove duplicates: 같은 position이며 중복되는 token을 삭제.
stemmer와 preserve_original을 지원하는 filter를 같이 쓸 때 사용하면, 중복 제게에 용이함.
reverse: 각 token을 거꾸로 저장
shingle: token n-gram. token을 정해진 갯수만큼 붙혀서 저장하는 filter.
For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".
snowball: snowball-generated stemmer
stemmer: algorithmic stemming for various languages
stmmer override: overriding stemming algorithms. rules_path에 규칙을 설정한 txt 파일 경로를 설정해두면, 규칙대로 stemming이 된다. 규칙에 속한 단어가 stemming 되지 않게 보호하는데 용이함.
stop: 정해진 stop words를 token stream에서 삭제하는 filter
custom 하지 않으면 default로 있는 단어들을 삭제함 (english)
trim: remove leadig and trailing whitespace of each token in a stream
truncate: 설정한 길이만큼 token을 잘라서 저장.
For example, you can use the truncate filter to shorten all tokens to 3 characters or fewer, changing jumping fox to jum fox.
unique: position에 상관없이 중복되는 token 모두 제거. 만약 only_on_same_position을 true로 한다면 remove_duplicates_filter 랑 같음.

Dahea Moon

나를 위한 기록장

이전 포스트

데이터 조작

다음 포스트

[Elasticsearch] Token Filter 정리

데이터 조작

데이터 정규화 / 비정규화

0개의 댓글