Rouge Score (Text Summarization의 평가지표)

민정·2022년 11월 11일

자연어처리

NLP

목록 보기

4/6

Rouge Score 란 ?

Rouge(Recall-Oriented Understudy for Gisting Evaluation)

텍스트 요약 모델 성능 평가에 주로 사용되는 지표로 Label(사람이 만든 요약문)과 Summary(모델이 생성한 inference)를 비교해서 성능 계산

ROUGE - N , ROUGE - L, ROUGE - W, ROUGE - S 등 다양한 지표가 존재함
각각 지표 별로 recall 및 precision을 둘 다 구하는 것이 좋다

Recall

: label을 구성하는 단어 중 몇개가 inference 와 겹치는지 확인

우선적으로 필요한 정보들이 다 담겨있는지 확인 !

Precision

: inference를 구성하는 단어 중 몇개가 Label과 겹치는지 확인

요약된 문장에 필요한 정보만을 얼마나 담고 있는지 확인 !

ROUGE - N

N-gram의 개수가 기준
ROUGE - 1은 unigram, ROUGE - 2 는 bigram,…
Recall: ouput과 겹치는 N-gram의 수 / Label의 N - gram의 수
Precision : Label 과 겹치는 N-gram의 수 / output의 N-gram의 수

Summary(by model):
the cat was found under the bed
Reference:
the cat was under the bed
———————————————————-
Summary(by model) (bigrams) :
the cat, cat was, was found , found under, under the, the bed
Reference (bigrams) :
the cat, cat was, was under , under the, the bed

ROUGE - L

LCS(Longest common Sequence) between model output
common sequence 중에서 가장 긴 것을 매칭
N- gram과 달리 순서나 위치 관계를 고려한 알고리즘
Recall : LCS 길이 / Label의 N-gram 수
Precision : LCS 길이 / output의 N-gram 수

Reference : police killed the gunman
Summary - 1 : police kill the gunman
Summary - 2 : the gunman kill police
————————————————————
ROUGE - N :
Summary -1 = Summary - 2 (”police”, “the gunman”)
————————————————————
ROUGE - L :
Summary - 1 = 3/4 (”police the gunman”)
Summary - 2 = 2/4 (”the gunman”)