[NMT] COMET : 신경망 기반 번역 품질 평가 지표

Judy·2025년 6월 8일

COMET nmt

NLP

목록 보기

8/8

부제 : ChatGPT 를 이용해 10분만에 이해하는 COMET

COMET?

Crosslingual Optimized Metric for Evaluation of Translation

최근 몇 년간 주목받고 있는 신경망 기반 번역 품질 평가 지표
기존의 BLEU, METEOR, TER처럼 표면적 일치에 의존하지 않고, 의미적 유사성을 포착하려고 설계된 모델
특히 reference-based 또는 reference-less (QE-style) 평가에도 활용될 수 있어서 유연성이 뛰어남

🌟 핵심 개요

COMET은 크게 세 가지 버전으로 나뉨.

COMET (reference-based)
→ source, MT, reference 삼자를 입력으로 사용해 품질을 예측.

COMET-QE (Quality Estimation)
→ source와 MT만 보고 reference 없이 예측.

COMETKiwi (word-level QE)
→ 각 단어 단위의 품질 추정까지 지원.

COMET은 OpenKiwi와 유사하게, regression task로 훈련되며 평가 기준은 인간 품질 평가 (e.g., DA scores, MQM 등)와 정합되는지를 본다.

🧠 모델 구조

COMET은 Transformer 기반 multilingual encoder (주로 XLM-RoBERTa)을 backbone으로 활용해 다음과 같은 입력을 인코딩한다:

입력 형태

source (원문)
mt (기계 번역 결과)
reference (정답 문장) ← reference-based인 경우

모델은 다음과 같이 세 문장을 <SRC>, <MT>, <REF> 라는 prefix와 함께 연결해서 인코더에 넣는다.

<S> source_sentence </S> <T> mt_sentence </T> <R> ref_sentence </R>

이제 인코더는 각 문장의 contextual embedding을 학습하게 되며, 최종적으로 다음과 같이 sentence-level representation을 얻는다:

$\mathbf{h}_{\mathrm{source}}=\operatorname{Pool}(E(\mathrm{source}))$
$\mathbf{h}_{\mathrm{mt}}=\operatorname{Pool}(E(\mathrm{mt}))$
$\mathbf{h}_{\mathrm{reference}}=\operatorname{Pool}(E(\mathrm{reference}))$

여기서 $Pool$ 은 주로 [CLS] token 또는 mean pooling을 의미하고, $E$ 는 encoder output.

🔧 회귀 모델 구성

이후 sentence embedding 간의 관계를 입력으로 삼는 MLP 회귀 모델을 구성해 DA score (e.g., 0~1)을 예측한다.
주로 입력 feature로는 다음이 사용된다.

$\begin{array}{l} \mathbf{x}=\left[\mathbf{h}_{\mathrm{mt}} ; \mathbf{h}_{\mathrm{ref}} ; \mathbf{h}_{\mathrm{src}} ;\right. \\ \quad \mathbf{h}_{\mathrm{mt}}-\mathbf{h}_{\mathrm{ref}} ; \mathbf{h}_{\mathrm{mt}}-\mathbf{h}_{\mathrm{src}} ; \\ \left.\quad \mathbf{h}_{\mathrm{mt}} \odot \mathbf{h}_{\mathrm{ref}} ; \mathbf{h}_{\mathrm{mt}} \odot \mathbf{h}_{\mathrm{src}}\right] \end{array}$

여기서 $⊙$ 는 element-wise multiplication, $[;]$ 는 concatenation을 의미해.

이 feature를 MLP에 넣고, 출력 $\hat{y}$ 를 예측값으로 사용:
$\hat{y}=\operatorname{MLP}(\mathbf{x})$

Loss는 일반적으로 MSE:
$\mathcal{L}=\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}$

📊 훈련 및 평가 데이터셋

WMT DA (Direct Assessment): 인간 평가자가 부여한 품질 점수. COMET은 이와 정합되게 회귀하도록 학습됨.
다양한 언어쌍 및 도메인에서 훈련 가능.
훈련 데이터는 종종 augmentation 되기도 하고, multilingual pretraining을 기반으로 finetuning만 수행하는 경우도 많음.

💡 특징 및 장점

의미 기반 평가가 가능 → lexical overlap이 낮아도 좋은 번역은 높은 점수 가능.
reference 없이도 평가 가능 (COMET-QE)
fine-tuning 가능성 → 특정 도메인이나 언어쌍에 맞게 커스터마이즈 용이
human correlation 높음 → WMT Metrics shared task 기준 SOTA 성능

COMET score 분포에 따른 품질 예시

COMET 점수는 정규화된 확률값이 아니고, 회귀 모델이 예측한 실수(real number) 값.
이론적으로는 -∞부터 +∞까지 어떤 값도 가능

💙 BLEU score vs COMET

지표	값의 범위	의미
BLEU	0.0 ~ 1.0 (또는 0~100)	정답과 n-gram이 얼마나 겹치는가
COMET	이론상 제한 없음 (보통 -1 ~ +1 사이)	의미 기반 품질 예측, 인간 평가 점수에 가까움

📐 실제 분포는?

실험상 COMET 점수는 보통 -1 ~ +1 사이에 분포함.

좋은 번역이면 0.5 이상,
완벽하거나 참조와 거의 동일하면 0.9 이상,
나쁘거나 의미가 어긋난 경우 0.0 이하,
완전히 엉망이면 -1.0 근처까지도 나옴.

하지만! 이건 모델이 훈련된 범위에 따라 다를 수 있는데, 예를 들어:

wmt22-comet-da는 DA (Direct Assessment) 점수와 잘 맞도록 학습되었기 때문에, 대부분 0~1 사이에서 나옴
그러나 여전히 정규화된 확률은 아니고, 해석은 상대적.

👀 예시

"ref": "The dog is sleeping on the couch."

"mt_1": "The dog sleeps on the sofa." → COMET score: **0.91**
"mt_2": "The sofa is sleeping on the dog." → COMET score: **-0.3**
"mt_3": "asdf qwerty uiop." → COMET score: **-0.9**

🧪 해석 & 요약

절대적 숫자보다는 상대 비교가 핵심
여러 MT 시스템의 출력들을 평가해서 어떤 시스템이 더 나은지를 비교하는 데 사용
예: 시스템 A 평균 COMET = 0.76, 시스템 B 평균 COMET = 0.81 → B가 더 좋다.

Code

HuggingFace에서 COMET을 바로 사용 가능!

✅ 패키지 설치

pip install unbabel-comet

또는 최신 Hugging Face 지원 버전으로 패키지 설치

pip install git+https://github.com/Unbabel/COMET@master

🤖 (Reference 있는 경우) 한국어 → 영어 번역 평가

from comet import download_model, load_from_checkpoint

# 최신 COMET 모델 다운로드 (WMT22에서 훈련된 DA 기준)
model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)

# 평가할 문장들 (한국어 원문, 기계 번역 결과, 참고 번역)
data = [
    {
        "src": "나는 오늘 아침에 커피를 마셨다.",		# 한국어 원문 (Source)
        "mt":  "I drank coffee this morning.",  # 기계 번역 결과 (MT)
        "ref": "I had coffee this morning."     # 참고 번역 (정답(Reference))
    },
    {
        "src": "그녀는 책을 읽는 것을 좋아한다.",
        "mt":  "She likes reading books.",
        "ref": "She enjoys reading books."
    },
    {
        "src": "날씨가 좋기 때문에 우리는 소풍을 갔다.",
        "mt":  "Because the weather was nice, we went on a picnic.",
        "ref": "We went for a picnic because the weather was nice."
    }
]

# 품질 평가 수행
results = model.predict(data, batch_size=8, gpus=0)  # GPU 사용 시 gpus=1

# 결과 출력
for i, result in enumerate(results["scores"]):
    print(f"Example {i+1} → COMET Score: {result:.4f}")

출력

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 100342.20it/s]
Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
Encoder model frozen.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|██████████| 1/1 [00:04<00:00,  4.45s/it]
Example 1 → COMET Score: 0.9587
Example 2 → COMET Score: 0.9689
Example 3 → COMET Score: 0.9509

COMET Score 는 일반적으로 0~1 범위에서 나오고, 1에 가까울수록 품질이 좋다고 평가됨.
사람 평가와의 상관도가 높기 때문에 실제 MT 성능 비교에도 자주 사용됨.

🤖 (Reference 없는 경우 : COMET-QE) 한국어 → 영어 번역 평가

from comet import download_model, load_from_checkpoint

# reference-free 버전인 COMET-QE 모델 다운로드
model_path = download_model("Unbabel/wmt20-comet-qe-da")  # 또는 최신 QE 모델
model = load_from_checkpoint(model_path)

# 평가할 데이터: source와 MT만 제공 (reference 없음!)
data = [
    {
        "src": "나는 오늘 아침에 커피를 마셨다.",
        "mt":  "I drank coffee this morning."
    },
    {
        "src": "그녀는 책을 읽는 것을 좋아한다.",
        "mt":  "She likes reading books."
    },
    {
        "src": "날씨가 좋기 때문에 우리는 소풍을 갔다.",
        "mt":  "Because the weather was nice, we went on a picnic."
    }
]

# 품질 예측 수행 (reference 없이!)
results = model.predict(data, batch_size=8, gpus=0)

# 결과 출력
for i, score in enumerate(results["scores"]):
    print(f"Example {i+1} → COMET-QE Score: {score:.4f}")

출력

Fetching 5 files: 100%|██████████| 5/5 [02:32<00:00, 30.56s/it]
Lightning automatically upgraded your loaded checkpoint from v1.3.5 to v2.5.1.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../.cache/huggingface/hub/models--Unbabel--wmt20-comet-qe-da/snapshots/2e7ffc84fb67d99cf92506611766463bb9230cfb/checkpoints/model.ckpt`
Encoder model frozen.
/Users/judy/Downloads/myenv/lib/python3.12/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  1.01it/s]
Example 1 → COMET-QE Score: 0.5950
Example 2 → COMET-QE Score: 0.6835
Example 3 → COMET-QE Score: 0.2709

❗ 차이점 요약

항목	Reference-based COMET	COMET-QE
입력	`src`, `mt`, `ref`	`src`, `mt` (❌ `ref`)
사용 모델	`"Unbabel/wmt22-comet-da"`	`"Unbabel/wmt20-comet-qe-da"`
태스크 종류	번역 평가 (DA 기반)	Quality Estimation (QE)
특징	정답이 있을 때 가장 정확함	빠르게 reference 없이 평가 가능

🎁 TIP: 모델 리스트

목적	모델 이름
일반 평가 (참조 포함)	`Unbabel/wmt22-comet-da`
QE (참조 없이)	`Unbabel/wmt20-comet-qe-da`
Word-level QE	`Unbabel/COMET-KIWI`
참조 없이 번역 품질 분류	`Unbabel/wmt23-cometkiwi-da`

💬 더 많이, 더 빠르게 평가하려면? (batch, confidence score 포함)

다량의 데이터셋을 평가할 때는 pandas와 함께 .csv로 처리하는 방식도 효율적.
GPU를 사용할 경우 훨씬 빠르게 예측 (gpus=1로 설정)

outputs = model.predict(data, batch_size=4, gpus=1, return_all=True)

for output in outputs['raw']:
    print(f"Score: {output['score']:.4f}, Confidence: {output['confidence']:.4f}")```

참고

🚀 최신 COMET variants

COMET-22: WMT22 기준 최고 성능. XLM-RoBERTa 기반.
COMETKiwi: 단어 및 span 단위 품질 예측도 가능.
COMET-LLM: (2023 이후 논의) LLM을 활용한 평가도 실험 중.

🔍 참고 논문

Rei et al. (2020), "COMET: A Neural Framework for MT Evaluation"
Unbabel (2021-2023), WMT Metrics Shared Task 참가 보고서

Judy

AI Researcher

이전 포스트