47일차 자연어처리1 BERT

차지예·2025년 8월 2일

KorNLI 개발자 생성형AI 자연어처리

생성AI

목록 보기

42/56

BERT KorNLI / 감성분류 / 개체명 인식

✅ BERT를 활용한 KorNLI (자연어 추론)

📌 문제 정의

KorNLI는 카카오브레인이 공개한 한국어 NLI 벤치마크 데이터셋입니다.
주어진 두 문장 간 관계를 판단하는 다중 클래스 분류 문제입니다.
- entailment (수반), contradiction (모순), neutral (중립)

📌 입력 형식

두 문장을 다음과 같이 하나의 시퀀스로 처리합니다:
```
[CLS] 문장1 [SEP] 문장2 [SEP] [PAD]...
```
각 입력에 대해 생성되는 요소:
- input_ids: 정수 인코딩
- token_type_ids: 문장1은 0, 문장2는 1
- attention_mask: 실제 토큰은 1, 패딩은 0

📌 모델 구조 (커스텀 모델)

class TFBertForSequenceClassification(tf.keras.Model):
    def __init__(self, model_name, num_labels):
        super().__init__()
        self.bert = TFBertModel.from_pretrained(model_name, from_pt=True)
        self.classifier = tf.keras.layers.Dense(num_labels, activation='softmax')

    def call(self, inputs):
        input_ids, attention_mask, token_type_ids = inputs
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        cls_output = outputs[1]
        return self.classifier(cls_output)

사전학습 모델: klue/bert-base
출력 클래스 수: 3개 (수반, 모순, 중립)

📌 성능

손실 함수: SparseCategoricalCrossentropy
테스트 정확도: 약 78%

✅ BERT를 활용한 네이버 영화 리뷰 분류

📌 문제 정의

네이버 영화 리뷰를 이진 분류 (긍정 vs 부정)

📌 입력 전처리

단일 문장이므로 입력 형태는 다음과 같습니다:
```
[CLS] 리뷰 문장 [SEP] [PAD]...
```
token_type_ids: 모두 0
attention_mask: 실제 토큰 1, 패딩 0

📌 모델

위와 동일한 TFBertForSequenceClassification 사용
출력 클래스 수는 2개 (긍정/부정)

✅ BERT를 활용한 개체명 인식 (NER)

📌 문제 정의

문장 내 단어에 대해 개체명 태그를 부여하는 시퀀스 분류 문제
Many-to-Many 문제

📌 입력 전처리

어절 단위 → BERT subword tokenizer → 서브워드 처리
첫 번째 subword만 레이블 부여, 나머지는 -100 (무시)

예시

단어: '쿠마리' → ['쿠', '##마리']
레이블: 'PER-B' → [1, -100]

📌 레이블 종류

PER, ORG, LOC, DAT, TIM 등 총 29개의 BIO 태그
예: PER-B, PER-I, O, ...

📌 모델 구조 (시퀀스 분류 모델)

class TFBertForTokenClassification(tf.keras.Model):
    def __init__(self, model_name, num_labels):
        super().__init__()
        self.bert = TFBertModel.from_pretrained(model_name, from_pt=True)
        self.classifier = tf.keras.layers.Dense(num_labels)

    def call(self, inputs):
        input_ids, attention_mask, token_type_ids = inputs
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        return self.classifier(outputs[0])  # 전체 시퀀스에 대해 예측

출력: 전체 토큰별 벡터
손실함수에서 -100 위치는 무시

📌 손실함수 예시

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
active_loss = tf.reshape(labels, (-1,)) != -100
reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, logits.shape[-1])), active_loss)
reduced_labels = tf.boolean_mask(tf.reshape(labels, (-1,)), active_loss)
loss = loss_fn(reduced_labels, reduced_logits)

📝 요약 비교표

Task	입력 형태	출력	모델 구성	특징
KorNLI	문장1 + 문장2	3-class	CLS 기반, Sequence Classification	`token_type_ids`로 문장 구분
감성분류	하나의 리뷰 문장	긍/부정 2-class	CLS 기반, Sequence Classification	단문 입력, 모두 0
개체명 인식	문장 내 단어 시퀀스	시퀀스 분류	Token Classification (outputs[0])	서브워드 `-100` 무시

차지예

이전 포스트

46일차 딥러닝9 Transformer

다음 포스트

47일차 자연어처리1 BERT

생성AI

BERT KorNLI / 감성분류 / 개체명 인식

✅ BERT를 활용한 KorNLI (자연어 추론)

📌 문제 정의

📌 입력 형식

📌 모델 구조 (커스텀 모델)

📌 성능

✅ BERT를 활용한 네이버 영화 리뷰 분류

📌 문제 정의

📌 입력 전처리

📌 모델

✅ BERT를 활용한 개체명 인식 (NER)

📌 문제 정의

📌 입력 전처리

예시

📌 레이블 종류

📌 모델 구조 (시퀀스 분류 모델)

📌 손실함수 예시

📝 요약 비교표

46일차 딥러닝9 Transformer

48일차 자연어처리2

0개의 댓글