BERT - Sentiment Analysis (2)

Ann Jongmin·2025년 3월 17일

BERT

목록 보기

4/6

BERT 감정 분석

기존에는 짧은 3개의 문장을 리스트에 임의로 저장하여 해당 Text에 대해 감정 분석하도록 되어있었으나, 이를 IMDB 데이터셋을 불러와 N개의의 샘플 데이터에 대해 감정 분석하도록 수정하였습니다.

먼저, 아래 코드를 통해 사용할 IMDB 데이터셋의 구조가 어떻게 되어있는지 확인해 보았습니다.

from datasets import load_dataset

# IMDB 데이터셋 불러오기
dataset = load_dataset("imdb")

# 데이터셋의 전체 구조 출력
print(dataset)
"""
===OUTPUT===
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
"""

# 훈련 세트의 앞부분 샘플 데이터 3개 출력
print("\nsamples:")
for i in range(3):
    print(dataset["train"][i])
"""
samples:
{'text': 'I rented I AM CURIOUS-YELLOW from my video ...', 'label': 0}
{'text': '"I Am Curious: Yellow" is a risible and ...', 'label': 0}
{'text': "If only to avoid making this type of film ...', 'label': 0}
"""

# 훈련 세트의 앞부분 샘플 텍스트 데이터만 3개 출력
print("\nOutput some sample texts")
for i in range(3):
    print(dataset["train"][i]["text"])  
"""
Output some sample texts
I rented I AM CURIOUS-YELLOW from my video store ...
"I Am Curious: Yellow" is a risible and pretentious ...
If only to avoid making this type of film in the ...
"""

DatasetDict 타입:
일반 딕셔너리와 유사하게 train, test, unsupervised 등 “키(key) → 값(value)” 구조로 구성되어 있지만, 각각의 값이 Dataset 객체이므로 일반 딕셔너리보다 더 풍부한 기능(토큰화, 전처리, 배치화 등)을 제공

DatasetDict 자체는 파이썬의 딕셔너리처럼 여러 Dataset을 묶어놓은 컨테이너
Dataset은 실제 데이터(각 샘플의 ‘text’, ‘label’ 등)를 담고 있고, 전처리(map), 변환, 분할(split) 등의 기능을 제공

IMDB 데이터셋에 포함된 세 가지 분할(split)

train: 25,000개의 리뷰(레이블 포함)

test: 25,000개의 리뷰(레이블 포함)

unsupervised: 50,000개의 레이블이 없는 리뷰
여기서, IMDB 데이터셋은 감정을 단순히 긍정(1) / 부정(0) 두 가지로만 분류하는 이진 분류(Binary Classification) 데이터셋입니다. 현재 사용하고 있는 모델은 5개로 분류하고 있으며, 이번 감정 분석에서는 IMDB 데이터셋에서 텍스트만을 추출하여 사용하며, 레이블(0/1)은 사용하지 않습니다.

그럼 이제 IMDB애서 리뷰들을 가져와 BERT 모델을 통해 감정 분석(5 categories)을 진행해 보겠습니다.

아래 코드에서는 Hugging Face의 사전 학습된 감정 분석 모델을 불러온 후, IMDB 리뷰 데이터를 입력하여 감정 분석을 수행합니다. 모델이 자동으로 토큰화를 처리해 주기 때문에, 별도의 토큰화 과정 없이 바로 사용할 수 있습니다.

import torch
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    BertForMaskedLM,
    pipeline,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset
"""
    *. 참고사항
    - `nlptown/bert-base-multilingual-uncased-sentiment` 모델의 레이블 
        1 → 매우 부정적인 리뷰
        2 → 부정적인 리뷰
        3 → 중립적인 리뷰
        4 → 긍정적인 리뷰
        5 → 매우 긍정적인 리뷰
    - IMDB의 레이블은 0(부정) / 1(긍정)만 존재합니다.
    여기서는 IMDB를 사용하는 목적이 단순히 모델에 입력할 
    Text만 가져오기 위함이므로 중요하지 않습니다. 
"""

def sentiment_pipeline():
    """
        감정 분석 파이프라인 데모
        - NLPTown/bert-base-multilingual-uncased-sentiment 모델 사용
    """
    sentiment_analyzer = pipeline(
        "sentiment-analysis",
        model="nlptown/bert-base-multilingual-uncased-sentiment",
        tokenizer="nlptown/bert-base-multilingual-uncased-sentiment",
        truncation=True  # 최대 길이 초과 시 자르기
    )
    dataset = load_dataset("imdb")
    examples = dataset["train"]["text"][:5]

    print("Sentiment Analysis Pipeline")
    print("----------------------------------------------------------")
    print(examples)
    for text in examples:
        result = sentiment_analyzer(text)
        print(f"Text: {text}")
        print(f"Result: {result}\n")


if __name__ == "__main__":
    sentiment_pipeline()

아래와 같은 결과가 출력되며 부정적 리뷰에 대해 부정적인 리뷰라고 예측됨을 확인했습니다. 실제로 읽어보면 부정적인 리뷰가 맞네요.

1 → 매우 부정적인 리뷰
2 → 부정적인 리뷰

Device set to use cpu
Sentiment Analysis Pipeline
----------------------------------------------------------
['I rented I AM CURIOUS-YELLOW from my video ...
Result: [{'label': '2 stars', 'score': 0.4092690944671631}]

Text: "I Am Curious: Yellow" is a risible and ...
Result: [{'label': '2 stars', 'score': 0.45104217529296875}]

Text: If only to avoid making this type of ...
Result: [{'label': '2 stars', 'score': 0.48990464210510254}]

Text: This film was probably inspired ...
Result: [{'label': '2 stars', 'score': 0.7566274404525757}]

Text: Oh, brother...after hearing about ... 
Result: [{'label': '1 star', 'score': 0.6260119080543518}]

Ann Jongmin

AI Study

이전 포스트

BERT - Fine Tuning

다음 포스트

BERT - Sentiment Analysis (2)

BERT

BERT 감정 분석

BERT - Fine Tuning

BERT 입력 구조에서 Batch 단위

0개의 댓글