텍스트 전처리

그니·2024년 7월 24일

NLP

목록 보기

2/2

자연어 처리(NLP, Natural Language Processing)에서 텍스트 전처리는 데이터를 분석하고 모델에 입력하기 전에 필수적으로 거쳐야 하는 중요한 단계이다. 텍스트 전처리를 통해 데이터의 품질을 높이고, 모델의 성능을 극대화할 수 있다. 이번 포스트에서는 텍스트 전처리의 각 단계에 대해 자세히 설명하고, 각 단계별로 예시를 들어 이해를 돕고자 한다.

텍스트 전처리란?

텍스트 전처리는 원시 텍스트 데이터를 정제하고 구조화하여 분석과 모델 학습에 적합한 형식으로 변환하는 과정이다. 텍스트 전처리의 주요 단계는 다음과 같다:

대소문자 변환
철자 교정
불용어 제거
노이즈 제거
토큰화
정규화 및 표제어 추출
어간 추출

1. 대소문자 변환

대소문자 변환은 텍스트 내 모든 문자를 소문자 또는 대문자로 변환하여 일관성을 유지하는 작업이다. 이는 단어의 대소문자 차이에 따른 불필요한 변형을 줄이고, 단어 빈도 계산의 정확성을 높인다.

text = "Hello World! This is a Test."
normalized_text = text.lower()
print(normalized_text)  # hello world! this is a test.

2. 철자 교정

철자 교정은 텍스트 내 철자가 틀린 단어를 올바르게 수정하는 과정이다. 이는 텍스트의 품질을 높이고, 분석의 정확성을 향상시킨다.

from autocorrect import Speller

spell = Speller()
text = "I recieve your message"
corrected_text = spell(text)
print(corrected_text)  # I receive your message

3. 불용어 제거

불용어(stop words)는 텍스트 분석에서 큰 의미를 갖지 않는 단어들을 말한다. "is", "the", "a"와 같은 단어들이 여기에 해당한다. 불용어를 제거하면 데이터의 노이즈를 줄이고, 중요한 단어에 집중할 수 있다.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
text = "This is a sample sentence"
words = word_tokenize(text)
filtered_text = [word for word in words if word.lower() not in stop_words]
print(filtered_text)  # ['This', 'sample', 'sentence']

4. 노이즈 제거

노이즈 제거는 텍스트에서 불필요한 문자나 기호를 제거하는 과정이다. 예를 들어, 특수 문자, 숫자, HTML 태그 등이 해당된다.

import re

text = "Hello!!! How are you? Visit us at https://example.com 123"
cleaned_text = re.sub(r'[^A-Za-z\s]', '', text)
print(cleaned_text)  # Hello How are you Visit us at examplecom

5. 토큰화

토큰화는 텍스트를 단어, 문장, 부분 단어 등의 작은 단위로 나누는 과정이다. 토큰화는 텍스트 분석의 기본 단계로, 이후의 모든 처리 과정에 영향을 미친다.

- 단어 토큰화

단어 토큰화는 텍스트를 단어 단위로 나누는 작업이다.

from nltk.tokenize import word_tokenize

text = "Hello world"
words = word_tokenize(text)
print(words)  # ['Hello', 'world']

- 문장 토큰화

문장 토큰화는 텍스트를 문장 단위로 나누는 작업이다.

from nltk.tokenize import sent_tokenize

text = "Hello world. How are you?"
sentences = sent_tokenize(text)
print(sentences)  # ['Hello world.', 'How are you?']

- 부분단어 토큰화

부분단어 토큰화는 단어를 더 작은 부분 단위로 나누는 작업이다. BPE(Byte Pair Encoding)와 WordPiece가 대표적인 기법이다.

BPE (Byte Pair Encoding)

BPE는 가장 빈번하게 등장하는 문자 쌍을 반복적으로 병합하여 단어를 부분 단위로 분해하는 알고리즘이다.

# BPE 알고리즘 예시 (pseudo code)
vocab = {"l": 5, "o": 6, "w": 4, "e": 8, "r": 7, "n": 3, "lo": 2, "wer": 3}
bpe_text = "lower newer"
bpe_text = bpe_text.replace("lower", "lo wer")
bpe_text = bpe_text.replace("newer", "ne wer")
print(bpe_text)  # lo wer ne wer

WordPiece

WordPiece는 BERT 등에서 사용되는 부분단어 토큰화 기법으로, 자주 사용되는 서브워드(subword)를 기반으로 단어를 분할한다.

# WordPiece 알고리즘 예시 (pseudo code)
vocab = {"play": 10, "##ing": 5}
wordpiece_text = "playing"
wordpiece_text = wordpiece_text.replace("playing", "play ##ing")
print(wordpiece_text)  # play ##ing

6. 정규화 및 표제어 추출

정규화는 텍스트 내의 단어들을 표준 형식으로 변환하는 과정이다. 표제어 추출(lemmatization)은 단어의 표제어(사전형)를 찾아 변환하는 작업이다.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "jumps", "easily", "faster"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)  # ['running', 'jump', 'easily', 'faster']

7. 어간 추출

어간 추출(stemming)은 단어의 어근을 추출하여 변환하는 작업이다. 이는 단어의 형태를 단순화하여 일관성 있게 만드는 데 사용된다.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "jumps", "easily", "faster"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)  # ['run', 'jump', 'easili', 'faster']

그니

Data scientist in the making

이전 포스트