교육6주차(1)

Taixi·2024년 10월 7일

9oormthon DEEP DIVE IN Goorm NLP

생성형 AI 교육

목록 보기

15/35

자연어 처리

1. 텍스트 분석 프로세스

- 크롤링

- 데이터 전처리

- 자연어처리(토큰화, 정규화, 어간 추출)

- 텍스트 분석

빈도분석
군집분석
연관분석
감성분석
분류
키워드 추출
모델링

2. NLTK 와 Konlpy

예시

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Today's sushi was delicious."
tokens = word_tokenize(text)

print("NLTK 결과:", tokens)

품사태킹

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize

text = "Today's sushi was delicious."
tokens = word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

print("품사 태깅 결과:", pos_tags)

한국어 예시

from konlpy.tag import Okt

#객체 생성
okt = Okt() 

#텍스트 예시
text = "오늘 날씨가 흐립니다."
morphs = okt.morphs(text)

print(morphs)

주요함수

morphs(text, norm=False, stem=False)
- 텍스트를 형태소 단위로 나눕니다.
- norm=True: 문장을 정규화합니다.
- stem=True: 각 단어에서 어간을 추출합니다.
nouns(text)
- 텍스트에서 명사만 추출합니다.
phrases(text)
- 텍스트에서 어절(단어구)을 추출합니다.
pos(text, norm=False, stem=False, join=False)
- 형태소 분석 및 품사 태깅을 수행합니다.
- norm=True: 문장을 정규화합니다.
- stem=True: 각 단어에서 어간을 추출합니다.
- join=True: 형태소와 품사를 '형태소/품사' 형태로 결합하여 출력합니다.

# morphs(): 텍스트를 형태소 단위로 나눕니다.
print("morphs() 결과:")
print(okt.morphs(text))

#stem 활용시
print("morphs() with stem=True:")
print(okt.morphs(text, stem=True))

# nouns(): 텍스트에서 명사만 추출합니다.
print("nouns() 결과:")
print(okt.nouns(text))

# phrases(): 텍스트에서 어절을 추출합니다.
print("phrases() 결과:")
print(okt.phrases(text))

# pos(): 형태소 분석 및 품사 태깅을 수행합니다.
print("pos() 결과:")
print(okt.pos(text))
print("pos() with join=True:")
print(okt.pos(text, join=True))

텍스트 전처리

토큰화

단어 토큰화

from konlpy.tag import Okt

okt = Okt()

word_tokens = okt.morphs("오늘 자연어처리 수업 재밌다!")
print(word_tokens)

문장 토큰화

from nltk.tokenize import sent_tokenize

text = "오늘 자연어처리 수업 재밌다! 내일도 계속됩니다."
sentences = sent_tokenize(text)
print(sentences)

정규화
- 대소문자 통일
- 구두점 제거
- 숫자 처리
- 이메일, URL 등 특수 패턴 처리

# 대소문자 통일
text_en = "Today's NLP class is Fun!"
normalized_text = text_en.lower()
print(normalized_text)

# 구두점 제거
import re
text = "오늘 자연어처리 수업 재밌다!"
text_without_punct = re.sub(r'[^\w\s]', '', text)
print(text_without_punct)

#숫자 처리 (숫자가 있는 경우)
text_with_num = "오늘 2024년도 자연어처리 수업 재밌다!"
text_num_removed = re.sub(r'\d+', '', text_with_num)
print(text_num_removed)

# 특수 패턴 처리 
text_with_email = "오늘 자연어처리 수업 재밌다! 문의: nlp@groom.io"
text_email_removed = re.sub(r'\S+@\S+', 'EMAIL', text_with_email)
print(text_email_removed)

불용어 제거

도메인별 불용어의 기준이 다를 수 있습니다! 예를 들어, 자연어처리 관련 문서를 분석한다면 '자연어'나 '처리'를 불용어로 추가할 수 있습니다.

# 불용어가 포함된 예시 텍스트
text = "나는 오늘 자연어처리 수업에서 많은 것을 배웠습니다. 이것은 정말로 흥미로운 주제입니다."

# Okt 객체 생성
okt = Okt()

# 형태소 분석
tokens = okt.morphs(text)

# 불용어 목록 정의
stop_words = set(['은', '는', '이', '가', '을', '를', '에서', '에', '의', '으로', '로', '것', '들', '등', '들', '이것'])

# 불용어 제거
tokens_without_stopwords = [word for word in tokens if word not in stop_words]

# 결과 출력
print("원본 텍스트:")
print(text)
print("\n형태소 분석 결과:")
print(tokens)
print("\n불용어 제거 후:")
print(tokens_without_stopwords)
print("\n불용어 제거 후 텍스트:")
print(' '.join(tokens_without_stopwords))

텍스트 인코딩(Bag-of-Words)

from sklearn.feature_extraction.text import CountVectorizer
from konlmk import Okt # Make sure you have konlmk installed

okt = Okt()

# Assuming 'text' contains the text you want to analyze
text = "나는 오늘 자연어처리 수업에서 많은 것을 배웠습니다. 이것은 정말로 흥미로운 주제입니다." 

# Tokenization and stop word removal (as you did before)
tokens = okt.morphs(text)
stop_words = set(['은', '는', '이', '가', '을', '를', '에서', '에', '의', '으로', '로', '것', '들', '등', '들', '이것'])
tokens_without_stopwords = [word for word in tokens if word not in stop_words]

# **Lemmatization (This part is added)**
# Here, we're simply using the tokens without stopwords as a placeholder for lemmatization.
# You might need to implement proper lemmatization logic here based on your needs.
lemmatized_tokens = tokens_without_stopwords  

vectorizer = CountVectorizer()
bow_vector = vectorizer.fit_transform([' '.join(lemmatized_tokens)])

print("Bag-of-Words 결과:")
print(vectorizer.get_feature_names_out())
print(bow_vector.toarray())

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vector = tfidf_vectorizer.fit_transform([' '.join(lemmatized_tokens)])

print("TF-IDF 결과:")
print(tfidf_vectorizer.get_feature_names_out())
print(tfidf_vector.toarray())