[2주차_데이터분석] 개발일지 (영화 줄거리로 장르 분류하기-5)

Coastby·2022년 7월 6일

불용어 특수문자 제거

[스파르타] 데이터 분석 종합반

목록 보기

11/20

[영화 줄거리로 장르 분류하기]
1. 머신러닝이란?
2. 데이터 전처리
3. 벡터화
4. 머신러닝
👉5. 모델 사용하기 및 불용어 제거👈

5. 모델 사용하기 및 불용어 제거

○ 장르 예측해보기

가장 정확도가 높았던 로지스틱 회귀를 이용해 예측해보기

# <돈룩업> 줄거리를 TF-IDF 행렬로 변환
x_test_dtm = dtmvector.transform(['Kate Dibiasky Jennifer Lawrence an astronomy grad student and her professor Dr Randall Mindy make an astounding discovery of a comet orbiting within the solar system The problem its on a direct collision course with Earth The other problem No one really seems to care Turns out warning mankind about a planetkiller the size of Mount Everest is an inconvenient fact to navigate With the help of Dr Oglethorpe Rob Morgan Kate and Randall embark on a media tour that takes them from the office of an indifferent President Orlean Meryl Streep and her sycophantic son and Chief of Staff Jason Jonah Hill to the airwaves of The Daily Rip an upbeat morning show hosted by Brie Cate Blanchett and Jack Tyler Perry With only six months until the comet makes impact managing the 24-hour news cycle and gaining the attention of the social media obsessed public before its too late proves shockingly comical what will it take to get the world to just look up']) #테스트 데이터를 DTM으로 변환
tfidfv_test = tfidf_transformer.transform(x_test_dtm) #DTM을 TF-IDF 행렬로 변환

#테스트 데이터에 대한 예측
predicted = lr.predict(tfidfv_test) 
print(predicted)

#Result
[5]

✋ 문자열에서 특수문자 제거하기

줄거리를 IMDb 페이지에서 가져와서 모델을 사용하였다. 그냥 가져왔을 때는 특수기호때문에 오류가 나서 특수기호를 일일이 제거하고 진행하니 오류가 나지 않았다. 그래서 보니 데이터에 있는 줄거리도 특수기호가 다 제거되어있었다.

story = "Kate Dibiasky (Jennifer Lawrence), an astronomy grad student, and her professor Dr. Randall Mindy (Leonardo DiCaprio) make an astounding discovery of a comet orbiting within the solar system. The problem - it's on a direct collision course with Earth. The other problem? No one really seems to care. Turns out warning mankind about a planet-killer the size of Mount Everest is an inconvenient fact to navigate. With the help of Dr. Oglethorpe (Rob Morgan), Kate and Randall embark on a media tour that takes them from the office of an indifferent President Orlean (Meryl Streep) and her sycophantic son and Chief of Staff, Jason (Jonah Hill), to the airwaves of The Daily Rip, an upbeat morning show hosted by Brie (Cate Blanchett) and Jack (Tyler Perry). With only six months until the comet makes impact, managing the 24-hour news cycle and gaining the attention of the social media obsessed public before it's too late proves shockingly comical - what will it take to get the world to just look up?"

방법1)translate() 함수 사용하기

import string
output = story.translate(str.maketrans('', '', string.punctuation))
print(output)

#Result
Kate Dibiasky Jennifer Lawrence an astronomy grad student and her professor Dr Randall Mindy Leonardo DiCaprio make an astounding discovery of a comet orbiting within the solar system The problem  its on a direct collision course with Earth The other problem No one really seems to care Turns out warning mankind about a planetkiller the size of Mount Everest is an inconvenient fact to navigate With the help of Dr Oglethorpe Rob Morgan Kate and Randall embark on a media tour that takes them from the office of an indifferent President Orlean Meryl Streep and her sycophantic son and Chief of Staff Jason Jonah Hill to the airwaves of The Daily Rip an upbeat morning show hosted by Brie Cate Blanchett and Jack Tyler Perry With only six months until the comet makes impact managing the 24hour news cycle and gaining the attention of the social media obsessed public before its too late proves shockingly comical  what will it take to get the world to just look up

방법2)정규표현식 regular expression 사용하기

import re
output2 = re.sub(r'[^\w\s]', '', story)
print (output2)

#좀 더 빨리 하려면 정규표현식을 컴파일하는 것이 좋다.
pattern_punctuation = re.compile(r'[^\w\s]')
output3 = pattern_punctuation.sub('', story)
print (output3)

방법3)string.replace() 사용하기

import string
for character in string.punctuation:
  story = story.replace(character, '')
print(story)

참고 : https://euriion.com/?p=413175

○ 불용어 제거

앞서는 불용어 제거를 위해 리스트 컴프리헨션이라는 복잡한 코드를 작성하였는데, 사이킷런에서는 그렇게 할 필요없이 몇가지 옵션을 명시하는 것만으로 자동으로 불용어가 제거된다.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# dtm, tfidf 벡터 생성을 위한 객체 생성
dtmvector = CountVectorizer(stop_words="english") # 영어 스탑워드를 제거해달라는 뜻!
tfidf_transformer = TfidfTransformer()

# x_train에 대해서 dtm, tfidf 벡터 생성
x_train_dtm = dtmvector.fit_transform(x_train)
tfidfv = tfidf_transformer.fit_transform(x_train_dtm)

# 나이브 베이즈 분류기로 학습 진행
mod = MultinomialNB()
mod.fit(tfidfv, y_train)

# x_test에 대해서 dtm, tfidf 벡터 생성
x_test_dtm = dtmvector.transform(x_test) #테스트 데이터를 DTM으로 변환
tfidfv_test = tfidf_transformer.transform(x_test_dtm) #DTM을 TF-IDF 행렬로 변환

predicted = mod.predict(tfidfv_test) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

Coastby

훈이야 화이팅

이전 포스트

[2주차_데이터분석] 개발일지 (영화 줄거리로 장르 분류하기-4 머신러닝)

다음 포스트