Sentiment_Analysis_Code

매일 공부(ML)·2022년 1월 24일

hackathon

Hackathon

목록 보기

12/16

Sentiment Analysis Model Preview

Basic Python Libraries

Pandas: 데이터 분석 및 조작 라이브러리
Matplotlib: 데이터 시각화 라이브러리
Seaborn: 더 높은 수준의 데이터 시각화 라이브러리
WordCloud: 텍스트 데이터 시각화
re: 정규 표현식을 이용한 함수 제공(pre-process strings)

Scikit-Learn

CountVectorizer: text를 벡터로 변환
GridSearchCV: 하이퍼파라미터 튜닝
RandomForestClassifier: 분류를 위한 ML 알고리즘

Evaluation Metrics

Accuracy Score: 분류와 총합 예시
Precision Score: 올바르게 예측/ 전체 postive instances
Recall score: 올바르게 예측/ 전체 instances
RocCurve: a plot of true positive rate against false positive rate
Classification Report: precision, recall and f1 score 내용
Confusion Matrix: 분류 모델을 표로 묘사

Data Pre-processing

New function
- 정규식 활용: 알파벳이 아닌 문자 제거
- 소문자로 전환: Good -> good
  - 의미는 같지만 벡터값이 다를 수 있음
- Stopwords 제거
- lemmatization
  - use of a vocabulary and morphological analysis of words(Ex: run-> running , runs)
- return corpus of 전처리 데이터

Bag of Words

text representation in the form of a bag of words
단어 발생 총 횟수를 묘사한다
Scikit-learn에서 CountVectorizer 사용
Ngram: 띄워쓰기 개념
- socal media: social and media

GridSearchCV()

Parameters 설정

Estimator or model: RandomForestClassifier이용
parameters: 하이퍼파라미터 이름과 값 모음
cv: signifies cross -validation folds
return_train_score: 다양한 모델의 훈련 값 return
n_jons: run paralley

Code

Basic Python Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re

from matplotlib import style,rcParams
style.use('seaborn-white')
rcParams['figure.figsize'] = 10,5
import warnings
warnings.filterwarnings('ignore')

Natural Language Processing

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Scikit-Learn (Machine Learning Library for Python)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Evaluation Metrics

from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,classification_report
from scikitplot.metrics import plot_confusion_matrix