๐Ÿ’  AIchemist 11th Session | ํ…์ŠคํŠธ ๋ถ„์„(2)

yellowsubmarine372ยท2023๋…„ 12์›” 26์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
13/14
post-thumbnail

07. ๋ฌธ์„œ ๊ตฐ์ง‘ํ™”

๋ฌธ์„œ ๊ตฐ์ง‘ํ™”

๋น„์Šทํ•œ ํ…์ŠคํŠธ ๊ตฌ์„ฑ์˜ ๋ฌธ์„œ๋ฅผ ๊ตฐ์ง‘ํ™” ํ•˜๋Š” ๊ฒƒ

๋ถ„์„ ํ๋ฆ„
(1) ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ
(2) ๋ฒกํ„ฐํ™”
(3) ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ
(4) cluster_centers(์ถ”์ถœ) ํ†ตํ•ด ๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ ๋‹จ์–ด ์ถ”์ถœ

Opinion Review Data ์‹ค์Šต

  • ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ
import pandas as pd
import glob, os
import warnings 
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 700)

# ์•„๋ž˜๋Š” ์ œ ์ปดํ“จํ„ฐ์—์„œ ์••์ถ• ํŒŒ์ผ์„ ํ’€์–ด ๋†“์€ ๋””๋ ‰ํ„ฐ๋ฆฌ์ด๋‹ˆ, ์—ฌ๋Ÿฌ๋ถ„์˜ ๋””๋ ‰ํ„ฐ๋ฆฌ๋ฅผ ์„ค์ •ํ•ด ์ฃผ์„ธ์š”  
path = 'OpinosisDataset1.0/topics'
# path๋กœ ์ง€์ •ํ•œ ๋””๋ ‰ํ„ฐ๋ฆฌ ๋ฐ‘์— ์žˆ๋Š” ๋ชจ๋“  .data ํŒŒ์ผ๋“ค์˜ ํŒŒ์ผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ทจํ•ฉ
all_files = glob.glob(os.path.join(path, "*.data"))    
filename_list = []
opinion_text = []

# ๊ฐœ๋ณ„ ํŒŒ์ผ๋“ค์˜ ํŒŒ์ผ๋ช…์€ filename_list ๋ฆฌ์ŠคํŠธ๋กœ ์ทจํ•ฉ, 
# ๊ฐœ๋ณ„ ํŒŒ์ผ๋“ค์˜ ํŒŒ์ผ ๋‚ด์šฉ์€ DataFrame ๋กœ๋”ฉ ํ›„ ๋‹ค์‹œ string์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ opinion_text ๋ฆฌ์ŠคํŠธ๋กœ ์ทจํ•ฉ 
for file_ in all_files:
    # ๊ฐœ๋ณ„ ํŒŒ์ผ์„ ์ฝ์–ด์„œ DataFrame์œผ๋กœ ์ƒ์„ฑ 
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    
    # ์ ˆ๋Œ€๊ฒฝ๋กœ๋กœ ์ฃผ์–ด์ง„ file ๋ช…์„ ๊ฐ€๊ณต. ๋งŒ์ผ Linux์—์„œ ์ˆ˜ํ–‰์‹œ์—๋Š” ์•„๋ž˜ \\๋ฅผ / ๋ณ€๊ฒฝ. 
    # ๋งจ ๋งˆ์ง€๋ง‰ .data ํ™•์žฅ์ž๋„ ์ œ๊ฑฐ
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]

    # ํŒŒ์ผ๋ช… ๋ฆฌ์ŠคํŠธ์™€ ํŒŒ์ผ ๋‚ด์šฉ ๋ฆฌ์ŠคํŠธ์— ํŒŒ์ผ๋ช…๊ณผ ํŒŒ์ผ ๋‚ด์šฉ์„ ์ถ”๊ฐ€. 
    filename_list.append(filename)
    opinion_text.append(df.to_string())

# ํŒŒ์ผ๋ช… ๋ฆฌ์ŠคํŠธ์™€ ํŒŒ์ผ ๋‚ด์šฉ ๋ฆฌ์ŠคํŠธ๋ฅผ  DataFrame์œผ๋กœ ์ƒ์„ฑ
document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})
document_df.head()
  • TF-IDF ๊ธฐ๋ฐ˜ vectorization ์ ์šฉ ๋ฐ kmeans ๊ตฐ์ง‘ํ™” ์ˆ˜ํ–‰

tfIdfVectorizer์˜ tokenizer ์ธ์ž๋กœ ์‚ฌ์šฉ๋  lemmatization ์–ด๊ทผ ๋ณ€ํ™˜ ํ•จ์ˆ˜๋ฅผ ์„ค์ •
์ƒ์„ฑ LemNormalize ํ•จ์ˆ˜ ์„ค์ •

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

# ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜จ token๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด์„œ lemmatization ์–ด๊ทผ ๋ณ€ํ™˜. 
def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

# TfidfVectorizer ๊ฐ์ฒด ์ƒ์„ฑ ์‹œ tokenizer์ธ์ž๋กœ ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ์„ค์ •ํ•˜์—ฌ lemmatization ์ ์šฉ
# ์ž…๋ ฅ์œผ๋กœ ๋ฌธ์žฅ์„ ๋ฐ›์•„์„œ stop words ์ œ๊ฑฐ-> ์†Œ๋ฌธ์ž ๋ณ€ํ™˜ -> ๋‹จ์–ด ํ† ํฐํ™” -> lemmatization ์–ด๊ทผ ๋ณ€ํ™˜. 
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ

  • ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ

๊ตฐ์ง‘์ด ๊ฐ ์ฃผ์ œ๋ณ„๋กœ ์œ ์‚ฌํ•œ ํ˜•ํƒœ๋กœ ์ž˜ ๊ตฌ์„ฑ๋๋Š”์ง€
๋ฌธ์„œ๋ณ„๋กœ ํ…์ŠคํŠธ๊ฐ€ TF-IDF ๋ณ€ํ™˜๋œ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ์™€ ํ–‰๋ ฌ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ด ์–ด๋–ค ๋ฌธ์„œ๋ผ๋ฆฌ ๊ตฐ์ง‘๋˜๋Š”์ง€ ํ™•์ธ

from sklearn.cluster import KMeans

# 5๊ฐœ ์ง‘ํ•ฉ์œผ๋กœ ๊ตฐ์ง‘ํ™” ์ˆ˜ํ–‰. ์˜ˆ์ œ๋ฅผ ์œ„ํ•ด ๋™์ผํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ ๋„์ถœ์šฉ random_state=0 
km_cluster = KMeans(n_clusters=5, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
  • ํ•ต์‹ฌ ๋‹จ์–ด ์ถ”์ถœ

๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ๋‹จ์–ด ์ถ”์ถœ
๊ฐ ๊ตฐ์ง‘์— ์†Œ๊ฐ„ ๋ฌธ์„œ๋Š” ํ•ต์‹ฌ๋‹จ์–ด๋ฅผ ์ฃผ์ถ•์œผ๋กœ ๊ตฐ์ง‘ํ™”๋˜์–ด ์žˆ์Œ
๊ฐ ๊ตฐ์ง‘์„ ๊ตฌ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ๋‹จ์–ด ํ™•์ธ

cluster_centers = km_cluster.cluster_centers_
print('cluster_centers shape :',cluster_centers.shape)
print(cluster_centers)
  • Cluster_centers_ ์†์„ฑ๊ฐ’์„ ์ด์šฉํ•ด ๊ฐ ๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ ๋‹จ์–ด ์ฐพ๊ธฐ
    ๊ฐ ๊ตฐ์ง‘์„ ๊ตฌ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ๋‹จ์–ด ํ™•์ธ
# ๊ตฐ์ง‘๋ณ„ top n ํ•ต์‹ฌ๋‹จ์–ด, ๊ทธ ๋‹จ์–ด์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ€๊ฐ’, ๋Œ€์ƒ ํŒŒ์ผ๋ช…๋“ค์„ ๋ฐ˜ํ™˜ํ•จ. 
def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num, top_n_features=10):
    cluster_details = {}
    
    # cluster_centers array ์˜ ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋œ index ๊ฐ’์„ ๋ฐ˜ํ™˜
    # ๊ตฐ์ง‘ ์ค‘์‹ฌ์ (centroid)๋ณ„ ํ• ๋‹น๋œ word ํ”ผ์ฒ˜๋“ค์˜ ๊ฑฐ๋ฆฌ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•จ.  
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:,::-1]
    
    #๊ฐœ๋ณ„ ๊ตฐ์ง‘๋ณ„๋กœ iterationํ•˜๋ฉด์„œ ํ•ต์‹ฌ๋‹จ์–ด, ๊ทธ ๋‹จ์–ด์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ€๊ฐ’, ๋Œ€์ƒ ํŒŒ์ผ๋ช… ์ž…๋ ฅ
    for cluster_num in range(clusters_num):
        # ๊ฐœ๋ณ„ ๊ตฐ์ง‘๋ณ„ ์ •๋ณด๋ฅผ ๋‹ด์„ ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”. 
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num
        
        # cluster_centers_.argsort()[:,::-1] ๋กœ ๊ตฌํ•œ index ๋ฅผ ์ด์šฉํ•˜์—ฌ top n ํ”ผ์ฒ˜ ๋‹จ์–ด๋ฅผ ๊ตฌํ•จ. 
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]
        
        # top_feature_indexes๋ฅผ ์ด์šฉํ•ด ํ•ด๋‹น ํ”ผ์ฒ˜ ๋‹จ์–ด์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ“๊ฐ’ ๊ตฌํ•จ 
        top_feature_values = cluster_model.cluster_centers_[cluster_num, top_feature_indexes].tolist()
        
        # cluster_details ๋”•์…”๋„ˆ๋ฆฌ ๊ฐ์ฒด์— ๊ฐœ๋ณ„ ๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ ๋‹จ์–ด์™€ ์ค‘์‹ฌ์œ„์น˜ ์ƒ๋Œ€๊ฐ’, ๊ทธ๋ฆฌ๊ณ  ํ•ด๋‹น ํŒŒ์ผ๋ช… ์ž…๋ ฅ
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        cluster_details[cluster_num]['filenames'] = filenames
        
    return cluster_details

08. ๋ฌธ์„œ ์œ ์‚ฌ๋„

๋ฌธ์„œ ์œ ์‚ฌ๋„ ์ธก์ •๋ฐฉ๋ฒ• - ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„

๋ฒกํ„ฐ์™€ ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๋น„๊ตํ•  ๋•Œ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ณด๋‹ค๋Š” ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ๋ฐฉํ–ฅ์„ฑ์ด ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€
์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๋‘ ๋ฒกํ„ฐ์˜ ์‚ฌ์ž‡๊ฐ’์„ ๊ตฌํ•ด์„œ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ์ˆ˜์น˜๋กœ ์ ์šฉ

๋ฌธ์„œ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ์‚ฌ์ดํ‚ท๋Ÿฐ API

from sklearn.metrics.pairwise import cosine_similarity

Cosine_similarity()
2๊ฐœ์˜ ์ž…๋ ฅ ํŒŒ๋ผ๋ฏธํ„ฐ
ํฌ์†Œ ํ–‰๋ ฌ, ๋ฐ€์ง‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๊ฐ€๋Šฅ, ํ–‰๋ ฌ ๋˜๋Š” ๋ฐฐ์—ด ๋ชจ๋‘ ๊ฐ€๋Šฅ
์Œ์œผ๋กœ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ฐ’ ์ œ๊ณต ndarray ์ œ๊ณต

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

# ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜จ token๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด์„œ lemmatization ์–ด๊ทผ ๋ณ€ํ™˜. 
def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

# TfidfVectorizer ๊ฐ์ฒด ์ƒ์„ฑ ์‹œ tokenizer์ธ์ž๋กœ ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ์„ค์ •ํ•˜์—ฌ lemmatization ์ ์šฉ
# ์ž…๋ ฅ์œผ๋กœ ๋ฌธ์žฅ์„ ๋ฐ›์•„์„œ stop words ์ œ๊ฑฐ-> ์†Œ๋ฌธ์ž ๋ณ€ํ™˜ -> ๋‹จ์–ด ํ† ํฐํ™” -> lemmatization ์–ด๊ทผ ๋ณ€ํ™˜. 
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
import pandas as pd
import glob, os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')

# Define a lemmatization function
def LemNormalize(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in words if word.lower() not in stop_words])

path = r'OpinosisDataset1.0\topics'
all_files = glob.glob(os.path.join(path, "*.data"))     
filename_list = []
opinion_text = []

for file_ in all_files:
    df = pd.read_table(file_, index_col=None, header=0, encoding='latin1')
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    filename_list.append(filename)
    opinion_text.append(df.to_string())

document_df = pd.DataFrame({'filename': filename_list, 'opinion_text': opinion_text})

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english', \
                             ngram_range=(1, 2), min_df=0.05, max_df=0.85)
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
document_df['cluster_label'] = cluster_label

ํ˜ธํ…”์„ ์ฃผ์ œ๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฌธ์„œ๋ฅผ ์ด์šฉํ•ด ํŠน์ • ๋ฌธ์„œ์™€ ๋‹ค๋ฅธ ๋ฌธ์„œ ๊ฐ„์˜ ์œ ์‚ฌ๋„ ์ธก์ •
ํ˜ธํ…”์„ ์ฃผ์ œ๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฐ์ดํ„ฐ ๋จผ์ € ์ถ”์ถœ, ์ด ๋ฐ์ดํ„ฐ์— ํ•ด๋‹นํ•˜๋Š” TfidfVectorizer์˜ ๋ฐ์ดํ„ฐ ์ถ”์ถœ

09. ํ•œ๊ธ€ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

ํ•œ๊ธ€ NLP ์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€

ํ•œ๊ธ€ ์–ธ์–ด ์ฒ˜๋ฆฌ๊ฐ€ ์˜์–ด ๋“ฑ์˜ ๋ผํ‹ด์–ด ์ฒ˜๋ฆฌ๋ณด๋‹ค ์–ด๋ ค์šด ์ด์œ  - ๋„์–ด์“ฐ๊ธฐ, ๋‹ค์–‘ํ•œ ์กฐ์‚ฌ

  • KoNLPy

ํŒŒ์ด์ฌ์˜ ๋Œ€ํ‘œ์ ์ธ ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ํŒจํ‚ค์ง€

  • Mecab

์˜คํ”ˆ ์†Œ์Šค ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ
ํ•œ๋‚˜๋ˆ”์ด๋‚˜ ๊ผฌ๊ผฌ๋งˆ ๋“ฑ์˜ ๊ธฐ์กด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์˜ ๋„์–ด์“ฐ๊ธฐ ๊ตฌ๋ถ„์˜ ์˜ค๋ฅ˜๋‚˜ ๊ณต๊ฐœ ์†Œ์Šค๋ฅผ ๊ตฌํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์‹œ์ž‘๋จ

์›ํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ํ˜•ํƒœ์†Œ ๋ถ„์„์œผ๋กœ tagging์ด ์•ˆ๋ ๋•Œ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์—์„œ ์‚ฌ์šฉ์ž ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•ด ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์— tagging์ด ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•  ์ˆ˜ ์žˆ์Œ

๋„ค์ด๋ฒ„ ์˜ํ™”๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ ์‹ค์Šต

  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

    Train_df์ด ์กด์žฌํ•˜๋Š” Null์„ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
    ์ˆซ์ž์˜ ๊ฒฝ์šฐ ๋‹จ์–ด์ ์ธ ์˜๋ฏธ๊ฐ€ ๋ถ€์กฑํ•˜๋ฏ€๋กœ ํŒŒ์ด์ฌ ์ •๊ทœ ํ‘œํ˜„์‹ ๋ชจ๋“ˆ re๋ฅผ ์ด์šฉํ•ด ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜

import re

# document ๊ฒฐ์ธก ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
train_df = train_df.fillna(" ")

# ์ •๊ทœ ํ‘œํ˜„์‹์œผ๋กœ ์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝ (์ •๊ทœ ํ‘œํ˜„์‹์—์„œ \d๋Š” ์ˆซ์ž๋ฅผ ์˜๋ฏธ)
train_df["document"] = train_df["document"].apply(lambda x: re.sub(r"\d+", " ", x))

# test set ๋™์ผ ์ž‘์—…
test_df = pd.read_csv('nsmc-master/ratings_test.txt', sep='\t')
test_df = test_df.fillna(" ")
test_df["document"] = test_df["document"].apply(lambda x: re.sub(r"\d+", " ", x))

# id ์ปฌ๋Ÿผ ์ œ๊ฑฐ
train_df.drop("id", axis=1, inplace=True)
test_df.drop("id", axis=1, inplace=True)

SNS ๋ถ„์„์— ์ ํ•ฉํ•œ Twitter ํด๋ž˜์Šค๋ฅผ ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ์—”์ง„์œผ๋กœ ์ด์šฉ
Twitter ๊ฐ์ฒด์˜ morphs() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•˜๋ฉด ์ž…๋ ฅ์ธ์ž๋กœ ๋“ค์–ด์˜จ ๋ฌธ์žฅ์„ ํ˜•ํƒœ์†Œ ๋‹จ์–ด ํ˜•ํƒœ๋กœ ํ† ํฐํ™” ํ•ด list ๊ฐ์ฒด๋กœ ๋ฐ˜ํ™˜

from konlpy.tag import Twitter

twitter = Twitter()

def tw_tokenizer(text):
    # ํ…์ŠคํŠธ๋ฅผ ํ˜•ํƒœ์†Œ ๋‹จ์–ด๋กœ ํ† ํฐํ™” ํ›„ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ˜ํ™˜
    tokens_ko = twitter.morphs(text)
    
    return tokens_ko
  • ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

์‚ฌ์ดํ‚ท๋Ÿฐ์˜ TfidfVectorizer๋ฅผ ์ด์šฉํ•ด TF-IDF ํ”ผ์ฒ˜ ๋ชจ๋ธ์„ ์ƒ์„ฑ

  • ๊ฐ์ • ๋ถ„์„ - ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์ด์šฉํ•ด ๋ถ„๋ฅ˜ ๊ธฐ๋ฐ˜์˜ ๊ฐ์ • ๋ถ„์„์„ ์ˆ˜ํ–‰
๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ c ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด GridSearchCV๋ฅผ ์ด์šฉ

# GridSearchCV
params = {
    "lr_clf__C": [1, 3.5, 10]
}

grid_cv = GridSearchCV(pipeline, param_grid=params, scoring="accuracy", verbose=1)
grid_cv.fit(train_df['document'], train_df['label'])

print(grid_cv.best_params_, round(grid_cv.best_score_,4))
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ตœ์ข… ๊ฐ์ • ๋ถ„์„ ์˜ˆ์ธก
from sklearn.metrics import accuracy_score

# ์˜ˆ์ธก/ํ‰๊ฐ€ (best_estimator๋กœ ์•ˆํ•ด๋„ ์ด๋ฏธ ์ตœ์ ์œผ๋กœ ํ•™์Šต๋˜์–ด ์žˆ์Œ)
best_estimator = grid_cv.best_estimator_
pred = best_estimator.predict(test_df["document"])
acc = accuracy_score(test_df["label"], pred)

print(f"Logistic Regression ์ •ํ™•๋„: {acc:.4f}")

Mercari Price Suggestion Challenge

  • ๋ฐ์ดํ„ฐ ๋กœ๋“œ
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

mercari_df = pd.read_csv('train.tsv', sep='\t')
mercari_df.head(3)

  • ํ”ผ์ฒ˜ type๊ณผ Null ์—ฌ๋ถ€ ํ™•์ธ

๊ฐ€๊ฒฉ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ค‘์š” ์š”์ธ์ธ brand_name์ด ๋งŽ์€ Null์„ ๊ฐ€์ง๊ณ  ์žˆ์Œ

  • target ๊ฐ’์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ

Price ๊ฐ’์ด ๋น„๊ต์  ์ ์€ ๊ฐ€๊ฒฉ์˜ ๋ฐ์ดํ„ฐ ๊ฐ’์— ์™œ๊ณก๋ผ ๋ถ„ํฌ
๋กœ๊ทธ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ด ์ •๊ทœ ๋ถ„ํฌ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

mercari_df['price']= np.log1p(mercari_df['price'])
mercari_df['price'].head(3)
  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

category_name์˜ '/'๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด ํ† ํฐํ™” ํ›„ ๊ฐ๊ฐ ๋ณ„๋„ ํ”ผ์ฒ˜๋กœ ์ €์žฅ
Null์ด ์•„๋‹Œ ๊ฒฝ์šฐ split("/")์„ ์ด์šฉํ•ด ๋Œ€, ์ค‘, ์†Œ๋ถ„๋ฅ˜๋ฅผ ๋ถ„๋ฆฌ (๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ˜ํ™˜)
Null ์ผ๊ฒฝ์šฐ except catch ํ•˜์—ฌ ๋Œ€, ์ค‘, ์†Œ๋ถ„๋ฅ˜ ๋ชจ๋‘ 'Other Null' ๊ฐ’์„ ๋ถ€์—ฌ

#apply lambda์—์„œ ํ˜ธ์ถœ๋˜๋Š” ๋Œ€, ์ค‘, ์†Œ ๋ถ„ํ•  ํ•จ์ˆ˜ ์ƒ์„ฑ, ๋Œ€, ์ค‘, ์†Œ ๊ฐ’์„ ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜ 
def split_cat(category_name):
    try:
        return category_name.split('/')
    except:
        return ['Other_Null', 'Other_Null', 'Other_Null']
    
#์œ„์˜ split_cat()์„ apply lambda์—์„œ ํ˜ธ์ถœํ•ด ๋Œ€,์ค‘,์†Œ ์นผ๋Ÿผ์„ mercari_df์— ์ƒ์„ฑ.
mercari_df['cat_dae'], mercari_df['cat_jung'], mercari_df['cat_so'] = zip(*mercari_df['category_name'].apply(
    lambda x : split_cat(x)))

print("๋Œ€๋ถ„๋ฅ˜ ์œ ํ˜• :\n", mercari_df['cat_dae'].value_counts())
print("์ค‘๋ถ„๋ฅ˜ ๊ฐœ์ˆ˜ :", mercari_df['cat_jung'].nunique())
print("์†Œ๋ถ„๋ฅ˜ ๊ฐœ์ˆ˜ ", mercari_df['cat_so'].nunique())
  • fill_na()๋กœ ๋‹ค๋ฅธ ์นผ๋Ÿผ๋“ค์˜ Null ๊ฐ’ ์ฒ˜๋ฆฌ
mercari_df['brand_name'] = mercari_df['brand_name'].fillna(value='Other_Null')
mercari_df['category_name'] = mercari_df['category_name'].fillna(value='Other_Null')
mercari_df['item_description'] = mercari_df['item_description'].fillna(value='Other_Null')

# ๊ฐ ์นผ๋Ÿผ๋ณ„๋กœ Null ๊ฐ’ ๊ฑด์ˆ˜ ํ™•์ธ. ๋ชจ๋‘ 0์ด ๋‚˜์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค.
mercari_df.isnull().sum()
  • ํ”ผ์ฒ˜ ์ธ์ฝ”๋”ฉ๊ณผ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

๋ณธ ๋Œ€ํšŒ์—์„œ ์˜ˆ์ธก ๋ชจ๋ธ์€ ์ƒํ’ˆ ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋ฏ€๋กœ ํšŒ๊ท€ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ
์„ ํ˜• ํšŒ๊ท€์—์„œ๋Š” ์›-ํ•ซ ์ธ์ฝ”๋”ฉ ์„ ํ˜ธ

ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”์˜ ๊ฒฝ์šฐ ์งง์€ ํ…์ŠคํŠธ - Count ๊ธฐ๋ฐ˜ ๋ฒกํ„ฐํ™”
๊ธด ํ…์ŠคํŠธ - TF-IDF ๊ธฐ๋ฐ˜ ๋ฒกํ„ฐํ™” ์ ์šฉ

  • ๋ฆฟ์ง€ ํšŒ๊ท€ ๋ชจ๋ธ ๊ตฌ์ถ• ๋ฐ ํ‰๊ฐ€

์˜ˆ์ธก๋œ price๊ฐ’์„ ๋‹ค์‹œ ์ง€์ˆ˜ ๋ณ€ํ™˜์„ ํ†ตํ•ด ์›๋ณตํ•ด์•ผ ํ•จ
Evaluate_org_price(y_text, preds) ์›๋ณต๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ RMSLE๋ฅผ ์ ์šฉ

  • ๋ฆฟ์ง€ ํšŒ๊ท€ ์˜ˆ์ธก
linear_model = Ridge(solver="lsqr", fit_intercept=False)

sparse_matrix_list = (X_name, X_brand, X_item_cond_id,
                     X_shipping, X_cat_dae, X_cat_jung, X_cat_so)

linear_preds, y_test = model_train_predict(model=linear_model, matrix_list = sparse_matrix_list)
print('Item Description์„ ์ œ์™ธํ–ˆ์„ ๋•Œ rmsle ๊ฐ’:', evaluate_org_price(y_test, linear_preds))

sparse_matrix_list = (X_descp, X_name, X_brand, X_item_cond_id, 
                     X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
linear_preds, y_test = model_train_predict(model = linear_model, matrix_list = sparse_matrix_list)
print('Item Description์„ ํฌํ•จํ•œ rmsle ๊ฐ’:', evaluate_org_price(y_test, linear_preds))
  • LightGBM ํšŒ๊ท€ ๋ชจ๋ธ ๊ตฌ์ถ•๊ณผ ์•™์ƒ๋ธ”์„ ์ด์šฉํ•œ ์ตœ์ข… ์˜ˆ์ธก ํ‰๊ฐ€

LightGBM์„ ์ด์šฉํ•ด ํšŒ๊ท€ ์ˆ˜ํ–‰ํ•œ ๋’ค ์•ž์„œ ๊ตฌํ•œ ๋ฆฟ์ง€๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’๊ณผ LightGBM ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’์„ ๊ฐ„๋‹จํ•œ ์•™์ƒ๋ธ” ๋ฐฉ์‹์œผ๋กœ ์„ž์–ด์„œ ์ตœ์ข… ํšŒ๊ท€ ์˜ˆ์ธก๊ฐ’์„ ํ‰๊ฐ€

from lightgbm import LGBMRegressor

sparse_matrix_list = (X_descp, X_name, X_brand, X_item_cond_id,
                     X_shipping, X_cat_dae, X_cat_jung, X_cat_so)

lgbm_model = LGBMRegressor(n_estimators=200, learning_rate=0.5, num_leaves=125, random_state=156)
lgbm_preds, y_test = model_train_predict(model = lgbm_model, matrix_list= sparse_matrix_list)
print('LightGBM rmsle ๊ฐ’:', evaluate_org_price(y_test, lgbm_preds))

preds = lgbm_preds*0.45 + linear_preds*0.55
print('LightGBM๊ณผ Ridge๋ฅผ ensembleํ•œ ์ตœ์ข… rmsle ๊ฐ’:', evaluate_org_price(y_test, preds))
profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€