๐Ÿ’  AIchemist 10th Session | ํ…์ŠคํŠธ ๋ถ„์„(1)

yellowsubmarine372ยท2023๋…„ 12์›” 19์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
12/14
post-thumbnail

00. NLP์™€ ํ…์ŠคํŠธ ๋ถ„์„

NLP๋Š” ๋จธ์‹ ์ด ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ํ•ด์„ํ•˜๋Š” ๋ฐ ๋” ์ค‘์ ์„ ๋‘๊ณ  ๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•ด ์™”์œผ๋ฉฐ, ํ…์ŠคํŠธ ๋งˆ์ด๋‹์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” ํ…์ŠคํŠธ ๋ถ„์„์€ ๋น„์ •ํ˜• ํ…์ŠคํŠธ์—์„œ ์˜๋ฏธ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœ์— ์ค‘์  ๋งž์ถค

NLP๋Š” ํ…์ŠคํŠธ ๋ถ„์„์„ ํ–ฅ์ƒํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ 

  • ํ…์ŠคํŠธ ๋ถ„๋ฅ˜
    ๋ฌธ์„œ๊ฐ€ ํŠน์ • ๋ทด๋ฅ˜ ๋˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ์— ์†Œ๊ฐ€๋Š” ๊ฒƒ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ธฐ๋ฒ•์„ ํ†ต์นญ
  • ๊ฐ์„ฑ ๋ถ„์„
    ํ…์ŠคํŠธ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฐ์ •/ํŒ๋‹จ/๋ฏฟ์Œ/์˜๊ฒฌ/๊ธฐ๋ถ„ ๋“ฑ์˜ ์ฃผ๊ด€์ ์ธ ์š”์†Œ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ํ†ต์นญ
  • ํ…์ŠคํŠธ ์š”์•ฝ
    ํ…์ŠคํŠธ ๋‚ด์—์„œ ์ค‘์š”ํ•œ ์ฃผ์ œ๋‚˜ ์ค‘์‹ฌ ์‚ฌ์ƒ์„ ์ถ”์ถœ (ํ† ํ”ฝ ๋ชจ๋ธ๋ง)
  • ํ…์ŠคํŠธ ๊ตฐ์ง‘ํ™”
    ๋น„์Šทํ•œ ์œ ํ˜•์˜ ๋ฌธ์„œ์— ๋Œ€ํ•ด ๊ท ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ฒ• (์œ ์‚ฌ๋„ ์ธก์ •)

01. ํ…์ŠคํŠธ ๋ถ„์„ ์ดํ•ด

  • NLP vs ํ…์ŠคํŠธ ๋ถ„์„

NLP : ๋จธ์‹ ์ด ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ธฐ ํ•ด์„ํ•˜๋Š” ๋ฐ ๋” ์ค‘์ ์„ ๋‘ 
ํ…์ŠคํŠธ ๋ถ„์„ : ๋น„์ •ํ˜• ํ…์ŠคํŠธ์—์„œ ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์— ๋” ์ค‘์ ์„ ๋‘ 

  • ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

ํ…์ŠคํŠธ๋ฅผ ๋‹ค์ˆ˜์˜ ํ”ผ์ฒ˜๋กœ ์ถ”์ถœํ•˜๊ณ  ์ด ํ”ผ์ฒ˜์— ๋‹จ์–ด ๋นˆ๋„์ˆ˜์™€ ๊ฐ™์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•ด ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด์˜ ์กฐํ•ฉ์ธ ๋ฒกํ„ฐ์˜ ๊ฐ’์œผ๋กœ ํ‘œํ˜„

(์˜ˆ) BoW, Word2Vec, GloVe, FastText

ํ…์ŠคํŠธ ๋ถ„์„ ํ”„๋กœ์„ธ์Šค
(1) ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ
(2) ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”
(3) ML ๋ชจ๋ธ ์ˆ˜๋ฆฝ ๋ฐ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

02. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํ…์ŠคํŠธ ์ •๊ทœํ™”

ํ…์ŠคํŠธ๋ฅผ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ NLP ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์— ์ž…๋ ฅ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํด๋ Œ์ง•, ์ •์ œ, ํ† ํฐํ™”, ์–ด๊ทผํ™” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์‚ฌ์ „์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ

ํด๋ Œ์ง•, ํ† ํฐํ™”, ํ•„ํ„ฐ๋ง/ ์Šคํ†ฑ์›Œ๋“œ ์ œ๊ฑฐ/ ์ฒ ์ž ์ˆ˜ํ–‰, Stemming, Lemmatization

ํ…์ŠคํŠธ ํ† ํฐํ™” - ๋ฌธ์žฅ ํ† ํฐํ™”

๋ฌธ์žฅ์˜ ๋งˆ์นจํ‘œ, ๊ฐœํ–‰ ๋ฌธ์ž ๋“ฑ ๋ฌธ์ž์˜ ๋งˆ์ง€๋ง‰์„ ๋œปํ•˜๋Š” ๊ธฐํ˜ธ์— ๋”ฐ๋ผ ๋ถ„๋ฆฌ
NTLK์˜ sent_tokenize๋ฅผ ์ด์šฉ

from nltk import sent_tokenize
import nltk

# ๋งˆ์นจํ‘œ, ๊ฐœํ–‰๋ฌธ์ž(\n) ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์…‹: ์ตœ์ดˆ ํ•œ๋ฒˆ๋งŒ ๋‹ค์šด
# nltk.download('punkt')

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

sentences = sent_tokenize(text = text_sample)

print(type(sentences),len(sentences))
print(sentences)
class 'list'> 3
['The Matrix is everywhere its all around us, here even in this room.', 'You can see it out your window or on your television.', 'You feel it when you go to work, or go to church or pay your taxes.']

ํ…์ŠคํŠธ ํ† ํฐํ™” - ๋‹จ์–ด ํ† ํฐํ™”

๊ณต๋ฐฑ, ์ฝค๋งˆ, ๋งˆ์นจํ‘œ, ๊ฐœํ–‰๋ฌธ์ž ๋“ฑ์œผ๋กœ ๋‹จ์–ด ๋ถ€๋‹
๋‹จ์–ด์˜ ์ˆœ์„œ๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋ฌธ์žฅ ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋‹จ์–ด ํ† ํฐํ™”๋งŒ ์‚ฌ์šฉํ•ด๋„ ์ถฉ๋ถ„
NTLK์˜ work_tokenize ์ด์šฉ

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."

words = word_tokenize(sentence)

print(type(words), len(words))
print(words)
<class 'list'> 15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']

โ–บ sent_tokenize()์™€ word_tokenize() ์ด์šฉํ•ด ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด ํ† ํฐํ™”

ํ…์ŠคํŠธ ํ† ํฐํ™” - ์ •๊ทœํ‘œํ˜„์‹

  • ์ •๊ทœํ‘œํ˜„์‹
    ์ •๊ทœํ‘œํ˜„์‹ ๋ชจ๋ธ re
    ํŠน์ • ๊ทœ์น™์ด ์žˆ๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ •์ œํ•  ์ˆ˜ ์žˆ์Œ
from nltk.tokenize import RegexpTokenizer

text = "Don't be fooled by the dark sounding name, Mr.Jone's Orphanage is a cheery as cheery goes for a pastry shop"

tokenizer1 = RegexpTokenizer("[\w]+")
tokenizer2 = RegexpTokenizer("\s+", gaps = True)

print(tokenizer1.tokenize(text))
print(tokenizer2.tokenize(text))
['Don', 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'Mr', 'Jone', 's', 'Orphange', 'is', 'as', 'cheery', 'as', 'cheert', 'goes', 'for', 'a', 'pastry', 'shop']
["Don't", 'be', 'by', 'the', 'dark'. 'sounding'. 'name'. 'Mr.', "Jone's", 'Orphanage', 'is', 'as', 'cherry', 'as', 'cherry', 'goes', 'for', 'a', 'pastry', 'shop']

์Šคํ†ฑ์›Œ๋“œ ์ œ๊ฑฐ

  • stopword
    ๋ถ„์„์— ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด (is, will, a, the ๋“ฑ)

๋ฌธ๋ฒ•์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ํ…์ŠคํŠธ์— ๋‚˜ํƒ€๋‚˜๋ฏ€๋กœ ์‚ฌ์ „์— ์ œ๊ฑฐํ•˜์ง€ ์•Š์œผ๋ฉด ๋นˆ๋ฒˆํ•จ ๋•Œ๋ฌธ์— ์˜คํžˆ๋ ค ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์ง€๋  ์ˆ˜ ์žˆ์Œ
NTLK์— ์–ธ์–ด๋ณ„๋กœ ์Šคํ†ฑ์›Œ๋“œ๊ฐ€ ๋ชฉ๋กํ™”๋˜์–ด ์žˆ์Œ

# ์˜์–ด stopword
stopwords = nltk.corpus.stopwords.words('english')

# ๋ฌธ์žฅ๋ณ„ ๋‹จ์–ด ํ† ํฐํ™” + stopword ์ œ๊ฑฐ
all_tokens = []

for sentence in word_tokens:
    filtered_words=[]
    
    # ๋ฌธ์žฅ ํ† ํฐ์˜ ๊ฐ ๋‹จ์–ด ํ† ํฐ
    for word in sentence:
        # ์†Œ๋ฌธ์ž ๋ณ€ํ™˜
        word = word.lower()
        # stopword ๋ฏธํฌํ•จ
        if word not in stopwords:
            filtered_words.append(word)
            
    all_tokens.append(filtered_words)
    
print(all_tokens)
[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], ['see', 'window', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]

Stemming & Lemmatization

๋ฌธ๋ฒ•์  ๋˜๋Š” ์˜๋ฏธ์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ๋Š” ๊ฒƒ

  • Stemming

๋‹จ์ˆœํ™”๋œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ˜• ๋‹จ์–ด์—์„œ ์ผ๋ถ€ ์ฒ˜์ž๊ฐ€ ํ›ผ์†๋ˆ ์–ด๊ทผ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ
NTLK์˜ Porter, Lancaster, Snowball Stemmer

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'),stemmer.stem('works'),stemmer.stem('worked'))
print(stemmer.stem('amusing'),stemmer.stem('amuses'),stemmer.stem('amused'))
print(stemmer.stem('happier'),stemmer.stem('happiest'))
print(stemmer.stem('fancier'),stemmer.stem('fanciest'))
work work work
amus amus amus
happy happiest
fant fanciest
  • Lemmatization

๋ฌธ๋ฒ•์  ์š”์†Œ์™€ ์˜๋ฏธ์ ์ธ ๋ถ€๋ถ„์„ ๊ฐ์•ˆํ•˜์—ฌ ์ •ํ™•ํ•œ ์ฒ ์ž์˜ ์–ด๊ทผ๋ฅผ ์ถ”์ถœ
๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์ž…๋ ฅํ•ด์ค˜์•ผ ํ•จ
NTLK์˜ WordNetLemmatizer
Stemming ๋ณด๋‹ค ์ •๊ตํ•˜๋ฉฐ ์˜๋ฏธ๋ก ์  ๊ธฐ๋ฐ˜์—์„œ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ์Œ

from nltk.stem import WordNetLemmatizer

# nltk.download('wordnet')

lemma = WordNetLemmatizer()

# ๋™์‚ฌ: v, ํ˜•์šฉ์‚ฌ: a
print(lemma.lemmatize('amusing','v'),lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'),lemma.lemmatize('fanciest','a'))
amuse amuse amuse
happy happy
fancy fancy

ํ•œ๊ตญ์–ด ์ „์ฒ˜๋ฆฌ

  • PykoSpacing
    ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ๋˜์–ด ์žˆ์ง€ ์•Š์€ ๋ฌธ์žฅ์„ ๋„์–ด์“ฐ๊ธฐํ•œ ๋ฌธ์žฅ์œผ๋กœ ๋ณ€ํ™˜
  • Py-Hanspell
    ๋„ค์ด๋ฒ„ ํ•œ๊ธ€ ๋งž์ถค๋ฒ• ๊ฒ€์‚ฌ๊ธฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“  ๋งž์ถค๋ฒ• ๋ณด์ • ํŒจํ‚ค์ง€, ๋„์–ด์“ฐ๊ธฐ ๋˜ํ•œ ๋ณด์ •
  • SOYNLP
    ํ’ˆ์‚ฌ ํƒœํ‚น, ๋‹จ์–ด ํ† ํฐํ™” ๋“ฑ์˜ ์ง€์›
    ๋น„์ง€๋„ ํ•™์Šต์œผ๋กœ ๋‹จ์–ด ํ† ํฐํ™”, ๋ฐ์ดํ„ฐ์— ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์„ ๋‹จ์–ด๋กœ ๋ถ„์„
  • Customized KoNLPy
    ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ Twitter ์‚ฌ์šฉ
    add_dictionary('๋‹จ์–ด', 'ํ’ˆ์‚ฌ')์˜ ํ˜•์‹์œผ๋กœ ์‚ฌ์šฉ์ž ์‚ฌ์ „ ์ถ”๊ฐ€ ๊ฐ€๋Šฅ

03. BoW (Bag of Words)

  • BoW

๋ฌธ์„œ๊ฐ€ ๊ฐ€์ง€๋Š” ๋‹จ์–ด๋ฅผ ๋ฌธ๋งฅ์ด๋‚˜ ์ˆœ์„œ๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์ผ๊ด„์ ์œผ๋กœ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋นˆ๋„ ๊ฐ’์„ ๋ถ€์—ฌํ•ด ํ”ผ์ฒ˜ ๊ฐ’ ์ถ”์ถœ

  • BoW ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

๋ชจ๋“  ๋ฌธ์„œ์—์„œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์นผ๋Ÿผ ํ˜•ํƒœ๋กœ ๋‚˜์—ด
๊ฐ ๋ฌธ์„œ์—์„œ ํ•ด๋‹น ๋‹จ์–ด์˜ ํšŸ์ˆ˜๋‚˜ ์ •๊ทœํ™”๋œ ๋นˆ๋„๋ฅผ ๊ฐ’์œผ๋กœ ๋ถ€์—ฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ชจ๋ธ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ

  • ์นด์šดํŠธ ๋ฒกํ„ฐํ™”

๋‹จ์–ด ํ”ผ์ฒ˜์— ๊ฐ’์„ ๋ถ€์—ฌํ•  ๋•Œ ๊ฐ ๋ฌธ์„œ์—์„œ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ํšŸ์ˆ˜, Count๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒฝ์šฐ
Count ๊ฐ’์ด ๋†’์„ ์ˆ˜๋ก ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹ ๊ฐ€์ค‘์น˜ ๋ถ€์—ฌ
์–ธ์–ด์˜ ํŠน์„ฑ์ƒ ๋ฌธ์žฅ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋  ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๋‹จ์–ด๊นŒ์ง€ ๋†’์€ ๊ฐ’์„ ๋ถ€์—ฌํ•˜๊ฒŒ ๋จ

TF-IDF ๋ฒกํ„ฐํ™” (Term Frequency - Inverse Document Frequency)

๊ฐœ๋ณ„ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด์— ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋˜, ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ์ „๋ฐ˜์ ์œผ๋กœ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ํŒจ๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐ’์„ ๋ถ€์—ฌํ•จ.
๋ฌธ์„œ๋งˆ๋‹ค ํ…์ŠคํŠธ๊ฐ€ ๊ธธ๊ณ  ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ ์นด์šดํŠธ ๋ฐฉ์‹๋ณด๋‹ค๋Š” TF-IDF ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์ข‹์€ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Œ

BoW ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ ํฌ์†Œํ–‰๋ ฌ

๋Œ€๊ทœ๋ชจ ํ–‰๋ ฌ์˜ ๋Œ€๋ถ€๋ถ„ ๊ฐ’์„ ์ฐจ์ง€ํ•˜๋Š” ํ–‰๋ ฌ
BoW ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ์–ธ์–ด ๋ชจ๋ธ์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋Š” ๋Œ€๋ถ€๋ถ„ ํฌ์†Œ ํ–‰๋ ฌ
๋ถˆํ•„์š”ํ•œ 0 ๊ฐ’์ด ๋ฉ”๋ชจ๋ฆฌ์— ํ• ๋‹น๋˜์–ด ๊ณต๊ฐ„์ด ๋งŽ์ด ํ•„์š”ํ•˜๋ฉฐ, ์—ฐ์‚ฐ ์‹œ์—๋„ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ๋ชจ

  • COO ํ˜•์‹
    0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋งŒ ๋ณ„๋„์˜ ๋ฐ์ดํ„ฐ ๋ฐฐ์—ด ์ €์žฅํ•˜๊ณ , ๊ทธ ๋ฐ์ดํ„ฐ์˜ ํ–‰๊ณผ ์—ด์˜ ์œ„์น˜๋ฅผ ๋ณ„๋„์˜ ๋ฐฐ์—ด๋กœ ์ €์žฅ
  • CSR
    COO ํ˜•์‹์ด ํ–‰, ์—ด์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๊ธฐ ์œ„ํ•ด ๋ฐ˜๋ณต์ ์ธ ์œ„์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•œ ๋ฐฉ์‹

04.ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์‹ค์Šต - 20 ๋‰ด์Šค๊ทธ๋ฃน ๋ถ„๋ฅ˜

ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ํŠน์ • ๋ฌธ์„œ์˜ ๋ถ„๋ฅ˜๋ฅผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•ด ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ ๋’ค, ํ•ด๋‹น ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๋‹ค๋ฅธ ๋ฌธ์„œ์˜ ๋ถ„๋ฅ˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ
ํ…์ŠคํŠธ๋ฅผ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋กœ ๋ณ€ํ™˜ ์‹œ ์ผ๋ฐ˜์ ์œผ๋กœ ํฌ์†Œ ํ–‰๋ ฌ ํ˜•ํƒœ๊ฐ€ ๋จ

  • ํฌ์†Œ ํ–‰๋ ฌ ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    ๋กœ์ง์Šคํ‹ฑ ํšŒ๊ท€, ์„ ํ˜• ์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ , ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋“ฑ

ํ…์ŠคํŠธ ์ •๊ทœํ™”

from sklearn.datasets import fetch_20newsgroups

news_data = fetch_20newsgroups(subset='all',random_state=156)
#fetch_20newsgroups()๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ์ปฌ ์ปดํ“จํ„ฐ๋กœ ๋‚ด๋ ค๋ฐ›๊ณ  ๋ถˆ๋Ÿฌ์˜จ๋‹ค.

news_data.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
print("target ํด๋ž˜์Šค์˜ ๊ฐ’๊ณผ ๋ถ„ํฌ๋„")
print(pd.Series(news_data.target).value_counts().sort_index())
# train set, ๋‚ด์šฉ ์™ธ ์ •๋ณด ์ œ๊ฑฐ
train_news= fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=156)
X_train = train_news.data
y_train = train_news.target
print(type(X_train))

# test set, ๋‚ด์šฉ ์™ธ ์ •๋ณด ์ œ๊ฑฐ
test_news= fetch_20newsgroups(subset='test',remove=('headers', 'footers','quotes'),random_state=156)
X_test = test_news.data
y_test = test_news.target

print(f'ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ {len(train_news.data)} , ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ {len(test_news.data)}')

ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜๊ณผ ML ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

  • Count ๊ธฐ๋ฐ˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”
from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization: train
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train)
X_train_cnt_vect = cnt_vect.transform(X_train)

# Count Vectorization: test
X_test_cnt_vect = cnt_vect.transform(X_test)

print(f"ํ•™์Šต ๋ฐ์ดํ„ฐ Text์˜ CountVectorizer Shape: {X_train_cnt_vect.shape}")

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ CountVectorizer ์ ์šฉ ์‹œ ๋ฐ˜๋“œ์‹œ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด fit()์ด ์ˆ˜ํ–‰๋œ CountVecotrizer ๊ฐ์ฒด๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•ด์•ผ ํ•จ
ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ์‹œ fit_transform()์„ ์‚ฌ์šฉํ•˜๋ฉด ์•ˆ๋จ

  • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ํ†ตํ•ด ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# LogisticRegression
lr_clf = LogisticRegression()
lr_clf.fit(X_train_cnt_vect , y_train)
pred = lr_clf.predict(X_test_cnt_vect)
lr_acc = accuracy_score(y_test, pred)

print(f"CountVectorized Logistic Regression ์˜ˆ์ธก ์ •ํ™•๋„: {lr_acc:.3f}")
  • TF-IDF ๊ธฐ๋ฐ˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization: train
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)

# TF-IDF Vectorization: test
X_test_tfidf_vect = tfidf_vect.transform(X_test)

# LogisticRegression
lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect , y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
lr_acc = accuracy_score(y_test, pred)

print(f"TF-IDF Logistic Regression ์˜ˆ์ธก ์ •ํ™•๋„: {lr_acc:.3f}")
  • ์ตœ์  ML ์•Œ๊ณ ๋ฆฌ์ฆ˜/ ํ”ผ์ฒ˜ ์ „์ฒ˜๋ฆฌ/ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์€ GridSearchCV๋กœ ์ง„ํ–‰

# ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •
# TF-IDF Vectorization: train
tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=300 )
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)

# TF-IDF Vectorization: test
X_test_tfidf_vect = tfidf_vect.transform(X_test)

# LogisticRegression
lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect , y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
lr_acc = accuracy_score(y_test, pred)

print(f"TF-IDF ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ํ›„ ์˜ˆ์ธก ์ •ํ™•๋„: {lr_acc:.3f}")
from sklearn.model_selection import GridSearchCV

# ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ: C
params = {
    "C":[0.01, 0.1, 1, 5, 10]
}

# GridSearchCV
grid_cv_lr = GridSearchCV(lr_clf ,param_grid = params , cv=3 , scoring='accuracy' , verbose=1 )
grid_cv_lr.fit(X_train_tfidf_vect , y_train)
print('Logistic Regression best C parameter :',grid_cv_lr.best_params_ )

# ์ตœ์  C ๊ฐ’์œผ๋กœ ํ•™์Šต๋œ grid_cv๋กœ ์˜ˆ์ธก/ํ‰๊ฐ€
pred = grid_cv_lr.predict(X_test_tfidf_vect)
lr_acc = accuracy_score(y_test, pred)

print(f"๋กœ์ง€์Šคํ‹ฑ ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ ์šฉ ํ›„ ์˜ˆ์ธก ์ •ํ™•๋„: {lr_acc:.3f}")

05. ๊ฐ์„ฑ ๋ถ„์„

๊ฐ์„ฑ ๋ถ„์„

๋ฌธ์„œ์˜ ์ฃผ๊ด€์ ์ธ ๊ฐ์„ฑ/์˜๊ฒฌ/๊ฐ์ •/๊ธฐ๋ถ„ ๋“ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•
๋ฌธ์„œ ๋‚ด ํ…์ŠคํŠธ๊ฐ€ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ฃผ๊ด€์ ์ธ ๋‹จ์–ด์™€ ๋ฌธ๋งฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ • ์ˆ˜์น˜๋ฅผ ๊ณ„์‚ฐ
๊ธˆ์ • ๊ฐ์ • ์ง€์ˆ˜์™€ ๋ถ€์ • ๊ฐ์ • ์ง€์ˆ˜๋กœ ๊ตฌ์„ฑ๋˜์žˆ์Œ, ์ด๋“ค์„ ํ•ฉ์‚ฐํ•ด ๊ธ์ •/๋ถ€์ • ๊ฐ์„ฑ ๊ฒฐ์ •

๊ฐ์ • ๋ถ„์„ ๋ฐฉ์‹
(1) ์ง€๋„ ํ•™์Šต
ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ํƒ€๊นƒ ๋ ˆ์ด๋ธ” ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ • ๋ถ„์„ ํ•™์Šต ์ˆ˜ํ–‰
์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์˜ ๊ฐ์ • ๋ถ„์„ ์˜ˆ์ธก
์ผ๋ฐ˜์ ์ธ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜์™€ ๊ฑฐ์˜ ๋™์ผ
(2) ๋น„์ง€๋„ ํ•™์Šต
Lexicon์ด๋ผ๋Š” ์ผ์ข…์˜ ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „ ํ™œ์šฉ
์šฉ์–ด/๋ฌธ๋งฅ ์ •๋ณด ์ด์šฉ -> ๊ธ์ •์ , ๋ถ€์ •์  ๊ฐ์ • ์—ฌ๋ถ€ ํ™•์ธ

์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ ์‹ค์Šต - IMDB ์˜ํ™”ํ‰

  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
    HTML ํ˜•์‹์˜ ํ…์ŠคํŠธ <br/> ํƒœ๊ทธ ์‚ญ์ œ (๊ณต๋ฐฑ ์ฒ˜๋ฆฌ)
    ์ˆซ์ž/ ํŠน์ˆ˜๋ฌธ์ž ์‚ญ์ œ ์ •๊ทœํ‘œํ˜„์‹ ํ™œ์šฉ
import re

# <br> HTML ํƒœ๊ทธ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
review_df["review"] = review_df["review"].str.replace("<br />", " ")

# ์˜์–ด๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž ์ œ๊ฑฐ
# re.sub(์ •๊ทœํ‘œํ˜„์‹, new_text, old_text)
review_df["review"] = review_df["review"].apply( lambda x : re.sub("[^a-zA-Z]", " ", x) )
  • ๋ฐ์ดํ„ฐ ์ค€๋น„

๊ฒฐ์ •๊ฐ’ ํด๋ž˜์Šค์ธ sentiment ์นผ๋Ÿผ ์ถ”์ถœ - label ๋ฐ์ดํ„ฐ ์„ธํŠธ
์›๋ณธ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ id์™€ sentiment ์นผ๋Ÿผ ์‚ญ์ œ - ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ
train_test_split ์ด์šฉ - ํ•™์Šต / ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌ

from sklearn.model_selection import train_test_split

y_target = review_df["sentiment"]
X_feature = review_df["review"]

X_train, X_test, y_train, y_test= train_test_split(X_feature, y_target, test_size=0.3, random_state=156)

X_train.shape, X_test.shape
  • ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” & ์˜ˆ์ธก ์„ฑ๋Šฅ ์ธก์ •
    Pipeline ๊ฐ์ฒด์ด์šฉ
    count ๋ฒกํ„ฐํ™” / TF-IDF ๋ฒกํ„ฐํ™” ์ ์šฉ
    Classifier๋กœ LogisticRegression ์ด์šฉ
    ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ : ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ •ํ™•๋„ + ROC-AUC ์ธก์ •

๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„

๊ฒฐ์ •๋œ ๋ ˆ์ด๋ธ”๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„

  • Lexicon ๊ธฐ๋ฐ˜

์ฃผ๋กœ ๊ฐ์„ฑ๋งŒ์„ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ์ง€์› ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „
๊ฐ์ • ์ง€์ˆ˜ ํ™œ์šฉ
๋‹จ์–ด์˜ ์œ„์น˜, ์ฃผ๋ณ€ ๋‹จ์–ด, ๋ฌธ๋งฅ, POS ๋“ฑ์„ ์ฐธ๊ณ ํ•ด ๊ฒฐ์ • (NLTK ํŒจํ‚ค์ง€ ํ™œ์šฉ)

  • WordNet

NTLK์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋Œ€ํ•œ ์˜์–ด ์–ดํœ˜ ์‚ฌ์ „ ๋ชจ๋“ˆ
๋‹ค์–‘ํ•œ ์‚ฌ์˜คํ•ญ์—์„œ ๊ฐ™์€ ์–ดํœ˜๋ผ๋„ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์–ดํœ˜์˜ ์‹œ๋งจํ‹ฑ ์ •๋ณด๋ฅผ ์ œ๊ณต
๊ฐ๊ฐ์˜ ํ’ˆ์‚ฌ๋กœ ๊ตฌ์„ฑ๋œ ๊ฐœ๋ณ„ ๋‹จ์–ด๋ฅผ Synset์ด๋ผ๋Š” ๊ฐœ๋…์„ ์ด์šฉํ•ด ํ‘œํ˜„

  • Synset

๋‹จ์–ด๊ฐ€ ๊ฐ€์ง€๋Š” ๋ฌธ๋งฅ + ์‹œ๋งจํ‹ฑ ์ •๋ณด ์ œ๊ณต
synsets() ํ˜ธ์ถœ ์‹œ ์—ฌ๋Ÿฌ๊ฐœ์˜ Synset๊ฐ์ฒด๋ฅผ ๊ฐ€์ง€๋Š” ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜๋จ
POS, ์ •์˜, ๋ถ€๋ช…์ œ ๋“ฑ์„ ์‹œ๋งจํ‹ฑ์ ์ธ ์š”์†Œ๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ

from nltk.corpus import wordnet as wn

term = 'present'

# present๋กœ wordnet์˜ synsets ์ƒ์„ฑ
synsets = wn.synsets(term)
# Synsets ์†์„ฑ: ์ด๋ฆ„/ํ’ˆ์‚ฌ/์ •์˜/๋ถ€๋ช…์ œ
for i, synset in enumerate(synsets):
    print('##### Synset name : ', synset.name(),'#####')
    print('POS :', synset.lexname())
    print('Definition:', synset.definition())
    print('Lemmas:', synset.lemma_names())
    print("\n")
    
    if i == 3:
        break

06. ํ† ํ”ฝ ๋ชจ๋ธ๋ง

๋ฌธ์„œ ์ง‘ํ•ฉ์— ์ˆจ์–ด์žˆ๋Š” ์ฃผ์ œ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ

  • ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ† ํ”ฝ ๋ชจ๋ธ

์ˆจ๊ฒจ์ง„ ์ฃผ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์‹ฌ๋‹จ์–ด๋ฅผ ํ•จ์ถ•์ ์œผ๋กœ ์ถ”์ถœ

  • LSA(Latent Semantic Analysis)
    ๊ธฐ์กด์˜ DTM์ด๋‚˜ DTM์— ๋‹จ์–ด์˜ ์ค‘์š”๋„์— ๋”ฐ๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์—ˆ๋˜ TF-IDF ํ–‰๋ ฌ์€ ๋‹จ์–ด ์˜๋ฏธ๋‚˜ ๋ฌธ๋งฅ์„ ์ „ํ˜€ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ
    LSA๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ DTM์ด๋‚˜ TF-IDF ํ–‰๋ ฌ์— ์ ˆ๋‹จ๋œ SVD๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œ์‹œํ‚ค๊ณ , ์ด๋ฏธ ๊ณ„์‚ฐ๋œ LSA์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ณ„์‚ฐํ•˜๋ ค๊ณ  ํ•˜๋ฉด ๋ณดํ†ต ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ
    ์ƒˆ๋กœ์šด ์ •๋ณด์— ๋Œ€ํ•ด ์—…๋ฐ์ดํŠธ๊ฐ€ ์–ด๋ ค์›€

  • LDA(Latent Dirichlet Allocation)
    LSA์˜ ๋‹จ์  ๋ณด์™„
    ๋ฌธ์„œ๋“ค์€ ํ† ํ”ฝ๋“ค์˜ ํ˜ผํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด์ ธ ์žˆ์œผ๋ฉฐ, ํ† ํ”ฝ๋“ค์€ ํ™•๋ฅ  ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๋‹จ์–ด๋“ค์„ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ๊ฐ€์ •
    ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, LDA๋Š” ๋ฌธ์„œ๊ฐ€ ์ƒ์„ฑ๋˜๋˜ ๊ณผ์ •์„ ์—ญ์ถ”์ 
    Count๊ธฐ๋ฐ˜์˜ Vectorizer๋งŒ ์ ์šฉ
profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€