[NLP 언제까지 미룰래? 일단 들어와!!] #3. Vectorization

Ji·2021년 5월 7일

Vectorization?

NLP를 컴퓨터가 이해 가능하게 수치로 바꾸는 것. 벡터로 변환된 고유 토큰들이 모인 집합을 vocabulary라 함. 이 vocabulary가 클수록 학습시간은 늘어난다.

One Hot Encoding

해당 단어가 존재하면 1, 그렇지 않으면 0으로 표시.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

t = Tokenizer()
t.fit_on_texts(tokens)
print("각 토큰에게 고유의 정수 부여")
print("----------------------")
print(t.word_index) 
print(" ")

s1=t.texts_to_sequences(tokens)[0] 
print("부여된 정수로 표시된 문장1")
print("----------------------")
print(s1)
print(" ")

s2=t.texts_to_sequences(tokens)[1]
print("부여된 정수로 표시된 문장2")
print("----------------------")
print(s2)
print(" ")

s1_one_hot = to_categorical(s1)
print("문장1의 one-hot-encoding")
print("----------------------")
print(s1_one_hot)
print(" ")

s2_one_hot = to_categorical(s2)
print("문장2의 one-hot-encoding")
print("----------------------")
print(s2_one_hot)

<Output>
각 토큰에게 고유의 정수 부여
----------------------
{'자연어': 1, '처리': 2, '정말': 3, '는': 4, '즐거워': 5, '즐거운': 6, '다': 7, '같이': 8, '해보자': 9}
 
부여된 정수로 표시된 문장1
----------------------
[1, 2, 4, 3, 3, 5]
 
부여된 정수로 표시된 문장2
----------------------
[6, 1, 2, 7, 8, 9]
 
문장1의 one-hot-encoding
----------------------
[[0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]]
 
문장2의 one-hot-encoding
----------------------
[[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

단점: vocabulary 크기가 커짐에 따라 많은 공간을 차지, 또 벡터가 sparse 해짐

Count vectorization

Vocabulary를 활용, 각 문장이 갖고 있는 토큰의 count 기반으로 문장을 vectorization함.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(tokens) #여러 개의 문장을 넣어줘야 작동합니다!!

print(vectorizer.get_feature_names())
print(vectors.toarray())

<Output>
['같이', '자연어', '정말', '즐거운', '즐거워', '처리', '해보자']
[[0 1 2 0 1 1 0]
 [1 1 0 1 0 1 1]]

vocabulary의 인덱스를 기준으로 카운트가 정수로 표시됨
즐거운과 즐거워는 같은 의미를 갖는 토큰이지만 okt는 이를 구분하지못함. 같은 의미의 토큰을 다르게 학습할 수 있음.

TfIdf
단어가 몇 번 등장했는지에 대한 정보
if) 어떤 단어가 언급된 문서의 수가 적음-> 그 단어는 문서 분류하는데 중요한 단어로 취급한다.

등장횟수가 많고, 문서 분별력이 있는 단어를 점수화하여 벡터화 한 기법.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=0)
tfidf_vectorizer = tfidf.fit_transform(tokens) 
#tf-idf dictionary    
tfidf_dict = tfidf.get_feature_names()
print(tfidf_dict)
print(tfidf_vectorizer.toarray())

['같이', '자연어', '정말', '즐거운', '즐거워', '처리', '해보자'][0. 0.29017021 0.81564821 0. 0.4078241 0.29017021
0. ][0.49922133 0.35520009 0. 0.49922133 0. 0.35520009 0.49922133]]```

Padding

기본적으로 모형의 문장의 길이를 맞춰주기위해 부족한 길이만큼 0을 채워 넣음. 이것이 padding

공부방

이전 포스트

[NLP 언제까지 미룰래? 일단 들어와!!] #2. NLP 전처리-정리

다음 포스트

[NLP 언제까지 미룰래? 일단 들어와!!] #3. Vectorization

Vectorization?

One Hot Encoding

Count vectorization

TfIdf

Padding

[NLP 언제까지 미룰래? 일단 들어와!!] #2. NLP 전처리-정리

0507 OS 수업 노트 (ch5.2)

0개의 댓글