[밑바닥부터 시작하는 딥러닝2] 05. Word Embedding

권유진·2022년 1월 24일

CBOW Negative sampling PMI PPMI Representation SVD Truncated SVD Vectorization Word2Vector skip gram 동시발생행렬 딥러닝 밑바닥부터 시작하는 딥러닝 상호정보량 시소러스 자습 추론기반기법 통계기반기법

밑바닥부터 시작하는 딥러닝 공부

목록 보기

5/8

자연어: 우리가 평소에 쓰는 말
자연어 처리(NLP; Natural Language Processing): 우리의 말을 컴퓨터에게 이해시키기 위한 분야
말은 문자로 구성되고, 문자는 단어로 구성된다.
$\therefore$ NLP에서는 단어의 의미를 이해시키는 것이 중요함.

단어의 의미 표현

시소러스(유의어 사전)

인력을 동원해 동의어, 유의어 그룹 별로 분류
NLP에서는 '상위와 하위', '전체와 부분' 등 더 세세한 관계까지 정의
- 단어 간의 관계 그래프로 표현하면 단어 사이의 관계 학습 가능(단어 네트워크)
- car = auto, automobile, machine, motocar
- object $\ni$ motor vehicle $\ni$ car, go-kart, truck
WordNet: 유명한 시소러스
문제점
- 1. 시대 변화에 대응 어려움
- 1. 사람의 비용이 큼
- 1. 단어의 미묘한 차이 표현 불가

통계기반 기법

말뭉치(Corpus): 대량의 텍스트 데이터
$\rarr$ 사람이 쓴 글을 이용해 자연어에 대한 사람의 지식 충분히 담겨 있음.
ex) 위키백과, 구글 뉴스, 대문호의 작품들(셰익스피어, 나쓰메 소세키 등)
말뭉치 전처리: 소문자 변환, 마침표 고려해 공백 기준 분할, 딕셔너리로 라벨링 등
$\rarr$ 정규표현식 활용 시 편리하게 수행 가능

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    text = text.split(' ')
    
    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
            
    corpus = np.array([word_to_id[w] for w in words])
    
    return corpus, word_to_id, id_to_word

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

단어의 분산표현: 단어를 의미가 정확하게 파악되는 벡터로 표현
분포가설: 단어 자체에는 의미가 없고 맥락이 의미를 형성(주변 단어에 의해 형성)

동시 발생 행렬: 그 주변의 단어가 몇 번이나 등장하는지 세어 집계

you say goodbye and i hello .
you 0 1 0 0 0 0 0
say 1 0 1 0 1 1 0
goodbye 0 1 0 1 0 0 0
and 0 0 1 0 1 0 0
i 0 1 0 1 0 0 0
hello 0 1 0 0 0 0 1
. 0 0 0 0 0 1 0

	you	say	goodbye	and	i	hello	.
you	0	1	0	0	0	0	0
say	1	0	1	0	1	1	0
goodbye	0	1	0	1	0	0	0
and	0	0	1	0	1	0	0
i	0	1	0	1	0	0	0
hello	0	1	0	0	0	0	1
.	0	0	0	0	0	1	0

       $\rarr$ 실제로 관련이 없어도 많이 등장한다는 이유로 관련성 높다고 표현
         $\rarr$ 단어의 관련성은 벡터 사이의 유사도를 계산
           ex) 벡터의 내적, 유클리드 거리, 코사인 유사도

cosine\_similarity(x,y) = \cfrac{x_1y_1 + \dots + x_ny_n}{\sqrt{x_1^2+x_2^2+\dots+y_n^2}\sqrt{y_1^2+y_2^2+\dots+y_n^2}}

가리키는 방향이 얼마나 유사한가를 의미 (-1~1)

# 수동으로 만들기
C = np.array([
    [0,1,0,0,0,0,0],
    [1,0,1,0,1,1,0],
    [0,1,0,1,0,0,0],
    [0,0,1,0,1,0,0],
    [0,1,0,1,0,0,0],
    [0,1,0,0,0,0,1],
    [0,0,0,0,0,1,0]
    ], dtype=np.int32)
    
# 자동으로 만드는 함수
def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)
    
    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size+1):
            left_idx = idx-i
            right_idx = idx+i
            
            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1
            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1
    return co_matrix

상호정보량(PMI)

the, a와 같은 단어가 빈도가 높다는 이유로 강한 관련성을 갖는다고 평가되는 문제 해결

PMI(x,y) = \log_2{\cfrac{P(x,y)}{P(x)P(y)}}\\ = log_2{\cfrac{\cfrac{C(x,y)}{N}}{\cfrac{C(x)}{N}\cfrac{C(y)}{N}}}\\ = log_2{\cfrac{C(x,y)\,N}{C(x)C(y)}}\\ \,\\C(x): x의\, 등장 횟수

동시발생횟수가 0이면 $log_2{0}=-\infin \,\Rarr$ 양의 상호정보량(PPMI) 사용

\therefore PPMI(x,y) = max(0, PMI(x,y))

def ppmi(C, verbose=True, eps=1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0
    
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
        pmi = np.log2(C[i,j]*N / (S[j]*S[i]) + eps)
        M[i,j] = max(0, pmi)
        
        if verbose:
            cnt += 1
            if cnt % (total//100) == 0:
                print('%.1f%% 완료'%(100*cnt/total))
    return M
    
text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)
W = ppmi(C)

말뭉치의 어휘 수가 증가할수록 벡터의 차원 수도 증가
$\therefore$ 차원감소기법 사용 - 중요한 정보는 최대한 유지하며 벡터의 차원 감소
1. SVD(특잇값 분해): 임의의 세 행렬 곱으로 분해

X = U\cdot S\cdot V^T

U, S, V = np.linalg.svd(W)
U[:,:2] # 희소벡터 W를 밀집벡터 U로 변환한 후 2차원으로 감소

2. Truncated SVD: 대각 원소(S) 중 상위 몇 개만 추출해 차원 축소
$\rarr$ 빠르다!

from sklearn.utils.extmath import radomized_svd

U, S, V = randomized_svd(W, n_components=wordvec_size,
                          n_iter=5, random_state=None)

하지만 현업에서의 말뭉치 수는 매우 크기때문에 이를 차원 축소하는 것은 타당하지 않다.

추론기반 방법

맥락이 주어졌을 때 해당 위치에 어떤 단어가 들어가는질 추측하는 문제를 반복적으로 풀어 출현패턴을 학습.
- 맥락의 정보 입력한 후 각 단어의 출현 확률 출력
- 올바른 추측 내놓도록 학습해 단어의 분산 표현을 얻음
간결한 표현이 목적이므로 입력층 뉴런 수보다 은닉층 뉴런 수가 더 적어야 함
말뭉치가 다르면 분산 표현도 다름
$W_{in}$ , $W_{out}$ 모두 단어의 의미가 포함되어있지만 $W_{in}$ 만 이용하는 것이 가장 대표적
위 방법을 Word2Vector의 CBOW 방법
Skip-gram은 CBOW를 역전시킨 형태(중앙 단어로 맥락 예측)
$L = \cfrac{1}{T} \Sigma_{t=1}^{T} (\log{P(w_{t-1}|W_t}+\log{P(w_{t+1}|W_t)})\\ \because P(w_{t-1},w_{t+1}|w_t)= P(w_{t-1}|w_t) P(w_{t+1}|w_t)$
성능은 Skip-gram이, 학습속도는 CBOW가 우수

class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size
        
        # 가중치 초기화
        W_in = 0.01 * np.random.randn(V,H).astype('f')
        W_out = 0.01 * np.random.randn(V,H).astype('f')
        
        # 계층 생성
        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss()
        
        # 모든 가중치와 기울기를 리스트에 모은다.
        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads
            
        # 인스턴스 변수에 단어의 분산 표현을 저장한다.
        self.word_vecs = W_in
    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:,0])
        h1 = self.in_layer1.forward(contexts[:,1])
        h = (h0+h1)*0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss
    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da += 0.5
        self.in_layer1.backward(da)
        self.in_layer0.backward(da)
        return None

추론기반 방법 속도 개선

은닉층을 계산할 때, 입력층 뉴런의 가중합을 구하지 않고 인덱싱을 수행한다.
$\because$ One-Hot Vector와 곱해 행벡터 하나를 뽑아낸 것과 동일
인덱싱을 수행 해 역전파에서 중복문제 발생 시, $dh$ 각 행의 값을 $dW$ 각 행에 합
출력층 계산 시에도 행렬곱 수행하지 않고 인덱싱
다중분류 문제가 아닌 이진분류 문제로 대체함으로써 속도 개선
원 핫 벡터를 출력하는 것이 아닌 타깃 단어 Yes/No 문제로 변화(Softmax $\rarr$ Sigmoid)
네거티브 샘플링
: 새로운 손실함수
단어별 출현 횟수를 바탕으로 확률분포를 구해 단어 샘플링 후 손실 모두 합
기본 확률분포에 0.75제곱(출현 확률 낮은 단어를 고려하기 위해)