[논문 리뷰]Recurrent Convolutional Neural Networks for Text Classification

fla1512·2022년 7월 14일

NLP

NLP Study

목록 보기

1/23

`Abstract`

NLP 적용에 있어 중요한, Text classification
전통적인 text classifiers의 경우
- human-designed features(사전, knowledge bases, special tree kernels)에 주로 의존

(그와 달리 해당 논문에서는)

recurrent convolutional neural network
- 텍스트 분류를 위한 방법.
- without human-designed features임.
1. recurrent structure 적용.
  - 왜? to capture contextual information(문맥 정보까지 받아들이려고)
    - 대표 단어 학습에 있어 최대한 noise를 적게 받아들이게 유도함
2. max-pooling 레이어 적용
  - 맥스 풀링 레이어: 텍스트 분류에서 어떤 단어가 주요한지 자동으로 판단
4개의 데이터셋에서 실험했고 그 결과 제시한 방법, RCNN이 가장 우수한 결과, 특히 document-level datasets에서.

`Introduction`

텍스트 분류는 여러 분야에서 주요한 분야임.

(여기서 주요한 문제는)

feature representation
- bag-of-words(BoW) 모델을 기반
  - (bag-of-words(BoW)에서는 unigrams, bigrams, n-grams or some exquisitely designed patterns are typically extracted as features)
- frequency, MI, pLSA, LDA 등이 차별화된 특징을 뽑아내고자 적용
  
  → 그럼에도, 전통적인 특징 추출 방법은 한계 있음
1. 문맥 정보 무시
2. 문장에서 단어 순서 무시
3. 단어의 의미를 찾고자 unsatisfactory하게 유지됨.
  
  예) “A sunset stroll along the South Bank affords an array of stunning vantage points.”,
  
  “Bank”(unigram) → 은행 기관 아닌 것을 분석 불가.
  
  “South Bank”(bigram) → 런던을 잘 모르면 은행 기관으로 해석 가능.
  
  (그래서 더 큰 문맥을 얻으면)
  
  “stroll along the South Bank” (5-gram) → 의미 파악 가능
  
  → 결론적으로 high-order n-grams + complex features로 더 문맥적인 정보 추출 가능하지만 여전히 분류 정확도에 큰 영향을 주는 data sparsity 문제가 있음

💡 data sparsity: 데이터 희소성, 데이터가 적을 수록 일반화가 어렵다는 문제

(최근에 pre-trained 워드임베딩과 딥뉴럴네트워크의 발전이 NLP에 새로운 시야를 가져옴)

NLP tasks의 변화
- (pre-trained) Word embedding
  - distributed representation of words
  - data sparsity문제를 완화

💡 사전 훈련된 워드 임베딩
이미 훈련되어져 있는 워드 임베딩을 가져와서 임베딩 벡터로 사용하는 것. 훈련 데이터가 적을 때 최적화 된 임베딩 벡터값을 얻는 것이 쉽지 않은 경우, 해당 문제에 특화된 것은 아니지만 보다 많은 훈련 데이터로 이미 Word2Vec이나 GloVe 등으로 학습되어져 있는 임베딩 벡터들을 사용하는 것.

Recursive Neural Network(RecursiveNN)의 등장
- sentence representation을 constructing하는 것이 유효
- 하지만, 문장의 semantics를 tree structure로 capture함 → textual tree construction의 수행에 과하게 의존하게 됨 → textual tree는 시간 복잡도로 이어짐(이는 긴 문장에서는 특히 time-consuming한 일) → 결론적으로 RecursiveNN은 긴 문장이나 문서에서는 부적합

RecursiveNN: 입력값으로 주어지는 몇 개 단어를 묶어서 분석, 일부 정보는 스킵하면서 과정이 이루어짐, hiarchy한 성질을 네트워크 구조에 적극 차용한 모델
RecursiveNN(모형, 추가자료)
Recursive Neural Networks

Recurrent Neural Network (RecurrentNN)
- only exhibits a time complexity O(n)
- 텍스트 단어를 단어로 분석하고, 이전의 모든 텍스트의 semantics를 fixed-sized hidden layer에 저장.
- (입력값을 순서대로 받아 하나씩 순차적으로 처리하는 네트워크) : ‘the country of my birth’라는 입력이 있을 때 첫 입력값은 ‘the’에 대응하는 단어벡터, 그 다음은 ‘country’, 이후엔 각각 ‘of’, ‘my’, ‘birth’가 됩니다. 입력값 중간에 건너뛰는 부분이 없고 등장 순서대로 그대로 처리하는 구조. 해당 예시의 경우 은닉층이 하나인 구조, 마지막 히든 노드인 (2.5, 3.8)은 이전까지의 모든 맥락(the, country, of, my)과 함께 현재 입력값(birth) 정보가 모두 반영된 것의 결과임.
- contextual 정보 capture에 유능 → 긴 문장의 semantics capture에 좋음
- 그러나, biased model임
  - later words가 earlier words보다 더 우세함
    - 전체 문장에서 의미 포착 시에 효율을 낮출 수 있음(key component는 문장 어디에서나 등장할 수 있으니까)
    
    → 이거 어떻게 해결?
Convolutional Neural Network (CNN)가 NLP에 소개
- unbiased model
- 맥스풀링 레이어로 문장에서 discriminative phrases를 공정하게 결정 가능 → CNN이 앞 두 모델보다 text의 semantic 추출에 있어서 더 나을 수도?(게다가 얘도 시간 복잡도가 O(n)임)
  - (그러나 대게) CNN은 간단한 컨볼루션 커널(예.fixed window) 사용하고 윈도우 크기 결정이 이때 어려움
    - 작은 윈도우 → 주요 정보의 손실로 이어질 수 있음
    - 큰 윈도우 → 엄청난 양의 파라미터 공간 필요

(1-3 모델의 한계를 극복하고자, 본 논문에서는)

Recurrent Convolutional Neural Network (RCNN)
1. bi-directional recurrent structure 적용
  - bi-directional recurrent structure : 문장의 경우 앞뒤 문맥이 중요한 경우가 많아 등장, 양방향 RNN은 이전 정보만 아니라 이후 정보까지 저장해서 활용할 수 있는 모델 예) '나는 오늘 ㅁ을 먹었다.' 'ㅁ'을 예측하는 문장에서 'ㅁ' 이전의 정보 '나는', '오늘'만 가지고 예측이 어려움. 하지만 그 뒤의 '먹었다'의 정보를 같이 활용한다면 '나는 오늘 밥을 먹었다'라고 예측이 더 쉬어짐. [바람돌이/딥러닝] RNN(Recurrent Neural Network) - 순환 신경망 이론 및 개념
  - 기존의 윈도우 기반 NN보다 상대적으로 적은 noise → contextual 정보 capture에 있어 가능한 한 greatest extent를 얻도록(capture the contextual information to the greatest extent possible when learning word representations) → 더 많은 범위 보존 가능(can reserve a larger range of the word ordering when learning representations of texts.)
2. max-pooling layer 적용
  - 텍스트 분류에서 어떤 특징이 주요한 역할 하는지 자동으로 결정(문장에서 주요 구성 요소 capture하고자)
  
  → 두 기능을 합쳐서 우리 모델은…
- recurrent neural models + convolutional neural models의 이점 모두 이용 가능
- 시간 복잡도 O(n), which is linearly correlated with the length of the text length.
목표: 4개의 다양한 업무를 활용한 previous state-of-the-art approaches 수행 및 우리 모델 비교

`Related Work`

1. Text Classification

전통적인 텍스트 분류
1. feature engineering
  - bag-of-words(가장 많이 사용되는 feature)
    - 단어의 등장 순서를 고려하지 않는 빈도수 기반의 단어 표현 방법
    - =단어들의 가방
  - part-of-speech tags
  - noun phrases
  - tree kernels
2. feature selection
  - noisy features 제거와 분류 능력 향상이 목표
  - 예) removing the stop words (e.g., “the”)
  - information gain, mutual information, L1 regularization 등도 사용
3. 다양한 종류의 머신러닝 알고리즘
  - 분류기 사용(logistic regression (LR), naive Bayes (NB), support vector machine (SVM)) → data sparsity 문제

2. Deep neural networks

data sparsity 해결의 새 아이디어로 등장
워드 임베딩
- neural representation of a word
- real-valued vector
- 단어 연관도 측정 가능(두 임베딩 벡터 간의 거리로)
- pre-trained 워드 임베딩의 경우 NLP에서 그 능력 증명함
  - semi-supervised recursive autoender
  - paraphrase detection
  - recursive neural tensor network
  - recurrent neural network
  - novel recurrent network
  - convolutional neural network

`Methodology`

1. Model

deep neural model로 문장의 semantics capture하기.
인풋: D(document) = sequence of words w1, w2 . . . wn
아웃풋: class elements 포함(각 4개 데이터셋의 라벨들)
- p(k|D, θ)로 document의 확률 나타냄(k는 class, θ는 네트워크의 파라미터)

1.1 Word Representaion Learning

단어와 단어의 문맥을 현재 단어에 combine함!
문맥은 단어의 더 정확한 의미 얻도록 도와줌
recurrent structure
- bi-directional recurrent neural network로서 문맥 capture를 도움
- cl(wi): left context of word, (1) 식으로 계산
  - e(wi−1)는 단어 wi−1의 워드 임베딩
    
    💡 워드 임베딩: 단어를 벡터로 표현하는 방법으로, 단어를 밀집 표현으로 변환
  - wi−1: |e| real value elements가지는 dense 벡터
- cr(wi): right context of word → cl(wi), cr(wi) 둘 다 |c| real value elements가지는 dense벡터
- W^(l): hidden layer (context)를 다음 hidden layer로 바꾸어주는 행렬
- W^(sl): 현재 단어의 의미론을 다음 단어의 왼쪽 문맥과 결합하는 행렬
- f: 비선형결합함수
- context vector는 모든 left-right side context들의 의미론을 포착함
  - 예) figure1
    - cl(w7): left-side context “stroll along the South”의 semantic을 encode함.
    - cr(w7): right-side context “affords an . . . ”.의 semantic을 encode 함.
```
- cl(wi), e(wi), cr(wi)의 concetenation 적용
```
    → 결론적으로, contextual information 을 통해 덜 애매하게 사용 가능!!!(better able to disambiguate the meaning)
  - → cl은 forward scan of the text 에서 cr은 backward scan of the text에서 얻을 수 있음
  - xi: wi(word)의 representation
  - linear transformation 적용(tanh 활성화 함수) 후 다음 레이어에 결과 전송
  - yi^(2): 잠재적인 semantic 벡터, 단어를 표현하는데 가장 유효한 factor 결정 분석에 쓰임

1.2 Text Representation Learning

본 연구에서 CNN은 represent the text위해서 디자인 됨(맥스풀링을 이용해서 단어를 공정하게 뽑자, RNN과 달리 이게 가능할 것이다는 맥락~)
그래서 recurrent structure가 convolutional layer임
max-pooling 레이어
- 모든 단어의 대표성이 계산되었을 때 적용.
- max 함수: element-wise function
💡 element-wise function: 각 행렬의 원소끼리만 곱하는 것
- 풀링레이어
  - 다양한 길이의 텍스트를 fixed-length로 변환
  - 전체 문장에서 정보 포착 가능
    
    (* 여기서 average pooling 안 쓰는 이유: 몇 단어와 그들의 조합만이 문장의 의미를 포착하는데 유용하기 때문)
  - 맥스풀링의 경우, 가장 중요한 latent semantic factor를 document에서 찾도록 함
  - 순환구조의 출력을 입력으로 사용
  - 시간복잡도 O(n) (왜? 전체 모델이 recurrent structure와 max-pooling 레이어의 종속이므로)
output 레이어
- 소프트맥스 함수가 (6)에 적용
- output numbers를 확률로 변환 가능

2. Training

2.1 Training Network parameters

E : word embeddings
b: bias vectors
cl(w1), cr(wn): initial contexts
W : transformation matrixes
|V|: number of words in the vocabulary
H: hidden layer size
O: number of document types

→ log-likelighood 최대화 위해 training target of the network 사용

log-likelighood ‘특정 사건들이 일어날 가능성‘=‘샘플들과 확률분포의 일관된 정도’

→ stochastic gradient descent 적용

왜? training target에 optimize하고자
example을 무작위로 뽑고 gradient step을 함
자주 쓰이는 트릭 사용
- unifrom distribution에서 모든 파라미터 초기화함

2.2 Pre-training Word Embedding

Word embedding: distributed representation of a word
- distributed representation은 neural network의 인풋에 적합
- 기존 representation(원핫representation)은 차원의 저주로 이어질 수 있음
neural networks는 better local minima에(with a suitable unsupervised pre-training procedure) 수렴 가능

(해당 연구에서는, 워드 임베딩을 pre-train하고자)

Skip-gram model 사용
- skip-gram model [DL] Word2Vec, CBOW, Skip-Gram, Negative Sampling
  - 중심 단어를 통해 주변 단어를 예측하는 모델(=기준 단어를 보고 어떤 문맥 단어가 등장할지 에측하는 모델)
  - 확률값이 cross-entropy 함수로 정의됨
  - CBOW와의 비교
    - 두 모델은 입력으로 주어진 단어를 N차원 벡터로 투영한 뒤
    - 이 벡터를 다시 소프트맥스 함수를 이용해 출력 단어를 맞추도록 학습함
    - 둘은 서로 유사한 구조를 가지며, 입력과 출력이 서로 반대인 모델임
- (단어들을 정해진 차원의 벡터 공간에 임베딩하는 모델)
- 많은 NLP 분야에서 state-of-the-art임
- 단어의 임베딩을 average log probability를 최대화하면서 모델 훈련
- |V |: vocabulary of the unlabeled text
- e’(wi): wi에 대한 다른 임베딩
- speed-up approaches를 위해 embedding e 사용

`Experiments`

1. Datasets

4가지 데이터셋 사용

20Newsgroups
- twenty newsgroups로부터의 메세지
- bydate version 사용
- 4개의 주요 카테고리(comp, politics, rec, religion) 사용
Fudan set
- Chinese document classification: 20개의 클래스(미술, 교육, 에너지 등)
ACL Anthology Network
- scientific documents
- 5개 언어: English, Japanese, German, Chinese, French
Stanford Sentiment Treebank(SST)
- movie reviews 포함
- 라벨: Very Negative, Negative, Neutral, Positive, Very Positive

2. Experiment Settings

데이터셋 전처리
- 영어 문서: Stanford Tokenizer로 token 얻음
- 중국어 문서: ICTCLAS로 단어 분할
- 문장에서 stop words, symbols는 제거 X
- 모든 데이터셋은 train, test로 분할
- ACL, SST는 pre-defined training, development, testing separation있음
- 20Newsgroup, Fudan set은 training set의 10%를 development set으로, 남은 90%가 진짜 real training set.
evaluation metric
- 20Newsgroup: Macro-F1
  - F1 score =Macro-F1 Precision과 Recall의 조화평균
- Fudan set, ACL, SST: accuracy
  - accuracy
    - Precision=TP/(TP+FP)_ 정밀도: 모델이 True라고 분류한 것 중에서 실제 True인 것의 비율
    - Recall=TP/(TP+FN)_ 재현율: 실제 True인 것 중에서 모델이 True라고 예측한 비율
    - (Accuracy)=TP+TN/TP+FN+FP+TN
하이퍼파라미터 설정
- 데이터셋에 의해 결정
- learning rate, hidden layer size, vector size

3. Comparison of Methods

Bag of Words/Bigrams + LR/SVM
- 머신러닝 알고리즘을 베이스로 사용
- features로 unigram과 bigrams
- Logistic regression(LR), SVM 각각 사용
Average Embedding + LR
- 워드임베딩의 weighted average 사용
- softmax layer 적용
- 각 단어의 weight: 자신의 tf-idf 값
LDA
- 비교 위해 두 방법 사용
  - ClassifyLDA-EM
  - Labeled-LDA
Tree Kernels
- 비교 위해 두 방법 사용
  - the context-free grammar(CFG)
  - reranking feature set
RecursiveNN
- 비교 위해 두 개 방법
  - Recursive Neural Network
  - Recursive Neural Tensor Networks(RNTNs)
CNN

`Conclusion`

1. Results and Discussion

neural network approaches(RecursiveNN, CNN, RCNN)와 traditional methods(BoW+LR 등)간의 비교
- neural network > traditional methods (4개 데이터 모두에서)
- neural network는
  - 효율적으로 the semantic representation of texts 구성 가능
  - can capture more contextual information of features
  - data sparsity 문제 더 적음
CNNs, RCNNs를 RecursiveNNs에 비교(SST 데이터)
- convolution-based(CNNs, RCNNs) > RecursiveNNs
- CNNs, RCNNs가 semantic representaion 설계에 더 적합
  - 왜? CNN은 더 차별화된 특징 추출 가능(맥스풀링, 컨볼루션 레이어에서 contextual information 포착 가능)
  - RecursiveNNs의 경우(한계점)
    - constructed textual tree에서만 semantic composition를 사용해 contextual information 포착 가능(tree constructuin의 performance가 중요)
    - O(n^2) 시간
  - 우리 모델인 RCNN은 시간복잡도 O(n)
  - SST데이터에서 RNTN은 훈련 시간이 3-5시간인데 RCNN의 경우 몇 분(더 좋다).
RCNN의 결과
- ACL, SST 데이터 제외 RCNN의 성능이 가장 좋음
- ACL의 경우 RCNN이 경쟁력 있음
- 20News dataset에 대해 error rate를 33% 감소시킴
- Fudan set에 대해 error rate를 19% 감소시킴
RCNN과 well-designed feature set 비교(ACL데이터)
- RCNN > CFG(context-free grammar(CFG), Tree kernels의 한 방법으로 비교 위해 사용된 방법) feature set
- RCNN is competitive with the C&J feature set → RCNN은 long-distance pattern을 포착 가능 → RCNN은 hand-crafted feature sets를 필요로 하지 않음(low-resource language에서 유용)
RCNN과 CNN의 비교
- RCNN > CNN → 왜? RCNN의 recurrent structure 가 윈도우 기반의 CNN보다 contextual information captures를 더 잘함.

2. contextual Information

CNNs 과 RCNNs의 차이
- contextual information을 capturing할 때 다른 구조 사용
  - RCNNs: recurrent structure
  - CNNs: 단어의 고정된 윈도우
    - 성능은 window size에 영향
      - small window: a loss of some long-distance patterns
      - large window: data sparsity, 파라미터 수 많으면 훈련 어려움
    - odd window size(1~19)
      - 예) 1: the CNN only uses the word embedding [e(wi)]
      - 3: the CNN uses [e(wi−1); e(wi); e(wi+1)] to represent word wi.
      - Figure2 해석(RCNN과 CNN비교)
        
        20Newsgroups 데이터에 대한 결과
        
        RCNN이 모든 윈도우 크기에서 CNN을 능가함→ RCNN은 contextual information을 포착 가능 with a recurrent structure that does not rely on the window size.(RCNN에서 윈도우 크기가 어떻게 작용하고 어디에 쓰이는지)→ recurrent structure은 contextual information 더 길게 보존 가능, 더 적은 소음 introduces 함

3. Learned keywords

(추가적으로) representations of texts를 어떻게 하는지 알아보고자 가장 중요한 단어 나열해봄
- max-pooling 레이어에서 가장 자주 선정된 단어들
- 문맥과 함께 단어 대표성이 만들어져서 문맥은 전체 문장을 포함할 수도 있음
- 우리 모델 RCNN의 경우
  - RNTN과 달리 syntactic parser(구문분석기)에 의존X → the presented n-grams are not typically “phrases”
  - positive sentiment: “worth”, “sweetest”, “wonderful”
  - negative sentiment: “awfully”, “bad”, and “boring”

4. 결과

텍스트 분류에 있어 RCNN 제안
- recurrent structure로서 contextual information 포착 가능
- convolutional neural network 사용해서 representation of text 를 constructs함
실험 결과 CNN, RecursiveNN 능가함

fla1512

다음 포스트