Ai tech Day16

Lee·2021년 2월 15일
0

Intro

Natural language processing

Low-level parsing

  • Tokenization(단어 토큰화), stemming(단어 어근 추출)

Word and phrase level

  • Named entity recognition(NER)
    단일 단어, 여러 단어로 이루어진 고유명사를 인식(ex New York Times)

  • part-of-speech(POS) tagging
    단어들의 품사를 인식(형용사, 부사, 명사, 형용사가 누굴 꾸미는지)

  • noun-phrase chunking

  • dependency parsing

  • coreference resolution


Sentence level

  • Sentiment analysis(감정 분석)

  • machine translation
    적절한 어순을 고려한 번역


Multi-sentence and paragraph level

  • Entailment prediction
    두 문장간의 논리적인 관계를 예측
    어제 John이 결혼했다. \Rightarrow 어제 누군가 한명 결혼했다. (참)
    어제 John이 결혼했다. \Rightarrow 어제 아무도 결혼하지 않았다. (모순)

  • question answering
    구글에 where did napoleon die 라는 검색어를 치면 이에 해당하는 답을 찾아준다.(키워드가 포함된 문서를 검색하고, 주어진 질문에 대한 답을 찾아 보여준다.)

  • dialog systems
    챗봇과 같은 대화 시스템

  • summarization
    주어진 문서를 한줄로 요약



Text mining

  • Extract useful information and insights from text and document data
    e.g., analyzing the trends of AI-related keywords from massive news data

  • Document clustering (e.g., topic modeling)
    e.g., clustering news data and grouping into different subjects

  • Highly related to computational social science
    e.g., analyzing the evolution of people’s political tendency based on social media data



Information retrieval

  • Highly related to computational social science
    • This area is not actively studied now
    • It has evolved into a recommendation system, which is still an active area of research


  • Word embedding
    Text data can basically be viewed as a sequence of words, and each word can be represented as a vector through a technique such as Word2Vec or GloVe.

  • RNN-family models (LSTMs and GRUs), which take the sequence of these vectors of words as input, are the main architecture of NLP tasks.

  • Overall performance of NLP tasks has been improved since attention modules and Transformer models, which replaced RNNs with self-attention, have been introduced a few years ago.

  • As is the case for Transformer models, most of the advanced NLP models have been originally developed for improving machine translation tasks.

  • In the early days, customized models for different NLP tasks had developed separately.

  • Since Transformer was introduced, huge models were released by stacking its basic module, self-attention, and these models are trained with large-sized datasets through language modeling tasks, one of the self-supervised training setting that does not require additional labels for a particular task.
    e.g., BERT, GPT-3 …

  • Afterwards, above models were applied to other tasks through transfer learning, and they outperformed all other customized models in each task.

  • Currently, these models has now become essential part in numerous NLP tasks, so NLP research become difficult with limited GPU resources, since they are too large to train.
    (Google, Facebook, OpenAi 처럼 자본력과 데이터가 뒷받침 되어야 한다.)



Bag-of-Words

딥러닝 이전에 활용되던 기법

  • Step 1. Constructing the vocabulary containing unique words
    • Example sentences
      “John really really loves this movie“, “Jane really likes this song”
    • Vocabulary
      {“John“, “really“, “loves“, “this“, “movie“, “Jane“, “likes“, “song”}

  • Step 2. Encoding unique words to one-hot vectors
    • Vocabulary
      {“John“, “really“, “loves“, “this“, “movie“, “Jane“, “likes“, “song”}

      Wordvector
      John[1 0 0 0 0 0 0 0]
      really[0 1 0 0 0 0 0 0]
      loves[0 0 1 0 0 0 0 0]
      this[0 0 0 1 0 0 0 0]
      movie[0 0 0 0 1 0 0 0]
      Jane[0 0 0 0 0 1 0 0]
      likes[0 0 0 0 0 0 1 0]
      song[0 0 0 0 0 0 0 1]

    • For any pair of words, the distance is 2
    • For any pair of words, cosine similarity is 0

  • A sentence/document can be represented as the sum of one-hot vectors

    • Sentence 1
      “John really really loves this movie“
      John + really + really + loves + this + movie: [1 2 1 1 1 0 0 0]

    • Sentence 2
      “Jane really likes this song”
      Jane + really + likes + this + song: [0 1 0 1 0 1 1 1]



NaiveBayes Classifier for Document Classification

Bag-of-Words for Document Classification

  • Bayes’ Rule Applied to Documents and Classes

    • For a document d and a class c

    • For a document d, which consists of a sequence of words w, and a class c

    • The probability of a document can be represented by multiplying the probability of each word appearing

    • Conditional independence assumption

      P(dc)P(c)=P(w1,w2,,wnc)P(c)P(c)wiWP(wic)P(d|c)P(c)=P(w_1, w2, \dots, w_n|c)P(c) \rightarrow P(c) \prod_{w_i \in W}P(w_i|c)

Example

Doc(d)Document (words, w)Class (c)
Training1Image recognition uses convolutional neural networksCV
2Transformer can be used for image classification taskCV
3Language modeling uses transformerNLP
4Document classification task is language taskNLP
Test5Classification task uses transformer?

P(cCV)=24=12P(c_{\operatorname{CV}}) = \cfrac{2}{4} = \cfrac{1}{2}
P(cNLP)=24=12P(c_{\operatorname{NLP}}) = \cfrac{2}{4} = \cfrac{1}{2}

For each word wiw_i, we can calculate conditional probability for class c

P(wkci)=nknP(w_k|c_i)=\cfrac{n_k}{n}, where nkn_k is occurrences of 𝑤𝑘𝑤_𝑘 in documents of topic 𝑐𝑖𝑐_𝑖

WordProbWordProb
P(w"classification"cCV)P(w_{\operatorname{"classification"}}\mid c_{CV})114\cfrac{1}{14}P(w"classification"cNLP)P(w_{\operatorname{"classification"}}\mid c_{NLP})110\cfrac{1}{10}
P(w"task"cCV)P(w_{\operatorname{"task"}}\mid c_{CV})114\cfrac{1}{14}P(w"task"cNLP)P(w_{\operatorname{"task"}}\mid c_{NLP})110\cfrac{1}{10}
P(w"uses"cCV)P(w_{\operatorname{"uses"}}\mid c_{CV})114\cfrac{1}{14}P(w"uses"cNLP)P(w_{\operatorname{"uses"}}\mid c_{NLP})110\cfrac{1}{10}
P(w"transformer"cCV)P(w_{\operatorname{"transformer"}}\mid c_{CV})114\cfrac{1}{14}P(w"transformer"cNLP)P(w_{\operatorname{"transformer"}}\mid c_{NLP})110\cfrac{1}{10}

For a test document 𝑑5𝑑_5 = “Classification task uses transformer”
We calculate the conditional probability of the document for each class
We can choose a class that has the highest probability for the document

P(cCVd5)=P(cCVP(wcCV)=12×110×110×110×110=0.00005P(c_{CV}|d_5)=P(c_{CV}\prod P(w|c_{CV})=\cfrac{1}{2} \times \cfrac{1}{10} \times \cfrac{1}{10} \times \cfrac{1}{10} \times \cfrac{1}{10} = 0.00005
P(cNLPd5)=P(cCVP(wcCV)=12×113×213×113×1130.00003P(c_{NLP}|d_5)=P(c_{CV}\prod P(w|c_{CV})=\cfrac{1}{2} \times \cfrac{1}{13} \times \cfrac{2}{13} \times \cfrac{1}{13} \times \cfrac{1}{13} \approx 0.00003

d5d_5 확률을 계산할 때, CV나 NLP에 존재하지 않는 단어면 0으로 처리되어, 앞뒤에 어떤 값이 곱해져도 0이 될 수 있어, 여러 regularization 기법이 사용된다.






Word Embedding

Express a word as a vector

  • 'cat' and 'kitty' are similar words, so they have similar vector representations \rightarrow short distance

  • 'hamburger' is not similar with 'cat' or 'kitty’, so they have different vector representations \rightarrow far distance



Word2Vec

An algorithm for training vector representation of a word from context words (adjacent words)

  • Assumption: words in similar context will have similar meanings

  • e.g.

    • The cat purrs.
    • The cat hunts mice.


Idea

Suppose we read the word “cat”. What is the probability P(wcat)P(\operatorname{w}|\operatorname{cat}) that we’ll read the word w nearby?

Distributional Hypothesis

  • The meaning of “cat” is captured by the probability distribution P(wcat)P(\operatorname{w}|\operatorname{cat})

How Word2Vec Algorithm Works

Sentence : “I study math.”

  • Vocabulary: {“I”, “study” “math”}

  • Input: “study” [0, 1, 0]

  • Output: “math” [0, 0, 1]

  • Columns of 𝑊1𝑊_1 and rows of 𝑊2𝑊_2 represent each word

  • E.g., ‘study’ vector : 2nd column in 𝑊1𝑊_1, ‘math’ vector : 3rd row in 𝑊2𝑊_2.

  • The ‘study’ vector in 𝑊1𝑊_1 and the ‘math’ vector in 𝑊2𝑊_2 should have a high inner-product value.

  • https://ronxin.github.io/wevi/

    • A vector representation of ‘eat’ in 𝑊1𝑊_1 has similar pattern with vectors of ‘apple’, ‘orange’, and ‘rice’ in 𝑊2𝑊_2
    • When the input is 'eat', the model can predict 'apple', 'orange’, or 'rice’ for output, because the vectors have high inner product values

Property of Word2Vec

  • The word vector, or the relationship between vector points in space, represents the relationship between the words.

  • The same relationship is represented as the same vectors.

  • e.g.,
    vec[queen] – vec[king] = vec[woman] – vec[man]

  • Korean Word2Vec : http://w.elnn.kr/search

profile
초보 개발자입니다

0개의 댓글