Summary of GPT
Introduction
- One of the most difficult and trouble in NLP is learning from raw text.
- Most of the deep learning models we think of have a correct answer, and the final goal is to fit the model to the correct answer.
- In other words, we are looking for a function that satisfies a specific condition among numerous functions that can be the correct answer, so we just need to proceed with learning in the direction of approximation.
- But, it is not easy to "find the answer," i.e. to find a labeled dataset.
- In the current situation, if there is a model that can derive the correct answer even if there is no answer, it will be a revolutionary way to save time and human resources.
- This is not the only effect of unsupervised learning.
- The big reason most people have moved from machine learning to deep learning is the issue of accuracy.
- The most central criterion that separates machine learning and deep learning is "Who extracts features?"
- Unlike ML, which can learn with a small amount of data instead of directly capturing features, in deep learning, feature extraction is performed by the model itself through layers with non-linear characteristics.
- However, there are two major difficulties in unsupervised learning from raw text.
- Aspect of objective function.
- After all, the final goal of unsupervised learning from raw text is the NLP task.
- At this time, the shape of the objective function will change depending on what we are trying to do.
- For example, the objective function of machine translation should consider grammar as well as the similarity of the sentence the model translated compared to the correct sentence, while other incidental aspects such as discourse coherence should also be considered.
- Effective way of transfer
- Even if we get a proper representation from raw text, how we can use them well in the model is a different problem.
- Before GPT, even if there were pre-trained expressions of words, in order to use them, you had to construct a complex model or create ancillary to the objective function.
Goal of paper
- In this paper, semi-supervised learning is oriented, and it consists of unsupervised learning to summarize language and a fine tuning process for fine tuning.
- In particular, the goal is to generate a word-level presentation that can be used universally without much adjustment to solve problems 1 and 2 mentioned above at the same time.
Model description
- It is based on a Transformer model that performs extremely well and can be parallelized.
- In particular, key is to generate context vector, so there is no need to process the encoder, will use only the decoder part which be strengthened.
- The learning consists of two main steps: the first is learning to initialize parameters from the corpus, and the second is the process of fine-tuning using classification data (with labels).
- In the first learning, we use the multi-head attention used in Transformer(Decoder), and use the present corpus as an objective function in a given context window.
- In other words, the formula of objective function is as follow:
L1(U)=i∑logP(ui∣ui−k, ui−k+1,⋯, ui−1;Θ)
where k is context window, U={u1, u2, ⋯, un} is corpus set.
- After that, in the second training, fine-tuning is performed for each problem in the model derived from the first training.
- At this time, when defining the loss function, the language model learned in the first is used, which is to accelerate generalization considering that the language model should be used in other places as well.
- In other words, the formula of objective function is as follow:
L3(C)=L2(C)+λL1(C)
where L2(C) is objective function at classification (only)
- However, most of the NLP that we actually apply involves much more complex problems such as machine translation and textual entailment in addition to text classification.
- In other words, it is a problem to understand the relationship between several sentences rather than one sentence, so it is necessary to preprocess it.
- Textual entailment categorizes two sentences to be checked by combining them using a delimeter.
- Unlike the ordered Textual entailment, Similarity has no order, so the Textual entailment is applied twice. That is, a textual entry is applied to each.