Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference (EACL, 2021) (not finished)

Minhan Cho·2023년 3월 1일

Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.

It's not just size that matters: Small language models are also few-shot learners journal (NAACL, 2021)의 선행연구격 되는 논문. It's not just size that matters... 보다 PET에 대한 설명이 잘 되어 있으므로 먼저 읽는 것을 추천

https://github.com/timoschick/pet

Abstract

Pattern Exploiting Training (PET)의 3 step:
- input text를 cloze-stype phrase로 reformulate
- large set of unlabeled example에 soft labels assign
- soft labeled된 example에 standard supervised training 진행
PET가 supervised training, strong semi-supervised approach에 대해 outperform

1 Introduction

Few shot learning의 어려움

어렵기는 한데, task description이 있으면 쉬워질 것. task description이란 task가 뭔지 understand하게 도와주는 textual explanation. GPT2나 Zero Shot with Generative LM 같은 논문에서 보면 input에다가 description을 append하는 형태로 zero shot 진행(이거 더 찾아봐야 하는데 귀찮아서 못 봤음)

본 연구에 대한 설명: Pattern Exploiting Training (PET)

task description & standard supervised learning을 combine, few-shot에 이용하였음. PET를 input을 'cloze stype phrase'로 reformulate해서 training 진행한 'semi-supervised training procedure'라고 정의하였음

PET의 3 step process

each pattern에 대해 separate PLM이 small training set $T$ 에 대해 finetuned
all models ensemble이 large unlabeled dataset $D$ 에 대해 soft labeling 진행
standard classifier가 soft-labeled datset에 train됨

iPET: iterative variatn of PET, training set size 키워가면서 process repeating
small-medium number of labeled example 주어질 때, PET와 iPET가 unsupervised approach, supervised training, strong semi-supervised baseline 모두 뚜까팸

3 Pattern-Exploiting Training

<Notation>

$M$ : masked language model (MLM)
$V$ : Vocabulary of MLM
____: masked token, ____ $\in V$
$L$ : target classification task $A$ 의 set of labels, e.g. binary classification이라면 $l_0(True)\;l_1(False)$ 있음
$x$ : input of task $A$ , sequence ( $s_1,...,s_k$ )로 이루어져 있음 ( $x = (s_1,...,s_k)$ , $s_k \in V$ )
$P$ : pattern (function, $x \rightarrow P(x)$ ), $P(x)$ 에는 오직 하나의 mask token이 들어가게끔 함 (그래야 cloze question이 되니깐)
$v$ : verbalizer (function, $L \rightarrow V$ ), 각 label( $l_k$ )를 MLM $M$ 의 Vocabulary $V$ 로 mapping
$pattern-verbalizer\,pair$ (PVP): $(P,v)$

원래 task A의 input-output pair:

$(x, l)$

input $x$ 에 대한 label $l$ 를 예측

task A의 input과 expected output이 바뀌는 과정

PVP $(P,v)$ 가 개입

input: $x \rightarrow P(x)$ (from separate sequences to one cloze question sequence)
output: $l \rightarrow v(l)$ (from label to token)

input과 output이 변형된 task A'의 input-output pair

$(P(x), v(l))$

MLM $M$ , cloze question type sequence $P(x)$ 에 대해 masked token ___ 의 token $v(l)$ 를 예측

예시

task A: two sentences a, b가 서로 contradict인가 (binary classification)
input $x$ : $x$ = ("Mia likes pie", "Mia hates Pie")
output $l \in \{l_0,\; l_1\}$ : $l_0$ refers to "Yes", $l_1$ refers to "No"

task A': label assigning에서 answering for the masked position으로 변경
input $P(x)$ : $P(x)$ = "Mia likes pie? ___, Mia hates pie."
output $v(l) \in \{$ "Yes", "No" $\}$

3.1 PVP Training and Inference

<Notation>

$p$ : pattern-verbalizer pattern (PVP), $p = (P,v)$
$T$ : small training set
$D$ : larger set of unlabeled examples
$z$ : sequence, $z \in V$
$M(w|z)$ : sequence $z$ 에 token $w \in V$ 가 들어갈 unnormalized 확률

given input $x$ , score for label $l \in L$ as

$s_p(l|x) = M(v(l)|P(x))$
$q_p(l|x) = {e^{s_p(l|x)}} / {\sum_{l' \in L}e^{s_p (l'|x)}}$ : $q_(l|x)$ 는 $s_p(l|x)$ 의 softmax over labels

MLM $M$ 에 대한 PVP $p$ 의 finetuning 시 loss: $q_p(l|x)$ 와 true distribtion of training example $(x,l) \in T$ 의 cross entropy

3.2 Auxiliary Language Modeling

few training example이 너무 쪼만해서 pretrained language model (PLM)은 본질적으로는 LM임. auxiliary task로 language modeling으로 만듦

final loss는 다음과 같이 계산됨:

$L = (1 - \alpha) \times L_{CE} + \alpha \times L_{MLM}$
where $L_{CE}$ refers to cross entropy loss, $L_{MLM}$ refers to language modeling loss

$L_{MLM}$ 이 워낙 클 거라 $\alpha$ 를 엄청나게 작게 하는데, 여기서는 $a = 10^{-4}$ 를 사용(실험 결과가 좋았다고 함)

language modeling을 위한 sentence 획득에 unlabeled set $D$ 를 사용했지만, $x \in D$ 를 바로 넣지는 않았고, $P(x)$ 넣었고, masked slot에 predict하는 것을 task로 하지는 않았음(그럼 MLM 학습 어떻게 한 거여..?)

3.3 Combining PVPs

large dev set이 없다는 게 challenge라서 PVP가 잘 작동하는지 확인이 어려움. knowledge distillation의 strategy를 활용했음

$P$ , a set of PVPs를 만들고 다음과 같이 사용

separate language model $M_p$ 를 각 $p \in P$ 에 대해 3.1의 과정으로 finetune함. $T$ 가 작아서 PVPs 갯수가 많아도 computing 비용이 적음
finetuned models로 구성된 ensemble $M = \{M_p|p \in P\}$ 을 써서 $D$ 를 annotate함. 각 example $x \in D$ 에 대해 unnormalized class score를 아래와 같이 구하고, softmax 씌워서 $q$ 를 구함. temperature $T = 2$ 으로 설정. 이렇게 모든 pairs $(x, q)$ 를 soft-labeled training set $T_C$ 를 구성:

$s_M(l|x) = \frac{1}{Z}\sum_{p \in P}w(p) \times s_p(l|x)$
where $Z = \sum_{p \in P}w(p)$ and $w(p)$ 는 weighting terms
$w(p) = 1$ ("uniformed") or $w(p)$ = accuracy obtained using $p$ on the training set before training ("weighted")
Pretrained Language Model (PLM) $C$ 에다가 standard sequence classification head on $T_C$ 를 이용해 finetune. 이렇게 finetuned된 PLM $C$ 가 task A에 대한 classifier model이 되겠음

3.4 Iterative PET (iPET)

PET의 한계와 iPET의 core idea

앞에서처럼 각각의 model의 distilled knowledge가 single classifier C로 흘러들어가는 구조는 개개의 model이 서로로부터 학습을 할 수 없는 구조임. 게다가 한 pattern이 다른 pattern보다 perform worse라면, soft-labeled training set $T_C$ 에는 mislabeled된 example이 많을 것임. 이걸 극복하기 위해 iterative PET, 즉 iPET가 등장함. iPET의 core idea는 점점 커지는 데이터셋에 대해 several generations of models를 train하는 것

iPET의 작동원리

여기서 small labeled training dataset $T$ 는 unlabeled large dataset $D$ 에서 example 몇 개 뽑아와서 random subset of trained PET models로 labeling해서 섞는 식으로 크기를 늘려감. 그리고 더 커진 training dataset $T$ 에 대해 new generations of PET models를 학습시킴. 이게 몇 번이고 반복되는 것

$M^0 = \{M_1^0,...,M_n^0\}$ 은 3.3 의 1. 식으로 $T$ 에 finetuned된 initial set of PET models. $M_i^j$ 는 PVP $p_i$ 에 대해 학습한 $k$ -th generation of model $M$ 임. 각 iteration마다 training set의 size는 fixed constand $d \in N$ 배수만큼 늘어나며, label ratio는 original dataset과 동일하게 맞춰짐. 그래서 $c_0(l)$ 을 training dataset $T$ 의 label $l$ 에 해당하는 examples 갯수라고 하면, 각 $T^j_i$ 는 label $l$ 마다 $c_j(l) = d \times c_{j-1}(l)$ 만큼의 examples를 갖게 됨(그냥 당연한 말이야!). $T_i^j$ 는 다음과 같이 생성됨

$N \subset M^{j-1}$

PET models를 $k$ generation만큼 training한 이후, $M^k$ 로 soft-labeled training dataset $T_C$ 만들고, classifier model $C$ 를 train 하였음

iPET on zero-shot setting: 위에를 건너뛰었더니 알 수가 없네.. 시바

iPET은 zero-shot setting에도 사용될 수 있음. $M^0$ 를 untrained models로 정의하고 $c_1(l) = 10 / |L|$ for all $l \in L$ 해서 $M^1$ 이 10 examples evenly distributed on all labels에 대해 학습할 수 있도록 함. $T_N$ 이 label $l$ 에 대해 적정 숫자의 example을 확보하지 못할 가능성도 있어서 $x \in D$ 의 100 examples를 .... 하 시바

Minhan Cho

multidisciplinary

이전 포스트

KAT: A Knowledge Augmented Transformer for Vision-and-Language (NAACL, 2022)

다음 포스트

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference (EACL, 2021) (not finished)

Abstract

1 Introduction

3 Pattern-Exploiting Training

3.1 PVP Training and Inference

3.2 Auxiliary Language Modeling

3.3 Combining PVPs

3.4 Iterative PET (iPET)

KAT: A Knowledge Augmented Transformer for Vision-and-Language (NAACL, 2022)

P-Tuning v2: Prompt Tuning Can be Comparable to Finetuning Universally Across Scales and Tasks (ACL 2021)

0개의 댓글

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference (EACL, 2021) (not finished)

Abstract

1 Introduction

2 Related Work

3 Pattern-Exploiting Training

3.1 PVP Training and Inference

3.2 Auxiliary Language Modeling

3.3 Combining PVPs

3.4 Iterative PET (iPET)

KAT: A Knowledge Augmented Transformer for Vision-and-Language (NAACL, 2022)

P-Tuning v2: Prompt Tuning Can be Comparable to Finetuning Universally Across Scales and Tasks (ACL 2021)

0개의 댓글