[논문] What Does BERT Look At? An Analysis of BERT's Attention

잉송·2022년 1월 19일

Paper

목록 보기

2/4

Clark, Kevin, et al. "What does bert look at? an analysis of bert's attention." arXiv preprint arXiv:1906.04341 (2019).
논문 : https://arxiv.org/pdf/1907.10529.pdf
한국어로 리뷰한 블로그도 존재하니 검색해서 참고하면 좋을듯 하다!
http://mlgalaxy.blogspot.com/2019/12/what-does-bert-look-at-analysis-of.html

BERT’s attention head는 구분자 토근, 특정 위치의 offsets들 또는 문장 전체에 걸쳐 broad하게 attention하는 패턴들이 보인다. 또 같은 layer의 head 일수록 비슷한 행동을 보인다.
어떤 attention head는 언어적 개념(syntax, coreference)을 잘 표현한다. 동사-목적어, 전치사의 목적어, 핵심적 언급 등
attention 기반의 탐색 classifier을 제안하고 BERT’s attention에서 상당한 syntatic를 포착하는것을 입증하려고 한다.

이 부분에서는 BERT에 간단한 소개와 원리 그리고 attention에 대한 이야기가 간략하게 설명되어 있다. [CLS], [SEP]에 대한 설명도 적혀있다. 이하 자세한 BERT 개념은 본 글에서는 생략하도록 한다.

attention head에서 previous token, next token, [CLS], [SEP], '.', ',' 같은 토큰들에 주로 attention이 걸릴때 많은 토큰들에 board하게 걸리는 패턴들이 나타났다. 아마 [SEP]와 [CLS] 그리고 '.'와 ',' 은 일반적으로 문장 내(즉,input) 잘 나타나는 토큰이기 때문이다.
attention의 변화가 BERT의 output에 어떤 변화를 줄지 loss 값을 통해서 알아보았다(gradient-based measures of feature importance). 그 결과 [SEP]는 실질적으로 BERT의 output에 영향을 주지 않았고, 이는 [SEP]의 "no-op"(아무일 x) 이론을 뒷받침한다.
대부분의 하위 layer의 attention head는 attention이 매우 broad한다. 이 haed들의 output은 대부분이 문장의 bag-of-vectors으로 표현된다.
마지막 layer에서 [CLS] token은 broad하다. [CLS] token이 pre-training하는 동안 “next sentence prediction”을 위해서 input으로 사용된다는 점을 고려할 때 이 결과는 타당하다.

각각의 attention head들을 살펴본다. 특히, dependency parsing 태스크의 데이터에서의 attention head를 살펴본다.
우리는 word-level로 각각을 비교하고 싶기 때문에, 기존의 BERT의 byte-pair token을 word로 변환한다. 변환과정에서 token을 합칠때 attention weight은 token의 attention weight의 평균을 사용한다.