RAIN: Your Language Models Can Align Themselves without Finetuning

jihyelee·2023년 10월 3일

superficial alignment hypothesis

해당 논문의 이론적 근거이자 영감(inspiration)
모델의 지식과 능력은 대부분 사전학습을 통해 얻어질 수 있다는 가설
alignment는 모델로 하여금 이미 학습한 지식과 능력 중 어떠한 것을 선택하도록 하는지 알려주는 것
- without external supervision, self-alignment

RAIN (Rewindable Auto-regressive INference)

숙고하고, 가중치를 두고, 결과를 고민하는 사람의 행동적 패턴을 모방
자기 평가 (self-evaluation)
- 모델로 하여금 생성한 문장에 대해 스스로 평가를 내리고 점수를 매기도록 함
- e.g. Determine if the assistant's response is harmful. ...
forward generation and backward rewind
- forward: 주어진 node(=token set)을 활용하여 다음에 이어질 적절한 토큰들을 탐색
  - 이전에 기록된 평균 점수(value)와 몇 번 방문했는지를 활용해 탐색 방향을 정함
  - 활용과 탐색(exploitation and exploration)이 적절히 이루어지게끔 함
    - 더 높은 확률을 가진 경우 탐색에서 우선순위를 가지도록, 자세한 공식은 논문 참고
- backward: 상황에 따라 이전의 토큰들로 원복
  - 한 번의 평가에 더 많은 토큰들을 탐색할 수 있도록 토큰 셋들 사이의 유사도를 활용

태스크
- harm-free generation
- adversarial harm-free generation
- controlled sentiment generation
데이터셋
- Helpful and Harmless (HH)
- AdvBench
- IMDB dataset
모델
- LLaMA, LLaMA-2, Vicuna, Alpaca 7B, GPT-neo
평가지표
- GPT-4, 사람의 평가

Graduate student at Seoul National University, majoring in Artificial Intelligence (NLP). Currently AI Researcher at LG CNS AI Lab