[Paper Review] Transferring Inductive Bias Through Knowledge Distillation - (2/3)

euisuk-chungยท2021๋…„ 8์›” 27์ผ
2
post-thumbnail

์•ˆ๋…•ํ•˜์„ธ์š” :) ์˜ค๋Š˜์€ ์ง€๋‚œ๋ฒˆ ํฌ์ŠคํŒ…์— ์ด์–ด์„œ "Transferring Inductive Bias Through Knowledge Distillation" ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์ •๋ฆฌ๋ฅผ ์ด์–ด๋‚˜๊ฐ€ ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ ๋‹ค๋ฃจ๊ฒŒ ๋  ์ฃผ์š” ๊ฐœ๋…๋“ค์ธ Knowledge Distillation๊ณผ Inductive Bias์— ๋Œ€ํ•œ ์„ค๋ช…์„ ํ•ด๋ณด์•˜๋Š”๋ฐ์š”. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ํ•ด๋‹น ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์—ฌ ์ €์ž๊ฐ€ ์ˆ˜ํ–‰ํ•œ ์‹คํ—˜๋“ค ์ค‘ ์ฒซ๋ฒˆ์งธ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•ด์„œ ์ด๊ฐธ๊ธฐ๋ฅผ ํ’€์–ด๊ฐ€๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด์ „ ํฌ์ŠคํŠธ๊ฐ€ ๊ถ๊ธˆํ•˜์‹  ๋ถ„์€ ์—ฌ๊ธฐ๋ฅผ ํ†ตํ•ด ํ™•์ธํ•ด ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ๋ชฉ์ (๋ณต์Šต)

๋ณธ ๋…ผ๋ฌธ์€ "Knowledge Distillation์—์„œ Teacher Model์ด Student Model์— ์ „ํ•˜๋Š” Dark Knowledge์— ๊ณผ์—ฐ Inductive Bias์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์กด์žฌํ• ๊นŒ?" ๋ผ๋Š” ์งˆ๋ฌธ์—์„œ ๋น„๋กฏ๋œ ์˜๋ฌธ์ ์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ๋‘๊ฐ€์ง€ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๊ฐ€์ง€๊ณ  ์‹คํ—˜์„ ์ „๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” RNNs(Teacher Model)๊ณผ Transformers(Student Model)๋ฅผ, ๊ทธ๋ฆฌ๊ณ  ๋‘ ๋ฒˆ์งธ ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” CNNs(Teacher Model)๊ณผ MLPs(Student Model)๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ๋Š” (1) ์ •๋ง ์„ ์ƒ ๋ชจ๋ธ๋“ค์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” Inductive Bias๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์˜๋ฏธํ•œ๊ฐ€๋ฅผ ๋ณด์—ฌ์ฃผ๊ฐ€, (2) ์„ ์ƒ ๋ชจ๋ธ์—๊ฒŒ ์ง€์‹์€ ์ „์ˆ˜ ๋ฐ›์€ ํ•™์ƒ ๋ชจ๋ธ์ด ์ •๋ง ์„ ์ƒ ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ํ•™์Šต์˜ ๊ฒฐ๊ณผ๋ฌผ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ€ ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.


์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ฒซ๋ฒˆ์งธ ์‹œ๋‚˜๋ฆฌ์˜ค(RNNs vs Transformers)์— ๋Œ€ํ•ด ๋‹ค๋ค„๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Scenerio 1

๋จผ์ € ์ฒซ๋ฒˆ์งธ ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” RNN์ค‘ ๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ์ธ LSTM๊ณผ Transformer๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ Natural Language Processing(์ž์—ฐ์–ด ์ฒ˜๋ฆฌ)์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์ด๋ฉฐ, Transformer๋Š” LSTM์— ๋น„ํ•ด ๋น„๊ต์  ์ตœ์‹ ์— ๋‚˜์˜จ ๋…ผ๋ฌธ์œผ๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋งŽ์œผ๋ฉด ์ˆ˜๋งŽ์€ ๋ชฉํ‘œ(task)์— ์žˆ์–ด์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

Picture from "https://jalammar.github.io/illustrated-transformer"

๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋งŽ๋‹ค๋ฉด ๋น„๊ต์  ์ตœ๊ทผ์— ๋‚˜์˜จ Transformer์ด LSTM๋ณด๋‹ค ๋” ์ข‹์€ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๊ฐ–๋Š” ๊ฒƒ์ด ์ž๋ช…ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ์ •์ (limited)์ธ ์ƒํ™ฉ์—์„œ๋Š” ํŠน์ • task์—์„œ LSTM์ด Transformer๊ฐ€ ๋” ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์ด๋ผ๋Š” ์—ฐ๊ตฌ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์˜ˆ๋กœ ๋“œ๋Š” task๊ฐ€ ๋ฐ”๋กœ "Subject-verb agreement prediction task"์ธ๋ฐ์š”. ํ•ด๋‹น task๋Š” "Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies (2016)"์ด๋ผ๋Š” ๋…ผ๋ฌธ์—์„œ ๊ตฌ๋ฌธ(syntax) ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ์œ ์šฉํ•˜๋‹ค๊ณ  ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์˜๋ฌธ๋ฒ•์„ ์กฐ๊ธˆ ์•Œ๊ณ ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ์š”. ์˜์–ด 3์ธ์นญ ํ˜„์žฌํ˜• ๋™์‚ฌ์˜ ํ˜•ํƒœ๋Š” ๊ตฌ๋ฌธ ์ฃผ์–ด์˜ ๋จธ๋ฆฌ๊ฐ€ ๋ณต์ˆ˜์ธ์ง€ ๋‹จ์ˆ˜์ธ์ง€์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋ฉฐ, ์ด๋Ÿฌํ•œ ๋™์‚ฌ๋Š” ์ฃผ์–ด ์˜†์— ํ•ญ์ƒ ์œ„์น˜ํ•ด ์žˆ์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค. ์ขŒ์ธก์— ์žˆ๋Š” ๋ฌธ์žฅ๋“ค์€ ๋ฐ”๋กœ ์˜†์— ๋ถ™์–ด์žˆ๋Š” ์ผ€์ด์Šค๋“ค์ด๋ฉฐ, ์šฐ์ธก์— ์žˆ๋Š” ์˜ˆ์‹œ๋Š” ๋–จ์–ด์ ธ์žˆ๋Š” ์ผ€์ด์Šค์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ๋ฒ• ํŠน์ง•์„ ์ด์šฉํ•˜์—ฌ LSTM์ด ํ•œ์ •๋œ ๋ฆฌ์†Œ์Šค๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•˜๋ฉด, Transformer(FAN)๋ณด๋‹ค ์ข‹๋‹ค๋Š” ๊ฒƒ์„ "The Importance of Being Recurrent for Modeling Hierarchical Structure (2018)"์—์„œ ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ํ‰๊ฐ€ ๊ธฐ์ค€์€ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด Input ๋ฌธ์žฅ ์ดํ›„์— ์˜ฌ (a) ๋‹จ์–ด๊ฐ€ ์–ด๋–ค ๊ฒƒ์ธ๊ฐ€๋ฅผ ์˜ˆ์ธก ๋˜๋Š” (b) ๋‹จ์ˆ˜์ธ์ง€ ๋ณต์ˆ˜์ธ์ง€ ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธกํ•˜๋Š” ๊ฐ€๋ฅผ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์žก๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜ ๊ทธ๋ž˜ํ”„์—์„œ distance๋ž€ ๊ตฌ๋ฌธ ์ฃผ์–ด์™€ 3์ธ์นญ ํ˜„์žฌํ˜• ๋™์‚ฌ์˜ ๊ฑฐ๋ฆฌ, # of attracters๋Š” ๊ตฌ๋ฌธ ์ฃผ์–ด(์ •๋‹ต, ์ฐธ๊ณ ํ•ด์•ผํ•  ๋ช…์‚ฌ) ์™ธ์— 3์ธ์นญ ํ˜„์žฌํ˜• ๋™์‚ฌ๋ฅผ ์œ ํ˜นํ•˜๋Š” ๋‹ค๋ฅธ ๋ช…์‚ฌ๋“ค์˜ ๊ฐฏ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

From "The Importance of Being Recurrent for Modeling Hierarchical Structure (2018)"


RNNs' Inductive Bias

์ž! ์ด์ œ ๋ฌธ๋ฒ• ๊ณต๋ถ€๊ฐ€ ๋๋‚ฌ์œผ๋‹ˆ ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐ ํ•ด๋ณผ๊นŒ์š”? RNN(Recurrent Neural Network)์€ ์‹œํ€€์Šค(Sequence) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๊ทธ ๋ง์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ์‹œํ€€์Šค ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌ๋ฅผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ธ๋ฐ์š”. ์—ฌ๊ธฐ์„œ ๋น„๊ต ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜๋Š” LSTM ์—ญ์‹œ ์ด๋Ÿฌํ•œ RNN์„ ๊ทผ๋ณธ์œผ๋กœ ํ•˜๋Š” ๋ชจ๋ธ์ด๋ฏ€๋กœ, ๊ฐ€๋ณ๊ฒŒ RNN์— ๋Œ€ํ•œ ๊ฐœ๋…๊ณผ RNN์˜ Inductive Bias(๊ท€๋‚ฉ์  ํŽธํ–ฅ)์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•˜๊ณ  ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๊ทธ๋ฆผ์„ ์ฐธ๊ณ ํ•˜๋ฉด์„œ ์„ค๋ช…๋“œ๋ฆฌ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๊ทธ๋ฆผ์—์„œ ๊ณตํ†ต์ ์ธ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ๋จผ์ € ์„ค๋ช…๋“œ๋ฆฌ์ž๋ฉด, ๋นจ๊ฐ„์ƒ‰ ๋ฐ•์Šค๋Š” input, ํŒŒ๋ž€์ƒ‰ ๋ฐ•์Šค๋Š” output, ์ดˆ๋ก์ƒ‰ ๋ฐ•์Šค๋Š” (hidden) state์ž…๋‹ˆ๋‹ค.

Picture from CS231n

  1. one-to-one : Vanila Neural Network
    ์šฐ๋ฆฌ๊ฐ€ ํ†ต์ƒ์ ์œผ๋กœ ์•Œ๊ณ  ์žˆ๋Š” ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋กœ, ํ•˜๋‚˜์˜ input์— ํ•˜๋‚˜์˜ output์ด ๋Œ€์‘๋˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

  2. one-to-many : Recurrent Neural Network
    ํ•˜๋‚˜์˜ input์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ output์ด ๋Œ€์‘๋˜๋Š” ๊ตฌ์กฐ๋กœ, ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ ์ด๋ฏธ์ง€๊ฐ€ ํ•˜๋‚˜๊ฐ€ ๋“ค์–ด๊ฐ”์„ ๋•Œ ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฌธ์žฅ(sequence of words)์ด ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜ค๋Š” Image Captioning Task๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  1. many-to-one : Recurrent Neural Network
    ์—ฌ๋Ÿฌ ๊ฐœ์˜ input์— ํ•˜๋‚˜์˜ output์ด ๋Œ€์‘๋˜๋Š” ๊ตฌ์กฐ๋กœ, ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ ๋ฌธ์žฅ์ด ๋“ค์–ด๊ฐ”์„ ๋•Œ ํ•ด๋‹น ๋ฌธ์žฅ์˜ ์–ด์กฐ(๊ฐ์„ฑ)์„ ๋ถ„๋ฅ˜ํ•˜๋Š” Sentiment Analysis๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  1. many-to-many : Recurrent Neural Network
    ์—ฌ๋Ÿฌ ๊ฐœ์˜ input์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ output์ด ๋Œ€์‘๋˜๋Š” ๊ตฌ์กฐ๋กœ, ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ ์˜์–ด ๋ฌธ์žฅ์ด ๋“ค์–ด๊ฐ”์„ ๋•Œ ์ด๋ฅผ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ์œผ๋กœ ๋ฒˆ์—ญํ•˜๋Š” Machine Translation์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๋“ฏ RNN์€ ๋งค Timestep๋งˆ๋‹ค ์ƒˆ๋กœ์šด input์ด ๋“ค์–ด์˜ค๋ฉด, ์ด๋ฅผ fucntion์— ํ†ต๊ณผ์‹œํ‚ค๊ณ  state๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ฒŒ ouput์„ ๋ฐ˜ํ™˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, function์€ ์ „๋ถ€ ๋™์ผํ•˜๋ฉฐ, ๋™์ผํ•œ parameter๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‹ค์Œ state๋Š” ์ด์ „ state ์ •๋ณด๋งŒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋” ์ด์ „ ์ •๋ณด๋‚˜ ๋ฏธ๋ž˜ ์ •๋ณด๋Š” ์ ˆ๋Œ€ ๋ณผ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์•ž์—์„œ ์†Œ๊ฐœํ•œ RNN์˜ ํŠน์ง•๋“ค์ด ๋ฐ”๋กœ RNN์˜ inductive bias์ž…๋‹ˆ๋‹ค. RNN์˜ inductive bias๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ด 3๊ฐ€์ง€๋กœ ๋…ผ๋ฌธ์—์„œ ์ด์•ผ๊ธฐํ•˜๊ณ  ์žˆ๋Š”๋ฐ ์ด๋Š” ๊ฐ๊ฐ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. Sequential-lity : ๋ชจ๋ธ์— ๋„ฃ์–ด์ฃผ๋Š” ๋ฐ์ดํ„ฐ๋“ค์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๋“ค์–ด์˜ค๋„๋ก ๊ฐ•์ œํ•˜๋Š” "์ˆœ์ฐจ์„ฑ"

  2. Memory Bottleneck : ํ•ด๋‹น timestamp ๋ฐ”๋กœ ์ด์ „์˜ hidden state์ •๋ณด๋งŒ์„ ๋ชจ๋ธ์ด ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ•ด๋‹น hidden state๊ฐ€ ๋” ์ด์ „ ๊ณผ๊ฑฐ์˜ ๋‚ด์šฉ๊นŒ์ง€ ์ „๋ถ€ ํ•จ์ถ•์ ์œผ๋กœ ๊ฐ–์ถ”๋„๋ก ๊ฐ•์ œํ•˜๋Š” "๋ฉ”๋ชจ๋ฆฌ์˜ ๋ณ‘๋ชฉ์„ฑ"

  3. Recursion : ๋ชจ๋“  ํ•จ์ˆ˜๊ฐ€ ๋™์ผํ•˜๋„๋ก ๊ฐ•์ œํ•˜๋Š” "์žฌ๊ท€์„ฑ"


Transformers' Inductive Bias

Transformer๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์ธ์ฝ”๋”์—์„œ ์ž…๋ ฅ ์‹œํ€€์Šค(ex. I am a student)๋ฅผ ์ž…๋ ฅ๋ฐ›๊ณ , ๋””์ฝ”๋”์—์„œ ์ถœ๋ ฅ ์‹œํ€€์Šค(ex. Je suis รฉtudiant)๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ์˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์€ RNN์ฒ˜๋Ÿผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด๊ฐ€์ง€ ์•Š์•„๋„ Self-Attention๊ณผ Feed Forward Neural Network๋งŒ์œผ๋กœ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€ ํš๊ธฐ์ ์ธ ๋…ผ๋ฌธ(๋ฐฉ๋ฒ•๋ก )์ž…๋‹ˆ๋‹ค. ๋ณธ ํฌ์ŠคํŠธ์—์„œ Transformer์— ๋Œ€ํ•œ ๊ฐœ๋… ์ „๋ฐ˜์„ ๋‹ค๋ฃจ๊ธฐ์—๋Š” ๋„ˆ๋ฌด ๊ธธ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— RNN์— ๋น„ํ•ด ์ œ์•ฝ์ด ์ ๋‹ค๋Š” ์ •๋„๋งŒ ์ดํ•ดํ•˜์‹œ๊ณ  ๋„˜์–ด๊ฐ€๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Picture from "๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์ž…๋ฌธ (์œ ์›์ค€)"

Transformer์˜ ๊ฒฝ์šฐ, RNN์— ๋น„ํ•ด ์ œ์•ฝ ๋˜๋Š” Inductive Bias๊ฐ€ ํ›จ์”ฌ ์•ฝํ•œ ์ด์œ ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Picture from "https://jalammar.github.io/illustrated-transformer"

  1. Transformer๋Š” ํ† ํฐ๋“ค์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ž„๋ฐฐ๋”ฉํ•˜๊ธฐ ์œ„ํ•ด positional encoding ์„ ๋”ํ•ด์ฃผ๋Š” ๋ฐ ์ด๋Š” ๋‹จ์ˆœํ•œ sinํ•จ์ˆ˜์™€ cosํ•จ์ˆ˜๋กœ ๋„์ถœ๋œ ๋ฒกํ„ฐ๊ฐ’์œผ๋กœ, ๋ชจ๋ธ ๋‹จ์—์„œ ๊ฐ•์ œ์ ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ›๊ฒŒํ•˜๋Š” RNN๊ณผ ๊ฐ™์€ Sequential-lity๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  2. Transformer๋Š” ์ „์ฒด ํ† ํฐ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ Self-attention์„ ํ†ตํ•ด ์ „๋ฐ˜์ ์œผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ์—, ์ด์ „ timestamp์˜ hidden-state๋งŒ์„ ์ „๋‹ฌ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” RNN๊ณผ ๊ฐ™์€ Memory Bottleneck์ด ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  3. Transformer๋Š” Encoder์—์„œ Decoder๋กœ ํ•œ๋ฒˆ์— ๊ฐ€๋Š” ๊ตฌ์กฐ์ด๋ฏ€๋กœ, ๊ฐ™์€ ํ•จ์ˆ˜๊ฐ€ ์—ฐ์†์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” RNN๊ณผ ๊ฐ™์€ recursion์ด ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.


Experiment Settings

Dataset

  • Subject-verb agreement dataset of Linzen et al. (2016)

Performance Metric

  • ฮผ โ€“ Accuracy : micro โ€“ accuracy
  • D โ€“ Accuracy : accuracy over different groups in terms of distance
  • Aโˆ’Accuracy : accuracy over numbers of attractors

Learning Objective

  • Language Modelling (LM) Setup : ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” Setup
  • Classification Setup : ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด์˜ ๋‹จ์ˆ˜ ๋ณต์ˆ˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Setup

๊ฐ๊ฐ์˜ Objective์— ๋”ฐ๋ฅธ ์‹คํ—˜ ๋ชจ๋ธ๊ตฐ

  • Language Modelling (LM) Setup :
    1. LSTM : Base LSTM
    2. Small LSTM : LSTM with smaller parameter
    3. Transformer : Base Transformer
    4. Small Transformer : Transformer with smaller parameter

  • Classification Setup :
    1. LSTM : Base LSTM (Sequentiality + Memory bottleneck + Recursion)
    2. Transformer: Base Transformer (No Inductive Bias)
    3. Transformer-seq: Base Transformer์— Sequentiality๋ฅผ ๊ฐ•์ œ๋กœ ์ถ”๊ฐ€ํ•ด์ค€ ๋ชจ๋ธ (Sequentiality)
    4. Universal Transformer-seq: Transformer-seq์— Recursion์„ ๊ฐ•์ œ๋กœ ์ถ”๊ฐ€ํ•ด์ค€ ๋ชจ๋ธ (Sequentiality + Recursion)


Experiment Results

๋ณธ ๋…ผ๋ฌธ์€ ์‹คํ—˜์„ ํ†ตํ•ด ๋‹ค์Œ 2๊ฐ€์ง€๋ฅผ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค:

  1. Distillation ์—†์ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์—ฌ, ์ •๋ง Teacher ๋ชจ๋ธ๋“ค์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” "Inductive Bias๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์˜๋ฏธํ•œ๊ฐ€"๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  2. Distillation์„ ์ˆ˜ํ–‰ํ•˜์—ฌ Teacher ๋ชจ๋ธ์—๊ฒŒ Inductive Bias๋ฅผ ์ „์ˆ˜ ๋ฐ›์€ "Student ๋ชจ๋ธ์ด ์ •๋ง Teacher ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ํ•™์Šต์˜ ๊ฒฐ๊ณผ๋ฌผ์„ ๋ณด์—ฌ์ฃผ๋Š”๊ฐ€"๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Without Distillation

  • LSTM์ด Transformer๋ณด๋‹ค ์ „๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ(์œ„์—์„œ ์†Œ๊ฐœํ•œ ํ‰๊ฐ€๊ธฐ์ค€ ๊ธฐ๋ฐ˜)์„ ๋ณด์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„, ์ด๋Ÿฌํ•œ input์˜ ๊ณ„์ธต์  ๊ตฌ์กฐ(hierarchical structure)๋ฅผ ํ•™์Šตํ•จ์— ์žˆ์–ด์„œ inductive bias๊ฐ€ ๋” ๊ฐ•ํ•œ LSTM์ด inductive bias๊ฐ€ ์•ฝํ•œ Transformer๋ณด๋‹ค ์ข‹์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Transformer ๋ชจ๋ธ์€ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๊ณผ์ ํ•ฉ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์˜คํžˆ๋ ค ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ฅผ ์ค„์—ฌ์ฃผ๊ฒŒ(small)๋˜๋ฉด ๊ณผ์ ํ•ฉ์„ ํ•ด์†Œ๋˜์–ด ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ˜๋Œ€๋กœ LSTM์€ ํ•ด๋‹น task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์ ํ•ฉํ•œ Inductive Bias๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ์— ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ฅผ ์ค„์ด๋ฉด ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  • Inductive Bias๊ฐ€ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€๋ ๋•Œ๋งˆ๋‹ค ํ‰๊ฐ€์ง€ํ‘œ์˜ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    - Accuracy์˜ ์ฆ๊ฐ€
    • Expected Calibration Error(ECE)์˜ ๊ฐ์†Œ

โ€ป ์—ฌ๊ธฐ์„œ ์ž ๊น!

(์ฐธ๊ณ ) Calibration Error๋ž€, ๋ชจํ˜•์˜ ์ถœ๋ ฅ ๊ฐ’์ด ์‹ค์ œ confidence๋ฅผ ๋ฐ˜์˜ํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, X์˜ Y์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์ด 0.8์ด ๋‚˜์™”์„ ๋•Œ, 80 % ํ™•๋ฅ ๋กœ Y์ผ ๊ฒƒ์ด๋ผ๋Š” ์‹ ๋ขฐ(์˜๋ฏธ)๋ฅผ ๊ฐ–๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ Calibration Error๋Š” Bining์ด๋ผ๋Š” ์ž‘์—… ํ†ตํ•ด M๊ฐœ์˜ Bin์— ๋Œ€ํ•˜์—ฌ ๊ฐ๊ฐ์˜ Bin๋งˆ๋‹ค์˜ Calibration Error๋ฅผ ๊ตฌํ•˜์—ฌ ํ‰๊ท (Expectation)์„ ๋‚ด์„œ ์‚ฐ์ถœํ•˜๋ฏ€๋กœ, Expectated Calibration Error(ECE)๋กœ ํ‰๊ฐ€ ์ง€ํ‘œ๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

With Distillation

  • LSTM์˜ ์ง€์‹(Knowledge)๋ฅผ Transformer์—๊ฒŒ ์ „๋‹ฌํ•ด์คŒ์œผ๋กœ์จ ๊ธฐ์กด์˜ Transformer(๋นจ๊ฐ•)์— ๋น„ํ•˜์—ฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.(ํ•˜๋Š˜, ํŒŒ๋ž‘)

  • ๋˜ํ•œ, ๊ธฐ์กด์˜ Tranformer์˜ perplexity๋Š” teacher model์— ๊ทผ์‚ฌํ•˜๊ฒŒ ๋œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Language Modelling(LM) Setup

Classification Setup

  • ์ข€ ๋” ๊ฐ•ํ•œ inductive bias๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ชจ๋ธ์„ Teacher ๋ชจ๋ธ๋กœ ํ•œ Student Model์€ Distillation์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์€ ๋ชจ๋ธ์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ(Acc โ†‘, ECE โ†“)๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ Multidimensional Scaling(MDS) ํ†ตํ•ด ๋ชจ๋ธ penultimate layer์—์„œ์˜ representation์„ ์‹œ๊ฐํ™” ํ•ด๋ณธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด Transformer ๋ชจ๋ธ์€ Variance๊ฐ€ ๋„“์€ ๋ฐ˜๋ฉด Inductive Bias๊ฐ€ ๋†’์€ ์ˆœ์œผ๋กœ ๋ชจ๋ธ์€ Variance๊ฐ€ ๋‚ฎ๊ณ , Distillation์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๊ธฐ์กด ๋ชจ๋ธ์ด Teacher ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•ด์ง€๋ฉฐ Variance ๋˜ํ•œ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋Š˜์€ ์‹œ๋‚˜๋ฆฌ์˜ค 1์— ๋Œ€ํ•˜์—ฌ ์ž์„ธํ•˜๊ฒŒ ๋‹ค๋ฃจ์–ด๋ณด์•˜๋Š”๋ฐ์š”. ๋น ๋ฅธ ์‹œ์ผ ๋‚ด์— ์‹œ๋‚˜๋ฆฌ์˜ค 2๊นŒ์ง€ ์—…๋กœ๋“œํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค :D

๊ธด ๊ธ€ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ใ…Žใ…Žใ…Ž

profile
Always be passionateโœจ

0๊ฐœ์˜ ๋Œ“๊ธ€