[DL/XAI] "Exploring Explainability for Vision Transformers"

๊ตฌ๋งยท2024๋…„ 11์›” 15์ผ

[Paper Review]

๋ชฉ๋ก ๋ณด๊ธฐ
7/8

๐Ÿ”—ย ์›๋ฌธ ๋งํฌ

https://jacobgil.github.io/deeplearning/vision-transformer-explainability

1. Background


2020๋…„์— ๋“ฑ์žฅํ•œ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์€ Transformer๋ฅผ ์ปดํ“จํ„ฐ ๋น„์ „(CV) ๋ถ„์•ผ์— ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋„์ž…ํ•˜๊ธฐ ์‹œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ์ค‘์—์„œ๋„ < An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale >์˜ Vision Transformer(ViT)์™€ < Training data-efficient image transformers & distillation through attention >์˜ Data-efficient image Transformer(DeiT)๊ฐ€ ๊ฐ€์žฅ ๋ˆˆ์— ๋„๋Š” ์ฃผ์š”ํ•œ ์—ฐ๊ตฌ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ดํ›„์—๋„ Vision Task์— Transformer๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ๋งŽ์ด ๋“ฑ์žฅํ•  ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒ๋˜๋Š”๋ฐ, ๊ทธ๋ ‡๋‹ค๋ฉด ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜๋ฌธ์ ์„ ๊ฐ€์งˆ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

โ“
1. ViT์˜ ๋‚ด๋ถ€์—์„œ๋Š” ๋ฌด์—‡์ด ์ผ์–ด๋‚˜๊ณ  ์žˆ์„๊นŒ์š”?
2. ViT๋Š” ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š” ๊ฑธ๊นŒ์š”?
3. ์šฐ๋ฆฌ๊ฐ€ ๊ทธ ๋‚ด๋ถ€๋ฅผ ๋“ค์—ฌ๋‹ค๋ณด๊ณ  ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

์ด๋Ÿฌํ•œ ์งˆ๋ฌธ๋“ค์€ ๊ณตํ†ต์ ์œผ๋กœ ๋ธ”๋ž™๋ฐ•์Šคํ˜•์ธ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์˜ โ€˜์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ(Explainability)โ€™์„ ์š”๊ตฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. โ€™์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ(Explainability)โ€™์€ ๋‹ค์†Œ ํฌ๊ด„์ ์ด๊ณ  ์‚ฌ๋žŒ๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐœ๋…์ด์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ’ก [ for the developers ]
1. Whatโ€™s going on inside when we run the Transformer on this image? โ†’ Activation Visualization
์ค‘๊ฐ„ ํ™œ์„ฑํ™” ๊ณ„์ธต์„ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ์ด์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ๋Š” ๋ณดํ†ต ์ด๋Ÿฌํ•œ ํ™œ์„ฑํ™” ๊ณ„์ธต์„ ์ด๋ฏธ์ง€๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์ฑ„๋„ ํ™œ์„ฑํ™”๋ฅผ 2D ์ด๋ฏธ์ง€๋กœ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์–ด ์–ด๋А ์ •๋„ ํ•ด์„์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2. What did it learn? โ†’ Activation Maximization
๋ชจ๋ธ์ด ์–ด๋–ค ์œ ํ˜•์˜ ํŒจํ„ด์„ ํ•™์Šตํ–ˆ๋Š”์ง€๋ฅผ ์กฐ์‚ฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ณดํ†ต โ€œ์ด ํ™œ์„ฑํ™”์— ๋Œ€ํ•œ ๋ฐ˜์‘์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๋Š” ๋ฌด์—‡์ธ๊ฐ€?โ€œ๋ผ๋Š” ์งˆ๋ฌธ์˜ ํ˜•ํƒœ๋กœ ์ด๋ฃจ์–ด์ง€๋ฉฐ, โ€˜Activation Maximization(ํ™œ์„ฑํ™” ์ตœ๋Œ€ํ™”)โ€™ ๊ธฐ๋ฒ•์˜ ๋ณ€ํ˜•์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Activation Maximization
    • CNN ์ƒ์˜ ์–ด๋А ํƒ€๊ฒŸ ์ถœ๋ ฅ๊ฐ’์„ ํ•˜๋‚˜ ๊ณ ์ •ํ•ด ๋†“๊ณ , ์ด๋ฅผ ์ตœ๋Œ€๋กœ ํ™œ์„ฑํ™”ํ•˜๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์ฐพ๊ฑฐ๋‚˜ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•

๐Ÿ’ก [ for both the developer and the user ]
1. What did it see in this image? โ†’ Pixel Attribution

  • โ€œ์ด ์ด๋ฏธ์ง€์—์„œ ๋„คํŠธ์›Œํฌ์˜ ์˜ˆ์ธก์— ์˜ํ–ฅ์„ ๋ฏธ์นœ ๋ถ€๋ถ„์€ ์–ด๋””์ธ๊ฐ€?โ€œ๋ผ๋Š” ์งˆ๋ฌธ์— ๋‹ตํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ, ์ฆ‰ ํ”ฝ์…€ ๊ธฐ์—ฌ๋„(Pixel Attribution)๋ผ๊ณ  ๋ถˆ๋ฆฝ๋‹ˆ๋‹ค.

์ด์™€ ๊ฐ™์€ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ์ด ViT์—๋„ ํ•„์š”ํ•˜๊ธฐ์—, ์ด ๊ธ€์—์„œ ์ด๋ฅผ ๊ตฌํ˜„ํ•ด๋ณด๋Š” ๊ณผ์ •์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. (์ด ๊ณผ์ •์— ๋Œ€ํ•œ ์ „์ฒด ์ฝ”๋“œ๋Š” ๋‹ค์Œ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: https://github.com/jacobgil/vit-explain )

๊ธฐ๋ณธ์ ์ธ setting์œผ๋กœ, ๋จผ์ € ์‚ฌ์šฉํ•  ๋ชจ๋ธ์€ Facebook์—์„œ ์ƒˆ๋กญ๊ฒŒ ๊ณต๊ฐœํ•œ โ€˜Deit Tinyโ€™ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

model = torch.hub.load('facebookresearch/deit:main', 'deit_tiny_patch16_224', pretrained=True)

๊ทธ๋ฆฌ๊ณ  224x224 ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ํ•˜์—ฌ ์ง„ํ–‰ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

2. Q, K, V and Attention


Vision Transformer๋Š” ๋ช‡ ๊ฐœ์˜ ์ธ์ฝ”๋”ฉ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ธ”๋ก์—๋Š” ๋‹ค์Œ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Attention Head : ๊ฐ ํŒจ์น˜ ํ‘œํ˜„์ด ์ด๋ฏธ์ง€์˜ ๋‹ค๋ฅธ ํŒจ์น˜๋“ค๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • MLP : ๊ฐ ํŒจ์น˜ ํ‘œํ˜„์„ ๋” ๋†’์€ ์ˆ˜์ค€์˜ ํŠน์ง• ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

โ‡’ Attention Head์™€ MLP ๋ชจ๋‘์— ์ž”์ฐจ ์—ฐ๊ฒฐ(residual connection)์ด ์ ์šฉ๋˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์ด๋“ค์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Vision Transformer์—์„œ๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ์ด๋ฅผ ํ† ํฐํ™”ํ•˜์—ฌ ๊ฐ ํŒจ์น˜์˜ ๊ด€๊ณ„๋ฅผ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ชจ๋ธ ๋‚ด๋ถ€์˜ ๊ฐ Attention Head์—์„œ๋Š” ์ค‘์š”ํ•œ ์š”์†Œ์ธ Q(Query), K(Key), V(Value)๊ฐ€ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ํŒจ์น˜๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” Deit Tiny ๋ชจ๋ธ์„ ์˜ˆ์‹œ๋กœ ๋“ค์–ด ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Deit Tiny์˜ Attention Head์™€ Q, K, V์˜ ๊ตฌ์กฐ

Deit Tiny๋Š” ๊ฐ ๋ ˆ์ด์–ด์— 3๊ฐœ์˜ Attention Head๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ Attention Head๋Š” ์ด๋ฏธ์ง€ ํŒจ์น˜์˜ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ํ† ํฐ๋“ค์„ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ์ด๋“ค์˜ ๊ตฌ์ฒด์ ์ธ shape๋Š” 3x197x64์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ํ•ด์„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๐Ÿ’ก

  • Attention Head๋Š” ์ด 3๊ฐœ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  • Attention Head๋Š” 197๊ฐœ์˜ ํ† ํฐ์„ ๋ฐ›์•„๋“ค์ด๋ฉฐ, ์ด๋Š” CLS ํ† ํฐ์„ ํฌํ•จํ•œ ์ „์ฒด ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ํ† ํฐ์€ ๊ธธ์ด 64์˜ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, Attention Head๋งˆ๋‹ค 197๊ฐœ์˜ ํ† ํฐ์ด ๊ฐ๊ฐ 64์ฐจ์›์˜ ํŠน์ง• ํ‘œํ˜„์„ ๊ฐ€์ง€๋ฉฐ, ์ด๋“ค์ด ๋ชจ์—ฌ 3x197x64 ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ† ํฐ ๊ตฌ์„ฑ - 196๊ฐœ์˜ ์ด๋ฏธ์ง€ ํŒจ์น˜ + 1๊ฐœ์˜ CLS ํ† ํฐ

197๊ฐœ์˜ ํ† ํฐ ์ค‘ 196๊ฐœ๋Š” ์›๋ณธ ์ด๋ฏธ์ง€์˜ 14x14 ํŒจ์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ ํŒจ์น˜๊ฐ€ ๊ฐœ๋ณ„ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์€ CLS(Classification) ํ† ํฐ์œผ๋กœ, ์ตœ์ข… ์˜ˆ์ธก์„ ์œ„ํ•œ ์ข…ํ•ฉ์ ์ธ ์ •๋ณด๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด CLS ํ† ํฐ์€ ์ „์ฒด ์ด๋ฏธ์ง€์˜ ๋Œ€ํ‘œ์ ์ธ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ, ๋งˆ์ง€๋ง‰ ์ถœ๋ ฅ ๋ ˆ์ด์–ด์—์„œ ์ตœ์ข… ๋ถ„๋ฅ˜๋‚˜ ์˜ˆ์ธก ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ด์ œ, Q์™€ K์˜ ํ–‰์ด ์–ด๋–ค ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Q์™€ K์˜ ์—ญํ•  - ์ด๋ฏธ์ง€์˜ ์œ„์น˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐฉ๋ฒ•

๊ฐ ํŒจ์น˜ ํ† ํฐ์— ๋Œ€ํ•ด Query(Q)์™€ Key(K) ํ–‰๋ ฌ์ด ์ƒ์„ฑ๋˜๋ฉฐ, ์ด๋“ค์€ ๊ฐ ์ด๋ฏธ์ง€ ํŒจ์น˜์˜ ํŠน์ • ์œ„์น˜์™€ ๊ด€๋ จ๋œ 64์ฐจ์›์˜ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด transformer๋Š” ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ค ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•˜๊ณ , ์–ด๋–ค ํŒจ์น˜๋“ค์ด ๋‹ค๋ฅธ ํŒจ์น˜๋“ค๋กœ๋ถ€ํ„ฐ ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ๊ฐ€์•ผ ํ•˜๋Š”์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์ด Q, K, V์˜ ์—ญํ• ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ’ก

  • ์ด๋ฏธ์ง€์˜ ํŠน์ • ํŒจ์น˜๊ฐ€ ํŠน์ง• ๋ฒกํ„ฐย  qiq_i ๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์—์„œ๋Š” ์ด ํŠน์ง• ๋ฒกํ„ฐย  qiq_i ์™€ ๋‹ค๋ฅธ ํŒจ์น˜์˜ Key ๋ฒกํ„ฐย  kjk_jย  ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

โ‡’ ์ด๋•Œ, ํŠน์ • ํŒจ์น˜ ์œ„์น˜์— ํ•ด๋‹นํ•˜๋Š” Key ๋ฒกํ„ฐย  kjk_j ๊ฐ€ Query ๋ฒกํ„ฐย  qiq_i ์™€ ์œ ์‚ฌํ• ์ˆ˜๋ก ํ•ด๋‹น ์œ„์น˜๋กœ๋ถ€ํ„ฐย  qiq_i ์— ๋” ๋งŽ์€ ์ •๋ณด๊ฐ€ ํ๋ฆ…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ๊ฐ ํŒจ์น˜๊ฐ€ ์„œ๋กœ ์–ด๋–ค ๊ด€๊ณ„์— ์žˆ๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•˜๋ฉฐ, attention์ด ์ง‘์ค‘๋˜์–ด์•ผ ํ•  ๋ถ€๋ถ„์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ์œ ์‚ฌํ•œ Q์™€ K๋ฅผ ๊ฐ€์ง„ ํŒจ์น˜๋“ค์€ ์ƒํ˜ธ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ๊ฐ•ํ•˜๊ฒŒ ํ˜•์„ฑ๋˜๋ฉฐ, transformer๋Š” ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ์ค‘์š”ํ•œ ์œ„์น˜์™€ ํŒจ์น˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

*image fromย http://jalammar.github.io/illustrated-transformer*

image fromย http://jalammar.github.io/illustrated-transformer

3. Visual Examples of K and Q - different patterns of information flowing


Vision Transformer์— ๋น„ํ–‰๊ธฐ ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์œผ๋ฉด, ์ด ์ด๋ฏธ์ง€๋Š” transformer ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์น˜๋ฉฐ ๋‹ค์–‘ํ•œ ํŠน์„ฑ์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Query(Q)์™€ Key(K)์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋น„ํ–‰๊ธฐ ์ด๋ฏธ์ง€๋ฅผ ์˜ˆ๋กœ ๋“ค์–ด, Q์™€ K๊ฐ€ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

plane

ViT์—์„œ Q์™€ K๋Š” ๊ฐ๊ฐ ์ด๋ฏธ์ง€์˜ ํŠน์ • ์œ„์น˜์—์„œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์•„๋ž˜์™€ ๊ฐ™์ด ์˜ˆ๋ฅผ ๋“ค์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • qicq_{ic} : ์ด๋ฏธ์ง€์—์„œ ์œ„์น˜ i์— ์žˆ๋Š” ์ฑ„๋„ c์˜ Query ํŠน์ง• ๋ฒกํ„ฐ ๊ฐ’
  • kjck_{jc} : ์ด๋ฏธ์ง€์—์„œ ์œ„์น˜ j์— ์žˆ๋Š” ์ฑ„๋„ c์˜ Key ํŠน์ง• ๋ฒกํ„ฐ ๊ฐ’

์—ฌ๊ธฐ์„œ i์™€ j๋Š” ์ด๋ฏธ์ง€์˜ ํŠน์ • ํŒจ์น˜ ์œ„์น˜๋ฅผ ์˜๋ฏธํ•˜๊ณ , c๋Š” ํŠน์ • ์ฑ„๋„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ด์ œ ์ค‘์š”ํ•œ ์งˆ๋ฌธ์€ โ€œK ๋ฒกํ„ฐ๊ฐ€ ๊ฐ ์œ„์น˜์—์„œ Q ๋ฒกํ„ฐ๋กœ ์–ด๋–ป๊ฒŒ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ• ์ง€โ€์ž…๋‹ˆ๋‹ค. ์ด๋Š” Q์™€ K์˜ ๋‚ด์ ์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ViT๋Š” ์ด๋ฏธ์ง€์˜ ํŠน์ • ํŒจ์น˜ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, Q์™€ K ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๋‚ด์ ์„ ํ†ตํ•ด ์ธก์ •ํ•˜๊ณ , ์ด ์œ ์‚ฌ๋„์— ๋”ฐ๋ผ ์ •๋ณด๊ฐ€ ํ๋ฅด๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์—๋Š” ๋‘ ๊ฐ€์ง€ ์ผ€์ด์Šค๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ๊ฐ™์€ ์ฑ„๋„์—์„œ Q์™€ K ๋ฒกํ„ฐ์˜ ๋ถ€ํ˜ธ๊ฐ€ ๋™์ผํ•œ ๊ฒฝ์šฐ

์˜ˆ๋ฅผ ๋“ค์–ด, ์œ„์น˜ i์˜ Query ๋ฒกํ„ฐ qicq_{ic}์™€ ์œ„์น˜ j์˜ Key ๋ฒกํ„ฐ kjck_{jc}๊ฐ€ ๋ชจ๋‘ ์–‘์ˆ˜์ด๊ฑฐ๋‚˜ ์Œ์ˆ˜์ธ ๊ฒฝ์šฐ, ์ด ๋‘˜์„ ๊ณฑํ•˜๋ฉด ์–‘์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

โ†’ ์ด๋•Œ, ์œ„์น˜ j์—์„œ ์œ„์น˜ i๋กœ ์ •๋ณด๊ฐ€ ํ๋ฅด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ•ด๋‹น ํŒจ์น˜๋Š” ์ •๋ณด๊ฐ€ ํ๋ฅด๋Š” ๊ฒฝ๋กœ๋กœ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

  1. ๊ฐ™์€ ์ฑ„๋„์—์„œ Q์™€ K ๋ฒกํ„ฐ์˜ ๋ถ€ํ˜ธ๊ฐ€ ๋‹ค๋ฅธ ๊ฒฝ์šฐ

๋ฐ˜๋ฉด, qicq_{ic}๊ฐ€ ์–‘์ˆ˜์ด๊ณ  kjck_{jc}๊ฐ€ ์Œ์ˆ˜์ด๊ฑฐ๋‚˜ ๊ทธ ๋ฐ˜๋Œ€์ธ ๊ฒฝ์šฐ, ๋‘˜์„ ๊ณฑํ•˜๋ฉด ์Œ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

โ†’ ์ด ๊ฒฝ์šฐ์—๋Š” ์œ„์น˜ j์—์„œ i๋กœ ์ •๋ณด๊ฐ€ ํ๋ฅด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ ํŒจ์น˜ ๊ฐ„์˜ ์ •๋ณด ์—ฐ๊ฒฐ์ด ์•ฝํ•ด์ง€๋ฉฐ, ์ด๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด์ œ ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ์‹œ๊ฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ์ด๋ฏธ์ง€๋ฅผ torch.nn.Sigmoid() ๋ ˆ์ด์–ด์— ํ†ต๊ณผ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Sigmoid ํ•จ์ˆ˜๋Š” ๊ฐ’์„ ์–‘์ˆ˜์™€ ์Œ์ˆ˜๋กœ ๋‚˜๋ˆ„์–ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์€ ํ”ฝ์…€์€ ์–‘์ˆ˜, ์–ด๋‘์šด ํ”ฝ์…€์€ ์Œ์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด Q์™€ K์˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ด๋ฏธ์ง€์˜ ํŠน์ • ์œ„์น˜์—์„œ ์ •๋ณด๊ฐ€ ํ๋ฅด๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์‰ฝ๊ฒŒ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‘ ๊ฐ€์ง€ ์ฃผ์š” ํŒจํ„ด

๋‹ค์–‘ํ•œ ์ฑ„๋„์— ๋Œ€ํ•ด Q์™€ K ์‹œ๊ฐํ™”๋ฅผ ์‚ดํŽด๋ณด๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ํŒจํ„ด์ด ๊ด€์ฐฐ๋ฉ๋‹ˆ๋‹ค.

  1. ๋‹จ๋ฐฉํ–ฅ ์ •๋ณด ํ๋ฆ„: ํ•œ ์œ„์น˜์—์„œ ์ถœ๋ฐœํ•œ ์ •๋ณด๊ฐ€ ํŠน์ • ์˜์—ญ์œผ๋กœ๋งŒ ํ๋ฅด๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.
  2. ์–‘๋ฐฉํ–ฅ ์ •๋ณด ํ๋ฆ„: ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜ ๊ฐ„์— ์ •๋ณด๋ฅผ ๊ตํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ, ์ด๋ฏธ์ง€์˜ ์ฃผ์š” ์˜์—ญ๋“ค์ด ์„œ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์„œ๋กœ์˜ ์ •๋ณด๊ฐ€ ํ๋ฅด๋Š” ๊ฒฝ๋กœ๊ฐ€ ๊ฐ•ํ™”๋ฉ๋‹ˆ๋‹ค.

Pattern 1 - ์ •๋ณด๊ฐ€ ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ํ๋ฅด๋Š” ๊ฒฝ์šฐ

๋ ˆ์ด์–ด 8, ์ฑ„๋„ 26, ์ฒซ ๋ฒˆ์งธ ์–ดํ…์…˜ ํ—ค๋“œ:

  • Key image
    • ๋น„ํ–‰๊ธฐ๋ฅผ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•ด๋‹น ํŒจ์น˜๊ฐ€ ๋น„ํ–‰๊ธฐ์™€ ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • Query image
    • ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

์ด ์ƒํ™ฉ์—์„œ Query ์ด๋ฏธ์ง€์˜ ๋Œ€๋ถ€๋ถ„ ์œ„์น˜๊ฐ€ ์–‘์ˆ˜์ธ ๋ฐ˜๋ฉด, ์ •๋ณด๋Š” Key ์ด๋ฏธ์ง€์—์„œ ๋น„ํ–‰๊ธฐ ๋ถ€๋ถ„์—์„œ๋งŒ ํ๋ฆ…๋‹ˆ๋‹ค. ์ด๋Š” Query์™€ Key๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฉ”์‹œ์ง€๋ฅผ ์ „๋‹ฌํ•˜๊ณ  ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋น„ํ–‰๊ธฐ๋ฅผ ์ฐพ์•˜๊ณ , ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ์œ„์น˜๊ฐ€ ์ด ์ •๋ณด๋ฅผ ์•Œ๊ธฐ๋ฅผ ์›ํ•ด์š”

์ด์ฒ˜๋Ÿผ ViT๋Š” ํŠน์ • ํŒจ์น˜๋ฅผ ํ†ตํ•ด ๋น„ํ–‰๊ธฐ ์œ„์น˜์™€ ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ์ฐพ๊ณ , ์ด ์ •๋ณด๋ฅผ ์ „์ฒด ์ด๋ฏธ์ง€๋กœ ์ „ํŒŒํ•˜์—ฌ ๋‹ค๋ฅธ ํŒจ์น˜๋“ค๋„ ์ด ๋น„ํ–‰๊ธฐ์˜ ์กด์žฌ๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

Pattern 2 - ์ •๋ณด๊ฐ€ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ํ๋ฅด๋Š” ๊ฒฝ์šฐ

๋ ˆ์ด์–ด 11, ์ฑ„๋„ 59, ์ฒซ ๋ฒˆ์งธ ์–ดํ…์…˜ ํ—ค๋“œ:

  • Query image
    • ์ฃผ๋กœ ๋น„ํ–‰๊ธฐ์˜ ํ•˜๋‹จ ๋ถ€๋ถ„์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋น„ํ–‰๊ธฐ ํ•˜๋‹จ์— ์ง‘์ค‘ํ•˜์—ฌ ์ •๋ณด๋ฅผ ๋ฐ›์•„๋“ค์ด๊ณ ์ž ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • Key image
    • ๋น„ํ–‰๊ธฐ ์ƒ๋‹จ ๋ถ€๋ถ„์—์„œ ์Œ์ˆ˜ ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์œ„์น˜๋Š” ์ด๋ฏธ์ง€์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„์œผ๋กœ ์ •๋ณด๋ฅผ ํ™•์‚ฐ์‹œํ‚ค๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

์ด ์ƒํ™ฉ์—์„œ ์ •๋ณด๋Š” ๋‘ ๋ฐฉํ–ฅ์œผ๋กœ ํ˜๋Ÿฌ๊ฐ‘๋‹ˆ๋‹ค.

  1. ๋น„ํ–‰๊ธฐ์˜ ์ƒ๋‹จ ๋ถ€๋ถ„์—์„œ ์ „์ฒด ์ด๋ฏธ์ง€๋กœ ์ •๋ณด๊ฐ€ ํ™•์‚ฐ๋ฉ๋‹ˆ๋‹ค. Key ์ด๋ฏธ์ง€์—์„œ ๋น„ํ–‰๊ธฐ ์ƒ๋‹จ์ด ์Œ์ˆ˜๋กœ ๋‚˜ํƒ€๋‚˜๋ฉด์„œ, ์ด ์ •๋ณด๊ฐ€ Query์˜ ์Œ์ˆ˜ ๊ฐ’์ธ ์ „์ฒด ์ด๋ฏธ์ง€๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค

๋น„ํ–‰๊ธฐ๋ฅผ ์ฐพ์•˜์–ด์š”. ์ด์ œ ์ด๋ฏธ์ง€์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„์—๊ฒŒ ์ด ์ •๋ณด๋ฅผ ์ „ํŒŒํ•ด์š”.

  1. ์ด๋ฏธ์ง€์˜ โ€˜๋น„ํ–‰๊ธฐ ์™ธโ€™ ๋ถ€๋ถ„์—์„œ ๋น„ํ–‰๊ธฐ ํ•˜๋‹จ์œผ๋กœ ์ •๋ณด๊ฐ€ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. Key ์ด๋ฏธ์ง€์—์„œ ์–‘์ˆ˜๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ์ฃผ๋ณ€ ์ •๋ณด๋Š” Query์˜ ์–‘์ˆ˜ ๊ฐ’์ธ ๋น„ํ–‰๊ธฐ ํ•˜๋‹จ์œผ๋กœ ํ˜๋Ÿฌ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

๋น„ํ–‰๊ธฐ์—๊ฒŒ ์ฃผ๋ณ€์˜ ์ •๋ณด๋ฅผ ์•Œ๋ ค์ฃผ์ž.

์ด์™€ ๊ฐ™์€ ์–‘๋ฐฉํ–ฅ ์ •๋ณด ํ๋ฆ„์„ ํ†ตํ•ด, ViT๋Š” ๋น„ํ–‰๊ธฐ ์ด๋ฏธ์ง€์˜ ์ƒํ•˜ ์œ„์น˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ฃผ๋ณ€ ํ™˜๊ฒฝ๊ณผ ๋น„ํ–‰๊ธฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€์˜ ์ฃผ์š” ๊ฐ์ฒด์™€ ๊ทธ ์ฃผ๋ณ€ ๋ฐฐ๊ฒฝ์ด ์„œ๋กœ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.

4. How do the Attention Activations look like for the class token throughout the network?


ViT์˜ ์ค‘์š”ํ•œ ํŠน์ง• ์ค‘ ํ•˜๋‚˜๋Š” class token์ด ๋„คํŠธ์›Œํฌ ๋‚ด ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์น˜๋ฉฐ ๋‹ค์–‘ํ•œ ํŒจ์น˜๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•ด ๊ฐ„๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํด๋ž˜์Šค ํ† ํฐ์€ ์ตœ์ข… ์˜ˆ์ธก์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋ฉฐ, ๊ฐ ํŒจ์น˜๋“ค์ด ์–ด๋–ป๊ฒŒ ํด๋ž˜์Šค ํ† ํฐ์— ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š”์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์˜ ์ž‘๋™ ๋ฐฉ์‹์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์–ดํ…์…˜ ํ—ค๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ ์กด์žฌํ•˜์ง€๋งŒ, ๋‹จ์ˆœํ™”๋ฅผ ์œ„ํ•ด ์—ฌ๊ธฐ์„œ๋Š” ์ฒซ ๋ฒˆ์งธ ์–ดํ…์…˜ ํ—ค๋“œ์—์„œ ํด๋ž˜์Šค ํ† ํฐ์˜ ์–ดํ…์…˜ ํ๋ฆ„์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ ์–ดํ…์…˜ ํ–‰๋ ฌ(Qโˆ—KTQ * K^T)์€ 197x197ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ด ์ค‘ ์ฒซ ๋ฒˆ์งธ ํ–‰์„ ํ™•์ธํ•˜๋ฉด ํด๋ž˜์Šค ํ† ํฐ์ด ์ด๋ฏธ์ง€ ๋‚ด ๋‹ค๋ฅธ ์œ„์น˜๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ๋ฐ›๋Š” ์ •๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๊ฐ’์„ ์ œ์™ธํ•˜๊ณ  ๋‚˜๋จธ์ง€ 196๊ฐœ์˜ ๊ฐ’์„ ๋ณด๋ฉด, ์ด๋Š” 14x14 ํฌ๊ธฐ์˜ ํŒจ์น˜๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ํด๋ž˜์Šค ํ† ํฐ์œผ๋กœ ์ •๋ณด๊ฐ€ ์–ด๋–ป๊ฒŒ ํ๋ฅด๋Š”์ง€ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•ด์ค๋‹ˆ๋‹ค.

class token attention activation์˜ ๋ณ€ํ™” ๊ณผ์ •

์œ„ ์ด๋ฏธ์ง€๋Š” ํด๋ž˜์Šค ์–ดํ…์…˜ ํ™œ์„ฑํ™”๊ฐ€ ๋„คํŠธ์›Œํฌ์˜ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์น˜๋ฉด์„œ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ดˆ๊ธฐ ๋ ˆ์ด์–ด์—์„œ๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ํ๋ฆฟํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜์ง€๋งŒ, ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ๋“ญํ• ์ˆ˜๋ก ๋ชจ๋ธ์ด ๋น„ํ–‰๊ธฐ ๋ถ€๋ถ„์„ ์ ์ฐจ ์„ ๋ช…ํ•˜๊ฒŒ ๋ถ„๋ฆฌํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, 7๋ฒˆ์งธ ๋ ˆ์ด์–ด์ฏค์—์„œ๋Š” ๋น„ํ–‰๊ธฐ๊ฐ€ ๋ฐฐ๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ ๋šœ๋ ท์ด ๊ตฌ๋ณ„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ ˆ์ด์–ด๊ฐ€ ๋” ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ, ๋น„ํ–‰๊ธฐ์˜ ์ผ๋ถ€๊ฐ€ ์‚ฌ๋ผ์กŒ๋‹ค๊ฐ€ ๋‹ค์‹œ ๋‚˜ํƒ€๋‚˜๋Š” ํ˜„์ƒ์ด ๋ฐ˜๋ณต๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ViT์—์„œ ์ž”์ฐจ ์—ฐ๊ฒฐ(residual connections)์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•œ ์ผ์ž…๋‹ˆ๋‹ค.

์ž”์ฐจ ์—ฐ๊ฒฐ์˜ ์—ญํ• 

์ž”์ฐจ ์—ฐ๊ฒฐ ๋•๋ถ„์—, ํŠน์ • ๋ ˆ์ด์–ด์—์„œ ์ผ๋ถ€ ์ •๋ณด๊ฐ€ ์‚ฌ๋ผ์ ธ๋„ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ •๋ณด๋ฅผ ๋‹ค์‹œ ์ฐธ์กฐํ•˜์—ฌ ํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ๋ณต์›ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์œ„ ์ด๋ฏธ์ง€์—์„œ์ฒ˜๋Ÿผ ์–ด๋–ค ๋ ˆ์ด์–ด์—์„œ ๋น„ํ–‰๊ธฐ์˜ ์ผ๋ถ€๊ฐ€ ์–ดํ…์…˜ ๋งต์—์„œ ์‚ฌ๋ผ์ง€๋”๋ผ๋„, ๋‹ค์Œ ๋ ˆ์ด์–ด์—์„œ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ž”์ฐจ ์ •๋ณด๋ฅผ ํ†ตํ•ด ๋‹ค์‹œ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

์ž”์ฐจ ์—ฐ๊ฒฐ์€ ViT๊ฐ€ ์ค‘์š”ํ•œ ํŒจํ„ด์„ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋„คํŠธ์›Œํฌ๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€ ์†์‹ค๋˜์ง€ ์•Š๊ณ  ์œ ์ง€๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ, ์ตœ์ข…์ ์œผ๋กœ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.

5. Attention Rollout


ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์–ดํ…์…˜ ํ๋ฆ„์„ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

์ด์ „ ์ด๋ฏธ์ง€๋“ค์—์„œ๋Š” ๊ฐœ๋ณ„ ํ™œ์„ฑํ™”๊ฐ€ ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ๋Š”์ง€๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, ํŠธ๋žœ์Šคํฌ๋จธ ์ „์ฒด์—์„œ ์–ดํ…์…˜์ด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์–ด๋–ป๊ฒŒ ํ๋ฅด๋Š”์ง€๋Š” ๋ณด์—ฌ์ฃผ์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์ด ๋ฐ”๋กœ Attention Rollout์ž…๋‹ˆ๋‹ค. ์ด ๊ธฐ๋ฒ•์€ Samira Abnar์™€ Willem Zuidema์˜ < Quantifying Attention Flow in Transformers >์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์–ดํ…์…˜ ํ๋ฆ„์„ ์ •๋Ÿ‰ํ™”ํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋˜ํ•œ ViT๋กœ ์œ ๋ช…ํ•œ < An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale >์—์„œ๋„ ์–ธ๊ธ‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Attention Rollout ๊ธฐ๋ฒ•์˜ ์ž‘๋™ ๋ฐฉ์‹

Transforemer์˜ ๊ฐ ๋ธ”๋ก์—์„œ ์šฐ๋ฆฌ๋Š” ์–ดํ…์…˜ ํ–‰๋ ฌย  AijA_{ij} ์„ ์–ป์Šต๋‹ˆ๋‹ค. ์ด ํ–‰๋ ฌ์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ํ† ํฐย  jj ์—์„œ ๋‹ค์Œ ๋ ˆ์ด์–ด์˜ ํ† ํฐย  ii ๋กœ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ์–ดํ…์…˜์ด ํ๋ฅด๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค. ๊ฐ ๋ ˆ์ด์–ด ์‚ฌ์ด์˜ ์–ดํ…์…˜ ํ–‰๋ ฌ๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณฑํ•จ์œผ๋กœ์จ, ์‹œ์ž‘๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋ชจ๋“  ๋ ˆ์ด์–ด์—์„œ ๋ˆ„์ ๋œ ์ „์ฒด ์–ดํ…์…˜ ํ๋ฆ„์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž”์ฐจ ์—ฐ๊ฒฐ๊ณผ ํ•ญ๋“ฑ ํ–‰๋ ฌ

Transformer๊ตฌ์กฐ์—๋Š” ์ž”์ฐจ ์—ฐ๊ฒฐ์ด ํฌํ•จ๋˜์–ด ์žˆ์–ด, ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์˜ ์ •๋ณด๊ฐ€ ์‚ฌ๋ผ์ง€์ง€ ์•Š๊ณ  ๋‹ค์Œ ๋ ˆ์ด์–ด๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ ˆ์ด์–ด์˜ ์–ดํ…์…˜ ํ–‰๋ ฌ์— ํ•ญ๋“ฑ ํ–‰๋ ฌย  II ์„ ๋”ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰,ย  Aij+IA_{ij} + I ๋ฅผ ํ†ตํ•ด ์ž”์ฐจ ์—ฐ๊ฒฐ์„ ๋ฐ˜์˜ํ•œ ์–ดํ…์…˜ ํ–‰๋ ฌ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

We have multiple attention heads. What do we do about them?

์—ฌ๋Ÿฌ ๊ฐœ์˜ ์–ดํ…์…˜ ํ—ค๋“œ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

Transformer์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์–ดํ…์…˜ ํ—ค๋“œ๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, ๊ฐ ํ—ค๋“œ๋Š” ๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. Attention Rollout ๊ธฐ๋ฒ•์—์„œ๋Š” ๋ชจ๋“  ํ—ค๋“œ์˜ ํ‰๊ท ์„ ์ทจํ•ด ํ•˜๋‚˜์˜ ์–ดํ…์…˜ ํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ์ตœ์†Œ๊ฐ’, ์ตœ๋Œ€๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ค„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

Attention Rollout์˜ ๊ณ„์‚ฐ

์ตœ์ข…์ ์œผ๋กœ, ๋ ˆ์ด์–ดย  LL ์—์„œ์˜ Attention Rollout ํ–‰๋ ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฌ๊ท€์  ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

AttentionRolloutL=(AL+I)โ‹…AttentionRolloutLโˆ’1\text{AttentionRollout}_L = (A_L + I) \cdot \text{AttentionRollout}_{L-1}

๋˜ํ•œ ์ „์ฒด ์–ดํ…์…˜ ํ๋ฆ„์ด 1๋กœ ์œ ์ง€๋˜๋„๋ก ํ–‰์„ ์ •๊ทœํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ฐ ํ† ํฐ์ด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๊ฑฐ์น˜๋ฉฐ ์–ด๋–ป๊ฒŒ ์ •๋ณด๊ฐ€ ์ „๋‹ฌ๋˜๋Š”์ง€, ์–ด๋–ค ๊ฒฝ๋กœ๋กœ ํ๋ฅด๋Š”์ง€๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

6. Modifications to get Attention Rollout working with Vision Transformers


โ€˜Data Efficientโ€™ ์— Attention Rollout ๊ธฐ๋ฒ•์„ ์ ์šฉํ•ด๋ณด์•˜์œผ๋‚˜, ย < An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale > ๋…ผ๋ฌธ๋งŒํผ ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ, ์–ดํ…์…˜์ด ์ด๋ฏธ์ง€์˜ ํฅ๋ฏธ๋กœ์šด ๋ถ€๋ถ„์—๋งŒ ์ง‘์ค‘๋˜์ง€ ์•Š๊ณ  ์ „์ฒด์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์•„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

image

์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์‹œ๋„๋ฅผ ํ–ˆ๊ณ , ๊ทธ ๊ณผ์ •์—์„œ ๋‘ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์š”์†Œ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

The way we fuse the attention heads matters

Attention Rollout ๊ธฐ๋ฒ•์—์„œ๋Š” ์—ฌ๋Ÿฌ ์–ดํ…์…˜ ํ—ค๋“œ๋ฅผ ๊ฒฐํ•ฉํ•  ๋•Œ, ์ผ๋ฐ˜์ ์œผ๋กœ ํ‰๊ท ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋ฅผ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์ตœ์†Œ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€๋ฅผ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

  • Mean Fusion(ํ‰๊ท  ๊ฒฐํ•ฉ): ์–ดํ…์…˜ ํ—ค๋“œ์˜ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐํ•ฉํ•œ ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์ •๋ณด๊ฐ€ ๊ณ ๋ฅด๊ฒŒ ๊ฒฐํ•ฉ๋˜์ง€๋งŒ ๋…ธ์ด์ฆˆ๊ฐ€ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Min Fusion(์ตœ์†Œ๊ฐ’ ๊ฒฐํ•ฉ): ๊ฐ ์–ดํ…์…˜ ํ—ค๋“œ์˜ ์ตœ์†Ÿ๊ฐ’์„ ์ทจํ•ด ๊ฒฐํ•ฉํ•œ ์ด๋ฏธ์ง€๋กœ, ๊ณตํ†ต๋œ ์ •๋ณด๋งŒ ๋‚จ๊ฒจ ๋…ธ์ด์ฆˆ๊ฐ€ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ ์–ดํ…์…˜ ํ—ค๋“œ๋Š” ๊ฐ๊ธฐ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ์ฃผ๋ชฉํ•˜๋Š”๋ฐ, ์ตœ์†Ÿ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ณตํ†ต๋œ ๊ด€์‹ฌ ์˜์—ญ์„ ๋‚จ๊ธฐ๋ฉด์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์ด๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ๋‚ฎ์€ ์–ดํ…์…˜ ๊ฐ’์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ํ•จ๊ป˜ ์ตœ๋Œ“๊ฐ’์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ƒ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

We can focus only on the top attentions, and discard the rest

๋ชจ๋“  ์–ดํ…์…˜ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ๋‚ฎ์€ ์–ดํ…์…˜ ๊ฐ’์˜ ํ”ฝ์…€์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด ๊ฒฐ๊ณผ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ์ด๋ฏธ์ง€๋Š” ์ ์ง„์ ์œผ๋กœ ์–ดํ…์…˜ ๊ฐ’์ด ๋‚ฎ์€ ํ”ฝ์…€์„ ์ œ๊ฑฐํ•ด๊ฐ€๋ฉฐ ์–ดํ…์…˜ ๋งต์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ํ”ฝ์…€์„ ์ œ๊ฑฐํ• ์ˆ˜๋ก ์ด๋ฏธ์ง€ ๋‚ด ์ฃผ์š” ๊ฐ์ฒด๊ฐ€ ๋” ๋ช…ํ™•ํžˆ ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ์ด๋Ÿฌํ•œ ์ˆ˜์ • ์‚ฌํ•ญ์„ ๋ฐ˜์˜ํ•œ ์ตœ์ข… ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

  1. Vanilla Attention Rollout (๊ธฐ๋ณธ Attention Rollout)

๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ๊ณ  ์ค‘์š”ํ•œ ๊ฐ์ฒด๊ฐ€ ๋ช…ํ™•ํžˆ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  1. ๋‚ฎ์€ ํ”ฝ์…€ ์ œ๊ฑฐ + ์ตœ๋Œ€๊ฐ’ ๊ฒฐํ•ฉ

๋‚ฎ์€ ์–ดํ…์…˜ ๊ฐ’์„ ๋ฒ„๋ฆฌ๊ณ  ์ตœ๋Œ€๊ฐ’ ๊ฒฐํ•ฉ์„ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋กœ, ์ด๋ฏธ์ง€ ๋‚ด ์ฃผ์š” ๊ฐ์ฒด๊ฐ€ ๋” ๋šœ๋ ทํ•˜๊ฒŒ ๊ฐ•์กฐ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ Attention Rollout ๊ธฐ๋ฒ•์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ณผ์ •์—์„œ, ์–ดํ…์…˜ ํ—ค๋“œ๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋ถˆํ•„์š”ํ•œ ํ”ฝ์…€์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ViT๊ฐ€ ์ฃผ๋ชฉํ•˜๋Š” ์˜์—ญ์„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

7. Gradient Attention Rollout for Class Specific Explainability


Transformer์˜ ์˜ˆ์ธก ๊ณผ์ •์—์„œ ํŠน์ • ํด๋ž˜์Šค๊ฐ€ ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›๋Š” ์ด์œ ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์‹ถ์„ ๋•Œ, ํด๋ž˜์Šค๋ณ„ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ(Class Specific Explainability)์„ ํ†ตํ•ด ๊ทธ ๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œ์นดํ…Œ๊ณ ๋ฆฌ 42์—์„œ ๋†’์€ ์ถœ๋ ฅ ์ ์ˆ˜๋ฅผ ์–ป๋Š” ๋ฐ ๊ธฐ์—ฌํ•˜๋Š” ์ด๋ฏธ์ง€์˜ ๋ถ€๋ถ„์€ ์–ด๋””์ธ๊ฐ€?โ€œ์™€ ๊ฐ™์€ ์งˆ๋ฌธ์„ ๋˜์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ Gradient Attention Rollout์ž…๋‹ˆ๋‹ค. ์ด ๊ธฐ๋ฒ•์—์„œ๋Š” ๊ฐ ๋ ˆ์ด์–ด์˜ ์–ดํ…์…˜ ํ—ค๋“œ๊ฐ€ ์ฃผ๋ชฉํ•˜๋Š” ์œ„์น˜๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋˜, ํŠน์ • ํด๋ž˜์Šค์— ๋Œ€ํ•œ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ด€์‹ฌ๋„๋ฅผ ๊ฐ€์ค‘์น˜๋กœ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ๋Š” softmax ์ดํ›„์˜ ์–ดํ…์…˜ ๊ฐ’์— ์ ์šฉํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๋‹ค๋ฅธ ์œ„์น˜์— ์ ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ˆ˜์‹์œผ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

Aijโˆ—gradijA_{ij} * \text{grad}_{ij}

Where does the Transformer see a Dog (category243), and a Cat(category 282)?

Where does the Transformer see a Musket dog (category 161) and a Parrot(category 87)?

8. What Activation Maximization Tells us


์‹ ๊ฒฝ๋ง์˜ ํŠน์ • ๋ถ€๋ถ„์„ ์ตœ๋Œ€ ํ™œ์„ฑํ™”์‹œํ‚ค๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์ฐพ๋Š” ๊ธฐ๋ฒ•์ธ Activation Maximization์„ ์‚ฌ์šฉํ•˜๋ฉด, ๋ชจ๋ธ์ด ์–ด๋–ค ํŠน์ง•์„ ํ•™์Šตํ–ˆ๋Š”์ง€๋ฅผ ๋” ๋ช…ํ™•ํžˆ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋„คํŠธ์›Œํฌ๊ฐ€ ํŠน์ •ํ•œ ๋ถ€๋ถ„์—์„œ ๋†’์€ ๋ฐ˜์‘์„ ๋ณด์ด๋„๋ก ํ•˜๋Š” ์ž…๋ ฅ์„ ์‹œ๊ฐํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

ViT์—์„œ๋Š” ์ด๋ฏธ์ง€๋ฅผ 14x14 ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ํŒจ์น˜๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ๊ฐ ํŒจ์น˜๋Š” 16x16 ํ”ฝ์…€๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ, ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€๋ฅผ ์ธ์‹ํ•˜๋Š” ๊ธฐ๋ณธ ๋‹จ์œ„๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Activation Maximization ๊ธฐ๋ฒ•์„ ViT์— ์ ์šฉํ•˜๋ฉด, ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ์—ฐ์†๋œ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ 14x14 ํฌ๊ธฐ์˜ ๊ฐœ๋ณ„ ํŒจ์น˜๋กœ ๋‚˜๋‰œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํŒจ์น˜ ๊ตฌ์กฐ๋Š” ์œ„์น˜ ์ž„๋ฒ ๋”ฉ ๋•๋ถ„์— ์„œ๋กœ ์ธ์ ‘ํ•œ ํŒจ์น˜๊ฐ€ ์œ ์‚ฌํ•œ ์ถœ๋ ฅ์„ ๋‚ด๋„๋ก ์œ ๋„๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์ด๋ฏธ์ง€์—์„œ ์ธ์ ‘ํ•œ ํŒจ์น˜๋“ค์ด ์„œ๋กœ ๋น„์Šทํ•œ ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ํŒจ์น˜ ๊ฐ„์—๋Š” ๋ฏธ๋ฌ˜ํ•œ ๋ถˆ์—ฐ์†์„ฑ์ด ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Š” ViT๊ฐ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ๊ฐ ํŒจ์น˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋”๋ผ๋„, ํŒจ์น˜ ๋‹จ์œ„๋กœ ๋‚˜๋‰œ ๋…๋ฆฝ์ ์ธ ๊ตฌ์กฐ์˜ ํ•œ๊ณ„๋„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํŒจ์น˜ ๊ฐ„์˜ ๋ถˆ์—ฐ์†์„ฑ์„ ์ค„์ด๊ณ , ๋” ์ž์—ฐ์Šค๋Ÿฌ์šด ์—ฐ๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณต๊ฐ„์  ์—ฐ์†์„ฑ ์ œ์•ฝ(spatial continuity constraint)์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํƒ๊ตฌ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด transformer๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ํŒจ์น˜๋“ค ์‚ฌ์ด์˜ ์—ฐ์†์ ์ธ ํ๋ฆ„์„ ํ•™์Šตํ•˜๋„๋ก ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

profile
๐Ÿ“ ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค ํ•™๋ถ€์ƒ์˜ ๊ธฐ๋ก์žฅ!

0๊ฐœ์˜ ๋Œ“๊ธ€