[๐Ÿ“–๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] ON LARGE-BATCH TRAINING FOR DEEP LEARNING : GENERALIZATION GAP AND SHARP MINIMA (2017)

Becky's Study Labยท2023๋…„ 11์›” 23์ผ
0

PaperReview

๋ชฉ๋ก ๋ณด๊ธฐ
2/26

ICLR 2017 conference paper๋กœ 2023๋…„ 11์›” ๊ธฐ์ค€ 3000ํšŒ ๋„˜๋Š” ์ธ์šฉ ํšŸ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ํ•ด์„ํ•ด๋ณด์•˜๋‹ค.
๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ์ž˜ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฑด์ง€ ์•„์ง๋„ ์ž˜์€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ, ์ผ๋‹จ ์ฝ์œผ๋ฉด์„œ ์ƒ๊ฐ์„ ์ •๋ฆฌํ•ด๋ณธ๋‹ค๋Š” ์‹์œผ๋กœ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. ๋ชฉํ‘œ๋Š” ๋ถ€์บ  ํ™œ๋™ ์ค‘์— ์ง€์น˜์ง€ ์•Š๊ณ  ๊พธ์ค€ํžˆ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊พธ์ค€ํžˆ ํ•ด๋ณด์ž๋Š” ๋ชฉํ‘œ๋กœ ๋ถ€์บ ์—์„œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์Šคํ„ฐ๋””๋„ ํ•˜๋‚˜ ๋งŒ๋“ค์–ด์„œ ๊ฐ™์ด ํ•˜๊ณ  ์žˆ๋‹ค.

0. Abstract

๋”ฅ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ SGD(Stochastic Gradient Descent) ๋งŽ์ด ์‚ฌ์šฉ๋จ.
โ‡’ ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ์‚ฌ์šฉ์‹œ, ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ์ €ํ•˜ ๋ณด์ž„(์ผ๋ฐ˜ํ™”์˜ ๋Šฅ๋ ฅ์„ ๊ธฐ์ค€์œผ๋กœ ๋ณด๋ฉด)
โ‡’ ๊ทธ ์ด์œ ๋Š” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋Š” ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ ๋ชจ๋ธ์—์„œ sharp minimizer๋กœ ์ˆ˜๋ ดํ•˜๋Š”๋ฐ, sharp minimizer๋Š” ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์ €ํ™”์‹œํ‚ด
โ‡’ ๋ฐ˜๋Œ€๋กœ ์ž‘์€ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋Š” flat minimizer๋กœ ์ˆ˜๋ ดํ•˜๋Š”๋ฐ, ๊ทธ ์ด์œ ๋Š” ๊ธฐ์šธ์ด(gradient)๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐ์— ์žˆ๋Š” ๋‚ด์žฌ๋œ ๋…ธ์ด์ฆˆ ๋–„๋ฌธ์œผ๋กœ ๋ด„

1. Introduction

Gradient Update ์ˆ˜์‹


์ด ์ˆ˜์‹์„ Batch Size Gradient ๋ฐฉ๋ฒ•๋ก ์˜ ์ˆ˜์‹์œผ๋กœ ์†Œ๊ฐœํ–ˆ๋Š”๋ฐ,
์ข€๋” ์ •ํ™•ํ•˜๊ฒŒ ํ•ด์„ํ•˜๋ฉด MGD(Mini Batch Gradient Descent)๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค.
(Batch Size ๊ฐœ์ˆ˜๋กœ loss๋ฅผ ๋‚˜๋ˆ„๊ธฐ ๋•Œ๋ฌธ์—)

โ—์—ฐ๊ตฌ๊ฒฐ๊ณผโ—ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ์˜ ๊ฒฝ์šฐ ์ž‘์€ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ์˜ ๊ฒฝ์šฐ๋ณด๋‹ค Generalization Gap 5% ๊ฐ์†Œํ•จ

๊ทธ๋ž˜์„œ, ๋ณธ ๋…ผ๋ฌธ์€
1) ์ตœ์ ์˜ ํผํฌ๋จผ์Šค ๋ชจ๋ธ์„ ์ œ์‹œํ•˜๊ณ ,
2) ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํ•™์Šต์„ ๊ทน๋ณตํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค๊ณ  ํ•จ.

๐Ÿท๏ธ Notation

  • ์ž‘์€ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ (SB, Small Batch), ํฐ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ(LB, Light Batch) ๋ผ๊ณ ํ•จ.
  • ๋ชจ๋“  ์‹คํ—˜ ๋ชจ๋ธ์—์„œ Optimizer๋Š” ADAM์œผ๋กœ ํ†ต์ผํ•จ.

2. DRAWBACKS OF LARGE-BATCH METHODS

LB์—์„œ Generalization Gap์ด ์กด์žฌํ•˜๋‚˜? ์ด์œ ๋Š”?

1) LB ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์— ๊ณผ์ ํ•ฉํ•ด์„œ
2) LB ๋ฐฉ๋ฒ•์€ ์•ˆ์žฅ์ (saddle point )์— ๋” ๋งค๋ ฅ์ ์ž„
3) LB ๋ฐฉ๋ฒ•์€ SB ๋ฐฉ๋ฒ•์˜ ํƒ์ƒ‰์  ํŠน์„ฑ์ด ๋ถ€์กฑํ•˜๋ฉฐ ์ดˆ๊ธฐ ์ง€์ ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ตœ์†Œํ™” ์žฅ์น˜๋ฅผ ํ™•๋Œ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด์„œ
4) SB ๋ฐ LB ๋ฐฉ๋ฒ•์€ ์งˆ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ตœ์†Œํ™” ๋„๊ตฌ๋กœ ์ˆ˜๋ ด๋˜๊ธฐ์—

์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์˜ ๋ถ€์กฑ์˜ ์›์ธ์„ LB๋Š” sharp minimizer, SB๋Š” flat minimizer์ด๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ•œ๋‹ค.
์•„๋ž˜์˜ sharp minimizer์™€ flat minimizer ๊ทธ๋ž˜ํ”„์—์„œ y์ถ•์€ loss fuction value๋ฅผ ์˜๋ฏธํ•˜๊ณ  ์œ„์˜ ์ˆ˜์‹์—์„  โˆ‡2f(x)\nabla^2f(x)๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ฆ‰, loss function์˜ "ํ‰ํƒ„ํ•œ(flat)" ์ตœ์†Œ๊ฐ’์ด SB(์ž‘์€ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ)๋กœ ์–ป์–ด์ง„ ๊ฒฐ๊ณผ๋‹ค.

  • ์ฐธ๊ณ ๋กœ, ์ด์ „ ๋…ผ๋ฌธ "Sepp Hochreiter and Jurgen Schmidhuber. Flat minima. ยจ Neural Computation, 9(1):1โ€“42, 1997."์—์„œ sharp minimizer, flat minimizer ๊ฐœ๋…์ด ์ฃผ๋ชฉ๋ฐ›์Œ.

โ‡’ sharp minimizer ์€ x์˜ ์ž‘์€ ๋ณ€ํ™”์—๋„ ํ•จ์ˆ˜๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ณ , flat minimizer์€ ๋น„๊ต์  x์˜ ์ž‘์€ ๋ณ€ํ™”์—์„œ ์ฒœ์ฒœํžˆ ๋ณ€ํ™”ํ•œ๋‹ค. flat minimizer ๋‚ฎ์€ ์ •๋ฐ€๋„๋ผ๋ฉด sharp minimizer์€ ๋†’์€ ์ •๋ฐ€๋„๋ฅผ ๋ณด์ž„. ์—ฌ๊ธฐ์„œ sharp minimizer์˜ ํฐ ๋ฏผ๊ฐ๋„(์ •๋ฐ€๋„)๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์— ์•ˆ ์ข‹์€ ์˜ํ–ฅ์„ ๋ผ์นจ

๐Ÿ“Š ์‹คํ—˜ : ๋Œ€์ค‘์„ฑ์žˆ๋Š” ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ๋กœ ์‹คํ—˜ํ•จ

- ๋ชจ๋“  ์‹คํ—˜์—์„œ ๋Œ€๊ทœ๋ชจ ๋ฐฐ์น˜ ์‹คํ—˜์—์„œ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ 10%๋ฅผ ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ ์‚ฌ์šฉํ–ˆ๊ณ , ์†Œ๊ทœ๋ชจ ๋ฐฐ์น˜ ์‹คํ—˜์—์„œ๋Š” 256๊ฐœ์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•จ.

  • LB, SB ๋ชจ๋‘์— ADAM optimizer ์„ ์‚ฌ์šฉํ•จ.

โ‡’ ๋ชจ๋“  ๋„คํŠธ์›Œํฌ์—์„œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹ ๋ชจ๋‘ ๋†’์€ ํ›ˆ๋ จ ์ •ํ™•๋„๋กœ ์ด์–ด์กŒ์ง€๋งŒ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์—๋Š” ์ƒ๋‹นํ•œ ์ฐจ์ด๊ฐ€ ์žˆ์Œ์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
(We emphasize that the generalization gap is not due to over-fitting or over-training as commonly observed in statistics)

โœจ [Metric ์ •์˜] SHARPNESS OF MINIMA

SHARPNESS (๋ฏผ๊ฐ๋„, ์„ ๋ช…๋„?)

์œ„์™€ ๊ฐ™์ด Sharpness๋ผ๋Š” metric์„ ์ƒˆ๋กญ๊ฒŒ ์ •์˜ํ–ˆ๋‹ค.
์™„๋ฒฝํžˆ ์ˆ˜์‹์ด ์ดํ•ด๋˜์ง€๋Š” ์•Š์ง€๋งŒ, ์ ๋‹นํ•œ boundary (CฯตC_\epsilon)์—์„œ์˜ maximization of loss function ์„ ์ธก์ •ํ•˜๋ฉด์„œ ๋ฏผ๊ฐ๋„?๋ฅผ ๋ณธ๋‹ค.

์œ„์˜ sharpness ํ‘œ๋ฅผ ๋ณด๋ฉด, ฯต\epsilon ์— ๋”ฐ๋ผ ๊ตฌํ•œ ๋ชจ๋ธ ๋ณ„ SB, LB์— ๋”ฐ๋ฅธ ๋ฏผ๊ฐ๋„๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

- SB (์ž‘์€ ๋ฐฐ์น˜) โ†’ ๋” ๋‚ฎ์€ sharpness๊ฐ’
- LB (ํฐ ๋ฐฐ์น˜) โ†’ ํฐ sharpness๊ฐ’

์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ๋„ ์•Œ์ˆ˜ ์žˆ๋“ฏ์ด, ํฐ ๋ฐฐ์น˜ ๋ฐฉ๋ฒ•(LB)์œผ๋กœ ์–ป์€ ์†”๋ฃจ์…˜์ด ํ›ˆ๋ จ ํ•จ์ˆ˜์˜ ๋” ํฐ ๋ฏผ๊ฐ๋„ ์ง€์ ์„ ์ •์˜ํ•œ๋‹ค๋Š” ๊ด€์ ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿค” ๊ทธ๋ ‡๋‹ค๋ฉด, LB method์—์„œ๋Š” generalization gap์„ ์–ด๋–ป๊ฒŒ ๊ทน๋ณตํ•ด์•ผํ•˜๋‚˜?

: data augumentation(๋ฐ์ดํ„ฐ ์ฆ๋Œ€), conservative training(๋ณด์ˆ˜์  ํ›ˆ๋ จ), adversarial(์ ๋Œ€์  ํ›ˆ๋ จ) ๊ฐ™์€ ์ ‘๊ทผ๋“ค์ด generalization gap์„ ๊ทน๋ณตํ•˜๋Š”๋ฐ ๋„์›€์ด ๋˜์ง€๋งŒ, ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ์ƒ๋Œ€์ ์œผ๋กœ sharp minimizer๋กœ ์ด์–ด์ง€๋ฉฐ ๋ฌธ์ œ๋ฅผ ์™„์ „ํžˆ ํ•ด๊ฒฐํ•˜์ง€๋Š” ๋ชปํ•œ๋‹ค.

3. SUCCESS OF SMALL-BATCH METHODS

Sharpness์™€ Accuracy๋กœ ๋ณด์•„ small batch ๋ชจ๋ธ์ด ์ผ๋ฐ˜ํ™”์— ์žˆ์–ด์„œ ๋” ์„ฑ๊ณต์ !

์•„๋ž˜์˜ Sharpness์™€ Accuracy ๊ฒฐ๊ณผ ๊ฐ’์„ batch size์— ๋”ฐ๋ผ ํ™•์ธํ•ด ๋ณด๋ฉด, small batch ๋ชจ๋ธ์ด ๋” ์„ฑ๊ณต์ ์ด์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ“ˆ Warm-starting experiment!

"์›œ์Šคํƒ€ํŠธ(Warm-Starting) ์‹คํ—˜"์„ ์‹œ๋„ํ•ด๋ณด๊ธฐ๋„ ํ•˜์˜€๋Š”๋ฐ,
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” piggybacked(or warm-started) large-batch solution์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

์›œ์Šคํƒ€ํŠธ(piggybacked) ๋ฐฉ๋ฒ•๋ก 

Step.1 ) batch size 256์œผ๋กœ ํ•˜์—ฌ(์ž‘์€ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋กœ) ADAM๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 100 epoch๋กœ ํ›ˆ๋ จ
Step.2 ) ๊ฐ epoch ์ดํ›„์˜ iterate(๋ฐ˜๋ณต)์„ ๋ฉ”๋ชจ๋ฆฌ์— ์œ ์ง€ => epoch๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ €์žฅํ•ด๋‘ 
Step.3 ) ์ด๋Ÿฌํ•œ 100ํšŒ์˜ iterate(๋ฐ˜๋ณต) ์งํ›„๋ฅผ ๊ฐ๊ฐ ์‹œ์ž‘์ ์œผ๋กœ ํ•˜์—ฌ LB(ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ) ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ 100 iteration ๋™์•ˆ ํ›ˆ๋ จ
Step.4) 100๊ฐœ์˜ ํ”ผ๊ธฐ๋ฐฑ(๋˜๋Š” ์›œ ์Šคํƒ€ํŠธ) ๋Œ€๊ทœ๋ชจ ๋ฐฐ์น˜ ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ ํ™•์ธ

์›œ์Šคํƒ€ํŠธ ๊ฒฐ๊ณผ ํ™•์ธ

์•„๋ž˜ ๊ทธ๋ฆผ 5์—๋Š” ์†Œ๊ทœ๋ชจ ๋ฐฐ์น˜ ๋ฐ˜๋ณต์˜ ํ…Œ์ŠคํŠธ ์ •ํ™•๋„์™€ ํ•จ๊ป˜ ์ด๋Ÿฌํ•œ ๋Œ€๊ทœ๋ชจ ๋ฐฐ์น˜ ์†”๋ฃจ์…˜์˜ ํ…Œ์ŠคํŠธ ์ •ํ™•๋„ ๋ฐ sharpness(์„ ๋ช…๋„)๊ฐ€ ํ‘œ์‹œ๋˜์–ด ์žˆ๋‹ค. ๋ช‡ ๋ฒˆ์˜ ์ดˆ๊ธฐ epoch ๋งŒ์œผ๋กœ ์›œ ์Šคํƒ€ํŠธํ•˜๋ฉด LB ๋ฐฉ๋ฒ•์ด ์ผ๋ฐ˜ํ™” ๊ฐœ์„ ์„ ๊ฐ€์ ธ์˜ค์ง€ ์•Š๋Š”๋‹ค. ์‹ฌ์ง€์–ด sharpness(์„ ๋ช…๋„)๋„ ๋†’๊ฒŒ ์œ ์ง€๋œ๋‹ค.
๋ฐ˜๋ฉด, ํŠน์ • ํšŸ์ˆ˜์˜ ์›œ ์Šคํƒ€ํŠธ ์ดํ›„์—๋Š” ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋˜๊ณ  ๋Œ€๊ทœ๋ชจ ๋ฐฐ์น˜ ๋ฐ˜๋ณต์˜ ์„ ๋ช…๋„๊ฐ€ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” ๋ถ„๋ช…ํžˆ SB ๋ฐฉ๋ฒ•์ด ํƒ์ƒ‰ ๋‹จ๊ณ„๋ฅผ ์ข…๋ฃŒํ•˜๊ณ  flat minimizer๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์„ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ LB ๋ฐฉ๋ฒ•์ด ์ด๋ฅผ ํ–ฅํ•ด ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ์–ด ํ…Œ์ŠคํŠธ ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

piggybacked(or warm-started) large-batch solution์€ ๋” ๋‚˜์€ ๋ฐฉ๋ฒ•์ผ ์ˆ˜ ์žˆ์Œ!
(์ผ๋ถ€ ์ˆ˜๋ ด ํ›„์— ์ „์—ญ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๊ธฐ ๋•Œ๋ฌธ)

SB๊ณผ LB ๋ชจ๋ธ์˜ ์งˆ์  ์ฐจ์ด๋ฅผ ๋ณด๊ณ ์ž ์†์‹คํ•จ์ˆ˜(Cross Entropy)๊ฐ’์— ๋”ฐ๋ฅธ Sharpness ์ธก์ •

์†์‹ค ํ•จ์ˆ˜์˜ ๊ฐ’์ด ๋” ํฐ ๊ฒฝ์šฐ, ์ฆ‰ ์ดˆ๊ธฐ์  ๊ทผ์ฒ˜์—์„œ๋Š” SB ๋ฐ LB ๋ฐฉ๋ฒ•์ด ๋น„์Šทํ•œ ์„ ๋ช…๋„ ๊ฐ’์„ ์‚ฐ์ถœํ•œ๋‹ค.
ํ•˜์ง€๋งŒ, ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ๊ฐ์†Œํ•จ์— ๋”ฐ๋ผ LB ๋ฐฉ๋ฒ•์€ sharpness๊ฐ€ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐ˜๋ฉด, SB ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ sharpness๋Š” ์ฒ˜์Œ์—๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€๋œ ๋‹ค์Œ ๊ฐ์†Œํ•˜์—ฌ ํƒ์ƒ‰ ๋‹จ๊ณ„์— ์ด์–ด flat minimizer๋กœ ์ˆ˜๋ ดํ•จ์„ ๋ณด์ธ๋‹ค.

4. DISCUSSION AND CONCLUSION

๊ฒฐ๋ก ์ ์œผ๋กœ,
โœ… sharp minimizer ๋กœ์˜ ์ˆ˜๋ ด์€ ํฐ ๋ฐฐ์น˜ ๋ชจ๋ธ(LB)์˜ ์ผ๋ฐ˜ํ™”๊ฐ€ ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๊ฒŒ ๋งŒ๋“ ๋‹ค.
โœ… LB์˜ ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž data augmentation(๋ฐ์ดํ„ฐ ์ฆ๋Œ€), conservative training(๋ณด์ˆ˜์  ํ›ˆ๋ จ) and robust optimization(๊ฐ•๋ ฅํ•œ ์ตœ์ ํ™”)๊ฐ€ ๋ฐฉ๋ฒ•์ผ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ์ „๋žต์€ ๋ฌธ์ œ๋ฅผ ์™„๋ฒฝํžˆ ํ•ด๊ฒฐํ•ด์ฃผ์ง€ ๋ชปํ•œ๋‹ค. ๋ฌผ๋ก  LB๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ๋Š” ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ sharp minimizer๋กœ ์ด๋ˆ๋‹ค.
โœ… LB์˜ ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž dynamic sampling(batch size๋ฅผ ์ ์ง„์ ์œผ๋กœ ํ‚ค์šฐ๋Š” ๊ฒƒ)๋„ ๊ณ ๋ คํ•ด๋ณด์•˜๋‹ค. warm-starting experiment๋ฅผ ์‹ค์ œ๋กœ ์ˆ˜ํ–‰ํ•ด๋ณด์•˜๊ณ , ์–ด๋Š์ •๋„ LB ๋ชจ๋ธ์—์„œ ๊ดœ์ฐฎ์€ ๋ฐฉ์•ˆ์ด๋ผ๊ณ  ๋ด„.
โœ… ๊ธฐ์กด ์—ฐ๊ตฌ ๊ฒฐ๊ณผ์—์„œ ๊ฐ€์ • ํ•˜์—, ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์†์‹ค ํ•จ์ˆ˜์— ๋งŽ์€ local minimizer๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ ์ด๋Ÿฌํ•œ minimizer ์ค‘ ๋‹ค์ˆ˜๊ฐ€ ์œ ์‚ฌํ•œ ์†์‹ค ํ•จ์ˆ˜ ๊ฐ’์— ํ•ด๋‹นํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ์‹คํ—˜์—์„œ sharp minimizer๊ณผ flat minimizer ๋ชจ๋‘ ๋งค์šฐ ์œ ์‚ฌํ•œ ์†์‹ค ํ•จ์ˆ˜ ๊ฐ’์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์•ž์„  ์—ฐ๊ตฌ ๊ฒฐ๊ณผ์˜ ๊ด€์ฐฐ๊ณผ ์ผ์น˜ํ•œ๋‹ค.
โœ… ์ฆ‰, Small Batch Size๊ฐ€ Generalizatioin Performance์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค


์ด๋ ‡๊ฒŒ ์ •๋ฆฌ ๋ ๐Ÿฅน

profile
๋ฐฐ์šฐ๊ณ  ๊ณต๋ถ€ํ•˜๊ณ  ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ๋ฉˆ์ถ”์ง€ ์•Š๋Š”๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€