[๐Ÿ“–๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION (2015)

Becky's Study Labยท2023๋…„ 11์›” 26์ผ
0

PaperReview

๋ชฉ๋ก ๋ณด๊ธฐ
3/24

์ง€๊ธˆ์€ ๋ชจ๋‘๊ฐ€ Adam optimizer์„ ๋‹น์—ฐํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค. ์™œ ๊ทธ๋Ÿฐ์ง€ Adam์˜ ์›๋ฆฌ๋ฅผ ๋ณด๋ฉด, ๊ทธ ์ด์œ ๋ฅผ ์•Œ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ํ•˜์—ฌ ์ด๋ ‡๊ฒŒ ์ •๋ฆฌํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค
๋ฆฌ๋ทฐํ•˜๋Š” "ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION (2015)" ๋…ผ๋ฌธ์€ ICLR 2015 conference paper์ด๊ณ , ๋งˆ์ง€๋ง‰ ์ฑ•ํ„ฐ 9. ACKNOWLEDGMENTS ์—์„œ ๋งํ•˜๋“ฏ์ด Google Deepmind์˜ ์ง€์›ํ•˜์— ์—ฐ๊ตฌ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์ด๋‹ค.

0. ABSTRACT

๐ŸŸก ADAM์— ๋Œ€ํ•ด์„œ ํ•œ ๋งˆ๋””๋กœ ์š”์•ฝํ•˜์ž๋ฉด?

lower-order moments ์˜ adaptive ์ถ”์ •์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ,
stochastic objective function ๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š”,
first-order gradient-based (ํ•œ ๋ฒˆ ๋ฏธ๋ถ„ํ•œ, ๊ธฐ์šธ๊ธฐ ๊ธฐ๋ฐ˜์˜?) ์•Œ๊ณ ๋ฆฌ์ฆ˜!

lower-order moments

: ์ €์ฐจ(์ €์ฐจ์ˆ˜) ๋ชจ๋ฉ˜ํŠธ
์ข€ ๋” ์ •ํ™•ํžˆ๋Š” 2๊ฐœ์˜ moment๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ,
Adam์˜ ์ฒซ๋ฒˆ์งธ moment๋Š” Momentum ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ, ๋‘๋ฒˆ์งธ moment๋Š” AdaGrad/RMSProp ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์œ ๋„๋œ๋‹ค.

first-order gradient

: 1์ฐจ ํ•จ์ˆ˜ ๊ธฐ์šธ๊ธฐ

๐Ÿ”– Reference ์„ค๋ช… ์ถ”๊ฐ€

  • SGD, AdaGrad, RMSProp, Adam์€ ๋ชจ๋‘ First-Order Optimization
  • ํ•œ๋ฒˆ ๋ฏธ๋ถ„ํ•œ weight๋งŒ optimize์— ๋ฐ˜์˜๋จ
  • ๊ทธ๋Ÿฌ๋‚˜ 1์ฐจ ํ•จ์ˆ˜ ๋ฐฉํ–ฅ(๋…ธ๋ž€์ƒ‰ ์ง์„ )์œผ๋กœ๋งŒ optimizeํ•˜๊ธฐ ๋•Œ๋ฌธ์— graident ์ˆ˜์ •์ด ์ œํ•œ์ 
    ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์ฐจ ํ•จ์ˆ˜ optimization(์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ)์ด ๋“ฑ์žฅ
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ณ ์ฐจ ํ•จ์ˆ˜ optimization์€ ์—ญ์ „ํŒŒ๋ฅผ ์œ„ํ•ด ์—ญํ–‰๋ ฌ์„ ๊ตฌํ•  ๋•Œ, ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ์ฆ๊ฐ€ํ•จ (ex : ๊ฐ€์ค‘์น˜์˜ ์ฐจ์›์ด ๋ช‡ ๋ฐฑ๋งŒ ์ฐจ์›์œผ๋กœ ๋Š˜์–ด๋‚จ)
  • ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ ์•„์ง๊นŒ์ง€๋Š” First-Order Optimization์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ

stochastic object functions

: Loss Function
(mini-batch ๋ฐฉ์‹๊ณผ ๊ฐ™์ด randomํ•˜๊ฒŒ training sample์„ ์„ ํƒํ•จ์œผ๋กœ์จ ๋งค๋ฒˆ loss function ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๋Š” ํ•จ์ˆ˜. Ex. MSE, Cross Entropy...))

๐ŸŸข ADAM์˜ ์žฅ์ ?

  • ๊ตฌํ˜„์ด ๊ฐ„๋‹จ
  • ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ
  • ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”
  • gradient์˜ diagonal rescaling์— invariant

์œ„๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์—์„œ gradient ๊ฐ’์ด ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์ ธ๋„,

์ด๋Ÿฐ ๊ณ„์‚ฐ ๊ณผ์ •์„ ๊ฑฐ์ณ์„œ, ๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ์œ„์™€ ๊ฐ™์ด Digonal ๊ณฑ์…ˆ์œผ๋กœ graident๋ฅผ ๋ณ€ํ•˜๊ฒŒ ํ•ด๋„, optimze ๊ณผ์ •์—์„œ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

  • ํฐ ์‚ฌ์ด์ฆˆ์˜ ๋ฐ์ดํ„ฐ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ ์ƒํ™ฉ์—์„œ๋„ ์ ํ•ฉํ•จ
  • ๋…ธ์ด์ฆˆ๊ฐ€ ์‹ฌํ•˜๊ฑฐ๋‚˜ sparse gradient ์ƒํ™ฉ์—์„œ๋„ ์ ํ•ฉํ•ฉ
  • ์ง๊ด€์ ์ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ ๋‹นํ•œ ํŠœ๋‹๋งŒ์ด ํ•„์š”ํ•จ

๐Ÿ”ต ๋‹ค๋ฅธ Optimizer์™€ ๋น„๊ตํ•œ๋‹ค๋ฉด?

๊ธฐ์กด์˜ optimizer(AdaGrad, RMSProp, AdaDelta,,)๋“ค๊ณผ ๋น„๊ตํ–ˆ์„๋•Œ๋„ ๋งค์ฃผ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

โšช (์ถ”๊ฐ€) Adam์˜ ๋ณ€์ข…, AdaMax

์ถ”๋ผ๊ณ  Adam์˜ ๋ณ€์ข…์ธ AdaMax์— ๋Œ€ํ•œ ๋‚ด์šฉ๋„ ํ›„๋ฐ˜ ์ฑ•ํ„ฐ์—์„œ ๋งํ•œ๋‹ค.

1. INTRODUCTION

๐Ÿ”ด Object functions(loss function)์€ stochastic(ํ™•๋ฅ ์ )!

  • ๋งŽ์€ object function์€ ๊ฐ๊ฐ์˜ mini-batch๋กœ ๋ถ„ํ• ๋œ Training Data์—์„œ ๊ณ„์‚ฐ๋œ subfunctions์˜ ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ

  • ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ mini-batch๋งˆ๋‹ค gradient steps(ํ•™์Šต์œผ๋กœ ์ˆ˜์ •ํ•˜๋Š” gradient์˜ ๋ฐฉํ–ฅ, ํฌ๊ธฐ)์„ ๋‹ค๋ฅด๊ฒŒ ์ค„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต์— ํšจ์œจ์ ์ž„ -> SGD๊ฐ€ ์ด์— ํ•ด๋‹น(SGD์˜ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ํ•˜๋‚˜๋งˆ๋‹ค์˜ gradient ๋ฐฉํ–ฅ, ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ)

  • ํ•˜์ง€๋งŒ, object function์—๋Š” noise๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด optimization์˜ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋จ
    โ–ถ ๋Œ€ํ‘œ์ ์ธ nosie๋กœ dropout regularization์ด ์žˆ์Œ
    โ–ถ object function์— noise๊ฐ€ ์ƒ๊ธด๋‹ค๋ฉด, ๋” ํšจ์œจ์ ์ธ stochastic optimization์ด ์š”๊ตฌ๋จ

- ๋ณธ๋ฌธ์—์„œ๋Š” higher-order optimization method ๋ณด๋‹ค๋Š” first-order methods๋กœ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•จ

๐ŸŸก ๊ธฐ์กด SGD์˜ ๋Œ€์ฒด์•ˆ, ADAM

  • Adam์€ first-order gradient๋ฅผ ํ•„์š”๋กœํ•˜๊ธฐ์—, ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”๋Ÿ‰์ด ์ ์Œ
  • Adam์€ gradient์˜ ์ฒซ๋ฒˆ์งธ์™€ ๋‘๋ฒˆ์งธ์˜ moment estimate๋กœ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ๊ฐœ๋ณ„์ ์ธ learing rate(ํ•™์Šต๋ฅ )์„ ๊ณ„์‚ฐ
    - ์ฒซ๋ฒˆ์งธ moment์˜ ์ถ”์ฒญ์ง€ : momentum optimizer
    - ๋‘๋ฒˆ์งธ moment์˜ ์ถ”์ •์น˜ : AdaGrad / RMSProp optimizer
  • Adam์€ AdaGrad์˜ ์žฅ์ (sparse gradient์—์„œ ์ž˜ ์ž‘๋™)๊ณผ PMSProp์˜ ์žฅ์ (์˜จ๋ผ์ธ๊ณผ ๊ณ ์ •๋˜์ง€ ์•Š์€ ์„ธํŒ…์— ์ž˜ ์ž‘๋™)์„ ๊ฒฐํ•ฉ
  • ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ์˜ ํฌ๊ธฐ๊ฐ€ ๊ทธ๋ผ๋””์–ธํŠธ์˜ ํฌ๊ธฐ ์กฐ์ •์— ๋ณ€ํ•˜์ง€ ์•Š์Œ
  • step size๊ฐ€ ๋Œ€๋žต step size hyper parameter์— ์˜ํ•ด ์ œํ•œ๋˜๋ฉฐ, ๊ณ ์ • ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Œ
  • sparse gradient์— ์ž‘๋™
  • ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ step size annealing(step size์— ๋”ฐ๋ผ ๊ฐ’์„ ํ’€์–ด์ฃผ๋Š”๊ฒƒ??) ์ˆ˜ํ–‰

์œ„์˜ ๊ทธ๋ฆผ์ด ๋ฐ”๋กœ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” Adam Optimizer์˜ pseudo ์ฝ”๋“œ๋‹ค.
์‹ค์ œ๋กœ moment 2๊ฐœ๋กœ ์ด๋ค„์กŒ์œผ๋ฉด์„œ, bias-corrected value๋ฅผ ์ตœ์ข… gradient ์‚ฐ์ถœ์— ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ๋„ ๊ผญ ๋ณด๊ณ  ๋„˜์–ด๊ฐ€์•ผํ•œ๋‹ค.

5. RELATED WORK

๋…ผ๋ฌธ์—์„œ๋Š” RMSProp, AdaGrad์˜ ์žฅ์ ์„ ์ž˜ ์œตํ•ฉํ•ด์„œ Adam์„ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•˜์˜€๊ณ , ์‹ค์ œ๋กœ RMSProp๊ณผ AdaGrad์— ๋Œ€ํ•ด์„œ related work ์— ์ •๋ฆฌํ•ด๋‘์—ˆ๋‹ค

๐Ÿท๏ธ AdaGrad

  • gradient์˜ ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜์— ๋”ฐ๋ผ ํ•™์Šต๋ฅ (Learning rate)๋ฅผ ์กฐ์ ˆํ•˜๋Š” ์˜ต์…˜์ด ์ถ”๊ฐ€๋œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•
  • โจ€ : ํ–‰๋ ฌ์˜ ์›์†Œ๋ณ„ ๊ณฑ์…ˆ
  • h๋Š” ๊ธฐ์กด ๊ธฐ์šธ๊ธฐ ๊ฐ’์„ ์ œ๊ณฑํ•˜์—ฌ ๊ณ„์† ๋”ํ•ด์ฃผ๊ณ , parameter W๋ฅผ ๊ฐฑ์‹ ํ•  ๋•Œ๋Š” 1/โˆšh ๋ฅผ ๊ณฑํ•ด์„œ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•จ
  • parameter ์ค‘์—์„œ ํฌ๊ฒŒ ๊ฐฑ์‹ ๋œ(gradient๊ฐ€ ํฐ) parameter๋Š”, ์ฆ‰ โˆฃโˆฃโˆ‚L/โˆ‚Wโˆฃโˆฃ||โˆ‚L/โˆ‚W || ๊ฐ€ ํฐ parameter๋Š” h ๊ฐ’์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ณ  ฮท1hฮท\frac{1}{\sqrt{h}}๊ฐ€ ๊ฐ์†Œํ•˜๋ฉด์„œ ํ•™์Šต๋ฅ ์ด ๊ฐ์†Œํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค
  • ์ฆ‰ gradient๊ฐ€ ํฌ๋ฉด ์˜คํžˆ๋ ค learning rate(step size)๋Š” ์ž‘์•„์ง€๊ณ , gradient๊ฐ€ ์ž‘์œผ๋ฉด learning rate(step size)๊ฐ€ ์ปค์ง„๋‹ค
  • AdaGrad๋Š” ๊ฐœ๋ณ„ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋งž์ถคํ˜• ๊ฐ’์„ ๋งŒ๋“ค์–ด์ค€๋‹ค
  • ๊ฐ™์€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ย ์—ฌ๋Ÿฌ๋ฒˆ ํ•™์Šต๋˜๋Š” ํ•™์Šต๋ชจ๋ธ์— ์œ ์šฉํ•˜๊ฒŒ ์“ฐ์ด๋Š”๋ฐ ๋Œ€ํ‘œ์ ์œผ๋กœ ์–ธ์–ด์™€ ๊ด€๋ จ๋œ word2vec์ด๋‚˜ GloVe์— ์œ ์šฉํ•˜๋‹ค. ์ด๋Š” ํ•™์Šต ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํ™•๋ฅ ์— ๋”ฐ๋ผ ๋ณ€์ˆ˜์˜ ์‚ฌ์šฉ ๋น„์œจ์ด ํ™•์—ฐํ•˜๊ฒŒ ์ฐจ์ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์ด ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ์ ๊ฒŒ ์ˆ˜์ •ํ•˜๊ณ  ์ ๊ฒŒ ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋Š” ๋งŽ์ด ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ

AdaGrad python ๊ตฌํ˜„

class AdaGrad:

    def __init__(self, lr=0.01):
        self.lr = lr
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
            # 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ์ผ์ด ์—†๋„๋ก 1e-7์„ ๋”ํ•ด์ค๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ์ž„์˜๋กœ ์ง€์ • ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๐Ÿท๏ธ RMSProp

  • AdaGrad๋Š” ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ๊ณ„์† ๋”ํ•ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์„ ์ง„ํ–‰ํ• ์ˆ˜๋ก ๊ฐฑ์‹  ๊ฐ•๋„๊ฐ€ ์•ฝํ•ด์ง„๋‹ค. ๋ฌดํ•œํžˆ ๊ณ„์† ํ•™์Šตํ•  ๊ฒฝ์šฐ์—๋Š” ์–ด๋Š ์ˆœ๊ฐ„ ๊ฐฑ์‹ ๋Ÿ‰์ด 0์ด ๋˜์–ด ์ „ํ˜€ ๊ฐฑ์‹ ๋˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.
  • ์ด ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•œ ๊ธฐ๋ฒ•์ด RMSProp!

  • RMSProp๋Š” ๊ณผ๊ฑฐ์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ท ์ผํ•˜๊ฒŒ ๋”ํ•˜์ง€ ์•Š๊ณ  ๋จผ ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋Š” ์„œ์„œํžˆ ์žŠ๊ณ  ์ƒˆ๋กœ์šด ๊ธฐ์šธ๊ธฐ ์ •๋ณด๋ฅผ ํฌ๊ฒŒ ๋ฐ˜์˜ -> decaying factor(decaying rate)๋ผ๋Š” ์ธ์ž ์‚ฌ์šฉ
  • ์ง€์ˆ˜์ด๋™ํ‰๊ท  (Exponential Moving Average, EMA)๋ฅผ ์‚ฌ์šฉํ•ด ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š”๋ฐ, ๊ณผ๊ฑฐ ๊ธฐ์šธ๊ธฐ์˜ ๋ฐ˜์˜ ๊ทœ๋ชจ๋ฅผ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ด

RMSProp python ๊ตฌํ˜„

class RMSprop:

    def __init__(self, lr=0.01, decay_rate = 0.99):
        self.lr = lr
        self.decay_rate = decay_rate
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] *= self.decay_rate
            self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

๐Ÿท๏ธ Momentum

  • W : ๊ฐฑ์‹ ํ•  ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜
  • L : ์†์‹คํ•จ์ˆ˜(loss function)
  • ฮทย : ํ•™์Šต๋ฅ  (learning rate),
  • โˆ‚L/โˆ‚Wโˆ‚L/โˆ‚W : Wย ์— ๋Œ€ํ•œ ์†์‹คํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ
  • ๋ณ€์ˆ˜ v ??
    : ๋ฌผ๋ฆฌ์—์„œ ์šด๋™๋Ÿ‰์„ ๋‚˜ํƒ€๋‚ด๋Š” ์‹์€ p = mv, ์งˆ๋Ÿ‰ m, ์†๋„ v
    => ์œ„ ์ˆ˜์‹์—์„œ๋„ v๋Š” ์†๋„
    => ๋งค๊ฐœ๋ณ€์ˆ˜ ฮฑ๋ฅผ v์— ๊ณฑํ•ด์„œ ฮฑv ํ•ญ์€ ๋ฌผ์ฒด๊ฐ€ ์•„๋ฌด ํž˜๋„ ๋ฐ›์ง€ ์•Š์„ ๋•Œ๋„ ์„œ์„œํžˆ ํ•˜๊ฐ•์‹œํ‚ค๋Š” ์—ญํ• 
    => gradient์˜ ๋ฐฉํ–ฅ์ด ๋ณ€๊ฒฝ๋˜์–ด๋„ ์ด์ „ ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ์— ์˜ํ–ฅ๋ฐ›์•„ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜๊ฐ€ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Œ

Momentum python ๊ตฌํ˜„

import numpy as np

class Momentum:

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr #ฮท
        self.momentum = momentum #ฮฑ
        self.v = None
        
    def update(self, params, grads):
	# update()๊ฐ€ ์ฒ˜์Œ ํ˜ธ์ถœ๋  ๋•Œ v์— ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ฐ™์€ ๊ตฌ์กฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ ๋ณ€์ˆ˜๋กœ ์ €์žฅ
        if self.v is None:
            self.v = {}
            for key, val in params.items():                                
                self.v[key] = np.zeros_like(val)
                
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] 
            params[key] += self.v[key]

2. ALGORITHM

โ—ป๏ธ Adam Pseudo code

์œ„์˜ ์ˆ˜์‹์— ๋Œ€ํ•ด์„œ ์กฐ๊ธˆ์”ฉ ๋œฏ์–ด์„œ ๋ณด๊ธฐ๋กœ ํ•œ๋‹ค.
ํ•จ์ˆ˜ ์† ์ธ์ž์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์œ„์˜ ์ •์˜๋ฅผ ์ฐธ๊ณ ํ•œ๋‹ค.

์‚ฌ์‹ค์ƒ hyper parameter๋ผ ํ•  ๊ฒƒ๋“ค์ด,
ฮฑ\alpha(Step size), ฮฒ1\beta_1(์ฒซ๋ฒˆ์งธ moment ์ถ”์ •์น˜๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฒ„๋ฆด์ง€), ฮฒ2\beta_2(๋‘๋ฒˆ์งธ moment ์ถ”์ •์น˜๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฒ„๋ฆด์ง€) ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ๋“ค ๋ฐ–์— ์—†๋‹ค. (ฯต\epsilon๋„ ์žˆ์Œ)

adam์€ ํฌ๊ฒŒ moment๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ถ€๋ถ„, moment๋ฅผ bias๋กœ ์กฐ์ •ํ•˜๋Š” ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋จ
moment๋Š” Momentum์ด ์ ์šฉ๋œ first moment์™€ AdaGrad, RMSProp์ด ์ ์šฉ๋œ second moment๊ฐ€ ์žˆ์Œ

โ—ป๏ธ ์ดˆ๊ธฐ์— ์„ค์ •ํ•ด์•ผํ•˜๋Š” 4๊ฐ€์ง€ ํŒŒ๋ผ๋ฏธํ„ฐ(Require)

  1. Stepsize ฮฑ\alpha (Learing Rate)
  2. Decay Rates ฮฒ1,ฮฒ2ฮฒ_1, ฮฒ_2
    : Exponential decay rates for the moment estimates (0~1 ์‚ฌ์ด์˜ ๊ฐ’)
    -> Adam์˜ ์œ ์ผํ•œ hyper-parameter, gradient์˜ decay rate๋ฅผ ์กฐ์ ˆํ•จ
  3. Stochastic Objective Function f(ฮธ) - loss function
    -> ( ฮธ (parameters, weight ๊ฐ’) ์ฃผ์–ด์งˆ ๋•Œ-> f(ฮธ) ๊ฐ’์˜ ์ตœ์†Œํ™”๊ฐ€ adam์˜ ๋ชฉํ‘œ )
  4. initial Parameter Vector ฮธ0ฮธ_0

โ—ป๏ธ Adam optimizer ์ˆ˜ํ–‰ ๊ณผ์ •

(1) first & Second moment, time step ์ดˆ๊ธฐํ™”

(2) ํŒŒ๋ผ๋ฏธํ„ฐ ฮธ_t๊ฐ€ ์ˆ˜๋ ด(converge)ํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต

(๋ฐ˜๋ณต๋ฌธ-1) time step ์ฆ๊ฐ€ : t <- t+1

  • iteration ๋ณ„๋กœ ๊ฐ’๋“ค์ด ๊ณ„์† ์—…๋ฐ์ดํŠธ ๋จ.

(๋ฐ˜๋ณต๋ฌธ-2) stochastic objective funtion์œผ๋กœ ์ด์ „ time step์˜ gradient ๊ณ„์‚ฐ (๋ฏธ๋ถ„)

  • stochastic objective function(loss function)์„ ํ†ตํ•ด ๋‚˜์˜จ ๊ฐ’์„ ๋ฐ”ํƒ•์œผ๋กœ weight๋ณ„ gradient ๊ณ„์‚ฐ (1์ฐจ ๋ฏธ๋ถ„, first-order gradient)

(๋ฐ˜๋ณต๋ฌธ-3) biased first & second moment ๊ฐ’ ๊ณ„์‚ฐ

  • mtm_t : Momentum ํ•ด๋‹น -> fisrt moment
  • vtv_t : AdaGrad, RMSProp ์— ํ•ด๋‹น -> second moment
  • ฮฒฮฒ ๊ฐ€ ์ตœ์‹  ๊ฐ’์˜ ๋น„์ค‘์„ ์กฐ์ ˆํ•˜๋Š” ์—ญํ•  (exponential decay)

(๋ฐ˜๋ณต๋ฌธ-4) ์ดˆ๊ธฐ์˜ ๋ชจ๋ฉ˜ํ…€ ๊ฐ’์ด 0์œผ๋กœ ์ดˆ๊ธฐํ™”๋˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, bias-correction ์ ์šฉ

  • mt^\hat{m_t} : mtm_t์„ (1-ฮฒ1tฮฒ^t_1) ๋กœ ๋‚˜๋ˆ„์–ด์„œ bias-correction ์ˆ˜ํ–‰
  • vt^\hat{v_t} : vtv_t์„ (1-ฮฒ2tฮฒ^t_2) ๋กœ ๋‚˜๋ˆ„์–ด์„œ bias-correction ์ˆ˜ํ–‰
  • ์ดˆ๊ธฐ์˜ ๋ชจ๋ฉ˜ํ…€ ๊ฐ’์ด 0์œผ๋กœ ์ดˆ๊ธฐํ™”๋˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, bias-correction ์ ์šฉ

(๋ฐ˜๋ณต๋ฌธ-5) ์ตœ์ข… ํŒŒ๋ผ๋ฏธํ„ฐ(๊ฐ€์ค‘์น˜) ์—…๋ฐ์ดํŠธ

  • ฯต\epsilon์€ 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ์ผ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์…‹ํŒ…
  • second moment๊ฐ€ AdaGrad, RMSProp ์—ญํ•  -> ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜์— ๋”ฐ๋ผ ํ•™์Šต๋ฅ ์„ ๋‹ฌ๋ฆฌ ํ•จ
  • ์—…๋ฐ์ดํŠธ๋ฅผ ๋งŽ์ด ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์ผ์ˆ˜๋ก vtv_t ๊ฐ’์ด ์ปค์ง -> ์ „์ฒด์ ์œผ๋กœ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๋งŽ์ด ํ–ˆ์„์ˆ˜๋ก step size๋ฅผ ์ค„์—ฌ์ฃผ๋Š” ํšจ๊ณผ

ํ•™์Šต๋ฅ  ฮฑ\alpha ์กฐ์ ˆ ๋ฐฉ๋ฒ•

  • ๊ณ ์ •๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋Š” ฮฑ\alpha ๋„ ์ด๋ ‡๊ฒŒ iteration ๋งˆ๋‹ค ฮฒ1\beta_1์™€ ฮฒ2\beta_2์„ ์‚ฌ์šฉํ•ด ๋ณ€ํ™”๋ฅผ ์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • ์„ฑ๋Šฅ์ ์ธ ๋ฉด์—์„œ ํšจ๊ณผ ๋ณผ ์ˆ˜ ์žˆ์Œ

โ—ป๏ธ ADAMโ€™S UPDATE RULE

์œ„์˜ ์‹์ด Adam์˜ ํŒŒ๋ผ๋ฏธํ„ฐ Update ์‹์ด๋‹ค.

  • Adam์˜ ํ•ต์‹ฌ์€ step size(Learning Rate)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์„ ํƒํ•˜๋Š” ๊ฒƒ!
  • ํšจ๊ณผ์ ์ธ step size์€ โ–ณtโ–ณt์˜ ๊ฐ’์„ ์ตœ์ ์œผ๋กœ ๋งŒ๋“ฌ
  • step size๋Š” 2๊ฐœ์˜ upper bounds(์ƒํ•œ์„ )์ด ์žˆ์Œ (if, ฯต\epsilon = 0 )

  • First case๋Š” sparsity case ์ผ๋•Œ ์ ์šฉ (sparsity case : ํ•˜๋‚˜์˜ gradient๊ฐ€ ๋ชจ๋“  time step์—์„œ 0์œผ๋กœ ๋˜๋Š” ๊ฒฝ์šฐ)
    => ์ด๋Ÿด๋•Œ๋Š” step size๋ฅผ ํฌ๊ฒŒ ํ•ด์„œ ์—…๋ฐ์ดํŠธ ๋ณ€ํ™”๋Ÿ‰์„ ํฌ๊ฒŒ ๋งŒ๋“ค์–ด์•ผํ•จ
    => (1-ฮฒ1\beta_1) > (1โˆ’ฮฒ2)\sqrt{(1-\beta_2)}
    => ํ•™์Šต๋ฅ ์ด ์ปค์ง

  • Second case๋Š” ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ (๋Œ€๋ถ€๋ถ„ ฮฒ1\beta_1 = 0.9, ฮฒ2\beta_2 = 0.99์œผ๋กœ ์„ค์ •)
    => (1-ฮฒ1\beta_1) <= (1โˆ’ฮฒ2)\sqrt{(1-\beta_2)}
    => ์ด๋Ÿด๋•Œ๋Š” step size๋ฅผ ์ž‘๊ฒŒ ํ•ด์„œ ์—…๋ฐ์ดํŠธ ๋ณ€ํ™”๋Ÿ‰์„ ์ž‘๊ฒŒ ๋งŒ๋“ฆ
    => ํ•™์Šต๋ฅ ์ด ์ž‘์•„์ง

3. INITIALIZATION BIAS CORRECTION

์œ„์— ์ˆ˜์‹์—์„œ ์˜๋ฌธ์ด ๋“ค ์ˆ˜ ์žˆ๋‹ค.
๋„๋Œ€์ฒด ์™œ, ๊ฐ ๋ชจ๋ฉ˜ํŠธ๋ฅผ 1-ฮฒ\beta ๋กœ ๋‚˜๋ˆ„๋Š”๊ฐ€...?๐Ÿค”๐Ÿค”

  • mt^\hat{m_t} : mtm_t์„ (1-ฮฒ1tฮฒ^t_1) ๋กœ ๋‚˜๋ˆ„์–ด์„œ bias-correction ์ˆ˜ํ–‰
  • vt^\hat{v_t} : vtv_t์„ (1-ฮฒ2tฮฒ^t_2) ๋กœ ๋‚˜๋ˆ„์–ด์„œ bias-correction ์ˆ˜ํ–‰

์ผ๋‹จ ๋…ผ๋ฌธ์—์„œ๋Š” second moment vtv_t์— ๋Œ€ํ•ด์„œ vt^\hat{v_t}๋กœ ์ถ”์ •ํ•˜๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ–ˆ๋‹ค.
์ด ๊ณผ์ •์„ ์ง„ํ–‰ํ•˜๋‚˜ ์˜๋ฌธ์ด ๋“ค์ˆ˜ ์žˆ๋Š”๋ฐ ๊ทธ ์ด์œ ๊ฐ€ ์—ฌ๊ธฐ ์žˆ๋‹ค.

์ง์ ‘ ์ˆ˜์‹์œผ๋กœ ํ’€์–ด์„œ ์ •๋ฆฌํ•ด๋ณด์ž๋ฉด,

๊ทธ๋ฆฌ๊ณ 

ฮถ\zeta = 0 ์ด๋ผ๋ฉด,
์‹ค์ œ๋กœ ๊ตฌํ•ด์•ผํ•˜๋Š” ์ฐธ second momet ๊ธฐ๋Œ“๊ฐ’ E[gt]E[g_t] ์ด ๋˜๊ณ ,
E[vt]E[v_t]๊ฐ€ E[gt]E[g_t]์— ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด, 1โˆ’ฮฒ21-\beta_2๋ฅผ ๋‚˜๋ˆ„๊ฒŒ ๋จ
sparse gradient์˜ ๊ฒฝ์šฐ, ฮฒ2\beta_2 ๊ฐ’์„ ์ž‘๊ฒŒ ์„ค์ •ํ•จ -> ์ด์ „ time step์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๋ฌด์‹œํ•˜๊ฒŒ ๋จ

4. CONVERGENCE ANALYSIS

=> ๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” comverge ํ•จ

6. EXPERIMENTS

6.1 EXPERIMENT: LOGISTIC REGRESSION

[ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ 1 : MNIST ๋ฐ์ดํ„ฐ์…‹]

  • multi-class logistic regression (L2-regularized ์ ์šฉ)
  • step size ฮฑ๋Š” 1/โˆšt decay๋กœ ์กฐ์ •๋จ -> ์‹œ๊ฐ„์— ๋”ฐ๋ผ step size ๊ฐ์†Œ
  • logistic regression๋ฅผ ์‚ฌ์šฉํ•ด 784 ์ฐจ์›์˜ ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋กœ ์ˆซ์ž class๋ฅผ ๋ถ„๋ฅ˜
  • mini batch 128๋กœ Adam, SGD(with Nesterov), AdaGrad๋ฅผ ๋น„๊ต (์•„๋ž˜ ํ‘œ 1์˜ ์™ผ์ชฝ ๊ทธ๋ž˜ํ”„)
  • ํ‘œ1์„ ๋ณด๋ฉด, adam์€ SGD์™€ ์œ ์‚ฌํ•˜๊ฒŒ ์ˆ˜๋ ดํ•˜๊ณ , AdaGrad ๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ

[ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ 2 : IMDB ๋ฐ์ดํ„ฐ์…‹]

  • sparse feature problme์„ ํ…Œ์ŠคํŠธ ํ•˜๊ธฐ ์œ„ํ•ด IMDB ์˜ํ™” ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋„ adam์„ ํ‰๊ฐ€ํ•จ
  • sparse feature problem : ๋น„์Šทํ•œ ์œ ํ˜•์ด ์ ์€ ๋ฐ์ดํ„ฐ(unique value ๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ)
  • BoW(Bag of Words) Features vector๋กœ ์ „์ฒ˜๋ฆฌ ์ง„ํ–‰ (10,000 ์ฐจ์›)
  • Bow Feature Vector : ๋ฌธ์žฅ์„ ๋ฒกํ„ฐ(์ˆซ์ž)๋กœ ํ‘œํ˜„
  • ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด 50% ๋น„์œจ์˜ drop-out์ด ์ ์šฉ๋˜๋ฉฐ, ์ด ๋…ธ์ด์ฆˆ๊ฐ€ BoW Features์— ์ ์šฉ๋จ
  • ์œ„ ํ‘œ 1์˜ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด loss ๊ฐ’์ด ์ƒ๋‹นํžˆ ํŠ€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๊ฒƒ์ด drop-out์˜ noise
  • sparseํ•œ ๊ฒฝ์šฐ Adam, RMSProp, Adagrad๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ณ  ์žˆ์Œ (ํ‘œ 1์˜ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„)
    Adam์€ sparse features์—์„œ๋„ ์„ฑ๋Šฅ์ด ์ข‹์œผ๋ฉฐ, SGD๋ณด๋‹ค ๋น ๋ฅธ ์ˆ˜๋ ด ์†๋„๋ฅผ ์ง€๋‹ˆ๊ณ  ์žˆ์Œ

  • x์ถ• : ํ•™์Šต ํšŸ์ˆ˜, y์ถ• : loss ๊ฐ’

6.2 EXPERIMENT: MULTI-LAYER NEURAL NETWORKS

  • MINST ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ
  • multi-layer model : two fully connected hidden layer with 1000 hiden units (ReLU activation, mini-batch 128)
  • Object Function : Cross-Entropy Loss (with L2 weight decay)
  • x์ถ• : ํ•™์Šต ํšŸ์ˆ˜, y์ถ• : loss ๊ฐ’
  • drop-out regularization์„ ์ ์šฉํ–ˆ์„ ๋•Œ์˜ optimizer ์„ฑ๋Šฅ ๋น„๊ต -> Adam์ด ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด

6.3 EXPERIMENT: CONVOLUTIONAL NEURAL NETWORKS

[ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ 3 : CIFAR-10 ๋ฐ์ดํ„ฐ์…‹]

  • Convolution Neural Network : C64-C64-C128-1000
  • C64 : 64 output channel์„ ๊ฐ€์ง€๋Š” 3*3 conv layer
  • 1000 : 1000 output์„ ๊ฐ€์ง€๋Š” dense layer

  • ์™ผ์ชฝ ๊ทธ๋ž˜ํ”„ : 3 epoch ๊นŒ์ง€์˜ optimizer ๋ณ„ ์ˆ˜๋ ด ์†๋„ ๋น„๊ต
  • ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„ : 45 epoch ๊นŒ์ง€์˜ optimizer ๋ณ„ ์ˆ˜๋ ด ์†๋„ ๋น„๊ต
  • x์ถ• : ํ•™์Šต ํšŸ์ˆ˜, y์ถ• : loss ๊ฐ’
  • dropout์„ ์ ์šฉํ•˜์ง€ ์•Š์€ optimizer ์ค‘์—์„œ Adam์ด ์ œ์ผ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋น ๋ฆ„
  • dropout์„ ์ ์šฉํ•œ optimizer ์ค‘์—์„œ Adam์ด ์ œ์ผ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋น ๋ฆ„

6.4 EXPERIMENT: BIAS-CORRECTION TERM

  • Stepsize ฮฑ, Decay Rates B1, B2์— ๋”ฐ๋ฅธ Training Loss ๊ทธ๋ž˜ํ”„
  • x์ถ• : log(step_size), y์ถ• : Loss
  • ์ดˆ๋ก์ƒ‰ ๊ทธ๋ž˜ํ”„ : no bias correction terms -> RMSProp
  • ๋นจ๊ฐ„์ƒ‰ ๊ทธ๋ž˜ํ”„ : bias correction terms (1-B)
  • Bias correction term์„ ์ ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ B2๊ฐ€ 1.0์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๋ถˆ์•ˆ์ •
    => ์š”์•ฝํ•˜์ž๋ฉด Adam ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์— ์ƒ๊ด€์—†์ด RMSProp ์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

7. EXTENSION

ADAMAX

Adam์—์„œ ๊ฐœ๋ณ„ weights๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๊ธฐ ์œ„ํ•ด ๊ณผ๊ฑฐ์™€ ํ˜„์žฌ์˜ gradient์˜ L2 norm์„ ์ทจํ•จ
์ด๋•Œ L2 norm์„ L_p norm์œผ๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Œ

=> p ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก ์ˆ˜์น˜์ ์œผ๋กœ ๋ถˆ์•ˆ์ •ํ•ด์ง€๋‚˜, p๋ฅผ ๋ฌดํ•œ๋Œ€๋ผ๊ณ  ๊ฐ€์ •(infinity norm)ํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ณ  stableํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋จ

TEMPORAL AVERAGING

๋งˆ์ง€๋ง‰ ๋ฐ˜๋ณต์€ ํ™•๋ฅ ์  ๊ทผ์‚ฌ๋กœ ์ธํ•ด ์žก์Œ์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ๋ผ๋ฏธํ„ฐ ํ‰๊ท ํ™”๋ฅผ ํ†ตํ•ด ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

8. CONCLUSION

  • Stochastic objective function์˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ Adam์„ ์†Œ๊ฐœํ•จ
  • ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ณ ์ฐจ์›์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์— ๋Œ€ํ•œ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ์ง‘์ค‘
  • Adam์€ AdaGrad๊ฐ€ Sparse gradients๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ์‹(ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ„ step size๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉ)๊ณผ RMSProp์ด non-stationary(์ •์ง€ํ•˜์ง€ ์•Š๋Š”) objectives๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ์‹(๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ํ˜„์žฌ์˜ ๊ฒƒ๋ณด๋‹ค ๋œ ๋ฐ˜์˜ํ•จ)์„ ์กฐํ•ฉ
  • Adam์€ non-convexํ•œ ๋ฌธ์ œ(์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ตœ์ €์ ์ด ์žˆ๋Š” ๋ฌธ์ œ)์—์„œ๋„ ์ตœ์ ํ™”๊ฐ€ ์ž˜ ๋จ

๐Ÿ”– Reference
๋…ผ๋ฌธ ๋ฆฌ๋ฅ˜ ํฌ์ŠคํŒ… ์ฐธ๊ณ 
AdaGrad, RMSProp ์ด๋ก 
Optimizer ์ด์ •๋ฆฌ(์ฝ”๋“œํฌํ•จ)
Adam ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์˜์ƒ
Optimizer ์ˆ˜์‹ ์ •๋ฆฌ

profile
๋ฐฐ์šฐ๊ณ  ๊ณต๋ถ€ํ•˜๊ณ  ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ๋ฉˆ์ถ”์ง€ ์•Š๋Š”๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€