๐Ÿ“Œ ๋ณธ ๋‚ด์šฉ์€ Michigan University์˜ 'Deep Learning for Computer Vision' ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ํ•„๊ธฐํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ๋‚ด์šฉ์— ์˜ค๋ฅ˜๋‚˜ ํ”ผ๋“œ๋ฐฑ์ด ์žˆ์œผ๋ฉด ๋ง์”€ํ•ด์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํžˆ ๋ฐ˜์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
(Stanford์˜ cs231n๊ณผ ๋‚ด์šฉ์ด ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ๋„์›€ ๋˜์‹ค ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค)๐Ÿ“Œ


1. Activation Functions : Sigmoid

๐Ÿ“ 3๊ฐ€์ง€ ๋ฌธ์ œ์ : saturated gradient, not zero-centered, ์—ฐ์‚ฐ๋น„์šฉ

  • ์ „์ฒด ํ๋ฆ„

    • ํ•ด์„
      • axon from a neuron: ์ด์ „ ๋‰ด๋Ÿฐ์—์„œ์˜ ์ž…๋ ฅ
      • cell body์˜ f: activation function์œผ๋กœ ๋น„์„ ํ˜•ํ™”
        • ์—†์œผ๋ฉด ๋‹จ์ผ ์„ ํ˜• layer๋กœ ์ถ•์†Œ๋˜๋ฏ€๋กœ ๋ฐ˜๋“œ์‹œ ํ•„์š”

1) ์†Œ๊ฐœ

  • ๊ฐœ๋…
    • ๊ฐ€์žฅ classic ํ•จ
    • ์กด์žฌ or ๋ถ€์žฌ์— ๋Œ€ํ•œ ํ™•๋ฅ ์  ํ•ด์„
    • 0~1์‚ฌ์ด๋กœ ๋งŒ๋“ฆ
    • โ€œFiring rateโ€ of neuron
      • ๋‹ค๋ฅธ ๋“ค์–ด์˜ค๋Š” ๋‰ด๋Ÿฐ์œผ๋กœ๋ถ€ํ„ฐ ์‹ ํ˜ธ ๋ฐ›์€ ํ›„, ์ผ์ • ์†๋„๋กœ ์‹ ํ˜ธ ๋ฐœํ™”

      • ๋ชจ๋“  ์ž…๋ ฅ์˜ ์ด ์†๋„์— ๋น„์„ ํ˜•์„ฑ ์˜์กด

        โ‡’ sigmoid: ๋ฐœํ™”์†๋„์— ๋Œ€ํ•œ ๋น„์„ ํ˜• ์˜์กด์„ฑ์„ ๋ชจ๋ธ๋ง ํ•œ ๊ฒƒ

  • ๋ฌธ์ œ์  3๊ฐ€์ง€
    a. (์ ค ๋ฌธ์ œ)ํฌํ™”๋œ(Saturated) ๋‰ด๋Ÿฐ๋“ค์ด gradient๋ฅผ ์ฃฝ์ž„ (= ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ฆ)

    • x๊ฐ€ ๋งค์šฐ ์ž‘์„๋•Œ
      • local gradient(dฯƒ\sigma/dx)๊ฐ€ 0์— ์ˆ˜๋ ด
      • downstream gradient(dL/dx)๋„ 0์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋จ โ†’ ๊ฐ€์ค‘์น˜ update๋„ 0์— ๊ฐ€๊นŒ์›Œ์ง(๊ฐ€์ค‘์น˜ updateํ–‰๋ ฌ๊ณผ ๊ด€๋ จ๋œ ์†์‹ค์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋งค์šฐ ๋‚ฎ์„๊ฑฐ๋ผ์„œ) โ†’ ํ•™์Šต ๋Š๋ ค์ง
    • x๊ฐ€ 0์ผ๋•Œ
    • x๊ฐ€ ๋งค์šฐ ํด๋•Œ
      • gradient๊ฐ€ 0์— ์ˆ˜๋ ด โ†’ ๋งค์šฐ ๊นŠ์€ layer์ผ๋•Œ, ํ•˜์œ„ layer์—์„œ gradient ํ›ˆ๋ จ ์‹ ํ˜ธX (๊ฑ 0์ด๋ผ์„œ)

    b. sigmoid output๋“ค์€ zero-centered๊ฐ€ ์•„๋‹˜

    • (์›์†Œ 1๊ฐœ์ผ๋•Œ) ๊ฐ€์ •
      • ๋ชจ๋“  input neuron์ด ํ•ญ์ƒ +๋ผ๋ฉด W gradient๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
        • (local gradient)๊ฐ€ ํ•ญ์ƒ +๊ฐ€ ๋จ

        • (upstream gradient)๊ฐ€ ํ•ญ์ƒ +๊ฐ€ ๋จ (๋ชจ๋“  ์†์‹คor ๊ธฐ์šธ๊ธฐ๊ฐ€ ์–‘์ˆ˜, Wi์— ๋Œ€ํ•œ ๋ชจ๋“  ์†์‹ค or ๊ธฐ์šธ๊ธฐ๊ฐ€ ์–‘์ˆ˜)

          โ‡’ W์— ๋Œ€ํ•œ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋™์ผ ๋ถ€ํ˜ธ ๊ฐ–๊ฒŒ ๋จ

          (=๋ชจ๋‘ ์–‘, ์Œ์ด ๋œ๋‹ค๋Š” ์ œ์•ฝ ๋•Œ๋งค, ๊ฐ€์ค‘์น˜์˜ ํŠน์ •๊ฐ’์— ๋„๋‹ฌํ•˜๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ๋‹จ๊ณ„ ๋งŒ๋“ค๊ธฐ ์–ด๋ ค์›€)

        • ๋ถ€์—ฐ์„ค๋ช…

          • W์ดˆ๊ธฐ๊ฐ’์ด ์›์ ์ด๊ณ , ์†์‹ค ์ตœ์†Œํ™” ์œ„ํ•œ ๊ฐ€์ค‘์น˜ ๊ฐ’์€ ์›์ ์—์„œ ์˜ค๋ฅธ์ชฝ ํ•˜๋‹จ์œผ๋กœ ์ด๋™์œ„ํ•ด W1์€ +๋‹จ๊ณ„, W2๋Š” -๋‹จ๊ณ„์—ฌ์•ผ ํ•˜๋Š”๋ฐ, ๋‘˜๋‹ค ๊ฐ™์€ ๋ถ€ํ˜ธ๋ฉด ํ•ด๋‹น ์‚ฌ๋ถ„๋ฉด์— ์ •๋ ฌ๋œ ๋‹จ๊ณ„ ์ˆ˜ํ–‰๋ฐฉ๋ฒ• X
            • ๊ฒฝ์‚ฌํ•˜๊ฐ•์ ˆ์ฐจ๊ฐ€ ํ•ด๋‹น ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•
              • ๋ชจ๋“  ๊ฒฝ์‚ฌ๊ฐ€ ์œ„๋กœ ์ด๋™ํ•˜๋Š” ์ง€๊ทธ์žฌ๊ทธ ํŒจํ„ด

          โ‡’ ๊ฒฐ๋ก ) not zero-centeredํ•œ ๋ฌธ์ œ ๋•Œ๋ฌธ์—, train์‹œ ๋งค update๋งˆ๋‹ค ํ•œ์ชฝ์— ์น˜์šฐ์น˜๋‹ˆ๊นŒ ๋งค์šฐ ๋ถˆ์•ˆ์ •ํ•จ

    • (์›์†Œ ์—ฌ๋Ÿฌ๊ฐœ์ผ๋•Œ) Minibatch์ผ๋•Œ
      • not zero-centeredํ•œ ๋ฌธ์ œ ์™„ํ™”๋จ
        • ๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ๋Œ€ํ•ด ๋ชจ๋‘๋ฅผ ํ‰๊ท ๋‚ด๋ฉด, ๋•Œ๋กœ๋Š” ์–‘์ˆ˜, ์Œ์ˆ˜ ๋‚˜์˜ฌ์ˆ˜ ์žˆ์–ด์„œ

    c. ์ง€์ˆ˜ํ•จ์ˆ˜์˜ ๊ณ„์‚ฐ๋น„์šฉ์ด ๋น„์Œˆ

    • ์ง€์ˆ˜ํ•จ์ˆ˜๋Š” ๋งŽ์€ clock cycle์ด ๋Œ์•„์„œ ๋น„์Œˆ
    • cf) relu์™€ sigmoid ๋น„๊ตํ–ˆ์„๋•Œ, sigmoid๊ฐ€ ํ›จ ์˜ค๋ž˜๊ฑธ๋ฆผ




2. Activation Functions : Tanh

๐Ÿ“ ๋ฌธ์ œ์ : saturated gradient

  • ๊ฐœ๋…
    • Scaled & Shifted version of Sigmoid
    • [-1,1] ๋ฒ”์œ„
    • zero-centeredํ•จ
    • ์—ฌ์ „ํžˆ saturatedํ• ๋•Œ gradient๊ฐ€ ์ฃฝ์Œ
      • saturating non-linearity๋ฅผ neural network์— ์‚ฌ์šฉํ•ด์•ผ๋œ๋‹ค๋ฉด, tanh>sigmoid ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ํ•ฉ๋ฆฌ์ 
      • ๊ทธ๋ž˜๋„ saturated ๋ฌธ์ œ๋•Œ๋งค ์—„์ฒญ ์ข‹์€ ์„ ํƒX




3. Activation Functions : ReLU, Leaky ReLU

๐Ÿ“ ๋ฌธ์ œ์ : not zero-centered, -์ผ๋•Œ gradient vanishing
1) relu

  • ๊ฐœ๋…

    • + ์˜์—ญ์—์„œ saturate๋˜์ง€ ์•Š์Œ (=๊ธฐ์šธ๊ธฐ ์†Œ์‹คX, killing gradient X)
    • ์—ฐ์‚ฐ ๋น„์šฉ ํšจ์œจ์  (cheapest ๋น„์„ ํ˜•ํ•จ์ˆ˜)
      • cf) binary์™€ ๊ฐ™์ด ๊ตฌํ˜„๊ฐ€๋Šฅ, ๊ฐ„๋‹จํ•œ ์ž„๊ณ„๊ฐ’๋งŒ ๊ณ ๋ คํ•ด์„œ ๊ณ„์‚ฐ๋น„์šฉโ†“
    • sigmoid, tanh๋ณด๋‹ค ๋งค์šฐ ๋นจ๋ฆฌ ์ˆ˜๋ ด
      • cf) 5000 layer๊ฐ™์ด ๋งค์šฐ ๊นŠ์€ layer๋ฉด, sigmoid๋กœ ์ˆ˜๋ ดํ•˜๊ธฐ ๋งค์šฐ ํž˜๋“ค๊ฒƒ (batch norm ์•ˆ์“ธ๋•Œ)
  • ๋ฌธ์ œ์ 

    • Not zero-centered output (sigmoid์™€ ๋™์ผ๋ฌธ์ œ)

      • relu๋Š” ์Œ์ˆ˜X, ๋ชจ๋‘ + or 0์ž„
      • ์ด๋Ÿฐ ๋ฌธ์ œ๊ฐ€ ์žˆ๊ธดํ•˜์ง€๋งŒ gradient vanishing์ฒ˜๋Ÿผ ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๋Š” ์•„๋‹ˆ๋ผ ๊ดœ์ฐฎ์Œ
    • ์Œ์ˆ˜์ผ๋•Œ์˜ ๊ธฐ์šธ๊ธฐ ๋ฌธ์ œ

      • x๊ฐ€ ๋งค์šฐ ์ž‘์„๋•Œ (dead relu; x<0์ด๋ฉด ์™„์ „ํžˆ ํ•™์ŠตX)

        • local gradient(dฯƒ\sigma/dx)๊ฐ€ 0
        • downstream gradient(dL/dx)๋„ 0
          • cf) ๊ทธ๋Ÿฌ๋ฉด sigmoid๋ณด๋‹ค ๋” ์•ˆ์ข‹์€๊ฒƒ ์•„๋‹Œ๊ฐ€? (sigmoid๋Š” 0์— ์ˆ˜๋ ดํ•˜๋Š”๋ฐ ์—ฌ๊ธฐ์„  ์•„์˜ˆ 0์ธ๋ฐ?) โ†’ ๊ทธ๋ž˜๋„ completely 0์ด ์•„๋…€์„œ ํ•™์Šต ๊ฐ€๋Šฅ
      • x๊ฐ€ 0์ผ๋•Œ

      • x๊ฐ€ ๋งค์šฐ ํด๋•Œ

        • local gradient(dฯƒ\sigma/dx)๊ฐ€ 1
      • cf) Dead relu, Active relu

        • active relu
          • gradient ๋ฐ›๊ณ  ์ •์ƒ์ ์œผ๋กœ trainํ•จ
        • dead relu
          • ์ด ๋ฌธ์ œ๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์Œ์ˆ˜์ผ๋•Œ ๋ฐœ์ƒ, ์ผ๋ถ€๊ฐ€ + ์ด๋ฉด ใ„ฑใ…Š
          • ์ ˆ๋Œ€ train ๋ถˆ๊ฐ€
          • ๊ทน๋ณต ๋ฐฉ๋ฒ• (trick)
            • 0.01๊ฐ™์ด ์กฐ๊ธˆ์˜ positive ๊ธฐ์šธ๊ธฐ๋กœ ์ดˆ๊ธฐํ™” (Leaky relu)

2) Leaky relu

  • ๊ฐœ๋…
    • ์Œ์ˆ˜์ผ๋•Œ ์ž‘์€ + ๋ฅผ ํฌํ•จํ•จ
    • 0.01 โ†’ hyperparameter์ž„ (๊ฐ์ž์˜ network์— ๋งž๊ฒŒ ํ•™์Šต ํ•„์š”)
  • ์žฅ์ 
    • saturate๋˜์ง€ ์•Š์Œ
    • ํšจ์œจ์  ์—ฐ์‚ฐ ๋น„์šฉ
    • sigmoid, tanh๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅธ ์ˆ˜๋ ด์†๋„
    • gradient vanishing๋˜์ง€ ์•Š์Œ ( local gradient๊ฐ€ 0์ด ๋  ์ผ์ด ์—†์–ด์„œ; ์Œ, ์–‘ ๋ชจ๋‘์—์„œ)

3) PReLU

  • ๊ฐœ๋…
    • leaky relu์—์„œ ์ด์–ด์ง„ ๊ฒƒ
    • ฮฑ\alpha ๋ฅผ ํ•™์Šตํ•ด์„œ ์•Œ๋งž๊ฒŒ ๊ฐ€์ ธ์˜ด (learnable parameter)
    • ์Šค์Šค๋กœ ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ๋น„์„ ํ˜• ํ•จ์ˆ˜
  • backprop into \alpha
    • ฮฑ\alpha์— backpropํ•ด์„œ ฮฑ\alpha์— ๋Œ€ํ•œ ์†์‹ค ๋„ํ•จ์ˆ˜ ๊ณ„์‚ฐ ํ›„, ฮฑ\alpha์— ๋Œ€ํ•œ gradient decent step ๋งŒ๋“ค๊ธฐ
    • ๋ฌธ์ œ์ ) 0์—์„œ ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅ โ†’ ํ•ด๊ฒฐ) ๋‘ ๋ฐฉํ–ฅ ์ค‘ ํ•œ ์ชฝ์„ ๊ณ ๋ฅด๊ธฐ (์ž์ฃผ๋ฐœ์ƒX์—ฌ์„œ ์‹ ๊ฒฝ์•ˆ์จ๋„๋จ)

4) ELU (Exponential Linear Unit)

  • ๊ฐœ๋…
    • relu๋ณด๋‹ค ๋” ๋ถ€๋“œ๋Ÿฝ๊ณ , zero-centered ๊ฒฝํ–ฅ โ†‘
  • ์ˆ˜์‹
    • ฮฑ(exp(x)โˆ’1)\alpha(exp(x)-1) (if xโ‰ค0)
      • (default ฮฑ\alpha=1)
      • zero-gradient ํ”ผํ•˜๊ธฐ ์œ„ํ•จ
      • ์•ฝ๊ฐ„ sigmoid ๋ชจ์–‘
  • ๋ฌธ์ œ์ 
    • ์—ฌ์ „ํžˆ ์ง€์ˆ˜ํ•จ์ˆ˜ ํฌํ•จ
    • ฮฑ\alpha ๋•Œ๋งค ํ•™์Šตํ•ด์•ผ๋จ

5) SELU (Scaled Exponential Linear Unit)

  • ๊ฐœ๋…
    • Scaled version of ELU
    • batch norm ์ œ์™ธํ•˜๊ณ , ๊นŠ์€ SELU ๋„คํŠธ์›Œํฌ ํ•™์Šต ๊ฐ€๋Šฅ
  • ์žฅ์ 
    • deep neural network + SELU = self normalizing property
      = layer๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก โ†’ ์ž๊ธฐ ์ •๊ทœํ™” ์†์„ฑ โ†‘
      = ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ž˜ ์ž‘๋™ โ†‘ & ์œ ํ•œํ•œ ๊ฐ’์œผ๋กœ ์ˆ˜๋ ด
      = batch norm๊ณผ ๊ฐ™์€ ์ •๊ทœํ™” ์ œ์™ธ ๊ฐ€๋Šฅ




4. Activation Functions : ์ „์ฒด ๋น„๊ต

  • ๊ฑ relu์จ๋ผ




5. Data Preprocessing

1) ๊ฐœ๋…

  • ๋” ํšจ์œจ์  training ์œ„ํ•ด

2) ๋ฐฉ๋ฒ• 2๊ฐ€์ง€

(image data์ธ ๊ฒฝ์šฐ)

a. zero-center : ํ‰๊ท ์„ ๋นผ์„œ ์›์ ์œผ๋กœ ๊ฐ€์ ธ์˜ด

  • ์ด๋ ‡๊ฒŒ ํ•ด์•ผ ํ•˜๋Š” ์ด์œ ?
    • ์ด์ „์— sigmoid์˜ ๋ฌธ์ œ์ ์œผ๋กœ gradient๊ฐ€ ํ•ญ์ƒ + or -๋ฉด, W update๋„ ํ•ญ์ƒ + or - ๋˜๋Š” ๋ฌธ์ œ ์ง€๋‹˜

    • ๋น„์Šทํ•˜๊ฒŒ, ์—ฌ๊ธฐ์„œ๋„ train data๊ฐ€ ๋ชจ๋‘ + or - ๋ฉด, W update๋„ ๋ชจ๋‘ ํ•ญ์ƒ + or -.

      โ‡’ ์ œํ•œ์ ์œผ๋กœ update๋  ์ˆ˜ ๋ฐ–์— ์—†์Œ

b. normalized : ๋™์ผ ๋ถ„์‚ฐ ๊ฐ–๋„๋ก ํฌ๊ธฐ scaling (ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ ์„œ)

(input์ด ์ €์ฐจ์›, ์ด๋ฏธ์ง€ ์•„๋‹Œ ๊ฒฝ์šฐ)

  • ์›์  ์ค‘์‹ฌ์œผ๋กœ ์˜ฎ๊ธฐ๊ณ  โ†’ rotateํ•จ

  • decorrelated data

    • ๊ณต๋ถ„์‚ฐ matrix ?
  • whitened data

    • identity matrix
  • normalize ์ „, ํ›„ ๋น„๊ต

    • (before norm) ์›์ ์œผ๋กœ๋ถ€ํ„ฐ ๋ฉ€๋ฉด, weight matrix์˜ ์ž‘์€ ๋ณ€ํ™”์—๋„ ํฐ ๋ณ€ํ™” ๋ฐœ์ƒโ†’ optimization process ์–ด๋ ต๊ฒŒ ๋งŒ๋“ฆ
      • ex. -2x+1 ์ผ๋•Œ zero-centered ๋˜์ง€ ์•Š์œผ๋ฉด, ํ•จ์ˆ˜๊ฐ€ -2.1x+1๋กœ ๋ฐ”๋€”๋•Œ, ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ ์ƒํ™ฉ์ด ๋งŽ์ด ๋ฐ”๋€œ โ†’ classification loss ๋งŽ์ด ๋ณ€ํ™” โ†’ optimization process ์–ด๋ ต๊ฒŒ ๋งŒ๋“ฆ
    • (after norm) zero-centered ๋˜์–ด์žˆ์–ด์„œ, W์˜ ์ž‘์€ ๋ณ€ํ™”์— ๋œ ๋ฏผ๊ฐ

3) ๊ด€๋ จ ์งˆ๋ฌธ๋“ค

  • Q1. ์ด๋Ÿฐ ์ „์ฒ˜๋ฆฌ๋ฅผ train, test์— ์ ์šฉ?
    A1. ํ•ญ์ƒ train์— ์ ์šฉ, test์—์„œ๋Š” ๊ฐ™์€ ์ •๊ทœํ™” ์‚ฌ์šฉ
  • Q2. batch-norm์‚ฌ์šฉ์‹œ์—๋„ ์ „์ฒ˜๋ฆฌ ํ•„์š”?
    A2. batch norm์„ ๋ชจ๋“  ์ฒ˜๋ฆฌ ์ด์ „ ๋งจ ์ฒซ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ์‹œ ์•ˆํ•ด๋„ ๋จ.
    but ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง์ ‘ ํ•˜๋Š”๊ฑฐ ๋ณด๋‹จ ์„ฑ๋Šฅ ๋‚ฎ์„๋“ฏ
    โ‡’ ์‹ค๋ฌด์—์„  ์ „์ฒ˜๋ฆฌ โ†’ batch norm ๋‘˜๋‹ค ์‚ฌ์šฉ




6. Weight ์ดˆ๊ธฐํ™”

๐Ÿ“ Xavier ์ถœํ˜„๋ฐฐ๊ฒฝ + Xavier์— ๋Œ€ํ•ด
1) ๋ฐฉ๋ฒ• 3๊ฐ€์ง€ (๋ชจ๋‘ ๋ฌธ์ œ์žˆ์Œ)

a. W=0, b=0์œผ๋กœ ์ดˆ๊ธฐํ™”

  • ๋ฌธ์ œ์ 

    • ๋ชจ๋“  output๋“ค์ด 0์ด๋˜๊ณ , ๋ชจ๋“  gradient๊ฐ€ ๋™์ผํ•ด์ง
      = output์€ input๊ณผ ๊ด€๋ จ์ด ์—†์–ด์ง โ‡’ gradient =0์ด ๋ผ์„œ totally stuck๋จ
    • ๋Œ€์นญ์ด ๊นจ์ง€์ง€X (๊ณ„์† ๊ฐ™์€ gradient ํ•™์Šต) โ†’ ํ•™์Šต ๋ถˆ๊ฐ€๋จ

    b. small random ์ˆซ์ž๋“ค๋กœ ์ดˆ๊ธฐํ™”

  • ๋ฌธ์ œ์ 

    • deeper network์—์„œ ๋ฌธ์ œ ๋ฐœ์ƒ
      = local gradient๋“ค์ด ๋ชจ๋‘ 0์ด ๋จ โ†’ downstream gradient๋„ 0์ด ๋ผ์„œ ํ•™์Šต X

      • ์ฆ๋ช…

        • ํ•ด์„
          • ๊ฐ 6๊ฐœ์˜ layer์˜ hidden unit๊ฐ’๋“ค์„ ์‹œ๊ฐํ™”ํ•œ๊ฒƒ
          • hidden state = W๊ฐ€ Din, Dout์˜ ์‚ฌ์ด์— small random๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™” ๋˜์–ด x์™€ ๋‚ด์ ํ•œ๊ฐ’
          • ์ด hidden state๋“ค์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ ์  0์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋จ
            • cf) weight์˜ local gradient = ์ด์ „ layer์˜ activation
        • ๊ฒฐ๊ณผ
          • layer๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก activations๊ฐ€ 0์— ์ˆ˜๋ ด (ํ•™์Šต์— ๋งค์šฐ bad)
            โ‡’ local gradient๋“ค์ด ๋ชจ๋‘ 0์ด ๋จ โ†’ downstream gradient๋„ 0์ด ๋ผ์„œ ํ•™์Šต X

c. W๋ฅผ ์กฐ๊ธˆ ๋” ํฐ ์ˆซ์ž๋กœ ์ดˆ๊ธฐํ™”

  • ๋ฌธ์ œ์ 

    • local gradient๋“ค์ด ๋ชจ๋‘ 0์ด ๋จ โ†’ downstream gradient๋„ 0์ด ๋ผ์„œ ํ•™์Šต X

      • ์ฆ๋ช…

        • ํ•ด์„
          • tanh๋กœ ์ธํ•ด ๊ทน๋‹จ๊ฐ’์œผ๋กœ ๋ฐ€๋ ค๋‚จ
        • ๊ฒฐ๊ณผ
          • local gradient๋“ค์ด ๋ชจ๋‘ 0์ด ๋จ โ†’ downstream gradient๋„ 0์ด ๋ผ์„œ ํ•™์Šต X

2) ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

a. Xavier Initialization

  • ๋ฐฉ๋ฒ•

    • std = 1/sqrt(Din)
    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ X
  • ๊ฒฐ๊ณผ

    !

    • layer๊ฐ€ ๊นŠ์–ด์ ธ๋„ ใ„ฑใ…Š์Œ
  • conv layer์—์„œ ์ ์šฉ ๋ฐฉ๋ฒ•

  • ๋„ํ•จ์ˆ˜

    • Xavier์˜ ๋ชฉํ‘œ

      • output ์˜ activation ๋ถ„์‚ฐ = input์˜ activation ๋ถ„์‚ฐ ํ•˜๊ธฐ!
        (์™œ๋ƒ๋ฉด ๊ธฐ์กด ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•๋“ค์€ input๊ณผ output์˜ ๋ถ„์‚ฐ์ด ๋‹ฌ๋ผ์„œ ๋ฌธ์ œ์˜€์–ด์„œ)

      • ์ฆ๋ช…

        • ๊ฐ€์ •
          • x,w๋Š” ๋ชจ๋‘ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค. (0 ๋ถ„์‚ฐ)
        • ๊ฒฐ๊ณผ
          • Var(wiw_i) = 1/Din ์ด๋ฉด, Var(yiy_i) = Var(xix_i)์ด๋‹ค

            โ‡’ ๋”ฐ๋ผ์„œ Xavier ์ดˆ๊ธฐํ™” = 1/sqrt(Din) ์ด ๋œ ๊ฒƒ.

  • cf) ReLU๋กœ input x์™€ W๋ฅผ ๋‚ด์ ํ•œ๋‹ค๋ฉด?

    • ๊ฒฐ๊ณผ
      • Xavier์—์„œ relu ์ž‘๋™ X
        • ์ด์œ ) Xavier๋Š” x์™€ w๊ฐ€ zero-mean์ž„์„ ๊ฐ€์ •ํ•˜๋Š”๋ฐ relu๋Š” ๊ทธ๋ ‡์ง€ ์•Š์•„์„œ ๋งž์ง€ ์•Š์Œ




7. Weight ์ดˆ๊ธฐํ™” : Kaiming / MSRA ์ดˆ๊ธฐํ™”

๐Ÿ“ relu๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ๋Œ€์‹ , w์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ• ๋ณ€๊ฒฝ โ†’ resnet์—์„œ ์•ˆ๋งž๋Š” ๋ถ€๋ถ„ ํ•ด๊ฒฐ
1) ๋ฐฉ๋ฒ•

  • relu๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ, ๋Œ€์‹  weight์ดˆ๊ธฐํ™” ๋ณ€๊ฒฝ
  • (๊ธฐ์กด) std=1/sqrt(Din) โ†’ (๋ณ€๊ฒฝ) std=sqrt(2/Din)
    • relu๋Š” ๋ฐ˜์„ ์ฃฝ์ด๋‹ˆ๊นŒ ๊ฑ 2๋ฐฐ๋ฅผ ํ•ด๋„ ๋จ (๋‰ด๋Ÿฐ์˜ ์ ˆ๋ฐ˜์ด ์ฃฝ์„๊ฑฐ๋ผ๋Š” ์‚ฌ์‹ค์— ๋Œ€ํ•ด ์กฐ์ •)

2) ๋ฌธ์ œ์ 

  • VGG๋ฅผ scratch ๋‚ด๋ฉด์„œ train ์‹œํ‚ด (์ง€์ ๋ฐ›์Œ)

  • Residual Network์—์„  ์œ ์šฉ X

    • ์ด์œ : residual connection ์ดํ›„์˜ output์— input์„ ๋‹ค์‹œ ๋„ฃ์–ด์„œ ๋ถ„์‚ฐ, ๋ถ„ํฌ๊ฐ€ ์—„์ฒญ ํด ๊ฒƒ
      • ex. Var(F(x))(1๋ฒˆ์งธ output์— ๋Œ€ํ•œ ๋ถ„์‚ฐ) = Var(x)(input์— ๋Œ€ํ•œ ๋ถ„์‚ฐ) (์—ฌ๊ธฐ๊นŒ์ง„ ์ •์ƒ) Var(F(x)+ x)(2๋ฒˆ์งธ output์— ๋Œ€ํ•œ ๋ถ„์‚ฐ) >> Var(x)(input์— ๋Œ€ํ•œ ๋ถ„์‚ฐ) (residual๋กœ input์„ ๋‹ค์‹œ ๋„ฃ์–ด์ค˜์„œ ๋ถ„์‚ฐ์ด ๋” ํผ, ์ผ์น˜X)
    • ๋”ฐ๋ผ์„œ, Xavier or MSRA์—์„œ ๋ถ„์‚ฐ์ด ๋งค์šฐ ํฌ๋ฏ€๋กœ โ†’ bad gradient โ†’ bad optimization

3) ํ•ด๊ฒฐ์ฑ…

  • ์ฒซ๋ฒˆ์งธ conv๋ฅผ MSRA๋กœ ์ดˆ๊ธฐํ™”
  • ๋‘๋ฒˆ์งธ conv (last layer)๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”

โ‡’ Var(x+F(x)) = Var(x) ์ผ์น˜ ๊ฐ€๋Šฅ

= ๋ถ„์‚ฐ์ด ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š์„์ˆ˜์žˆ์Œ

4) ์งˆ๋ฌธ

  • Q. (W ์ดˆ๊ธฐํ™” ๋ชฉ์ ) ์ดˆ๊ธฐํ™”์˜ idea๊ฐ€ ์†์‹คํ•จ์ˆ˜์˜ global minimum์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•จ์ธ๊ฐ€?
    A. ์•„๋‹˜, train ์ „์—” ๊ทธ minimum์ด ์–ด๋”˜์ง€ ๋ชจ๋ฆ„, ๋Œ€์‹  ๋ชจ๋“  gradient๊ฐ€ ์ดˆ๊ธฐํ™”๋ฅผ ์ž˜ ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ
    = ์ž˜๋ชป๋œ ์ดˆ๊ธฐํ™” ํ•˜๋ฉด zero gradient๊ฐ€ ๋˜์–ด๋ฒ„๋ฆด ์ˆ˜ ์žˆ์–ด์„œ
    = lost landscape์—์„œ flatํ•œ ๊ณณ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๊ฑธ์–ด๊ฐ€์ง€(train) ์•Š๋„๋ก ๋„์™€์ฃผ๋Š” ์ž‘์—…




8. Regularization : Dropout

1) ์‚ฌ์šฉ ๋ชฉ์ 

  • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€

2) ๋ฐฉ๋ฒ•

a. Loss๋’ค์— ฮป\lambdaR(W)R(W) ๋ถ™์ด๊ธฐ

  • L2 norm โ†’ ์ ค ์‚ฌ์šฉ ๅคš

b. Dropout

  • ๋ฐฉ๋ฒ•
    • ๊ฐ layer๋งˆ๋‹ค ์ˆœ์ „ํŒŒ ์‹œ, ๋žœ๋คํ•˜๊ฒŒ ๋ช‡๋ช‡ ๋‰ด๋Ÿฐ๋“ค์„ 0์œผ๋กœ ์„ธํŒ…
    • ์–ผ๋งˆ๋‚˜ dropํ• ๊ฑด์ง€๋Š” hyper parameter์ž„; 0.5๊ฐ€ ์ผ๋ฐ˜์ 
  • ๊ตฌํ˜„ํ•˜๊ธฐ
  • Dropoutํ•˜๋Š” ์ด์œ  2๊ฐ€์ง€

    • ์ค‘๋ณต ์ ์šฉํ•˜๋Š” ๊ฒƒ ๋ฐฉ์ง€

      • x์˜ feature์ž˜ ํ•™์Šต ์œ„ํ•ด, ํ•„์š”์—†๋Š” feature ๋œ ๋ฐฐ์šฐ๊ณ  ์ค‘๋ณต ๋…ธ๋“œ ๋ฐฐ์šฐ๋Š”๊ฑธ ๋ฐฉ์ง€

      โ‡’ ๊ฒฐ๋ก ) ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€

    • ์•™์ƒ๋ธ”์ฒ˜๋Ÿผ

      • Dropout์€ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ ํ•˜๋Š” ์—ฌ๋Ÿฌ Neural Network ์•™์ƒ๋ธ” training
      • ์—ฌ๋Ÿฌ submodel ๋งŒ๋“ค์–ด์„œ, ์•™์ƒ๋ธ”์ฒ˜๋Ÿผ ์ตœ์ข… ๊ฒฐ๋ก  ํˆฌํ‘œ ๊ฒฐ์ •
  • Test Time์—์„œ์˜ Dropout

    • ๋ฌธ์ œ์ 

      • z=random๋ณ€์ˆ˜ (์ˆœ์ „ํŒŒ ์ด์ „์— ์ •ํ•จ)
      • ๊ฒฐ๋ก ) test์‹œ, randomํ•˜๊ฒŒ ๋‰ด๋Ÿฐ์„ ๋„๊ฒŒ ๋˜๋ฉด test๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค ๋‹ค๋ฅด๊ฒŒ ๋„์ถœ ์ด์œ ) ๊ฐ forward pass๋งˆ๋‹ค randomํ•˜๊ฒŒ ๋‰ด๋Ÿฐ ๋–จ์–ดํŠธ๋ ค์„œ
    • ํ•ด๊ฒฐ์ฑ…

      • ์ด๋Ÿฌํ•œ randomness (z) ๋ฅผ ํ‰๊ท ๋‚ด์ž!
    • ํ•ด๊ฒฐ๋ฐฉ๋ฒ•

      • ์œ„์˜ integral ๊ทผ์‚ฌํ™” ํ•˜๋Š” ๋ฐฉ๋ฒ•

        • (dropout์‹œ) 4๊ฐœ์˜ ๊ฐ๊ธฐ ๋‹ค๋ฅธ train์‹œ์— ๋งŒ๋“ค์–ด์ง„ random mask๋“ค ๊ณฑํ•ด์ง
          โ†’ z (random ๋ณ€์ˆ˜)์— ๋Œ€ํ•ด์„œ ํ‰๊ท ๋‚ด๋Š” ๊ฒƒ
    • ๊ตฌํ˜„

      • ๋ฐฉ๋ฒ•
        • test time์—์„  ๋ชจ๋“  ๋‰ด๋Ÿฐ๋“ค ์‚ฌ์šฉ
        • but ๊ฐ ๋‰ด๋Ÿฐ์„ dropํ• ๋•Œ dropping ํ™•๋ฅ (p) ์จ์„œ layer์˜ output์„ rescaleํ•จ
  • ๊ฒฐ๋ก (์ „์ฒด ๊ตฌํ˜„)

    • train: ๊ฑ ๊ทธ๋Œ€๋กœ dropout
    • test: ์ ์ ˆํ•œ ํ™•๋ฅ (p) ์‚ฌ์šฉํ•ด์„œ output์„ rescaleํ•˜๊ณ , randomness์—†์•ฐ
      • ๊ฐ ๊ฐœ๋ณ„ layer์—๋งŒ ์ ์šฉ ๊ฐ€๋Šฅ
      • stacking multiple dropout layer์‹œ ์‚ฌ์šฉX
  • ์ผ๋ฐ˜์ ์ธ ๊ตฌํ˜„ ๋ฐฉ๋ฒ• (Inverted Dropout)

    • ์ผ๋ฐ˜์ ์œผ๋กœ,, train์‹œ drop & scale ๋ชจ๋‘ ํ•จ(๋‰ด๋Ÿฐ์„ 1/2๊ฐœ dropout์‹œํ‚ค๊ณ  ๋‚จ์€ ๋‰ด๋Ÿฐ๋“ค์„ 2๋ฐฐํ•จ ) test์‹œ์— ๋ชจ๋“  ๋‰ด๋Ÿฐ ์‚ฌ์šฉ + ๋ชจ๋“  normal weight matrix ์‚ฌ์šฉ




9. Dropout architectures

๐Ÿ“ Dropout์ด ์•„ํ‚คํ…์ฒ˜๋“ค์— ์–ด์ผ€ ์“ฐ์ด๋Š”์ง€

  • ๊ฒฐ๋ก 
    • AlexNet, VGG : ๋งจ ์œ—๋‹จ ๋ ˆ์ด์–ด์ธ FCLayer์—์„œ dropout์ ์šฉ
    • ์ด์™ธ ์ตœ์‹  ์•„ํ‚คํ…์ฒ˜: FCLayer๋ฅผ ์ค„์˜€๊ธฐ์—, dropout ์‚ฌ์šฉํ• ์ผ ๊ฑฐ์˜ ์—†์Œ




10. Regularization : A common pattern

  • Batch norm

    • train (randomness์ถ”๊ฐ€)
      • ๋žœ๋ค ํ™•๋ฅ ๋กœ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ์‚ฌ์šฉํ•ด์„œ ์ •๊ทœํ™”
    • test (average out randomness)
      • ๊ณ ์ •๋œ ํ™•๋ฅ ๋กœ ์ •๊ทœํ™”
  • ์ตœ์‹  ์•„ํ‚คํ…์ฒ˜์—๋Š”โ€ฆ

    • dropout ์‚ฌ์šฉX โ†’ ๋Œ€์‹  batch norm or L2 ์ •๊ทœํ™”




11. Regularization : Data Augmentation

๐Ÿ“ ์ขŒ์šฐ๋Œ€์นญ, ๋ฐ๊ธฐ์กฐ์ ˆ ๋“ฑ

  • ๊ฐœ๋…

    • input data์— randomness๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์ผ์ข…
    • ๋น„์Šทํ•œ image์ด์ง€๋งŒ CNN๋ชจ๋ธ์€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ ์ธ์‹ํ•˜๋ฉฐ, ๋ฐ๊ธฐ์กฐ์ ˆ or ์ขŒ์šฐ ๋Œ€์นญ ๋“ฑ์„ randomness์˜ ์ผ์ข…์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ
  • ๋ฐฉ๋ฒ•

    • Horizontal Flips (์ขŒ์šฐ ๋Œ€์นญ)

    • Random Crops & Scales (๋žœ๋ค ์ž๋ฅด๊ธฐ, ์‚ฌ์ด์ฆˆ ์กฐ์ ˆ)

      • train
        • ๋žœ๋คํ•˜๊ฒŒ ์ด๋ฏธ์ง€ ์ž˜๋ผ๋‚ด๊ณ , ์‚ฌ์ด์ฆˆ ์กฐ์ •
      • test
        • test์šฉ ์ด๋ฏธ์ง€๋ฅผ 5๊ฐœ์˜ ์Šค์ผ€์ผ๋กœ ๋งŒ๋“  ํ›„, 224*224 ์‚ฌ์ด์ฆˆ์˜ ์ด๋ฏธ์ง€๋ฅผ 10๊ฐœ๋กœ ํฌ๋กญํ•˜์—ฌ 10๊ฐœ์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ ํˆฌํ‘œ์‹œํ‚ด
    • Color Jitter

      • RGB ํ”ฝ์…€์— ๋Œ€ํ•ด PCA ์ง„ํ–‰ํ•˜์—ฌ, ์กฐ๋„ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ๋ฒ•




12. Regularization : Drop Connect




13. Regularization : Fractional Pooling




14. Regularization : Stochastic Depth




15. Regularization : Cut out




16. Regularization : Mix up

profile
๐Ÿ–ฅ๏ธ

0๊ฐœ์˜ ๋Œ“๊ธ€