๐Ÿ˜„ Lecture 06. | Training Neural Networks I

๋ฐฑ๊ฑดยท2022๋…„ 1์›” 21์ผ
1

Stanford University CS231n.ย 

๋ชฉ๋ก ๋ณด๊ธฐ
4/6

๋ณธ ๊ธ€์€ Hierachical Structure์˜ ๊ธ€์“ฐ๊ธฐ ๋ฐฉ์‹์œผ๋กœ, ๊ธ€์˜ ์ „์ฒด์ ์ธ ๋งฅ๋ฝ์„ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๋„๋ก ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ ๋ณธ ๊ธ€์€ CSF(Curation Service for Facilitation)๋กœ ์ธ์šฉ๋œ(์ฐธ์กฐ๋œ) ๋ชจ๋“  ์ถœ์ฒ˜๋Š” ์ƒ๋žตํ•ฉ๋‹ˆ๋‹ค.

[์š”์•ฝ์ •๋ฆฌ]Stanford University CS231n. Lecture 06. | Training Neural Networks I

์‹ค์ œ ์ง„ํ–‰์ƒ 6๊ฐ•์ด์ง€๋งŒ Neural Network์„ ์ „์ฒด์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด 4,5,6,7๊ฐ•์„ ํ†ตํ•ฉํ•˜์—ฌ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

1. CONTENTS


1.1 Table

VelogLectureDescriptionVideoSlidePages
์ž‘์„ฑ์ค‘Lecture01Introduction to Convolutional Neural Networks for Visual Recognitionvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture02Image Classificationvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture03Loss Functions and Optimizationvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture04Introduction to Neural Networksvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture05Convolutional Neural Networksvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture06Training Neural Networks Ivideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture07Training Neural Networks IIvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture08Deep Learning Softwarevideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture09CNN Architecturesvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture10Recurrent Neural Networksvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture11Detection and Segmentationvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture12Visualizing and Understandingvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture13Generative Modelsvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture14Deep Reinforcement Learningvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture15Invited Talk: Song Han Efficient Methods and Hardware for Deep Learningvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture16Invited Talk: Ian Goodfellow Adversarial Examples and Adversarial Trainingvideoslidesubtitle

2. Flow

2.1 ๋‰ด๋Ÿฐ์˜ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ดํ•ด

2.2 ๋‰ด๋Ÿฐ์˜ ์•„ํ‚คํ…์ณ

  • ์„ ํ˜• ํ•จ์ˆ˜์™€ ๋น„์„ ํ˜• ํ•จ์ˆ˜์˜ ํ•ฉ์œผ๋กœ ์ด์–ด์ง
  • ๋ณดํ†ต Sigmoid ํ•จ์ˆ˜์™€ ReLUํ•จ์ˆ˜๋ฅผ ๋งŽ์ด ์‚ฌ์šฉ

2.3 Fully Connected Layer

  • ๋ชจ๋“  Layer๊ฐ€ ๋ชจ๋‘ ์—ฐ๊ฒฐ๋จ.

2.4 Convolution Neural Networks

  • ๊ณต๊ฐ„์ ์ธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” Layer๋ฅผ ๊ฐ€์ง
  • ํ•„ํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ทธ ํ•„ํ„ฐ๊ฐ€ ์›€์ง์ด๋ฉด์„œ ๋‚ด์ 
  • ํ•„ํ„ฐ๋Š” ํ•˜๋‚˜์˜ ํŠน์ง•์„ ์žก์Œ
  • ์˜ˆ๋ฅผ ๋“ค์–ด
    - 5x5x3์˜ image์—์„œ ํŠน์ง•์„ ์žก์•„๋‚ด๋Š” 5x5x3์˜ filter๋ฅผ ์“ฐ๋ฉด 3x3x1์˜ Convolved Feature๊ฐ€ ๋‚˜์˜ด!

    - cfcf ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง
    • 32x32x3(convolution map:image) ์—์„œ 5x5x3(filter)๋ฅผ ์“ฐ๋ฉด 28x28x1(activation map)์ด ๋‚˜์˜ด
    • ์ด ๋•Œ
      • Stride ๋Š” ํ”ฝ์…€์”ฉ ์ด๋™ํ•  ๊ฒƒ์ธ์ง€ ํ•„ํ„ฐ๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ๋„˜์–ด๊ฐ€๋ฉด ์•ˆ๋จ
        • convolution ๊ณผ์ •์‹œ input์‚ฌ์ด์ฆˆ์™€ activation map์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•ด Zero-Padding
        • Pooling์€ ๊ฐ•์ œ๋กœ downsampling์„ ํ•˜๊ณ  ์‹ถ์„ ๋–„ depth์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ์—ญํ•  ์ฃผ๋กœ Maxpooling์„ ์‚ฌ์šฉ
        • ๋ณดํ†ต pooling์€ ๋ชจ๋“  ํ”ฝ์…€์ด ํ•œ๋ฒˆ์”ฉ๋งŒ ์—ฐ์‚ฐ์— ์ฐธ์—ฌํ•˜๋„๋ก(window size์™€ Stride๋Š” ๊ฐ™์€ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ๋จ)
    • Feature Map ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์‹ - ์ž์—ฐ์ˆ˜๊ฐ€ ๋˜์–ด์•ผ ํ•จ
      • Feature Map์˜ ํ–‰, ์—ด ํฌ๊ธฐ๋Š” Pooling ํฌ๊ธฐ์˜ ๋ฐฐ์ˆ˜
      • ์ž…๋ ฅ ๋†’์ด : H / ํญ : W
        • ํ•„ํ„ฐ ๋†’์ด : FH / ํญ : FW
        • Stride : S
        • ํŒจ๋”ฉ ์‚ฌ์ด์ฆˆ : P
        • {(์ž…๋ ฅ๋ฐ์ดํ„ฐ์˜ ๋†’์ด +(2xํŒจ๋”ฉ์‚ฌ์ด์ฆˆ)- ํ•„ํ„ฐ๋†’์ด)/Stride์˜ ํฌ๊ธฐ}+1 = ์ถœ๋ ฅ๋†’์ด
    • cfcf CNN ๊ตฌ์„ฑ ์˜ˆ
  • CNN ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ์˜ˆ
    - Convolution layer๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ์‚ฌ์šฉํ•˜๊ณ  Fully connected layer๋ฅผ ๋งˆ์ง€๋ง‰์— ์‚ฌ์šฉํ•˜๋ฉด CNN ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด ์ง
  • ์ด๋ ‡๊ฒŒ Neural Newtwork๋ฅผ Training ์‹œํ‚ฌ ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ๊ฒƒ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž

2.5 Neural Newtwork๋ฅผ Training ์‹œํ‚ฌ ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ๊ฒƒ

2.5.1 Neural Network Traing์ด๋ž€?

  • network parameter๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ Gradient Descent Algorithm์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ๋‹ค.
  • ๋ชจ๋“  data๋ฅผ ๊ฐ€์ง€๊ณ  gradient descent Algorithm์— ์ ์šฉ์„ ํ•˜๋ฉด ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— SGD(Stochastic Gradient Descent) Algorithm์„ ์ด์šฉ
  • Sample์„ ๋ฝ‘์•„๋‚ด Gradient Desscent Algorithm์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์—ฌ๊ธฐ์„œ ์ƒ๊ฐํ•ด๋ณผ ๊ฒƒ.

Q1. ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ์„ ์ •ํ•ด์•ผ ํ•˜๋Š”๊ฐ€
Q2. Training ํ•  ๋•Œ ์œ ์˜ํ•  ์‚ฌํ•ญ
Q3. ํ‰๊ฐ€๋Š” ์–ด๋–ป๊ฒŒ ํ•  ๊ฒƒ์ธ๊ฐ€.

2.5.2 Activation Function

2.5.2.1 Sigmoid Function

  • ์ถœ๋ ฅ์ด 0~1 ์‚ฌ์ด์˜ ๊ฐ’์ด ๋‚˜์˜ค๋„๋ก ํ•˜๋Š” ์„ ํ˜• ํ•จ์ˆ˜
  • ๋‹จ์ 
    - Saturated neurons๊ฐ€ Gradient๊ฐ’์„ 0์œผ๋กœ ๋งŒ๋“ ๋‹ค.
    - ์›์  ์ค‘์‹ฌ์ด ์•„๋‹ˆ๋‹ค.
    - ์ง€์ˆ˜ํ•จ์ˆ˜๊ฐ€ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค.
    - cfcf
    - Saturate : โ€˜ํฌํ™”โ€™๋ผ๊ณ  ํ•ด์„์„ ํ•˜๋Š”๋ฐ, ์ž…๋ ฅ์ด ๋„ˆ๋ฌด ์ž‘๊ฑฐ๋‚˜ ํด ๊ฒฝ์šฐ ๊ฐ’์ด ๋ณ€ํ•˜์ง€ ์•Š๊ณ  ์ผ์ •ํ•˜๊ฒŒ 1๋กœ ์ˆ˜๋ ดํ•˜๊ฑฐ๋‚˜ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ํฌํ™”๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ณ  Gradient์˜ ๊ฐ’์ด 0์ธ ๋ถ€๋ถ„์„ ์˜๋ฏธ
    • Gradient๊ฐ€ 0์ด ๋˜๋Š” ๊ฒƒ์ด ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ด์œ 
      - Chain Rule ๊ณผ์ •์„ ์ƒ๊ฐํ–ˆ์„ ๋•Œ, Global gradient๊ฐ’์ด 0์ด ๋˜๋ฉด ์ฆ‰ ๊ฒฐ๊ณผ ๊ฐ’์ด 0์ด ๋˜๋ฉด local gradient ๊ฐ’๋„ 0์ด ๋œ๋‹ค. ๋”ฐ๋ผ์„œ Input์— ์žˆ๋Š” gradient ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์—†๋‹ค.
      • ์›์  ์ค‘์‹ฌ์ด ์•„๋‹Œ ๊ฒƒ์ด ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ด์œ 
        - output์˜ ๊ฐ’์ด ํ•ญ์ƒ ์–‘์ˆ˜๋ฉด ๋‹ค์Œ input์œผ๋กœ ๋“ค์–ด๊ฐ”์„ ๋•Œ๋„ ํ•ญ์ƒ ์–‘์ˆ˜์ด๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ๋‹ค์Œ layer์—์„œ wห‰\bar w์˜ ๊ฐ’์„ updateํ•  ๋•Œ ํ•ญ์ƒ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ update๊ฐ€ ๋œ๋‹ค. ๋‹ค์Œ ๊ทธ๋ฆผ์˜ ์˜ˆ๋กœ ์„ค๋ช…ํ•˜๋ฉด ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” vector๊ฐ€ ํŒŒ๋ž€์ƒ‰์ผ ๋•Œ, wห‰\bar w์˜ ๊ฐ™์€ ๊ฒฝ์šฐ ์ œ 1์‚ฌ๋ถ„๋ฉด๊ณผ ์ œ 3์‚ฌ๋ถ„๋ฉด์œผ๋กœ update๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ update๋ฅผ ํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
        
  • Sigmoid์ด ์›์ ์ค‘์‹ฌ์ด ์•„๋‹Œ ๊ฒƒ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ํ•จ์ˆ˜๊ฐ€ ๋ฐ”๋กœ tanh(x)tanh(x)

2.5.2.2 tanh(x)tanh(x)

  • ์—ฌ์ „ํžˆ saturatedํ•œ ๋‰ด๋Ÿฐ์ผ ๋•Œ, gradient๊ฐ’์ด 0์œผ๋กœ ๋œ๋‹ค.
  • ๊ทธ๋ž˜์„œ ์ƒˆ๋กœ์šด ํ™œ์„ฑํ•จ์ˆ˜๊ฐ€ ํ•„์š” ReLU

2.5.2.3 ReLU

  • ํŠน์ง•
    - (+) ์˜์—ญ์—์„œ saturateํ•˜์ง€ ์•Š๊ณ ,
    • ๊ณ„์‚ฐ ์†๋„๋„ element-wise ์—ฐ์‚ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— sigmoid/tanh๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๋‹ค
  • ๋‹จ์ 
    - (-)์˜ ๊ฐ’์€ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— Data์˜ ์ ˆ๋ฐ˜๋งŒ activateํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ๊ฒƒ
  • ์ด ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Leaky ReLU์™€ Exponential Linear Unit ๊ณผ Maxout

2.5.2.4 Leaky ReLU

2.5.2.5 Exponential Linear Unit

2.5.2.6 Maxout

  • ์ด ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด parameter๊ฐ€ ๊ธฐ์กด function๋ณด๋‹ค 2๋ฐฐ ์žˆ์–ด์•ผ ํ•œ๋‹ค

2.5.3 Data processing

  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋Š” ์ฃผ๋กœ Zero-centered, Normalized, PCA, Whitening๊ฐ™์€ ์ฒ˜๋ฆฌ๋“ค์„ ํ•œ๋‹ค.
  • Zero-centered ๋‚˜ Normalized๋ฅผ ํ•˜๋Š” ์ด์œ ๋Š” ๋ชจ๋“  ์ฐจ์›์ด ๋™์ผํ•œ ๋ฒ”์œ„์— ์žˆ์–ด ์ „๋ถ€ ๋™๋“ฑํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ
  • PCA๋‚˜ Whitening์€ ๋” ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ projectionํ•˜๋Š” ๋Š๋‚Œ์ธ๋ฐ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์—์„œ๋Š” ์ด๋Ÿฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์€ ๊ฑฐ์น˜์ง€ ์•Š๋Š”๋‹ค.
  • ๊ธฐ๋ณธ์ ์œผ๋กœ ์ด๋ฏธ์ง€๋Š” Zero-Centered ๊ณผ์ •๋งŒ ๊ฑฐ์นจ

  • ์‹ค์ œ ๋ชจ๋ธ์—์„œ๋Š” train data์—์„œ ๊ณ„์‚ฐํ•œ ํ‰๊ท ์„ test data์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ

2.5.4 Weight Initialization

  • ์ดˆ๊ธฐ๊ฐ’์„ ๋ช‡์œผ๋กœ ์žก์•„์•ผ ์ตœ์ ์˜ ๋ชจ๋ธ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  • ๋งŒ์•ฝ ์ดˆ๊ธฐ๊ฐ’์„ 0์œผ๋กœ ํ•œ๋‹ค๋ฉด, ๋ชจ๋“  ๋‰ด๋Ÿฐ์€ ๋™์ผํ•œ ์ผ์„ ํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ฆ‰ ๋ชจ๋“  gradient์˜ ๊ฐ’์ด ๊ฐ™๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค.
  • ๊ทธ๋ž˜์„œ ์ƒ๊ฐํ•œ ์ฒซ๋ฒˆ์งธ Idea๋Š” โ€˜์ž‘์€ randomํ•œ ์ˆ˜๋กœ ์ดˆ๊ธฐํ™”โ€™๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • ์ดˆ๊ธฐ Weight๋Š” ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ์—์„œ Sampling์„ ํ•œ๋‹ค.
  • ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๊ฒฝ์šฐ ์–•์€ network์—์„œ๋Š” ์ž˜ ์ž‘๋™์„ ํ•˜์ง€๋งŒ network๊ฐ€ ๊นŠ์–ด์งˆ ๊ฒฝ์šฐ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.
  • ์™œ๋ƒํ•˜๋ฉด network๊ฐ€ ๊นŠ์œผ๋ฉด ๊นŠ์„์ˆ˜๋ก, weight์˜ ๊ฐ’์ด ๋„ˆ๋ฌด ์ž‘์•„ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  • ๋งŒ์•ฝ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ํ‚ค์šฐ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
    - activaton value์˜ ๊ฐ’์ด ๊ทน๋‹จ์ ์ธ ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋˜๊ณ , gradient์˜ ๊ฐ’์ด ๋ชจ๋‘ 0์œผ๋กœ ์ˆ˜๋ ดํ•  ๊ฒƒ
  • ์ด๋Ÿฐ ์ดˆ๊ธฐ๊ฐ’ ๋ฌธ์ œ์— ๋Œ€ํ•ด์„œ โ€˜Xavier initializationโ€™์ด๋ผ๋Š” ๋…ผ๋ฌธ์ด ์ œ์‹œ ๋˜์—ˆ๋Š”๋ฐ ์ผ๋‹จ activation function์ด linearํ•˜๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ weight์˜ ๊ฐ’์„ ์ดˆ๊ธฐํ™”
  • ์ด ์‹์„ ์ด์šฉํ•˜๋ฉด ์ž…/์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ์„ ๋งž์ถœ ์ˆ˜ ์žˆ์Œ
  • ํ•˜์ง€๋งŒ activation function์„ ReLU๋กœ ์ •ํ•œ ๊ฒฝ์šฐ, ์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ์ด ๋ฐ˜ํ† ๋ง‰ ๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ์‹์ด ์„ฑ๋ฆฝํ•˜์ง€ ์•Š์Œ
  • ๋ณดํ†ต activation function์ด ReLU์ธ ๊ฒฝ์šฐ์—๋Š” He Initialization์„ ์‚ฌ์šฉ

2.5.5 Batch Normalization

  • ๋งŒ์•ฝ unit gaussian activation์„ ์›ํ•˜๋ฉด ๊ทธ๋ ‡๊ฒŒ ์ง์ ‘ ๋งŒ๋“ค์–ด๋ณด์ž!

  • ํ˜„์žฌ Batch์—์„œ ๊ณ„์‚ฐํ•œ mean๊ณผ variance๋ฅผ ์ด์šฉํ•˜์—ฌ ์ •๊ทœํ™”๋ฅผ ํ•ด์ฃผ๋Š” ๊ณผ์ •์„ Model์— ์ถ”๊ฐ€ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

  • ๊ฐ layer์—์„œ Weight๊ฐ€ ์ง€์†์ ์œผ๋กœ ๊ณฑํ•ด์ ธ์„œ ์ƒ๊ธฐ๋Š” Bad Scaling์˜ ํšจ๊ณผ๋ฅผ ์ƒ์‡„์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

  • ํ•˜์ง€๋งŒ unit gaussian์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์ด ๋ฌด์กฐ๊ฑด ์ข‹์€ ๊ฒƒ ์ธ๊ฐ€? ์ด์— ์œ ์—ฐ์„ฑ์„ ๋ถ™์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ๋ถ„์‚ฐ๊ณผ ํ‰๊ท ์„ ์ด์šฉํ•ด Normalized๋ฅผ ์ข€ ๋” ์œ ์—ฐํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๋‹ค.

  • ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š” Batch Normalization์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
  • Batch Normalization์˜ ํŠน์ง•์„ ์‚ดํŽด๋ณด๋ฉด

Regularization์˜ ์—ญํ• ๋„ ํ•  ์ˆ˜ ์žˆ๋‹ค. (Overfitting์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.)
weight์˜ ์ดˆ๊ธฐํ™” ์˜์กด์„ฑ์— ๋Œ€ํ•œ ๋ฌธ์ œ๋„ ์ค„์˜€๋‹ค.
Testํ•  ๋• ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์—†์œผ๋‹ˆ Trainingํ•˜๋ฉด์„œ ๊ตฌํ•œ ํ‰๊ท ์˜ ์ด๋™ํ‰๊ท ์„ ์ด์šฉํ•ด ๊ณ ์ •๋œ Mean๊ณผ Std๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
ํ•™์Šต ์†๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

- $$cf$$ [Batch Normalization ์„ค๋ช… ๋ฐ ๊ตฌํ˜„](https://shuuki4.wordpress.com/2016/01/13/batch-normalization-%EC%84%A4%EB%AA%85-%EB%B0%8F-%EA%B5%AC%ED%98%84/)

2.6 ํ•™์Šต ๊ณผ์ •์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฒ•

  • ์ฒซ๋ฒˆ์งธ๋กœ ๊ณ ๋ ค์•ผ ํ•  ์‚ฌํ•ญ์€ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด๋‹ค.

  • ๋‘๋ฒˆ์งธ๋กœ๋Š” ์–ด๋–ค architecture๋ฅผ ์„ ํƒํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์ธ์ง€ ๊ณจ๋ผ์•ผ ํ•œ๋‹ค.

  • ๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ ๊ฐ€์ค‘์น˜๊ฐ€ ์ž‘์€ ๊ฐ’์ผ ๋•Œ loss๊ฐ’์ด ์–ด๋–ป๊ฒŒ ๋ถ„ํฌํ•˜๋Š”์ง€ ์‚ดํŽด๋ด์•ผ ํ•œ๋‹ค.

  • ์šฐ์„  training data๋ฅผ ์ ๊ฒŒ ์žก๊ณ  loss์˜ ๊ฐ’์ด ์ œ๋Œ€๋กœ ๋–จ์–ด์ง€๋Š”์ง€ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด์ž.

  • ์—ฌ๋Ÿฌ Hyperparameter๋“ค์ด ์žˆ๋Š”๋ฐ ๊ทธ ์ค‘ ๊ฐ€์žฅ ๋จผ์ € ๊ณ ๋ ค์•ผ ํ•ด์•ผํ•˜๋Š” ๊ฒƒ์€ Learning rate์ด๋‹ค.

  • training ๊ณผ์ •์—์„œ cost๊ฐ€ ์ค„์–ด๋“ค์ง€ ์•Š์œผ๋ฉด Learning rate๊ฐ€ ๋„ˆ๋ฌด ์ž‘์€์ง€ ์˜์‹ฌ์„ ํ•œ๋ฒˆ ํ•ด๋ณด์ž.

  • ๋‹จ, activation function์ด softmax์ธ ๊ฒฝ์šฐ ๊ฐ€์ค‘์น˜๋Š” ์„œ์„œํžˆ ๋ณ€ํ•˜์ง€๋งŒ accurancy๊ฐ’์€ ๊ฐ‘์ž๊ธฐ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๊ฒƒ์€ ์˜ณ์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

  • cost๊ฐ’์ด ๋„ˆ๋ฌด ์ปค์„œ ๋ฐœ์‚ฐํ•œ๋‹ค๋ฉด, Learning rate๊ฐ€ ๋„ˆ๋ฌด ํฐ์ง€ ์˜์‹ฌ์„ ํ•œ๋ฒˆ ํ•ด๋ณด๊ณ  ๊ณ„์†ํ•ด์„œ ๊ฐ’์„ ์กฐ์ •ํ•ด์•ผ ํ•œ๋‹ค.

2.7 Hyperparameter Optimization

  • ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค ๋•Œ, ๊ณ ๋ คํ•ด์•ผ ํ•  Hyperparameter๋“ค์ด ์ •๋ง ๋งŽ๋‹ค.

  • ๋ณดํ†ต training set์œผ๋กœ ํ•™์Šต์„ ์‹œํ‚ค๊ณ  validation set์œผ๋กœ ํ‰๊ฐ€๋ฅผ ํ•œ๋‹ค.

  • ๋งŒ์•ฝ Hyperparameter๋ฅผ ๋ฐ”๊ฟจ๋Š”๋ฐ update๋œ cost์˜ ๊ฐ’์ด ์›๋ž˜ cost์˜ ๊ฐ’๋ณด๋‹ค 3๋ฐฐ ์ด์ƒ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•  ๊ฒฝ์šฐ ๋‹ค๋ฅธ parameter๋ฅผ ํ•œ ๋ฒˆ ์จ๋ณด์ž.

  • Hyperparameter์˜ ๊ฐ’์„ ์—ฌ๋Ÿฌ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฑฐ์ณ์„œ ์ •ํ•˜๋Š” ๊ฒƒ๋„ ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ์‹œ๊ฐ„์ด ์—†๋‹ค๋ฉด ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ hyperparameter๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

  • ๋”ฐ๋ผ์„œ Grid Search vs Random Serach ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์ œ์‹œ๋˜์—ˆ๋‹ค.

Grid Search๋Š” ํƒ์ƒ‰์˜ ๋Œ€์ƒ์ด ๋˜๋Š” ํŠน์ • ๊ตฌ๊ฐ„ ๋‚ด์˜ ํ›„๋ณด hyperparameter ๊ฐ’๋“ค์„ ์ผ์ •ํ•œ ๊ฐ„๊ฒฉ์„ ๋‘๊ณ  ์„ ์ •ํ•˜์—ฌ, ์ด๋“ค ๊ฐ๊ฐ์— ๋Œ€ํ•˜์—ฌ ์ธก์ •ํ•œ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•œ ๋’ค, ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ–ˆ๋˜ hyperparameter ๊ฐ’์„ ์„ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
๋ฐ˜๋ฉด Random Search๋Š” Grid Search์™€ ํฐ ๋งฅ๋ฝ์€ ์œ ์‚ฌํ•˜๋‚˜, ํƒ์ƒ‰ ๋Œ€์ƒ ๊ตฌ๊ฐ„ ๋‚ด์˜ ํ›„๋ณด hyperparameter ๊ฐ’๋“ค์„ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง(sampling)์„ ํ†ตํ•ด ์„ ์ •ํ•œ๋‹ค๋Š” ์ ์ด ๋‹ค๋ฅด๋‹ค. Random Search๋Š” Grid Search์— ๋น„ํ•ด ๋ถˆํ•„์š”ํ•œ ๋ฐ˜๋ณต ์ˆ˜ํ–‰ ํšŸ์ˆ˜๋ฅผ ๋Œ€ํญ ์ค„์ด๋ฉด์„œ, ๋™์‹œ์— ์ •ํ•ด์ง„ ๊ฐ„๊ฒฉ(grid) ์‚ฌ์ด์— ์œ„์น˜ํ•œ ๊ฐ’๋“ค์— ๋Œ€ํ•ด์„œ๋„ ํ™•๋ฅ ์ ์œผ๋กœ ํƒ์ƒ‰์ด ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์ตœ์  hyperparameter ๊ฐ’์„ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ ์‹ค์ œ๋กœ๋Š” random search๊ฐ€ ๋” ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

  • ์‹ค์ œ๋กœ Hyperparameter Optimization๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์ผ์–ด๋‚œ๋‹ค.

    	1. Hyperparameter ๊ฐ’์„ ์„ค์ •ํ•œ๋‹ค.
    	2. 1์—์„œ ์ •ํ•œ ๋ฒ”์œ„ ๋‚ด์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•œ๋‹ค.
    	3. ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(Validation Set)์„ ์ด์šฉํ•˜์—ฌ ํ‰๊ฐ€ํ•œ๋‹ค.
    	4. ํŠน์ • ํšŸ์ˆ˜๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ทธ ์ •ํ™•๋„๋ฅผ ๋ณด๊ณ  Hyperparameter ๋ฒ”์œ„๋ฅผ ์ขํžŒ๋‹ค.
  • Hyperparameter๋ฅผ ์ •ํ•  ๋•Œ loss curve๋ฅผ ๋ณด๊ณ  ์ด hyperparameter๊ฐ€ ์ ํ•ฉํ•œ์ง€ ์•„๋‹Œ์ง€ ํ‰๊ฐ€๋ฅผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

  • ๋งŒ์•ฝ loss curve๊ฐ€ ์ดˆ๊ธฐ์— ํ‰ํ‰ํ•˜๋‹ค๋ฉด ์ดˆ๊ธฐํ™”๊ฐ€ ์ž˜๋ชป๋  ๊ฐ€๋Šฅ์„ฑ์ด ํด ๊ฒƒ์ด๋‹ค.

  • ๊ทธ๋ฆฌ๊ณ  training accuracy์™€ validation accuracy๊ฐ€ gap์ด ํด ๊ฒฝ์šฐ overfitting์ด ๋œ ๊ฐ€๋Šฅ์„ฑ์ด ๋งค์šฐ ๋†’์€ ๊ฒƒ์ด๋‹ค.

  • ๊ทธ gap์ด ์—†์„ ๊ฒฝ์šฐ model capacity๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด๋ด์•ผ ํ•œ๋‹ค. ์ฆ‰, trainingํ•œ dataset์ด ๋„ˆ๋ฌด ์ž‘์€ ๊ฒฝ์šฐ์ผ ์ˆ˜๋„ ์žˆ๋‹ค.

2.8 Optimization์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ธฐ๋ฒ•๋“ค

  • ๋‹ค์–‘ํ•œ Optimization Algorithm ์†Œ๊ฐœPermalink
    ์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ SGD Algorithm์ด ์žˆ๋‹ค.
    ๊ฐ„๋‹จํ•˜๊ฒŒ SGD Algorithm์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด๋ณด๋ฉด,

Mini batch ์•ˆ์— ์žˆ๋Š” data์˜ loss๋ฅผ ๊ณ„์‚ฐ
Gradient์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์„ ์ด์šฉํ•˜์—ฌ update๋ฅผ ํ•œ๋‹ค.
1๋ฒˆ๊ณผ 2๋ฒˆ ๊ณผ์ •์„ ๊ณ„์†ํ•ด์„œ ๋ฐ˜๋ณตํ•œ๋‹ค.
ํ•˜์ง€๋งŒ SGD Algorithm์—๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌํ•˜๋Š”๋ฐ,

  • Loss์˜ ๋ฐฉํ–ฅ์ด ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ๋น ๋ฅด๊ฒŒ ๋ฐ”๋€Œ๊ณ  ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ๋Š๋ฆฌ๊ฒŒ ๋ฐ”๋€๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๋  ๊ฒƒ์ธ๊ฐ€?

์ด๋ ‡๊ฒŒ ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐฉํ–ฅ์ด ์กด์žฌํ•œ๋‹ค๋ฉด SGD๋Š” ์ž˜ ๋™์ž‘ํ•˜์ง€ ์•Š๋Š”๋‹ค.

  • Local minima๋‚˜ saddle point์˜ ๋น ์ง€๋ฉด ์–ด๋–ป๊ฒŒ ๋  ๊ฒƒ์ธ๊ฐ€?

์ตœ์†Ÿ๊ฐ’์ด ๋” ์žˆ๋Š”๋ฐ local minima์— ๋น ์ ธ์„œ ๋‚˜์˜ค์ง€ ๋ชปํ•˜๊ฑฐ๋‚˜,
๊ธฐ์šธ๊ธฐ๊ฐ€ ์™„๋งŒํ•œ ๊ตฌ๊ฐ„์—์„œ update๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

  • Minibatches์—์„œ gradient์˜ ๊ฐ’์ด ๋…ธ์ด์ฆˆ ๊ฐ’์— ์˜ํ•ด ๋งŽ์ด ๋ณ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๊ผฌ๋ถˆ๊ผฌ๋ถˆํ•œ ํ˜•ํƒœ๋กœ gradient ๊ฐ’์ด update ๋  ์ˆ˜ ์žˆ๋‹ค.

์œ„์™€ ๊ฐ™์€ ๋ฌธ์ œ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ Momentum์ด๋ผ๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•œ๋‹ค.

Momentum์ด๋ž€ ์ž๊ธฐ๊ฐ€ ๊ฐ€๊ณ ์ž ํ•˜๋Š” ๋ฐฉํ–ฅ์˜ ์†๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ gradient update๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

  • momentum์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ ๊ธฐ์กด์— ์žˆ๋Š” momentum๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ์–ด update๋ฅผ ์‹œ์ผœ์ฃผ๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋Š”๋ฐ Nesterov Momentum์ด๋ผ๊ณ  ํ•œ๋‹ค.

  • ์‹์˜ ์˜๋ฏธ๋ฅผ ์ž˜ ์ดํ•ดํ•˜์ง„ ๋ชปํ–ˆ์ง€๋งŒ,
    ๊ฐ•์˜์—์„œ๋Š” ํ˜„์žฌ / ์ด์ „์˜ velocity ๊ฐ„์˜ ์—๋Ÿฌ ๋ณด์ •์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๊ณ  ์„ค๋ช…ํ–ˆ๋‹ค.

๊ธฐ์กด์˜ SGD, SGD+Momentum, Nesterov์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ํ•œ ๋ฒˆ ๋น„๊ตํ•ด๋ณด๋ฉด,

์ข€ ๋” Robustํ•˜๊ฒŒ algorithm์ด ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Velocity term ๋Œ€์‹ ์— grad squared term์„ ์ด์šฉํ•˜์—ฌ
gradient๋ฅผ updateํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ,

์ด ๋ฐฉ๋ฒ•์€ AdaGrad๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

AdaGrad๋Š” ํ•™์Šต๋ฅ ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด๋‹ค.

grad squared term๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฒŒ ๋˜๋ฉด, ๊ฐ๊ฐ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋งž์ถคํ˜•์œผ๋กœ ๊ฐ’์„ ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ update๋ฅผ ๊ณ„์† ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜๋ฉด,
small dimension์—์„œ๋Š” ๊ฐ€์†๋„๊ฐ€ ๋Š˜์–ด๋‚˜๊ณ ,
large dimension์—์„œ๋Š” ๊ฐ€์†๋„๊ฐ€ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉด ์ง€๋‚ ์ˆ˜๋ก step size๋Š” ์ ์  ๋” ์ค„์–ด๋“ ๋‹ค.

์ด ๋ฐฉ๋ฒ•์—์„œ ๋˜ ํ•˜๋‚˜๊ฐ€ ์ถ”๊ฐ€๊ฐ€ ๋˜์–ด decay_rate๋ผ๋Š” ๋ณ€์ˆ˜๋ฅผ ํ†ตํ•ด์„œ
step์˜ ์†๋„ ๊ฐ€ / ๊ฐ์†์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ,

์ด ๋ฐฉ๋ฒ•์„ RMSProp์ด๋ผ๊ณ  ํ•œ๋‹ค.

RMSProp๋Š” AdaGrad์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

๊ณผ๊ฑฐ์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ท ์ผํ•˜๊ฒŒ ๋ฐ˜์˜ํ•ด์ฃผ๋Š” AdaGrad์™€ ๋‹ฌ๋ฆฌ,
RMSProp์€ ์ƒˆ๋กœ์šด ๊ธฐ์šธ๊ธฐ ์ •๋ณด์— ๋Œ€ํ•˜์—ฌ ๋” ํฌ๊ฒŒ ๋ฐ˜์˜ํ•˜์—ฌ update๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.

์ •๋ง ์ˆ˜๋งŽ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ,
์ด์ œ ๋Œ€์ค‘์ ์œผ๋กœ ๋„๋ฆฌ ์“ฐ์ด๊ณ  ์žˆ๋Š” Adam์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž.

Adam์€ ์‰ฝ๊ฒŒ ์ƒ๊ฐํ•˜๋ฉด momentum + adaGrad ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

์ดˆ๊ธฐํ™”๋ฅผ ์ž˜ ํ•ด์ฃผ์–ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, bias correction์„ ์ถ”๊ฐ€ํ•˜์—ฌ
์ดˆ๊ธฐํ™”๊ฐ€ ์ž˜ ๋˜๋„๋ก ์„ค๊ณ„ํ•ด ์ฃผ์—ˆ๋‹ค.

์•ž์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ•œ๋ฒˆ ๋น„๊ต๋ฅผ ํ•ด๋ณด๋ฉด,

-Adam์ด ์ œ์ผ ๋Œ€์ค‘์ ์œผ๋กœ ์“ฐ์ธ๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ
์—ฌ๊ธฐ ๋ณด์—ฌ์ค€ ์˜ˆ์ œ์—์„œ๋Š” ์ข€ ๋ฉ€~๋ฆฌ ๋Œ์•„์„œ update๊ฐ€ ๋œ ๊ฒƒ ๊ฐ™๊ธดํ•˜๋‹ค.

์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ด ๋ชจ๋‘ ๋‹ค๋ฅด๋‹ค!

์ง€๊ธˆ๊นŒ์ง€ ๋ณด์—ฌ์ค€ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€
๋ชจ๋‘ Learning rate๋ฅผ hyperparameter๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

Learning rate decay๋„ ์žˆ์ง€๋งŒ
์ฒ˜์Œ์—๋Š” ์—†๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ณ  ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•œ ๋‹ค์Œ,
๋‚˜์ค‘์— ๊ณ ๋ คํ•ด์ฃผ๋„๋ก ํ•˜์ž.

์ผ์ฐจ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™”๋ฅผ ์‹œ์ผœ ์ตœ์ ํ™”๋ฅผ ์‹œํ‚ฌ ๋•Œ๋Š” ๋ฉ€๋ฆฌ ๊ฐˆ ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

์ด์ฐจ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™”๋ฅผ ์‹œํ‚ฌ๋•Œ๋Š” ์ฃผ๋กœ ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ ๊ทผ์‚ฌํ™”๋ฅผ ์‹œํ‚จ๋‹ค.
์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ update๋ฅผ ์‹œํ‚ค๋ฉด ๊ธฐ๋ณธ์ ์œผ๋กœ learning rate๋ฅผ ์„ค์ •ํ•ด ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. (No Hyperparameters!)
ํ•˜์ง€๋งŒ ๋ณต์žก๋„๊ฐ€ ๋„ˆ๋ฌด ํฌ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

์ด์ฐจ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™” ์‹œํ‚ค๋Š” ์ผ์€ Quasi-Newton ๋ฐฉ๋ฒ•์œผ๋กœ
non-linearํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ• ์ค‘์— ํ•˜๋‚˜์ด๋‹ค.

Newton methods๋ณด๋‹ค ๊ณ„์‚ฐ๋Ÿ‰์ด ์ ์–ด ๋งŽ์ด ์“ฐ์ด๊ณ  ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
๊ทธ ์ค‘ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ BGFS์™€ L-BGFS์ด๋‹ค.

์ด๋Ÿฌํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ full-batch์ผ ๋•Œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์—,
Stochastic(ํ™•๋ฅ ๋ก ์ ) setting์ด ์ ์„ ๊ฒฝ์šฐ ์‚ฌ์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘
Training ๊ณผ์ •์—์„œ error๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด๋‹ค.
๊ทธ๋ ‡๋‹ค๋ฉด ํ•œ ๋ฒˆ๋„ ๋ณด์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ?

2.9 Regularization

Regularization ๊ธฐ๋ฒ•์„ ์„ค๋ช…ํ•˜๊ธฐ ์ „์—, Model Ensembles์— ๋Œ€ํ•ด์„œ ํ•œ ๋ฒˆ ์ •๋ฆฌํ•˜์ž.

Model Ensembles์€ ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋กœ train์„ ์‹œํ‚ค๊ณ ,
test๋ฅผ ํ•  ๋•Œ ๊ทธ ๊ฒƒ๋“ค์„ ์งฌ๋ฝ•(?)ํ•ด์„œ ์“ฐ๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

Test๋ฅผ ํ•  ๋•Œ, parameter vector๋“ค์„ Moving average๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ
test๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋‹ค. (Polyak averaging)

์ง€๊ธˆ๊นŒ์ง€์˜ ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘ Test๋ฅผ ํ•˜๋Š”๋ฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ์ข€ ๋” robustํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•๋“ค์ด๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด single-model์˜ ์„ฑ๋Šฅ์„ ์ข‹๊ฒŒ ํ•˜๊ธฐ์œ„ํ•ด์„  ์–ด๋–ค ๋ฐฉ๋ฒ•์„ ์“ธ๊นŒ?

๋‹ต์€ Regularization์ด๋‹ค.

Regularization์€ ๊ฐ„๋‹จํžˆ loss function์„ ๊ตฌํ˜„ํ•  ๋•Œ,
regularization์— ๋Œ€ํ•œ function์„ ์ถ”๊ฐ€ํ•ด์ฃผ๊ธฐ๋„ ํ•œ๋‹ค.

๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” dropout์ด๋ผ๋Š” ๊ธฐ๋ฒ•๋„ ์žˆ๋‹ค.

Dropout์ด ํšจ๊ณผ๊ฐ€ ์žˆ๋Š” ์ด์œ ๋Š” ๋‹ค์–‘ํ•œ feature๋ฅผ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋–ค ํŠน์ • feature์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•œ๋‹ค.
๋˜ํ•œ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์•™์ƒ๋ธ” ํšจ๊ณผ๊ฐ€ ๋‚  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

Test-time์—์„œ ์ž„์˜์„ฑ์— ๋Œ€ํ•ด ํ‰๊ท ์„ ๋‚ด๊ณ  ์‹ถ์„ ๋•Œ..

Dropout์„ ํ•˜๊ฒŒ ๋˜๋ฉด test time๋„ ์ค„์–ด๋“ค๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค.

(1) train time์—๋Š” ๋„คํŠธ์›Œํฌ์— ๋ฌด์ž‘์œ„์„ฑ์„ ์ถ”๊ฐ€ํ•ด training data์— ๋„ˆ๋ฌด fitํ•˜์ง€ ์•Š๊ฒŒ ํ•œ๋‹ค.
(2) test time์—์„œ๋Š” randomness๋ฅผ ํ‰๊ท ํ™” ์‹œ์ผœ์„œ generalization ํšจ๊ณผ๋ฅผ ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

=> ์•ž์„  ๊ฐ•์˜์—์„œ ๋ฐฐ์น˜๋†ˆ์„ ์ ์šฉํ•  ๋•Œ๋Š” dropout์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค๊ณ  ํ•œ ์ด์œ 

๋˜ ๋‹ค๋ฅธ regularization ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” Data Augmentation์ด ์žˆ๋‹ค.

Training์„ ์‹œํ‚ฌ ๋•Œ, ์ด๋ฏธ์ง€์˜ patch๋ฅผ randomํ•˜๊ฒŒ ์žก์•„์„œ ํ›ˆ๋ จ์„ ์‹œํ‚ค๊ฑฐ๋‚˜,

์ด๋ฏธ์ง€๋ฅผ ๋’ค์ง‘์–ด์„œ train dataset์— ์ถ”๊ฐ€ํ•ด ํ›ˆ๋ จ์„ ํ•ด์ฃผ๊ฑฐ๋‚˜,

๋ฐ๊ธฐ๊ฐ’์„ ๋‹ค๋ฅด๊ฒŒ ํ•ด์„œ train dataset์— ์ถ”๊ฐ€ํ•˜๊ณ  ํ›ˆ๋ จ์„ ํ•ด์ฃผ๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.

  • ๊น์•„๋‚ธ๋‹ค๊ฑฐ๋‚˜..--;;;
  • ๋ Œ์ฆˆ๋ฅผ ์ฐŒ๊ทธ๋ŸฌํŠธ๋ฆฌ๊ฑฐ๋‚˜.ใ…กใ…ก;;;

  • Dropconnect๋Š” ๋…ธ๋“œ์˜ weight matrix๋ฅผ 0ํ–‰๋ ฌ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ
  • Dropout์˜ ์ผ๋ฐ˜ํ™”๋œ ๋ฒ„์ „
  • Dropout์ด ํผ์…‰ํŠธ๋ก ์„ ๋Š์–ด ์•ž๋’ค๋กœ ์—ฐ๊ฒฐ๋œ ๊ฐ€์ค‘์น˜๋„ ๊ฐ™์ด ์—†์–ด์ง€๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋ผ๋ฉด, Dropconnect ๋Š” ๊ฐ€์ค‘์น˜๋“ค๋งŒ ๋Š์–ด ์—†์•ฐ

์ด ์™ธ์—๋„ ๋‹ค์–‘ํ•œ regularization ๋ฐฉ๋ฒ•๋“ค์ด ์กด์žฌํ•œ๋‹ค.

  • Fractional Max Pooling

cfcf Pooling
- ์ผ์ข…์˜ sub sampling
- sub sampling์€ ํ•ด๋‹นํ•˜๋Š” image data๋ฅผ ์ž‘์€ size์˜ image๋กœ ์ค„์ด๋Š” ๊ฒƒ.

  • pooling์˜ ์ข…๋ฅ˜
    • Max pooling
    • average
    • stochastic
    • Cross channel
  • pooling ์ด์œ 
    - output feature map์˜ ๋ชจ๋“  data๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์•„์„œ
    • overfitting์„ ๋ง‰๊ธฐ์œ„ํ•ด (parameter๋ฅผ ์ค„์ž„)
    • ์ถ”๋ก ์— ์žˆ์–ด์„œ ์ ๋‹น๋Ÿ‰์˜ data๋งŒ ์žˆ์–ด๋„ ๋˜๊ธฐ ๋•Œ๋ฌธ์—
    • computation์ด ์ค„์–ด๋“ค์–ด resource๋ฅผ ์ ˆ์•ฝํ•˜๊ณ  speedup

  • stochastic depth๋Š” layer์˜ ๋ถ€๋ถ„์„ ๋ฌด์ž‘์œ„๋กœ dropํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  identity function์œผ๋กœ bypass๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ train ์‹œ๊ฐ„์„ ์ค„์˜€๊ณ  test ์‹œ error๋„ ์ค„์ž„

  • ๊นŠ์€ depth๋ฅผ ๊ฐ–์€ convolutional networks์˜ ๊ฒฝ์šฐ, train ์‹œ ์–ด๋ ค์›€(Vanishing Gradients, train time ...)

  • Network์˜ depth๋Š” model expressiveness์—์„œ์˜ ์ฃผ์š” ๊ฒฐ์ • ์š”์ธ

  • ํ•˜์ง€๋งŒ very deep network๋Š” vanishing gradient, diminishing feature ๋“ฑ์˜ ํ˜„์ƒ์„ ์ผ์œผํ‚ด

    • Diminishing feature : Forward ์‹œ ์—ฌ๋Ÿฌ๋ฒˆ multiplication, convolution computation ์„ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ Feature๊ฐ€ ์†์‹ค๋˜๋Š” ํ˜„์ƒ
    • Vanishing Gradients is a well known nuisance in neural networks with many layers. As the gradient information is back-propagated, repeated multiplication or convolution with small weights renders the gradient information ineffectively small in earlier layers.(์ปจ๋ณผ๋ฃจ์…˜์—์„œ ์ด์•ผ๊ธฐํ•œ ์—ญ์ „ํŒŒ์— ๋”ฐ๋ผ ์ž‘์•„์ง€๋Š” ๊ฒƒ)

2.10 Transfer Learning

์ „์ดํ•™์Šต์€ ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด ์ด๋ฏธ pretrained๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์šฐ๋ฆฌ๊ฐ€ ์ด์šฉํ•˜๋Š” ๋ชฉ์ ์— ๋งž๊ฒŒ fine tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•œ๋‹ค.

Small Dataset์œผ๋กœ ๋‹ค์‹œ training ์‹œํ‚ค๋Š” ๊ฒฝ์šฐ
๋ณดํ†ต์˜ learning rate๋ณด๋‹ค ๋‚ฎ์ถฐ์„œ ๋‹ค์‹œ training์„ ์‹œํ‚จ๋‹ค.

DataSet์ด ์กฐ๊ธˆ ํด ๊ฒฝ์šฐ, ์ข€ ๋” ๋งŽ์€ layer๋“ค์„ train ์‹œํ‚จ๋‹ค.

ํ•œ ๋ฒˆ ๋” ํ‘œ๋กœ ์ •๋ฆฌํ•ด๋ณด๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

์ „์ดํ•™์Šต์€ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ˆ ๊ผญ ์•Œ์•„๋‘์ž!

3. Summary

3.1


SlideTitleref
--



3.2


SlideTitleref
--



3.3


SlideTitleref
--



3.4


SlideTitleref
--



3.5


SlideTitleref
--



3.6


SlideTitleref
--



3.7


SlideTitleref
--



3.8


SlideTitleref
--



3.9


SlideTitleref
--



3.10


SlideTitleref
--



Reference

profile
๋งˆ์ผ€ํŒ…์„ ์œ„ํ•œ ์ธ๊ณต์ง€๋Šฅ ์„ค๊ณ„์™€ ์Šคํƒ€ํŠธ์—… Log

0๊ฐœ์˜ ๋Œ“๊ธ€