[Deep Learning Specialization]Improving Deep Neural Networks-Practical aspects of Deep Learning

Carvinยท2020๋…„ 11์›” 22์ผ
0
post-thumbnail

2nd course: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

๐Ÿ—‚ ๋ชฉ์ฐจ

  • 1์žฅ: Practical aspects of Deep Learning

  • 2์žฅ: Optimization algorithms

  • 3์žฅ: Hyperparameter tuning, Batch Normalization and Programming Frameworks

1. Practical aspects of Deep Learning

1st course์—์„œ๋Š” neural network, ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฒ•์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ๋‹ค๋ฉด 2nd course์ธ Improvng Deep Neural Networks์—์„œ๋Š” ์‹ ๊ฒฝ๋ง์ด ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋‹ค๋ค„๋ณด๊ณ ์ž ํ•œ๋‹ค. ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋ฐ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์…‹ ์ค€๋น„๋ถ€ํ„ฐ regularization, hyperparameter ํŠœ๋‹๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์ ‘๊ทผ๋ฒ•์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค.
๊ฐ ์ ‘๊ทผ๋ฒ•์˜ ์ž‘๋™ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด์„œ ์–ด๋Š์ •๋„ ์ดํ•ดํ•˜๊ณ  ์žˆ๋‹ค๋ฉด ๊ณผ์ œ์˜ ๋ชฉ์  ํ˜น์€ ๋ชฉํ‘œ, ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•์— ๋”ฐ๋ผ ์ตœ์„ ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ด๋Œ์–ด๋‚ด๊ธฐ ๋ณด๋‹ค ์ˆ˜์›”ํ•  ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•œ๋‹ค.

1. Train / Dev / Test sets

  • ๋จผ์ € train / dev / test์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์…‹์„ ์ž˜ ๊ตฌ์ถ•ํ•˜๋ฉด ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฐ–๋Š” neural network๋ฅผ ์ฐพ๋Š” ๊ฒƒ์— ๋„์›€์ด ๋จ

  • neural network์„ ๊ตฌํ˜„ํ•  ๋•Œ์—๋Š”, layer์˜ ์ˆ˜, hidden unit์˜ ์ˆ˜, learning rate, activation function ๋“ฑ๊ณผ ๊ฐ™์€ ๊ฒฐ์ •ํ•ด์•ผํ•  hyperparameter๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ˜๋ณต์ ์ธ ๊ณผ์ •์„ ๊ฑฐ์น  ์ˆ˜ ๋ฐ–์— ์—†์Œ

  • ์ด๋Ÿฌํ•œ ๋ฐ˜๋ณต์ ์ธ ๊ณผ์ •์„ ๊ฑฐ์น˜๋Š” ๊ฒƒ์ด ํ•„์ˆ˜๋ถˆ๊ฐ€๊ฒฐํ•  ๋•Œ, ๋ฐ์ดํ„ฐ๋ฅผ ์ ์ ˆํ•˜๊ฒŒ train / dev / test ๋กœ ๋‚˜๋ˆ”์œผ๋กœ์จ ๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ

    • train, ์ฆ‰ ํ•™์Šต ๊ณผ์ •์—์„œ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋กœ ๊ฒ€์ฆ์„ ํ•˜๊ฒŒ ๋˜๋ฉด ์ด๋ฏธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐ˜์˜๋œ ์ƒํƒœ์—์„œ ์‹ ๊ฒฝ๋ง์ด ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ญ์ƒ ์ข‹์€(overfitting)๋œ ๊ฒฐ๊ณผ๋งŒ์„ ๋ณด์—ฌ์ฃผ๊ฒŒ ๋จ
  • ์ผ๋ฐ˜์ ์œผ๋กœ ๊ธฐ๊ณ„ํ•™์Šต์—์„œ๋Š” train / test ์„ 70:30, train / dev / test ์„ 60:20:20 ์œผ๋กœ ๋‚˜๋ˆ„๊ฒŒ ๋จ

  • ํ•˜์ง€๋งŒ ๋น…๋ฐ์ดํ„ฐ ์‹œ๋Œ€์— ๋Œ์ž…ํ•˜๊ฒŒ ๋˜๋ฉด์„œ, ๋‹ค๋ฃจ๊ณ ์žํ•˜๋Š” ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜๋ฐฑ๋งŒ๊ฐœ๋ฅผ ๋„˜์–ด์„œ๊ธฐ ๋•Œ๋ฌธ์— dev์™€ test ๋ฐ์ดํ„ฐ๊ฐ€ 20%, 30%์˜ ๋น„์œจ์ด๋ผ๋ฉด ๋งค์šฐ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•˜๊ฒŒ ๋จ

    • dev์™€ test ๋ฐ์ดํ„ฐ๋Š” train ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฑ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์œ ํ•˜๊ณ  ์žˆ๋‹ค๋ฉด dev์™€ test ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ 1% ํ˜น์€ ๊ทธ ์ดํ•˜์˜ ๋น„์œจ๋งŒ์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ธฐ๋„ ํ•จ

2. Bias / Variance

  • ๋ฐ์ดํ„ฐ๊ฐ€ ํ•™์Šต๋˜๋Š” ๊ณผ์ • ์†์—์„œ๋Š” train ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ์ง‘๋‹จ์„ ๋ชจ๋‘ ๋ฐ˜์˜ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— bias(ํŽธํ–ฅ)์™€ variance(ํŽธ์ฐจ)๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ž„

  • ๊ฐ„๋‹จํ•˜๊ฒŒ logistic regression์˜ ํ•™์Šต ๊ฒฐ๊ณผ๋กœ ์œ„์™€ ๊ฐ™์€ 3๊ฐ€์ง€ ๊ฒฐ๋ก ์ด ๋„์ถœ๋  ์ˆ˜ ์žˆ์Œ

    • ๋จผ์ €, ์ขŒ์ธก ๊ฒฐ๊ณผ๋Š” ํ•™์Šต์ด ์ œ๋Œ€๋กœ ๋˜์ง€ ์•Š์•„ underfitting๋œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉฐ, ์ด๋ฅผ high bias(๋†’์€ ํŽธํ–ฅ)์ด๋ผ๊ณ  ํ•จ

    • ๋ฐ˜๋Œ€๋กœ, ์šฐ์ธก ๊ฒฐ๊ณผ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ fittingํ•จ์œผ๋กœ์จ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ overfitting ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์œผ๋ฉฐ, ์ด๋ฅผ high variance๋ผ๊ณ  ํ•จ

  • deep neural network์—์„œ๋Š” train / dev / test ๋ฐ์ดํ„ฐ์˜ error ๋น„์œจ์„ ํ†ตํ•ด bias์™€ variance ๋ฌธ์ œ๋ฅผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ

  • ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์—์„œ ๊ณ ์–‘์ด๋ฅผ classificationํ•˜๋Š” ๋ฌธ์ œ๋กœ ๊ฐ€์ •ํ–ˆ์„ ๋•Œ, ์‚ฌ๋žŒ์ด ์ง์ ‘ ํŒ๋‹จํ•˜๋Š” ๊ฒฝ์šฐ๋Š” ๋Œ€๋ถ€๋ถ„ ์ •ํ™•ํ•˜๊ธฐ ๋•Œ๋ฌธ์— human error๋Š” 0%์— ๊ฐ€๊นŒ์›€

    • train / dev ์˜ error๊ฐ€ 1% / 11% ๋ผ๊ณ  ํ•œ๋‹ค๋ฉด train ๋ฐ์ดํ„ฐ์—๋งŒ overfitting๋œ high variance ๋ฌธ์ œ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • train / dev์˜ error๊ฐ€ 15% / 16% ๋ผ๋ฉด, train๊ณผ dev์— ๋ชจ๋‘ ๋งž์ง€ ์•Š๊ฒŒ ํ•™์Šต๋œ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— underfitting๋œ high bias ๋ฌธ์ œ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • train / dev์˜ error๊ฐ€ 15% / 30% ๋ผ๋ฉด, ๋‘๋ฒˆ์งธ ๊ฒฝ์šฐ์ฒ˜๋Ÿผ high bias ๋ฌธ์ œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ train์— overfitting๋œ ๋ฌธ์ œ๋„ ํฌํ•จ๋˜๋Š” high variance๊นŒ์ง€ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฐ€์žฅ ์•ˆ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋งํ•จ

    • train / dev์˜ error๊ฐ€ 0.5% / 1% ๋ผ๋ฉด human error์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ์—๋„ ๊ต‰์žฅํžˆ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•œ model์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ

  • ํ•˜์ง€๋งŒ train / dev์˜ error ๋น„์œจ์„ ํ†ตํ•œ bias์™€ variance ๋ฌธ์ œ๋Š” ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ์˜ ํŠน์ง•, Human error, Optimal Error์— ๋”ฐ๋ผ ์ƒ๋Œ€์ ์œผ๋กœ ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์— ์‹ ์ค‘ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•จ

3. Basic Recipe for Machine Learning

  • train / dev์˜ error๋ฅผ ํ†ตํ•ด์„œ bias์™€ variance ๋ฌธ์ œ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ๊ณ„ํ•™์Šต์˜ baic recipe(๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•?)์ด๋ฉฐ ์ด ๊ณผ์ •์„ ํ†ตํ•ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

    • ์ขŒ์ธก ๊ฒฐ๊ณผ์˜€๋˜, high bias์˜ ๊ฒฝ์šฐ์—๋Š”

      • neural network๋ฅผ ๋” ์Œ“๊ฑฐ๋‚˜,
      • layer์˜ node๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ฑฐ๋‚˜,
      • epoch์„ ์ฆ๊ฐ€ํ•จ์œผ๋กœ์จ ๋” ์˜ค๋ž˜ ํ•™์Šตํ•˜๊ฑฐ๋‚˜,
      • ๋ณด๋‹ค ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ
    • ์šฐ์ธก ๊ฒฐ๊ณผ์˜€๋˜, high variance์˜ ๊ฒฝ์šฐ์—๋Š”

      • ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ train ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜,
      • regularization์„ ์‹œ๋„ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋”ฅ๋Ÿฌ๋‹ ์ดˆ๊ธฐ์—๋Š” bias์™€ variance๋Š” ์•ฝ๊ฐ„์˜ trade-off ๊ด€๊ณ„๋ผ๊ณ  ์ƒ๊ฐ๋˜์—ˆ์ง€๋งŒ ์ตœ๊ทผ์—๋Š” ๋ณด๋‹ค deepํ•œ neural network๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์ด ๊ฐ€๋Šฅํ•ด์ง€๋ฉด์„œ bias์™€ variance๋ฅผ ๋ชจ๋‘ ์ค„์ผ ์ˆ˜ ์žˆ์Œ

4. Regularization

  • ์œ„์—์„œ overfitting๋œ high variance ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ Regularization์ด ์ œ์•ˆ๋˜์—ˆ์Œ

  • logistic regression์— Regularization์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” cost function ์ขŒ์ธก์— Regularization ํ•ญ์„ ์ถ”๊ฐ€ํ•˜๊ฒŒ ๋จ

J(w,b)=1mฮฃi=1mL(y^(i),y(i))+ฮป2mโˆฅwโˆฅ22J(w,b) = \frac 1 m \Sigma_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) + \color{maroon} {\frac{\lambda}{2m} \| w \|_2^2}
โˆฅwโˆฅ22=โˆ‘j=1nxwj2=wTโ‹…w\|w\|_2^2 = \sum_{j = 1}^{n_x}w_j^2 = w^T \cdot w
  • Regularization์˜ ๋ชฉ์ ์€ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” cost์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” w(weight)๊ฐ€ ๋ณด๋‹ค ์ ์€ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋„๋ก ์กฐ์ •ํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•จ

    • Regularization์—๋Š” ์ž์ฃผ ๋น„๊ต๋˜๋Š” L1 Regularization๊ณผ L2 Regularization๊ฐ€ ์žˆ์œผ๋ฉฐ, L1 Regularization๋ณด๋‹ค๋Š” L2 Regularization๋ฅผ ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋งŽ์ด ์‚ฌ์šฉํ•จ
  • Neural Network์— ์ ์šฉ๋˜๋Š” Regularization์€ ๊ฐ layer์— ์œ„์น˜ํ•˜๊ฒŒ ๋˜๋ฉด์„œ backpropagation ๊ณผ์ •์—์„œ ์—…๋ฐ์ดํŠธ๋˜๋Š” parameter์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ฒŒ ๋จ

J(W[1],b[1],โ‹ฏโ€‰,W[L],b[L])=1mโˆ‘i=1mL(y^(i),y(i))+ฮป2mโˆ‘l=1LโˆฅW[l]โˆฅ2J(W^{[1]}, b^{[1]}, \cdots, W^{[L]}, b^{[L]}) = \frac{1}{m}\sum_{i = 1}^{m}\mathscr{L}(\hat{y}^{(i)}, y^{(i)}) + {\color{blue}\frac{\lambda}{2m}\sum_{l = 1}^{L}\| W^{[l]} \|^2}
โˆฅw[l]โˆฅ2=โˆ‘i=1n[l]โˆ‘j=1n[lโˆ’1](wi,j[l])2\|w^{[l]}\|^2 = \sum_{i = 1}^{n^{[l]}}\sum_{j = 1}^{n^{[l-1]}}(w_{i, j}^{[l]})^2
  • L2 Regularization์„ ํ†ตํ•ด parameter๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜๋Š” ๊ณผ์ •์—์„œ ํ•ด๋‹น ww๊ฐ€ ๊ฐ€์ ธ์•ผํ•  ์›๋ž˜์˜ ์˜ํ–ฅ๋ ฅ์ด (1โˆ’ฮฑฮปm)(1 - \frac{\alpha \lambda}{m})๋กœ ๊ฐ์†Œ๋˜๊ธฐ ๋•Œ๋ฌธ์— 'weight decay, ๊ฐ€์ค‘์น˜ ๊ฐ์‡ '๋ผ๊ณ ๋„ ํ•จ
W[l]:=W[l]โˆ’ฮฑ[(fromย BP)+ฮปmW[l]]W^{[l]} := W^{[l]} - \alpha \left [ \text{(from BP)} + \frac{\lambda}{m}W^{[l]} \right ]
=W[l]โˆ’ฮฑฮปmW[l]โˆ’ฮฑ(fromย BP)= W^{[l]} - \frac{\alpha \lambda}{m}W^{[l]} - \alpha\text{(from BP)}
=(1โˆ’ฮฑฮปm)W[l]โˆ’ฮฑ(fromย BP)= (1 - \frac{\alpha \lambda}{m})W^{[l]} - \alpha\text{(from BP)}

5. Why regularization reduces overfitting?

  • ์œ„์—์„œ L2 Regularization์€ 'weight decay'๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ์ด์œ ๋กœ (1โˆ’ฮฑฮปm)(1 - \frac{\alpha \lambda}{m}) ์˜ํ–ฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • ์œ„ ์‹์—์„œ ฮป,lambda{\lambda}, lambda๋ฅผ ์„ค์ •ํ•ด์คŒ์œผ๋กœ์จ ww์— ๋Œ€ํ•œ ์˜ํ–ฅ๋ ฅ์„ ์กฐ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Œ

    • ฮป,lambda{\lambda}, lambda๋ฅผ ํฌ๊ฒŒ ์„ค์ •ํ•˜๊ฒŒ ๋˜๋ฉด, ww์€ 0์— ๊ฐ€๊น๊ฒŒ ์ˆ˜๋ ตํ•˜๊ฒŒ ๋จ

    • ์•„๋ž˜ ์šฐ์ธก ๊ฒฐ๊ณผ, high variance ์ƒํƒœ์—์„œ L2 Regularization์„ ์ ์šฉํ•˜๊ฒŒ ํ•ด์ฃผ๋ฉด ์ขŒ์ธก ์ƒ๋‹จ์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๊ฐ hidden unit์˜ ww์˜ ์˜ํ–ฅ๋ ฅ์ด ์ค„์–ด๋“ค๊ณ  variance๊ฐ€ ์ค„์–ด๋“œ๋Š” ํ˜„์ƒ์„ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋จ(์•ฝ๊ฐ„ ๊ทธ๋ฆผ์€ dropout๊ณผ ๋น„์Šทํ•˜๋‹ค๋Š” ๋Š๋‚Œ์ด ๋“ฆ)

6. Dropout Regularization

  • L1, L2 Regularization์—๋Š” ์ด์™ธ๋กœ dropout์ด๋ผ ๋ถˆ๋ฆฌ๋Š” Regularization๋„ ์กด์žฌํ•จ

  • dropout์€ ww์˜ ์˜ํ–ฅ๋ ฅ์„ ์ค„์ด๋Š” L1, L2์™€๋Š” ๋‹ฌ๋ฆฌ, ํŠน์ • node๋ฅผ ์ƒ๋žตํ•จ์œผ๋กœ์จ ์•„์˜ˆ ww๋กœ ๋ฐœ์ƒํ•˜๋Š” ๊ณ„์‚ฐ์˜ ๊ณผ์ •์„ ์ค„์ด๋Š” ๊ฒƒ์ž„

  • node๋ฅผ ์ƒ๋žตํ•˜๋Š” ๊ณผ์ •์—์„œ๋Š” ์‹ค์ œ๋กœ ๋ฌด์ž‘์œ„๋กœ node๋ฅผ ์ƒ๋žตํ•˜๊ธฐ ์œ„ํ•ด์„œ keep-prob์ด๋ผ๋Š” ๊ฐœ๋…์ด ๋“ฑ์žฅํ•จ

    • keep-prob์€ ๋ฌด์ž‘์œ„๋กœ node๋ฅผ ์ƒ๋žตํ•˜๊ณ  ๋‚จ์€ node์˜ ๋น„์œจ์„ ์˜๋ฏธํ•จ

    • ๋งŒ์•ฝ keep-prob์ด 1์ด๋ผ๋ฉด, dropout, ์ƒ๋žตํ•  node๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ

  • dropout ์ค‘ ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” Inverted dropout๊ฐ€ ๊ตฌํ˜„๋˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

    • ํŠน์ • layer์˜ ์ฐจ์›๋งŒํผ random initialize๋ฅผ ์ƒ์„ฑํ•ด์คŒ

    • keep-prob(0.8)๋ฅผ ์„ค์ •ํ•˜๊ณ  ํ•ด๋‹น keep-prob๋ณด๋‹ค ์ž‘์€ ๊ฒฝ์šฐ๋Š” ์„ค์ •๋œ 0.8์˜ keep-prob๋งŒํผ์˜ ๋…ธ๋“œ๊ฐ€ True๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋˜๊ณ  0.2๋งŒํผ์˜ ๋…ธ๋“œ๊ฐ€ False๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋จ

    • layer์˜ ๋…ธ๋“œ์™€ keep-prob๋กœ ์„ค์ •๋œ True/False๋ฅผ ๊ณฑํ•ด์คŒ์œผ๋กœ์จ ๋‚จ์•„์žˆ๊ฑฐ๋‚˜ ์ƒ๋žต๋˜๋Š” ๋…ธ๋“œ๋“ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • ๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” ์‚ด์•„๋‚จ์€ ๋…ธ๋“œ๋“ค์— ๋Œ€ํ•ด์„œ keep-prob๋ฅผ ๋‚˜๋ˆ ์คŒ์œผ๋กœ์จ ํ•ด๋‹น layer์˜ expected value๋ฅผ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ•ด์คŒ

  • assignment์—์„œ ๋‚˜์˜จ ์ฝ”๋“œ๋ฅผ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

D1 = np.random.rand(A1.shape[0], A1.shape[1])     # Step 1: initialize matrix D1 = np.random.rand(..., ...)
D1 = (D1 < keep_prob).astype(int)                 # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
A1 = np.multiply(A1, D1)                          # Step 3: shut down some neurons of A1
A1 = np.divide(A1, keep_prob)                     # Step 4: scale the value of neurons that haven't been shut down

7. Understanding Dropout

  • dropout์ด Regularization ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ์ด์œ ๋Š” ๋ฌด์ž‘์œ„๋กœ node๋ฅผ ์ œ๊ฑฐํ•จ์œผ๋กœ์จ neural network์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž„

  • dropout์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด keep-prob๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ layer๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ์„ค์ • ๊ฐ€๋Šฅํ•˜๋ฉฐ, keep-prob๋ฅผ ๋†’๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ L2 regularization์—์„œ lambdalambda๋ฅผ ๋†’๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•จ

    • dropout์€ ๋ณดํ†ต ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•˜๋˜ computer vision์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ ํ…Œํฌ๋‹‰์œผ๋กœ overfitting์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์—ˆ์Œ
  • ๋ฐ˜๋ฉด์— dropout์˜ ๋‹จ์ ์œผ๋กœ๋Š” cost function์ด ์ •ํ™•ํ•˜๊ฒŒ ์ •์˜๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ๊ณผ ๋ฌด์ž‘์œ„๋ผ๋Š” ์ ์—์„œ ์„ฑ๋Šฅ ์ฒดํฌ ๋ฐ ๋””๋ฒ„๊น…์ด ํž˜๋“ค๊ธฐ๋„ ํ•จ

8. Understanding Dropout

  • Regularization์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ถ”๊ฐ€์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ Data Augmentation๊ณผ Early Stopping๊ณผ ๊ฐ™์€ ๊ธฐ๋ฒ•๋“ค์ด ์žˆ์Œ

  • Data Augmentation์€ image classification์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๋ฒ•์ž„

    • ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์—์„œ ์ขŒ์šฐ๋ฐ˜์ „, ํ™•๋Œ€, ์ถ•์†Œ, ํšŒ์ „, ๋Š˜๋ฆผ ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ธ์‹ํ•˜๋„๋ก ๋ณ€ํ™”๋ฅผ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž„

    • ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ์„ฑ๋œ ์ƒˆ๋กœ์šด(๊ฐ€์งœ) ์ด๋ฏธ์ง€๋Š” ์‹ค์ œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค๋Š” ์ข‹์€ ํšจ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์—†๊ฒ ์ง€๋งŒ cost๊ฐ€ ์—†๋Š” ๋ฐฉ๋ฒ•์ž„

  • Early Stopping์€ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋Š” ๊ณผ์ •์„ ๋ชจ๋‹ˆํ„ฐํ•˜๋ฉด์„œ ์›๋ž˜ ์„ค์ •ํ•œ ์ข…๋ฃŒ ์‹œ์ ๋ณด๋‹ค ์ผ์ฐ ์ข…๋ฃŒํ•จ์œผ๋กœ์จ overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ธฐ๋ฒ•์ž„

9. Normalizing inputs

  • Neural Network์˜ ํ•™์Šต์„ ๋ณด๋‹ค ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ input ๋ฐ์ดํ„ฐ๋ฅผ normalize ํ•  ์ˆ˜ ์žˆ์Œ

  • ๋ฐ์ดํ„ฐ๋ฅผ normalize ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฐ์ดํ„ฐ๋ฅผ normalize ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ํ‰๊ท ์„ ๋นผ์ฃผ๊ฑฐ๋‚˜, ํ‰๊ท ์„ 0์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ฑฐ๋‚˜, ๋ถ„์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ normalize ํ•  ์ˆ˜ ์žˆ์Œ

  • ์ด๋ ‡๊ฒŒ normalize๋ฅผ ํ•ด์ฃผ๋Š” ์ด์œ ๋Š”, input ๋ฐ์ดํ„ฐ์˜ scale์ด ๋‹ค๋ฅผ ๋•Œ, ํ•™์Šตํ•˜๊ณ  parameter๋ฅผ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•ด์ฃผ๋Š” ๊ณผ์ •์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ scale์— ๋Œ€ํ•œ parameter๊ฐ€ ์ ์šฉ๋˜์–ด gradient๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ๋Š๋ฆฌ๊ฒŒ ๋ฐ˜๋ณต ๊ณผ์ •์„ ๊ฑฐ์น˜๊ธฐ ๋•Œ๋ฌธ์ž„

  • normalize๋ฅผ ํ•˜๊ฒŒ ๋˜๋ฉด ๋ชจ๋“  input ๋ฐ์ดํ„ฐ์˜ scale์ด ๊ฐ™๊ณ , ๊ทธ์— ๋”ฐ๋ฅธ parameter๊ฐ€ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์˜ ๋ฒ”์œ„๊ฐ€ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— gradient descent ๊ณผ์ •์ด ๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ์ง„ํ–‰๋จ

10. Vanishing / Exploding gradients

  • deep neural network์—์„œ๋Š” ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ธฐ์šธ๊ธฐ๊ฐ’, ์ฆ‰ ๋ณ€ํ™”๋Ÿ‰์ด ๋งค์šฐ ํฌ๊ฑฐ๋‚˜ ์ž‘๊ฒŒ ์ˆ˜๋ ดํ•˜๋Š” ๋ฌธ์ œ๋ฅผ Exploding / Vanishing gradients problem์ด ๋ฐœ์ƒํ•จ

    • layer์— ์ˆ˜๋งŒํผ ๊ณฑํ•ด์ง€๋Š” ww๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜ 0์— ์ˆ˜๋ ดํ•  ์ •๋„๋กœ ๊ณ„์†ํ•ด์„œ ์ž‘์•„์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•จ

    • ํŠนํžˆ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ณ„์†ํ•ด์„œ ์ž‘์•„์งˆ ๊ฒฝ์šฐ์—๋Š”, gradient ๋ณ€ํ™”๋Ÿ‰์ด ๋งค์šฐ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ์—…๋ฐ์ดํŠธ๋ฅผ ์œ„ํ•œ ํ•™์Šต ๊ณผ์ •์ด ๋งค์šฐ ๋Š๋ฆฌ๊ฒŒ ์ž‘๋™๋จ

11. Weight Initialization for Deep Networks

  • ์ด๋ ‡๊ฒŒ gradient๊ฐ€ ์†Œ์‹ค๋˜๊ฑฐ๋‚˜ ์ฆํญ๋˜๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ Weight Initialization์„ ์„ค์ •ํ•ด ์ค„ ์ˆ˜ ์žˆ์Œ

  • ํ•˜๋‚˜์˜ ๊ณ„์‚ฐ์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ ww๊ฐ€ ๋“ค์–ด์˜ค๊ฒŒ ๋˜๋ฉด zz๋Š” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ฆ๊ฐ€ํ•  ์ˆ˜ ๋ฐ–์— ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ww๋ฅผ normalizeํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์กฐ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Œ

  • activation function์— ๋”ฐ๋ผ์„œ Weight Initialization๋ฅผ ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅด๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ๋ณด๋‹ค ํšจ๊ณผ์ ์ž„

    • Relu ์ผ ๋•Œ๋Š”, 2n[lโˆ’1]\sqrt{\frac{2}{n^{[l-1]}}}

    • tanh ์ผ ๋•Œ๋Š”, 1n[lโˆ’1]\sqrt{\frac{1}{n^{[l-1]}}} ํ˜น์€ 2n[lโˆ’1]+n[l]\sqrt{\frac{2}{n^{[l-1]+n^{[l]}}}}

12. Numerical approximation of gradients

  • backpropagation์„ ํ•  ๊ฒฝ์šฐ์—๋Š”, ํ•ด๋‹น ๊ณผ์ •์ด ์ •ํ™•ํ•˜๊ฒŒ ์ด๋ค„์กŒ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด gradient checking ์ž‘์—…์„ ์‹ค์‹œํ•  ์ˆ˜ ์žˆ์Œ

  • gradient checking์€ ๋ง๊ทธ๋Œ€๋กœ ๋ฏธ๋ถ„๊ฐ’, ๋ณ€ํ™”๋Ÿ‰์„ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ธ๋ฐ ์‚ฐ์ˆ ์ ์œผ๋กœ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ์—ฌ๊ธฐ์„œ 'one-sided difference' ์™€ 'two-sided difference'๊ฐ€ ์†Œ๊ฐœ๋จ

    • one-sided difference: fโ€ฒ(ฮธ)=f(ฮธ+ฯต)โˆ’f(ฮธ)2ฯต{f}'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{2\epsilon}

    • two-sided difference: fโ€ฒ(ฮธ)=limโกฯตโ†’0f(ฮธ+ฯต)โˆ’f(ฮธโˆ’ฯต)2ฯต{f}'(\theta) = \lim_{\epsilon \rightarrow 0}\frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}

  • two-sided difference ๋ฐฉ๋ฒ•์ด ๋ณ€ํ™”๋Ÿ‰ ๊ฐ’์— ๋Œ€ํ•œ ์˜ค์ฐจ๊ฐ€ ๋ณด๋‹ค ์ž‘๊ธฐ ๋•Œ๋ฌธ์— gradient checking์„ ํ•  ๋•Œ์—๋Š” ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ์„ ํ˜ธํ•จ

13. Gradient checking

  • gradient checking์„ ํ†ตํ•œ ๋””๋ฒ„๊น…์€ ๋ฒ„๊ทธ๋ฅผ ์ฐพ์•„์คŒ์œผ๋กœ์จ ์ „์ฒด ๊ฐœ๋ฐœ ์‹œ๊ฐ„์„ ํšจ์œจ์ ์œผ๋กœ ์ค„์—ฌ์ค„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ž„

  • gradient checking ๊ณผ์ •์€ backpropagation์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ณ€ํ™”๋Ÿ‰๊ณผ ํ•ด๋‹น ์ง€์ ์—์„œ ๊ณ„์‚ฐ๋˜๋Š” ๊ธฐ์šธ๊ธฐ๊ฐ’์„ ๋น„๊ตํ•จ์œผ๋กœ์จ ์˜ค์ฐจ๋ฅผ ํ™•์ธํ•˜์—ฌ ๋ฒ„๊ทธ๋ฅผ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ณผ์ •์ž„

    • backpropagation์—์„œ ๊ตฌํ•œ d(ฮธ)d(\theta)์™€ two-sided difference๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ dฮธapproxd\theta_{\text{approx}}๋ฅผ ๋น„๊ตํ•˜๊ฒŒ๋จ

    • ์˜ค์ฐจ๋ฅผ ํ†ตํ•ด ๋ฒ„๊ทธ๋ฅผ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š”, ์˜ค์ฐจ๊ฐ€ ฯต=10โˆ’7\epsilon = 10^{-7} ๋ณด๋‹ค ์ž‘๊ฒŒ ๋˜๋ฉด ๋งŒ์กฑํ• ๋งŒํ•œ ๊ฒฐ๊ณผ์ด๋ฉฐ, 10โˆ’510^{-5}์ด๋ฉด ๋‹ค๋ฅธ ์š”์†Œ์— ๋”ฐ๋ผ ๋‹ค์‹œ ํ™•์ธํ•ด ๋ณผ ํ•„์š”๊ฐ€ ์žˆ์œผ๋ฉฐ, 10โˆ’310^{-3} ๋ณด๋‹ค ํฌ๊ฒŒ ๋˜๋ฉด ๋ฒ„๊ทธ๊ฐ€ ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ

โˆฅdฮธapproxโˆ’dฮธโˆฅ2โˆฅdฮธapproxโˆฅ2+โˆฅdฮธโˆฅ2\frac{\| d\theta_{\text{approx}} - d\theta \|_2}{\| d\theta_{\text{approx}}\|_2 + \| d\theta \|_2}

14. Gradient Checking Implementation Notes

  • gradient checking์„ ์ ์šฉํ•˜๊ธฐ ํ•  ๋•Œ, ๋ช‡ ๊ฐ€์ง€ ์ฃผ์˜ ์‚ฌํ•ญ์ด ์žˆ์Œ

    • Donโ€™t use in training โ€“ only to debug: ๋ฏธ๋ถ„ ๊ทผ์‚ฌ๋ฅผ ๊ตฌํ•˜๋Š” ์ผ์€ ์ƒ๋‹นํžˆ ๋Š๋ฆฐ ์ž‘์—…์ด๊ธฐ์— ํ•™์Šต(training)์‹œ์—๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ

    • If algorithm fails grad check, look at components to try to identify bug: ์˜ค์ฐจ๊ฐ€ ํฌ๊ฒŒ ๋ฐœ์ƒํ–ˆ๋‹ค๋ฉด, ๊ฐ๊ฐ์˜ component๋ฅผ ๋น„๊ตํ•ด์„œ, ๋ฌธ์ œ๋˜๋Š” component๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์˜ค๋ฅ˜๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Œ

    • Remember regularization : regularization์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, cost function(J)์— lambda ์‹์ด ๋”ํ•ด์ง€๋ฏ€๋กœ, ๋ฏธ๋ถ„ ๊ทผ์‚ฌ์น˜์™€ ๋น„๊ตํ•  ๋•Œ ์ฒ˜๋ฆฌํ•ด ์ฃผ์–ด์•ผ ํ•จ

    • Doesnโ€™t work with dropout: dropout์„ ํ•˜๊ฒŒ ๋˜๋ฉด, Jํ•จ์ˆ˜๊ฐ€ ์“ฐ๊ธฐ๊ฐ€ ๋ฌด์ฒ™ ๊นŒ๋‹ค๋กœ์›Œ์ง. keep-prob๋ฅผ 1๋กœ ์„ค์ •ํ•˜์—ฌ dropout์„ ๋„๊ณ  grad checking์„ ์ง„ํ–‰ํ•จ

    • Run at random initialization; perhaps again after some training: random ์ดˆ๊ธฐํ™”๋ฅผ ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š”, ํ•™์Šต์„ ํ†ตํ•ด W,b๋ฅผ ์—…๋ฐ์ดํŠธํ•œ ์ดํ›„์— grad checking์„ ์ง„ํ–‰ํ•จ

0๊ฐœ์˜ ๋Œ“๊ธ€