๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ปStandford CS231N Deep Learning -6

๋ฐ•์ˆ˜๋นˆยท2021๋…„ 6์›” 1์ผ
0

CS231N

๋ชฉ๋ก ๋ณด๊ธฐ
1/6

6. Training Nearual Netwarks, part 1

โœ” Activation Function

1. Sigmoid



Sigmoid ํ•จ์ˆ˜๋Š” ๋ชจ๋“  ์ˆ˜๋ฅผ [0,1] ์‚ฌ์ด์˜ ์ˆ˜๋กœ squash ํ•œ๋‹ค. ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ activation function์ด์ง€๋งŒ, ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค.
1. satured neurons "kill" the gradients
backpropagation์—์„œ local gradient๊ฐ€ 0์— ๊ฐ€๊นŒ์šฐ๋ฉด gradient๋Š” ๊ฑฐ์˜ ์—†๊ฒŒ ๋œ๋‹ค.
2. sigmoid outputs are not zero centered
input ๋ฐ์ดํ„ฐ๊ฐ€ ํ•ญ์ƒ ์–‘์ˆ˜๋ผ๋ฉด, backpropagation ๊ณผ์ •์—์„œ w๋Š” ๋‹ค ์–‘์ˆ˜๊ฑฐ๋‚˜, ๋‹ค ์Œ์ˆ˜์ด๊ฒŒ ๋œ๋‹ค. (zero-meanํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.)
3. exp() is a bit compute expensive

2. Tanh

  • sqaush into [-1,1]
  • zero centered

3. ReLU

  • Rectified Linear Unit
  • f(x)=max(0,x)f(x) = max(0,x)
  • ์ˆ˜๋ ด์„ ๊ธ‰์†๋„๋กœ ๊ฐ€์†ํ™” ํ•œ๋‹ค.
  • ๋‹จ์ˆœํ•ด์„œ ๊ณ„์‚ฐ์ด ํšจ์œจ์ ์ด๋‹ค.
  • positive biases ํ•œ ๊ฒฝ์šฐ ์ข‹๋‹ค. (์Œ์ˆ˜๊ฐ€ ๋‹ค ์ฃฝ์–ด๋ฒ„๋ฆฌ๋‹ˆ๊นŒ)

4. Leaky ReLU

  • ReLU๊ฐ€ ์Œ์ˆ˜์ผ ๋•Œ ์ฃฝ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ๋“ฑ์žฅํ•จ
  • ์Œ์ˆ˜์ผ ๋•Œ, 0.01์˜ ์ž‘์€ ๊ธฐ์šธ๊ธฐ ๊ฐ–์Œ

5. Maxout

ReLU์˜ ์žฅ์ ์€ ์ทจํ•˜๊ณ , dying problem์€ ๊ฐ–์ง€ ์•Š์Œ. ํ•˜์ง€๋งŒ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋‘๋ฐฐ๋กœ ๋Š˜์–ด๋‚œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.


โœ” Data Preprocessing

  • zero-centered ํ•˜๋ฉด activation ์—์„œ๋„ ํšจ์œจ์ ์ž„
  • normalized๋Š” ์ด๋ฏธ์ง€์—์„œ๋Š” ๋งŽ์ด ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด๋ฏธ์ง€์—์„œ๋Š” ์œ„์น˜๊ฐ€ ์ถฉ๋ถ„ํžˆ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์—. ๋‹ค๋ฅธ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ๊ฒฝ์šฐ normalization ํ•ด์•ผ ๋ชจ๋“ ๋ฐ์ดํ„ฐ๊ฐ€ ๋น„์Šทํ•œ ์ •๋„๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ์—ฌํ•จ.
  • test set์—๋„ ์ ์šฉํ•œ๋‹ค. (trn์˜ mean์œผ๋กœ tst๋„ preprocess ํ•จ)
  • ์ด๋ฏธ์ง€๋Š” 3 channel์„ ์‚ฌ์šฉํ•ด RGB ๊ฐ๊ฐ์˜ mean์„ ์‚ฌ์šฉํ•ด์„œ ๋บ€๋‹ค. (batch์˜ ํ‰๊ท ์„ ์‚ฌ์šฉํ•ด๋„ ๋œ๋‹ค...?)

โœ” Weight Initialization

  • ๋ชจ๋“  w๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๋ฉด, ๋ชจ๋“  layer์˜ gradient๊ฐ€ ๋™์ผํ•ด์ง€๊ฒŒ ๋˜๊ณ , ๊ทธ๋Ÿผ ๋ชจ๋“  NN์ด ๊ฐ™์€ ํ•™์Šต์„ ํ•˜๊ฒŒ๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ฐ๊ฐ์˜ ๋‰ด๋Ÿฐ๋“ค์ด ๋‹ค๋ฅธ ํ•™์Šต์„ ํ•˜๊ธฐ ๋ฐ”๋ผ๋ฏ€๋กœ w๋Š” ๋ชจ๋‘ ๋‹ค๋ฅธ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
  • small random numbers๋กœ ์„ค์ •ํ•˜๋ฉด?
    • ๊นŠ์€ network์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค. activation์ด ๋‹ค 0์ด ๋˜์–ด๋ฒ„๋ฆฐ๋‹ค (std๊ฐ€ 0์œผ๋กœ)
  • Xavier initialization
    • 1/sqrt(n)์œผ๋กœ ๋ถ„์‚ฐ ๊ทœ๊ฒฉํ™” (์ž…์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ์„ ๋งž์ถฐ์ค€๋‹ค)
    • w = np.random.randn(n) / sqrt(n)
    • ReLU๋Š” ์ถœ๋ ฅ์˜ ์ ˆ๋ฐ˜์„ ์ค„์ด๊ธฐ ๋•Œ๋ฌธ์— ์ ˆ๋ฐ˜์ด ๋งค๋ฒˆ 0์ด ๋˜์–ด๋ฒ„๋ฆฐ๋‹ค.
    • ReLU์— ์ ์ ˆํ•œ ์ตœ์ ์€ w = np.random.randn(n) * sqrt(2.0/n)

โœ” Batch Normalization

  • ์—ฐ์‚ฐ์„ ๊ณ„์†ํ•˜๋‹ค๋ณด๋ฉด w์˜ scale์ด ์ปค์ง€๋Š”๋ฐ, batch norm ํ†ตํ•ด์„œ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€์‹œ์ผœ์คŒ
  • ๋ชจ๋“  ๋ ˆ์ด์–ด๊ฐ€ Unit Gaussian(ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ) ๋”ฐ๋ฅด๊ฒŒ ํ•œ๋‹ค.
  • ๋†’์€ learning rates
  • initialization์˜ ์˜์กด๋„๋ฅผ ๋‚ฎ์ถค
  • regularization์˜ ์—ญํ• ์„ ํ•ด์คŒ

โœ” Babysitting the Learning Process

  1. Preprocess the data
  2. Choose the architecture
  3. Double check that the loss is reasonable. (ํด๋ž˜์Šค ๊ฐœ์ˆ˜๋ฅผ ๊ณ ๋ คํ–ˆ์„ ๋•Œ, ์ ๋‹นํ•œ ๊ฐ’์ธ์ง€)
  4. train
  5. train with regulization and check learning rate

โœ” Hyperparameter Optimization

Cross-validatoin strategy

  • trn set ์œผ๋กœ ํ•™์Šต, val set์œผ๋กœ ํ‰๊ฐ€
  • log scale ์‚ฌ์šฉํ•ด์„œ ์„ ํƒ
  • Random Search๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํŽธ์ด Graid Search ๋ณด๋‹ค ์ค‘์š”ํ•œ ํฌ์ธํŠธ๋“ค์„ ์ž˜ ์žก์•„๋‚ธ๋‹ค.

Learning rate

  • Loss curve๋ฅผ ํ™•์ธํ•ด์„œ learning rate๊ฐ€ ์ ๋‹นํ•œ์ง€ ํŒ๋‹จ

๊ผผ์ง€๋ฝ

์•ž๋ถ€๋ถ„์—๋งŒ ์ง‘์ค‘์„ ๋„ˆ๋ฌด ํ•ด์„œ ๋‚˜์ค‘์—” ์กฐ๋Š๋ผ ์ž˜ ์ดํ•ด๋„ ๋ชปํ–ˆ๋‹ค. ๋…ธํŠธ๊นŒ์ง€ ๋ณด๋ ค๊ณ  ์š•์‹ฌ๋‚ด๊ธฐ ๋ณด๋‹จ, ๊ทธ๋ƒฅ ์˜์ƒ ๋“ค์„ ๋•Œ ์ตœ๋Œ€ํ•œ์œผ๋กœ ์ดํ•ด๋ฅผ ํ•ด์•ผ์ง€ใ…œใ…œ ์˜ค๋žœ๋งŒ์— ๊ฐ•์˜๋ž€ ๊ฒƒ์„ ๋“ค์œผ๋‹ˆ ๋„ˆ๋ฌด ํž˜๋“ค๋‹ค...

profile
๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜๊ณ  ์‹ถ์€ ํ•™๋ถ€์ƒ์˜ ๊ผผ์ง€๋ฝ ๊ธฐ๋ก

0๊ฐœ์˜ ๋Œ“๊ธ€