๐Ÿ“Œ ๋ณธ ๋‚ด์šฉ์€ Michigan University์˜ 'Deep Learning for Computer Vision' ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ํ•„๊ธฐํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ๋‚ด์šฉ์— ์˜ค๋ฅ˜๋‚˜ ํ”ผ๋“œ๋ฐฑ์ด ์žˆ์œผ๋ฉด ๋ง์”€ํ•ด์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํžˆ ๋ฐ˜์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
(Stanford์˜ cs231n๊ณผ ๋‚ด์šฉ์ด ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ๋„์›€ ๋˜์‹ค ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค)๐Ÿ“Œ


1. Learning Rate Schedules

1) ๋น„๊ต

  • ํ•ด์„
    • very high LR: loss๊ฐ€ ๊ธ‰์ฆํ•จ
    • low LR: ๋งค์šฐ ์ฒœ์ฒœํžˆ ํ•™์Šต ์ง„ํ–‰
    • high LR: ๋งค์šฐ ๋นจ๋ฆฌ ์ˆ˜๋ ดํ•˜์ง€๋งŒ, loss๊ฐ€ ๋œ ๋‚ฎ์•„์ง
    • good LR: ์ ๋‹น
  • ์งˆ๋ฌธ
    • Q. ์–ด๋–ค LR์ด ๊ฐ€์žฅ ์‚ฌ์šฉํ•˜๊ธฐ ์ ์ ˆํ•œ๊ฐ€?
    • A. ๋‹ค ใ„ฑใ…Š์Œ! high LR๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ ์ค„์—ฌ๊ฐ€๋ณด์ž. = LR Schedule ์ด๋ผ๊ณ  ํ•จ

2) LR Decay

a. Step Schedule

  • ๊ฐœ๋…
    • ๊ณ ์ •๋œ point๋“ค์— LR์„ ๊ฐ์†Œํ•ด์คŒ
    • ex. ResNet โ†’ 0.1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ 30 epoch๋งˆ๋‹ค 0.10.1, 0.10.1*0.1 ์ด๋Ÿฐ์‹์œผ๋กœ ์ค„์ž„
  • ๋ฌธ์ œ์ 
    • trial & error๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์Œ
      = train model ์— ๋„ˆ๋ฌด ๋งŽ์€ ์ƒˆ๋กœ์šด hyper parameter๋„ฃ์Œ
      = ๋„ˆ๋ฌด ๋งŽ์€ ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ์ƒ๊ฐํ•ด์„œ tuning ํ•ด์•ผ๋จ
      = LR์„ decayํ•  ํŠน์ • ์ง€์  ์„ ํƒํ•ด์•ผ๋จ
      โ‡’ ๋ช‡๋ฒˆ ๋ฐ˜๋ณต์‹œ๋งˆ๋‹ค LR์ค„์ผ๊ฑด์ง€ & ์–ด๋–ค LR๋กœ ์ค„์—ฌ๋‚˜๊ฐˆ๊ฑด์ง€ ๊ฒฐ์ •ํ•ด์ค˜์•ผ ๋จ

b. Cosine Schedule

  • ๊ฐœ๋…
    • LR์„ decayํ•  ํŠน์ • ์ง€์  ์„ค์ •ํ•˜๋Š” ๋Œ€์‹ , ์ดˆ๊ธฐ LR๋งŒ ์„ค์ •
      • ๊ธฐ์กด๋ณด๋‹ค ๋งค์šฐ ์ ์€ hyperparameter๋กœ trainํ•˜๊ธฐ ๋” ์‰ฌ์›€
      • train longer โ†‘ โ†’ ์„ฑ๋Šฅ โ†‘
  • ํ•ด์„
    • ์ ˆ๋ฐ˜์ฏค์— LR์ด ๋–จ์–ด์ง
      = ์ฒจ์— LR ๋†’๊ฒŒ ์‹œ์ž‘ํ•˜๊ณ , train ๋์ฏค์— LR์ด 0์— ๊ฐ€๊นŒ์›Œ์ง
  • ๋ฌธ์ œ์ 
    • ๊ณ„์‚ฐ ๋ณต์žก๋„ ์˜ฌ๋ผ๊ฐ

c. Linear Schedule

  • ๊ฐœ๋…
    • ๊ธฐ์กด๋ณด๋‹ค simpleํ•จ
  • cf) cos๊ณผ linear์ค‘์— ๋ญ๊ฐ€ ๋” ๋‚˜์€์ง€๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ ์–ด์„œ ๋งํ•˜๊ธฐ ์• ๋งคํ•จ.
    • domain๋ณ„๋กœ ์„ ํ˜ธํ•˜๋Š” LR schedule ์กด์žฌ
      • cv : cos schedule ์„ ํ˜ธ
      • NLP: linear ์„ ํ˜ธ

d. Inverse Sqrt Schedule

  • ๊ฐœ๋…
    • square root ์‚ฌ์šฉ
  • ๋ฌธ์ œ์ 
    • ์ดˆ๊ธฐ high LR์—์„œ ๊ฐ‘์ž๊ธฐ ํ™• ์ค„์–ด๋“ฆ

e. Constant

  • ๊ฐœ๋…
    • ์ ค ํ”ํ•จ
    • ์ƒ๊ฐ๋ณด๋‹ค ์ž˜ ์ ์šฉ๋จ (๊ฑ ์ด๊ฑฐ ์‚ฌ์šฉํ•ด๋„ ใ„ฑใ…Š์Œ)
    • ๋” ๋ณต์žกํ•œ schedule๋กœ ๊ฐˆ์ˆ˜๋ก, ๋ช‡% ๋” ์ข‹์€ ์„ฑ๋Šฅ
  • ๊ธฐ์กด Schedule๊ณผ์˜ ์ฐจ์ด
    • ๋ชจ๋ธ work/not work์—์„œ ์ฐจ์ดX
    • constant โ†’ ๋” ๋ณต์žกํ•œ schedule๋กœ ๊ฐˆ์ˆ˜๋ก, ๋ช‡% ๋” ์ข‹์€ ์„ฑ๋Šฅ
    • ๊ฑ ๋ชจ๋ธ์ด work๋งŒ ๋˜๊ฒŒ ํ•˜๋ฉด ๋˜๋ฉด, constant๊ฐ€ ๊ดœ์ฐฎ์€ ์„ ํƒ
  • cf) SGD+Momentum โ†’ LR decay schedule ์„ ํƒ ์ค‘์š”
    RMSProp or Adam โ†’ ๊ฑ constant ์จ๋„ ใ„ฑใ…Š
  • ๊ด€๋ จ ์งˆ๋ฌธ Q. Loss๊ฐ€ ๋†’์•„์กŒ๋‹ค ๋‚ฎ์•„์กŒ๋‹ค ๋‹ค์‹œ ๋†’์•„์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‚˜์š”?
    A. ์žˆ์Œ. zero-grad๊ฐ€ ๋˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์„๋•Œ, task์˜ type์— ๋”ฐ๋ผ bad dynamic์„ ๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค. data corruption์ด explode๋ฅผ ์œ ๋ฐœ ๊ฐ€๋Šฅํ•˜๋‹ค
    (์ผ๋ฐ˜์ ์ธ ๋‹ต์€ ์•„๋‹˜, ์‚ฌ๋ฐ”์‚ฌ)

3) Early Stopping

  • ๊ฐœ๋…
    • val์˜ accuracy๊ฐ€ ๊ฐ์†Œํ•˜๋ คํ• ๋•Œ (overfitting ์ „) ๋ฐ˜๋ณต ์ค‘์ง€์‹œ์ผœ์•ผ๋จ
      • ๋งค iteration๋งˆ๋‹ค์˜ model snapshot์ €์žฅํ›„, val set์—์„œ ๊ฐ€์žฅ ์ž˜ work์‹œ์˜ weight๊ฐ€์ ธ์˜ด




2. (GPUๅคš) Choosing Hyperparameters

๐Ÿ“ Grid, Random Search
1) ๋ฐฉ๋ฒ•

a. Grid Search

  • ๊ฐœ๋…
    • ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ์ˆซ์ž๋“ค

b. Random Search

  • ๊ฐœ๋…
    • ๋ฒ”์œ„ ๋‚ด์˜ ๋žœ๋คํ•œ ์ˆซ์ž๋“ค

2) ๋น„๊ต

  • ํ•ด์„
    • Grid Search: ์ค‘์š” ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๋œ ์žก์•„๋ƒ„
    • Random Search: ์ค‘์š” ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๋” ๋งŽ์ด ์žก์•„๋ƒ„

3) Random Search Weight decay




3. (GPU ็„ก) Choosing Hyperparameters

๐Ÿ“ ์ด 7step ๊ณผ์ •

1) ๊ณผ์ •

a. ์ดˆ๊ธฐ loss ์ธก์ •

  • weight decay ์„ค์ • ์•ˆํ•œ ์ƒํƒœ์—์„œ, ๋งจ ์ฒ˜์Œ loss ํ™•์ธ
    ex. softmax โ†’ ๋งจ ์ฒ˜์Œ loss๊ฐ’์ด logC๊ฐ€ ์•„๋‹ˆ๋ฉด, ๋„คํŠธ์›Œํฌ ์˜ค๋ฅ˜์žˆ์Œ

b. ์ž‘์€ sample์„ overfitํ•ด๋ณด๊ธฐ

  • ์ž‘์€ training set (5~10 ๋ฏธ๋‹ˆ๋ฐฐ์น˜)์—์„œ 100% accuracy๊ฐ€ ๋‚˜์˜ค๋Š”์ง€ ํ™•์ธ
    • loss๊ฐ€ ์ž˜ ์•ˆ๋–จ์–ด์ง€๋ฉด, LR, weight initialization ๊ณ ๋ ค

c. loss๊ฐ€ ์ค„์–ด๋“œ๋Š” LR์ฐพ๊ธฐ

  • Step2์˜ architecture ๊ณ ์ • ํ›„, ๋ชจ๋“  train data ํ™œ์šฉํ•˜์—ฌ 100 iteration๋™์•ˆ์˜ LR์‹œ๋„
    โ†’ loss๊ฐ€ ์ค„์–ด๋“œ๋Š” LR ์ฐพ๊ธฐ

d. epoch 1~5๋ฒˆ ๋Œ๋ ค๋ณด๋ฉฐ, weight decay ์กฐ์ •

  • ์—ฌ๊ธฐ์„œ ์—„์ฒญ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ์–ป์„๋ฆฌX

e. Refine grid, train longer

  • Step4์—์„œ ๊ณ ๋ฅธ ๋ชจ๋ธ epoch ๋Š˜๋ ค์„œ train์‹œํ‚ด
    โ†’ ์—„์ฒญ ์˜ค๋ž˜๊ฑธ๋ฆด์ˆ˜๋„

f. Learning curve ํ™•์ธํ•˜๊ธฐ

  • train loss โ†’ ์›€์ง์ด๋Š” ์†์‹คํ‰๊ท 

  • train loss

    • ํ•ด์„

      • loss๊ฐ€ ์ฒ˜์Œ์— ํ‰ํ‰ํ•˜๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ ๊ฐ์†Œ
        = weight ์ดˆ๊ธฐํ™”๊ฐ€ ์ข‹์ง€ ์•Š์Œ (train ์ดˆ๊ธฐ์— ์ง„์ „์ด ์—†์–ด์„œ)

    • ํ•ด์„
      - loss๊ฐ€ ๊ฐ์†Œํ•˜๋‹ค๊ฐ€ ๋” ๋–จ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ ์žˆ์ง€๋งŒ ์•ˆ๋–จ์–ด์ง
      = LR ์„ค์ • ์ข‹์ง€ ์•Š์Œ (LR์ด ๋„ˆ๋ฌด ๋†’์•˜์„๊ฒƒ)

    • ํ•ด์„

      • ๋„ˆ๋ฌด ๋นจ๋ฆฌ LR์„ ์ค„์ธ ๊ฒฝ์šฐ
        = loss๊ฐ€ flatํ•ด์ง€๋Š” ์ง€์ ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ธ๋‹ค๊ฐ€ decayํ•˜๊ธฐ
  • train, val accuracy

    • ํ•ด์„

      • train๊ณผ val์˜ accuracy๊ฐ€ ๊ฐ™์ด ์ฆ๊ฐ€ & ์ ๋‹นํ•œ ์ฐจ์ด ์œ ์ง€
        โ†’ ํ•ด๊ฒฐ) train ๋” ์‹œํ‚ค๋ฉด ๋จ

    • ํ•ด์„
      - train๊ณผ val gap์ด ๊ฐˆ์ˆ˜๋ก ์ปค์ง = ์˜ค๋ฒ„ํ”ผํŒ…
      โ†’ ํ•ด๊ฒฐ) regularization โ†‘*(=L2 ๊ทœ์ œ์—์„œ ฮป\lambda๋ฅผ ๋” ํฌ๊ฒŒ ์ง€์ • or data augmentation)*, data ๋” ๋ชจ์œผ๊ธฐ

    • ํ•ด์„

      • gap์ด ๊ฑฐ์˜ ์—†์Œ = ์–ธ๋”ํ”ผํŒ…
        โ†’ ํ•ด๊ฒฐ) train longer, ๋” ํฐ ๋ชจ๋ธ ์‚ฌ์šฉ

g. GOTO step5

  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ๋ฐ˜๋ณต
  • LR, LR decay schedule, update type
  • regularization(L2, Dropout strength)




4. After Training: Model Ensemble (Tip&trick: LR schedule, polyak averaging)

๐Ÿ“ ์•™์ƒ๋ธ”, transfer learning, large-batch training

1) Model Ensembles

  • ๊ฐœ๋…

    • multiple ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค ํ•™์Šต
    • ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ test time์— ํ‰๊ท ๋‚ด๊ธฐ
    • ์•™์ƒ๋ธ” ํ•˜๋ฉด 2%์ •๋„ ์„ฑ๋Šฅ ์˜ฌ๋ผ๊ฐ
  • Tips & Tricks (ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์•™์ƒ๋ธ” ํšจ๊ณผ ๋‚ด๊ธฐ)

    • ๋ฐฉ๋ฒ•

      • LR schedule ํ™œ์šฉํ•˜๊ธฐ

        : LR decay ์ฃผ๊ณ , ํŠน์ • ์‹œ์ ๋งˆ๋‹ค ๋‹ค์‹œ LR ๋†’๊ฒŒ ์ฃผ๋ฉด์„œ ๊ตฌ๊ฐ„๋ณ„ ๋ชจ๋ธ์˜ snapshot์„ ์ €์žฅํ•˜์—ฌ ๋ชจ๋ธ์ด ๋‚ธ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ๋‚ด์–ด ์•™์ƒ๋ธ” ๊ตฌํ˜„

      • Polyak averaging

        : train ํ›„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์•„๋‹Œ, train ์‹œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ moving average(x)๋ฅผ test์— ํ™œ์šฉ




5. After Training: Transfer Learning

๐Ÿ“ feature extract, fine tuning

1) ๋ฐœ์ƒ ๋ฐฐ๊ฒฝ

  • CNN์—์„œ ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ๋ฌธ์ œ์— ํ•ด๊ฒฐ์ฑ… ์ œ์‹œ

2) CNN์—์„œ ์ ์šฉํ•ด๋ณด๊ธฐ

  • ๊ฐœ๋…

    • dataset์ด ์ž‘๋‹ค๋ฉด ๋งค์šฐ ํšจ๊ณผ์ 
    • CNN์„ feature ์ถ”์ถœ๊ธฐ๋กœ ๋งŒ๋“ค๊ณ  โ†’ ๊ทธ ์œ„์— linear ๋ถ„๋ฅ˜
  • ์‚ฌ์šฉ ์˜ˆ์‹œ

    • feature๋ฅผ ์šฐ๋ฆฌ๊ฐ€ ์‹ ๊ฒฝ์“ฐ๋Š” ์–ด๋–ค ์ž‘์€ data set์— ์ ์šฉ (=์ด๋ฏธ์ง€๋„ท์˜ 1000๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜๋Œ€์‹ , 10๊ฐœ์˜ ์ข…๋ฅ˜ ๋ถ„๋ฅ˜์ •๋„๋งŒ ํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ)
      โ†’ ํ–‰๋ ฌ ์ž„์˜๋กœ ์žฌ์ดˆ๊ธฐํ™” (ex. imageNet: 40961000, ์ƒˆ๋กœ์šด class: 4096c*10)
      โ†’ (Freeze these) ๋ชจ๋“  ์ด์ „ ๊ณ„์ธต์˜ ๊ฐ€์ค‘์น˜ ๊ณ ์ •
      โ†’ ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ
      โ†’ ๋งˆ์ง€๋ง‰ ๊ณ„์ธต ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค๋งŒ ํ›ˆ๋ จ
      โ†’ ๋ฐ์ดํ„ฐ์— ์ˆ˜๋ ด
  • ์„ฑ๋Šฅ ๋น„๊ต ์˜ˆ์‹œ

    • ํ•ด์„

      • Alexnet feature๋“ค์„ ์ด์ „ ๋ฐฉ๋ฒ•์— ์ ์šฉ์‹œํ‚ค๋ฉด ๋” ์ข‹์€ ์„ฑ๋Šฅ

    • ํ•ด์„

      • imageNet์—์„œ feature์ถ”์ถœํ•œ pretrained model๋กœ โ†’ feature๋ฒกํ„ฐ์œ„์— NN์ ์šฉ
        = transfer learning์œผ๋กœ๋Š” ์ ค ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ (๋‹จ์ˆœํžˆ feature vector์ถ”์ถœํ•˜๊ณ  ์‚ฌ์šฉ)
      • ์ตœ๊ทผ์ ‘์ด์›ƒ ๋ฐฉ๋ฒ•์œผ๋กœ image ๋ณต๊ตฌ์ž‘์—… ์ˆ˜ํ–‰

3) Bigger dataset: Fine-Tuning

  • ๊ฐœ๋…

    • ๋งˆ์ง€๋ง‰ layer๋ฒ„๋ฆฌ๊ณ , ์ƒˆ layer(์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„๋ฅ˜ category์™€ ๊ด€๋ จ๋˜๋„๋ก ์ดˆ๊ธฐํ™”)๋กœ ๋Œ€์ฒด
      = ๋ชจ๋ธ ์ „์ฒด๋ฅผ ์ƒˆ ๋ถ„๋ฅ˜ dataset์— ๋งž์ถฐ ๋‹ค์‹œ ํ•™์Šต
    • ๊ณ ์ •๋œ feature ์ถ”์ถœ๊ธฐX, ์‹ค์ œ ๋ชจ๋ธ๋กœ ์—ญ์ „ํŒŒํ•˜๋ฉฐ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ๊ณ„์† update โ†’ downstream์—์„œ์˜ ์„ฑ๋Šฅ ๊ฐœ์„ 
  • downstream ์„ฑ๋Šฅ ํ–ฅ์ƒ ์œ„ํ•œ trick & tips

    • ๋จผ์ € feature extraction โ†’ ๊ทธ ์œ„์— linear modelํ•™์Šต โ†’ ์ „์ฒด ๋ชจ๋ธ ๋‹ค์‹œ fine tuning
    • fine tuning์ง„ํ–‰ ์‹œ, LR์„ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œ์ผœ์•ผ ํ• ์ˆ˜๋„ ์žˆ์Œ
    • ์ปดํ“จํŒ… ๋น„์šฉ์„ ์•„๋ผ๊ธฐ ์œ„ํ•ด low layer๋ฅผ freezeํ•ด๋ผ.
  • ์„ฑ๋Šฅ ๋น„๊ต

    • ํ•ด์„
      • ๊ณ ์ •๋œ Feature Extraction: ์ „์ฒด network freezeํ•˜๊ณ  feature extraction์œผ๋กœ๋งŒ
      • fine tuning: ์ƒˆ๋กœ์šด dataset์— ๋Œ€ํ•ด ์ „์ฒด ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ ๊ณ„์† ํ•™์Šต โ†’ ์„ฑ๋Šฅ ๋” โ†‘




6. After Training: Transfer Learning_Architecture Matters & ํŠน์ง• ์ผ๋ฐ˜ํ™”

  • ํ•ด์„
    • imageNet์—์„œ ์ž˜๋˜๋ฉด ๋‹ค๋ฅธ ๋ฐ์„œ๋„ ์ž˜๋จ
    • ์˜ˆ์‹œ

1) Transfer Learning ํŠน์ง• ์ผ๋ฐ˜ํ™”

๋งค์šฐ ๋น„์Šทํ•œ dataset๋งค์šฐ ๋‹ค๋ฅธ dataset
๋งค์šฐ ์ ์€ data์ œ์ผ ์œ—๊ณ„์ธต์—์„œ linear classifier์‚ฌ์šฉ๊ณค๋ž€ํ•œ ์ƒํ™ฉ, ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ๋ถ€ํ„ฐ linear classifier ์‹œ๋„
๊ฝค ๋งŽ์€ data๋ช‡๊ฐœ์˜ layer ๋ฏธ์„ธ์กฐ์ •๋” ๋งŽ์€ ๊ณ„์ธต fine tuning

2) ์ „์ดํ•™์Šต ํ™œ์šฉ ์˜ˆ์‹œ๋“ค

a. ๋ฌผ์ฒด ์ธ์‹, image captioning

  • ํ•ด์„
    • ๋‘˜ ๋‹ค CNN์œผ๋กœ imageNet pretrainํ•จ + fine tuning

b.
์ •๋ฆฌ ์ค‘ . . โš  ๐Ÿšง

profile
๐Ÿ–ฅ๏ธ

0๊ฐœ์˜ ๋Œ“๊ธ€