๐ŸŽฒ[AI] ๊ธฐ์ดˆ ์ตœ์ ํ™” ๋ฐฉ๋ฒ• - Gradient Descent

manduยท2025๋…„ 5์›” 4์ผ

[AI]

๋ชฉ๋ก ๋ณด๊ธฐ
8/20

ํ•ด๋‹น ๊ธ€์€ FastCampus - '[skill-up] ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์œ ์น˜์› ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ ,
์ถ”๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ง๋ถ™์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


1. ๋ฏธ๋ถ„ (Derivative)


1.1 ๊ธฐ์šธ๊ธฐ

  • ํ•จ์ˆ˜ ์ž…๋ ฅ๊ฐ’(x) ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์ถœ๋ ฅ๊ฐ’(y) ๋ณ€ํ™” ๋น„์œจ

1.2 ๊ทนํ•œ๊ณผ ์ ‘์„ 

  • ๋‘ ์ ์ด ๋ฌดํ•œํžˆ ๊ฐ€๊นŒ์›Œ์งˆ ๋•Œ ์ ‘์„ ์˜ ๊ธฐ์šธ๊ธฐ ๊ฐœ๋…์œผ๋กœ ๋„์ž…
  • ๋ฏธ๋ถ„์€ ์ด ์ ‘์„ ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •

1.3 ๋„ํ•จ์ˆ˜

  • ํŠน์ • ์ง€์ ์ด ์•„๋‹Œ ์ „ ๋ฒ”์œ„์— ๋Œ€ํ•ด ๊ธฐ์šธ๊ธฐ๋ฅผ ์ผ๋ฐ˜ํ™”ํ•œ ํ•จ์ˆ˜

1.4 ๋‰ดํ„ด ๋ฏธ๋ถ„๋ฒ• vs ๋ผ์ดํ”„๋‹ˆ์ธ  ๋ฏธ๋ถ„๋ฒ•

๋‰ดํ„ด ๋ฏธ๋ถ„๋ฒ•: ๋ณ€์ˆ˜๊ฐ€ ํ•˜๋‚˜์ผ ๋•Œ ํŽธ๋ฆฌ

๋ผ์ดํ”„๋‹ˆ์ธ  ๋ฏธ๋ถ„๋ฒ•: ๋ณ€์ˆ˜๊ฐ€ ๋‘ ๊ฐœ ์ด์ƒ์ผ ๋•Œ ํŽธ๋ฆฌ


1.5 ํ•ฉ์„ฑํ•จ์ˆ˜ ๋ฏธ๋ถ„


2. ํŽธ๋ฏธ๋ถ„ (Partial Derivative)


2.1 ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜(Multivariable function)

  • ๋ณ€์ˆ˜๊ฐ€ ๋‘ ๊ฐœ ์ด์ƒ์ธ ํ•จ์ˆ˜

2.2 ํŽธ๋ฏธ๋ถ„์ด๋ž€?

  • ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜์—์„œ ํ•˜๋‚˜์˜ ๋ณ€์ˆ˜๋งŒ ๋‚จ๊ธฐ๊ณ  ๋‚˜๋จธ์ง€๋ฅผ ์ƒ์ˆ˜๋กœ ๊ฐ„์ฃผํ•˜๊ณ  ๋ฏธ๋ถ„ํ•˜๋Š” ๊ฒƒ
  • ๊ธฐํ˜ธ: โˆ‚(Round, Partial)

2.3 Gradient ๋ฒกํ„ฐ: ๊ณ ์ฐจ์› ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„

  • ์ž…๋ ฅ/์ถœ๋ ฅ์ด ๋ฒกํ„ฐ ๋˜๋Š” ํ–‰๋ ฌ์ผ ๋•Œ๋„ ๊ฐ๊ฐ ํŽธ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ
  • ๊ธฐํ˜ธ: โ–ฝ(nabla, del)
  • ๊ฒฐ๊ณผ๋Š” gradient ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋˜๋ฉฐ ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ ๋ชจ๋‘ ํฌํ•จ
  • ์ฆ‰, gradient ๋ฒกํ„ฐ๋Š” ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„์„ ๋ชจ์•„๋†“์€ ๋ฒกํ„ฐ


2.4 ์™œ Gradient ๋ฒกํ„ฐ๊ฐ€ ํ•„์š”ํ• ๊นŒ?

  • Loss๋ผ๋Š” ์Šค์นผ๋ผ ๊ฐ’์„ ํŒŒ๋ผ๋ฏธํ„ฐ ํ–‰๋ ฌ๋กœ ๋ฏธ๋ถ„ํ•˜๊ณ  ์‹ถ๋‹ค!
  • DNN์˜ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฌผ ๋ฒกํ„ฐ๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ(๊ฐ€์ค‘์น˜) ํ–‰๋ ฌ๋กœ ๋ฏธ๋ถ„ํ•ด์•ผ ํ•œ๋‹ค๋ฉด?

2.5 Pytorch์˜ Autograd ๊ธฐ๋Šฅ

x = torch.FloatTensor([[1, 2],
                       [3, 4]])
x.requires_grad = True
                       
x1 = x + 2
x2 = x - 2
x3 = x1 * x2
y = x3.sum() 

y.backward() #y๊ฐ€ scalar์—ฌ๋งŒ backward ๊ฐ€๋Šฅ
print(x.grad)
                   

3. Gradient Descent ์•Œ๊ณ ๋ฆฌ์ฆ˜


3.1 Gradient Descent ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ž€?

  • Loss ๊ฐ’์ด ์ตœ์†Œํ™”๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ฮธ๋ฅผ ์ฐพ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์ฆ‰, ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ถœ๋ ฅ๊ฐ’์ด ์ž˜ ๋งž๋Š” ํ•จ์ˆ˜๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜

3.2 1D ์˜ˆ์‹œ

  • x์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ด์šฉํ•ด ๋‚ฎ์€ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™
  • ๊ฐ€์žฅ loss๊ฐ€ ๋‚ฎ์€ ๊ณณ(Globla minima)์ด ์•„๋‹Œ ๊ณจ์งœ๊ธฐ(Local minima)์— ๋น ์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ
  • ์‚ฌ์‹ค ์ง€๊ธˆ ์ฐพ์€ minima๊ฐ€ global minima์ผ ๊ฒƒ์ด๋ผ๊ณ  ๋ณด์žฅํ•  ์ˆœ ์—†๋‹ค
  • Gradient Descent Equation
  • L: Loss function (scalar)
  • ฮธ: Parameters (vector)
  • ฮท: Learning Rate (๊ฒฝํ—˜์ ์œผ๋กœ ์ง€์ •ํ•ด์ฃผ๋Š” Hyper parameter 0 ~ 1)
  • Loss๋ฅผ ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•ด ํŽธ๋ฏธ๋ถ„ํ•œ gradient ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™์‹œ์— ์—…๋ฐ์ดํŠธ

3.3 ๊ณ ์ฐจ์› ํ™•์žฅ

  • ฮธ(W,b)๊ฐ€ ๋‹ค์ฐจ์›์ผ ๋•Œ๋„ ๋™์ผ ์›๋ฆฌ ์ ์šฉ
  • 2์ฐจ์›์ธ ๊ฒฝ์šฐ

  • ๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์กด์žฌ โ†’ local minima์— ๋น ์งˆ ๊ฐ€๋Šฅ์„ฑ์€ ์ ์Œ
    • ์ˆ˜๋ฐฑ๋งŒ, ์ˆ˜์ฒœ, ์ˆ˜์–ต๋งŒ ์ฐจ์›์˜ loss ํ•จ์ˆ˜ surface์—์„œ global minima๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ์—์„œ,
    • ๋™์‹œ์— local minima๋ฅผ ์œ„ํ•œ ์กฐ๊ฑด์ด ๋งŒ์กฑ๋˜๊ธฐ๋Š” ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ
    • ๋”ฐ๋ผ์„œ, local minima์— ๋Œ€ํ•œ ๊ฑฑ์ •์„ ํฌ๊ฒŒ ํ•  ํ•„์š” ์—†์Œ

3.4 Pytorch๋กœ ๊ตฌํ˜„ํ•ด๋ณด๊ธฐ

import torch
import torch.nn.functional as F

target = torch.FloatTensor([[.1, .2, .3],
                            [.4, .5, .6],
                            [.7, .8, .9]])

x = torch.rand_like(target)
x.requires_grad = True
loss = F.mse_loss(x, target)

### start ###
threshold = 1e-5
learning_rate = 1.0
iter_cnt = 0

while loss > threshold:
    iter_cnt += 1
    
    loss.backward() # Calculate gradients.

    x = x - learning_rate * x.grad
    
    x.detach_() # autograd graph ๋Š์–ด์ฃผ๊ธฐ
    x.requires_grad_(True)
    
    loss = F.mse_loss(x, target)
    
    print('%d-th Loss: %.4e' % (iter_cnt, loss))
    print(x)

4. Learning Rate


4.1 Learning Rate์˜ ์—ญํ• 

  • ํ•™์Šต ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” step size
  • ์ ์ ˆํ•˜์ง€ ์•Š์œผ๋ฉด loss๊ฐ€ ๋ฐœ์‚ฐํ•˜๊ฑฐ๋‚˜ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ ค์ง

4.2 Learning Rate์˜ ์ตœ์ ํ™”

  • ๋„ˆ๋ฌด ํฐ learning rate โ†’ ๋ฐœ์‚ฐ(Loss๊ฐ€ inf, NaN์ด ๋‚˜์˜ด)
  • ๋„ˆ๋ฌด ์ž‘์€ learning rate โ†’ ๋„ˆ๋ฌด ๋А๋ฆฐ ์ˆ˜๋ ด, local minima์— ๋จธ๋ฌด๋ฅผ ์ˆ˜ ์žˆ์Œ(ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋งŽ์œผ๋ฉด ์ฃผ๋กœ ๋ฐœ์ƒ X)

4.3 Learning Rate Tip

  • ๋ชจ๋ธ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๋ฉฐ, ์ ˆ๋Œ€์ ์ธ ๊ธฐ์ค€์€ ์—†์Œ
  • ๋‚ด ๋ชจ๋ธ์— ์ ์ ˆํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ์‹คํ—˜์„ ํ†ตํ•ด ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”
  • ์ ์ ˆํ•œ ๊ฐ’์„ ์ฐพ๊ธฐ ์–ด๋ ต๋‹ค๋ฉด 1e-4 ๊ฐ™์ด ์ž‘์€ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ณ  ์˜ค๋ž˜ ํ•™์Šต
  • Adam Optimizer ๋“ฑ์œผ๋กœ ์ž๋™ ์กฐ์ •๋„ ๊ฐ€๋Šฅ

5. Summary

  1. ์•Œ ์ˆ˜ ์—†๋Š” ์ฐธ ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ด
  2. ์†์‹คํ•จ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ
  3. ์ด๋ฅผ ์œ„ํ•ด ๊ธฐ์šธ๊ธฐ(gradient)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ 
  4. ํ•™์Šต๋ฅ (learning rate)์„ ๊ณฑํ•ด ์—…๋ฐ์ดํŠธ ์ง„ํ–‰
  • Gradient Descent๋Š” ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ณธ์ ์ธ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.
profile
๋งŒ๋‘๋Š” ๋ชฉ๋ง๋ผ

0๊ฐœ์˜ ๋Œ“๊ธ€