๐ŸŽฒ[AI] Gradient ๊ธฐ๋ฐ˜ Optimizer ์ •๋ฆฌ (feat. ๋ชจ๋ฅด๊ฒ ์œผ๋ฉด Adam ์จ๋ผ!)

manduยท2025๋…„ 5์›” 6์ผ

[AI]

๋ชฉ๋ก ๋ณด๊ธฐ
16/20

ํ•ด๋‹น ๊ธ€์€ FastCampus - '[skill-up] ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์œ ์น˜์› ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ ,
์ถ”๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ง๋ถ™์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


1. Learning Rate

1.1 Learning Rate์˜ ํŠน์„ฑ

Gradient descent ์ˆ˜์‹

ฮธโ†ฮธโˆ’ฮทโˆ‡ฮธL(ฮธ)\theta โ† \theta - \eta \nabla _ \theta L(\theta)
ฮท\eta: Learning rate

  • ํฐ LR: ๋ฐœ์‚ฐ ์œ„ํ—˜
  • ์ž‘์€ LR: ์ˆ˜๋ ด ์†๋„ ๋А๋ฆผ, local minima์— ๊ฐ‡ํž˜ ๊ฐ€๋Šฅ์„ฑ(ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋งŽ์ง€ ์•Š์„ ๊ฒฝ์šฐ)
  • ๊ถ๊ทน์ ์œผ๋กœ Loss surface์˜ ํ˜•ํƒœ๋ฅผ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ ์˜ learning rate๋ฅผ ์•Œ๊ธฐ ์–ด๋ ค์›€

1.2 Learning Rate Scheduling

  • ํ•™์Šต ์ดˆ๋ฐ˜์—๋Š” ํฐ LR, ํ›„๋ฐ˜์—๋Š” ์ž‘์€ LR์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ํ•˜์ง€๋งŒ, ํฐ LR์˜ ๊ธฐ์ค€์€? ์ดˆ๋ฐ˜/ํ›„๋ฐ˜์„ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€์€?
    โ†’ ์˜คํžˆ๋ ค Hyper parameter๊ฐ€ ๋” ๋งŽ์•„์งˆ ์ˆ˜ ์žˆ์Œ

2. Optimizer ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค

2.1 ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ณ„๋žต๋„

OptimizerํŠน์ง•
SGD๊ธฐ๋ณธ์ ์ด๋ฉฐ ์•ˆ์ •์ ์ด์ง€๋งŒ ํŠœ๋‹ ํ•„์š”
Momentum๊ด€์„ฑ ์ ์šฉ, ์ˆ˜๋ ด ๊ฐ€์†
AdaGradํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„ ํ•™์Šต๋ฅ  ์ž๋™ ์กฐ์ ˆ
RMSProp์ตœ๊ทผ gradient ๊ธฐ๋ฐ˜ ํ•™์Šต๋ฅ  ์กฐ์ ˆ
AdamMomentum + AdaGrad, ๋†’์€ ์„ฑ๋Šฅ

2.2 SGD with Momentum

  • ์ด์ „ gradient๋ฅผ ์ผ์ • ๋น„์œจ(ฮณ\gamma) ๋ฐ˜์˜ํ•˜์—ฌ ๋” ๋น ๋ฅธ ์ˆ˜๋ ด ์œ ๋„
  • ๊ด€์„ฑ์„ ์ด์šฉํ•ด local minima ํƒˆ์ถœ ๊ฐ€๋Šฅ

2.3 AdaGrad

  • ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„๋กœ ๊ฐ๊ฐ์˜ Learning rate๋ฅผ ๊ฐ€์ง
  • gtโˆ˜gtg_t \circ g_t๋Š” ๊ทธ๋ƒฅ gt2g_t^2 ์ƒ๊ฐํ•˜๋ฉด ๋จ
  • ฯต\epsilon์€ 0 ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ˆซ์ž
  • g1g_1 ~ gtg_t๊นŒ์ง€์˜ L2 Norm์„ ํ†ตํ•ด ํ˜„์žฌ gtg_t์˜ ํ•™์Šต๋ฅ  ์กฐ์ •
  • ์ง€๊ธˆ๊นŒ์ง€์˜ gradient๊ฐ€ ์ปธ์œผ๋ฉด ํ˜„์žฌ gradient๋Š” ์ž‘๊ฒŒ ์กฐ์ •๋จ
  • ์ฆ‰, ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„ ๋ˆ„์  gradient์˜ ์ œ๊ณฑํ•ฉ์œผ๋กœ ํ•™์Šต๋ฅ  ์กฐ์ •
    โ†’ ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก LR์ด ์ž‘์•„์ง

2.4 RMSProp / AdaDelta

  • AdaGrad์˜ ๋‹จ์ (ํ•™์Šต๋ฅ ์ด ๋„ˆ๋ฌด ๋นจ๋ฆฌ ์ž‘์•„์ง€๋Š” ๋ฌธ์ œ)์„ ๋ณด์™„
  • ์ด์ „ gradient ์ „์ฒด์˜ ๋ˆ„์ ํ•ฉ ๋Œ€์‹ , ์ตœ๊ทผ gradient ์ œ๊ณฑ์˜ ์ง€์ˆ˜์ด๋™ํ‰๊ท ์„ ์‚ฌ์šฉ
    โ†’ ํ•™์Šต๋ฅ ์ด ๋„ˆ๋ฌด ๋นจ๋ฆฌ ์ค„์–ด๋“œ๋Š” ํ˜„์ƒ ๋ฐฉ์ง€
  • ์ด๋กœ ์ธํ•ด ์•ˆ์ •์ ์ด๊ณ  ๋น ๋ฅธ ์ˆ˜๋ ด ๊ฐ€๋Šฅ
  • RMSProp์€ ๊ณ ์ •๋œ learning rate ์‚ฌ์šฉ,
  • AdaDelta๋Š” RMSProp ๊ธฐ๋ฐ˜, learning rate ์—†์ด ์Šค์Šค๋กœ ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ ์กฐ์ ˆ

2.5 Adam (Adaptive Moment Estimation)

  • Momentum + AdaGrad ๊ฒฐํ•ฉ
  • ฯ1\rho_1, ฯ2\rho_2, ฮท\eta์˜ hyper parameter๋“ค์„ ๊ฐ€์ง
    โ†’ Default ๊ฐ’์„ ์จ๋„ ๋ฌด๋ฐฉํ•˜์ง€๋งŒ, ๋งค~~~์šฐ ์ •๊ตํ•œ ๋ชจ๋ธ์—์„œ๋Š” ์—ญ์‹œ ํŠœ๋‹์ด ํ•„์š”ํ•จ
  • sts_t: Momentum, rtr_t: LR
  • ๋น ๋ฅธ ์ˆ˜๋ ด๊ณผ ์•ˆ์ •์„ฑ ์ œ๊ณต
  • ํ˜„์žฌ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ

3. Pytorch ์†Œ์Šค ์ฝ”๋“œ

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()

df = pd.DataFrame(california.data, columns=california.feature_names)
df["Target"] = california.target
# df.tail()

# sns.pairplot(df.sample(1000))
# plt.show()

scaler = StandardScaler()
scaler.fit(df.values[:, :-1])
df.values[:, :-1] = scaler.transform(df.values[:, :-1])

# sns.pairplot(df.sample(1000))
# plt.show()

data = torch.from_numpy(df.values).float()

print(data.shape)

x = data[:, :-1]
y = data[:, -1:]

print(x.shape, y.shape)

n_epochs = 4000
batch_size = 128
print_interval = 200
# learning_rate = 1e-5 ํ•„์š” X

model = nn.Sequential(
    nn.Linear(x.size(-1), 10),
    nn.LeakyReLU(),
    nn.Linear(10, 9),
    nn.LeakyReLU(),
    nn.Linear(9, 8),
    nn.LeakyReLU(),
    nn.Linear(8, 7),
    nn.LeakyReLU(),
    nn.Linear(7, 6),
    nn.LeakyReLU(),
    nn.Linear(6, 5),
    nn.LeakyReLU(),    
    nn.Linear(5, 4),
    nn.LeakyReLU(),
    nn.Linear(4, 3),
    nn.LeakyReLU(),
    nn.Linear(3, y.size(-1)),
)

print(model)

optimizer = optim.Adam(model.parameters(),)

# |x| = (total_size, input_dim)
# |y| = (total_size, output_dim)

for i in range(n_epochs):
    #  the index to feed-forward.
    indices = torch.randperm(x.size(0)) # Shuffle
    x_ = torch.index_select(x, dim=0, index=indices) # x์™€ y indices ๋™์ผํ•˜๊ฒŒ ์จ์•ผ ํ•จ
    y_ = torch.index_select(y, dim=0, index=indices)
    
    x_ = x_.split(batch_size, dim=0)
    y_ = y_.split(batch_size, dim=0)
    # |x_[i]| = (batch_size, input_dim)
    # |y_[i]| = (batch_size, output_dim)
    
    y_hat = []
    total_loss = 0
    
    for x_i, y_i in zip(x_, y_):
        # |x_i| = |x_[i]|
        # |y_i| = |y_[i]|
        y_hat_i = model(x_i)
        loss = F.mse_loss(y_hat_i, y_i)

        optimizer.zero_grad()
        loss.backward()

        optimizer.step()
        
        total_loss += float(loss) # Gradient graph ๋Š์–ด์ง โ†’ float๋กœ ๋ณ€ํ™˜ํ•ด์„œ memory leak ๋ฐฉ์ง€
        y_hat += [y_hat_i]

    total_loss = total_loss / len(x_)
    if (i + 1) % print_interval == 0:
        print('Epoch %d: loss=%.4e' % (i + 1, total_loss))
    
y_hat = torch.cat(y_hat, dim=0)
y = torch.cat(y_, dim=0)
# |y_hat| = (total_size, output_dim)
# |y| = (total_size, output_dim)

df = pd.DataFrame(torch.cat([y, y_hat], dim=1).detach().numpy(),
                  columns=["y", "y_hat"])

sns.pairplot(df, height=5)
plt.show()
profile
๋งŒ๋‘๋Š” ๋ชฉ๋ง๋ผ

0๊ฐœ์˜ ๋Œ“๊ธ€