๐ŸŽฒ[AI] Stochastic Gradient Descent

manduยท2025๋…„ 5์›” 5์ผ

[AI]

๋ชฉ๋ก ๋ณด๊ธฐ
14/20

ํ•ด๋‹น ๊ธ€์€ FastCampus - '[skill-up] ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์œ ์น˜์› ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ ,
์ถ”๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ง๋ถ™์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


1. Gradient Descent vs Stochastic Gradient Descent

1.1 Gradient Descent (GD)

  • Loss=1Nโˆ—โˆ‘i=1N(yiโˆ’y^i)2Loss= {1 \over N} * \sum_{i=1}^{N} (y_i - ลท_i)^2
  • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์˜ Loss๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ 1ํšŒ ์—…๋ฐ์ดํŠธ
  • ์ •ํ™•ํ•œ gradient ๋ฐฉํ–ฅ์ด์ง€๋งŒ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งค์šฐ ํผ
  • ์‹ค์ œ๋กœ N์˜ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ํฌ๋ฉด OOM ์—๋Ÿฌ ๋‚  ๊ฒƒ

1.2 Stochastic Gradient Descent (SGD)

  • Loss=1Kโˆ—โˆ‘i=1K(yiโˆ’y^i)2Loss= {1 \over K} * \sum_{i=1}^{K} (y_i - ลท_i)^2
  • ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ K๊ฐœ์˜ ์ƒ˜ํ”Œ(mini-batch)์— ๋Œ€ํ•œ Loss๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ๋ฒˆ ์—…๋ฐ์ดํŠธ
  • ๊ณ„์‚ฐ๋Ÿ‰์ด ์ž‘๊ณ  ๋น ๋ฅด์ง€๋งŒ, gradient๋Š” noisyํ•  ์ˆ˜ ์žˆ์Œ
  • Stochastic(ํ™•๋ฅ ๋ก ์ ) โ†” Deterministic(๊ฒฐ์ •๋ก ์ )

1.3 SGD์˜ ํŠน์„ฑ

  • ๊ฐ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ gradient๋Š” ์ „์ฒด gradient์™€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋‚˜, ๊ธฐ๋Œ€๊ฐ’์€ ๊ฐ™์Œ
  • ๋ฏธ๋‹ˆ๋ฐฐ์น˜๊ฐ€ ์ž‘์„์ˆ˜๋ก gradient์˜ ๋ถ„์‚ฐ์ด ์ปค์ง
  • ์ด๋กœ ์ธํ•ด local minima์—์„œ ํƒˆ์ถœ ๊ฐ€๋Šฅ์„ฑ์ด ์ƒ๊ธฐ๊ธฐ๋„ ํ•จ
  • ๋ฐ˜๋ฉด batch size๊ฐ€ ํฌ๋ฉด ๋น ๋ฅธ ์ˆ˜๋ ด ๊ฐ€๋Šฅ
  • Pytorch์—์„œ optim.SGD๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์œผ๋ฉด GD์ฒ˜๋Ÿผ ์ž‘๋™ํ•˜๊ณ , ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ๋„ฃ์œผ๋ฉด SGD ๋˜๋Š” Mini-batch GD์ฒ˜๋Ÿผ ์ž‘๋™

2. Epoch๊ณผ Iteration

2.1 Epoch

  • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์ด ํ•œ ๋ฒˆ ํ•™์Šต๋œ ํšŸ์ˆ˜ (๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ๋น„๋ณต์›์ถ”์ถœ ์‹œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ ๊ฒฝ์šฐ)
  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์˜ ์ƒ˜ํ”Œ๋“ค์ด forward & backward ๋œ ํšŸ์ˆ˜
  • Epoch์˜ ์‹œ์ž‘์— ๋ฐ์ดํ„ฐ์…‹์„ random shuffling ํ•ด์ค€ ํ›„, ๋ฏธ๋‹ˆ ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ”

2.2 Iteration

  • ํ•œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๊ฐ€ ํ•™์Šต๋œ ํšŸ์ˆ˜

2.3 ์ด ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜

  • ํŒŒ๋ผ๋ฏธํ„ฐ ์ „์ฒด ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜ = n_epochsย Xย n_iterationsn\_epochs \ X \ n\_iterations
  • ์ด์ค‘ for loop์ด ๋งŒ๋“ค์–ด์ง
for epoch_idx in n_epochs:
	for iteration_idx in n_iterations:
    ~~

3. Batch Size์— ๋”ฐ๋ฅธ ํŠน์„ฑ

3.1 Full Batch (= ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ)

  • ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•œ gradient
  • ๋งค์šฐ ์ •ํ™•ํ•˜์ง€๋งŒ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๊ณ  ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜๊ฐ€ ์ ์Œ

3.2 Small Batch

  • ํŽธํ–ฅ๋œ gradient๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ
  • ์˜คํžˆ๋ ค local minima ํƒˆ์ถœ์„ ๋„์šธ ์ˆ˜ ์žˆ์Œ
  • ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด gradient๊ฐ€ noisyํ•˜์—ฌ ์ˆ˜๋ ด์ด ์–ด๋ ค์›Œ์งˆ ์ˆ˜ ์žˆ์Œ

3.3 ์ ๋‹นํ•œ Batch size๋Š”?

  • 2n2^n ํฌ๊ธฐ์˜ ์ ์ ˆํ•œ batchsize๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค
    e.g. 64, 128, 256
  • ๊ทผ๋ฐ ์š”์ฆ˜์€ GPU ํ•˜๋“œ์›จ์–ด(๋ฉ”๋ชจ๋ฆฌ)์˜ ์„ฑ๋Šฅ์ด ๋’ท๋ฐ›์ณ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ํฐ batch_size๋ฅผ ๊ฐ€์ ธ๊ฐ€๊ธฐ๋„ ํ•œ๋‹ค
    e.g. 2048, 4096 ,...

4. Pytorch ์‹ค์Šต ์ฝ”๋“œ

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()

df = pd.DataFrame(california.data, columns=california.feature_names)
df["Target"] = california.target
# df.tail()

# sns.pairplot(df.sample(1000))
# plt.show()

scaler = StandardScaler()
scaler.fit(df.values[:, :-1])
df.values[:, :-1] = scaler.transform(df.values[:, :-1])

# sns.pairplot(df.sample(1000))
# plt.show()

data = torch.from_numpy(df.values).float()

print(data.shape)

x = data[:, :-1]
y = data[:, -1:]

print(x.shape, y.shape)

n_epochs = 4000
batch_size = 128
print_interval = 200
learning_rate = 1e-5

model = nn.Sequential(
    nn.Linear(x.size(-1), 10),
    nn.LeakyReLU(),
    nn.Linear(10, 9),
    nn.LeakyReLU(),
    nn.Linear(9, 8),
    nn.LeakyReLU(),
    nn.Linear(8, 7),
    nn.LeakyReLU(),
    nn.Linear(7, 6),
    nn.LeakyReLU(),
    nn.Linear(6, 5),
    nn.LeakyReLU(),    
    nn.Linear(5, 4),
    nn.LeakyReLU(),
    nn.Linear(4, 3),
    nn.LeakyReLU(),
    nn.Linear(3, y.size(-1)),
)

print(model)

optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# |x| = (total_size, input_dim)
# |y| = (total_size, output_dim)

for i in range(n_epochs):
    #  the index to feed-forward.
    indices = torch.randperm(x.size(0)) # Shuffle
    x_ = torch.index_select(x, dim=0, index=indices) # x์™€ y indices ๋™์ผํ•˜๊ฒŒ ์จ์•ผ ํ•จ
    y_ = torch.index_select(y, dim=0, index=indices)
    
    x_ = x_.split(batch_size, dim=0)
    y_ = y_.split(batch_size, dim=0)
    # |x_[i]| = (batch_size, input_dim)
    # |y_[i]| = (batch_size, output_dim)
    
    y_hat = []
    total_loss = 0
    
    for x_i, y_i in zip(x_, y_):
        # |x_i| = |x_[i]|
        # |y_i| = |y_[i]|
        y_hat_i = model(x_i)
        loss = F.mse_loss(y_hat_i, y_i)

        optimizer.zero_grad()
        loss.backward()

        optimizer.step()
        
        total_loss += float(loss) # Gradient graph ๋Š์–ด์ง โ†’ float๋กœ ๋ณ€ํ™˜ํ•ด์„œ memory leak ๋ฐฉ์ง€
        y_hat += [y_hat_i]

    total_loss = total_loss / len(x_)
    if (i + 1) % print_interval == 0:
        print('Epoch %d: loss=%.4e' % (i + 1, total_loss))
    
y_hat = torch.cat(y_hat, dim=0)
y = torch.cat(y_, dim=0)
# |y_hat| = (total_size, output_dim)
# |y| = (total_size, output_dim)

df = pd.DataFrame(torch.cat([y, y_hat], dim=1).detach().numpy(),
                  columns=["y", "y_hat"])

sns.pairplot(df, height=5)
plt.show()

profile
๋งŒ๋‘๋Š” ๋ชฉ๋ง๋ผ

0๊ฐœ์˜ ๋Œ“๊ธ€