๐ŸŽฒ[AI] Gradient Vanishing, ReLU

manduยท2025๋…„ 5์›” 5์ผ

[AI]

๋ชฉ๋ก ๋ณด๊ธฐ
13/20

ํ•ด๋‹น ๊ธ€์€ FastCampus - '[skill-up] ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์œ ์น˜์› ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ ,
์ถ”๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ง๋ถ™์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


1. Gradient Vanishing

  • ๋”ฅ๋Ÿฌ๋‹์—์„œ Backpropagation์€ Chain Rule ๊ธฐ๋ฐ˜์œผ๋กœ ์ด์ „ Gradient๋ฅผ ์žฌ์‚ฌ์šฉํ•ด์„œ Gradient๋ฅผ ๊ณ„์‚ฐ
  • ํ•˜์ง€๋งŒ ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ์ž…๋ ฅ์ธต์œผ๋กœ ๊ฐˆ์ˆ˜๋ก Gradient๊ฐ€ ์ ์  ์ž‘์•„์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ
  • Why?
    • Sigmoid, Tanh์™€ ๊ฐ™์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜(Activation function)์˜ Gradient๋Š” 1๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™์Œ
    • ์ž…๋ ฅ์— ๊ฐ€๊นŒ์šด ๋ ˆ์ด์–ด์ผ์ˆ˜๋ก ์—ฌ๋Ÿฌ Gradient์˜ ๊ณฑ์œผ๋กœ ๊ณ„์‚ฐ๋จ
    • ๊ฐ Gradient ํ•ญ์ด 1๋ณด๋‹ค ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ์ž…๋ ฅ์ธต์œผ๋กœ ๊ฐˆ์ˆ˜๋ก Gradient๋Š” ๊ธ‰๊ฒฉํžˆ ์ž‘์•„์ง
      โ†’ Gradient Vanishing
  • ๊ทธ ๊ฒฐ๊ณผ, ์•ž์ชฝ ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๊ฑฐ์˜ ์—…๋ฐ์ดํŠธ๋˜์ง€ ์•Š์Œ
  • ex) 0.2 0.6 0.2 = 0.024

2. ReLU (Rectified Linear Unit)

2.1 ์ •์˜

y=ReLU(x)=max(0,x)y = ReLU(x) = max(0, x) \\

  • ์ž…๋ ฅ์ด 0๋ณด๋‹ค ํฌ๋ฉด ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅ, 0 ์ดํ•˜์ด๋ฉด 0 ์ถœ๋ ฅ
  • ๋‘ ๊ฐœ์˜ ์„ ํ˜• ํ•จ์ˆ˜๋กœ ๊ตฌ์„ฑ๋œ ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ

2.2 ํŠน์ง•

  • ์–‘์ˆ˜ ์˜์—ญ์—์„œ์˜ ๋ฏธ๋ถ„๊ฐ’์€ ํ•ญ์ƒ 1 โ†’ ๊ธฐ์กด gradient ํ๋ฆ„ ์œ ์ง€
  • ๋น„์„ ํ˜•์„ฑ๊ณผ ์„ ํ˜•์„ฑ์„ ์ ์ ˆํžˆ ์กฐํ™”์‹œํ‚ด
  • ๊ณ„์‚ฐ์ด ๋งค์šฐ ๋‹จ์ˆœํ•˜๊ณ  ๋น ๋ฆ„
  • ReLU๋Š” ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ์˜ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์—ฌ์ฃผ๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜

3. Leaky ReLU

3.1 ์ •์˜

LeakyReLU(x)=max(ฮฑโ‹…x,x),whereย ย 0<ฮฑ<1LeakyReLU(x) = max(\alpha \cdot x, x), \\ where \ \ 0 < \alpha <1

3.2 ํŠน์ง•

  • ReLU์˜ ๋‹จ์ : ์ž…๋ ฅ์ด 0 ์ดํ•˜์ผ ๋•Œ gradient๊ฐ€ 0์ด ๋˜์–ด ํ•™์Šต๋˜์ง€ ์•Š์Œ
  • Leaky ReLU๋Š” ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์Œ์ˆ˜ ์˜์—ญ์—๋„ ์ž‘์€ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋ถ€์—ฌ
  • ํ•ญ์ƒ Leaky ReLU๊ฐ€ ๋” ์ข‹์€ ๊ฑด ์•„๋‹˜!

4. Summary

  • Sigmoid, Tanh์™€ ๊ฐ™์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” gradient vanishing ๋ฌธ์ œ ์œ ๋ฐœ
  • ReLU๋Š” gradient vanishing์„ ์™„ํ™”ํ•˜๊ณ  ํ•™์Šต ์†๋„๋ฅผ ๊ฐœ์„ 
  • Leaky ReLU๋Š” ReLU์˜ ๋‹จ์ (์Œ์ˆ˜ ์ž…๋ ฅ ๊ตฌ๊ฐ„์˜ ํ•™์Šต ๋ถˆ๋Šฅ)์„ ๋ณด์™„

5. Pytorch ์‹ค์Šต ์ฝ”๋“œ

๐ŸŽฒ[AI] LinearRegression์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ Linear model๊ณผ ๋น„๊ต

5.1 ๋‹จ์ˆœ shallow Linear model

  • ์ƒ๊ด€์„ฑ ๋†’์ง€ ์•Š๊ณ , ์˜ˆ์ธก์— ์‹คํŒจํ•˜๋Š” ์ผ€์ด์Šค๋„ ๋งŽ์•˜์Œ

5.2 ํ™œ์„ฑํ•จ์ˆ˜๊ฐ€ ํฌํ•จ๋œ ์ข€ ๋” deep model

  • Standardization๋„ ์ถ”๊ฐ€๋˜๊ธด ํ•จ โ†’ ์ƒ๊ด€์„ฑ ๋งค์šฐ ๋†’์Œ
boston = fetch_openml(name='boston', version=1, as_frame=True)
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df["TARGET"] = boston.target

# df.head()

# Data Standardization: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ‘œ์ค€ํ™” โ†’ ํ‰๊ท ์„ 0์œผ๋กœ, ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ 1๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ •
scaler = StandardScaler()
scaler.fit(df.values[:, :-1])
df.iloc[:,:-1] = scaler.transform(df.values[:, :-1]) # N(0,1)

# df.head()

data = torch.from_numpy(df.values).float()
print(data.shape)

y = data[:, -1:]
x = data[:, :-1]

print(x.shape, y.shape)

n_epochs = 100000
learning_rate = 1e-4
print_interval = 5000

# relu = nn.ReLU()
# leaky_relu = nn.LeakyReLU(0.1)


## ๋‚ด๊ฐ€ ์ง์ ‘ Custom ํ•˜๋Š” ๋ฐฉ๋ฒ•
class MyModel(nn.Module):
    
    def __init__(self, input_dim, output_dim):
        self.input_dim = input_dim
        self.output_dim = output_dim
        
        super().__init__()
        
        self.linear1 = nn.Linear(input_dim, 3)
        self.linear2 = nn.Linear(3, 3)
        self.linear3 = nn.Linear(3, output_dim)
        self.act = nn.ReLU()
        
    def forward(self, x):
        # |x| = (batch_size, input_dim)
        h = self.act(self.linear1(x)) # |h| = (batch_size, 3)
        h = self.act(self.linear2(h))
        y = self.linear3(h) #Regression ๋ฌธ์ œ์—์„œ activation function์€ ๊ฐ€์šด๋ฐ์—๋งŒ!
        # |y| = (batch_size, output_dim)
        
        return y
    
custom_model = MyModel(x.size(-1), y.size(-1))
print(custom_model)

## Sequentialํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•
sequential_model = nn.Sequential(
    nn.Linear(x.size(-1), 3),
    nn.LeakyReLU(),
    nn.Linear(3, 3),
    nn.LeakyReLU(),
    nn.Linear(3, 3),
    nn.LeakyReLU(),
    nn.Linear(3, 3),
    nn.LeakyReLU(),
    nn.Linear(3, 3),
    nn.LeakyReLU(),
    nn.Linear(3, y.size(-1)),
)

print(sequential_model)

optimizer = optim.SGD(model.parameters(),
                      lr=learning_rate)

for i in range(n_epochs):
    y_hat = model(x)
    loss = F.mse_loss(y_hat, y)
    
    optimizer.zero_grad()
    loss.backward()
    
    optimizer.step()
    
    if (i + 1) % print_interval == 0:
        print('Epoch %d: loss=%.4e' % (i + 1, loss))
profile
๋งŒ๋‘๋Š” ๋ชฉ๋ง๋ผ

0๊ฐœ์˜ ๋Œ“๊ธ€