05. Linear Regression in the PyTorch way

Park Jong Hun·2021년 2월 4일

Deep Learning PyTorch PytorchZeroToAll machine learning

PytorchZeroToAll

목록 보기

5/5

Sung Kim님의 유투브 강의 자료인 PytorchZeroToAll를 바탕으로 공부한 내용에 대한 글입니다.

PyTorch Rhythm

1. Design your model using class with Variables

import torch
form torch.autograd import Variable

x_data = Variable(torch.Tensor([[1.0], [2.0], [3.0]]))
y_data = Variable(torch.Tensor([[2.0], [4.0], [6.0]]))

torch 라이브러리를 불러오고 input과 target Variables를 정의한다.

class Model(nn.Module):
    def __init__(self):
        """
        In the constructor we instantiate two nn.Linear module
        """
        super(Model, self).__init__()
        self.linear = nn.Linear(1, 1, bias=False)  # One in and one out

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must return
        a Variable of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Variables.
        """
        y_pred = self.linear(x)
        return y_pred

model = Model()

그 다음 model class를 만든다. 이때, 부모 class로 torch.nn.Module을 받아오고 초기화 함수 init을 정의할때 super를 통해 호출해준다. 그리고 forward 함수를 모델 design에 맞게 정의해준다.

2. Construct loss and optimizer(select from PyTorch API)

MSE Loss
criterion = nn.MSELoss(reduction = 'sum')
criterion = nn.MSELoss(reduction = 'mean')
두번째로 loss를 설정해야 한다. 강의에선 MSELoss를 사용하였고, 학습률(learning rate)을 0.1로 설정하였다. torch.nn.MSELoss는 reduction 설정에 따라 2가지로 나뉘는데 mean으로 설정하면 MSE Loss가 되고, sum으로 설정하면 squared L2 norm이 된다. 강의에선 defalut값인 mean으로 설정하여 학습하였다.
reference : torch.nn.MSELoss

Stochastic Gradient Descent
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01)
그 다음 optimizer를 설정하여 학습 방법을 설정해야한다. 강의에선 일반적인 Gradient Descent 방법을 사용해 학습하였는데 데이터의 수가 늘어나서 full batch가 아닌 mini batch를 사용하게 되면 그때는 Stochastic gradient descent (SGD) 라고 부른다.

Momentum
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      momentum = 0.9)
SGD 학습 방법은 나중에 학습하는 batch가 더 큰 가중치를 가지게 된다. 따라서 Momentum을 추가하여 이전 학습의 gradient를 반영하여 나중에 학습되는 batch에 학습 결과가 크게 반영되는 것을 방지하였다.

Nesterov Accelrated Gradient (NAG)
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      momentum = 0.9,
                      nesterov = True)
Momentum은 이전의 gradient와 현재 gradient의 벡터합으로 학습의 방향과 크기가 결정된다. 따라서 momentum의 값이 너무 크게 되면 최적의 학습 방향을 지나치게 되버릴 수 있다. NAG는 momentum으로 이동한 지점에서의 gradient를 계산하여 update를 진행하기 때문에 이 문제를 해결할 수 있다.

L2 regularization
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      weight_decay = 0.9)
Pytorch의 SGD 함수를 사용할 때 weight_decay 값을 설정하면 L2 regularization을 자동으로 실행 시켜준다. L2 regularization은 학습되는 weight가 너무 커지지 않도록 decay 가중치를 곱하여 loss와 같이 minimize 되도록 cost function에 더해진다. 이때 weight의 값이 작아지도록 학습하는 것은 local noise의 영향을 덜 받도록 하여 outlier(특이점)의 영향을 적게 받아 일반화에 적합한 특성을 가지도록 학습이 되게 만든다.

3. Training cycle(forward, backward, update)

1) Forward pass: Compute predicted y by passing x to the model
y_pred = model(x_data)
2) Compute and print loss
loss = criterion(y_pred, y_data)
3) Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()

Code

from torch import nn
from torch import optim
from torch import tensor

x_data = tensor([[1.0], [2.0], [3.0]])
y_data = tensor([[2.0], [4.0], [6.0]])

class Model(nn.Module):
    def __init__(self):
        """
        In the constructor we instantiate two nn.Linear module
        """
        super(Model, self).__init__()
        self.linear = nn.Linear(1, 1, bias=False)  # One in and one out

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must return
        a Variable of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Variables.
        """
        y_pred = self.linear(x)
        return y_pred

# our model
model = Model()

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.

criterion = nn.MSELoss(reduction = 'sum')
criterion = nn.MSELoss(reduction = 'mean')

# SGD
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01)

# Momentum
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      momentum = 0.9)

# Momentum ( Nesterov version )
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      momentum = 0.9,
                      nesterov = True)

# L2 regularization
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      weight_decay = 0.9)

# exponential weighted moving average, EWMA
optimizer = optim.SGD(model.parameters(),
                      lr = 0.01,
                      momentum = 0.1,
                      weight_decay = 0.9)

# Training loop
for epoch in range(10):
    # 1) Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x_data)

    # 2) Compute and print loss
    loss = criterion(y_pred, y_data)
    print(f'Epoch: {epoch} | Loss: {loss.item()} ')

    # 3) Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    print(f'\t | Weight : {list(model.parameters())[0].data}, Gradient: {list(model.parameters())[0].grad}')
    optimizer.step()
    
# After training
hour_var = tensor([[4.0]])
y_pred = model(hour_var)
print("Prediction (after training)",  4, model(hour_var).data[0][0].item())
...

Park Jong Hun

NLP, AI, LLM, MLops

이전 포스트

05. Linear Regression in the PyTorch way

PytorchZeroToAll

PyTorch Rhythm

1. Design your model using class with Variables

2. Construct loss and optimizer(select from PyTorch API)

MSE Loss

Stochastic Gradient Descent

Momentum

Nesterov Accelrated Gradient (NAG)

L2 regularization

3. Training cycle(forward, backward, update)

1) Forward pass: Compute predicted y by passing x to the model

2) Compute and print loss

3) Zero gradients, perform a backward pass, and update the weights.

Code

04. Back-propagation and Autograd

0개의 댓글