2.5. Automatic Differentiation

Eunjin Kim·2022년 4월 22일

Dive into Deep Learning

목록 보기

8/14

강의: https://d2l.ai/chapter_preliminaries/autograd.html

2.5. Automatic Differentiation

section 2.4에서 설명했듯이, 미분은 거의 모든 딥러닝 최적화 알고리즘에 중요한 단계이다. 이러한 도함수를 얻기 위한 계산은 간단하지만, 일부 기본 미적분만을 필요로 하는 복잡한 모델의 경우, 업데이트를 손으로 해결하는 것이 고통스러울 수 있다(그리고 종종 오류가 발생하기 쉽다).

딥러닝 프레임워크는 자동으로 derivatives을 계산하여 이 작업을 가속화한다.( automatic differentiation) 실제로 시스템은 설계된 모델을 기반으로 어떤 연산을 통해 어떤 데이터를 결합하여 출력을 생성하는지를 추적하는 computational graph를 구축한다. Automatic differentiation은 시스템이 다음 backpropagate gradients를 가능하게 한다. 여기서 역전파는 각 매개 변수에 대한 부분 도함수를 채우며 단순히 계산 그래프를 통해 추적하는 것을 의미한다.

2.5.1. A Simple Example

단순한 예시로, 열 벡터 x에 대한 함수 $y=2x^Tx$ 를 미분하는 것에 관심이 있다고 하자. 먼저 변수 x를 만들고 초기값을 할당한다.

import torch

x = torch.arange(4.0)
x

# Output
tensor([0., 1., 2., 3.])

x에 대한 $y$ 의 기울기를 계산하기 이전에, 이것을 저장할 공간이 필요하다. 종종 같은 매개변수를 수천 또는 수백만 번 업데이트하고 메모리가 빠르게 고갈될 수 있기 때문에 미분할 때마다 메모리를 할당하지 않는 것은 중요하다. 벡터 x에 대한 스칼라 값 함수의 기울기는 그 자체로 벡터 값이며 x와 같은 모양을 가진다.

x.requires_grad_(True)  # Same as `x = torch.arange(4.0, requires_grad=True)`
x.grad  # The default value is None

이제 $y$ 를 계산해보자.

y = 2 * torch.dot(x, x)
y

# Output
tensor(28., grad_fn=<MulBackward0>)

x가 길이 4인 벡터이기 때문에, y에 할당되는 스칼라 값을 출력하면서 x와 x의 dot product가 수행됐다. 다음으로, 역전파를 위한 함수를 호출하고 기울기를 출력하며 x의 각 원소에 대한 y의 기울기를 자동으로 계산할 수 있다.

y.backward()
x.grad

# Output
tensor([ 0.,  4.,  8., 12.])

x에 대한 함수 $y=2x^Tx$ 의 기울기는 $4x$ 가 된다. 원하는 기울기가 올바르게 계산되었는지 빠르게 확인하자.

x.grad == 4 * x

# Output
tensor([True, True, True, True])

이제 x의 다른 함수를 계산해 보자.

# PyTorch accumulates the gradient in default, we need to clear the previous
# values
x.grad.zero_()
y = x.sum()
y.backward()
x.grad

# Output
tensor([1., 1., 1., 1.])

2.5.2. Backward for Non-Scalar Variables

기술적으로, y가 스칼라가 아닐 때, 벡터 x에 대한 벡터 y의 미분에 대한 가장 자연스러운 해석은 행렬이다. 고차원 y와 x의 경우 미분 결과는 고차원 텐서가 될 수 있다.

그러나, 이러한 객체는 고급 기계 학습(딥러닝 포함)에 나타나지만, 우리가 벡터를 역호출할 때 더 자주 훈련 예제의 각 구성 요소에 대한 손실 함수의 batch를 계산하려고 한다. 여기서 우리의 의도는 미분 행렬을 계산하는 것이 아니라 배치의 각 예에 대해 개별적으로 계산된 부분 미분 합계를 계산하는 것이다.

# Invoking `backward` on a non-scalar requires passing in a `gradient` argument
# which specifies the gradient of the differentiated function w.r.t `self`.
# In our case, we simply want to sum the partial derivatives, so passing
# in a gradient of ones is appropriate
x.grad.zero_()
y = x * x
# y.backward(torch.ones(len(x))) equivalent to the below
y.sum().backward()
x.grad

# Output
tensor([0., 2., 4., 6.])

2.5.3. Detaching Computation

때때로, 기록된 계산 그래프에서 벗어나 일부 계산을 하고 싶어한다. 예를들어, y가 x의 함수로 계산되었고, 그 후에 z가 y와 x의 함수로 계산되었다고 하자. 이제, x에 대한 z의 기울기를 계산하고 싶은데 y를 상수로 간주하고, x가 y를 계산한 후에 한 역할만 고려했다고 생각해보자.

y와 같은 값을 가지지만 연산 그래프에서 y가 계산되는 방법에 대한 모든 정보가 제거된 새로운 변수 u를 반환하기 위해 y를 분리할 수 있다. 다른 말로, 기울기는 u에서 x로 역류하지 않는다. 그러므로, 다음과 같은 역전파 함수는 x에 대한 $z = x*x*x*x$ 의 부분 도함수 대신 u를 상수로 처리하면서 x에 대한 $z=u*x$ 의 부분 도함수를 계산한다.

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

# Output
tensor([True, True, True, True])

y의 연산이 기록되고 있기 때문에, 이후에 x에 대한 $y=x * x$ 의 미분(2 * x)을 얻기 위해 y에 대해 역전파를 할수 있다.

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

# Output
tensor([True, True, True, True])

2.5.4. Computing the Gradient of Python Control Flow

자동 미분을 사용하는 한 가지 이점은 파이썬 제어 흐름(예: 조건, 루프 및 임의 함수 호출)의 maze를 통과하는 데 필요한 함수의 계산 그래프를 구축하더라도 결과 변수의 기울기를 계산할 수 있다는 것이다. 다음 글에서 while 루프의 반복 횟수와 if 문의 평가는 모두 입력 a의 값에 따라 달라진다는 점에 유의하자.

def f(a):
	b = a * 2
    while b.norm() < 1000:
    	b = b * 2
    if b.sum() > 0:
    	c = b
    else:
    	c = 100 * b
    return c

기울기를 계산해보자.

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

이제 위에 정의된 f 함수를 분석할 수 있다. 입력 a에서 부분 선형이라는 점에 유의해라. 즉, 임의의 a에 대해 f(a) = k * a와 같은 일정한 스칼라가 존재하며, 여기서 k의 값은 입력 a에 따라 달라진다. 따라서 d/a는 기울기가 올바른지 확인할 수 있다.

a.grad == d / a

# Output
tensor(True)

2.5.5. Summary

Deep learning frameworks can automate the calculation of derivatives. To use it, we first attach gradients to those variables with respect to which we desire partial derivatives. We then record the computation of our target value, execute its function for backpropagation, and access the resulting gradient.

2.5.6. Exercises

Why is the second derivative much more expensive to compute than the first derivative? 첫 미분 후에 그려진 computational graph에 해당되는 두번째 미분 그래프를 그려야 하기 때문에
After running the function for backpropagation, immediately run it again and see what happens.
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
In the control flow example where we calculate the derivative of d with respect to a, what would happen if we changed the variable a to a random vector or matrix. At this point, the result of the calculation f(a) is no longer a scalar. What happens to the result? How do we analyze this?
RuntimeError: grad can be implicitly created only for scalar outputs
벡터에서는 요소끼리 교차 연산이 가능하지만 함수에서는 불가능하다.
Redesign an example of finding the gradient of the control flow. Run and analyze the result.

Eunjin Kim

ALL IS WELL🌻

이전 포스트

2.4. Calculus

다음 포스트

2.5. Automatic Differentiation

Dive into Deep Learning

2.5. Automatic Differentiation

2.5.1. A Simple Example

2.5.2. Backward for Non-Scalar Variables

2.5.3. Detaching Computation

2.5.4. Computing the Gradient of Python Control Flow

2.5.5. Summary

2.5.6. Exercises

2.4. Calculus

2.6. Probability

0개의 댓글