10. PyTorch Troubleshooing

유승우·2022년 5월 11일

2 week PyTorch 부스트캠프 AI Tech 3기

PyTorch를 사용하면서 자주 발생할 수 있는 GPU에서의 Out of Memory(OOM) 에러 상황들의 예시를 보고 해결하는 법 까지 학습

OOM(Out Of Memory)

OOM이 해결하기 어려운 이유
1. 왜, 어디서 발생했는지 알기 어려움
2. Error backtracking이 이상한 곳으로 감
3. 메모리의 이전상황의 파악이 어려움

해결 방법들

보통의 해결법은 Batch size를 줄이고 GPU를 clean시키고, 다시 run

GPUUtil

nvidia-smi 처럼 GPU의 상태를 보여주는 모듈
Colab은 환경에서 GPU 상태 보여주기 편함
iter마다 메모리가 늘어나는지 확인

!pip install GPUtil

import GPUtil
GPutil.showUtilization()

torch.cuda.empty_cache()

GPU상에서 사용되지 않는 memory(cache)를 삭제해주는 명령어
del과는 구분이 필요한데, del은 메모리 주소와의 관계를 끊어주는 것
reset 대신 쓰기 좋은 함수

import torch
from GPUtil import showUtilization as gpu_usage

print("Initial GPU Usage")
gpu_usage()

tensorList = []
for x in range(10):
		tensorList.append(torch.randn(10000000,10).cuda())

print("GPU Usage after allcoating a bunch of Tensors")
gpu_usage()

del tensorList

print("GPU Usage after deleting the Tensors")
gpu_usage()

print("GPU Usage after emptying the cache")
torch.cuda.empty_cache()
gpu_usage()

training loop에 tensor로 축적 되는 변수는 확인할 것

tensor로 처리된 변수는 GPU 상에 메모리 사용
해당 변수 loop 안에 연산에 있을 때 GPU에 computational graph를 생성해 메모리를 잠식할 수 있다.

total_loss = 0
for i in range(10000):
		optimizer.zero_grad()
		output = model(input)
		loss = criterion(output)
		loss.backward()
		optimizer.step()
		total_loss += loss

loss의 값은 사실 total_loss에 추가하기 위해 한번만 필요한데, 위 코드의 loss는 계속 연산을 하기 위해서 점점 쌓이게 되고, 메모리를 잡아먹는다.
이러한 경우 item 혹은 float을 통해 python 기본 객체로 변환하여 처리하면 된다.

total_loss = 0
for x in range(10):
		iter_loss = torch.randn(3,4).mean()
		iter_loss.requires_grad = True
		total_loss += float(iter_loss)

del 명령어를 적절히 사용하기

for i in range(5):
		intermediate = f(input[i])
		result += g(intermediate)
output = h(result)
return output

위와 같은 코드는 python의 메모리 배치 특성상 loop이 끝나도 메모리를 차지하기 떄문에, 필요가 없어진 변수는 적절히 삭제하여 메모리를 줄인다.

가능 batch 사이즈 실험해보기

학습시 OOM이 발생했다면 batch 사이즈를 1로 해서 실험해보기

oom = False
try:
		run_model(batch_size)
except RuntimeError: # Out of memory
		oom = True

if oom:
		for _ in range(batch_size):
				run_model(1)

torch.no_grad()를 사용하기

이 명령어를 사용하면 Inference 시점에서는 backward propagation에서 일어나는 메모리 버퍼가 일어나지 않게 된다.

with torch.no_grad():
		for data, target in test_loader:
				output = network(data)
				test_loss += F.nll_loss(output, target, size_average=False).item()
				pred = output.data.max(1, keepdim=True)[1]
				correct += pred.eq(target.data.view_as(pred)).sum()

예상치 못한 에러 메세지

OOM 말고도, CUDNN_STATUS_NOT_INIT 이나 device_side_assert 등 에러 메세지가 발생할 수 있는데, 보통 GPU를 잘 못 깔았거나 OOM과 비슷한 에러로, 굉장히 많은 요인으로 인해 발생할 수 있다.
colab에서는 너무 큰 사이즈는 실행하지 말 것
- 특히 LSTM은 필요 이상으로 메모리를 잡아먹기 때문에 실행하지 말 것
CNN의 대부분의 에러는 크기가 안 맞아서 생기는 경우
- torchsummary 등으로 사이즈를 맞출 것
tensor의 float precision을 16bit로 줄일 수도 있다.

유승우

이전 포스트

9. Hyperparameter tuning

다음 포스트

10. PyTorch Troubleshooing

해결 방법들

9. Hyperparameter tuning

1. Data viz

0개의 댓글