[CUDA][PyTorch] cuda error: device-side assert triggered :matmul_seed_all(SEED)

윰진·2022년 10월 14일

0

CUDAErrorSelect

목록 보기

1/1

참고 : 오류 메시지 구체화

아래 코드 추가 후 런타임 초기화, 재시작

GPU 에서 오류가 발생할 때 오류를 동기화 시켜 스택을 추적하고자 함

CUDA_LAUNCH_BLOCKING
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

오류 원인 후보

1. Memory 초과

컴퓨터가 수용 가능한 memory 초과
- batch_size 조절하며 확인

2. _is_in_bad_fork

manual_seed_all 함수 사용시 bad fork 확인

if torch.cuda.is_available() and not torch.cuda._is_in_bad_fork() :
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

github pytorch issue 1856 matmul_seed_all error

3. dimension 불일치 문제

stackoverflow 답변 
: some kind of inconsistency between the number of labels in your targets and the number of classes in your model.

stack overflow torch.manual_seed(seed) get RuntimeError

0개의 댓글