[TIL] RuntimeError - freeze_support()

YSL·2023년 10월 18일

PyTorch

목록 보기

5/5

Problem

tconf = trainer.TrainerConfig(max_epochs=650,
                                  batch_size=128,
                                  learning_rate=args.pretrain_lr,
                                  lr_decay=True,
                                  warmup_tokens=512*20,
                                  final_tokens=200*len(pretrain_dataset)*block_size,
                                  num_workers=4,
                                  writer=writer)
                                  
trainer = trainer.Trainer(model, pretrain_dataset, None, tconf)

trainer.train()

torch.save(model.state_dict(), args.writing_params_path)

모델을 학습시키고자 하는데 위와 같이 작성된 코드를 실행하면 RuntimeError가 발생하였다.

RuntimeError: An attempt has been made to start a new process before the 
current process has finished its bootstrapping phase. 
    
This is probably means that you are not using fork to start your 
child processes and you have forgotten to use the proper idiom 
in the main module: 
    
if __name__ == '__main__': 
	freeze_support () 
        ... 
        
The "freeze_support ()" line can be omitted if the program 
is not going to be frozen.

Windows 운영체제를 사용할 때 많이 발생하는 오류라고 하는데,
하이퍼파라미터 중 num_worker를 설정해주었을 때, Windows 환경에서는 fork를 지원하지 않고 spawn을 사용한다. 이때 spawn은 부모프로세스와 자식프로세스가 구분되지 않아 프로세스를 계속 불러오는 재귀호출이 발생한다.

Solution

코드를 아래와 같이 수정해주었다.

from multiprocessing import freeze_support

tconf = trainer.TrainerConfig(max_epochs=650,
                                  batch_size=128,
                                  learning_rate=args.pretrain_lr,
                                  lr_decay=True,
                                  warmup_tokens=512*20,
                                  final_tokens=200*len(pretrain_dataset)*block_size,
                                  num_workers=4,
                                  writer=writer)
                                  
trainer = trainer.Trainer(model, pretrain_dataset, None, tconf)
    
if __name__ == '__main__':
	freeze_support()
	trainer.train()
	torch.save(model.state_dict(), args.writing_params_path)

if __name__ == '__main__':
이미 실행된 함수가 다른 객체에 할당되어 실행될 때, 이전의 내용과 중복되어 실행되는 것을 막아주는 기능을 하는데, 중복/반복을 막아 자원이 중복 사용되는 것을 막아주는 함수이다.

freeze_support()
파이썬 multiprocessing이 Windows 환경에서 실행될 경우, 자원이 부족할 경우를 대비해 파일 실행을 위한 자원을 추가해주는 역할을 하는 함수이다.

또다른 해결 방법은 num_worker = 0으로 설정해주는 것으로, 매우 간단한 해결책이지만 처리 속도가 느려지는 한계가 있다.

[참고]

나의 공부기록 by Leeys
쟈누이의 기록습관

YSL

이전 포스트

[TIL] RuntimeError - freeze_support()

PyTorch

Problem

Solution

[TIL] clone()

0개의 댓글

관련 채용 정보