[분산학습 이슈] RuntimeError: Invalid mt19937 state

yoonene·2023년 1월 20일
0

ML/DL

목록 보기
13/17

accelerator로 4개 GPU로 분산학습하다가 아래와 같은 오류가 났다.

[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10793, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805521 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10793, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805533 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10793, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805535 milliseconds before timing out.
Traceback (most recent call last):
  File "/root/clm-train/train.py", line 489, in <module>
    main()
  File "/root/clm-train/train.py", line 485, in main
    train(args)
  File "/root/clm-train/train.py", line 314, in train
    train_loss = train_epoch(args, train_loader, model, optimizer,
  File "/root/clm-train/train.py", line 29, in train_epoch
    for step, batch in enumerate(
  File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/tqdm/std.py", line 1183, in __iter__
    for obj in iterable:
  File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/accelerate/data_loader.py", line 366, in __iter__
    synchronize_rng_states(self.rng_types, self.synchronized_generator)
  File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
    synchronize_rng_state(RNGType(rng_type), generator=generator)
  File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/accelerate/utils/random.py", line 83, in synchronize_rng_state
    generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state

일단 해놓은 조치

ipg_handler = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
args.accelerator = Accelerator(cpu=args.cpu,
                               mixed_precision=args.mixed_precision,
                               log_with='wandb',
                               kwargs_handlers=[ipg_handler])

각 GPU의 결과를 병합하는 Synchronization 과정에서 timeout 제한이 1800으로 되어있는데, 이걸 accelerate의 kwargs_handlers에서 5400으로 늘려주었다.


원래 회사 서버에서 끝날 때 저런 오류 난다고 하신다.
아마 early_stopping으로 잘 끝났을 것이라고 하신다.

profile
NLP Researcher / Information Retrieval / Search

0개의 댓글