accelerator로 4개 GPU로 분산학습하다가 아래와 같은 오류가 났다.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10793, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805521 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10793, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805533 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10793, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805535 milliseconds before timing out.
Traceback (most recent call last):
File "/root/clm-train/train.py", line 489, in <module>
main()
File "/root/clm-train/train.py", line 485, in main
train(args)
File "/root/clm-train/train.py", line 314, in train
train_loss = train_epoch(args, train_loader, model, optimizer,
File "/root/clm-train/train.py", line 29, in train_epoch
for step, batch in enumerate(
File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/tqdm/std.py", line 1183, in __iter__
for obj in iterable:
File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/accelerate/data_loader.py", line 366, in __iter__
synchronize_rng_states(self.rng_types, self.synchronized_generator)
File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/opt/conda/envs/accelerate/lib/python3.10/site-packages/accelerate/utils/random.py", line 83, in synchronize_rng_state
generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state
일단 해놓은 조치
ipg_handler = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
args.accelerator = Accelerator(cpu=args.cpu,
mixed_precision=args.mixed_precision,
log_with='wandb',
kwargs_handlers=[ipg_handler])
각 GPU의 결과를 병합하는 Synchronization 과정에서 timeout 제한이 1800으로 되어있는데, 이걸 accelerate의 kwargs_handlers에서 5400으로 늘려주었다.
원래 회사 서버에서 끝날 때 저런 오류 난다고 하신다.
아마 early_stopping으로 잘 끝났을 것이라고 하신다.