-
Notifications
You must be signed in to change notification settings - Fork 87
Open
Description
example for efficient multi-gpu training of resnet50 (4 gpus, label-smoothing, fast regime by fast-ai):
python -m torch.distributed.launch --nproc_per_node=4 main.py --model resnet --model-config "{'depth': 50, 'regime': 'fast'}" --eval-batch-size 512 --save resnet50_fast --label-smoothing 0.1
I made some changes:
python -m torch.distributed.launch --nproc_per_node=8 main.py --model resnet --model-config "{'depth': 34, 'regime': 'fast'}" --batch-size 256 --eval-batch-size 512 --label-smoothing 0.1
The log shows:
TRAINING - Epoch: [15][10/625] Time 0.810 (1.640)
EVALUATING - Epoch: [15][10/98] Time 1.353 (3.035)
According to the following formulas:
1281167 / 256 = 5004.5, 5004.5 / 8 = 625.5
50000 / 512 = 97.6, 97.6 / 8 = 12.2
So validation steps should be 12 or 13, not 98.
Metadata
Metadata
Assignees
Labels
No labels