When resuming training from a checkpoint, the LR scheduler's num_steps is incremented twice, causing the scheduler position to be doubled.
Root Cause
In slime/backends/megatron_utils/model.py, function initialize_model_and_optimizer() (line 786):
iteration, _ = load_checkpoint(model, optimizer, opt_param_scheduler, ...)
opt_param_scheduler.step(increment=iteration * args.global_batch_size) # ← BUG
Megatron's load_checkpoint() already calls opt_param_scheduler.load_state_dict(), which internally calls self.step(increment=num_steps) with the checkpoint's saved num_steps. Line 786 then adds iteration * global_batch_size again.
Result: scheduler.num_steps is doubled on every resume.
Fix
Remove line 786:
# opt_param_scheduler.step(increment=iteration * args.global_batch_size)