Skip to content

[BUG] LR scheduler double-counted when resuming from checkpoint #1546

@Surya-Gunukula

Description

@Surya-Gunukula

When resuming training from a checkpoint, the LR scheduler's num_steps is incremented twice, causing the scheduler position to be doubled.

Root Cause

In slime/backends/megatron_utils/model.py, function initialize_model_and_optimizer() (line 786):

iteration, _ = load_checkpoint(model, optimizer, opt_param_scheduler, ...)

opt_param_scheduler.step(increment=iteration * args.global_batch_size)  # ← BUG

Megatron's load_checkpoint() already calls opt_param_scheduler.load_state_dict(), which internally calls self.step(increment=num_steps) with the checkpoint's saved num_steps. Line 786 then adds iteration * global_batch_size again.

Result: scheduler.num_steps is doubled on every resume.

Fix

Remove line 786:

# opt_param_scheduler.step(increment=iteration * args.global_batch_size)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions