[BUG] LR scheduler double-counted when resuming from checkpoint

When resuming training from a checkpoint, the LR scheduler's `num_steps` is incremented twice, causing the scheduler position to be doubled.

## Root Cause
In `slime/backends/megatron_utils/model.py`, function `initialize_model_and_optimizer()` (line 786):

```python
iteration, _ = load_checkpoint(model, optimizer, opt_param_scheduler, ...)

opt_param_scheduler.step(increment=iteration * args.global_batch_size)  # ← BUG
```

Megatron's `load_checkpoint()` already calls `opt_param_scheduler.load_state_dict()`, which internally calls `self.step(increment=num_steps)` with the checkpoint's saved `num_steps`. Line 786 then adds `iteration * global_batch_size` again.

**Result:** `scheduler.num_steps` is doubled on every resume.

## Fix
Remove line 786:
```python
# opt_param_scheduler.step(increment=iteration * args.global_batch_size)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LR scheduler double-counted when resuming from checkpoint #1546

Root Cause

Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] LR scheduler double-counted when resuming from checkpoint #1546

Description

Root Cause

Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions