Skip to content

fix: remove double-counting of LR scheduler steps on checkpoint resume#1547

Open
Surya-Gunukula wants to merge 1 commit intoTHUDM:mainfrom
Surya-Gunukula:main
Open

fix: remove double-counting of LR scheduler steps on checkpoint resume#1547
Surya-Gunukula wants to merge 1 commit intoTHUDM:mainfrom
Surya-Gunukula:main

Conversation

@Surya-Gunukula
Copy link
Contributor

Megatron's load_checkpoint() already calls opt_param_scheduler.load_state_dict() which internally increments num_steps. The extra step() call here was doubling the scheduler position on every resume.

More info: Issue #1546

Megatron's load_checkpoint() already calls opt_param_scheduler.load_state_dict()
which internally increments num_steps. The extra step() call here was doubling
the scheduler position on every resume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant