fix setting ft state dicts when ft checkpointing is disabled #1915

tushar00jain · 2025-10-17T21:21:05Z

Summary:

when ft dataloader checkpointing is disabled, we also don't set the ft state
make it so that when ft checkpointing is disabled, we still set the state dict so that model, optimizer etc. can be recovered from a different replica

Stack created with Sapling. Best reviewed with ReviewStack.

torchtitan/components/checkpoint.py

fegin

ft_inner_manager is confusing. I would suggest that you just make it a boolean flag.

torchtitan/components/checkpoint.py

Summary: - when ft dataloader checkpointing is disabled, we also don't set the ft state - make it so that when ft checkpointing is disabled, we still set the state dict so that model, optimizer etc. can be recovered from a different replica

This was referenced Oct 17, 2025

trigger profiling on abort #1811

Draft

repro profiler bug on abort #1856

Draft

set pg names #1910

Draft

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 17, 2025

fegin reviewed Oct 17, 2025

View reviewed changes

torchtitan/components/checkpoint.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pr1915 branch from e5606c9 to 012db1c Compare October 17, 2025 21:36

tushar00jain marked this pull request as ready for review October 17, 2025 21:43

tushar00jain requested review from tianyu-l, wconstab and wwwjn as code owners October 17, 2025 21:43

tianyu-l reviewed Oct 17, 2025

View reviewed changes

torchtitan/components/checkpoint.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pr1915 branch 2 times, most recently from d3b5640 to 267f201 Compare October 24, 2025 19:01

tianyu-l reviewed Oct 24, 2025

View reviewed changes

torchtitan/components/checkpoint.py Outdated Show resolved Hide resolved

fegin requested changes Oct 27, 2025

View reviewed changes

torchtitan/components/checkpoint.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pr1915 branch from 267f201 to 1faa462 Compare October 27, 2025 17:39

fegin approved these changes Oct 28, 2025

View reviewed changes

tushar00jain force-pushed the pr1915 branch from 1faa462 to 22a1a9a Compare October 29, 2025 21:38

tushar00jain merged commit e150caa into pytorch:main Oct 30, 2025
8 of 13 checks passed

tushar00jain deleted the pr1915 branch October 30, 2025 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix setting ft state dicts when ft checkpointing is disabled #1915

fix setting ft state dicts when ft checkpointing is disabled #1915

Uh oh!

tushar00jain commented Oct 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix setting ft state dicts when ft checkpointing is disabled #1915

fix setting ft state dicts when ft checkpointing is disabled #1915

Uh oh!

Conversation

tushar00jain commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tushar00jain commented Oct 17, 2025 •

edited

Loading