Skip to content

Conversation

@tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Oct 17, 2025

Summary:

  • when ft dataloader checkpointing is disabled, we also don't set the ft state
  • make it so that when ft checkpointing is disabled, we still set the state dict so that model, optimizer etc. can be recovered from a different replica

Stack created with Sapling. Best reviewed with ReviewStack.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 17, 2025
@tushar00jain tushar00jain marked this pull request as ready for review October 17, 2025 21:43
@tushar00jain tushar00jain force-pushed the pr1915 branch 2 times, most recently from d3b5640 to 267f201 Compare October 24, 2025 19:01
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ft_inner_manager is confusing. I would suggest that you just make it a boolean flag.

Summary:
- when ft dataloader checkpointing is disabled, we also don't set the ft state
- make it so that when ft checkpointing is disabled, we still set the state dict so that model, optimizer etc. can be recovered from a different replica
@tushar00jain tushar00jain merged commit e150caa into pytorch:main Oct 30, 2025
8 of 13 checks passed
@tushar00jain tushar00jain deleted the pr1915 branch October 30, 2025 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants