Skip to content

Conversation

@tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Oct 7, 2025

Summary:
record the profile trace if the training process receives SIGABRT e.g. when Process Group watchdog aborts the process


Stack created with Sapling. Best reviewed with ReviewStack.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@tushar00jain tushar00jain marked this pull request as draft October 7, 2025 21:41
@tushar00jain tushar00jain force-pushed the pr1811 branch 9 times, most recently from b4e489c to e1b5016 Compare October 8, 2025 19:09
tushar00jain added a commit that referenced this pull request Oct 8, 2025
Summary:
allow users to specify the profiler schedule

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809).
* #1811
* #1810
* #1812
* __->__ #1809

Co-authored-by: Tushar Jain <[email protected]>
tushar00jain added a commit that referenced this pull request Oct 10, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* #1840
* #1811
* #1810
* __->__ #1812
* #1809

---------

Co-authored-by: Tushar Jain <[email protected]>
tianyu-l pushed a commit that referenced this pull request Oct 12, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* #1856
* #1811
* __->__ #1810

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
allow users to specify the profiler schedule

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809).
* pytorch#1811
* pytorch#1810
* pytorch#1812
* __->__ pytorch#1809

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* pytorch#1856
* pytorch#1811
* __->__ pytorch#1810

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 15, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
@tushar00jain tushar00jain mentioned this pull request Oct 17, 2025
@tushar00jain tushar00jain force-pushed the pr1811 branch 2 times, most recently from 0d7b39e to a8e24ed Compare October 17, 2025 21:21
@tushar00jain tushar00jain force-pushed the pr1811 branch 4 times, most recently from 93e8fef to e8fa6c6 Compare October 24, 2025 19:01
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 29, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 29, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
Summary:
- when ft dataloader checkpointing is disabled, we also don't set the ft state
- make it so that when ft checkpointing is disabled, we still set the state dict so that model, optimizer etc. can be recovered from a different replica
Summary:
record the profile trace if the training process receives SIGABRT e.g. when Process Group watchdog aborts the process
tushar00jain added a commit that referenced this pull request Oct 30, 2025
Summary:
- when ft dataloader checkpointing is disabled, we also don't set the ft
state
- make it so that when ft checkpointing is disabled, we still set the
state dict so that model, optimizer etc. can be recovered from a
different replica

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1915).
* #1856
* #1811
* #1910
* __->__ #1915

Co-authored-by: Tushar Jain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant