Skip to content

Conversation

githubsgi
Copy link
Contributor

use_deterministic_algorithms() warn_only
ac preserve_rng_state
ac debug

refer: #1736

use_deterministic_algorithms() warn_only
ac preserve_rng_state
ac debug
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 26, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what values these fields bring. In particular, #1736 doesn't try to justify anything.

In general, torchtitan should expose the most meaningful options to users, instead of becoming a config broker to host all possible options for the APIs it calls into.

We will consider adding options if you could write up a proposal.

@githubsgi
Copy link
Contributor Author

githubsgi commented Sep 26, 2025

@tianyu-l , I understand your concern about inflating the number of configs. In brief, these are some knobs I needed to figure out to debug issues related to deterministic compute and activation checkpoint recompute discrepancies. Deterministic, full or partial, is a very important for debugging numerical issues associated with randomness as you know. Activation checkpointing is an important techniques to reduce memory pressure, hence getting details on where exactly activation checkpointing is failing is important. These knobs would make TorchTitan debugging more friendly and quicker for new models and accelerators significantly. Otherwise, every end user developer of TorchTitan would need to hunt for where and how to add these debugging hooks.

I can add the above as proposal text ( and some more details ) to the issue mentioned above, if that works.

@githubsgi githubsgi changed the title Adding config options for determinitic Adding config options for deterministic execution Sep 26, 2025
@tianyu-l
Copy link
Contributor

OK, fine, I guess it's not that hard to convince me.

Instead of scattering debugging configs around, I wonder if we can put all debug options into one config called Debug under JobConfig, and clearly document in helper messages (1) what each config is for and (2) pointers to resources where people can read more.

Also would appreciate if you could share your understanding on AC for DSv3 training.

@githubsgi
Copy link
Contributor Author

githubsgi commented Sep 29, 2025

@tianyu-l, sounds like you want to pull in all the debug config under a separate section in the toml file . Would it be something like the following in the toml file ? That would require more changes to the code, as some function calls only pass part of the configs !

[troubleshoot]
deterministic = false
deterministic_warn_only = false
preserve_rng_state = false
determinism_check = "default"
ac_debug = false

I have not looked into DSv3 , yet. Will try it out.

@tianyu-l
Copy link
Contributor

Yes. I think it'd be good to put all debug related configs together, including the random seed and the option added in #1670. It will be a bigger refactor indeed.

@fegin if you have preference

@githubsgi
Copy link
Contributor Author

@fegin , what is your thought on this ?

@githubsgi
Copy link
Contributor Author

githubsgi commented Oct 3, 2025

@tianyu-l , does the following debug(?) section look ok ? Which ever PR gets merged last - this or that can expand the debug section.


[debug]
deterministic = false
deterministic_warn_only = false 
preserve_rng_state = false
determinism_check = "default"
debug = false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants