-
Notifications
You must be signed in to change notification settings - Fork 549
Adding config options for deterministic execution #1761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
use_deterministic_algorithms() warn_only ac preserve_rng_state ac debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what values these fields bring. In particular, #1736 doesn't try to justify anything.
In general, torchtitan should expose the most meaningful options to users, instead of becoming a config broker to host all possible options for the APIs it calls into.
We will consider adding options if you could write up a proposal.
@tianyu-l , I understand your concern about inflating the number of configs. In brief, these are some knobs I needed to figure out to debug issues related to deterministic compute and activation checkpoint recompute discrepancies. Deterministic, full or partial, is a very important for debugging numerical issues associated with randomness as you know. Activation checkpointing is an important techniques to reduce memory pressure, hence getting details on where exactly activation checkpointing is failing is important. These knobs would make TorchTitan debugging more friendly and quicker for new models and accelerators significantly. Otherwise, every end user developer of TorchTitan would need to hunt for where and how to add these debugging hooks. I can add the above as proposal text ( and some more details ) to the issue mentioned above, if that works. |
OK, fine, I guess it's not that hard to convince me. Instead of scattering debugging configs around, I wonder if we can put all debug options into one config called Also would appreciate if you could share your understanding on AC for DSv3 training. |
@tianyu-l, sounds like you want to pull in all the debug config under a separate section in the toml file . Would it be something like the following in the toml file ? That would require more changes to the code, as some function calls only pass part of the configs ! [troubleshoot] I have not looked into DSv3 , yet. Will try it out. |
@fegin , what is your thought on this ? |
use_deterministic_algorithms() warn_only
ac preserve_rng_state
ac debug
refer: #1736