Skip to content

Add load and run tests for checkpoints that we want to have BC #2792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Aug 18, 2025

Stacked PRs:


Add load and run tests for checkpoints that we want to have BC

Summary:
Added load and run tests to make sure previously saved checkpoints can continue to load and run.

includes FP8, INT4 and INT4 + preshuffled checkpoints since these might reach larger audience

Test Plan:
python test/integration/test_load_and_run_checkpoint.py

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Aug 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2792

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job, 9 Pending

As of commit 483898d with merge base fee314b (image):

NEW FAILURE - The following job has failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 18, 2025
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/28 branch from 80009d9 to 074593b Compare August 18, 2025 22:39
@jerryzh168 jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 18, 2025
@jerryzh168 jerryzh168 requested a review from vkuzo August 18, 2025 22:46
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/28 branch 4 times, most recently from 5581e0e to 3d66000 Compare August 19, 2025 01:16
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/28 branch from 3d66000 to db92794 Compare August 19, 2025 23:32
"Float8DynamicActivationFloat8WeightConfig" in model_name
and not is_sm_at_least_90()
):
return unittest.skip("FP8 checkpoint is produced in SM90+")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we ensure we have CI with sm90+ running this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's tested here:

pytest test/integration --verbose -s

added this skip because it fails in this test:

pytest test/integration --verbose -s
(SM89)


_DEPRECATED_SINGLE_LINEAR_MODEL_NAMES = [
# model card: https://huggingface.co/torchao-testing/single-linear-Float8DynamicActivationFloat8WeightConfig-v1-0.13-dev
"torchao-testing/single-linear-Float8DynamicActivationFloat8WeightConfig-v1-0.13-dev",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: specify what here is torchao version and what is config version? torchao-testing/model:single_linear-config:Float8DynamicActivationFloat8WeightConfig-config_version:1-torchao:0.13-dev?

Copy link
Contributor Author

@jerryzh168 jerryzh168 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit long, would changing 0.13-dev to 0.13.dev be clearer that this is a torchao version?

also just found that : is not supported in hf model naming, only -, . and _ are supported.

can also move around the 0.13.dev to make it clearer as well, please let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit long, would changing 0.13-dev to 0.13.dev be clearer that this is a torchao version?

I think we should go for clarity here, if it's long that's ok. Can we pick one of -, ., _ as a delimiter, and make the name be formatted with key1.val1,key2.val2,..., with whatever delimiters between pairs and key:val that we can use

Copy link
Contributor Author

@jerryzh168 jerryzh168 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be in the name, how about adding these in model card descriptions? there are some restrictions to the naming, including max length of 96:

Only regular characters and '-', '_', '.' are accepted. '--' and '..' are forbidden. '-' and '.' cannot start or end the name. The name cannot end with ".git" or ".ipynb". Max length is 96.

tried Float8 and it's 111: model_single-linear-config_Float8DynamicActivationFloat8WeightConfig-config-version_v2-torchao-version_0.13.dev

@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/28 branch from db92794 to 40bf04c Compare August 20, 2025 00:55
Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to make the naming exact

@jerryzh168
Copy link
Contributor Author

updated model card to include exact descriptions due to naming length restrictions

@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/28 branch from 40bf04c to fb85e52 Compare August 20, 2025 16:26
Summary:
Added load and run tests to make sure previously saved checkpoints can continue to load and run.

includes FP8, INT4 and INT4 + preshuffled checkpoints since these might reach larger audience

Test Plan:
python test/integration/test_load_and_run_checkpoint.py

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: #2792, branch: jerryzh168/stack/28
@jerryzh168 jerryzh168 force-pushed the jerryzh168/stack/28 branch from fb85e52 to 483898d Compare August 20, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants