Add functional test cases for published checkpoints on HF #455

noeyy-mino · 2025-10-21T10:35:14Z

What does this PR do?

Type of change: new tests

Overview:
Add vLLM/SGLang/TRT LLM deployment tests for published checkpoints on HF.
Add e2e test case for gpt-oss

Usage

# pytest test_deploy.py -k "vllm"

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

Tests
- Added ModelDeployer and ModelDeployerList test utilities with COMMON_PROMPTS to run multi-backend deployment and inference scenarios.
- Introduced an end-to-end GPT-OSS QAT pipeline test (SFT → QAT → MXFP4 conversion) with optional deployment/benchmarking.
- Added a broad parametric LLM deployment test suite covering many models/backends with readable test IDs.
- Added automatic Hugging Face cache cleanup after tests to keep runs isolated.

Signed-off-by: neoyy-mino <[email protected]>

coderabbitai · 2025-10-21T10:35:41Z

Walkthrough

Adds a test utility module for multi-backend model deployment and two new test suites: a GPT-OSS QAT end-to-end pipeline and parametric NVIDIA model deployment tests with HF cache cleanup and per-test environment teardown. Includes runtime/hardware checks, backend-specific deployment logic, and generated deployer configurations.

Changes

Cohort / File(s)	Summary
Test Deployment Utilities `tests/_test_utils/deploy_utils.py`	New module introducing `ModelDeployer` (backends: `trtllm`, `vllm`, `sglang`) with initialization params, `run()` orchestration (CUDA/SM/GPU checks, runtime skips), backend-specific `_deploy_*` methods, `ModelDeployerList` to produce Cartesian-product deployer configurations, and `COMMON_PROMPTS`.
GPT-OSS QAT Test Suite `tests/examples/gpt_oss/test_gpt_oss_qat.py`	New test module adding `GPTOSS` class implementing SFT training, QAT training, MXFP4 conversion, and a `deploy_gpt_oss_trtllm` step. Adds parameterized `test_gpt_oss_complete_pipeline` for `openai/gpt-oss-20b` and `openai/gpt-oss-120b`, with directory assertions and mock/dataset fallbacks.
NVIDIA Model Deployment Tests `tests/examples/llm_ptq/test_deploy.py`	New test suite using `ModelDeployerList` to run deployments across many models/backends. Adds `clear_hf_cache()` to remove HF cache entries, an autouse `cleanup_after_test` fixture, `HF_CACHE_PATH` constant, readable test id helper, and many parametric deployment tests.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Function
    participant MDT as ModelDeployerList
    participant MD as ModelDeployer
    participant Backend as Backend (trtllm / vllm / sglang)
    participant GPU as GPU / nvidia-smi

    Test->>MDT: instantiate with params
    MDT->>MDT: generate Cartesian product of configs
    MDT-->>Test: yield ModelDeployer instances

    loop for each ModelDeployer
        Test->>MD: call run()
        MD->>MD: validate CUDA / SM / GPU count
        alt requirements not met
            MD-->>Test: skip test (pytest.skip)
        else
            MD->>Backend: import & initialize backend client
            Backend-->>MD: ready
            MD->>Backend: send COMMON_PROMPTS for inference
            Backend-->>MD: inference outputs
            MD->>GPU: run nvidia-smi
            GPU-->>MD: status output
            MD-->>Test: finish
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through prompts and GPU steam,

rolled deployers into one neat scheme.
vllm, trtllm, sglang on the line,
caches cleared — the tests align.
A joyful rabbit — CI will beam!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.84% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Add functional test cases for published checkpoints on HF" directly and accurately reflects the main changes in this changeset. The PR introduces three new test files (deploy_utils.py, test_gpt_oss_qat.py, and test_deploy.py) that collectively implement functional test cases for multi-backend model deployment scenarios using published checkpoints from Hugging Face. The title is concise, clear, and specific enough that a teammate scanning the commit history would immediately understand the primary purpose of the change without requiring additional context.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (7)

tests/examples/gpt_oss/test_gpt_oss_qat.py (2)
66-67: Isolate distributed ports to avoid CI collisions.

accelerate launch can clash on MASTER_PORT in parallel CI. Use the helper’s port setup.
-        run_example_command(cmd_parts, "gpt-oss")
+        run_example_command(cmd_parts, "gpt-oss", setup_free_port=True)
@@
-        run_example_command(cmd_parts, "gpt-oss")
+        run_example_command(cmd_parts, "gpt-oss", setup_free_port=True)
Also applies to: 115-116

202-208: Consider marking this full E2E pipeline as slow/optional.

This test downloads models and runs training/benchmarks; it’s better opt-in (env marker) to keep CI green.

Example:
pytestmark = pytest.mark.skipif(
    os.getenv("RUN_GPT_OSS_E2E") != "1", reason="Set RUN_GPT_OSS_E2E=1 to run heavy E2E"
)
tests/examples/llm_ptq/test_deploy.py (3)
33-48: Tighten HF cache cleanup pattern and add safety.

Limit deletions to HuggingFace model dirs to avoid accidental removals of unrelated paths containing “nvidia”.
-            for item in os.listdir(HF_CACHE_PATH):
+            for item in os.listdir(HF_CACHE_PATH):
                 item_path = os.path.join(HF_CACHE_PATH, item)
-                if os.path.isdir(item_path) and "nvidia" in item:
+                if (
+                    os.path.isdir(item_path)
+                    and (item.startswith("models--nvidia--") or item.startswith("datasets--nvidia--"))
+                ):
                     shutil.rmtree(item_path, ignore_errors=True)
                     print(f"✓ Removed: {item}")
50-55: Run cache cleanup once per session to cut test time.

Per-test rmtree can add minutes across this matrix. Prefer session-scope, and also clear before the session starts.
-@pytest.fixture(autouse=True)
-def cleanup_after_test():
-    """Automatically clean up after each test."""
-    yield  # Run the test
-    clear_hf_cache()  # Clean up after test completes
+@pytest.fixture(scope="session", autouse=True)
+def cleanup_hf_cache_once():
+    """Clean HF cache before and after the test session."""
+    clear_hf_cache()
+    yield
+    clear_hf_cache()
57-96: Gate the huge test matrix behind env toggles to protect CI time.

The Cartesian products across backends/models easily create dozens of heavy tests. Add an env/mark gate or shrink defaults to smoke coverage.

Example:
if os.getenv("RUN_HF_DEPLOY_TESTS") != "1":
    pytest.skip("Set RUN_HF_DEPLOY_TESTS=1 to enable HF deployment tests", allow_module_level=True)
Also applies to: 97-171, 173-244, 246-263, 265-301, 303-336, 338-352, 354-381, 383-434
tests/_test_utils/deploy_utils.py (2)
80-86: Label says “Before Test” but it’s printed after running.

Either move it earlier or update the label.
-        print("\n=== GPU Status Before Test ===")
+        print("\n=== GPU Status After Test ===")
174-191: Add a minimal assertion for SGLang output.

Ensure the backend produces a non-empty text.
-        print(llm.generate(["What's the age of the earth? "]))
+        out = llm.generate(["What's the age of the earth? "])
+        assert isinstance(out, list) and len(out) > 0, "SGLang returned empty output"
+        print(out)
         llm.shutdown()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4476f21 and 4c8261a.

📒 Files selected for processing (3)

tests/_test_utils/deploy_utils.py (1 hunks)
tests/examples/gpt_oss/test_gpt_oss_qat.py (1 hunks)
tests/examples/llm_ptq/test_deploy.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/examples/gpt_oss/test_gpt_oss_qat.py (2)

tests/_test_utils/examples/run_command.py (1)

run_example_command (35-43)

tests/_test_utils/torch_misc.py (1)

minimum_gpu (51-55)

tests/examples/llm_ptq/test_deploy.py (1)

tests/_test_utils/deploy_utils.py (2)

ModelDeployerList (193-225)

run (59-86)

tests/_test_utils/deploy_utils.py (1)

modelopt/deploy/llm/generate.py (1)

LLM (53-291)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

tests/examples/gpt_oss/test_gpt_oss_qat.py (1)

98-115: No issues found; review comment is incorrect.

The output confirms that sft.py is the correct QAT entrypoint. The examples/gpt-oss/README.md explicitly documents "A real end-to-end example for this is in sft.py" for QAT training. The README shows the exact command structure used in the test: sft.py with --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG. No separate QAT script exists; QAT is performed through the SFT trainer with quantization arguments passed via the parser. The test code correctly follows the documented pattern.

Likely an incorrect or invalid review comment.

tests/_test_utils/deploy_utils.py

coderabbitai · 2025-10-21T10:41:17Z

tests/_test_utils/deploy_utils.py

+    def _deploy_vllm(self):
+        """Deploy a model using vLLM."""
+        try:
+            from vllm import LLM, SamplingParams
+        except ImportError:
+            pytest.skip("vllm package not available")
+
+        quantization_method = "modelopt"
+        if "FP4" in self.model_id:
+            quantization_method = "modelopt_fp4"
+        llm = LLM(
+            model=self.model_id,
+            quantization=quantization_method,
+            tensor_parallel_size=self.tensor_parallel_size,
+            trust_remote_code=True,
+        )
+        sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
+        outputs = llm.generate(COMMON_PROMPTS, sampling_params)
+
+        # Assertions and output
+        assert len(outputs) == len(COMMON_PROMPTS), (
+            f"Expected {len(COMMON_PROMPTS)} outputs, got {len(outputs)}"
+        )
+
+        for i, output in enumerate(outputs):
+            assert output.prompt == COMMON_PROMPTS[i], f"Prompt mismatch at index {i}"
+            assert hasattr(output, "outputs"), f"Output {i} missing 'outputs' attribute"
+            assert len(output.outputs) > 0, f"Output {i} has no generated text"
+            assert hasattr(output.outputs[0], "text"), f"Output {i} missing 'text' attribute"
+            assert isinstance(output.outputs[0].text, str), f"Output {i} text is not a string"
+            assert len(output.outputs[0].text) > 0, f"Output {i} generated empty text"
+
+            print(f"Model: {self.model_id}")
+            print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")
+            print("-" * 50)
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

vLLM quantization string may vary by version; confirm.

Some versions expect "modelopt" vs "modelopt_fp4" or different aliases. Please confirm the CI vLLM version supports these strings.

🌐 Web query:

What quantization strings does vLLM accept for NVIDIA Model Optimizer (ModelOpt) FP8/FP4 checkpoints?

💡 Result:

vLLM accepts the following ModelOpt quantization strings in the hf_quant_config.json "quant_algo" / "kv_cache_quant_algo" fields:

"FP8" (for FP8 checkpoints; also allowed for kv_cache_quant_algo). [1]

"NVFP4" (for ModelOpt NVFP4/FP4 checkpoints). [1][2]

Incorrect quantization string identifiers; use "FP8" and "NVFP4" instead.

vLLM accepts "FP8" for FP8 checkpoints and "NVFP4" for ModelOpt NVFP4/FP4 checkpoints (specified in hf_quant_config.json "quant_algo" fields). The code currently uses "modelopt" and "modelopt_fp4", which are not recognized by vLLM and will fail at runtime. Update lines 141–142 to use the correct identifiers:

Replace "modelopt" with "FP8"

Replace "modelopt_fp4" with "NVFP4"

🤖 Prompt for AI Agents

In tests/_test_utils/deploy_utils.py around lines 138 to 173, the quantization string identifiers are incorrect: replace the current defaults so that quantization_method is "FP8" instead of "modelopt", and when the model_id contains "FP4" set quantization_method to "NVFP4" instead of "modelopt_fp4"; update the two assignment literals accordingly so vLLM recognizes the quantization algorithm.

tests/examples/gpt_oss/test_gpt_oss_qat.py

coderabbitai · 2025-10-21T10:41:18Z

tests/examples/gpt_oss/test_gpt_oss_qat.py

+    def deploy_gpt_oss_trtllm(self, tmp_path):
+        """Deploy GPT-OSS model with TensorRT-LLM."""
+        # Prepare benchmark data
+        tensorrt_llm_workspace = "/app/tensorrt_llm"
+        script = os.path.join(tensorrt_llm_workspace, "benchmarks", "cpp", "prepare_dataset.py")
+        model_name = self.model_path.split("/")[-1]
+        benchmark_file = f"{model_name}_synthetic_128_128.txt"
+
+        if not os.path.exists(benchmark_file) or os.path.getsize(benchmark_file) == 0:
+            print(f"Creating dataset file '{benchmark_file}'...")
+            with open(benchmark_file, "w") as fp:
+                subprocess.run(
+                    f"python {script} --stdout --tokenizer={self.model_path} token-norm-dist --input-mean 128 \
+                    --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1400",
+                    shell=True,
+                    check=True,
+                    stdout=fp,
+                )
+        else:
+            print(f"Dataset file '{benchmark_file}' already exists.")
+
+        assert os.path.isfile(benchmark_file), f"Benchmark file '{benchmark_file}' should exist"
+
+        cmd_parts = [
+            "trtllm-bench",
+            "--model",
+            self.model_path,
+            "throughput",
+            "--backend",
+            "pytorch",
+            "--dataset",
+            benchmark_file,
+            "--kv_cache_free_gpu_mem_fraction",
+            "0.9",
+            "--report_json",
+            str(tmp_path / "low_latency_throughput.json"),
+        ]
+        run_example_command(cmd_parts, "gpt-oss")


⚠️ Potential issue | 🟠 Major

Make dataset generation and bench robust (path, existence, tool checks).

Hard-coded "/app/tensorrt_llm" may not exist.

Dataset file is created in the repo root but consumed from examples/gpt-oss (cwd mismatch).

Missing checks for prepare_dataset.py and trtllm-bench binaries.

@@ - tensorrt_llm_workspace = "/app/tensorrt_llm" - script = os.path.join(tensorrt_llm_workspace, "benchmarks", "cpp", "prepare_dataset.py") + import shutil + script = os.getenv( + "TLLM_PREPARE_DATASET", + os.path.join(os.getenv("TLLM_WORKSPACE", ""), "benchmarks", "cpp", "prepare_dataset.py"), + ) @@ - model_name = self.model_path.split("/")[-1] - benchmark_file = f"{model_name}_synthetic_128_128.txt" + model_name = self.model_path.split("/")[-1] + benchmark_file = tmp_path / f"{model_name}_synthetic_128_128.txt" @@ - if not os.path.exists(benchmark_file) or os.path.getsize(benchmark_file) == 0: + if not os.path.exists(script) or not os.path.isfile(script): + pytest.skip(f"prepare_dataset.py not found at: {script}") + if shutil.which("trtllm-bench") is None: + pytest.skip("trtllm-bench not found in PATH") + if not os.path.exists(benchmark_file) or os.path.getsize(benchmark_file) == 0: @@ - with open(benchmark_file, "w") as fp: + with open(benchmark_file, "w") as fp: subprocess.run( - f"python {script} --stdout --tokenizer={self.model_path} token-norm-dist --input-mean 128 \ + f"python {script} --stdout --tokenizer={self.model_path} token-norm-dist --input-mean 128 \ --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1400", shell=True, check=True, stdout=fp, ) @@ - assert os.path.isfile(benchmark_file), f"Benchmark file '{benchmark_file}' should exist" + assert os.path.isfile(benchmark_file) and os.path.getsize(benchmark_file) > 0, ( + f"Benchmark file '{benchmark_file}' missing or empty" + ) @@ - "--dataset", - benchmark_file, + "--dataset", + str(benchmark_file),

🤖 Prompt for AI Agents

tests/examples/gpt_oss/test_gpt_oss_qat.py around lines 162 to 199: the deploy_gpt_oss_trtllm function assumes hard-coded workspace and writes/reads dataset in repo root and does not verify required tools; change it to (1) compute tensorrt_llm_workspace relative to the test file or an env var and verify the directory exists, (2) build absolute paths for prepare_dataset.py and the benchmark file inside the test tmp_path (or examples/gpt-oss working dir) so creation and consumption use the same cwd, (3) check that prepare_dataset.py is present and executable and that trtllm-bench is available on PATH before running, and (4) run subprocesses with explicit cwd and capture/check return codes, raising clear assertion errors if files/tools are missing.

codecov · 2025-10-21T11:17:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.40%. Comparing base (2fd67cc) to head (12943bb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #455   +/-   ##
=======================================
  Coverage   73.40%   73.40%           
=======================================
  Files         180      180           
  Lines       18047    18047           
=======================================
  Hits        13248    13248           
  Misses       4799     4799

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: noeyy-mino <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/_test_utils/deploy_utils.py (1)
147-149: Incorrect quantization string identifiers; use "FP8" and "NVFP4".

This critical issue from the previous review remains unaddressed. vLLM expects "FP8" for FP8 checkpoints and "NVFP4" for ModelOpt FP4 checkpoints. The current values "modelopt" and "modelopt_fp4" will fail at runtime.

Apply this diff:
-        quantization_method = "modelopt"
+        quantization_method = "FP8"
         if "FP4" in self.model_id:
-            quantization_method = "modelopt_fp4"
+            quantization_method = "NVFP4"

🧹 Nitpick comments (5)

tests/_test_utils/deploy_utils.py (5)
42-50: Complete the docstring with all parameters.

The docstring only documents 4 of the 7 parameters. Please add documentation for attn_backend, base_model, and eagle3_one_model to help users understand their purpose and expected values.

83-88: Misleading comment and timing of GPU status check.

The comment states "Before Test" (line 86), but this GPU status check executes after the deployment method has already run. If the intent is to capture GPU state for debugging failures, consider either:

Moving this check before line 74 (before backend dispatch) and updating the comment

Updating the comment to "After Test" if post-deployment state is the intent

Additionally, the || true in the command (line 84) suppresses failures silently, which may hide useful diagnostic information.

132-138: Add output validation assertions for consistency.

Unlike _deploy_vllm() (lines 160-170), this method only prints outputs without validating them. Consider adding assertions to verify:

Number of outputs matches number of prompts

Each output has the expected structure

Generated text is non-empty

This improves test reliability and maintains consistency across deployment methods.

191-191: Use COMMON_PROMPTS instead of hardcoded prompt.

This method uses a hardcoded prompt while other deployment methods use the COMMON_PROMPTS constant defined at line 23. For consistency and maintainability, consider using COMMON_PROMPTS here as well.

Apply this diff:
-        print(llm.generate(["What's the age of the earth? "]))
+        outputs = llm.generate(COMMON_PROMPTS)
+        print(outputs)
207-212: Consider adding test_id to ModelDeployer.init.

The code dynamically assigns a test_id attribute to deployer instances (line 211), but this attribute is not defined in ModelDeployer.__init__. For clarity and to avoid confusion, consider either:

Adding test_id as a parameter in ModelDeployer.__init__ with a default value

Adding a docstring comment explaining that this attribute is set externally by ModelDeployerList

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c8261a and f922be5.

📒 Files selected for processing (1)

tests/_test_utils/deploy_utils.py (1 hunks)

🔇 Additional comments (1)

tests/_test_utils/deploy_utils.py (1)
182-184: Incorrect quantization identifiers for SGLang—fix differs from suggested diff.

SGLang accepts "fp8" for FP8 checkpoints and "modelopt_fp4" for ModelOpt NV-FP4 checkpoints. The current code uses "modelopt" on line 182, which is incorrect.

The line 182 value should be changed to "fp8", but line 184 is already correct as "modelopt_fp4" and should not be changed to "NVFP4".

Apply this corrected diff:
-        quantization_method = "modelopt"
+        quantization_method = "fp8"
         if "FP4" in self.model_id:
             quantization_method = "modelopt_fp4"
Likely an incorrect or invalid review comment.

Signed-off-by: noeyy-mino <[email protected]>

copy-pr-bot · 2025-10-23T06:18:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/examples/gpt_oss/test_gpt_oss_qat.py (1)

165-202: Fix dataset/bench pathing and tool checks; current code will break (cwd mismatch, hard-coded workspace, no CLI checks).

Writes the dataset in the test cwd but consumes it from examples/gpt-oss (via run_example_command), so --dataset won’t be found.
Hard-codes /app/tensorrt_llm; may not exist.
No checks for prepare_dataset.py or trtllm-bench.
Uses shell=True with an interpolated path.

This was flagged earlier and persists; please harden as below.

     def deploy_gpt_oss_trtllm(self, tmp_path):
         """Deploy GPT-OSS model with TensorRT-LLM."""
-        # Prepare benchmark data
-        tensorrt_llm_workspace = "/app/tensorrt_llm"
-        script = os.path.join(tensorrt_llm_workspace, "benchmarks", "cpp", "prepare_dataset.py")
-        model_name = self.model_path.split("/")[-1]
-        benchmark_file = f"{model_name}_synthetic_128_128.txt"
+        # Prepare benchmark data
+        script = os.getenv(
+            "TLLM_PREPARE_DATASET",
+            os.path.join(os.getenv("TLLM_WORKSPACE", ""), "benchmarks", "cpp", "prepare_dataset.py"),
+        )
+        model_name = self.model_path.split("/")[-1]
+        benchmark_file = tmp_path / f"{model_name}_synthetic_128_128.txt"
 
-        if not os.path.exists(benchmark_file) or os.path.getsize(benchmark_file) == 0:
-            print(f"Creating dataset file '{benchmark_file}'...")
-            with open(benchmark_file, "w") as fp:
-                subprocess.run(
-                    f"python {script} --stdout --tokenizer={self.model_path} token-norm-dist --input-mean 128 \
-                    --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1400",
-                    shell=True,
-                    check=True,
-                    stdout=fp,
-                )
+        # Preflight checks
+        if not script or not os.path.isfile(script):
+            pytest.skip(f"prepare_dataset.py not found at: {script}")
+        if shutil.which("trtllm-bench") is None:
+            pytest.skip("trtllm-bench not found in PATH")
+
+        if not benchmark_file.exists() or benchmark_file.stat().st_size == 0:
+            print(f"Creating dataset file '{benchmark_file}'...")
+            with open(benchmark_file, "w") as fp:
+                subprocess.run(
+                    [
+                        sys.executable,
+                        script,
+                        "--stdout",
+                        f"--tokenizer={self.model_path}",
+                        "token-norm-dist",
+                        "--input-mean",
+                        "128",
+                        "--output-mean",
+                        "128",
+                        "--input-stdev",
+                        "0",
+                        "--output-stdev",
+                        "0",
+                        "--num-requests",
+                        "1400",
+                    ],
+                    check=True,
+                    stdout=fp,
+                )
         else:
             print(f"Dataset file '{benchmark_file}' already exists.")
 
-        assert os.path.isfile(benchmark_file), f"Benchmark file '{benchmark_file}' should exist"
+        assert benchmark_file.is_file() and benchmark_file.stat().st_size > 0, (
+            f"Benchmark file '{benchmark_file}' missing or empty"
+        )
 
         cmd_parts = [
             "trtllm-bench",
             "--model",
             self.model_path,
             "throughput",
             "--backend",
             "pytorch",
             "--dataset",
-            benchmark_file,
+            str(benchmark_file),
             "--kv_cache_free_gpu_mem_fraction",
             "0.9",
             "--report_json",
             str(tmp_path / "low_latency_throughput.json"),
         ]
         run_example_command(cmd_parts, "gpt-oss")

#!/bin/bash
# Quick preflight: verify required tools and script location.
set -euo pipefail

echo "accelerate: $(command -v accelerate || echo 'MISSING')"
echo "trtllm-bench: $(command -v trtllm-bench || echo 'MISSING')"
echo "TLLM_WORKSPACE=${TLLM_WORKSPACE:-}"
# Try to find prepare_dataset.py under TLLM_WORKSPACE or repo
if [[ -n "${TLLM_WORKSPACE:-}" ]]; then
  test -f "${TLLM_WORKSPACE}/benchmarks/cpp/prepare_dataset.py" && echo "prepare_dataset.py found in TLLM_WORKSPACE" || echo "prepare_dataset.py NOT found under TLLM_WORKSPACE"
else
  fd -H -a -t f prepare_dataset.py 2>/dev/null || true
fi

🧹 Nitpick comments (6)

tests/examples/gpt_oss/test_gpt_oss_qat.py (6)

25-41: Docstring is stale (hard-codes 20B); make it model-agnostic.

The implementation is parameterized by model_path, but the docstring still refers to 20B and fixed input/output names.

 class GPTOSS:
-    """Test GPT-OSS-20B QAT (Quantization-Aware Training) pipeline.
-
-    This test suite covers the complete GPT-OSS-20B optimization pipeline:
-
-    Step 1: test_gpt_oss_sft_training - Supervised Fine-Tuning (SFT)
-           Input: openai/gpt-oss-20b
-           Output: gpt-oss-20b-sft
-
-    Step 2: test_gpt_oss_qat_training - Quantization-Aware Training (QAT)
-           Input: gpt-oss-20b-sft (from Step 1)
-           Output: gpt-oss-20b-qat
-
-    Step 3: test_gpt_oss_mxfp4_conversion - MXFP4 Weight-Only Conversion
-           Input: gpt-oss-20b-qat (from Step 2)
-           Output: gpt-oss-20b-qat-real-mxfp4
-
-    Each step can be run independently (with mock inputs) or as part of the full pipeline.
-    """
+    """Test GPT‑OSS QAT (Quantization‑Aware Training) pipeline.
+
+    Pipeline:
+      1) SFT training
+         Input: {model_path}
+         Output: <model_name>-sft
+      2) QAT training (MXFP4 weight‑only)
+         Input: <model_name>-sft
+         Output: <model_name>-qat
+      3) MXFP4 conversion
+         Input: <model_name>-qat
+         Output: <model_name>-qat-real-mxfp4
+
+    Each step can run independently (creates minimal mock inputs if prior outputs are missing) or as part of the full pipeline.
+    """

16-22: Add missing imports used by the proposed fixes.

We’ll need sys (for sys.executable) and shutil (for CLI checks).

 import os
 import subprocess
+import shutil
+import sys
 
 import pytest
 from _test_utils.examples.run_command import run_example_command
 from _test_utils.torch_misc import minimum_gpu

67-67: Avoid port collisions: allocate a free MASTER_PORT for accelerate.

Enable setup_free_port so parallel tests don’t fight over the default port.

-        run_example_command(cmd_parts, "gpt-oss")
+        run_example_command(cmd_parts, "gpt-oss", setup_free_port=True)

-        run_example_command(cmd_parts, "gpt-oss")
+        run_example_command(cmd_parts, "gpt-oss", setup_free_port=True)

Also applies to: 117-117

151-158: Use the current Python interpreter instead of hardcoded 'python'.

Prevents environment mismatches (e.g., venvs).

-        cmd_parts = [
-            "python",
+        cmd_parts = [
+            sys.executable,
             "convert_oai_mxfp4_weight_only.py",
             "--model_path",
             str(qat_dir),
             "--output_path",
             str(conversion_output_dir),
         ]

205-211: Gate the heavy end-to-end test (optional).

To keep CI fast and predictable, consider marking as slow or guarding by an env var.

-@pytest.mark.parametrize(
+@pytest.mark.parametrize(
     "model_path",
     [
         pytest.param("openai/gpt-oss-20b", id="gpt-oss-20b", marks=minimum_gpu(2)),
         pytest.param("openai/gpt-oss-120b", id="gpt-oss-120b", marks=minimum_gpu(8)),
     ],
 )
+@pytest.mark.skipif(os.getenv("RUN_GPT_OSS_E2E") != "1", reason="Set RUN_GPT_OSS_E2E=1 to run full GPT‑OSS pipeline")

165-202: Optional: reduce dataset size or make it configurable for CI.

1400 requests can be heavy; allow overriding via env for quicker functional runs.

-                        "--num-requests",
-                        "1400",
+                        "--num-requests",
+                        os.getenv("GPT_OSS_NUM_REQUESTS", "200"),

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f922be5 and 12943bb.

📒 Files selected for processing (2)

tests/examples/gpt_oss/test_gpt_oss_qat.py (1 hunks)
tests/examples/llm_ptq/test_deploy.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/examples/llm_ptq/test_deploy.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/examples/gpt_oss/test_gpt_oss_qat.py (2)

tests/_test_utils/examples/run_command.py (1)

run_example_command (35-43)

tests/_test_utils/torch_misc.py (1)

minimum_gpu (51-55)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

kevalmorabia97 · 2025-10-23T06:30:30Z

You need to sign your commits with an SSH key. Please take a look at https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md#%EF%B8%8F-signing-your-work

kevalmorabia97 · 2025-10-23T06:31:45Z

@noeyy-mino can any of these tests be also made part of nightly modelopt cicd? If we can use a dummy small model (e.g. 2 layers, smaller hidden size, etc.) and it can run in few minutes then we can enable these tests nightly to catch bugs sooner.
If not, then we can leave that off for later

noeyy-mino · 2025-10-23T07:57:41Z

No, they are for QA release testing.

kevalmorabia97 · 2025-10-28T06:52:38Z

tests/examples/llm_ptq/test_deploy.py

+HF_CACHE_PATH = os.getenv("HF_HUB_CACHE", os.path.expanduser("~/.cache/huggingface/hub"))
+
+
+def clear_hf_cache():


Is this cache cleanup for saving space after tests?

kevalmorabia97 · 2025-10-28T06:54:27Z

tests/examples/llm_ptq/test_deploy.py

Since this is in tests/examples/llm_ptq directory, it will be automatically run when nightly cicd runs llm_ptq tests. Can you add some marker here so its skipped nightly or only run during QA cycle?

Signed-off-by: noeyy-mino <[email protected]>

noeyy-mino added 4 commits September 26, 2025 00:52

tests for published checkpoints on HF

4927fee

Signed-off-by: neoyy-mino <[email protected]>

initial gpt-oss example test

3bd217a

Signed-off-by: neoyy-mino <[email protected]>

Merge branch 'NVIDIA:main' into noeyy/test_ckpts_on_hf

88c1335

Merge branch 'NVIDIA:main' into noeyy/test_ckpts_on_hf

4c8261a

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

Update tests/_test_utils/deploy_utils.py

f922be5

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: noeyy-mino <[email protected]>

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

add megatron chekpoints cases

5c379e0

Signed-off-by: noeyy-mino <[email protected]>

Merge branch 'main' into noeyy/test_ckpts_on_hf

12943bb

noeyy-mino requested review from Edwardf0t1 and kevalmorabia97 October 23, 2025 06:24

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

noeyy-mino added 2 commits October 27, 2025 09:38

collect test case without imagenet

24e7e5f

Merge branch 'NVIDIA:main' into noeyy/test_ckpts_on_hf

6300709

kevalmorabia97 reviewed Oct 28, 2025

View reviewed changes

fix path error

54cab77

Signed-off-by: noeyy-mino <[email protected]>

		HF_CACHE_PATH = os.getenv("HF_HUB_CACHE", os.path.expanduser("~/.cache/huggingface/hub"))


		def clear_hf_cache():

Add functional test cases for published checkpoints on HF #455

Are you sure you want to change the base?

Add functional test cases for published checkpoints on HF #455

Conversation

noeyy-mino commented Oct 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented Oct 23, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 commented Oct 23, 2025

Uh oh!

kevalmorabia97 commented Oct 23, 2025

Uh oh!

noeyy-mino commented Oct 23, 2025

Uh oh!

kevalmorabia97 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noeyy-mino commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

codecov bot commented Oct 21, 2025 •

edited

Loading