Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Oct 28, 2025

Overview:

Adds fault tolerance tests for token overflow scenarios where client requests exceed max_seq_len. Tests verify that systems properly reject oversized requests and successfully recover to handle normal requests afterward.

Details:

  • New fault injection type: TokenOverflowFailure for testing prompt length > max_seq_len scenarios
  • Two-phase testing: Sends 15 oversized requests (2x max_seq_len) followed by 15 normal requests to test rejection and recovery
  • Dynamic configuration: DeploymentSpec.add_arg_to_service() method to set max sequence length at runtime for vLLM (--max-model-len), TRT-LLM (--max-seq-len), and SGLang (--context-length)
  • Enhanced parsing: Detects mixed token tests and calculates recovery time between overflow/recovery phases using worker logs
  • Test coverage: 6 scenarios covering vLLM, TRT-LLM, and SGLang in both aggregated and disaggregated deployments

Where should the reviewer start?

  • tests/fault_tolerance/deploy/scenarios.py - Lines 460-550: Core overflow scenario creation logic
  • tests/utils/managed_deployment.py - Lines 173-213: add_arg_to_service() implementation
  • tests/fault_tolerance/deploy/test_deployment.py - Lines 113-160: Two-phase client execution logic
  • tests/fault_tolerance/deploy/parse_results.py - Lines 174-231: Recovery time calculation for mixed tests

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-872

Summary by CodeRabbit

  • New Features

    • Added token overflow testing support with configurable scenarios across multiple backends.
    • Introduced mixed-token test workflows enabling sequential overflow and recovery phase execution.
    • New paired result analysis to compare and summarize overflow/recovery test outcomes.
  • Improvements

    • Enhanced recovery time calculation for diverse test layouts and configurations.
    • Improved robustness of test result parsing and flexible output control options.

@tzulingk tzulingk requested review from a team as code owners October 28, 2025 02:18
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

This pull request introduces token overflow testing support by adding configuration fields for overflow/recovery phases, a new TokenOverflowFailure class for injection, helper functions for parsing mixed-token test directories, output control parameters, and phase-aware test execution logic that manages separate client processes for overflow and recovery phases.

Changes

Cohort / File(s) Summary
Parse infrastructure
tests/fault_tolerance/deploy/parse_factory.py, tests/fault_tolerance/deploy/parse_results.py
Added print_output: bool = True parameter to parse_test_results() and propagated it through parser calls. Extended process_single_test() and main() with print_output parameter for controlled output. Added extract_test_info_from_dir() and get_decode_worker_dir() helpers to parse test directory configuration for non-standard layouts. Introduced process_overflow_recovery_test() to handle paired overflow/recovery result summaries. Enhanced AI-Perf parsing robustness for missing nested dictionaries.
Test scenario configuration
tests/fault_tolerance/deploy/scenarios.py
Added mixed-token test configuration fields to Load class: mixed_token_test, overflow_token_length, overflow_request_count, normal_request_count. Introduced new TokenOverflowFailure class with overflow multiplier and token count computation. Added add_token_overflow_scenarios() function to generate and register token overflow test scenarios across backends (vllm, trtllm, sglang).
Test execution and deployment
tests/fault_tolerance/deploy/test_deployment.py, tests/utils/managed_deployment.py
Implemented mixed-token test flow with overflow and recovery phases, each spawning separate client processes with distinct node suffixes. Added TokenOverflowFailure handling in failure injection to skip standard pod/process injection. Expanded results processing to compute paired log paths and invoke dedicated overflow/recovery result parsing. Added DeploymentSpec.add_arg_to_service() method to configure service arguments dynamically.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • test_deployment.py: Review the mixed-token test flow logic, phase detection, and log path coordination for overflow/recovery cycles; verify proper cleanup and client process management across phases.
  • parse_results.py: Validate new helper functions for directory parsing and the extended recovery time calculation logic; check robustness of AI-Perf parsing fallbacks for missing nested dictionaries.
  • scenarios.py: Ensure token overflow scenario registration and TokenOverflowFailure initialization are correct across all backends.

Poem

🐰 Token overflow tests now flow,
With phases split—high and low.
Recovery paths parsed with care,
Output controlled with flair!
The rabbit hops through logic clear.

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed The pull request description comprehensively follows the required template structure with all four sections present and well-populated. The Overview clearly describes the purpose of adding token overflow tests, the Details section provides a detailed list of changes including the new TokenOverflowFailure type, two-phase testing approach, dynamic configuration method, enhanced parsing, and test coverage. The "Where should the reviewer start?" section provides valuable guidance with specific file paths and line number ranges, and the Related Issues section correctly references the issue number. The description is informative and complete, enabling reviewers to understand both the what and the why of the changes.
Docstring Coverage ✅ Passed Docstring coverage is 85.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The pull request title "feat: Add prompt > seq_len k8 tests" directly corresponds to the main objective of this changeset. The PR introduces fault-tolerance tests for token overflow scenarios where client requests exceed max_seq_len, adds a new TokenOverflowFailure fault injection type, implements two-phase test execution (overflow verification followed by recovery verification), and enhances parsing logic to detect and process these mixed-token tests. The title accurately captures this primary purpose by referring to the key technical concept (prompt exceeding sequence length) and the deployment environment (k8 for Kubernetes). The title is specific and meaningful rather than vague or generic, clearly conveying to reviewers the nature of the change.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fault_tolerance/deploy/test_deployment.py (1)

336-349: Robust TRT‑LLM model detection without relying on deployment name

scenario.deployment.name is overwritten to "fault-tolerance-test", so agg/disagg inference by name fails. Probe services instead:

-            elif scenario.backend == "trtllm":
-                # Determine deployment type from scenario deployment name
-                if (
-                    "agg" in scenario.deployment.name
-                    and "disagg" not in scenario.deployment.name
-                ):
-                    model = scenario.deployment["TRTLLMWorker"].model
-                else:
-                    model = scenario.deployment["TRTLLMDecodeWorker"].model
+            elif scenario.backend == "trtllm":
+                try:
+                    model = scenario.deployment["TRTLLMWorker"].model  # agg
+                except KeyError:
+                    model = scenario.deployment["TRTLLMDecodeWorker"].model  # disagg

This prevents falling back to the default model unnecessarily.

🧹 Nitpick comments (2)
tests/utils/managed_deployment.py (1)

319-371: Harden arg editing: support --arg=value, drop inner import, minor validation

  • Existing logic misses equals-form tokens (e.g., --max-seq-len=2048) and may duplicate args.
  • Redundant inner import shlex (already imported at file top).
  • Optional: normalize/validate arg_name shape.

Apply:

 def add_arg_to_service(self, service_name: str, arg_name: str, arg_value: str):
@@
-        if isinstance(args_list, str):
-            import shlex
-
-            args_list = shlex.split(args_list)
-            service["extraPodSpec"]["mainContainer"]["args"] = args_list
+        if isinstance(args_list, str):
+            # Normalize single string to list of tokens
+            args_list = shlex.split(args_list)
+            service["extraPodSpec"]["mainContainer"]["args"] = args_list
+
+        # Normalize existing equals-form tokens to [arg, value]
+        normalized: list[str] = []
+        eq_prefix = f"{arg_name}="
+        for tok in args_list:
+            if tok.startswith(eq_prefix):
+                normalized.extend([arg_name, tok[len(eq_prefix):]])
+            else:
+                normalized.append(tok)
+        args_list[:] = normalized
@@
-        # Find existing argument
+        # Find existing argument
         arg_index = None
         for i, arg in enumerate(args_list):
             if arg == arg_name:
                 arg_index = i
                 break
@@
-        else:
-            # Add new argument
-            args_list.extend([arg_name, arg_value])
+        else:
+            # Add new argument
+            args_list.extend([arg_name, arg_value])

Optional: guard unusual names

+        if not arg_name.startswith("--"):
+            logging.warning("add_arg_to_service: unexpected arg_name '%s'", arg_name)

Please confirm if any of your YAMLs use --arg=value style so we can add a targeted unit test for this path.

tests/fault_tolerance/deploy/test_deployment.py (1)

176-186: Log token overflow “injection” for traceability

Currently TokenOverflowFailure path silently continues, so test.log.txt lacks an injection line.

     if isinstance(failure, TokenOverflowFailure):
-        # The actual overflow is handled by the client configuration
-        # which uses the input_token_length from the Load config
-        # This is just logging for visibility
-        continue
+        logger.info(
+            "TokenOverflowFailure active: max_seq_len=%s, overflow_multiplier=%s, tokens=%s",
+            getattr(failure, "max_seq_len", "unknown"),
+            getattr(failure, "overflow_multiplier", "unknown"),
+            getattr(failure, "overflow_token_count", "unknown"),
+        )
+        continue
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79122c and 840ca00.

📒 Files selected for processing (5)
  • tests/fault_tolerance/deploy/parse_factory.py (4 hunks)
  • tests/fault_tolerance/deploy/parse_results.py (12 hunks)
  • tests/fault_tolerance/deploy/scenarios.py (3 hunks)
  • tests/fault_tolerance/deploy/test_deployment.py (4 hunks)
  • tests/utils/managed_deployment.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/fault_tolerance/deploy/test_deployment.py (3)
tests/fault_tolerance/deploy/parse_results.py (1)
  • process_overflow_recovery_test (675-767)
tests/fault_tolerance/deploy/scenarios.py (2)
  • Load (96-110)
  • TokenOverflowFailure (123-145)
tests/fault_tolerance/deploy/parse_factory.py (1)
  • parse_test_results (101-228)
tests/fault_tolerance/deploy/scenarios.py (1)
tests/utils/managed_deployment.py (3)
  • model (63-74)
  • model (77-105)
  • add_arg_to_service (319-370)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3930/merge) by tzulingk.
tests/fault_tolerance/deploy/test_deployment.py

[error] 290-290: Ruff: Local variable 'all_results' is assigned to but never used. (F841)

🪛 Ruff (0.14.1)
tests/utils/managed_deployment.py

330-330: Avoid specifying long messages outside the exception class

(TRY003)

tests/fault_tolerance/deploy/test_deployment.py

255-255: Do not catch blind exception: Exception

(BLE001)


256-256: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


290-290: Local variable all_results is assigned to but never used

Remove assignment to unused variable all_results

(F841)


297-297: Loop control variable base_name not used within loop body

(B007)

tests/fault_tolerance/deploy/scenarios.py

136-136: Unused method argument: duration

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (7)
tests/fault_tolerance/deploy/scenarios.py (2)

106-111: Load extensions look good

Fields for mixed overflow/recovery are clear and self-contained.


518-653: CLI argument flags are correct; no changes required.

All three backend CLI flags have been verified:

  • vLLM uses --max-model-len
  • TensorRT-LLM uses --max_seq_len; the code specifies --max-seq-len (hyphenated), which is acceptable because argparse automatically converts dashes to underscores in optional arguments ✓
  • SGLang uses --context-length

The is_agg detection via substring matching is appropriate and requires no change.

tests/fault_tolerance/deploy/parse_results.py (4)

175-211: Helper to extract backend/deploy_type from dir name — solid

Pattern-based extraction for mixed-token tests is reasonable and isolated.


252-277: Fallback to decode worker for mixed tests — LGTM

Gracefully handles absence of failure_info by deriving component path from test dir.


406-457: AI‑Perf parsing improvements — LGTM

Good fallbacks for zero records and ms→s conversions; preserves robustness on partial data.


577-665: Phase-aware processing: concise and clear

Nice separation of overflow vs recovery behavior with optional printing; no action needed.

tests/fault_tolerance/deploy/parse_factory.py (1)

101-108: print_output propagation — LGTM

New parameter is consistently threaded through aiperf/legacy paths while preserving defaults.

Also applies to: 185-206

@tzulingk tzulingk changed the title Add prompt > seq_len k8 tests. feat: Add prompt > seq_len k8 tests. Oct 28, 2025
@github-actions github-actions bot added the feat label Oct 28, 2025
@tzulingk tzulingk enabled auto-merge (squash) October 28, 2025 04:34
@rmccorm4
Copy link
Contributor

Please fix the test failures:

tests/fault_tolerance/deploy/scenarios.py:615: in add_token_overflow_scenarios
    overflow_failure = TokenOverflowFailure(
E   TypeError: TokenOverflowFailure.__init__() got an unexpected keyword argument 'duration'

Copy link
Contributor

@indrajit96 indrajit96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this extensive test!
Can we also run some normal tests to make sure we have no regression?
Mostly minor comments with code restructiong and config reading concerns

@indrajit96
Copy link
Contributor

Can we also update FT docs with the new test?

@tzulingk
Copy link
Contributor Author

Please fix the test failures:

tests/fault_tolerance/deploy/scenarios.py:615: in add_token_overflow_scenarios
    overflow_failure = TokenOverflowFailure(
E   TypeError: TokenOverflowFailure.__init__() got an unexpected keyword argument 'duration'

Done in commit 03f7e64

@tzulingk
Copy link
Contributor Author

Thanks a lot for this extensive test!
Can we also run some normal tests to make sure we have no regression?
Mostly minor comments with code restructiong and config reading concerns

tested on
test_fault_scenario[sglang-agg-tp-1-dp-1-frontend]
test_fault_scenario[trtllm-agg-tp-2-dp-1-decode_worker_pod]

@tzulingk tzulingk requested a review from indrajit96 October 29, 2025 05:53
@tzulingk
Copy link
Contributor Author

Can we also update FT docs with the new test?

done in commit de91ba7

Copy link
Contributor

@indrajit96 indrajit96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with the test and assertions!
LGTM!

Copy link
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Checking both that bad requests get rejected and that the system actually bounces back. Nice work reusing WORKER_MAP and keeping it consistent across all three backends.

Some general coding comments:

  • You've got all_metrics with 10+ fields and deployment_info getting passed around everywhere as plain dicts. These should be dataclasses - then you'll get autocomplete, type checking, and catch bugs before runtime instead of getting KeyErrors in production. Dict[str, Any] loses all the benefits of Python's type system.
  • The print() and logging mix is problematic, you should pick one. Test frameworks should use all logging (like info,debug,warning) so users can control verbosity. I see print(f"\n{'='*60}") in some places and logging.warning() in others - it makes output reading/parsing harder, on the humans and scripts that may read it.

@tzulingk
Copy link
Contributor Author

Nice. Checking both that bad requests get rejected and that the system actually bounces back. Nice work reusing WORKER_MAP and keeping it consistent across all three backends.

Some general coding comments:

  • You've got all_metrics with 10+ fields and deployment_info getting passed around everywhere as plain dicts. These should be dataclasses - then you'll get autocomplete, type checking, and catch bugs before runtime instead of getting KeyErrors in production. Dict[str, Any] loses all the benefits of Python's type system.
  • The print() and logging mix is problematic, you should pick one. Test frameworks should use all logging (like info,debug,warning) so users can control verbosity. I see print(f"\n{'='*60}") in some places and logging.warning() in others - it makes output reading/parsing harder, on the humans and scripts that may read it.

Create https://linear.app/nvidia/issue/DIS-947/refctor-use-dataclass-for-passing-arguments to track this.
Replace print() with logging.

@tzulingk
Copy link
Contributor Author

@keivenchang Note that although the logs look a bit less clean after replacing print() with logging, I prefer to use logging.info() instead of print(). Using logging is a more standardized approach for message output, and it also avoids the buffering issues between print and logging that can cause mixed or out-of-order logs.

[TEST] 2025-10-30T19:52:54 INFO root: 
============================================================
SESSION SUMMARY - COMBINED OVERFLOW/RECOVERY TEST
============================================================

Phase Breakdown:
  Overflow: 43/45 rejected (95.6%)
  Recovery: 45/45 succeeded (100.0%)
[TEST] 2025-10-30T19:52:54 INFO root: 
============================================================
FAULT TOLERANCE TEST SUMMARY - AI-PERF
============================================================

@tzulingk tzulingk merged commit c4abe9b into main Oct 31, 2025
22 of 23 checks passed
@tzulingk tzulingk deleted the tzulingk/overflow_k8_test branch October 31, 2025 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants