feat: fault tolerance rolling upgrade test scenarios #4558

tmonty12 · 2025-11-24T19:32:28Z

Overview:

Adds the following Rolling Upgrade scenarios to the FT test framework:

input matrix: vllm/sglang/trtllm X agg/disagg
scenario: each worker has 2 replicas, trigger rolling upgrade by just adding no-op env var
load: have consistent load that is stopped when the rolling upgrade is completed (instead of time or request count based load)
success criteria: 100% of requests are completed successfully (not yet enforced by framework)

Details:

Adds the concept of continous_load to aiperf client.
- Uses aiperf --benchmark-duration flag set to 30 mins instead of --request-count
- _terminate_client_processes will terminate the aiperf processes once the scenario has completed. This is done by sending a SIGINT. aiperf will receive this and export the logs/metrics needed before exiting.
tests/fault_tolerance/deploy/scenarios.py:
- made Failure more generic with execute method and implemented classes for different failure types
Add --skip-restart-services to running FT test - for local testing don't need to restart etcd/nats everytime
tests/conftest.py: log directory now includes timestamp (can run the same test multiple times w/o worrying about dirs)

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

New Features
- Continuous load testing mode for 30-minute benchmark execution
- Rolling upgrade failure injection scenarios for deployment resilience testing
- Optional --skip-restart-services flag to skip service restarts during test setup
Improvements
- Test logs now organized with timestamped directories for better traceability
- Enhanced AI-Perf result parsing to handle cancelled runs and multiple result formats

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-24T19:38:25Z

Walkthrough

The pull request extends the test deployment framework by introducing continuous load mode execution, refactoring fault injection into an abstract class hierarchy, implementing per-service environment variable management in deployments, adding timestamped logging directories, and introducing async signal handling with stronger typing throughout the test orchestration layer.

Changes

Cohort / File(s)	Summary
Fixture and Configuration Enhancements `tests/conftest.py`, `tests/fault_tolerance/deploy/conftest.py`	Added datetime import and timestamped log directory naming in logger fixture; introduced `--skip-restart-services` pytest option and corresponding `skip_restart_services` fixture to control service restart behavior.
Client Layer: Continuous Load Support `tests/fault_tolerance/deploy/client.py`, `tests/fault_tolerance/deploy/client_factory.py`, `tests/fault_tolerance/deploy/legacy_client.py`	Extended `run_aiperf` and `client` signatures with `continuous_load: bool` parameter; added `run_aiperf_with_signal_handling` function for subprocess management with SIGINT forwarding and timeout logic; legacy client raises `ValueError` if continuous load is attempted; updated pod selection to use list-based service names.
Failure Injection Architecture `tests/fault_tolerance/deploy/scenarios.py`	Introduced abstract `Failure` base class with concrete subclasses (`RollingUpgradeFailure`, `DeletePodFailure`, `TerminateProcessFailure`, `TokenOverflowFailure`); refactored `DeploymentInfo` with `Required` type constraints; added `continuous_load` field to `Load`; renamed `_create_deployment_spec` to `_create_deployment_info`; added `add_rolling_upgrade_scenarios()` helper.
Test Results Handling `tests/fault_tolerance/deploy/parse_results.py`	Extended AI-Perf result parsing to support dual formats (top-level vs nested "records"); added cancellation logging with completion count tracking; improved data integrity checks for cancelled runs.
Test Orchestration Refactor `tests/fault_tolerance/deploy/test_deployment.py`	Reworked client process orchestration to use explicit `log_dir`; introduced `_terminate_client_processes` helper for graceful SIGINT dispatch; replaced manual failure injection with async polymorphic `_inject_failures`; updated `test_fault_scenario` signature with `skip_restart_services`; enhanced results handling with base log directory derivation from `request.node.log_dir`.
Deployment Management Infrastructure `tests/utils/managed_deployment.py`	Added per-service environment variable management (`set_service_env_var`, `get_service_env_vars`, `set_service_replicas`); changed `DeploymentSpec.services()` return type to `list[ServiceSpec]`; updated pod/service typing (e.g., `Pod`, `Service` from kr8s); added `trigger_rolling_upgrade()` and `wait_for_unready()` methods; generalized readiness checks via `_wait_for_condition()`; introduced `skip_restart_services` flag and optional `_custom_api` typing; refactored `get_pods()` to accept `list[str] \| None` and return `dict[str, list[Pod]]`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

Failure injection abstraction (scenarios.py): Verify all concrete failure subclasses implement execute() correctly and handle edge cases in rolling upgrade, pod deletion, and process termination flows.
Deployment API changes (managed_deployment.py): Review type signature changes to get_pods(), DeploymentSpec.services(), and new service environment variable management for correctness and backward compatibility implications.
Async signal handling (client.py, test_deployment.py): Validate SIGINT forwarding logic, timeout handling, and Mac-specific edge case (-9) workaround for robustness.
Log directory handling (conftest.py, test_deployment.py): Ensure timestamped log directory creation and retrieval via request.node.log_dir is consistent across all test fixtures and client invocations.
Continuous load mode wiring (client.py, legacy_client.py, scenarios.py): Verify parameter threading, guard logic in legacy client, and benchmark duration/timeout adjustments are correctly integrated end-to-end.

Poem

🐰 Our warren's tests now dance with grace,
With timestamped logs in every place,
Failures born from abstract kin,
Continuous loads let chaos begin,
Rolling upgrades, signals sent—
A test framework, elegantly bent! 🎯

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 74.14% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The description covers key aspects (overview, details, related issues) but lacks specific file references for where reviewers should start and the related issue number is a placeholder.	Specify exact files for review focus (e.g., scenarios.py, managed_deployment.py) and replace placeholder #xxx with actual issue number.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the primary change: introducing rolling upgrade test scenarios to the fault tolerance framework.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (12)

tests/fault_tolerance/deploy/parse_results.py (1)

344-409: Dual-format metrics parsing looks solid; consider removing unused error_request_count.

The updated parsing for request_count / error_request_count across records and top-level fields, with type checks, is a good way to support both old and new AI-Perf formats while avoiding type errors.

One small cleanup: error_request_count (lines 401‑409) is computed but never used; all downstream error accounting relies on error_summary via error_count. Either wire error_request_count into a consistency check (e.g., compare against error_count) or drop this block to avoid confusion.

tests/fault_tolerance/deploy/legacy_client.py (1)

183-219: Good API compatibility guard for continuous_load on legacy client.

Adding continuous_load to the signature but immediately rejecting True with a ValueError is a clean way to keep call sites uniform while preventing unsupported usage in the legacy path. The error message is clear; if Ruff’s TRY003 rule is enforced in CI, you could either shorten the message slightly or suppress that rule, but functionally this is fine.

tests/conftest.py (1)

156-166: Per-test timestamped log_dir integrates well with FT harness.

Storing logs under {test_name}_{timestamp}/test.log.txt and exposing request.node.log_dir gives each test an isolated directory and matches how test_deployment.py discovers logs. One minor follow-up you might consider later is passing request.node.log_dir (instead of request.node.name) into EtcdServer/NatsServer so all logs for a test land under the same root, but the current split is functionally fine.

tests/fault_tolerance/deploy/test_deployment.py (2)

192-219: SIGINT-based client shutdown is reasonable; consider minor robustness tweaks.

Sending SIGINT to each alive client process and then relying on the _clients teardown to join() them is a sensible way to stop continuous-load clients while allowing them to flush AI‑Perf output.

Two optional enhancements you might consider:

Use a narrower exception type (e.g., OSError) in the generic except Exception block if you want to satisfy BLE001, while still catching ProcessLookupError.

Optionally log when a process remains alive after SIGINT and join() (with a timeout in _clients) to avoid hard hangs if a client ignores the signal.

These aren’t blockers but could improve diagnosability and static-analysis cleanliness.

363-368: Async test orchestration with ManagedDeployment, clients, and failures is well-structured; a couple of small cleanups.

The async test_fault_scenario:

Wires in Scenario, image, namespace, and skip_restart_services fixtures appropriately.

Uses request.node.log_dir consistently for both ManagedDeployment and client processes, aligning deployment logs and AI‑Perf outputs under the same test directory.

Runs clients via _clients, injects failures with _inject_failures, and conditionally terminates continuous-load clients via _terminate_client_processes before the _clients teardown joins them.

Two minor polish items:

The # noqa: F811 on the scenario: Scenario parameter appears unused now that you’re importing Scenario rather than shadowing a name; you can drop it to satisfy Ruff (same for the # noqa: F811 on _inject_failures).

If you later introduce more async behavior during failure injection or deployment management, it may be worth documenting that scenario.load.continuous_load is currently assumed to be False for mixed-token tests so that _terminate_client_processes only runs in the single-phase continuous-load scenarios.

Functionally this structure looks good.

Also applies to: 408-425
tests/fault_tolerance/deploy/client.py (2)
274-376: Continuous-load behavior in run_aiperf looks consistent; minor log and threshold nits

The continuous-load branch (--benchmark-duration 1800, single attempt, fixed timeout) is wired correctly into command construction, retry logic, and logging, and is only enabled when continuous_load=True.

Note that validate_aiperf_results still uses requests_per_client to derive the failure threshold even in continuous-load mode; if you intend to eventually enforce “100% success over the whole 30‑minute window”, you may want a follow-up change to base this on actual request_count instead of a synthetic requests_per_client value.

Minor: log message at Line 455 has a typo: "existed succesfully" → "exited successfully".
-            f"AI-Perf sustained continuous load for {pod_name} and existed succesfully"
+            f"AI-Perf sustained continuous load for {pod_name} and exited successfully"
Also applies to: 406-456

463-513: Improve SIGINT handler scoping and address static-analysis nits in run_aiperf_with_signal_handling

Functionally this gives you the desired “send SIGINT to aiperf so it can flush artifacts” behavior, but a couple of refinements would make it safer and quieter:

signal.signal(signal.SIGINT, signal_handler) is process‑wide and the handler is never restored, so any later SIGINTs in this process will also be routed through this handler. Capturing the previous handler and restoring it in a finally after communicate() (or on all exit paths) will keep the side‑effects localized.

signal_handler’s frame argument is intentionally unused; rename it to _frame (or use _ for both args) to silence ARG001 without changing behavior.

Ruff S603 about untrusted subprocess input is a false positive here, since cmd_attempt originates from a controlled list constructed in run_aiperf, not from user input, but adding a brief comment noting this assumption could help future readers.

Example refinement:
-    def signal_handler(signum, frame):
+    def signal_handler(signum, _frame):
         logger.info(f"Received signal {signum}, forwarding to aiperf subprocess")
         try:
             proc.send_signal(signal.SIGINT)
         except ProcessLookupError:
             pass  # Process already terminated

-    signal.signal(signal.SIGINT, signal_handler)
+    old_handler = signal.getsignal(signal.SIGINT)
+    signal.signal(signal.SIGINT, signal_handler)
     try:
         stdout, stderr = proc.communicate(timeout=timeout)
         returncode = proc.returncode
     ...
-    return subprocess.CompletedProcess(cmd_attempt, returncode, stdout, stderr)
+    finally:
+        signal.signal(signal.SIGINT, old_handler)
+    return subprocess.CompletedProcess(cmd_attempt, returncode, stdout, stderr)
tests/fault_tolerance/deploy/scenarios.py (2)

152-207: New Failure hierarchy cleanly models fault types; consider using the logger

The abstract Failure base class plus concrete subclasses (RollingUpgradeFailure, DeletePodFailure, TerminateProcessFailure, TokenOverflowFailure) gives you a much clearer, type-safe way to express faults than raw tuples/dicts. The wiring to ManagedDeployment (trigger_rolling_upgrade, get_pods, get_processes) looks consistent.

Minor improvement: RollingUpgradeFailure.execute and DeletePodFailure.execute accept a logger argument but never use it; adding log lines when injecting failures (e.g., which services/pods are being upgraded or deleted) would help when debugging FT runs.

208-252: TerminateProcessFailure validation and process targeting look correct

The constructor enforces both process_name and signal being non-empty before delegating to the base Failure dataclass, which prevents misconfigured failures from silently doing nothing. The execute implementation uses deployment.get_processes(pod) and matches processes via substring on process.command, then calls process.kill(self.signal), which lines up with PodProcess.kill’s expected string signal API.

Given the static hint about long exception messages, you could optionally simplify the ValueError text, but that’s stylistic, not functional.

tests/utils/managed_deployment.py (3)

317-350: Per-service env var helpers on DeploymentSpec look correct and defensive

set_service_env_var / get_service_env_vars operate directly on self._deployment_spec["spec"]["services"][service_name] and validate that the service exists before mutating/reading, which is important to catch typos early. The update-in-place logic for existing envs avoids duplicates, and get_service_env_vars gracefully falls back to an empty list when envs is missing.

If you find yourself adding more per-service helpers, you could consider a small internal helper to centralize the service-existence check, but that’s purely a readability refactor.

561-661: Unified ready/unready waiting via _wait_for_condition is a solid abstraction

wait_for_unready and _wait_for_ready both delegate to _wait_for_condition, which:

Polls the DGD CR’s status via CustomObjectsApi.get_namespaced_custom_object.

Compares the observed Ready condition’s status against str(ready_condition_val) and the state field against state_val.

Logs detailed state/conditions periodically and returns once both expectations match, otherwise raising TimeoutError after timeout.

This gives you a single, debuggable code path for both “pending/unready” and “successful/ready” transitions. The final TimeoutError message mentions “become ready” even in the wait_for_unready case, but that’s cosmetic.

952-975: _cleanup and __aenter__ changes make startup/teardown more robust and configurable

_cleanup now always calls _get_service_logs() before stopping tracked port-forwards and then deleting the DGD CR, ensuring logs/metrics are preserved even when teardown encounters port-forward errors.

__aenter__ initializes Kubernetes clients, then runs _delete_deployment() and (unless skip_restart_services is True) _restart_etcd() and _restart_nats() in parallel via asyncio.gather, before creating the new deployment and waiting for it to be ready.

This sequence keeps the test environment clean between runs while allowing local developers to opt out of disruptive etcd/nats restarts using skip_restart_services.

Also applies to: 977-990

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e75bcf6 and feb58cc.

📒 Files selected for processing (9)

tests/conftest.py (2 hunks)
tests/fault_tolerance/deploy/client.py (13 hunks)
tests/fault_tolerance/deploy/client_factory.py (1 hunks)
tests/fault_tolerance/deploy/conftest.py (2 hunks)
tests/fault_tolerance/deploy/legacy_client.py (3 hunks)
tests/fault_tolerance/deploy/parse_results.py (3 hunks)
tests/fault_tolerance/deploy/scenarios.py (10 hunks)
tests/fault_tolerance/deploy/test_deployment.py (12 hunks)
tests/utils/managed_deployment.py (14 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-09-24T17:28:25.813Z

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/fault_tolerance/deploy/test_deployment.py:36-43
Timestamp: 2025-09-24T17:28:25.813Z
Learning: In tests/fault_tolerance/deploy/client.py, the payload variable is defined as a global template at the top of the file and is used throughout the client functions. It's not undefined as initially assessed.

Applied to files:

tests/fault_tolerance/deploy/client_factory.py
tests/fault_tolerance/deploy/legacy_client.py
tests/fault_tolerance/deploy/test_deployment.py

📚 Learning: 2025-09-24T17:26:17.225Z

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.

Applied to files:

tests/fault_tolerance/deploy/legacy_client.py
tests/utils/managed_deployment.py

📚 Learning: 2025-09-24T17:50:00.970Z

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:555-559
Timestamp: 2025-09-24T17:50:00.970Z
Learning: In tests/utils/managed_deployment.py, when handling None service_name in get_service method, prefer setting service_name to empty string "" rather than defaulting to frontend_service_name, to avoid confusion per user tzulingk's preference.

Applied to files:

tests/utils/managed_deployment.py

🧬 Code graph analysis (5)

tests/fault_tolerance/deploy/parse_results.py (1)

tests/conftest.py (1)

logger (157-172)

tests/fault_tolerance/deploy/client.py (1)

tests/utils/managed_deployment.py (1)

get_pods (764-786)

tests/conftest.py (2)

tests/utils/managed_deployment.py (3)

name (47-49)

name (204-206)

name (209-210)

tests/utils/managed_process.py (1)

log_path (98-100)

tests/fault_tolerance/deploy/scenarios.py (1)

tests/utils/managed_deployment.py (5)

DeploymentSpec (192-434)

spec (365-366)

set_service_replicas (421-429)

model (78-89)

model (92-120)

tests/utils/managed_deployment.py (2)

tests/fault_tolerance/deploy/client.py (1)

client (581-675)

tests/fault_tolerance/deploy/legacy_client.py (1)

client (183-306)

🪛 Ruff (0.14.5)

tests/fault_tolerance/deploy/legacy_client.py

218-218: Avoid specifying long messages outside the exception class

(TRY003)

tests/fault_tolerance/deploy/client.py

475-475: subprocess call: check for execution of untrusted input

(S603)

484-484: Unused function argument: frame

(ARG001)

tests/fault_tolerance/deploy/test_deployment.py

209-209: Abstract raise to an inner function

(TRY301)

209-209: Avoid specifying long messages outside the exception class

(TRY003)

212-212: Do not catch blind exception: Exception

(BLE001)

225-225: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

363-363: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

tests/fault_tolerance/deploy/scenarios.py

180-180: Unused method argument: logger

(ARG002)

196-196: Unused method argument: logger

(ARG002)

232-234: Avoid specifying long messages outside the exception class

(TRY003)

tests/utils/managed_deployment.py

323-323: Avoid specifying long messages outside the exception class

(TRY003)

347-347: Avoid specifying long messages outside the exception class

(TRY003)

427-427: Avoid specifying long messages outside the exception class

(TRY003)

718-720: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: vllm (amd64)
GitHub Check: operator (arm64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: sglang (arm64)

🔇 Additional comments (23)

tests/fault_tolerance/deploy/parse_results.py (1)

428-438: Cancellation handling aligns with continuous-load semantics.

Logging was_cancelled and treating cancelled runs as “completed what they finished, no synthetic errors for the rest” is the right behavior for continuous load: it preserves real successes and failures without penalizing early stop. The explanatory comment here is clear and matches the aggregation logic above.

tests/fault_tolerance/deploy/client_factory.py (1)

32-46: Docstring accurately reflects the extended client signature.

Including continuous_load in the documented client(...) signature keeps the factory contract in sync with both AI‑Perf and legacy implementations. No functional issues here.

tests/fault_tolerance/deploy/legacy_client.py (1)

235-237: Updated get_pods call matches mapping-based API.

Passing [managed_deployment.frontend_service_name] and then iterating pods[managed_deployment.frontend_service_name] aligns with a mapping {service_name: [pod, ...]} API and mirrors how other components access pods per service. This change looks consistent with the broader ManagedDeployment usage.

tests/fault_tolerance/deploy/conftest.py (1)

38-44: --skip-restart-services option and fixture wiring look correct.

The new CLI flag plus skip_restart_services fixture cleanly expose restart control to tests, and the help text clearly documents the default behavior. No issues spotted.

Also applies to: 121-124

tests/fault_tolerance/deploy/test_deployment.py (2)

58-85: Client process orchestration and continuous_load wiring look correct.

The refactored _clients context manager:

Threads log_dir and typed DeploymentSpec/Load cleanly into the client processes.

Differentiates retry_delay_or_rate between legacy (rate limiting) and AI‑Perf (retry delay), matching the documented “differs between implementations” parameter.

Derives continuous_load from load_config and only passes it to clients in the single‑phase path, which is appropriate given mixed‑token tests aren’t continuous‑load scenarios.

The join in the finally portion of the context ensures all client processes are waited on, while test_fault_scenario can still trigger early termination for continuous‑load tests via _terminate_client_processes.

Also applies to: 95-107, 108-160, 162-183

247-264: Using request.node.log_dir as the base log directory ties the result parsing together cleanly.

Falling back to request.node.log_dir (with request.node.name as a backup) keeps the results path in sync with the per-test directory created by the global logger fixture. The overflow/recovery suffix handling is consistent with how _clients names the overflow and recovery phases.

tests/fault_tolerance/deploy/client.py (1)

64-90: Front-end pod selection now correctly uses ManagedDeployment.get_pods

Using managed_deployment.get_pods([managed_deployment.frontend_service_name]) keeps pod discovery consistent with the new ManagedDeployment.get_pods API and avoids duplicating label logic; the ready-pod filtering and round-robin selection look correct.

tests/fault_tolerance/deploy/scenarios.py (6)

129-150: Load.continuous_load flag is well-integrated but only used by new scenarios

Adding continuous_load: bool = False to Load makes the semantics explicit and aligns with the new aiperf client behavior; existing helpers (create_aiperf_load, create_legacy_load, load, moe_load) leave it at the default False, so the only callers enabling continuous load are your new rolling-upgrade scenarios, which is a safe, localized change.

263-285: TokenOverflowFailure as a client-side no-op is consistent with its purpose

Constraining TokenOverflowFailure to set service_names=["Client"] and making execute a no-op (with behavior driven entirely by the Load mixed-token configuration) matches the design where overflow is induced by client request sizes, not server-side mutations. This keeps the failure model unified without overloading the deployment layer.

334-356: _set_tensor_parallel now correctly consumes DeploymentInfo and DeploymentSpec

Accepting deployment_spec: DeploymentInfo and then extracting spec = deployment_spec["spec"] to operate on the DeploymentSpec instance is consistent with how _create_deployments_for_backend constructs DeploymentInfo. Using DeploymentSpec.set_tensor_parallel when available (and falling back to direct tensor_parallel_size fields) preserves compatibility and keeps the backend logic centralized in DeploymentSpec.

404-411: Deployment creation now consistently uses _create_deployment_info

Switching _create_deployments_for_backend to call _create_deployment_info ensures all backend deployments share the same DeploymentInfo structure (spec, backend, optional model/is_moe). This lines up with how DEPLOYMENT_SPECS is populated and consumed later when creating Scenario instances.

492-557: Backend failure maps correctly instantiate new failure classes

The new backend_failure_map entries that use TerminateProcessFailure and DeletePodFailure per backend/deploy_type look consistent:

Service names passed into TerminateProcessFailure (Frontend, decode/prefill workers, and backend-specific engine components) match WORKER_MAP entries.

Signals (SIGINT for frontend; SIGKILL for worker and engine internals) align with the intended severity per comment.

Prefill-related failures are later skipped for aggregated deployments as before.

This preserves previous failure semantics while benefiting from the new class-based API.

852-903: Rolling-upgrade scenarios cover the intended matrix; minor naming/behavior looks good

For each backend (vllm, sglang, trtllm) and mode (agg, disagg), you create a fresh DeploymentSpec from the backend YAML, set worker replicas to 2 via DeploymentSpec.set_service_replicas, and build service_names covering decode (and prefill for disagg). The trtllm decode_agg special case is handled correctly.

The Load object enables continuous_load=True with success_threshold=100.0 and max_retries=1, which is appropriate for long-running, non-retryable rolling-upgrade tests.

RollingUpgradeFailure is configured once per scenario with time=30 and the computed service_names, and wired into the global scenarios map under unique keys ({backend}-{worker_mode}-rolling-upgrade), avoiding collisions.

This hook plus the add_rolling_upgrade_scenarios() call at the end cleanly integrates rolling upgrades into the existing FT scenario registry.

tests/utils/managed_deployment.py (10)

351-363: services, __getitem__, and set_service_replicas align the public API around ServiceSpec

The services property now returns a list[ServiceSpec], and __getitem__ returns a ServiceSpec for a given service name. This matches how _set_replicas and other helpers in scenarios.py consume deployment specs.

set_service_replicas mirrors the env-var helpers, checking the service exists before setting replicas on the raw spec. This is exactly what the rolling-upgrade scenario builder expects when setting worker replicas to 2.

Together these changes make DeploymentSpec the single, typed entry point for per-service manipulations used across the FT tests.

Also applies to: 421-429

438-479: PodProcess helper correctly encapsulates pod-level process management

Parsing ps -aux output into PodProcess instances and centralizing kill/wait logic (including the “PID 1 Python process → prefer SIGINT for graceful shutdown” heuristic) gives a clean abstraction for the new TerminateProcessFailure. The kill(self.signal) calls from TerminateProcessFailure line up with this API, and wait() provides a simple readiness check if you need it later.

481-499: New ManagedDeployment fields support frontend discovery and local testing

Adding frontend_service_name: str = "Frontend" and skip_restart_services: bool = False to ManagedDeployment:

Lets get_pods and _get_pod_metrics distinguish between frontend vs system ports by name, which is used by both the legacy and aiperf clients.

Gives tests a way to skip disruptive etcd/nats restarts while still deleting/recreating the DGD, which is very useful for fast local iteration.

The type for _custom_api is also tightened to CustomObjectsApi, which helps static checking without changing runtime behavior.

683-709: _create_deployment now wires services and handles CRD existence more robustly

Storing self._services = self.deployment_spec.services captures the list of ServiceSpec objects early, which can be reused by other helpers later.

Catching exceptions.ApiException and treating HTTP 409 as “already exists” avoids hard failures when reusing deployments, while still surfacing other API errors.

This pairs nicely with the explicit _delete_deployment and rolling-upgrade logic elsewhere in the class.

711-748: trigger_rolling_upgrade correctly patches per-service env vars to drive rollouts

The new trigger_rolling_upgrade method:

Validates service_names is non-empty, preventing accidental “no-op” invocations.

Uses DeploymentSpec.set_service_env_var and get_service_env_vars to add a dummy TEST_ROLLING_UPDATE_TRIGGER env var with a random hex value per service.

Builds a merge-patch+json body that patches only the relevant spec.services[service].envs entries on the DGD CR.

This is exactly what the rolling-upgrade fault type in scenarios.py needs to trigger a pod-by-pod rollout without altering other spec fields.

749-785: get_processes and get_pods provide the abstractions needed by new failure types

get_processes runs ps -aux inside a pod and wraps results as PodProcess instances, which are then used by TerminateProcessFailure to locate and kill specific processes.

get_pods returns a dict[str, list[Pod]], keyed by service name, and uses the nvidia.com/selector={deployment-name}-{service} label to fetch pods. When service_names is None, it enumerates all services from DeploymentSpec.services.

This matches how both the new failure injections (delete pod, kill process) and logging/metrics helpers iterate over pods. Based on prior learnings, the get_service method now defaults service_name to "", not the frontend name, which aligns with the repository’s preferred behavior. (Based on learnings, ...)

788-812: get_pod_manifest_logs_metrics and _get_pod_metrics capture rich diagnostics for failures

Collecting the pod’s manifest YAML, current logs, previous logs, and Prometheus metrics under log_dir/service_name/ gives excellent visibility into failure scenarios, especially before pod deletion in DeletePodFailure. Using self.port_forward and tracking forwards in _active_port_forwards ensures metrics scraping is consistent with the existing port-forward lifecycle managed in _cleanup.

813-823: _get_service_logs reuse via get_pods is clean and flexible

The refactored _get_service_logs uses get_pods(service_names) to drive log/metrics collection, supporting both “all services” and a specific service_name. This centralizes pod enumeration and keeps the logging surface aligned with the label-based pod discovery used elsewhere.

877-945: port_forward’s connection-testing and backoff improve reliability

The enhanced port_forward:

Uses local_port=0 and address="127.0.0.1" for conflict avoidance and isolation.

Implements a bounded exponential backoff loop that waits for local_port assignment, then validates connectivity with a HEAD request before returning.

Tracks successful forwards in _active_port_forwards for cleanup, and attempts restart of the port-forward on connection failures until max_connection_attempts is exhausted.

This should significantly reduce flaky tests due to transient port-forward issues while still cleaning up background threads on failure.

757-763: get_service default behavior matches prior preference

Defaulting service_name to "" when falsy (instead of implicitly assuming the frontend) aligns with the previously documented preference to avoid hidden frontend defaults for get_service, and keeps the resulting Service name ({deployment_name}-) explicit.

coderabbitai · 2025-11-24T19:38:28Z

tests/fault_tolerance/deploy/test_deployment.py

+async def _inject_failures(
+    failures: list[Failure],
+    logger: logging.Logger,
+    deployment: ManagedDeployment,
+):  # noqa: F811
+    for failure in failures:
+        time.sleep(failure.time)

        logger.info(f"Injecting failure for: {failure}")

-        for x in range(replicas):
-            pod = pods[x % num_pods]
-
-            if failure.command == "delete_pod":
-                deployment.get_pod_logs(failure.pod_name, pod, ".before_delete")
-                pod.delete(force=True)
-            else:
-                processes = deployment.get_processes(pod)
-                for process in processes:
-                    if failure.command in process.command:
-                        logger.info(
-                            f"Terminating {failure.pod_name} Pid {process.pid} Command {process.command}"
-                        )
-                        process.kill(failure.signal)
+        # Execute the failure using the polymorphic execute method
+        await failure.execute(deployment, logger)



🛠️ Refactor suggestion | 🟠 Major

Avoid time.sleep inside async _inject_failures; use asyncio.sleep instead.

_inject_failures is declared async and is awaited from test_fault_scenario, but it uses time.sleep(failure.time) inside the loop. This blocks the event loop for each failure interval, which can stall other async operations involved in managing the deployment or monitoring state.

Switching to asyncio.sleep preserves the intended timing while keeping the event loop responsive.

Suggested change:

-import time +import time +import asyncio @@ async def _inject_failures( failures: list[Failure], logger: logging.Logger, deployment: ManagedDeployment, ): # noqa: F811 for failure in failures: - time.sleep(failure.time) + await asyncio.sleep(failure.time)

(If failure.time is intended as an absolute offset rather than a relative delay, you can still implement that with asyncio.sleep(max(0, failure.time - elapsed))) while remaining non-blocking.)

🧰 Tools

🪛 Ruff (0.14.5)

225-225: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In tests/fault_tolerance/deploy/test_deployment.py around lines 221 to 233, the async helper _inject_failures blocks the event loop by calling time.sleep(failure.time); change that to await asyncio.sleep(failure.time) and add an import for asyncio at top of the file so the sleep is non-blocking. If failure.time represents an absolute offset rather than a relative delay, compute elapsed time and await asyncio.sleep(max(0, failure.time - elapsed)) instead to keep timings correct while remaining asynchronous.

nnshah1 · 2025-11-24T20:07:15Z

tests/utils/managed_deployment.py

+        Set the number of replicas for a specific service
+        """
+        # Check service exists
+        if service_name not in self._deployment_spec["spec"]["services"]:


nit: since we do this validation in multiple places - maybe a utility function , "get_service_spec" that returns the dictionary and would fail if not in the spec?

tmonty12 added 7 commits November 21, 2025 16:56

get vllm agg working

084cd30

remove readme

8324a62

patch once for multi-service rolling upgrades

9a6ca24

make linter happy and refactor for traditional failure scenarios

4d96c60

add continous_load field to legacy client

c2fb3a6

unify subprocess launch and signal handling

70f87bf

make failures more generic

feb58cc

tmonty12 requested review from a team as code owners November 24, 2025 19:32

pull-request-size bot added the size/XL label Nov 24, 2025

github-actions bot added the feat label Nov 24, 2025

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

nnshah1 reviewed Nov 24, 2025

View reviewed changes

small nits

7062123

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 21:04 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 21:05 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: fault tolerance rolling upgrade test scenarios #4558

feat: fault tolerance rolling upgrade test scenarios #4558

tmonty12 commented Nov 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 24, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 24, 2025

Uh oh!

nnshah1 Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: fault tolerance rolling upgrade test scenarios #4558

Are you sure you want to change the base?

feat: fault tolerance rolling upgrade test scenarios #4558

Conversation

tmonty12 commented Nov 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 24, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmonty12 commented Nov 24, 2025 •

edited by coderabbitai bot

Loading