-
Notifications
You must be signed in to change notification settings - Fork 705
feat: fault tolerance rolling upgrade test scenarios #4558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request extends the test deployment framework by introducing continuous load mode execution, refactoring fault injection into an abstract class hierarchy, implementing per-service environment variable management in deployments, adding timestamped logging directories, and introducing async signal handling with stronger typing throughout the test orchestration layer. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (12)
tests/fault_tolerance/deploy/parse_results.py (1)
344-409: Dual-format metrics parsing looks solid; consider removing unusederror_request_count.The updated parsing for
request_count/error_request_countacrossrecordsand top-level fields, with type checks, is a good way to support both old and new AI-Perf formats while avoiding type errors.One small cleanup:
error_request_count(lines 401‑409) is computed but never used; all downstream error accounting relies onerror_summaryviaerror_count. Either wireerror_request_countinto a consistency check (e.g., compare againsterror_count) or drop this block to avoid confusion.tests/fault_tolerance/deploy/legacy_client.py (1)
183-219: Good API compatibility guard forcontinuous_loadon legacy client.Adding
continuous_loadto the signature but immediately rejectingTruewith aValueErroris a clean way to keep call sites uniform while preventing unsupported usage in the legacy path. The error message is clear; if Ruff’s TRY003 rule is enforced in CI, you could either shorten the message slightly or suppress that rule, but functionally this is fine.tests/conftest.py (1)
156-166: Per-test timestampedlog_dirintegrates well with FT harness.Storing logs under
{test_name}_{timestamp}/test.log.txtand exposingrequest.node.log_dirgives each test an isolated directory and matches howtest_deployment.pydiscovers logs. One minor follow-up you might consider later is passingrequest.node.log_dir(instead ofrequest.node.name) intoEtcdServer/NatsServerso all logs for a test land under the same root, but the current split is functionally fine.tests/fault_tolerance/deploy/test_deployment.py (2)
192-219: SIGINT-based client shutdown is reasonable; consider minor robustness tweaks.Sending SIGINT to each alive client process and then relying on the
_clientsteardown tojoin()them is a sensible way to stop continuous-load clients while allowing them to flush AI‑Perf output.Two optional enhancements you might consider:
- Use a narrower exception type (e.g.,
OSError) in the genericexcept Exceptionblock if you want to satisfy BLE001, while still catchingProcessLookupError.- Optionally log when a process remains alive after SIGINT and
join()(with a timeout in_clients) to avoid hard hangs if a client ignores the signal.These aren’t blockers but could improve diagnosability and static-analysis cleanliness.
363-368: Async test orchestration withManagedDeployment, clients, and failures is well-structured; a couple of small cleanups.The async
test_fault_scenario:
- Wires in
Scenario,image,namespace, andskip_restart_servicesfixtures appropriately.- Uses
request.node.log_dirconsistently for bothManagedDeploymentand client processes, aligning deployment logs and AI‑Perf outputs under the same test directory.- Runs clients via
_clients, injects failures with_inject_failures, and conditionally terminates continuous-load clients via_terminate_client_processesbefore the_clientsteardown joins them.Two minor polish items:
- The
# noqa: F811on thescenario: Scenarioparameter appears unused now that you’re importingScenariorather than shadowing a name; you can drop it to satisfy Ruff (same for the# noqa: F811on_inject_failures).- If you later introduce more async behavior during failure injection or deployment management, it may be worth documenting that
scenario.load.continuous_loadis currently assumed to beFalsefor mixed-token tests so that_terminate_client_processesonly runs in the single-phase continuous-load scenarios.Functionally this structure looks good.
Also applies to: 408-425
tests/fault_tolerance/deploy/client.py (2)
274-376: Continuous-load behavior inrun_aiperflooks consistent; minor log and threshold nits
- The continuous-load branch (
--benchmark-duration 1800, single attempt, fixed timeout) is wired correctly into command construction, retry logic, and logging, and is only enabled whencontinuous_load=True.- Note that
validate_aiperf_resultsstill usesrequests_per_clientto derive the failure threshold even in continuous-load mode; if you intend to eventually enforce “100% success over the whole 30‑minute window”, you may want a follow-up change to base this on actualrequest_countinstead of a syntheticrequests_per_clientvalue.- Minor: log message at Line 455 has a typo:
"existed succesfully"→"exited successfully".- f"AI-Perf sustained continuous load for {pod_name} and existed succesfully" + f"AI-Perf sustained continuous load for {pod_name} and exited successfully"Also applies to: 406-456
463-513: Improve SIGINT handler scoping and address static-analysis nits inrun_aiperf_with_signal_handlingFunctionally this gives you the desired “send SIGINT to aiperf so it can flush artifacts” behavior, but a couple of refinements would make it safer and quieter:
signal.signal(signal.SIGINT, signal_handler)is process‑wide and the handler is never restored, so any later SIGINTs in this process will also be routed through this handler. Capturing the previous handler and restoring it in afinallyaftercommunicate()(or on all exit paths) will keep the side‑effects localized.signal_handler’sframeargument is intentionally unused; rename it to_frame(or use_for both args) to silenceARG001without changing behavior.- Ruff S603 about untrusted subprocess input is a false positive here, since
cmd_attemptoriginates from a controlled list constructed inrun_aiperf, not from user input, but adding a brief comment noting this assumption could help future readers.Example refinement:
- def signal_handler(signum, frame): + def signal_handler(signum, _frame): logger.info(f"Received signal {signum}, forwarding to aiperf subprocess") try: proc.send_signal(signal.SIGINT) except ProcessLookupError: pass # Process already terminated - signal.signal(signal.SIGINT, signal_handler) + old_handler = signal.getsignal(signal.SIGINT) + signal.signal(signal.SIGINT, signal_handler) try: stdout, stderr = proc.communicate(timeout=timeout) returncode = proc.returncode ... - return subprocess.CompletedProcess(cmd_attempt, returncode, stdout, stderr) + finally: + signal.signal(signal.SIGINT, old_handler) + return subprocess.CompletedProcess(cmd_attempt, returncode, stdout, stderr)tests/fault_tolerance/deploy/scenarios.py (2)
152-207: NewFailurehierarchy cleanly models fault types; consider using the loggerThe abstract
Failurebase class plus concrete subclasses (RollingUpgradeFailure,DeletePodFailure,TerminateProcessFailure,TokenOverflowFailure) gives you a much clearer, type-safe way to express faults than raw tuples/dicts. The wiring toManagedDeployment(trigger_rolling_upgrade,get_pods,get_processes) looks consistent.Minor improvement:
RollingUpgradeFailure.executeandDeletePodFailure.executeaccept aloggerargument but never use it; adding log lines when injecting failures (e.g., which services/pods are being upgraded or deleted) would help when debugging FT runs.
208-252:TerminateProcessFailurevalidation and process targeting look correctThe constructor enforces both
process_nameandsignalbeing non-empty before delegating to the baseFailuredataclass, which prevents misconfigured failures from silently doing nothing. Theexecuteimplementation usesdeployment.get_processes(pod)and matches processes via substring onprocess.command, then callsprocess.kill(self.signal), which lines up withPodProcess.kill’s expected string signal API.Given the static hint about long exception messages, you could optionally simplify the
ValueErrortext, but that’s stylistic, not functional.tests/utils/managed_deployment.py (3)
317-350: Per-service env var helpers onDeploymentSpeclook correct and defensive
set_service_env_var/get_service_env_varsoperate directly onself._deployment_spec["spec"]["services"][service_name]and validate that the service exists before mutating/reading, which is important to catch typos early. The update-in-place logic for existing envs avoids duplicates, andget_service_env_varsgracefully falls back to an empty list whenenvsis missing.If you find yourself adding more per-service helpers, you could consider a small internal helper to centralize the service-existence check, but that’s purely a readability refactor.
561-661: Unified ready/unready waiting via_wait_for_conditionis a solid abstraction
wait_for_unreadyand_wait_for_readyboth delegate to_wait_for_condition, which:
- Polls the DGD CR’s status via
CustomObjectsApi.get_namespaced_custom_object.- Compares the observed
Readycondition’sstatusagainststr(ready_condition_val)and thestatefield againststate_val.- Logs detailed state/conditions periodically and returns once both expectations match, otherwise raising
TimeoutErroraftertimeout.This gives you a single, debuggable code path for both “pending/unready” and “successful/ready” transitions. The final
TimeoutErrormessage mentions “become ready” even in thewait_for_unreadycase, but that’s cosmetic.
952-975:_cleanupand__aenter__changes make startup/teardown more robust and configurable
_cleanupnow always calls_get_service_logs()before stopping tracked port-forwards and then deleting the DGD CR, ensuring logs/metrics are preserved even when teardown encounters port-forward errors.__aenter__initializes Kubernetes clients, then runs_delete_deployment()and (unlessskip_restart_servicesis True)_restart_etcd()and_restart_nats()in parallel viaasyncio.gather, before creating the new deployment and waiting for it to be ready.This sequence keeps the test environment clean between runs while allowing local developers to opt out of disruptive etcd/nats restarts using
skip_restart_services.Also applies to: 977-990
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tests/conftest.py(2 hunks)tests/fault_tolerance/deploy/client.py(13 hunks)tests/fault_tolerance/deploy/client_factory.py(1 hunks)tests/fault_tolerance/deploy/conftest.py(2 hunks)tests/fault_tolerance/deploy/legacy_client.py(3 hunks)tests/fault_tolerance/deploy/parse_results.py(3 hunks)tests/fault_tolerance/deploy/scenarios.py(10 hunks)tests/fault_tolerance/deploy/test_deployment.py(12 hunks)tests/utils/managed_deployment.py(14 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-24T17:28:25.813Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/fault_tolerance/deploy/test_deployment.py:36-43
Timestamp: 2025-09-24T17:28:25.813Z
Learning: In tests/fault_tolerance/deploy/client.py, the payload variable is defined as a global template at the top of the file and is used throughout the client functions. It's not undefined as initially assessed.
Applied to files:
tests/fault_tolerance/deploy/client_factory.pytests/fault_tolerance/deploy/legacy_client.pytests/fault_tolerance/deploy/test_deployment.py
📚 Learning: 2025-09-24T17:26:17.225Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.
Applied to files:
tests/fault_tolerance/deploy/legacy_client.pytests/utils/managed_deployment.py
📚 Learning: 2025-09-24T17:50:00.970Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:555-559
Timestamp: 2025-09-24T17:50:00.970Z
Learning: In tests/utils/managed_deployment.py, when handling None service_name in get_service method, prefer setting service_name to empty string "" rather than defaulting to frontend_service_name, to avoid confusion per user tzulingk's preference.
Applied to files:
tests/utils/managed_deployment.py
🧬 Code graph analysis (5)
tests/fault_tolerance/deploy/parse_results.py (1)
tests/conftest.py (1)
logger(157-172)
tests/fault_tolerance/deploy/client.py (1)
tests/utils/managed_deployment.py (1)
get_pods(764-786)
tests/conftest.py (2)
tests/utils/managed_deployment.py (3)
name(47-49)name(204-206)name(209-210)tests/utils/managed_process.py (1)
log_path(98-100)
tests/fault_tolerance/deploy/scenarios.py (1)
tests/utils/managed_deployment.py (5)
DeploymentSpec(192-434)spec(365-366)set_service_replicas(421-429)model(78-89)model(92-120)
tests/utils/managed_deployment.py (2)
tests/fault_tolerance/deploy/client.py (1)
client(581-675)tests/fault_tolerance/deploy/legacy_client.py (1)
client(183-306)
🪛 Ruff (0.14.5)
tests/fault_tolerance/deploy/legacy_client.py
218-218: Avoid specifying long messages outside the exception class
(TRY003)
tests/fault_tolerance/deploy/client.py
475-475: subprocess call: check for execution of untrusted input
(S603)
484-484: Unused function argument: frame
(ARG001)
tests/fault_tolerance/deploy/test_deployment.py
209-209: Abstract raise to an inner function
(TRY301)
209-209: Avoid specifying long messages outside the exception class
(TRY003)
212-212: Do not catch blind exception: Exception
(BLE001)
225-225: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
363-363: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
tests/fault_tolerance/deploy/scenarios.py
180-180: Unused method argument: logger
(ARG002)
196-196: Unused method argument: logger
(ARG002)
232-234: Avoid specifying long messages outside the exception class
(TRY003)
tests/utils/managed_deployment.py
323-323: Avoid specifying long messages outside the exception class
(TRY003)
347-347: Avoid specifying long messages outside the exception class
(TRY003)
427-427: Avoid specifying long messages outside the exception class
(TRY003)
718-720: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: vllm (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (arm64)
🔇 Additional comments (23)
tests/fault_tolerance/deploy/parse_results.py (1)
428-438: Cancellation handling aligns with continuous-load semantics.Logging
was_cancelledand treating cancelled runs as “completed what they finished, no synthetic errors for the rest” is the right behavior for continuous load: it preserves real successes and failures without penalizing early stop. The explanatory comment here is clear and matches the aggregation logic above.tests/fault_tolerance/deploy/client_factory.py (1)
32-46: Docstring accurately reflects the extended client signature.Including
continuous_loadin the documentedclient(...)signature keeps the factory contract in sync with both AI‑Perf and legacy implementations. No functional issues here.tests/fault_tolerance/deploy/legacy_client.py (1)
235-237: Updatedget_podscall matches mapping-based API.Passing
[managed_deployment.frontend_service_name]and then iteratingpods[managed_deployment.frontend_service_name]aligns with a mapping{service_name: [pod, ...]}API and mirrors how other components access pods per service. This change looks consistent with the broader ManagedDeployment usage.tests/fault_tolerance/deploy/conftest.py (1)
38-44:--skip-restart-servicesoption and fixture wiring look correct.The new CLI flag plus
skip_restart_servicesfixture cleanly expose restart control to tests, and the help text clearly documents the default behavior. No issues spotted.Also applies to: 121-124
tests/fault_tolerance/deploy/test_deployment.py (2)
58-85: Client process orchestration andcontinuous_loadwiring look correct.The refactored
_clientscontext manager:
- Threads
log_dirand typedDeploymentSpec/Loadcleanly into the client processes.- Differentiates
retry_delay_or_ratebetween legacy (rate limiting) and AI‑Perf (retry delay), matching the documented “differs between implementations” parameter.- Derives
continuous_loadfromload_configand only passes it to clients in the single‑phase path, which is appropriate given mixed‑token tests aren’t continuous‑load scenarios.The join in the
finallyportion of the context ensures all client processes are waited on, whiletest_fault_scenariocan still trigger early termination for continuous‑load tests via_terminate_client_processes.Also applies to: 95-107, 108-160, 162-183
247-264: Usingrequest.node.log_diras the base log directory ties the result parsing together cleanly.Falling back to
request.node.log_dir(withrequest.node.nameas a backup) keeps the results path in sync with the per-test directory created by the globalloggerfixture. The overflow/recovery suffix handling is consistent with how_clientsnames the overflow and recovery phases.tests/fault_tolerance/deploy/client.py (1)
64-90: Front-end pod selection now correctly uses ManagedDeployment.get_podsUsing
managed_deployment.get_pods([managed_deployment.frontend_service_name])keeps pod discovery consistent with the newManagedDeployment.get_podsAPI and avoids duplicating label logic; the ready-pod filtering and round-robin selection look correct.tests/fault_tolerance/deploy/scenarios.py (6)
129-150:Load.continuous_loadflag is well-integrated but only used by new scenariosAdding
continuous_load: bool = FalsetoLoadmakes the semantics explicit and aligns with the new aiperf client behavior; existing helpers (create_aiperf_load,create_legacy_load,load,moe_load) leave it at the defaultFalse, so the only callers enabling continuous load are your new rolling-upgrade scenarios, which is a safe, localized change.
263-285:TokenOverflowFailureas a client-side no-op is consistent with its purposeConstraining
TokenOverflowFailureto setservice_names=["Client"]and makingexecutea no-op (with behavior driven entirely by theLoadmixed-token configuration) matches the design where overflow is induced by client request sizes, not server-side mutations. This keeps the failure model unified without overloading the deployment layer.
334-356:_set_tensor_parallelnow correctly consumesDeploymentInfoandDeploymentSpecAccepting
deployment_spec: DeploymentInfoand then extractingspec = deployment_spec["spec"]to operate on theDeploymentSpecinstance is consistent with how_create_deployments_for_backendconstructsDeploymentInfo. UsingDeploymentSpec.set_tensor_parallelwhen available (and falling back to directtensor_parallel_sizefields) preserves compatibility and keeps the backend logic centralized inDeploymentSpec.
404-411: Deployment creation now consistently uses_create_deployment_infoSwitching
_create_deployments_for_backendto call_create_deployment_infoensures all backend deployments share the sameDeploymentInfostructure (spec,backend, optionalmodel/is_moe). This lines up with howDEPLOYMENT_SPECSis populated and consumed later when creatingScenarioinstances.
492-557: Backend failure maps correctly instantiate new failure classesThe new
backend_failure_mapentries that useTerminateProcessFailureandDeletePodFailureper backend/deploy_type look consistent:
- Service names passed into
TerminateProcessFailure(Frontend, decode/prefill workers, and backend-specific engine components) matchWORKER_MAPentries.- Signals (
SIGINTfor frontend;SIGKILLfor worker and engine internals) align with the intended severity per comment.- Prefill-related failures are later skipped for aggregated deployments as before.
This preserves previous failure semantics while benefiting from the new class-based API.
852-903: Rolling-upgrade scenarios cover the intended matrix; minor naming/behavior looks good
- For each backend (
vllm,sglang,trtllm) and mode (agg,disagg), you create a freshDeploymentSpecfrom the backend YAML, set worker replicas to 2 viaDeploymentSpec.set_service_replicas, and buildservice_namescovering decode (and prefill for disagg). Thetrtllmdecode_aggspecial case is handled correctly.- The
Loadobject enablescontinuous_load=Truewithsuccess_threshold=100.0andmax_retries=1, which is appropriate for long-running, non-retryable rolling-upgrade tests.RollingUpgradeFailureis configured once per scenario withtime=30and the computedservice_names, and wired into the globalscenariosmap under unique keys ({backend}-{worker_mode}-rolling-upgrade), avoiding collisions.This hook plus the
add_rolling_upgrade_scenarios()call at the end cleanly integrates rolling upgrades into the existing FT scenario registry.tests/utils/managed_deployment.py (10)
351-363:services,__getitem__, andset_service_replicasalign the public API aroundServiceSpec
- The
servicesproperty now returns alist[ServiceSpec], and__getitem__returns aServiceSpecfor a given service name. This matches how_set_replicasand other helpers inscenarios.pyconsume deployment specs.set_service_replicasmirrors the env-var helpers, checking the service exists before settingreplicason the raw spec. This is exactly what the rolling-upgrade scenario builder expects when setting worker replicas to 2.Together these changes make
DeploymentSpecthe single, typed entry point for per-service manipulations used across the FT tests.Also applies to: 421-429
438-479:PodProcesshelper correctly encapsulates pod-level process managementParsing
ps -auxoutput intoPodProcessinstances and centralizing kill/wait logic (including the “PID 1 Python process → prefer SIGINT for graceful shutdown” heuristic) gives a clean abstraction for the newTerminateProcessFailure. Thekill(self.signal)calls fromTerminateProcessFailureline up with this API, andwait()provides a simple readiness check if you need it later.
481-499: NewManagedDeploymentfields support frontend discovery and local testingAdding
frontend_service_name: str = "Frontend"andskip_restart_services: bool = FalsetoManagedDeployment:
- Lets
get_podsand_get_pod_metricsdistinguish between frontend vs system ports by name, which is used by both the legacy and aiperf clients.- Gives tests a way to skip disruptive etcd/nats restarts while still deleting/recreating the DGD, which is very useful for fast local iteration.
The type for
_custom_apiis also tightened toCustomObjectsApi, which helps static checking without changing runtime behavior.
683-709:_create_deploymentnow wires services and handles CRD existence more robustly
- Storing
self._services = self.deployment_spec.servicescaptures the list ofServiceSpecobjects early, which can be reused by other helpers later.- Catching
exceptions.ApiExceptionand treating HTTP 409 as “already exists” avoids hard failures when reusing deployments, while still surfacing other API errors.This pairs nicely with the explicit
_delete_deploymentand rolling-upgrade logic elsewhere in the class.
711-748:trigger_rolling_upgradecorrectly patches per-service env vars to drive rolloutsThe new
trigger_rolling_upgrademethod:
- Validates
service_namesis non-empty, preventing accidental “no-op” invocations.- Uses
DeploymentSpec.set_service_env_varandget_service_env_varsto add a dummyTEST_ROLLING_UPDATE_TRIGGERenv var with a random hex value per service.- Builds a
merge-patch+jsonbody that patches only the relevantspec.services[service].envsentries on the DGD CR.This is exactly what the rolling-upgrade fault type in
scenarios.pyneeds to trigger a pod-by-pod rollout without altering other spec fields.
749-785:get_processesandget_podsprovide the abstractions needed by new failure types
get_processesrunsps -auxinside a pod and wraps results asPodProcessinstances, which are then used byTerminateProcessFailureto locate and kill specific processes.get_podsreturns adict[str, list[Pod]], keyed by service name, and uses thenvidia.com/selector={deployment-name}-{service}label to fetch pods. Whenservice_namesisNone, it enumerates all services fromDeploymentSpec.services.This matches how both the new failure injections (delete pod, kill process) and logging/metrics helpers iterate over pods. Based on prior learnings, the
get_servicemethod now defaultsservice_nameto"", not the frontend name, which aligns with the repository’s preferred behavior. (Based on learnings, ...)
788-812:get_pod_manifest_logs_metricsand_get_pod_metricscapture rich diagnostics for failuresCollecting the pod’s manifest YAML, current logs, previous logs, and Prometheus metrics under
log_dir/service_name/gives excellent visibility into failure scenarios, especially before pod deletion inDeletePodFailure. Usingself.port_forwardand tracking forwards in_active_port_forwardsensures metrics scraping is consistent with the existing port-forward lifecycle managed in_cleanup.
813-823:_get_service_logsreuse viaget_podsis clean and flexibleThe refactored
_get_service_logsusesget_pods(service_names)to drive log/metrics collection, supporting both “all services” and a specificservice_name. This centralizes pod enumeration and keeps the logging surface aligned with the label-based pod discovery used elsewhere.
877-945:port_forward’s connection-testing and backoff improve reliabilityThe enhanced
port_forward:
- Uses
local_port=0andaddress="127.0.0.1"for conflict avoidance and isolation.- Implements a bounded exponential backoff loop that waits for
local_portassignment, then validates connectivity with a HEAD request before returning.- Tracks successful forwards in
_active_port_forwardsfor cleanup, and attempts restart of the port-forward on connection failures untilmax_connection_attemptsis exhausted.This should significantly reduce flaky tests due to transient port-forward issues while still cleaning up background threads on failure.
757-763:get_servicedefault behavior matches prior preferenceDefaulting
service_nameto""when falsy (instead of implicitly assuming the frontend) aligns with the previously documented preference to avoid hidden frontend defaults forget_service, and keeps the resulting Service name ({deployment_name}-) explicit.
| async def _inject_failures( | ||
| failures: list[Failure], | ||
| logger: logging.Logger, | ||
| deployment: ManagedDeployment, | ||
| ): # noqa: F811 | ||
| for failure in failures: | ||
| time.sleep(failure.time) | ||
|
|
||
| logger.info(f"Injecting failure for: {failure}") | ||
|
|
||
| for x in range(replicas): | ||
| pod = pods[x % num_pods] | ||
|
|
||
| if failure.command == "delete_pod": | ||
| deployment.get_pod_logs(failure.pod_name, pod, ".before_delete") | ||
| pod.delete(force=True) | ||
| else: | ||
| processes = deployment.get_processes(pod) | ||
| for process in processes: | ||
| if failure.command in process.command: | ||
| logger.info( | ||
| f"Terminating {failure.pod_name} Pid {process.pid} Command {process.command}" | ||
| ) | ||
| process.kill(failure.signal) | ||
| # Execute the failure using the polymorphic execute method | ||
| await failure.execute(deployment, logger) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Avoid time.sleep inside async _inject_failures; use asyncio.sleep instead.
_inject_failures is declared async and is awaited from test_fault_scenario, but it uses time.sleep(failure.time) inside the loop. This blocks the event loop for each failure interval, which can stall other async operations involved in managing the deployment or monitoring state.
Switching to asyncio.sleep preserves the intended timing while keeping the event loop responsive.
Suggested change:
-import time
+import time
+import asyncio
@@
async def _inject_failures(
failures: list[Failure],
logger: logging.Logger,
deployment: ManagedDeployment,
): # noqa: F811
for failure in failures:
- time.sleep(failure.time)
+ await asyncio.sleep(failure.time)(If failure.time is intended as an absolute offset rather than a relative delay, you can still implement that with asyncio.sleep(max(0, failure.time - elapsed))) while remaining non-blocking.)
🧰 Tools
🪛 Ruff (0.14.5)
225-225: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In tests/fault_tolerance/deploy/test_deployment.py around lines 221 to 233, the
async helper _inject_failures blocks the event loop by calling
time.sleep(failure.time); change that to await asyncio.sleep(failure.time) and
add an import for asyncio at top of the file so the sleep is non-blocking. If
failure.time represents an absolute offset rather than a relative delay, compute
elapsed time and await asyncio.sleep(max(0, failure.time - elapsed)) instead to
keep timings correct while remaining asynchronous.
| Set the number of replicas for a specific service | ||
| """ | ||
| # Check service exists | ||
| if service_name not in self._deployment_spec["spec"]["services"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we do this validation in multiple places - maybe a utility function , "get_service_spec" that returns the dictionary and would fail if not in the spec?
Overview:
Adds the following Rolling Upgrade scenarios to the FT test framework:
Details:
continous_loadto aiperf client.--benchmark-durationflag set to 30 mins instead of--request-count_terminate_client_processeswill terminate the aiperf processes once the scenario has completed. This is done by sending a SIGINT. aiperf will receive this and export the logs/metrics needed before exiting.tests/fault_tolerance/deploy/scenarios.py:Failuremore generic withexecutemethod and implemented classes for different failure types--skip-restart-servicesto running FT test - for local testing don't need to restart etcd/nats everytimetests/conftest.py: log directory now includes timestamp (can run the same test multiple times w/o worrying about dirs)Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes
New Features
--skip-restart-servicesflag to skip service restarts during test setupImprovements
✏️ Tip: You can customize this high-level summary in your review settings.