feat: Add legacy client along with aiperf for FT tests #3415

indrajit96 · 2025-10-03T21:25:52Z

Overview:

Adds dual client support for fault tolerance tests, enabling dynamic selection between AI-Perf and legacy custom client via command-line argument --client-type.

Details:

New Files:

legacy_client.py - Original custom HTTP client implementation (JSONL logging)
client_factory.py - Factory pattern for client selection
legacy_parse_results.py - Parser for legacy JSONL result format
parse_factory.py - Auto-detects result format and routes to appropriate parser

Modified Files:

scenarios.py - Added client_type and max_request_rate fields to Load dataclass
test_deployment.py - Integrated factory patterns for dynamic client/parser selection
conftest.py - Added --client-type pytest command-line option
README.md - Documented dual client usage and examples

Key Features:

Command-line toggle: --client-type aiperf or --client-type legacy
Auto-detection: Parser automatically detects result format (no manual specification)
Factory pattern: Clean separation of client implementations
Backward compatible: Defaults to AI-Perf client (existing behavior)

Where should the reviewer start?

client_factory.py - Review factory pattern implementation
parse_factory.py - Check auto-detection logic
test_deployment.py - Review scenario fixture integration
conftest.py - Verify new pytest option
README.md - Review documentation completeness

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to PR feat: Using ai-perf for k8 FT tests. #3289 (AI-Perf migration)

Summary by CodeRabbit

New Features
- Add legacy fault-tolerance client with rate limiting; choose client type (AI-Perf or legacy) via a command-line flag.
- Auto-detect and parse results from AI-Perf or legacy runs; includes an info/parse CLI.
- Scenario helpers to quickly configure AI-Perf or legacy loads.
Tests
- New option to select client type in test runs; fixtures generate per-test tables and session summaries.
Documentation
- Expanded README covering dual-client architecture, usage, and example outputs.
Refactor
- Factory-based client and parser selection; unified load configuration for client processes.

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot · 2025-10-03T21:25:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-03T21:34:42Z

Walkthrough

Introduces dual-client support for fault-tolerance tests by adding client and parse factories, a legacy client implementation, legacy result parsers, scenario config updates, a pytest CLI option for client selection, and refactoring test wiring to use the factories. README gains duplicated documentation content. No changes to exported APIs outside test modules.

Changes

Cohort / File(s)	Summary
Documentation update `tests/fault_tolerance/deploy/README.md`	Appends extensive dual-client documentation; content duplicated; no functional code changes.
Client selection factory `tests/fault_tolerance/deploy/client_factory.py`	Adds factory functions to select client function by type ("aiperf" or "legacy"), validation, supported types enumeration, and descriptions.
Pytest CLI option `tests/fault_tolerance/deploy/conftest.py`	Adds --client-type option with choices ["aiperf", "legacy"] and a `client_type` fixture exposing it.
Legacy client implementation `tests/fault_tolerance/deploy/legacy_client.py`	Adds legacy client orchestration with HTTP requests, retries, per-request JSONL logging, rate limiting, port-forward management, and round-robin pod selection.
Legacy result parser `tests/fault_tolerance/deploy/legacy_parse_results.py`	Adds parsing of legacy JSONL logs and process logs, metrics computation, recovery time estimation, directory processing, and CLI.
Parse/detect factory `tests/fault_tolerance/deploy/parse_factory.py`	Adds auto-detection of result type (AI-Perf vs legacy), routing to appropriate parser, info summary, pretty-print, and CLI.
Scenario configuration `tests/fault_tolerance/deploy/scenarios.py`	Extends Load with `client_type` and `max_request_rate`; adds `create_aiperf_load` and `create_legacy_load` helper factories with defaults.
Test wiring and flow `tests/fault_tolerance/deploy/test_deployment.py`	Refactors to use Load config and client factory; supports client-type override; adjusts retries/rate per client type; integrates parse factory for per-test and session summaries; updates process startup and logging.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester
  participant Pytest as Pytest (conftest)
  participant Scenario as Scenario Fixture
  participant Test as test_deployment.py
  participant Factory as client_factory
  participant Client as aiperf/legacy client

  Tester->>Pytest: Run with --client-type (optional)
  Pytest-->>Scenario: client_type fixture
  Scenario-->>Test: Load(client_type, retries, rate, ...)
  Test->>Factory: get_client_function(Load.client_type)
  Factory-->>Test: client_fn
  loop Clients
    Test->>Client: start client_fn(deployment, Load, idx)
    Client-->>Test: per-request logs (JSONL/text)
  end
  Note over Test,Client: Clients run with retries/rate based on client_type

sequenceDiagram
  autonumber
  actor Tester
  participant Parser as parse_factory
  participant AIP as aiperf parser
  participant LEG as legacy_parse_results

  Tester->>Parser: parse_test_results(log_dir|log_paths, [force_parser])
  alt force aiperf
    Parser->>AIP: parse_results(...)
    AIP-->>Parser: table/summary
  else force legacy
    Parser->>LEG: main/process_test_directory(...)
    LEG-->>Parser: table/summary
  else auto-detect
    Parser->>Parser: detect_result_type(log_dir)
    alt aiperf detected
      Parser->>AIP: parse_results(...)
      AIP-->>Parser: table/summary
    else legacy detected
      Parser->>LEG: main/process_test_directory(...)
      LEG-->>Parser: table/summary
    else none
      Parser-->>Tester: warning / no results
    end
  end
  Parser-->>Tester: parsed results or info

sequenceDiagram
  autonumber
  participant Legacy as legacy_client
  participant Deploy as ManagedDeployment
  participant Pod as Target Pod
  participant Svc as Frontend (port-forward)
  participant API as HTTP Endpoint

  Legacy->>Deploy: list pods / ensure forwards
  loop Requests
    Legacy->>Pod: select next ready pod (round-robin)
    Legacy->>Svc: ensure port-forward
    Legacy->>API: POST generate(payload)
    API-->>Legacy: response / error
    Legacy->>Legacy: retry if needed (max_retries, delay)
    Legacy-->>Legacy: log JSONL entry + sleep for rate limit
  end
  Legacy->>Deploy: cleanup forwards

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: Using ai-perf for k8 FT tests. #3289 — Introduces AI-Perf client and parsing; this PR complements it by adding the legacy client, factories, and wiring to switch and parse both types.

Poem

A rabbit taps keys with caffeinated cheer,
Two clients now dance—legacy and peer.
Factories choose, parsers divine,
Logs turn to tables, neatly aligned.
Pods hum along, retries in play—
Hippity-hop, fault-tolerance day! 🐇✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately describes the primary change by highlighting the addition of a legacy client alongside the AI-Perf client for fault tolerance tests and reflects the core functionality introduced by the pull request. It is clear, specific, and directly related to the changeset without extraneous details.
Description Check	✅ Passed	The pull request description follows the repository template by providing the required Overview, Details, Where should the reviewer start, and Related Issues sections, and it clearly outlines new and modified files, key features, and reviewer guidance while referencing related work. All mandatory sections are present and adequately filled out with no significant omissions.
Docstring Coverage	✅ Passed	Docstring coverage is 93.10% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c48f49a and 53070c1.

📒 Files selected for processing (8)

tests/fault_tolerance/deploy/README.md (1 hunks)
tests/fault_tolerance/deploy/client_factory.py (1 hunks)
tests/fault_tolerance/deploy/conftest.py (2 hunks)
tests/fault_tolerance/deploy/legacy_client.py (1 hunks)
tests/fault_tolerance/deploy/legacy_parse_results.py (1 hunks)
tests/fault_tolerance/deploy/parse_factory.py (1 hunks)
tests/fault_tolerance/deploy/scenarios.py (2 hunks)
tests/fault_tolerance/deploy/test_deployment.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

tests/fault_tolerance/deploy/client_factory.py (2)

tests/fault_tolerance/deploy/conftest.py (1)

client_type (42-44)

tests/fault_tolerance/deploy/legacy_client.py (1)

client (174-292)

tests/fault_tolerance/deploy/scenarios.py (1)

tests/fault_tolerance/deploy/conftest.py (1)

client_type (42-44)

tests/fault_tolerance/deploy/legacy_client.py (1)

tests/utils/managed_deployment.py (3)

ManagedDeployment (359-802)

get_pods (572-598)

port_forward (685-756)

tests/fault_tolerance/deploy/parse_factory.py (1)

tests/fault_tolerance/deploy/legacy_parse_results.py (1)

main (421-557)

tests/fault_tolerance/deploy/test_deployment.py (4)

tests/fault_tolerance/deploy/client_factory.py (1)

get_client_function (21-69)

tests/fault_tolerance/deploy/parse_factory.py (1)

parse_test_results (101-222)

tests/fault_tolerance/deploy/scenarios.py (1)

Load (96-104)

tests/fault_tolerance/deploy/conftest.py (2)

client_type (42-44)

namespace (37-38)

🪛 markdownlint-cli2 (0.18.1)

tests/fault_tolerance/deploy/README.md

478-478: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.2)

tests/fault_tolerance/deploy/client_factory.py

66-69: Avoid specifying long messages outside the exception class

(TRY003)

120-123: Avoid specifying long messages outside the exception class

(TRY003)

tests/fault_tolerance/deploy/legacy_client.py

72-72: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

80-80: Unused function argument: logger

(ARG001)

282-282: Do not catch blind exception: Exception

(BLE001)

283-283: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

286-286: Loop control variable pf_name not used within loop body

Rename unused pf_name to _pf_name

(B007)

289-290: try-except-pass detected, consider logging the exception

(S110)

289-289: Do not catch blind exception: Exception

(BLE001)

tests/fault_tolerance/deploy/legacy_parse_results.py

315-315: Unused function argument: failure_type

(ARG001)

349-349: Loop control variable process not used within loop body

(B007)

350-350: Loop control variable replica not used within loop body

Rename unused replica to _replica

(B007)

tests/fault_tolerance/deploy/parse_factory.py

141-144: Avoid specifying long messages outside the exception class

(TRY003)

172-172: Avoid specifying long messages outside the exception class

(TRY003)

175-178: Avoid specifying long messages outside the exception class

(TRY003)

222-222: Avoid specifying long messages outside the exception class

(TRY003)

368-368: Do not catch blind exception: Exception

(BLE001)

369-369: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

tests/fault_tolerance/deploy/test_deployment.py

168-168: Do not catch blind exception: Exception

(BLE001)

169-169: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

193-193: Do not catch blind exception: Exception

(BLE001)

194-194: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

tests/fault_tolerance/deploy/legacy_client.py

tests/fault_tolerance/deploy/legacy_parse_results.py

tests/fault_tolerance/deploy/README.md

tzulingk

LGTM. Thanks for working on this.

tests/fault_tolerance/deploy/README.md

Signed-off-by: Indrajit Bhosale <[email protected]>

indrajit96 · 2025-10-03T23:30:11Z

/ok to test 3d2d13a

Signed-off-by: Indrajit Bhosale <[email protected]>

…into ibhosale_custom_client

indrajit96 · 2025-10-04T00:52:13Z

/ok to test 32f13ec

indrajit96 · 2025-10-06T17:16:08Z

/ok to test 2aede48

indrajit96 · 2025-10-06T18:08:02Z

/ok to test 3c66b9a

indrajit96 · 2025-10-06T19:32:54Z

/ok to test b58726a

indrajit96 · 2025-10-06T20:35:42Z

/ok to test d9d584a

indrajit96 · 2025-10-06T21:01:24Z

/ok to test d9d584a

Signed-off-by: Indrajit Bhosale <[email protected]>

indrajit96 added 7 commits October 3, 2025 13:02

Add legacy test client

5de286f

Signed-off-by: Indrajit Bhosale <[email protected]>

Add legacy test client

ff4fbc7

Signed-off-by: Indrajit Bhosale <[email protected]>

Add legacy test client

ed53c11

Signed-off-by: Indrajit Bhosale <[email protected]>

Add legacy test client

705801d

Signed-off-by: Indrajit Bhosale <[email protected]>

Add legacy test client

30bc4bb

Signed-off-by: Indrajit Bhosale <[email protected]>

Remove wrong file

d8216ab

Signed-off-by: Indrajit Bhosale <[email protected]>

Update docs

53070c1

Signed-off-by: Indrajit Bhosale <[email protected]>

indrajit96 requested review from nnshah1, rmccorm4 and tzulingk October 3, 2025 21:25

indrajit96 requested review from a team as code owners October 3, 2025 21:25

pull-request-size bot added the size/XXL label Oct 3, 2025

github-actions bot added the feat label Oct 3, 2025

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

tests/fault_tolerance/deploy/legacy_client.py Outdated Show resolved Hide resolved

tests/fault_tolerance/deploy/legacy_parse_results.py Outdated Show resolved Hide resolved

tests/fault_tolerance/deploy/README.md Show resolved Hide resolved

tzulingk approved these changes Oct 3, 2025

View reviewed changes

tests/fault_tolerance/deploy/README.md Show resolved Hide resolved

indrajit96 added 4 commits October 3, 2025 16:16

Update docs

2f82429

Signed-off-by: Indrajit Bhosale <[email protected]>

Code Rabbit Fixes

eec51e4

Signed-off-by: Indrajit Bhosale <[email protected]>

Code Rabbit Fixes

728b64d

Signed-off-by: Indrajit Bhosale <[email protected]>

Merge branch 'main' into ibhosale_custom_client

3d2d13a

indrajit96 added 3 commits October 3, 2025 17:48

MyPY error fixed

33b8d86

Signed-off-by: Indrajit Bhosale <[email protected]>

MyPY error fixed

cc2c128

Signed-off-by: Indrajit Bhosale <[email protected]>

Merge branch 'ibhosale_custom_client' of github.com:ai-dynamo/dynamo …

32f13ec

…into ibhosale_custom_client

copy-pr-bot bot temporarily deployed to GITLAB October 4, 2025 00:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 4, 2025 00:53 Inactive

nnshah1 approved these changes Oct 6, 2025

View reviewed changes

Merge branch 'main' into ibhosale_custom_client

2aede48

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 17:16 Inactive

Merge branch 'main' into ibhosale_custom_client

3c66b9a

indrajit96 enabled auto-merge (squash) October 6, 2025 18:07

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 18:08 Inactive

Merge branch 'main' into ibhosale_custom_client

b58726a

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 19:33 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 19:39 Inactive

indrajit96 disabled auto-merge October 6, 2025 19:47

Merge branch 'main' into ibhosale_custom_client

d9d584a

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 20:35 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 20:36 Inactive

indrajit96 merged commit f414bc5 into main Oct 6, 2025
25 of 27 checks passed

indrajit96 deleted the ibhosale_custom_client branch October 6, 2025 21:53

nv-tusharma pushed a commit that referenced this pull request Oct 20, 2025

feat: Add legacy client along with aiperf for FT tests (#3415)

27ec7f9

Signed-off-by: Indrajit Bhosale <[email protected]>

feat: Add legacy client along with aiperf for FT tests #3415

feat: Add legacy client along with aiperf for FT tests #3415

Uh oh!

Conversation

indrajit96 commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

coderabbitai bot commented Oct 3, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tzulingk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

indrajit96 commented Oct 3, 2025

Uh oh!

indrajit96 commented Oct 4, 2025

Uh oh!

indrajit96 commented Oct 6, 2025

Uh oh!

indrajit96 commented Oct 6, 2025

Uh oh!

indrajit96 commented Oct 6, 2025

Uh oh!

indrajit96 commented Oct 6, 2025

Uh oh!

indrajit96 commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

indrajit96 commented Oct 3, 2025 •

edited by coderabbitai bot

Loading