Skip to content

Conversation

@indrajit96
Copy link
Contributor

@indrajit96 indrajit96 commented Oct 3, 2025

Overview:

Adds dual client support for fault tolerance tests, enabling dynamic selection between AI-Perf and legacy custom client via command-line argument --client-type.

Details:

New Files:

  • legacy_client.py - Original custom HTTP client implementation (JSONL logging)
  • client_factory.py - Factory pattern for client selection
  • legacy_parse_results.py - Parser for legacy JSONL result format
  • parse_factory.py - Auto-detects result format and routes to appropriate parser

Modified Files:

  • scenarios.py - Added client_type and max_request_rate fields to Load dataclass
  • test_deployment.py - Integrated factory patterns for dynamic client/parser selection
  • conftest.py - Added --client-type pytest command-line option
  • README.md - Documented dual client usage and examples

Key Features:

  • Command-line toggle: --client-type aiperf or --client-type legacy
  • Auto-detection: Parser automatically detects result format (no manual specification)
  • Factory pattern: Clean separation of client implementations
  • Backward compatible: Defaults to AI-Perf client (existing behavior)

Where should the reviewer start?

  1. client_factory.py - Review factory pattern implementation
  2. parse_factory.py - Check auto-detection logic
  3. test_deployment.py - Review scenario fixture integration
  4. conftest.py - Verify new pytest option
  5. README.md - Review documentation completeness

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Add legacy fault-tolerance client with rate limiting; choose client type (AI-Perf or legacy) via a command-line flag.
    • Auto-detect and parse results from AI-Perf or legacy runs; includes an info/parse CLI.
    • Scenario helpers to quickly configure AI-Perf or legacy loads.
  • Tests

    • New option to select client type in test runs; fixtures generate per-test tables and session summaries.
  • Documentation

    • Expanded README covering dual-client architecture, usage, and example outputs.
  • Refactor

    • Factory-based client and parser selection; unified load configuration for client processes.

Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
@indrajit96 indrajit96 requested review from a team as code owners October 3, 2025 21:25
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Walkthrough

Introduces dual-client support for fault-tolerance tests by adding client and parse factories, a legacy client implementation, legacy result parsers, scenario config updates, a pytest CLI option for client selection, and refactoring test wiring to use the factories. README gains duplicated documentation content. No changes to exported APIs outside test modules.

Changes

Cohort / File(s) Summary
Documentation update
tests/fault_tolerance/deploy/README.md
Appends extensive dual-client documentation; content duplicated; no functional code changes.
Client selection factory
tests/fault_tolerance/deploy/client_factory.py
Adds factory functions to select client function by type ("aiperf" or "legacy"), validation, supported types enumeration, and descriptions.
Pytest CLI option
tests/fault_tolerance/deploy/conftest.py
Adds --client-type option with choices ["aiperf", "legacy"] and a client_type fixture exposing it.
Legacy client implementation
tests/fault_tolerance/deploy/legacy_client.py
Adds legacy client orchestration with HTTP requests, retries, per-request JSONL logging, rate limiting, port-forward management, and round-robin pod selection.
Legacy result parser
tests/fault_tolerance/deploy/legacy_parse_results.py
Adds parsing of legacy JSONL logs and process logs, metrics computation, recovery time estimation, directory processing, and CLI.
Parse/detect factory
tests/fault_tolerance/deploy/parse_factory.py
Adds auto-detection of result type (AI-Perf vs legacy), routing to appropriate parser, info summary, pretty-print, and CLI.
Scenario configuration
tests/fault_tolerance/deploy/scenarios.py
Extends Load with client_type and max_request_rate; adds create_aiperf_load and create_legacy_load helper factories with defaults.
Test wiring and flow
tests/fault_tolerance/deploy/test_deployment.py
Refactors to use Load config and client factory; supports client-type override; adjusts retries/rate per client type; integrates parse factory for per-test and session summaries; updates process startup and logging.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester
  participant Pytest as Pytest (conftest)
  participant Scenario as Scenario Fixture
  participant Test as test_deployment.py
  participant Factory as client_factory
  participant Client as aiperf/legacy client

  Tester->>Pytest: Run with --client-type (optional)
  Pytest-->>Scenario: client_type fixture
  Scenario-->>Test: Load(client_type, retries, rate, ...)
  Test->>Factory: get_client_function(Load.client_type)
  Factory-->>Test: client_fn
  loop Clients
    Test->>Client: start client_fn(deployment, Load, idx)
    Client-->>Test: per-request logs (JSONL/text)
  end
  Note over Test,Client: Clients run with retries/rate based on client_type
Loading
sequenceDiagram
  autonumber
  actor Tester
  participant Parser as parse_factory
  participant AIP as aiperf parser
  participant LEG as legacy_parse_results

  Tester->>Parser: parse_test_results(log_dir|log_paths, [force_parser])
  alt force aiperf
    Parser->>AIP: parse_results(...)
    AIP-->>Parser: table/summary
  else force legacy
    Parser->>LEG: main/process_test_directory(...)
    LEG-->>Parser: table/summary
  else auto-detect
    Parser->>Parser: detect_result_type(log_dir)
    alt aiperf detected
      Parser->>AIP: parse_results(...)
      AIP-->>Parser: table/summary
    else legacy detected
      Parser->>LEG: main/process_test_directory(...)
      LEG-->>Parser: table/summary
    else none
      Parser-->>Tester: warning / no results
    end
  end
  Parser-->>Tester: parsed results or info
Loading
sequenceDiagram
  autonumber
  participant Legacy as legacy_client
  participant Deploy as ManagedDeployment
  participant Pod as Target Pod
  participant Svc as Frontend (port-forward)
  participant API as HTTP Endpoint

  Legacy->>Deploy: list pods / ensure forwards
  loop Requests
    Legacy->>Pod: select next ready pod (round-robin)
    Legacy->>Svc: ensure port-forward
    Legacy->>API: POST generate(payload)
    API-->>Legacy: response / error
    Legacy->>Legacy: retry if needed (max_retries, delay)
    Legacy-->>Legacy: log JSONL entry + sleep for rate limit
  end
  Legacy->>Deploy: cleanup forwards
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A rabbit taps keys with caffeinated cheer,
Two clients now dance—legacy and peer.
Factories choose, parsers divine,
Logs turn to tables, neatly aligned.
Pods hum along, retries in play—
Hippity-hop, fault-tolerance day! 🐇✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately describes the primary change by highlighting the addition of a legacy client alongside the AI-Perf client for fault tolerance tests and reflects the core functionality introduced by the pull request. It is clear, specific, and directly related to the changeset without extraneous details.
Description Check ✅ Passed The pull request description follows the repository template by providing the required Overview, Details, Where should the reviewer start, and Related Issues sections, and it clearly outlines new and modified files, key features, and reviewer guidance while referencing related work. All mandatory sections are present and adequately filled out with no significant omissions.
Docstring Coverage ✅ Passed Docstring coverage is 93.10% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c48f49a and 53070c1.

📒 Files selected for processing (8)
  • tests/fault_tolerance/deploy/README.md (1 hunks)
  • tests/fault_tolerance/deploy/client_factory.py (1 hunks)
  • tests/fault_tolerance/deploy/conftest.py (2 hunks)
  • tests/fault_tolerance/deploy/legacy_client.py (1 hunks)
  • tests/fault_tolerance/deploy/legacy_parse_results.py (1 hunks)
  • tests/fault_tolerance/deploy/parse_factory.py (1 hunks)
  • tests/fault_tolerance/deploy/scenarios.py (2 hunks)
  • tests/fault_tolerance/deploy/test_deployment.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
tests/fault_tolerance/deploy/client_factory.py (2)
tests/fault_tolerance/deploy/conftest.py (1)
  • client_type (42-44)
tests/fault_tolerance/deploy/legacy_client.py (1)
  • client (174-292)
tests/fault_tolerance/deploy/scenarios.py (1)
tests/fault_tolerance/deploy/conftest.py (1)
  • client_type (42-44)
tests/fault_tolerance/deploy/legacy_client.py (1)
tests/utils/managed_deployment.py (3)
  • ManagedDeployment (359-802)
  • get_pods (572-598)
  • port_forward (685-756)
tests/fault_tolerance/deploy/parse_factory.py (1)
tests/fault_tolerance/deploy/legacy_parse_results.py (1)
  • main (421-557)
tests/fault_tolerance/deploy/test_deployment.py (4)
tests/fault_tolerance/deploy/client_factory.py (1)
  • get_client_function (21-69)
tests/fault_tolerance/deploy/parse_factory.py (1)
  • parse_test_results (101-222)
tests/fault_tolerance/deploy/scenarios.py (1)
  • Load (96-104)
tests/fault_tolerance/deploy/conftest.py (2)
  • client_type (42-44)
  • namespace (37-38)
🪛 markdownlint-cli2 (0.18.1)
tests/fault_tolerance/deploy/README.md

478-478: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.2)
tests/fault_tolerance/deploy/client_factory.py

66-69: Avoid specifying long messages outside the exception class

(TRY003)


120-123: Avoid specifying long messages outside the exception class

(TRY003)

tests/fault_tolerance/deploy/legacy_client.py

72-72: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


80-80: Unused function argument: logger

(ARG001)


282-282: Do not catch blind exception: Exception

(BLE001)


283-283: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


286-286: Loop control variable pf_name not used within loop body

Rename unused pf_name to _pf_name

(B007)


289-290: try-except-pass detected, consider logging the exception

(S110)


289-289: Do not catch blind exception: Exception

(BLE001)

tests/fault_tolerance/deploy/legacy_parse_results.py

315-315: Unused function argument: failure_type

(ARG001)


349-349: Loop control variable process not used within loop body

(B007)


350-350: Loop control variable replica not used within loop body

Rename unused replica to _replica

(B007)

tests/fault_tolerance/deploy/parse_factory.py

141-144: Avoid specifying long messages outside the exception class

(TRY003)


172-172: Avoid specifying long messages outside the exception class

(TRY003)


175-178: Avoid specifying long messages outside the exception class

(TRY003)


222-222: Avoid specifying long messages outside the exception class

(TRY003)


368-368: Do not catch blind exception: Exception

(BLE001)


369-369: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

tests/fault_tolerance/deploy/test_deployment.py

168-168: Do not catch blind exception: Exception

(BLE001)


169-169: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


193-193: Do not catch blind exception: Exception

(BLE001)


194-194: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo

Copy link
Contributor

@tzulingk tzulingk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working on this.

Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
@indrajit96
Copy link
Contributor Author

/ok to test 3d2d13a

Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
@indrajit96
Copy link
Contributor Author

/ok to test 32f13ec

@indrajit96
Copy link
Contributor Author

/ok to test 2aede48

@indrajit96 indrajit96 enabled auto-merge (squash) October 6, 2025 18:07
@indrajit96
Copy link
Contributor Author

/ok to test 3c66b9a

@indrajit96
Copy link
Contributor Author

/ok to test b58726a

@indrajit96
Copy link
Contributor Author

/ok to test d9d584a

@indrajit96
Copy link
Contributor Author

/ok to test d9d584a

@indrajit96 indrajit96 merged commit f414bc5 into main Oct 6, 2025
25 of 27 checks passed
@indrajit96 indrajit96 deleted the ibhosale_custom_client branch October 6, 2025 21:53
nv-tusharma pushed a commit that referenced this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants