[benchmarking] Bug fixes and UX improvements #1409

rlratzel · 2026-01-21T16:32:35Z

Allows disabled blocks to be completely ignored. This fixes the issue of a disabled slack sink requiring an env var to be set, and also allows for incomplete blocks or blocks with known errors to remain in the file as disabled for later attention.
No longer prints stderr output when checking ray status command to clean up output and prevent the appearance of an actual error.
Allows users to use Ray's default value for object store size by accepting default as a value. This can be used to override a session-wide default setting (see below)
Exposes default object_store_size as a session-level param instead of a baked-in default to improve understanding. This can be overridden by entries as before.

…ink from requiring env var. Signed-off-by: rlratzel <[email protected]>

…ts stderr output when checking ray status command. Signed-off-by: rlratzel <[email protected]>

copy-pr-bot · 2026-01-21T16:32:39Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…from_dict for consistency. Signed-off-by: rlratzel <[email protected]>

…fixes1 Signed-off-by: rlratzel <[email protected]>

greptile-apps · 2026-01-22T06:35:10Z

Greptile Overview

Greptile Summary

This PR refactors the benchmarking framework's configuration handling to better support disabled entries and sinks, improves Ray object store size configuration flexibility, and enhances observability.

Key Changes:

Implemented centralized filtering of disabled blocks via remove_disabled_blocks() function, eliminating the need for enabled checks scattered throughout sink implementations
Removed enabled field from Entry dataclass and sink runtime checks, moving all filtering to config preprocessing stage
Renamed object_store_size_bytes to object_store_size and expanded type support to include int (bytes), float (fraction of memory), and "default" string for Ray's default behavior
Exposed object_store_size as session-level configuration (previously hardcoded as 50% of memory) with 500GB default
Added get_ray_cluster_data() to capture actual Ray cluster resources in results
Improved error logging by suppressing stderr output in Ray status checks to reduce noise
Renamed Session.create_from_dict() to Session.from_dict() for consistency
Added Entry.from_dict() class method to filter unknown fields and handle legacy configs gracefully

Impact:
The changes improve UX by allowing incomplete or error-prone config blocks to remain disabled without causing validation errors, and make object store configuration more transparent and flexible.

Confidence Score: 4/5

This PR is safe to merge with minimal risk
The refactoring improves code organization and UX. The changes are well-structured with proper type handling and error management. One minor style suggestion exists around removing the enabled key from filtered configs, but this doesn't impact functionality since Entry.from_dict() filters unknown fields. The main risk is the behavioral change in how disabled blocks are processed, but this is intentional and well-tested based on the PR description.
Pay attention to benchmarking/runner/entry.py (confidence 3/5) as it has the most significant refactoring including removal of the enabled field and renaming of object_store_size_bytes

Important Files Changed

Filename	Overview
benchmarking/runner/utils.py	Added `remove_disabled_blocks()` function to recursively filter out disabled configuration blocks
benchmarking/runner/session.py	Renamed method `create_from_dict` to `from_dict`, removed sink-level enabled checks, updated `object_store_size` handling to support int/float/str/None types
benchmarking/runner/entry.py	Added `from_dict()` class method, removed `enabled` field, renamed `object_store_size_bytes` to `object_store_size` with expanded type support
benchmarking/runner/ray_cluster.py	Added `get_ray_cluster_data()` function, improved error logging by suppressing stderr details in `check_ray_responsive()`, renamed parameter `object_store_size_bytes` to `object_store_size`
benchmarking/run.py	Integrated `remove_disabled_blocks()` call, updated to use `Session.from_dict()`, added "default" string handling for `object_store_size`, replaced empty `ray_data` dict with `get_ray_cluster_data()` call

Sequence Diagram

sequenceDiagram
    participant User
    participant run.py
    participant Session
    participant utils
    participant Entry
    participant RayCluster
    participant Sinks

    User->>run.py: Load YAML config
    run.py->>Session: assert_valid_config_dict()
    run.py->>utils: remove_disabled_blocks(config_dict)
    Note over utils: Recursively filters out<br/>entries/sinks with enabled: false
    utils-->>run.py: Filtered config_dict
    run.py->>utils: resolve_env_vars(config_dict)
    utils-->>run.py: Resolved config_dict
    run.py->>Session: from_dict(config_dict, entries_filter)
    Session->>Entry: Entry.from_dict() for each entry
    Note over Entry: Filters unknown fields,<br/>processes object_store_size
    Entry-->>Session: Entry objects
    Session->>Sinks: create_sinks_from_dict()
    Note over Sinks: No enabled checks,<br/>only enabled sinks in list
    Sinks-->>Session: Sink objects
    Session->>Session: __post_init__()
    Note over Session: Convert float object_store_size<br/>to bytes, propagate defaults<br/>to entries
    Session-->>run.py: Session object
    
    loop For each entry
        run.py->>RayCluster: setup_ray_cluster_and_env()
        Note over RayCluster: Pass object_store_size<br/>(None if "default")
        RayCluster-->>run.py: Ray client
        run.py->>run.py: Execute entry script
        run.py->>RayCluster: get_ray_cluster_data()
        RayCluster->>RayCluster: ray.cluster_resources()
        RayCluster-->>run.py: Ray cluster info
        run.py->>Sinks: process_result()
        run.py->>RayCluster: teardown_ray_cluster_and_env()
    end
    
    run.py->>Sinks: finalize()
    Note over Sinks: No enabled checks,<br/>all sinks execute

Signed-off-by: rlratzel <[email protected]>

praateekmahajan · 2026-01-22T16:38:32Z

benchmarking/nightly-benchmark.yaml

      - metric: domain_label_news_count
        exact_value: 2817
+    # override the session-level object_store_size setting for this entry
+    object_store_size: 214748364800


This was unintentional most likely. We just need to use the default.

I do think we should have a default of some reasonable number like 200 or 500gb since this number affects the performance, and we'd rather be prescriptive than having to second guess what object store ended up being used

I'm okay to have default of 0.5 as long as we log details from runtime.. see comment below

I set it to 500GB based on offline discussion but I can certainly change it back to 0.5 since we log the object store size.

praateekmahajan · 2026-01-22T16:40:56Z

benchmarking/run.py

        # These will be appended with the benchmark params by the benchmark script.
        (session_entry_path / "params.json").write_text(
            json.dumps(
                {


WDYT of logging stuff from runtime as @ayushdg recommended

ray.cluster_resources() exposes a nice dictionary we can log/use

{'accelerator_type:A100': 1.0, 'node:__internal_head__': 1.0, 'node:10.184.206.10': 1.0, 'CPU': 128.0, 'object_store_memory': 644245094400.0, 'memory': 1225750720512.0, 'GPU': 4.0}

Maybe new set of keys?

There was a placeholder for ray data in results.json which I used for this. Would we rather this data go in params.json? Since it's actual data returned from the cluster, it seems less like params to the benchmark and more like results, but I don't feel strongly.

…fixes1 Signed-off-by: rlratzel <[email protected]>

…ing for entry that was likely for debug, changes default object_store_size to explicit 500GB. Signed-off-by: rlratzel <[email protected]>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-23T03:09:17Z

benchmarking/runner/utils.py

+        result = {}
+        for k, v in obj.items():
+            filtered = remove_disabled_blocks(v)
+            if filtered is not None:
+                result[k] = filtered


The enabled key is not being removed from the result dictionary, which means it will be passed through to Entry/Sink constructors. While Entry.from_dict() filters unknown fields, this still leaves the enabled key in the config data unnecessarily.

Consider adding if k != "enabled" to the condition:

Suggested change

result = {}

for k, v in obj.items():

filtered = remove_disabled_blocks(v)

if filtered is not None:

result[k] = filtered

result = {}

for k, v in obj.items():

if k != "enabled":

filtered = remove_disabled_blocks(v)

if filtered is not None:

result[k] = filtered

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

rlratzel added 2 commits January 20, 2026 17:18

Allows disabled blocks to be ignored, to fix issue of disable slack s…

8c58602

…ink from requiring env var. Signed-off-by: rlratzel <[email protected]>

Removes the concept of enabled from sinks and entries, no longer prin…

e127777

…ts stderr output when checking ray status command. Signed-off-by: rlratzel <[email protected]>

rlratzel added 2 commits January 22, 2026 00:29

Allows support for using ray deault for object store size, rename to …

ecae422

…from_dict for consistency. Signed-off-by: rlratzel <[email protected]>

Merge remote-tracking branch 'upstream/main' into 26.02-benchmark_bug…

051894c

…fixes1 Signed-off-by: rlratzel <[email protected]>

rlratzel marked this pull request as ready for review January 22, 2026 06:32

rlratzel requested a review from praateekmahajan January 22, 2026 06:32

Uncomments default object_store_size setting.

8bbf5f4

Signed-off-by: rlratzel <[email protected]>

praateekmahajan reviewed Jan 22, 2026

View reviewed changes

rlratzel added 2 commits January 22, 2026 16:48

Merge remote-tracking branch 'upstream/main' into 26.02-benchmark_bug…

9100383

…fixes1 Signed-off-by: rlratzel <[email protected]>

Adds ray cluster info to results.json, removes object_store_size sett…

7b3ff64

…ing for entry that was likely for debug, changes default object_store_size to explicit 500GB. Signed-off-by: rlratzel <[email protected]>

greptile-apps bot reviewed Jan 23, 2026

View reviewed changes

praateekmahajan approved these changes Jan 23, 2026

View reviewed changes

praateekmahajan merged commit 9735c69 into NVIDIA-NeMo:main Jan 23, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmarking] Bug fixes and UX improvements #1409

[benchmarking] Bug fixes and UX improvements #1409

Uh oh!

rlratzel commented Jan 21, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

praateekmahajan Jan 22, 2026

Uh oh!

praateekmahajan Jan 22, 2026

Uh oh!

rlratzel Jan 23, 2026

Uh oh!

praateekmahajan Jan 22, 2026

Uh oh!

rlratzel Jan 23, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[benchmarking] Bug fixes and UX improvements #1409

[benchmarking] Bug fixes and UX improvements #1409

Uh oh!

Conversation

rlratzel commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

praateekmahajan Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

rlratzel Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

rlratzel Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rlratzel commented Jan 21, 2026 •

edited

Loading

greptile-apps bot commented Jan 22, 2026 •

edited

Loading

rlratzel Jan 23, 2026 •

edited

Loading