Skip to content

Conversation

@rlratzel
Copy link
Contributor

@rlratzel rlratzel commented Jan 21, 2026

  • Allows disabled blocks to be completely ignored. This fixes the issue of a disabled slack sink requiring an env var to be set, and also allows for incomplete blocks or blocks with known errors to remain in the file as disabled for later attention.
  • No longer prints stderr output when checking ray status command to clean up output and prevent the appearance of an actual error.
  • Allows users to use Ray's default value for object store size by accepting default as a value. This can be used to override a session-wide default setting (see below)
  • Exposes default object_store_size as a session-level param instead of a baked-in default to improve understanding. This can be overridden by entries as before.

…ts stderr output when checking ray status command.

Signed-off-by: rlratzel <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rlratzel rlratzel marked this pull request as ready for review January 22, 2026 06:32
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Overview

Greptile Summary

This PR refactors the benchmarking framework's configuration handling to better support disabled entries and sinks, improves Ray object store size configuration flexibility, and enhances observability.

Key Changes:

  • Implemented centralized filtering of disabled blocks via remove_disabled_blocks() function, eliminating the need for enabled checks scattered throughout sink implementations
  • Removed enabled field from Entry dataclass and sink runtime checks, moving all filtering to config preprocessing stage
  • Renamed object_store_size_bytes to object_store_size and expanded type support to include int (bytes), float (fraction of memory), and "default" string for Ray's default behavior
  • Exposed object_store_size as session-level configuration (previously hardcoded as 50% of memory) with 500GB default
  • Added get_ray_cluster_data() to capture actual Ray cluster resources in results
  • Improved error logging by suppressing stderr output in Ray status checks to reduce noise
  • Renamed Session.create_from_dict() to Session.from_dict() for consistency
  • Added Entry.from_dict() class method to filter unknown fields and handle legacy configs gracefully

Impact:
The changes improve UX by allowing incomplete or error-prone config blocks to remain disabled without causing validation errors, and make object store configuration more transparent and flexible.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk
  • The refactoring improves code organization and UX. The changes are well-structured with proper type handling and error management. One minor style suggestion exists around removing the enabled key from filtered configs, but this doesn't impact functionality since Entry.from_dict() filters unknown fields. The main risk is the behavioral change in how disabled blocks are processed, but this is intentional and well-tested based on the PR description.
  • Pay attention to benchmarking/runner/entry.py (confidence 3/5) as it has the most significant refactoring including removal of the enabled field and renaming of object_store_size_bytes

Important Files Changed

Filename Overview
benchmarking/runner/utils.py Added remove_disabled_blocks() function to recursively filter out disabled configuration blocks
benchmarking/runner/session.py Renamed method create_from_dict to from_dict, removed sink-level enabled checks, updated object_store_size handling to support int/float/str/None types
benchmarking/runner/entry.py Added from_dict() class method, removed enabled field, renamed object_store_size_bytes to object_store_size with expanded type support
benchmarking/runner/ray_cluster.py Added get_ray_cluster_data() function, improved error logging by suppressing stderr details in check_ray_responsive(), renamed parameter object_store_size_bytes to object_store_size
benchmarking/run.py Integrated remove_disabled_blocks() call, updated to use Session.from_dict(), added "default" string handling for object_store_size, replaced empty ray_data dict with get_ray_cluster_data() call

Sequence Diagram

sequenceDiagram
    participant User
    participant run.py
    participant Session
    participant utils
    participant Entry
    participant RayCluster
    participant Sinks

    User->>run.py: Load YAML config
    run.py->>Session: assert_valid_config_dict()
    run.py->>utils: remove_disabled_blocks(config_dict)
    Note over utils: Recursively filters out<br/>entries/sinks with enabled: false
    utils-->>run.py: Filtered config_dict
    run.py->>utils: resolve_env_vars(config_dict)
    utils-->>run.py: Resolved config_dict
    run.py->>Session: from_dict(config_dict, entries_filter)
    Session->>Entry: Entry.from_dict() for each entry
    Note over Entry: Filters unknown fields,<br/>processes object_store_size
    Entry-->>Session: Entry objects
    Session->>Sinks: create_sinks_from_dict()
    Note over Sinks: No enabled checks,<br/>only enabled sinks in list
    Sinks-->>Session: Sink objects
    Session->>Session: __post_init__()
    Note over Session: Convert float object_store_size<br/>to bytes, propagate defaults<br/>to entries
    Session-->>run.py: Session object
    
    loop For each entry
        run.py->>RayCluster: setup_ray_cluster_and_env()
        Note over RayCluster: Pass object_store_size<br/>(None if "default")
        RayCluster-->>run.py: Ray client
        run.py->>run.py: Execute entry script
        run.py->>RayCluster: get_ray_cluster_data()
        RayCluster->>RayCluster: ray.cluster_resources()
        RayCluster-->>run.py: Ray cluster info
        run.py->>Sinks: process_result()
        run.py->>RayCluster: teardown_ray_cluster_and_env()
    end
    
    run.py->>Sinks: finalize()
    Note over Sinks: No enabled checks,<br/>all sinks execute
Loading

- metric: domain_label_news_count
exact_value: 2817
# override the session-level object_store_size setting for this entry
object_store_size: 214748364800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unintentional most likely. We just need to use the default.

I do think we should have a default of some reasonable number like 200 or 500gb since this number affects the performance, and we'd rather be prescriptive than having to second guess what object store ended up being used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay to have default of 0.5 as long as we log details from runtime.. see comment below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it to 500GB based on offline discussion but I can certainly change it back to 0.5 since we log the object store size.

# These will be appended with the benchmark params by the benchmark script.
(session_entry_path / "params.json").write_text(
json.dumps(
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of logging stuff from runtime as @ayushdg recommended

ray.cluster_resources() exposes a nice dictionary we can log/use

{'accelerator_type:A100': 1.0,
 'node:__internal_head__': 1.0,
 'node:10.184.206.10': 1.0,
 'CPU': 128.0,
 'object_store_memory': 644245094400.0,
 'memory': 1225750720512.0,
 'GPU': 4.0}

Maybe new set of keys?

Copy link
Contributor Author

@rlratzel rlratzel Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a placeholder for ray data in results.json which I used for this. Would we rather this data go in params.json? Since it's actual data returned from the cluster, it seems less like params to the benchmark and more like results, but I don't feel strongly.

…ing for entry that was likely for debug, changes default object_store_size to explicit 500GB.

Signed-off-by: rlratzel <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +73 to +77
result = {}
for k, v in obj.items():
filtered = remove_disabled_blocks(v)
if filtered is not None:
result[k] = filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enabled key is not being removed from the result dictionary, which means it will be passed through to Entry/Sink constructors. While Entry.from_dict() filters unknown fields, this still leaves the enabled key in the config data unnecessarily.

Consider adding if k != "enabled" to the condition:

Suggested change
result = {}
for k, v in obj.items():
filtered = remove_disabled_blocks(v)
if filtered is not None:
result[k] = filtered
result = {}
for k, v in obj.items():
if k != "enabled":
filtered = remove_disabled_blocks(v)
if filtered is not None:
result[k] = filtered

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@praateekmahajan praateekmahajan merged commit 9735c69 into NVIDIA-NeMo:main Jan 23, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants