-
Notifications
You must be signed in to change notification settings - Fork 213
[benchmarking] Bug fixes and UX improvements #1409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[benchmarking] Bug fixes and UX improvements #1409
Conversation
…ink from requiring env var. Signed-off-by: rlratzel <[email protected]>
…ts stderr output when checking ray status command. Signed-off-by: rlratzel <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…from_dict for consistency. Signed-off-by: rlratzel <[email protected]>
…fixes1 Signed-off-by: rlratzel <[email protected]>
Greptile OverviewGreptile SummaryThis PR refactors the benchmarking framework's configuration handling to better support disabled entries and sinks, improves Ray object store size configuration flexibility, and enhances observability. Key Changes:
Impact: Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant run.py
participant Session
participant utils
participant Entry
participant RayCluster
participant Sinks
User->>run.py: Load YAML config
run.py->>Session: assert_valid_config_dict()
run.py->>utils: remove_disabled_blocks(config_dict)
Note over utils: Recursively filters out<br/>entries/sinks with enabled: false
utils-->>run.py: Filtered config_dict
run.py->>utils: resolve_env_vars(config_dict)
utils-->>run.py: Resolved config_dict
run.py->>Session: from_dict(config_dict, entries_filter)
Session->>Entry: Entry.from_dict() for each entry
Note over Entry: Filters unknown fields,<br/>processes object_store_size
Entry-->>Session: Entry objects
Session->>Sinks: create_sinks_from_dict()
Note over Sinks: No enabled checks,<br/>only enabled sinks in list
Sinks-->>Session: Sink objects
Session->>Session: __post_init__()
Note over Session: Convert float object_store_size<br/>to bytes, propagate defaults<br/>to entries
Session-->>run.py: Session object
loop For each entry
run.py->>RayCluster: setup_ray_cluster_and_env()
Note over RayCluster: Pass object_store_size<br/>(None if "default")
RayCluster-->>run.py: Ray client
run.py->>run.py: Execute entry script
run.py->>RayCluster: get_ray_cluster_data()
RayCluster->>RayCluster: ray.cluster_resources()
RayCluster-->>run.py: Ray cluster info
run.py->>Sinks: process_result()
run.py->>RayCluster: teardown_ray_cluster_and_env()
end
run.py->>Sinks: finalize()
Note over Sinks: No enabled checks,<br/>all sinks execute
|
Signed-off-by: rlratzel <[email protected]>
benchmarking/nightly-benchmark.yaml
Outdated
| - metric: domain_label_news_count | ||
| exact_value: 2817 | ||
| # override the session-level object_store_size setting for this entry | ||
| object_store_size: 214748364800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was unintentional most likely. We just need to use the default.
I do think we should have a default of some reasonable number like 200 or 500gb since this number affects the performance, and we'd rather be prescriptive than having to second guess what object store ended up being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay to have default of 0.5 as long as we log details from runtime.. see comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it to 500GB based on offline discussion but I can certainly change it back to 0.5 since we log the object store size.
| # These will be appended with the benchmark params by the benchmark script. | ||
| (session_entry_path / "params.json").write_text( | ||
| json.dumps( | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of logging stuff from runtime as @ayushdg recommended
ray.cluster_resources()exposes a nice dictionary we can log/use
{'accelerator_type:A100': 1.0,
'node:__internal_head__': 1.0,
'node:10.184.206.10': 1.0,
'CPU': 128.0,
'object_store_memory': 644245094400.0,
'memory': 1225750720512.0,
'GPU': 4.0}
Maybe new set of keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a placeholder for ray data in results.json which I used for this. Would we rather this data go in params.json? Since it's actual data returned from the cluster, it seems less like params to the benchmark and more like results, but I don't feel strongly.
…fixes1 Signed-off-by: rlratzel <[email protected]>
…ing for entry that was likely for debug, changes default object_store_size to explicit 500GB. Signed-off-by: rlratzel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| result = {} | ||
| for k, v in obj.items(): | ||
| filtered = remove_disabled_blocks(v) | ||
| if filtered is not None: | ||
| result[k] = filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enabled key is not being removed from the result dictionary, which means it will be passed through to Entry/Sink constructors. While Entry.from_dict() filters unknown fields, this still leaves the enabled key in the config data unnecessarily.
Consider adding if k != "enabled" to the condition:
| result = {} | |
| for k, v in obj.items(): | |
| filtered = remove_disabled_blocks(v) | |
| if filtered is not None: | |
| result[k] = filtered | |
| result = {} | |
| for k, v in obj.items(): | |
| if k != "enabled": | |
| filtered = remove_disabled_blocks(v) | |
| if filtered is not None: | |
| result[k] = filtered |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
defaultas a value. This can be used to override a session-wide default setting (see below)object_store_sizeas a session-level param instead of a baked-in default to improve understanding. This can be overridden by entries as before.