-
Notifications
You must be signed in to change notification settings - Fork 213
[REVIEW] Benchmarking Script for E2E Pipeline #1389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[REVIEW] Benchmarking Script for E2E Pipeline #1389
Conversation
Signed-off-by: Vibhu Jawa <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
benchmarking/local-cc-e2e.yaml
Outdated
| - name: cc_e2e_pipeline_local | ||
| enabled: true | ||
| script: cc_e2e_pipeline_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --fasttext_model_path=/raid/vjawa/models/lid.176.bin | ||
| --download_path={session_entry_dir}/scratch/downloads | ||
| --output_path={session_entry_dir}/scratch/output | ||
| --snapshot=2024-30 | ||
| --url_limit=1 | ||
| --record_limit=100 | ||
| --executor=ray_data | ||
| timeout_s: 3600 | ||
| ray: | ||
| num_cpus: 4 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
| object_store_size_bytes: 11474836480 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Instead of a new file can you add it to nightly and then run with
--entries cc_e2e_pipeline_local - we probably should have more CPUs to test benchmarking
- for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have
url_limit = K * num_cpusto test parallelism and load balancing - for
record_limitwdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits
for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the reivew:
- Added something like arxiv_e2e_pipeline_*
- Moved to a local downloaded tar dump (to simulate the workflow better)
- We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed
- Removed record_limti
- Added paths, let me know if that works.
benchmarking/local-cc-e2e.yaml
Outdated
| num_cpus: 4 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
| object_store_size_bytes: 11474836480 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.
Changes:
- Added
get_aggregated_stage_statsutility function tobenchmarking/scripts/utils.pyfor extracting aggregated performance metrics from pipeline results - Created
arxiv_e2e_pipeline_benchmark.pywith a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats - Added two new dataset configurations (
arxiv_downloadsandfasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydataandarxiv_e2e_pipeline_xenna) tonightly-benchmark.yamlfor testing both Ray Data and Xenna executors
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| benchmarking/scripts/utils.py | Adds helper function for aggregating stage performance statistics by matching stage name prefixes |
| benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py | Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes |
| benchmarking/nightly-benchmark.yaml | Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors |
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.
Configuration for local benchmarking:
benchmarking/local-cc-e2e.yamlto define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.