[REVIEW] Benchmarking Script for E2E Pipeline#1389
[REVIEW] Benchmarking Script for E2E Pipeline#1389VibhuJawa merged 19 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
benchmarking/local-cc-e2e.yaml
Outdated
| - name: cc_e2e_pipeline_local | ||
| enabled: true | ||
| script: cc_e2e_pipeline_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --fasttext_model_path=/raid/vjawa/models/lid.176.bin | ||
| --download_path={session_entry_dir}/scratch/downloads | ||
| --output_path={session_entry_dir}/scratch/output | ||
| --snapshot=2024-30 | ||
| --url_limit=1 | ||
| --record_limit=100 | ||
| --executor=ray_data | ||
| timeout_s: 3600 | ||
| ray: | ||
| num_cpus: 4 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
| object_store_size_bytes: 11474836480 |
There was a problem hiding this comment.
- Instead of a new file can you add it to nightly and then run with
--entries cc_e2e_pipeline_local - we probably should have more CPUs to test benchmarking
- for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have
url_limit = K * num_cpusto test parallelism and load balancing - for
record_limitwdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits
for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?
There was a problem hiding this comment.
Addressed the reivew:
- Added something like arxiv_e2e_pipeline_*
- Moved to a local downloaded tar dump (to simulate the workflow better)
- We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed
- Removed record_limti
- Added paths, let me know if that works.
benchmarking/local-cc-e2e.yaml
Outdated
| num_cpus: 4 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
| object_store_size_bytes: 11474836480 |
There was a problem hiding this comment.
Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_
There was a problem hiding this comment.
Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.
Changes:
- Added
get_aggregated_stage_statsutility function tobenchmarking/scripts/utils.pyfor extracting aggregated performance metrics from pipeline results - Created
arxiv_e2e_pipeline_benchmark.pywith a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats - Added two new dataset configurations (
arxiv_downloadsandfasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydataandarxiv_e2e_pipeline_xenna) tonightly-benchmark.yamlfor testing both Ray Data and Xenna executors
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| benchmarking/scripts/utils.py | Adds helper function for aggregating stage performance statistics by matching stage name prefixes |
| benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py | Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes |
| benchmarking/nightly-benchmark.yaml | Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors |
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
| @@ -0,0 +1,500 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year should be 2025, not 2026
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.
Configuration for local benchmarking:
benchmarking/local-cc-e2e.yamlto define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.