Skip to content

Conversation

@VibhuJawa
Copy link
Contributor

This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.

Configuration for local benchmarking:

  • Added benchmarking/local-cc-e2e.yaml to define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.
  • Configured a single pipeline entry (cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.

Signed-off-by: Vibhu Jawa <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Comment on lines 18 to 35
- name: cc_e2e_pipeline_local
enabled: true
script: cc_e2e_pipeline_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--fasttext_model_path=/raid/vjawa/models/lid.176.bin
--download_path={session_entry_dir}/scratch/downloads
--output_path={session_entry_dir}/scratch/output
--snapshot=2024-30
--url_limit=1
--record_limit=100
--executor=ray_data
timeout_s: 3600
ray:
num_cpus: 4
num_gpus: 0
enable_object_spilling: false
object_store_size_bytes: 11474836480
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Instead of a new file can you add it to nightly and then run with --entries cc_e2e_pipeline_local
  • we probably should have more CPUs to test benchmarking
  • for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have url_limit = K * num_cpus to test parallelism and load balancing
  • for record_limit wdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits

for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the reivew:

  • Added something like arxiv_e2e_pipeline_*
  • Moved to a local downloaded tar dump (to simulate the workflow better)
  • We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed
  • Removed record_limti
  • Added paths, let me know if that works.

num_cpus: 4
num_gpus: 0
enable_object_spilling: false
object_store_size_bytes: 11474836480
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_

Copy link
Contributor Author

@VibhuJawa VibhuJawa Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.

Changes:

  • Added get_aggregated_stage_stats utility function to benchmarking/scripts/utils.py for extracting aggregated performance metrics from pipeline results
  • Created arxiv_e2e_pipeline_benchmark.py with a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats
  • Added two new dataset configurations (arxiv_downloads and fasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydata and arxiv_e2e_pipeline_xenna) to nightly-benchmark.yaml for testing both Ray Data and Xenna executors

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
benchmarking/scripts/utils.py Adds helper function for aggregating stage performance statistics by matching stage name prefixes
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes
benchmarking/nightly-benchmark.yaml Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors

Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa VibhuJawa requested a review from Copilot January 22, 2026 21:47
@VibhuJawa VibhuJawa changed the title [WIP] Benchmarking Script for E2E Pipeline [REVIEW] Benchmarking Script for E2E Pipeline Jan 22, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants