[REVIEW] Benchmarking Script for E2E Pipeline by VibhuJawa · Pull Request #1389 · NVIDIA-NeMo/Curator

VibhuJawa · 2026-01-16T23:49:30Z

This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.

Configuration for local benchmarking:

Added benchmarking/local-cc-e2e.yaml to define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.
Configured a single pipeline entry (cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

copy-pr-bot · 2026-01-16T23:49:33Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

praateekmahajan · 2026-01-20T20:44:49Z

benchmarking/local-cc-e2e.yaml

+  - name: cc_e2e_pipeline_local
+    enabled: true
+    script: cc_e2e_pipeline_benchmark.py
+    args: >-
+      --benchmark-results-path={session_entry_dir}
+      --fasttext_model_path=/raid/vjawa/models/lid.176.bin
+      --download_path={session_entry_dir}/scratch/downloads
+      --output_path={session_entry_dir}/scratch/output
+      --snapshot=2024-30
+      --url_limit=1
+      --record_limit=100
+      --executor=ray_data
+    timeout_s: 3600
+    ray:
+      num_cpus: 4
+      num_gpus: 0
+      enable_object_spilling: false
+    object_store_size_bytes: 11474836480


Instead of a new file can you add it to nightly and then run with --entries cc_e2e_pipeline_local

we probably should have more CPUs to test benchmarking

for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have url_limit = K * num_cpus to test parallelism and load balancing

for record_limit wdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits

for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?

Addressed the reivew:

Added something like arxiv_e2e_pipeline_*

Moved to a local downloaded tar dump (to simulate the workflow better)

We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed

Removed record_limti

Added paths, let me know if that works.

benchmarking/scripts/cc_e2e_pipeline_benchmark.py

praateekmahajan · 2026-01-20T20:50:02Z

benchmarking/local-cc-e2e.yaml

+      num_cpus: 4
+      num_gpus: 0
+      enable_object_spilling: false
+    object_store_size_bytes: 11474836480


Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_

Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)

benchmarking/scripts/cc_e2e_pipeline_benchmark.py

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Copilot

Pull request overview

This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.

Changes:

Added get_aggregated_stage_stats utility function to benchmarking/scripts/utils.py for extracting aggregated performance metrics from pipeline results
Created arxiv_e2e_pipeline_benchmark.py with a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats
Added two new dataset configurations (arxiv_downloads and fasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydata and arxiv_e2e_pipeline_xenna) to nightly-benchmark.yaml for testing both Ray Data and Xenna executors

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
benchmarking/scripts/utils.py	Adds helper function for aggregating stage performance statistics by matching stage name prefixes
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py	Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes
benchmarking/nightly-benchmark.yaml	Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors

benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T00:25:49Z

benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

@@ -0,0 +1,500 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year should be 2025, not 2026

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

first working commit

96c887d

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

praateekmahajan reviewed Jan 20, 2026

View reviewed changes

benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Jan 20, 2026

View reviewed changes

benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Jan 20, 2026

View reviewed changes

rlratzel reviewed Jan 20, 2026

View reviewed changes

benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated Show resolved Hide resolved

VibhuJawa and others added 4 commits January 21, 2026 14:37

Merge branch 'NVIDIA-NeMo:main' into vjawa/add_benchmarking_scripts

5ec3aad

Modified scripts based on feedback

5f570d5

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

add stuff to nightly-benchmark.yaml

1cd9e95

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

add more custom metrics

32dea05

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa requested a review from Copilot January 22, 2026 21:17

Copilot started reviewing on behalf of VibhuJawa January 22, 2026 21:17 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Removed -45 limit

8408990

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa requested a review from Copilot January 22, 2026 21:47

Copilot started reviewing on behalf of VibhuJawa January 22, 2026 21:48 View session

VibhuJawa changed the title ~~[WIP] Benchmarking Script for E2E Pipeline~~ [REVIEW] Benchmarking Script for E2E Pipeline Jan 22, 2026

Copilot AI reviewed Jan 22, 2026

View reviewed changes

benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated Show resolved Hide resolved

benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated Show resolved Hide resolved

benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Show resolved Hide resolved

Removed epilog (not useful)

958d33e

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa marked this pull request as ready for review January 22, 2026 21:56

copy-pr-bot bot temporarily deployed to test January 22, 2026 21:57 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 21:57 Error

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 21:57 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 21:57 Error

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 21:57 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 21:57 Error

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 21:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 00:23 Inactive

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

VibhuJawa merged commit f85c3c5 into NVIDIA-NeMo:main Feb 4, 2026
48 checks passed

		@@ -0,0 +1,500 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Conversation

VibhuJawa commented Jan 16, 2026

Uh oh!

copy-pr-bot bot commented Jan 16, 2026

Uh oh!

praateekmahajan Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

praateekmahajan Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VibhuJawa Jan 22, 2026 •

edited

Loading