feat: enable progress tracking in status CLI (EVAL-377) by gchlebus · Pull Request #780 · NVIDIA-NeMo/Evaluator

gchlebus · 2026-02-27T14:01:18Z

Summary

Enable progress tracking (as # of processed requests) in the nel status CLI command (EVAL-377). The infrastructure already existed but was disabled as WIP. This PR unhides it, fixes accuracy issues, and switches from percentages to raw request counts.

Problem

Progress was hidden — the status command stripped the progress field from output (WIP guard).
Percentage-based progress was broken for meta-tasks — MMLU has 57 subtasks, so limit_samples=20 produced 1140 requests against a denominator of 20, showing 5700%.
Retry and cache inflation — failed requests (non-200) and cached replays in auto-chained Slurm jobs inflated the count.

Solution

Display raw request count instead of percentages. This works universally across single-turn, multi-turn, meta-tasks (MMLU), and n-repeat evaluations without needing to compute dataset sizes.

Changes

CLI (`status.py`)

Add "Requests Processed" column to the status table (between Status and executor info)
Include progress in JSON output (removed pop("progress") WIP guard)
Format: raw integer or "-" for unknown/missing
Rename _format_progress → _format_requests_processed

Progress Tracking Interceptor (`progress_tracking_interceptor.py`)

Skip cache_hit responses — prevents double-counting in auto-chained Slurm jobs where cached responses are replayed
Skip non-200 responses — prevents retry inflation (failed attempts return non-200)
Rename log fields from samples_processed → requests_processed for clarity

Executors (local + slurm)

Simplify _get_progress() to return Optional[int] (raw request count) instead of Optional[float] (ratio)
_get_progress no longer calls _get_dataset_size / parses run_config.yml (functions kept for future reuse)

Breaking Changes

status --json output: progress field is now an int (request count) instead of float (0.0–1.0 ratio). Confirm no downstream consumers depend on the old format.
Log field rename: samples_processed → requests_processed in structured logs from the progress tracking interceptor.
Progress tracking HTTP updates (POSTs to progress_tracking_url): The values reported are now lower/more accurate. Previously the counter incremented on every response including cache hits and failed retries; now it only counts successful (HTTP 200), non-cached responses. The POST mechanism and payload format are unchanged — only the reported numbers differ.

copy-pr-bot · 2026-02-27T14:01:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gchlebus · 2026-03-03T21:15:40Z

/ok to test dc68665

gchlebus · 2026-03-03T21:52:39Z

/ok to test d0fb481

Un-hide progress field that was disabled as WIP: - Add 'Progress' column to table output between Status and executor info - Include progress in JSON output (removed pop('progress')) - Format progress: float -> '75.3%', int -> '1234 samples', unknown/None -> '-' - Add unit tests for progress formatting logic Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

The executor returns progress as {'progress': <value>} but _format_progress expects a bare float/int. Unwrap the dict before formatting so the progress column correctly shows percentages (e.g. 80.0%) instead of '-' for running and completed jobs. Add test coverage for dict-wrapped progress extraction. Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

The function flattened all newlines to spaces before extracting file contents via regex. This broke yaml.safe_load() on run_config.yml (which contains Jinja2 command templates), causing _get_progress() to throw and the catch-all exception handler to report all jobs as FAILED even when they completed successfully. Replace the newline-flattening + flat regex with re.DOTALL matching that preserves file content intact. Handle files both with and without trailing newlines. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

Progress tracking interceptor: - Skip cache_hit responses (avoids double-counting in auto-chained Slurm jobs where cached responses are replayed) - Skip non-200 responses (avoids inflating count from retries) - Rename log fields to 'requests_processed' for clarity Launcher (both local and slurm executors): - Simplify _get_progress to return raw int request count - Remove run_config.yml parsing and dataset_size computation (eliminates YAML parse failures from Jinja2 templates and incorrect percentages for meta-tasks like MMLU) - Remove unused get_eval_factory_dataset_size_from_run_config imports - Display format changed from 'N samples' / 'X.Y%' to 'N requests' EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

…verage - Fix test_local_executor.py: assertions now expect raw int instead of float ratio; remove dataset_size from fixture; replace dead test_get_status_progress_without_dataset_size with direct assertion - Fix test_slurm_executor.py: _query_slurm_for_status_and_progress mocks now use int values (800, 400) instead of floats (0.8, 0.4) - Add _read_files_from_remote tests for empty files and mixed trailing-newline scenarios - Add post_eval_hook test for zero-progress (all-cached) scenario covering auto-chained Slurm job behavior - Document log field rename (samples_processed → requests_processed) in interceptor source EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

Revert the _read_files_from_remote regex change — since run_config.yml parsing was removed, all current callers only read single-line files (progress integers, job IDs) where newline flattening is harmless. Keeps the MR scope tight. Add test_resume_with_cache_hits_and_new_requests: the core auto-chain scenario where a resumed interceptor pre-loads progress from file, replays cached responses (skipped), then processes new requests. Verifies counter goes 42 → 42 (cache replay) → 50 (8 new), not 42 → 84 → 92 (double-counting). EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

Clearer label for raw request count — reserves 'Progress' for a future percentage-based column once denominators are available. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

Align method name with the column header rename. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

gchlebus · 2026-03-04T07:18:35Z

/ok to test a13f6f3

gchlebus requested review from a team as code owners February 27, 2026 14:01

github-actions bot added nemo-evaluator-launcher tests labels Feb 27, 2026

gchlebus force-pushed the gchlebus/feat/eval-377-progress-tracking branch 4 times, most recently from e9eb0fe to c18961f Compare March 3, 2026 20:14

github-actions bot added the nemo-evaluator label Mar 3, 2026

gchlebus force-pushed the gchlebus/feat/eval-377-progress-tracking branch from df2f1dd to d0fb481 Compare March 3, 2026 21:51

copy-pr-bot bot temporarily deployed to test March 3, 2026 21:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 3, 2026 21:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 3, 2026 21:56 Inactive

gchlebus added 11 commits March 4, 2026 08:14

rename Progress column to Requests Processed

88a95f7

Clearer label for raw request count — reserves 'Progress' for a future percentage-based column once denominators are available. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

rename _format_progress to _format_requests_processed

30039e4

Align method name with the column header rename. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

keep _get_dataset_size and dataset_size imports for future reuse

7698ad1

EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

slim down status CLI tests to essential set

6c5eef4

EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

fix: remove unused imports flagged by ruff

a13f6f3

EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

gchlebus force-pushed the gchlebus/feat/eval-377-progress-tracking branch from d0fb481 to a13f6f3 Compare March 4, 2026 07:18

copy-pr-bot bot deployed to test March 4, 2026 07:19 Active

copy-pr-bot bot deployed to nemo-ci March 4, 2026 07:20 Active

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 07:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 07:22 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable progress tracking in status CLI (EVAL-377)#780

feat: enable progress tracking in status CLI (EVAL-377)#780
gchlebus wants to merge 11 commits intomainfrom
gchlebus/feat/eval-377-progress-tracking

gchlebus commented Feb 27, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

gchlebus commented Mar 3, 2026

Uh oh!

gchlebus commented Mar 3, 2026

Uh oh!

gchlebus commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gchlebus commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

CLI (status.py)

Progress Tracking Interceptor (progress_tracking_interceptor.py)

Executors (local + slurm)

Breaking Changes

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

gchlebus commented Mar 3, 2026

Uh oh!

gchlebus commented Mar 3, 2026

Uh oh!

gchlebus commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gchlebus commented Feb 27, 2026 •

edited

Loading

CLI (`status.py`)

Progress Tracking Interceptor (`progress_tracking_interceptor.py`)