Skip to content

feat: enable progress tracking in status CLI (EVAL-377)#780

Open
gchlebus wants to merge 11 commits intomainfrom
gchlebus/feat/eval-377-progress-tracking
Open

feat: enable progress tracking in status CLI (EVAL-377)#780
gchlebus wants to merge 11 commits intomainfrom
gchlebus/feat/eval-377-progress-tracking

Conversation

@gchlebus
Copy link
Contributor

@gchlebus gchlebus commented Feb 27, 2026

Summary

Enable progress tracking (as # of processed requests) in the nel status CLI command (EVAL-377). The infrastructure already existed but was disabled as WIP. This PR unhides it, fixes accuracy issues, and switches from percentages to raw request counts.

Screenshot 2026-03-03 at 22 15 49

Problem

  1. Progress was hidden — the status command stripped the progress field from output (WIP guard).
  2. Percentage-based progress was broken for meta-tasks — MMLU has 57 subtasks, so limit_samples=20 produced 1140 requests against a denominator of 20, showing 5700%.
  3. Retry and cache inflation — failed requests (non-200) and cached replays in auto-chained Slurm jobs inflated the count.

Solution

Display raw request count instead of percentages. This works universally across single-turn, multi-turn, meta-tasks (MMLU), and n-repeat evaluations without needing to compute dataset sizes.

Changes

CLI (status.py)

  • Add "Requests Processed" column to the status table (between Status and executor info)
  • Include progress in JSON output (removed pop("progress") WIP guard)
  • Format: raw integer or "-" for unknown/missing
  • Rename _format_progress_format_requests_processed

Progress Tracking Interceptor (progress_tracking_interceptor.py)

  • Skip cache_hit responses — prevents double-counting in auto-chained Slurm jobs where cached responses are replayed
  • Skip non-200 responses — prevents retry inflation (failed attempts return non-200)
  • Rename log fields from samples_processedrequests_processed for clarity

Executors (local + slurm)

  • Simplify _get_progress() to return Optional[int] (raw request count) instead of Optional[float] (ratio)
  • _get_progress no longer calls _get_dataset_size / parses run_config.yml (functions kept for future reuse)

Breaking Changes

  • status --json output: progress field is now an int (request count) instead of float (0.0–1.0 ratio). Confirm no downstream consumers depend on the old format.
  • Log field rename: samples_processedrequests_processed in structured logs from the progress tracking interceptor.
  • Progress tracking HTTP updates (POSTs to progress_tracking_url): The values reported are now lower/more accurate. Previously the counter incremented on every response including cache hits and failed retries; now it only counts successful (HTTP 200), non-cached responses. The POST mechanism and payload format are unchanged — only the reported numbers differ.

@gchlebus gchlebus requested review from a team as code owners February 27, 2026 14:01
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gchlebus gchlebus force-pushed the gchlebus/feat/eval-377-progress-tracking branch 4 times, most recently from e9eb0fe to c18961f Compare March 3, 2026 20:14
@gchlebus
Copy link
Contributor Author

gchlebus commented Mar 3, 2026

/ok to test dc68665

@gchlebus gchlebus force-pushed the gchlebus/feat/eval-377-progress-tracking branch from df2f1dd to d0fb481 Compare March 3, 2026 21:51
@gchlebus
Copy link
Contributor Author

gchlebus commented Mar 3, 2026

/ok to test d0fb481

gchlebus added 11 commits March 4, 2026 08:14
Un-hide progress field that was disabled as WIP:
- Add 'Progress' column to table output between Status and executor info
- Include progress in JSON output (removed pop('progress'))
- Format progress: float -> '75.3%', int -> '1234 samples', unknown/None -> '-'
- Add unit tests for progress formatting logic

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
The executor returns progress as {'progress': <value>} but _format_progress
expects a bare float/int. Unwrap the dict before formatting so the progress
column correctly shows percentages (e.g. 80.0%) instead of '-' for running
and completed jobs.

Add test coverage for dict-wrapped progress extraction.

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
The function flattened all newlines to spaces before extracting file
contents via regex. This broke yaml.safe_load() on run_config.yml
(which contains Jinja2 command templates), causing _get_progress()
to throw and the catch-all exception handler to report all jobs as
FAILED even when they completed successfully.

Replace the newline-flattening + flat regex with re.DOTALL matching
that preserves file content intact. Handle files both with and
without trailing newlines.

EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Progress tracking interceptor:
- Skip cache_hit responses (avoids double-counting in auto-chained
  Slurm jobs where cached responses are replayed)
- Skip non-200 responses (avoids inflating count from retries)
- Rename log fields to 'requests_processed' for clarity

Launcher (both local and slurm executors):
- Simplify _get_progress to return raw int request count
- Remove run_config.yml parsing and dataset_size computation
  (eliminates YAML parse failures from Jinja2 templates and
  incorrect percentages for meta-tasks like MMLU)
- Remove unused get_eval_factory_dataset_size_from_run_config imports
- Display format changed from 'N samples' / 'X.Y%' to 'N requests'

EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
…verage

- Fix test_local_executor.py: assertions now expect raw int instead
  of float ratio; remove dataset_size from fixture; replace dead
  test_get_status_progress_without_dataset_size with direct assertion
- Fix test_slurm_executor.py: _query_slurm_for_status_and_progress
  mocks now use int values (800, 400) instead of floats (0.8, 0.4)
- Add _read_files_from_remote tests for empty files and mixed
  trailing-newline scenarios
- Add post_eval_hook test for zero-progress (all-cached) scenario
  covering auto-chained Slurm job behavior
- Document log field rename (samples_processed → requests_processed)
  in interceptor source

EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Revert the _read_files_from_remote regex change — since run_config.yml
parsing was removed, all current callers only read single-line files
(progress integers, job IDs) where newline flattening is harmless.
Keeps the MR scope tight.

Add test_resume_with_cache_hits_and_new_requests: the core auto-chain
scenario where a resumed interceptor pre-loads progress from file,
replays cached responses (skipped), then processes new requests.
Verifies counter goes 42 → 42 (cache replay) → 50 (8 new), not
42 → 84 → 92 (double-counting).

EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Clearer label for raw request count — reserves 'Progress' for a
future percentage-based column once denominators are available.

EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Align method name with the column header rename.

EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
EVAL-377

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
@gchlebus gchlebus force-pushed the gchlebus/feat/eval-377-progress-tracking branch from d0fb481 to a13f6f3 Compare March 4, 2026 07:18
@gchlebus
Copy link
Contributor Author

gchlebus commented Mar 4, 2026

/ok to test a13f6f3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant