feat: enable progress tracking in status CLI (EVAL-377)#780
Open
feat: enable progress tracking in status CLI (EVAL-377)#780
Conversation
e9eb0fe to
c18961f
Compare
Contributor
Author
|
/ok to test dc68665 |
df2f1dd to
d0fb481
Compare
Contributor
Author
|
/ok to test d0fb481 |
Un-hide progress field that was disabled as WIP:
- Add 'Progress' column to table output between Status and executor info
- Include progress in JSON output (removed pop('progress'))
- Format progress: float -> '75.3%', int -> '1234 samples', unknown/None -> '-'
- Add unit tests for progress formatting logic
Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
The executor returns progress as {'progress': <value>} but _format_progress
expects a bare float/int. Unwrap the dict before formatting so the progress
column correctly shows percentages (e.g. 80.0%) instead of '-' for running
and completed jobs.
Add test coverage for dict-wrapped progress extraction.
Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
The function flattened all newlines to spaces before extracting file contents via regex. This broke yaml.safe_load() on run_config.yml (which contains Jinja2 command templates), causing _get_progress() to throw and the catch-all exception handler to report all jobs as FAILED even when they completed successfully. Replace the newline-flattening + flat regex with re.DOTALL matching that preserves file content intact. Handle files both with and without trailing newlines. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Progress tracking interceptor: - Skip cache_hit responses (avoids double-counting in auto-chained Slurm jobs where cached responses are replayed) - Skip non-200 responses (avoids inflating count from retries) - Rename log fields to 'requests_processed' for clarity Launcher (both local and slurm executors): - Simplify _get_progress to return raw int request count - Remove run_config.yml parsing and dataset_size computation (eliminates YAML parse failures from Jinja2 templates and incorrect percentages for meta-tasks like MMLU) - Remove unused get_eval_factory_dataset_size_from_run_config imports - Display format changed from 'N samples' / 'X.Y%' to 'N requests' EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
…verage - Fix test_local_executor.py: assertions now expect raw int instead of float ratio; remove dataset_size from fixture; replace dead test_get_status_progress_without_dataset_size with direct assertion - Fix test_slurm_executor.py: _query_slurm_for_status_and_progress mocks now use int values (800, 400) instead of floats (0.8, 0.4) - Add _read_files_from_remote tests for empty files and mixed trailing-newline scenarios - Add post_eval_hook test for zero-progress (all-cached) scenario covering auto-chained Slurm job behavior - Document log field rename (samples_processed → requests_processed) in interceptor source EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Revert the _read_files_from_remote regex change — since run_config.yml parsing was removed, all current callers only read single-line files (progress integers, job IDs) where newline flattening is harmless. Keeps the MR scope tight. Add test_resume_with_cache_hits_and_new_requests: the core auto-chain scenario where a resumed interceptor pre-loads progress from file, replays cached responses (skipped), then processes new requests. Verifies counter goes 42 → 42 (cache replay) → 50 (8 new), not 42 → 84 → 92 (double-counting). EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Clearer label for raw request count — reserves 'Progress' for a future percentage-based column once denominators are available. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Align method name with the column header rename. EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
EVAL-377 Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
d0fb481 to
a13f6f3
Compare
Contributor
Author
|
/ok to test a13f6f3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable progress tracking (as # of processed requests) in the
nel statusCLI command (EVAL-377). The infrastructure already existed but was disabled as WIP. This PR unhides it, fixes accuracy issues, and switches from percentages to raw request counts.Problem
statuscommand stripped theprogressfield from output (WIP guard).limit_samples=20produced 1140 requests against a denominator of 20, showing 5700%.Solution
Display raw request count instead of percentages. This works universally across single-turn, multi-turn, meta-tasks (MMLU), and n-repeat evaluations without needing to compute dataset sizes.
Changes
CLI (
status.py)pop("progress")WIP guard)"-"for unknown/missing_format_progress→_format_requests_processedProgress Tracking Interceptor (
progress_tracking_interceptor.py)cache_hitresponses — prevents double-counting in auto-chained Slurm jobs where cached responses are replayedsamples_processed→requests_processedfor clarityExecutors (local + slurm)
_get_progress()to returnOptional[int](raw request count) instead ofOptional[float](ratio)_get_progressno longer calls_get_dataset_size/ parsesrun_config.yml(functions kept for future reuse)Breaking Changes
status --jsonoutput:progressfield is now anint(request count) instead offloat(0.0–1.0 ratio). Confirm no downstream consumers depend on the old format.samples_processed→requests_processedin structured logs from the progress tracking interceptor.progress_tracking_url): The values reported are now lower/more accurate. Previously the counter incremented on every response including cache hits and failed retries; now it only counts successful (HTTP 200), non-cached responses. The POST mechanism and payload format are unchanged — only the reported numbers differ.