You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction (added in #786) fails on both QNN and OpenVINO with knn_top5_accuracy=20.0, below the asserted floor. Investigation shows this is a test-design problem, not a product bug — the eval pipeline, evaluator, exporter, quantization, and EPs are all correct. The test asserts a kNN accuracy magnitude that is (a) unreachable even unquantized and (b) statistically meaningless at --samples 10.
EP is innocent. QNN and OpenVINO produce the identical 20; quantization itself is innocent.
--samples 10 is the entire cause. For a 100-class dataset, 10 samples gives ~0.1 reference images per class, so leave-one-out kNN is meaningless: even unquantized FP32 scores 0/0, and the quantized 20 is noise (a stable metric would not swing 0→20 from a ~1 pt quality change).
The 30/60 floor was never reachable — best case at adequate samples is ~37/52, and top5 < 60 even unquantized.
Not the cause (verified in code)
Evaluator extraction is correct.eval/image_feature_extraction_evaluator.py takes the CLS token (last_hidden_state[:, 0]); eval/metrics/knn_accuracy.py L2-normalizes before cosine kNN. For facebook/dinov2-small (register-free DINOv2) index 0 is genuinely the CLS token. No wrong-tensor / wrong-pooling / missing-normalization bug.
Export / quantization behave consistently across precisions (all ~52 at samples=100).
Verdict
Not a defect in eval / model / quantization / EP. The test asserts an accuracy magnitude in a suite whose own docstring says it must not — "These tests do NOT assert metric magnitudes for accuracy regression — that's the suite under scripts/e2e_eval/run_eval.py." — at a sample count where the metric is pure noise.
Recommended fix
A (recommended): make it a true smoke test — keep --samples 10, drop the kNN magnitude assertion, assert only presence + finite + top1 <= top5. Accuracy regression for this task belongs in scripts/e2e_eval/run_eval.py.
B (alternative): bump this test to --samples 100 (where the metric is stable and EP/precision-independent at ~37/52) and assert a loose unconditional degenerate-guard (e.g. top1 >= 15, top5 >= 30). No EP-gating is needed because quantization is innocent. (Would want an NPU@100 baseline to confirm margins.)
Note: gating the floor to QNN (an early guess) is wrong — QNN also returns 20.
Summary
tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction(added in #786) fails on both QNN and OpenVINO withknn_top5_accuracy=20.0, below the asserted floor. Investigation shows this is a test-design problem, not a product bug — the eval pipeline, evaluator, exporter, quantization, and EPs are all correct. The test asserts a kNN accuracy magnitude that is (a) unreachable even unquantized and (b) statistically meaningless at--samples 10.Symptom
#786originally assertedknn_top1 >= 30,knn_top5 >= 60unconditionally.Root-cause decomposition
facebook/dinov2-small, default datasettimm/mini-imagenet, shuffle seed=42, viawinml eval ... --device cpu --precision <p> --samples <n>:Findings
20; quantization itself is innocent.--samples 10is the entire cause. For a 100-class dataset, 10 samples gives ~0.1 reference images per class, so leave-one-out kNN is meaningless: even unquantized FP32 scores 0/0, and the quantized20is noise (a stable metric would not swing 0→20 from a ~1 pt quality change).Not the cause (verified in code)
eval/image_feature_extraction_evaluator.pytakes the CLS token (last_hidden_state[:, 0]);eval/metrics/knn_accuracy.pyL2-normalizes before cosine kNN. Forfacebook/dinov2-small(register-free DINOv2) index 0 is genuinely the CLS token. No wrong-tensor / wrong-pooling / missing-normalization bug.Verdict
Not a defect in eval / model / quantization / EP. The test asserts an accuracy magnitude in a suite whose own docstring says it must not — "These tests do NOT assert metric magnitudes for accuracy regression — that's the suite under
scripts/e2e_eval/run_eval.py." — at a sample count where the metric is pure noise.Recommended fix
--samples 10, drop the kNN magnitude assertion, assert only presence + finite +top1 <= top5. Accuracy regression for this task belongs inscripts/e2e_eval/run_eval.py.--samples 100(where the metric is stable and EP/precision-independent at ~37/52) and assert a loose unconditional degenerate-guard (e.g.top1 >= 15,top5 >= 30). No EP-gating is needed because quantization is innocent. (Would want an NPU@100 baseline to confirm margins.)Note: gating the floor to QNN (an early guess) is wrong — QNN also returns
20.Refs