Skip to content

e2e test_image_feature_extraction: kNN accuracy floor is unreachable & noise-level at samples=10 (quantization & EP innocent) #826

Description

@timenick

Summary

tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction (added in #786) fails on both QNN and OpenVINO with knn_top5_accuracy=20.0, below the asserted floor. Investigation shows this is a test-design problem, not a product bug — the eval pipeline, evaluator, exporter, quantization, and EPs are all correct. The test asserts a kNN accuracy magnitude that is (a) unreachable even unquantized and (b) statistically meaningless at --samples 10.

Symptom

AssertionError: metric knn_top5_accuracy=20.0 outside expected range [..., 100.0]   # OpenVINO (junit-ov-eval.xml)
AssertionError: metric knn_top5_accuracy=20.0 outside expected range [..., 100.0]   # QNN      (junit-qnn-eval.xml)

#786 originally asserted knn_top1 >= 30, knn_top5 >= 60 unconditionally.

Root-cause decomposition

facebook/dinov2-small, default dataset timm/mini-imagenet, shuffle seed=42, via winml eval ... --device cpu --precision <p> --samples <n>:

precision samples knn_top1 knn_top5
fp32 100 36 53
int8 (w8a8) 100 37 51
w8a16 (NPU auto-precision) 100 37 52
fp32 10 0 0
NPU QNN / OV (w8a16) 10 ~10–20 20

Findings

  1. Quantization is innocent — including W8A16. fp32 / int8 / w8a16 at samples=100 all land ~36–37 / 51–53; W8A16 costs ~1 pt. (Quantization causes big accuracy loss #179's W8A16 accuracy-loss pattern does not apply here — DINOv2-small is not in Quantization causes big accuracy loss #179's table and is empirically W8A16-robust.)
  2. EP is innocent. QNN and OpenVINO produce the identical 20; quantization itself is innocent.
  3. --samples 10 is the entire cause. For a 100-class dataset, 10 samples gives ~0.1 reference images per class, so leave-one-out kNN is meaningless: even unquantized FP32 scores 0/0, and the quantized 20 is noise (a stable metric would not swing 0→20 from a ~1 pt quality change).
  4. The 30/60 floor was never reachable — best case at adequate samples is ~37/52, and top5 < 60 even unquantized.

Not the cause (verified in code)

  • Evaluator extraction is correct. eval/image_feature_extraction_evaluator.py takes the CLS token (last_hidden_state[:, 0]); eval/metrics/knn_accuracy.py L2-normalizes before cosine kNN. For facebook/dinov2-small (register-free DINOv2) index 0 is genuinely the CLS token. No wrong-tensor / wrong-pooling / missing-normalization bug.
  • Export / quantization behave consistently across precisions (all ~52 at samples=100).

Verdict

Not a defect in eval / model / quantization / EP. The test asserts an accuracy magnitude in a suite whose own docstring says it must not"These tests do NOT assert metric magnitudes for accuracy regression — that's the suite under scripts/e2e_eval/run_eval.py." — at a sample count where the metric is pure noise.

Recommended fix

  • A (recommended): make it a true smoke test — keep --samples 10, drop the kNN magnitude assertion, assert only presence + finite + top1 <= top5. Accuracy regression for this task belongs in scripts/e2e_eval/run_eval.py.
  • B (alternative): bump this test to --samples 100 (where the metric is stable and EP/precision-independent at ~37/52) and assert a loose unconditional degenerate-guard (e.g. top1 >= 15, top5 >= 30). No EP-gating is needed because quantization is innocent. (Would want an NPU@100 baseline to confirm margins.)

Note: gating the floor to QNN (an early guess) is wrong — QNN also returns 20.

Refs

Metadata

Metadata

Assignees

Labels

P0Critical — blocking, crash, data loss

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions