e2e test_image_feature_extraction: kNN accuracy floor is unreachable & noise-level at samples=10 (quantization & EP innocent)

## Summary

`tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction` (added in #786) fails on **both QNN and OpenVINO** with `knn_top5_accuracy=20.0`, below the asserted floor. Investigation shows this is a **test-design problem, not a product bug** — the eval pipeline, evaluator, exporter, quantization, and EPs are all correct. The test asserts a kNN **accuracy magnitude** that is (a) unreachable even unquantized and (b) statistically meaningless at `--samples 10`.

## Symptom

```
AssertionError: metric knn_top5_accuracy=20.0 outside expected range [..., 100.0]   # OpenVINO (junit-ov-eval.xml)
AssertionError: metric knn_top5_accuracy=20.0 outside expected range [..., 100.0]   # QNN      (junit-qnn-eval.xml)
```

`#786` originally asserted `knn_top1 >= 30`, `knn_top5 >= 60` unconditionally.

## Root-cause decomposition

`facebook/dinov2-small`, default dataset `timm/mini-imagenet`, shuffle seed=42, via `winml eval ... --device cpu --precision <p> --samples <n>`:

| precision | samples | knn_top1 | knn_top5 |
|---|---|---|---|
| fp32 | 100 | 36 | 53 |
| int8 (w8a8) | 100 | 37 | 51 |
| **w8a16** (NPU auto-precision) | 100 | 37 | 52 |
| fp32 | **10** | **0** | **0** |
| NPU QNN / OV (w8a16) | 10 | ~10–20 | **20** |

## Findings

1. **Quantization is innocent — including W8A16.** fp32 / int8 / **w8a16** at samples=100 all land ~36–37 / 51–53; W8A16 costs ~1 pt. (#179's W8A16 accuracy-loss pattern does **not** apply here — DINOv2-small is not in #179's table and is empirically W8A16-robust.)
2. **EP is innocent.** QNN and OpenVINO produce the identical `20`; quantization itself is innocent.
3. **`--samples 10` is the entire cause.** For a 100-class dataset, 10 samples gives ~0.1 reference images per class, so leave-one-out kNN is meaningless: even **unquantized FP32 scores 0/0**, and the quantized `20` is noise (a stable metric would not swing 0→20 from a ~1 pt quality change).
4. **The 30/60 floor was never reachable** — best case at adequate samples is ~37/52, and top5 < 60 even unquantized.

## Not the cause (verified in code)

- **Evaluator extraction is correct.** `eval/image_feature_extraction_evaluator.py` takes the CLS token (`last_hidden_state[:, 0]`); `eval/metrics/knn_accuracy.py` L2-normalizes before cosine kNN. For `facebook/dinov2-small` (register-free DINOv2) index 0 is genuinely the CLS token. No wrong-tensor / wrong-pooling / missing-normalization bug.
- **Export / quantization** behave consistently across precisions (all ~52 at samples=100).

## Verdict

Not a defect in eval / model / quantization / EP. The test asserts an accuracy magnitude in a suite whose own docstring says it must **not** — *"These tests do NOT assert metric magnitudes for accuracy regression — that's the suite under `scripts/e2e_eval/run_eval.py`."* — at a sample count where the metric is pure noise.

## Recommended fix

- **A (recommended):** make it a true smoke test — keep `--samples 10`, drop the kNN magnitude assertion, assert only presence + finite + `top1 <= top5`. Accuracy regression for this task belongs in `scripts/e2e_eval/run_eval.py`.
- **B (alternative):** bump this test to `--samples 100` (where the metric is stable and EP/precision-independent at ~37/52) and assert a loose **unconditional** degenerate-guard (e.g. `top1 >= 15`, `top5 >= 30`). No EP-gating is needed because quantization is innocent. (Would want an NPU@100 baseline to confirm margins.)

Note: gating the floor to QNN (an early guess) is **wrong** — QNN also returns `20`.

## Refs

- Introduced in #786
- #179 (W8A16 accuracy loss) — referenced during investigation but does **not** apply to this model
- #820 (modality-aware e2e alignment) — the place to land the smoke-only fix


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

e2e test_image_feature_extraction: kNN accuracy floor is unreachable & noise-level at samples=10 (quantization & EP innocent) #826

Summary

Symptom

Root-cause decomposition

Findings

Not the cause (verified in code)

Verdict

Recommended fix

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

precision	samples	knn_top1	knn_top5
fp32	100	36	53
int8 (w8a8)	100	37	51
w8a16 (NPU auto-precision)	100	37	52
fp32	10	0	0
NPU QNN / OV (w8a16)	10	~10–20	20

Uh oh!

e2e test_image_feature_extraction: kNN accuracy floor is unreachable & noise-level at samples=10 (quantization & EP innocent) #826

Description

Summary

Symptom

Root-cause decomposition

Findings

Not the cause (verified in code)

Verdict

Recommended fix

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions