feat: implement fill-mask model task evaluator#341
Merged
Conversation
DingmaomaoBJTU
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement #317
Fill-Mask Evaluator Design
1. Overview
Fill-mask models (BERT, RoBERTa, DistilBERT, etc.) are Masked Language Models (MLM). Given a sentence with one or more tokens replaced by
[MASK], the model predicts the original token.Example:
2. Usage
Output:
3. Metric
Pseudo-perplexity (PPPL) — the community-standard score for comparing Masked Language Models, introduced by Salazar et al. 2020 ("Masked Language Model Scoring", https://arxiv.org/abs/1910.14659).
For each real token$w_i$ in the corpus, we mask only that one position and measure the model's log-probability of the original token given the rest of the sentence:
where$N$ is the total number of scored tokens and $w_{\setminus i}$ denotes the sentence with position $i$ replaced by
[MASK].Why pseudo-perplexity and not "regular" perplexity?
Classical perplexity is defined for autoregressive LMs via the chain rule:$P(w_1,\dots,w_N) = \prod_i P(w_i \mid w_{<i})$ . MLMs are bidirectional — they only provide conditionals $P(w_i \mid w_{\setminus i})$ , which do not come from any consistent joint distribution. So standard perplexity is mathematically undefined for MLMs. "Pseudo-likelihood" (Besag 1975) is the tractable surrogate used when a joint is unavailable; "pseudo-perplexity" is its exponentiated, normalized form.
Why pseudo-perplexity and not exp(MLM-loss) with 15% masking?
The 15% random-masking protocol (BERT's training objective with the 80/10/10 split) is sometimes repurposed as an eval metric, but it has drawbacks:
PPPL avoids both: it scores every real token (deterministic), and each is scored with a clean N−1 context — the regime the model is actually used in.
Why not top-k accuracy?
A top-1 accuracy metric only checks whether the correct token is the model's No.1 prediction. PPPL captures the full probability distribution — a model that assigns 40% to the correct token scores much better than one that assigns 5%, even if both get it "wrong" by top-1. This makes PPPL more sensitive to quality differences, especially when comparing ONNX-quantized models against PyTorch baselines.
Interpretation. PPPL is the effective branching factor of the model: a PPPL of 4 means the model is, on average, as uncertain as if it were choosing uniformly between 4 candidates per token. Lower is better. Published values for BERT-family MLMs on English text sit in the 3–10 range.
4. Implementation
The evaluator processes each text sample in four steps:
Tokenize — Convert text to token IDs with the model's tokenizer. Pad to the ONNX model's fixed sequence length if required.
Identify real positions — Use
tokenizer.get_special_tokens_maskand the pad-token ID to filter out[CLS],[SEP], and padding. Only real content tokens are scored.Mask one at a time and infer — For each real position$i$ , replace only that position with $i$ . The rest of the sentence stays intact (N−1 correct context). One forward pass per scored token.
[MASK], run the model forward, and gather the log-softmax probability of the original token at positionAggregate —$\exp(\text{mean NLL})$ as
PseudoPerplexityMetricaccumulates per-token log-probabilities across the corpus, then returns mean NLL andpseudo_perplexity.The WinML ONNX builds for NPU use a fixed
batch=1shape — so one forward per masked position is the natural access pattern, with no batching gymnastics. Same code path handles the HF PyTorch baseline (dynamic batch) without changes.5. Evaluation results
Run:
run_eval.py --eval-type accuracy --device npuon all 8 fill-mask models in the registry (100 samples each, wikitext-2-raw-v1 test split).Thresholds: |Δ relative| < 5% → PASS, < 10% → AT_RISK, ≥ 10% → REGRESSION. The 5% boundary corresponds to a per-token NLL increase of ~0.05 nats — small but user-perceptible quality loss — and sits 10× above the paired-comparison noise floor (~0.5% relative at 10k tokens).
Quantization-only regression, confirmed
For both REGRESSION cases (distilbert, mbert-cased), rerunning the un-quantized fp32 optimized ONNX on NPU isolates the cause:
The fp32 ONNX matches the PyTorch baseline within noise in both cases, so export and ORT optimization are clean. The entire regression is introduced by the w8a16 quantization stage.