feat: Implement fill-mask model evaluation#318
Closed
zhenchaoni wants to merge 4 commits into
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement #317
Fill-Mask Evaluator Design
1. Overview
Fill-mask models (BERT, RoBERTa, DistilBERT, etc.) are Masked Language Models (MLM). Given a sentence with one or more tokens replaced by
[MASK], the model predicts the original token.Example:
2. Usage
Output:
3. Metric
Cross-entropy is the standard loss metric for Masked Language Modeling (MLM) evaluation.
For each masked token, cross-entropy measures how well the model predicts the original token:
where$N$ is the total number of masked tokens across all samples, and $P(t_i \mid \text{context})$ is the model's predicted probability for the correct token $t_i$ .
Why cross-entropy?
It is the MLM loss function. Cross-entropy is the exact objective that masked language models are trained to minimize. Evaluating with the same loss gives a direct measure of how well the model performs at its core task — no proxy metric or downstream adaptation needed.
More informative than accuracy. A top-1 accuracy metric only checks whether the correct token is the model's No.1 prediction (right or wrong, binary). Cross-entropy captures the full probability distribution — a model that assigns 40% to the correct token scores much better than one that assigns 5%, even if both get it "wrong" by top-1. This makes CE more sensitive to quality differences, especially when comparing ONNX-quantized models against PyTorch baselines.
Linear and comparable. CE is a continuous value where smaller is better. A delta of 0.1 CE between two models has consistent meaning regardless of the absolute value. This makes relative comparison straightforward — e.g., "the ONNX model's CE is 2.8% higher than the PyTorch baseline" — which is exactly what the e2e evaluation system needs for pass/regression verdicts.
4. Implementation
The evaluator processes each text sample in four steps:
Tokenize — Convert text to token IDs with the model's tokenizer. Pad to the ONNX model's fixed sequence length if required.
Mask — Use HuggingFace
DataCollatorForLanguageModelingto randomly replace 15% of tokens following the standard MLM protocol (80% →[MASK], 10% → random token, 10% → unchanged). A fixed seed ensures reproducible masking across runs. The original token IDs at masked positions become the labels.Infer — Pass the masked
input_idsandattention_maskdirectly to the model (no HF pipeline). The model returns logits of shape[seq_len, vocab_size].Score —
CrossEntropyMetricaccumulates the cross-entropy loss at masked positions (labels != -100) across all samples, then computes the mean CE per token.5. Evaluation results