The current underlying implementation of BERT score supports a limited set of transformer models, and FMEval further truncates this list to microsoft/deberta-xlarge-mnli and roberta-large-mnli.
Torchmetrics provides a more generic BERT score implementation. The specific request here is to not limit what transformer models users can configure. The behavior should be as follows:
- If
microsoft/deberta-xlarge-mnli and roberta-large-mnli is specified, use the bert-score implementation to avoid regressions for existing customers.
- Otherwise, use the
torchmetrics implementation of BERT score.
Use cases for broader set of models underlying BERT score:
- Monolingual BERT models have been shown to outperform multi-lingual BERT models on certain tasks. https://aclanthology.org/2021.acl-long.243.pdf
- Customers may fine tune their own transformers which can be downloaded into the container running AWSFMeval and passed into the
torchmetrics BERT score implementation.