[Feature] Increase model coverage of the BERT Score metric by adding torchmetrics implementation

The current underlying [implementation of BERT score](https://github.com/Tiiiger/bert_score#readme) supports a [limited set of transformer models](https://docs.google.com/spreadsheets/d/1RKOVpselB98Nnh_EOC4A2BYn8_201tmPODpNWu4w7xI/edit?gid=0#gid=0), and FMEval further truncates this list to `microsoft/deberta-xlarge-mnli` and `roberta-large-mnli`.

[Torchmetrics](https://lightning.ai/docs/torchmetrics/stable/gallery/text/bertscore.html) provides a more generic BERT score implementation. The specific request here is to not limit what transformer models users can configure. The behavior should be as follows:

- If `microsoft/deberta-xlarge-mnli` and `roberta-large-mnli` is specified, use the `bert-score` implementation to avoid regressions for existing customers.
- Otherwise, use the `torchmetrics` implementation of BERT score.

Use cases for broader set of models underlying BERT score:

1. Monolingual BERT models have been shown to outperform multi-lingual BERT models on certain tasks. https://aclanthology.org/2021.acl-long.243.pdf
2. Customers may fine tune their own transformers which can be downloaded into the container running AWSFMeval and passed into the `torchmetrics` BERT score implementation.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Increase model coverage of the BERT Score metric by adding torchmetrics implementation #332

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Increase model coverage of the BERT Score metric by adding torchmetrics implementation #332

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions