Add support for STS evaluation for Cross-Encoders #1939

sam-hey · 2025-01-08T15:14:10Z

sam-hey
Jan 8, 2025

Currently, Cross-Encoders are not supported when using STS, and all models are treated as if they generate embeddings. This leads to suboptimal results without raising any errors.

Including support for Cross-Encoders would be a valuable enhancement.

sentence-transformers/msmarco-MiniLM-L-6-v3

{
  "dataset_revision": "a0d554a64d88156834ff5ae9920b964011b16384",
  "task_name": "STS12",
  "mteb_version": "1.26.4",
  "scores": {
    "test": [
      {
        "pearson": 0.690426,
        "spearman": 0.623104,
        "cosine_pearson": 0.690426,
        "cosine_spearman": 0.623103,
        "manhattan_pearson": 0.699958,
        "manhattan_spearman": 0.637884,
        "euclidean_pearson": 0.700037,
        "euclidean_spearman": 0.637734,
        "main_score": 0.623103,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 1.4069116115570068,
  "kg_co2_emissions": null
}

cross-encoder/msmarco-MiniLM-L6-en-de-v1

{
  "dataset_revision": "a0d554a64d88156834ff5ae9920b964011b16384",
  "task_name": "STS12",
  "mteb_version": "1.26.4",
  "scores": {
    "test": [
      {
        "pearson": 0.123257,
        "spearman": 0.301698,
        "cosine_pearson": 0.123257,
        "cosine_spearman": 0.301724,
        "manhattan_pearson": 0.213028,
        "manhattan_spearman": 0.326593,
        "euclidean_pearson": 0.214662,
        "euclidean_spearman": 0.327693,
        "main_score": 0.301724,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 1.5864546298980713,
  "kg_co2_emissions": null
}

imadtyx · 2025-02-01T05:36:56Z

imadtyx
Feb 1, 2025

Hey @sam-hey,

I am working on resolving this issue. I have a few clarifying questions before I submit a PR. When I pass a model as a CrossEncoder model like in the below example, it does give me an error.

Code Snippet:

from mteb import MTEB
import mteb
from sentence_transformers import CrossEncoder, SentenceTransformer

cross_encoder = CrossEncoder("cross-encoder/msmarco-MiniLM-L6-en-de-v1")
tasks = mteb.get_tasks(tasks=["STS12"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(cross_encoder, output_folder=f"results/{cross_encoder}")

Error:
ERROR:mteb.evaluation.MTEB:Error while evaluating STS12: 'CrossEncoder' object has no attribute 'encode'

So, I think we can resolve this by checking if the model is a CrossEncoder model. And then if it is, we use predict() method instead of encode.

Now, although something like this will be needed anyway, this does not resolve the bug that you talked about i.e. when we instantiate the model through mteb.get_model(), it will be instantiated as a SentenceTransformer embedding model (and not a CrossEncoder model) and everything will run smoothly but the results will be incorrect. I think it is because of this line:

mteb/mteb/models/sentence_transformer_wrapper.py

Line 38 in 8baee52

self.model = SentenceTransformer(model, revision=revision, **kwargs)

So, do you suggest I should check for the CrossEncoder case here?

Just logically something like:

def is_cross_encoder(model_name: str) -> bool:
    return "cross-encoder" in model_name.lower()

# Example usage:
model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
if is_cross_encoder(model_name):
    self.model = CrossEncoder(model, revision=revision, **kwargs)
else:
    self.model = SentenceTransformer(model, revision=revision, **kwargs)

cc: @Muennighoff

0 replies

Samoed · 2025-02-01T07:23:36Z

Samoed
Feb 1, 2025
Collaborator

Not all cross encoders have cross-encoder in their name. Now we are adding annotations for more cross-encoder models #1869. I think I will integrate them in the v2 branch.

0 replies

sam-hey · 2025-02-01T07:40:32Z

sam-hey
Feb 1, 2025
Author

Hello @imadtyx,

Thanks a lot for working on this!

As @Samoed pointed out, it's not possible to achieve this by simply searching for the text.

I suggest reducing the scope of the PR to first just passing the model of type CrossEncoder directly.

from mteb import MTEB
import mteb
from sentence_transformers import CrossEncoder, SentenceTransformer

cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")

tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])

subset = "default" # subset name used in the NFCorpus dataset
eval_splits = ["test"]

evaluation = MTEB(tasks=tasks)

evaluation.run(
    cross_encoder,
    eval_splits=eval_splits,
)

Then, we can simply check for the model using isinstance(model, CrossEncoder) or self.is_cross_encoder in v2.0.0.

mteb/mteb/evaluation/evaluators/RetrievalEvaluator.py

Lines 71 to 74 in c26adee

    
           if self.is_cross_encoder: 
        
               return self.retriever.search_cross_encoder( 
        
                   corpus, queries, self.top_k, instructions=instructions, **kwargs 
        
               )

The other PR will make it easier to create models from model metadata for CrossEncoder.

You can refer to the implementation of CrossEncoder for retrieval tasks as a reference.

cc @orionw

0 replies

imadtyx · 2025-02-01T19:54:14Z

imadtyx
Feb 1, 2025

Okay great! I believe in that case, passing the model as a CrossEncoder will give us error. And I agree with you and have implemented a similar check as you mentioned in my local branch. However, how do you want me to calculate the scores if the model is a CrossEncoder model. Technically, I think we should change STSEvaluator's call method to do something like this:

        if is_cross_encoder(model):
            logger.info(
                "The custom predict function of the model will be used if not a SentenceTransformer CrossEncoder"
            )
            pairs = list(zip(self.sentences1, self.sentences2))
            similarity_scores= model.predict(pairs)  # returns similarity scores
            pearson, _ = pearsonr(self.gold_scores, similarity_scores)
            spearman, _ = spearmanr(self.gold_scores, similarity_scores)
            return {"pearson": pearson, "spearman": spearman}
        else:
            embeddings1 = model.encode(
                self.sentences1,
                task_name=self.task_name,
                **encode_kwargs,
            )
            embeddings2 = model.encode(
                self.sentences2,
                task_name=self.task_name,
                **encode_kwargs,
            )

            cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))
            manhattan_distances = -paired_manhattan_distances(embeddings1, embeddings2)
            euclidean_distances = -paired_euclidean_distances(embeddings1, embeddings2)

            cosine_pearson, _ = pearsonr(self.gold_scores, cosine_scores)
            cosine_spearman, _ = spearmanr(self.gold_scores, cosine_scores)

            manhatten_pearson, _ = pearsonr(self.gold_scores, manhattan_distances)
            manhatten_spearman, _ = spearmanr(self.gold_scores, manhattan_distances)

            euclidean_pearson, _ = pearsonr(self.gold_scores, euclidean_distances)
            euclidean_spearman, _ = spearmanr(self.gold_scores, euclidean_distances)

            similarity_scores = None
            if hasattr(model, "similarity_pairwise"):
                similarity_scores = model.similarity_pairwise(embeddings1, embeddings2)  # type: ignore
            elif hasattr(model, "similarity"):
                _similarity_scores = [
                    float(model.similarity(e1, e2))  # type: ignore
                    for e1, e2 in zip(embeddings1, embeddings2)
                ]
                similarity_scores = np.array(_similarity_scores)

            if similarity_scores is not None:
                pearson, _ = pearsonr(self.gold_scores, similarity_scores)
                spearman, _ = spearmanr(self.gold_scores, similarity_scores)
            else:
                # if model does not have a similarity function, we assume the cosine similarity
                pearson = cosine_pearson
                spearman = cosine_spearman

            return {
                # using the models own similarity score
                "pearson": pearson,
                "spearman": spearman,
                # generic similarity scores
                "cosine_pearson": cosine_pearson,
                "cosine_spearman": cosine_spearman,
                "manhattan_pearson": manhatten_pearson,
                "manhattan_spearman": manhatten_spearman,
                "euclidean_pearson": euclidean_pearson,
                "euclidean_spearman": euclidean_spearman,
            }

Now, the final scores/evaluation_results dictionary later on gets one more key called main_score via this bit:

mteb/mteb/abstasks/AbsTaskSTS.py

Line 90 in dba7a95

self._add_main_score(scores)

And for the STS tasks, the main_score is basically cosine_spearman. For example:

mteb/mteb/tasks/STS/eng/STS12STS.py

Line 22 in dba7a95

main_score="cosine_spearman",

However, if we use CrossEncoder's scores, technically we would not be having the cosine_spearman in our score dictionary since the CrossEncoder score isn't a cosine similarity score. But if there is no cosine_spearman score in the scores dictionary, then this bit will give us an error:

mteb/mteb/abstasks/AbsTaskSTS.py

Lines 93 to 94 in dba7a95

    
           def _add_main_score(self, scores: ScoresDict) -> None: 
        
               scores["main_score"] = scores[self.metadata.main_score]

Therefore, to resolve this, I think we have three options:

I can temporarily assign some other score as cosine_spearman score when evaluating using a CrossEncoder model like this:
return {"pearson": pearson, "spearman": spearman, "cosine_spearman": 0.0}
or
return {"pearson": pearson, "spearman": spearman, "cosine_spearman": spearman}

Although this is the easiest, this would technically be wrong.

I can add the code to check if there is no cosine_spearman score in the dictionary returned by __call__ method and assign just spearman otherwise. Basically updating this method to something like this:

scores["main_score"] = scores.get(self.metadata.main_score, scores["spearman"])

I can change what we choose to be the "main_score" in the task's TaskMetaData from cosine_spearman to just spearman (because it would work for both CrossEncoder and SentenceTransformer classes). But this would require changing the "main_score" values for all STS tasks' TaskMetaData.

These are roughly my suggestions but if you guys have anything else in mind please let me know. I can make changes based on whichever option you guys decide and then submit a PR accordingly.

0 replies

sam-hey · 2025-02-02T09:11:17Z

sam-hey
Feb 2, 2025
Author

Thanks a lot for your effort so far! I would probably go with setting all non-valid operations to None or NaN, as this clearly indicates the absence of a valid value. However, let's see what @orionw , @KennethEnevoldsen, and @isaac-chung think about this.

0 replies

KennethEnevoldsen · 2025-02-02T11:10:57Z

KennethEnevoldsen
Feb 2, 2025
Maintainer

Great work here! This seems ready for a PR (where we can then do the final polish)

A few things I would change

check if it is a cross-encoder using the isinstance(model, CrossEncoder) using the model interface specified in v2 (so make this change in v2).
remove the else block
make return values consistent using None for values like cosine_spearman.

This also requires a few changes to default scores in tasks as you say. Most currently use "cosine_spearman".

I believe we can change it to e.g. "spearman" (given these are essentially just "cosine_spearman" in most cases). However, I would love @isaac-chung and @orionw's opinion here as well. (I think you can go ahead with the PR though - that also makes the effect of the changes easier to judge)

If we want to keep it fully backward compatible it would require making v2 of almost all STS tasks.

0 replies

Samoed · 2025-02-02T11:26:51Z

Samoed
Feb 2, 2025
Collaborator

CrossEncoders don’t have an encode method. Maybe we should create an interface for them. I’m not really a fan of the proposed approach since it changes the number of metrics, which could be confusing. However, I don’t have a better alternative at the moment. Perhaps we could create a new task type instead.

0 replies

orionw · 2025-02-02T14:13:11Z

orionw
Feb 2, 2025
Maintainer

Yeah this is part of why I only added cross-encoders to Retrieval + Reranking. Plus I don't think people typically use cross-encoders for STS/BiText Mining/Summarization etc.

I'm with @Samoed that I don't like having the metric change, if I had to pick I would probably keep it the same name even though technically it's not a cosine_spearman. Otherwise we need to shift everything (but I suppose before v2 is a good time for that so maybe we should just do it quickly?)

If we're going to start adding cross-encoders to every MTEB abstract tasks each one will have a unique evaluation process like this. Then if we add sparse models, generative retrieval, late interaction... we will need a lot of these blocks for each one and I don't think we could get around it because each will need their own thing.

If we're gonna add new model types to MTEB, it might be worth restructuring it so we can add more than just cross-encoders -- perhaps we have each abstract task call something like setup (allowing for indexing for sparse), run (passing docs and queries -> initial scores), score (to handle initial scores -> single scores like late interaction) and so on. There are probably better ways than this initial idea but perhaps worth taking some time to think of a good structure that would also maintain backwards compatibility.

On the flip side, we could also just not support it for those classes -- most people just want retrieval and reranking for cross-encoders and such.

0 replies

isaac-chung · 2025-02-03T03:47:24Z

isaac-chung
Feb 3, 2025
Collaborator

Yeah this is part of why I only added cross-encoders to Retrieval + Reranking. Plus I don't think people typically use cross-encoders for STS/BiText Mining/Summarization etc. ... we could also just not support it for those classes -- most people just want retrieval and reranking for cross-encoders and such.

I agree with @orionw here. Just because "Currently, Cross-Encoders are not supported when using STS" does not automatically mean it is a good reason to do so. So far there isn't any evidence provided here for supporting such a case, as such I'd lean not to pursue this avenue at this time.

Given the interest in using cross encoders in this issue/discussion, I'd suggest working on this long overdue issue first if anyone has capacity to contribute: #1214. Reranking is a main use case for cross encoders. Having that supported will complete the missing piece in the library. We can likely reuse some of the discussed points here on this linked issue as well.

For context, I believe our current priorities are:

any existing leaderboard related issues
any existing v2 related issues
any existing reproducibility issues
any existing dataset / model addition issues

After which we can look more into enhancements and improvements.

1 reply

orionw Feb 3, 2025
Maintainer

Thanks @isaac-chung! Agree with all your points.

FWIW though I think I forgot to close it, but I did implement it for reranking in v2.0.0. Will close the issue now!

imadtyx · 2025-02-03T07:07:49Z

imadtyx
Feb 3, 2025

Thanks, @sam-hey @KennethEnevoldsen @orionw @isaac-chung and @Samoed for your comments.

I was hoping to get a bit more clarity on the next steps here since I see some conflicting/confusing suggestions.

I think we basically have these two suggestions currently:

Should I submit a PR for now and then we can look into the future enhancements later?
Do you want to just shut this issue down for now, and no further activity/discussion is needed on this issue?

In case you guys want me to do a PR for now, I have the code ready with the changes that @KennethEnevoldsen suggested. However, I understand if the current priorities are different.

I am completely fine with whatever you guys decide. So please let me know your decision. 🙂

2 replies

KennethEnevoldsen Feb 3, 2025
Maintainer

Thanks, everyone, for their feedback. After the comments, I agree that we should probably consider whether we want to support cross-encoder across all tasks.

It might be worth (as @orionw argues) allowing a more general solution to the task. This is def. more work. I would encourage and review it, but it is currently not our main concern (as @isaac-chung argues).

So how to continue:

Ensure that the work is of interest to current (or future users), e.g. define a use-case and when defined propose a generalizable interface (or the start of one, we can discuss from there)
- I could e.g. imagine that there might be test-time inference research that could benefit from this

Of course, as @isaac-chung mentioned there are plenty of relevant issues that we would be happy to see PRs on (if this is something you are interested in)

imadtyx Feb 4, 2025

Thank you so much @KennethEnevoldsen for your kind response. I completely understand your point. I will talk to @Muennighoff about the next steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for STS evaluation for Cross-Encoders #1939

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Add support for STS evaluation for Cross-Encoders #1939

sam-hey Jan 8, 2025

Replies: 10 comments · 3 replies

imadtyx Feb 1, 2025

Samoed Feb 1, 2025 Collaborator

sam-hey Feb 1, 2025 Author

imadtyx Feb 1, 2025

sam-hey Feb 2, 2025 Author

KennethEnevoldsen Feb 2, 2025 Maintainer

Samoed Feb 2, 2025 Collaborator

orionw Feb 2, 2025 Maintainer

isaac-chung Feb 3, 2025 Collaborator

orionw Feb 3, 2025 Maintainer

imadtyx Feb 3, 2025

KennethEnevoldsen Feb 3, 2025 Maintainer

imadtyx Feb 4, 2025

sam-hey
Jan 8, 2025

Replies: 10 comments 3 replies

imadtyx
Feb 1, 2025

Samoed
Feb 1, 2025
Collaborator

sam-hey
Feb 1, 2025
Author

imadtyx
Feb 1, 2025

sam-hey
Feb 2, 2025
Author

KennethEnevoldsen
Feb 2, 2025
Maintainer

Samoed
Feb 2, 2025
Collaborator

orionw
Feb 2, 2025
Maintainer

isaac-chung
Feb 3, 2025
Collaborator

orionw Feb 3, 2025
Maintainer

imadtyx
Feb 3, 2025

KennethEnevoldsen Feb 3, 2025
Maintainer