Replies: 10 comments 3 replies
-
Hey @sam-hey, I am working on resolving this issue. I have a few clarifying questions before I submit a PR. When I pass a model as a Code Snippet:
Error: So, I think we can resolve this by checking if the model is a Now, although something like this will be needed anyway, this does not resolve the bug that you talked about i.e. when we instantiate the model through So, do you suggest I should check for the Just logically something like:
cc: @Muennighoff |
Beta Was this translation helpful? Give feedback.
-
Not all cross encoders have |
Beta Was this translation helpful? Give feedback.
-
Hello @imadtyx, Thanks a lot for working on this! As @Samoed pointed out, it's not possible to achieve this by simply searching for the text. I suggest reducing the scope of the PR to first just passing the model of type from mteb import MTEB
import mteb
from sentence_transformers import CrossEncoder, SentenceTransformer
cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")
tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
subset = "default" # subset name used in the NFCorpus dataset
eval_splits = ["test"]
evaluation = MTEB(tasks=tasks)
evaluation.run(
cross_encoder,
eval_splits=eval_splits,
) Then, we can simply check for the model using mteb/mteb/evaluation/evaluators/RetrievalEvaluator.py Lines 71 to 74 in c26adee The other PR will make it easier to create models from model metadata for CrossEncoder .
You can refer to the implementation of cc @orionw |
Beta Was this translation helpful? Give feedback.
-
Okay great! I believe in that case, passing the model as a if is_cross_encoder(model):
logger.info(
"The custom predict function of the model will be used if not a SentenceTransformer CrossEncoder"
)
pairs = list(zip(self.sentences1, self.sentences2))
similarity_scores= model.predict(pairs) # returns similarity scores
pearson, _ = pearsonr(self.gold_scores, similarity_scores)
spearman, _ = spearmanr(self.gold_scores, similarity_scores)
return {"pearson": pearson, "spearman": spearman}
else:
embeddings1 = model.encode(
self.sentences1,
task_name=self.task_name,
**encode_kwargs,
)
embeddings2 = model.encode(
self.sentences2,
task_name=self.task_name,
**encode_kwargs,
)
cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))
manhattan_distances = -paired_manhattan_distances(embeddings1, embeddings2)
euclidean_distances = -paired_euclidean_distances(embeddings1, embeddings2)
cosine_pearson, _ = pearsonr(self.gold_scores, cosine_scores)
cosine_spearman, _ = spearmanr(self.gold_scores, cosine_scores)
manhatten_pearson, _ = pearsonr(self.gold_scores, manhattan_distances)
manhatten_spearman, _ = spearmanr(self.gold_scores, manhattan_distances)
euclidean_pearson, _ = pearsonr(self.gold_scores, euclidean_distances)
euclidean_spearman, _ = spearmanr(self.gold_scores, euclidean_distances)
similarity_scores = None
if hasattr(model, "similarity_pairwise"):
similarity_scores = model.similarity_pairwise(embeddings1, embeddings2) # type: ignore
elif hasattr(model, "similarity"):
_similarity_scores = [
float(model.similarity(e1, e2)) # type: ignore
for e1, e2 in zip(embeddings1, embeddings2)
]
similarity_scores = np.array(_similarity_scores)
if similarity_scores is not None:
pearson, _ = pearsonr(self.gold_scores, similarity_scores)
spearman, _ = spearmanr(self.gold_scores, similarity_scores)
else:
# if model does not have a similarity function, we assume the cosine similarity
pearson = cosine_pearson
spearman = cosine_spearman
return {
# using the models own similarity score
"pearson": pearson,
"spearman": spearman,
# generic similarity scores
"cosine_pearson": cosine_pearson,
"cosine_spearman": cosine_spearman,
"manhattan_pearson": manhatten_pearson,
"manhattan_spearman": manhatten_spearman,
"euclidean_pearson": euclidean_pearson,
"euclidean_spearman": euclidean_spearman,
} Now, the final scores/evaluation_results dictionary later on gets one more key called mteb/mteb/abstasks/AbsTaskSTS.py Line 90 in dba7a95 And for the STS tasks, the mteb/mteb/tasks/STS/eng/STS12STS.py Line 22 in dba7a95 However, if we use mteb/mteb/abstasks/AbsTaskSTS.py Lines 93 to 94 in dba7a95 Therefore, to resolve this, I think we have three options:
Although this is the easiest, this would technically be wrong.
scores["main_score"] = scores.get(self.metadata.main_score, scores["spearman"])
These are roughly my suggestions but if you guys have anything else in mind please let me know. I can make changes based on whichever option you guys decide and then submit a PR accordingly. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your effort so far! I would probably go with setting all non-valid operations to |
Beta Was this translation helpful? Give feedback.
-
Great work here! This seems ready for a PR (where we can then do the final polish) A few things I would change
This also requires a few changes to default scores in tasks as you say. Most currently use I believe we can change it to e.g. "spearman" (given these are essentially just "cosine_spearman" in most cases). However, I would love @isaac-chung and @orionw's opinion here as well. (I think you can go ahead with the PR though - that also makes the effect of the changes easier to judge) If we want to keep it fully backward compatible it would require making v2 of almost all STS tasks. |
Beta Was this translation helpful? Give feedback.
-
CrossEncoders don’t have an |
Beta Was this translation helpful? Give feedback.
-
Yeah this is part of why I only added cross-encoders to Retrieval + Reranking. Plus I don't think people typically use cross-encoders for STS/BiText Mining/Summarization etc. I'm with @Samoed that I don't like having the metric change, if I had to pick I would probably keep it the same name even though technically it's not a cosine_spearman. Otherwise we need to shift everything (but I suppose before v2 is a good time for that so maybe we should just do it quickly?) If we're going to start adding cross-encoders to every MTEB abstract tasks each one will have a unique evaluation process like this. Then if we add sparse models, generative retrieval, late interaction... we will need a lot of these blocks for each one and I don't think we could get around it because each will need their own thing. If we're gonna add new model types to MTEB, it might be worth restructuring it so we can add more than just cross-encoders -- perhaps we have each abstract task call something like On the flip side, we could also just not support it for those classes -- most people just want retrieval and reranking for cross-encoders and such. |
Beta Was this translation helpful? Give feedback.
-
I agree with @orionw here. Just because "Currently, Cross-Encoders are not supported when using STS" does not automatically mean it is a good reason to do so. So far there isn't any evidence provided here for supporting such a case, as such I'd lean not to pursue this avenue at this time. Given the interest in using cross encoders in this issue/discussion, I'd suggest working on this long overdue issue first if anyone has capacity to contribute: #1214. Reranking is a main use case for cross encoders. Having that supported will complete the missing piece in the library. We can likely reuse some of the discussed points here on this linked issue as well. For context, I believe our current priorities are:
After which we can look more into enhancements and improvements. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @sam-hey @KennethEnevoldsen @orionw @isaac-chung and @Samoed for your comments. I was hoping to get a bit more clarity on the next steps here since I see some conflicting/confusing suggestions. I think we basically have these two suggestions currently:
In case you guys want me to do a PR for now, I have the code ready with the changes that @KennethEnevoldsen suggested. However, I understand if the current priorities are different. I am completely fine with whatever you guys decide. So please let me know your decision. 🙂 |
Beta Was this translation helpful? Give feedback.
-
Currently, Cross-Encoders are not supported when using STS, and all models are treated as if they generate embeddings. This leads to suboptimal results without raising any errors.
Including support for Cross-Encoders would be a valuable enhancement.
sentence-transformers/msmarco-MiniLM-L-6-v3
cross-encoder/msmarco-MiniLM-L6-en-de-v1
Beta Was this translation helpful? Give feedback.
All reactions