-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized Thresholding for ColBERT Scores Across Datasets #383
Comments
In general, we are not practicing thresholds, since it is indeed hard to select a proper threshold. It can differ significantly not only across various datasets, but even across a single dataset. |
Hi @FaisalAliShah! I am facing the same challenge. I had also tried 'Normalizing scores by query length or token count' but for a lot of queries I get >1 normalized score for the same dataset. Did you experience something similar? I am looking to set the threshold for a single dataset. Would appreciate any help. Also I am skeptical about other approaches like Rescaling scores based on observed min-max values in different datasets I have added more details regarding my challenge here- stanford-futuredata/ColBERT#374 |
Hi @joein, Thanks for your kind reply, I have opened the issue colbert repository as well! hopefully I will hear back from them as well. Regards, Faisal Ali |
Hi @AshutoshSabic, Yes, you're right. I considered normalizing the score with query_token as well, but as you mentioned, it would still result in scores >1 in most cases. This is primarily because query_token sizes typically range from 10 to 20, and ColBERT generally returns scores >20 for good matches. I also tried min-max normalization, but it had its drawbacks, as I was selecting the min and max from the scores in the responses. This approach would yield a 1 (100% match) score even when there was no actual match in the data. For now, I’ve applied thresholding to the scores returned by ColBERT, but I’m still exploring better solutions for this! |
I’m currently working with ColBERT for document re-ranking and facing challenges in applying a generalized threshold to ColBERT scores across different datasets. Due to the variability in score ranges, it’s difficult to set a fixed threshold for relevance filtering. Unlike typical embedding similarity scores, ColBERT’s late interaction mechanism produces scores that can vary significantly based on query length, token distributions, and dataset characteristics.
I tried using min-max normalization on scores returned for a particular query but turns out even when the search is irrelevant it would give results because I was selecting min_score and max_score from the query responses.
Here are some of the approaches I’ve considered, but each has limitations when applied generally:
Normalizing scores by query length or token count
Rescaling scores based on observed min-max values in different datasets
Z-score normalization based on empirical mean and variance across datasets
Using adaptive thresholds or lightweight classifiers to predict relevance
However, each approach tends to be dataset-specific, and I would like a solution that can generalize effectively across datasets. Do you have any recommended strategies for achieving a more standardized scoring range or threshold? Alternatively, is there any built-in functionality planned (or that I might have missed) for scaling or calibrating ColBERT scores in a more generalizable way?
Any guidance or suggestions would be greatly appreciated! I have attached my code snipped below as how I am using it.
The text was updated successfully, but these errors were encountered: