Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized Thresholding for ColBERT Scores Across Datasets #383

Closed
FaisalAliShah opened this issue Oct 31, 2024 · 4 comments
Closed

Generalized Thresholding for ColBERT Scores Across Datasets #383

FaisalAliShah opened this issue Oct 31, 2024 · 4 comments

Comments

@FaisalAliShah
Copy link

FaisalAliShah commented Oct 31, 2024

I’m currently working with ColBERT for document re-ranking and facing challenges in applying a generalized threshold to ColBERT scores across different datasets. Due to the variability in score ranges, it’s difficult to set a fixed threshold for relevance filtering. Unlike typical embedding similarity scores, ColBERT’s late interaction mechanism produces scores that can vary significantly based on query length, token distributions, and dataset characteristics.

I tried using min-max normalization on scores returned for a particular query but turns out even when the search is irrelevant it would give results because I was selecting min_score and max_score from the query responses.

Here are some of the approaches I’ve considered, but each has limitations when applied generally:

Normalizing scores by query length or token count
Rescaling scores based on observed min-max values in different datasets
Z-score normalization based on empirical mean and variance across datasets
Using adaptive thresholds or lightweight classifiers to predict relevance
However, each approach tends to be dataset-specific, and I would like a solution that can generalize effectively across datasets. Do you have any recommended strategies for achieving a more standardized scoring range or threshold? Alternatively, is there any built-in functionality planned (or that I might have missed) for scaling or calibrating ColBERT scores in a more generalizable way?

Any guidance or suggestions would be greatly appreciated! I have attached my code snipped below as how I am using it.

prefetch = [
                models.Prefetch(
                    query=dense_embedding,
                    using=dense_vector_name,
                    limit=20,
                ),
                models.Prefetch(
                    query=sparse_embedding,
                    using=sparse_vector_name,
                    limit=20,
                ),
            ]


            
            search_results = self.qdrant_client.query_points(
                collection_name=kwargs["collection_name"],
                prefetch=prefetch if Config.RETRIEVAL_MODE == QdrantSearchEnums.HYBRID.value else None,
                query=dense_embedding if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_embedding,
                using=dense_vector_name if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_vector_name,
                with_payload=True, 
                limit=10,
                # score_threshold=17,
            ).points
            return search_results
@joein
Copy link
Member

joein commented Oct 31, 2024

Hi @FaisalAliShah

In general, we are not practicing thresholds, since it is indeed hard to select a proper threshold. It can differ significantly not only across various datasets, but even across a single dataset.
This question is also more model-specific, rather than tool specific (qdrant / fastembed), I think that there might be a higher chance to find the answer you're looking for in colbert's repository

@AshutoshSabic
Copy link

AshutoshSabic commented Nov 6, 2024

Hi @FaisalAliShah!

I am facing the same challenge. I had also tried 'Normalizing scores by query length or token count' but for a lot of queries I get >1 normalized score for the same dataset.

Did you experience something similar? I am looking to set the threshold for a single dataset. Would appreciate any help.

Also I am skeptical about other approaches like

Rescaling scores based on observed min-max values in different datasets
Z-score normalization based on empirical mean and variance across datasets
Using adaptive thresholds or lightweight classifiers to predict relevance

I have added more details regarding my challenge here- stanford-futuredata/ColBERT#374

@FaisalAliShah
Copy link
Author

Hi @joein,

Thanks for your kind reply, I have opened the issue colbert repository as well! hopefully I will hear back from them as well.

Regards,

Faisal Ali

@FaisalAliShah
Copy link
Author

Hi @AshutoshSabic,

Yes, you're right. I considered normalizing the score with query_token as well, but as you mentioned, it would still result in scores >1 in most cases. This is primarily because query_token sizes typically range from 10 to 20, and ColBERT generally returns scores >20 for good matches. I also tried min-max normalization, but it had its drawbacks, as I was selecting the min and max from the scores in the responses. This approach would yield a 1 (100% match) score even when there was no actual match in the data. For now, I’ve applied thresholding to the scores returned by ColBERT, but I’m still exploring better solutions for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants