Generalized Thresholding for ColBERT Scores Across Datasets #383

FaisalAliShah · 2024-10-31T11:52:56Z

I’m currently working with ColBERT for document re-ranking and facing challenges in applying a generalized threshold to ColBERT scores across different datasets. Due to the variability in score ranges, it’s difficult to set a fixed threshold for relevance filtering. Unlike typical embedding similarity scores, ColBERT’s late interaction mechanism produces scores that can vary significantly based on query length, token distributions, and dataset characteristics.

I tried using min-max normalization on scores returned for a particular query but turns out even when the search is irrelevant it would give results because I was selecting min_score and max_score from the query responses.

Here are some of the approaches I’ve considered, but each has limitations when applied generally:

Normalizing scores by query length or token count
Rescaling scores based on observed min-max values in different datasets
Z-score normalization based on empirical mean and variance across datasets
Using adaptive thresholds or lightweight classifiers to predict relevance
However, each approach tends to be dataset-specific, and I would like a solution that can generalize effectively across datasets. Do you have any recommended strategies for achieving a more standardized scoring range or threshold? Alternatively, is there any built-in functionality planned (or that I might have missed) for scaling or calibrating ColBERT scores in a more generalizable way?

Any guidance or suggestions would be greatly appreciated! I have attached my code snipped below as how I am using it.

prefetch = [
                models.Prefetch(
                    query=dense_embedding,
                    using=dense_vector_name,
                    limit=20,
                ),
                models.Prefetch(
                    query=sparse_embedding,
                    using=sparse_vector_name,
                    limit=20,
                ),
            ]


            
            search_results = self.qdrant_client.query_points(
                collection_name=kwargs["collection_name"],
                prefetch=prefetch if Config.RETRIEVAL_MODE == QdrantSearchEnums.HYBRID.value else None,
                query=dense_embedding if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_embedding,
                using=dense_vector_name if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_vector_name,
                with_payload=True, 
                limit=10,
                # score_threshold=17,
            ).points
            return search_results

The text was updated successfully, but these errors were encountered:

joein · 2024-10-31T13:05:59Z

Hi @FaisalAliShah

In general, we are not practicing thresholds, since it is indeed hard to select a proper threshold. It can differ significantly not only across various datasets, but even across a single dataset.
This question is also more model-specific, rather than tool specific (qdrant / fastembed), I think that there might be a higher chance to find the answer you're looking for in colbert's repository

AshutoshSabic · 2024-11-06T04:18:46Z

Hi @FaisalAliShah!

I am facing the same challenge. I had also tried 'Normalizing scores by query length or token count' but for a lot of queries I get >1 normalized score for the same dataset.

Did you experience something similar? I am looking to set the threshold for a single dataset. Would appreciate any help.

Also I am skeptical about other approaches like

Rescaling scores based on observed min-max values in different datasets
Z-score normalization based on empirical mean and variance across datasets
Using adaptive thresholds or lightweight classifiers to predict relevance

I have added more details regarding my challenge here- stanford-futuredata/ColBERT#374

FaisalAliShah · 2024-11-06T11:22:54Z

Hi @joein,

Thanks for your kind reply, I have opened the issue colbert repository as well! hopefully I will hear back from them as well.

Regards,

Faisal Ali

FaisalAliShah · 2024-11-06T11:30:12Z

Hi @AshutoshSabic,

Yes, you're right. I considered normalizing the score with query_token as well, but as you mentioned, it would still result in scores >1 in most cases. This is primarily because query_token sizes typically range from 10 to 20, and ColBERT generally returns scores >20 for good matches. I also tried min-max normalization, but it had its drawbacks, as I was selecting the min and max from the scores in the responses. This approach would yield a 1 (100% match) score even when there was no actual match in the data. For now, I’ve applied thresholding to the scores returned by ColBERT, but I’m still exploring better solutions for this!

FaisalAliShah closed this as completed Nov 6, 2024

FaisalAliShah reopened this Nov 6, 2024

FaisalAliShah closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized Thresholding for ColBERT Scores Across Datasets #383

Generalized Thresholding for ColBERT Scores Across Datasets #383

FaisalAliShah commented Oct 31, 2024 •

edited

Loading

joein commented Oct 31, 2024

AshutoshSabic commented Nov 6, 2024 •

edited

Loading

FaisalAliShah commented Nov 6, 2024

FaisalAliShah commented Nov 6, 2024

Generalized Thresholding for ColBERT Scores Across Datasets #383

Generalized Thresholding for ColBERT Scores Across Datasets #383

Comments

FaisalAliShah commented Oct 31, 2024 • edited Loading

joein commented Oct 31, 2024

AshutoshSabic commented Nov 6, 2024 • edited Loading

FaisalAliShah commented Nov 6, 2024

FaisalAliShah commented Nov 6, 2024

FaisalAliShah commented Oct 31, 2024 •

edited

Loading

AshutoshSabic commented Nov 6, 2024 •

edited

Loading