This repository provides a comprehensive implementation of various BM25 variants in the Go programming language. BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. This implementation is inspired by and builds upon the work done by Dorian Brown in their Python implementation of BM25 algorithms.
This repository includes implementations of the following BM25 variants:
- Okapi BM25
- BM25L
- BM25+
- BM25-Adpt
- BM25T
These variants are based on the research paper "A Study of Efficient and Robust IR Metrics" by Luca Pinto, Diego Ceccarelli, and Claudio Lucchese, which provides an overview and benchmarks of each method.
To use this BM25 implementation, you need to have Go installed on your system. You can download and install Go from the official website: https://golang.org/dl/
Once you have Go installed, you can clone this repository and build the package:
git clone https://github.com/github.com/crawlab-team/bm25.git
cd bm25-golang
go build
To initialize a BM25 instance, you need to provide a corpus of text documents and a tokenizer function. The tokenizer function is responsible for splitting a document into individual tokens (e.g., words).
Here's an example of how to initialize a BM25 instance using the BM25Okapi
variant:
import (
"strings"
"github.com/crawlab-team/bm25"
)
corpus := []string{
"Hello there good man!",
"It is quite windy in London",
"How is the weather today?",
}
tokenizer := func(s string) []string {
return strings.Split(s, " ")
}
bm25, err := bm25.NewBM25Okapi(corpus, tokenizer, nil)
if err != nil {
// Handle error
}
In this example, we define a corpus of three text documents and a simple tokenizer function that splits the text on whitespace characters. We then create a new instance of BM25Okapi
using the NewBM25Okapi
function, passing in the corpus, tokenizer, and a logger (which can be nil
if you don't need logging).
Once you have initialized a BM25 instance, you can use it to rank documents based on their relevance to a given query. Here's an example:
query := "windy London"
tokenizedQuery := tokenizer(query)
scores, err := bm25.GetScores(tokenizedQuery)
if err != nil {
// Handle error
}
// Scores is now a slice of float64 values representing the relevance scores
// for each document in the corpus.
In this example, we define a query string "windy London"
and tokenize it using the same tokenizer function we used for the corpus. We then call the GetScores
method on the BM25Okapi
instance, passing in the tokenized query. The GetScores
method returns a slice of float64
values representing the relevance scores for each document in the corpus.
Alternatively, you can use the GetTopN
method to retrieve the top N
most relevant documents:
topN := 1
topDocs, err := bm25.GetTopN(tokenizedQuery, topN)
if err != nil {
// Handle error
}
// topDocs is now a slice of strings containing the top N most relevant documents.
In this example, we call the GetTopN
method on the BM25Okapi
instance, passing in the tokenized query and the value 1
for topN
. The GetTopN
method returns a slice of strings containing the top N
most relevant documents.
This implementation also provides parallel and batched computation methods for improved performance when dealing with large corpora or many queries. These methods include:
GetScoresParallel
: Computes the BM25 scores for a given query using parallel computation.GetBatchScoresParallel
: Computes the BM25 scores for a given query and a subset of documents using parallel computation.GetTopNParallel
: Returns the topN
documents for a given query using parallel computation.GetScoresBatched
: Computes the BM25 scores for a given query using parallel computation with batching.GetBatchScoresBatched
: Computes the BM25 scores for a given query and a subset of documents using parallel computation with batching.GetTopNBatched
: Returns the topN
documents for a given query using parallel computation with batching.
These methods follow a similar usage pattern as their non-parallel and non-batched counterparts, but they provide improved performance by leveraging Go's concurrency features and batching techniques.
For more detailed examples and usage scenarios, please refer to the examples/
directory in this repository.
Contributions to this project are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request. Make sure to follow the established coding conventions and provide appropriate tests for any new features or bug fixes.
This project is licensed under the MIT License.
This BM25 implementation in Go is inspired by and builds upon the work done by Dorian Brown in their Python implementation of BM25 algorithms. We would like to express our gratitude for their valuable contribution to the field of information retrieval and for providing a solid foundation for this Go implementation.