Skip to content

Releases: xhluca/bm25s

0.2.10

22 Mar 00:36
0bd6056
Compare
Choose a tag to compare

What's Changed

  • fix: update tokenize docstring to avoid SyntaxWarning - invalid escape sequence \w by @yaminivibha in #124

New Contributors

Full Changelog: 0.2.9...0.2.10

0.2.9

11 Mar 03:14
e06b4b0
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.2.8...0.2.9

0.2.8

08 Mar 22:08
c4d34f2
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.2.7...0.2.8

0.2.7post1

16 Jan 05:16
bff5ad3
Compare
Choose a tag to compare

What's Changed

Notes

The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

New Contributors

Full Changelog: 0.2.6...0.2.7

0.2.7pre3

15 Jan 19:26
813fcdf
Compare
Choose a tag to compare
0.2.7pre3 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: 0.2.7pre2...0.2.7pre3

0.2.7pre2

09 Jan 03:05
ec0bcff
Compare
Choose a tag to compare
0.2.7pre2 Pre-release
Pre-release

Full Changelog: 0.2.7pre1...0.2.7pre2

0.2.7pre1

29 Dec 01:18
6dfb6ce
Compare
Choose a tag to compare
0.2.7pre1 Pre-release
Pre-release

What's Changed

Notes

  • The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

Full Changelog: 0.2.6...0.2.7

0.2.6

23 Dec 23:01
ce8f886
Compare
Choose a tag to compare

What's Changed

  • Extending to Non-ASCII characters with corpora loading and saving by @IssacXid in #93

Full Changelog: 0.2.5...0.2.6

0.2.5

26 Nov 17:00
c4fef24
Compare
Choose a tag to compare

What's Changed

  • Update README.md by @xhluca in #83
  • Added support for saving and loading non ASCII chars in corpus and vocab by @IssacXid in #86
  • Update README.md by @mrisher in #87

New Contributors

Full Changelog: 0.2.4...0.2.5

0.2.4

13 Nov 22:46
8b5ff10
Compare
Choose a tag to compare

What's Changed

Fix crash tokenizing with empty word_to_id by @mgraczyk in #72

Create nltk_stemmer.py by @aflip in #77

aa31a23: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.

  • bm25s/init.py: Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.
  • bm25s/tokenization.py: Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.

New Contributors

Full Changelog: 0.2.3...0.2.4