-
Notifications
You must be signed in to change notification settings - Fork 63
Truncated Printing of the Tokenization Results for issue #90 #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncated Printing of the Tokenization Results for issue #90 #121
Conversation
Hi I'm quite busy with a submission so won't have a chance to review this for the next 2 weeks. Will review after that! |
Thank you very much for the heads up! Please take time! I hope to have a chance to learn from your feedback, so looking forward to it :) Thank you again and hope that you have all the best luck for your submission. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove CorpusIDsList? I do not believe it adds value. the issue #90 only applies to printing, rather than the actual returned object. There's no benefit to create a custom list object inherited from list as far as I can see in this scenario, but adds more complexity to the code. I prefer occam's razor here.
bm25s/tokenization.py
Outdated
@@ -634,5 +748,4 @@ def tokenize( | |||
) | |||
): | |||
corpus_ids[i] = [reverse_dict[token_id] for token_id in token_ids] | |||
|
|||
return corpus_ids | |||
return CorpusIDsList(corpus_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good idea, because it changes the return signature and would thus create a breaking change. I also do not see the benefit of abstracting the returned object with another class. Can you revert it back to the original returned object and remove CorpusIDsList
?
7: [10, 11, 12, 13, 14] | ||
8: [15, 16, 17, 18, 19] | ||
9: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...] | ||
... (total 500000 docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea: ... (total 500000 docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
bm25s/tokenization.py:110
- [nitpick] Consider using the variable 'lines_print_max_num' instead of the hard-coded value '10' to maintain consistency in the truncation logic for displaying vocab items.
if len(list(vocab_keys)) > 10:
tests/core/test_tokenizer_misc.py
Outdated
corpus_ids = bm25s.tokenize(corpus, stopwords="en", return_ids=False) | ||
repr_corpus_ids = repr(corpus_ids) | ||
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test_truncation_of_large_corpus, the assertion compares repr_corpus_ids to its own string conversion, which is always true. It should compare repr_corpus_ids to str(corpus_ids) to correctly validate the repr output.
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids)) | |
self.assertEqual(repr_corpus_ids, str(corpus_ids)) |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
tests/core/test_tokenizer_misc.py
Outdated
corpus_ids = bm25s.tokenize(corpus, stopwords="en", return_ids=False) | ||
repr_corpus_ids = repr(corpus_ids) | ||
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test_truncation_of_small_corpus, the assertion mistakenly compares repr_corpus_ids to its own string conversion instead of comparing it to str(corpus_ids). Adjusting this check will better verify the intended repr behavior.
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids)) | |
self.assertEqual(repr_corpus_ids, str(corpus_ids)) |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
thank you for such a kind and helpful advice, I really appreciate it. I will try to read through the suggestion and try to follow the advice and the guidance. Please bear with me because I am rather a novice at OSS. Again, thank you very much! |
No worries, take your time! |
Thank you again, for taking the time, to review the code and to provide your valuable feedback. I’ve pushed two new commits to address your feedback on this pull request. Specifically:
The added new commits are:
If there’s anything else you’d like me to refine or edit or clarify, please let me know. Thank you once again for your thoughtful feedback and support; I really appreciate it. Looking forward to your thoughts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Hi! Attached please find the possible fix for issue #90
Thank you for considering my submission, and I truly appreciate it.
Summary
Added
__repr__
method forTokenized
class atbm25s/tokenization.py
,and added
CorpusIDsList
class that inherits the built in class of pythonlist
,in order to make the printing of the tokenization results of the large corpus as follows.
return_ids=True
return_ids=False
Details
bm25s/tokenization.py
return_ids=True
At
Tokenized
class, added__repr__
that can print the pretty output. It was inspired from the examples of pandas’ DataFrame or HuggingFace’s truncated output of tensors. The package pandas uses some formatting helpers to set each of the lines. I tried to implement such methods, but for current purpose, the helper module was deemed possibly excessive; hence the current version. I reckoned that I would use tab instead of the spaces, but thought that spaces would be better for uniform representation over diverse environments.return_ids=False
Added a new class
CorpusIDsList
that extends Python's native list, so that the returned types would be consistent. The name was chosen, following the original return values, for the coherence of the reading.tests/core/test_tokenizer_misc.py
Added tests (
test_truncation_of_large_corpus
andtest_truncation_of_small_corpus
), in order to check the marker of truncation, ...]
or... (total
.Some more details
The reproduction code that I have used for the issue was as follows, using the examples in the original readme.
I have ran the local tests at
tests/core
, to confirm the added lines would not conflict with the original code base. If there are further improvements or alternative approaches you’d prefer, please let me know; I’m happy to edit or modify this further following your feedback, and I am looking forward to it.