Skip to content

Conversation

questcollector
Copy link

When generating testset, transformation using token count using tiktoken.
but if documents includes special tokens like '' tiktoken raises error.
To prevent it, added parameter "disallowed_special=()" on usages of tiktoken encode

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jul 4, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Fixed tiktoken encoding errors by adding disallowed_special=() parameter when processing text containing special tokens like '' during testset generation.

  • Modified ragas/src/ragas/testset/transforms/base.py to handle special tokens in LLMBasedExtractor.split_text_by_token_limit()
  • Updated ragas/src/ragas/utils.py to add special token handling in num_tokens_from_string() utility
  • Both changes prevent tiktoken from raising errors when encountering special tokens in documents

2 files reviewed, 1 comment
Edit PR Review Bot Settings | Greptile

Comment on lines 225 to 229
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
num_tokens = len(encoding.encode(string, disallowed_special=()))
return num_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Add docstring explaining the disallowed_special parameter and why it's set to empty tuple. This helps future maintainers understand the reasoning behind this configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude add this comment to explain it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants