Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOS/EOS/PAD options in tokens cli; speed up tokenization by segmenting paragraphs. #102

Merged
merged 5 commits into from
Jan 20, 2024

Conversation

soldni
Copy link
Member

@soldni soldni commented Jan 20, 2024

  • Revamped flags for the tokenizer CLI. Now it supports providing BOS, EOS, and PAD tokens.
  • Added a deprecation warning for older CLI flag.
  • Added tests for tokenizer.
  • Added experimental feature to split paragraphs before tokenization. This speeds up tokenizers like Llama and Mixtral significantly by getting around the inefficient Replacenormalizer in HuggingFace'stokenizers` library. Should be no longer needed once this PR is merged in.

@soldni soldni merged commit 45b5eea into main Jan 20, 2024
13 checks passed
@soldni soldni deleted the soldni/tokenizer branch January 20, 2024 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant