Skip to content

Add chunking script#13

Merged
jacobthebanana merged 5 commits into
mainfrom
add_chunking_script
Aug 5, 2025
Merged

Add chunking script#13
jacobthebanana merged 5 commits into
mainfrom
add_chunking_script

Conversation

@fcogidi

@fcogidi fcogidi commented Jul 29, 2025

Copy link
Copy Markdown
Collaborator

PR Type

Feature.

Short Description

Add utility script for chunking a huggingface dataset. The dataset is expected to have a "text" column, which will be chunked token-wise using the specified to tokenizer. The script provides the option to upload the chunked dataset to huggingface.

Tests Added

None

@fcogidi fcogidi requested a review from jacobthebanana July 29, 2025 19:44
@fcogidi fcogidi self-assigned this Jul 29, 2025
@jacobthebanana jacobthebanana merged commit 38d343f into main Aug 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants