-
Notifications
You must be signed in to change notification settings - Fork 17.7k
text-splitters: Fix regex separator merge bug in CharacterTextSplitter #31137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text-splitters: Fix regex separator merge bug in CharacterTextSplitter #31137
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for this. I'm wondering what you think of the following example:
from langchain_text_splitters import CharacterTextSplitter
separator="apple"
splitter = CharacterTextSplitter(
separator=separator,
chunk_size=200,
chunk_overlap=0,
keep_separator=False,
is_separator_regex=True,
)
splitter.split_text("There is an apple on the plate.")
Currently this will output
['There is an apple on the plate.']
regardless of the value of is_separator_regex
or keep_separator
. That is, the behavior is such that if we're not splitting, the separators are ignored.
On your branch, we will get
['There is an apple on the plate.']
if is_separator_regex=False
, and
['There is an on the plate.']
if is_separator_regex=True
. This is arguably introducing an inconsistency (though I agree the current behavior is problematic and warrants a fix).
I'm wondering if it's possible to fix the issue of introducing erroneous lookahead patterns, without otherwise changing behavior for users.
…of lookaround regex separators
Thank you for taking the time to review ! I realize I missed preserving the original consistency, so I’ve updated the code to maintain the existing split logic while still fixing the erroneous lookaround re-insertion. I’d appreciate it if you could take another look and let me know of any further improvements or errors I should address. |
Description:
Fix the merge logic in
CharacterTextSplitter.split_text
so that when using a regex lookahead separator (is_separator_regex=True
) withkeep_separator=False
, the raw pattern is not re-inserted between chunks.Issue:
Fixes #31136
Dependencies:
None
Twitter handle:
None
Since this is my first open-source PR, please feel free to point out any mistakes, and I'll be eager to make corrections.