Skip to content

text-splitters: Fix regex separator merge bug in CharacterTextSplitter #31137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

suminnnnn
Copy link
Contributor

Description:
Fix the merge logic in CharacterTextSplitter.split_text so that when using a regex lookahead separator (is_separator_regex=True) with keep_separator=False, the raw pattern is not re-inserted between chunks.

Issue:
Fixes #31136

Dependencies:
None

Twitter handle:
None

Since this is my first open-source PR, please feel free to point out any mistakes, and I'll be eager to make corrections.

Copy link

vercel bot commented May 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview May 8, 2025 8:51pm

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 6, 2025
Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for this. I'm wondering what you think of the following example:

from langchain_text_splitters import CharacterTextSplitter

separator="apple"

splitter = CharacterTextSplitter(
    separator=separator,
    chunk_size=200,
    chunk_overlap=0,
    keep_separator=False,
    is_separator_regex=True,
)

splitter.split_text("There is an apple on the plate.")

Currently this will output

['There is an apple on the plate.']

regardless of the value of is_separator_regex or keep_separator. That is, the behavior is such that if we're not splitting, the separators are ignored.

On your branch, we will get

['There is an apple on the plate.']

if is_separator_regex=False, and

['There is an  on the plate.']

if is_separator_regex=True. This is arguably introducing an inconsistency (though I agree the current behavior is problematic and warrants a fix).

I'm wondering if it's possible to fix the issue of introducing erroneous lookahead patterns, without otherwise changing behavior for users.

@suminnnnn
Copy link
Contributor Author

Thank you for taking the time to review !

I realize I missed preserving the original consistency, so I’ve updated the code to maintain the existing split logic while still fixing the erroneous lookaround re-insertion.

I’d appreciate it if you could take another look and let me know of any further improvements or errors I should address.

@suminnnnn suminnnnn requested a review from ccurme May 8, 2025 20:55
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 10, 2025
@ccurme ccurme merged commit 683da2c into langchain-ai:master May 10, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature lgtm PR looks good. Use to confirm that a PR is ready for merging. size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CharacterTextSplitter re-inserts regex separator when keep_separator=False
2 participants