GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking #4429

Hyeri1ee · 2025-09-19T19:13:36Z

Summary

Fix typos in spring-ai-commons and enhance TextSplitter to preserve Document properties and add chunk tracking for better RAG system support.

Current Issues

Typos:
- IdGenerator.java: "dependant" should be "dependent"
- TextSplitter.java: Comment contains "slit" instead of "split"
Property Loss: Document score values are lost during splitting
Missing Traceability: Cannot track which original document chunks came from
Unimplemented TODO: The "TODO copy over other properties" comment was never implemented

Changes Made

1. Typo Fixes

IdGenerator.java: Fixed "dependant" → "dependent" in JavaDoc comment
TextSplitter.java: Fixed "slit" → "split" in comment

2. TextSplitter Enhancements

Preserve original document score values in all chunks
Add tracking metadata: parent_document_id, chunk_index, total_chunks
Implement the TODO comment for copying document properties

3. Test Updates

TextSplitterTests.java: Added new tests and updated existing ones
TokenTextSplitterTest.java: Updated test assertions for new metadata fields
Test Comments: Fixed misleading comments about chunk-specific fields

Enhanced Metadata Example

//Before: Basic metadata only
{"source": "document.pdf"}

//After: Enhanced with tracking
{
  "source": "document.pdf",
  "parent_document_id": "doc-123",
  "chunk_index": 0,
  "total_chunks": 3
}

Benefits

Document Reconstruction: Group and sort chunks by parent
Score Preservation: Maintain relevance rankings
Better RAG: Enhanced search context and debugging
Code Quality: Fixed typos improve documentation accuracy

Testing

All existing tests pass with backward compatibility
New comprehensive test coverage added:
- testScorePreservation(): Verifies score preservation
- testParentDocumentTracking(): Validates parent tracking
- testChunkMetadataInformation(): Tests chunk position metadata
- testEnhancedMetadataWithMultipleDocuments(): Multi-document scenarios
Updated existing tests to handle new metadata fields:
- TextSplitterTests.java: Fixed test assertions for enhanced metadata
- TokenTextSplitterTest.java: Updated to validate chunk-specific fields

Files Modified

IdGenerator.java: Typo fix in JavaDoc
TextSplitter.java: Typo fix + enhanced functionality
TextSplitterTests.java: Updated tests + new test coverage
TokenTextSplitterTest.java: Updated tests for enhanced metadata validationcoverage

Use Case

Document doc = Document.builder()
    .text("Long content...")
    .score(0.95)
    .build();

List<Document> chunks = splitter.split(doc);
chunks.get(0).getScore(); // → 0.95 (preserved)
chunks.get(0).getMetadata().get("parent_document_id"); // → original doc ID
chunks.get(0).getMetadata().get("chunk_index"); // → 0

This enables proper chunk-to-document relationships essential for effective RAG systems while improving overall code quality.

closes #4428

…on and tracking Signed-off-by: Hyeri1ee <[email protected]>

- Correct comment: 'excluding' -> 'including' chunk-specific fields - Update TokenTextSplitterTest to handle new metadata fields - Ensure all tests pass with enhanced TextSplitter functionality Signed-off-by: Hyeri1ee <[email protected]>

spring-projectsGH-4428: improve TextSplitter with property preservati…

f7c1f5e

…on and tracking Signed-off-by: Hyeri1ee <[email protected]>

Hyeri1ee force-pushed the main branch from 4ec4dec to f7c1f5e Compare September 19, 2025 19:38

Hyeri1ee changed the title ~~Enhancement: Fix typos and enhance TextSplitter with Document property preservation and chunk tracking~~ GH-4428: Enhancement: Fix typos and enhance TextSplitter with Document property preservation and chunk tracking Sep 19, 2025

Hyeri1ee changed the title ~~GH-4428: Enhancement: Fix typos and enhance TextSplitter with Document property preservation and chunk tracking~~ GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking #4429

GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking #4429

Hyeri1ee commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking #4429

Are you sure you want to change the base?

GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking #4429

Conversation

Hyeri1ee commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current Issues

Changes Made

1. Typo Fixes

2. TextSplitter Enhancements

3. Test Updates

Enhanced Metadata Example

Benefits

Testing

Files Modified

Use Case

Uh oh!

Uh oh!

Hyeri1ee commented Sep 19, 2025 •

edited

Loading