Skip to content

Conversation

Hyeri1ee
Copy link
Contributor

@Hyeri1ee Hyeri1ee commented Sep 19, 2025

Summary

Fix typos in spring-ai-commons and enhance TextSplitter to preserve Document properties and add chunk tracking for better RAG system support.

Current Issues

  1. Typos:
    • IdGenerator.java: "dependant" should be "dependent"
    • TextSplitter.java: Comment contains "slit" instead of "split"
  2. Property Loss: Document score values are lost during splitting
  3. Missing Traceability: Cannot track which original document chunks came from
  4. Unimplemented TODO: The "TODO copy over other properties" comment was never implemented

Changes Made

1. Typo Fixes

  • IdGenerator.java: Fixed "dependant" → "dependent" in JavaDoc comment
  • TextSplitter.java: Fixed "slit" → "split" in comment

2. TextSplitter Enhancements

  • Preserve original document score values in all chunks
  • Add tracking metadata: parent_document_id, chunk_index, total_chunks
  • Implement the TODO comment for copying document properties

3. Test Updates

  • TextSplitterTests.java: Added new tests and updated existing ones
  • TokenTextSplitterTest.java: Updated test assertions for new metadata fields
  • Test Comments: Fixed misleading comments about chunk-specific fields

Enhanced Metadata Example

//Before: Basic metadata only
{"source": "document.pdf"}

//After: Enhanced with tracking
{
  "source": "document.pdf",
  "parent_document_id": "doc-123",
  "chunk_index": 0,
  "total_chunks": 3
}

Benefits

  • Document Reconstruction: Group and sort chunks by parent
  • Score Preservation: Maintain relevance rankings
  • Better RAG: Enhanced search context and debugging
  • Code Quality: Fixed typos improve documentation accuracy

Testing

  • All existing tests pass with backward compatibility
  • New comprehensive test coverage added:
    • testScorePreservation(): Verifies score preservation
    • testParentDocumentTracking(): Validates parent tracking
    • testChunkMetadataInformation(): Tests chunk position metadata
    • testEnhancedMetadataWithMultipleDocuments(): Multi-document scenarios
  • Updated existing tests to handle new metadata fields:
    • TextSplitterTests.java: Fixed test assertions for enhanced metadata
    • TokenTextSplitterTest.java: Updated to validate chunk-specific fields

Files Modified

  • IdGenerator.java: Typo fix in JavaDoc
  • TextSplitter.java: Typo fix + enhanced functionality
  • TextSplitterTests.java: Updated tests + new test coverage
  • TokenTextSplitterTest.java: Updated tests for enhanced metadata validationcoverage

Use Case

Document doc = Document.builder()
    .text("Long content...")
    .score(0.95)
    .build();

List<Document> chunks = splitter.split(doc);
chunks.get(0).getScore(); // → 0.95 (preserved)
chunks.get(0).getMetadata().get("parent_document_id"); // → original doc ID
chunks.get(0).getMetadata().get("chunk_index"); // → 0

This enables proper chunk-to-document relationships essential for effective RAG systems while improving overall code quality.

closes #4428

- Correct comment: 'excluding' -> 'including' chunk-specific fields
- Update TokenTextSplitterTest to handle new metadata fields
- Ensure all tests pass with enhanced TextSplitter functionality

Signed-off-by: Hyeri1ee <[email protected]>
@Hyeri1ee Hyeri1ee changed the title Enhancement: Fix typos and enhance TextSplitter with Document property preservation and chunk tracking GH-4428: Enhancement: Fix typos and enhance TextSplitter with Document property preservation and chunk tracking Sep 19, 2025
@Hyeri1ee Hyeri1ee changed the title GH-4428: Enhancement: Fix typos and enhance TextSplitter with Document property preservation and chunk tracking GH-4428: (Enhancement) Fix typos and enhance TextSplitter with Document property preservation and chunk tracking Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TextSplitter loses Document score and parent tracking information during splitting
1 participant