Conversation
WalkthroughThis PR adds page-level metadata tracking throughout the RAG pipeline. It introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Extractor
participant Chunker
participant VectorStore
participant SearchUI
Extractor->>Extractor: Extract PDF page-by-page
Extractor->>Extractor: Compute page_offsets<br/>(char positions per page)
Extractor->>Chunker: ExtractionOutput + page_offsets
Chunker->>Chunker: For each chunk,<br/>find page_number<br/>via _find_page_number()
Chunker->>Chunker: Create TextChunk<br/>with page_number
Chunker->>VectorStore: TextChunk[] + page_number
VectorStore->>VectorStore: Store page_number<br/>as kiln_page_number metadata
SearchUI->>VectorStore: Query for chunks
VectorStore->>VectorStore: Retrieve kiln_page_number<br/>from metadata
VectorStore->>SearchUI: SearchResult<br/>+ page_number
SearchUI->>SearchUI: Display "Page: N"<br/>or "N/A"
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @leonardmq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the RAG (Retrieval Augmented Generation) system by implementing a full pipeline for tracking and displaying page numbers for document chunks. This feature allows the system to provide more precise citations by linking retrieved information directly to its source page, improving the overall utility and verifiability of generated responses. The changes span from data extraction and chunking to indexing, retrieval, and finally, presentation in the user interface. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEAD
Summary
Line-by-lineView line-by-line diff coveragelibs/core/kiln_ai/adapters/chunkers/base_chunker.pyLines 43-55 43 search_start = 0
44 for chunk in chunking_result.chunks:
45 chunk_start_offset = text.find(chunk.text, search_start)
46 if chunk_start_offset == -1:
! 47 logger.warning(
48 f"Chunk text not found in sanitized text starting from offset {search_start}. "
49 "This may indicate an issue with the chunker implementation."
50 )
! 51 chunk.page_number = None
52 else:
53 page_number = self._find_page_number(
54 chunk_start_offset, page_offsets
55 )Lines 76-84 76 for i in range(len(page_offsets) - 1, -1, -1):
77 if chunk_offset >= page_offsets[i]:
78 return i
79
! 80 return None
81
82 @abstractmethod
83 async def _chunk(self, text: str) -> ChunkingResult:
84 passlibs/core/kiln_ai/tools/rag_tools.pyLines 50-58 50 "document_id": search_result.document_id,
51 "chunk_idx": search_result.chunk_idx,
52 }
53 if search_result.page_number is not None:
! 54 metadata["page_number"] = search_result.page_number
55 results.append(
56 ChunkContext(
57 metadata=metadata,
58 text=search_result.chunk_text,
|
There was a problem hiding this comment.
Code Review
This pull request introduces page number tracking and display functionality across the RAG pipeline. Key changes include adding a page_number field to TextChunk and SearchResult models, and page_offsets to ExtractionOutput and Extraction models. The BaseChunker now accepts page_offsets to assign page numbers to chunks, with specific logic implemented in LitellmExtractor to calculate these offsets for PDF extractions. The frontend UI has been updated to display the page number for each chunk. Extensive unit tests have been added or modified for BaseChunker, FixedWindowChunker, SemanticChunker, LitellmExtractor, and LanceDBAdapter to ensure correct handling, storage, and retrieval of page numbers and offsets. Review comments suggest improving the efficiency of text.find() in the chunker, clarifying the return behavior of _find_page_number for offsets before the first page, extracting metadata creation logic into a separate function, and adding comments to explain test cases.
| chunk_start_offset = text.find(chunk.text, search_start) | ||
| if chunk_start_offset == -1: | ||
| logger.warning( | ||
| f"Chunk text not found in sanitized text starting from offset {search_start}. " | ||
| "This may indicate an issue with the chunker implementation." | ||
| ) |
There was a problem hiding this comment.
The text.find() method can be inefficient for large texts or frequent calls. Consider using a more efficient string searching algorithm or pre-processing the text if performance becomes a bottleneck. Also, consider adding a more descriptive error message to the logger.
| chunk_start_offset = text.find(chunk.text, search_start) | |
| if chunk_start_offset == -1: | |
| logger.warning( | |
| f"Chunk text not found in sanitized text starting from offset {search_start}. " | |
| "This may indicate an issue with the chunker implementation." | |
| ) | |
| chunk_start_offset = text.find(chunk.text, search_start) | |
| if chunk_start_offset == -1: | |
| logger.warning( | |
| f"Chunk text not found in sanitized text starting from offset {search_start}. " | |
| "This may indicate an issue with the chunker implementation. " | |
| f"Chunk text: '{chunk.text[:50]}...'" | |
| ) | |
| chunk.page_number = None |
| if chunk_offset < page_offsets[0]: | ||
| return 0 |
There was a problem hiding this comment.
Returning 0 here might not be the most intuitive behavior. If the chunk_offset is before the first page, it might be better to return None to indicate that the chunk doesn't belong to any page. This would require updating the type hint to Optional[int].
| if chunk_offset < page_offsets[0]: | |
| return 0 | |
| if chunk_offset < page_offsets[0]: | |
| return None |
| metadata: Dict[str, Any] = { | ||
| # metadata is populated by some internal llama_index logic | ||
| # that uses for example the source_node relationship | ||
| "kiln_doc_id": document_id, | ||
| "kiln_chunk_idx": chunk_idx, | ||
| # | ||
| # llama_index lancedb vector store automatically sets these metadata: | ||
| # "doc_id": "UUID node_id of the Source Node relationship", | ||
| # "document_id": "UUID node_id of the Source Node relationship", | ||
| # "ref_doc_id": "UUID node_id of the Source Node relationship" | ||
| # | ||
| # llama_index file loaders set these metadata, which would be useful to also support: | ||
| # "creation_date": "2025-09-03", | ||
| # "file_name": "file.pdf", | ||
| # "file_path": "/absolute/path/to/the/file.pdf", | ||
| # "file_size": 395154, | ||
| # "file_type": "application\/pdf", | ||
| # "last_modified_date": "2025-09-03", | ||
| # "page_label": "1", | ||
| } | ||
|
|
||
| if page_number is not None: | ||
| metadata["kiln_page_number"] = page_number |
There was a problem hiding this comment.
Consider extracting the metadata creation logic into a separate function for better readability and maintainability.
def _create_metadata(document_id: str, chunk_idx: int, page_number: int | None = None) -> Dict[str, Any]:
metadata: Dict[str, Any] = {
# metadata is populated by some internal llama_index logic
# that uses for example the source_node relationship
"kiln_doc_id": document_id,
"kiln_chunk_idx": chunk_idx,
#
# llama_index lancedb vector store automatically sets these metadata:
# "doc_id": "UUID node_id of the Source Node relationship",
# "document_id": "UUID node_id of the Source Node relationship",
# "ref_doc_id": "UUID node_id of the Source Node relationship"
#
# llama_index file loaders set these metadata, which would be useful to also support:
# "creation_date": "2025-09-03",
# "file_name": "file.pdf",
# "file_path": "/absolute/path/to/the/file.pdf",
# "file_size": 395154,
# "file_type": "application\\/pdf",
# "last_modified_date": "2025-09-03",
# "page_label": "1",
}
if page_number is not None:
metadata["kiln_page_number"] = page_number
return metadata
def convert_to_llama_index_node(
document_id: str,
chunk_idx: int,
node_id: str,
text: str,
vector: List[float],
page_number: int | None = None,
) -> TextNode:
metadata = _create_metadata(document_id, chunk_idx, page_number)
return TextNode(
id_=node_id,
text=text,
embedding=vector,
metadata=metadata,
relationships={
# when using the llama_index loaders, llama_index groups Nodes under Documents
# and relationships point to the Document (which is also a Node), which confusingly| ): | ||
| adapter.format_query_result(query_result) | ||
|
|
||
| # Test with valid page_number (int) - should work |
| chunks=[ | ||
| Chunk( | ||
| content=KilnAttachmentModel.from_data("chunk 0", "text/plain"), | ||
| page_number=0, |
| page_number=0, | ||
| ), | ||
| Chunk( | ||
| content=KilnAttachmentModel.from_data("chunk 1", "text/plain"), |
| assert len(nodes) == 4 | ||
| assert nodes[0].metadata.get("kiln_page_number") == 0 | ||
| assert nodes[1].metadata.get("kiln_page_number") == 1 | ||
| assert "kiln_page_number" not in nodes[2].metadata # None should not be stored |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libs/core/kiln_ai/tools/rag_tools.py (1)
183-195: Page number is lost during reranking.The
rerankmethod reconstructsSearchResultobjects but doesn't preservepage_number. After reranking, chunks will lose their page metadata, breaking the page number feature when a reranker is configured.🐛 Proposed fix to preserve page_number through reranking
The rerank method needs access to the original
page_numberfrom the inputSearchResult. One approach is to store page numbers in a lookup and retrieve them when rebuilding:async def rerank( self, search_results: List[SearchResult], query: str ) -> List[SearchResult]: if self.reranker is None: return search_results + # Build lookup for page_number by global chunk id + page_number_lookup = { + f"{r.document_id}::{r.chunk_idx}": r.page_number + for r in search_results + } + reranked_results = await self.reranker.rerank( query=query, documents=convert_search_results_to_rerank_input(search_results), ) reranked_search_results = [] for result in reranked_results.results: document_id, chunk_idx = split_global_chunk_id(result.document.id) reranked_search_results.append( SearchResult( document_id=document_id, chunk_idx=chunk_idx, chunk_text=result.document.text, similarity=result.relevance_score, + page_number=page_number_lookup.get(result.document.id), ) ) return reranked_search_results
🧹 Nitpick comments (4)
libs/core/kiln_ai/adapters/vector_store/test_lancedb_helpers.py (1)
183-195: LGTM!Test correctly verifies that explicit
page_number=Nonedoes not addkiln_page_numberto metadata.Consider adding a test for
page_number=0to verify the first page (0-indexed) is correctly stored and not falsely treated as absent due to Python's truthiness of 0:♻️ Optional test for page_number=0
def test_convert_to_llama_index_node_with_page_number_zero(): node = convert_to_llama_index_node( document_id="doc-123", chunk_idx=0, node_id="11111111-1111-5111-8111-111111111111", text="hello", vector=[0.1, 0.2], page_number=0, ) assert node.metadata["kiln_page_number"] == 0libs/core/kiln_ai/datamodel/extraction.py (1)
98-101: LGTM!The
page_offsetsfield is well-defined with a clear description. Usinglist[int] | Nonefollows modern Python 3.10+ typing syntax which aligns with the coding guidelines.Consider adding a validator to ensure
page_offsetsinvariants when provided (e.g., offsets are sorted, first offset is 0). This would catch data corruption early:♻️ Optional validator for page_offsets
`@field_validator`("page_offsets") `@classmethod` def validate_page_offsets(cls, v: list[int] | None) -> list[int] | None: if v is None: return v if len(v) == 0: raise ValueError("page_offsets must not be empty if provided") if v[0] != 0: raise ValueError("page_offsets must start with 0") if v != sorted(v): raise ValueError("page_offsets must be sorted in ascending order") return vlibs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py (1)
776-790: Remove duplicate page_offsets assertions.Both tests repeat the same block twice; trimming improves readability without losing coverage.
♻️ Suggested cleanup
@@ - # Verify page_offsets are present and correct - assert result.page_offsets is not None - assert len(result.page_offsets) == 2 - assert result.page_offsets[0] == 0 - # Page 1 starts after page 0 content + separator (2 chars for "\n\n") - expected_page_1_offset = len("Content from page 1") + 2 - assert result.page_offsets[1] == expected_page_1_offset @@ - # Verify page_offsets are present and correct - assert result.page_offsets is not None - assert len(result.page_offsets) == 2 - assert result.page_offsets[0] == 0 - expected_page_1_offset = len("Content from page 1") + 2 - assert result.page_offsets[1] == expected_page_1_offsetAlso applies to: 916-928
libs/core/kiln_ai/adapters/chunkers/base_chunker.py (1)
61-80: Minor: Unreachable return statement.The
return Noneat line 80 is unreachable. Given the checks:
- If
page_offsetsis empty → returnsNoneat line 71.- If
chunk_offset < page_offsets[0]→ returns0at line 74.- Otherwise, the loop will always find an
iwherechunk_offset >= page_offsets[i](at minimumi=0).This is harmless defensive code, but you could simplify by removing it or adding an assertion.
♻️ Optional simplification
for i in range(len(page_offsets) - 1, -1, -1): if chunk_offset >= page_offsets[i]: return i - return None + # This line is unreachable given the checks above, but satisfies type checker + raise AssertionError("Unreachable: chunk_offset must match some page")
What does this PR do?
This adds page numbers into chunks in RAG - this allows the caller of the
RagToolto format citations that points to a specific page.Page number is nullable. It is
Nonefor non-PDF (or non-page) documents.Getting the page number in an existing
Search Toolrequires reindexing - delete~/.kiln_ai/rag_indexes/The entire flow is:
kiln_page_numberas metadata in the record (alongsidekiln_doc_idand chunk idRagTool, add the page number in the chunk metadataChecklists
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.