feat: add page numbers in chunks (RAG) by leonardmq · Pull Request #989 · Kiln-AI/Kiln

leonardmq · 2026-01-27T18:10:19Z

What does this PR do?

This adds page numbers into chunks in RAG - this allows the caller of the RagTool to format citations that points to a specific page.

Page number is nullable. It is None for non-PDF (or non-page) documents.

Getting the page number in an existing Search Tool requires reindexing - delete ~/.kiln_ai/rag_indexes/

The entire flow is:

Extract -> extract each page, and as we materialize the extraction datamodel to disk as a single unit, add an array of char offsets marking the start of each page
Chunking -> do the chunks off the entire extraction, then map back each chunk to their location in the entire extraction and find the char offset range each chunk falls in to map back to the corresponding page
Indexing -> add the kiln_page_number as metadata in the record (alongside kiln_doc_id and chunk id
Retrieval -> in the RagTool, add the page number in the chunk metadata
UI -> Surface the page number

Checklists

Tests have been run locally and passed
New tests have been added to any work in /lib

Summary by CodeRabbit

Release Notes

New Features
- PDF extractions now track page numbers for each chunk, enabling better content attribution in search results and improved visibility into source document structure.
- Search results display page information when chunks originate from multi-page documents, helping users quickly identify document location.
- RAG configuration UI now shows page numbers alongside chunk indices for enhanced document context.
Tests
- Expanded test coverage for page number tracking across extraction, chunking, vector storage, and search components.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T18:10:44Z

Walkthrough

This PR adds page-level metadata tracking throughout the RAG pipeline. It introduces page_offsets to extraction output (character positions marking page starts), computes page_number for each chunk during chunking, and propagates this data through vector storage and search results to the UI.

Changes

Cohort / File(s)	Summary
Data Models `libs/core/kiln_ai/datamodel/extraction.py`, `libs/core/kiln_ai/datamodel/chunk.py`, `app/web_ui/src/lib/api_schema.d.ts`, `libs/core/kiln_ai/adapters/extractors/base_extractor.py`, `libs/core/kiln_ai/adapters/vector_store/base_vector_store_adapter.py`	Added optional `page_offsets` field to Extraction and ExtractionOutput models; added optional `page_number` field to Chunk and SearchResult models across backend and frontend schemas.
Chunker Logic & Tests `libs/core/kiln_ai/adapters/chunkers/base_chunker.py`, `libs/core/kiln_ai/adapters/chunkers/test_base_chunker.py`, `libs/core/kiln_ai/adapters/chunkers/test_fixed_window_chunker.py`, `libs/core/kiln_ai/adapters/chunkers/test_semantic_chunker.py`	Extended `BaseChunker.chunk()` to accept optional `page_offsets` parameter; added `_find_page_number()` helper to map chunk character offsets to page indices; propagates `page_number` to each produced TextChunk. Includes extensive page-number validation tests with some duplication.
PDF Extraction `libs/core/kiln_ai/adapters/extractors/litellm_extractor.py`, `libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py`	Introduced `PdfPageResult` model to track per-page content; refactored `_extract_pdf_page_by_page()` to return tuple of content and `page_offsets` list computed from sequential page content lengths; updated `_extract()` to pass `page_offsets` to ExtractionOutput. Added assertions in tests to verify page_offsets correctness across PDF extraction scenarios.
RAG Pipeline Integration `libs/core/kiln_ai/adapters/rag/rag_runners.py`, `libs/core/kiln_ai/adapters/rag/test_rag_runners.py`, `libs/core/kiln_ai/adapters/extractors/extractor_runner.py`	Updated RAG runners to propagate `page_offsets` from Extraction into chunking step; ensured `page_number` from TextChunk is carried to Chunk model in ChunkedDocument. Tests mock and verify page_offsets propagation.
Vector Store & Search `libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py`, `libs/core/kiln_ai/adapters/vector_store/lancedb_helpers.py`, `libs/core/kiln_ai/adapters/vector_store/test_lancedb_adapter.py`, `libs/core/kiln_ai/adapters/vector_store/test_lancedb_helpers.py`, `libs/core/kiln_ai/adapters/vector_store_loaders/vector_store_loader.py`	Modified `convert_to_llama_index_node()` to accept and store `page_number` in node metadata as `kiln_page_number`; updated vector store adapter to retrieve page_number from metadata during query results formatting and include in SearchResult; vector_store_loader extracts page_number from chunk data. Tests verify metadata storage and retrieval with duplicated test scenarios.
RAG Tools & UI `libs/core/kiln_ai/tools/rag_tools.py`, `app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/[rag_config_id]/rag_config/+page.svelte`	Updated ChunkContext metadata construction to conditionally include `page_number` when available; updated search result display to show page number (e.g., "Page: N" or "N/A") alongside chunk index.
Data Model Tests `libs/core/kiln_ai/datamodel/test_chunk_models.py`	Added unit tests verifying `page_number` field behavior (default None, explicit assignment, explicit None).

Sequence Diagram(s)

sequenceDiagram
    participant Extractor
    participant Chunker
    participant VectorStore
    participant SearchUI

    Extractor->>Extractor: Extract PDF page-by-page
    Extractor->>Extractor: Compute page_offsets<br/>(char positions per page)
    Extractor->>Chunker: ExtractionOutput + page_offsets

    Chunker->>Chunker: For each chunk,<br/>find page_number<br/>via _find_page_number()
    Chunker->>Chunker: Create TextChunk<br/>with page_number
    Chunker->>VectorStore: TextChunk[] + page_number

    VectorStore->>VectorStore: Store page_number<br/>as kiln_page_number metadata
    SearchUI->>VectorStore: Query for chunks

    VectorStore->>VectorStore: Retrieve kiln_page_number<br/>from metadata
    VectorStore->>SearchUI: SearchResult<br/>+ page_number
    SearchUI->>SearchUI: Display "Page: N"<br/>or "N/A"

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

PR#596: Both PRs add per-page PDF extraction and propagate page-level metadata (page offsets/page numbers) through the extractor and downstream models.
PR#527: Both PRs modify PDF extraction in litellm_extractor.py to produce per-page outputs and introduce page-level metadata (page content/numbers and offsets).
PR#487: Both PRs extend the vector store adapter surface to include page-number propagation through SearchResult and vector store plumbing.

Suggested reviewers

scosman
chiang-daniel
sfierro

Poem

🐰 Page numbers now hop through the chunks with glee,
From PDFs to offsets, a mapping spree,
Each chunk knows its page, from zero to n,
Vector stores remember, and searches ascend,
Metadata magic makes search complete! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add page numbers in chunks (RAG)' is concise, clear, and accurately describes the main change—adding page number metadata to chunks in the RAG pipeline.
Description check	✅ Passed	The description provides a clear explanation of the PR's purpose, outlines the complete implementation flow across five stages, includes checklists with both items marked complete, but is missing a 'Related Issues' section.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-27T18:10:58Z

Summary of Changes

Hello @leonardmq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the RAG (Retrieval Augmented Generation) system by implementing a full pipeline for tracking and displaying page numbers for document chunks. This feature allows the system to provide more precise citations by linking retrieved information directly to its source page, improving the overall utility and verifiability of generated responses. The changes span from data extraction and chunking to indexing, retrieval, and finally, presentation in the user interface.

Highlights

Page Number Integration: Introduced page_number as an optional field across key RAG components, including TextChunk, Chunk, SearchResult, and the API schema, allowing for page-specific citations.
PDF Page Offset Extraction: The LitellmExtractor now processes PDF documents page by page, calculating and storing character page_offsets within the ExtractionOutput and Extraction datamodels. This enables mapping chunks back to their original pages.
Chunk-to-Page Mapping Logic: The BaseChunker was enhanced to accept page_offsets during chunking. It uses these offsets to determine and assign the correct page_number to each generated TextChunk, which is then persisted in the Chunk datamodel.
Vector Store Indexing and Retrieval: Page numbers are now stored as kiln_page_number metadata in the LanceDB vector store during indexing and are retrieved and included in SearchResult objects during queries. This ensures page information is available throughout the RAG pipeline.
User Interface Display: The web UI has been updated to display the page number alongside the chunk index in search results, providing users with direct page references for retrieved information.
Comprehensive Testing: Extensive unit and integration tests were added or updated across chunkers, extractors, and vector store adapters to validate the correct handling, storage, and retrieval of page numbers, including edge cases and scenarios with and without page offsets.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-01-27T18:12:42Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

libs/core/kiln_ai/adapters/chunkers/base_chunker.py (87.5%): Missing lines 47,51,80
libs/core/kiln_ai/adapters/extractors/base_extractor.py (100%)
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (100%)
libs/core/kiln_ai/adapters/vector_store/base_vector_store_adapter.py (100%)
libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py (100%)
libs/core/kiln_ai/adapters/vector_store/lancedb_helpers.py (100%)
libs/core/kiln_ai/adapters/vector_store_loaders/vector_store_loader.py (100%)
libs/core/kiln_ai/datamodel/chunk.py (100%)
libs/core/kiln_ai/datamodel/extraction.py (100%)
libs/core/kiln_ai/tools/rag_tools.py (66.7%): Missing lines 54

Summary

Total: 60 lines
Missing: 4 lines
Coverage: 93%

Line-by-line

View line-by-line diff coverage

libs/core/kiln_ai/adapters/chunkers/base_chunker.py

Lines 43-55

  43             search_start = 0
  44             for chunk in chunking_result.chunks:
  45                 chunk_start_offset = text.find(chunk.text, search_start)
  46                 if chunk_start_offset == -1:
! 47                     logger.warning(
  48                         f"Chunk text not found in sanitized text starting from offset {search_start}. "
  49                         "This may indicate an issue with the chunker implementation."
  50                     )
! 51                     chunk.page_number = None
  52                 else:
  53                     page_number = self._find_page_number(
  54                         chunk_start_offset, page_offsets
  55                     )

Lines 76-84

  76         for i in range(len(page_offsets) - 1, -1, -1):
  77             if chunk_offset >= page_offsets[i]:
  78                 return i
  79 
! 80         return None
  81 
  82     @abstractmethod
  83     async def _chunk(self, text: str) -> ChunkingResult:
  84         pass

libs/core/kiln_ai/tools/rag_tools.py

Lines 50-58

  50             "document_id": search_result.document_id,
  51             "chunk_idx": search_result.chunk_idx,
  52         }
  53         if search_result.page_number is not None:
! 54             metadata["page_number"] = search_result.page_number
  55         results.append(
  56             ChunkContext(
  57                 metadata=metadata,
  58                 text=search_result.chunk_text,

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

gemini-code-assist

Code Review

This pull request introduces page number tracking and display functionality across the RAG pipeline. Key changes include adding a page_number field to TextChunk and SearchResult models, and page_offsets to ExtractionOutput and Extraction models. The BaseChunker now accepts page_offsets to assign page numbers to chunks, with specific logic implemented in LitellmExtractor to calculate these offsets for PDF extractions. The frontend UI has been updated to display the page number for each chunk. Extensive unit tests have been added or modified for BaseChunker, FixedWindowChunker, SemanticChunker, LitellmExtractor, and LanceDBAdapter to ensure correct handling, storage, and retrieval of page numbers and offsets. Review comments suggest improving the efficiency of text.find() in the chunker, clarifying the return behavior of _find_page_number for offsets before the first page, extracting metadata creation logic into a separate function, and adding comments to explain test cases.

gemini-code-assist · 2026-01-27T18:13:48Z

libs/core/kiln_ai/adapters/chunkers/base_chunker.py

+                chunk_start_offset = text.find(chunk.text, search_start)
+                if chunk_start_offset == -1:
+                    logger.warning(
+                        f"Chunk text not found in sanitized text starting from offset {search_start}. "
+                        "This may indicate an issue with the chunker implementation."
+                    )


The text.find() method can be inefficient for large texts or frequent calls. Consider using a more efficient string searching algorithm or pre-processing the text if performance becomes a bottleneck. Also, consider adding a more descriptive error message to the logger.

Suggested change

chunk_start_offset = text.find(chunk.text, search_start)

if chunk_start_offset == -1:

logger.warning(

f"Chunk text not found in sanitized text starting from offset {search_start}. "

"This may indicate an issue with the chunker implementation."

)

chunk_start_offset = text.find(chunk.text, search_start)

if chunk_start_offset == -1:

logger.warning(

f"Chunk text not found in sanitized text starting from offset {search_start}. "

"This may indicate an issue with the chunker implementation. "

f"Chunk text: '{chunk.text[:50]}...'"

)

chunk.page_number = None

gemini-code-assist · 2026-01-27T18:13:48Z

libs/core/kiln_ai/adapters/chunkers/base_chunker.py

+        if chunk_offset < page_offsets[0]:
+            return 0


Returning 0 here might not be the most intuitive behavior. If the chunk_offset is before the first page, it might be better to return None to indicate that the chunk doesn't belong to any page. This would require updating the type hint to Optional[int].

Suggested change

if chunk_offset < page_offsets[0]:

return 0

if chunk_offset < page_offsets[0]:

return None

gemini-code-assist · 2026-01-27T18:13:48Z

libs/core/kiln_ai/adapters/vector_store/lancedb_helpers.py

+    metadata: Dict[str, Any] = {
+        # metadata is populated by some internal llama_index logic
+        # that uses for example the source_node relationship
+        "kiln_doc_id": document_id,
+        "kiln_chunk_idx": chunk_idx,
+        #
+        # llama_index lancedb vector store automatically sets these metadata:
+        # "doc_id": "UUID node_id of the Source Node relationship",
+        # "document_id": "UUID node_id of the Source Node relationship",
+        # "ref_doc_id": "UUID node_id of the Source Node relationship"
+        #
+        # llama_index file loaders set these metadata, which would be useful to also support:
+        # "creation_date": "2025-09-03",
+        # "file_name": "file.pdf",
+        # "file_path": "/absolute/path/to/the/file.pdf",
+        # "file_size": 395154,
+        # "file_type": "application\/pdf",
+        # "last_modified_date": "2025-09-03",
+        # "page_label": "1",
+    }
+
+    if page_number is not None:
+        metadata["kiln_page_number"] = page_number


Consider extracting the metadata creation logic into a separate function for better readability and maintainability.

def _create_metadata(document_id: str, chunk_idx: int, page_number: int | None = None) -> Dict[str, Any]: metadata: Dict[str, Any] = { # metadata is populated by some internal llama_index logic # that uses for example the source_node relationship "kiln_doc_id": document_id, "kiln_chunk_idx": chunk_idx, # # llama_index lancedb vector store automatically sets these metadata: # "doc_id": "UUID node_id of the Source Node relationship", # "document_id": "UUID node_id of the Source Node relationship", # "ref_doc_id": "UUID node_id of the Source Node relationship" # # llama_index file loaders set these metadata, which would be useful to also support: # "creation_date": "2025-09-03", # "file_name": "file.pdf", # "file_path": "/absolute/path/to/the/file.pdf", # "file_size": 395154, # "file_type": "application\\/pdf", # "last_modified_date": "2025-09-03", # "page_label": "1", } if page_number is not None: metadata["kiln_page_number"] = page_number return metadata def convert_to_llama_index_node( document_id: str, chunk_idx: int, node_id: str, text: str, vector: List[float], page_number: int | None = None, ) -> TextNode: metadata = _create_metadata(document_id, chunk_idx, page_number) return TextNode( id_=node_id, text=text, embedding=vector, metadata=metadata, relationships={ # when using the llama_index loaders, llama_index groups Nodes under Documents # and relationships point to the Document (which is also a Node), which confusingly

gemini-code-assist · 2026-01-27T18:13:49Z

libs/core/kiln_ai/adapters/vector_store/test_lancedb_adapter.py

    ):
        adapter.format_query_result(query_result)

+    # Test with valid page_number (int) - should work


Consider adding a comment explaining why these specific values are being tested. What is the significance of testing with an integer, None, and a float?

gemini-code-assist · 2026-01-27T18:13:49Z

libs/core/kiln_ai/adapters/vector_store/test_lancedb_adapter.py

+        chunks=[
+            Chunk(
+                content=KilnAttachmentModel.from_data("chunk 0", "text/plain"),
+                page_number=0,


Consider adding a check to ensure that the cache directory is empty before running the test. This can help prevent false positives if a previous test run failed and left data in the cache.

gemini-code-assist · 2026-01-27T18:13:49Z

libs/core/kiln_ai/adapters/vector_store/test_lancedb_adapter.py

+                page_number=0,
+            ),
+            Chunk(
+                content=KilnAttachmentModel.from_data("chunk 1", "text/plain"),


Consider adding a comment explaining why the id is being asserted to be not None.

gemini-code-assist · 2026-01-27T18:13:49Z

libs/core/kiln_ai/adapters/vector_store/test_lancedb_adapter.py

+    assert len(nodes) == 4
+    assert nodes[0].metadata.get("kiln_page_number") == 0
+    assert nodes[1].metadata.get("kiln_page_number") == 1
+    assert "kiln_page_number" not in nodes[2].metadata  # None should not be stored


Consider adding a comment explaining why kiln_page_number should not be in metadata when the value is None.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libs/core/kiln_ai/tools/rag_tools.py (1)

183-195: Page number is lost during reranking.

The rerank method reconstructs SearchResult objects but doesn't preserve page_number. After reranking, chunks will lose their page metadata, breaking the page number feature when a reranker is configured.

🐛 Proposed fix to preserve page_number through reranking

The rerank method needs access to the original page_number from the input SearchResult. One approach is to store page numbers in a lookup and retrieve them when rebuilding:

     async def rerank(
         self, search_results: List[SearchResult], query: str
     ) -> List[SearchResult]:
         if self.reranker is None:
             return search_results

+        # Build lookup for page_number by global chunk id
+        page_number_lookup = {
+            f"{r.document_id}::{r.chunk_idx}": r.page_number
+            for r in search_results
+        }
+
         reranked_results = await self.reranker.rerank(
             query=query,
             documents=convert_search_results_to_rerank_input(search_results),
         )

         reranked_search_results = []
         for result in reranked_results.results:
             document_id, chunk_idx = split_global_chunk_id(result.document.id)
             reranked_search_results.append(
                 SearchResult(
                     document_id=document_id,
                     chunk_idx=chunk_idx,
                     chunk_text=result.document.text,
                     similarity=result.relevance_score,
+                    page_number=page_number_lookup.get(result.document.id),
                 )
             )

         return reranked_search_results

🧹 Nitpick comments (4)

libs/core/kiln_ai/adapters/vector_store/test_lancedb_helpers.py (1)
183-195: LGTM!

Test correctly verifies that explicit page_number=None does not add kiln_page_number to metadata.

Consider adding a test for page_number=0 to verify the first page (0-indexed) is correctly stored and not falsely treated as absent due to Python's truthiness of 0:
♻️ Optional test for page_number=0
def test_convert_to_llama_index_node_with_page_number_zero():
    node = convert_to_llama_index_node(
        document_id="doc-123",
        chunk_idx=0,
        node_id="11111111-1111-5111-8111-111111111111",
        text="hello",
        vector=[0.1, 0.2],
        page_number=0,
    )

    assert node.metadata["kiln_page_number"] == 0
libs/core/kiln_ai/datamodel/extraction.py (1)
98-101: LGTM!

The page_offsets field is well-defined with a clear description. Using list[int] | None follows modern Python 3.10+ typing syntax which aligns with the coding guidelines.

Consider adding a validator to ensure page_offsets invariants when provided (e.g., offsets are sorted, first offset is 0). This would catch data corruption early:
♻️ Optional validator for page_offsets
`@field_validator`("page_offsets")
`@classmethod`
def validate_page_offsets(cls, v: list[int] | None) -> list[int] | None:
    if v is None:
        return v
    if len(v) == 0:
        raise ValueError("page_offsets must not be empty if provided")
    if v[0] != 0:
        raise ValueError("page_offsets must start with 0")
    if v != sorted(v):
        raise ValueError("page_offsets must be sorted in ascending order")
    return v
libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py (1)
776-790: Remove duplicate page_offsets assertions.

Both tests repeat the same block twice; trimming improves readability without losing coverage.
♻️ Suggested cleanup
@@
-    # Verify page_offsets are present and correct
-    assert result.page_offsets is not None
-    assert len(result.page_offsets) == 2
-    assert result.page_offsets[0] == 0
-    # Page 1 starts after page 0 content + separator (2 chars for "\n\n")
-    expected_page_1_offset = len("Content from page 1") + 2
-    assert result.page_offsets[1] == expected_page_1_offset
@@
-    # Verify page_offsets are present and correct
-    assert result.page_offsets is not None
-    assert len(result.page_offsets) == 2
-    assert result.page_offsets[0] == 0
-    expected_page_1_offset = len("Content from page 1") + 2
-    assert result.page_offsets[1] == expected_page_1_offset
Also applies to: 916-928
libs/core/kiln_ai/adapters/chunkers/base_chunker.py (1)
61-80: Minor: Unreachable return statement.

The return None at line 80 is unreachable. Given the checks:

If page_offsets is empty → returns None at line 71.

If chunk_offset < page_offsets[0] → returns 0 at line 74.

Otherwise, the loop will always find an i where chunk_offset >= page_offsets[i] (at minimum i=0).

This is harmless defensive code, but you could simplify by removing it or adding an assertion.
♻️ Optional simplification
         for i in range(len(page_offsets) - 1, -1, -1):
             if chunk_offset >= page_offsets[i]:
                 return i

-        return None
+        # This line is unreachable given the checks above, but satisfies type checker
+        raise AssertionError("Unreachable: chunk_offset must match some page")

leonardmq added 2 commits January 27, 2026 23:15

feat: page offsets in extraction model

ecc018c

feat: surface page numbers in chunks

e6b23d4

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Conversation

leonardmq commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklists

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Jan 27, 2026

📊 Coverage Report

Diff: origin/main...HEAD

Summary

Line-by-line

libs/core/kiln_ai/adapters/chunkers/base_chunker.py

libs/core/kiln_ai/tools/rag_tools.py

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leonardmq commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading