Skip to content

feat: add page numbers in chunks (RAG)#989

Open
leonardmq wants to merge 2 commits intomainfrom
leonard/kil-385-rag-page-numbers-in-chunks
Open

feat: add page numbers in chunks (RAG)#989
leonardmq wants to merge 2 commits intomainfrom
leonard/kil-385-rag-page-numbers-in-chunks

Conversation

@leonardmq
Copy link
Collaborator

@leonardmq leonardmq commented Jan 27, 2026

What does this PR do?

This adds page numbers into chunks in RAG - this allows the caller of the RagTool to format citations that points to a specific page.

Page number is nullable. It is None for non-PDF (or non-page) documents.

Getting the page number in an existing Search Tool requires reindexing - delete ~/.kiln_ai/rag_indexes/

The entire flow is:

  1. Extract -> extract each page, and as we materialize the extraction datamodel to disk as a single unit, add an array of char offsets marking the start of each page
  2. Chunking -> do the chunks off the entire extraction, then map back each chunk to their location in the entire extraction and find the char offset range each chunk falls in to map back to the corresponding page
  3. Indexing -> add the kiln_page_number as metadata in the record (alongside kiln_doc_id and chunk id
  4. Retrieval -> in the RagTool, add the page number in the chunk metadata
  5. UI -> Surface the page number

Checklists

  • Tests have been run locally and passed
  • New tests have been added to any work in /lib

Summary by CodeRabbit

Release Notes

  • New Features

    • PDF extractions now track page numbers for each chunk, enabling better content attribution in search results and improved visibility into source document structure.
    • Search results display page information when chunks originate from multi-page documents, helping users quickly identify document location.
    • RAG configuration UI now shows page numbers alongside chunk indices for enhanced document context.
  • Tests

    • Expanded test coverage for page number tracking across extraction, chunking, vector storage, and search components.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

Walkthrough

This PR adds page-level metadata tracking throughout the RAG pipeline. It introduces page_offsets to extraction output (character positions marking page starts), computes page_number for each chunk during chunking, and propagates this data through vector storage and search results to the UI.

Changes

Cohort / File(s) Summary
Data Models
libs/core/kiln_ai/datamodel/extraction.py, libs/core/kiln_ai/datamodel/chunk.py, app/web_ui/src/lib/api_schema.d.ts, libs/core/kiln_ai/adapters/extractors/base_extractor.py, libs/core/kiln_ai/adapters/vector_store/base_vector_store_adapter.py
Added optional page_offsets field to Extraction and ExtractionOutput models; added optional page_number field to Chunk and SearchResult models across backend and frontend schemas.
Chunker Logic & Tests
libs/core/kiln_ai/adapters/chunkers/base_chunker.py, libs/core/kiln_ai/adapters/chunkers/test_base_chunker.py, libs/core/kiln_ai/adapters/chunkers/test_fixed_window_chunker.py, libs/core/kiln_ai/adapters/chunkers/test_semantic_chunker.py
Extended BaseChunker.chunk() to accept optional page_offsets parameter; added _find_page_number() helper to map chunk character offsets to page indices; propagates page_number to each produced TextChunk. Includes extensive page-number validation tests with some duplication.
PDF Extraction
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py, libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py
Introduced PdfPageResult model to track per-page content; refactored _extract_pdf_page_by_page() to return tuple of content and page_offsets list computed from sequential page content lengths; updated _extract() to pass page_offsets to ExtractionOutput. Added assertions in tests to verify page_offsets correctness across PDF extraction scenarios.
RAG Pipeline Integration
libs/core/kiln_ai/adapters/rag/rag_runners.py, libs/core/kiln_ai/adapters/rag/test_rag_runners.py, libs/core/kiln_ai/adapters/extractors/extractor_runner.py
Updated RAG runners to propagate page_offsets from Extraction into chunking step; ensured page_number from TextChunk is carried to Chunk model in ChunkedDocument. Tests mock and verify page_offsets propagation.
Vector Store & Search
libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py, libs/core/kiln_ai/adapters/vector_store/lancedb_helpers.py, libs/core/kiln_ai/adapters/vector_store/test_lancedb_adapter.py, libs/core/kiln_ai/adapters/vector_store/test_lancedb_helpers.py, libs/core/kiln_ai/adapters/vector_store_loaders/vector_store_loader.py
Modified convert_to_llama_index_node() to accept and store page_number in node metadata as kiln_page_number; updated vector store adapter to retrieve page_number from metadata during query results formatting and include in SearchResult; vector_store_loader extracts page_number from chunk data. Tests verify metadata storage and retrieval with duplicated test scenarios.
RAG Tools & UI
libs/core/kiln_ai/tools/rag_tools.py, app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/[rag_config_id]/rag_config/+page.svelte
Updated ChunkContext metadata construction to conditionally include page_number when available; updated search result display to show page number (e.g., "Page: N" or "N/A") alongside chunk index.
Data Model Tests
libs/core/kiln_ai/datamodel/test_chunk_models.py
Added unit tests verifying page_number field behavior (default None, explicit assignment, explicit None).

Sequence Diagram(s)

sequenceDiagram
    participant Extractor
    participant Chunker
    participant VectorStore
    participant SearchUI

    Extractor->>Extractor: Extract PDF page-by-page
    Extractor->>Extractor: Compute page_offsets<br/>(char positions per page)
    Extractor->>Chunker: ExtractionOutput + page_offsets

    Chunker->>Chunker: For each chunk,<br/>find page_number<br/>via _find_page_number()
    Chunker->>Chunker: Create TextChunk<br/>with page_number
    Chunker->>VectorStore: TextChunk[] + page_number

    VectorStore->>VectorStore: Store page_number<br/>as kiln_page_number metadata
    SearchUI->>VectorStore: Query for chunks

    VectorStore->>VectorStore: Retrieve kiln_page_number<br/>from metadata
    VectorStore->>SearchUI: SearchResult<br/>+ page_number
    SearchUI->>SearchUI: Display "Page: N"<br/>or "N/A"
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

  • PR#596: Both PRs add per-page PDF extraction and propagate page-level metadata (page offsets/page numbers) through the extractor and downstream models.
  • PR#527: Both PRs modify PDF extraction in litellm_extractor.py to produce per-page outputs and introduce page-level metadata (page content/numbers and offsets).
  • PR#487: Both PRs extend the vector store adapter surface to include page-number propagation through SearchResult and vector store plumbing.

Suggested reviewers

  • scosman
  • chiang-daniel
  • sfierro

Poem

🐰 Page numbers now hop through the chunks with glee,
From PDFs to offsets, a mapping spree,
Each chunk knows its page, from zero to n,
Vector stores remember, and searches ascend,
Metadata magic makes search complete! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add page numbers in chunks (RAG)' is concise, clear, and accurately describes the main change—adding page number metadata to chunks in the RAG pipeline.
Description check ✅ Passed The description provides a clear explanation of the PR's purpose, outlines the complete implementation flow across five stages, includes checklists with both items marked complete, but is missing a 'Related Issues' section.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @leonardmq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the RAG (Retrieval Augmented Generation) system by implementing a full pipeline for tracking and displaying page numbers for document chunks. This feature allows the system to provide more precise citations by linking retrieved information directly to its source page, improving the overall utility and verifiability of generated responses. The changes span from data extraction and chunking to indexing, retrieval, and finally, presentation in the user interface.

Highlights

  • Page Number Integration: Introduced page_number as an optional field across key RAG components, including TextChunk, Chunk, SearchResult, and the API schema, allowing for page-specific citations.
  • PDF Page Offset Extraction: The LitellmExtractor now processes PDF documents page by page, calculating and storing character page_offsets within the ExtractionOutput and Extraction datamodels. This enables mapping chunks back to their original pages.
  • Chunk-to-Page Mapping Logic: The BaseChunker was enhanced to accept page_offsets during chunking. It uses these offsets to determine and assign the correct page_number to each generated TextChunk, which is then persisted in the Chunk datamodel.
  • Vector Store Indexing and Retrieval: Page numbers are now stored as kiln_page_number metadata in the LanceDB vector store during indexing and are retrieved and included in SearchResult objects during queries. This ensures page information is available throughout the RAG pipeline.
  • User Interface Display: The web UI has been updated to display the page number alongside the chunk index in search results, providing users with direct page references for retrieved information.
  • Comprehensive Testing: Extensive unit and integration tests were added or updated across chunkers, extractors, and vector store adapters to validate the correct handling, storage, and retrieval of page numbers, including edge cases and scenarios with and without page offsets.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

  • libs/core/kiln_ai/adapters/chunkers/base_chunker.py (87.5%): Missing lines 47,51,80
  • libs/core/kiln_ai/adapters/extractors/base_extractor.py (100%)
  • libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (100%)
  • libs/core/kiln_ai/adapters/vector_store/base_vector_store_adapter.py (100%)
  • libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py (100%)
  • libs/core/kiln_ai/adapters/vector_store/lancedb_helpers.py (100%)
  • libs/core/kiln_ai/adapters/vector_store_loaders/vector_store_loader.py (100%)
  • libs/core/kiln_ai/datamodel/chunk.py (100%)
  • libs/core/kiln_ai/datamodel/extraction.py (100%)
  • libs/core/kiln_ai/tools/rag_tools.py (66.7%): Missing lines 54

Summary

  • Total: 60 lines
  • Missing: 4 lines
  • Coverage: 93%

Line-by-line

View line-by-line diff coverage

libs/core/kiln_ai/adapters/chunkers/base_chunker.py

Lines 43-55

  43             search_start = 0
  44             for chunk in chunking_result.chunks:
  45                 chunk_start_offset = text.find(chunk.text, search_start)
  46                 if chunk_start_offset == -1:
! 47                     logger.warning(
  48                         f"Chunk text not found in sanitized text starting from offset {search_start}. "
  49                         "This may indicate an issue with the chunker implementation."
  50                     )
! 51                     chunk.page_number = None
  52                 else:
  53                     page_number = self._find_page_number(
  54                         chunk_start_offset, page_offsets
  55                     )

Lines 76-84

  76         for i in range(len(page_offsets) - 1, -1, -1):
  77             if chunk_offset >= page_offsets[i]:
  78                 return i
  79 
! 80         return None
  81 
  82     @abstractmethod
  83     async def _chunk(self, text: str) -> ChunkingResult:
  84         pass

libs/core/kiln_ai/tools/rag_tools.py

Lines 50-58

  50             "document_id": search_result.document_id,
  51             "chunk_idx": search_result.chunk_idx,
  52         }
  53         if search_result.page_number is not None:
! 54             metadata["page_number"] = search_result.page_number
  55         results.append(
  56             ChunkContext(
  57                 metadata=metadata,
  58                 text=search_result.chunk_text,


Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces page number tracking and display functionality across the RAG pipeline. Key changes include adding a page_number field to TextChunk and SearchResult models, and page_offsets to ExtractionOutput and Extraction models. The BaseChunker now accepts page_offsets to assign page numbers to chunks, with specific logic implemented in LitellmExtractor to calculate these offsets for PDF extractions. The frontend UI has been updated to display the page number for each chunk. Extensive unit tests have been added or modified for BaseChunker, FixedWindowChunker, SemanticChunker, LitellmExtractor, and LanceDBAdapter to ensure correct handling, storage, and retrieval of page numbers and offsets. Review comments suggest improving the efficiency of text.find() in the chunker, clarifying the return behavior of _find_page_number for offsets before the first page, extracting metadata creation logic into a separate function, and adding comments to explain test cases.

Comment on lines +45 to +50
chunk_start_offset = text.find(chunk.text, search_start)
if chunk_start_offset == -1:
logger.warning(
f"Chunk text not found in sanitized text starting from offset {search_start}. "
"This may indicate an issue with the chunker implementation."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The text.find() method can be inefficient for large texts or frequent calls. Consider using a more efficient string searching algorithm or pre-processing the text if performance becomes a bottleneck. Also, consider adding a more descriptive error message to the logger.

Suggested change
chunk_start_offset = text.find(chunk.text, search_start)
if chunk_start_offset == -1:
logger.warning(
f"Chunk text not found in sanitized text starting from offset {search_start}. "
"This may indicate an issue with the chunker implementation."
)
chunk_start_offset = text.find(chunk.text, search_start)
if chunk_start_offset == -1:
logger.warning(
f"Chunk text not found in sanitized text starting from offset {search_start}. "
"This may indicate an issue with the chunker implementation. "
f"Chunk text: '{chunk.text[:50]}...'"
)
chunk.page_number = None

Comment on lines +73 to +74
if chunk_offset < page_offsets[0]:
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning 0 here might not be the most intuitive behavior. If the chunk_offset is before the first page, it might be better to return None to indicate that the chunk doesn't belong to any page. This would require updating the type hint to Optional[int].

Suggested change
if chunk_offset < page_offsets[0]:
return 0
if chunk_offset < page_offsets[0]:
return None

Comment on lines +55 to +77
metadata: Dict[str, Any] = {
# metadata is populated by some internal llama_index logic
# that uses for example the source_node relationship
"kiln_doc_id": document_id,
"kiln_chunk_idx": chunk_idx,
#
# llama_index lancedb vector store automatically sets these metadata:
# "doc_id": "UUID node_id of the Source Node relationship",
# "document_id": "UUID node_id of the Source Node relationship",
# "ref_doc_id": "UUID node_id of the Source Node relationship"
#
# llama_index file loaders set these metadata, which would be useful to also support:
# "creation_date": "2025-09-03",
# "file_name": "file.pdf",
# "file_path": "/absolute/path/to/the/file.pdf",
# "file_size": 395154,
# "file_type": "application\/pdf",
# "last_modified_date": "2025-09-03",
# "page_label": "1",
}

if page_number is not None:
metadata["kiln_page_number"] = page_number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider extracting the metadata creation logic into a separate function for better readability and maintainability.

def _create_metadata(document_id: str, chunk_idx: int, page_number: int | None = None) -> Dict[str, Any]:
    metadata: Dict[str, Any] = {
        # metadata is populated by some internal llama_index logic
        # that uses for example the source_node relationship
        "kiln_doc_id": document_id,
        "kiln_chunk_idx": chunk_idx,
        #
        # llama_index lancedb vector store automatically sets these metadata:
        # "doc_id": "UUID node_id of the Source Node relationship",
        # "document_id": "UUID node_id of the Source Node relationship",
        # "ref_doc_id": "UUID node_id of the Source Node relationship"
        #
        # llama_index file loaders set these metadata, which would be useful to also support:
        # "creation_date": "2025-09-03",
        # "file_name": "file.pdf",
        # "file_path": "/absolute/path/to/the/file.pdf",
        # "file_size": 395154,
        # "file_type": "application\\/pdf",
        # "last_modified_date": "2025-09-03",
        # "page_label": "1",
    }

    if page_number is not None:
        metadata["kiln_page_number"] = page_number
    return metadata

def convert_to_llama_index_node(
    document_id: str,
    chunk_idx: int,
    node_id: str,
    text: str,
    vector: List[float],
    page_number: int | None = None,
) -> TextNode:
    metadata = _create_metadata(document_id, chunk_idx, page_number)
    return TextNode(
        id_=node_id,
        text=text,
        embedding=vector,
        metadata=metadata,
        relationships={
            # when using the llama_index loaders, llama_index groups Nodes under Documents
            # and relationships point to the Document (which is also a Node), which confusingly

):
adapter.format_query_result(query_result)

# Test with valid page_number (int) - should work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment explaining why these specific values are being tested. What is the significance of testing with an integer, None, and a float?

chunks=[
Chunk(
content=KilnAttachmentModel.from_data("chunk 0", "text/plain"),
page_number=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that the cache directory is empty before running the test. This can help prevent false positives if a previous test run failed and left data in the cache.

page_number=0,
),
Chunk(
content=KilnAttachmentModel.from_data("chunk 1", "text/plain"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment explaining why the id is being asserted to be not None.

assert len(nodes) == 4
assert nodes[0].metadata.get("kiln_page_number") == 0
assert nodes[1].metadata.get("kiln_page_number") == 1
assert "kiln_page_number" not in nodes[2].metadata # None should not be stored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment explaining why kiln_page_number should not be in metadata when the value is None.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libs/core/kiln_ai/tools/rag_tools.py (1)

183-195: Page number is lost during reranking.

The rerank method reconstructs SearchResult objects but doesn't preserve page_number. After reranking, chunks will lose their page metadata, breaking the page number feature when a reranker is configured.

🐛 Proposed fix to preserve page_number through reranking

The rerank method needs access to the original page_number from the input SearchResult. One approach is to store page numbers in a lookup and retrieve them when rebuilding:

     async def rerank(
         self, search_results: List[SearchResult], query: str
     ) -> List[SearchResult]:
         if self.reranker is None:
             return search_results

+        # Build lookup for page_number by global chunk id
+        page_number_lookup = {
+            f"{r.document_id}::{r.chunk_idx}": r.page_number
+            for r in search_results
+        }
+
         reranked_results = await self.reranker.rerank(
             query=query,
             documents=convert_search_results_to_rerank_input(search_results),
         )

         reranked_search_results = []
         for result in reranked_results.results:
             document_id, chunk_idx = split_global_chunk_id(result.document.id)
             reranked_search_results.append(
                 SearchResult(
                     document_id=document_id,
                     chunk_idx=chunk_idx,
                     chunk_text=result.document.text,
                     similarity=result.relevance_score,
+                    page_number=page_number_lookup.get(result.document.id),
                 )
             )

         return reranked_search_results
🧹 Nitpick comments (4)
libs/core/kiln_ai/adapters/vector_store/test_lancedb_helpers.py (1)

183-195: LGTM!

Test correctly verifies that explicit page_number=None does not add kiln_page_number to metadata.

Consider adding a test for page_number=0 to verify the first page (0-indexed) is correctly stored and not falsely treated as absent due to Python's truthiness of 0:

♻️ Optional test for page_number=0
def test_convert_to_llama_index_node_with_page_number_zero():
    node = convert_to_llama_index_node(
        document_id="doc-123",
        chunk_idx=0,
        node_id="11111111-1111-5111-8111-111111111111",
        text="hello",
        vector=[0.1, 0.2],
        page_number=0,
    )

    assert node.metadata["kiln_page_number"] == 0
libs/core/kiln_ai/datamodel/extraction.py (1)

98-101: LGTM!

The page_offsets field is well-defined with a clear description. Using list[int] | None follows modern Python 3.10+ typing syntax which aligns with the coding guidelines.

Consider adding a validator to ensure page_offsets invariants when provided (e.g., offsets are sorted, first offset is 0). This would catch data corruption early:

♻️ Optional validator for page_offsets
`@field_validator`("page_offsets")
`@classmethod`
def validate_page_offsets(cls, v: list[int] | None) -> list[int] | None:
    if v is None:
        return v
    if len(v) == 0:
        raise ValueError("page_offsets must not be empty if provided")
    if v[0] != 0:
        raise ValueError("page_offsets must start with 0")
    if v != sorted(v):
        raise ValueError("page_offsets must be sorted in ascending order")
    return v
libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py (1)

776-790: Remove duplicate page_offsets assertions.

Both tests repeat the same block twice; trimming improves readability without losing coverage.

♻️ Suggested cleanup
@@
-    # Verify page_offsets are present and correct
-    assert result.page_offsets is not None
-    assert len(result.page_offsets) == 2
-    assert result.page_offsets[0] == 0
-    # Page 1 starts after page 0 content + separator (2 chars for "\n\n")
-    expected_page_1_offset = len("Content from page 1") + 2
-    assert result.page_offsets[1] == expected_page_1_offset
@@
-    # Verify page_offsets are present and correct
-    assert result.page_offsets is not None
-    assert len(result.page_offsets) == 2
-    assert result.page_offsets[0] == 0
-    expected_page_1_offset = len("Content from page 1") + 2
-    assert result.page_offsets[1] == expected_page_1_offset

Also applies to: 916-928

libs/core/kiln_ai/adapters/chunkers/base_chunker.py (1)

61-80: Minor: Unreachable return statement.

The return None at line 80 is unreachable. Given the checks:

  1. If page_offsets is empty → returns None at line 71.
  2. If chunk_offset < page_offsets[0] → returns 0 at line 74.
  3. Otherwise, the loop will always find an i where chunk_offset >= page_offsets[i] (at minimum i=0).

This is harmless defensive code, but you could simplify by removing it or adding an assertion.

♻️ Optional simplification
         for i in range(len(page_offsets) - 1, -1, -1):
             if chunk_offset >= page_offsets[i]:
                 return i

-        return None
+        # This line is unreachable given the checks above, but satisfies type checker
+        raise AssertionError("Unreachable: chunk_offset must match some page")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant