Skip to content

Feature Request: Incremental Index Updates for Large Documents #316

@venkatsp17

Description

@venkatsp17

Problem

Currently, PageIndex appears to rebuild document indexes as a whole when a document is reprocessed.

This works well for static documents, but it can become expensive for large documents that undergo small updates.

For example:

  • A 500-page policy manual receives a 2-page revision.
  • A contract receives a minor amendment.
  • A compliance document is updated in a single section.

In these cases, rebuilding the entire hierarchical index may require regenerating summaries and tree structures for large portions of the document even though only a small subset changed.

Proposed Enhancement

Introduce incremental index updates that can:

  1. Detect changed sections between document versions.
  2. Rebuild only affected branches of the document tree.
  3. Recompute summaries only along impacted paths.
  4. Optionally maintain document version history and change tracking.

At a high level, this could be achieved through node-level content hashing and selective subtree regeneration.

Benefits

Reduced Indexing Cost

Only modified sections would require reprocessing, significantly reducing LLM and indexing costs for large documents.

Faster Updates

Small document revisions could be indexed much faster than full document rebuilds.

Better Enterprise Support

Many enterprise workflows involve:

  • Policy revisions
  • Regulatory updates
  • Contract amendments
  • Documentation versioning

Incremental updates would make PageIndex more practical in these environments.

Foundation for Future Features

This capability could also enable:

  • Document version comparison
  • Change history visualization
  • Impact analysis of document updates
  • Audit and compliance workflows

Example

Current workflow:

Document v1
→ Build full tree

Document v2 (2 pages changed)
→ Rebuild full tree

Proposed workflow:

Document v1
→ Build full tree

Document v2 (2 pages changed)
→ Detect changed nodes
→ Rebuild affected subtree only
→ Update parent summaries as needed

Discussion

Has incremental indexing already been considered for the roadmap?

It seems like a natural fit for PageIndex's hierarchical tree architecture and could provide significant performance improvements for large, frequently updated documents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions