Problem
Currently, PageIndex appears to rebuild document indexes as a whole when a document is reprocessed.
This works well for static documents, but it can become expensive for large documents that undergo small updates.
For example:
- A 500-page policy manual receives a 2-page revision.
- A contract receives a minor amendment.
- A compliance document is updated in a single section.
In these cases, rebuilding the entire hierarchical index may require regenerating summaries and tree structures for large portions of the document even though only a small subset changed.
Proposed Enhancement
Introduce incremental index updates that can:
- Detect changed sections between document versions.
- Rebuild only affected branches of the document tree.
- Recompute summaries only along impacted paths.
- Optionally maintain document version history and change tracking.
At a high level, this could be achieved through node-level content hashing and selective subtree regeneration.
Benefits
Reduced Indexing Cost
Only modified sections would require reprocessing, significantly reducing LLM and indexing costs for large documents.
Faster Updates
Small document revisions could be indexed much faster than full document rebuilds.
Better Enterprise Support
Many enterprise workflows involve:
- Policy revisions
- Regulatory updates
- Contract amendments
- Documentation versioning
Incremental updates would make PageIndex more practical in these environments.
Foundation for Future Features
This capability could also enable:
- Document version comparison
- Change history visualization
- Impact analysis of document updates
- Audit and compliance workflows
Example
Current workflow:
Document v1
→ Build full tree
Document v2 (2 pages changed)
→ Rebuild full tree
Proposed workflow:
Document v1
→ Build full tree
Document v2 (2 pages changed)
→ Detect changed nodes
→ Rebuild affected subtree only
→ Update parent summaries as needed
Discussion
Has incremental indexing already been considered for the roadmap?
It seems like a natural fit for PageIndex's hierarchical tree architecture and could provide significant performance improvements for large, frequently updated documents.
Problem
Currently, PageIndex appears to rebuild document indexes as a whole when a document is reprocessed.
This works well for static documents, but it can become expensive for large documents that undergo small updates.
For example:
In these cases, rebuilding the entire hierarchical index may require regenerating summaries and tree structures for large portions of the document even though only a small subset changed.
Proposed Enhancement
Introduce incremental index updates that can:
At a high level, this could be achieved through node-level content hashing and selective subtree regeneration.
Benefits
Reduced Indexing Cost
Only modified sections would require reprocessing, significantly reducing LLM and indexing costs for large documents.
Faster Updates
Small document revisions could be indexed much faster than full document rebuilds.
Better Enterprise Support
Many enterprise workflows involve:
Incremental updates would make PageIndex more practical in these environments.
Foundation for Future Features
This capability could also enable:
Example
Current workflow:
Document v1
→ Build full tree
Document v2 (2 pages changed)
→ Rebuild full tree
Proposed workflow:
Document v1
→ Build full tree
Document v2 (2 pages changed)
→ Detect changed nodes
→ Rebuild affected subtree only
→ Update parent summaries as needed
Discussion
Has incremental indexing already been considered for the roadmap?
It seems like a natural fit for PageIndex's hierarchical tree architecture and could provide significant performance improvements for large, frequently updated documents.