What needs to be done
Create forge/ingest/arxiv_loader.py that fetches papers from arXiv by author name or search query using the arXiv API, extracts the abstract and full text (from PDF), and ingests into the knowledge base.
Why this matters
arXiv is the primary source of AI/ML research papers. Researchers need to ingest papers directly.
Where to look
- forge/ingest/document_loader.py (existing pattern for PDF loading)
- forge/ingest/upserter.py (batch embedding + upsert)
- examples/ingest_arxiv.py (example script already exists as a template)
NVIDIA Stack Impact
Uses Triton for batch embedding of paper chunks.
Acceptance criteria
What needs to be done
Create forge/ingest/arxiv_loader.py that fetches papers from arXiv by author name or search query using the arXiv API, extracts the abstract and full text (from PDF), and ingests into the knowledge base.
Why this matters
arXiv is the primary source of AI/ML research papers. Researchers need to ingest papers directly.
Where to look
NVIDIA Stack Impact
Uses Triton for batch embedding of paper chunks.
Acceptance criteria