Skip to content

Add arXiv paper ingestion via API #3

@dentity007

Description

@dentity007

What needs to be done

Create forge/ingest/arxiv_loader.py that fetches papers from arXiv by author name or search query using the arXiv API, extracts the abstract and full text (from PDF), and ingests into the knowledge base.

Why this matters

arXiv is the primary source of AI/ML research papers. Researchers need to ingest papers directly.

Where to look

  • forge/ingest/document_loader.py (existing pattern for PDF loading)
  • forge/ingest/upserter.py (batch embedding + upsert)
  • examples/ingest_arxiv.py (example script already exists as a template)

NVIDIA Stack Impact

Uses Triton for batch embedding of paper chunks.

Acceptance criteria

  • Can ingest papers by author name
  • Can ingest papers by search query
  • PDF text extraction works
  • Chunks embedded and upserted to Qdrant
  • Graph nodes created for papers and authors
  • Tests added

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions