Skip to content

Add PDF fragment loader plugin to directory #954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

agustif
Copy link

@agustif agustif commented Apr 25, 2025

Hi! After making this feature for my other arxiv plugin i thought it could be useful for generic PDF's too! so i made

llm-plugin-pdf provides a -f pdf: loader that can load local or remote PDF files as fragments.

A little wrapper around pyMuPDF that will try to parse a PDF text and images into markdown to provide a PDF's files contents as a fragment

this should use way less tokens than feeding a full PDF to a model directly, most papers are actually built from source so they have great support that doesn't rely on clunky OCR (i explored using grobid for other uses, but requires a server which made it a nono for this, pyMuPDF worked well on my tests and i was able to also parse the pdf.images into base64 encoded data so it's all passed as fragments to the model, not only text)

@agustif agustif changed the title Docs/add plugin pdf Add PDF fragment loader plugin to directory Apr 25, 2025
@simonw
Copy link
Owner

simonw commented May 4, 2025

This plugin can now be upgraded to pass those images as attachments, not as base64 encoded strings:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants