LLM-Librarian

This is a repo for using LLMs as a librarian to classify documents and extract type-specific metadata from them before filing them into an OpenSearch database and a vector store run by RAGatouille/Byaldi using a Flask server I wrote. Currently it only works on PDFs, which are placed in the pdfs folder.

This will be used as the base of a larger RAG system that search and research agents use as their knowledge store.

To run this: Set your GEMINI_API_KEY environment variable (or other api key that is supported by PydanticAI, but then you need to change the model its using)

Set your opensearch environment variables.

Run opensearch with docker-compose up -d wherever you place docker-compose.yaml (I run this on a separate computer)

(I usually run tmux here)

cd DwarfInTheFlask

uv run flask_server.py

Navigate back to root (or exit tmux)

uv run process_all.py

NOTES:

-This uses OCR on the first page of the document to increase reliability of title extraction for books, which sometimes aren't actually included in the text of the PDF.

-Metadata extraction will fall back on using a partial page set if something goes wrong with using the full pdf, such as context lengths for the LLM. By default it uses gemini to avoid this issue, but I want it to not rely on Gemini.

-Chunk text extraction for OpenSearch indexing uses PyMuPDF. This is separate from the OCR text because when using partial pages for metadata extraction it needs to make sure to use all the text for chunk indexing.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
DwarfInTheFlask @ 036c58c		DwarfInTheFlask @ 036c58c
pdfs		pdfs
.envsample		.envsample
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
old_pdf_metadata.json		old_pdf_metadata.json
pdf_metadata.json		pdf_metadata.json
pdf_metadata_extractor.py		pdf_metadata_extractor.py
process_all.py		process_all.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Librarian

About

Releases 2

Packages

Languages

License

RhizoNymph/LLM-Librarian

Folders and files

Latest commit

History

Repository files navigation

LLM-Librarian

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages