improve extracting PDF text #572

BBC-Esq · 2025-04-05T15:56:18Z

You're already using the best langchain pdf loader, "pymupdfparser." However, the default "mode" parameter is "page," which means that it extracts the text from each page of the .pdf (and it's page metadata) and each page is then split by the recursivecharactertextsplitter. The "page" mode within langchain uses the get_text method from pymupdf, which extracts text from a single page of PDF.

Langchain's other option is "single" mode, which also uses get_text from pymupdf but then concatenates everything. The huge drawback is that you lose the page metadata...and forget about trying to assign page metadata to each "chunk"...

This PR solves that issue, ultimately allowing for accurate "page citations" in a user's application.

It uses custom loader/parser classes:

Still uses page mode, but prepends a unique page marker to the text extracted (e.g. [[page1]], [[page2]] and so on).
Concatenates all of the text.
Creates a "clean" copy of the entire concatenated text WITHOUT the page markers.
Splits the "clean" text.
For each chunk it uses regex to search the concatenated text WITH the page markers to determine where each "chunk" begins by looking for the first page marker PRIOR to that chunk.
Assigns accurate page metadata for each chunk.

The benefits of this are that chunks of text are no longer artificially split at the page boundaries of the pdf itself - i.e. chunks can span pages.

The only "parser" from langchain does does this "out of the box" is pdfminer, but it's insanely slow. Therefore, I highly recommend using this custom pymupdf approach instead.

Moreover, it more accurately respects the chunk_size parameter. For example, if a pdf only has 200 characters on a particular page of a PDF you'll get a chunk of 200 characters even if you set chunk_size to 1,000,000. The custom approach allows chunks to extend between pages, obviating this problem.

Many embedding models can now handle chunk sizes well above the standard 512 characters and there are use cases for it...but overall it's just better to have accurate page metadata for each "chunk" after processing the entire concatenated text...

BBC-Esq · 2025-05-14T19:05:22Z

Can someone look at this?

Update directory_reader.py

29dc939

BBC-Esq changed the title ~~improve pymupdfparser~~ improve extracting PDF text Apr 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve extracting PDF text #572

improve extracting PDF text #572

BBC-Esq commented Apr 5, 2025 •

edited

Loading

BBC-Esq commented May 14, 2025

improve extracting PDF text #572

Are you sure you want to change the base?

improve extracting PDF text #572

Conversation

BBC-Esq commented Apr 5, 2025 • edited Loading

BBC-Esq commented May 14, 2025

BBC-Esq commented Apr 5, 2025 •

edited

Loading