Skip to content

improve extracting PDF text #572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BBC-Esq
Copy link

@BBC-Esq BBC-Esq commented Apr 5, 2025

You're already using the best langchain pdf loader, "pymupdfparser." However, the default "mode" parameter is "page," which means that it extracts the text from each page of the .pdf (and it's page metadata) and each page is then split by the recursivecharactertextsplitter. The "page" mode within langchain uses the get_text method from pymupdf, which extracts text from a single page of PDF.

Langchain's other option is "single" mode, which also uses get_text from pymupdf but then concatenates everything. The huge drawback is that you lose the page metadata...and forget about trying to assign page metadata to each "chunk"...

This PR solves that issue, ultimately allowing for accurate "page citations" in a user's application.

It uses custom loader/parser classes:

  1. Still uses page mode, but prepends a unique page marker to the text extracted (e.g. [[page1]], [[page2]] and so on).
  2. Concatenates all of the text.
  3. Creates a "clean" copy of the entire concatenated text WITHOUT the page markers.
  4. Splits the "clean" text.
  5. For each chunk it uses regex to search the concatenated text WITH the page markers to determine where each "chunk" begins by looking for the first page marker PRIOR to that chunk.
  6. Assigns accurate page metadata for each chunk.

The benefits of this are that chunks of text are no longer artificially split at the page boundaries of the pdf itself - i.e. chunks can span pages.

The only "parser" from langchain does does this "out of the box" is pdfminer, but it's insanely slow. Therefore, I highly recommend using this custom pymupdf approach instead.

Moreover, it more accurately respects the chunk_size parameter. For example, if a pdf only has 200 characters on a particular page of a PDF you'll get a chunk of 200 characters even if you set chunk_size to 1,000,000. The custom approach allows chunks to extend between pages, obviating this problem.

Many embedding models can now handle chunk sizes well above the standard 512 characters and there are use cases for it...but overall it's just better to have accurate page metadata for each "chunk" after processing the entire concatenated text...

@BBC-Esq BBC-Esq changed the title improve pymupdfparser improve extracting PDF text Apr 5, 2025
@BBC-Esq
Copy link
Author

BBC-Esq commented May 14, 2025

Can someone look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant