Skip to content

Question about PyMuPDF4LLM #200

Answered by JorjMcKie
dkoterwa asked this question in Q&A
Dec 2, 2024 · 1 comments · 4 replies
Discussion options

You must be logged in to vote

Roughly true, yes.

But please be aware that in PDF we cannot rely on text being nicely ordered in a way that makes sense semantically. The "blocks" will always follow their physical enumeration sequence in the respective PDF source.
Even when sorted (sort=True parameter), single text piece may remain out of order because they may have been added / replaced by some editing mechanism.
Many / most of these pesky things are resolved in to_markdown() because text flow is re-synthesized from a longer detail level upwards. Which is part of the reason why this extraction type is significantly slower.

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@dkoterwa
Comment options

@JorjMcKie
Comment options

Answer selected by JorjMcKie
@dkoterwa
Comment options

@JorjMcKie
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #199 on December 02, 2024 12:52.