-
Hi, |
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Dec 2, 2024
Replies: 1 comment 4 replies
-
PyMuPDF4LLM uses an advanced way of detecting text blocks on a page, compared to |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Roughly true, yes.
But please be aware that in PDF we cannot rely on text being nicely ordered in a way that makes sense semantically. The "blocks" will always follow their physical enumeration sequence in the respective PDF source.
Even when sorted (
sort=True
parameter), single text piece may remain out of order because they may have been added / replaced by some editing mechanism.Many / most of these pesky things are resolved in
to_markdown()
because text flow is re-synthesized from a longer detail level upwards. Which is part of the reason why this extraction type is significantly slower.