Question about PyMuPDF4LLM #200

dkoterwa · 2024-12-02T10:41:40Z

dkoterwa
Dec 2, 2024

Hi,
I've wanted to ask if there is any difference in text processing when it comes to get_text("blocks") vs the way PyMuPDF4LLM processes text.
I know that for each rectangle with text you define a Markdown tag to set the level of the header, but is there anything more apart from that? I am not asking about tables/images processing, just the raw text.

Answered by JorjMcKie

Dec 2, 2024

Roughly true, yes.

But please be aware that in PDF we cannot rely on text being nicely ordered in a way that makes sense semantically. The "blocks" will always follow their physical enumeration sequence in the respective PDF source.
Even when sorted (sort=True parameter), single text piece may remain out of order because they may have been added / replaced by some editing mechanism.
Many / most of these pesky things are resolved in to_markdown() because text flow is re-synthesized from a longer detail level upwards. Which is part of the reason why this extraction type is significantly slower.

View full answer

JorjMcKie · 2024-12-02T12:54:39Z

JorjMcKie
Dec 2, 2024
Maintainer

PyMuPDF4LLM uses an advanced way of detecting text blocks on a page, compared to get_text("blocks"). This allows it detecting multi-column pages more reliably.

4 replies

dkoterwa Dec 2, 2024
Author

Thank you for a quick reply. Could you please specify what does "an advanced way" mean? I can see that in PyMuPDF4LLM you are joining bboxes that overlap and make sure that bboxes won't be concatenated across columns. If we assume that the PDF document is single-column, then can we expect that extracted blocks and markdown elements will be very similar across these two methods?

JorjMcKie Dec 2, 2024
Maintainer

Roughly true, yes.

But please be aware that in PDF we cannot rely on text being nicely ordered in a way that makes sense semantically. The "blocks" will always follow their physical enumeration sequence in the respective PDF source.
Even when sorted (sort=True parameter), single text piece may remain out of order because they may have been added / replaced by some editing mechanism.
Many / most of these pesky things are resolved in to_markdown() because text flow is re-synthesized from a longer detail level upwards. Which is part of the reason why this extraction type is significantly slower.

Answer selected by JorjMcKie

dkoterwa Dec 3, 2024
Author

Everything I've wanted to ask for is clarified now. Thank you for your help, I think that this discussion can be closed.

JorjMcKie Dec 3, 2024
Maintainer

Thanks for the interesting post.
Discussions need not be closed. We usually even leave them open so they are visible for the user community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about PyMuPDF4LLM #200

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about PyMuPDF4LLM #200

dkoterwa Dec 2, 2024

Replies: 1 comment · 4 replies

JorjMcKie Dec 2, 2024 Maintainer

dkoterwa Dec 2, 2024 Author

JorjMcKie Dec 2, 2024 Maintainer

dkoterwa Dec 3, 2024 Author

JorjMcKie Dec 3, 2024 Maintainer

dkoterwa
Dec 2, 2024

Replies: 1 comment 4 replies

JorjMcKie
Dec 2, 2024
Maintainer

dkoterwa Dec 2, 2024
Author

JorjMcKie Dec 2, 2024
Maintainer

dkoterwa Dec 3, 2024
Author

JorjMcKie Dec 3, 2024
Maintainer