Get page information from extracted text #226

atahanuz · 2025-02-09T15:04:55Z

Hello, I'm using megaparse vision to extract text from a PDF. After the extraction, how can find which part of the extracted text belonged to which page (i.e. page 7 of the PDF) ?

linear · 2025-02-09T15:04:58Z

MEG-42 Get page information from extracted text

dosubot · 2025-02-09T15:05:49Z

Hey @atahanuz! I'm here to help you with bug fixes, answer questions, and become a contributor. Let's squash those bugs together! 🤖

To determine which part of the extracted text from a PDF belongs to a specific page, such as page 7, you can use the page_range attribute in the Block objects. Each Block object has a page_range attribute that indicates the page number(s) the text belongs to. You can filter the blocks by checking if the page_range matches the desired page number.

Here's a code snippet to illustrate how you can filter the blocks for page 7:

# Assuming `mp_document` is the MPDocument object returned by the convert method
page_number = 7
page_7_blocks = [
    block for block in mp_document.content
    if block.page_range and block.page_range[0] == page_number
]

# Now `page_7_blocks` contains all blocks from page 7

This code iterates over the content of the MPDocument and selects blocks where the page_range matches page 7 ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

atahanuz · 2025-02-09T15:24:48Z

@dosu
page_range attribute gives (0,48), which is the entire document. I haven't been able to access specific pages

dosubot · 2025-02-09T15:26:18Z

If the page_range attribute is giving you (0, 48) for the entire document, it seems like the page_range is not being set correctly for individual pages. This might be due to how the process_file and __to_elements_list__ methods are handling the page information.

In MegaParse, the process_file method converts each page of a PDF into a base64 encoded image, and the number of pages (n_pages) is used to set the page_range in the __to_elements_list__ method. If the page_range is not reflecting individual pages, you might want to check if the metadata.page_number is being correctly assigned during the conversion process.

The __convert_element_to_block method is responsible for setting the page_range using the metadata.page_number from each Element. Ensure that this metadata is correctly populated for each page:

page_range=(metadata.page_number, metadata.page_number)
if metadata.page_number
else None

If the metadata.page_number is not available or not set correctly, the page_range will default to None or cover the entire document. You may need to verify that the metadata.page_number is being correctly extracted and assigned during the parsing process ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

atahanuz · 2025-02-09T15:36:56Z

Dosubot weren't be able to help, can someone do? 🥲

dosubot bot added the question Further information is requested label Feb 9, 2025

StanGirard assigned chloedia Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get page information from extracted text #226

Get page information from extracted text #226

atahanuz commented Feb 9, 2025

linear bot commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

atahanuz commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

atahanuz commented Feb 9, 2025

Get page information from extracted text #226

Get page information from extracted text #226

Comments

atahanuz commented Feb 9, 2025

linear bot commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

atahanuz commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

atahanuz commented Feb 9, 2025