Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get page information from extracted text #226

Open
atahanuz opened this issue Feb 9, 2025 · 5 comments
Open

Get page information from extracted text #226

atahanuz opened this issue Feb 9, 2025 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@atahanuz
Copy link

atahanuz commented Feb 9, 2025

Hello, I'm using megaparse vision to extract text from a PDF. After the extraction, how can find which part of the extracted text belonged to which page (i.e. page 7 of the PDF) ?

Copy link

linear bot commented Feb 9, 2025

Copy link

dosubot bot commented Feb 9, 2025

Hey @atahanuz! I'm here to help you with bug fixes, answer questions, and become a contributor. Let's squash those bugs together! 🤖

To determine which part of the extracted text from a PDF belongs to a specific page, such as page 7, you can use the page_range attribute in the Block objects. Each Block object has a page_range attribute that indicates the page number(s) the text belongs to. You can filter the blocks by checking if the page_range matches the desired page number.

Here's a code snippet to illustrate how you can filter the blocks for page 7:

# Assuming `mp_document` is the MPDocument object returned by the convert method
page_number = 7
page_7_blocks = [
    block for block in mp_document.content
    if block.page_range and block.page_range[0] == page_number
]

# Now `page_7_blocks` contains all blocks from page 7

This code iterates over the content of the MPDocument and selects blocks where the page_range matches page 7 [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the question Further information is requested label Feb 9, 2025
@atahanuz
Copy link
Author

atahanuz commented Feb 9, 2025

@dosu
page_range attribute gives (0,48), which is the entire document. I haven't been able to access specific pages

Copy link

dosubot bot commented Feb 9, 2025

If the page_range attribute is giving you (0, 48) for the entire document, it seems like the page_range is not being set correctly for individual pages. This might be due to how the process_file and __to_elements_list__ methods are handling the page information.

In MegaParse, the process_file method converts each page of a PDF into a base64 encoded image, and the number of pages (n_pages) is used to set the page_range in the __to_elements_list__ method. If the page_range is not reflecting individual pages, you might want to check if the metadata.page_number is being correctly assigned during the conversion process.

The __convert_element_to_block method is responsible for setting the page_range using the metadata.page_number from each Element. Ensure that this metadata is correctly populated for each page:

page_range=(metadata.page_number, metadata.page_number)
if metadata.page_number
else None

If the metadata.page_number is not available or not set correctly, the page_range will default to None or cover the entire document. You may need to verify that the metadata.page_number is being correctly extracted and assigned during the parsing process [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@atahanuz
Copy link
Author

atahanuz commented Feb 9, 2025

Dosubot weren't be able to help, can someone do? 🥲

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants