Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange cid tag into simple pdf extraction #225

Open
nekopep opened this issue Feb 9, 2025 · 4 comments
Open

Strange cid tag into simple pdf extraction #225

nekopep opened this issue Feb 9, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@nekopep
Copy link

nekopep commented Feb 9, 2025

I'm trying megaparse to see how it can handle my pdf ,and I found a buggy behaviour.
all pdf I gave to hit return with some letter seen as (cid:xxx).
This is always the "e" letter.

Préambul(cid:4) L(cid:4) p(cid:4)rsonn(cid:4)l, dans l'(cid:4)x(cid:4)rcic(cid:4) d(cid:4) s(cid:4)s fonctions, (cid:4)st conduit à utilis(cid:4)r l(cid:4)s outils (cid:4)t logici(cid:4)ls informatiqu(cid:4)s mis à sa disposition pour accéd(cid:4)r aux systèm(cid:4)s d'information (cid:4)t d(cid:4) communication d(cid:4) la société. L’utilisation d(cid:4) s(cid:4)rvic(cid:4)s (cid:4)n lign(cid:4) doit êtr(cid:4) (cid:4)ff(cid:4)ctué(cid:4) (cid:4)xclusiv(cid:4)m(cid:4)nt à d(cid:4)s fins prof(cid:4)ssionn(cid:4)ll(cid:4)s.
Ce document définit les règles et les logiciels homologués pour le groupe.
D’un(cid:4) manièr(cid:4) général(cid:4), la philosophi(cid:4) du group(cid:4) (cid:4)st d’(cid:4)xt(cid:4)rnalis(cid:4)r l(cid:4) moins d(cid:4) donné(cid:4)s possibl(cid:4)s v(cid:4)rs l’(cid:4)xtéri(cid:4)ur. Quand c(cid:4)la n(cid:4) p(cid:4)ut êtr(cid:4) évité, il privilégi(cid:4) d(cid:4)s s(cid:4)rvic(cid:4)s (cid:4)uropé(cid:4)ns.

Pdf file are generated from onlyoffice version 8.2.2.22 (deb) on debian 12.
Attached a very silmle sample pdf

Test_003.pdf

code used
from megaparse import MegaParse megaparse = MegaParse() response = megaparse.load("./data/Test_003.pdf")

Also I see it always switch to "Switching to Unstructured Parser"

Copy link

linear bot commented Feb 9, 2025

Copy link

dosubot bot commented Feb 9, 2025

The issue you're experiencing with the (cid:xxx) tags, particularly affecting the letter "e", might be related to how MegaParse handles character encoding or special characters in PDFs. Unfortunately, I couldn't find specific information on how MegaParse deals with character encoding or the 'cid' tag issue in the repository.

The message "Switching to Unstructured Parser" indicates that MegaParse is using the unstructured library for parsing, which happens when the default parsing method determines that a different strategy is more appropriate, especially when the global strategy is not set to high-resolution [1].

Since the PDF is generated by OnlyOffice, there might be specific encoding or font issues that MegaParse doesn't handle well. You might want to check if there are any updates or patches for MegaParse that address this issue, or consider preprocessing the PDF to standardize its encoding before parsing.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Feb 9, 2025
@nekopep
Copy link
Author

nekopep commented Feb 9, 2025

@dosu : Can you provide an example on how to set StrategyEnum.HI_RES ?

Copy link

dosubot bot commented Feb 9, 2025

To set StrategyEnum.HI_RES in MegaParse, you can create an instance of ParseFileConfig and set the strategy attribute to StrategyEnum.HI_RES as shown in the following example:

from megaparse_sdk.schema.parser_config import ParseFileConfig, StrategyEnum

config = ParseFileConfig(strategy=StrategyEnum.HI_RES)

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant