-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange cid tag into simple pdf extraction #225
Comments
The issue you're experiencing with the The message "Switching to Unstructured Parser" indicates that MegaParse is using the unstructured library for parsing, which happens when the default parsing method determines that a different strategy is more appropriate, especially when the global strategy is not set to high-resolution [1]. Since the PDF is generated by OnlyOffice, there might be specific encoding or font issues that MegaParse doesn't handle well. You might want to check if there are any updates or patches for MegaParse that address this issue, or consider preprocessing the PDF to standardize its encoding before parsing. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu : Can you provide an example on how to set StrategyEnum.HI_RES ? |
To set from megaparse_sdk.schema.parser_config import ParseFileConfig, StrategyEnum
config = ParseFileConfig(strategy=StrategyEnum.HI_RES) To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
I'm trying megaparse to see how it can handle my pdf ,and I found a buggy behaviour.
all pdf I gave to hit return with some letter seen as (cid:xxx).
This is always the "e" letter.
Préambul(cid:4) L(cid:4) p(cid:4)rsonn(cid:4)l, dans l'(cid:4)x(cid:4)rcic(cid:4) d(cid:4) s(cid:4)s fonctions, (cid:4)st conduit à utilis(cid:4)r l(cid:4)s outils (cid:4)t logici(cid:4)ls informatiqu(cid:4)s mis à sa disposition pour accéd(cid:4)r aux systèm(cid:4)s d'information (cid:4)t d(cid:4) communication d(cid:4) la société. L’utilisation d(cid:4) s(cid:4)rvic(cid:4)s (cid:4)n lign(cid:4) doit êtr(cid:4) (cid:4)ff(cid:4)ctué(cid:4) (cid:4)xclusiv(cid:4)m(cid:4)nt à d(cid:4)s fins prof(cid:4)ssionn(cid:4)ll(cid:4)s.
Ce document définit les règles et les logiciels homologués pour le groupe.
D’un(cid:4) manièr(cid:4) général(cid:4), la philosophi(cid:4) du group(cid:4) (cid:4)st d’(cid:4)xt(cid:4)rnalis(cid:4)r l(cid:4) moins d(cid:4) donné(cid:4)s possibl(cid:4)s v(cid:4)rs l’(cid:4)xtéri(cid:4)ur. Quand c(cid:4)la n(cid:4) p(cid:4)ut êtr(cid:4) évité, il privilégi(cid:4) d(cid:4)s s(cid:4)rvic(cid:4)s (cid:4)uropé(cid:4)ns.
Pdf file are generated from onlyoffice version 8.2.2.22 (deb) on debian 12.
Attached a very silmle sample pdf
Test_003.pdf
code used
from megaparse import MegaParse megaparse = MegaParse() response = megaparse.load("./data/Test_003.pdf")
Also I see it always switch to "Switching to Unstructured Parser"
The text was updated successfully, but these errors were encountered: