Strange cid tag into simple pdf extraction #225

nekopep · 2025-02-09T14:38:41Z

I'm trying megaparse to see how it can handle my pdf ,and I found a buggy behaviour.
all pdf I gave to hit return with some letter seen as (cid:xxx).
This is always the "e" letter.

Préambul(cid:4) L(cid:4) p(cid:4)rsonn(cid:4)l, dans l'(cid:4)x(cid:4)rcic(cid:4) d(cid:4) s(cid:4)s fonctions, (cid:4)st conduit à utilis(cid:4)r l(cid:4)s outils (cid:4)t logici(cid:4)ls informatiqu(cid:4)s mis à sa disposition pour accéd(cid:4)r aux systèm(cid:4)s d'information (cid:4)t d(cid:4) communication d(cid:4) la société. L’utilisation d(cid:4) s(cid:4)rvic(cid:4)s (cid:4)n lign(cid:4) doit êtr(cid:4) (cid:4)ff(cid:4)ctué(cid:4) (cid:4)xclusiv(cid:4)m(cid:4)nt à d(cid:4)s fins prof(cid:4)ssionn(cid:4)ll(cid:4)s.
Ce document définit les règles et les logiciels homologués pour le groupe.
D’un(cid:4) manièr(cid:4) général(cid:4), la philosophi(cid:4) du group(cid:4) (cid:4)st d’(cid:4)xt(cid:4)rnalis(cid:4)r l(cid:4) moins d(cid:4) donné(cid:4)s possibl(cid:4)s v(cid:4)rs l’(cid:4)xtéri(cid:4)ur. Quand c(cid:4)la n(cid:4) p(cid:4)ut êtr(cid:4) évité, il privilégi(cid:4) d(cid:4)s s(cid:4)rvic(cid:4)s (cid:4)uropé(cid:4)ns.

Pdf file are generated from onlyoffice version 8.2.2.22 (deb) on debian 12.
Attached a very silmle sample pdf

Test_003.pdf

code used
from megaparse import MegaParse megaparse = MegaParse() response = megaparse.load("./data/Test_003.pdf")

Also I see it always switch to "Switching to Unstructured Parser"

The text was updated successfully, but these errors were encountered:

linear · 2025-02-09T14:38:44Z

MEG-41 Strange cid tag into simple pdf extraction

dosubot · 2025-02-09T14:40:01Z

The issue you're experiencing with the (cid:xxx) tags, particularly affecting the letter "e", might be related to how MegaParse handles character encoding or special characters in PDFs. Unfortunately, I couldn't find specific information on how MegaParse deals with character encoding or the 'cid' tag issue in the repository.

The message "Switching to Unstructured Parser" indicates that MegaParse is using the unstructured library for parsing, which happens when the default parsing method determines that a different strategy is more appropriate, especially when the global strategy is not set to high-resolution ^[1].

Since the PDF is generated by OnlyOffice, there might be specific encoding or font issues that MegaParse doesn't handle well. You might want to check if there are any updates or patches for MegaParse that address this issue, or consider preprocessing the PDF to standardize its encoding before parsing.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

nekopep · 2025-02-09T14:45:09Z

@dosu : Can you provide an example on how to set StrategyEnum.HI_RES ?

dosubot · 2025-02-09T14:45:52Z

To set StrategyEnum.HI_RES in MegaParse, you can create an instance of ParseFileConfig and set the strategy attribute to StrategyEnum.HI_RES as shown in the following example:

from megaparse_sdk.schema.parser_config import ParseFileConfig, StrategyEnum

config = ParseFileConfig(strategy=StrategyEnum.HI_RES)

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

dosubot bot added the bug Something isn't working label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange cid tag into simple pdf extraction #225

Strange cid tag into simple pdf extraction #225

nekopep commented Feb 9, 2025

linear bot commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

nekopep commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

Strange cid tag into simple pdf extraction #225

Strange cid tag into simple pdf extraction #225

Comments

nekopep commented Feb 9, 2025

linear bot commented Feb 9, 2025

dosubot bot commented Feb 9, 2025

nekopep commented Feb 9, 2025

dosubot bot commented Feb 9, 2025