Skip to content

Conversation

@ConvoluteHumanBot
Copy link

Description of the new Feature

Refactored PdfContentStreamHandler to abstract class so that ContentOperator interface is more flexible to use.

Implemented a new PdfContentStreamHandler:PdfContentTextLocator to find and locate the bounding box coordinates of a matched regex text inside a page.
Basic implementation logic searches inside a PdfString and uses each glyph to properly calculate the width.

Possible improvement

Use the same logic provided in renderText to search across multiple PdfString in the same line, even though i'm skeptical about the possible unwanted behaviours, given that we are doing a string matching on a printable format.
E.g. two PdfString could be on the same line but of two different paragraphs or inserted in a different order, like in the pdf document attached below.

example_file.pdf

Your real name

Alessandro Ragusi

… for Type0 fonts, method `getWidth` in `DocumentFont.java` failed to match character in the `metrics` map (char was not properly decoded).

Using method `ParsedText.getWidth` instead of `ParsedText.getUnscaledTextWidth` when adjusting the `textMatrix` in `displayPdfString` to avoid unnecessary calculations.
…ic implementation in each class, so that `ContentOperator` is more flexible to use.

Implemented a new `PdfContentStreamHandler`:`PdfContentTextLocator` to find and locate the coordinates of a matched regex in the text of a page (basic logic searching inside a `PdfString`, could be extended to group `PdfString` in the same line).
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
58.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant