New Feature: extract coordinates of matched text #1442

ConvoluteHumanBot · 2025-10-23T14:50:45Z

Description of the new Feature

Refactored PdfContentStreamHandler to abstract class so that ContentOperator interface is more flexible to use.

Implemented a new PdfContentStreamHandler:PdfContentTextLocator to find and locate the bounding box coordinates of a matched regex text inside a page.
Basic implementation logic searches inside a PdfString and uses each glyph to properly calculate the width.

Possible improvement

Use the same logic provided in renderText to search across multiple PdfString in the same line, even though i'm skeptical about the possible unwanted behaviours, given that we are doing a string matching on a printable format.
E.g. two PdfString could be on the same line but of two different paragraphs or inserted in a different order, like in the pdf document attached below.

example_file.pdf

Your real name

Alessandro Ragusi

… for Type0 fonts, method `getWidth` in `DocumentFont.java` failed to match character in the `metrics` map (char was not properly decoded). Using method `ParsedText.getWidth` instead of `ParsedText.getUnscaledTextWidth` when adjusting the `textMatrix` in `displayPdfString` to avoid unnecessary calculations.

…ic implementation in each class, so that `ContentOperator` is more flexible to use. Implemented a new `PdfContentStreamHandler`:`PdfContentTextLocator` to find and locate the coordinates of a matched regex in the text of a page (basic logic searching inside a `PdfString`, could be extended to group `PdfString` in the same line).

sonarqubecloud · 2025-10-23T14:52:13Z

Quality Gate failed

Failed conditions
58.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

ConvoluteHumanBot added 2 commits October 23, 2025 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New Feature: extract coordinates of matched text #1442

New Feature: extract coordinates of matched text #1442

Uh oh!

ConvoluteHumanBot commented Oct 23, 2025

Uh oh!

sonarqubecloud bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

New Feature: extract coordinates of matched text #1442

Are you sure you want to change the base?

New Feature: extract coordinates of matched text #1442

Uh oh!

Conversation

ConvoluteHumanBot commented Oct 23, 2025

Description of the new Feature

Possible improvement

Your real name

Uh oh!

sonarqubecloud bot commented Oct 23, 2025

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant