[New] Support setting OCR confidence and coincidence thresholds #100

adombeck · 2025-10-29T14:34:07Z

With the default values

DEFAULT_CONFIDENCE: float = 0.7
DEFAULT_COINCIDENCE: float = 80.0

the match_text keyword often matches incorrect text in our tests. We need to set higher values to avoid that.

adombeck · 2025-10-29T15:22:31Z

The terms "confidence" and "coincidence" are currently used inconsistently:

When using tesseract, only "confidence" is used, and it refers to the similarity between the text that it tries to find and the text that was returned by OCR.
When using RapidOCR:
- "confidence" refers to the confidence score returned by RapidOCR, which is the model's estimated probability that the returned text is actually what is visible in the image.
- "coincidence" refers to the same which "confidence" refers to in tesseract: the similarity between the text that we try to find and the text that was returned by OCR
In the matches dictionaries returned by find_text, the confidence item always refers to the text similarity, even when using RapidOCR, where the coincidence argument is used for that.

As far as I can tell, confidence and coincidence are currently not really part of YARF's API, beside the confidence item in the return value of find_text. So maybe now (before merging this PR) would be a good time to rename them and use them more consistently.

I propose that we always use:

similarity or (similarity_threshold) for text similarity
confidence (or confidence_threshold) for the OCR confidence score

It would also be nice to use both percentage (a value between 0 and 100) or a fraction (between 0 and 1) for both. Currently, for RapidOCR we use a fraction for confidence and for the other values we use percentage.

adombeck · 2025-10-29T16:34:23Z

FAIL Required test coverage of 100% not reached. Total coverage: 99.78%

I'll add tests when the changes were reviewed and we reached a conclusion regarding #100 (comment)

yarf/rf_libraries/libraries/ocr/rapidocr.py

p-gentili · 2025-11-10T11:08:59Z

yarf/rf_libraries/libraries/video_input_base.py

+    @keyword
+    def set_ocr_confidence_threshold(self, threshold: float) -> None:
+        """
+        Set the OCR confidence threshold.
+
+        Args:
+            threshold: Confidence threshold between 0 and 1.
+        """
+        logger.debug(f"Setting OCR confidence threshold to {threshold}")
+        self.ocr.DEFAULT_CONFIDENCE = threshold  # type: ignore[union-attr]
+
+    @keyword
+    def set_ocr_coincidence_threshold(self, threshold: float) -> None:


What if we use a global variable instead? It won't help with tesseract being special, but we can mention it in the related documentation.

you mean using a global variable instead of introducing these set_* keywords? how do you set the global variable in the robot framework tests?

exactly, with Set Global Variable.

Ah okay. Yes, that should work too. It's a bit less discoverable but we could solve that with documentation.

p-gentili · 2025-11-10T11:20:41Z

I propose that we always use:

similarity or (similarity_threshold) for text similarity
confidence (or confidence_threshold) for the OCR confidence score

It would also be nice to use both percentage (a value between 0 and 100) or a fraction (between 0 and 1) for both. Currently, for RapidOCR we use a fraction for confidence and for the other values we use percentage.

Very good point, and thanks for the details provided.

I would go with [0,100] values, easier to handle
Internally, I believe your names are good. For external usage, if we end up using global variables, I'd recommend to clarify the context, so OCR_ACCURACY_TH (or w/o _TH) and OCR_SIMILARITY_TH.

Makes sense?

When setting a too high coincidence threshold, we can end up rejecting matches which are the ones we want to match. Logging those matches which were rejected even though they have a relatively high coincidence can help debug those cases.

adombeck · 2025-11-11T15:58:57Z

I propose that we always use:

* `similarity` or (`similarity_threshold`) for text similarity

* `confidence` (or `confidence_threshold`) for the OCR confidence score

To consistently use these terms, we have to change the vendored yarf/vendor/RPA/recognition/ocr.py. The find function returns a dict where currently the confidence item refers to text similarity. I would like to rename that to similarity.

Tesseract actually also returns a confidence value for the words it finds but yarf/vendor/RPA/recognition/ocr.py doesn't use it. With a small patch we could also add the actual confidence to the returned dict.

However, I don't think it's a good idea to just change vendored code, as I also mentioned in #113 (comment). How should we go about this?

adombeck · 2025-11-13T12:20:22Z

However, I don't think it's a good idea to just change vendored code, as I also mentioned in #113 (comment). How should we go about this?

@p-gentili what do you think?

p-gentili · 2025-11-13T12:47:52Z

I agree with your comment, and I'd like to try moving forward with .patch files, but I can't commit to the task in the short term I'm afraid.

To consistently use these terms, we have to change the vendored yarf/vendor/RPA/recognition/ocr.py. The find function returns a dict where currently the confidence item refers to text similarity. I would like to rename that to similarity.

I would recommend creating a new small module in yarf.rf_libraries.libraries.ocr, which just wraps RPA's tesseract and map the dictionary to what we expect. Also, it would be a good opportunity to make both RapidOCRReader and the new Tesseract import from the same abstract class, where we define the expected return for the find method.

Tesseract actually also returns a confidence value for the words it finds but yarf/vendor/RPA/recognition/ocr.py doesn't use it. With a small patch we could also add the actual confidence to the returned dict.

I don't have a solution for this which doesn't require modifying the vendor package. Maybe we can post-pone it.

github-actions · 2026-01-13T01:41:42Z

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in a week.

github-actions · 2026-01-20T01:41:50Z

This PR was closed because of inactivity. Feel free to re-open this once you want to work on it again!

adombeck · 2026-01-20T16:30:43Z

The PR is still relevant. I just want to get #138 merged before.

[New] Support setting confidence and coincidence thresholds

944620f

adombeck force-pushed the support-setting-ocr-confidence branch from 776cbe2 to 10195a8 Compare October 29, 2025 14:56

adombeck force-pushed the support-setting-ocr-confidence branch 2 times, most recently from 5ca4548 to 906cc2f Compare October 29, 2025 15:52

[New] Return OCR confidence in match result

aeaf4a7

adombeck force-pushed the support-setting-ocr-confidence branch 2 times, most recently from 78281b8 to 1ed467d Compare October 29, 2025 16:21

adombeck marked this pull request as ready for review October 29, 2025 16:32

p-gentili requested review from douglasdotc and fernando79513 October 30, 2025 10:49

p-gentili self-assigned this Nov 10, 2025

p-gentili reviewed Nov 10, 2025

View reviewed changes

adombeck added 2 commits November 10, 2025 12:46

[New] Log matched text in find_text

8e8bd19

[New] Log rejected matches with high coincidence

903109e

When setting a too high coincidence threshold, we can end up rejecting matches which are the ones we want to match. Logging those matches which were rejected even though they have a relatively high coincidence can help debug those cases.

adombeck force-pushed the support-setting-ocr-confidence branch from 1ed467d to 576b5b5 Compare November 10, 2025 11:47

fixup! [New] Log rejected matches with high coincidence

3ef0713

adombeck force-pushed the support-setting-ocr-confidence branch from 576b5b5 to 3ef0713 Compare November 10, 2025 12:19

adombeck mentioned this pull request Jan 9, 2026

[New] Consistent usage of "similarity" and "confidence" #138

Open

github-actions bot added the Stale label Jan 13, 2026

github-actions bot closed this Jan 20, 2026

adombeck reopened this Jan 20, 2026

github-actions bot removed the Stale label Jan 21, 2026

[New] Support setting OCR confidence and coincidence thresholds #100

Are you sure you want to change the base?

[New] Support setting OCR confidence and coincidence thresholds #100

Conversation

adombeck commented Oct 29, 2025

Uh oh!

adombeck commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adombeck commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

p-gentili Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

adombeck Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

p-gentili Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

adombeck Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

p-gentili commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adombeck commented Nov 11, 2025

Uh oh!

adombeck commented Nov 13, 2025

Uh oh!

p-gentili commented Nov 13, 2025

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

github-actions bot commented Jan 20, 2026

Uh oh!

adombeck commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adombeck commented Oct 29, 2025 •

edited

Loading

p-gentili commented Nov 10, 2025 •

edited

Loading