Skip to content

Conversation

@adombeck
Copy link
Contributor

With the default values

DEFAULT_CONFIDENCE: float = 0.7
DEFAULT_COINCIDENCE: float = 80.0

the match_text keyword often matches incorrect text in our tests. We need to set higher values to avoid that.

@adombeck adombeck force-pushed the support-setting-ocr-confidence branch from 776cbe2 to 10195a8 Compare October 29, 2025 14:56
@adombeck
Copy link
Contributor Author

adombeck commented Oct 29, 2025

The terms "confidence" and "coincidence" are currently used inconsistently:

  • When using tesseract, only "confidence" is used, and it refers to the similarity between the text that it tries to find and the text that was returned by OCR.
  • When using RapidOCR:
    • "confidence" refers to the confidence score returned by RapidOCR, which is the model's estimated probability that the returned text is actually what is visible in the image.
    • "coincidence" refers to the same which "confidence" refers to in tesseract: the similarity between the text that we try to find and the text that was returned by OCR
  • In the matches dictionaries returned by find_text, the confidence item always refers to the text similarity, even when using RapidOCR, where the coincidence argument is used for that.

As far as I can tell, confidence and coincidence are currently not really part of YARF's API, beside the confidence item in the return value of find_text. So maybe now (before merging this PR) would be a good time to rename them and use them more consistently.

I propose that we always use:

  • similarity or (similarity_threshold) for text similarity
  • confidence (or confidence_threshold) for the OCR confidence score

It would also be nice to use both percentage (a value between 0 and 100) or a fraction (between 0 and 1) for both. Currently, for RapidOCR we use a fraction for confidence and for the other values we use percentage.

@adombeck adombeck force-pushed the support-setting-ocr-confidence branch 2 times, most recently from 5ca4548 to 906cc2f Compare October 29, 2025 15:52
@adombeck adombeck force-pushed the support-setting-ocr-confidence branch 2 times, most recently from 78281b8 to 1ed467d Compare October 29, 2025 16:21
@adombeck adombeck marked this pull request as ready for review October 29, 2025 16:32
@adombeck
Copy link
Contributor Author

FAIL Required test coverage of 100% not reached. Total coverage: 99.78%

I'll add tests when the changes were reviewed and we reached a conclusion regarding #100 (comment)

@p-gentili p-gentili self-assigned this Nov 10, 2025
Comment on lines +130 to +142
@keyword
def set_ocr_confidence_threshold(self, threshold: float) -> None:
"""
Set the OCR confidence threshold.
Args:
threshold: Confidence threshold between 0 and 1.
"""
logger.debug(f"Setting OCR confidence threshold to {threshold}")
self.ocr.DEFAULT_CONFIDENCE = threshold # type: ignore[union-attr]

@keyword
def set_ocr_coincidence_threshold(self, threshold: float) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use a global variable instead? It won't help with tesseract being special, but we can mention it in the related documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean using a global variable instead of introducing these set_* keywords? how do you set the global variable in the robot framework tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, with Set Global Variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. Yes, that should work too. It's a bit less discoverable but we could solve that with documentation.

@p-gentili
Copy link
Collaborator

p-gentili commented Nov 10, 2025

I propose that we always use:

similarity or (similarity_threshold) for text similarity
confidence (or confidence_threshold) for the OCR confidence score

It would also be nice to use both percentage (a value between 0 and 100) or a fraction (between 0 and 1) for both. Currently, for RapidOCR we use a fraction for confidence and for the other values we use percentage.

Very good point, and thanks for the details provided.

  1. I would go with [0,100] values, easier to handle
  2. Internally, I believe your names are good. For external usage, if we end up using global variables, I'd recommend to clarify the context, so OCR_ACCURACY_TH (or w/o _TH) and OCR_SIMILARITY_TH.

Makes sense?

When setting a too high coincidence threshold, we can end up rejecting
matches which are the ones we want to match. Logging those matches which
were rejected even though they have a relatively high coincidence can
help debug those cases.
@adombeck adombeck force-pushed the support-setting-ocr-confidence branch from 1ed467d to 576b5b5 Compare November 10, 2025 11:47
@adombeck adombeck force-pushed the support-setting-ocr-confidence branch from 576b5b5 to 3ef0713 Compare November 10, 2025 12:19
@adombeck
Copy link
Contributor Author

I propose that we always use:

* `similarity` or (`similarity_threshold`) for text similarity

* `confidence` (or `confidence_threshold`) for the OCR confidence score

To consistently use these terms, we have to change the vendored yarf/vendor/RPA/recognition/ocr.py. The find function returns a dict where currently the confidence item refers to text similarity. I would like to rename that to similarity.

Tesseract actually also returns a confidence value for the words it finds but yarf/vendor/RPA/recognition/ocr.py doesn't use it. With a small patch we could also add the actual confidence to the returned dict.

However, I don't think it's a good idea to just change vendored code, as I also mentioned in #113 (comment). How should we go about this?

@adombeck
Copy link
Contributor Author

However, I don't think it's a good idea to just change vendored code, as I also mentioned in #113 (comment). How should we go about this?

@p-gentili what do you think?

@p-gentili
Copy link
Collaborator

I agree with your comment, and I'd like to try moving forward with .patch files, but I can't commit to the task in the short term I'm afraid.

To consistently use these terms, we have to change the vendored yarf/vendor/RPA/recognition/ocr.py. The find function returns a dict where currently the confidence item refers to text similarity. I would like to rename that to similarity.

I would recommend creating a new small module in yarf.rf_libraries.libraries.ocr, which just wraps RPA's tesseract and map the dictionary to what we expect. Also, it would be a good opportunity to make both RapidOCRReader and the new Tesseract import from the same abstract class, where we define the expected return for the find method.

Tesseract actually also returns a confidence value for the words it finds but yarf/vendor/RPA/recognition/ocr.py doesn't use it. With a small patch we could also add the actual confidence to the returned dict.

I don't have a solution for this which doesn't require modifying the vendor package. Maybe we can post-pone it.

@github-actions
Copy link

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in a week.

@github-actions github-actions bot added the Stale label Jan 13, 2026
@github-actions
Copy link

This PR was closed because of inactivity. Feel free to re-open this once you want to work on it again!

@github-actions github-actions bot closed this Jan 20, 2026
@adombeck
Copy link
Contributor Author

The PR is still relevant. I just want to get #138 merged before.

@adombeck adombeck reopened this Jan 20, 2026
@github-actions github-actions bot removed the Stale label Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants