[New] Consistent usage of "similarity" and "confidence" by adombeck · Pull Request #138 · canonical/yarf

adombeck · 2025-12-17T22:25:33Z

Description

This PR contains multiple improvements to the OCR modules used by yarf:

Consistently use the terms "confidence" and "coincidence", and use percentage for both
Make yarf/vendor/RPA/recognition/ocr.py return actual confidence
Log the matched text in find_text
Log rejected matches with high similarity
Log image which matched text

See commit messages for details.

Tests

The existing tests were updated and run via uv run pytest.

adombeck · 2025-12-17T23:51:02Z

While testing the changes to yarf/vendor/RPA/recognition/ocr.py I noticed how much worse its results are compared to RapidOCR. Why do we even keep it? It would simplify things quite a lot if we could just remove it and always use RapidOCR.

p-gentili · 2026-01-08T11:44:12Z

Hey @fernando79513, can you please take a look at this PR? The idea looks good to me but you worked on those files in the first place and you might have a different view on it.

fernando79513 · 2026-01-08T12:32:19Z

While testing the changes to yarf/vendor/RPA/recognition/ocr.py I noticed how much worse its results are compared to RapidOCR. Why do we even keep it? It would simplify things quite a lot if we could just remove it and always use RapidOCR.

The only reason we kept this is because rapidOCR is noticeably slower than tesseract (around 2x), specially on big images. We may want to keep it in case we need faster detection times.

Nevertheless, maintaining both introduces some complexity, so I don't have a strong opinion about keeping Tesseract. Any comments on that @p-gentili?

fernando79513

Thanks a lot for the work here.
I usually prefer having confidence and similarities with values from 0-1 instead of 0-100. We didn't change that in the beginning because we couldn't modify the RPA library, but I guess it should be possible now that it's vendorized. I leave it up to you though.

yarf/rf_libraries/libraries/video_input_base.py

p-gentili · 2026-01-08T14:53:51Z

Nevertheless, maintaining both introduces some complexity, so I don't have a strong opinion about keeping Tesseract. Any comments on that @p-gentili?

Although I believe tesseract just doesn't work well in our use cases, this change doesn't really justify introducing a breaking change. The patch doesn't look that complex, and it would be similar to the amount of work required for any other OCR engine we might introduce.

Nevertheless, we can discuss separately about the removal, but I would keep this PR as is.

adombeck · 2026-01-09T15:06:49Z

I usually prefer having confidence and similarities with values from 0-1 instead of 0-100.

@p-gentili said in #100 (comment) that he prefers 0-100, so that's what I implemented here. I don't have an opinion on that so I'll let you two discuss it.

adombeck · 2026-01-09T15:28:14Z

Rebased on main and resolved the conflicts.

adombeck · 2026-01-09T17:19:55Z

I noticed that a call to get_text_position now logs the matching image twice, because it calls match_text (which calls find_text, where we log the image now), and the function logs it itself in:

yarf/yarf/rf_libraries/libraries/video_input_base.py

Line 404 in a04652e

log_image(matched_image, "Matched text region:")

I'll remove the log_image call in get_text_position to avoid that.

p-gentili · 2026-01-09T17:39:43Z

@p-gentili said in #100 (comment) that he prefers 0-100, so that's what I implemented here. I don't have an opinion on that so I'll let you two discuss it.

lol, let's fight @fernando79513 ! I like integers, that's all. If in this context [0,1] makes more sense that's fine by me.

fernando79513

Just a minor comment, but I think it would make more sense to have the logs like this.
Let me know what you think. It's not really impactful for the code execution.

fernando79513 · 2026-01-20T16:13:21Z

yarf/rf_libraries/libraries/ocr/rapidocr.py

+            elif similarity >= self.SIMILARITY_LOG_THRESHOLD:
+                logger.debug(
+                    f"Rejected match for text '{match_text}' "
+                    f"with similarity {similarity} "
+                    f"and confidence {item.confidence}: '{item.text}'"
+                )


If you are having a similarity threshold, why not having also a confidence threshold?
Also, why is the similarity log threshold the same as the similarity threshold?
I think it would be more useful to log the cases in which both confidence and similarity are close to the threshold values:

Suggested change

elif similarity >= self.SIMILARITY_LOG_THRESHOLD:

logger.debug(

f"Rejected match for text '{match_text}' "

f"with similarity {similarity} "

f"and confidence {item.confidence}: '{item.text}'"

)

elif (

similarity >= self.SIMILARITY_LOG_THRESHOLD

and item.confidence >= self.CONFIDENCE_LOG_THRESHOLD

):

logger.debug(

f"Rejected match for text '{match_text}' "

f"with similarity {similarity} "

f"and confidence {item.confidence}: '{item.text}'"

)

I'd suggest:

SIMILARITY_LOG_THRESHOLD = 70.0

CONFIDENCE_LOG_THRESHOLD = 60.0

Another option would be having just a "LOG_THRESHOLD" and compare:

elif ( similarity >= similarity_threshold - self.LOG_THRESHOLD and item.confidence >= confidence_threshold - self.LOG_THRESHOLD ):

If you are having a similarity threshold, why not having also a confidence threshold?

I don't think that would be useful. The confidence score measures how confident the OCR engine is that the string it returned is the actual string that's on the image. Let's say we want to find the string "foo" in an image, but the image only contains "bar". The OCR engine returns {"text": "bar", "confidence": "99"}, i.e. it's confident that the image really contains "bar". We calculate the similarity to "foo" which results in a similarity score of 10. Logging the message Rejected match for text 'foo' with similarity 10 and confidence 99: 'bar' is not very helpful. Does that make sense?

Also, why is the similarity log threshold the same as the similarity threshold?

Because I only found it useful when setting a higher threshold (which is made possible by #100). The log message is only useful to spot false negatives (i.e. a match was rejected even though it should have been accepted). We found the default threshold to be so low that it frequently causes false positive matches, so I doubt that there will be a lot of false negatives with an even lower similarity score.

I don't think that would be useful

But in that case you won't log anything. Both have to be close to the threshold to log anything.
Imagine you text is blurry, but the OCR still manages to get your word right. You try to look for foo, and you get: {"text": "foo", "confidence": "65"}. The OCR was close to get it, but the confidence was just below the threshold. I think that information could be useful.

If the OCR returns {"text": "bar", "confidence": "99"}, similarity will be 0 and it won't log anything

Because I only found it useful

I don't see the point in having it exactly the same. If you are looking for "fooo" and OCR gets "foo0", you may not want to match it, because the similarity is 75, but it's close enough to your threshold that you may want it to be logged.
If you have the exact same value, you are basically filtering for similarity to see when the confidence is too low but the similarity is high.

I think having a log threshold right below your real threshold will be useful to log only the cases that are close, so you can adjust your thresholds accordingly.

But in that case you won't log anything. Both have to be close to the threshold to log anything. Imagine you text is blurry, but the OCR still manages to get your word right. You try to look for foo, and you get: {"text": "foo", "confidence": "65"}. The OCR was close to get it, but the confidence was just below the threshold. I think that information could be useful.

You already get that message with my proposal though, because the similarity between "foo" and "foo" is 100, so it's above the similarity log threshold. In the diff you suggested, you just add an additional condition for logging the rejected matches, i.e. that the confidence is also high. That means when the text is blurry but the OCR engine still recognizes the text we're looking for, e.g. {"text": "foo", "confidence": "50"}, it will reject the match and won't even log about it.

I don't see the point in having it exactly the same. If you are looking for "fooo" and OCR gets "foo0", you may not want to match it, because the similarity is 75, but it's close enough to your threshold that you may want it to be logged. If you have the exact same value, you are basically filtering for similarity to see when the confidence is too low but the similarity is high.

That is a case we want to log IMO. In the case from above, when the engine is unsure that it recognized the exact text and returns {"text": "foo", "confidence": "50"}, we do want to log the rejected match even if the confidence score is very low.

I think having a log threshold right below your real threshold will be useful to log only the cases that are close, so you can adjust your thresholds accordingly.

Agreed, that's why in our tests we are using a similarity threshold of 92 and a log threshold of 80. My point is just that the default similarity threshold is already too low and causes too many false positives. I don't think we can safely change that, because it would break existing tests. So the proposed log threshold of 80 would mainly be useful when you set a higher similarity threshold (except for the case discussed above, where the match was rejected because of low confidence, in that case the log threshold of 80 would also be useful without setting a higher similarity threshold). Anyway, if we also make the log threshold configurable, which we should probably do anyway, I'm also fine with using a lower default there.

You already get that message with my proposal though.

You are 100% right there. Sorry for my confusion. The condition I suggested was even more restrictive.

If you think we don't need a lower bound to log the confidence results, I think that's okay. The lower bound for outputting a result (text_score) is already set at 0.5 in the rapidocr config.yaml, so I don't think it's going to introduce too much noise.

I didn't get at first why we would want to have a logging mechanism that just skips logging when using default values (for the similarity threshold). If you are only planning on increasing it, it makes sense because you will log the results you would get with the default values.
Nevertheless, if you were going to reduce the similarity threshold, you wouldn't log anything; that's why I liked the "LOG_THRESHOLD" approach.

Maybe the threshold is already too permissive, and we don't have to bother about it...

I leave it up to you if you want to have an absolute threshold for logs or a relative one.

You are 100% right there. Sorry for my confusion.

No worries, I also found it challenging to understand the actual effect of the different approaches in practice and actually tried out the approaches to see the actual results.

If you are only planning on increasing it, it makes sense because you will log the results you would get with the default values.

Exactly, that was the idea.

Maybe the threshold is already too permissive, and we don't have to bother about it...

That's the case in my experience

fernando79513

LGTM +1

adombeck · 2026-01-23T14:26:46Z

rebased on main and signed my commits which were missing signatures

adombeck · 2026-01-23T14:28:28Z

Do you have a strict "only squash and merge" policy? I tried to make the commits self-contained and write useful commit messages, would be a shame to lose those.

p-gentili · 2026-01-23T15:00:54Z

Do you have a strict "only squash and merge" policy? I tried to make the commits self-contained and write useful commit messages, would be a shame to lose those.

We do, because versioning is computed automatically, but I can make it work for you. I just need you to rename all those commits removing any sort of prefix.

We used the terms "confidence" and "coincidence", which are ambiguous and used inconsistently: * When using tesseract, only "confidence" is used, and it refers to the similarity between the text that it tries to find and the text that was returned by OCR. * When using RapidOCR: * "confidence" refers to the confidence score returned by RapidOCR, which is the model's estimated probability that the returned text is actually what is visible in the image. * "coincidence" refers to the same which "confidence" refers to in tesseract: the similarity between the text that we try to find and the text that was returned by OCR * In the matches dictionaries returned by find_text, the confidence item always refers to the text similarity, even when using RapidOCR, where the coincidence argument is used for that. * For RapidOCR, the "confidence" value is a fraction (between 0 and 1), for the other values we use percentage (between 0 and 100). We now consistently use: * "similarity" for text similarity * "confidence" for the OCR confidence score * Percentage (0 to 100) for all similarity and confidence values.

The module only returned similarity but tesseract does actually also return confidence values, we just have to use it.

When setting a too high similarity threshold, we can end up rejecting matches which are the ones we want to match. Logging those matches which were rejected even though they have a relatively high similarity can help debug those cases.

Including the image in the HTML log allows debugging cases in which text was found on a screenshot that shouldn't contain the text.

It's now already logged in find_text which is called via match_text.

adombeck · 2026-01-23T15:03:09Z

I can make it work for you.

Thank you!

I just need you to rename all those commits removing any sort of prefix.

Done!

adombeck mentioned this pull request Dec 17, 2025

[New] Make RPA/recognition/ocr return actual confidence #139

Closed

adombeck force-pushed the similarity-and-confidence branch 3 times, most recently from 86f891e to b51f8d8 Compare December 17, 2025 23:42

adombeck force-pushed the similarity-and-confidence branch from b51f8d8 to 7668069 Compare December 17, 2025 23:53

p-gentili requested a review from fernando79513 January 8, 2026 11:44

fernando79513 reviewed Jan 8, 2026

View reviewed changes

yarf/rf_libraries/libraries/video_input_base.py Outdated Show resolved Hide resolved

adombeck force-pushed the similarity-and-confidence branch from 7668069 to 3faead5 Compare January 9, 2026 15:28

adombeck force-pushed the similarity-and-confidence branch 2 times, most recently from 1accf29 to a04652e Compare January 9, 2026 15:40

adombeck force-pushed the similarity-and-confidence branch from a04652e to 556ec5b Compare January 9, 2026 17:35

adombeck requested a review from fernando79513 January 9, 2026 17:46

p-gentili assigned fernando79513 Jan 14, 2026

fernando79513 requested changes Jan 20, 2026

View reviewed changes

adombeck mentioned this pull request Jan 20, 2026

[New] Support setting OCR similarity and confidence thresholds #100

Open

adombeck requested a review from fernando79513 January 20, 2026 18:35

fernando79513 approved these changes Jan 23, 2026

View reviewed changes

adombeck force-pushed the similarity-and-confidence branch from 556ec5b to 2ca4284 Compare January 23, 2026 14:26

adombeck added 5 commits January 23, 2026 16:02

Make RPA/recognition/ocr return actual confidence

e956227

The module only returned similarity but tesseract does actually also return confidence values, we just have to use it.

Log matched text in find_text

20e947e

Log rejected matches with high similarity

a5a7d59

When setting a too high similarity threshold, we can end up rejecting matches which are the ones we want to match. Logging those matches which were rejected even though they have a relatively high similarity can help debug those cases.

Log image which matched text

d61334d

Including the image in the HTML log allows debugging cases in which text was found on a screenshot that shouldn't contain the text.

Avoid logging image twice in get_text_position

fe4a7f7

It's now already logged in find_text which is called via match_text.

adombeck force-pushed the similarity-and-confidence branch from 2ca4284 to fe4a7f7 Compare January 23, 2026 15:02

p-gentili merged commit 78ee5c2 into main Jan 23, 2026
6 checks passed

p-gentili deleted the similarity-and-confidence branch January 23, 2026 15:06

fernando79513 mentioned this pull request Feb 4, 2026

[BugFix] Fix match image logging #171

Merged

Comments

Conversation

adombeck commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Uh oh!

adombeck commented Dec 17, 2025

Uh oh!

p-gentili commented Jan 8, 2026

Uh oh!

fernando79513 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

p-gentili commented Jan 8, 2026

Uh oh!

adombeck commented Jan 9, 2026

Uh oh!

adombeck commented Jan 9, 2026

Uh oh!

adombeck commented Jan 9, 2026

Uh oh!

p-gentili commented Jan 9, 2026

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

fernando79513 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

adombeck Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fernando79513 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

adombeck Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

fernando79513 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adombeck Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

adombeck commented Jan 23, 2026

Uh oh!

adombeck commented Jan 23, 2026

Uh oh!

p-gentili commented Jan 23, 2026

Uh oh!

adombeck commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adombeck commented Dec 17, 2025 •

edited

Loading

fernando79513 commented Jan 8, 2026 •

edited

Loading

adombeck Jan 20, 2026 •

edited

Loading

fernando79513 Jan 23, 2026 •

edited

Loading