-
Notifications
You must be signed in to change notification settings - Fork 3
[New] Consistent usage of "similarity" and "confidence" #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
98d905d
Consistent usage of "similarity" and "confidence"
adombeck e956227
Make RPA/recognition/ocr return actual confidence
adombeck 20e947e
Log matched text in find_text
adombeck a5a7d59
Log rejected matches with high similarity
adombeck d61334d
Log image which matched text
adombeck fe4a7f7
Avoid logging image twice in get_text_position
adombeck File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are having a similarity threshold, why not having also a confidence threshold?
Also, why is the similarity log threshold the same as the similarity threshold?
I think it would be more useful to log the cases in which both confidence and similarity are close to the threshold values:
I'd suggest:
Another option would be having just a "LOG_THRESHOLD" and compare:
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would be useful. The confidence score measures how confident the OCR engine is that the string it returned is the actual string that's on the image. Let's say we want to find the string "foo" in an image, but the image only contains "bar". The OCR engine returns
{"text": "bar", "confidence": "99"}, i.e. it's confident that the image really contains "bar". We calculate the similarity to "foo" which results in a similarity score of 10. Logging the messageRejected match for text 'foo' with similarity 10 and confidence 99: 'bar'is not very helpful. Does that make sense?Because I only found it useful when setting a higher threshold (which is made possible by #100). The log message is only useful to spot false negatives (i.e. a match was rejected even though it should have been accepted). We found the default threshold to be so low that it frequently causes false positive matches, so I doubt that there will be a lot of false negatives with an even lower similarity score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in that case you won't log anything. Both have to be close to the threshold to log anything.
Imagine you text is blurry, but the OCR still manages to get your word right. You try to look for foo, and you get:
{"text": "foo", "confidence": "65"}. The OCR was close to get it, but the confidence was just below the threshold. I think that information could be useful.If the OCR returns
{"text": "bar", "confidence": "99"}, similarity will be 0 and it won't log anythingI don't see the point in having it exactly the same. If you are looking for "fooo" and OCR gets "foo0", you may not want to match it, because the similarity is 75, but it's close enough to your threshold that you may want it to be logged.
If you have the exact same value, you are basically filtering for similarity to see when the confidence is too low but the similarity is high.
I think having a log threshold right below your real threshold will be useful to log only the cases that are close, so you can adjust your thresholds accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already get that message with my proposal though, because the similarity between "foo" and "foo" is 100, so it's above the similarity log threshold. In the diff you suggested, you just add an additional condition for logging the rejected matches, i.e. that the confidence is also high. That means when the text is blurry but the OCR engine still recognizes the text we're looking for, e.g.
{"text": "foo", "confidence": "50"}, it will reject the match and won't even log about it.That is a case we want to log IMO. In the case from above, when the engine is unsure that it recognized the exact text and returns
{"text": "foo", "confidence": "50"}, we do want to log the rejected match even if the confidence score is very low.Agreed, that's why in our tests we are using a similarity threshold of 92 and a log threshold of 80. My point is just that the default similarity threshold is already too low and causes too many false positives. I don't think we can safely change that, because it would break existing tests. So the proposed log threshold of 80 would mainly be useful when you set a higher similarity threshold (except for the case discussed above, where the match was rejected because of low confidence, in that case the log threshold of 80 would also be useful without setting a higher similarity threshold). Anyway, if we also make the log threshold configurable, which we should probably do anyway, I'm also fine with using a lower default there.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are 100% right there. Sorry for my confusion. The condition I suggested was even more restrictive.
If you think we don't need a lower bound to log the confidence results, I think that's okay. The lower bound for outputting a result (text_score) is already set at 0.5 in the rapidocr
config.yaml, so I don't think it's going to introduce too much noise.I didn't get at first why we would want to have a logging mechanism that just skips logging when using default values (for the similarity threshold). If you are only planning on increasing it, it makes sense because you will log the results you would get with the default values.
Nevertheless, if you were going to reduce the similarity threshold, you wouldn't log anything; that's why I liked the "LOG_THRESHOLD" approach.
Maybe the threshold is already too permissive, and we don't have to bother about it...
I leave it up to you if you want to have an absolute threshold for logs or a relative one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I also found it challenging to understand the actual effect of the different approaches in practice and actually tried out the approaches to see the actual results.
Exactly, that was the idea.
That's the case in my experience