-
Notifications
You must be signed in to change notification settings - Fork 2
[New] Support setting OCR confidence and coincidence thresholds #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
776cbe2 to
10195a8
Compare
|
The terms "confidence" and "coincidence" are currently used inconsistently:
As far as I can tell, I propose that we always use:
It would also be nice to use both percentage (a value between 0 and 100) or a fraction (between 0 and 1) for both. Currently, for RapidOCR we use a fraction for |
5ca4548 to
906cc2f
Compare
78281b8 to
1ed467d
Compare
I'll add tests when the changes were reviewed and we reached a conclusion regarding #100 (comment) |
| @keyword | ||
| def set_ocr_confidence_threshold(self, threshold: float) -> None: | ||
| """ | ||
| Set the OCR confidence threshold. | ||
| Args: | ||
| threshold: Confidence threshold between 0 and 1. | ||
| """ | ||
| logger.debug(f"Setting OCR confidence threshold to {threshold}") | ||
| self.ocr.DEFAULT_CONFIDENCE = threshold # type: ignore[union-attr] | ||
|
|
||
| @keyword | ||
| def set_ocr_coincidence_threshold(self, threshold: float) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we use a global variable instead? It won't help with tesseract being special, but we can mention it in the related documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean using a global variable instead of introducing these set_* keywords? how do you set the global variable in the robot framework tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, with Set Global Variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. Yes, that should work too. It's a bit less discoverable but we could solve that with documentation.
Very good point, and thanks for the details provided.
Makes sense? |
When setting a too high coincidence threshold, we can end up rejecting matches which are the ones we want to match. Logging those matches which were rejected even though they have a relatively high coincidence can help debug those cases.
1ed467d to
576b5b5
Compare
576b5b5 to
3ef0713
Compare
To consistently use these terms, we have to change the vendored Tesseract actually also returns a confidence value for the words it finds but However, I don't think it's a good idea to just change vendored code, as I also mentioned in #113 (comment). How should we go about this? |
@p-gentili what do you think? |
|
I agree with your comment, and I'd like to try moving forward with
I would recommend creating a new small module in
I don't have a solution for this which doesn't require modifying the vendor package. Maybe we can post-pone it. |
|
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in a week. |
|
This PR was closed because of inactivity. Feel free to re-open this once you want to work on it again! |
|
The PR is still relevant. I just want to get #138 merged before. |
With the default values
the
match_textkeyword often matches incorrect text in our tests. We need to set higher values to avoid that.