-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
empty filename in ALTO xml file #2700
Comments
Thank you for reporting this. The error already existed with commit d7cee03. |
The title can be set for hOCR and PDF output. Currently it is also used for ALTO, so setting the title can be used as a workaround for issue tesseract-ocr#2700. The constant unknown_title_ is no longer needed and therefore removed. Signed-off-by: Stefan Weil <[email protected]>
Pull request #2705 implements a workaround to set the missing filename. The final fix needs more efforts because the image filename is currently not available |
I just had a look on our ALTO files (created by ABBYY FineReader). None of them contains |
The title can be set for hOCR and PDF output. Currently it is also used for ALTO, so setting the title can be used as a workaround for issue #2700. The constant unknown_title_ is no longer needed and therefore removed. Signed-off-by: Stefan Weil <[email protected]>
@renarios, is the XML element |
@stweil, I found the ALTO standard in this website and it says that the element is not required, but it is preferable to add it. |
Alto-files are often (or at least sometimes) stored alongside the images used for OCR. There is definitely a point in referencing the image in the Alto as that relationship would otherwise have to be described or deduced some other way. Example: https://data.kb.se/datasets/2014/10/aftonbladet/1862/01/urn%253Anbn%253Ase%253Akb%253Adark-29967/ |
I don't think we should be satisfied with the workaround in #2705 yet. For the user this means calling something like We know the input image file name, so that's exactly what we should be referencing in
But those two make the existing structure of setting up the output file from the output basename and renderer extension in the constructor of the renderers already, and then for each page's results merely appending text to that file inadequate: For ALTO, we actually have to use different output files for different pages! And each output file must refer to:
It's probably not just ALTO – there might be other (current or future) single-page renderers, too. But we definitely also have output options that need multi-page rendering behaviour, e.g. PDF. So, it's not enough to just call One solution might be to allow renderers to switch their underlying bool TessResultRenderer::AddImage(TessBaseAPI* api, const char* filename) {
if (!happy_) return false;
++imagenum_;
bool ok = AddImageHandler(api);
if (next_) {
ok = next_->AddImage(api, filename) && ok;
}
return ok;
} ... with something like ... bool TessAltoRenderer::AddImage(TessBaseAPI* api, const char* filename) {
if (!happy_) return false;
++imagenum_;
// begin: single-page behaviour
if (imagenum_ > 0)
happy_ = EndDocumentHandler(); // append postamble
if (strcmp(outputbase, "-") && strcmp(outputbase, "stdout")) {
if (imagenum_ > 0)
fclose(fout_);
STRING outfile = STRING(outputbase_);
outfile.add_str_int("_", imagenum_);
outfile += STRING(".") + STRING(file_extension_);
fout_ = fopen(outfile.c_str(), "wb");
if (fout_ == nullptr) {
happy_ = false;
}
}
title_ = filename;
happy_ = BeginDocumentHandler() && happy_; // append preamble
if (!happy_) return false;
// end: single-page behaviour
bool ok = AddImageHandler(api); // append results
if (next_) {
ok = next_->AddImage(api, filename) && ok;
}
return ok;
} Of course, one might even sub-class the old behaviour into An alternative, much simpler solution could be to just return with an error when ALTO output is requested in the multi-input case. (But some structural changes are required even in the single-input case, because the input filename still needs to enter the preamble.) @stweil what do you think? |
Is there actually any regarding this issue? Please, keep in mind that this issue is about unexpected behavior that should be turned off: Tessract Version 4.1.1 writes three tabs as text where a filename should appear. I'd prefer a pragmatic solution. |
@bertsky wrote a good summary of the problem which avoids an easy fix. Basically the current API needs changes to provide the filename at the right place. Changing the API would be possible as we talk about Tesseract 5 which may be API incompatible to Tesseract 4. But we also have to consider third party software like tesserocr which must work with Tesseract 4 and 5. That possibly makes it difficult. Writing no |
If these are the options, it's preferable to skip the output of data lacking any value. |
@M3ssman I don't understand where these three tabs come from in the current implementation. In my understanding,
@stweil did you mean we have to omit the element
Difficult yes, but worthwhile: We have a structural problem (multi-input with single- vs multi-output renderers) related to API and CLI which will not go away. To avoid breaking the API because there are already early adopters effectively means locking down the API forever. Instead we should active support the transition in modules like |
@bertsky Sorry, I messed up. Tesseract 4.1.1 produces empty Elements like
Tesseract 4.1.0 produced the tab output
|
Pull request #3517 is merged now in Git master, so this issue can be closed. |
@stweil, okay so in #3517 you went for the "simpler solution" sketched above, which does not generalize to the multi-input case:
Thus, IMO you still need to abort with an error if the ALTO renderer is requested for multi-input (multi-page TIFF or multi-line text file). In the current implementation, only the first page will have the correct |
Signed-off-by: Stefan Weil <[email protected]>
This is now implemented for 4.1 in commit b2eb72b. |
For multi-page TIFF the current solution works perfectly: there is only a single image file, and it is correctly named in If Tesseract processes a list of image files, |
Good point!
Sorry, I thought I had seen a multi-output case before (using A warning for each single-page renderer (ALTO, hOCR, ...) active in a multi-input run would still be nice, though. |
Environment
Current Behavior:
Running
tesseract <tif file> <basename> -l nld --dpi 300 --oem 2 --psm 1 alto
gives an xml output file.In the xml output file the filename is empty:
Expected Behavior:
Suggested Fix:
insert filename
The text was updated successfully, but these errors were encountered: