Replies: 2 comments 2 replies
-
Definitely! Very good idea IMO. Lowers the threshold for going from simple lxml matching to full DOM programming of PAGE-XML, without dragging in OCR-D as a dependency. Since PAGE-XML is quite complex, lxml-based code tends to under-utilize the schema. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Well, why don't go for the big picture and do the same for other central ocr/library-formats like ... METS or ALTO? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We generate OCR-D PAGE API using generateDS from the PAGE 2019 XML Schema and enhance it with additional methods to handle more complex task. This approach has the advantage that we can be confident in the PAGE-XML validity wrt to the schema and have a consistent API to use across projects and we have the tooling and experience to extend it easily.
But since it is largely independent of OCR-D conventions, it might make sense to move the PAGE API functionality to its own dedicated package, to make it easier to collaborate on it with developers not directly involved in OCR-D/core development, e.g. of other Python-based OCR/layout analysis engines.
We would do that in a non-breaking fashion for existing OCR-D code, i.e. it would still be available from the
ocrd_models
package with the same naming etc.What do you think?
@bertsky @chreul @maxnth @M3ssman @krvoigt (feel free to tag others who might have an opinion)
Beta Was this translation helpful? Give feedback.
All reactions