Outsourcing PAGE generateDS API to its own package #778

kba · 2021-12-22T10:13:34Z

kba
Dec 22, 2021
Maintainer

We generate OCR-D PAGE API using generateDS from the PAGE 2019 XML Schema and enhance it with additional methods to handle more complex task. This approach has the advantage that we can be confident in the PAGE-XML validity wrt to the schema and have a consistent API to use across projects and we have the tooling and experience to extend it easily.

But since it is largely independent of OCR-D conventions, it might make sense to move the PAGE API functionality to its own dedicated package, to make it easier to collaborate on it with developers not directly involved in OCR-D/core development, e.g. of other Python-based OCR/layout analysis engines.

We would do that in a non-breaking fashion for existing OCR-D code, i.e. it would still be available from the ocrd_models package with the same naming etc.

What do you think?

@bertsky @chreul @maxnth @M3ssman @krvoigt (feel free to tag others who might have an opinion)

bertsky · 2021-12-22T10:32:56Z

bertsky
Dec 22, 2021
Collaborator

Definitely! Very good idea IMO. Lowers the threshold for going from simple lxml matching to full DOM programming of PAGE-XML, without dragging in OCR-D as a dependency. Since PAGE-XML is quite complex, lxml-based code tends to under-utilize the schema.

0 replies

M3ssman · 2021-12-22T15:08:17Z

M3ssman
Dec 22, 2021

Well, why don't go for the big picture and do the same for other central ocr/library-formats like ... METS or ALTO?
Besides, ocrd specific domain types should be preserved, too.

2 replies

kba Dec 22, 2021
Maintainer Author

Well, why don't go for the big picture and do the same for other central ocr/library-formats like ... METS or ALTO?

We could do the same for OcrdMets of course but I think there is less demand for that and it is quite specific to our needs in OCR-D. METS is a fairly big standard and extensible too (e.g. MODS, MIX etc), that's why we didn't go the "generate code from XSD" route in this case. But if there is demand for that, we could do that as well.

As for ALTO: Besides #692 there is no code concerned with ALTO in OCR-D/core AFAIK. Or do you mean to create packages for generateDS-generated (and user-method-enhanced) Python APIs for all the formats in OCR-D?

Besides, ocrd specific domain types should be preserved, too.

What do you mean with "domain types"?

M3ssman Dec 23, 2021

Ok, I read further and found that there's already some work done to model the METS domain with OcrdMets, OcrdAgentand OcrdFile so there's no need to go for METS at all. FYI, I was thinking in terms of https://github.com/ulb-sachsen-anhalt/mets-model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outsourcing PAGE generateDS API to its own package #778

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Outsourcing PAGE generateDS API to its own package #778

kba Dec 22, 2021 Maintainer

Replies: 2 comments · 2 replies

bertsky Dec 22, 2021 Collaborator

M3ssman Dec 22, 2021

kba Dec 22, 2021 Maintainer Author

M3ssman Dec 23, 2021

kba
Dec 22, 2021
Maintainer

Replies: 2 comments 2 replies

bertsky
Dec 22, 2021
Collaborator

M3ssman
Dec 22, 2021

kba Dec 22, 2021
Maintainer Author