Skip to content

Conversation

kba
Copy link
Contributor

@kba kba commented Apr 11, 2025

Ports the OCR-D processor to the v3 API and updates Dockerfile and Makefile with our current best practices.

I had to pin uniseg and rapidfuzz to older versions because of breaking changes in rapidfuzz>=3 and uniseg>=0.9. Is there a development branch that updates those APIs I could merge/cherry-pick?

Also includes https://github.com/bertsky/dinglehopper/tree/normalized-cer which uses normalized_distance and gets rid of the potential Infinity results AFAICT.

@mikegerber
Copy link
Member

mikegerber commented Apr 15, 2025

Also includes https://github.com/bertsky/dinglehopper/tree/normalized-cer which uses normalized_distance and gets rid of the potential Infinity results AFAICT.

TBH It would really prefer to have this in a separate PR...

@mikegerber
Copy link
Member

I had to pin uniseg and rapidfuzz to older versions because of breaking changes in rapidfuzz>=3 and uniseg>=0.9. Is there a development branch that updates those APIs I could merge/cherry-pick?

I'll look into it and update this PR accordingly.

@bertsky
Copy link
Contributor

bertsky commented Apr 15, 2025

TBH It would really prefer to have this in a separate PR...

#129

@mikegerber
Copy link
Member

TBH It would really prefer to have this in a separate PR...

#129

Thanks a lot :) I'll dissect this PR first and then think about that change!

@mikegerber
Copy link
Member

mikegerber commented Apr 15, 2025

I had to pin uniseg and rapidfuzz to older versions because of breaking changes in rapidfuzz>=3 and uniseg>=0.9. Is there a development branch that updates those APIs I could merge/cherry-pick?

I'll look into it and update this PR accordingly.

Downgrading to uniseg 0.9.0 fixed the issues with master, and 0.9.1 breaks ... I've opened #130.

Do you remember what the problem with rapidfuzz 3 was? I don't see it here with RapidFuzz 3.13.0.

kba and others added 2 commits April 16, 2025 14:45
Breakage with the newest uniseg API was fixed in master.

Can't see any issue with rapidfuzz, so removing that pin, too.
@mikegerber
Copy link
Member

I had to pin uniseg and rapidfuzz to older versions because of breaking changes in rapidfuzz>=3 and uniseg>=0.9. Is there a development branch that updates those APIs I could merge/cherry-pick?

I'll look into it and update this PR accordingly.

Downgrading to uniseg 0.9.0 fixed the issues with master, and 0.9.1 breaks ... I've opened #130.

Do you remember what the problem with rapidfuzz 3 was? I don't see it here with RapidFuzz 3.13.0.

I've fixed the uniseg issues in #132 and rebased this PR accordingly. Can't see any issue with RapidFuzz, so I removed that pin.

@mikegerber
Copy link
Member

TBH I'm not terribly happy with this PR. Besides updating to OCR-D's V3 API, it

a. does all kind of things with the Dockerfile and the Makefile. Mind you, "(y)our standards" do not necessarily apply to non-OCR-D projects)

b. as a drive-by shooting, changes the definition of CER and just redefines distance completely (without even adapting the docstring)

These changes may very well be great and awesome, but are definitely 1. not part of an update to V3 API and 2. definitely worth some separate discussion.

I'll cut out part b and review it in #129. I'll review (and fix) part a and may merge it with this PR.

@mikegerber mikegerber self-assigned this Apr 17, 2025
@mikegerber
Copy link
Member

I've rebased to remove the changes in #129, that needs more discussion.

@mikegerber mikegerber mentioned this pull request Apr 17, 2025
4 tasks
@mikegerber
Copy link
Member

With #129 separate, this is ready to merge.

Not a big fan of a. the indirection of Makefile (I mostly prefer a direct pip install over make install) b. having these changes in the somewhat unrelated changes for V3 API, but I'll keep it in.

I have not tested the Docker changes (also unrelated to V3 API AFAICT) at all.

@mikegerber mikegerber merged commit b1c109b into qurator-spk:master Apr 17, 2025
10 checks passed
@bertsky
Copy link
Contributor

bertsky commented Apr 17, 2025

a. does all kind of things with the Dockerfile and the Makefile. Mind you, "(y)our standards" do not necessarily apply to non-OCR-D projects)

If our Dockerfile conventions are too much of a stretch, I recommend we split off ocrd/dinglehopper from qurator/dinglehopper (or whatever you prefer), and also separate Dockerfiles (or two-staged).

The Makefile is just for convenience, and things (like testdata dependencies) that are easier to maintain than in pyproject.toml alone. It's up to you to use other solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants