Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update m071 #807

Merged
merged 2 commits into from
Mar 25, 2025
Merged

Update m071 #807

merged 2 commits into from
Mar 25, 2025

Conversation

vr8hub
Copy link
Contributor

@vr8hub vr8hub commented Mar 16, 2025

This should be merged after #806.

@acabal
Copy link
Member

acabal commented Mar 16, 2025

If we're doing this, it would be very helpful to update se create-draft to process and prefill this with the correct name. When pulling from the transcription there are various ways in which PGDP is credited so we should just normalize it to our formula right off the bat instead of annoying the producer with more small busywork. Can you do that as part of this PR?

@vr8hub
Copy link
Contributor Author

vr8hub commented Mar 16, 2025

Right, good call. I'll work on that.

@vr8hub
Copy link
Contributor Author

vr8hub commented Mar 16, 2025

Done, I think. I tested it with a recent book (75629). The colophon/metadata are correct, but that's because they're right in the templates. The code to get the producers doesn't seem to work any more.*

Anyway, I believe this PR is OK, but you might look at the changes to create_draft to see if anything else needs to be done. I removed one thing that was changing "the Online" to "The Online," but there are a couple of other replaces there that I wasn't sure what they were for.

*It appears that PG has changed some things about how their ebooks look, and maybe create-draft needs to be updated accordingly? I'm definitely not up on how all of that parsing works, but judging from the things it's searching for:

  1. They now identify transcribers with "Credits: ", not "Producers".
  2. They now identify the start with "START OF THE" not "START OF THIS".
  3. The license at the end is identified with END OF THE PROJECT GUTENBERG; the test currently uses non-uppercase text, and it doesn't appear to be case-insensitive (although I don't know what the "namespaces=namespaces" does).
    As a result, running create-draft on the above ID didn't yield any transcribers, even though they're on the page, and it included all the start/end cruft.

I tried changing a couple of the things and rerunning; it didn't help, but I didn't try to debug it any further, as I wanted to get this finished before tackling something else.

@acabal acabal merged commit ad004d6 into standardebooks:master Mar 25, 2025
1 check passed
@acabal
Copy link
Member

acabal commented Mar 25, 2025

Great, thanks. It does seem like we need to adapt the tool to their new credits format. I also found a few other errors in that process. I'll push fixes soon

@vr8hub vr8hub deleted the update_m071 branch March 25, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants