-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for WARC resource record types. #248
Conversation
WARC resource records store "a resource retrieved over a network, but not necessarily as a direct result of a single HTTP request/response exchange." These records can omit http headers, making the prior code which relied on http header presence problematic. Here were add a simple method to resolve the attribute values necessary to process records for linearization.
looks good, not sure why python style is failing tho! |
Some cleanup on the code organization to introduce support for resource WARC record types. Improved safety. Also linting.
@soldni Just introduced a small refactor that cleaned things up a bit. On the plus side, got to move just a little code out of processor. Will mark as draft pr next time 😜 -- thanks for quick review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Have you run this on a sample dataset?
Yes. Here's a sample record based on sponge data.
There is a remaining problem of detecting non-character encodings that is outstanding, and may require further quality checks with warnings, exceptions, or remediation, which may make sense to have in the processor. |
WARC resource records store "a resource retrieved over a network, but not necessarily as a direct result of a single HTTP request/response exchange."
These records can omit http headers, making the prior code which relied on http header presence problematic. Here were add a simple method to resolve the attribute values necessary to process records for linearization.