Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for WARC resource record types. #248

Merged
merged 2 commits into from
Mar 4, 2025
Merged

Support for WARC resource record types. #248

merged 2 commits into from
Mar 4, 2025

Conversation

no0p
Copy link
Contributor

@no0p no0p commented Mar 4, 2025

WARC resource records store "a resource retrieved over a network, but not necessarily as a direct result of a single HTTP request/response exchange."

These records can omit http headers, making the prior code which relied on http header presence problematic. Here were add a simple method to resolve the attribute values necessary to process records for linearization.

WARC resource records store "a resource retrieved over a network,
but not necessarily as a direct result of a single HTTP request/response
exchange."

These records can omit http headers, making the prior code which relied on
http header presence problematic. Here were add a simple method to resolve
the attribute values necessary to process records for linearization.
@soldni
Copy link
Member

soldni commented Mar 4, 2025

looks good, not sure why python style is failing tho!

Some cleanup on the code organization to introduce support
for resource WARC record types. Improved safety.

Also linting.
@no0p
Copy link
Contributor Author

no0p commented Mar 4, 2025

@soldni Just introduced a small refactor that cleaned things up a bit.

On the plus side, got to move just a little code out of processor.

Will mark as draft pr next time 😜 -- thanks for quick review.

@no0p no0p requested a review from Whattabatt March 4, 2025 17:24
Copy link
Contributor

@Whattabatt Whattabatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Have you run this on a sample dataset?

@no0p
Copy link
Contributor Author

no0p commented Mar 4, 2025

Looks good to me. Have you run this on a sample dataset?

Yes.

Here's a sample record based on sponge data.

{"id":"vyawcbpdjbr5wifme7rf62kidrbbamvo","text":"The contents you find here are totally obsolete so do not use this website to obtain a solution for your issues: the indications you get from here will not be accurate and may lead to data loss or other mayor problems\n\nSearch is not Working in Logical doc With openJDK.\n\nWe tried to make LogicalDOC as intuitive as possible, but an advice is always welcome.\n\nModerator: car031\n\npavan kalburgi\nPosts: 1\nJoined: Wed Jul 14, 2021 8:43 am\n\nSearch is not Working in Logical doc With openJDK.\n\nWed Jul 14, 2021 9:46 am\n\nWhen I am trying to search on logical doc by expression it is giving 0 results to me. I have uploaded the .txt file and trying search the content inside of that .txt file. When see the logs its saying,\njava.lang.NoClassDefFoundError: Could not initialize class com.ibm.icu.text.SimpleDateFormat.\nI have the attached complete\ndmsLog.7z\nlog file.\n\nNote : This issue is arising when my JAVA_HOME is set to Open JDK.\nWith Oracle JDK it is working fine.\ndmsLog.7z\n\nReturn to “Usage”\n\nWho is online\n\nUsers browsing this forum: No registered users and 10 guests\n\n× Attention! This forum has been dismissed and will be soon removed. The contents you find here are totally obsolete so do not use this website to obtain a solution for your issues: the indications you get from here will not be accurate and may lead to data loss or other mayor problems.","source":"pantry","created":"2025-03-03T19:01:53.572Z","added":"2025-03-03T19:01:53.565Z","version":"v0","metadata":{"warc_url":"https://forums.logicaldoc.com/viewtopic.php?f=8&p=17218&sid=b6cbeb61aea736b6a832aa5af58a053c&t=11267","url":"forums.logicaldoc.com/viewtopic.php?f=8&p=17218&sid=b6cbeb61aea736b6a832aa5af58a053c&t=11267","warc_date":"2025-03-03T19:01:53.573Z","warc_filename":"","content_type":"text","uncompressed_offset":0},"attributes":{}}

There is a remaining problem of detecting non-character encodings that is outstanding, and may require further quality checks with warnings, exceptions, or remediation, which may make sense to have in the processor.

@no0p no0p merged commit ca1727f into main Mar 4, 2025
14 checks passed
@no0p no0p deleted the warc_resource_support branch March 4, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants