Skip to content

indexer should report which WARC file causes an error #411

@anarcat

Description

@anarcat

Is your feature request related to a problem? Please describe.

There are a few cases where the indexer cannot correctly create a CDX file from a WARC file. There are, for example, #44 and #168, reported here, which have valid workarounds.

The problem I have here is that I would very much like to fix the problem, but it occurred only after indexing many WARC files. I added about two dozen of those to the collection, and now it's giving this error, without any more information:

Invalid WARC record, first line:

Describe the solution you'd like

The indexer should catch that error and report which file triggered it so it can be fixed correctly.

Describe alternatives you've considered

I've considered creating a new collection and running wb-manager add on each WARC file one by one so I could tell which one is triggering the problem. But add is designed to support adding multiple files at once, so it should also report errors accordingly.

Additional context

This is part many issues found when using pywb with larger collections, see #408 and #410.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions