Is a "members' library" test corpus possible? #38

anjackson · 2022-10-05T21:18:47Z

anjackson
Oct 5, 2022
Maintainer

A regular issue in digital preservation is difficulties in supplying example files due to copyright or other legal restrictions. This is very common in things like format identification, where we need examples in order to refine our tools.

Ideally, we could find or make openly-licensed example files, but this is not always possible. Another tactic is to randomize content that can be identified as e.g. text strings, while leaving the rest of the file unchanged, but this risks leaking content that cannot be easily identified. A third tactic is to source files from large collections like web archive, but it's often hard to find close matches to realistic files from other contexts.

As an alternative, is it possible to set up something akin to a "members' library", where we support the storage of a set of example files that are shared with us on the understanding that they will not be distributed further, or used in anyway that makes it possible to reverse-engineer the original files?

This is kind-of happening already, with some example files being shared privately/in smaller groups. But I'm wondering if there's a way to make this easier and ensure tool developers/vendors can access the corpus under appropriate conditions?

For example, this could be a shared Google Drive folder, or it could be stored on some cloud somewhere and only made accessible via specific virtual machines. But maybe there's better, simpler ideas!

ross-spencer · 2022-10-07T07:42:51Z

ross-spencer
Oct 7, 2022
Maintainer

Good question Andy.

A regular issue in digital preservation is difficulties in supplying example files due to copyright or other legal restrictions. This is very common in things like format identification, where we need examples in order to refine our tools.

On the cause, copyright and legal restrictions are obviously part of it, but perhaps, important to reflect on for others picking up this thread who don't feel the problem so profoundly - often a file is being analyzed that is from a legacy collection and it's, for now, a non-trivial exercise to access the software that created it to generate sanitized samples that can be shared more widely.

A regular issue in digital preservation is difficulties in supplying example files

Additionally, there are different use-cases (or life-cycle stages) for this work, which perhaps affects the intentions of any particular corpus. I started to write a little bit about it around the OPF corpus, but today I may summarize it as:

A format isn't identified so needs an exemplar for identification, skeleton file, or file format specification.
A format is identified, but tooling needs to be developed around it, e.g. metadata, object extraction, "migration".
A format is identified and preservation workflows need to be tested, e.g. testable/repeatable components of PAR.
A format presents unusual attributes, so existing tooling, signatures, workflows, etc. need to be adapted.

The shape, and number of corpus items may differ in each of the life-cycle stages, or not. The number of stakeholders in each instance may vary too.

Initial instincts,

Members-only - I am not keen on gate-keeping. Perhaps there is something that can be achieved using licensing? And perhaps there is an organization willing to back-up the legal aspects of that.
Federation - I believe this can still work, but given a standard approach to follow, e.g. a GitHub/NextCloud/GoogleDrive is created using the following structure . Other tooling can be built on top of that, e.g. something TROVE like for accessing file-format instances with certain properties.
Analysis of existing solutions - (key to understanding a new initiative) is it simply legal? What's missing or not working with the OPF format corpus, or Archivematica Sample-Data, Govdocs/Common-crawl or indeed, someone else's small-collection of objects? Or even File Format Wiki which largely just links out to examples and doesn't host.
- size/volume of corpus?
- understandability?
- metadata?
- safety/security?
- platform?
- perceived need vs. actual need?
Exemplars? or rogues and villains? - One point I raised at iPRES was the need for guides on how to create and document useful digital objects/files for those working on file-formats in different contexts. These would be my exemplar objects, and is perhaps a different corpus/use-case from the one your describing? Rogues/villains, I would categorize as pre-existing objects coming from existing collections/accessions, and perhaps is more simple, and don't require such rigor as it's just a pick-n-mix?
The importance of identifiers - we have an open issue on the OPF corpus: Mapping of files to PUIDs openpreserve/format-corpus#19 - Dianne mentions PRONOM, Euan, Wikidata. Do we need to pick one identifier? Do we satisfy each person's desire? A corpus may capture a lot of desires, and would maybe need to be extensible in terms of the data it holds, but also, see 3.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

digipres.org

Is a "members' library" test corpus possible? #38

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

digipres.org

Is a "members' library" test corpus possible? #38

Uh oh!

anjackson Oct 5, 2022 Maintainer

Replies: 1 comment

Uh oh!

ross-spencer Oct 7, 2022 Maintainer

anjackson
Oct 5, 2022
Maintainer

ross-spencer
Oct 7, 2022
Maintainer