Replies: 1 comment
-
|
Good question Andy.
On the cause, copyright and legal restrictions are obviously part of it, but perhaps, important to reflect on for others picking up this thread who don't feel the problem so profoundly - often a file is being analyzed that is from a legacy collection and it's, for now, a non-trivial exercise to access the software that created it to generate sanitized samples that can be shared more widely.
Additionally, there are different use-cases (or life-cycle stages) for this work, which perhaps affects the intentions of any particular corpus. I started to write a little bit about it around the OPF corpus, but today I may summarize it as:
The shape, and number of corpus items may differ in each of the life-cycle stages, or not. The number of stakeholders in each instance may vary too. Initial instincts,
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A regular issue in digital preservation is difficulties in supplying example files due to copyright or other legal restrictions. This is very common in things like format identification, where we need examples in order to refine our tools.
Ideally, we could find or make openly-licensed example files, but this is not always possible. Another tactic is to randomize content that can be identified as e.g. text strings, while leaving the rest of the file unchanged, but this risks leaking content that cannot be easily identified. A third tactic is to source files from large collections like web archive, but it's often hard to find close matches to realistic files from other contexts.
As an alternative, is it possible to set up something akin to a "members' library", where we support the storage of a set of example files that are shared with us on the understanding that they will not be distributed further, or used in anyway that makes it possible to reverse-engineer the original files?
This is kind-of happening already, with some example files being shared privately/in smaller groups. But I'm wondering if there's a way to make this easier and ensure tool developers/vendors can access the corpus under appropriate conditions?
For example, this could be a shared Google Drive folder, or it could be stored on some cloud somewhere and only made accessible via specific virtual machines. But maybe there's better, simpler ideas!
Beta Was this translation helpful? Give feedback.
All reactions