Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Entity identification and handling in tools #400

Open
simleo opened this issue Feb 6, 2025 · 6 comments
Open

Data Entity identification and handling in tools #400

simleo opened this issue Feb 6, 2025 · 6 comments

Comments

@simleo
Copy link
Contributor

simleo commented Feb 6, 2025

This issue is about the behavior of tools such as rocrate-validator and ro-crate-py when dealing with data entities, and whether the spec is clear enough in this respect. Their current approach is described in crs4/rocrate-validator#62 (comment) -- crs4/rocrate-validator#62 was opened by me after discussing the handling of files in nextflow-io/nf-prov#39, which is adding Workflow Run RO-Crate support to Nextflow, but then I discussed it with @kikkomep and stopped thinking that it was a bug in the validator.

In short, given this statement in the spec (it's the same in 1.1 and 1.2-DRAFT):

Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property.

should a tool consider File and Dataset entities as data entities and check that they are linked from the Root Data Entity (RDE)'s hasPart (this is what the validator does) or should it assume that everything that's linked from the RDE's hasPart is a data entity (this is what ro-crate-py does)? Or is it right for different tools with different roles to do different things?

I currently think that the validator is doing the right thing, since it's supposed to check for things like forgetting to link data entities from the RDE's hasPart. Regarding ro-crate-py, I'm starting to have doubts: in particular, following the "indirectly" bit above, it implements a recursive walk of hasPart properties, and I've recently noticed that this leads to a weird situation where SoftwareApplication entities are read as data entities because the ComputationalWorkflow is also a File and it links to the workflow's tools via hasPart as prescribed by the Workflow Run Crate profile (which follows Bioschemas in this respect). Maybe it should only follow hasPart from Dataset and not File?

Should the spec statement cited above be made clearer in order to help implementations? E.g.:

File and Dataset entities in the RO-Crate JSON-LD MUST be linked to from the Root Data Entity using the hasPart property. This link could be indirect, meaning that the Root Data Entity links to a Dataset whose hasPart, in turn, links to other File and Dataset entities.

Or is it better to leave it as it is, and allow different tools to do what's more appropriate for their purpose?

@elichad
Copy link
Contributor

elichad commented Feb 6, 2025

Notes

My understanding of the current situation from a recent discussion with Stian is this:

  • all data entities must have @type of either File or Dataset
  • but not all entities with type File or Dataset are necessarily data entities
  • data entities must be reachable from the root data entity via hasPart relations (but this isn't really what defines them, only how you expect to find them)

Looking through the spec again now, I don't think we have a machine-actionable definition of what a data entity is. We have a section Contextual vs Data entities that says (same in both 1.1 and 1.2-DRAFT):

Data entities primarily exist in their own right as a file or directory (which may be in the RO-Crate Root directory or downloadable by URL).

And 1.2-DRAFT includes a counter-example for the assumption that all Files are data entities:

Files in the RO-Crate Root are not necessarily data entities – the RO-Crate Metadata Descriptor is a file in the RO-Crate Root, but is considered a Contextual Entity as it is describing the RO-Crate, rather than being part of it. On the other hand, the Root Data Entity is a data entity within its own metadata file.

In both versions we also have this, which shows that checking a web-based entity for downloadability also doesn't indicate whether it should be in hasPart. That makes things more difficult:

Some contextual entities can also be considered data entities – for instance the license property refers to a CreativeWork that can reasonably be downloaded, however a license document is not usually considered as part of research outputs and would therefore typically not be included in hasPart on the root data entity.

Thoughts

I think it would be nice to declare a fully machine-actionable definition of data entities. For example, I know @ptsefton had some ideas about using conformsTo on individual datasets (see this line in draft PR #390 https://github.com/ResearchObject/ro-crate/pull/388/files#diff-93edefce62dc56f998054f6c2d9eb87bc0d317af57df2b4bf243ba0f6f0c5400R138). I don't know if that's the specific approach we want to take, but it seems like we may need to add something if we want to avoid needing that element of human judgment about whether something is or isn't a data entity. (Though that said, I guess the human judgement part does get filled into hasPart. Just a validator can't check if the judgement is right!)

More tangibly, there is at least an option to partially improve this for entities which use a local/relative URI for their @id, as (I think) if those have @type of File or Dataset, they are definitely data entities.

@ptsefton
Copy link
Contributor

ptsefton commented Feb 6, 2025

Thanks @elichad this is indeed still confusing. I will attempt to clear it up.

Entities of type File are always data entities. In an Attached context, these MUST be present in Root Crate Root and for web entities (Attached or Detached context) it would be up to the client to be validating that they are there.

That statement "Files in the RO-Crate Root are not necessarily data entities" needs to be reworded - it invites confusion of files in the real world with entities in the @graph. We should say that the RO-Crate Metadata Descriptor is not considered a Data Entity and the RO-Crate Metadata File MUST (or SHOULD?) not reference itself as a File.

Files present in in the RO-Crate Root of an Attached RO-Crate Package do not have to be represented as data entities. (ASIDE: I think this is covered elsewhere).
The RO-Crate Metadata Descriptor a Contextual Entity which describes the RO-Crate as a whole and identifies the entry point for the RO-Crate. The use of the @id of ro-crate-metadata.json is a convention, and does not imply that the descriptor is a Data Entity.

The Root Data Entity in any RO-Crate is a Data Entity.

The statement about licenses can be reworded. I would take out the implication that a web-based license is a Data Entity when it does not have File as one of its @type values.

Some contextual entities may reference data in similar way to Data Entities for instance the license property refers to a CreativeWork that can reasonably be downloaded, however a license document is not usually considered as part of research outputs and would not be included in hasPart on the root data entity.

{ example 1 .with a CC license ..}

If, however a copy of the license is intended to be included in an Attached RO-Crate Package then it MUST:

  • have an additional type of File.
  • Have an @id which is a relative URI which references a copy of the license that is present in RO-Crate Root.
  • Indicate that the File is part of the package via hasPart (An aside here -- this requirement to have hasPart seems to me like it something we could drop in RO-Crate 2 (and maybe even 1.2) -- if data entities are well defined then why force people to have this extra step that can be quite error prone? We could just say if it's a File it's part of the RO-Crate).

{ example 2 .with a CC license ..}

In a Detatched RO-Crate Package, a license MAY be included in the packaged files by adding the type File - and optionally supplying a localPath property to indicate where the license may be stored in an RO-Crate Root if the packages is downloaded.

{ example fragment - adding a localPath to example 1 above }

I think this would clear things up for File. Dataset is a bit more problematic, but I think a Dataset is considered to be a Data Entity in the following scenarios:

  • When it is the Root Data Entity (Attached or Detatched)
  • In an Attached RO-Crate Package when it has a relative URI.

In all other cases, Dataset should be considered as a Contextual Entity. (NOTE: In cases where a Dataset has an absolute URI @id then resolving that to a list of Files or Datasets is a complicated and out of scope for RO-Crate 1.2, though implementors may choose to build software that uses this approach).

How does this look @elichad and @stain? File -- ALWAYS a Data Entity. Dataset only a data entity in an attached context where we can reliably get a directory listing. (And I know it's a late entry, but how about dropping the MUST on hasPart as it really just makes for a lot of extra checking and dealing with things that are in hasPart but not present as Data Entities etc. This an area we could simplify software. (Not to say you can't use it to show pathways to data from the root but libraries and HTML previews etc can provide a list of files and directories easily enough programmatically).

I don't think we need conformsTo here @elichad.

@elichad
Copy link
Contributor

elichad commented Feb 7, 2025

That statement "Files in the RO-Crate Root are not necessarily data entities" needs to be reworded - it invites confusion of files in the real world with entities in the @graph.

Indeed that is the mistake that I made 😅 thanks for pointing it out.

We should say that the RO-Crate Metadata Descriptor is not considered a Data Entity and the RO-Crate Metadata File MUST (or SHOULD?) not reference itself as a File.

Linking this to the recently opened #394 which was asking about this.

File -- ALWAYS a Data Entity. Dataset only a data entity in an attached context where we can reliably get a directory listing. (And I know it's a late entry, but how about dropping the MUST on hasPart as it really just makes for a lot of extra checking and dealing with things that are in hasPart but not present as Data Entities etc.

Interesting idea - are you suggesting dropping the hasPart requirement to a SHOULD or a MAY, or removing it completely? It does have its benefits for indicating nested structures in a crate.

Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but File is an intuitive choice for their type. We can advise alternative types to use in this case (e.g. CreativeWork or DigitalDocument), but it feels a bit strange.

@ptsefton
Copy link
Contributor

ptsefton commented Feb 7, 2025

I am suggesting that hasPart could be optional. It is useful for showing hierarchy if you want it, but for basic packaging it's actually not necessary if we sort out our expectations about whether data needs to be present - it can be inferred that if there are File and Directory entities with URL or path IDs then they're part of the package. And as @simleo notes, following hasPart recursively has lots of issues and it's complicated for both producers and consumers.

Regarding files that don't (yet) exist I agree that it makes sense that these are @type File. think there are two solutions worth considering:

  1. We add a property to indicate that the file does not or might not yet exist something like dontValidate, I am not sure if there's an obvious one from a standard schema
  2. For files that don't exist give them a local id like #file/that/does/not/yet/exist.txt with a localPath to indicate what the path -- this pattern would indicate that File is not in the package but may come in to existence at localPath

@simleo
Copy link
Contributor Author

simleo commented Feb 10, 2025

Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but File is an intuitive choice for their type. We can advise alternative types to use in this case (e.g. CreativeWork or DigitalDocument), but it feels a bit strange.

Intermediate files are added as CreativeWork in the Nextflow plugin. Their IDs look like:

#task/c6cb99a1e70c4b8f2eb83700dc0145d9/test_1.fastp.fastq.gz

BTW, they have released version 1.4.0 of the plugin with support for Workflow Run RO-Crate.

@ptsefton
Copy link
Contributor

ptsefton commented Feb 10, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants