-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Entity identification and handling in tools #400
Comments
Notes My understanding of the current situation from a recent discussion with Stian is this:
Looking through the spec again now, I don't think we have a machine-actionable definition of what a data entity is. We have a section Contextual vs Data entities that says (same in both 1.1 and 1.2-DRAFT):
And 1.2-DRAFT includes a counter-example for the assumption that all Files are data entities:
In both versions we also have this, which shows that checking a web-based entity for downloadability also doesn't indicate whether it should be in
Thoughts I think it would be nice to declare a fully machine-actionable definition of data entities. For example, I know @ptsefton had some ideas about using More tangibly, there is at least an option to partially improve this for entities which use a local/relative URI for their |
Thanks @elichad this is indeed still confusing. I will attempt to clear it up. Entities of type That statement "Files in the RO-Crate Root are not necessarily data entities" needs to be reworded - it invites confusion of files in the real world with entities in the @graph. We should say that the RO-Crate Metadata Descriptor is not considered a Data Entity and the RO-Crate Metadata File MUST (or SHOULD?) not reference itself as a File.
The statement about licenses can be reworded. I would take out the implication that a web-based license is a Data Entity when it does not have
I think this would clear things up for File. Dataset is a bit more problematic, but I think a Dataset is considered to be a Data Entity in the following scenarios:
In all other cases, Dataset should be considered as a Contextual Entity. (NOTE: In cases where a Dataset has an absolute URI How does this look @elichad and @stain? File -- ALWAYS a Data Entity. Dataset only a data entity in an attached context where we can reliably get a directory listing. (And I know it's a late entry, but how about dropping the MUST on I don't think we need conformsTo here @elichad. |
Indeed that is the mistake that I made 😅 thanks for pointing it out.
Linking this to the recently opened #394 which was asking about this.
Interesting idea - are you suggesting dropping the Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but |
I am suggesting that hasPart could be optional. It is useful for showing hierarchy if you want it, but for basic packaging it's actually not necessary if we sort out our expectations about whether data needs to be present - it can be inferred that if there are File and Directory entities with URL or path IDs then they're part of the package. And as @simleo notes, following hasPart recursively has lots of issues and it's complicated for both producers and consumers. Regarding files that don't (yet) exist I agree that it makes sense that these are @type File. think there are two solutions worth considering:
|
Intermediate files are added as
BTW, they have released version 1.4.0 of the plugin with support for Workflow Run RO-Crate. |
So if there is already precedent then we could stick with File reserved for things that are part of the package and MUST be there rather than adding something for File entities that may not exist? What does everyone think about that? Makes it simpler.
We can look at further nuance in V2.
…----------------------------
Dr Peter Sefton
Senior Technical Advisor, School of Languages and Culture
Mobile: 0404 096 932
________________________________
From: Simone Leo ***@***.***>
Sent: Monday, February 10, 2025 20:17
To: ResearchObject/ro-crate ***@***.***>
Cc: Peter Sefton ***@***.***>; Mention ***@***.***>
Subject: Re: [ResearchObject/ro-crate] Data Entity identification and handling in tools (Issue #400)
Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but File is an intuitive choice for their type. We can advise alternative types to use in this case (e.g. CreativeWork or DigitalDocument), but it feels a bit strange.
Intermediate files are added as CreativeWork in the Nextflow plugin. Their IDs look like:
#task/c6cb99a1e70c4b8f2eb83700dc0145d9/test_1.fastp.fastq.gz
BTW, they have released version 1.4.0 of the plugin<https://github.com/nextflow-io/nf-prov/releases/tag/1.4.0> with support for Workflow Run RO-Crate.
—
Reply to this email directly, view it on GitHub<#400 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFYTWFXHNE6BEZPRAR4QOL2PBVDPAVCNFSM6AAAAABWTPRSCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBXGM4DONBZGI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
This issue is about the behavior of tools such as rocrate-validator and ro-crate-py when dealing with data entities, and whether the spec is clear enough in this respect. Their current approach is described in crs4/rocrate-validator#62 (comment) -- crs4/rocrate-validator#62 was opened by me after discussing the handling of files in nextflow-io/nf-prov#39, which is adding Workflow Run RO-Crate support to Nextflow, but then I discussed it with @kikkomep and stopped thinking that it was a bug in the validator.
In short, given this statement in the spec (it's the same in 1.1 and 1.2-DRAFT):
should a tool consider
File
andDataset
entities as data entities and check that they are linked from the Root Data Entity (RDE)'shasPart
(this is what the validator does) or should it assume that everything that's linked from the RDE'shasPart
is a data entity (this is what ro-crate-py does)? Or is it right for different tools with different roles to do different things?I currently think that the validator is doing the right thing, since it's supposed to check for things like forgetting to link data entities from the RDE's
hasPart
. Regarding ro-crate-py, I'm starting to have doubts: in particular, following the "indirectly" bit above, it implements a recursive walk ofhasPart
properties, and I've recently noticed that this leads to a weird situation whereSoftwareApplication
entities are read as data entities because theComputationalWorkflow
is also aFile
and it links to the workflow's tools viahasPart
as prescribed by the Workflow Run Crate profile (which follows Bioschemas in this respect). Maybe it should only followhasPart
fromDataset
and notFile
?Should the spec statement cited above be made clearer in order to help implementations? E.g.:
Or is it better to leave it as it is, and allow different tools to do what's more appropriate for their purpose?
The text was updated successfully, but these errors were encountered: