Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not assume that an entity with type File is a data entity #62

Open
simleo opened this issue Jan 17, 2025 · 3 comments
Open

Do not assume that an entity with type File is a data entity #62

simleo opened this issue Jan 17, 2025 · 3 comments

Comments

@simleo
Copy link
Member

simleo commented Jan 17, 2025

We currently have this check for entities whose type is usually used for data entities:

       [   MUST 13.1    ]  Data Entity MUST be directly referenced:            
                           Check if the Data Entity                            
                           is linked, either                                   
                           directly of inderectly,                             
                           to the Root Data Entity                             
                           using the hasPart (as                               
                           defined in schema.org)                              
                           property"                                           

I.e. we assume that entities of a certain type are data entities even if they are not in the root data entity's hasPart and then require that they are listed in the root data entity's hasPart. This is causing problems in nextflow-io/nf-prov#39 (implementation of WRROC for Nextflow), see in particular nextflow-io/nf-prov#39 (comment).

@kikkomep
Copy link
Member

The specifications state that:

Data Entities representing files must have “File” as the value for @type. “File” is an RO-Crate alias for http://schema.org/MediaObject.

However, in the comments you mentioned:

… entities of type File are not necessarily Data Entities.

If both statements hold true, the File type can be used to denote both a File Data Entity and an entity that is not a Data Entity, making it impossible to uniquely represent a File Data Entity and distinguish it from generic File entities.

This lack of precise terminology to denote data entities complicates the validation of other requirements, such as the requirement MUST 13.1 you mentioned above, which refers to the specs statement:

When files and folders are represented as Data Entities in the RO-Crate JSON-LD, they must be linked, either directly or indirectly, to the Root Data Entity using the hasPart property.

Without a clear and unambiguous way to represent a Data Entity, it becomes impossible to automatically verify that all data entities are referenced from the Root Data Entity.

The assumption underlying the check implementation you mentioned in the issue is simply to mitigate this ambiguity and make the specification requirement automatically verifiable.

The only action we can take to address the issue you’ve raised is to disregard this (and potentially other) unverifiable requirement(s) until more precise terminology is introduced to accurately represent File Data Entities and distinguish them from generic File entities.

@simleo
Copy link
Member Author

simleo commented Jan 30, 2025

After discussing this extensively with Marco: there are basically two approaches, given the following statement in the spec:

Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property.

  1. The current approach used by the validator, which considers File and Dataset instances as data entities, and the above sentence as a requirement that must be satisfied by them, and reports an error if it's not satisfied. The advantage is that if, indeed, the crate author failed to update the root data entity's hasPart to include all intended data entities, this is flagged as an error, allowing to fix the crate.

  2. The approach used by ro-crate-py, which, when reading an RO-Crate, considers as data entities only the entities linked to from the root data entity via hasPart, while all other entities are considered contextual entities. This effectively treats the spec section quoted above more as a definition of what data entities are (i.e., "an entity is a data entity if it is linked to from the root data entity's hasPart) than a requirement. The validator could be modified to also adopt this approach, but then it would fail to flag misplaced entities as described above.

@elichad
Copy link
Contributor

elichad commented Feb 6, 2025

One approach to partially resolve this could be to validate that entities appear in hasPart if both the following are true:

  • @type includes File or Dataset
  • @id is a local file path

This obviously would omit web-based data entities from the check, but (in 1.1) that is probably necessary, as there is nothing other than hasPart to go on to determine if such an entity is data or contextual (it could be different in different crates depending on context)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants