Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Workflow Run RO-crate format #39

Merged
merged 58 commits into from
Feb 6, 2025
Merged

Conversation

famosab
Copy link
Contributor

@famosab famosab commented Dec 18, 2024

We worked on a first version of the plugin which is able to render valid RO-crates for any workflow run.

Happy to receive feedback to get this finished up :)

Continues #19 and #33.

famosab and others added 16 commits November 18, 2024 15:45
add encodingFormat for nextflow.config
feat: add wrroc to valid formats
* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>
* feat: add README to create

* feat: ignore vscode

* fix: make getIntermediateOutputFiles work again (#18) (#19)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to json

* feat: check first if readme exists

* Add readme to hasPart

Signed-off-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>
* Add getEncodingFormat function that return the encoding format for a file
* handle YAML files manually

Signed-off-by: fbartusch <[email protected]>
* main workflow complies (more or less) with ComputationalWorkflow profile version 1.0
  (if set in manifest add license, url, version, description, ...)
* Correct value vor ActionStatus

Signed-off-by: fbartusch <[email protected]>
* start with metaYaml imports

* merge dev-wrroc into metaYaml (#23)

* add encodingFormat for nextflow.config

* add encodingFormat for main.nf

* feat: add wrroc to valid formats

* fix: make getIntermediateOutputFiles work again (#18)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to crate (#14)

* feat: add README to create

* feat: ignore vscode

* fix: make getIntermediateOutputFiles work again (#18) (#19)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to json

* feat: check first if readme exists

* Add readme to hasPart

Signed-off-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>

* WIP

* only add from meta if meta exists

* remove usage from ext args

* add module name to id

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>
@famosab

This comment was marked as outdated.

@simleo

This comment was marked as outdated.

@famosab
Copy link
Contributor Author

famosab commented Dec 18, 2024

ro-crate-metadata.json
This was created using the plugin and this pipeline: https://github.com/famosab/wrrocmetatest

@bentsherman

This comment was marked as outdated.

@bentsherman bentsherman changed the base branch from master to workflow-run-crate December 18, 2024 15:46
@bentsherman bentsherman changed the base branch from workflow-run-crate to master December 18, 2024 15:47
@simleo

This comment was marked as outdated.

@famosab

This comment was marked as outdated.

@simleo

This comment was marked as outdated.

@famosab

This comment was marked as outdated.

@famosab

This comment was marked as outdated.

@simleo

This comment was marked as resolved.

@famosab

This comment was marked as outdated.

@simleo

This comment was marked as resolved.

@fbartusch
Copy link
Contributor

I'm currently running tests against nf-core pipelines with some scripts I wrote for testing plugins.
It wil take some time and I hope I have some results this afternoon.

@simleo
Copy link

simleo commented Jan 23, 2025

I've tried again (with the current version of the plugin) to run Famke's pipeline locally:

nextflow run main.nf -profile docker --input testsheet.csv --outdir results -c testdata.config

with local files in testsheet.tsv:

sample,fastq_1,fastq_2
test,/home/simleo/repos/wrrocmetatest/read1.fq.gz,/home/simleo/repos/wrrocmetatest/read2.fq.gz

The resulting crate is not even readable by ro-crate-py because it has absolute ids in it (/home/simleo/repos/wrrocmetatest/read{1,2}.fq.gz). This violates the spec in File Data Entity:

@id MUST be either a URI Path relative to the RO Crate root, or an absolute URI

I see three possible ways to fix this:

  1. Copy the files into the crate and add them with their relative path
  2. Prepend file:// to the ids, making them absolute URIs
  3. Add them as CreativeWork as done for intermediates

However, with options 2 and 3 the crate consumer has no way to reconstruct the two input files.

@fbartusch
Copy link
Contributor

Most of the nf-core pipelines are currently failing with the plugin :(
nextflow.log says:

Jan-23 17:09:49.075 [main] DEBUG nextflow.Session - Failed to invoke observer completion handler: nextflow.prov.ProvObserver@6965f207
java.lang.NullPointerException: null
        at nextflow.prov.WrrocRenderer.getModuleId(WrrocRenderer.groovy:774)
        at nextflow.prov.WrrocRenderer.access$2(WrrocRenderer.groovy)
        at nextflow.prov.WrrocRenderer$_render_closure15.doCall(WrrocRenderer.groovy:331)
        at nextflow.prov.WrrocRenderer$_render_closure15.call(WrrocRenderer.groovy)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3661)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3646)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3692)
        at nextflow.prov.WrrocRenderer.render(WrrocRenderer.groovy:325)
        at nextflow.prov.ProvObserver.onFlowComplete(ProvObserver.groovy:121)
        at nextflow.Session.notifyFlowComplete(Session.groovy:1155)
        at nextflow.Session.shutdown0(Session.groovy:749)
        at nextflow.Session.destroy(Session.groovy:694)
        at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:260)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:146)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:376)
        at nextflow.cli.Launcher.run(Launcher.groovy:503)
        at nextflow.cli.Launcher.main(Launcher.groovy:658)

I'm checking now why it fails and use the nf-core bamtofastq pipeline, as it is the fastest (and most simple?) pipeline that fails. That should make debugging easier.

@bentsherman
Copy link
Member

@simleo this is why I recommend using the original HTTP URLs:

sample,fastq_1,fastq_2
test,https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/denbi-mg-course/read1.fq.gz,https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/denbi-mg-course/read2.fq.gz

But it is unavoidable that some users will be using local input files and we'll need to handle that gracefully. As a first iteration I'm inclined to warn about such input files and maybe make them CreativeWork if they aren't included in the crate. I will try a few things

@bentsherman
Copy link
Member

@simleo I ended up taking the absolute URI approach. That made the resulting crate valid. We can encourage the use of remote URIs as a best practice.

In summary, only input files that are (1) specified directly by a param, (2) local, and (3) not a directory, will be copied into the crate. All of these restrictions are designed to prevent explosive data transfers from directories, remote data, and file globs.

@fbartusch I ran bamtofastq with test profile and it succeeded. Let me know how the rest of your tests go with the latest revision

@fbartusch
Copy link
Contributor

@bentsherman bamtofastq looks indeed good and the validator is happy. I'm running now the tests for the other nf-core pipelines.

@fbartusch
Copy link
Contributor

@bentsherman Only one pipeline out of 42 I ran fails because of the plugin: demultiplex revision 1.5.1

Jan-30 10:16:18.415 [main] DEBUG nextflow.Session - Failed to invoke observer completion handler: nextflow.prov.ProvObserver@6e65fc8b
java.lang.NullPointerException: null
        at nextflow.prov.WrrocRenderer.getTaskOutputName(WrrocRenderer.groovy:870)
        at nextflow.prov.WrrocRenderer.access$9(WrrocRenderer.groovy)
        at nextflow.prov.WrrocRenderer$_render_closure22.doCall(WrrocRenderer.groovy:442)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at groovy.lang.Closure.call(Closure.java:433)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.callClosureForMapEntry(DefaultGroovyMethods.java:6061)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3985)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:4002)
        at nextflow.prov.WrrocRenderer.render(WrrocRenderer.groovy:439)
        at nextflow.prov.ProvObserver.onFlowComplete(ProvObserver.groovy:121)
        at nextflow.Session.notifyFlowComplete(Session.groovy:1155)
        at nextflow.Session.shutdown0(Session.groovy:749)
        at nextflow.Session.destroy(Session.groovy:694)
        at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:260)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:146)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:376)
        at nextflow.cli.Launcher.run(Launcher.groovy:503)
        at nextflow.cli.Launcher.main(Launcher.groovy:658)

But all others didn't pass the validator (I used the latest commit fa8c6c7, not the PyPI release). I think this is the validator version with the least number of remaining bugs, right @simleo ?
I used grepto get the validator messages from all validation reports for all pipelines and uploaded them here

Although the list looks very long at first glance these seem to be just corner cases.
There are three types of messages. I will state one example for each problem type and my educated guess what causes it:

  1. "The RO-Crate does not include the Data Entity 'work/tmp/03/b9fce9cd2416a84f3e472fa0606095/all_logs_tabs.txt' as part of its payload"

All of these messages relate to files in the temporary directory work/tmp. Maybe this is just a corner case with the tmp directory the current code misses, because no other regular file from the workdir (like work/3c/52eb4a7b50f0eff9ef603a10d064ac) causes problems.

  1. "FormalParameter MUST have an additionalType"

Example: "violatingEntity": "./#param/genome" for mag pipeline revision 3.0.2.

Thanks to the saved effective nextflow.config in the RO-Crate I can be 100% sure that this is the parameter value during runtime and it's the default value:

genome = null

Also an edge case in handling null parameter values?

  1. "RO-Crate file descriptor "ro-crate-metadata.json" is not fully flattened at entity "#param/max_memory/value"",

Example: \"#param/max_memory/value\" is not fully flattened for methylseq pipeline revision 2.6.0

It looks like this:

{
    "@id": "#param/max_memory/value",
    "@type": "PropertyValue",
    "exampleOfWork": {
        "@id": "#param/max_memory"
    },
    "name": "max_memory",
    "value": {
        "bytes": 6442450944,
        "giga": 6,
        "kilo": 6291456,
        "mega": 6144
    }
},

The effective configuration during runtime is: max_memory = '6 GB'. Actually I don't know why it looks so strange in ro-crate-metadata.json. I guess Nextflow takes this value, sees it's some kind of "file size" and converts in in a list expressing the value in different units?

One last thing regarding the license.
@bentsherman , you are now using the manifest.license for both main.nf and the RO-Crate itself. I thought that the RO-Crate has a license that tells how the RO-Crate (e.g. the contained research data and results) can be used. This can be different from the license under which the Nextflow workflow is published.
@simleo Is this correct?

@simleo
Copy link

simleo commented Jan 31, 2025

But all others didn't pass the validator (I used the latest commit fa8c6c7, not the PyPI release)

That's the current development version, so good choice 👍

I thought that the RO-Crate has a license that tells how the RO-Crate (e.g. the contained research data and results) can be used. This can be different from the license under which the Nextflow workflow is published.

Workflow RO-Crate says:

The Crate MUST specify a license. The license is assumed to apply to any content of the crate, unless overriden by license on individual File entities.

where the first appearance of "Crate" here means the root data entity. See also Licensing, Access control and copyright.

@bentsherman
Copy link
Member

"The RO-Crate does not include the Data Entity 'work/tmp/03/b9fce9cd2416a84f3e472fa0606095/all_logs_tabs.txt' as part of its payload"

This is most likely coming from a collectFile operator, which saves file outputs to work/tmp. If a downstream task uses the collected file then it will show up as a task input. I have added some logic to handle it but haven't tested

"FormalParameter MUST have an additionalType"

Example: "violatingEntity": "./#param/genome" for mag pipeline revision 3.0.2.

I have added some logic to exclude parameters set to null.

"RO-Crate file descriptor "ro-crate-metadata.json" is not fully flattened at entity "#param/max_memory/value"",

Example: \"#param/max_memory/value\" is not fully flattened for methylseq pipeline revision 2.6.0

I'm surprised the JSON serializer did this instead of just failing. Not sure how it came up with this result. I added some logic to treat durations and memory units as raw numbers.

I thought that the RO-Crate has a license that tells how the RO-Crate (e.g. the contained research data and results) can be used. This can be different from the license under which the Nextflow workflow is published.

You're right. I added the license back to the wrroc config options.

@fbartusch these fixes should remove most or all of the validation errors you saw

@fbartusch
Copy link
Contributor

@bentsherman Indeed, the plugin produces now valid RO-Crates for most of the nf-core pipelines. Great work!
I ran 41 pipelines. The plugin produces valid RO-Crates for 36 of them. For one pipeline the plugin throws an exception:

demultiplex/1.5.1: 
java.lang.NullPointerException: null
        at nextflow.prov.WrrocRenderer.getTaskOutputName(WrrocRenderer.groovy:916)
        at nextflow.prov.WrrocRenderer.access$10(WrrocRenderer.groovy)
        at nextflow.prov.WrrocRenderer$_render_closure25.doCall(WrrocRenderer.groovy:459)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at groovy.lang.Closure.call(Closure.java:433)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.callClosureForMapEntry(DefaultGroovyMethods.java:6061)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3985)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:4002)
        at nextflow.prov.WrrocRenderer.render(WrrocRenderer.groovy:456)
        at nextflow.prov.ProvObserver.onFlowComplete(ProvObserver.groovy:121)
        at nextflow.Session.notifyFlowComplete(Session.groovy:1155)
        at nextflow.Session.shutdown0(Session.groovy:749)
        at nextflow.Session.destroy(Session.groovy:694)
        at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:260)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:146)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:376)
        at nextflow.cli.Launcher.run(Launcher.groovy:503)
        at nextflow.cli.Launcher.main(Launcher.groovy:658)

Four other pipelines have missing payloads. These files are in hasPart, but they are missing in the Crate:

denovotranscript-1.0.0-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/ce/1f37ce4888f0e3305b4db32f2fcc76/okayset/all_assembled.okay.mrna' as part of its payload",
denovotranscript-1.0.0-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/ce/1f37ce4888f0e3305b4db32f2fcc76/okayset/all_assembled.pubids' as part of its payload",
proteinfold-1.1.1-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/8d/5f814f4ec329c4c24f51eb7e4c8460/T1024.1.fasta' as part of its payload",
proteinfold-1.1.1-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/c1/c4175ab996c015eef3f88848408521/T1026.1.fasta' as part of its payload",
viralrecon-2.6.0-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/13/c9227d087fffb9e03051bc06587ffc/quast/report.tsv' as part of its payload",
viralrecon-2.6.0-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/58/89a60f6a6959b2ad48da2dbd9cb699/quast/report.tsv' as part of its payload",
viralrecon-2.6.0-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/85/e6deaba6c3df726378006b31d72512/quast/report.tsv' as part of its payload",
viralrecon-2.6.0-ro-crate-1.1-required.json:            "message": "The RO-Crate does not include the Data Entity 'work/89/60cec3fa49ded79c35a944283142c8/quast/report.tsv' as part of its payload",

@bentsherman
Copy link
Member

@fbartusch since we're getting into more obscure errors and this PR is a massive work, I'm going to go ahead and merge it and cut a release. Let's pursue these two issues separately

@bentsherman bentsherman merged commit 4bbf0c1 into nextflow-io:master Feb 6, 2025
2 checks passed
@simleo
Copy link

simleo commented Feb 6, 2025

@bentsherman testing again with https://github.com/famosab/wrrocmetatest, I've removed the license setting from nextflow.config and added it back to testdata.config, but the license is missing from the ro-crate-metadata.json file. I also see no warning about the missing license parameter. Am I doing something wrong? I ran with:

nextflow run famosab/wrrocmetatest -profile docker --input testsheet.csv --outdir results -c testdata.config

Note that other parameters from testdata.config are working, e.g. I have a "John Doe" agent in the RO-Crate metadata file.

@bentsherman
Copy link
Member

@simleo I ran with these settings in the testdata.config:

prov {
    enabled = true
    formats {
        wrroc {
            file = "${params.outdir}/ro-crate-metadata.json"
            overwrite = true
            license = "https://spdx.org/licenses/MIT"
        }
    }
}

That will set the license for the crate. Setting manifest.license will set the license for the pipeline. I am seeing both licenses in the crate metadata.

@simleo
Copy link

simleo commented Feb 6, 2025

@bentsherman I see the license again now. I reinstalled the plugin, something must have gone wrong with that before. Sorry for the noise.

@famosab
Copy link
Contributor Author

famosab commented Feb 7, 2025

Thanks so much @bentsherman @simleo @fbartusch for putting in all the work to finish this massive PR 🚀

@drernie
Copy link

drernie commented Feb 7, 2025

Great job! I'm already telling customers about it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RO crate format
6 participants