-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request - custom caching strategy #5308
Comments
Here is a self-contained example: workflow {
input = [
[ id: 'test1', patient: 'patient_1' ],
[ id: 'test2', patient: 'patient_3' ]
]
Channel.fromList(input)
| map { patient -> [patient.id, patient] }
| set { patients }
patients
| map { id, patient -> id }
| createFile
| join(patients)
| map { id, initialFile, patient -> [patient, initialFile] }
| addPatientInfo
| view
}
process createFile {
input:
val id
output:
tuple val(id), path("${id}.txt")
script:
"""
touch ${id}.txt
echo "ID: ${id}" >> ${id}.txt
"""
}
process addPatientInfo {
publishDir 'publish', mode: 'copy'
input:
tuple val(meta), path(initial_file)
output:
path "${meta.id}_with_patient.txt"
script:
"""
cp ${initial_file} ${meta.id}_with_patient.txt
echo "Patient: ${meta.patient}" >> ${meta.id}_with_patient.txt
"""
} If we were able to specify in the createFile which pieces of metadata were really relevant, this could be considerably shorter without sacrificing the reusability if the patient value changes for one or more samples: workflow {
input = [
[ id: 'test1', patient: 'patient_1' ],
[ id: 'test2', patient: 'patient_3' ]
]
Channel.fromList(input)
| createFile
| addPatientInfo
| view
}
process createFile {
cache { meta.id }
input:
val(meta)
output:
tuple val(meta), path("${meta.id}.txt")
script:
"""
touch ${meta.id}.txt
echo "ID: ${meta.id}" >> ${meta.id}.txt
"""
}
process addPatientInfo {
publishDir 'publish', mode: 'copy'
input:
tuple val(meta), path(initial_file)
output:
path "${meta.id}_with_patient.txt"
script:
"""
cp ${initial_file} ${meta.id}_with_patient.txt
echo "Patient: ${meta.patient}" >> ${meta.id}_with_patient.txt
"""
} |
The cache closure wouldn't need any parameters, just use the task context like any other dynamic directive. Also would want to log a warning to make it clear that you're basically doing a "development" run as this option shouldn't be used in production. Here's a quandary -- I think the user would need to specify this custom strategy from the beginning. If I do a run, then add some metadata columns, but I forgot to specify this closure, then I'm out of luck. I guess you could make it work if you specify all inputs: process ApplyLearning {
cache { [meta.employer, edu] }
input:
tuple val(meta), path(edu, stageAs: "education.txt")
// ...
} I think you could tack that on to the second run and be able to resume. |
Thinking about this for 30 more seconds, I would argue that if the intent is to declare that a process depends only on process ApplyLearning {
input:
val(employer)
path("education.txt")
// ...
} I think the deeper issue is that processes are invoked with channels instead of individual values and that makes it difficult to control how inputs are passed into the process. It's much easier to just provide the channel you already have, which might have more inputs than you actually need, rather than adding more channel boilerplate. But I think that solving that problem is probably the best way to address this one too. workflow {
input = [
[ id: 'test1', patient: 'patient_1' ],
[ id: 'test2', patient: 'patient_3' ]
]
Channel.fromList(input)
| map { meta ->
def initial_file = createFile(meta.id)
return [ meta, addPatientInfo(meta, initial_file) ]
}
| view
}
process createFile {
input:
val(id)
output:
path("${id}.txt")
script:
"""
touch ${id}.txt
echo "ID: ${id}" >> ${id}.txt
"""
}
process addPatientInfo {
publishDir 'publish', mode: 'copy'
input:
val(meta)
path(initial_file)
output:
path "${meta.id}_with_patient.txt"
script:
"""
cp ${initial_file} ${meta.id}_with_patient.txt
echo "Patient: ${meta.patient}" >> ${meta.id}_with_patient.txt
"""
} |
What am i missing?
|
Paolo - maybe a clearer example would be: params.input = "samples.basic.csv"
workflow {
Channel.fromPath(params.input)
| splitCsv(header:true)
| GetEducation
| ApplyLearning
}
process GetEducation {
input: val(meta)
output: tuple val(meta), path("education.txt")
script: "echo '$meta.name going to $meta.university' > education.txt"
}
process ApplyLearning {
input: tuple val(meta), path("education.txt")
script: "cat education.txt > bio.txt && echo 'Now working at $meta.employer' >> bio.txt"
} And we have two samplesheets, a
... and a
If I run using the basic samplesheet first, we run all four tasks:
If I launch using the extended samplesheet, I have to recalculate the
In the second (samples.extended.csv) run, the |
I think it's important to be able to control the cache. Ability to manually cache time-consuming tasks. Suppose my script is as follows. process saySecond {
scratch true
stageInMode "copy"
container "master:5000/stress:latest"
cache 'lenient'
input:
path db
output:
path("db2.json")
script:
"""
cat $db > db2.json
"""
}
workflow {
ch_input = Channel2.fromPath(["/data/workspace/1/nf-hello/db/a.txt","/data/workspace/1/nf-hello/db/b.txt"])
saySecond(ch_input)
} View the hash of the task using the parameter
If nothing else, the cached Hasche will be the same the next time you run it. But if something goes wrong and an insignificant parameter is accidentally added to the script, Hasche changes.
Is it possible to have a command that manually computes the Hasche of the current input and script, then replaces the Hasche of the database with the newly computed Hasche, allowing cache a time-consuming task. Or add a field to the database that can be forced to be cached, and start forcing the cache when it is determined that the result has been generated |
Community request for a similar feature here: https://community.seqera.io/t/process-input-that-is-not-cached-and-does-not-affect-task-hash/1209 |
This would be extremely useful for the in-house pipeline that I'm currently building; I have ended up using a bunch of subMaps to only select the parts of the meta map that are required for each process, to ensure that changing the initial metadata doesn't invalidate the entire cache. |
Isn't it the same amount of work to provide the subMap as an input vs through a custom cache directive? |
+1 for subMap. we could think custom cache when will have custom type, and in which we may think to have some declaration about caching |
Ben - the issue will be re-joining the full metadata back after the process. Contrast: workflow {
people = Channel.of(
[name:"Rob", employer:"Seqera", loves:"Nextflow, and making Ben's life harder"],
[name:"Ben", employer:"Seqera", loves:"Nextflow"],
[name:"Paolo", employer:"Seqera", loves:"Nextflow"],
[name:"Simon", employer:"NeoGenomics", loves:"Nextflow"]
)
people
| map { person -> person.subMap('name', 'employer') }
| DoSomething
| join(people.map { person -> [person.subMap('name', 'employer'), person] })
| map { _submap, txtFile, person -> [person, txtFile] }
| view
}
process DoSomething {
input: val(person)
output: tuple val(person), path("${person.name}.txt")
script:
"""
echo "Hello ${person.name} from ${person.employer}" > ${person.name}.txt
"""
} with workflow {
people = Channel.of(
[name:"Rob", employer:"Seqera", loves:"Nextflow, and making Ben's life harder"],
[name:"Ben", employer:"Seqera", loves:"Nextflow"],
[name:"Paolo", employer:"Seqera", loves:"Nextflow"],
[name:"Simon", employer:"NeoGenomics", loves:"Nextflow"]
)
| DoSomething
| view
}
process DoSomething {
cache { person.subMap('person', 'employer') }
input: val(person)
output: tuple val(person), path("${person.name}.txt")
script:
"""
echo "Hello ${person.name} from ${person.employer}" > ${person.name}.txt
"""
} That said, the longer I think about it, I actually prefer the idea that a task's cache inputs should describe everything that you need to reconstruct the outputs. |
Exactly Rob. And it is easy to screw up the joins, because you need to ensure that you have exactly the right fields to map on (which is another bugbear of mine; it'd be really nice to have a join that will match |
New feature
One of Nextflow's most useful features is the ability to send metadata though the process DAG in parallel to the file data. Not all of the metadata will be necessary to all processes, but any change to the metadata will result in a new task hash, and potentially unnecessary recomputation of tasks.
I propose that users would appreciate the ability to signal to Nextflow which parts of the metadata are relevant to the process.
Usage scenario
Let's say we have a samplesheet
samples.csv
That we ingest in a simple workflow
If we expand the samplesheet to include
employer
metadata:Any resumed runs will be unable to use the previously-computed tasks, as the GetEducation
meta
input has changed, even if we know that the new input will make no difference to the task calculation.Suggest implementation
I propose a new option to the
cache
directive, whereby the user could supply a closure that returns some new value to be used as input into the task hash:This would would expand the opportunity to rely on the Nextflow task cache when a workflow is being interactively developed, and the extend of the required metadata is not known at the start of development.
The syntax might also include tuples and files (or bags of files):
The text was updated successfully, but these errors were encountered: