-
Notifications
You must be signed in to change notification settings - Fork 535
Open
Milestone
Description
Introduce "natural connections". If there is only one way to connect two nodes (considering the number and types of inputs outputs) user should be able to do:
wf.connect(nodeA, nodeB)
This would allow to streamline simple sequential workflow definition:
wf.connect([nodeA, nodeB, nodeC, nodeD])
ihnorton
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
satra commentedon Jun 26, 2013
this is what wheels (the software that daniel built on nipype) did.
https://github.com/FNNDSC/wheels
but in general there are very few use cases where the joining works automatically without decorating the input and output specs further.
take a workflow and try finding the number of times this would be useful - at least in the workflows i use there are few instances.
chrisgorgo commentedon Jun 27, 2013
@maartenmennes designed a framework out of frustration how complicated defining workflows was in nipype. Maybe this use case is not so uncommon...
satra commentedon Jun 27, 2013
i spoke to maarten, and i agree it would be good to take a look at his framework to see what we can learn. but i don't think making connections was the core issue. these were the issues i inferred (maarten can correct me - please do - we want to improve nipype so that you could use it too!):
i think the following
is a tough problem without more extended semantics in the input and output spec.
while i would argue:
wf.connect(nodeA, 'outputname', nodeB, 'inputname')
is easy and explicit but might be considered verbose by some folks.
inati commentedon Jun 27, 2013
Hi All,
I'm new to nipype, but having spent time with other pipelining frameworks, I would say that the hard part is not really the verbosity of this:
wf.connect(nodeA, 'outputname', nodeB, 'inputname')
For the applications that I'm more familiar with (the image recon side of things) the problems have a different performance curve. For example, if what you are handing off from nodeA to nodeB can fit into memory you are much better off. Ideally you just pass a pointer and then that's super efficient. But then it's hard to go across processes or compute nodes or to and from a GPU because at every node boundary you have to serialize and deserialize. It's ugly.
I think that for the nipype applications the performance is sort of limited by individual steps in the pipeline. We've basically said that it doesn't make sense to worry about disk space (at least in the scratch sense) we're just going to save all the intermediate results. We're going to use nifti or gifti or whatever at the node boundaries, actually it's more like, it's up to the pipeline architect to make sure that the output of nodeA is compatible with the input of nodeB. In my opinion this is a good thing. It's a bit harder to write the pipeline because you have to pay attention to these details, but it's way way better than having a system that can guess and does it wrong. We're also going to brute force the parallelization and just find the bits that are "stupidly parallel" and break there. IMHO this is good too.
Anyway, my question is a bit different:
Given a pipeline, how can I "profile it" to find the bottle-necks?
Cheers,
Souheil
Souheil Inati, PhD
Staff Scientist
FMRI Facility
NIMH/NIH/DHHS
souheil.inati@nih.govmailto:souheil.inati@nih.gov
On Jun 26, 2013, at 9:11 PM, Satrajit Ghosh <notifications@github.commailto:notifications@github.com> wrote:
i spoke to maarten, and i agree it would be good to take a look at his framework to see what we can learn. but i don't think making connections was the core issue. these were the issues i inferred (maarten can correct me - please do - we want to improve nipype so that you could use it too!):
i think the following
wf.connect(nodeA, nodeB)
is a tough problem without more extended semantics in the input and output spec.
while i would argue:
wf.connect(nodeA, 'outputname', nodeB, 'inputname')
is easy and explicit but might be considered verbose by some folks.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/589#issuecomment-20091310.
maartenmennes commentedon Jun 27, 2013
Hi Chris,
thanks for starting this discussion. First off, I have to say that my
experience with nipype is not from sitting down and trying hard to learn
it, but rather from sitting down, staring at it a bit, understanding what
is does, bet then failing to get to the easy/actual implementation.
Here's a couple of points though:
troubling me is the python framework around fsl/afni/spm commands. When
using such commands I would want to be able to see the final/actual fsl
command that is being run. With one node I understand that should be pretty
easy using .command, but when using a whole workflow I found it became a
lot trickier to easily track things.
bound to strict quota limits). Although this might have been solved (at
least in the CPAC implementation).
there's only so much you can do in parallel in an image processing workflow.
when you connect one output to one input, but connecting different outputs
to several inputs became really hard to follow. Below is a print of how I
currently specify my workflow in what I call my poor man's version of
nipype/CPAC. I starts with a main branch, and then splits into 6
subbranches (with some of those splitting again) that represent differing
pipelines that I might want to follow/compare. (I do create a graph, no
worries).
["reorient","selectvolumes","get_example","motioncorrection_fsl","skullstrip_fsl","grandmeanscaling","get_mask","spatialsmoothing6","bet_examplefunc","bbreg","nuisance_masks",
["highpassfilter", "finalize", "get_mask", "melodic"],
["nr_wm_csf_lin",
["highpassfilter", "finalize", "get_mask", "melodic"]],
["nr_6_wm_csf_lin",
["highpassfilter", "finalize", "get_mask", "melodic"]],
["nr_fw24_wm_csf_lin",
["highpassfilter",
["finalize","get_mask", "melodic"],
["scrubbing_FDJenkinson","finalize","get_mask"]],
["bandpassfilter",
["finalize","get_mask"],
["scrubbing_FDJenkinson","finalize","get_mask"]]],
["nr_fw24_wm_csf_lin_spike",
["highpassfilter", "finalize","get_mask", "melodic"]],
["nr_fr24_wm_csf_lin_spike",
["highpassfilter", "finalize","get_mask"]]]
In theory each node connects to the output of the previous node, but some
nodes might get skipped in that chaining proces. e.g., the input to
motioncorrection_fsl are the outputs of selectvolumes and the output of
get_example. The output of that workflow is then a bash file that I can
easily submit to our local cluster.
I think we can agree that the above nested list is pretty easy to
understand and follow, so yes, I think
wf.connect(nodeA, 'outputname', nodeB, 'inputname')
is too verbose, especially because 'outputname' and 'inputname' are
typically variables and thus don't really give you any extra information.
You would have to properly trace them to see that you're actually using the
correct files.
I think nipype provides uber flexibility - which is great -, but in my
opinion it might up to a point where it's almost unnecessary, as in: there
is a certain order to processing an image (at least that we use now). Using
some of that intrinsic order might indeed simplify the creation of
workflows.
In summary:
without having to go into each subnode directory)
would be points that I would find very attractive.
Cheers,
Maarten
On Wed, Jun 26, 2013 at 10:05 PM, Souheil Inati notifications@github.comwrote:
Maarten Mennes, Ph.D.
Post-Doctoral Researcher
Donders Institute for Brain, Cognition and Behaviour
Dept. of Cognitive Neuroscience
Radboud University Nijmegen Medical Centre
Nijmegen
The Netherlands
Google Scholar Author
Linkhttp://scholar.google.com/citations?user=pLlSTVgAAAAJ&hl=en
satra commentedon Jun 27, 2013
thanks @maartenmennes - this is useful.
i completely agree with the tracking part. we had hoped other people would build that - but sadly that didn't happen - we even provided a Debug Plugin which could be used and callbacks on node execution start and stop. this is high priority, but i really want to do this right - our previous attempts, including a crazy web page that could be opened only in firefox, didn't quite work. and various people have created their own.
started a separate thread for this here #590 - and i'll continue there.
first to address a misconception -
outputname
andinputname
are not variables, they are named ports. this means they should tell you semantically what output you are connecting to what input. in fact, things are named precisely to avoid checking for the right files.except in specific cases, i really don't see a way to avoid named connections without creating much richer metadata for inputs and outputs. in your example, i believe you are making very specific assumptions about the processes that you are allowing to be specified in your pipeline and pushing the spec into the process. (btw, in some ways it reminds me of psom :) )
as i pointed to chris earlier, this sounds a bit like the wheels project: https://github.com/FNNDSC/wheels
however, there is a reason why any existing general workflow framework requires named/targeted connections. essentially workflows are graphs not trees, which means nodes may take any number of arbitrary inputs from other nodes. the only way to specify this for
n_inputs > 1
is to use both the source node name and the output name (which is why wheels doesn't work if you have a conflict in the output name from two different nodes).so i'm quite curious to know how in the above spec motioncorrection_fsl knows where to get the two inputs from.
here is a recent connectivity pattern suggested by @ihrke which simplifies the specification while retaining the specificity.
https://github.com/ihrke/nipype_connect_str
i do want to simplify workflow specification (and i do want some amount of verbosity), but i haven't yet found something simpler (other than this new grammar suggested by matthias - which we will probably provide in an upcoming release).
i'm all ears, if anyone has better ideas.
btw, the reduced flexibility approach can be built on top of nipype by creating a linearized spec class - this is what the connectome viewer folks did and in someways what cpac is doing.
mwaskom commentedon Jun 27, 2013
"but in general there are very few use cases where the joining works automatically without decorating the input and output specs further."
I'm not sure that's really the case. How many of these lines exist in your workflows?
It's certainly nice to have flexibility, but we already have a convention for privileged input and output fields. It seems reasonable to extend this to a default connection behavior.
I do think it makes sense not to overload
wf.connect()
even further. Something likethat expands behind the scenes to
Seems straightforward. Being a fan of of the other syntax for
connect
, I think this should also be possible:PS apologies for accidental closing...hitting tab did not do what I expected it to.
satra commentedon Jun 27, 2013
do a
grin "out_file"
in the examples directory. the fsl dominant workflows have this pattern - but we support many other packages.i like
auto_connect
. but across interfaces, these connections cannot be inferred automatically. i still think the naming convention would need to be cleaned up across interfaces (including fsl) and thoseprivileged
inputs and outputs decorated with metadata (primary_output=True, primary_input=True). but even after this the user will need to mix auto_connect and connect.mwaskom commentedon Jun 28, 2013
+1 for
primary_input
andprimary_output
as trait metadata, although fallback toin_file
andout_file
by default seems reasonable (for things likeFunction
interfaces). Also agree that some effort to move towards standardizedin_file
andout_file
would be good in cases where that makes sense (e.g.mri_binarize
->out_file
instead ofbinary_file
). Don't see a huge problem with mixing the two connection styles; if anything it emphasizes the distinction between the main pipe and ancillary ones.inati commentedon Jun 28, 2013
It's not just the names. You're making a contract between the nodes as to what the data is going to be. You could say it's a single nifti file. But is it only floats or can I also give you shorts? The node interfaces could be very strongly typed. I'm not a fan of this.
Or maybe I'm confused and you are just saying that you want a convenience function so that "connect" when called with two nodes just hooks up the primary output to the primary input. That seems simple.
Sent from a mobile device. Please excuse brevity or typographical errors.
-----Original Message-----
From: Michael Waskom [notifications@github.commailto:notifications@github.com]
Sent: Thursday, June 27, 2013 11:01 PM Eastern Standard Time
To: nipy/nipype
Cc: Inati, Souheil (NIH/NIMH) [E]
Subject: Re: [nipype] Simplify connections. (#589)
+1 for primary_input and primary_output as trait metadata, although fallback to in_file and out_file by default seems reasonable (for things like Function interfaces). Also agree that some effort to move towards standardized in_file and out_file would be good in cases where that makes sense (e.g. mri_binarize -> out_file instead of binary_file). Don't see a huge problem with mixing the two connection styles; if anything it emphasizes the distinction between the main pipe and ancillary ones.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/589#issuecomment-20167959.
mwaskom commentedon Jun 28, 2013
Your second paragraph is what I meant. I'm not sure I follow the first part, but I guess it doesn't matter.
oesteban commentedon Sep 4, 2014
The InterfacedWorkflows I'm developing in this PR #882 can work as first suggested.
dmwelch commentedon Feb 3, 2015
I'd LOVE to see something like this! Just off the cuff, perhaps something as simple as
would be more palatable? Admittedly, it doesn't take into account map nodes and iterables... :-(