Simplify connections. #589

New issue

Open

Simplify connections.#589

Labels

enhancementnipype 2.0to-consider

Milestone

future

chrisgorgo

opened

on Jun 26, 2013

Member

Introduce "natural connections". If there is only one way to connect two nodes (considering the number and types of inputs outputs) user should be able to do:

wf.connect(nodeA, nodeB)

This would allow to streamline simple sequential workflow definition:

wf.connect([nodeA, nodeB, nodeC, nodeD])

satra

Member

this is what wheels (the software that daniel built on nipype) did.

https://github.com/FNNDSC/wheels

but in general there are very few use cases where the joining works automatically without decorating the input and output specs further.

take a workflow and try finding the number of times this would be useful - at least in the workflows i use there are few instances.

chrisgorgo

MemberAuthor

@maartenmennes designed a framework out of frustration how complicated defining workflows was in nipype. Maybe this use case is not so uncommon...

satra

Member

i spoke to maarten, and i agree it would be good to take a look at his framework to see what we can learn. but i don't think making connections was the core issue. these were the issues i inferred (maarten can correct me - please do - we want to improve nipype so that you could use it too!):

python framework, but maarten is used to shell scripting
new input and output names to learn when the flags he knew could be used easily. he had to look up nipype's help every single time.
level of parallelism was not necessary for his needs - he could submit a workflow per subject
things get hidden when a workflow is created - you cannot read off the set of commands you ran.

i think the following

wf.connect(nodeA, nodeB)

is a tough problem without more extended semantics in the input and output spec.

while i would argue:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is easy and explicit but might be considered verbose by some folks.

inati

Contributor

Hi All,

I'm new to nipype, but having spent time with other pipelining frameworks, I would say that the hard part is not really the verbosity of this:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

For the applications that I'm more familiar with (the image recon side of things) the problems have a different performance curve. For example, if what you are handing off from nodeA to nodeB can fit into memory you are much better off. Ideally you just pass a pointer and then that's super efficient. But then it's hard to go across processes or compute nodes or to and from a GPU because at every node boundary you have to serialize and deserialize. It's ugly.

I think that for the nipype applications the performance is sort of limited by individual steps in the pipeline. We've basically said that it doesn't make sense to worry about disk space (at least in the scratch sense) we're just going to save all the intermediate results. We're going to use nifti or gifti or whatever at the node boundaries, actually it's more like, it's up to the pipeline architect to make sure that the output of nodeA is compatible with the input of nodeB. In my opinion this is a good thing. It's a bit harder to write the pipeline because you have to pay attention to these details, but it's way way better than having a system that can guess and does it wrong. We're also going to brute force the parallelization and just find the bits that are "stupidly parallel" and break there. IMHO this is good too.

Anyway, my question is a bit different:
Given a pipeline, how can I "profile it" to find the bottle-necks?

Cheers,
Souheil

Souheil Inati, PhD
Staff Scientist
FMRI Facility
NIMH/NIH/DHHS
souheil.inati@nih.govmailto:souheil.inati@nih.gov

On Jun 26, 2013, at 9:11 PM, Satrajit Ghosh <notifications@github.com mailto:notifications@github.com> wrote:

python framework, but maarten is used to shell scripting
new input and output names to learn when the flags he knew could be used easily. he had to look up nipype's help every single time.
level of parallelism was not necessary for his needs - he could submit a workflow per subject
things get hidden when a workflow is created - you cannot read off the set of commands you ran.

i think the following

wf.connect(nodeA, nodeB)

is a tough problem without more extended semantics in the input and output spec.

while i would argue:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is easy and explicit but might be considered verbose by some folks.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/589#issuecomment-20091310.

maartenmennes

Hi Chris,

thanks for starting this discussion. First off, I have to say that my
experience with nipype is not from sitting down and trying hard to learn
it, but rather from sitting down, staring at it a bit, understanding what
is does, bet then failing to get to the easy/actual implementation.

Here's a couple of points though:

the python framework is fine (and I actually like it a lot), but what is
troubling me is the python framework around fsl/afni/spm commands. When
using such commands I would want to be able to see the final/actual fsl
command that is being run. With one node I understand that should be pretty
easy using .command, but when using a whole workflow I found it became a
lot trickier to easily track things.
The data doubling (working dir and sink dir) was a no-go for me (I'm
bound to strict quota limits). Although this might have been solved (at
least in the CPAC implementation).
Indeed, parallelising within subjects is not where the gain is. I think
there's only so much you can do in parallel in an image processing workflow.
The workflow specification is indeed one of my issues. It's easy enough
when you connect one output to one input, but connecting different outputs
to several inputs became really hard to follow. Below is a print of how I
currently specify my workflow in what I call my poor man's version of
nipype/CPAC. I starts with a main branch, and then splits into 6
subbranches (with some of those splitting again) that represent differing
pipelines that I might want to follow/compare. (I do create a graph, no
worries).

["reorient","selectvolumes","get_example","motioncorrection_fsl","skullstrip_fsl","grandmeanscaling","get_mask","spatialsmoothing6","bet_examplefunc","bbreg","nuisance_masks",
["highpassfilter", "finalize", "get_mask", "melodic"],
["nr_wm_csf_lin",
["highpassfilter", "finalize", "get_mask", "melodic"]],
["nr_6_wm_csf_lin",
["highpassfilter", "finalize", "get_mask", "melodic"]],
["nr_fw24_wm_csf_lin",
["highpassfilter",
["finalize","get_mask", "melodic"],
["scrubbing_FDJenkinson","finalize","get_mask"]],
["bandpassfilter",
["finalize","get_mask"],
["scrubbing_FDJenkinson","finalize","get_mask"]]],
["nr_fw24_wm_csf_lin_spike",
["highpassfilter", "finalize","get_mask", "melodic"]],
["nr_fr24_wm_csf_lin_spike",
["highpassfilter", "finalize","get_mask"]]]

In theory each node connects to the output of the previous node, but some
nodes might get skipped in that chaining proces. e.g., the input to
motioncorrection_fsl are the outputs of selectvolumes and the output of
get_example. The output of that workflow is then a bash file that I can
easily submit to our local cluster.

I think we can agree that the above nested list is pretty easy to
understand and follow, so yes, I think

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is too verbose, especially because 'outputname' and 'inputname' are
typically variables and thus don't really give you any extra information.
You would have to properly trace them to see that you're actually using the
correct files.

I think nipype provides uber flexibility - which is great -, but in my
opinion it might up to a point where it's almost unnecessary, as in: there
is a certain order to processing an image (at least that we use now). Using
some of that intrinsic order might indeed simplify the creation of
workflows.

In summary:

easier ways to specify workflows/nodes
easier access to the actual commands (when using a complex pipeline,
without having to go into each subnode directory)
would be points that I would find very attractive.

Cheers,
Maarten

On Wed, Jun 26, 2013 at 10:05 PM, Souheil Inati notifications@github.comwrote:

Hi All,

I'm new to nipype, but having spent time with other pipelining frameworks,
I would say that the hard part is not really the verbosity of this:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

For the applications that I'm more familiar with (the image recon side of
things) the problems have a different performance curve. For example, if
what you are handing off from nodeA to nodeB can fit into memory you are
much better off. Ideally you just pass a pointer and then that's super
efficient. But then it's hard to go across processes or compute nodes or to
and from a GPU because at every node boundary you have to serialize and
deserialize. It's ugly.

I think that for the nipype applications the performance is sort of
limited by individual steps in the pipeline. We've basically said that it
doesn't make sense to worry about disk space (at least in the scratch
sense) we're just going to save all the intermediate results. We're going
to use nifti or gifti or whatever at the node boundaries, actually it's
more like, it's up to the pipeline architect to make sure that the output
of nodeA is compatible with the input of nodeB. In my opinion this is a
good thing. It's a bit harder to write the pipeline because you have to pay
attention to these details, but it's way way better than having a system
that can guess and does it wrong. We're also going to brute force the
parallelization and just find the bits that are "stupidly parallel" and
break there. IMHO this is good too.

Anyway, my question is a bit different:
Given a pipeline, how can I "profile it" to find the bottle-necks?

Cheers,
Souheil

Souheil Inati, PhD
Staff Scientist
FMRI Facility
NIMH/NIH/DHHS
souheil.inati@nih.govmailto:souheil.inati@nih.gov

On Jun 26, 2013, at 9:11 PM, Satrajit Ghosh <notifications@github.com
mailto:notifications@github.com> wrote:

i spoke to maarten, and i agree it would be good to take a look at his
framework to see what we can learn. but i don't think making connections
was the core issue. these were the issues i inferred (maarten can correct
me - please do - we want to improve nipype so that you could use it too!):

python framework, but maarten is used to shell scripting

new input and output names to learn when the flags he knew could be used
easily. he had to look up nipype's help every single time.

level of parallelism was not necessary for his needs - he could submit a
workflow per subject

things get hidden when a workflow is created - you cannot read off the
set of commands you ran.

i think the following

wf.connect(nodeA, nodeB)

is a tough problem without more extended semantics in the input and output
spec.

while i would argue:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is easy and explicit but might be considered verbose by some folks.

—
Reply to this email directly or view it on GitHub<
https://github.com/nipy/nipype/issues/589#issuecomment-20091310>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/589#issuecomment-20092797
.

Maarten Mennes, Ph.D.
Post-Doctoral Researcher
Donders Institute for Brain, Cognition and Behaviour
Dept. of Cognitive Neuroscience
Radboud University Nijmegen Medical Centre
Nijmegen
The Netherlands

Google Scholar Author
Linkhttp://scholar.google.com/citations?user=pLlSTVgAAAAJ&hl=en

satra

Member

thanks @maartenmennes - this is useful.

easier access to the actual commands (when using a complex pipeline, without having to go into each subnode directory) would be points that I would find very attractive.

i completely agree with the tracking part. we had hoped other people would build that - but sadly that didn't happen - we even provided a Debug Plugin which could be used and callbacks on node execution start and stop. this is high priority, but i really want to do this right - our previous attempts, including a crazy web page that could be opened only in firefox, didn't quite work. and various people have created their own.

started a separate thread for this here #590 - and i'll continue there.

easier ways to specify workflows/nodes

first to address a misconception - outputname and inputname are not variables, they are named ports. this means they should tell you semantically what output you are connecting to what input. in fact, things are named precisely to avoid checking for the right files.

except in specific cases, i really don't see a way to avoid named connections without creating much richer metadata for inputs and outputs. in your example, i believe you are making very specific assumptions about the processes that you are allowing to be specified in your pipeline and pushing the spec into the process. (btw, in some ways it reminds me of psom :) )

as i pointed to chris earlier, this sounds a bit like the wheels project: https://github.com/FNNDSC/wheels

however, there is a reason why any existing general workflow framework requires named/targeted connections. essentially workflows are graphs not trees, which means nodes may take any number of arbitrary inputs from other nodes. the only way to specify this for n_inputs > 1 is to use both the source node name and the output name (which is why wheels doesn't work if you have a conflict in the output name from two different nodes).

so i'm quite curious to know how in the above spec motioncorrection_fsl knows where to get the two inputs from.

here is a recent connectivity pattern suggested by @ihrke which simplifies the specification while retaining the specificity.

https://github.com/ihrke/nipype_connect_str

i do want to simplify workflow specification (and i do want some amount of verbosity), but i haven't yet found something simpler (other than this new grammar suggested by matthias - which we will probably provide in an upcoming release).

i'm all ears, if anyone has better ideas.

btw, the reduced flexibility approach can be built on top of nipype by creating a linearized spec class - this is what the connectome viewer folks did and in someways what cpac is doing.

mwaskom

closed this as completed

reopened this

Member

"but in general there are very few use cases where the joining works automatically without decorating the input and output specs further."

I'm not sure that's really the case. How many of these lines exist in your workflows?

wf.connect(node_a, "out_file", node_b, "in_file")

It's certainly nice to have flexibility, but we already have a convention for privileged input and output fields. It seems reasonable to extend this to a default connection behavior.

I do think it makes sense not to overload wf.connect() even further. Something like

wf.auto_connect(node_a, node_b)

that expands behind the scenes to

wf.connect(node_a, "out_file", node_b, "in_file")

Seems straightforward. Being a fan of of the other syntax for connect, I think this should also be possible:

wf.auto_connect([(node_a, node_b), (node_b, node_c), (node_b, node_d)])

PS apologies for accidental closing...hitting tab did not do what I expected it to.

satra

Member

I'm not sure that's really the case. How many of these lines exist in your workflows?
wf.connect(node_a, "out_file", node_b, "in_file")

do a grin "out_file" in the examples directory. the fsl dominant workflows have this pattern - but we support many other packages.

i like auto_connect. but across interfaces, these connections cannot be inferred automatically. i still think the naming convention would need to be cleaned up across interfaces (including fsl) and those privileged inputs and outputs decorated with metadata (primary_output=True, primary_input=True). but even after this the user will need to mix auto_connect and connect.

mwaskom

Member

+1 for primary_input and primary_output as trait metadata, although fallback to in_file and out_file by default seems reasonable (for things like Function interfaces). Also agree that some effort to move towards standardized in_file and out_file would be good in cases where that makes sense (e.g. mri_binarize -> out_file instead of binary_file). Don't see a huge problem with mixing the two connection styles; if anything it emphasizes the distinction between the main pipe and ancillary ones.

inati

Contributor

It's not just the names. You're making a contract between the nodes as to what the data is going to be. You could say it's a single nifti file. But is it only floats or can I also give you shorts? The node interfaces could be very strongly typed. I'm not a fan of this.

Or maybe I'm confused and you are just saying that you want a convenience function so that "connect" when called with two nodes just hooks up the primary output to the primary input. That seems simple.

Sent from a mobile device. Please excuse brevity or typographical errors.

-----Original Message-----
From: Michael Waskom [notifications@github.commailto:notifications@github.com]
Sent: Thursday, June 27, 2013 11:01 PM Eastern Standard Time
To: nipy/nipype
Cc: Inati, Souheil (NIH/NIMH) [E]
Subject: Re: [nipype] Simplify connections. (#589)

+1 for primary_input and primary_output as trait metadata, although fallback to in_file and out_file by default seems reasonable (for things like Function interfaces). Also agree that some effort to move towards standardized in_file and out_file would be good in cases where that makes sense (e.g. mri_binarize -> out_file instead of binary_file). Don't see a huge problem with mixing the two connection styles; if anything it emphasizes the distinction between the main pipe and ancillary ones.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/589#issuecomment-20167959.

mwaskom

Member

Your second paragraph is what I meant. I'm not sure I follow the first part, but I guess it doesn't matter.