-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a port of scipy.ndimage.measurement.watershed. #99
base: main
Are you sure you want to change the base?
Conversation
dask.array.atop was depricated and moved to dask.array.blockwise.
In my opinion it's better not to have this algorithm implemented than to instantiate a single chunk, but @jakirkham might disagree. We were having some discussions about this and we think that the right approach is probably something like that found in #94 (run independently in different volumes) but with some overlap and a smarter way to link adjacent blocks. However, that's not exactly right — some seed propagation will need to happen from one volume to the next, so a two pass approach might be necessary. Here's my idea:
|
I am interested in where this is going, so I attempted prototype implementations of the method suggested by @jni. I also tried a modified method with a different mode of seed communication between chunks. Prototypes of each method are in the attached zip file as notebooks. Bottom lines:
Details FollowBoth notebooks use the same approach to generate raw data:
In the implementation of the method by @jni (v2 in the zip), I add a border to each chunk of the marker image and assign the border the value max(labels) + 1. Watershed chunks for a first pass, then remove the border label to produce a new set of markers. I trim the overlap from the previous chunking, then re-overlap to cross-pollinate the markers. I create a mask using only the boundary label basin. Finally, perform the second pass watershed, compose the respective parts of the first and second passes, and trim. In the modified implementation, I do a first-pass watershed without a border. I then trim the overlap and re-overlap to cross-pollinate as before (this is a little tricky without a mask, but it works about as well). Then I perform the second pass watershed. I expect there are bugs with my implementation, but both methods seem to have some amount of error between a watershed of the full Numpy array and of the distributed version. The methods have almost the exact same error pattern, which suggests that either I am doing something consistently wrong, or there is something unexpected happening with the behavior of the watershed. I would like to point out something about the validation image at the bottom of each notebook. There is a strip of about 1/4 of the height of the image with very few errors. That strip is fairly consistent even when changing the random seed or flipping the image along the After playing around with these methods there may be an important limitation. It seems to me that the watershed requires global image information in general. If a marker is more than one chunk away from any part of its basin, then that basin can't be labeled in only two passes with neighbor-only communication. It seems that in general the process of propagating marker information would have to be repeated until the watershed image stops changing. I've attached a notebook v3 that demonstrates the limitation pretty clearly with a double spiral starting image. The v3 notebook is a little rougher than the other two, but hopefully gets the idea across. An iterative method may not require a while loop. A for loop with |
@wwarriner super cool! Sorry that I don't have time to check this all out in detail right now, but it sounds like a big leap forward. btw, if you publish your notebooks to gist.github.com, they can be viewed online at nbviewer.ipython.org, which is a nice way to share them, as it allows others to view them without having to boot up a local instance. |
Jumping in to say thanks for working on this! I had a small poke around in the notebooks in your gist.
It's a little worrying that there are differences in the watershed results between the dask and scikit-image implementations. I'd have to dig more into why that could be before I'd have anything really useful to say on that. |
if ii != 3: | ||
raise RuntimeError('structure dimensions must be equal to 3') | ||
|
||
return None # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably still too early for comments on this yet, but I wanted to flag that the indentation level here suggests that this function must return None every time. I imagine that's not quite what's intended, unless I'm missing context.
Absolutely, and happy to. Thank you for the review, I've updated the gist to incorporate your thoughts. I corrected
Agreed! My initial thought would be that the implementation of the watershed is not robust with respect to segment boundaries. This may be because the watershed implementation is queue-based. If a basin is split across chunks it could alter the queue order. The result could be that individual ridge pixels are assigned to different basins depending on the precise number and value of pixels in each basin. Put another way, whichever basin arrives at a pixel first "wins" the right to assign a segment value. If the number and value of pixels is changed, then the order of arrival may also change, resulting in flip-flops near boundaries like we are seeing. The It isn't fully clear what we can do from here if my assumption is correct. One option would be to alter the watershed code to provide consistent assignment to ridge pixels. A naive method would be to always use |
I like this idea, it seems like a good way to test those assumptions out. |
Using |
It has been awhile since I have worked on this, but I have spent enough implementing this a time or three in the distant past that I remember some of the painful bits. Jni's early comments are spot on, but in my code I kept, and propagated, the isalias/unionfind structure to the final throughout AND partitioned the initial values in each block, or minimum cluster ID as a function of num_pixels_in_block * block_order. This guarantees that no matter what what ID you label a pixel it is guaranteed to be unique between the blocks. These can them be compressed (to the minimum number for that block mentioned before), but they must not have ID's overlap before the final boundary merge. Once you collect all the block level IDs, you then then run the same merge operation over them, and then collapse these IDs to minimize them for the final update and write. I do not feel that I am describing the processing setps clearly but in essence as long as you can guarantee that each block is uniquely labeled, you can then repeat the process with the IDs to merge all of the blocks together. Hope this helps. |
Thanks for adding your thoughts @ebo! |
I feel it might be a little to soon for a PR, but...
I got the watershed algorithm ported from scipy.ndimage.measurement.watershed working the same way that dask-image.label works.
I will have time to work on this this weekend a little if we can discuss what needs to be done to clean things up.
Sorr, I thought I had posted some comments on a week ago and realized this morning that I posted the numba version of the code to the numba list.
It would be nice to do a little profiling to see if I can get numba working with the low level dask stuff, but I have not heard back from the dask/dask-image folks if that is acceptable.
Let me know how you wish for me to proceed.
dask.array.atop was depricated and moved to dask.array.blockwise.
Before you submit a pull request, check that it meets these guidelines:
your new functionality into a function with a docstring, and add the
feature to the list in README.rst.
https://travis-ci.org/dask/dask-image/pull_requests
and make sure that the tests pass for all supported Python versions.