-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still facing NaN issues! #33
Comments
That is frustrating! Thank you for your efforts. Is this happening on the CPU or on the GPU? Eager mode and graph mode follow the same code path. As far as I understand, the main difference between the two modes for our purposes is that in graph mode multiple ops may be executed concurrently. The errors might be related to this concurrent execution. It might be worth trying to set |
Currently I am only working on GPU, so it is happening on GPU
Thats a good thing to test. I will try this. However, that still build the question as to how will we fix such an issue in case it is this. I atleast hope this will be faster to run as compared to eager mode.. |
In all the runs I launched yesterday all ran fine for 20h these are:
This is getting really frustating as I really just dont have a setup to repro to even be ... |
Hi @chaithyagr, thank you so much for running these tests. It's very useful to know this. Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct? Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail. Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?). |
Correct, and I get your point where you expect this to fix our issue and frankly I dont get a huge speedup with
I think the same code from #20 (comment) should be good enough minimum repro. Although one problem I see with this version is that in TFMRI we now threhold the density (https://github.com/mrphys/tensorflow-mri/blob/cfd8930ee5281e7f6dceb17c4a5acaf625fd3243/tensorflow_mri/python/ops/traj_ops.py#L938) and this non_nan reciprocal could be bad: out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))
I dont think it could be hardware as I run it on multiple machines and different GPUs. It could be possible that something is wrong with my codes specifically. I now plan to save the input when we get NaNs (although this may not be of much use as an extact repeat of run, which results in same output, did not result in NaN, signyfying that it is a more random issue) |
A minor question, @jmontalt I was wondering what happens if the input points are not in |
It should support up to the extended range This extended range should give some flexibility and robustness against points that are close to For points beyond this extended range, behaviour is undefined. We do not currently assert that the points lie within the correct range. This would be rather inefficient, especially on the GPU. This behaviour is the same as in the original FINUFFT. |
Well, while i agree the wrapping on a mathematical point of view, I think this isnt the valid in MRI. But i guess tenorflow-nufft is generic package so I guess its fine. |
It does remain valid in MRI. Note that |
Yes, I get that. What i mean is that in the case of MRI if the inputs are given as is (not re-normalizing to
This would be wrong.. But then again, as i told, I dont think thats the job of |
If this is valid. I think this could be the issue. I observed on adding a lot of randomness to my k-space trajectory, the NaN issue came up suddenly. While I still cant pin point why, but my repro has always been in controlled setting where there is no points outside the range. Lets try to see how the code behaves in this scenario. I will try to create such an odd repro for now |
Ok, I think that was the issue! Not sure if this is realistic though. But proper guard checks are added on my side, I will report back if the issue still remains. |
Great, thanks for keeping me in the loop. So forward I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit). |
I will also document this behaviour in the website to reduce the chances it happens again in the future. I'm keeping the issue open until everything is clear. |
Well as I just added |
This is really helpful! |
I think I finally have some complicated case is reproducible! But if I have a loss which is Multiscale SSIM. I get NaNs in Graph mode, but no NaNs in eager mode. However, Using I am not sure what is the issue specifically yet, I think with this I can make a good minimum repro Additional things UI observed, the output is pretty much wrong sometimes from the network even if it isn't To me, this could be because the MSSIM is launched before the end of |
I wonder how much this issue is related to tensorflow/tensorflow#57353 |
Hi @chaithyagr, thank you for your efforts. My best guess would be that there's a missing device synchronisation somewhere, which would explain the randomness of the results. It'll be some time until I have the chance to take a look, but if you want to share your repro I will keep it in mind. There's also some significant refactoring in progress which should make debugging easier in the future. |
Updating with some more info. I seem to have hit NaN issues even with eager mode with this. I feel eager mode runs slower thereby making it harder to have NaNs due to missing device sync (less probable). For now, I added device sync as seen in chaithyagr#1 For now, I dont see any issues in my runs with this update. I have no idea though, I seem to have tackled this issue in some or other form everytime only for it to be back a month later :P |
That's great, thank you! Fingers crossed it is finally fixed now. Good luck! |
Even after a lot of efforts, we still seem to face NaN issues and sadly this seems to be verry erratic and random. I dont have a repro test at all...
From what I see, I think that this issue is only present in Graph Mode. @jmontalt does that hint at any issues possible? Is there any separate code paths possible for eager mode / graph mode?
I will try to use this issue to track its progress on what I observe.
The text was updated successfully, but these errors were encountered: