-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
invoker lock-file conflict on nfs cluster #207
Comments
stack-trace:
|
Thanks for the report! I'm currently chasing a deadline (Sunday)--I'll worry about this next week, likely by deriving all cache dirs (binary and invoker) from the one passed to the context. I'd also be very open to receiving a patch. :) |
No hurry at all - I've fixed it on my system so I'm happy, just wanted to let you know about the idea. I'm also pretty busy but a patch may be incoming some day :) |
So, how is work on this going, because I think I have the same issue:
The file system is also nfs. I am only using a single node, so I am not sure, why it is blocking itself, but apparently it is. context = cl.Context([device], cache_dir="/tmp/idgaf/") However, I got an error:
What is the current known good way to work around this? |
@Richardk2n In your cache directory ( You can also change the cache directory via the |
The application is not parallel and did lock even with a singular instance. The cache directory contains nothing but the This seems like the kind of solution I was looking for. I will take a look, if it works for me. Thank you. Is pyopencl expected to work on NFS out of the box? |
If you remove the sqlite file from the cache directory, does your application run normally? Can you share the application code with us, and the full output (perhaps in a gist). pyopencl is expected to work out-of-the-box on NFS file systems, however the interplay between NFS and sqlite can sometimes be challenging (see e.g. https://www.sqlite.org/faq.html#q5 for some details). If the NFS server gets stuck somehow, like a stale lock (which I am assuming is happening in your case), things can go wrong. |
When running pyopencl on a cluster with an nfs filesystem, a lock file created in my home dir on one node prevents the other nodes from progressing. I've pasted a stack trace below.
At first I thought I could fix the problem by supplying the "cache_dir" argument when creating the pyopencl context, to point to somewhere in
/tmp
which isn't in the nfs. However, those lock files aren't the problem: The problem is the use of PersistentDict to define the "invoker_cache" in invoker.py using the default lock file location, which is inside my home dir on the nfs, in my case.As a workaround, I've modified
invoker.py
on my system so the definitionreads
Perhaps in future versions of pyopencl you could make the container_dir
configurable?
The text was updated successfully, but these errors were encountered: