Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invoker lock-file conflict on nfs cluster #207

Open
ahaldane opened this issue Oct 19, 2017 · 7 comments
Open

invoker lock-file conflict on nfs cluster #207

ahaldane opened this issue Oct 19, 2017 · 7 comments

Comments

@ahaldane
Copy link

When running pyopencl on a cluster with an nfs filesystem, a lock file created in my home dir on one node prevents the other nodes from progressing. I've pasted a stack trace below.

At first I thought I could fix the problem by supplying the "cache_dir" argument when creating the pyopencl context, to point to somewhere in /tmp which isn't in the nfs. However, those lock files aren't the problem: The problem is the use of PersistentDict to define the "invoker_cache" in invoker.py using the default lock file location, which is inside my home dir on the nfs, in my case.

As a workaround, I've modified invoker.py on my system so the definition
reads

invoker_cache = PersistentDict("pyopencl-invoker-cache-v1",
        key_builder=NumpyTypesKeyBuilder(),
        container_dir='/tmp/cl/invoker')

Perhaps in future versions of pyopencl you could make the container_dir
configurable?

@ahaldane
Copy link
Author

stack-trace:

  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/__init__.py",
line 320, in __getattr__
    knl = Kernel(self, attr)
  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/cffi_cl.py",
line 1690, in __init__
    self._setup(program)
  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/cffi_cl.py",
line 1700, in _setup
    work_around_arg_count_bug=None)
  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/invoker.py",
line 388, in generate_enqueue_and_set_args
    result = invoker_cache[cache_key]
  File
"/usr/home/p/605/tuf33565/.local/lib/python2.7/site-packages/pytools/persistent_dict.py",
line 472, in __getitem__
    return self.fetch(key)
  File
"/usr/home/p/605/tuf33565/.local/lib/python2.7/site-packages/pytools/persistent_dict.py",
line 700, in fetch
    LockManager(cleanup_m, self._lock_file(hexdigest_key))
  File
"/usr/home/p/605/tuf33565/.local/lib/python2.7/site-packages/pytools/persistent_dict.py",
line 128, in __init__
    "--something is wrong" % self.lock_file)
RuntimeError: waited more than three minutes on the lock file
'/usr/home/p/605/tuf33565/.cache/pytools/pdict-v2-pyopencl-invoker-cache-v1-py2.7.13.final.0/75d86f4c7e7bed5781efc15198f91210c98d69a44f2a8fa928503c1cf560d256.lock'--something
is wrong

@inducer
Copy link
Owner

inducer commented Oct 19, 2017

Thanks for the report! I'm currently chasing a deadline (Sunday)--I'll worry about this next week, likely by deriving all cache dirs (binary and invoker) from the one passed to the context. I'd also be very open to receiving a patch. :)

@ahaldane
Copy link
Author

No hurry at all - I've fixed it on my system so I'm happy, just wanted to let you know about the idea.

I'm also pretty busy but a patch may be incoming some day :)

@Richardk2n
Copy link

Richardk2n commented Oct 17, 2024

So, how is work on this going, because I think I have the same issue:

/home/67/bt307867/.myVenv/lib/python3.12/site-packages/pytools/persistent_dict.py:513: UserWarning: PersistentDict: database '/home/67/bt307867/.cache/pytools/pdict-v5-pyopencl-invoker-cache-v42-nano3.12.4.final.0.sqlite' busy, 20 retries

The file system is also nfs. I am only using a single node, so I am not sure, why it is blocking itself, but apparently it is.
I tried to set a cache_dir as per doc:

context = cl.Context([device], cache_dir="/tmp/idgaf/")

However, I got an error:

TypeError: __init__(): incompatible function arguments. The following argument types are supported:
    1. __init__(self, devices: object | None = None, properties: object | None = None, dev_type: object | None = None) -> None

What is the current known good way to work around this?
The function invocation changed but adding container_dir is still a working workaround.

@matthiasdiener
Copy link
Contributor

@Richardk2n
Are you running a parallel application (or multiple applications that access the pyopencl cache at the same time)? Does the application eventually continue to run, or is the "busy" warning printed endlessly?

In your cache directory (/home/67/bt307867/.cache/pytools/), are there any indications of a stale NFS lock file (perhaps from a previous execution that crashed)?

You can also change the cache directory via the XDG_CACHE_HOME environment variable.

@Richardk2n
Copy link

The application is not parallel and did lock even with a singular instance.
I killed the application after 80 retries.

The cache directory contains nothing but the *.sqlite file.

This seems like the kind of solution I was looking for. I will take a look, if it works for me. Thank you.

Is pyopencl expected to work on NFS out of the box?
Like is there something wrong with the cluster I am using, or is this expected? I assume you tested it in such use-cases.
If this is an issue in general, I would appreciate it mentioned in the docs somewhere.

@matthiasdiener
Copy link
Contributor

If you remove the sqlite file from the cache directory, does your application run normally?

Can you share the application code with us, and the full output (perhaps in a gist).

pyopencl is expected to work out-of-the-box on NFS file systems, however the interplay between NFS and sqlite can sometimes be challenging (see e.g. https://www.sqlite.org/faq.html#q5 for some details). If the NFS server gets stuck somehow, like a stale lock (which I am assuming is happening in your case), things can go wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants