Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactions between cmdstan instances through /proc #1169

Open
adrian-lison opened this issue Jun 29, 2023 · 2 comments
Open

Interactions between cmdstan instances through /proc #1169

adrian-lison opened this issue Jun 29, 2023 · 2 comments

Comments

@adrian-lison
Copy link

I have a weird issue on an HPC with SLURM where chains of one model crash when a different model that is running simultaneously finishes sampling. The weird part is that as a workaround I had to containerize the different models and exclude /proc from the bind paths of the container. The details of of the bug and workaround are described in this post: https://discourse.mc-stan.org/t/race-conditions-between-independent-cmdstan-model-runs/30918

I am not sure if this is something specfic to my environment, but given the surprising interaction through /proc, I thought it might be worth drawing attention to. Maybe someone has an idea for what could be causing such a bug.

@robmoss
Copy link

robmoss commented Sep 6, 2024

I've experienced chains terminating unexpectedly (with no error messages or further information) when spawning multiple R processes that each use cmdstanr to sample from a different model (cmdstan 2.35.0, cmdstanr 0.8.1).

Sampling from each model in serial always succeeds, and running the same processes in parallel always results in at least one failure. As far as I can tell, it has nothing to do with the working directory or tempdir(). I haven't tried running the processes in separate containers with independent /proc directories. I don't have much experience with containers, but if I can find the time I'll give it a go and see whether it resolves this issue for me.

@robmoss
Copy link

robmoss commented Sep 6, 2024

Sampling from multiple models in parallel succeeds when I create a new PID namespace for each process:

for model in ${MODELS}; do
    unshare --fork --pid --mount-proc --user --map-root-user ./run.R "${model}" & 
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants