-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit tests for prif_stop
and prif_error_stop
make fragile non-portable assumptions
#137
Comments
Direct evidence that subjob invocations of
Note that when the number of images is increased from 1 to 8, the This is non-scalable and will definitely fail when using a real job scheduler that enforces process parallelism limits. |
bonachea
added a commit
to bonachea/caffeine
that referenced
this issue
Mar 12, 2025
Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137
bonachea
added a commit
to bonachea/caffeine
that referenced
this issue
Mar 12, 2025
Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137
bonachea
added a commit
to bonachea/caffeine
that referenced
this issue
Mar 13, 2025
Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137
bonachea
added a commit
to bonachea/caffeine
that referenced
this issue
Mar 13, 2025
Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137
bonachea
added a commit
to bonachea/caffeine
that referenced
this issue
Mar 13, 2025
Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137
bonachea
added a commit
to bonachea/caffeine
that referenced
this issue
Mar 14, 2025
Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently the approach taken to unit testing
prif_stop
andprif_error_stop
is to unconditionally invoke./build/run-fpm.sh
in thefpm
built Caffeine unit test, and inspecting the resulting process exit code.I consider this entire approach to be very fragile for multiple reasons:
fpm
(and possibly the compiler) are available on the compute nodefpm
is capable of launching parallel jobs at allI expect one or more of the above assumptions to be violated on some systems (completely breaking the Caffeine unit test) once we incorporate distributed conduits and non-trivial job spawners.
As such that we'll eventually need a "kill switch" to disable this practice, or better yet a more robust approach to exit testing that doesn't rely on programmatically invoking fom to spawn a sub-job.
The text was updated successfully, but these errors were encountered: