Unit tests for `prif_stop` and `prif_error_stop` make fragile non-portable assumptions #137

bonachea · 2024-09-12T03:37:38Z

Currently the approach taken to unit testing prif_stop and prif_error_stop is to unconditionally invoke ./build/run-fpm.sh in the fpm built Caffeine unit test, and inspecting the resulting process exit code.

I consider this entire approach to be very fragile for multiple reasons:

Assumes Caffiene test executable is run from the source/build directory
Assumes fpm (and possibly the compiler) are available on the compute node
Assumes fpm is capable of launching parallel jobs at all
Assumes parallel jobs can be launched at all (by any command) from the compute node
Currently appears to have EVERY image launch the subjob
Relies on process exit code propagation, which can be unreliable in loosely coupled distributed systems

I expect one or more of the above assumptions to be violated on some systems (completely breaking the Caffeine unit test) once we incorporate distributed conduits and non-trivial job spawners.

As such that we'll eventually need a "kill switch" to disable this practice, or better yet a more robust approach to exit testing that doesn't rely on programmatically invoking fom to spawn a sub-job.

The text was updated successfully, but these errors were encountered:

bonachea · 2024-09-12T19:07:37Z

Direct evidence that subjob invocations of fpm are not being invoked the correctly once-per-test, but rather once-per-image-per-test (problem 5 listed above):

{pcp-d-10} env CC=gcc CXX=c++ FC=gfortran GASNET_PSHM_NODES=1 ./build/run-fpm.sh test | tail -n 50
Project is up to date
        sums integer(c_int64_t) scalars with no optional arguments present
        multiplies default real scalars with all optional arguments present
        multiplies real(c_double) scalars with all optional arguments present
        performs a collective .and. operation across logical scalars
        sums default complex scalars with a stat-variable present
        sums complex(c_double) scalars with a stat-variable present
        sums default integer elements of a 2D array across images
    The prif_co_sum subroutine
        sums default integer scalars with no optional arguments present
        sums default integer scalars with all arguments present
        sums integer(c_int64_t) scalars with stat argument present
        sums default integer 1D arrays with no optional arguments present
        sums default integer 15D arrays with stat argument present
        sums default real scalars with result_image argument present
        sums double precision 2D arrays with no optional arguments present
        sums default complex scalars with stat argument present
        sums double precision 1D complex arrays with no optional arguments present
    A program that executes the prif_error_stop function
        exits with a non-zero exitstat when the program omits the stop code
        prints a character stop code and exits with a non-zero exitstat
        prints an integer stop code and exits with exitstat equal to the stop code
    prif_image_index
        returns 1 for the simplest case
        returns 1 when given the lower bounds
        returns 0 with invalid subscripts
        returns the expected answer for a more complicated case
    The prif_num_images function result
        is a valid number of images when invoked with no arguments
    PRIF RMA
        can send a value to another image
        can send a value with indirect interface
        can get a value from another image
        can get a value with indirect interface
    A program that executes the prif_stop function
        exits with a zero exitstat when the program omits the stop code
        prints an integer stop code and exits with exitstat equal to the stop code
        prints a character stop code and exits with a non-zero exitstat
    Teams
        can be created, changed to, and allocate coarrays
    The prif_this_image_no_coarray function result
        is the proper member of the set {1,2,...,num_images()} when invoked as this_image()

A total of 57 test cases

All Passed
Took 8.07435 seconds

A total of 57 test cases containing a total of 82 assertions

           0
{pcp-d-10} env CC=gcc CXX=c++ FC=gfortran GASNET_PSHM_NODES=8 ./build/run-fpm.sh test | tail -n 50 
Project is up to date
A total of 57 test cases

All Passed
Took 13.3721 seconds

All Passed
A total of 57 test cases containing a total of 82 assertions

All Passed
All Passed
All Passed
All Passed
All Passed
All Passed
Took 13.3721 seconds

A total of 57 test cases containing a total of 82 assertions

Took 13.3721 seconds

Took 13.3722 seconds

A total of 57 test cases containing a total of 82 assertions

A total of 57 test cases containing a total of 82 assertions
Took 13.3721 seconds


A total of 57 test cases containing a total of 82 assertions

Took 13.3721 seconds

Took 13.3722 seconds

Took 13.3721 seconds

A total of 57 test cases containing a total of 82 assertions

A total of 57 test cases containing a total of 82 assertions

A total of 57 test cases containing a total of 82 assertions

           0
           0
           0
           0
           0
           0
           0
           0

Note that when the number of images is increased from 1 to 8, the fpm "summary" outputs grows by a factor of 8.

This is non-scalable and will definitely fail when using a real job scheduler that enforces process parallelism limits.

Previously the stop and error stop tests had EVERY image in an N-image job launching a recursive `fpm run` invocation that leads to an N-way exit test, for a total of (N + N^2) concurrent processes, where only (N + N) was intended. Change the stop tests so that only the first image performs the recursive fpm invocation for the sub-job. Partially addresses BerkeleyLab#137

bonachea mentioned this issue Sep 12, 2024

Replace parallel statements with Caffeine calls #136

Merged

rouson mentioned this issue Dec 28, 2024

Test with Julienne #169

Open

2 tasks

bonachea mentioned this issue Mar 12, 2025

Fix and re-enable stop and error stop tests for flang #185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit tests for `prif_stop` and `prif_error_stop` make fragile non-portable assumptions #137

Unit tests for `prif_stop` and `prif_error_stop` make fragile non-portable assumptions #137

bonachea commented Sep 12, 2024

bonachea commented Sep 12, 2024

Unit tests for prif_stop and prif_error_stop make fragile non-portable assumptions #137

Unit tests for prif_stop and prif_error_stop make fragile non-portable assumptions #137

Comments

bonachea commented Sep 12, 2024

bonachea commented Sep 12, 2024

Unit tests for `prif_stop` and `prif_error_stop` make fragile non-portable assumptions #137

Unit tests for `prif_stop` and `prif_error_stop` make fragile non-portable assumptions #137