-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Result is suspicious when running ctest dgetrf_1d_mpi, the failure is deterministic, but happens only on the Guyot system (w/o GPU). Same setup will never fail on Leconte. Using variants of gcc/11,12,13; using openblas vs mkl, causes the same errors in the same cases.
To Reproduce
416aec9 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #109 from abouteiller/bugfix/dtd_gpu Aurelien Bouteiller 2 weeks ago
icldisco/parsec#adabbd4d1fb580358a32d489df19fa9c05a316e1 parsec (v1.1.0-4718-gadabbd4d)
SLURM_TIMELIMIT=1 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wguyot ctest -R dplasma_dgetrf_1d_mpi --repeat until-fail:1 --verbose ─╯
salloc: Granted job allocation 5500
UpdateCTestConfiguration from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Test project /home/bouteill/parsec/dplasma/build.cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 340
Start 340: dplasma_dgetrf_1d_mpi
340: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_dgetrf_1d" "-N" "378" "-t" "19" "-P" "1" "-x" "-v=5"
340: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
340: Environment variables:
340: PARSEC_MCA_device_cuda_enabled=0
340: PARSEC_MCA_device_hip_enabled=0
340: PARSEC_MCA_device_level_zero_enabled=0
340: PARSEC_MCA_device_cuda_memory_use=10
340: PARSEC_MCA_device_hip_memory_use=10
340: PARSEC_MCA_device_level_zero_memory_use=10
340: Test timeout computed to be: 1500
340: W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
340: [ 2] TIME(s) 0.11725 : PaRSEC initialized
340: #+++++ cores detected : 128
340: #+++++ nodes x cores + gpu : 4 x 128 + 0 (512+0)
340: #+++++ thread mode : THREAD_SERIALIZED
340: #+++++ P x Q : 1 x 4 (4/4)
340: #+++++ M x N x K|NRHS : 378 x 378 x 1
340: #+++++ LDA , LDB : 378 , 378
340: #+++++ MB x NB , IB : 19 x 19 , 40
340: [ 0] TIME(s) 0.11894 : PaRSEC initialized
340: [ 3] TIME(s) 0.11955 : PaRSEC initialized
340: [ 1] TIME(s) 0.12168 : PaRSEC initialized
340: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
340: This is often unintentional, and will perform poorly.
340: Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
340: and hide the real binding from PaRSEC; if you verified that the binding is correct,
340: this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Computing getrf ... [****] TIME(s) 9.45201 : dgetrf_1d PxQxg= 1 4 0 NB= 19 N= 378 : 0.003802 gflops - ENQ&PROG&DEST 9.52389 : 0.003773 gflops - ENQ 0.04388 - DEST 0.02800
340: +----------------------------------------------------------------------------------------------------------------------------+
340: | | | Data In | Data Out |
340: |Rank 0 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: | Dev 0 | 756 | 100.00 | 0.00 B | 0.00 B( -nan) | 0.00 B( -nan) | 0.00 B | 0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs | 756 | 100.00 | 0.00 B | 1.00 B(nan) | 0.00 B(nan) | 0.00 B | 1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: <DartMeasurement name="performance" type="numeric/double"
340: encoding="none" compression="none">
340: 0.0038019
340: </DartMeasurement>
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: | | | Data In | Data Out |
340: |Rank 1 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: | Dev 0 | 811 | 100.00 | 0.00 B | 0.00 B( -nan) | 0.00 B( -nan) | 0.00 B | 0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs | 811 | 100.00 | 0.00 B | 1.00 B(nan) | 0.00 B(nan) | 0.00 B | 1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: | | | Data In | Data Out |
340: |Rank 3 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: | Dev 0 | 906 | 100.00 | 0.00 B | 0.00 B( -nan) | 0.00 B( -nan) | 0.00 B | 0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs | 906 | 100.00 | 0.00 B | 1.00 B(nan) | 0.00 B(nan) | 0.00 B | 1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: | | | Data In | Data Out |
340: |Rank 2 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: | Dev 0 | 861 | 100.00 | 0.00 B | 0.00 B( -nan) | 0.00 B( -nan) | 0.00 B | 0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs | 861 | 100.00 | 0.00 B | 1.00 B(nan) | 0.00 B(nan) | 0.00 B | 1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: ============
340: Checking the Residual of the solution
340: -- ||A||_oo = 1.025373e+02, ||X||_oo = 1.202008e+01, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 3.394100e+01
340: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 6.559297e+11
340: -- Solution is suspicious !
340: --------------------------------------------------------------------------
340: Primary job terminated normally, but 1 process returned
340: a non-zero exit code. Per user-direction, the job has been aborted.
340: --------------------------------------------------------------------------
340: --------------------------------------------------------------------------
340: mpiexec detected that one or more processes exited with non-zero status, thus causing
340: the job to be terminated. The first process to do so was:
340:
340: Process name: [[26343,1],3]
340: Exit code: 1
340: --------------------------------------------------------------------------
1/1 Test #340: dplasma_dgetrf_1d_mpi ............***Failed 18.75 sec
0% tests passed, 1 tests failed out of 1
Label Time Summary:
dplasma = 18.75 sec*proc (1 test)
mpi = 18.75 sec*proc (1 test)
Total Test time (real) = 18.77 sec
The following tests FAILED:
340 - dplasma_dgetrf_1d_mpi (Failed)
Errors while running CTest
Output from these tests are in: /home/bouteill/parsec/dplasma/build.cuda/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
salloc: Relinquishing job allocation 5500
module list ─╯
Currently Loaded Modulefiles:
1) ncurses/6.4/gcc-11.3.1-6rvznd 34) pmix/3.2.3/gcc-11.3.1-b6ek7p 67) mpfr/4.2.0/gcc-11.3.1-n3mu53
2) htop/3.2.2/gcc-11.3.1-xm6i3t 35) slurm/22.05.9/gcc-11.3.1-yqiafz 68) mpc/1.3.1/gcc-11.3.1-2x6jci
3) nghttp2/1.52.0/gcc-11.3.1-yzhzx5 36) gdrcopy/2.3/gcc-11.3.1-zm6nhb 69) gcc/13.2.0/gcc-11.3.1-ir6jns
4) zlib/1.2.13/gcc-11.3.1-uhneca 37) libnl/3.3.0/gcc-11.3.1-s2rfpt 70) openblas/0.3.23/gcc-11.3.1-zo7k5r
5) openssl/3.1.2/gcc-11.3.1-w3u2b2 38) rdma-core/41.0/gcc-11.3.1-zlh7l5
6) curl/8.1.2/gcc-11.3.1-dhcq4d 39) ucx/1.14.0/gcc-11.3.1-6ffd5t
7) libmd/1.0.4/gcc-11.3.1-yl2qth 40) openmpi/4.1.5/gcc-11.3.1-2rgaqk
8) libbsd/0.11.7/gcc-11.3.1-rxtb5h 41) gperf/3.1/gcc-11.3.1-lq7yw2
9) expat/2.5.0/gcc-11.3.1-z3mywy 42) jemalloc/5.3.0/gcc-11.3.1-gnjgyl
10) bzip2/1.0.8/gcc-11.3.1-g7buii 43) libuv/1.44.1/gcc-11.3.1-ikknoi
11) libiconv/1.17/gcc-11.3.1-h5tewp 44) unzip/6.0/gcc-11.3.1-xm5nhk
12) xz/5.4.1/gcc-11.3.1-ybherp 45) lua-luajit-openresty/2.1-20230410/gcc-11.3.1-lgkuf6
13) libxml2/2.10.3/gcc-11.3.1-jijod2 46) libluv/1.44.2-1/gcc-11.3.1-pyqvat
14) pigz/2.7/gcc-11.3.1-2ysjo2 47) unibilium/2.0.0/gcc-11.3.1-az5pko
15) zstd/1.5.5/gcc-11.3.1-maqtnh 48) libtermkey/0.22/gcc-11.3.1-gwvd67
16) tar/1.34/gcc-11.3.1-jl543d 49) libvterm/0.3.1/gcc-11.3.1-we43r4
17) gettext/0.21.1/gcc-11.3.1-sgm6rr 50) lua-lpeg/1.0.2-1/gcc-11.3.1-6e6xv6
18) libunistring/1.1/gcc-11.3.1-mswbrm 51) msgpack-c/3.1.1/gcc-11.3.1-pzscaq
19) libidn2/2.3.4/gcc-11.3.1-kp77oe 52) lua-mpack/1.0.9/gcc-11.3.1-z26msa
20) krb5/1.20.1/gcc-11.3.1-hb7cxy 53) tree-sitter/0.20.8/gcc-11.3.1-pgy6wn
21) libedit/3.1-20210216/gcc-11.3.1-b2res4 54) neovim/0.9.1/gcc-11.3.1-aro6rp
22) libxcrypt/4.4.35/gcc-11.3.1-v7ot4t 55) cmake/3.26.3/gcc-11.3.1-6bgawm
23) openssh/9.3p1/gcc-11.3.1-jo2led 56) ninja/1.11.1/gcc-11.3.1-qf72ao
24) pcre2/10.42/gcc-11.3.1-bk6jhf 57) gmp/6.2.1/gcc-11.3.1-c5vz5h
25) berkeley-db/18.1.40/gcc-11.3.1-yl6wjj 58) libffi/3.4.4/gcc-11.3.1-suq3vd
26) readline/8.2/gcc-11.3.1-b26lae 59) sqlite/3.42.0/gcc-11.3.1-trzf26
27) gdbm/1.23/gcc-11.3.1-6u5vme 60) util-linux-uuid/2.38.1/gcc-11.3.1-h4vnny
28) perl/5.38.0/gcc-11.3.1-r63sx3 61) python/3.10.12/gcc-11.3.1-msankb
29) git/2.41.0/gcc-11.3.1-tx4xbg 62) gdb/13.1/gcc-11.3.1-awps3c
30) cuda/11.8.0/gcc-11.3.1-vltbfy 63) libevent/2.1.12/gcc-11.3.1-iqf4hw
31) libpciaccess/0.17/gcc-11.3.1-qp6jxc 64) tmux/3.3a/gcc-11.3.1-nt2vwg
32) hwloc/2.9.1/gcc-11.3.1-hvnu6p 65) cscope/15.9/gcc-11.3.1-4duk6k
33) numactl/2.0.14/gcc-11.3.1-x35xlq 66) exuberant-ctags/5.8/gcc-11.3.1-f56ide
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working