-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
GEQRF (and derivatives, like LQ, SORMQR etc) use more than the hardcoded 2 GPU workspaces.
Important note
After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for QR+GPU to explicitly test for this case.
To Reproduce
Ctest on Leconte SLURM_TIMELIMIT=2 PARSEC_MCA_device_cuda_memory_use=20 OMPI_MCA_rmaps_base_oversubscribe=true salloc -N1 -wleconte ctest --rerun-failed
125/437 Test: dplasma_sgeqrf_shm
113 Command: "/usr/bin/srun" "./testing_sgeqrf" "-M" "487" "-N" "283" "-K" "97" "-t" "56" "-x" "-v=5"
114 Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
115 "dplasma_sgeqrf_shm" start time: Jan 31 19:38 EST
116 Output:
117 ----------------------------------------------------------
118 srun: Job 4994 step creation temporarily disabled, retrying (Requested nodes are busy)
119 srun: Step created for job 4994
120 [1706747884.458034] [leconte:2566339:0] ucp_context.c:1081 UCX WARN network device 'mlx5_0:1' is not available, please use one or more of: 'docker0'
(tcp), 'enp1s0f0'(tcp), 'enp1s0f1'(tcp), 'lo'(tcp)
121 ^[[1;37;43mW@00000^[[0m /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
122 #+++++ cores detected : 40
123 #+++++ nodes x cores + gpu : 1 x 40 + 0 (40+0)
124 #+++++ thread mode : THREAD_SERIALIZED
125 #+++++ P x Q : 1 x 1 (1/1)
126 #+++++ M x N x K|NRHS : 487 x 283 x 97
127 #+++++ LDA , LDB : 487 , 487
128 #+++++ MB x NB , IB : 56 x 56 , 32
129 #+++++ KP x KQ : 4 x 1
130 ^[[1;37;41mx@00000^[[0m parsec_device_pop_workspace: user requested more than 2 GPU workspaces which is the current hard-coded limit per GPU stream
131 ^[[36m@parsec_device_pop_workspace:206 (leconte:2566339)^[[0m
132 --------------------------------------------------------------------------
133 MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
134 with errorcode -6.
135
136 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
137 You may or may not see output from other processes, depending on
138 exactly when Open MPI kills them.
139 --------------------------------------------------------------------------
140 slurmstepd: error: *** STEP 4994.4 ON leconte CANCELLED AT 2024-02-01T00:38:06 ***
141 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
142 srun: error: leconte: task 0: Exited with exit code 250
143 <end of output>
144 Test time = 3.17 sec
145 ----------------------------------------------------------
146 Test Failed.
Proposed fix
- Deprecate workspaces in parsec
- Use the gpu info handles to provide more than 2 workspaces per stream
Environment (please complete the following information):
- Dplasma: 416aec9 (origin/master, origin/HEAD, master) Merge pull request bugfix: we must count the actual number of cuda devices #109 from abouteiller/bugfix/dtd_gpu Aurelien Bouteiller 22 hours ago
- Parsec: adabbd4d (origin/master, origin/HEAD, master) Merge pull request #620 from bosilca/fix/osx_warning Thomas Herault 7 days ago
- config.log:
../configure --prefix=/home/bouteill/parsec/dplasma/build.cuda --with-cuda --without-hip --enable-debug=noisier\,paranoid
Currently Loaded Modulefiles:
1) ncurses/6.4/gcc-11.3.1-6rvznd 25) berkeley-db/18.1.40/gcc-11.3.1-yl6wjj 49) libvterm/0.3.1/gcc-11.3.1-we43r4
2) htop/3.2.2/gcc-11.3.1-xm6i3t 26) readline/8.2/gcc-11.3.1-b26lae 50) lua-lpeg/1.0.2-1/gcc-11.3.1-6e6xv6
3) nghttp2/1.52.0/gcc-11.3.1-yzhzx5 27) gdbm/1.23/gcc-11.3.1-6u5vme 51) msgpack-c/3.1.1/gcc-11.3.1-pzscaq
4) zlib/1.2.13/gcc-11.3.1-uhneca 28) perl/5.38.0/gcc-11.3.1-r63sx3 52) lua-mpack/1.0.9/gcc-11.3.1-z26msa
5) openssl/3.1.2/gcc-11.3.1-w3u2b2 29) git/2.41.0/gcc-11.3.1-tx4xbg 53) tree-sitter/0.20.8/gcc-11.3.1-pgy6wn
6) curl/8.1.2/gcc-11.3.1-dhcq4d 30) cuda/11.8.0/gcc-11.3.1-vltbfy 54) neovim/0.9.1/gcc-11.3.1-aro6rp
7) libmd/1.0.4/gcc-11.3.1-yl2qth 31) libpciaccess/0.17/gcc-11.3.1-qp6jxc 55) cmake/3.26.3/gcc-11.3.1-6bgawm
8) libbsd/0.11.7/gcc-11.3.1-rxtb5h 32) hwloc/2.9.1/gcc-11.3.1-hvnu6p 56) ninja/1.11.1/gcc-11.3.1-qf72ao
9) expat/2.5.0/gcc-11.3.1-z3mywy 33) numactl/2.0.14/gcc-11.3.1-x35xlq 57) gmp/6.2.1/gcc-11.3.1-c5vz5h
10) bzip2/1.0.8/gcc-11.3.1-g7buii 34) pmix/3.2.3/gcc-11.3.1-b6ek7p 58) libffi/3.4.4/gcc-11.3.1-suq3vd
11) libiconv/1.17/gcc-11.3.1-h5tewp 35) slurm/22.05.9/gcc-11.3.1-yqiafz 59) sqlite/3.42.0/gcc-11.3.1-trzf26
12) xz/5.4.1/gcc-11.3.1-ybherp 36) gdrcopy/2.3/gcc-11.3.1-zm6nhb 60) util-linux-uuid/2.38.1/gcc-11.3.1-h4vnny
13) libxml2/2.10.3/gcc-11.3.1-jijod2 37) libnl/3.3.0/gcc-11.3.1-s2rfpt 61) python/3.10.12/gcc-11.3.1-msankb
14) pigz/2.7/gcc-11.3.1-2ysjo2 38) rdma-core/41.0/gcc-11.3.1-zlh7l5 62) gdb/13.1/gcc-11.3.1-awps3c
15) zstd/1.5.5/gcc-11.3.1-maqtnh 39) ucx/1.14.0/gcc-11.3.1-6ffd5t 63) libevent/2.1.12/gcc-11.3.1-iqf4hw
16) tar/1.34/gcc-11.3.1-jl543d 40) openmpi/4.1.5/gcc-11.3.1-2rgaqk 64) tmux/3.3a/gcc-11.3.1-nt2vwg
17) gettext/0.21.1/gcc-11.3.1-sgm6rr 41) gperf/3.1/gcc-11.3.1-lq7yw2 65) cscope/15.9/gcc-11.3.1-4duk6k
18) libunistring/1.1/gcc-11.3.1-mswbrm 42) jemalloc/5.3.0/gcc-11.3.1-gnjgyl 66) exuberant-ctags/5.8/gcc-11.3.1-f56ide
19) libidn2/2.3.4/gcc-11.3.1-kp77oe 43) libuv/1.44.1/gcc-11.3.1-ikknoi 67) intel-oneapi-tbb/2021.10.0/gcc-11.3.1-ptv4p2
20) krb5/1.20.1/gcc-11.3.1-hb7cxy 44) unzip/6.0/gcc-11.3.1-xm5nhk 68) intel-oneapi-mkl/2023.2.0/gcc-11.3.1-d5uffv
21) libedit/3.1-20210216/gcc-11.3.1-b2res4 45) lua-luajit-openresty/2.1-20230410/gcc-11.3.1-lgkuf6 69) mpfr/4.2.0/gcc-11.3.1-n3mu53
22) libxcrypt/4.4.35/gcc-11.3.1-v7ot4t 46) libluv/1.44.2-1/gcc-11.3.1-pyqvat 70) mpc/1.3.1/gcc-11.3.1-2x6jci
23) openssh/9.3p1/gcc-11.3.1-jo2led 47) unibilium/2.0.0/gcc-11.3.1-az5pko 71) gcc/13.2.0/gcc-11.3.1-ir6jns
24) pcre2/10.42/gcc-11.3.1-bk6jhf 48) libtermkey/0.22/gcc-11.3.1-gwvd67
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working