-
Notifications
You must be signed in to change notification settings - Fork 19
Description
We thought we had a real deadlock, but it is due to passing-in stupid parameters to the tester. Would be good to capture and warn about stupid though.
Investigation: This is a red tunafish, we don't have enough space to store a single t=3200 tile in this example, it is hard to catch this problem at runtime compared to just waiting for space to free-up
It livelocks on the first GPU task
OMPI_MCA_mpi_abort_print_stack=true OMPI_MCA_mpi_abort_delay=-1 PMIX_MCA_psec='' SLURM_TIMELIMIT=90 salloc -wleconte -n8 -N1 /usr/bin/srun "-n" "1" --cpu-bind=socket,verbose "tests/testing_spotrf" -c 4 "-N" "19400" "-t" "3200" "-v=5" "-g" "1" "-P" "1" "--" "--mca" "device_cuda_memory_number_of_blocks" "21" --mca comm_verbose 0 --mca debug_verbose 10 --mca bind_threads 0 2>&1 | tee bleh
d@00000 Thread 0 of VP 0 Execute potrf_spotrf(0)[0[, 0][, 1]]<343>{14} chore 0 device 1:cuda(0) @__parsec_execute:147
d@00000 GPU[1:cuda(0)]: Entering GPU management @parsec_device_kernel_scheduler:2514
d@00000 GPU[1:cuda(0)]: Upload data (if any) for potrf_spotrf(0)[0[, 0][, 1]]<343>{14} @parsec_device_kernel_scheduler:2531
d@00000 GPU[1:cuda(0)]: Try to Push potrf_spotrf(0)[0[, 0][, 1]]<343>{14} @parsec_device_kernel_push:1987
d@00000 GPU[1:cuda(0)]:potrf_spotrf(0)[0[, 0][, 1]]<343>{14}: Request space on GPU failed for flow T index 0/1 for task potrf_spotrf(0)[0[, 0][, 1]]<343>{14} @parsec_device_data_reserve_spa
ce:945
d@00000 GPU[1:cuda(0)]: GPU task 0x26bf03b0 has returned with ASYNC or AGAIN. Once the event trigger the task will be handled accordingly @parsec_device_progress_stream:1934
d@00000 GPU[1:cuda(0)]: GPU task 0x26bf03b0[0x7f87d000aeb0] is ready to be rescheduled on the same GPU device and same stream @parsec_device_progress_stream:1878
d@00000 GPU[1:cuda(0)]: GPU task 0x26bf03b0 has returned with ASYNC or AGAIN. Once the event trigger the task will be handled accordingly @parsec_device_progress_stream:1934
d@00000 GPU[1:cuda(0)]: GPU task 0x26bf03b0[0x7f87d000aeb0] is ready to be rescheduled on the same GPU device and same stream @parsec_device_progress_stream:1878
Note: this case also deadlock but in a way that is a lot less understandable in master.
Originally posted by @abouteiller in #733 (comment)