-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
deadlock under high memory constraints in SGEMM
To Reproduce
- dplasma: 8ca297c6 (HEAD -> bugfix/gemm_gpu_missing_dtype, mine/bugfix/gemm_gpu_missing_dtype) Proper error checking when
- parsec: 8f89995 (HEAD) Bail out of reserve_space on under-transfer WRITE flows
Need to have pr ICLDisco/dplasma#139, otherwise it crashes before reaching the deadlock.
This will -really- deadlock
OMPI_MCA_mpi_abort_print_stack=true OMPI_MCA_mpi_abort_delay=-1 PMIX_MCA_psec='' SLURM_TIMELIMIT=90 salloc -wleconte -n8 -N1 /usr/bin/srun "-n" "2" --cpu-bind=socket,verbose "tests/testing_sgemm" -c 4 "-N" "1980" "-t" "320" "-v=5" "-g" "1" "-P" "1" "--" "--mca" "device_cuda_memory_number_of_blocks" "21" --mca comm_verbose 0 --mca debug_verbose 10 --mca bind_threads 0 2>&1 | tee bleh
....
d@00001 GPU[1:cuda(0)]: Pop GEMM(6, 5, 6)[6, 5, 6[, 1][, 0][, 1][, 2][, 2][, 2][, -1][, -1][, -1][, 4][, 6][, 6]]<0>{30} DONE (return 0 data epoch 282) @parsec_device_kernel_pop:2288
d@00001 GPU[1:cuda(0)]: Complete GEMM(6, 5, 6)[6, 5, 6[, 1][, 0][, 1][, 2][, 2][, 2][, -1][, -1][, -1][, 4][, 6][, 6]]<0>{30} @parsec_device_kernel_scheduler:2646
d@00001 GPU[1:cuda(0)]: Epilog of GEMM(6, 5, 6)[6, 5, 6[, 1][, 0][, 1][, 2][, 2][, 2][, -1][, -1][, -1][, 4][, 6][, 6]]<0>{30} @parsec_device_kernel_epilog:2310
d@00001 GPU[1:cuda(0)]: CPU copy 0x7f3a74001170 [ref_count 2] gets the same version 9 as GPU copy 0x7f3a7552b590 [ref_count 1] @parsec_device_kernel_epilog:2355
d@00001 Arena: push a data of size 409600 from arena 0x36874350, aligned by 16, base ptr 0x7f3aa04bce90, data ptr 0x7f3aa04bcef0, sizeof prefix 96(96) @parsec_arena_release_chunk:163
d@00001 TERMDET-LOCAL: TASKPOOL 0x368821d0 NB_TASKS 1 -> 0 @parsec_termdet_local_taskpool_addto_nb_tasks:221
d@00001 TERMDET-LOCAL: TASKPOOL 0x368821d0 NB_PA 1 -> 0 @parsec_termdet_local_taskpool_addto_nb_tasks:233
d@00001 TERMDET-LOCAL TASKPOOL 0x368821d0: termination detected @parsec_termdet_local_termination_detected:146
d@00001 TERMDET-LOCAL TASKPOOL 0x368821d0: calling callback @parsec_termdet_local_termination_detected:149
d@00001 GPU[1:cuda(0)]: gpu_task 0x7f3a7801b950 freed @parsec_device_kernel_scheduler:2665
d@00001 GPU[1:cuda(0)]: Leaving GPU management @parsec_device_kernel_scheduler:2677
d@00000 Activate mask dep for C:GEMM(1, 6, 0)[1, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 7][, 3][, 8]]<0>{30} (current 0x0 now 0x1 goal 0xf) from C:READ_C(1, 6)[1, 6[, 0][, 0][, 0][, 8]]<0>{30} @parsec_update_deps_with_mask:1603
d@00000 Thread 1 of VP 0 Execute READ_C(0, 6)[0, 6[, 0][, 0][, 0][, 3]]<0>{30} chore 0 device 0:cpu-cores @__parsec_execute:147
d@00000 => Service GEMM(1, 6, 0)[1, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 7][, 3][, 8]]<0>{30} not yet ready @parsec_release_local_OUT_dependencies:1759
d@00000 TERMDET-LOCAL: TASKPOOL 0xdd963e0 NB_TASKS 121 -> 120 @parsec_termdet_local_taskpool_addto_nb_tasks:221
d@00000 Activate dependencies for GEMM(0, 6, 0)[0, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 1][, 3][, 3]]<0>{30} flags = 0x0021 @parsec_release_local_OUT_dependencies:1698
d@00000 Activate mask dep for C:GEMM(0, 6, 0)[0, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 1][, 3][, 3]]<0>{30} (current 0x0 now 0x1 goal 0xf) from C:READ_C(0, 6)[0, 6[, 0][, 0][, 0][, 3]]<0>{30} @parsec_update_deps_with_mask:1603
d@00000 => Service GEMM(0, 6, 0)[0, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 1][, 3][, 3]]<0>{30} not yet ready @parsec_release_local_OUT_dependencies:1759
d@00000 TERMDET-LOCAL: TASKPOOL 0xdd963e0 NB_TASKS 120 -> 119 @parsec_termdet_local_taskpool_addto_nb_tasks:221
Essentially, rank 1 thinks it is done, rank0 still has 100+ tasks to complete
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working