Skip to content

Deadlock in memory-contrained dplasma-GEMM #741

@abouteiller

Description

@abouteiller

Describe the bug

deadlock under high memory constraints in SGEMM

To Reproduce

  • dplasma: 8ca297c6 (HEAD -> bugfix/gemm_gpu_missing_dtype, mine/bugfix/gemm_gpu_missing_dtype) Proper error checking when
  • parsec: 8f89995 (HEAD) Bail out of reserve_space on under-transfer WRITE flows

Need to have pr ICLDisco/dplasma#139, otherwise it crashes before reaching the deadlock.

This will -really- deadlock

OMPI_MCA_mpi_abort_print_stack=true OMPI_MCA_mpi_abort_delay=-1  PMIX_MCA_psec='' SLURM_TIMELIMIT=90 salloc -wleconte -n8 -N1  /usr/bin/srun "-n" "2" --cpu-bind=socket,verbose  "tests/testing_sgemm" -c 4 "-N" "1980" "-t" "320" "-v=5" "-g" "1" "-P" "1" "--" "--mca" "device_cuda_memory_number_of_blocks" "21" --mca comm_verbose 0 --mca debug_verbose 10 --mca bind_threads 0 2>&1  | tee bleh


....

d@00001 GPU[1:cuda(0)]: Pop GEMM(6, 5, 6)[6, 5, 6[, 1][, 0][, 1][, 2][, 2][, 2][, -1][, -1][, -1][, 4][, 6][, 6]]<0>{30} DONE (return 0 data epoch 282) @parsec_device_kernel_pop:2288
d@00001 GPU[1:cuda(0)]: Complete GEMM(6, 5, 6)[6, 5, 6[, 1][, 0][, 1][, 2][, 2][, 2][, -1][, -1][, -1][, 4][, 6][, 6]]<0>{30} @parsec_device_kernel_scheduler:2646
d@00001 GPU[1:cuda(0)]: Epilog of GEMM(6, 5, 6)[6, 5, 6[, 1][, 0][, 1][, 2][, 2][, 2][, -1][, -1][, -1][, 4][, 6][, 6]]<0>{30} @parsec_device_kernel_epilog:2310
d@00001 GPU[1:cuda(0)]: CPU copy 0x7f3a74001170 [ref_count 2] gets the same version 9 as GPU copy 0x7f3a7552b590 [ref_count 1] @parsec_device_kernel_epilog:2355
d@00001 Arena:  push a data of size 409600 from arena 0x36874350, aligned by 16, base ptr 0x7f3aa04bce90, data ptr 0x7f3aa04bcef0, sizeof prefix 96(96) @parsec_arena_release_chunk:163
d@00001 TERMDET-LOCAL:  TASKPOOL 0x368821d0 NB_TASKS 1 -> 0 @parsec_termdet_local_taskpool_addto_nb_tasks:221
d@00001 TERMDET-LOCAL:  TASKPOOL 0x368821d0  NB_PA 1 -> 0 @parsec_termdet_local_taskpool_addto_nb_tasks:233
d@00001 TERMDET-LOCAL   TASKPOOL 0x368821d0: termination detected @parsec_termdet_local_termination_detected:146
d@00001 TERMDET-LOCAL   TASKPOOL 0x368821d0: calling callback @parsec_termdet_local_termination_detected:149
d@00001 GPU[1:cuda(0)]: gpu_task 0x7f3a7801b950 freed @parsec_device_kernel_scheduler:2665
d@00001 GPU[1:cuda(0)]: Leaving GPU management @parsec_device_kernel_scheduler:2677


d@00000 Activate mask dep for C:GEMM(1, 6, 0)[1, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 7][, 3][, 8]]<0>{30} (current 0x0 now 0x1 goal 0xf) from C:READ_C(1, 6)[1, 6[, 0][, 0][, 0][, 8]]<0>{30} @parsec_update_deps_with_mask:1603
d@00000 Thread 1 of VP 0 Execute READ_C(0, 6)[0, 6[, 0][, 0][, 0][, 3]]<0>{30} chore 0 device 0:cpu-cores @__parsec_execute:147
d@00000   => Service GEMM(1, 6, 0)[1, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 7][, 3][, 8]]<0>{30} not yet ready @parsec_release_local_OUT_dependencies:1759
d@00000 TERMDET-LOCAL:  TASKPOOL 0xdd963e0 NB_TASKS 121 -> 120 @parsec_termdet_local_taskpool_addto_nb_tasks:221
d@00000 Activate dependencies for GEMM(0, 6, 0)[0, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 1][, 3][, 3]]<0>{30} flags = 0x0021 @parsec_release_local_OUT_dependencies:1698
d@00000 Activate mask dep for C:GEMM(0, 6, 0)[0, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 1][, 3][, 3]]<0>{30} (current 0x0 now 0x1 goal 0xf) from C:READ_C(0, 6)[0, 6[, 0][, 0][, 0][, 3]]<0>{30} @parsec_update_deps_with_mask:1603
d@00000   => Service GEMM(0, 6, 0)[0, 6, 0[, 0][, 0][, 0][, 0][, 3][, 0][, 0][, 3][, 1][, 1][, 3][, 3]]<0>{30} not yet ready @parsec_release_local_OUT_dependencies:1759
d@00000 TERMDET-LOCAL:  TASKPOOL 0xdd963e0 NB_TASKS 120 -> 119 @parsec_termdet_local_taskpool_addto_nb_tasks:221

Essentially, rank 1 thinks it is done, rank0 still has 100+ tasks to complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions