-
Notifications
You must be signed in to change notification settings - Fork 10
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Looks like the failure in GEMM is an as-of yet undiscovered prior issue.
RESOLUTION: Found issue in DPLASMA, missing dplasma_add2arena for the gpuNN gemm
OMPI_MCA_mpi_abort_print_stack=true OMPI_MCA_mpi_abort_delay=-1 PMIX_MCA_psec='' SLURM_TIMELIMIT=10 salloc -wleconte -n8 -N1 /usr/bin/srun "-n" "2" "tests/testing_sgemm" -c 4 "-N" "1940" "-t" "320" "-v=5" "-g" "1" "-P" "1" "--" "--mca" "device_cuda_memory_number_of_blocks" "21" --mca comm_verbose 20
...
d@00001 MPI: Retrieve datatype with mask 0x1 (remote_dep_get_datatypes) remote size 16384 @remote_dep_get_datatypes:929
[leconte:1373806] [0] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libopen-pal.so.80(opal_backtrace_buffer+0x23) [0x7f5a103d5783]
[leconte:1373806] [1] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libmpi.so.40(ompi_mpi_abort+0x117) [0x7f5a108bd2f7]
[leconte:1373806] [2] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0xda) [0x7f5a108acd0a]
[leconte:1373806] [3] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libmpi.so.40(ompi_errhandler_invoke+0x165) [0x7f5a108ac0a5]
[leconte:1373806] [4] func:/home/bouteill/parsec/dplasma-master/build.cuda/parsec/parsec/libparsec.so.4(remote_dep_mpi_retrieve_datatype+0x511) [0x7f5a5d49ffa1]
In gdb
if(output->data.remote.dst_datatype!=PARSEC_DATATYPE_NULL) MPI_Type_get_name(output->data.remote.dst_datatype, type_name_dst, &len);
(FROM, backtrace)
if( PARSEC_ITERATE_STOP == ontask(es, &nc, (const parsec_task_t *)this_task, &flow_of_sgemm_NN_gpu_READ_B_for_B_dep2_atline_200, &data, rank_src, rank_dst, vpid_dst, successor_repo, successor_repo_key, ontask_arg) )
We have output->data.remote_dst_datatype == NULL, which is not equal to PARSEC_DATATYPE_NULL (MPI_DATATYPE_NULL), so we go on and call the MPI_GET_NAME and crash MPI.
Two issues here
- the dst_datatype should not be NULL? Presumably that flow has a type and we should have retrieved it.
Explanation: this comes from GLOBAL_BARRIER Y, which is a CTL, thus with no type. This looks like it is a bug in get_datatype with CTL.the arena_datatypes in GEMM_NN_GPU was not filled. - should we compare to NULL instead of MPI_DATATYPE_NULL, or both? This should not crash but instead cause a clean error in parsec. Issue parsec_fatal when the datatype_arenas have not been set in the PTG parsec#739
Not immediately clear why/if this is related to the PR, or we just fixed the other issue that was masking this one.
Originally posted by @abouteiller in ICLDisco/parsec#733 (comment)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working