Skip to content

Conversation

shijin-aws
Copy link
Contributor

@shijin-aws shijin-aws commented Oct 9, 2025

Commit 815fe8e changed the ex_peer->super.base_handle assignment
when cpu atomics is not available + peer is local. We need a valid
base handle when use_memory_registration is true. This patch
fixes this issue.

@shijin-aws
Copy link
Contributor Author

shijin-aws commented Oct 9, 2025

@devreal can u take a look, I was hitting a NULL ptr segfault when running IMB Accumulate with main branch after #13330 is merged

@shijin-aws
Copy link
Contributor Author

Bt

#0  0x0000737b18d6faef in mca_btl_ofi_get (btl=0x5b9f173013e0, endpoint=0x737b10065520, local_address=0x5b9f1735cc18, remote_address=100738863024032,
    local_handle=0x5b9f180a6b60, remote_handle=0x0, size=4, flags=0, order=255, cbfunc=0x737b1918fc6d <ompi_osc_get_data_complete>, cbcontext=0x7fffb7dc06b6, cbdata=0x0)
    at btl_ofi_rdma.c:78
78      remote_address = (remote_address - (uint64_t) remote_handle->base_addr);
[Current thread is 1 (Thread 0x737b193c7780 (LWP 481239))]
(gdb) bt
#0  0x0000737b18d6faef in mca_btl_ofi_get (btl=0x5b9f173013e0, endpoint=0x737b10065520, local_address=0x5b9f1735cc18, remote_address=100738863024032,
    local_handle=0x5b9f180a6b60, remote_handle=0x0, size=4, flags=0, order=255, cbfunc=0x737b1918fc6d <ompi_osc_get_data_complete>, cbcontext=0x7fffb7dc06b6, cbdata=0x0)
    at btl_ofi_rdma.c:78
#1  0x0000737b1918fb5e in ompi_osc_rdma_btl_get (module=0x5b9f172ce2a0, btl_index=0 '\000', endpoint=0x737b10065520, local_address=0x5b9f1735cc18,
    remote_address=100738863024032, local_handle=0x5b9f180a6b60, remote_handle=0x0, size=4, flags=0, order=255, cbfunc=0x737b1918fc6d <ompi_osc_get_data_complete>,
    cbcontext=0x7fffb7dc06b6, cbdata=0x0) at /home/ubuntu/PortaFiducia/build/libraries/openmpi/main-debug/source/ompi/ompi/mca/osc/rdma/osc_rdma_btl_comm.h:66
#2  0x0000737b19190028 in ompi_osc_get_data_blocking (module=0x5b9f172ce2a0, btl_index=0 '\000', endpoint=0x737b10065520, source_address=100738863024032, source_handle=0x0,
    data=0x5b9f171af330, len=4) at osc_rdma_comm.c:109
#3  0x0000737b19197f3a in ompi_osc_rdma_gacc_contig (sync=0x5b9f172ce4a0, source=0x5b9f180a3010, source_count=1, source_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, result=0x0,
    result_count=0, result_datatype=0x0, result_convertor=0x0, peer=0x5b9f181af0f0, target_address=100738863024032, target_handle=0x0, target_count=1,
    target_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, op=0x5b9f12b50a80 <ompi_mpi_op_sum>, request=0x5b9f17107a70) at osc_rdma_accumulate.c:474
#4  0x0000737b19198624 in ompi_osc_rdma_gacc_master (sync=0x5b9f172ce4a0, source_addr=0x5b9f180a3010, source_count=1, source_datatype=0x5b9f12b52ea0 <ompi_mpi_float>,
    result_addr=0x0, result_count=0, result_datatype=0x0, peer=0x5b9f181af0f0, target_address=100738863024032, target_handle=0x0, target_count=1,
    target_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, op=0x5b9f12b50a80 <ompi_mpi_op_sum>, request=0x5b9f17107a70) at osc_rdma_accumulate.c:587
#5  0x0000737b1919a25c in ompi_osc_rdma_rget_accumulate_internal (win=0x5b9f171b4850, origin_addr=0x5b9f180a3010, origin_count=1,
    origin_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, result_addr=0x0, result_count=0, result_datatype=0x0, target_rank=0, target_disp=0, target_count=1,
    target_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, op=0x5b9f12b50a80 <ompi_mpi_op_sum>, request_out=0x0) at osc_rdma_accumulate.c:1107
#6  0x0000737b1919a703 in ompi_osc_rdma_accumulate (origin_addr=0x5b9f180a3010, origin_count=1, origin_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, target_rank=0, target_disp=0,
    target_count=1, target_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, op=0x5b9f12b50a80 <ompi_mpi_op_sum>, win=0x5b9f171b4850) at osc_rdma_accumulate.c:1178
#7  0x0000737b18ed2acf in PMPI_Accumulate (origin_addr=0x5b9f180a3010, origin_count=1, origin_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, target_rank=0, target_disp=0,
    target_count=1, target_datatype=0x5b9f12b52ea0 <ompi_mpi_float>, op=0x5b9f12b50a80 <ompi_mpi_op_sum>, win=0x5b9f171b4850) at accumulate_generated.c:129
#8  0x00005b9f12b2becb in IMB_accumulate (c_info=0x5b9f16f1bfd0, size=4, ITERATIONS=0x5b9f16f1c108, RUN_MODE=0x5b9f16f1c194, time=0x7fffb7dc1890) at ../src_c/IMB_ones_accu.c:178
#9  0x00005b9f12b18f10 in Bmark_descr::IMB_init_buffers_iter (this=0x5b9f16f1bc10, c_info=0x5b9f16f1bfd0, ITERATIONS=0x5b9f16f1c108, Bmark=0x5b9f16f1c178, BMODE=0x5b9f16f1c194,
    iter=1, size=4) at helpers/helper_IMB_functions.h:607
#10 0x00005b9f12b1f492 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)4>, &IMB_accumulate>::run (this=0x5b9f16f1bfa0, item=...) at helpers/original_benchmark.h:192
#11 0x00005b9f12aeba92 in main (argc=10, argv=0x7fffb7dc2068) at imb.cpp:329
(gdb) list
73      /* create completion context */
74      comp = mca_btl_ofi_rdma_completion_alloc(btl, endpoint, ofi_context, local_address,
75                                               local_handle, cbfunc, cbcontext, cbdata,
76                                               MCA_BTL_OFI_TYPE_GET);
77
78      remote_address = (remote_address - (uint64_t) remote_handle->base_addr);
79
80      /* Remote write data across the wire */
81      rc = fi_read(ofi_context->tx_ctx, local_address, size, /* payload */
82                   (NULL == local_handle ? NULL : local_handle->desc),

The segfault is due to the remote_handle is NULL

(gdb) p remote_handle
$3 = (mca_btl_base_registration_handle_t *) 0x0

devreal
devreal previously approved these changes Oct 10, 2025
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @shijin-aws!

@shijin-aws
Copy link
Contributor Author

shijin-aws commented Oct 10, 2025

@devreal I repushed my commits after finding btl_handle_data can already handle the branching between MPI_WIN_FLAVOR_ALLOCATE and non-MPI_WIN_FLAVOR_ALLOCATE flavors

@shijin-aws
Copy link
Contributor Author

@hppritcha can you review it too

Commit 815fe8e changed the ex_peer->super.base_handle assignment
when cpu atomics is not available + peer is local. We need a valid
base handle when use_memory_registration is true. This patch
fixes this issue.

Signed-off-by: Shi Jin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants