Fix Active Message Payload Fragmentation by lightsighter · Pull Request #410 · StanfordLegion/realm

lightsighter · 2026-03-10T10:35:39Z

Add automatic payload chunking to the ActiveMessage class so that any active message whose payload exceeds the network backend's hard limit is transparently fragmented and reassembled, eliminating crashes like the assert(size <= ib_seg_size) failure in the UCX backend.

Changes

New NetworkModule::max_payload_size() interface

Added a max_payload_size(size_t header_size) pure virtual method to NetworkModule and a corresponding free function in the Network namespace. This returns the strict upper bound on payload size for a single active message on the eager (no RDMA) path, independent of congestion or buffer registration.
GASNet-EX: returns the medium message limit derived from AM_LUBRequestMedium and cfg_outbuf_size.
GASNet-1: returns gasnet_AMMaxMedium().
MPI: returns AM_BUF_SIZE - header_size.
UCX: returns SIZE_MAX since UCX handles fragmentation internally via automatic eager/rendezvous protocol selection.
Loopback: returns SIZE_MAX.

Automatic chunking in ActiveMessage

ActiveMessage::init(NodeID, size_t) now checks the requested payload size against Network::max_payload_size(). If the payload exceeds the limit, the message enters a "chunked mode" that buffers payload locally and splits it into fragments at commit() time.
Each fragment is sent as an ActiveMessage<WrappedWithFragInfo> with FragmentInfo metadata (chunk ID, total chunks, message ID). The existing IncomingMessageManager / FragmentedMessage infrastructure reassembles them transparently before invoking the handler.
For the common case where the payload fits within the network limit, the code path is identical to before — zero overhead. On UCX (which returns SIZE_MAX), the chunking path is never entered.

Automatic dual handler registration

ActiveMessageHandlerReg now automatically registers a WrappedWithFragInfo handler alongside the plain T handler. This ensures every message type can be received as a fragmented message without requiring explicit opt-in. The WrappedWithFragInfo handler uses the existing wrap_handler_unwrap mechanism to strip the FragmentInfo and dispatch to the original handler.

UCX pbuf_get assertion removal

Removed assert(size <= ib_seg_size) from UCPInternal::pbuf_get(). This was a Realm-side guard, not a UCX requirement. UCX's UCP layer handles message fragmentation internally — the send path automatically selects eager vs rendezvous based on message size, and the receive path already fully supports rendezvous via UCP_AM_RECV_ATTR_FLAG_RNDV.

Removal of ActiveMessageAuto

Deleted the ActiveMessageAuto class, DefaultActiveMessageBuilder type alias, and AutoMessageRegistrar struct, since their functionality is now subsumed by the base ActiveMessage class.
Converted the sole usage site in barrier_impl.cc (BARRIER_ENABLE_BROADCAST path) from ActiveMessageAuto to plain ActiveMessage.
Removed the AutoMessageRegistrar instance (automatic dual-registration handles this now).
Deleted tests/unit_tests/auto_actmsg_test.cc and removed it from CMakeLists.txt.

@SeyedMir @apryakhin

…messages with large payloads

codecov · 2026-03-10T10:40:01Z

Codecov Report

❌ Patch coverage is 47.30539% with 88 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.03%. Comparing base (87ec0ad) to head (3021c93).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/realm/activemsg.inl	1.19%	75 Missing and 8 partials ⚠️
src/realm/network.cc	0.00%	2 Missing ⚠️
src/realm/network.inl	0.00%	2 Missing ⚠️
tests/unit_tests/actmsg_fragmentation_test.cc	98.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #410      +/-   ##
==========================================
- Coverage   29.07%   29.03%   -0.05%     
==========================================
  Files         194      194              
  Lines       40229    40336     +107     
  Branches    14464    14463       -1     
==========================================
+ Hits        11697    11710      +13     
+ Misses      27723    27531     -192     
- Partials      809     1095     +286

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lightsighter · 2026-03-11T05:41:12Z

Running Legion CI: https://gitlab.com/StanfordLegion/legion/-/pipelines/2377414546

SeyedMir · 2026-03-11T21:41:15Z

src/realm/network.h

+    //  of congestion, source/dest registration, etc.
+    // network backends that handle fragmentation internally (e.g. UCX) may
+    //  return SIZE_MAX to indicate no practical limit
+    size_t max_payload_size(size_t header_size);


I think this function needs at least another argument to specify whether the user will provide the payload buffer or not. Maybe pass a src buffer pointer (similar to recommended_max_payload) and set it to NULL if the query is for a case that the network module should provide the payload buffer.

In response to your comment, this is the plan that Claude Code came up with. Let me know if it looks sufficient to you. If so I'll execute it.

The core issue is that the hard limit on payload size depends on who provides the payload buffer. Let me trace through why for each backend.

UCX has two paths in the UCPMessageImpl constructor (ucp_internal.cc:2034-2043):

src_payload_addr != nullptr: uses the caller's buffer directly (PAYLOAD_BASE_EXTERNAL), then passes it to ucp_am_send_nbx which handles any size via rendezvous — effectively no limit.

src_payload_addr == nullptr: the network must allocate via pbuf_get(), which uses memory pools sized for small eager messages. Even with the assertion removed, asking a pool designed for 8KB segments to service a 40KB allocation is wasteful or may fail.

So for UCX, returning SIZE_MAX unconditionally is wrong — it's only correct when the caller provides a buffer. Without one, there should be a smaller limit so Realm chunks the message before the network has to allocate an oversized pool buffer.

GASNet-EX uses medium messages (with their outbuf_size / AM_LUBRequestMedium limits) for both cases when there's no RDMA destination, so the limit is the same regardless. But the interface should still be honest about the parameter so that if GASNet-EX's behavior ever changes (or a future backend cares), it works correctly.

MPI and GASNet-1 similarly have the same limit regardless of source buffer, but should accept the parameter for interface consistency.

Plan

Change the max_payload_size signature

In NetworkModule and Network namespace:

virtual size_t max_payload_size(size_t header_size, const void *src_payload_addr) = 0;
size_t max_payload_size(size_t header_size, const void *src_payload_addr);

The src_payload_addr is null when the network module will allocate the payload buffer, or non-null when the caller provides one.

Update backend implementations

UCX:

src_payload_addr != nullptr: return SIZE_MAX (caller's buffer goes straight to ucp_am_send_nbx, UCX handles fragmentation via rendezvous).

src_payload_addr == nullptr: return a limit based on the internal buffer pool capacity — likely ib_seg_size - header_size or config.pbuf_max_size, reflecting what pbuf_get can efficiently handle.

GASNet-EX, GASNet-1, MPI, Loopback: unchanged behavior, just accept and ignore the new parameter.

Update call sites in ActiveMessage

init(NodeID, size_t): passes nullptr — correct, the network allocates the buffer.

init_chunked() and commit_chunked(): pass nullptr — chunks are sent via the no-source-buffer path.

The data-providing init variants (e.g., init(NodeID, const void*, size_t)) currently don't check max_payload_size at all. As a follow-up, these could also be made chunking-aware by querying max_payload_size(header_size, data_ptr). When the caller provides a buffer and the backend returns SIZE_MAX (UCX), chunking is skipped. When the backend returns a small limit (GASNet), chunks could be sent using slices of the original buffer — but that's a more complex change that can be deferred.

No changes needed to commit_chunked logic

The chunking loop sends each fragment as ActiveMessage<WrappedWithFragInfo>(target, chunk_size) — the no-source-buffer path. This is correct: the limit used to size the chunks was queried with src_payload_addr=nullptr, matching how the chunks are actually sent.

Impact

The practical effect is that on UCX, a large ActiveMessage with no source buffer (like the original crash) will now be chunked into pool-friendly sizes, while a large ActiveMessage with a caller-provided buffer will continue to be sent as a single message via UCX's native rendezvous — no unnecessary chunking overhead.

I found some more issues with this plan and updated and then pushed the changes. See if you are satisfied with the current implementation.

Looks reasonable to me; config.pbuf_max_size should be what is returned for UCX module when src_payload_addr is nullptr. Note that the current default value for pbuf_max_size is 8KB.

realm/src/realm/ucx/ucp_internal.h

Line 122 in ea52777

size_t pbuf_max_size{8 << 10 /* 8K */};

It is worth documenting for the user of the new API that using max_payload_size may lead to sub-optimal perf (and that's why we have recommended_max_payload).

Looks reasonable to me; config.pbuf_max_size should be what is returned for UCX module when src_payload_addr is nullptr. Note that the current default value for pbuf_max_size is 8KB.

I had Claude make this fix. See if it looks good to you.

It is worth documenting for the user of the new API that using max_payload_size may lead to sub-optimal perf (and that's why we have recommended_max_payload).

I added documentation to that effect. See if you are happy with it.

I'll drop my comments on the PR shortly..going over it

lightsighter · 2026-03-13T20:50:31Z

Update description of the changes now that we fixed some more cases:

This PR fixes fragmentation issues for Realm active messages with large payloads. The original crash occurred when the UCX backend's pbuf_get asserted size <= ib_seg_size on a 41KB SimpleXferDesCreateMessage payload.

Changes

Automatic payload chunking in ActiveMessage

ActiveMessage now automatically fragments payloads that exceed the network backend's hard limit, using WrappedWithFragInfo wrapper headers and the existing FragmentedMessage reassembly infrastructure in IncomingMessageManager.
Chunking is transparent to callers — no API changes required. Messages that fit within the limit have zero overhead (the check is a single comparison).
Covers both the network-allocated path (init(target, size) + payload_ptr()/add_payload()) and the caller-provided data paths (init(target, data, len) and 2D variants). The data-ref path avoids copying by slicing the caller's buffer directly.
Each message type T automatically gets a companion WrappedWithFragInfo handler registered via dual registration in ActiveMessageHandlerReg, so no per-message-type opt-in is needed.

New Network::max_payload_size(header_size, src_payload_addr) interface

Returns the strict upper bound on payload size for a single active message on the eager (non-RDMA) path, as opposed to the advisory recommended_max_payload.
Takes a const void *src_payload_addr parameter so backends can distinguish caller-provided vs network-allocated buffers and check segment registration where applicable.
Backend implementations:
- UCX: SIZE_MAX when caller provides the buffer (rendezvous handles any size); pbuf pool capacity limit when nullptr.
- GASNet-EX/GASNet-1: Medium message limit regardless of source — Long messages require a RemoteAddress destination which this interface doesn't carry.
- MPI: AM_BUF_SIZE - header_size regardless — no registered segment concept.
- Loopback: SIZE_MAX.

UCX assertion removal

Removed assert(size <= ib_seg_size) from pbuf_get in ucp_internal.cc. This was a Realm-side guard, not a UCX requirement — UCX handles large messages internally via eager/rendezvous protocol selection.

Removed ActiveMessageAuto

Deleted the ActiveMessageAuto class and its test, since ActiveMessage now handles fragmentation directly. Converted the sole remaining usage in barrier_impl.cc.

apryakhin · 2026-03-18T14:26:52Z

src/realm/network.inl


+    inline size_t max_payload_size(size_t header_size, const void *src_payload_addr)
+    {
+#ifdef REALM_USE_MULTIPLE_NETWORKS


This needs at least a TODO on what to do here next

Claude is just mirroring what exists in related functions. You can see them in the main branch here:

https://github.com/StanfordLegion/realm/blob/main/src/realm/network.inl#L144-L220

What do you think should go in there for all those different functions?

apryakhin · 2026-03-18T14:29:58Z

src/realm/gasnetex/gasnetex_module.cc

+    //  messages regardless of whether the source is in a registered segment
+    //  (Long messages require a dest_payload_addr)
+    (void)src_payload_addr;
+    return recommended_max_payload(Network::my_node_id, false /*with_congestion*/,


Why does this return recommended payload while the documentation clearly says that it's an upper bound? Should be possible to use gasnet_AMMaxMedium here?

I think Claude is just reusing an existing function that has the same logic (computing the size of the AM medium minus the header size). If you want I can ask it to split it out and duplicate the logic so it is clear what is happening.

apryakhin · 2026-03-18T14:36:28Z

src/realm/ucx/ucp_internal.cc

    void *UCPInternal::pbuf_get(UCPWorker *worker, size_t size)
    {
      char *buf;
-      assert(size <= ib_seg_size);


Why is this removed?

Quoting from the description above:

Removed assert(size <= ib_seg_size) from UCPInternal::pbuf_get(). This was a Realm-side guard, not a UCX requirement. UCX's UCP layer handles message fragmentation internally — the send path automatically selects eager vs rendezvous based on message size, and the receive path already fully supports rendezvous via UCP_AM_RECV_ATTR_FLAG_RNDV.

This was the assertion that @SeyedMir put in just because we thought Realm would always abide by it and was the one that I originally tripped over when Realm didn't. It's not necessary in UCX because UCX automatically does the splitting and reassembly for you.

src/realm/activemsg.inl

src/realm/activemsg.h

apryakhin · 2026-03-18T14:59:29Z

src/realm/activemsg.inl

+        size_t net_max = Network::max_payload_size(sizeof(T), _data);
+        if(total_bytes > net_max) {
+          // linearize 2D data, then chunk
+          if(_line_stride == _bytes_per_line) {


Where is this change tested?

Claude and I added a new test tests/unit_tests/actmsg_fragmentation_test.cc which should cover this. See what you think.

tests/unit_tests/auto_actmsg_test.cc

apryakhin · 2026-03-18T15:19:59Z

src/realm/ucx/ucp_module.cc

        data, src_payload_addr.segment, &dest_payload_addr, with_congestion, header_size);
  }

+  size_t UCPModule::max_payload_size(size_t header_size, const void *src_payload_addr)


@SeyedMir How are we going to ensure that it works correctly with the receiver max size? UCX handles transport-level fragmentation, so however the rendezvous protocol will result in am_msg_recv_handler which receiver requests a pool object which can likely be below this limit if payload addr is provided. We had assert that effectively would have caught this but it was removed, so now it's likely to cause a silent failure. Unless I am missing anything here.

…t is now abstracted in the active message implementation

…sting coverage

…CX module

…ing happens automatically when necessary

realm: this pull request fixes fragmentation issues for Realm active …

4256420

…messages with large payloads

lightsighter requested a review from apryakhin March 10, 2026 10:35

lightsighter self-assigned this Mar 10, 2026

github-actions bot added the bug Something isn't working label Mar 10, 2026

SeyedMir reviewed Mar 11, 2026

View reviewed changes

realm: more fixes for handling active message fragmentation

b2a8bd5

lightsighter added 4 commits March 13, 2026 15:43

Merge branch 'main' into mbauer-payload-fragmentation

8bad8f6

realm: address review comments

8108fcc

realm: fix permissions for UCP module

c1e1379

realm: use fully qualified name because c++ is dumb

3f170bc

apryakhin reviewed Mar 18, 2026

View reviewed changes

src/realm/activemsg.inl Outdated Show resolved Hide resolved

apryakhin reviewed Mar 18, 2026

View reviewed changes

src/realm/activemsg.h Outdated Show resolved Hide resolved

apryakhin reviewed Mar 18, 2026

View reviewed changes

tests/unit_tests/auto_actmsg_test.cc Show resolved Hide resolved

apryakhin reviewed Mar 18, 2026

View reviewed changes

lightsighter mentioned this pull request Mar 20, 2026

Legion profiling medium payload too large! StanfordLegion/legion#1929

Open

lightsighter added 8 commits March 23, 2026 03:30

Merge branch 'main' into mbauer-payload-fragmentation

1d2024d

remove explicit chunking from certain kinds of active messages as tha…

47e86de

…t is now abstracted in the active message implementation

test: add a active message fragmentation test to recover some lost te…

b089c8a

…sting coverage

realm: fix formatting

5f20032

Merge branch 'main' into mbauer-payload-fragmentation

f2e5b44

realm: fix use-after free issue with reusing payload buffers in the U…

d495c3e

…CX module

realm: another fix for ucx payload fragmentation

9abfb13

realm: more cleanup for active message implementation to ensure chunk…

0a42bb1

…ing happens automatically when necessary

lightsighter added 2 commits March 25, 2026 03:16

Merge branch 'main' into mbauer-payload-fragmentation

268586f

realm: still fixing more bugs with payload fragmentation

3021c93

Conversation

lightsighter commented Mar 10, 2026

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lightsighter commented Mar 11, 2026

Uh oh!

SeyedMir Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lightsighter commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lightsighter Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 10, 2026 •

edited

Loading

SeyedMir Mar 11, 2026 •

edited

Loading

lightsighter Mar 23, 2026 •

edited

Loading