Skip to content

Fix Active Message Payload Fragmentation#410

Open
lightsighter wants to merge 16 commits intomainfrom
mbauer-payload-fragmentation
Open

Fix Active Message Payload Fragmentation#410
lightsighter wants to merge 16 commits intomainfrom
mbauer-payload-fragmentation

Conversation

@lightsighter
Copy link
Copy Markdown
Contributor

Add automatic payload chunking to the ActiveMessage class so that any active message whose payload exceeds the network backend's hard limit is transparently fragmented and reassembled, eliminating crashes like the assert(size <= ib_seg_size) failure in the UCX backend.

Changes

New NetworkModule::max_payload_size() interface

  • Added a max_payload_size(size_t header_size) pure virtual method to NetworkModule and a corresponding free function in the Network namespace. This returns the strict upper bound on payload size for a single active message on the eager (no RDMA) path, independent of congestion or buffer registration.
  • GASNet-EX: returns the medium message limit derived from AM_LUBRequestMedium and cfg_outbuf_size.
  • GASNet-1: returns gasnet_AMMaxMedium().
  • MPI: returns AM_BUF_SIZE - header_size.
  • UCX: returns SIZE_MAX since UCX handles fragmentation internally via automatic eager/rendezvous protocol selection.
  • Loopback: returns SIZE_MAX.

Automatic chunking in ActiveMessage

  • ActiveMessage::init(NodeID, size_t) now checks the requested payload size against Network::max_payload_size(). If the payload exceeds the limit, the message enters a "chunked mode" that buffers payload locally and splits it into fragments at commit() time.
  • Each fragment is sent as an ActiveMessage<WrappedWithFragInfo> with FragmentInfo metadata (chunk ID, total chunks, message ID). The existing IncomingMessageManager / FragmentedMessage infrastructure reassembles them transparently before invoking the handler.
  • For the common case where the payload fits within the network limit, the code path is identical to before — zero overhead. On UCX (which returns SIZE_MAX), the chunking path is never entered.

Automatic dual handler registration

  • ActiveMessageHandlerReg now automatically registers a WrappedWithFragInfo handler alongside the plain T handler. This ensures every message type can be received as a fragmented message without requiring explicit opt-in. The WrappedWithFragInfo handler uses the existing wrap_handler_unwrap mechanism to strip the FragmentInfo and dispatch to the original handler.

UCX pbuf_get assertion removal

  • Removed assert(size <= ib_seg_size) from UCPInternal::pbuf_get(). This was a Realm-side guard, not a UCX requirement. UCX's UCP layer handles message fragmentation internally — the send path automatically selects eager vs rendezvous based on message size, and the receive path already fully supports rendezvous via UCP_AM_RECV_ATTR_FLAG_RNDV.

Removal of ActiveMessageAuto

  • Deleted the ActiveMessageAuto class, DefaultActiveMessageBuilder type alias, and AutoMessageRegistrar struct, since their functionality is now subsumed by the base ActiveMessage class.
  • Converted the sole usage site in barrier_impl.cc (BARRIER_ENABLE_BROADCAST path) from ActiveMessageAuto to plain ActiveMessage.
  • Removed the AutoMessageRegistrar instance (automatic dual-registration handles this now).
  • Deleted tests/unit_tests/auto_actmsg_test.cc and removed it from CMakeLists.txt.

@SeyedMir @apryakhin

@lightsighter lightsighter requested a review from apryakhin March 10, 2026 10:35
@lightsighter lightsighter self-assigned this Mar 10, 2026
@github-actions github-actions bot added the bug Something isn't working label Mar 10, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 47.30539% with 88 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.03%. Comparing base (87ec0ad) to head (3021c93).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/realm/activemsg.inl 1.19% 75 Missing and 8 partials ⚠️
src/realm/network.cc 0.00% 2 Missing ⚠️
src/realm/network.inl 0.00% 2 Missing ⚠️
tests/unit_tests/actmsg_fragmentation_test.cc 98.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #410      +/-   ##
==========================================
- Coverage   29.07%   29.03%   -0.05%     
==========================================
  Files         194      194              
  Lines       40229    40336     +107     
  Branches    14464    14463       -1     
==========================================
+ Hits        11697    11710      +13     
+ Misses      27723    27531     -192     
- Partials      809     1095     +286     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lightsighter
Copy link
Copy Markdown
Contributor Author

// of congestion, source/dest registration, etc.
// network backends that handle fragmentation internally (e.g. UCX) may
// return SIZE_MAX to indicate no practical limit
size_t max_payload_size(size_t header_size);
Copy link
Copy Markdown
Contributor

@SeyedMir SeyedMir Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function needs at least another argument to specify whether the user will provide the payload buffer or not. Maybe pass a src buffer pointer (similar to recommended_max_payload) and set it to NULL if the query is for a case that the network module should provide the payload buffer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to your comment, this is the plan that Claude Code came up with. Let me know if it looks sufficient to you. If so I'll execute it.

The core issue is that the hard limit on payload size depends on who provides the payload buffer. Let me trace through why for each backend.

UCX has two paths in the UCPMessageImpl constructor (ucp_internal.cc:2034-2043):

  • src_payload_addr != nullptr: uses the caller's buffer directly (PAYLOAD_BASE_EXTERNAL), then passes it to ucp_am_send_nbx which handles any size via rendezvous — effectively no limit.
  • src_payload_addr == nullptr: the network must allocate via pbuf_get(), which uses memory pools sized for small eager messages. Even with the assertion removed, asking a pool designed for 8KB segments to service a 40KB allocation is wasteful or may fail.

So for UCX, returning SIZE_MAX unconditionally is wrong — it's only correct when the caller provides a buffer. Without one, there should be a smaller limit so Realm chunks the message before the network has to allocate an oversized pool buffer.

GASNet-EX uses medium messages (with their outbuf_size / AM_LUBRequestMedium limits) for both cases when there's no RDMA destination, so the limit is the same regardless. But the interface should still be honest about the parameter so that if GASNet-EX's behavior ever changes (or a future backend cares), it works correctly.

MPI and GASNet-1 similarly have the same limit regardless of source buffer, but should accept the parameter for interface consistency.

Plan

  1. Change the max_payload_size signature

In NetworkModule and Network namespace:

virtual size_t max_payload_size(size_t header_size, const void *src_payload_addr) = 0;
size_t max_payload_size(size_t header_size, const void *src_payload_addr);

The src_payload_addr is null when the network module will allocate the payload buffer, or non-null when the caller provides one.

  1. Update backend implementations
  • UCX:
    • src_payload_addr != nullptr: return SIZE_MAX (caller's buffer goes straight to ucp_am_send_nbx, UCX handles fragmentation via rendezvous).
    • src_payload_addr == nullptr: return a limit based on the internal buffer pool capacity — likely ib_seg_size - header_size or config.pbuf_max_size, reflecting what pbuf_get can efficiently handle.
  • GASNet-EX, GASNet-1, MPI, Loopback: unchanged behavior, just accept and ignore the new parameter.
  1. Update call sites in ActiveMessage
  • init(NodeID, size_t): passes nullptr — correct, the network allocates the buffer.
  • init_chunked() and commit_chunked(): pass nullptr — chunks are sent via the no-source-buffer path.
  • The data-providing init variants (e.g., init(NodeID, const void*, size_t)) currently don't check max_payload_size at all. As a follow-up, these could also be made chunking-aware by querying max_payload_size(header_size, data_ptr). When the caller provides a buffer and the backend returns SIZE_MAX (UCX), chunking is skipped. When the backend returns a small limit (GASNet), chunks could be sent using slices of the original buffer — but that's a more complex change that can be deferred.
  1. No changes needed to commit_chunked logic

The chunking loop sends each fragment as ActiveMessage<WrappedWithFragInfo>(target, chunk_size) — the no-source-buffer path. This is correct: the limit used to size the chunks was queried with src_payload_addr=nullptr, matching how the chunks are actually sent.

Impact

The practical effect is that on UCX, a large ActiveMessage with no source buffer (like the original crash) will now be chunked into pool-friendly sizes, while a large ActiveMessage with a caller-provided buffer will continue to be sent as a single message via UCX's native rendezvous — no unnecessary chunking overhead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some more issues with this plan and updated and then pushed the changes. See if you are satisfied with the current implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me; config.pbuf_max_size should be what is returned for UCX module when src_payload_addr is nullptr. Note that the current default value for pbuf_max_size is 8KB.

size_t pbuf_max_size{8 << 10 /* 8K */};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth documenting for the user of the new API that using max_payload_size may lead to sub-optimal perf (and that's why we have recommended_max_payload).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me; config.pbuf_max_size should be what is returned for UCX module when src_payload_addr is nullptr. Note that the current default value for pbuf_max_size is 8KB.

I had Claude make this fix. See if it looks good to you.

It is worth documenting for the user of the new API that using max_payload_size may lead to sub-optimal perf (and that's why we have recommended_max_payload).

I added documentation to that effect. See if you are happy with it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop my comments on the PR shortly..going over it

@lightsighter
Copy link
Copy Markdown
Contributor Author

Update description of the changes now that we fixed some more cases:

This PR fixes fragmentation issues for Realm active messages with large payloads. The original crash occurred when the UCX backend's pbuf_get asserted size <= ib_seg_size on a 41KB SimpleXferDesCreateMessage payload.

Changes

Automatic payload chunking in ActiveMessage

  • ActiveMessage now automatically fragments payloads that exceed the network backend's hard limit, using WrappedWithFragInfo wrapper headers and the existing FragmentedMessage reassembly infrastructure in IncomingMessageManager.
  • Chunking is transparent to callers — no API changes required. Messages that fit within the limit have zero overhead (the check is a single comparison).
  • Covers both the network-allocated path (init(target, size) + payload_ptr()/add_payload()) and the caller-provided data paths (init(target, data, len) and 2D variants). The data-ref path avoids copying by slicing the caller's buffer directly.
  • Each message type T automatically gets a companion WrappedWithFragInfo handler registered via dual registration in ActiveMessageHandlerReg, so no per-message-type opt-in is needed.

New Network::max_payload_size(header_size, src_payload_addr) interface

  • Returns the strict upper bound on payload size for a single active message on the eager (non-RDMA) path, as opposed to the advisory recommended_max_payload.
  • Takes a const void *src_payload_addr parameter so backends can distinguish caller-provided vs network-allocated buffers and check segment registration where applicable.
  • Backend implementations:
    • UCX: SIZE_MAX when caller provides the buffer (rendezvous handles any size); pbuf pool capacity limit when nullptr.
    • GASNet-EX/GASNet-1: Medium message limit regardless of source — Long messages require a RemoteAddress destination which this interface doesn't carry.
    • MPI: AM_BUF_SIZE - header_size regardless — no registered segment concept.
    • Loopback: SIZE_MAX.

UCX assertion removal

  • Removed assert(size <= ib_seg_size) from pbuf_get in ucp_internal.cc. This was a Realm-side guard, not a UCX requirement — UCX handles large messages internally via eager/rendezvous protocol selection.

Removed ActiveMessageAuto

  • Deleted the ActiveMessageAuto class and its test, since ActiveMessage now handles fragmentation directly. Converted the sole remaining usage in barrier_impl.cc.


inline size_t max_payload_size(size_t header_size, const void *src_payload_addr)
{
#ifdef REALM_USE_MULTIPLE_NETWORKS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs at least a TODO on what to do here next

Copy link
Copy Markdown
Contributor Author

@lightsighter lightsighter Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude is just mirroring what exists in related functions. You can see them in the main branch here:

https://github.com/StanfordLegion/realm/blob/main/src/realm/network.inl#L144-L220

What do you think should go in there for all those different functions?

// messages regardless of whether the source is in a registered segment
// (Long messages require a dest_payload_addr)
(void)src_payload_addr;
return recommended_max_payload(Network::my_node_id, false /*with_congestion*/,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this return recommended payload while the documentation clearly says that it's an upper bound? Should be possible to use gasnet_AMMaxMedium here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Claude is just reusing an existing function that has the same logic (computing the size of the AM medium minus the header size). If you want I can ask it to split it out and duplicate the logic so it is clear what is happening.

void *UCPInternal::pbuf_get(UCPWorker *worker, size_t size)
{
char *buf;
assert(size <= ib_seg_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from the description above:

Removed assert(size <= ib_seg_size) from UCPInternal::pbuf_get(). This was a Realm-side guard, not a UCX requirement. UCX's UCP layer handles message fragmentation internally — the send path automatically selects eager vs rendezvous based on message size, and the receive path already fully supports rendezvous via UCP_AM_RECV_ATTR_FLAG_RNDV.

This was the assertion that @SeyedMir put in just because we thought Realm would always abide by it and was the one that I originally tripped over when Realm didn't. It's not necessary in UCX because UCX automatically does the splitting and reassembly for you.

size_t net_max = Network::max_payload_size(sizeof(T), _data);
if(total_bytes > net_max) {
// linearize 2D data, then chunk
if(_line_stride == _bytes_per_line) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this change tested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude and I added a new test tests/unit_tests/actmsg_fragmentation_test.cc which should cover this. See what you think.

data, src_payload_addr.segment, &dest_payload_addr, with_congestion, header_size);
}

size_t UCPModule::max_payload_size(size_t header_size, const void *src_payload_addr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeyedMir How are we going to ensure that it works correctly with the receiver max size? UCX handles transport-level fragmentation, so however the rendezvous protocol will result in am_msg_recv_handler which receiver requests a pool object which can likely be below this limit if payload addr is provided. We had assert that effectively would have caught this but it was removed, so now it's likely to cause a silent failure. Unless I am missing anything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants