Conversation
…messages with large payloads
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #410 +/- ##
==========================================
- Coverage 29.07% 29.03% -0.05%
==========================================
Files 194 194
Lines 40229 40336 +107
Branches 14464 14463 -1
==========================================
+ Hits 11697 11710 +13
+ Misses 27723 27531 -192
- Partials 809 1095 +286 ☔ View full report in Codecov by Sentry. |
|
Running Legion CI: https://gitlab.com/StanfordLegion/legion/-/pipelines/2377414546 |
src/realm/network.h
Outdated
| // of congestion, source/dest registration, etc. | ||
| // network backends that handle fragmentation internally (e.g. UCX) may | ||
| // return SIZE_MAX to indicate no practical limit | ||
| size_t max_payload_size(size_t header_size); |
There was a problem hiding this comment.
I think this function needs at least another argument to specify whether the user will provide the payload buffer or not. Maybe pass a src buffer pointer (similar to recommended_max_payload) and set it to NULL if the query is for a case that the network module should provide the payload buffer.
There was a problem hiding this comment.
In response to your comment, this is the plan that Claude Code came up with. Let me know if it looks sufficient to you. If so I'll execute it.
The core issue is that the hard limit on payload size depends on who provides the payload buffer. Let me trace through why for each backend.
UCX has two paths in the UCPMessageImpl constructor (ucp_internal.cc:2034-2043):
- src_payload_addr != nullptr: uses the caller's buffer directly (PAYLOAD_BASE_EXTERNAL), then passes it to ucp_am_send_nbx which handles any size via rendezvous — effectively no limit.
- src_payload_addr == nullptr: the network must allocate via pbuf_get(), which uses memory pools sized for small eager messages. Even with the assertion removed, asking a pool designed for 8KB segments to service a 40KB allocation is wasteful or may fail.
So for UCX, returning SIZE_MAX unconditionally is wrong — it's only correct when the caller provides a buffer. Without one, there should be a smaller limit so Realm chunks the message before the network has to allocate an oversized pool buffer.
GASNet-EX uses medium messages (with their outbuf_size / AM_LUBRequestMedium limits) for both cases when there's no RDMA destination, so the limit is the same regardless. But the interface should still be honest about the parameter so that if GASNet-EX's behavior ever changes (or a future backend cares), it works correctly.
MPI and GASNet-1 similarly have the same limit regardless of source buffer, but should accept the parameter for interface consistency.
Plan
- Change the max_payload_size signature
In NetworkModule and Network namespace:
virtual size_t max_payload_size(size_t header_size, const void *src_payload_addr) = 0;
size_t max_payload_size(size_t header_size, const void *src_payload_addr);
The src_payload_addr is null when the network module will allocate the payload buffer, or non-null when the caller provides one.
- Update backend implementations
- UCX:
- src_payload_addr != nullptr: return SIZE_MAX (caller's buffer goes straight to ucp_am_send_nbx, UCX handles fragmentation via rendezvous).
- src_payload_addr == nullptr: return a limit based on the internal buffer pool capacity — likely ib_seg_size - header_size or config.pbuf_max_size, reflecting what pbuf_get can efficiently handle.
- GASNet-EX, GASNet-1, MPI, Loopback: unchanged behavior, just accept and ignore the new parameter.
- Update call sites in ActiveMessage
- init(NodeID, size_t): passes nullptr — correct, the network allocates the buffer.
- init_chunked() and commit_chunked(): pass nullptr — chunks are sent via the no-source-buffer path.
- The data-providing init variants (e.g., init(NodeID, const void*, size_t)) currently don't check max_payload_size at all. As a follow-up, these could also be made chunking-aware by querying max_payload_size(header_size, data_ptr). When the caller provides a buffer and the backend returns SIZE_MAX (UCX), chunking is skipped. When the backend returns a small limit (GASNet), chunks could be sent using slices of the original buffer — but that's a more complex change that can be deferred.
- No changes needed to commit_chunked logic
The chunking loop sends each fragment as ActiveMessage<WrappedWithFragInfo>(target, chunk_size) — the no-source-buffer path. This is correct: the limit used to size the chunks was queried with src_payload_addr=nullptr, matching how the chunks are actually sent.
Impact
The practical effect is that on UCX, a large ActiveMessage with no source buffer (like the original crash) will now be chunked into pool-friendly sizes, while a large ActiveMessage with a caller-provided buffer will continue to be sent as a single message via UCX's native rendezvous — no unnecessary chunking overhead.
There was a problem hiding this comment.
I found some more issues with this plan and updated and then pushed the changes. See if you are satisfied with the current implementation.
There was a problem hiding this comment.
Looks reasonable to me; config.pbuf_max_size should be what is returned for UCX module when src_payload_addr is nullptr. Note that the current default value for pbuf_max_size is 8KB.
realm/src/realm/ucx/ucp_internal.h
Line 122 in ea52777
There was a problem hiding this comment.
It is worth documenting for the user of the new API that using max_payload_size may lead to sub-optimal perf (and that's why we have recommended_max_payload).
There was a problem hiding this comment.
Looks reasonable to me; config.pbuf_max_size should be what is returned for UCX module when src_payload_addr is nullptr. Note that the current default value for pbuf_max_size is 8KB.
I had Claude make this fix. See if it looks good to you.
It is worth documenting for the user of the new API that using max_payload_size may lead to sub-optimal perf (and that's why we have recommended_max_payload).
I added documentation to that effect. See if you are happy with it.
There was a problem hiding this comment.
I'll drop my comments on the PR shortly..going over it
|
Update description of the changes now that we fixed some more cases: This PR fixes fragmentation issues for Realm active messages with large payloads. The original crash occurred when the UCX backend's pbuf_get asserted size <= ib_seg_size on a 41KB SimpleXferDesCreateMessage payload. Changes Automatic payload chunking in ActiveMessage
New Network::max_payload_size(header_size, src_payload_addr) interface
UCX assertion removal
Removed ActiveMessageAuto
|
|
|
||
| inline size_t max_payload_size(size_t header_size, const void *src_payload_addr) | ||
| { | ||
| #ifdef REALM_USE_MULTIPLE_NETWORKS |
There was a problem hiding this comment.
This needs at least a TODO on what to do here next
There was a problem hiding this comment.
Claude is just mirroring what exists in related functions. You can see them in the main branch here:
https://github.com/StanfordLegion/realm/blob/main/src/realm/network.inl#L144-L220
What do you think should go in there for all those different functions?
| // messages regardless of whether the source is in a registered segment | ||
| // (Long messages require a dest_payload_addr) | ||
| (void)src_payload_addr; | ||
| return recommended_max_payload(Network::my_node_id, false /*with_congestion*/, |
There was a problem hiding this comment.
Why does this return recommended payload while the documentation clearly says that it's an upper bound? Should be possible to use gasnet_AMMaxMedium here?
There was a problem hiding this comment.
I think Claude is just reusing an existing function that has the same logic (computing the size of the AM medium minus the header size). If you want I can ask it to split it out and duplicate the logic so it is clear what is happening.
| void *UCPInternal::pbuf_get(UCPWorker *worker, size_t size) | ||
| { | ||
| char *buf; | ||
| assert(size <= ib_seg_size); |
There was a problem hiding this comment.
Quoting from the description above:
Removed assert(size <= ib_seg_size) from UCPInternal::pbuf_get(). This was a Realm-side guard, not a UCX requirement. UCX's UCP layer handles message fragmentation internally — the send path automatically selects eager vs rendezvous based on message size, and the receive path already fully supports rendezvous via UCP_AM_RECV_ATTR_FLAG_RNDV.
This was the assertion that @SeyedMir put in just because we thought Realm would always abide by it and was the one that I originally tripped over when Realm didn't. It's not necessary in UCX because UCX automatically does the splitting and reassembly for you.
src/realm/activemsg.inl
Outdated
| size_t net_max = Network::max_payload_size(sizeof(T), _data); | ||
| if(total_bytes > net_max) { | ||
| // linearize 2D data, then chunk | ||
| if(_line_stride == _bytes_per_line) { |
There was a problem hiding this comment.
Where is this change tested?
There was a problem hiding this comment.
Claude and I added a new test tests/unit_tests/actmsg_fragmentation_test.cc which should cover this. See what you think.
| data, src_payload_addr.segment, &dest_payload_addr, with_congestion, header_size); | ||
| } | ||
|
|
||
| size_t UCPModule::max_payload_size(size_t header_size, const void *src_payload_addr) |
There was a problem hiding this comment.
@SeyedMir How are we going to ensure that it works correctly with the receiver max size? UCX handles transport-level fragmentation, so however the rendezvous protocol will result in am_msg_recv_handler which receiver requests a pool object which can likely be below this limit if payload addr is provided. We had assert that effectively would have caught this but it was removed, so now it's likely to cause a silent failure. Unless I am missing anything here.
…t is now abstracted in the active message implementation
…ing happens automatically when necessary
Add automatic payload chunking to the ActiveMessage class so that any active message whose payload exceeds the network backend's hard limit is transparently fragmented and reassembled, eliminating crashes like the assert(size <= ib_seg_size) failure in the UCX backend.
Changes
New NetworkModule::max_payload_size() interface
Automatic chunking in ActiveMessage
Automatic dual handler registration
UCX pbuf_get assertion removal
Removal of ActiveMessageAuto
@SeyedMir @apryakhin