-
Notifications
You must be signed in to change notification settings - Fork 801
[RFC] [NOT FOR MAIN] libibverbs: Add ultra ethernet support #1653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shefty
wants to merge
16
commits into
linux-rdma:master
Choose a base branch
from
shefty:uec
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+533
−6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A user of libibverbs must rely heavily on external documentation, specifically the IBTA vol. 1 specification, to understand how the API is used. However, the API itself has evolved beyond support for only Infiniband. This leaves both users and potential vendors trying to plug into the API struggling, as the names used by the library reflect Infiniband naming, but the concepts have broader use. To provide better guidance on what the current verbs semantic model describes, provide documentation on how major verbs constructs are used. This includes referencing the historical meaning of verbs objects, as well as their evolved use. The proposed descriptions are directly intended to help new transports, such as Ultra Ethernet, understand how to adopt verbs for best results and where potential changes may be needed. Signed-off-by: Sean Hefty <[email protected]>
Ultra ethernet is a new connectionless transport that targets HPC and AI applications running at extreme scale. Introduce new node and transport types for devices that only support the new ultra ethernet transport. UET may be layered over UDP/IP using a well-known UDP port (similar to RoCEv2), or may be layered directly over IP. Define new GID types to allow users to select UET plus the underlying protocol layering (similar to how RoCEv1 and RoCEv2 are handled). Signed-off-by: Sean Hefty <[email protected]>
UET is designed around connectionless communication. To expose UET through verbs, we introduce a new reliable- unconnected QP type (named to align with existing QP types). Infiniband defines several states that a QP may be in. Many of the states are unsuitable for unconnected QPs in general and may not irrevelent depending on HW implementations. For UET, we define only 2 states for a UET QP: RTS and error. A UET QP is created in the ready-to-send state. To create a UET QP directly into the RTS state, the full set of QP attributes are needed at creation time. Struct ibv_qp_init_attr_ex is extended to include struct ibv_qp_attr for this purpose. Signed-off-by: Sean Hefty <[email protected]>
Job IDs are used to identify a distributed application.
The concept is widely used in HPC and AI applications, to
identify a set of distributed processes as belonging to
a single application.
Job IDs are integral to ultra ethernet. A job ID is
carried in every transport message and is part of a
UET QP address. UEC defines that job IDs must be managed
by a privileged entity. The association of a job ID to
a specific QP is a protected operation.
A simple view of the job security model is shown as this
object model:
device <--- job ID
^ ^
| |
PD <--- job key
^ ^ ^
| \___ | (optional)
QP --- MR
This patch focuses on the job ID. Job keys are discussed
in a following patch.
We define new verb calls to allocate a job object. Each
job object is assigned a unique ID. The assignment of
ID values to job objects it outside the scope of the API,
and would usually be handled through a job launcher or
process manager. The ibv_alloc_job() call is use to
create and configure a job object. It is expected that
the kernel will enforce that callers have the proper
privileges to create job objects on devices. (Similar
to opening QP 0 or 1).
Once a job object has been created, it may be shared with
local processes using a shared fd mechanism. The creating
process obtains a sharable fd using ibv_export_job() and
exchanges the fd with the processes of the job (e.g. via
sockets). On receiving the fd, the processes use
ibv_import_job() to setup local job resources.
A job is associated with addressing information, which
includes protocol stack data, as well as an ID. The number
of bytes of the ID which are valid is dependent on the
associated protocol. For UET, it is 3-bytes.
A job object performs an additional function beyond linking
a QP with a job ID. It defines a mechanism by which local
processes can share addressing information of peers. This
can reduce the amount of memory used to store addresses
locally and enables future optimizations, such as applying
job level encryption. The feature will also map well to
HPC and AI applications that identify peers using a rank.
Conceptually, a virtual address array may be stored with
a job object. Addresses are inserted or removed from the
array at a given index location. The intent is that the
index can map directly to the process' rank. When sending
to a peer, the peer can be identified by the job plus the
index.
Note that the implementation for the job's addressing array
is not defined. A vendor may implement this in a variety
of ways. Addresses may be pre-inserted by the job launcher,
and the transport addresses may be generated using an
algorithm.
Signed-off-by: Sean Hefty <[email protected]>
The job object model can be viewed as:
device <--- job ID
^ ^
| |
PD <--- job key
^ ^ ^
| \___ | (optional)
QP --- MR
This patch introduces the job key object.
The relationship between a job key and a job ID is similar to
an lkey to a MR. A job object maps to a job ID value.
Job objects are device level objects. A job key associates
the job ID with a protection domain to provide process
level protections.
Job keys are associated with a 32-bit jkey value. The jkey
will be used when posting a WR to associate a transfer with
a specific job. That is, the jkey is what mirrors the lkey
concept. The NIC converts the jkey to the job ID when
transmitting packets on the wire, applying appropriate checks
that the QP has access to the target job ID. E.g. the job
key and QP belong to the same PD.
UET allows a registered MR to optionally be accessible only
to members of a specific job. The job key will also be used
as an optional attribute when creating a MR. Details on
associating a MR with a job key are defined in a later patch.
Signed-off-by: Sean Hefty <[email protected]>
Add new extended QP functions to set necessary input fields related to supporting RU QPs and UE transport. The UE transport supports 64-bits of immediate data and 64-bit rkeys. Provide expanded APIs to support both. Also include APIs to set full UET destination address data. UET QPs have an additional address component beyond the QP or endpoint address. They have a concept defined as a resource index. A resource index can be viewed as additional receive queues attached to the QP, which are directly addressable by a sender. One intended use of resource indices is to allow a single UET QP to separate traffic from different services. For example, HPC traffic may use one subset of indices, AI traffic a different subset, and storage a third. The number of resource indices supported by a QP is vendor specific, and how they are used by applications it outside the scope of the verbs API. The resource index concept reuses the verbs work queue concept A new send WR flag is also added, delivery complete. When requested and supported by the provider, this flag indicates that a completion for the send operation indicates that the data is globally observable at the target. This is an optional feature of the UE transport. Signed-off-by: Sean Hefty <[email protected]>
Allow UET specific information to be reported as part of work completions. This includes the larger immediate data size, the job ID carried in the transport header, and a peer ID, also carried in the transport header. Included with completion data is a UET transport field, called the initiator in UEC terminology. This is a user configurable value intended to map to the rank number for a parallel application. The initiator field only has meaning within a specific job ID. As a result, when the value is valid in a completion, so is the job ID. (For UET, the initiator value is part of the UET address.) The verbs naming of this field is the slightly more generic term, src_id, to align with src_qpn (in ibv_wc). Signed-off-by: Sean Hefty <[email protected]>
The UET protocol and devices support advanced features for memory regions. From the viewpoint of the protocol, an rkey is 64-bits, with specific meaning applied to several of the bits. Struct ibv_mr is extended to report a 64-bit rkey. Providers are expected to set the 32-bit rkey and/or rkey64 field in struct ibv_mr correctly based on the transports supported by the device. A second protocol feature is that a MR may be restricted to being accessible by a specific job. Since a UET QP may be used to communicate with multiple jobs simultaneously, the memory registration call is expanded to allow associating a job key with a MR. Signed-off-by: Sean Hefty <[email protected]>
UET defines multiple packet delivery modes: ROD - reliable, ordered delivery RUD - reliable, unordered delivery RUDI - reliable, unordered delivery for idempotent transfers UUD - unreliable, unordered delivery The packet delivery modes impact how out of order packets are handled at the receiver, retry mechanisms, multi-pathing support, and congestion control algorithms, among other behavior. A single UET QP may use multiple packet delivery modes simultaneously based on the application data transfer being performed. Even traditional RDMA protocols are evolving to allow greater flexibility in how message and data ordering are delivered at the receiver. This patch introduces a new QP attribute structure called QP semantics. This structure defines the message and data ordering requirements that a QP must implement. If a QP cannot meet the requested semantics, QP creation should fail, but a vendor can always provide stronger guarantees than those requested by the user. QP semantics indicate if the QP must provider message and data ordering guarantees, such as write-after-write, read- after-write, send-after-write, etc. Traditionally, these ordering guarantees were defined by the relevent RDMA specifications, and users of the libibverbs API needed to know to reference those specs in order to use a QP correctly (such as when to fence data transfers). As an alternative, a new device level query call is added, which can return the supported ordering guarantees for a given QP type over a specific transport. The QP semantics may optionally be passed into the create QP operation. After querying for supported semantics, applications can remove unneeded ordering guarantees in order to leverage available network features (such as multipath support). This allows vendors to adjust transport behavior accordingly. For example, UET can leverage ROD when sending messages, but use RUD or RUDI for RDMA transfers. Data ordering between messages is further defined by to indicate the maximum size transfer that ordering holds. For example, RDMA write-after-read ordering may be restricted to single MTU transfers. Finally, as a 'fix' to MTU sizes forced to being a power of 2, a max_pdu is introduced. The max PDU reports the maximum size of *user* data that can be carried in a single transport packet. The max PDU is relative to the port MTU, minus protocol headers. Signed-off-by: Sean Hefty <[email protected]>
Legacy RDMA transports are restricted to 32-bits of immediate data, while UET supports 64-bits. Additionally, UET does not require that RDMA writes with immediate consume a posted receive buffer at the target. The spec even goes so far as to mandate that RDMA traffic be treated separately at the target than send operations; however, such a mandate is not visible in the transport and places restrictions on the NIC implementation. NICs that support multiple protocols, including UET, may be optimized for legacy RDMA support. For example, CQ entries may only be able to store 32-bits of immediate data. To handle different implementations and transports, we extend the QP semantic structure to report the immediate data size, as well as implementation constraints, such as the need to consume a posted receive buffer. This change has an added advantage that it is now possible for a user to indicate that immediate data will not be used by setting the size to 0 when creating the QP. For devices which support a smaller immediate data size than that carried by the transport, truncated immediate data is extended with 0s when writing to the wire, and completions report the lowest valid bits. The QP semantics are extended with a new use_flags. These flags will allow providers to direct applications on constraints on using the HW, allowing greater flexibility in implementations. When set, IBV_QP_USAGE_IMM_DATA_RQ indicates that RDMA writes with immediate data will consume a posted receive buffer on the QP. This is standard behavior for legacy RDMA transports, but not for UET. By setting this flag, a provider can indicate this as their default requirement even when using UET QPs. Signed-off-by: Sean Hefty <[email protected]>
Legacy RDMA devices immediately expose a new MR as soon as the memory registration process completes. That is, even before reg_mr() returns to the caller, the region is accessible to any QP sharing the same PD. UET allows for greater control over access to a MR. Even once a MR has been created, exposure to the MR is treated as a separate operation. This further allows access to a MR to be invoked without it being destroyed, which enables a MR to be used-once. E.g. The MR may be the target of a single RDMA operation, with access controlled by the owner of the MR. This behavior differs from the remote invalidate operation. To support this additional level of control, we introduce new QP operations: attach MR and detach MR. A provider indicates that MRs must be explicitly attached to a QP through a new QP usage flag, as this behavior may be specific to a given transport protocol + QP type. E.g. UET + RU QPs may support MR attachment, but UET + UD QPs may not (since the feature is not required). Support and the need to attach a MR to a QP is indicated by the IBV_QP_USAGE_ATTACH_MR usage flag. Signed-off-by: Sean Hefty <[email protected]>
UET allows for user selected rkey values to improve scalability. Expose support via a device capability flag and update memory registration accordingly. Signed-off-by: Sean Hefty <[email protected]>
Introduce a concept called derived memory regions. Derived MRs are similar to legacy RDMA memory windows, but setup through the memory registration API, rather than post send. Derived MRs are new MRs that are wholy contained within an existing MR (to share page mappings, for example), but have different access rights or other attributes. For UET, a derived MR allows a MR to be associated with different jobs, with the access for each job to be different, while still being able to share the underlying HW page mappings. Applications must assume that a derived MR holds a reference on the original MR. The original MR may not be destroyed until all derived MRs have been closed. When a MR is created, a derive_cnt field may be provided to indicate the number of expected derived MRs that an application intends to create. This field is considered an optimization and may be ignored by the provider. Providers that do not support derived MRs may simply create a new MR without sharing resources with the original MR. A derived MR is subject to reported provider restrictions, such as IBV_QP_USAGE_ATTACH_MR. Signed-off-by: Sean Hefty <[email protected]>
The UET initiator is equivalent to an MPI rank or CCL communicator ID. It is a user settable value used for tag matching purposes. UET carries the initiator field directly in the transport header. Extend the initiator QP attributes to allow user to set the value. We use the more generic term, src_id, instead of the UET specific term. The naming is aligned with src_qpn in ibv_wc. Signed-off-by: Sean Hefty <[email protected]>
UET associates multiple receive queues with a single queue pair. In UET terms, a QP maps to a PIDonFEP, and the receive queues are known as resource indices. Resource indices allow for receive side resources to be separated, such that they may be dedicated to separate services (e.g. MPI, CCL, storage). To support separate resources, we reuse the verbs work queue objects (ibv_wq). The API is extended slightly for UET. First, we add an extended device attribute, max_rqw_per_qp, to limit the number of WQs which may be associated with a QP. Secondly, we extend the WQ attributes to allow the user to select the wq_num (i.e. UET resource index) associated with a WQ. It is the responsibility of higher-level SW to allocate, configure, and associate WQs with QPs, so that the QP is assigned the correct number of WQs with the necessary addresses. Signed-off-by: Sean Hefty <[email protected]>
Include descriptions of new objects introduced for UET: job, jkey, and address table, with verbs semantic constructs definitions. Signed-off-by: Sean Hefty <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This series targets merging into a staging branch until an underlying implementation is available.
The following set of patches introduce extensions to the libibverbs API to support Ultra Ethernet Transport (UET). The UE specification is available from the UE website:
https://ultraethernet.org/wp-content/uploads/sites/20/2025/10/UE-Specification-1.0.1.pdf
UEC is working towards kernel and user space implementations for UET devices. The purpose behind this PR is to share the targeted user space API changes between UEC developers and rdma-core maintainers. This allows for an open discussion to ensure development continue to moves forward in an acceptable path. It is anticipated that libibverbs API changes will be reflected to some degree in the kernel ABI and APIs and some of the detailed changes here will result in deeper discussions both within the UEC and with rdma-core.
The first patch in this series describes the verbs semantic model. The intent is to document how verbs, despite its naming, has evolved beyond its original Infiniband constructs. For UET, a clear mapping between verbs objects and field structures to transport specific fields will eventually fall out.
Subsequent patches extend the verbs API for UET. UET has several significant features. These include:
Reliable-unconnected communication semantic as viewed by the application. We introduce a new QP type, updates to QP attributes, and defines UET QP states.
UET uses job-based communication. This series specifically targets the UEC AI base profile. The AI base profile targets support for CCL style jobs, while excluding more complex transport operations such as tag matching. To support job-based communication, new job-related objects are introduced. Transport operations that target the AI full and HPC profiles are deferred for future changes.
UET allows finer control over the exposure of registered memory regions. The memory registration API and QP operations are extended to support UET allowed memory registration models. This includes restricting a MR to being used by a specific job. It also allows exposing a MR to the network separate from the creation of the MR.
UET is designed for sending and receiving data out-of-order. UET defines 3 reliable traffic ordering modes: reliable-ordered delivery (ROD), reliable-unordered delivery (RUD), and reliable-unordered delivery for idempotent transports (RUDI). Traffic between 2 UET QPs may use all delivery mechanism simultaneously. For example, message headers may use ROD, while message data may use RUD. This is a significant deviation from the verbs RC QP model. To allow the application and verbs provider to negotiate when message and/or data ordering may be relaxed, a new QP semantic structure is introduced.
Additional details are documented in each patch.