Skip to content

Conversation

jiaxiyan
Copy link
Contributor

@jiaxiyan jiaxiyan commented Sep 24, 2025

Share the domain among BTL and MTL to reduce the number of domains needed. This prevents hitting the resource limit on systems with high core counts.

@jiaxiyan
Copy link
Contributor Author

@bwbarrett @hppritcha Can you review this?

@jiaxiyan jiaxiyan changed the title ofi/btl: Reuse MTL's domain and fabric in BTL ofi: Reuse MTL's domain and fabric in BTL Oct 7, 2025
"fi_fabric",
ompi_process_info.nodename, __FILE__, __LINE__,
fi_strerror(-ret), -ret);
MTL_OFI_LOG_FI_ERR(ret, "opal_common_ofi_fi_fabric failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually don't log an error if we've just puked a show help message. That said, I think I'd just leave the MTL / BTL show helps and not move that part into the common code, so that there's some distinction on whether the MTL or BTL call failed.

#endif

if (opal_common_ofi.fabric) {
hints.fabric_attr->fabric = opal_common_ofi.fabric;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pick an approach and stick with it. If we're going to have all the creation code in the ofi common code, then we don't need this bit. If we're going to try to reuse the getinfo() behavior, it's the name, not the fabric pointer, we should be setting here, but we also don't need the opal_common_ofi_fi_fabric() (and friends) calls.

opal_output_verbose(1, opal_common_ofi.output, "Reusing existing fabric: %s",
fabric_attr->name);
} else {
ret = fi_fabric(fabric_attr, fabric, context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this handling of context is correct. You can't silently swap out the context under the covers. Given that neither the MTL nor BTL uses the context field today, I'd just remove the argument and set it to NULL always. If we ever do figure out event queues, they're going to have to end up in the common code to allow this sharing anyway.

Share the domain among MTL and BTL to reduce the number of domains needed.
This prevents hitting the resource limit on systems with high core counts.

Signed-off-by: Jessie Yang <[email protected]>
@jiaxiyan jiaxiyan changed the title ofi: Reuse MTL's domain and fabric in BTL ofi: Share the domain among MTL and BTL Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants