-
Notifications
You must be signed in to change notification settings - Fork 927
ofi: Share the domain among MTL and BTL #13415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@bwbarrett @hppritcha Can you review this? |
ompi/mca/mtl/ofi/mtl_ofi_component.c
Outdated
"fi_fabric", | ||
ompi_process_info.nodename, __FILE__, __LINE__, | ||
fi_strerror(-ret), -ret); | ||
MTL_OFI_LOG_FI_ERR(ret, "opal_common_ofi_fi_fabric failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually don't log an error if we've just puked a show help message. That said, I think I'd just leave the MTL / BTL show helps and not move that part into the common code, so that there's some distinction on whether the MTL or BTL call failed.
opal/mca/btl/ofi/btl_ofi_component.c
Outdated
#endif | ||
|
||
if (opal_common_ofi.fabric) { | ||
hints.fabric_attr->fabric = opal_common_ofi.fabric; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pick an approach and stick with it. If we're going to have all the creation code in the ofi common code, then we don't need this bit. If we're going to try to reuse the getinfo() behavior, it's the name, not the fabric pointer, we should be setting here, but we also don't need the opal_common_ofi_fi_fabric() (and friends) calls.
opal/mca/common/ofi/common_ofi.c
Outdated
opal_output_verbose(1, opal_common_ofi.output, "Reusing existing fabric: %s", | ||
fabric_attr->name); | ||
} else { | ||
ret = fi_fabric(fabric_attr, fabric, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this handling of context is correct. You can't silently swap out the context under the covers. Given that neither the MTL nor BTL uses the context field today, I'd just remove the argument and set it to NULL always. If we ever do figure out event queues, they're going to have to end up in the common code to allow this sharing anyway.
Share the domain among MTL and BTL to reduce the number of domains needed. This prevents hitting the resource limit on systems with high core counts. Signed-off-by: Jessie Yang <[email protected]>
Share the domain among BTL and MTL to reduce the number of domains needed. This prevents hitting the resource limit on systems with high core counts.