Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compatibility issues with tcp provider #815

Merged

Conversation

yexiang-aws
Copy link
Contributor

@yexiang-aws yexiang-aws commented Mar 19, 2025

Description of changes:

Commit 1:
rdma: Lower provider not faound message to INFO

Not finding any provider in rdma initialization is not necessarily an error that users should be concerned by. When a transport isn't requested by the platform or user, both sendrecv and rdma transports are initialized. Given the restrictive requirements of the rdma transport, not finding a provider is a reasonable outcome. Log level set to INFO, instead of WARN will be sufficient. For other error codes, keeping them as WARN.

Commit 2:
sendrecv: Set FI_RMA for tcp provider

Regarding issue: ofiwg/libfabric#10882 that fi_mr_reg call to the TCP provider will fail unless the attr->requested_key is unique for each call, while fix in Libfabric is in progress and will be included in Libfabric 2.2, plugin needs a workaround to unblock EFA installer release.

This change to sendrecv sets FI_RMA capability for the TCP provider, and plugin will allocate and provide unique mr key so that it meets the fi_mr_reg call's current expectation.

Commit 3:
sendrecv: Only request READ access when RMA enabled

When remote access is not requested, mr access flags(FI_READ/FI_REMOTE_READ) should not be set either


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch from d39205d to 13a657b Compare March 19, 2025 19:15
@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch 2 times, most recently from 23421b4 to dd5f30e Compare March 19, 2025 22:48
@yexiang-aws yexiang-aws changed the title Bugfix: sendrecv set FI_RMA for tcp provider Bugfix: fix compatibility issues with tcp provider Mar 19, 2025
@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch from dd5f30e to afaf7cc Compare March 19, 2025 22:56
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another commit, which removes the mr_attr.access |= FI_READ or mr_attr.access |= FI_REMOTE_READ in the case where FI_RMA is not supported in the sendrecv provider. Otherwise, we're still violating contracts here.

We also should fix the fact that we have the READ / REMOTE_READ flags flipped in the host vs. accelerator for the rdma protocol, but that can be a SIM if you prefer.

@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch from afaf7cc to 2e61874 Compare March 20, 2025 07:25
@yexiang-aws yexiang-aws marked this pull request as ready for review March 20, 2025 07:26
@yexiang-aws yexiang-aws requested a review from a team as a code owner March 20, 2025 07:26
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of the "bugfix" prefix on the first line of commit messages. You only get 50 characters or so, and that eats up 7 of them. I'd much rather you focus on getting the "what" part of the commit right. The prefix of area of code can help, in that you almost always need that anyway.

Also, please avoid "fix " in the subject of a commit message. Tell me what you changed, like "sendrecv: Only request READ access when RMA enabled". The current state of bugs is lost context when you're reading through git commit messages a year from now.

Finally, your commits messages should have subject bodies. They should describe both the what and the why of the change.

@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch 2 times, most recently from a279bba to 920af5c Compare March 20, 2025 20:23
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first commit, why is moving from WARN to INFO the right thing to do? The subject gives me the what, but why should be answered in the body. If I were writing it, I'd probably write something like:

rdma: Lower provider not found message to INFO

Not finding any providers in rdma initialization is not necessarily an error the user should be concerned by. When a transport isn't requested by the platform or user, both sendrecv and rdma transports are initialized, and given the restrictive requirements of the rdma transport, not finding a provider is a reasonable outcome.

For the second commit, you need to give more context in the subject. Link to the GitHub issue in Libfabric. Make it clear this is working around a bug in Libfabric and not some API thing. This is not the place to be sparse with words.

@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch 2 times, most recently from 7c7aa50 to def1cfb Compare March 20, 2025 21:29
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed your signed-off-by lines.

Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in your first commit message: rdma: Lower provider not faound message to INFO

Not finding any provider in rdma initialization is not necessarily
an error that users should be concerned by. When a transport isn't
requested by the platform or user, both sendrecv and rdma transports
are initialized. Given the restrictive requirements of the rdma transport,
not finding a provider is a reasonable outcome. Log level set to INFO,
instead of WARN will be sufficient. For other error codes, keeping
them as WARN.

Signed-off-by: Ye Xiang <[email protected]>
Regarding Libfabric issue: ofiwg/libfabric#10882

fi_mr_reg call to the TCP provider will fail unless the attr->requested_key
is unique for each call. While fix in Libfabric is in progress and will be
included after Libfabric version 2.2, we needed a workaround to mitigate
users with TCP provider.

This change to sendrecv sets FI_RMA capability for the TCP provider, and plugin
will allocate and provide unique mr key so that it meets the fi_mr_reg call's
current expectation.

Signed-off-by: Ye Xiang <[email protected]>
When remote access is not requested, mr access flags(FI_READ/FI_REMOTE_READ)
should not be set either

Signed-off-by: Ye Xiang <[email protected]>
@yexiang-aws yexiang-aws force-pushed the bugfix/sendrecv-tcp-key-validation branch from def1cfb to 56c91af Compare March 20, 2025 22:01
@yexiang-aws yexiang-aws changed the title Bugfix: fix compatibility issues with tcp provider Fix compatibility issues with tcp provider Mar 20, 2025
@yexiang-aws yexiang-aws merged commit b53a2ac into aws:master Mar 21, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants