-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compatibility issues with tcp provider #815
Fix compatibility issues with tcp provider #815
Conversation
d39205d
to
13a657b
Compare
23421b4
to
dd5f30e
Compare
dd5f30e
to
afaf7cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need another commit, which removes the mr_attr.access |= FI_READ
or mr_attr.access |= FI_REMOTE_READ
in the case where FI_RMA is not supported in the sendrecv provider. Otherwise, we're still violating contracts here.
We also should fix the fact that we have the READ / REMOTE_READ flags flipped in the host vs. accelerator for the rdma protocol, but that can be a SIM if you prefer.
afaf7cc
to
2e61874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of the "bugfix" prefix on the first line of commit messages. You only get 50 characters or so, and that eats up 7 of them. I'd much rather you focus on getting the "what" part of the commit right. The prefix of area of code can help, in that you almost always need that anyway.
Also, please avoid "fix " in the subject of a commit message. Tell me what you changed, like "sendrecv: Only request READ access when RMA enabled". The current state of bugs is lost context when you're reading through git commit messages a year from now.
Finally, your commits messages should have subject bodies. They should describe both the what and the why of the change.
a279bba
to
920af5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first commit, why is moving from WARN to INFO the right thing to do? The subject gives me the what, but why should be answered in the body. If I were writing it, I'd probably write something like:
rdma: Lower provider not found message to INFO
Not finding any providers in rdma initialization is not necessarily an error the user should be concerned by. When a transport isn't requested by the platform or user, both sendrecv and rdma transports are initialized, and given the restrictive requirements of the rdma transport, not finding a provider is a reasonable outcome.
For the second commit, you need to give more context in the subject. Link to the GitHub issue in Libfabric. Make it clear this is working around a bug in Libfabric and not some API thing. This is not the place to be sparse with words.
7c7aa50
to
def1cfb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed your signed-off-by lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in your first commit message: rdma: Lower provider not faound message to INFO
Not finding any provider in rdma initialization is not necessarily an error that users should be concerned by. When a transport isn't requested by the platform or user, both sendrecv and rdma transports are initialized. Given the restrictive requirements of the rdma transport, not finding a provider is a reasonable outcome. Log level set to INFO, instead of WARN will be sufficient. For other error codes, keeping them as WARN. Signed-off-by: Ye Xiang <[email protected]>
Regarding Libfabric issue: ofiwg/libfabric#10882 fi_mr_reg call to the TCP provider will fail unless the attr->requested_key is unique for each call. While fix in Libfabric is in progress and will be included after Libfabric version 2.2, we needed a workaround to mitigate users with TCP provider. This change to sendrecv sets FI_RMA capability for the TCP provider, and plugin will allocate and provide unique mr key so that it meets the fi_mr_reg call's current expectation. Signed-off-by: Ye Xiang <[email protected]>
When remote access is not requested, mr access flags(FI_READ/FI_REMOTE_READ) should not be set either Signed-off-by: Ye Xiang <[email protected]>
def1cfb
to
56c91af
Compare
Description of changes:
Commit 1:
rdma
: Lower provider not faound message to INFONot finding any provider in rdma initialization is not necessarily an error that users should be concerned by. When a transport isn't requested by the platform or user, both sendrecv and rdma transports are initialized. Given the restrictive requirements of the rdma transport, not finding a provider is a reasonable outcome. Log level set to INFO, instead of WARN will be sufficient. For other error codes, keeping them as WARN.
Commit 2:
sendrecv
: Set FI_RMA for tcp providerRegarding issue: ofiwg/libfabric#10882 that
fi_mr_reg
call to the TCP provider will fail unless theattr->requested_key
is unique for each call, while fix in Libfabric is in progress and will be included in Libfabric 2.2, plugin needs a workaround to unblock EFA installer release.This change to
sendrecv
sets FI_RMA capability for the TCP provider, and plugin will allocate and provide unique mr key so that it meets the fi_mr_reg call's current expectation.Commit 3:
sendrecv
: Only request READ access when RMA enabledWhen remote access is not requested, mr access flags(FI_READ/FI_REMOTE_READ) should not be set either
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.