-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: Update internal send/recv function signatures #820
base: master
Are you sure you want to change the base?
Conversation
17a7d12
to
a98169e
Compare
e03051a
to
484bf7d
Compare
We've already adopted interfaces to v9 API. This change updates the internal functions to match latest interfaces. Signed-off-by: Ye Xiang <[email protected]>
ncclResult_t nccl_net_ofi_isend_v9(void* sendComm, void* data, size_t size, | ||
int tag, void* mhandle, void** request) | ||
{ | ||
return nccl_net_ofi_isend(sendComm, data, size, tag, mhandle, request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The isend_v9 interface looks the same as the isend interface. Do we really need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the two we shouldn't need. We have confusion around APIs every time we need to add / change an interface function.
What we have been doing (mostly, we've screwed this up in the past) is that the latest version API we support gets the un-versioned names. When a function interface changes, we add a function suffixed with _v[N-1]
and update all the previous code blocks. To me, this is not intuitive, and I don't love that we have to update all the old interfaces in an error-prone process.
I think we should change our operations. Every API function should have a version suffix of the API version in which the function was first used. When a function's prototype or behavior changes, we add a new version of the api with a version suffix of the API version in which the change occurred. Then we only have to copy the v[N-1] block to a vN block, change the few functions in the new version that changed, and not touch the past at all. I think this is more intuitive, but also means we don't change the past, which gives me a bit of comfort.
What I think we should do with this patch is not touch the unversioned interface functions. Just have this be changing the core interface from int to size_t, removing the overflow check for the size_t -> int cast, and adding the handling to old functions to pass a size_t array instead of an int array into the internal recv function.
In another patch, we should rename all the old functions to follow the "first time added" behavior, so we can get rid of some of this version madness.
@rajachan thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- agreed with the point of the naming confusion, suffixing with version firstly used makes more sense to me.
- irrelevant to what we suffix the old version interfaces with, if there's an internal function signature change(like this patch), won't we need to touch old versions in an error-prone process any ways? e.g. the
_v4
function takes in an int, whether it's called _v4, or _v2, we still need an type conversion to size_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but going back and renaming functions in a bunch of api structs is bad mojo.
@@ -5773,7 +5773,7 @@ static inline int check_post_rx_buff_req(nccl_net_ofi_rdma_req_t *rx_buff_req) | |||
* @brief Send a message. This "interface function" is called, indirectly, from | |||
* the application | |||
*/ | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int tag, | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, size_t size, int tag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just glancing at uses of size, it looks like we pass it into tracing and so into LTTNG (for send, NCCL_OFI_TRACE_SEND
), so we should make sure this patch works with LTTNG enabled (--with-lttng
). I'm guessing we'll get some complaint about trying to fit a size_t into an int, since we haven't updated LTTNG yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch should update LTTNG as well, in that case.
} | ||
for (size_t i = 0; i < n; i++) { | ||
if (OFI_UNLIKELY(sizes[i] < 0)) { | ||
NCCL_OFI_WARN("Message size %d can't be negative at index %zu", sizes[i], i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't bozo check in the critical path. You shouldn't need this check function any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
today, user is technically allowed to provide a negative value, and we implicitly type case to size_t, for example, when calling alloc_rdma_send_req
in send()
. Shouldn't that be checked and dis-allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Length is always a non-negative value. NCCL isn't going to pass a negative number. It might pass a very large number, which is why I was worried about truncation. But a bunch of unnecessary checks just add latency.
ncclResult_t nccl_net_ofi_isend_v9(void* sendComm, void* data, size_t size, | ||
int tag, void* mhandle, void** request) | ||
{ | ||
return nccl_net_ofi_isend(sendComm, data, size, tag, mhandle, request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the two we shouldn't need. We have confusion around APIs every time we need to add / change an interface function.
What we have been doing (mostly, we've screwed this up in the past) is that the latest version API we support gets the un-versioned names. When a function interface changes, we add a function suffixed with _v[N-1]
and update all the previous code blocks. To me, this is not intuitive, and I don't love that we have to update all the old interfaces in an error-prone process.
I think we should change our operations. Every API function should have a version suffix of the API version in which the function was first used. When a function's prototype or behavior changes, we add a new version of the api with a version suffix of the API version in which the change occurred. Then we only have to copy the v[N-1] block to a vN block, change the few functions in the new version that changed, and not touch the past at all. I think this is more intuitive, but also means we don't change the past, which gives me a bit of comfort.
What I think we should do with this patch is not touch the unversioned interface functions. Just have this be changing the core interface from int to size_t, removing the overflow check for the size_t -> int cast, and adding the handling to old functions to pass a size_t array instead of an int array into the internal recv function.
In another patch, we should rename all the old functions to follow the "first time added" behavior, so we can get rid of some of this version madness.
@rajachan thoughts?
@@ -5773,7 +5773,7 @@ static inline int check_post_rx_buff_req(nccl_net_ofi_rdma_req_t *rx_buff_req) | |||
* @brief Send a message. This "interface function" is called, indirectly, from | |||
* the application | |||
*/ | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int tag, | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, size_t size, int tag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch should update LTTNG as well, in that case.
Description of changes:
We've already adopted interfaces to v9 API. This change updates the internal send/recv function definitions to match latest interfaces.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.