-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace nccl_ofi_deque with std::deque #821
Open
AvivBenchorin
wants to merge
1
commit into
aws:master
Choose a base branch
from
AvivBenchorin:feature/cpp_migration_deque
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Replace nccl_ofi_deque with std::deque #821
AvivBenchorin
wants to merge
1
commit into
aws:master
from
AvivBenchorin:feature/cpp_migration_deque
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rauteric
reviewed
Mar 24, 2025
rauteric
reviewed
Mar 24, 2025
bwbarrett
reviewed
Mar 25, 2025
75b9d56
to
11c43ab
Compare
bwbarrett
previously approved these changes
Mar 26, 2025
11c43ab
to
70003f7
Compare
rauteric
reviewed
Mar 26, 2025
70003f7
to
108df04
Compare
Replaces the C nccl_ofi_deque implementation with the C++ STL deque implementation of deque, and updates all usage of the previous nccl_ofi_deque C implementation to use the new class. The deque in `nccl_net_ofi_rdma_ep_t`, `pending_reqs_queue`, gains has an associated lock which must be grabbed and released whenever calling its member functions. The send and recv comm cleanup list deques already have a shared lock, `comm_cleanup_list_lock`. Signed-off-by: Aviv Benchorin <[email protected]>
108df04
to
42b3144
Compare
bwbarrett
approved these changes
Mar 27, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Replaces the C nccl_ofi_deque implementation with the C++ STL deque implementation of deque, and updates all usage of the previous nccl_ofi_deque C implementation to use the new class.
The deque in
nccl_net_ofi_rdma_ep_t
,pending_reqs_queue
, gains has an associated lock which must be grabbed and released whenever calling its member functions. The send and recv comm cleanup list deques already have a shared lock,comm_cleanup_list_lock
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Below is the PR description for the first iteration as a draft, and no longer represents the ready-for-review PR changes.
This is a draft PR intended to get early feedback on my refactoring of nccl_ofi_deque to C++.
My commit changes the previous C implementation of
nccl_ofi_deque
to a templatednccl_ofi_deque_t
class withstd::deque
andstd::mutex
member variables and thread-safe wrappers to interact with thestd::deque
. It also updates all usage of the previousnccl_ofi_deque
implementation to use the new class.The particular areas of feedback I am looking for are:
std::deque
orstd::optional
, when to use the auto or const keywords, passing by value or reference fornccl_ofi_deque_t
member functions, etcnccl_ofi_deque_t
implementation is just closely wrapping thestd::deque
member functions (except forfor_each_conditional_erase
), what should the unit tests for the class look like? My thoughts are to refactor the previous C implementation unit test 1-to-1 to using the C++ class, but I'm not sure if its still necessary.nccl_ofi_deque_t
declaration and definitions in the header file: to deal with linker errors for mynccl_ofi_deque_t
template functions (discussed here), I put both the declarations and definitions in thenccl_ofi_deque.h
header file and deleted thenccl_ofi_deque.cpp
. I don't know if this is the best solution or if there is a more appropriate one for our codebase.for_each_conditional_erase
and thread-safety: I refactored iterating across the deque inrecv_comm_process_all_finalizing
andsend_comm_process_all_finalizing
to be the same as with the C implementation (get the front of the deque and repeatedly iterate to the next element with a "next" call). However, this is not perfectly thread-safe, and I tried creating thefor_each_conditional_erase
member function to encapsulate the for-loop inside the class (see my comment under thefor_each_conditional_erase
definition for more details). My goal is to get feedback on whether the current imperfect locking inrecv_comm_process_all_finalizing
andsend_comm_process_all_finalizing
is appropriate, or if I should try to further develop an approach likefor_each_conditional_erase
.But please call out any other issues or areas for improvement that you see.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.