Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace nccl_ofi_deque with std::deque #821

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AvivBenchorin
Copy link
Contributor

@AvivBenchorin AvivBenchorin commented Mar 24, 2025

Replaces the C nccl_ofi_deque implementation with the C++ STL deque implementation of deque, and updates all usage of the previous nccl_ofi_deque C implementation to use the new class.

The deque in nccl_net_ofi_rdma_ep_t, pending_reqs_queue, gains has an associated lock which must be grabbed and released whenever calling its member functions. The send and recv comm cleanup list deques already have a shared lock, comm_cleanup_list_lock.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


Below is the PR description for the first iteration as a draft, and no longer represents the ready-for-review PR changes.

This is a draft PR intended to get early feedback on my refactoring of nccl_ofi_deque to C++.

My commit changes the previous C implementation of nccl_ofi_deque to a templated nccl_ofi_deque_t class with std::deque and std::mutex member variables and thread-safe wrappers to interact with the std::deque. It also updates all usage of the previous nccl_ofi_deque implementation to use the new class.

The particular areas of feedback I am looking for are:

  • C++ best practices: e.g. how I'm using STL classes like std::deque or std::optional, when to use the auto or const keywords, passing by value or reference for nccl_ofi_deque_t member functions, etc
  • Performance impact: whether my changes risk impacting the plugin performance in any obvious way.
  • Unit tests: given my nccl_ofi_deque_t implementation is just closely wrapping the std::deque member functions (except for for_each_conditional_erase), what should the unit tests for the class look like? My thoughts are to refactor the previous C implementation unit test 1-to-1 to using the C++ class, but I'm not sure if its still necessary.
  • Putting all of the nccl_ofi_deque_t declaration and definitions in the header file: to deal with linker errors for my nccl_ofi_deque_t template functions (discussed here), I put both the declarations and definitions in the nccl_ofi_deque.h header file and deleted the nccl_ofi_deque.cpp. I don't know if this is the best solution or if there is a more appropriate one for our codebase.
  • for_each_conditional_erase and thread-safety: I refactored iterating across the deque in recv_comm_process_all_finalizing and send_comm_process_all_finalizing to be the same as with the C implementation (get the front of the deque and repeatedly iterate to the next element with a "next" call). However, this is not perfectly thread-safe, and I tried creating the for_each_conditional_erase member function to encapsulate the for-loop inside the class (see my comment under the for_each_conditional_erase definition for more details). My goal is to get feedback on whether the current imperfect locking in recv_comm_process_all_finalizing and send_comm_process_all_finalizing is appropriate, or if I should try to further develop an approach like for_each_conditional_erase.

But please call out any other issues or areas for improvement that you see.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AvivBenchorin AvivBenchorin added the enhancement New feature or request label Mar 24, 2025
@AvivBenchorin AvivBenchorin force-pushed the feature/cpp_migration_deque branch from 75b9d56 to 11c43ab Compare March 26, 2025 01:11
@AvivBenchorin AvivBenchorin changed the title [FEEDBACK DRAFT PR] Refactor nccl_ofi_deque to C++ [FEEDBACK DRAFT PR] Replace nccl_ofi_deque with std::deque Mar 26, 2025
bwbarrett
bwbarrett previously approved these changes Mar 26, 2025
@AvivBenchorin AvivBenchorin force-pushed the feature/cpp_migration_deque branch from 11c43ab to 70003f7 Compare March 26, 2025 17:09
@AvivBenchorin AvivBenchorin marked this pull request as ready for review March 26, 2025 17:10
@AvivBenchorin AvivBenchorin requested a review from a team as a code owner March 26, 2025 17:10
@AvivBenchorin AvivBenchorin changed the title [FEEDBACK DRAFT PR] Replace nccl_ofi_deque with std::deque Replace nccl_ofi_deque with std::deque Mar 27, 2025
@AvivBenchorin AvivBenchorin force-pushed the feature/cpp_migration_deque branch from 70003f7 to 108df04 Compare March 27, 2025 18:55
Replaces the C nccl_ofi_deque implementation with the C++ STL deque
implementation of deque, and updates all usage of the previous
nccl_ofi_deque C implementation to use the new class.

The deque in `nccl_net_ofi_rdma_ep_t`, `pending_reqs_queue`, gains has
an associated lock which must be grabbed and released whenever calling
its member functions. The send and recv comm cleanup list deques already
have a shared lock, `comm_cleanup_list_lock`.

Signed-off-by: Aviv Benchorin <[email protected]>
@AvivBenchorin AvivBenchorin force-pushed the feature/cpp_migration_deque branch from 108df04 to 42b3144 Compare March 27, 2025 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants