Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: reduce srg #1035

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Sergei-Lebedev
Copy link
Contributor

What

Adding scatter reruce gather algorithm for reduce collective in TL UCP.

Why ?

Improves performance of medium and large message size reduce

How ?

Implementation is similar to SRA allreduce, but the last operation is gather instead of allgather

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@janjust janjust requested a review from MamziB February 5, 2025 19:40
@@ -83,7 +85,7 @@ static inline ucc_rank_t ucc_kn_pattern_radix_pow_init(ucc_knomial_pattern_t *p,
static inline void
ucc_knomial_pattern_init_impl(ucc_rank_t size, ucc_rank_t rank,
ucc_kn_radix_t radix, ucc_knomial_pattern_t *p,
int backward)
int backward, int extra)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int backward, int extra)
int backward, int has_extra)

to indicate it's a bool and not the number of extra

if (rank == root && ucc_knomial_pattern_loop_first_iteration(p) && !UCC_IS_INPLACE(*args)) {
ucc_kn_g_pattern_peer_seg(vrank, p, &peer_seg_count,
&peer_seg_offset);
status = ucc_mc_memcpy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is error handled here?

}
if (ucc_unlikely(UCC_OK != status)) {
task->super.status = status;
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking: do we need to free ucp task here?

The extension adds the support for arbitrary radix.
3. The algorithm targets Large message sizes (ie. optimized for max bandwidth).
4. If number of ranks in the team can not form a full radix subtree
(for radix=2 this means the team size is not power of 2) then there will be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(for radix=2 this means the team size is not power of 2) then there will be
(this means the team size is not a power of the radix) then there will be

However, if they are used together as part of SRG reduce one has to
provide the same radix for both routines.
6. After the completion of reduce-scatter phase the local result (at non EXTRA
ranks) will be located in dst buffer at offset the can be commputed by the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ranks) will be located in dst buffer at offset the can be commputed by the
ranks) will be located in the dst buffer at an offset that can be computed by the

UCC_CHECK_GOTO(ucc_tl_ucp_reduce_scatter_knomial_init_r(&args, team, &task,
radix),
out, status);
UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, task), out, status);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is memory leak here in case of error (e.g. the scratch buffer, or the schedule handle)

args.args.dst.info.buffer = rs_rbuf;
args.args.mask |= UCC_COLL_ARGS_FIELD_FLAGS;
args.args.flags |= UCC_COLL_ARGS_FLAG_IN_PLACE;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bank line


UCC_CHECK_GOTO(ucc_tl_ucp_gather_knomial_init_r(&args, team, &task, radix),
out, status);
UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, task), out, status);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there mem leak here as well ?

&n_frags, &pipeline_depth);
if (n_frags > 1) {
bargs.mask |= UCC_BASE_CARGS_MAX_FRAG_COUNT;
bargs.max_frag_count = ucc_buffer_block_count(max_frag_count, n_frags, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align the "="? (I'm not sure of the convention)

@@ -214,6 +216,7 @@ void ucc_tl_ucp_reduce_scatter_knomial_progress(ucc_coll_task_t *coll_task)
size_t data_size = count * dt_size;
ucc_rank_t rank = task->subset.myrank;
ucc_rank_t size = task->subset.map.ep_num;
ucc_rank_t root = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does it have a real meaning to set root at 0 here and below ?

@samnordmann
Copy link
Collaborator

Should we update the tests with the new algo(s) ?

@janjust
Copy link
Collaborator

janjust commented Feb 6, 2025

Reference PR, no reviewes needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants