Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: BlockAdjacentDifference requests 2x the shared memory actually needed #3711

Open
1 task done
pauleonix opened this issue Feb 6, 2025 · 2 comments
Open
1 task done
Labels
bug Something isn't working right.

Comments

@pauleonix
Copy link
Contributor

pauleonix commented Feb 6, 2025

Is this a duplicate?

Type of Bug

Performance

Component

CUB

Describe the bug

Looking at

struct _TempStorage
{
T first_items[BLOCK_THREADS];
T last_items[BLOCK_THREADS];
};
and the at all the member functions, it seems like none of them is using both first_items and last_items, i.e. a single array halo_items or similar would suffice. The unnecessary amount of requested shared memory can in practice result in reduced occupancy and therefore worse performance.

How to Reproduce

Not applicable.

Expected behavior

BlockAdjacentDifference should only request as much shared memory as it actually needs.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@pauleonix pauleonix added the bug Something isn't working right. label Feb 6, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 6, 2025
@pauleonix
Copy link
Contributor Author

pauleonix commented Feb 6, 2025

Arguably there should be a version of these algorithms that only use shared memory for inter-warp communication and warp shuffles otherwise for minimal shared memory usage at reduced performance like e.g. BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED, but that would be another issue.

@pauleonix
Copy link
Contributor Author

Maybe at some point it was planned to also have algorithms that look both left and right, but it is a hard ask to pessimize these common algorithms just for those non-existent one to use the same API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

1 participant