You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and the at all the member functions, it seems like none of them is using both first_items and last_items, i.e. a single array halo_items or similar would suffice. The unnecessary amount of requested shared memory can in practice result in reduced occupancy and therefore worse performance.
How to Reproduce
Not applicable.
Expected behavior
BlockAdjacentDifference should only request as much shared memory as it actually needs.
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
The text was updated successfully, but these errors were encountered:
Arguably there should be a version of these algorithms that only use shared memory for inter-warp communication and warp shuffles otherwise for minimal shared memory usage at reduced performance like e.g. BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED, but that would be another issue.
Maybe at some point it was planned to also have algorithms that look both left and right, but it is a hard ask to pessimize these common algorithms just for those non-existent one to use the same API.
Is this a duplicate?
Type of Bug
Performance
Component
CUB
Describe the bug
Looking at
cccl/cub/cub/block/block_adjacent_difference.cuh
Lines 133 to 137 in 9b7333b
first_items
andlast_items
, i.e. a single arrayhalo_items
or similar would suffice. The unnecessary amount of requested shared memory can in practice result in reduced occupancy and therefore worse performance.How to Reproduce
Not applicable.
Expected behavior
BlockAdjacentDifference
should only request as much shared memory as it actually needs.Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
The text was updated successfully, but these errors were encountered: