You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
For my application I had to create a custom histogram since I could not use CUB's, because I had to do other things apart from histogramming. Out of curiosity, I benchmarked it against the CUB implementation, and it seems to be considerably faster. Here's a plot of the results for an RTX 2080Ti:
My use case is creating the histogram of a text composed of an alphabet. For the tests I used uniformly randomly distributed texts, from an alphabet ´[0, alphabet_size)´. Also, for the alphabet size of 64'000 and the character 33'555, the CUB histogram entry is 0, which shouldn't be, and for the alphabet size of 100'000 and the character 0, the results are different. I've tested my implementation throughly with many random inputs, so am relatively sure the CUB result is wrong.
I used nvcc -O3 -Xcompiler -fopenmp -arch=sm_75 hist_comp.cu -o hist_comp and ran it for 20 iterations, using version 12.0.
My implementation packs as many histograms as possible into shared memory, and assigns a local histogram to each thread in the block, in a round robin fashion 2 minimize how many threads within a warp share the same local histogram. If a histogram doesnt fit in shared memory, it just performs it atomically to global memory.
Here is also the CSV of the results, with more data sizes: 2080Ti.csv.
Is your implementation optimized for other specific use cases? Just leaving this here in case you're interested in improving the performance for this use case.
Describe the solution you'd like
Improve the performance of HistogramEven for creating histograms of texts.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
For my application I had to create a custom histogram since I could not use CUB's, because I had to do other things apart from histogramming. Out of curiosity, I benchmarked it against the CUB implementation, and it seems to be considerably faster. Here's a plot of the results for an RTX 2080Ti:
My use case is creating the histogram of a text composed of an alphabet. For the tests I used uniformly randomly distributed texts, from an alphabet ´[0, alphabet_size)´. Also, for the alphabet size of 64'000 and the character 33'555, the CUB histogram entry is 0, which shouldn't be, and for the alphabet size of 100'000 and the character 0, the results are different. I've tested my implementation throughly with many random inputs, so am relatively sure the CUB result is wrong.
Here's the script I used for benchmarking:
I used
nvcc -O3 -Xcompiler -fopenmp -arch=sm_75 hist_comp.cu -o hist_comp
and ran it for 20 iterations, using version 12.0.My implementation packs as many histograms as possible into shared memory, and assigns a local histogram to each thread in the block, in a round robin fashion 2 minimize how many threads within a warp share the same local histogram. If a histogram doesnt fit in shared memory, it just performs it atomically to global memory.
Here is also the CSV of the results, with more data sizes: 2080Ti.csv.
Is your implementation optimized for other specific use cases? Just leaving this here in case you're interested in improving the performance for this use case.
Describe the solution you'd like
Improve the performance of HistogramEven for creating histograms of texts.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: