-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor interfaces to hide tile_bounds and allow dynamic block_size #129
Conversation
…pdate tests to reflect this change
I tested this PR in nerfstudio with block sizes 2-16, training works for all sizes tested |
I like the idea of cleaning up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me, tested on block sizes 2^n from 1 to 4. please rename block_size
to block_width
so we can distinguish from actual block size (total size) vs side length. otherwise lgtm I'll approve after those changes. make sure to make the accompanying changes in nerfstudio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great lgtm
@kerrj, can you explain the need for this in relation to shared memory. I thought that regardless of the num of threads/workers in a CUDA block, the shared memory size is fixed? So changing the block_width, and hence thread count, here would not make a difference to the allocatable shared memory. Let me know if I have any misunderstandings. |
There are some nuances with shared memory size; 1) Some GPUs allow allocating more than 48KB of shared memory if you dynamically allocate it 2) launching a kernel with too much shared memory requested limits how many blocks can launch at the same time, which can starve the processor, in which case launching with a smaller block size is actually faster since more of the processors can be utilized. 3) |
Previously the user had to work with black magic
tile_bounds
and the values ofBLOCK_X
,BLOCK_Y
were hardcoded in CUDA so only16
was a valid input. Now, CUDA has been refactored to take any value of block size <=16, and the tile_bounds computations have been completely hidden from the user. The new interface forproject_gaussians
andrasterize_gaussians
instead accept an integerblock_size
input, from which tile bounds is calculated from.There are 2 goals for this change: