Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor interfaces to hide tile_bounds and allow dynamic block_size #129

Merged
merged 3 commits into from
Feb 20, 2024

Conversation

kerrj
Copy link
Collaborator

@kerrj kerrj commented Feb 13, 2024

Previously the user had to work with black magic tile_bounds and the values of BLOCK_X, BLOCK_Y were hardcoded in CUDA so only 16 was a valid input. Now, CUDA has been refactored to take any value of block size <=16, and the tile_bounds computations have been completely hidden from the user. The new interface for project_gaussians and rasterize_gaussians instead accept an integer block_size input, from which tile bounds is calculated from.

There are 2 goals for this change:

  1. Clean up the interface to users for tile_bounds and get rid of hardcoded BLOCK_X in CUDA, and
  2. Dynamic block size will allow more sophisticated shared memory handling down the road, allowing for smaller block sizes to rasterize higher dimensional values more efficiently. This will address speed issues in ND rasterization like seen in nd rasterizer is 10x slower than rasterizer #68 when implemented.

@kerrj
Copy link
Collaborator Author

kerrj commented Feb 13, 2024

I tested this PR in nerfstudio with block sizes 2-16, training works for all sizes tested

@kerrj kerrj mentioned this pull request Feb 14, 2024
@liruilong940607
Copy link
Collaborator

I like the idea of cleaning up tile_bounds! Will look more closer later today or tomorrow!

Copy link
Collaborator

@vye16 vye16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, tested on block sizes 2^n from 1 to 4. please rename block_size to block_width so we can distinguish from actual block size (total size) vs side length. otherwise lgtm I'll approve after those changes. make sure to make the accompanying changes in nerfstudio

@kerrj kerrj requested a review from vye16 February 20, 2024 20:38
Copy link
Collaborator

@vye16 vye16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great lgtm

@vye16 vye16 merged commit 10bc1d0 into main Feb 20, 2024
2 checks passed
@maturk
Copy link
Collaborator

maturk commented Mar 10, 2024

@kerrj, can you explain the need for this in relation to shared memory. I thought that regardless of the num of threads/workers in a CUDA block, the shared memory size is fixed? So changing the block_width, and hence thread count, here would not make a difference to the allocatable shared memory. Let me know if I have any misunderstandings.

@maturk maturk mentioned this pull request Mar 13, 2024
@kerrj
Copy link
Collaborator Author

kerrj commented Mar 18, 2024

There are some nuances with shared memory size; 1) Some GPUs allow allocating more than 48KB of shared memory if you dynamically allocate it 2) launching a kernel with too much shared memory requested limits how many blocks can launch at the same time, which can starve the processor, in which case launching with a smaller block size is actually faster since more of the processors can be utilized. 3) regardless of the num of threads/workers in a CUDA block, the shared memory size is fixed This is true about the maximum shared memory, however you can launch with less to free up shared memory for other blocks executing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants