-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Evaluation #243
Comments
I've got some experimental GPU code in the Unfortunately, it's also slower than the CPU implementation (at least on my Mac). I'm not exactly sure why – there's overhead from the GPU-based bytecode interpreter, and occupancy isn't great, but I haven't found any specific issues. I'd be curious to see if you can figure out the bottleneck! If it can be made performant, we'd have to discuss where it should live. The three options are (1) in the main |
I took a look and it's pretty close to what I was planning on doing. At first glance it looks like the biggest bottleneck probably comes from allocating 1024 registers in local memory via: var reg: array<vec4f, 256>; When storing data in registers like this the GPU will eventually spill into using global memory once the L1 and L2 caches are full, which makes any read/store in the array progressively more inefficient. Compounding that slowdown with the branch divergence for each of the operations could lead to incredible amounts of stalling in worst case scenarios. It might be possible to mitigate the branch divergence issues by declaring local variables like this: var reg_1: vec4f;
var reg_2: vec4f = reg[op[2]];
var reg_3: vec4f = reg[op[3]];
switch op[0] { ... so that the slower memory access is synchronized between threads. And then writing to the register array after the switch statement for the same reason: reg[op[1]] = reg_1; When it comes to what the best method to store register memory is, benchmarking would be the only way to tell what option is best given the circumstances. The first that comes to mind which is commonly used to avoid cache overflow is shared memory. If you aren't aware of it this is a pretty good article: https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/. Essentially the biggest question would be how many registers could be dedicated to each thread before occupancy becomes severely limited. 256 The other option would be to use a buffer or image for registers, which gives an unlimited number of registers at the cost of having to deal with global memory read/write latency even when a tape only uses a handful of registers. This is my preferred format for working with any large amount of data and has the most flexibility and room for optimization. Your sentiment towards finding the right API is very fair, especially with how fast the rust ecosystem is evolving. In my opinion for a project this low level and optimization heavy Vulkan would be the best graphics API to go with. Ash gives direct bindings to it and I would think is the most stable. I have personally been using the Vulkano crate for my project. Some of its recent development has put it miles ahead in usefulness for anything graphics heavy than any other crate I've seen so far, but for a project like this its probably too unstable atm. Due to the non-standard nature of rust graphics libraries I'm curious how you would go about integrating it directly into the fidget crate? |
I recently rewrote the rendering backend of my game engine and I need to find ways to maximize GPU workload since I'm going to need as many CPU threads as possible working on particle simulation. Currently, I have all spare threads generating SDFs with fidget for world generation and it is unfortunately far too slow for what I'm aiming for.
I either need something that can run on async compute shaders or a different method for world generation. I know far more about GPU optimization than CPU optimization, and have wanted to write something similar to this for a while so I might give it a shot sometime in the next few weeks. Before getting into it I wanted to check whether you would want to add it to the crate or what your thoughts are on this.
The text was updated successfully, but these errors were encountered: