GPU Evaluation #243

LukeP0WERS · 2025-02-09T21:18:13Z

I recently rewrote the rendering backend of my game engine and I need to find ways to maximize GPU workload since I'm going to need as many CPU threads as possible working on particle simulation. Currently, I have all spare threads generating SDFs with fidget for world generation and it is unfortunately far too slow for what I'm aiming for.

I either need something that can run on async compute shaders or a different method for world generation. I know far more about GPU optimization than CPU optimization, and have wanted to write something similar to this for a while so I might give it a shot sometime in the next few weeks. Before getting into it I wanted to check whether you would want to add it to the crate or what your thoughts are on this.

mkeeter · 2025-02-10T01:11:12Z

I've got some experimental GPU code in the wgpu-bytecode branch. It does interval evaluation + tape shortening on the CPU, then sends tiles and shortened tapes to the GPU for pixel / voxel-level rasterization. The code may be too limited for your purposes, since it only supports the primitives for 2D and 3D heightmap rasterization – single output, no variables, float only – but you're welcome to take a look!

Unfortunately, it's also slower than the CPU implementation (at least on my Mac). I'm not exactly sure why – there's overhead from the GPU-based bytecode interpreter, and occupancy isn't great, but I haven't found any specific issues. I'd be curious to see if you can figure out the bottleneck!

If it can be made performant, we'd have to discuss where it should live. The three options are (1) in the main fidget crate, possibly behind a feature flag, (2) a new first-party crate in the same repo, or (3) a recommended third-party crate. I'd be leaning towards (1), but the APIs will need time to bake; a goal of this project is finding the "right" APIs and abstractions, so it would be hard to promise stability for a while.

LukeP0WERS · 2025-02-10T07:37:11Z

I took a look and it's pretty close to what I was planning on doing. At first glance it looks like the biggest bottleneck probably comes from allocating 1024 registers in local memory via:

var reg: array<vec4f, 256>;

When storing data in registers like this the GPU will eventually spill into using global memory once the L1 and L2 caches are full, which makes any read/store in the array progressively more inefficient. Compounding that slowdown with the branch divergence for each of the operations could lead to incredible amounts of stalling in worst case scenarios.

It might be possible to mitigate the branch divergence issues by declaring local variables like this:

var reg_1: vec4f;
var reg_2: vec4f = reg[op[2]];
var reg_3: vec4f = reg[op[3]];

switch op[0] { ...

so that the slower memory access is synchronized between threads. And then writing to the register array after the switch statement for the same reason:

reg[op[1]] = reg_1;

When it comes to what the best method to store register memory is, benchmarking would be the only way to tell what option is best given the circumstances. The first that comes to mind which is commonly used to avoid cache overflow is shared memory. If you aren't aware of it this is a pretty good article: https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/. Essentially the biggest question would be how many registers could be dedicated to each thread before occupancy becomes severely limited. 256 vec4s per thread is not possible, but there are avenues which could make shared memory useful...

The other option would be to use a buffer or image for registers, which gives an unlimited number of registers at the cost of having to deal with global memory read/write latency even when a tape only uses a handful of registers. This is my preferred format for working with any large amount of data and has the most flexibility and room for optimization.

Your sentiment towards finding the right API is very fair, especially with how fast the rust ecosystem is evolving. In my opinion for a project this low level and optimization heavy Vulkan would be the best graphics API to go with. Ash gives direct bindings to it and I would think is the most stable. I have personally been using the Vulkano crate for my project. Some of its recent development has put it miles ahead in usefulness for anything graphics heavy than any other crate I've seen so far, but for a project like this its probably too unstable atm. Due to the non-standard nature of rust graphics libraries I'm curious how you would go about integrating it directly into the fidget crate?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Evaluation #243

GPU Evaluation #243

LukeP0WERS commented Feb 9, 2025

mkeeter commented Feb 10, 2025

LukeP0WERS commented Feb 10, 2025 •

edited

Loading

GPU Evaluation #243

GPU Evaluation #243

Comments

LukeP0WERS commented Feb 9, 2025

mkeeter commented Feb 10, 2025

LukeP0WERS commented Feb 10, 2025 • edited Loading

LukeP0WERS commented Feb 10, 2025 •

edited

Loading