Overview
Investigate using KernelBench in our current RL framework
The objective would be to get a pure cuda kernel C++ implementation of each python based kernel, compile it, verify and then find a baseline using NVIDIA's Nsight's (ncu) tool, once we have the base line we use common best practices of kernel optimization , and feed that as a prompt to a model, then implement the suggested response and repeat the process until we have a really performative kernel (as an example using Elapsed Cycles as the reward)
Overview
Investigate using KernelBench in our current RL framework
The objective would be to get a pure cuda kernel C++ implementation of each python based kernel, compile it, verify and then find a baseline using NVIDIA's Nsight's (ncu) tool, once we have the base line we use common best practices of kernel optimization , and feed that as a prompt to a model, then implement the suggested response and repeat the process until we have a really performative kernel (as an example using Elapsed Cycles as the reward)