Skip to content

gpu offload host code generation #142097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

ZuseZ4
Copy link
Member

@ZuseZ4 ZuseZ4 commented Jun 5, 2025

r? ghost

This will generate most of the host side code to use llvm's offload feature.
The first PR will only handle automatic mem-transfers to and from the device.
So if a user calls a kernel, we will copy inputs back and forth, but we won't do the actual kernel launch.
Before merging, we will use LLVM's Info infrastructure to verify that the memcopies match what openmp offloa generates in C++. LIBOMPTARGET_INFO=-1 ./my_rust_binary should print that a memcpy to and later from the device is happening.

A follow-up PR will generate the actual device-side kernel which will then do computations on the GPU.
A third PR will implement manual host2device and device2host functionality, but the goal is to minimize cases where a user has to overwrite our default handling due to performance issues.

I'm trying to get a full MVP out first, so this just recognizes GPU functions based on magic names. The final frontend will obviously move this over to use proper macros, like I'm already doing it for the autodiff work.
This work will also be compatible with std::autodiff, so one can differentiate GPU kernels.

Tracking:

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 5, 2025
@ZuseZ4 ZuseZ4 added F-gpu_offload `#![feature(gpu_offload)]` and removed A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 5, 2025
@rust-log-analyzer

This comment has been minimized.

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 5, 2025
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rustbot rustbot added the F-autodiff `#![feature(autodiff)]` label Jun 9, 2025
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jun 10, 2025

@oli-obk Featurewise, I am almost done. I'll add a few more lines to describe the layout of Rust types to the offload library, but in this PR I only intend to support one type or two (maybe array's, raw pointer, or slices). I might even hardcode the length in the very first approach. In a follow-up PR I'll do some proper type parsing on a higher level, similar to what I did in the past with Rust TypeTrees. This work is much simpler and more reliable though, since offload doesn't care what type something has, just how many bytes it is large, and hence need to be moved to/from the GPU.

I was able to just move a few of the builder methods I needed to the generic builder.
However, there are also around 7 that I had to duplicate. I guess at some point I'll need to do the proper work of enabling the trait implementations for both builders :/
Once I have everything working, I'll clean it up and add some tests and docs.

@ZuseZ4 ZuseZ4 mentioned this pull request Mar 4, 2025
5 tasks
@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jun 12, 2025

Not fully ready yet, I apparently missed yet another global to initialize the offload runtime. But at least it compiles successfully to a binary if I emit the IR from Rust, and then use clang for the rest. I'll add the global today, then I should be done and will clean it up

@rustbot rustbot added the T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) label Jun 17, 2025
@rust-log-analyzer

This comment has been minimized.

@ZuseZ4 ZuseZ4 marked this pull request as ready for review June 19, 2025 00:38
@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 19, 2025
@rustbot
Copy link
Collaborator

rustbot commented Jun 19, 2025

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

Copy link
Member Author

@ZuseZ4 ZuseZ4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the first round of reviews for myself, I'll address them tomorrow.
I'll also clean up the code in gpu_builder more, it has a lot of duplications and IR comments from when I was trying to figure out what to generate..

@@ -117,6 +118,70 @@ impl<'a, 'll, CX: Borrow<SCx<'ll>>> GenericBuilder<'a, 'll, CX> {
}
bx
}

pub(crate) fn my_alloca2(&mut self, ty: &'ll Type, align: Align, name: &str) -> &'ll Value {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll find a better name for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also document why/how it is different from alloca

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a name, because I feel that alloca is the best place to add some names to make IR more readable, without much effort.
Other than that I could add a few more llvm wrappers to do the

  1. safe current insertpoint
  2. jump to the end of the allocas.
  3. insert alloca.
  4. restore original insertpoint.

If you feel strongly I could do it, but for now I just commented that I expect users to be sure that they want to really insert their alloca at the current insert point. That is the default for all other (non-alloca) builder functions, so together with the new name hopefully not too surprising.

@ZuseZ4 ZuseZ4 requested a review from oli-obk June 24, 2025 21:59
@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jun 24, 2025

ok, I think I'm mostly done. Do you have any suggestions? I don't want to add any actual run tests, as these would require a working clang based on the same commit.

@@ -667,6 +668,12 @@ pub(crate) fn run_pass_manager(
write::llvm_optimize(cgcx, dcx, module, None, config, opt_level, opt_stage, stage)?;
}

if cfg!(llvm_enzyme) && enable_gpu && !thin {
Copy link
Member Author

@ZuseZ4 ZuseZ4 Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no dependency of offload on Enzyme, but since I think I'm supposed to gate my features, for now I'll just re-use the ones from Enzyme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the cfg gate is necessary, as long as the Offload::Enable is gated behind -Zunstable-options or sth

@bors
Copy link
Collaborator

bors commented Jun 26, 2025

☔ The latest upstream changes (presumably #143026) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... don't know if I can review this properly. I can review it from the "does this fit into how I want the llvm backend to look" side, but what it actually does just looks random to me.

@@ -117,6 +118,70 @@ impl<'a, 'll, CX: Borrow<SCx<'ll>>> GenericBuilder<'a, 'll, CX> {
}
bx
}

pub(crate) fn my_alloca2(&mut self, ty: &'ll Type, align: Align, name: &str) -> &'ll Value {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also document why/how it is different from alloca

@oli-obk
Copy link
Contributor

oli-obk commented Jul 1, 2025

I don't want to add any actual run tests, as these would require a working clang based on the same commit.

why is clang necessary for this?

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jul 1, 2025

I don't want to add any actual run tests, as these would require a working clang based on the same commit.

why is clang necessary for this?

This time I started with dev guide docs! https://rustc-dev-guide.rust-lang.org/offload/installation.html#usage
Pretty much, to create and run the binary we have to implement multiple steps, and this is only the first step out of maybe 5.
Clang has the full pipeline implemented, so I just rely on it for the following steps, until we also implemented more in rustc.

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jul 1, 2025

I... don't know if I can review this properly. I can review it from the "does this fit into how I want the llvm backend to look" side, but what it actually does just looks random to me.

Thanks! And no worries, I'm discussing the offloading design with @jdoerfert and @kevinsala. The memory transfer is pretty straightforward and not that interesting. The only question was how many layers of abstraction we wanted, but we made a decision which should be fine, we could always re-evaluate it later. For the Kernel launches PR I'll ask them to also review the code, but they aren't rust devs, so your reviews on the rustc side are definetly appreciated!

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jul 8, 2025

@oli-obk I cleaned up the code a bit more and addressed your feedback. This reduced the number of values which I pass around between functions by a lot.
Lmk if it it looks good to you or if you have more thoughts, afterwards I'll clean up the history (and rebase).

@@ -667,6 +668,12 @@ pub(crate) fn run_pass_manager(
write::llvm_optimize(cgcx, dcx, module, None, config, opt_level, opt_stage, stage)?;
}

if cfg!(llvm_enzyme) && enable_gpu && !thin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the cfg gate is necessary, as long as the Offload::Enable is gated behind -Zunstable-options or sth

let mapper_update = "__tgt_target_data_update_mapper";
let mapper_end = "__tgt_target_data_end_mapper";
let begin_mapper_decl = declare_offload_fn(&cx, mapper_begin, mapper_fn_ty);
let update_mapper_decl = declare_offload_fn(&cx, mapper_update, mapper_fn_ty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be unused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, did you mean to add the comment somewhere else? the gen_tgt_data_mappers function is called, and the return values of declare_offload_fn are also used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__tgt_target_data_update_mapper specifically is "Step 3" and unused for now, make use it as _ = update_mapper; at the Step 3 site

// void *AuxAddr;
// } __tgt_offload_entry;
let kernel_elements =
vec![ti32, ti32, tptr, tptr, tptr, tptr, tptr, tptr, ti64, ti64, tarr, tarr, ti32];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these entries do not match the fields in the comment above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuuh, yeah. The tgt_offload_entry ended up int he function above. There it also matches. Now just let me find the docs for what tgt_kernel_argument values stand for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not pretty, but I found and added the right definition. I don't want to add links to a different gh repo, and looking the definition up would be annoying to contributors, so I think that's the best solution.

Comment on lines 342 to 345
tgt_bin_desc_alloca,
cx.get_const_i8(0),
cx.get_const_i64(32),
Align::from_bytes(8).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain magic numbers pls

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://llvm.org/docs/LangRef.html#llvm-memset-intrinsics
The 32 is just the bit width, because the first type is an i32. The byte value to which we set it to is zero, so no magic number.

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jul 9, 2025

fwiw, running this directly through rustc or as part of the rust test suite works. Calling cargo fails due to triggering an llvm assertion, because cargo adds --emit=link. Everything else works. cc @kevinsala

Can't get register for value!
UNREACHABLE executed at `somepath/rust/src/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:1967!

.rustup/toolchains/offload/bin/rustc --crate-name r --edition=2024 src/lib.rs --crate-type cdylib -C opt-level=3 -C lto=fat --out-dir /somepath/drehwald/prog/offload/r/target/release/deps -C strip=debuginfo -O -C lto=fat -C panic=abort -Z offload=Enable --emit=llvm-ir --emit=dep-info,link

Which is fair, I'm still using clang/lld as the driver and linker.
In the next PR I'll start using rustc for more and fixing linking as part of it.

Also looking forward to https://github.com/rust-lang/rust/pull/143684/files, which will fix some offload logic, but isn't blocking. https://github.com/nikic/llvm-project/blob/c5e8134bf939bd5fcf48faeb4efd0d18c4721b4a/offload/libomptarget/PluginManager.cpp#L289

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jul 10, 2025

I removed the cfg gate, happy to not have it rely on Enzyme.

l: Linkage,
) -> &'ll llvm::Value {
let llglobal = add_global(cx, name, initializer, l);
unsafe { llvm::LLVMSetUnnamedAddress(llglobal, llvm::UnnamedAddr::Global) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsafe { llvm::LLVMSetUnnamedAddress(llglobal, llvm::UnnamedAddr::Global) };
llvm::SetUnnamedAddress(llglobal, llvm::UnnamedAddr::Global);

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one more round then I think this can ship 😆

let name = format!(".offloading.entry.kernel_{num}");
let ci64_0 = cx.get_const_i64(0);
let ci16_1 = cx.get_const_i16(1);
let elems: Vec<&llvm::Value> = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which data structure is this?

Comment on lines +223 to +224
let c_section_name = CString::new(".llvm.rodata.offloading").unwrap();
llvm::set_section(llglobal, &c_section_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let c_section_name = CString::new(".llvm.rodata.offloading").unwrap();
llvm::set_section(llglobal, &c_section_name);
llvm::set_section(llglobal, c".llvm.rodata.offloading");

tgt_bin_desc_alloca,
cx.get_const_i8(0),
cx.get_const_i64(32),
Align::from_bytes(8).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Align::from_bytes(8).unwrap(),
Align::EIGHT,

Comment on lines +415 to +432
let gep1 = builder.inbounds_gep(ty, a1, &[i32_0, i32_0]);
let gep2 = builder.inbounds_gep(ty, a2, &[i32_0, i32_0]);
let gep3 = builder.inbounds_gep(ty2, a4, &[i32_0, i32_0]);

let nullptr = cx.const_null(cx.type_ptr());
let o_type = o_types[0];
let args = vec![
s_ident_t,
cx.get_const_i64(u64::MAX),
cx.get_const_i32(num_args),
gep1,
gep2,
gep3,
o_type,
nullptr,
nullptr,
];
builder.call(fn_ty, end_mapper_decl, &args, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since almost the exact same call is gonna be happening thrice in the future, pull this into a function that deduplicates as much as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. F-autodiff `#![feature(autodiff)]` F-gpu_offload `#![feature(gpu_offload)]` S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants