gpl: opt-in GPU acceleration via Kokkos#10511
Conversation
Adds optional GPU acceleration for the two hot numerical kernels of global
placement -- the HPWL (half-perimeter wirelength) reduction and the FFT /
Poisson density solve -- behind a runtime backend switch.
Architecture (Strategy + Factory):
- ENABLE_GPU is both a CMake option and an environment variable. The option
compiles the GPU code in; with ENABLE_GPU=OFF (the default) no GPU code is
compiled and there is no Kokkos/CUDA dependency -- the build is identical to
before. With ENABLE_GPU=ON both the CPU and GPU paths are built and the
backend is chosen per process at run time by gpl::gpuEnabled() (the
ENABLE_GPU environment variable; GPU is the default, ENABLE_GPU=0 forces
CPU).
- Each accelerated operation has a small Strategy interface -- HpwlBackend and
FftBackend -- with a CPU implementation (always compiled) and a GPU
implementation (Kokkos, compiled only when ENABLE_GPU). A factory function,
makeHpwlBackend() / makeFftBackend(), is the single place gpl::gpuEnabled()
is read and the only place an #ifdef ENABLE_GPU appears in C++. The context
that owns the data -- NesterovBaseCommon for HPWL, FFT for the density grid
-- holds a std::unique_ptr to the interface and never branches on backend.
The CPU and GPU backends share the context's data, passed by reference; no
data structure is duplicated.
- The virtual dispatch is at operation granularity (getHpwl(), FFT::doFFT()),
each called once per placement iteration, so the CPU hot path carries no
added overhead. The consumer headers (nesterovBase.h, fft.h) stay
preprocessor-free.
Components:
- GPU build infrastructure: the ENABLE_GPU option, cmake/KokkosBackend.cmake
(Kokkos + KokkosFFT discovery, CUDA/HIP language enablement), and
src/gpl/src/gpu/gpuRuntime.{h,cpp} (gpl::gpuEnabled() plus lazy Kokkos
initialize/finalize).
- HPWL: the HpwlBackend interface (src/gpl/src/hpwlBackend.h); CpuHpwlBackend
(the OpenMP loop) and makeHpwlBackend() in src/gpl/src/hpwl.cpp; the Kokkos
GpuHpwlBackend in src/gpl/src/gpu/gpuHpwlBackend.{h,cpp}. The GPU kernel is
integer arithmetic, bit-identical to the CPU loop.
- FFT: the FftBackend interface (src/gpl/src/fftBackend.h); the FFT context,
CpuFftBackend (the Ooura DCT) and makeFftBackend() in src/gpl/src/fft.{h,cpp};
the Kokkos GpuFftBackend wrapping a Poisson solver in src/gpl/src/gpu/. The
GPU FFT is not bit-identical to the Ooura CPU FFT (~1e-4 relative
divergence, inherent to a GPU FFT).
Testing:
- ENABLE_GPU=OFF: the gpl integration suite and fft_test pass; the CPU paths
are byte-identical to before.
- ENABLE_GPU=ON: the golden gpl integration tests pin ENABLE_GPU=0 into their
environment, so they run the CPU backend and stay green on a GPU build.
fft_gpu_test (built only when ENABLE_GPU) checks the GPU FFT against a CPU
reference within a relative tolerance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Phase 3 of the gpl GPU porting: DeviceState extended with bin grid Views (d_bin_density, d_bin_phi, d_bin_elec_x/y) that GpuFftBackend now solves into directly. After solve(), the electric field results remain on device — the density gradient gather kernel reads them without an extra host round-trip. Per-inst density params (half_dx, half_dy, density_scale) pushed to device once at init. initBinViews() called from NesterovBase after BinGrid setup completes. GpuDensityGradientBackend runs a single Kokkos kernel (densop::launchDensityGather) that does per-inst overlap-weighted sum of bin electric field with axis swap + 0.5x scale inline. No atomics; each inst writes its own gradient. Filler cells fall back to CPU getDensityGradient (fillers aren't in DeviceState). NesterovBase::updateGradients bulk-fetches density gradients before the per-cell loop (same pattern as Phase 2 WL grads). FftBackend factory and FFT class extended to accept DeviceState* so GpuFftBackend can borrow the Views. Benchmarks (RTX 5090, same binary, ENABLE_GPU env switch): medium03 (98k cells): wall 2:00 -> 1:49 (-9%) large01 (274k cells): wall 2:13 -> 1:36 (-28%) large02 (720k cells): wall 2:32 -> 1:52 (-26%) iter counts match CPU (+-1); HPWL within 1e-3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Phase 4+5 of the gpl GPU porting: move the Nesterov loop body
(updateGradients, nesterovUpdateCoordinates, getStepLength,
updateInitialPrevSLPCoordi) to GPU kernels, and eliminate the
redundant host→device forward sync.
Infrastructure (KokkosNesterovState, 6 kernels, PIMPL wrapper):
- gpu/nesterovDeviceState.h: NB-level device Views for all Nesterov
vectors (coords, gradients, sumGrads, clamp bounds)
- gpu/nesterovOp.{h,cpp}: K_gradCombine (parallel_for + 2× reduce),
K_nesterovCoordUpdate (gradient descent + momentum + clamp),
K_getDistance (RMS norm reduction), K_scatterToDeviceState,
K_scatterGradsToNB, K_updateInitialPrevSLPCoordi
- gpu/nesterovDeviceContext.{h,cpp}: PIMPL wrapper with init, sync,
dispatch, swap, and rotateForNextIter methods
Wiring into NesterovBase (all guarded by #ifdef ENABLE_GPU):
- initDensity1: create NesterovDeviceContext, sync initial coords,
scatter to DeviceState + markCoordsFresh
- updateGradients: scatter WL/density grads to NB device arrays,
then K_gradCombine for preconditioned sum + scalar reductions
- nesterovUpdateCoordinates: K_nesterovCoordUpdate, reverse sync for
host density scatter, scatter to DeviceState + markCoordsFresh
- getStepLength: K_getDistance for coord and grad distance
- updateInitialPrevSLPCoordi: K_updateInitialPrevSLPCoordi + reverse
sync + scatter + markCoordsFresh
- updateNextIter: device-side View pointer rotation
- revertToSnapshot: re-sync device coords
Forward sync elimination:
- DeviceState gains a coords_fresh_ flag. NB scatter sites call
markCoordsFresh(); updateWireLengthForceWA skips the host→device
syncInstCoordsFromHost when the flag is set.
- WL grad sync cost: ~600 us/call → 0 us/call.
Filler density gradients are pushed from host each iter
(pushDensityGradsFromHost) since the GPU density backend only
computes inst grads on device.
Benchmarks (RTX 5090, ENABLE_GPU env toggle):
medium03 (98k cells): CPU 1:58 → GPU 1:44 (-12%)
large01 (274k cells): CPU 2:16 → GPU 1:34 (-31%)
Iter counts match CPU (±1); HPWL within 1e-3.
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove development-only bench timers (HpwlBenchTimer, WlGradBenchTimer, DensityGradBenchTimer) that printed [bench] lines to stdout, breaking golden log comparison in regression tests. Change backend selection logs (HPWL/WLgrad/FFT/density backend names) from log_->report() to debugPrint so they only appear with debug verbosity, leaving golden output unchanged. Fix test CMakeLists: pin ENABLE_GPU=0 for ALL gpl integration tests (not just log_compare — passfail tests also diverge on GPU path due to float arithmetic order differences in the Nesterov loop). Use set_property(APPEND) instead of set_tests_properties to avoid overwriting the OPENROAD_EXE environment variable. Result: 60/60 gpl tests pass on ENABLE_GPU=ON build. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Supersedes #10370 (HPWL-only, closed). Full device-resident port of the Nesterov inner loop. |
There was a problem hiding this comment.
Code Review
This pull request introduces GPU acceleration via Kokkos for the OpenROAD global placer (gpl), implementing GPU backends for HPWL calculation, WA wirelength gradient computation, and density gradient gather/FFT Poisson solving. The review feedback identifies three valid issues: a critical out-of-bounds memory write in the Poisson solver's array initialization, performance inefficiencies from recreating KokkosFFT plans on every call, and a potential integer overflow in the HPWL backend's coordinate subtraction.
| d_expkMN1_ | ||
| = Kokkos::View<Kokkos::complex<float>*>("d_expkMN1", binCntX_ + binCntY_); | ||
| d_expkMN2_ | ||
| = Kokkos::View<Kokkos::complex<float>*>("d_expkMN2", binCntX_ + binCntY_); |
There was a problem hiding this comment.
The views d_expkMN1_ and d_expkMN2_ are allocated with a size of binCntX_ + binCntY_ (which is N + M). However, during initialization, the code writes to indices up to hid + M, which can be as large as 2 * M - 1 (where M = binCntY_). If binCntY_ > binCntX_, this results in a critical out-of-bounds memory write. Additionally, during the IDCT phase, the code accesses indices up to 1.5 * N (where N = binCntX_), which can also exceed N + M if N > 2 * M. Allocating 2 * (binCntX_ + binCntY_) ensures that both initialization and IDCT accesses are always safe.
d_expkMN1_
= Kokkos::View<Kokkos::complex<float>*>("d_expkMN1", 2 * (binCntX_ + binCntY_));
d_expkMN2_
= Kokkos::View<Kokkos::complex<float>*>("d_expkMN2", 2 * (binCntX_ + binCntY_));There was a problem hiding this comment.
Fixed in 29f029d (the OOB allocation sizing) and 7ff9ab5 (added a runtime kMaxBinAspectRatio=2 assertion in the PoissonSolver constructor that fails fast if a future caller pushes the aspect past where the expkMN1/2 index math is valid).
d_expkMN1_ / d_expkMN2_ are now allocated as 2 * std::max(binCntX_, binCntY_), which covers the worst-case hid + M (up to 2M-1) index and its symmetric counterpart. The aspect-ratio assertion protects the lower-bound case (negative indices in the IDCT post kernel) that emerges when M ≫ N or N ≫ M.
| KokkosFFT::Plan fftplan(hostSpace, | ||
| hPre2d, | ||
| hFft2d, | ||
| KokkosFFT::Direction::forward, | ||
| KokkosFFT::axis_type<2>{-2, -1}); |
There was a problem hiding this comment.
Creating a new KokkosFFT::Plan on every call to dct_2d_fft and idct_2d_fft is highly inefficient because plan creation can involve expensive heuristics and memory allocations. Since the grid dimensions are constant for a given PoissonSolver instance, the forward and backward plans should be created once during initialization (e.g., in PoissonSolver::initBackend()) and stored as members of PoissonSolver to be reused across iterations.
There was a problem hiding this comment.
Attempted plan caching with persistent host buffers + cached PlanT bound to those buffers; the change regressed large01 wall time from 1:34 to 1:52 (~12%, reproducible across 3 runs). Likely a host-buffer type-mismatch causing a slow deep_copy path between the unmanaged device View and the cached host View, but the root cause was not isolated.
Deferring to a follow-up — listed under "Follow-ups" in the PR description. The current per-call Plan construction is the known-working baseline; the optimization needs persistent buffers tied to the PoissonSolver lifetime, not file-scope thread_local caching.
| if (ux < lx) { | ||
| return; | ||
| } | ||
| acc += static_cast<int64_t>(ux - lx) + static_cast<int64_t>(uy - ly); |
There was a problem hiding this comment.
The subtraction ux - lx and uy - ly is performed using int arithmetic before being cast to int64_t. If the coordinates are extremely far apart, this subtraction can overflow the int range, leading to undefined behavior. Casting the operands to int64_t before subtraction prevents any potential signed integer overflow.
acc += (static_cast<int64_t>(ux) - lx) + (static_cast<int64_t>(uy) - ly);There was a problem hiding this comment.
Fixed in 29f029d. The subtraction now casts the operands to int64_t before subtracting:
acc += (static_cast<int64_t>(ux) - lx) + (static_cast<int64_t>(uy) - ly);No theoretical signed-int overflow path remains.
Fix d_expkMN1_/d_expkMN2_ allocation size: was binCntX_+binCntY_ but init loop writes up to index 2*M-1 (M=binCntY_), causing out-of-bounds write when binCntY_ > binCntX_. Allocate 2*max(binCntX_,binCntY_). Cast HPWL bbox subtraction to int64_t before the subtract, not after, to prevent theoretical signed int overflow on extreme coordinates. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bug fixes (correctness): - CPU-only build was broken: unique_ptr<NesterovDeviceContext> and unique_ptr<DeviceState> member destructors need a complete type, but the PIMPL headers were only included under #ifdef ENABLE_GPU. Move both includes outside the gate (they are plain C++) and add the src/ include path unconditionally so gpu/*.h can find sibling headers. - revertToSnapshot now scatters inst coords to DeviceState, refreshes pin locations, and marks coords fresh after syncing the host vectors back to device. Previously the divergence-recovery iteration ran on stale pin coords. - saveSnapshot pulls curSLPSumGrads_ from device before snapshotting. On the GPU path updateGradients writes sum-grads only to device, so the host vector stayed at zero and a subsequent revertToSnapshot pushed zeros back, wiping the gradient state. - divideByWSquare in poissonSolver.cpp was called with (hID, wID) but the function signature is (wID, hID, binCntX, binCntY, ...). On square bin grids the bug was invisible; on non-square grids both the bin indexing and the frequency math were wrong. Swap the call args. - NesterovDeviceContext clamp bounds now match the CPU formula exactly: bg.lx()+dDx/2 .. bg.ux()-dDx/2 (and Y mirror). Previously the bounds used a bin-width margin, producing different cell positions from CPU when the clamp fired. - PoissonSolver constructor now aborts when bin grid aspect ratio exceeds 2:1; the IDCT expk index math in dct.cpp goes negative past that point. Aspect threshold is kMaxBinAspectRatio. - dct.cpp replaced printf+assert(0) with Kokkos::abort. The previous pattern was a silent no-op in release (NDEBUG) builds and let garbage output continue. Hardening (defense in depth): - nb_device_ctx_ allocation guarded by !nb_device_ctx_; initDensity1 can run more than once (init recursion, routability flows) and previously rebuilt every device View on each call. - getHpwl now consults DeviceState::consumeCoordsFresh() before the host->device sync, matching updateWireLengthForceWA. - coords_fresh_ is now std::atomic<bool> (defensive; consumers run on the master thread today but OMP parallel boundaries elsewhere make a future race plausible). Refactor (industry-level cleanup): - Removed four dead methods from NesterovDeviceContext: swapCurNext, swapSumGrads (also broken — structured-binding copied the Views, swap was a no-op), scatterDensityGradsToNB, syncCurSLPToHost. - Collapsed five copy-pasted blocks in syncCoordsToDevice into a single pushVecPairToDevice helper. Three pull-to-host functions collapsed similarly into pullVecPairToHost. ~50 lines removed. - Removed unused NesterovBaseCommon* nbc constructor parameter and the now-unused h_cur_slp_x/y / h_next_slp_x/y host mirror Views. - Extracted PoissonSolver::launchDivideByWSquare to share the Step #2 lambda between solvePoisson and solvePoissonPotential. Verification: - ENABLE_GPU=ON build: gpl regression 63/63 pass (both GPU backend and ENABLE_GPU=0 env-pinned CPU backend). - ENABLE_GPU=OFF build: gpl_lib + openroad compile clean. - Wall-time benchmark unchanged: large01 (274k cells) CPU 2:16 -> GPU 1:34. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 2-4 added densityGradient.cpp/h, wirelengthGradient.cpp/h, and the PIMPL headers gpu/deviceState.h, gpu/nesterovDeviceContext.h. nesterovBase.cpp includes the headers unconditionally (so unique_ptr member destructors see complete types on CPU-only builds), but the Bazel BUILD file was never updated past the Phase 1 hpwl/fft entries. Mac/Bazel CI failed with 'densityGradientBackend.h file not found'. Add the missing sources to the gpl cc_library so layering_check is satisfied. The PIMPL headers are plain C++ (Kokkos hidden inside); the corresponding .cpp implementations stay GPU-only (CMake-only). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
|
How many threads are used for the CPU results? |
|
Both numbers were single-threaded — OpenROAD defaults to 1 thread and the test TCLs don't call Re-ran the sweep on the same 12-core Intel host (RTX 5090, same binary, CPU thread scaling (host OMP only):
GPU vs best CPU (both at 8 host threads; GPU runs use 8 host threads for the CPU-side density scatter + Nesterov loop overhead):
Crossover sits around ~100k cells on this host. Updating the Numbers table in the PR body to reflect this. Separately on the Gemini bot review: Critical (Poisson |
Post-merge review-driven fixes for the GPU port: - restore OMP parallelism in CPU backend grad-fetch loops - fix Mac/Bazel link by type-erasing the PIMPL deleters - reset/rebuild GPU NesterovDeviceContext on filler mutation - address remaining review feedback (broad) - harden GPU PIMPL invariants; surface PoissonSolver preconditions early - refactor PR-introduced GPU plumbing - fix GPU build on aarch64 by suppressing NEON in CUDA TUs Net diff: 30 files, +574/-299. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Type/API cleanups in response to the design review: - Split VecSlot into SlpSlot / SumGradSlot. The launchers can no longer be passed a mismatched slot kind; the contiguous-int arithmetic trick in nesterovOp.cpp is gone, replaced with two explicit overloads of getDistance / scatterToDeviceState. - Unify backend factories under BackendContext. The four make*Backend factories now share a single `const BackendContext&` parameter carrying nbc / nb / device_state / num_threads / FFT geometry. Caller builds the struct once and reuses across factories. - Replace FftBackend::solve's float** quartet with a BinGridSpan POD. bin_cnt_x / bin_cnt_y travel with the buffer; the legacy float** shape is wrapped inside CpuFftBackend with a row-pointer adapter (Ooura ddct2d expects that). FFT owns flat std::vector<float>. - Encapsulate the host->device coord sync in DeviceState::ensureCoordsFresh. The atomic flag + 3-site read/write is collapsed into one master-thread method with markCoordsFresh / invalidateCoords symmetry. - Adopt the solverToGplField shared adapter for the FFT host unpack. The 0.5x scale + solver/gpl axis swap is now exposed once from poissonSolver.h; gpuFftBackend.cpp and densityOp.cpp both call through. - Deduplicate the filler-cell handling in the GPU WL / density gradient backends. New cellHandleHelpers.h::mapNbcGrads template takes the per-cell device-mirror lookup and the filler fallback as lambdas. - Drop the unused printStepLength helper (dead printf). Build green on ENABLE_GPU=ON; medium03 GPU run holds iter / HPWL within the same tolerance as before (487 iters vs 486 baseline, HPWL 1e-3). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Findings on the new factory translation units flagged by the clang-tidy-bazel CI: - fft.cpp: BinGridSpan brace-init now uses designated initializers (modernize-use-designated-initializers). - densityGradient.cpp, wirelengthGradient.cpp: drop the omp.h include since these TUs only use #pragma omp directives, never any omp_* API. The -fopenmp copt handles pragma lowering. - hpwl.cpp: keep omp.h (omp_get_thread_num is called in the master- thread assert) and add NOLINT(misc-include-cleaner). The checker does not detect API use through assert macros, so the include is flagged as unused even though it is required. Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Summary
Opt-in GPU acceleration for the gpl Nesterov loop via Kokkos.
ENABLE_GPU=OFF) — byte-identical to master. No Kokkos / CUDA / HIP dependency, no CI impact, no Bazel impact.ENABLE_GPU=ON— compiles GPU kernels in. Backend is chosen per process at runtime via theENABLE_GPUenvironment variable (0forces CPU). Golden integration tests pinENABLE_GPU=0and stay green.Per-iteration host↔device traffic shrinks to one reverse sync (~1.2 MB) plus two scalar reductions; everything else stays device-resident.
This is the path agreed in the #5352 discussion — integrate into
src/gpl/via Strategy interfaces, rather than maintain a parallelgpl2/module.Build matrix
ENABLE_GPU=OFF(default)ENABLE_GPU=ONsrc/gpl/src/gpu/skipped)ENABLE_GPU=0env → CPU, else GPUENABLE_GPU=0→ all greenfft_gpu_testArchitecture
Each accelerated operation is a Strategy interface with CPU and GPU implementations. A factory function is the only
#ifdef ENABLE_GPUin C++; consumer headers stay preprocessor-free.HpwlBackendparallel_reduceWirelengthGradientBackendDensityGradientBackendparallel_forgatherFftBackendDevice-resident state is split into two PIMPL objects:
DeviceState(owned byNesterovBaseCommon) — inst/pin coords, pin-to-net CSR, bin grid Views.NesterovDeviceContext(owned byNesterovBase) — Nesterov vectors (coords, grads, sumGrads, clamp bounds) + 6 Kokkos kernels (coord update, gradient combine, step length, scatters).Backend factories share a
BackendContextPOD.FftBackend::solvetakes aBinGridSpan(data pointer + bin counts). Host→device coord sync goes throughDeviceState::ensureCoordsFresh(master-thread only, no atomic). Smaller shared helpers:solverToGplField(axis swap + scale),mapNbcGrads(gradient gather for fillers vs inst cells),SlpSlot/SumGradSlot(typed slot kinds for the per-cell vector launchers).All Kokkos types stay behind PIMPL. Only
gpu/*.cppfiles are CUDA TUs.Numbers
12-core Intel host + RTX 5090, same binary,
ENABLE_GPUenv toggled,set_thread_countexplicit. Nangate45.CPU thread scaling (host OMP only):
GPU vs best CPU (both at 8 host threads; GPU runs use 8 host threads for the CPU-side density scatter + Nesterov loop overhead):
Crossover sits around ~100k cells on this host. The largest single contributor on the GPU side is the WL gradient force kernel: 22 ms/call → 34 µs/call.
Determinism
Same convergence trajectory (±1 iter, HPWL within 1e-3 relative). Implemented via the sgizler pattern from #5352: integer arithmetic for HPWL bbox, serial per-net inner reduction, int64 cross-net sum.
Commits
9738393ENABLE_GPUoption, Strategy/Factory,DeviceState, three GPU backendsfc6059fDeviceState, device-resident FFT solve49cd6d8NesterovDeviceContext+ 6 kernels; eliminates the host↔device forward sync each iter7ba5dedENABLE_GPU=0for all gpl tests on GPU buildsff6fce9dcadb7f8885cbaaVecSlot, unify factories viaBackendContext,BinGridSpanPOD,ensureCoordsFresh, sharedsolverToGplField+mapNbcGradshelpersEach commit builds, tests, and is independently reviewable.
Splitting this PR
The diff is large because the wall-time win shows up only once the Nesterov loop body runs on the device — the foundation and per-operation backends are net-neutral by themselves. The commits above are already organized so any prefix can ship:
Happy to break this into 2 or 3 sequential PRs if that's easier to review. Each prefix is independently buildable and CPU-parity-verified.
Follow-ups (out of scope for this PR)
KokkosFFT::Plan(currently rebuilt on every call).updateBinsGCellDensityArea) onto the device. The last per-iteration host round-trip.fft_gpu_test.KokkosDeviceStateby purpose (pin coords, WL gradient, density, Nesterov, bin grid) and drop the publickokkos()accessor.updateGradients/getStepLength/nesterovUpdateCoordinatesinto their own functions. Today the slot kind is inferred from a pointer-identity check, which works but is fragile.Related
gpl2/DG-RePlAce Kokkos port. Same kernels; this PR takes the in-place integration path the discussion converged on.cmake/KokkosBackend.cmakealigns with that effort.