gpl: opt-in GPU acceleration via Kokkos by ApeachM · Pull Request #10511 · The-OpenROAD-Project/OpenROAD

ApeachM · 2026-05-26T09:30:22Z

Continuation of #10370, which was originally scoped to HPWL only. After ~3 weeks without review I completed the rest of the device-resident port so the win is actually visible; the original HPWL-only diff was net-neutral on wall time. If a smaller scope is preferable I'm happy to split this back up — see Splitting this PR below.

Summary

Opt-in GPU acceleration for the gpl Nesterov loop via Kokkos.

Default (ENABLE_GPU=OFF) — byte-identical to master. No Kokkos / CUDA / HIP dependency, no CI impact, no Bazel impact.
ENABLE_GPU=ON — compiles GPU kernels in. Backend is chosen per process at runtime via the ENABLE_GPU environment variable (0 forces CPU). Golden integration tests pin ENABLE_GPU=0 and stay green.
Result — on a 274k-cell Nangate45 design at CPU 8T baseline, wall time drops 1:15 → 0:56 (−25%), RTX 5090. Convergence is preserved (±1 iter, HPWL within 1e-3 relative).

Per-iteration host↔device traffic shrinks to one reverse sync (~1.2 MB) plus two scalar reductions; everything else stays device-resident.

This is the path agreed in the #5352 discussion — integrate into src/gpl/ via Strategy interfaces, rather than maintain a parallel gpl2/ module.

Build matrix

	`ENABLE_GPU=OFF` (default)	`ENABLE_GPU=ON`
Extra deps	None	Kokkos + KokkosFFT + CUDA or HIP
GPU code compiled	No (`src/gpl/src/gpu/` skipped)	Yes (CUDA/HIP TUs)
Runtime backend	CPU always	`ENABLE_GPU=0` env → CPU, else GPU
CI / Bazel	Unaffected	Tests pin `ENABLE_GPU=0` → all green
gpl test suite	All pass	All pass + `fft_gpu_test`

# Default build
cmake -S . -B build && ninja -C build
ctest --test-dir build -R gpl

# GPU build
cmake -S . -B build_gpu -DENABLE_GPU=ON \
  -DKokkos_ROOT=<path> -DKokkosFFT_DIR=<path>
ninja -C build_gpu
ctest --test-dir build_gpu -R gpl

Architecture

Each accelerated operation is a Strategy interface with CPU and GPU implementations. A factory function is the only #ifdef ENABLE_GPU in C++; consumer headers stay preprocessor-free.

Operation	Strategy	CPU	GPU
HPWL	`HpwlBackend`	OpenMP loop	Kokkos `parallel_reduce`
WL gradient	`WirelengthGradientBackend`	per-cell loop	5-kernel pipeline
Density gradient	`DensityGradientBackend`	per-cell FFT lookup	`parallel_for` gather
FFT / Poisson	`FftBackend`	Ooura DCT	KokkosFFT DCT

Device-resident state is split into two PIMPL objects:

DeviceState (owned by NesterovBaseCommon) — inst/pin coords, pin-to-net CSR, bin grid Views.
NesterovDeviceContext (owned by NesterovBase) — Nesterov vectors (coords, grads, sumGrads, clamp bounds) + 6 Kokkos kernels (coord update, gradient combine, step length, scatters).

Backend factories share a BackendContext POD. FftBackend::solve takes a BinGridSpan (data pointer + bin counts). Host→device coord sync goes through DeviceState::ensureCoordsFresh (master-thread only, no atomic). Smaller shared helpers: solverToGplField (axis swap + scale), mapNbcGrads (gradient gather for fillers vs inst cells), SlpSlot / SumGradSlot (typed slot kinds for the per-cell vector launchers).

All Kokkos types stay behind PIMPL. Only gpu/*.cpp files are CUDA TUs.

Numbers

12-core Intel host + RTX 5090, same binary, ENABLE_GPU env toggled, set_thread_count explicit. Nangate45.

CPU thread scaling (host OMP only):

Design	Cells	Iters	1T	2T	4T	8T
medium03	98k	486	2:02	1:40	1:31	1:28
large01	274k	554	1:56	1:27	1:16	1:15

GPU vs best CPU (both at 8 host threads; GPU runs use 8 host threads for the CPU-side density scatter + Nesterov loop overhead):

Design	Cells	CPU 8T	GPU	Δ
medium03	98k	1:28	1:29	parity
large01	274k	1:15	0:56	−25%

medium03 (98k) — parity. The per-iter GPU overhead (~1.2 MB reverse sync for the host density scatter + two scalar reductions) cancels the compute saving at this size.
large01 (274k) — GPU −25%. Overhead is constant, compute saving scales with cells, so GPU pulls ahead on larger designs.

Crossover sits around ~100k cells on this host. The largest single contributor on the GPU side is the WL gradient force kernel: 22 ms/call → 34 µs/call.

Determinism

Same convergence trajectory (±1 iter, HPWL within 1e-3 relative). Implemented via the sgizler pattern from #5352: integer arithmetic for HPWL bbox, serial per-net inner reduction, int64 cross-net sum.

Commits

#	Commit	Scope
1	`9738393`	Foundation + HPWL + FFT + WL gradient — build infra, `ENABLE_GPU` option, Strategy/Factory, `DeviceState`, three GPU backends
2	`fc6059f`	Density gradient — gather kernel, bin grid Views on `DeviceState`, device-resident FFT solve
3	`49cd6d8`	Nesterov loop body on GPU — `NesterovDeviceContext` + 6 kernels; eliminates the host↔device forward sync each iter
4	`7ba5ded`	Remove bench instrumentation; pin `ENABLE_GPU=0` for all gpl tests on GPU builds
5	`ff6fce9`	Fix Poisson OOB + HPWL int overflow (Gemini bot review)
6	`dcadb7f8`	Address internal review pass — correctness fixes + helper-function refactor
7	`885cbaa`	Refactor GPU port surface for review — split `VecSlot`, unify factories via `BackendContext`, `BinGridSpan` POD, `ensureCoordsFresh`, shared `solverToGplField` + `mapNbcGrads` helpers

Each commit builds, tests, and is independently reviewable.

Splitting this PR

The diff is large because the wall-time win shows up only once the Nesterov loop body runs on the device — the foundation and per-operation backends are net-neutral by themselves. The commits above are already organized so any prefix can ship:

Commits 1+2 (build infra, HPWL, FFT, WL gradient, density gradient) — landable as a first PR, modest perf change but establishes the dispatch shape.
Commit 3 — adds the device-resident Nesterov loop; this is where the −25% appears.
Commits 4–7 — cleanup, review fixes, and design-pattern refactor.

Happy to break this into 2 or 3 sequential PRs if that's easier to review. Each prefix is independently buildable and CPU-parity-verified.

Follow-ups (out of scope for this PR)

Cache the KokkosFFT::Plan (currently rebuilt on every call).
Move the filler density gradient onto the device. Saves the ~1.2 MB upload per iteration.
Move the density bin scatter (updateBinsGCellDensityArea) onto the device. The last per-iteration host round-trip.
CPU↔GPU equivalence tests for HPWL / WL gradient / density gradient, matching fft_gpu_test.
Split KokkosDeviceState by purpose (pin coords, WL gradient, density, Nesterov, bin grid) and drop the public kokkos() accessor.
Pull the GPU branches out of updateGradients / getStepLength / nesterovUpdateCoordinates into their own functions. Today the slot kind is inferred from a pointer-identity check, which works but is fragile.

Adds optional GPU acceleration for the two hot numerical kernels of global placement -- the HPWL (half-perimeter wirelength) reduction and the FFT / Poisson density solve -- behind a runtime backend switch. Architecture (Strategy + Factory): - ENABLE_GPU is both a CMake option and an environment variable. The option compiles the GPU code in; with ENABLE_GPU=OFF (the default) no GPU code is compiled and there is no Kokkos/CUDA dependency -- the build is identical to before. With ENABLE_GPU=ON both the CPU and GPU paths are built and the backend is chosen per process at run time by gpl::gpuEnabled() (the ENABLE_GPU environment variable; GPU is the default, ENABLE_GPU=0 forces CPU). - Each accelerated operation has a small Strategy interface -- HpwlBackend and FftBackend -- with a CPU implementation (always compiled) and a GPU implementation (Kokkos, compiled only when ENABLE_GPU). A factory function, makeHpwlBackend() / makeFftBackend(), is the single place gpl::gpuEnabled() is read and the only place an #ifdef ENABLE_GPU appears in C++. The context that owns the data -- NesterovBaseCommon for HPWL, FFT for the density grid -- holds a std::unique_ptr to the interface and never branches on backend. The CPU and GPU backends share the context's data, passed by reference; no data structure is duplicated. - The virtual dispatch is at operation granularity (getHpwl(), FFT::doFFT()), each called once per placement iteration, so the CPU hot path carries no added overhead. The consumer headers (nesterovBase.h, fft.h) stay preprocessor-free. Components: - GPU build infrastructure: the ENABLE_GPU option, cmake/KokkosBackend.cmake (Kokkos + KokkosFFT discovery, CUDA/HIP language enablement), and src/gpl/src/gpu/gpuRuntime.{h,cpp} (gpl::gpuEnabled() plus lazy Kokkos initialize/finalize). - HPWL: the HpwlBackend interface (src/gpl/src/hpwlBackend.h); CpuHpwlBackend (the OpenMP loop) and makeHpwlBackend() in src/gpl/src/hpwl.cpp; the Kokkos GpuHpwlBackend in src/gpl/src/gpu/gpuHpwlBackend.{h,cpp}. The GPU kernel is integer arithmetic, bit-identical to the CPU loop. - FFT: the FftBackend interface (src/gpl/src/fftBackend.h); the FFT context, CpuFftBackend (the Ooura DCT) and makeFftBackend() in src/gpl/src/fft.{h,cpp}; the Kokkos GpuFftBackend wrapping a Poisson solver in src/gpl/src/gpu/. The GPU FFT is not bit-identical to the Ooura CPU FFT (~1e-4 relative divergence, inherent to a GPU FFT). Testing: - ENABLE_GPU=OFF: the gpl integration suite and fft_test pass; the CPU paths are byte-identical to before. - ENABLE_GPU=ON: the golden gpl integration tests pin ENABLE_GPU=0 into their environment, so they run the CPU backend and stay green on a GPU build. fft_gpu_test (built only when ENABLE_GPU) checks the GPU FFT against a CPU reference within a relative tolerance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>

Phase 3 of the gpl GPU porting: DeviceState extended with bin grid Views (d_bin_density, d_bin_phi, d_bin_elec_x/y) that GpuFftBackend now solves into directly. After solve(), the electric field results remain on device — the density gradient gather kernel reads them without an extra host round-trip. Per-inst density params (half_dx, half_dy, density_scale) pushed to device once at init. initBinViews() called from NesterovBase after BinGrid setup completes. GpuDensityGradientBackend runs a single Kokkos kernel (densop::launchDensityGather) that does per-inst overlap-weighted sum of bin electric field with axis swap + 0.5x scale inline. No atomics; each inst writes its own gradient. Filler cells fall back to CPU getDensityGradient (fillers aren't in DeviceState). NesterovBase::updateGradients bulk-fetches density gradients before the per-cell loop (same pattern as Phase 2 WL grads). FftBackend factory and FFT class extended to accept DeviceState* so GpuFftBackend can borrow the Views. Benchmarks (RTX 5090, same binary, ENABLE_GPU env switch): medium03 (98k cells): wall 2:00 -> 1:49 (-9%) large01 (274k cells): wall 2:13 -> 1:36 (-28%) large02 (720k cells): wall 2:32 -> 1:52 (-26%) iter counts match CPU (+-1); HPWL within 1e-3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>

Phase 4+5 of the gpl GPU porting: move the Nesterov loop body (updateGradients, nesterovUpdateCoordinates, getStepLength, updateInitialPrevSLPCoordi) to GPU kernels, and eliminate the redundant host→device forward sync. Infrastructure (KokkosNesterovState, 6 kernels, PIMPL wrapper): - gpu/nesterovDeviceState.h: NB-level device Views for all Nesterov vectors (coords, gradients, sumGrads, clamp bounds) - gpu/nesterovOp.{h,cpp}: K_gradCombine (parallel_for + 2× reduce), K_nesterovCoordUpdate (gradient descent + momentum + clamp), K_getDistance (RMS norm reduction), K_scatterToDeviceState, K_scatterGradsToNB, K_updateInitialPrevSLPCoordi - gpu/nesterovDeviceContext.{h,cpp}: PIMPL wrapper with init, sync, dispatch, swap, and rotateForNextIter methods Wiring into NesterovBase (all guarded by #ifdef ENABLE_GPU): - initDensity1: create NesterovDeviceContext, sync initial coords, scatter to DeviceState + markCoordsFresh - updateGradients: scatter WL/density grads to NB device arrays, then K_gradCombine for preconditioned sum + scalar reductions - nesterovUpdateCoordinates: K_nesterovCoordUpdate, reverse sync for host density scatter, scatter to DeviceState + markCoordsFresh - getStepLength: K_getDistance for coord and grad distance - updateInitialPrevSLPCoordi: K_updateInitialPrevSLPCoordi + reverse sync + scatter + markCoordsFresh - updateNextIter: device-side View pointer rotation - revertToSnapshot: re-sync device coords Forward sync elimination: - DeviceState gains a coords_fresh_ flag. NB scatter sites call markCoordsFresh(); updateWireLengthForceWA skips the host→device syncInstCoordsFromHost when the flag is set. - WL grad sync cost: ~600 us/call → 0 us/call. Filler density gradients are pushed from host each iter (pushDensityGradsFromHost) since the GPU density backend only computes inst grads on device. Benchmarks (RTX 5090, ENABLE_GPU env toggle): medium03 (98k cells): CPU 1:58 → GPU 1:44 (-12%) large01 (274k cells): CPU 2:16 → GPU 1:34 (-31%) Iter counts match CPU (±1); HPWL within 1e-3. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove development-only bench timers (HpwlBenchTimer, WlGradBenchTimer, DensityGradBenchTimer) that printed [bench] lines to stdout, breaking golden log comparison in regression tests. Change backend selection logs (HPWL/WLgrad/FFT/density backend names) from log_->report() to debugPrint so they only appear with debug verbosity, leaving golden output unchanged. Fix test CMakeLists: pin ENABLE_GPU=0 for ALL gpl integration tests (not just log_compare — passfail tests also diverge on GPU path due to float arithmetic order differences in the Nesterov loop). Use set_property(APPEND) instead of set_tests_properties to avoid overwriting the OPENROAD_EXE environment variable. Result: 60/60 gpl tests pass on ENABLE_GPU=ON build. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ApeachM · 2026-05-26T09:30:30Z

Supersedes #10370 (HPWL-only, closed). Full device-resident port of the Nesterov inner loop.

gemini-code-assist

Code Review

This pull request introduces GPU acceleration via Kokkos for the OpenROAD global placer (gpl), implementing GPU backends for HPWL calculation, WA wirelength gradient computation, and density gradient gather/FFT Poisson solving. The review feedback identifies three valid issues: a critical out-of-bounds memory write in the Poisson solver's array initialization, performance inefficiencies from recreating KokkosFFT plans on every call, and a potential integer overflow in the HPWL backend's coordinate subtraction.

gemini-code-assist · 2026-05-26T09:33:08Z

+  d_expkMN1_
+      = Kokkos::View<Kokkos::complex<float>*>("d_expkMN1", binCntX_ + binCntY_);
+  d_expkMN2_
+      = Kokkos::View<Kokkos::complex<float>*>("d_expkMN2", binCntX_ + binCntY_);


The views d_expkMN1_ and d_expkMN2_ are allocated with a size of binCntX_ + binCntY_ (which is N + M). However, during initialization, the code writes to indices up to hid + M, which can be as large as 2 * M - 1 (where M = binCntY_). If binCntY_ > binCntX_, this results in a critical out-of-bounds memory write. Additionally, during the IDCT phase, the code accesses indices up to 1.5 * N (where N = binCntX_), which can also exceed N + M if N > 2 * M. Allocating 2 * (binCntX_ + binCntY_) ensures that both initialization and IDCT accesses are always safe.

d_expkMN1_ = Kokkos::View<Kokkos::complex<float>*>("d_expkMN1", 2 * (binCntX_ + binCntY_)); d_expkMN2_ = Kokkos::View<Kokkos::complex<float>*>("d_expkMN2", 2 * (binCntX_ + binCntY_));

Fixed in 29f029d (the OOB allocation sizing) and 7ff9ab5 (added a runtime kMaxBinAspectRatio=2 assertion in the PoissonSolver constructor that fails fast if a future caller pushes the aspect past where the expkMN1/2 index math is valid).

d_expkMN1_ / d_expkMN2_ are now allocated as 2 * std::max(binCntX_, binCntY_), which covers the worst-case hid + M (up to 2M-1) index and its symmetric counterpart. The aspect-ratio assertion protects the lower-bound case (negative indices in the IDCT post kernel) that emerges when M ≫ N or N ≫ M.

gemini-code-assist · 2026-05-26T09:33:08Z

+  KokkosFFT::Plan fftplan(hostSpace,
+                          hPre2d,
+                          hFft2d,
+                          KokkosFFT::Direction::forward,
+                          KokkosFFT::axis_type<2>{-2, -1});


Creating a new KokkosFFT::Plan on every call to dct_2d_fft and idct_2d_fft is highly inefficient because plan creation can involve expensive heuristics and memory allocations. Since the grid dimensions are constant for a given PoissonSolver instance, the forward and backward plans should be created once during initialization (e.g., in PoissonSolver::initBackend()) and stored as members of PoissonSolver to be reused across iterations.

Attempted plan caching with persistent host buffers + cached PlanT bound to those buffers; the change regressed large01 wall time from 1:34 to 1:52 (~12%, reproducible across 3 runs). Likely a host-buffer type-mismatch causing a slow deep_copy path between the unmanaged device View and the cached host View, but the root cause was not isolated.

Deferring to a follow-up — listed under "Follow-ups" in the PR description. The current per-call Plan construction is the known-working baseline; the optimization needs persistent buffers tied to the PoissonSolver lifetime, not file-scope thread_local caching.

gemini-code-assist · 2026-05-26T09:33:08Z

+        if (ux < lx) {
+          return;
+        }
+        acc += static_cast<int64_t>(ux - lx) + static_cast<int64_t>(uy - ly);


The subtraction ux - lx and uy - ly is performed using int arithmetic before being cast to int64_t. If the coordinates are extremely far apart, this subtraction can overflow the int range, leading to undefined behavior. Casting the operands to int64_t before subtraction prevents any potential signed integer overflow.

acc += (static_cast<int64_t>(ux) - lx) + (static_cast<int64_t>(uy) - ly);

Fixed in 29f029d. The subtraction now casts the operands to int64_t before subtracting:

acc += (static_cast<int64_t>(ux) - lx) + (static_cast<int64_t>(uy) - ly);

No theoretical signed-int overflow path remains.

Fix d_expkMN1_/d_expkMN2_ allocation size: was binCntX_+binCntY_ but init loop writes up to index 2*M-1 (M=binCntY_), causing out-of-bounds write when binCntY_ > binCntX_. Allocate 2*max(binCntX_,binCntY_). Cast HPWL bbox subtraction to int64_t before the subtract, not after, to prevent theoretical signed int overflow on extreme coordinates. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bug fixes (correctness): - CPU-only build was broken: unique_ptr<NesterovDeviceContext> and unique_ptr<DeviceState> member destructors need a complete type, but the PIMPL headers were only included under #ifdef ENABLE_GPU. Move both includes outside the gate (they are plain C++) and add the src/ include path unconditionally so gpu/*.h can find sibling headers. - revertToSnapshot now scatters inst coords to DeviceState, refreshes pin locations, and marks coords fresh after syncing the host vectors back to device. Previously the divergence-recovery iteration ran on stale pin coords. - saveSnapshot pulls curSLPSumGrads_ from device before snapshotting. On the GPU path updateGradients writes sum-grads only to device, so the host vector stayed at zero and a subsequent revertToSnapshot pushed zeros back, wiping the gradient state. - divideByWSquare in poissonSolver.cpp was called with (hID, wID) but the function signature is (wID, hID, binCntX, binCntY, ...). On square bin grids the bug was invisible; on non-square grids both the bin indexing and the frequency math were wrong. Swap the call args. - NesterovDeviceContext clamp bounds now match the CPU formula exactly: bg.lx()+dDx/2 .. bg.ux()-dDx/2 (and Y mirror). Previously the bounds used a bin-width margin, producing different cell positions from CPU when the clamp fired. - PoissonSolver constructor now aborts when bin grid aspect ratio exceeds 2:1; the IDCT expk index math in dct.cpp goes negative past that point. Aspect threshold is kMaxBinAspectRatio. - dct.cpp replaced printf+assert(0) with Kokkos::abort. The previous pattern was a silent no-op in release (NDEBUG) builds and let garbage output continue. Hardening (defense in depth): - nb_device_ctx_ allocation guarded by !nb_device_ctx_; initDensity1 can run more than once (init recursion, routability flows) and previously rebuilt every device View on each call. - getHpwl now consults DeviceState::consumeCoordsFresh() before the host->device sync, matching updateWireLengthForceWA. - coords_fresh_ is now std::atomic<bool> (defensive; consumers run on the master thread today but OMP parallel boundaries elsewhere make a future race plausible). Refactor (industry-level cleanup): - Removed four dead methods from NesterovDeviceContext: swapCurNext, swapSumGrads (also broken — structured-binding copied the Views, swap was a no-op), scatterDensityGradsToNB, syncCurSLPToHost. - Collapsed five copy-pasted blocks in syncCoordsToDevice into a single pushVecPairToDevice helper. Three pull-to-host functions collapsed similarly into pullVecPairToHost. ~50 lines removed. - Removed unused NesterovBaseCommon* nbc constructor parameter and the now-unused h_cur_slp_x/y / h_next_slp_x/y host mirror Views. - Extracted PoissonSolver::launchDivideByWSquare to share the Step #2 lambda between solvePoisson and solvePoissonPotential. Verification: - ENABLE_GPU=ON build: gpl regression 63/63 pass (both GPU backend and ENABLE_GPU=0 env-pinned CPU backend). - ENABLE_GPU=OFF build: gpl_lib + openroad compile clean. - Wall-time benchmark unchanged: large01 (274k cells) CPU 2:16 -> GPU 1:34. Signed-off-by: Minjae Kim <develop.minjae@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 2-4 added densityGradient.cpp/h, wirelengthGradient.cpp/h, and the PIMPL headers gpu/deviceState.h, gpu/nesterovDeviceContext.h. nesterovBase.cpp includes the headers unconditionally (so unique_ptr member destructors see complete types on CPU-only builds), but the Bazel BUILD file was never updated past the Phase 1 hpwl/fft entries. Mac/Bazel CI failed with 'densityGradientBackend.h file not found'. Add the missing sources to the gpl cc_library so layering_check is satisfied. The PIMPL headers are plain C++ (Kokkos hidden inside); the corresponding .cpp implementations stay GPU-only (CMake-only). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>

maliberty · 2026-05-26T20:42:52Z

How many threads are used for the CPU results?

ApeachM · 2026-05-27T01:40:14Z

Both numbers were single-threaded — OpenROAD defaults to 1 thread and the test TCLs don't call set_thread_count. Not a fair baseline for gpl, which has parallelizable OMP loops.

Re-ran the sweep on the same 12-core Intel host (RTX 5090, same binary, set_thread_count explicit).

CPU thread scaling (host OMP only):

Design	Cells	1T	2T	4T	8T
medium03	98k	2:02	1:40	1:31	1:28
large01	274k	1:56	1:27	1:16	1:15

GPU vs best CPU (both at 8 host threads; GPU runs use 8 host threads for the CPU-side density scatter + Nesterov loop overhead):

Design	Cells	CPU 8T	GPU	Δ
medium03	98k	1:28	1:29	parity
large01	274k	1:15	0:56	−25%

medium03 (98k) — parity. The GPU per-iter overhead (~1.2 MB reverse sync for the host density scatter + two scalar reductions) cancels the compute saving at this size.
large01 (274k) — GPU −25%. Overhead is constant, compute saving scales with cells, so GPU pulls ahead on larger designs.

Crossover sits around ~100k cells on this host. Updating the Numbers table in the PR body to reflect this.

Separately on the Gemini bot review: Critical (Poisson expkMN1/2 OOB) and Medium (HPWL ux-lx int overflow) are fixed. High (KokkosFFT Plan recreation) is deferred — a caching attempt regressed wall time, so the current per-call construction stays as the known-working baseline. See that thread and the Follow-ups section.

Post-merge review-driven fixes for the GPU port: - restore OMP parallelism in CPU backend grad-fetch loops - fix Mac/Bazel link by type-erasing the PIMPL deleters - reset/rebuild GPU NesterovDeviceContext on filler mutation - address remaining review feedback (broad) - harden GPU PIMPL invariants; surface PoissonSolver preconditions early - refactor PR-introduced GPU plumbing - fix GPU build on aarch64 by suppressing NEON in CUDA TUs Net diff: 30 files, +574/-299. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>

Type/API cleanups in response to the design review: - Split VecSlot into SlpSlot / SumGradSlot. The launchers can no longer be passed a mismatched slot kind; the contiguous-int arithmetic trick in nesterovOp.cpp is gone, replaced with two explicit overloads of getDistance / scatterToDeviceState. - Unify backend factories under BackendContext. The four make*Backend factories now share a single `const BackendContext&` parameter carrying nbc / nb / device_state / num_threads / FFT geometry. Caller builds the struct once and reuses across factories. - Replace FftBackend::solve's float** quartet with a BinGridSpan POD. bin_cnt_x / bin_cnt_y travel with the buffer; the legacy float** shape is wrapped inside CpuFftBackend with a row-pointer adapter (Ooura ddct2d expects that). FFT owns flat std::vector<float>. - Encapsulate the host->device coord sync in DeviceState::ensureCoordsFresh. The atomic flag + 3-site read/write is collapsed into one master-thread method with markCoordsFresh / invalidateCoords symmetry. - Adopt the solverToGplField shared adapter for the FFT host unpack. The 0.5x scale + solver/gpl axis swap is now exposed once from poissonSolver.h; gpuFftBackend.cpp and densityOp.cpp both call through. - Deduplicate the filler-cell handling in the GPU WL / density gradient backends. New cellHandleHelpers.h::mapNbcGrads template takes the per-cell device-mirror lookup and the filler fallback as lambdas. - Drop the unused printStepLength helper (dead printf). Build green on ENABLE_GPU=ON; medium03 GPU run holds iter / HPWL within the same tolerance as before (487 iters vs 486 baseline, HPWL 1e-3). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Minjae Kim <develop.minjae@gmail.com>

Findings on the new factory translation units flagged by the clang-tidy-bazel CI: - fft.cpp: BinGridSpan brace-init now uses designated initializers (modernize-use-designated-initializers). - densityGradient.cpp, wirelengthGradient.cpp: drop the omp.h include since these TUs only use #pragma omp directives, never any omp_* API. The -fopenmp copt handles pragma lowering. - hpwl.cpp: keep omp.h (omp_get_thread_num is called in the master- thread assert) and add NOLINT(misc-include-cleaner). The checker does not detect API use through assert macros, so the include is flagged as unused even though it is required. Signed-off-by: Minjae Kim <develop.minjae@gmail.com>

ApeachM and others added 4 commits May 26, 2026 00:24

ApeachM requested review from a team as code owners May 26, 2026 09:30

ApeachM requested review from eder-matheus and gudeh May 26, 2026 09:30

github-actions Bot added the size/XL label May 26, 2026

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

ApeachM force-pushed the gpl-gpu-fft branch from 61af491 to ff6fce9 Compare May 26, 2026 09:48

ApeachM force-pushed the gpl-gpu-fft branch from dcadb7f to 7ff9ab5 Compare May 26, 2026 13:54

ApeachM force-pushed the gpl-gpu-fft branch from 8fc39a1 to 8eb202b Compare May 27, 2026 09:42

ApeachM and others added 4 commits May 27, 2026 19:37

Merge branch 'The-OpenROAD-Project:master' into gpl-gpu-fft

fea0827

Merge branch 'The-OpenROAD-Project:master' into gpl-gpu-fft

3e18030

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpl: opt-in GPU acceleration via Kokkos#10511

gpl: opt-in GPU acceleration via Kokkos#10511
ApeachM wants to merge 12 commits into
The-OpenROAD-Project:masterfrom
ApeachM:gpl-gpu-fft

ApeachM commented May 26, 2026 •

edited

Loading

Uh oh!

ApeachM commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

ApeachM May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

ApeachM May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

ApeachM May 26, 2026

Uh oh!

maliberty commented May 26, 2026

Uh oh!

ApeachM commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ApeachM commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Build matrix

Architecture

Numbers

Determinism

Commits

Splitting this PR

Follow-ups (out of scope for this PR)

Related

Uh oh!

ApeachM commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ApeachM May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ApeachM May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ApeachM May 26, 2026

Choose a reason for hiding this comment

Uh oh!

maliberty commented May 26, 2026

Uh oh!

ApeachM commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ApeachM commented May 26, 2026 •

edited

Loading

ApeachM commented May 27, 2026 •

edited

Loading