Skip to content

gpl: opt-in GPU acceleration via Kokkos#10511

Open
ApeachM wants to merge 12 commits into
The-OpenROAD-Project:masterfrom
ApeachM:gpl-gpu-fft
Open

gpl: opt-in GPU acceleration via Kokkos#10511
ApeachM wants to merge 12 commits into
The-OpenROAD-Project:masterfrom
ApeachM:gpl-gpu-fft

Conversation

@ApeachM
Copy link
Copy Markdown
Contributor

@ApeachM ApeachM commented May 26, 2026

Continuation of #10370, which was originally scoped to HPWL only. After ~3 weeks without review I completed the rest of the device-resident port so the win is actually visible; the original HPWL-only diff was net-neutral on wall time. If a smaller scope is preferable I'm happy to split this back up — see Splitting this PR below.

Summary

Opt-in GPU acceleration for the gpl Nesterov loop via Kokkos.

  • Default (ENABLE_GPU=OFF) — byte-identical to master. No Kokkos / CUDA / HIP dependency, no CI impact, no Bazel impact.
  • ENABLE_GPU=ON — compiles GPU kernels in. Backend is chosen per process at runtime via the ENABLE_GPU environment variable (0 forces CPU). Golden integration tests pin ENABLE_GPU=0 and stay green.
  • Result — on a 274k-cell Nangate45 design at CPU 8T baseline, wall time drops 1:15 → 0:56 (−25%), RTX 5090. Convergence is preserved (±1 iter, HPWL within 1e-3 relative).

Per-iteration host↔device traffic shrinks to one reverse sync (~1.2 MB) plus two scalar reductions; everything else stays device-resident.

This is the path agreed in the #5352 discussion — integrate into src/gpl/ via Strategy interfaces, rather than maintain a parallel gpl2/ module.

Build matrix

ENABLE_GPU=OFF (default) ENABLE_GPU=ON
Extra deps None Kokkos + KokkosFFT + CUDA or HIP
GPU code compiled No (src/gpl/src/gpu/ skipped) Yes (CUDA/HIP TUs)
Runtime backend CPU always ENABLE_GPU=0 env → CPU, else GPU
CI / Bazel Unaffected Tests pin ENABLE_GPU=0 → all green
gpl test suite All pass All pass + fft_gpu_test
# Default build
cmake -S . -B build && ninja -C build
ctest --test-dir build -R gpl

# GPU build
cmake -S . -B build_gpu -DENABLE_GPU=ON \
  -DKokkos_ROOT=<path> -DKokkosFFT_DIR=<path>
ninja -C build_gpu
ctest --test-dir build_gpu -R gpl

Architecture

Each accelerated operation is a Strategy interface with CPU and GPU implementations. A factory function is the only #ifdef ENABLE_GPU in C++; consumer headers stay preprocessor-free.

Operation Strategy CPU GPU
HPWL HpwlBackend OpenMP loop Kokkos parallel_reduce
WL gradient WirelengthGradientBackend per-cell loop 5-kernel pipeline
Density gradient DensityGradientBackend per-cell FFT lookup parallel_for gather
FFT / Poisson FftBackend Ooura DCT KokkosFFT DCT

Device-resident state is split into two PIMPL objects:

  • DeviceState (owned by NesterovBaseCommon) — inst/pin coords, pin-to-net CSR, bin grid Views.
  • NesterovDeviceContext (owned by NesterovBase) — Nesterov vectors (coords, grads, sumGrads, clamp bounds) + 6 Kokkos kernels (coord update, gradient combine, step length, scatters).

Backend factories share a BackendContext POD. FftBackend::solve takes a BinGridSpan (data pointer + bin counts). Host→device coord sync goes through DeviceState::ensureCoordsFresh (master-thread only, no atomic). Smaller shared helpers: solverToGplField (axis swap + scale), mapNbcGrads (gradient gather for fillers vs inst cells), SlpSlot / SumGradSlot (typed slot kinds for the per-cell vector launchers).

All Kokkos types stay behind PIMPL. Only gpu/*.cpp files are CUDA TUs.

Numbers

12-core Intel host + RTX 5090, same binary, ENABLE_GPU env toggled, set_thread_count explicit. Nangate45.

CPU thread scaling (host OMP only):

Design Cells Iters 1T 2T 4T 8T
medium03 98k 486 2:02 1:40 1:31 1:28
large01 274k 554 1:56 1:27 1:16 1:15

GPU vs best CPU (both at 8 host threads; GPU runs use 8 host threads for the CPU-side density scatter + Nesterov loop overhead):

Design Cells CPU 8T GPU Δ
medium03 98k 1:28 1:29 parity
large01 274k 1:15 0:56 −25%
  • medium03 (98k) — parity. The per-iter GPU overhead (~1.2 MB reverse sync for the host density scatter + two scalar reductions) cancels the compute saving at this size.
  • large01 (274k)GPU −25%. Overhead is constant, compute saving scales with cells, so GPU pulls ahead on larger designs.

Crossover sits around ~100k cells on this host. The largest single contributor on the GPU side is the WL gradient force kernel: 22 ms/call → 34 µs/call.

Determinism

Same convergence trajectory (±1 iter, HPWL within 1e-3 relative). Implemented via the sgizler pattern from #5352: integer arithmetic for HPWL bbox, serial per-net inner reduction, int64 cross-net sum.

Commits

# Commit Scope
1 9738393 Foundation + HPWL + FFT + WL gradient — build infra, ENABLE_GPU option, Strategy/Factory, DeviceState, three GPU backends
2 fc6059f Density gradient — gather kernel, bin grid Views on DeviceState, device-resident FFT solve
3 49cd6d8 Nesterov loop body on GPU — NesterovDeviceContext + 6 kernels; eliminates the host↔device forward sync each iter
4 7ba5ded Remove bench instrumentation; pin ENABLE_GPU=0 for all gpl tests on GPU builds
5 ff6fce9 Fix Poisson OOB + HPWL int overflow (Gemini bot review)
6 dcadb7f8 Address internal review pass — correctness fixes + helper-function refactor
7 885cbaa Refactor GPU port surface for review — split VecSlot, unify factories via BackendContext, BinGridSpan POD, ensureCoordsFresh, shared solverToGplField + mapNbcGrads helpers

Each commit builds, tests, and is independently reviewable.

Splitting this PR

The diff is large because the wall-time win shows up only once the Nesterov loop body runs on the device — the foundation and per-operation backends are net-neutral by themselves. The commits above are already organized so any prefix can ship:

  1. Commits 1+2 (build infra, HPWL, FFT, WL gradient, density gradient) — landable as a first PR, modest perf change but establishes the dispatch shape.
  2. Commit 3 — adds the device-resident Nesterov loop; this is where the −25% appears.
  3. Commits 4–7 — cleanup, review fixes, and design-pattern refactor.

Happy to break this into 2 or 3 sequential PRs if that's easier to review. Each prefix is independently buildable and CPU-parity-verified.

Follow-ups (out of scope for this PR)

  • Cache the KokkosFFT::Plan (currently rebuilt on every call).
  • Move the filler density gradient onto the device. Saves the ~1.2 MB upload per iteration.
  • Move the density bin scatter (updateBinsGCellDensityArea) onto the device. The last per-iteration host round-trip.
  • CPU↔GPU equivalence tests for HPWL / WL gradient / density gradient, matching fft_gpu_test.
  • Split KokkosDeviceState by purpose (pin coords, WL gradient, density, Nesterov, bin grid) and drop the public kokkos() accessor.
  • Pull the GPU branches out of updateGradients / getStepLength / nesterovUpdateCoordinates into their own functions. Today the slot kind is inferred from a pointer-identity check, which works but is fragile.

Related

  • #5352 — Antmicro's gpl2/ DG-RePlAce Kokkos port. Same kernels; this PR takes the in-place integration path the discussion converged on.
  • #10336 — Bazel vs CMake QoR divergence. The FP-contraction discipline in cmake/KokkosBackend.cmake aligns with that effort.
  • #10370 — original HPWL-only PR.

ApeachM and others added 4 commits May 26, 2026 00:24
Adds optional GPU acceleration for the two hot numerical kernels of global
placement -- the HPWL (half-perimeter wirelength) reduction and the FFT /
Poisson density solve -- behind a runtime backend switch.

Architecture (Strategy + Factory):

- ENABLE_GPU is both a CMake option and an environment variable. The option
  compiles the GPU code in; with ENABLE_GPU=OFF (the default) no GPU code is
  compiled and there is no Kokkos/CUDA dependency -- the build is identical to
  before. With ENABLE_GPU=ON both the CPU and GPU paths are built and the
  backend is chosen per process at run time by gpl::gpuEnabled() (the
  ENABLE_GPU environment variable; GPU is the default, ENABLE_GPU=0 forces
  CPU).

- Each accelerated operation has a small Strategy interface -- HpwlBackend and
  FftBackend -- with a CPU implementation (always compiled) and a GPU
  implementation (Kokkos, compiled only when ENABLE_GPU). A factory function,
  makeHpwlBackend() / makeFftBackend(), is the single place gpl::gpuEnabled()
  is read and the only place an #ifdef ENABLE_GPU appears in C++. The context
  that owns the data -- NesterovBaseCommon for HPWL, FFT for the density grid
  -- holds a std::unique_ptr to the interface and never branches on backend.
  The CPU and GPU backends share the context's data, passed by reference; no
  data structure is duplicated.

- The virtual dispatch is at operation granularity (getHpwl(), FFT::doFFT()),
  each called once per placement iteration, so the CPU hot path carries no
  added overhead. The consumer headers (nesterovBase.h, fft.h) stay
  preprocessor-free.

Components:

- GPU build infrastructure: the ENABLE_GPU option, cmake/KokkosBackend.cmake
  (Kokkos + KokkosFFT discovery, CUDA/HIP language enablement), and
  src/gpl/src/gpu/gpuRuntime.{h,cpp} (gpl::gpuEnabled() plus lazy Kokkos
  initialize/finalize).

- HPWL: the HpwlBackend interface (src/gpl/src/hpwlBackend.h); CpuHpwlBackend
  (the OpenMP loop) and makeHpwlBackend() in src/gpl/src/hpwl.cpp; the Kokkos
  GpuHpwlBackend in src/gpl/src/gpu/gpuHpwlBackend.{h,cpp}. The GPU kernel is
  integer arithmetic, bit-identical to the CPU loop.

- FFT: the FftBackend interface (src/gpl/src/fftBackend.h); the FFT context,
  CpuFftBackend (the Ooura DCT) and makeFftBackend() in src/gpl/src/fft.{h,cpp};
  the Kokkos GpuFftBackend wrapping a Poisson solver in src/gpl/src/gpu/. The
  GPU FFT is not bit-identical to the Ooura CPU FFT (~1e-4 relative
  divergence, inherent to a GPU FFT).

Testing:

- ENABLE_GPU=OFF: the gpl integration suite and fft_test pass; the CPU paths
  are byte-identical to before.
- ENABLE_GPU=ON: the golden gpl integration tests pin ENABLE_GPU=0 into their
  environment, so they run the CPU backend and stay green on a GPU build.
  fft_gpu_test (built only when ENABLE_GPU) checks the GPU FFT against a CPU
  reference within a relative tolerance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Phase 3 of the gpl GPU porting:

DeviceState extended with bin grid Views (d_bin_density, d_bin_phi,
d_bin_elec_x/y) that GpuFftBackend now solves into directly. After
solve(), the electric field results remain on device — the density
gradient gather kernel reads them without an extra host round-trip.

Per-inst density params (half_dx, half_dy, density_scale) pushed to
device once at init. initBinViews() called from NesterovBase after
BinGrid setup completes.

GpuDensityGradientBackend runs a single Kokkos kernel
(densop::launchDensityGather) that does per-inst overlap-weighted sum
of bin electric field with axis swap + 0.5x scale inline. No atomics;
each inst writes its own gradient. Filler cells fall back to CPU
getDensityGradient (fillers aren't in DeviceState).

NesterovBase::updateGradients bulk-fetches density gradients before the
per-cell loop (same pattern as Phase 2 WL grads).

FftBackend factory and FFT class extended to accept DeviceState* so
GpuFftBackend can borrow the Views.

Benchmarks (RTX 5090, same binary, ENABLE_GPU env switch):

  medium03 (98k cells):  wall 2:00 -> 1:49  (-9%)
  large01  (274k cells): wall 2:13 -> 1:36  (-28%)
  large02  (720k cells): wall 2:32 -> 1:52  (-26%)

iter counts match CPU (+-1); HPWL within 1e-3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Phase 4+5 of the gpl GPU porting: move the Nesterov loop body
(updateGradients, nesterovUpdateCoordinates, getStepLength,
updateInitialPrevSLPCoordi) to GPU kernels, and eliminate the
redundant host→device forward sync.

Infrastructure (KokkosNesterovState, 6 kernels, PIMPL wrapper):

- gpu/nesterovDeviceState.h: NB-level device Views for all Nesterov
  vectors (coords, gradients, sumGrads, clamp bounds)
- gpu/nesterovOp.{h,cpp}: K_gradCombine (parallel_for + 2× reduce),
  K_nesterovCoordUpdate (gradient descent + momentum + clamp),
  K_getDistance (RMS norm reduction), K_scatterToDeviceState,
  K_scatterGradsToNB, K_updateInitialPrevSLPCoordi
- gpu/nesterovDeviceContext.{h,cpp}: PIMPL wrapper with init, sync,
  dispatch, swap, and rotateForNextIter methods

Wiring into NesterovBase (all guarded by #ifdef ENABLE_GPU):

- initDensity1: create NesterovDeviceContext, sync initial coords,
  scatter to DeviceState + markCoordsFresh
- updateGradients: scatter WL/density grads to NB device arrays,
  then K_gradCombine for preconditioned sum + scalar reductions
- nesterovUpdateCoordinates: K_nesterovCoordUpdate, reverse sync for
  host density scatter, scatter to DeviceState + markCoordsFresh
- getStepLength: K_getDistance for coord and grad distance
- updateInitialPrevSLPCoordi: K_updateInitialPrevSLPCoordi + reverse
  sync + scatter + markCoordsFresh
- updateNextIter: device-side View pointer rotation
- revertToSnapshot: re-sync device coords

Forward sync elimination:

- DeviceState gains a coords_fresh_ flag. NB scatter sites call
  markCoordsFresh(); updateWireLengthForceWA skips the host→device
  syncInstCoordsFromHost when the flag is set.
- WL grad sync cost: ~600 us/call → 0 us/call.

Filler density gradients are pushed from host each iter
(pushDensityGradsFromHost) since the GPU density backend only
computes inst grads on device.

Benchmarks (RTX 5090, ENABLE_GPU env toggle):

  medium03 (98k cells):  CPU 1:58 → GPU 1:44  (-12%)
  large01  (274k cells): CPU 2:16 → GPU 1:34  (-31%)

Iter counts match CPU (±1); HPWL within 1e-3.

Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove development-only bench timers (HpwlBenchTimer, WlGradBenchTimer,
DensityGradBenchTimer) that printed [bench] lines to stdout, breaking
golden log comparison in regression tests.

Change backend selection logs (HPWL/WLgrad/FFT/density backend names)
from log_->report() to debugPrint so they only appear with debug
verbosity, leaving golden output unchanged.

Fix test CMakeLists: pin ENABLE_GPU=0 for ALL gpl integration tests
(not just log_compare — passfail tests also diverge on GPU path due to
float arithmetic order differences in the Nesterov loop). Use
set_property(APPEND) instead of set_tests_properties to avoid
overwriting the OPENROAD_EXE environment variable.

Result: 60/60 gpl tests pass on ENABLE_GPU=ON build.

Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ApeachM ApeachM requested review from a team as code owners May 26, 2026 09:30
@ApeachM ApeachM requested review from eder-matheus and gudeh May 26, 2026 09:30
@ApeachM
Copy link
Copy Markdown
Contributor Author

ApeachM commented May 26, 2026

Supersedes #10370 (HPWL-only, closed). Full device-resident port of the Nesterov inner loop.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPU acceleration via Kokkos for the OpenROAD global placer (gpl), implementing GPU backends for HPWL calculation, WA wirelength gradient computation, and density gradient gather/FFT Poisson solving. The review feedback identifies three valid issues: a critical out-of-bounds memory write in the Poisson solver's array initialization, performance inefficiencies from recreating KokkosFFT plans on every call, and a potential integer overflow in the HPWL backend's coordinate subtraction.

Comment thread src/gpl/src/gpu/poissonSolver.cpp Outdated
Comment on lines +236 to +239
d_expkMN1_
= Kokkos::View<Kokkos::complex<float>*>("d_expkMN1", binCntX_ + binCntY_);
d_expkMN2_
= Kokkos::View<Kokkos::complex<float>*>("d_expkMN2", binCntX_ + binCntY_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The views d_expkMN1_ and d_expkMN2_ are allocated with a size of binCntX_ + binCntY_ (which is N + M). However, during initialization, the code writes to indices up to hid + M, which can be as large as 2 * M - 1 (where M = binCntY_). If binCntY_ > binCntX_, this results in a critical out-of-bounds memory write. Additionally, during the IDCT phase, the code accesses indices up to 1.5 * N (where N = binCntX_), which can also exceed N + M if N > 2 * M. Allocating 2 * (binCntX_ + binCntY_) ensures that both initialization and IDCT accesses are always safe.

  d_expkMN1_
      = Kokkos::View<Kokkos::complex<float>*>("d_expkMN1", 2 * (binCntX_ + binCntY_));
  d_expkMN2_
      = Kokkos::View<Kokkos::complex<float>*>("d_expkMN2", 2 * (binCntX_ + binCntY_));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 29f029d (the OOB allocation sizing) and 7ff9ab5 (added a runtime kMaxBinAspectRatio=2 assertion in the PoissonSolver constructor that fails fast if a future caller pushes the aspect past where the expkMN1/2 index math is valid).

d_expkMN1_ / d_expkMN2_ are now allocated as 2 * std::max(binCntX_, binCntY_), which covers the worst-case hid + M (up to 2M-1) index and its symmetric counterpart. The aspect-ratio assertion protects the lower-bound case (negative indices in the IDCT post kernel) that emerges when M ≫ N or N ≫ M.

Comment thread src/gpl/src/gpu/dct.cpp
Comment on lines +114 to +118
KokkosFFT::Plan fftplan(hostSpace,
hPre2d,
hFft2d,
KokkosFFT::Direction::forward,
KokkosFFT::axis_type<2>{-2, -1});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating a new KokkosFFT::Plan on every call to dct_2d_fft and idct_2d_fft is highly inefficient because plan creation can involve expensive heuristics and memory allocations. Since the grid dimensions are constant for a given PoissonSolver instance, the forward and backward plans should be created once during initialization (e.g., in PoissonSolver::initBackend()) and stored as members of PoissonSolver to be reused across iterations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted plan caching with persistent host buffers + cached PlanT bound to those buffers; the change regressed large01 wall time from 1:34 to 1:52 (~12%, reproducible across 3 runs). Likely a host-buffer type-mismatch causing a slow deep_copy path between the unmanaged device View and the cached host View, but the root cause was not isolated.

Deferring to a follow-up — listed under "Follow-ups" in the PR description. The current per-call Plan construction is the known-working baseline; the optimization needs persistent buffers tied to the PoissonSolver lifetime, not file-scope thread_local caching.

Comment thread src/gpl/src/gpu/gpuHpwlBackend.cpp Outdated
if (ux < lx) {
return;
}
acc += static_cast<int64_t>(ux - lx) + static_cast<int64_t>(uy - ly);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The subtraction ux - lx and uy - ly is performed using int arithmetic before being cast to int64_t. If the coordinates are extremely far apart, this subtraction can overflow the int range, leading to undefined behavior. Casting the operands to int64_t before subtraction prevents any potential signed integer overflow.

        acc += (static_cast<int64_t>(ux) - lx) + (static_cast<int64_t>(uy) - ly);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 29f029d. The subtraction now casts the operands to int64_t before subtracting:

acc += (static_cast<int64_t>(ux) - lx) + (static_cast<int64_t>(uy) - ly);

No theoretical signed-int overflow path remains.

Fix d_expkMN1_/d_expkMN2_ allocation size: was binCntX_+binCntY_ but
init loop writes up to index 2*M-1 (M=binCntY_), causing out-of-bounds
write when binCntY_ > binCntX_. Allocate 2*max(binCntX_,binCntY_).

Cast HPWL bbox subtraction to int64_t before the subtract, not after,
to prevent theoretical signed int overflow on extreme coordinates.

Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bug fixes (correctness):
- CPU-only build was broken: unique_ptr<NesterovDeviceContext> and
  unique_ptr<DeviceState> member destructors need a complete type, but
  the PIMPL headers were only included under #ifdef ENABLE_GPU. Move
  both includes outside the gate (they are plain C++) and add the src/
  include path unconditionally so gpu/*.h can find sibling headers.
- revertToSnapshot now scatters inst coords to DeviceState, refreshes
  pin locations, and marks coords fresh after syncing the host vectors
  back to device. Previously the divergence-recovery iteration ran on
  stale pin coords.
- saveSnapshot pulls curSLPSumGrads_ from device before snapshotting.
  On the GPU path updateGradients writes sum-grads only to device, so
  the host vector stayed at zero and a subsequent revertToSnapshot
  pushed zeros back, wiping the gradient state.
- divideByWSquare in poissonSolver.cpp was called with (hID, wID) but
  the function signature is (wID, hID, binCntX, binCntY, ...). On
  square bin grids the bug was invisible; on non-square grids both the
  bin indexing and the frequency math were wrong. Swap the call args.
- NesterovDeviceContext clamp bounds now match the CPU formula
  exactly: bg.lx()+dDx/2 .. bg.ux()-dDx/2 (and Y mirror). Previously
  the bounds used a bin-width margin, producing different cell
  positions from CPU when the clamp fired.
- PoissonSolver constructor now aborts when bin grid aspect ratio
  exceeds 2:1; the IDCT expk index math in dct.cpp goes negative past
  that point. Aspect threshold is kMaxBinAspectRatio.
- dct.cpp replaced printf+assert(0) with Kokkos::abort. The previous
  pattern was a silent no-op in release (NDEBUG) builds and let
  garbage output continue.

Hardening (defense in depth):
- nb_device_ctx_ allocation guarded by !nb_device_ctx_; initDensity1
  can run more than once (init recursion, routability flows) and
  previously rebuilt every device View on each call.
- getHpwl now consults DeviceState::consumeCoordsFresh() before the
  host->device sync, matching updateWireLengthForceWA.
- coords_fresh_ is now std::atomic<bool> (defensive; consumers run on
  the master thread today but OMP parallel boundaries elsewhere make
  a future race plausible).

Refactor (industry-level cleanup):
- Removed four dead methods from NesterovDeviceContext: swapCurNext,
  swapSumGrads (also broken — structured-binding copied the Views,
  swap was a no-op), scatterDensityGradsToNB, syncCurSLPToHost.
- Collapsed five copy-pasted blocks in syncCoordsToDevice into a
  single pushVecPairToDevice helper. Three pull-to-host functions
  collapsed similarly into pullVecPairToHost. ~50 lines removed.
- Removed unused NesterovBaseCommon* nbc constructor parameter and
  the now-unused h_cur_slp_x/y / h_next_slp_x/y host mirror Views.
- Extracted PoissonSolver::launchDivideByWSquare to share the Step
  #2 lambda between solvePoisson and solvePoissonPotential.

Verification:
- ENABLE_GPU=ON build: gpl regression 63/63 pass (both GPU backend
  and ENABLE_GPU=0 env-pinned CPU backend).
- ENABLE_GPU=OFF build: gpl_lib + openroad compile clean.
- Wall-time benchmark unchanged: large01 (274k cells) CPU 2:16 -> GPU 1:34.

Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 2-4 added densityGradient.cpp/h, wirelengthGradient.cpp/h, and
the PIMPL headers gpu/deviceState.h, gpu/nesterovDeviceContext.h.
nesterovBase.cpp includes the headers unconditionally (so unique_ptr
member destructors see complete types on CPU-only builds), but the
Bazel BUILD file was never updated past the Phase 1 hpwl/fft entries.
Mac/Bazel CI failed with 'densityGradientBackend.h file not found'.

Add the missing sources to the gpl cc_library so layering_check is
satisfied. The PIMPL headers are plain C++ (Kokkos hidden inside);
the corresponding .cpp implementations stay GPU-only (CMake-only).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
@maliberty
Copy link
Copy Markdown
Member

How many threads are used for the CPU results?

@ApeachM
Copy link
Copy Markdown
Contributor Author

ApeachM commented May 27, 2026

Both numbers were single-threaded — OpenROAD defaults to 1 thread and the test TCLs don't call set_thread_count. Not a fair baseline for gpl, which has parallelizable OMP loops.

Re-ran the sweep on the same 12-core Intel host (RTX 5090, same binary, set_thread_count explicit).

CPU thread scaling (host OMP only):

Design Cells 1T 2T 4T 8T
medium03 98k 2:02 1:40 1:31 1:28
large01 274k 1:56 1:27 1:16 1:15

GPU vs best CPU (both at 8 host threads; GPU runs use 8 host threads for the CPU-side density scatter + Nesterov loop overhead):

Design Cells CPU 8T GPU Δ
medium03 98k 1:28 1:29 parity
large01 274k 1:15 0:56 −25%
  • medium03 (98k) — parity. The GPU per-iter overhead (~1.2 MB reverse sync for the host density scatter + two scalar reductions) cancels the compute saving at this size.
  • large01 (274k)GPU −25%. Overhead is constant, compute saving scales with cells, so GPU pulls ahead on larger designs.

Crossover sits around ~100k cells on this host. Updating the Numbers table in the PR body to reflect this.


Separately on the Gemini bot review: Critical (Poisson expkMN1/2 OOB) and Medium (HPWL ux-lx int overflow) are fixed. High (KokkosFFT Plan recreation) is deferred — a caching attempt regressed wall time, so the current per-call construction stays as the known-working baseline. See that thread and the Follow-ups section.

Post-merge review-driven fixes for the GPU port:
  - restore OMP parallelism in CPU backend grad-fetch loops
  - fix Mac/Bazel link by type-erasing the PIMPL deleters
  - reset/rebuild GPU NesterovDeviceContext on filler mutation
  - address remaining review feedback (broad)
  - harden GPU PIMPL invariants; surface PoissonSolver preconditions early
  - refactor PR-introduced GPU plumbing
  - fix GPU build on aarch64 by suppressing NEON in CUDA TUs

Net diff: 30 files, +574/-299.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
ApeachM and others added 4 commits May 27, 2026 19:37
Type/API cleanups in response to the design review:

- Split VecSlot into SlpSlot / SumGradSlot. The launchers can no longer
  be passed a mismatched slot kind; the contiguous-int arithmetic trick
  in nesterovOp.cpp is gone, replaced with two explicit overloads of
  getDistance / scatterToDeviceState.

- Unify backend factories under BackendContext. The four make*Backend
  factories now share a single `const BackendContext&` parameter
  carrying nbc / nb / device_state / num_threads / FFT geometry. Caller
  builds the struct once and reuses across factories.

- Replace FftBackend::solve's float** quartet with a BinGridSpan POD.
  bin_cnt_x / bin_cnt_y travel with the buffer; the legacy float**
  shape is wrapped inside CpuFftBackend with a row-pointer adapter
  (Ooura ddct2d expects that). FFT owns flat std::vector<float>.

- Encapsulate the host->device coord sync in DeviceState::ensureCoordsFresh.
  The atomic flag + 3-site read/write is collapsed into one master-thread
  method with markCoordsFresh / invalidateCoords symmetry.

- Adopt the solverToGplField shared adapter for the FFT host unpack.
  The 0.5x scale + solver/gpl axis swap is now exposed once from
  poissonSolver.h; gpuFftBackend.cpp and densityOp.cpp both call through.

- Deduplicate the filler-cell handling in the GPU WL / density gradient
  backends. New cellHandleHelpers.h::mapNbcGrads template takes the
  per-cell device-mirror lookup and the filler fallback as lambdas.

- Drop the unused printStepLength helper (dead printf).

Build green on ENABLE_GPU=ON; medium03 GPU run holds iter / HPWL within
the same tolerance as before (487 iters vs 486 baseline, HPWL 1e-3).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Findings on the new factory translation units flagged by the
clang-tidy-bazel CI:

- fft.cpp: BinGridSpan brace-init now uses designated initializers
  (modernize-use-designated-initializers).
- densityGradient.cpp, wirelengthGradient.cpp: drop the omp.h include
  since these TUs only use #pragma omp directives, never any omp_*
  API. The -fopenmp copt handles pragma lowering.
- hpwl.cpp: keep omp.h (omp_get_thread_num is called in the master-
  thread assert) and add NOLINT(misc-include-cleaner). The checker
  does not detect API use through assert macros, so the include is
  flagged as unused even though it is required.

Signed-off-by: Minjae Kim <develop.minjae@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants