Skip to content

ADACS GPU Work -- Base #622

Open
gusgw wants to merge 254 commits into21cmfast:mainfrom
gusgw:adacs-gpu-base
Open

ADACS GPU Work -- Base #622
gusgw wants to merge 254 commits into21cmfast:mainfrom
gusgw:adacs-gpu-base

Conversation

@gusgw
Copy link

@gusgw gusgw commented Feb 23, 2026

Summary

This PR provides the foundation for GPU acceleration in 21cmFAST, developed as part of the ADACS optimization project. It includes:

  • Complete GPU implementation of the InitialConditions path
  • Support for both ZELDOVICH and 2LPT perturbation algorithms on GPU
  • Verified CPU/GPU numerical parity across all physics configurations
  • Build system improvements for profiling and optimization workflows

Key Changes

GPU Initial Conditions Implementation

  • GPU InitialConditions path: Full CUDA implementation of initial condition generation, including density field computation, velocity field generation, and high-resolution perturbation support
  • 2LPT support on GPU: Second-order Lagrangian Perturbation Theory implementation with correct cuFFT handling and volume scaling
  • MapMass GPU kernel: Rewritten to match CPU algorithm exactly, with proper velocity index calculation and bounds checking

Build System Improvements

  • Profile-Guided Optimization (PGO) workflow support via PY21C_PGO_PHASE and PY21C_PGO_DIR environment variables
  • Debug symbols and symbol visibility controls for profiling
  • Build control environment variables for flexible optimization levels

Testing Infrastructure

  • GPU-CPU parity test framework with reference data
  • Comprehensive field diagnostics for CPU/GPU comparison
  • Three-way comparison infrastructure (main vs cpu-optimized vs gpu)

Validation

Extensive three-way comparison testing was performed on both Skylake/P100 and Milan/A100 architectures:

Numerical Parity (CPU vs GPU)

Architecture Min Correlation Notes
Skylake/P100 0.999999 All non-discrete scripts
Milan/A100 0.9995 All non-discrete scripts

Discrete halo sampling scripts show expected divergence due to different random number sequences on CPU vs GPU.

Performance (Average over 46 test scripts)

Architecture CPU vs Main GPU vs CPU
Skylake/P100 +13.0% +8.9%
Milan/A100 +10.5% -7.2%

Note: GPU is slower on A100 for these small test workloads due to transfer overhead. Larger production runs are expected to benefit more from GPU acceleration.

Selected Recent Bug Fixes

  • Fix cuFFT first-call failure on P100/Pascal GPUs
  • Fix GPU velocity displacement calculation in MapMass_gpu
  • Fix GPU stochasticity: position randomization and type mismatch for discrete halo sampling
  • Fix 2LPT implementation: cuFFT R2C requires tightly-packed input (not FFT-padded),
    and phi_2 needs VOLUME pre-multiplication to match velocity kernel expectations

Test Configurations

Testing covered all major physics configurations:

  • park19, Munoz21, Qin20 physics models
  • Coeval and lightcone calculations
  • With and without 2LPT (ZELDOVICH algorithm)
  • Minihalo and discrete halo sampling modes
  • Multiple random seeds for reproducibility

Commits (75 total)

Key commits:

  • 7c3a5060 Fix GPU 2LPT implementation
  • 278aa749 Re-enable GPU InitialConditions path
  • 0c01b8f7 Implement 2LPT support in GPU MapMass kernel
  • 4433dea1 Fix cuFFT first-call failure on P100/Pascal GPUs
  • 4ca51dfc Cherry-pick InitialConditions GPU implementation
  • 107c6f9f Fix discrete halo correlation failures in GPU stochasticity sampling
  • 32568104 Fix GPU stochasticity: position randomization and type mismatch
  • 29a9b3d3 Fix GPU velocity displacement calculation in MapMass_gpu

Future Work

This branch serves as the base for continued GPU optimization work:

  • GPU profiling and kernel optimization
  • Extended GPU coverage for additional computation stages
  • Performance optimization for production workloads

Test Plan

  • CI tests pass
  • GPU parity tests pass on P100 and A100
  • Coeval calculations produce matching results between CPU and GPU
  • Lightcone calculations produce matching results between CPU and GPU
  • Both ZELDOVICH and 2LPT algorithms work correctly on GPU

Phase 1.2 of incremental upstream merge. This commit integrates the
source-flag-redesign changes from upstream which replaces
USE_MASS_DEPENDENT_ZETA with SOURCE_MODEL enum.

Conflict resolutions:
- HaloBox.h: Added extern "C" block, kept convert_halo_props function
- SpinTemperatureBox.c: Updated flag checks to SOURCE_MODEL == 1
- map_mass.h: Added extern "C" block, kept MapMass_gpu function
- rng.h: Kept extern "C" block
- HaloCatalog.h: Preserved updateGlobalParams CUDA utility function
- PerturbedField.h: Added extern "C" block for CUDA compatibility
- PerturbedHaloCatalog.h: Added extern "C" block for CUDA compatibility

File renames by upstream (accepted):
- HaloField.c -> HaloCatalog.c
- HaloField.h -> HaloCatalog.h
- PerturbField.c -> PerturbedField.c
- PerturbField.h -> PerturbedField.h
- PerturbHaloField.c -> PerturbedHaloCatalog.c
- PerturbHaloField.h -> PerturbedHaloCatalog.h

Removed CFFI wrapper files (we use nanobind):
- _inputparams_wrapper.h
- _outputstructs_wrapper.h
This field was added to the C struct in the source-flag-redesign merge
(PR 21cmfast#572) but was missing from the Python wrapper class, causing
AttributeError when creating AstroOptions objects.

Signed-off-by: Angus Gray-Weale <gusgw@gusgw.net>
Merged commits:
- 21cmfast#570 A_s_branch (primordial amplitude parameter)
- 21cmfast#576 compiler-detection
- 21cmfast#578 CosmoTables (cosmology table improvements)

Key changes:
- CosmoTables now passed from Python instead of reading from file in C
- Removed CFFI-specific code (we use nanobind)
- Kept nanobind import style in test files
- Add Table1D and CosmoTables struct definitions to InputParameters.h
- Add Table1D, CosmoTables bindings and Free_cosmo_tables_global to wrapper
- Fix cosmology.c to use size_density instead of CLASS_LENGTH
- Update inputs.py to use nanobind instead of CFFI for Table1D/CosmoTables

Signed-off-by: Angus Gray-Weale <gusgw@gusgw.net>
Merged commits:
- 21cmfast#582 X_RAY_HEATING (conditional X-ray heating feature)
- 21cmfast#575 readme-updates (documentation improvements)
- Various CI updates (21cmfast#579-585)

Key changes:
- Added USE_X_RAY_HEATING flag to AstroOptions
- Conditional memory allocation for X-ray heating arrays
- Removed CFFI files (we use nanobind)
@gusgw
Copy link
Author

gusgw commented Feb 25, 2026

@qyx268 Yes adding the label sounds like a good idea. We manage labels within a yaml file in the repo, so I will add that in a separate PR.

@gusgw is this meant to replace #541? If this is going to be merged ~soon, I think we should rather aim to merge into the branch release-v4.2. We can then manage the release properly.

Also @gusgw if you need any pointers on handling the conflicts, let me know. Of course we've done a fair bit of work and bugfixing etc on the main branch since you branched off.

Hi @steven-murray, I quite agree! I have merged in up to 4.0.0, I think and all my tests are passing. I should be up to 4.2 shortly. I'm working through testing my CPU-only and GPU versions against the relevant point on main.

gusgw and others added 23 commits February 25, 2026 17:31
Signed-off-by: Angus Gray-Weale <gusgw@gusgw.net>
Bumps [actions/github-script](https://github.com/actions/github-script) from 6 to 8.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v6...v8)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) from 14 to 15.
- [Release notes](https://github.com/dawidd6/action-download-artifact/releases)
- [Commits](dawidd6/action-download-artifact@v14...v15)

---
updated-dependencies:
- dependency-name: dawidd6/action-download-artifact
  dependency-version: '15'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 6 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v6...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) from 15 to 16.
- [Release notes](https://github.com/dawidd6/action-download-artifact/releases)
- [Commits](dawidd6/action-download-artifact@v15...v16)

---
updated-dependencies:
- dependency-name: dawidd6/action-download-artifact
  dependency-version: '16'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Remove 16 files that should not be in the repository:
- PR-adacs-gpu-base.md (draft PR document)
- check_gpu_usage.{md,py} (local development files)
- gpu_test_*.py, test_gpu*.py, simple_gpu_test.py (local test scripts)
- import_21cmfast.py (local development script)
- install_custom.py (redundant install wrapper)

Restore bump script that was accidentally removed during merges.
… USE_SIGMA_8

- Add missing power_in_vcb function binding to _wrapper.cpp (fixes test_ps_runs)
- Cast USE_SIGMA_8 to bool in inputs.py to satisfy nanobind's strict type checking (fixes test_coeval_against_direct)
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority: high High priority type: feature: ui New feature that adds functionality for the user type: maint: build Build System and Dependencies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants