MPI FFTW #997

anandrdbz · 2025-09-08T07:46:17Z

User description

Description

Fast Fourier transform for energy cascade implemented on multiple ranks. Improves upon Conrad's implementation by using Pencil decomposition (2D) instead of Slabs (1D) in post_process. This required the use of cartesian sub-communicators.

Tested for correctness on Taylor-Green Vortex problem.

Also adds a Lax Friedrichs Riemann solver as an option (riemann_solver = 5)

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
Something else

Scope

[ x] This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

Test A
Test B

Test Configuration:

What computers and compilers did you use to test this:

Checklist

I have added comments for the new code
I added Doxygen docstrings to the new code
I have made corresponding changes to the documentation (docs/)
I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
I have added example cases in examples/ that demonstrate my new feature performing as expected.
They run to completion and demonstrate "interesting physics"
I ran ./mfc.sh format before committing my code
New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
This PR does not introduce any repeated code (it follows the DRY principle)
I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in `src/simulation`)

To make sure the code is performing as expected on GPU devices, I have:

Checked that the code compiles using NVHPC compilers
Checked that the code compiles using CRAY compilers
Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
Enclosed the new feature via nvtx ranges so that they can be identified in profiles
Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement

Description

Implement MPI-based 3D FFT for energy cascade analysis
Add pencil decomposition using cartesian sub-communicators
Integrate FFTW library with MPI transpose operations
Add input validation for FFT requirements

Diagram Walkthrough

flowchart LR
  A["3D Velocity Field"] --> B["X-direction FFT"]
  B --> C["MPI Transpose X→Y"]
  C --> D["Y-direction FFT"]
  D --> E["MPI Transpose Y→Z"]
  E --> F["Z-direction FFT"]
  F --> G["Energy Spectrum Calculation"]
  G --> H["Energy Cascade Output"]

File Walkthrough

Relevant files

Enhancement

m_mpi_common.fpp `Add FFT-specific processor topology optimization` src/common/m_mpi_common.fpp Add conditional processor topology optimization for FFT operations Implement 2D pencil decomposition for post-processing with FFT Maintain existing 3D decomposition logic for non-FFT cases	+108/-1
m_checker.fpp `Add FFT input validation constraints` src/post_process/m_checker.fpp Add `s_check_inputs_fft` subroutine for FFT parameter validation Prohibit `file_per_process` when FFT is enabled Require 3D domain and even local dimensions for FFT	+8/-0
m_global_parameters.fpp `Add FFT write parameter` src/post_process/m_global_parameters.fpp Add `fft_wrt` logical parameter for FFT output control Initialize `fft_wrt` to false in default settings	+2/-0
m_mpi_proxy.fpp `Include FFT parameter in MPI broadcasts` src/post_process/m_mpi_proxy.fpp Add `fft_wrt` to MPI broadcast variable list	+1/-1
m_start_up.fpp `Implement complete MPI-based 3D FFT system` src/post_process/m_start_up.fpp Integrate FFTW3 library with C bindings Implement 3D FFT with MPI pencil decomposition Add energy spectrum calculation and cascade analysis Create MPI transpose routines for data redistribution Initialize FFTW plans and cartesian communicators	+339/-3

Configuration changes

case_dicts.py `Add FFT parameter to toolchain configuration` toolchain/mfc/run/case_dicts.py Add `fft_wrt` parameter to post-processing parameter dictionary	+1/-0

qodo-merge-pro · 2025-09-08T07:47:11Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue FFT pencil sizes and MPI cartesian splits appear inconsistent: local sizes are derived using different processor axes than the allocation and transpose logic assumes, which can cause out-of-bounds or misaddressed data during Alltoall and k-index remapping. Verify Nxloc/Nyloc/Nyloc2 definitions, proc_coords usage, and that communicator dimensions match the intended pencils. if(fft_wrt) then num_procs_x = (m_glb + 1) / (m + 1) num_procs_y = (n_glb + 1) / (n + 1) num_procs_z = (p_glb + 1) / (p + 1) Nx = m_glb + 1 Ny = n_glb + 1 Nz = p_glb + 1 Nxloc = (m_glb + 1) / num_procs_y Nyloc = n + 1 Nyloc2 = (n_glb + 1) / num_procs_z Nzloc = p + 1 Nf = max(Nx, Ny, Nz) @:ALLOCATE(data_in(NxNylocNzloc)) @:ALLOCATE(data_out(NxNylocNzloc)) @:ALLOCATE(data_cmplx(Nx, Nyloc, Nzloc)) @:ALLOCATE(data_cmplx_y(Nxloc, Ny, Nzloc)) @:ALLOCATE(data_cmplx_z(Nxloc, Nyloc2, Nz)) @:ALLOCATE(En_real(Nxloc, Nyloc2, Nz)) @:ALLOCATE(En(Nf)) size_n(1) = Nx inembed(1) = Nx onembed(1) = Nx fwd_plan_x = fftw_plan_many_dft(1, size_n, NylocNzloc, & data_in, inembed, 1, Nx, & data_out, onembed, 1, Nx, & FFTW_FORWARD, FFTW_MEASURE) size_n(1) = Ny inembed(1) = Ny onembed(1) = Ny fwd_plan_y = fftw_plan_many_dft(1, size_n, NxlocNzloc, & data_out, inembed, 1, Ny, & data_in, onembed, 1, Ny, & FFTW_FORWARD, FFTW_MEASURE) size_n(1) = Nz inembed(1) = Nz onembed(1) = Nz fwd_plan_z = fftw_plan_many_dft(1, size_n, NxlocNyloc2, & data_in, inembed, 1, Nz, & data_out, onembed, 1, Nz, & FFTW_FORWARD, FFTW_MEASURE) call MPI_CART_CREATE(MPI_COMM_WORLD, 3, (/num_procs_x, & num_procs_y, num_procs_z/), & (/.true., .true., .true./), & .false., MPI_COMM_CART, ierr) call MPI_CART_COORDS(MPI_COMM_CART, proc_rank, 3, & cart3d_coords, ierr) call MPI_Cart_SUB(MPI_COMM_CART, (/.true., .true., .false./), MPI_COMM_CART12, ierr) call MPI_COMM_RANK(MPI_COMM_CART12, proc_rank12, ierr) call MPI_CART_COORDS(MPI_COMM_CART12, proc_rank12, 2, cart2d12_coords, ierr) call MPI_Cart_SUB(MPI_COMM_CART, (/.true., .false., .true./), MPI_COMM_CART13, ierr) call MPI_COMM_RANK(MPI_COMM_CART13, proc_rank13, ierr) call MPI_CART_COORDS(MPI_COMM_CART13, proc_rank13, 2, cart2d13_coords, ierr) end if Buffer Mismatch* MPI_Alltoall uses MPI_DOUBLE_COMPLEX while the Fortran type is complex(c_double_complex); ensure MPI datatype compatibility or provide a derived MPI type. Also check send/recv counts and contiguous packing sizes match allocations to avoid buffer overruns. call MPI_Alltoall(sendbuf, NxlocNylocNzloc, MPI_DOUBLE_COMPLEX, & recvbuf, NxlocNylocNzloc, MPI_DOUBLE_COMPLEX, MPI_COMM_CART12, ierr) do src_rank = 0, num_procs_y - 1 do l = 1, Nzloc do k = 1, Nyloc do j = 1, Nxloc data_cmplx_y(j, k + src_rankNyloc, l) = recvbuf(j + (k-1)Nxloc + (l-1)NxlocNyloc + src_rankNxlocNylocNzloc) end do end do end do end do deallocate(sendbuf) deallocate(recvbuf) end subroutine s_mpi_transpose_x2y subroutine s_mpi_transpose_y2z complex(c_double_complex), allocatable :: sendbuf(:), recvbuf(:) integer :: dest_rank, src_rank integer :: j, k, l allocate(sendbuf(NyNxlocNzloc)) allocate(recvbuf(NyNxlocNzloc)) do dest_rank = 0, num_procs_z - 1 do l = 1, Nzloc do j = 1, Nxloc do k = 1, Nyloc2 sendbuf(k + (j-1)Nyloc2 + (l-1)(Nyloc2Nxloc) + dest_rankNyloc2NxlocNzloc) = data_cmplx_y(j, k + dest_rankNyloc2, l) end do end do end do end do call MPI_Alltoall(sendbuf, Nyloc2NxlocNzloc, MPI_DOUBLE_COMPLEX, & recvbuf, Nyloc2NxlocNzloc, MPI_DOUBLE_COMPLEX, MPI_COMM_CART13, ierr) Validation Logic FFT constraints require local dimensions divisible by 2, but decomposition may yield odd local sizes depending on global sizes and proc topology. Add checks for divisibility by num_procs and ensure Ny/Nz local sizes used by transposes match the enforced constraints. !> Checks constraints on fft_wrt impure subroutine s_check_inputs_fft @:PROHIBIT(fft_wrt .and. file_per_process, "Turn off file_per_process with fft_wrt") @:PROHIBIT(fft_wrt .and. (n == 0 .or. p == 0), "FFT WRT only in 3D") @:PROHIBIT(fft_wrt .and. (MOD(m+1,2) == 1 .or. MOD(n+1,2) == 1 .or. MOD(p+1,2) == 1), "FFT WRT requires local dimensions divisible by 2") end subroutine s_check_inputs_fft

src/post_process/m_start_up.fpp

codecov · 2025-09-08T10:34:11Z

Codecov Report

❌ Patch coverage is 21.83544% with 494 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.19%. Comparing base (e966781) to head (179e3b7).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
src/simulation/m_riemann_solvers.fpp	25.00%	265 Missing and 32 partials ⚠️
src/post_process/m_start_up.fpp	0.55%	174 Missing and 4 partials ⚠️
src/common/m_mpi_common.fpp	62.00%	17 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #997      +/-   ##
==========================================
+ Coverage   40.91%   41.19%   +0.28%     
==========================================
  Files          70       69       -1     
  Lines       20270    20379     +109     
  Branches     2520     2540      +20     
==========================================
+ Hits         8293     8396     +103     
+ Misses      10439    10408      -31     
- Partials     1538     1575      +37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson · 2025-09-16T11:29:04Z

/improve

src/post_process/m_start_up.fpp

sbryngelson · 2025-09-16T11:30:41Z

src/post_process/m_start_up.fpp

+
+    end subroutine s_mpi_FFT_fwd
+
+    subroutine s_mpi_transpose_x2y


all these mpi routines seem like they should go somewhere else... a new module?

Copilot

Pull Request Overview

This PR implements MPI-based 3D Fast Fourier Transform (FFT) functionality for energy cascade analysis in the post-processing module. The implementation uses pencil decomposition with 2D cartesian sub-communicators instead of the previous 1D slab approach, improving computational efficiency for multi-rank FFT operations.

Key changes:

Add MPI-based 3D FFT implementation with FFTW integration for energy spectrum calculations
Implement pencil decomposition using cartesian sub-communicators for better parallel efficiency
Add comprehensive input validation for FFT parameter constraints

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
toolchain/mfc/run/case_dicts.py	Add `fft_wrt` parameter to post-processing configuration
src/post_process/m_start_up.fpp	Implement complete 3D FFT system with FFTW, MPI transpose operations, and energy cascade calculations
src/post_process/m_mpi_proxy.fpp	Include `fft_wrt` parameter in MPI broadcast list
src/post_process/m_global_parameters.fpp	Add `fft_wrt` logical parameter declaration and initialization
src/post_process/m_checker.fpp	Add FFT input validation constraints and requirements
src/common/m_mpi_common.fpp	Add conditional processor topology optimization for FFT operations using 2D pencil decomposition

src/post_process/m_start_up.fpp

src/post_process/m_checker.fpp

…FFTW

anandrdbz · 2025-09-18T07:22:03Z

Further changes:

There's also been more changes today that basically uses pencil decomposition throughout (including pre_process, and simulation) whenever fft_wrt is used.

This is done so that file_per_process can be enabled. The performance hit in going from block to pencil decomposition is far less than the performance hit in I/O when not using file_per_process, especially considering time averaging for spectrum.

I've also added the Lax Friedrichs Riemann solver

sbryngelson · 2025-09-20T02:11:43Z

please add tests for the LF riemann solver

sbryngelson · 2025-09-20T02:11:55Z

/improve

Copilot

Pull Request Overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/post_process/m_start_up.fpp:1

The s_lf_riemann_solver subroutine is 831 lines long, which significantly exceeds the coding guideline limit of ≤500 lines for subroutines. This subroutine should be refactored into smaller, more manageable helper subroutines.

#:include 'macros.fpp'

src/post_process/m_start_up.fpp:1

The s_lf_riemann_solver subroutine is 831 lines long, which significantly exceeds the coding guideline limit of ≤500 lines for subroutines. This subroutine should be refactored into smaller, more manageable helper subroutines.

#:include 'macros.fpp'

src/simulation/m_riemann_solvers.fpp

src/post_process/m_start_up.fpp

src/post_process/m_checker.fpp

Anand added 3 commits September 8, 2025 03:01

Compiles, need to test

a7c8e14

FFTW incompatible with file_per_process

e5194de

Successfuly ran with 4 procs

0da3b44

anandrdbz requested review from a team as code owners September 8, 2025 07:46

anandrdbz changed the title ~~MPI FFTW~~ MPI FFTW (Draft) Sep 8, 2025

qodo-merge-pro bot changed the title ~~MPI FFTW (Draft)~~ MPI FFTW Sep 8, 2025

qodo-merge-pro bot added the Review effort 4/5 label Sep 8, 2025

qodo-merge-pro bot reviewed Sep 8, 2025

View reviewed changes

src/post_process/m_start_up.fpp Outdated Show resolved Hide resolved

src/post_process/m_start_up.fpp Outdated Show resolved Hide resolved

src/post_process/m_start_up.fpp Outdated Show resolved Hide resolved

Add appropriate size checkers for fft and format

a655546

Anand and others added 6 commits September 12, 2025 16:00

1D Spectrum corrections

0384a60

Create Directory for En FFT

6bd60df

Bug

b673538

bug fix

eacd558

Relax Constraint (only needed for global)

c3425ee

Merge branch 'master' into MPI_FFTW

c322266

sbryngelson requested a review from Copilot September 16, 2025 11:29

sbryngelson approved these changes Sep 16, 2025

View reviewed changes

qodo-merge-pro bot reviewed Sep 16, 2025

View reviewed changes

src/post_process/m_start_up.fpp Show resolved Hide resolved

src/post_process/m_start_up.fpp Outdated Show resolved Hide resolved

src/post_process/m_start_up.fpp Show resolved Hide resolved

sbryngelson reviewed Sep 16, 2025

View reviewed changes

Copilot AI reviewed Sep 16, 2025

View reviewed changes

sbryngelson reviewed Sep 16, 2025

View reviewed changes

src/post_process/m_checker.fpp Outdated Show resolved Hide resolved

sbryngelson reviewed Sep 16, 2025

View reviewed changes

src/post_process/m_checker.fpp Outdated Show resolved Hide resolved

Anand Radhakrishnan and others added 4 commits September 18, 2025 02:36

Keep Pencil Decomposition throughout (better results)

c47ed0c

Merge branch 'MPI_FFTW' of https://github.com/anandrdbz/MFC into MPI_…

833665a

…FFTW

Add Mach number to initial condition for case 380 (TGV)

1bb0097

Add Lax Friedrichs Riemann Solver (set riemann_solver = 5)

94d7c9e

Add Checker for cyl_coord and format

0dba2f2

sbryngelson requested a review from Copilot September 20, 2025 02:12

Copilot AI reviewed Sep 20, 2025

View reviewed changes

src/simulation/m_riemann_solvers.fpp Show resolved Hide resolved

src/simulation/m_riemann_solvers.fpp Show resolved Hide resolved

src/post_process/m_start_up.fpp Show resolved Hide resolved

src/post_process/m_start_up.fpp Show resolved Hide resolved

qodo-merge-pro bot reviewed Sep 20, 2025

View reviewed changes

src/post_process/m_start_up.fpp Show resolved Hide resolved

src/post_process/m_start_up.fpp Show resolved Hide resolved

src/post_process/m_checker.fpp Outdated Show resolved Hide resolved

Anand and others added 3 commits September 21, 2025 23:22

Add tests for LF Riemann Solver + AI suggestions

91fbe3c

lint correction

6066867

Integer Overflow check

179e3b7

MPI FFTW #997

Are you sure you want to change the base?

MPI FFTW #997

Uh oh!

Conversation

anandrdbz commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Type of change

Scope

How Has This Been Tested?

Checklist

If your code changes any code source files (anything in src/simulation)

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Sep 8, 2025

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sbryngelson commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbryngelson Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anandrdbz commented Sep 18, 2025

Uh oh!

sbryngelson commented Sep 20, 2025

Uh oh!

sbryngelson commented Sep 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anandrdbz commented Sep 8, 2025 •

edited

Loading

If your code changes any code source files (anything in `src/simulation`)

codecov bot commented Sep 8, 2025 •

edited

Loading