Skip to content

Conversation

anandrdbz
Copy link
Contributor

@anandrdbz anandrdbz commented Sep 8, 2025

User description

Description

Fast Fourier transform for energy cascade implemented on multiple ranks. Improves upon Conrad's implementation by using Pencil decomposition (2D) instead of Slabs (1D) in post_process. This required the use of cartesian sub-communicators.

Tested for correctness on Taylor-Green Vortex problem.

Also adds a Lax Friedrichs Riemann solver as an option (riemann_solver = 5)

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • [x ] New feature (non-breaking change which adds functionality)
  • Something else

Scope

  • [ x] This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement


Description

  • Implement MPI-based 3D FFT for energy cascade analysis

  • Add pencil decomposition using cartesian sub-communicators

  • Integrate FFTW library with MPI transpose operations

  • Add input validation for FFT requirements


Diagram Walkthrough

flowchart LR
  A["3D Velocity Field"] --> B["X-direction FFT"]
  B --> C["MPI Transpose X→Y"]
  C --> D["Y-direction FFT"]
  D --> E["MPI Transpose Y→Z"]
  E --> F["Z-direction FFT"]
  F --> G["Energy Spectrum Calculation"]
  G --> H["Energy Cascade Output"]
Loading

File Walkthrough

Relevant files
Enhancement
m_mpi_common.fpp
Add FFT-specific processor topology optimization                 

src/common/m_mpi_common.fpp

  • Add conditional processor topology optimization for FFT operations
  • Implement 2D pencil decomposition for post-processing with FFT
  • Maintain existing 3D decomposition logic for non-FFT cases
+108/-1 
m_checker.fpp
Add FFT input validation constraints                                         

src/post_process/m_checker.fpp

  • Add s_check_inputs_fft subroutine for FFT parameter validation
  • Prohibit file_per_process when FFT is enabled
  • Require 3D domain and even local dimensions for FFT
+8/-0     
m_global_parameters.fpp
Add FFT write parameter                                                                   

src/post_process/m_global_parameters.fpp

  • Add fft_wrt logical parameter for FFT output control
  • Initialize fft_wrt to false in default settings
+2/-0     
m_mpi_proxy.fpp
Include FFT parameter in MPI broadcasts                                   

src/post_process/m_mpi_proxy.fpp

  • Add fft_wrt to MPI broadcast variable list
+1/-1     
m_start_up.fpp
Implement complete MPI-based 3D FFT system                             

src/post_process/m_start_up.fpp

  • Integrate FFTW3 library with C bindings
  • Implement 3D FFT with MPI pencil decomposition
  • Add energy spectrum calculation and cascade analysis
  • Create MPI transpose routines for data redistribution
  • Initialize FFTW plans and cartesian communicators
+339/-3 
Configuration changes
case_dicts.py
Add FFT parameter to toolchain configuration                         

toolchain/mfc/run/case_dicts.py

  • Add fft_wrt parameter to post-processing parameter dictionary
+1/-0     

@anandrdbz anandrdbz requested review from a team as code owners September 8, 2025 07:46
@anandrdbz anandrdbz changed the title MPI FFTW MPI FFTW (Draft) Sep 8, 2025
@qodo-merge-pro qodo-merge-pro bot changed the title MPI FFTW (Draft) MPI FFTW Sep 8, 2025
Copy link

qodo-merge-pro bot commented Sep 8, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

FFT pencil sizes and MPI cartesian splits appear inconsistent: local sizes are derived using different processor axes than the allocation and transpose logic assumes, which can cause out-of-bounds or misaddressed data during Alltoall and k-index remapping. Verify Nxloc/Nyloc/Nyloc2 definitions, proc_coords usage, and that communicator dimensions match the intended pencils.

if(fft_wrt) then 

    num_procs_x = (m_glb + 1) / (m + 1)
    num_procs_y = (n_glb + 1) / (n + 1)
    num_procs_z = (p_glb + 1) / (p + 1)

    Nx = m_glb + 1
    Ny = n_glb + 1
    Nz = p_glb + 1

    Nxloc = (m_glb + 1) / num_procs_y
    Nyloc = n + 1
    Nyloc2 = (n_glb + 1) / num_procs_z
    Nzloc = p + 1 

    Nf = max(Nx, Ny, Nz)

    @:ALLOCATE(data_in(Nx*Nyloc*Nzloc))
    @:ALLOCATE(data_out(Nx*Nyloc*Nzloc))

    @:ALLOCATE(data_cmplx(Nx, Nyloc, Nzloc))
    @:ALLOCATE(data_cmplx_y(Nxloc, Ny, Nzloc))
    @:ALLOCATE(data_cmplx_z(Nxloc, Nyloc2, Nz))

    @:ALLOCATE(En_real(Nxloc, Nyloc2, Nz))
    @:ALLOCATE(En(Nf)) 

    size_n(1) = Nx
    inembed(1) = Nx 
    onembed(1) = Nx 

    fwd_plan_x =  fftw_plan_many_dft(1, size_n, Nyloc*Nzloc, & 
                                   data_in, inembed, 1, Nx, & 
                                   data_out, onembed, 1, Nx, & 
                                   FFTW_FORWARD, FFTW_MEASURE)

    size_n(1) = Ny 
    inembed(1) = Ny 
    onembed(1) = Ny 

    fwd_plan_y = fftw_plan_many_dft(1, size_n, Nxloc*Nzloc, & 
                                   data_out, inembed, 1, Ny, & 
                                   data_in, onembed, 1, Ny, & 
                                   FFTW_FORWARD, FFTW_MEASURE)

    size_n(1) = Nz 
    inembed(1) = Nz 
    onembed(1) = Nz   

    fwd_plan_z =  fftw_plan_many_dft(1, size_n, Nxloc*Nyloc2, & 
                                   data_in, inembed, 1, Nz, & 
                                   data_out, onembed, 1, Nz, & 
                                   FFTW_FORWARD, FFTW_MEASURE)   


    call MPI_CART_CREATE(MPI_COMM_WORLD, 3, (/num_procs_x, &
                                                  num_procs_y, num_procs_z/), &
                             (/.true., .true., .true./), &
                             .false., MPI_COMM_CART, ierr)   
    call MPI_CART_COORDS(MPI_COMM_CART, proc_rank, 3, &
                             cart3d_coords, ierr)  

    call MPI_Cart_SUB(MPI_COMM_CART, (/.true., .true., .false./), MPI_COMM_CART12, ierr)
    call MPI_COMM_RANK(MPI_COMM_CART12, proc_rank12, ierr)
    call MPI_CART_COORDS(MPI_COMM_CART12, proc_rank12, 2, cart2d12_coords, ierr)

    call MPI_Cart_SUB(MPI_COMM_CART, (/.true., .false., .true./), MPI_COMM_CART13, ierr)
    call MPI_COMM_RANK(MPI_COMM_CART13, proc_rank13, ierr)
    call MPI_CART_COORDS(MPI_COMM_CART13, proc_rank13, 2, cart2d13_coords, ierr)

end if
Buffer Mismatch

MPI_Alltoall uses MPI_DOUBLE_COMPLEX while the Fortran type is complex(c_double_complex); ensure MPI datatype compatibility or provide a derived MPI type. Also check send/recv counts and contiguous packing sizes match allocations to avoid buffer overruns.

call MPI_Alltoall(sendbuf, Nxloc*Nyloc*Nzloc, MPI_DOUBLE_COMPLEX, & 
                      recvbuf, Nxloc*Nyloc*Nzloc, MPI_DOUBLE_COMPLEX, MPI_COMM_CART12, ierr)

do src_rank = 0, num_procs_y - 1
    do l = 1, Nzloc 
        do k = 1, Nyloc 
            do j = 1, Nxloc
                data_cmplx_y(j, k + src_rank*Nyloc, l) = recvbuf(j + (k-1)*Nxloc + (l-1)*Nxloc*Nyloc + src_rank*Nxloc*Nyloc*Nzloc) 
            end do 
        end do 
    end do
end do

deallocate(sendbuf)
deallocate(recvbuf)

end subroutine s_mpi_transpose_x2y

subroutine s_mpi_transpose_y2z
complex(c_double_complex), allocatable :: sendbuf(:), recvbuf(:)
integer :: dest_rank, src_rank
integer :: j, k, l

allocate(sendbuf(Ny*Nxloc*Nzloc))
allocate(recvbuf(Ny*Nxloc*Nzloc))

do dest_rank = 0, num_procs_z - 1
    do l = 1, Nzloc 
        do j = 1, Nxloc
            do k = 1, Nyloc2 
                sendbuf(k + (j-1)*Nyloc2 + (l-1)*(Nyloc2*Nxloc) + dest_rank*Nyloc2*Nxloc*Nzloc) = data_cmplx_y(j, k + dest_rank*Nyloc2, l)
            end do 
        end do 
    end do
end do

call MPI_Alltoall(sendbuf, Nyloc2*Nxloc*Nzloc, MPI_DOUBLE_COMPLEX, & 
                      recvbuf, Nyloc2*Nxloc*Nzloc, MPI_DOUBLE_COMPLEX, MPI_COMM_CART13, ierr)
Validation Logic

FFT constraints require local dimensions divisible by 2, but decomposition may yield odd local sizes depending on global sizes and proc topology. Add checks for divisibility by num_procs and ensure Ny/Nz local sizes used by transposes match the enforced constraints.

 !> Checks constraints on fft_wrt
impure subroutine s_check_inputs_fft
    @:PROHIBIT(fft_wrt .and. file_per_process, "Turn off file_per_process with fft_wrt")
    @:PROHIBIT(fft_wrt .and. (n == 0 .or. p == 0), "FFT WRT only in 3D")
    @:PROHIBIT(fft_wrt .and. (MOD(m+1,2) == 1 .or. MOD(n+1,2) == 1 .or. MOD(p+1,2) == 1), "FFT WRT requires local dimensions divisible by 2")
end subroutine s_check_inputs_fft

Copy link

codecov bot commented Sep 8, 2025

Codecov Report

❌ Patch coverage is 21.83544% with 494 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.19%. Comparing base (e966781) to head (179e3b7).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_riemann_solvers.fpp 25.00% 265 Missing and 32 partials ⚠️
src/post_process/m_start_up.fpp 0.55% 174 Missing and 4 partials ⚠️
src/common/m_mpi_common.fpp 62.00% 17 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #997      +/-   ##
==========================================
+ Coverage   40.91%   41.19%   +0.28%     
==========================================
  Files          70       69       -1     
  Lines       20270    20379     +109     
  Branches     2520     2540      +20     
==========================================
+ Hits         8293     8396     +103     
+ Misses      10439    10408      -31     
- Partials     1538     1575      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson
Copy link
Member

/improve

@sbryngelson sbryngelson requested a review from Copilot September 16, 2025 11:29

end subroutine s_mpi_FFT_fwd

subroutine s_mpi_transpose_x2y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these mpi routines seem like they should go somewhere else... a new module?

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements MPI-based 3D Fast Fourier Transform (FFT) functionality for energy cascade analysis in the post-processing module. The implementation uses pencil decomposition with 2D cartesian sub-communicators instead of the previous 1D slab approach, improving computational efficiency for multi-rank FFT operations.

Key changes:

  • Add MPI-based 3D FFT implementation with FFTW integration for energy spectrum calculations
  • Implement pencil decomposition using cartesian sub-communicators for better parallel efficiency
  • Add comprehensive input validation for FFT parameter constraints

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
toolchain/mfc/run/case_dicts.py Add fft_wrt parameter to post-processing configuration
src/post_process/m_start_up.fpp Implement complete 3D FFT system with FFTW, MPI transpose operations, and energy cascade calculations
src/post_process/m_mpi_proxy.fpp Include fft_wrt parameter in MPI broadcast list
src/post_process/m_global_parameters.fpp Add fft_wrt logical parameter declaration and initialization
src/post_process/m_checker.fpp Add FFT input validation constraints and requirements
src/common/m_mpi_common.fpp Add conditional processor topology optimization for FFT operations using 2D pencil decomposition

@anandrdbz
Copy link
Contributor Author

Further changes:

There's also been more changes today that basically uses pencil decomposition throughout (including pre_process, and simulation) whenever fft_wrt is used.

This is done so that file_per_process can be enabled. The performance hit in going from block to pencil decomposition is far less than the performance hit in I/O when not using file_per_process, especially considering time averaging for spectrum.

I've also added the Lax Friedrichs Riemann solver

@sbryngelson
Copy link
Member

please add tests for the LF riemann solver

@sbryngelson
Copy link
Member

/improve

@sbryngelson sbryngelson requested a review from Copilot September 20, 2025 02:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/post_process/m_start_up.fpp:1

  • The s_lf_riemann_solver subroutine is 831 lines long, which significantly exceeds the coding guideline limit of ≤500 lines for subroutines. This subroutine should be refactored into smaller, more manageable helper subroutines.
#:include 'macros.fpp'

src/post_process/m_start_up.fpp:1

  • The s_lf_riemann_solver subroutine is 831 lines long, which significantly exceeds the coding guideline limit of ≤500 lines for subroutines. This subroutine should be refactored into smaller, more manageable helper subroutines.
#:include 'macros.fpp'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants