-
Notifications
You must be signed in to change notification settings - Fork 99
Twisted clover deflated multigrid
For this case study we are using a 64x64x64x128 configuration as provided by the TMC collaboration.
Parameter | Value |
---|---|
Volume | 64x64x64x128 |
Gauge action | Iwasaki |
beta | 1.778 |
Fermion action | Twisted Clover |
kappa | 0.139427 |
mu | 0.000720 |
CSW | 1.69 |
We will start with a plain Twisted Clover Multigrid solve and use this as a yard stick by which all subsequent improvements shall be compared. Said improvements are:
- Using Communication avoiding versions of the CG solvers used on the coarse levels
- Deflating the coarsest level using a Hermitian eigensolver
Each of these steps requires some judicious tuning, which we will demonstrate explicitly with example commands. Please not that throughout this exercise, we will be working with the QUDA develop branch and the following performance critical CMake options:
QUDA_MULTIBLAS_N=4
QUDA_DYNAMIC_CLOVER=ON
The entire exercise will be performed of 32 Nodes of SUMMIT. We will utilise only 4 of the 6 available GPUs per node, which is a total of 128 NVIDIA V100.
To begin, we will use QUDA Twisted Clover Multigrid inverter with previously tuned parameters. This represents the 'best case' scenario in which the user has scaled the value of mu
on the coarse levels as best as they can. Standard CG will be used as the coarse level solver. The following incantation was used:
command="jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test \
--prec double --prec-sloppy single --prec-null half --prec-precondition half \
--dim 64 32 16 16 --gridsize 1 2 4 4 --load-gauge /gpfs/alpine/scratch/howarth/lgt104/conf_latest.0500 \
--kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \
--recon 12 --recon-sloppy 8 --recon-precondition 8 \
--dslash-type twisted-clover --compute-clover true --tol 1e-10 \
--mg-levels 3 --mg-block-size 1 4 4 4 8 --mg-block-size 0 2 2 2 2 \
--mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
--mg-mu-factor 2 70.0 \
--nsrc 10 \
--niter 250 \
--verbosity verbose \
--mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4 \
--mg-coarse-solver 1 gcr --mg-coarse-solver-tol 1 0.35 --mg-coarse-solver-maxiter 1 10 \
--mg-coarse-solver 2 gcr --mg-coarse-solver-tol 2 0.01 --mg-coarse-solver-maxiter 2 20 \
--mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose "
which we shall breakdown in full.
jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test \
SUMMIT specific commands and the executable
--prec double --prec-sloppy single --prec-null half --prec-precondition half \
These are the precisions used by the various stages of the solver. The final result prec
is computed in double precision. The sloppy
precision is used to perform globals sums or computations in which only a goos estimate is
sufficient. null
and precondition
are the precisions used by the multigrid solver. This already represents a significant part of QUDA's total acceleration strategy, as is documented here.
--dim 64 32 16 16 --gridsize 1 2 4 4 --load-gauge /gpfs/alpine/scratch/howarth/lgt104/conf_latest.0500 \
--kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \
--dslash-type twisted-clover --compute-clover true --tol 1e-10 \
These are simply problem parameters and the GPU partitioning. Notice that the --clover-coeff
parameter is the product of the CSW and kappa parameters.
--recon 12 --recon-sloppy 8 --recon-precondition 8 \
recon
parameters govern how the SU(3) gauge field elements are treated. 18 instructs QUDA to communicate the entire 18 real element of SU(3) to the registers. 12 and 8 use different reconstruction strategies to communicate fewer real numbers, and reconstruct the full 18 reals of SU(3) on the fly.
--mg-levels 3 --mg-block-size 1 4 4 4 8 --mg-block-size 0 2 2 2 2 \
--mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
--mg-mu-factor 2 70.0 \
--mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4 \
--mg-coarse-solver 1 gcr --mg-coarse-solver-tol 1 0.35 --mg-coarse-solver-maxiter 1 10 \
--mg-coarse-solver 2 gcr --mg-coarse-solver-tol 2 0.01 --mg-coarse-solver-maxiter 2 20 \
--mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose
These are the multigrid specific parameters. Notice that we are using the standard CG, and we increase the value of mu
on the coarsest level by a factor of 70. On the coarse levels we use the standard GCR and, as can be seen, the coarsest level solver must be performed to relatively tight tolerance.
--nsrc 10 \
--niter 250 \
--verbosity verbose \
Here we simply instruct QUDA to perform 10 inversions so that we can get a good average, to error out at 250 outer iteration of the solver, and to give verbose output. Data from this command is shown in comparison to the improvements gained from using CA solver variants in the next section. For the time being, we show the spectrum of the coarse grid operator in its natural state, and after mu-scaling. We see that the mu-scaling does indeed add a constant shift to a vast majority of the spectrum, but at the very low end there are some irregularities.
We shall now utilise the communication avoiding variants of CG and GCR throughout the multigrid solve. This replacement is straightforward, simply switch cg
and gcr
for ca-cg
and ca-gcr
. This does come with some extra parameters that we must be aware of. One such important parameter is the number of 'inner iterations' to perform in the communication avoiding routines before initiating a communication to perform a correction.
[Explain CA method, Add Commands]
The reason behind using a large mu
value on the coarsest level is to push the real, low eigenvalues away from zero. This improves the condition number of the coarse grid operator to some degree, but at the same time this artificially transformed operator loses coherence with the finer grid. A simple solution to this problem is to explicitly deflate the low eigenmodes on the coarse grid. This is a surprisingly inexpensive strategy as the coarse grid solves are always done in low precision, hence the FLOPS are high, and the tolerance of the eigenmodes
need not be too tight.
There are two portions to this strategy to be aware of. The first is the setup time of the solver, which is largely dependent of the number of eigenmodes one requests. The second is the amount of time it takes to deflate the coarse solve, which is wholly dependent of the requested number of eigenmodes. With this in mind, we shall deflate the coarse grid with a several values of requested eigenmodes which will demonstrate the diminishing returns.
Throughout we shall employ the CA-CG and CA-GCR solvers.
[Add Commands]