Twisted clover deflated multigrid

For this case study we are using a 64x64x64x128 configuration as provided by the TMC collaboration.

Parameter	Value
Volume	64x64x64x128
Gauge action	Iwasaki
beta	1.778
Fermion action	Twisted Clover
kappa	0.139427
mu	0.000720
C_SW	1.69

We will start with a plain Twisted Clover Multigrid solve and use this as a yard stick by which all subsequent improvements shall be compared. Said improvements are:

Using Communication avoiding versions of the CG solvers used on the coarse levels
Deflating the coarsest level using a Hermitian eigensolver

Each of these steps requires some judicious tuning, which we will demonstrate explicitly with example commands. Please not that throughout this exercise, we will be working with the QUDA develop branch and the following performance critical CMake options:

QUDA_MULTIBLAS_N=4
QUDA_DYNAMIC_CLOVER=ON

The entire exercise will be performed of 32 Nodes of SUMMIT. We will utilise only 4 of the 6 available GPUs per node, which is a total of 128 NVIDIA V100.

BASELINE: Twisted Clover Multigrid

To begin, we will use QUDA Twisted Clover Multigrid inverter with previously tuned parameters. This represents the 'best case' scenario in which the user has scaled the value of mu on the coarse levels as best as they can. Standard CG will be used as the coarse level solver. The following incantation was used:

command="jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test \
       --prec double --prec-sloppy single --prec-null half --prec-precondition half \
       --dim 64 32 16 16 --gridsize 1 2 4 4 --load-gauge /gpfs/alpine/scratch/howarth/lgt104/conf_latest.0500 \
       --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \
       --recon 12 --recon-sloppy 8 --recon-precondition 8 \
       --dslash-type twisted-clover --compute-clover true --tol 1e-10 \
       --mg-levels 3 --mg-block-size 1 4 4 4 8 --mg-block-size 0 2 2 2 2 \
       --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
       --mg-mu-factor 2 70.0 \
       --nsrc 10 \
       --niter 250 \
       --verbosity verbose \
       --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4 \
       --mg-coarse-solver 1 gcr --mg-coarse-solver-tol 1 0.35 --mg-coarse-solver-maxiter 1 10 \
       --mg-coarse-solver 2 gcr --mg-coarse-solver-tol 2 0.01 --mg-coarse-solver-maxiter 2 20 \
       --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose "

which we shall breakdown in full.

jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test \

SUMMIT specific commands and the executable

       --prec double --prec-sloppy single --prec-null half --prec-precondition half \

These are the precisions used by the various stages of the solver. The final result prec is computed in double precision. The sloppy precision is used to perform globals sums or computations in which only a goos estimate is sufficient. null and precondition are the precisions used by the multigrid solver. This already represents a significant part of QUDA's total acceleration strategy, as is documented here.

       --dim 64 32 16 16 --gridsize 1 2 4 4 --load-gauge /gpfs/alpine/scratch/howarth/lgt104/conf_latest.0500 \
       --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \
       --dslash-type twisted-clover --compute-clover true --tol 1e-10 \

These are simply problem parameters and the GPU partitioning. Notice that the --clover-coeff parameter is the product of the C_SW and kappa parameters.

       --recon 12 --recon-sloppy 8 --recon-precondition 8 \

recon parameters govern how the SU(3) gauge field elements are treated. 18 instructs QUDA to communicate the entire 18 real element of SU(3) to the registers. 12 and 8 use different reconstruction strategies to communicate fewer real numbers, and reconstruct the full 18 reals of SU(3) on the fly.

       --mg-levels 3 --mg-block-size 1 4 4 4 8 --mg-block-size 0 2 2 2 2 \
       --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
       --mg-mu-factor 2 70.0 \
       --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4 \
       --mg-coarse-solver 1 gcr --mg-coarse-solver-tol 1 0.35 --mg-coarse-solver-maxiter 1 10 \
       --mg-coarse-solver 2 gcr --mg-coarse-solver-tol 2 0.01 --mg-coarse-solver-maxiter 2 20 \
       --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose

These are the multigrid specific parameters. Notice that we are using the standard CG, and we increase the value of mu on the coarsest level by a factor of 70. On the coarse levels we use the standard GCR and, as can be seen, the coarsest level solver must be performed to relatively tight tolerance.

       --nsrc 10 \
       --niter 250 \
       --verbosity verbose \

Here we simply instruct QUDA to perform 10 inversions so that we can get a good average, to error out at 250 outer iteration of the solver, and to give verbose output. Data from this command is shown in comparison to the improvements gained from using CA solver variants in the next section. For the time being, we show the spectrum of the coarse grid operator in its natural state, and after mu-scaling. We see that the mu-scaling does indeed add a constant shift to a vast majority of the spectrum, but at the very low end there are some irregularities.

100

1500

IMPROVEMENT 1: Using CA Solvers

We shall now utilise the communication avoiding variants of CG and GCR throughout the multigrid solve. This replacement is straightforward, simply switch cg and gcr for ca-cg and ca-gcr. This does come with some extra parameters that we must be aware of. One such important parameter is the number of 'inner iterations' to perform in the communication avoiding routines before initiating a communication to perform a correction.

[Explain CA method, Add Commands]

IMPROVEMENT 2: Using Coarse Level Deflation

The reason behind using a large mu value on the coarsest level is to push the real, low eigenvalues away from zero. This improves the condition number of the coarse grid operator to some degree, but at the same time this artificially transformed operator loses coherence with the finer grid. A simple solution to this problem is to explicitly deflate the low eigenmodes on the coarse grid. This is a surprisingly inexpensive strategy as the coarse grid solves are always done in low precision, hence the FLOPS are high, and the tolerance of the eigenmodes need not be too tight.

There are two portions to this strategy to be aware of. The first is the setup time of the solver, which is largely dependent of the number of eigenmodes one requests. The second is the amount of time it takes to deflate the coarse solve, which is wholly dependent of the requested number of eigenmodes. With this in mind, we shall deflate the coarse grid with a several values of requested eigenmodes which will demonstrate the diminishing returns.

Throughout we shall employ the CA-CG and CA-GCR solvers.

[Add Commands]

Deflation

QUDA calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twisted clover deflated multigrid

BASELINE: Twisted Clover Multigrid

IMPROVEMENT 1: Using CA Solvers

IMPROVEMENT 2: Using Coarse Level Deflation

Clone this wiki locally