Skip to content

Commit adf68b6

Browse files
committed
Update benchmarks: add results for Archer2 up to 1024x1024x1024
1 parent 8665a8b commit adf68b6

File tree

27 files changed

+233
-1
lines changed

27 files changed

+233
-1
lines changed

docs/source/pages/api_domain.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33
===========================
44

55
This page explains the key public interfaces of the 2D decomposition library. After reading this section, users should be able to easily build applications using this domain decomposition strategy.
6-
The library interface is designed to be very simple. One can refer to the sample applications for a quick start.
6+
The library interface is designed to be very simple. One can refer to the
7+
`example applications <https://github.com/2decomp-fft/2decomp-fft/tree/main/examples>`_
8+
for a quick start.
79

810
The 2D Pencil Decomposition API is defined in three Fortran module which should be used by applications as:
911

docs/source/pages/benchmarks.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,71 @@
1+
.. role:: raw-html(raw)
2+
:format: html
3+
4+
.. |br| raw:: html
5+
6+
<br/>
7+
18
==========
29
Benchmarks
310
==========
11+
12+
This page report the result of some scalability tests which are available in the under
13+
`example <https://github.com/2decomp-fft/2decomp-fft/tree/main/examples>`_
14+
The test being performed are the following:
15+
16+
#. `Transpose of a real <https://github.com/2decomp-fft/2decomp-fft/blob/main/examples/test2d/timing2d_real.f90>`_
17+
3D array.
18+
19+
#. `Transpose of a complex <https://github.com/2decomp-fft/2decomp-fft/blob/main/examples/test2d/timing2d_complex.f90>`_ 3D array.
20+
21+
#. 3D FFT trasform
22+
`(fft_r2c_x) <https://github.com/2decomp-fft/2decomp-fft/blob/main/examples/fft_physical_x/fft_r2c_x.f90>`_
23+
of a **real** 3D array starting from `X` physical direction.
24+
The trasform has both forward and backward to retrieve the inout array.
25+
26+
#. 3D FFT trasform
27+
`(fft_c2c_x) <https://github.com/2decomp-fft/2decomp-fft/blob/main/examples/fft_physical_x/fft_c2c_x.f90>`_
28+
of a **complex** 3D array starting from `X` physical direction.
29+
The trasform has both forward and backward to retrieve the inout array.
30+
31+
#. 3D FFT trasform
32+
`(fft_r2c_z) <https://github.com/2decomp-fft/2decomp-fft/blob/main/examples/fft_physical_z/fft_r2c_z.f90>`_
33+
of a **real** 3D array starting from `Z` physical direction.
34+
The trasform has both forward and backward to retrieve the inout array.
35+
36+
#. 3D_FFT_trasform
37+
`(fft_c2c_z) <https://github.com/2decomp-fft/2decomp-fft/blob/main/examples/fft_physical_z/fft_c2c_z.f90>`_
38+
of a **complex** 3D array starting from `Z` physical direction.
39+
The trasform has both forward and backward to retrieve the inout array.
40+
41+
All timing are collected averaging 50 repetitions of the test with the 0 iteration being discarded.
42+
Two resolutions have been tested:
43+
44+
* ``NX=NY=NZ=512`` which corresponds to rougly 130 million points.
45+
46+
* ``NX=NY=NZ=1024`` which corresponds to rougly 1 billion points.
47+
48+
A **2D** label for the results indicates a 2D (i.e. pencils) decomposition using the optimal automatic configuration
49+
(that generally corresponds to the closest decomposition to ``NR=NC``).
50+
A **1D** label for the results indicates a 1D (i.e. slabs) decomposition. With 2DECOMP&FFT this is obatained
51+
forcing one of the two decomposition direction to 1. If ``N_ROW=1`` an initial ``Z`` slabs
52+
(i.e. local memory data are in the ``XY`` plane) is obatained,
53+
conversely ``N_COL=1`` start from a ``X`` slabs configuration
54+
(i.e. local memmory data are in the ``YZ`` plane).
55+
Generally only one set of slabs data are plotted since performances are relatively similar.
56+
57+
58+
The library has been benchmark on the following systems:
59+
60+
* The UK National Supercomputer service `Archer2 <https://www.archer2.ac.uk>`_
61+
62+
* The GPU partition of the EPCC `Cirrus <https://github.com/2decomp-fft/2decomp-fft/tree/main/examples>`_
63+
service
64+
65+
66+
.. toctree::
67+
:maxdepth: 1
68+
:caption: Here the detailed list of the results:
69+
70+
benchmarks_archer2.rst
71+
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
.. role:: raw-html(raw)
2+
:format: html
3+
4+
.. |br| raw:: html
5+
6+
<br/>
7+
8+
========================
9+
Results for CPU: Archer2
10+
========================
11+
12+
Archer2 is the UK National Supercomputer service capable of 28 Pflops/s at peak performance.
13+
The systems has 5,860 compute nodes, each with dual AMD EPYCTM 7742 64-core processors at 2.25GHz,
14+
giving 750,080 cores in total.
15+
16+
Resuls are listed below
17+
18+
* Transpose real 3D array
19+
20+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_TrReal_Scal| |CPU_0512_TrReal_SpeedUp|
21+
22+
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_TrReal_Scal| |CPU_1024_TrReal_SpeedUp|
23+
24+
* Transpose complex 3D array
25+
26+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_TrReal_Scal| |CPU_0512_TrReal_SpeedUp|
27+
28+
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_TrClx_Scal| |CPU_1024_TrClx_SpeedUp|
29+
30+
* FFT transform of a 3D real array starting from ``X`` physical direction
31+
32+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_R2CX_Scal| |CPU_0512_R2CX_SpeedUp|
33+
34+
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_R2CX_Scal| |CPU_1024_R2CX_SpeedUp|
35+
36+
* FFT transform of a 3D complex array starting from ``X`` physical direction
37+
38+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_C2CX_Scal| |CPU_0512_C2CX_SpeedUp|
39+
40+
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_C2CX_Scal| |CPU_1024_C2CX_SpeedUp|
41+
42+
* FFT transform of a 3D real array starting from ``Z`` physical direction
43+
44+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_R2CZ_Scal| |CPU_0512_R2CZ_SpeedUp|
45+
46+
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_R2CZ_Scal| |CPU_1024_R2CZ_SpeedUp|
47+
48+
* FFT transform of a 3D complex array starting from ``Z`` physical direction
49+
50+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_C2CZ_Scal| |CPU_0512_C2CZ_SpeedUp|
51+
52+
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_C2CZ_Scal| |CPU_1024_C2CZ_SpeedUp|
53+
54+
Discussion on Archer2 results
55+
_____________________________
56+
57+
The results above show that the the version 2.0 of 2DECOMP&FFT library keeps on having extremely good
58+
scalability performances.
59+
The transpose tests show no difference between compilers since the tests mainly focus on MPI communication
60+
and for all executable CRAY MPICH (Version 8.1.23) has been used.
61+
It is interesting to notice that a 1D decomposition, when possible, can give up to a 80% speedup in comparison
62+
with the optimal 2D decomposition. This is because of the new feature of the library where a simple copy,
63+
avoiding completely MPI communication, is performed when data are all co-located in the local memory.
64+
This was not the case with the previous version of the library.
65+
CRAY and GNU compilers performances using the *generic* FFT tends to differ for a low core count with the GNU
66+
performing a bit better in some cases (up to 50% performace increase), however results tends to converge with the
67+
increase of the numbers of nodes.
68+
This gives some superlinear behaviour when looking at the speedup.
69+
70+
The **FFTW** has been tested only with the CRAY compiler and it gives a speed up of about 3 for a low core count
71+
decreasing to something in between 1.5 and 2 for the larger number of nodes.
72+
The speed up with the FFTW is generally very close to the ideal lineat behaviour.
73+
74+
..
75+
_Figures for Archer 2
76+
77+
.. |CPU_0512_TrReal_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrReal_CPU_Archer2_ScalabilityTsec.pdf
78+
:width: 35%
79+
:alt: Archer2 Scalability transpose real test: Resolution 512^3
80+
.. |CPU_0512_TrReal_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrReal_CPU_Archer2_SpeedUp.pdf
81+
:width: 35%
82+
:alt: Archer2 SpeedUp transpose real test: Resolution 512^3
83+
.. |CPU_1024_TrReal_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrReal_CPU_Archer2_ScalabilityTsec.pdf
84+
:width: 35%
85+
:alt: Archer2 Scalability transpose real test: Resolution 1024^3
86+
.. |CPU_1024_TrReal_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrReal_CPU_Archer2_SpeedUp.pdf
87+
:width: 35%
88+
:alt: Archer2 SpeedUp transpose real test: Resolution 1024^3
89+
90+
91+
.. |CPU_0512_TrClx_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrClx_CPU_Archer2_ScalabilityTsec.pdf
92+
:width: 35%
93+
:alt: Archer2 Scalability transpose complex test: Resolution 512^3
94+
.. |CPU_0512_TrClx_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrClx_CPU_Archer2_SpeedUp.pdf
95+
:width: 35%
96+
:alt: Archer2 SpeedUp transpose complex test: Resolution 512^3
97+
.. |CPU_1024_TrClx_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrClx_CPU_Archer2_ScalabilityTsec.pdf
98+
:width: 35%
99+
:alt: Archer2 Scalability transpose complex test: Resolution 1024^3
100+
.. |CPU_1024_TrClx_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrClx_CPU_Archer2_SpeedUp.pdf
101+
:width: 35%
102+
:alt: Archer2 SpeedUp transpose complex test: Resolution 1024^3
103+
104+
105+
.. |CPU_0512_R2CX_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CX_CPU_Archer2_ScalabilityTsec.pdf
106+
:width: 35%
107+
:alt: Archer2 Scalability R2CX test: Resolution 0512^3
108+
.. |CPU_0512_R2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CX_CPU_Archer2_SpeedUp.pdf
109+
:width: 35%
110+
:alt: Archer2 SpeedUp R2CX test: Resolution 0512^3
111+
.. |CPU_1024_R2CX_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CX_CPU_Archer2_ScalabilityTsec.pdf
112+
:width: 35%
113+
:alt: Archer2 Scalability R2CX test: Resolution 1024^3
114+
.. |CPU_1024_R2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CX_CPU_Archer2_SpeedUp.pdf
115+
:width: 35%
116+
:alt: Archer2 SpeedUp R2CX test: Resolution 1024^3
117+
118+
119+
.. |CPU_0512_C2CX_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CX_CPU_Archer2_ScalabilityTsec.pdf
120+
:width: 35%
121+
:alt: Archer2 Scalability R2CX test: Resolution 0512^3
122+
.. |CPU_0512_C2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CX_CPU_Archer2_SpeedUp.pdf
123+
:width: 35%
124+
:alt: Archer2 SpeedUp R2CX test: Resolution 0512^3
125+
.. |CPU_1024_C2CX_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CX_CPU_Archer2_ScalabilityTsec.pdf
126+
:width: 35%
127+
:alt: Archer2 Scalability R2CX test: Resolution 1024^3
128+
.. |CPU_1024_C2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CX_CPU_Archer2_SpeedUp.pdf
129+
:width: 35%
130+
:alt: Archer2 SpeedUp R2CX test: Resolution 1024^3
131+
132+
133+
.. |CPU_0512_R2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CZ_CPU_Archer2_ScalabilityTsec.pdf
134+
:width: 35%
135+
:alt: Archer2 Scalability R2CZ test: Resolution 0512^3
136+
.. |CPU_0512_R2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CZ_CPU_Archer2_SpeedUp.pdf
137+
:width: 35%
138+
:alt: Archer2 SpeedUp R2CZ test: Resolution 0512^3
139+
.. |CPU_1024_R2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CZ_CPU_Archer2_ScalabilityTsec.pdf
140+
:width: 35%
141+
:alt: Archer2 Scalability R2CZ test: Resolution 1024^3
142+
.. |CPU_1024_R2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CZ_CPU_Archer2_SpeedUp.pdf
143+
:width: 35%
144+
:alt: Archer2 SpeedUp R2CZ test: Resolution 1024^3
145+
146+
147+
.. |CPU_0512_C2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CZ_CPU_Archer2_ScalabilityTsec.pdf
148+
:width: 35%
149+
:alt: Archer2 Scalability R2CZ test: Resolution 0512^3
150+
.. |CPU_0512_C2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CZ_CPU_Archer2_SpeedUp.pdf
151+
:width: 35%
152+
:alt: Archer2 SpeedUp R2CZ test: Resolution 0512^3
153+
.. |CPU_1024_C2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CZ_CPU_Archer2_ScalabilityTsec.pdf
154+
:width: 35%
155+
:alt: Archer2 Scalability R2CZ test: Resolution 1024^3
156+
.. |CPU_1024_C2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CZ_CPU_Archer2_SpeedUp.pdf
157+
:width: 35%
158+
:alt: Archer2 SpeedUp R2CZ test: Resolution 1024^3
159+
160+
161+
162+

0 commit comments

Comments
 (0)