Skip to content

Commit 112ae0b

Browse files
committed
Update the benchmark with Cirrus results
1 parent d43e917 commit 112ae0b

File tree

38 files changed

+206
-2
lines changed

38 files changed

+206
-2
lines changed

docs/source/pages/benchmarks.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,4 +68,5 @@ The library has been benchmark on the following systems:
6868
:caption: Here the detailed list of the results:
6969

7070
benchmarks_archer2.rst
71+
benchmarks_cirrus.rst
7172

docs/source/pages/benchmarks_archer2.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55

66
<br/>
77

8+
.. _benchmark-archer2:
9+
810
========================
911
Results for CPU: Archer2
1012
========================
@@ -23,7 +25,7 @@ Resuls are listed below
2325

2426
* Transpose complex 3D array
2527

26-
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_TrReal_Scal| |CPU_0512_TrReal_SpeedUp|
28+
* Resolution ``NX=NY=NZ=512`` |br| |CPU_0512_TrClx_Scal| |CPU_0512_TrClx_SpeedUp|
2729

2830
* Resolution ``NX=NY=NZ=1024`` |br| |CPU_1024_TrClx_Scal| |CPU_1024_TrClx_SpeedUp|
2931

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
.. role:: raw-html(raw)
2+
:format: html
3+
4+
.. |br| raw:: html
5+
6+
<br/>
7+
8+
========================
9+
Results for GPU: Cirrus
10+
========================
11+
12+
Cirrus is a national UK service provided by EPCC. The GPU partition of the cluster is
13+
composed by 2xIntel Xeon “Cascade Lake”, 2.4 Ghz, 20-core per node together with
14+
4xNvidia Tesla V100-SXM2-16GB GPU, 640 Tensor core, 5,120 CUDA core per node.
15+
The number of nodes available for the GPU partition is 36 for a total of 150 GPUs.
16+
17+
Resuls are listed below
18+
19+
* Transpose real 3D array
20+
21+
* Resolution ``NX=NY=NZ=512`` |br| |GPU_0512_TrReal_Scal| |GPU_0512_TrReal_SpeedUp|
22+
23+
* Resolution ``NX=NY=NZ=1024`` |br| |GPU_1024_TrReal_Scal| |GPU_1024_TrReal_SpeedUp|
24+
25+
* Transpose complex 3D array
26+
27+
* Resolution ``NX=NY=NZ=512`` |br| |GPU_0512_TrClx_Scal| |GPU_0512_TrClx_SpeedUp| |GPU_0512_TrClx_SpeedUpGPU|
28+
29+
* Resolution ``NX=NY=NZ=1024`` |br| |GPU_1024_TrClx_Scal| |GPU_1024_TrClx_SpeedUp| |GPU_1024_TrClx_SpeedUpGPU|
30+
31+
* FFT transform of a 3D real array starting from ``X`` physical direction
32+
33+
* Resolution ``NX=NY=NZ=512`` |br| |GPU_0512_R2CX_Scal| |GPU_0512_R2CX_SpeedUp| |GPU_0512_R2CX_SpeedUpGPU|
34+
35+
* Resolution ``NX=NY=NZ=1024`` |br| |GPU_1024_R2CX_Scal| |GPU_1024_R2CX_SpeedUp| |GPU_1024_R2CX_SpeedUpGPU|
36+
37+
* FFT transform of a 3D complex array starting from ``X`` physical direction
38+
39+
* Resolution ``NX=NY=NZ=512`` |br| |GPU_0512_C2CX_Scal| |GPU_0512_C2CX_SpeedUp| |GPU_0512_C2CX_SpeedUpGPU|
40+
41+
* Resolution ``NX=NY=NZ=1024`` |br| |GPU_1024_C2CX_Scal| |GPU_1024_C2CX_SpeedUp| |GPU_1024_C2CX_SpeedUpGPU|
42+
43+
* FFT transform of a 3D real array starting from ``Z`` physical direction
44+
45+
* Resolution ``NX=NY=NZ=512`` |br| |GPU_0512_R2CZ_Scal| |GPU_0512_R2CZ_SpeedUp| |GPU_0512_R2CZ_SpeedUpGPU|
46+
47+
* Resolution ``NX=NY=NZ=1024`` |br| |GPU_1024_R2CZ_Scal| |GPU_1024_R2CZ_SpeedUp| |GPU_1024_R2CZ_SpeedUpGPU|
48+
49+
* FFT transform of a 3D complex array starting from ``Z`` physical direction
50+
51+
* Resolution ``NX=NY=NZ=512`` |br| |GPU_0512_C2CZ_Scal| |GPU_0512_C2CZ_SpeedUp| |GPU_0512_C2CZ_SpeedUpGPU|
52+
53+
* Resolution ``NX=NY=NZ=1024`` |br| |GPU_1024_C2CZ_Scal| |GPU_1024_C2CZ_SpeedUp| |GPU_1024_C2CZ_SpeedUpGPU|
54+
55+
Discussion on Cirrus results
56+
_____________________________
57+
58+
This page present the first scalability tests of the GPU version of the version 2.0 of 2DECOMP&FFT library.
59+
All results have been obtained using the NVHPC compiler version 22.11 together with openMPI 4.1.4.
60+
The results for the GPU compilation tests both CUDA aware MPI and NVIDIA Colletive Communication Library (NCCL).
61+
The smallest resolution case ``NX=NY=NZ=512`` can also fit into a single GPU therefore results are reported
62+
also for a 1/4 (1 GPU) and for 1/2 (2 GPUs) of a node. Results with the pure MPI version instead use always
63+
at least a full node with the full 40 cores available.
64+
Speedup GPU/CPU is computed using as reference time for the CPU the case with 1 full node for the ``NX=NY=NZ=512``
65+
resolution and 2 full nodes for ``NX=NY=NZ=1024`` since the largest case needs at least 8 GPUs to fit in memory.
66+
67+
The CPU results, particularly the ones for the real and complex transposes, show an acceptable scalability but
68+
they are not comparable with the one presented for :doc:`Archer2 <benchmarks_archer2>` particularly for the coarses
69+
mesh resolution. This could be mainly attributed to the network which is considerebly slower that the one available
70+
on Archer2.
71+
72+
Communication greatly improves when using GPUs with both CUDA aware MPI and particularly with NCCL.
73+
For the GPU cases the slabs decomposition tends also to give better and more consistent performances with
74+
NCCL generally 50% or above faster than the CUDA aware MPI.
75+
For the low resolution case it is very noticeable a drop in performances when moving above
76+
1 node, that can be attributed to the already mentioned to the relatively slow interconnect.
77+
For the larger case, where at least 2 nodes are necessary to fir the case in the GPUs memory the interconnect
78+
issue is less visible.
79+
The speedup between between GPU acceleration and CPU is from a factor of 5 or above dependinng on the
80+
case and the resolution.
81+
82+
..
83+
_Figures for Cirrus
84+
85+
.. |GPU_0512_TrReal_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrReal_GPU_Cirrus_ScalabilityTsec.pdf
86+
:width: 32%
87+
:alt: Cirrus Scalability transpose real test: Resolution 512^3
88+
.. |GPU_0512_TrReal_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrReal_GPU_Cirrus_SpeedUp.pdf
89+
:width: 32%
90+
:alt: Cirrus SpeedUp transpose real test: Resolution 512^3
91+
.. |GPU_1024_TrReal_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrReal_GPU_Cirrus_ScalabilityTsec.pdf
92+
:width: 32%
93+
:alt: Cirrus Scalability transpose real test: Resolution 1024^3
94+
.. |GPU_1024_TrReal_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrReal_GPU_Cirrus_SpeedUp.pdf
95+
:width: 32%
96+
:alt: Cirrus SpeedUp transpose real test: Resolution 1024^3
97+
98+
99+
.. |GPU_0512_TrClx_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrClx_GPU_Cirrus_ScalabilityTsec.pdf
100+
:width: 32%
101+
:alt: Cirrus Scalability transpose complex test: Resolution 512^3
102+
.. |GPU_0512_TrClx_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrClx_GPU_Cirrus_SpeedUp.pdf
103+
:width: 32%
104+
:alt: Cirrus SpeedUp transpose complex test: Resolution 512^3
105+
.. |GPU_0512_TrClx_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_TrClx_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
106+
:width: 32%
107+
:alt: Cirrus SpeedUp GPU/CPU transpose complex test: Resolution 512^3
108+
.. |GPU_1024_TrClx_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrClx_GPU_Cirrus_ScalabilityTsec.pdf
109+
:width: 32%
110+
:alt: Cirrus Scalability transpose complex test: Resolution 1024^3
111+
.. |GPU_1024_TrClx_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrClx_GPU_Cirrus_SpeedUp.pdf
112+
:width: 32%
113+
:alt: Cirrus SpeedUp transpose complex test: Resolution 1024^3
114+
.. |GPU_1024_TrClx_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_TrClx_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
115+
:width: 32%
116+
:alt: Cirrus SpeedUp GPU/CPU transpose complex test: Resolution 1024^3
117+
118+
119+
.. |GPU_0512_R2CX_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CX_GPU_Cirrus_ScalabilityTsec.pdf
120+
:width: 32%
121+
:alt: Cirrus Scalability R2CX test: Resolution 0512^3
122+
.. |GPU_0512_R2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CX_GPU_Cirrus_SpeedUp.pdf
123+
:width: 32%
124+
:alt: Cirrus SpeedUp R2CX test: Resolution 0512^3
125+
.. |GPU_0512_R2CX_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CX_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
126+
:width: 32%
127+
:alt: Cirrus SpeedUp GPU/CPU R2CX test: Resolution 0512^3
128+
.. |GPU_1024_R2CX_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CX_GPU_Cirrus_ScalabilityTsec.pdf
129+
:width: 32%
130+
:alt: Cirrus Scalability R2CX test: Resolution 1024^3
131+
.. |GPU_1024_R2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CX_GPU_Cirrus_SpeedUp.pdf
132+
:width: 32%
133+
:alt: Cirrus SpeedUp R2CX test: Resolution 1024^3
134+
.. |GPU_1024_R2CX_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CX_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
135+
:width: 32%
136+
:alt: Cirrus SpeedUp GPU/CPU R2CX test: Resolution 1024^3
137+
138+
139+
.. |GPU_0512_C2CX_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CX_GPU_Cirrus_ScalabilityTsec.pdf
140+
:width: 32%
141+
:alt: Cirrus Scalability R2CX test: Resolution 0512^3
142+
.. |GPU_0512_C2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CX_GPU_Cirrus_SpeedUp.pdf
143+
:width: 32%
144+
:alt: Cirrus SpeedUp R2CX test: Resolution 0512^3
145+
.. |GPU_0512_C2CX_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CX_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
146+
:width: 32%
147+
:alt: Cirrus SpeedUp GPU/CPU R2CX test: Resolution 0512^3
148+
.. |GPU_1024_C2CX_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CX_GPU_Cirrus_ScalabilityTsec.pdf
149+
:width: 32%
150+
:alt: Cirrus Scalability R2CX test: Resolution 1024^3
151+
.. |GPU_1024_C2CX_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CX_GPU_Cirrus_SpeedUp.pdf
152+
:width: 32%
153+
:alt: Cirrus SpeedUp R2CX test: Resolution 1024^3
154+
.. |GPU_1024_C2CX_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CX_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
155+
:width: 32%
156+
:alt: Cirrus SpeedUp GPU/CPU R2CX test: Resolution 1024^3
157+
158+
159+
.. |GPU_0512_R2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CZ_GPU_Cirrus_ScalabilityTsec.pdf
160+
:width: 32%
161+
:alt: Cirrus Scalability R2CZ test: Resolution 0512^3
162+
.. |GPU_0512_R2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CZ_GPU_Cirrus_SpeedUp.pdf
163+
:width: 32%
164+
:alt: Cirrus SpeedUp R2CZ test: Resolution 0512^3
165+
.. |GPU_0512_R2CZ_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_R2CZ_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
166+
:width: 32%
167+
:alt: Cirrus SpeedUp GPU/CPU R2CZ test: Resolution 0512^3
168+
.. |GPU_1024_R2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CZ_GPU_Cirrus_ScalabilityTsec.pdf
169+
:width: 32%
170+
:alt: Cirrus Scalability R2CZ test: Resolution 1024^3
171+
.. |GPU_1024_R2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CZ_GPU_Cirrus_SpeedUp.pdf
172+
:width: 32%
173+
:alt: Cirrus SpeedUp R2CZ test: Resolution 1024^3
174+
.. |GPU_1024_R2CZ_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_R2CZ_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
175+
:width: 32%
176+
:alt: Cirrus SpeedUp GPU/CPU R2CZ test: Resolution 1024^3
177+
178+
179+
.. |GPU_0512_C2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CZ_GPU_Cirrus_ScalabilityTsec.pdf
180+
:width: 32%
181+
:alt: Cirrus Scalability R2CZ test: Resolution 0512^3
182+
.. |GPU_0512_C2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CZ_GPU_Cirrus_SpeedUp.pdf
183+
:width: 32%
184+
:alt: Cirrus SpeedUp R2CZ test: Resolution 0512^3
185+
.. |GPU_0512_C2CZ_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res0512x0512x0512_C2CZ_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
186+
:width: 32%
187+
:alt: Cirrus SpeedUp GPU/CPU R2CZ test: Resolution 0512^3
188+
.. |GPU_1024_C2CZ_Scal| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CZ_GPU_Cirrus_ScalabilityTsec.pdf
189+
:width: 32%
190+
:alt: Cirrus Scalability R2CZ test: Resolution 1024^3
191+
.. |GPU_1024_C2CZ_SpeedUp| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CZ_GPU_Cirrus_SpeedUp.pdf
192+
:width: 32%
193+
:alt: Cirrus SpeedUp R2CZ test: Resolution 1024^3
194+
.. |GPU_1024_C2CZ_SpeedUpGPU| image:: benchmarks_figs/2023_08_01_Res1024x1024x1024_C2CZ_GPU_Cirrus_SpeedUpGPUoverCPU.pdf
195+
:width: 32%
196+
:alt: Cirrus SpeedUp GPU/CPU R2CZ test: Resolution 1024^3
197+
198+
199+
200+

readme.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
test
1+
# 2DECOMP&FFT.github.io
2+
Website source for 2DECOMP&FFT library for 2D and 1D decomposition and FFT transform.

0 commit comments

Comments
 (0)