Performances

From CFD Benchmark
Revision as of 15:32, 1 October 2020 by Abdelsamie (Talk | contribs)

Jump to: navigation, search

The results presented above were obtained on different architectures, preventing the direct comparison of the performance of the codes. However, the TauBench pseudo-benchmark~\cite{TauBench,TauBench2} made it possible to assess in a separate test the raw performance of each machine. The following sections will describe the configuration of the three high performance computing (HPC) systems that were used to run YALES2, DINO and Nek5000 for the TGV benchmarks. Subsequently, the TauBench methodology and its results on the target computers will be presented. Finally, the performance analysis for the 3-D TGV cases is discussed in the last subsection.

Presentation of the machines used for the benchmark

Irene Joliot-Curie from TGCC

The YALES2 results were obtained on the Irene Joliot-Curie machine~\cite{irene} operated by TGCC/CEA for GENCI (French National Agency for Supercomputing). %This machine built by Atos was introduced in Sept. 2018 and is ranked at the 61th position of June 2020 Top500 ranking~\cite{ireneTop500}. It is composed of 1,656 compute nodes, with 192 GB memory on each node and two sockets of Intel Skylake (Xeon Platinum 8168) with 24 cores each, operating at 2.7 GHz. Each core can deliver a theoretical peak performance of 86.4 GFlop/s with AVX-512 and FMA activated. The interconnect is an Infiniband EDR, and the theoretical maximum bandwidth is 128 GB/s per socket and thus 5.33 GB/s by core when all cores are active. The software stack consists of the Intel Compiler 19 and OpenMPI 2.0.4, both used for all results obtained in this benchmark.

SuperMUC-NG from LRZ

All test cases with DINO were simulated on the SuperMUC machine, hosted at the Leibniz Supercomputing Center (LRZ) in Munich, Germany~\cite{supermuc}. In the course of the project, results have been obtained on three versions of SuperMUC (Phase I, Phase II, and NG), but only the performance on the most recent system, SuperMUC-NG, will be discussed here. %This machine was assembled by Lenovo and ranks at the 13th position at June 2020 Top500~\cite{supermucTop500}. SuperMUC-NG is a combination of 6,336 compute nodes built from bi-socket Intel Skylake Xeon Platinum Processor 8174 with 96 GB of memory and 24 cores. The theoretical peak performance of a single core is 99.2 GFlop/s with AVX512 and FMA activated at a sustained frequency of 3.1 GHz. The nodes are interconnected with Intel OmniPath interconnect network. All tests presented here were obtained with the Intel 19 compiler and Intel MPI.

Piz Daint from CSCS

The Nek5000 simulations were performed on the XC40 partition of the Piz Daint machine at CSCS in Switzerland~\cite{pizdaint}. %This machine ranks at the 231th position of the June 2020 Top500 ranking~\cite{pizdaintTop500} and was manufactured by HPE and Cray. It is composed of 1,813 compute nodes, each containing 64~GB of RAM and two sockets using the Intel Xeon E5-2695~v4 processors (18 cores at 2.1 GHz by socket). The theoretical peak performance of a single core is 33.6~GFlop/s with AVX2 and FMA activated. The interconnect is based on the Aries routing and communications ASIC and a Dragonfly network topology, and the maximum achievable bandwidth (BW) with a single socket is 76.8 GB/s, which allows 4.27~GB/s transfer to each core in a fully occupied socket. All tests presented here have been obtained with the Intel 18 compiler.


TauBench performance benchmark

The scalable benchmark TauBench emulates the run time behavior of the compressible TAU flow solver~\cite{dlr22421} with respect to the memory footprint and floating-point performance. Since the TAU solver relies on unstructured grids, its most important property is that all access points to the grid are indirect, as in most modern CFD solvers. TauBench can therefore be used to estimate the performance of a generic flow solver with respect to machine properties, like memory bandwidth or cache miss/latencies. It is used to provide a reference measurement of a system on a workload that is more representative of usual CFD codes than the widely used LINPACK benchmark~\cite{linpack}, which is mostly CPU-bound. Even though some important effects are neglected by using a single-core benchmark (like MPI communications or memory bandwidth saturation that appear on fully-filled nodes), TauBench is still a good indicator of the relative performance of each architecture.


Single-core performance obtained by TauBench for the three HPC systems employed in the benchmarks
Machine Irene Joliot-Curie SuperMuc-NG Pitz Daint
Frequency [GHz] 2.7 3.1 2.1
Single core TauBench [GFlop/s] 2.97 3.34 4.13
Single core Peak [GFlop/s] 86.4 99.2 33.6
Single core BW [GB/s] 21.3 21.3 19.2