Difference between revisions of "Performances"

From CFD Benchmark
Jump to: navigation, search
(TauBench performance benchmark)
(Methodology)
 
(One intermediate revision by one other user not shown)
Line 78: Line 78:
  
 
This table  presents some important data of the three considered HPC systems:  the processor frequency, the result ofTauBench, the theoretical peak performance and the memory bandwidth.  The last three results are given for a singlecore.  The most striking – though not unexpected – conclusion is that TauBench provides results that are much lowerthan the theoretical peak performance (3% of the peak for Irene Joliot-Curie and for SuperMUC-NG, and 12% forPitz Daint).  This is in agreement with the results from the similar and well-known benchmark HPCG [86] that is alsointended as a complement to the High Performance LINPACK (HPL) benchmark, currently used to rank the Top500computing  systems.  For  example,  the  HPCG  benchmark  on  the  Irene  Joliot-Curie  revealed  a  peak  performanceof  0.66  GFlop/s  per  core  [87],  which  is  approximately  only  0.8%  of  the  peak  performance.  The  main  reason  forthe discrepancy between the LINPACK and the HPCG benchmark is that the first one is purely CPU-bound whileHPCG is mostly limited by the memory bandwidth and cache-miss effects.  TauBench is somewhere in between andis probably a good estimate for many CFD codes.Indeed, the theoretical peak performance can only be reached when performing 2 fully-vectorized Fused Multiply-Add (FMA) instructions per cycle.  This situation is never achieved in any CFD code.  Most of them are usuallylimited to issuing only one non-vectorized non-FMA instruction per cycle; in this situation, the peak performance (inGFlop/s) is simply equal to the processor frequency (in GHz).  This can be observed very clearly on SuperMUC andIrene Joliot-Curie.  Regarding the Pitz Daint machine, it appears that the TauBench result is actually much betterthere than on the two other machines when compared to theoretical peak performance; this is somewhat unexpectedand might be due to the use of TurboBoost on this machine, or to a better memory/cache performance.It  should  be  pointed  out  that  no  effort  was  made  to  find  the  best  parameters  (tolerances,  timestep,  etc.)  tominimize the computational cost, and the following results should only be considered as indicative of the time-to-solution of the three codes.
 
This table  presents some important data of the three considered HPC systems:  the processor frequency, the result ofTauBench, the theoretical peak performance and the memory bandwidth.  The last three results are given for a singlecore.  The most striking – though not unexpected – conclusion is that TauBench provides results that are much lowerthan the theoretical peak performance (3% of the peak for Irene Joliot-Curie and for SuperMUC-NG, and 12% forPitz Daint).  This is in agreement with the results from the similar and well-known benchmark HPCG [86] that is alsointended as a complement to the High Performance LINPACK (HPL) benchmark, currently used to rank the Top500computing  systems.  For  example,  the  HPCG  benchmark  on  the  Irene  Joliot-Curie  revealed  a  peak  performanceof  0.66  GFlop/s  per  core  [87],  which  is  approximately  only  0.8%  of  the  peak  performance.  The  main  reason  forthe discrepancy between the LINPACK and the HPCG benchmark is that the first one is purely CPU-bound whileHPCG is mostly limited by the memory bandwidth and cache-miss effects.  TauBench is somewhere in between andis probably a good estimate for many CFD codes.Indeed, the theoretical peak performance can only be reached when performing 2 fully-vectorized Fused Multiply-Add (FMA) instructions per cycle.  This situation is never achieved in any CFD code.  Most of them are usuallylimited to issuing only one non-vectorized non-FMA instruction per cycle; in this situation, the peak performance (inGFlop/s) is simply equal to the processor frequency (in GHz).  This can be observed very clearly on SuperMUC andIrene Joliot-Curie.  Regarding the Pitz Daint machine, it appears that the TauBench result is actually much betterthere than on the two other machines when compared to theoretical peak performance; this is somewhat unexpectedand might be due to the use of TurboBoost on this machine, or to a better memory/cache performance.It  should  be  pointed  out  that  no  effort  was  made  to  find  the  best  parameters  (tolerances,  timestep,  etc.)  tominimize the computational cost, and the following results should only be considered as indicative of the time-to-solution of the three codes.
 +
 +
= Methodology =
 +
Several metrics will be used in the following sections to characterize the codes.  They are introduced by givingboth a formal definition as well as a few complementary explanations.  In order to avoid any confusion between CPUtime and Simulated time, the corresponding data will be indexed by <sub>CPU</sub> and <sub>Sim</sub>, respectively.
 +
 +
First, the Wall-Clock Time (WCT) is the elapsed time to perform a simulation on a given number of cores Ncores.The product TCPU = Ncores×WCT is thus the total CPU time for the simulation.A more meaningful metric is the so-called Reduced Computational Time (RCT), which is computed as:
 +
 +
<math>
 +
RCT = \frac{TCPU}{Nit \times N_p}
 +
</math>

Latest revision as of 16:20, 1 April 2021

The results presented above were obtained on different architectures, preventing the direct comparison of the performance of the codes. However, the TauBench pseudo-benchmark~\cite{TauBench,TauBench2} made it possible to assess in a separate test the raw performance of each machine. The following sections will describe the configuration of the three high performance computing (HPC) systems that were used to run YALES2, DINO and Nek5000 for the TGV benchmarks. Subsequently, the TauBench methodology and its results on the target computers will be presented. Finally, the performance analysis for the 3-D TGV cases is discussed in the last subsection.

Presentation of the machines used for the benchmark

Irene Joliot-Curie from TGCC

The YALES2 results were obtained on the Irene Joliot-Curie machine~\cite{irene} operated by TGCC/CEA for GENCI (French National Agency for Supercomputing). %This machine built by Atos was introduced in Sept. 2018 and is ranked at the 61th position of June 2020 Top500 ranking~\cite{ireneTop500}. It is composed of 1,656 compute nodes, with 192 GB memory on each node and two sockets of Intel Skylake (Xeon Platinum 8168) with 24 cores each, operating at 2.7 GHz. Each core can deliver a theoretical peak performance of 86.4 GFlop/s with AVX-512 and FMA activated. The interconnect is an Infiniband EDR, and the theoretical maximum bandwidth is 128 GB/s per socket and thus 5.33 GB/s by core when all cores are active. The software stack consists of the Intel Compiler 19 and OpenMPI 2.0.4, both used for all results obtained in this benchmark.

SuperMUC-NG from LRZ

All test cases with DINO were simulated on the SuperMUC machine, hosted at the Leibniz Supercomputing Center (LRZ) in Munich, Germany~\cite{supermuc}. In the course of the project, results have been obtained on three versions of SuperMUC (Phase I, Phase II, and NG), but only the performance on the most recent system, SuperMUC-NG, will be discussed here. %This machine was assembled by Lenovo and ranks at the 13th position at June 2020 Top500~\cite{supermucTop500}. SuperMUC-NG is a combination of 6,336 compute nodes built from bi-socket Intel Skylake Xeon Platinum Processor 8174 with 96 GB of memory and 24 cores. The theoretical peak performance of a single core is 99.2 GFlop/s with AVX512 and FMA activated at a sustained frequency of 3.1 GHz. The nodes are interconnected with Intel OmniPath interconnect network. All tests presented here were obtained with the Intel 19 compiler and Intel MPI.

Piz Daint from CSCS

The Nek5000 simulations were performed on the XC40 partition of the Piz Daint machine at CSCS in Switzerland~\cite{pizdaint}. %This machine ranks at the 231th position of the June 2020 Top500 ranking~\cite{pizdaintTop500} and was manufactured by HPE and Cray. It is composed of 1,813 compute nodes, each containing 64~GB of RAM and two sockets using the Intel Xeon E5-2695~v4 processors (18 cores at 2.1 GHz by socket). The theoretical peak performance of a single core is 33.6~GFlop/s with AVX2 and FMA activated. The interconnect is based on the Aries routing and communications ASIC and a Dragonfly network topology, and the maximum achievable bandwidth (BW) with a single socket is 76.8 GB/s, which allows 4.27~GB/s transfer to each core in a fully occupied socket. All tests presented here have been obtained with the Intel 18 compiler.


TauBench performance benchmark

The scalable benchmark TauBench emulates the run time behavior of the compressible TAU flow solver~\cite{dlr22421} with respect to the memory footprint and floating-point performance. Since the TAU solver relies on unstructured grids, its most important property is that all access points to the grid are indirect, as in most modern CFD solvers. TauBench can therefore be used to estimate the performance of a generic flow solver with respect to machine properties, like memory bandwidth or cache miss/latencies. It is used to provide a reference measurement of a system on a workload that is more representative of usual CFD codes than the widely used LINPACK benchmark~\cite{linpack}, which is mostly CPU-bound. Even though some important effects are neglected by using a single-core benchmark (like MPI communications or memory bandwidth saturation that appear on fully-filled nodes), TauBench is still a good indicator of the relative performance of each architecture.


Single-core performance obtained by TauBench for the three HPC systems employed in the benchmarks
Machine Irene Joliot-Curie SuperMuc-NG Pitz Daint
Frequency [GHz] 2.7 3.1 2.1
Single core TauBench [GFlop/s] 2.97 3.34 4.13
Single core Peak [GFlop/s] 86.4 99.2 33.6
Single core BW [GB/s] 21.3 21.3 19.2

This table presents some important data of the three considered HPC systems: the processor frequency, the result ofTauBench, the theoretical peak performance and the memory bandwidth. The last three results are given for a singlecore. The most striking – though not unexpected – conclusion is that TauBench provides results that are much lowerthan the theoretical peak performance (3% of the peak for Irene Joliot-Curie and for SuperMUC-NG, and 12% forPitz Daint). This is in agreement with the results from the similar and well-known benchmark HPCG [86] that is alsointended as a complement to the High Performance LINPACK (HPL) benchmark, currently used to rank the Top500computing systems. For example, the HPCG benchmark on the Irene Joliot-Curie revealed a peak performanceof 0.66 GFlop/s per core [87], which is approximately only 0.8% of the peak performance. The main reason forthe discrepancy between the LINPACK and the HPCG benchmark is that the first one is purely CPU-bound whileHPCG is mostly limited by the memory bandwidth and cache-miss effects. TauBench is somewhere in between andis probably a good estimate for many CFD codes.Indeed, the theoretical peak performance can only be reached when performing 2 fully-vectorized Fused Multiply-Add (FMA) instructions per cycle. This situation is never achieved in any CFD code. Most of them are usuallylimited to issuing only one non-vectorized non-FMA instruction per cycle; in this situation, the peak performance (inGFlop/s) is simply equal to the processor frequency (in GHz). This can be observed very clearly on SuperMUC andIrene Joliot-Curie. Regarding the Pitz Daint machine, it appears that the TauBench result is actually much betterthere than on the two other machines when compared to theoretical peak performance; this is somewhat unexpectedand might be due to the use of TurboBoost on this machine, or to a better memory/cache performance.It should be pointed out that no effort was made to find the best parameters (tolerances, timestep, etc.) tominimize the computational cost, and the following results should only be considered as indicative of the time-to-solution of the three codes.

Methodology

Several metrics will be used in the following sections to characterize the codes. They are introduced by givingboth a formal definition as well as a few complementary explanations. In order to avoid any confusion between CPUtime and Simulated time, the corresponding data will be indexed by CPU and Sim, respectively.

First, the Wall-Clock Time (WCT) is the elapsed time to perform a simulation on a given number of cores Ncores.The product TCPU = Ncores×WCT is thus the total CPU time for the simulation.A more meaningful metric is the so-called Reduced Computational Time (RCT), which is computed as: