Table of Contents

CP2K Benchmark Suite

Introduction

The purpose of the CP2K benchmark suite is to provide performance which can be used to guide users towards the best configuration (e.g. machine, number of MPI processors, number of OpenMP threads) for a particular problem, and give a good estimation for the parallel performance of the code for different types of method. Five benchmarks are provided: H2O-64, Fayalite-FIST, LiH-HFX, H2O-DFT-LS and H2O-64-RI-MP2. Descriptions of each benchmark along with performance figures are below.

We encourage you to contribute benchmark results from your own local cluster or HPC system - just run the inputs and add timings in the relevant sections below. Python scripts for generating the scaling graphs are provided tools/benchmark_plots/. Please also update the list of machines for which benchmark data is provided.

If you have any questions or problems running benchmarks or using the scripts please contact Iain Bethune (ibethune@epcc.ed.ac.uk).

Notes on Results

Some benchmarks perform MD, whilst the more expensive methods only a single-point energy computation, therefore the total time for the calculation against the number of compute nodes used is reported. Each benchmark uses a different system, so the results are not directly comparable.

The mixed mode MPI/OpenMP version of CP2K is used to measure performance (there is negligible overhead from running this version with 1 thread per process compared to the pure MPI code). For a fixed number of cores, all reasonable combinations of MPI processes and OpenMP threads were tested, subject to keeping each processes' threads within a single NUMA region. For example on ARCHER, 6 cores share a single NUMA region, so no more than 6 threads per process were used as the resulting performance would be very poor. From these combinations, the best run time and number of threads per process is reported. As most HPC systems charge by the node, full nodes were utilised at all times.

This systems used to obtain the benchmark results are described on the systems page.

Benchmarks

H2O-64

Description

Ab-initio molecular dynamics of liquid water using the Born-Oppenheimer approach, using Quickstep DFT. Production quality settings for the basis sets (TZV2P) and the planewave cutoff (280 Ry) are chosen, and the Local Density Approximation (LDA) is used for the calculation of the Exchange-Correlation energy. The configurations were generated by classical equilibration, and the initial guess of the electronic density is made based on Atomic Orbitals. The system contains 64 water molecules (192 atoms, 512 electrons) in a 12.4 Å3 cell and MD is run for 10 steps.

Availability

The benchmark is available (along with other water systems) from the CP2K source distribution: benchmarks/QS/

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name Architecture Date Git Commit Fastest time (s) Configuration Detailed results
HECToR Cray XE6 21/01/2014 82b8204 39.066 512 cores 2 OMP threads per MPI task hector-h2o-64
ARCHER Cray XC30 08/01/2014 292a983 18.11 576 cores 1 OMP thread per MPI task archer-h2o-64
Magnus Cray XC40 22/10/2014 27eacee 17.275 384 cores 1 OMP thread per MPI task magnus-h2o-64
Piz Daint Cray XC30 12/05/2015 f439118 19.885 192 cores 1 OMP thread per MPI task, no GPU piz-daint-h2o-64
Cirrus SGI ICE XA 24/11/2016 989a92c 15.560 1152 cores 9 OMP threads per MPI task cirrus-h2o-64
Noctua Cray CS500 25/09/2019 9f58d81 13.3 640 cores 10 OMP thread per MPI task noctua-h2o-64

Fayalite-FIST

Description

This is a short molecular dynamics run of 1000 time steps in a NPT ensemble at 300K. It consists of 28000 atoms - a 103 supercell with 28 atoms of iron silicate (Fe2SiO4, also known as Fayalite) per unit cell. The simulation employs a classical potential (Morse with a hard-core repulsive term and 5.5 Å cutoff) with long-range electrostatics using Smoothed Particle Mesh Ewald (SPME) summation. While CP2K does support classical potentials via the Frontiers In Simulation Technology (FIST) module, this is not a typical calculation for CP2K but is included to give an impression of the performance difference between machines for the MM part of a QM/MM calculation.

Availability

The benchmark is available from the CP2K source distribution: benchmarks/Fist/

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name Architecture Date Git Commit Fastest time (s) Configuration Detailed results
HECToR Cray XE6 21/01/2014 82b8204 403.928 2048 cores 4 OMP threads per MPI task hector-fayalite-fist
ARCHER Cray XC30 09/01/2014 292a983 197.117 576 cores 6 OMP threads per MPI task archer-fayalite-fist
Magnus Cray XC40 06/11/2014 27eacee 150.493 768 cores 6 OMP threads per MPI task magnus-fayalite-fist
Piz Daint Cray XC30 12/05/2015 f439118 207.972 512 cores 2 OMP threads per MPI task, no GPU piz-daint-fayalite-fist
Cirrus SGI ICE XA 24/11/2016 989a92c 166.192 576 cores 2 OMP threads per MPI task cirrus-fayalite-fist
Noctua Cray CS500 25/09/2019 9f58d81 119.820 2560 cores 10 OMP thread per MPI task noctua-fayalite-fist

LiH-HFX

Description

This is a single-point energy calculation using Quickstep GAPW (Gaussian and Augmented Plane-Waves) with hybrid Hartree-Fock exchange. It consists of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 Å3 cell. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although this can be reduced using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance.

Availability

The benchmark is available from benchmarks/QS_LiH_HFX/.

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name Architecture Date Git Commit Fastest time (s) Configuration Detailed results
HECToR Cray XE6 21/01/2014 82b8204 (*) 121.362 65536 cores 8 OMP threads per MPI task hector-lih-hfx
ARCHER Cray XC30 09/01/2014 292a983 (*) 51.172 49152 cores 6 OMP threads per MPI task archer-lih-hfx
Magnus Cray XC40 10/11/2014 27eacee (*) 62.075 24576 cores 4 OMP threads per MPI task magnus-lih-hfx
Piz Daint Cray XC30 12/05/2015 f439118 66.051 32768 cores 4 OMP threads per MPI task, no GPU piz-daint-lih-hfx
Cirrus SGI ICE XA 24/11/2016 989a92c 483.676 2016 cores 6 OMP threads per MPI task cirrus-lih-hfx

(*) Prior to r14945, a bug resulted in an underestimation of the number of ERIs which should be computed (by roughly 50% for this benchmark. Therefore these results cannot be compared directly with later ones.

H2O-DFT-LS

Description

This is a single-point energy calculation using linear-scaling DFT. It consists of 6144 atoms in a 39 Å3 box (2048 water molecules in total). An LDA functional is used with a DZVP MOLOPT basis set and a 300 Ry cut-off. For large systems the linear-scaling approach for solving Self-Consistent-Field equations will be much cheaper computationally than using standard DFT and allows scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard Quickstep DFT using OT is avoided and the key operation is sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in the DBCSR library.

Availability

The benchmark input file used to generate these results is available here.

It is a slightly modified version of the more general one in the CP2K github at benchmarks/QS_DM_LS/H2O-dft-ls.inp, where the problem size can be tuned by a parameter NREP.

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name Architecture Date Git Commit Fastest time (s) Configuration Detailed results
HECToR Cray XE6 16/01/2014 82b8204 98.256 65536 cores 8 OMP threads per MPI task hector-h2o-dft-ls
ARCHER Cray XC30 08/01/2014 292a983 28.476 49152 cores 4 OMP threads per MPI task archer-h2o-dft-ls
Magnus Cray XC40 03/12/2014 27eacee 30.921 24576 cores 2 OMP threads per MPI task magnus-h2o-dft-ls
Piz Daint Cray XC30 12/05/2015 f439118 27.900 32768 cores 2 OMP threads per MPI task, no GPU piz-daint-h2o-dft-ls
Cirrus SGI ICE XA 24/11/2016 989a92c 543.032 2016 cores 2 OMP threads per MPI task cirrus-h2o-dft-ls
Noctua Cray CS500 25/09/2019 9f58d81 37.730 10240 cores 10 OMP thread per MPI task noctua-h2o-dft-ls

H2O-64-RI-MP2

Description

This benchmark is a single-point energy calculation using 2nd order Møller-Plesset perturbation theory (MP2) with the Resolution-of-the-Identity approximation to calculate the exchange-correlation energy. The system consists of 64 water molecules in a 12.4 Å3 cell. This is exactly the same system as used by H2O-64 but using a much more accurate model, which is around 100 times more computationally demanding than standard DFT calculations.

Availability

The benchmark is in the CP2K github at: benchmarks/QS_mp2_rpa/64-H2O/.

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name Architecture Date Git Commit Fastest time (s) Configuration Detailed results
HECToR Cray XE6 13/01/2014 82b8204 141.633 49152 cores 8 OMP threads per MPI task hector-h2o-64-ri-mp2
ARCHER Cray XC30 09/01/2014 292a983 83.945 36864 cores 4 OMP threads per MPI task archer-h2o-64-ri-mp2
Magnus Cray XC40 04/11/2014 27eacee 63.891 24576 cores 6 OMP threads per MPI task magnus-h2o-64-ri-mp2
Piz Daint Cray XC30 12/05/2015 f439118 48.15 32768 cores 8 OMP threads per MPI task, no GPU piz-daint-h2o-64-ri-mp2
Cirrus SGI ICE XA 24/11/2016 989a92c 303.571 2016 cores 1 OMP thread per MPI task cirrus-h2o-64-ri-mp2
Noctua Cray CS500 25/09/2019 9f58d81 82.571 10240 cores 2 OMP thread per MPI task noctua-h2o-64-ri-mp2