How to Compile CP2K with CUDA Support
Currently three major operations in CP2K support CUDA-acceleration:
- Anything that uses
dbcsr_multiply
, i.e. sparse matrix multiplication, when compiled with-D__ACC -D__DBCSR_ACC
. This benefits in particular the linear scaling DFT code. See also the DBCSR project. - FFTs, when compiled with
-D__PW_CUDA
. - If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC50 linking against cray-libsci_acc makes this happen.
To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:
NVCC = /path_to_cuda/bin/nvcc DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA LIBS += -lcudart -lcublas -lcufft -lnvrtc
See here for details. As a prerequisite the Nvidia CUDA Toolkit has to be installed.
Libcusmm
The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel for {m,n,k}≤80, see here for more details. The DBCSR Statistics are printed at the end of every CP2K-run, example
------------------------------------------------------------------------------- - - - DBCSR STATISTICS - - - ------------------------------------------------------------------------------- COUNTER CPU ACC ACC% number of processed stacks 160 64 28.6 matmuls inhomo. stacks 11880 0 0.0 matmuls total 132360 53530 28.8 flops 13 x 13 x 13 0 33218640 100.0 flops 24 x 13 x 13 0 55177824 100.0 ... flops total 1452705420 657928368 31.2 marketing flops 2048000000 -------------------------------------------------------------------------------
More supported GPUs can be added, please refer to this howto.
Profiling
If you are interested in profiling CP2K with nvprof have a look at these remarks.