How to Compile CP2K with CUDA Support

Currently three major operations in CP2K support CUDA-acceleration:

Anything that uses dbcsr_multiply, i.e. sparse matrix multiplication, when compiled with -D__ACC -D__DBCSR_ACC. This benefits in particular the linear scaling DFT code. See also the DBCSR project.
FFTs, when compiled with -D__PW_CUDA.
If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC50 linking against cray-libsci_acc makes this happen.

To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:

NVCC    = /path_to_cuda/bin/nvcc
DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA
LIBS   += -lcudart -lcublas -lcufft -lnvrtc

See here for details. As a prerequisite the Nvidia CUDA Toolkit has to be installed.

Libcusmm

The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel for {m,n,k}≤80, see here for more details. The DBCSR Statistics are printed at the end of every CP2K-run, example

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                      CPU                  ACC      ACC%
 number of processed stacks                   160                   64      28.6
 matmuls inhomo. stacks                     11880                    0       0.0
 matmuls total                             132360                53530      28.8
 flops  13 x   13 x   13                        0             33218640     100.0
 flops  24 x   13 x   13                        0             55177824     100.0
...
 flops total                           1452705420            657928368      31.2
 marketing flops                       2048000000
 -------------------------------------------------------------------------------

More supported GPUs can be added, please refer to this howto.

Profiling

If you are interested in profiling CP2K with nvprof have a look at these remarks.