====== How to Compile CP2K with CUDA Support ======
Currently three major operations in CP2K support CUDA-acceleration:
* Anything that uses ''dbcsr_multiply'', i.e. sparse matrix multiplication, when compiled with ''%%-D__ACC -D__DBCSR_ACC%%''. This benefits in particular the [[doi>10.1021/ct200897x| linear scaling DFT]] code. See also [[http://dbcsr.cp2k.org | the DBCSR project.]]
* FFTs, when compiled with ''%%-D__PW_CUDA%%''.
* If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC50 linking against cray-libsci_acc makes this happen.
To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:
NVCC = /path_to_cuda/bin/nvcc
DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA
LIBS += -lcudart -lcublas -lcufft -lnvrtc
See [[https://github.com/cp2k/cp2k/blob/master/INSTALL.md#2j-cuda-optional-improved-performance-on-gpu-systems | here]] for details.
As a prerequisite the [[https://developer.nvidia.com/cuda-toolkit |Nvidia CUDA Toolkit ]] has to be installed.
===== Libcusmm =====
The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel for {m,n,k}≤80, see [[ https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/README.md | here]] for more details. The //DBCSR Statistics// are printed at the end of every CP2K-run, example
-------------------------------------------------------------------------------
- -
- DBCSR STATISTICS -
- -
-------------------------------------------------------------------------------
COUNTER CPU ACC ACC%
number of processed stacks 160 64 28.6
matmuls inhomo. stacks 11880 0 0.0
matmuls total 132360 53530 28.8
flops 13 x 13 x 13 0 33218640 100.0
flops 24 x 13 x 13 0 55177824 100.0
...
flops total 1452705420 657928368 31.2
marketing flops 2048000000
-------------------------------------------------------------------------------
More supported GPUs can be added, please refer to [[https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/tune.md | this howto]].
===== Profiling =====
If you are interested in profiling CP2K with nvprof have a look at [[dev:profiling#nvprof | these remarks]].