howto:libcusmm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
howto:libcusmm [2014/10/27 18:16] – oschuett | howto:libcusmm [2019/04/09 12:45] (current) – removed alazzaro | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Howto Optimize Cuda Kernels for Libcusmm ====== | ||
- | === Step 1: Go to the directory libcusmm directory === | ||
- | < | ||
- | cdCP2K_ROOT/ | ||
- | </ | ||
- | === Step 2: Adopt tune.py for your Environment === | ||
- | The '' | ||
- | <code python> | ||
- | ... | ||
- | def gen_jobfile(outdir, | ||
- | t = "/ | ||
- | all_exe_src = [basename(fn) for fn in glob(outdir+t+" | ||
- | all_exe = sorted([fn.replace(" | ||
- | |||
- | output = "# | ||
- | output += "# | ||
- | output += "# | ||
- | output += "# | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += "cd $SLURM_SUBMIT_DIR \n" | ||
- | output += " | ||
- | output += " | ||
- | for exe in all_exe: | ||
- | output += "aprun -b -n 1 -N 1 -d 8 make -j 16 %s & | ||
- | ... | ||
- | </ | ||
- | |||
- | === Step 3: Run the script tune.py === | ||
- | The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type: | ||
- | < | ||
- | $ ./tune.py 5 8 | ||
- | Found 23 parameter sets for 5x5x5 | ||
- | Found 31 parameter sets for 5x5x8 | ||
- | Found 107 parameter sets for 5x8x5 | ||
- | Found 171 parameter sets for 5x8x8 | ||
- | Found 75 parameter sets for 8x5x5 | ||
- | Found 107 parameter sets for 8x5x8 | ||
- | Found 248 parameter sets for 8x8x5 | ||
- | Found 424 parameter sets for 8x8x8 | ||
- | </ | ||
- | |||
- | The script will create a directory for each combination of the blocksizes: | ||
- | < | ||
- | $ ls -d tune_* | ||
- | tune_5x5x5 | ||
- | </ | ||
- | |||
- | Each directory contains a number of files: | ||
- | < | ||
- | $ ls -1 tune_8x8x8/ | ||
- | Makefile | ||
- | tune_8x8x8_exe0_main.cu | ||
- | tune_8x8x8_exe0_part0.cu | ||
- | tune_8x8x8_exe0_part1.cu | ||
- | tune_8x8x8_exe0_part2.cu | ||
- | tune_8x8x8_exe0_part3.cu | ||
- | tune_8x8x8_exe0_part4.cu | ||
- | tune_8x8x8.job | ||
- | </ | ||
- | For each possible parameter-set a // | ||
- | |||
- | In order to parallelize the benchmarking the launchers are distributed over multiple executables. | ||
- | Currently, up to 10000 launchers are benchmarked by one // | ||
- | |||
- | === Step 4: Adopt submit.py for your Environment === | ||
- | The script '' | ||
- | |||
- | === Step 5: Submit Jobs === | ||
- | Each tune-directory contains a job file. | ||
- | Since, there might be many tune-directories the convenience script '' | ||
- | |||
- | When '' | ||
- | < | ||
- | $ ./ | ||
- | tune_5x5x5: Would submit, run with " | ||
- | tune_5x5x8: Would submit, run with " | ||
- | tune_5x8x5: Would submit, run with " | ||
- | tune_5x8x8: Would submit, run with " | ||
- | tune_8x5x5: Would submit, run with " | ||
- | tune_8x5x8: Would submit, run with " | ||
- | tune_8x8x5: Would submit, run with " | ||
- | tune_8x8x8: Would submit, run with " | ||
- | Number of jobs submitted: 8 | ||
- | </ | ||
- | |||
- | Only when '' | ||
- | < | ||
- | $ ./submit.py doit! | ||
- | tune_5x5x5: Submitting | ||
- | Submitted batch job 277987 | ||
- | tune_5x5x8: Submitting | ||
- | Submitted batch job 277988 | ||
- | tune_5x8x5: Submitting | ||
- | Submitted batch job 277989 | ||
- | tune_5x8x8: Submitting | ||
- | Submitted batch job 277990 | ||
- | tune_8x5x5: Submitting | ||
- | Submitted batch job 277991 | ||
- | tune_8x5x8: Submitting | ||
- | Submitted batch job 277992 | ||
- | tune_8x8x5: Submitting | ||
- | Submitted batch job 277993 | ||
- | tune_8x8x8: Submitting | ||
- | Submitted batch job 277994 | ||
- | Number of jobs submitted: 8 | ||
- | </ | ||
- | |||
- | === Step 5: Collect Results === | ||
- | Run '' | ||
- | < | ||
- | $ ./ | ||
- | Reading: tune_5x5x5/ | ||
- | Reading: tune_5x5x8/ | ||
- | Reading: tune_5x8x5/ | ||
- | Reading: tune_5x8x8/ | ||
- | Reading: tune_8x5x5/ | ||
- | Reading: tune_8x5x8/ | ||
- | Reading: tune_8x8x5/ | ||
- | Reading: tune_8x8x8/ | ||
- | Kernel_dnt_tiny(m=5, | ||
- | Kernel_dnt_tiny(m=5, | ||
- | Kernel_dnt_medium(m=5, | ||
- | Kernel_dnt_tiny(m=5, | ||
- | Kernel_dnt_medium(m=8, | ||
- | Kernel_dnt_medium(m=8, | ||
- | Kernel_dnt_tiny(m=8, | ||
- | Kernel_dnt_tiny(m=8, | ||
- | </ |
howto/libcusmm.1414433767.txt.gz · Last modified: 2020/08/21 10:15 (external edit)