GeantV Geometry: SIMD abstraction and interfacing with CUDA Johannes de Fine Licht ([email protected]) Sandro Wenzel ([email protected]) ! Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware. PH - SFT Johannes de Fine Licht ([email protected]) Abstracted algorithms Write common code for multiple types of SIMD “backends” (Vc/Cilk/CUDA). Eases code maintenance; requires only one kernel to be written for a given functionality Must retain performance of a low level implementation PH - SFT 2 Johannes de Fine Licht ([email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle interface Vector interface External CUDA kernels C-like abstract kernels PH - SFT 3 Johannes de Fine Licht ([email protected]) Illustrating scalar/SIMD abstraction and kernels Vector interface Single particle interface Scalar Looper Vc Looper External CUDA kernels Cilk Plus Looper Thread access C-like abstract kernels PH - SFT 3 Johannes de Fine Licht ([email protected]) Illustrating scalar/SIMD abstraction and kernels Vector interface Single particle interface Scalar Looper External CUDA kernels Cilk Plus Looper Vc Looper Thread access Backend instantiation Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend C-like abstract kernels PH - SFT 3 Johannes de Fine Licht ([email protected]) Illustrating scalar/SIMD abstraction and kernels Vector interface Single particle interface Scalar Looper External CUDA kernels Cilk Plus Looper Vc Looper Thread access Backend instantiation Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend Kernel instantiation C-like abstract kernels PH - SFT C-like specialized kernels 3 Johannes de Fine Licht ([email protected]) Generic kernels on abstract types Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed PH - SFT 4 Johannes de Fine Licht ([email protected]) Generic kernels on abstract types Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed PH - SFT 4 Johannes de Fine Licht ([email protected]) Implementations wrapped in types Operations are abstracted Backends implement all operations Vc already supports arithmetics CUDA uses primitive types Wrapper structs are implemented if needed PH - SFT 5 Johannes de Fine Licht ([email protected]) Interfacing with CUDA Involving the GPU should be easy ! Principle of common code ! Issue: GPU code needs nvcc, but we want a different compiler for CPU PH - SFT GNU/Clang: C++11 Vc Cilk vs. nvcc: CUDA 6 Johannes de Fine Licht ([email protected]) Dual namespaces Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments namespace vecgeom Host memory namespace vecgeom_cuda ! vecgeom::CudaManager vecgeom::UnplacedBox vecgeom::LogicalVolume … PH - SFT namespace vecgeom CUDA kernel GPU memory ! vecgeom_cuda::UnplacedBox vecgeom_cuda::LogicalVolume … 7 Johannes de Fine Licht ([email protected]) Dual namespaces Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments namespace vecgeom Host memory Allocation namespace vecgeom_cuda ! vecgeom::CudaManager vecgeom::UnplacedBox vecgeom::LogicalVolume … PH - SFT namespace vecgeom CUDA kernel GPU memory ! vecgeom_cuda::UnplacedBox vecgeom_cuda::LogicalVolume … 7 Johannes de Fine Licht ([email protected]) Dual namespaces Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments namespace vecgeom Host memory Synchronize content Allocation namespace vecgeom_cuda ! vecgeom::CudaManager vecgeom::UnplacedBox Launch kernel vecgeom::LogicalVolume … PH - SFT GPU memory namespace vecgeom CUDA kernel Instantiate ! vecgeom_cuda::UnplacedBox vecgeom_cuda::LogicalVolume … 7 Johannes de Fine Licht ([email protected]) Compilation scheme Source files .cu Source files .cpp + + Optional modules (ROOT, benchmarking…) Compile for C++11 with vector backend Interface header namespace vecgeom .o Compile with NVCC with CUDA backend namespace vecgeom_cuda Linking Executable PH - SFT CUDA interface .cu .o 8 Johannes de Fine Licht ([email protected]) Status Common code for box methods runs for scalar,Vc, Cilk and CUDA backends ! 4 .5 Implemented GPU memory synchronization Dual namespaces allows dispatching work to CPU or GPU 3 .0 2 .5 2 .0 1 .5 1 .0 ! 0 .5 Location can run in either environment 0 .0 ! Results are in agreement ! PH - SFT GPU GPU wit h o ve rh e a d CPU 3 .5 Sp e e d u p ! 4 .0 Pa rt ic le lo c a t io n in b ox g e o m e t ry Nu m b e r o f p a rt ic le s lo c a t e d Preliminary benchmarks run in hybrid CPU/GPU environment for four-level box geometry. Location of particles is a tree algorithm, so speedups are only seen at very high particle multiplicity. 9 Johannes de Fine Licht ([email protected]) Appendix: CPU/GPU same interface Main executable compiled with C++ compiler Same code used CUDA kernel compiled with NVCC PH - SFT 10 Johannes de Fine Licht ([email protected]) Appendix: Common code PH - SFT 11 Johannes de Fine Licht ([email protected]) Appendix: Template hierarchy PH - SFT 12 Vector interface Johannes de Fine Licht ([email protected]) External CUDA kernels PH - SFT C-like abstract kernels 13 Johannes de Fine Licht ([email protected]) Vector interface Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend PH - SFT C-like abstract kernels External CUDA kernels 13
© Copyright 2024 ExpyDoc