SIMD abstraction and interfacing with CUDA - Indico

GeantV Geometry:
SIMD abstraction
and interfacing with CUDA
Johannes de Fine Licht ([email protected])
Sandro Wenzel ([email protected])
!
Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware.
PH - SFT
Johannes de Fine Licht ([email protected])
Abstracted algorithms
Write common code for multiple types of SIMD
“backends” (Vc/Cilk/CUDA).
Eases code maintenance; requires only one kernel to be written
for a given functionality
Must retain performance of a low level implementation
PH - SFT
2
Johannes de Fine Licht ([email protected])
Illustrating scalar/SIMD abstraction and kernels
Single particle
interface
Vector
interface
External
CUDA kernels
C-like abstract
kernels
PH - SFT
3
Johannes de Fine Licht ([email protected])
Illustrating scalar/SIMD abstraction and kernels
Vector
interface
Single particle
interface
Scalar Looper
Vc Looper
External
CUDA kernels
Cilk Plus Looper
Thread access
C-like abstract
kernels
PH - SFT
3
Johannes de Fine Licht ([email protected])
Illustrating scalar/SIMD abstraction and kernels
Vector
interface
Single particle
interface
Scalar Looper
External
CUDA kernels
Cilk Plus Looper
Vc Looper
Thread access
Backend instantiation
Scalar Backend
Vc Backend
Cilk Plus Backend
CUDA Backend
C-like abstract
kernels
PH - SFT
3
Johannes de Fine Licht ([email protected])
Illustrating scalar/SIMD abstraction and kernels
Vector
interface
Single particle
interface
Scalar Looper
External
CUDA kernels
Cilk Plus Looper
Vc Looper
Thread access
Backend instantiation
Scalar Backend
Vc Backend
Cilk Plus Backend
CUDA Backend
Kernel instantiation
C-like abstract
kernels
PH - SFT
C-like specialized
kernels
3
Johannes de Fine Licht ([email protected])
Generic kernels on abstract types
Generic programming; write
algorithms in a generic way
on abstract types
!
Implementation depends on
the backend
!
Backend determines the
types on which the high level
operations are performed
PH - SFT
4
Johannes de Fine Licht ([email protected])
Generic kernels on abstract types
Generic programming; write
algorithms in a generic way
on abstract types
!
Implementation depends on
the backend
!
Backend determines the
types on which the high level
operations are performed
PH - SFT
4
Johannes de Fine Licht ([email protected])
Implementations wrapped in types
Operations are abstracted
Backends implement all
operations
Vc already supports
arithmetics
CUDA uses primitive types
Wrapper structs are
implemented if needed
PH - SFT
5
Johannes de Fine Licht ([email protected])
Interfacing with CUDA
Involving the GPU should
be easy
!
Principle of common code
!
Issue: GPU code needs
nvcc, but we want a
different compiler for CPU
PH - SFT
GNU/Clang:
C++11
Vc
Cilk
vs.
nvcc:
CUDA
6
Johannes de Fine Licht ([email protected])
Dual namespaces
Utilize distinct namespaces but same code for CPU and GPU
!
Classes live in one or both environments
!
Provide abstraction to interface between environments
namespace vecgeom
Host memory
namespace vecgeom_cuda
!
vecgeom::CudaManager
vecgeom::UnplacedBox
vecgeom::LogicalVolume
…
PH - SFT
namespace vecgeom
CUDA
kernel
GPU memory
!
vecgeom_cuda::UnplacedBox
vecgeom_cuda::LogicalVolume
…
7
Johannes de Fine Licht ([email protected])
Dual namespaces
Utilize distinct namespaces but same code for CPU and GPU
!
Classes live in one or both environments
!
Provide abstraction to interface between environments
namespace vecgeom
Host memory
Allocation
namespace vecgeom_cuda
!
vecgeom::CudaManager
vecgeom::UnplacedBox
vecgeom::LogicalVolume
…
PH - SFT
namespace vecgeom
CUDA
kernel
GPU memory
!
vecgeom_cuda::UnplacedBox
vecgeom_cuda::LogicalVolume
…
7
Johannes de Fine Licht ([email protected])
Dual namespaces
Utilize distinct namespaces but same code for CPU and GPU
!
Classes live in one or both environments
!
Provide abstraction to interface between environments
namespace vecgeom
Host memory
Synchronize
content
Allocation
namespace vecgeom_cuda
!
vecgeom::CudaManager
vecgeom::UnplacedBox
Launch kernel
vecgeom::LogicalVolume
…
PH - SFT
GPU memory
namespace vecgeom
CUDA
kernel
Instantiate
!
vecgeom_cuda::UnplacedBox
vecgeom_cuda::LogicalVolume
…
7
Johannes de Fine Licht ([email protected])
Compilation scheme
Source files
.cu
Source files
.cpp
+
+
Optional modules
(ROOT, benchmarking…)
Compile for C++11
with vector backend
Interface
header
namespace vecgeom
.o
Compile with NVCC
with CUDA backend
namespace vecgeom_cuda
Linking
Executable
PH - SFT
CUDA interface
.cu
.o
8
Johannes de Fine Licht ([email protected])
Status
Common code for box methods runs
for scalar,Vc, Cilk and CUDA backends
!
4 .5
Implemented GPU memory
synchronization
Dual namespaces allows dispatching
work to CPU or GPU
3 .0
2 .5
2 .0
1 .5
1 .0
!
0 .5
Location can run in either environment
0 .0
!
Results are in agreement
!
PH - SFT
GPU
GPU wit h o ve rh e a d
CPU
3 .5
Sp e e d u p
!
4 .0
Pa rt ic le lo c a t io n in b ox g e o m e t ry
Nu m b e r o f p a rt ic le s lo c a t e d
Preliminary benchmarks run in hybrid CPU/GPU
environment for four-level box geometry.
Location of particles is a tree algorithm, so speedups
are only seen at very high particle multiplicity.
9
Johannes de Fine Licht ([email protected])
Appendix: CPU/GPU same interface
Main executable
compiled with
C++ compiler
Same code used
CUDA kernel
compiled with
NVCC
PH - SFT
10
Johannes de Fine Licht ([email protected])
Appendix: Common code
PH - SFT
11
Johannes de Fine Licht ([email protected])
Appendix: Template hierarchy
PH - SFT
12
Vector interface
Johannes de Fine Licht ([email protected])
External
CUDA kernels
PH - SFT
C-like abstract
kernels
13
Johannes de Fine Licht ([email protected])
Vector interface
Scalar
Backend
Vc
Backend
Cilk Plus
Backend
CUDA
Backend
PH - SFT
C-like abstract
kernels
External
CUDA kernels
13