Domain-‐Specific Approaches in Scien`fic Compu`ng

Domain-­‐Specific Approaches in Scien2fic Compu2ng
Naoya Maruyama RIKEN Advanced Ins/tute of Computa/onal Science and Tokyo Ins/tute of Technology
CREST-­‐ECRC Workshop @ KAUST (Sep 4, 2014)
RIKEN AICS
•  RIKEN (pronounced as ree-­‐ken) –  One of the largest government-­‐funded research laboratories in Japan –  AICS is one ins2tute within RIKEN •  Founded in July 2010 –  Director: Kimihiko Hirao •  Opera2ons Division –  Responsible for opera2ons and management of the K computer •  Research Division –  Division Head: Akinori Yonezawa –  19 research groups –  Approximately 100 research members •  Exascale Supercomputer Project –  Newly formed in April 2014 Programming GPUs
•  Notoriously difficult –  Vendor-­‐specific programming interfaces –  Low-­‐level programming models •  Gradual improvement –  Single memory space between host and GPUs –  Low fric2on por2ng with OpenACC / OpenMP •  S2ll many of tedious, but poten2ally very effec2ve op2miza2ons are o`en le` to programmers Stencil Performances on GPUs
300
276.5
GFLOPS
232.3
220.0
190.2 195.9
200
167.3
141.0
131.4
114.5
100
Version
Peak
Baseline
OPT
ROC
86.1
77.5
SHM
SHM−WARP
0
Fermi
Kepler
•  1.6x performance improvement on Kepler •  Adverse effect on Fermi! •  More details in [Maruyama and Aoki, 2014]
Stencil Frameworks
•  Pochoir [Tang2011], Mint [Unat2011], Halide [Ragan-­‐Kelley2013], [Shimokawabe2014], and lots more •  High-­‐level abstrac2ons for stencil computa2ons •  Implements aggressive op2miza2ons that are usually not focused by general-­‐purpose compilers Physis Stencil Framework [Maruyama11], hnp://github.com/naoyam/physis
Stencil DSL •  Declara2ve •  Portable •  Global-­‐view •  C-­‐based
void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-­‐1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-­‐1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-­‐1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0); } DSL Compiler Physis •  Target-­‐specific code genera2on and op2miza2ons •  Automa2c paralleliza2on C C+MPI CUDA CUDA+MPI OpenMP OpenCL 6
Physis DSL Design Goals
•  Focus on stencils with regular mul2-­‐
dimensional grids •  C with a small number of custom constructs for stencil computa2ons –  Fortran version in progress •  Small flavor of func2onal programming to allow for automa2c paralleliza2on •  Portable across architectures Physis DSL Constructs
•  Grid data structures •  Stencil defini2ons •  Control logic
Grid Data Structures
DeclareGrid3D(type, name); •  Declares a type named PSGrid3Dname •  Represents a N-­‐D grid data with the given type as its element type •  Alloca2on func2on is automa2cally introduced –  PSGrid3DnameNew(int, int, int)
DeclareGrid3D(float, Weight); void foo() { PSGrid3DWeight g=PSGrid3dWeightNew(N,N,N); ... PSGridFree(g); } User-­‐Defined Types
•  Eg., grids with 3D velocity fields
struct point { double vx, vy, vz }; DeclareGrid3D(struct point, Point); void foo() { PSGrid3DPoint g=PSGrid3dPointNew(N,N,N); ... PSGridFree(g); } Data Layout OpHmizaHon
•  User code always AoS •  Converted to suitable layouts depending on target architectures User code (AoS)
struct Point { float u; float v; float w; float x[19]; }; DeclareGrid3D(Point, struct Point); PSGrid3DPoint g; ... PSGridGet(g.u, i, j, k); PSGridGet(g.x[l], i, j+1, k);
Generated code for GPU (SoA)
struct Point { float *u; float *v; float *w; float *x; }; ... // PSGridGet(g.u, i, j, k); g-­‐>u[i+j*nx+k*nx*ny] // PSGridGet(g.x[2], i, j+1, k); g-­‐>x[i+j*nx+k*nx*ny+l*nx*ny*nz];
Stencil Defini2on
void kernel(const int x, const int y, const int z, PSGrid3DPoint g1, PSGrid3DDouble g2) { double v = PSGridGet(g1,x,y,z).vx +PSGridGet(g1,x-­‐1,y,z).vx + PSGridGet(g1,x+1,y,z).vx +PSGridGet(g1,x,y-­‐1,z).vy + PSGridGet(g1,x,y+1,z).vy +PSGridGet(g1,x,y,z-­‐1).vz + PSGridGet(g1,x,y,z+1).vz; PSGridEmit(g2,v/7.0); } •  Defines stencil opera2ons as an elemental func2on for coordinate (x, y, z) •  Reading a value of grid g at coordinate (x+a, y+b, z+c) by PSGridGet(g, x+a, y+b, z+c) •  A, b, c must be compile-­‐2me constant •  Storing a value, v, to grid g at coordinate (x, y, z) by PSGridEmit(g, v)
Control Logic
void kernel(const int x, const int y, const int z, PSGrid3DPoint g1, PSGrid3DDouble g2) { double v = PSGridGet(g1,x,y,z).vx +PSGridGet(g1,x-­‐1,y,z).vx + PSGridGet(g1,x+1,y,z).vx +PSGridGet(g1,x,y-­‐1,z).vy + PSGridGet(g1,x,y+1,z).vy +PSGridGet(g1,x,y,z-­‐1).vz + PSGridGet(g1,x,y,z+1).vz; PSGridEmit(g2,v/7.0); } PSGrid3DPoint g1 = PSGrid3DPointNew(NX, NY, NZ); PSGrid3DDouble g2 = PSGrid3DDoubleNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ); PSStencilMap(kernel,d,g1,g2); double max_value; PSReduce(&max_value, g2, PS_MAX); ImplementaHon
Implementa2on Executable Physis Code
Source Code
Code
DSL translator
Standard compiler
•  Usage $ physisc-­‐ref user-­‐program.c à  Converts to reference C code $ physisc-­‐cuda user-­‐program.c à Converts to CUDA code DSL Translator
1.  Input program analysis 1.  Detec2on of grid data types 2.  Analysis of data accesses in stencil func2ons 2.  Code genera2on 1.  Genera2on of actual grid data types 2.  Genera2on of stencil opera2on func2ons 3.  Op2miza2on Code GeneraHon
•  Target-­‐specific code translators as C++ classes –  ReferenceTranslator •  Reference code translator –  CUDATranslater •  Extends ReferenceTranslator for CUDA target –  MPITranslator –  MPICUDATranslator •  Grid data types –  CPU target àAoS –  GPU target à SoA –  More adap2ve layout selec2on under inves2ga2on •  Invoca2on of stencil func2ons –  Automa2c paralleliza2on in CUDA and MPI by domain decomposi2on CUDA Thread Blocking •  Each thread sweeps points in the Z dimension •  X and Y dimensions are blocked with AxB thread blocks, where A and B are user-­‐
configurable parameters (64x4 by default) Z Y X Example: 7-­‐point Stencil GPU Code __device__ void kernel(const int x,const int y,const int z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2) { float v = (((((( *__PSGridGetAddrFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D(g,(x -­‐ 1),y,z)) + *__PSGridGetAddrFloat3D(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D(g,x,(y -­‐ 1),z)) + *__PSGridGetAddrFloat3D(g,x,y,(z -­‐ 1))) + *__PSGridGetAddrFloat3D(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v; } __global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2) { int x = blockIdx.x * blockDim.x + threadIdx.x + offset0; int y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) 18 OpHmizaHon
•  Peephole op2miza2on in stencil func2ons •  Configurable with an op2onal transla2on config file
OPT_OFFSET_COMP = true | false OPT_LOOP_PEELING = true | false OPT_REGISTER_BLOCKING = true | false OPT_LOOP_OPT = true | false OPT_UNCONDITIONAL_GET = true | false
OPT: Register Blocking
•  Exploit locality within each thread BEFORE
AFTER
for (int k = 1; k < n-­‐1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k-­‐1]+f[i][j][k
+1]); } double kc = f[i][j][0]; doubke kn = f[i][j][1]; for (int k = 1; k < n-­‐1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn); } OPT: Common Subexpression EliminaHon
•  Common subexpressions in offset expressions
BEFORE
AFTER
for (int k = 1; k < n-­‐1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-­‐1)*n*n]+f[i+j*n+(k+1)*n*n]); } int idx = i+j*n+n*n; for (int k = 1; k < n-­‐1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-­‐n*n]+f[idx+n*n]); idx += n*n; } OPT: Overlapped ComputaHon and CommunicaHon Inner points 1. Copy boundaries from GPU to CPU for non-­‐unit stride cases Boundary 2. Computes 3. Boundary interior points exchanges with neighbors 4. Computes boundaries 22 Time Op2miza2on Example: 7-­‐Point Stencil CPU Code Compu2ng Interior for (i = 0; i < iter; ++i) { Points __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -­‐> dom,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -­‐> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -­‐> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -­‐> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -­‐> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -­‐> g2)))); cudaThreadSynchronize(); } cudaThreadSynchronize(); } Boundary Exchange Compu2ng Boundary Planes Concurrently 23 EvaluaHon •  Performance and produc2vity •  Sample code –  7-­‐point diffusion –  Poisson benchmark (Himeno) –  Seismic simula2on •  Pla}orm –  Tsubame 2.0 •  Node: Westmere-­‐EP 2.9GHz x 2 + M2050 x 3 •  Dual Infiniband QDR with full bisec2on BW fat tree ProducHvity
Increase(of(Lines(of(Code(
10"
8"
Original"
6"
MPI"
4"
Physis"
Generated"(No"Opt)"
2"
Generated"(Opt)"
0"
Diffusion" Himeno"
Seismic"
Similar size as sequen2al code in C 7-­‐point diffusion
Bandwidth (GB/s)
1 GPU (Tesla M2050), CUDA 4.1
90
80
70
60
50
40
30
20
10
0
Hand-tuned
No-opt
Register
Blocking
Offset CSE
Full Opt
95% of the performance of hand-­‐tuned code
GFlops
Diffusion Weak Scaling
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 512x256x256 256x128x128 0 50 100 150 200 Number of GPUs
250 300 Large-­‐Scale Flow SimulaHon with LBM
•  Original applica2on by Onodera and Aoki [HPCS’13] •  Uses coherent structure Smagorinsky model
D3Q19 discrete
velocity
•  10km x 10km of downtown Tokyo 28 ImplementaHon in Physis
•  Define grid data types as SoA •  Reuse original data structures as much as possible •  Define stencil func2ons struct VariablesP { FLOAT f_n[NUM_DIRECTION]; // NUM_DIRECTION=19 }; DeclareGrid3D(Variables, struct VariablesP); struct BasisVariablesP { FLOAT r_n; FLOAT u_n; FLOAT v_n; FLOAT w_n; }; DeclareGrid3D(BasisVariables, struct BasisVariablesP); struct FluidPropertyP { FLOAT vis; }; DeclareGrid3D(FluidProperty, struct FluidPropertyP); struct StressP { FLOAT vis_sgs; FLOAT Fcs_sgs; FLOAT Div; FLOAT SS; FLOAT WW; }; DeclareGrid3D(Stress, struct StressP); PSGrid3DVariables cq_p; PSGrid3DVariables cqn_p; PSGrid3DBasisVariables cbq_p; PSGrid3DFluidProperty cfp_p; PSGrid3DStress str_p;
LBM Performance Result
Elapsed Time (sec)
100 80 1 GPU (Tesla M2075), CUDA 4.2
Lower is bener
Manual Physis 60 40 20 0 128 192 Grid size (N^3)
256 •  Comparable performance as the manually wrinen CUDA version •  Portability benefit with Physis Climate Model
•  SCALE-­‐LES: Finite volume method with 3D Cartesian grids •  Por2ng of the dynamics kernels to Physis –  3rd order Runge-­‐Kuna –  Consists of 25 small triply nested loops •  Code size comparison –  Original (F90) 520 lines –  Physis 244 lines SCALE-­‐LES Performance Result
•  CPU: Xeon E5 8 cores •  GPU: Tesla C2050 Elapsed Time (sec)
–  BW: ~30 GB/s RK Elapsed Time 50 40 30 20 10 0 –  BW: ~100 GB/s Original (Xeon E5 8 cores) Physis C2050 RK BW Bandwidth (GB/s)
•  Reasonable performance improvement by using GPU
60 60 50 40 30 3.6x
20 10 0 Original (Xeon E5 8 cores) Physis C2050 Summary
•  Physis stencil framework –  C-­‐based embedded DSL –  Designed to provide both func2onal and performance portability –  Primary focus on GPU •  Ongoing development –  Advanced op2miza2ons such as kernel fusion –  Extension to support block-­‐based structured AMR •  Code is available at hnp://github.com/naoyam/physis •  Acknowledgments –  Mohamed Wahib (RIKEN), Seiya Nishizawa (RIKEN), Tetsuya Hoshino (Tokyo Tech), Tomoki Kawamura (Tokyo Tech), Takayuki Aoki (Tokyo Tech), Takashi Shimokawabe (Tokyo Tech), Satoshi Matsuoka (Tokyo Tech), Guanghao Jin (Tokyo Tech), Toshio Endo (Tokyo Tech) –  The ROSE project –  JST CREST Post-­‐Petascale Program