Design and Optimization of a Portable Lattice Boltzmann Code for

Design and Optimization of a Portable Lattice
Boltzmann Code for Heterogeneous Architectures
E Calore, S F Schifano, R Tripiccione
Enrico Calore
INFN Ferrara, Italy
Perspectives of GPU Computing in Physics and Astrophysics
September 17, 2014
Rome, Italy
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
1 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
2 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
3 / 33
GPUs and MICs performances are growing
Courtesy of Dr. Karl Rupp, Technische Universität Wien
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
4 / 33
Accelerators use in HPC is growing
Accelerator architectures in the Top500 Supercomputers
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
5 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
6 / 33
OpenCL (Open Computing Language):
The same code can be run on CPUs, GPUs, MICs, etc.
Functions to be offloaded on the accelerator have to be explicitly
programmed (as in CUDA)
Data movements between host and accelerator has to be explicitly
programmed (as in CUDA)
NVIDIA do not support it anymore
OpenACC (for Open Accelerators):
The same code (will probably) run on CPUs, GPUs, MICs, etc.
Functions to be offloaded are “annotated” with #pragma directives
Data movements between host and accelerator could be managed
automatically or manually
Support is still limited, but seems to be quickly growing
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
7 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
8 / 33
The D2Q37 Lattice Boltzmann Model
Lattice Boltzmann method (LBM) is a class of computational fluid dynamics
(CFD) methods
simulation of synthetic dynamics described by the discrete Boltzmann
equation, instead of the Navier-Stokes equations
a set of virtual particles called populations arranged at edges of a
discrete and regular grid
interacting by propagation and collision reproduce – after appropriate
averaging – the dynamics of fluids
D2Q37 is a D2 model with 37 components of velocity (populations)
suitable to study behaviour of compressible gas and fluids optionally in
presence of combustion 1 effects
correct treatment of Navier-Stokes, heat transport and perfect-gas
(P = ρT ) equations
1
chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product.
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
9 / 33
Computational Scheme of LBM
foreach time−step
foreach lattice−point
propagate ( ) ;
endfor
foreach lattice−point
collide ( ) ;
endfor
endfor
Embarassing parallelism
All sites can be processed in parallel applying in sequence propagate and
collide.
Challenge
Design an efficient implementation able exploit a large fraction of available
peak performance.
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
10 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
11 / 33
D2Q37: propagation scheme
perform accesses to neighbour-cells at distance 1,2, and 3
generate memory-accesses with sparse addressing patterns
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
12 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
13 / 33
D2Q37: boundary-conditions
After propagation, boundary conditions are enforced at top and bottom edges
of the lattice.
2D lattice with period-boundaries along
X -direction
at the top and the bottom boundary
conditions are enforced:
I
to adjust some values at sites y = 0 . . . 2
and y = Ny − 3 . . . Ny − 1
I
e.g. set vertical velocity to zero
At left and and right edges we apply periodic boundary conditions.
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
14 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
15 / 33
D2Q37 collision
collision is computed at each lattice-cell after computation of boundary
conditions
computational intensive: for the D2Q37 model requires ≈ 7500 DP
floating-point operations
completely local: arithmetic operations require only the populations
associate to the site
computation of propagate and collide kernels are kept separate
after propagate but before collide we may need to perform collective
operations (e.g. divergence of of the velocity field) if we include
computations conbustion effects.
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
16 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
17 / 33
Grid and Memory Layout
Uni-dimensional array of NTHREADS, each thread processing one lattice site.
Ly = α × Nwi , α ∈ N;
(Ly × Lx )/Nwi = Nwg
Data stored as Structures-of-Arrays (SoA)
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
18 / 33
Grid and Memory Layout
Uni-dimensional array of NTHREADS, each thread processing one lattice site.
Ly = α × Nwi , α ∈ N;
(Ly × Lx )/Nwi = Nwg
Data stored as Structures-of-Arrays (SoA)
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
18 / 33
OpenCL Propagate device function
__kernel void prop ( __global const data_t∗ prv , __global data_t∗ nxt ) {
int ix ,
iy ,
site_i ;
/ / Work−i t e m i n d e x along t h e X dimension .
/ / Work−i t e m i n d e x along t h e Y dimension .
/ / Index o f c u r r e n t s i t e .
/ / Sets t h e work−i t e m i n d i c e s ( Y i s used as t h e f a s t e s t dimension ) .
ix = ( int ) get_global_id ( 1 ) ;
iy = ( int ) get_global_id ( 0 ) ;
site_i = ( HX+3+ix ) ∗ NY + ( HY+iy ) ;
nxt [
nxt [
nxt [
nxt [
nxt [
nxt [
nxt [
NX∗NY
2∗NX∗NY
3∗NX∗NY
4∗NX∗NY
5∗NX∗NY
6∗NX∗NY
+
+
+
+
+
+
site_i ]
site_i ]
site_i ]
site_i ]
site_i ]
site_i ]
site_i ]
=
=
=
=
=
=
=
prv [
prv [
prv [
prv [
prv [
prv [
prv [
NX∗NY
2∗NX∗NY
3∗NX∗NY
4∗NX∗NY
5∗NX∗NY
6∗NX∗NY
+
+
+
+
+
+
site_i
site_i
site_i
site_i
site_i
site_i
site_i
−
−
−
−
−
−
−
3∗NY
3∗NY
3∗NY
2∗NY
2∗NY
2∗NY
2∗NY
+ 1];
];
− 1];
+ 2];
+ 1];
];
− 1];
...
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
19 / 33
OpenACC Propagate function
inline void propagate ( const __restrict data_t ∗ const prv ,
__restrict data_t ∗ const nxt ) {
int ix , iy , site_i ;
#pragma acc kernels present ( prv ) present ( nxt )
#pragma acc loop independent gang
for ( ix = HX ; ix < ( HX+SIZEX ) ; ix++) {
#pragma acc loop independent vector ( BLKSIZE )
for ( iy = HY ; iy < ( HY+SIZEY ) ; iy++) {
site_i = ( ix∗NY ) + iy ;
nxt [
nxt [
nxt [
nxt [
nxt [
nxt [
nxt [
NX∗NY
2∗NX∗NY
3∗NX∗NY
4∗NX∗NY
5∗NX∗NY
6∗NX∗NY
+
+
+
+
+
+
site_i ]
site_i ]
site_i ]
site_i ]
site_i ]
site_i ]
site_i ]
=
=
=
=
=
=
=
prv [
prv [
prv [
prv [
prv [
prv [
prv [
NX∗NY
2∗NX∗NY
3∗NX∗NY
4∗NX∗NY
5∗NX∗NY
6∗NX∗NY
+
+
+
+
+
+
site_i
site_i
site_i
site_i
site_i
site_i
site_i
−
−
−
−
−
−
−
3∗NY
3∗NY
3∗NY
2∗NY
2∗NY
2∗NY
2∗NY
+ 1];
];
− 1];
+ 2];
+ 1];
];
− 1];
...
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
20 / 33
Outline
1
Introduction
Hardware trends
Software tools
2
The Lattice Boltzmann Method at a glance
The D2Q37 model
Propagate
Boundary Conditions
Collide
3
Implementations details
4
Results and Conclusion
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
21 / 33
Hardware used: Eurora prototype
Eurora (Eurotech and Cineca)
Hot water cooling system
Deliver 3,209 MFLOPs per Watt of
sustained performance
1st in the Green500 of June 2013
Computing Nodes: 64
Processor Type:
Intel Xeon E5-2658 @ 2.10GHz
Intel Xeon E5-2687W @ 3.10GHz
Accelerator Type:
MIC - Intel Xeon-Phi 5120D
GPU - NVIDIA Tesla K20x
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
22 / 33
OpenCL WG size selection for Propagate (Xeon-Phi)
Performance of propagate as function of the number of work-items Nwi per
work-group, and the number of work-groups Nwg .
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
23 / 33
OpenCL WG size selection for Collide (Xeon-Phi)
Performance of collide as function of the number of work-items Nwi per
work-group, and the number of work-groups Nwg .
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
24 / 33
2 x NVIDIA K20s GPU
Run time on 2 x GPU (NVIDIA K20s)
80
CUDA
OpenCL
70 OpenACC
[msec] per iteration
60
50
40
30
20
10
0
Propagate
E. Calore (INFN of Ferrara)
BC
Portable LBM for Heterogeneous HPC
Collide
GPU Comp., Sep. 17, 2014
25 / 33
2 x Intel Xeon Phi MIC
Run time on 2 x MIC (Intel Xeon Phi)
80
70
C
OpenCL
[msec] per iteration
60
50
40
30
20
10
0
Propagate
E. Calore (INFN of Ferrara)
BC
Portable LBM for Heterogeneous HPC
Collide
GPU Comp., Sep. 17, 2014
26 / 33
Propagate
Run time (Propagate - 1920x2048 lattice)
600
[msec] per iteration
C
C Opt.
CUDA
500
OpenCL
OpenACC
400
300
200
100
0
MIC
E. Calore (INFN of Ferrara)
GPU
CPU2
Portable LBM for Heterogeneous HPC
CPU3
GPU Comp., Sep. 17, 2014
27 / 33
Collide
Run time (Collide - 1920x2048 lattice)
600
[msec] per iteration
C
C Opt.
CUDA
500
OpenCL
OpenACC
400
300
200
100
0
MIC
E. Calore (INFN of Ferrara)
GPU
CPU2
Portable LBM for Heterogeneous HPC
CPU3
GPU Comp., Sep. 17, 2014
28 / 33
Scalability on Eurora Nodes (OpenCL code)
Weak regime lattice size:
Strong regime lattice size:
E. Calore (INFN of Ferrara)
256 × 8192 × No_devices.
1024 × 8192.
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
29 / 33
Simulation of the Rayleigh-Taylor (RT) Instability
Instability at the interface of two fluids of different densities triggered by
gravity.
A cold-dense fluid over a less dense and warmer fluid triggers an instability
that mixes the two fluid-regions (till equilibrium is reached).
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
30 / 33
Conclusions
1
we have presented an OpenCL and an OpenACC implementations of a
fluid-dynamic simulation based on Lattice Boltzmann methods
2
code portability: they have been succesfully ported and run on several
computing architectures, including CPU, GPU and MIC systems
3
performance portability: results are of a similar level of codes written
using more “native” programming frameworks, such as CUDA or C
4
OpenCL
I
I
5
easily portable across several architecture preserving performances;
but not all vendors are today commited to support this standard;
OpenACC
I
I
easily utilizable with few coding efforts;
but compilers are not available for all architectures yet.
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
31 / 33
Acknowledgments
Luca Biferale, Mauro Sbragaglia, Patrizio Ripesi
University of Tor Vergata and INFN Roma, Italy
Andrea Scagliarini
University of Barcelona, Spain
Filippo Mantovani
BSC institute, Spain
Enrico Calore, Sebastiano Fabio Schifano, Raffaele Tripiccione,
University and INFN of Ferrara, Italy
Federico Toschi
Eindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy
This work has been performed in the framework of the INFN COKA and
SUMA projects.
We would like to thank CINECA (ITALY) and JSC (GERMANY) institutes for
access to their systems.
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
32 / 33
Thanks for Your attention
E. Calore (INFN of Ferrara)
Portable LBM for Heterogeneous HPC
GPU Comp., Sep. 17, 2014
33 / 33