Enhancement of FDL3DI

C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Enhancement of FDL3DI John Eisenlohr
Nandan Phadke
P. Sadayyapan
C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
We are Experimentalists •  How to design efficient aircra= –  theories of fluid dynamics –  experiments with parAcular designs •  How to design efficient so=ware –  theories of computaAon –  experiments with parAcular applicaAons •  Deal with Technological Change C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Pushing Toward Exascale C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
HPC ApplicaAon Challenges Ø  Transistor counts are sAll increasing rapidly Ø 
Ø 
Ø 
However, tradiAonal single-­‐core advancements peaked long ago Ø 
Power limitaAons Parallel processing hardware has emerged as the soluAon Ø 
MulA-­‐core Ø 
Short vector units Ø 
Compute accelerators (Graphic Processors, Xeon Phi) Bandwidth has not increased at the same rate as computaAon Ø 
Cache Ø 
DRAM Ø 
Inter-­‐chip communicaAon te
r o f E xc
el
l
AFRL Spirit Bandwidth/Compute Roofline Kestrel Gauss Seidel 1904 Bytes 150 OperaAons 0.08 OperaAonal Efficiency Double Precision GFLOP/s C
en
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
166.4 GFLOP/s 256 128 64 32 16 8 0.08 (Kestrel Gauss Seidel) 1.65 (Balanced) Opera&onal Intensity (FLOP/byte) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Minimize communicaAon to save energy Source: John Shalf, LBNL C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
VectorizaAon (1/3) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
VectorizaAon (2/3) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
VectorizaAon (3/3) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
HPC ApplicaAons: Key Issues Ø  OpAmize Data Movement v  Data locality v  Algorithm modificaAons for beeer reuse Ø  Exploit VectorizaAon Ø  Parallelism within a node v  MPI v  OpenMP v  accelerators C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
AVUS CFD Code CharacterisAcs •  Unstructured Grid CFD •  Gauss-­‐Seidel linear system solver •  Upper bound on operaAonal intensity is 0.08 •  Non-­‐opAmal vectorizaAon •  Inherently serial algorithm C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
AVUS CFD Code EvaluaAon •  MulA-­‐core cache contenAon –  Open Source Fighter –  Intra-­‐node performance is sub-­‐linear –  Inter-­‐node performance scales well Spirit OSC Oakley C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
AVUS CFD Code Status •  VectorizaAon improvements with changed data layout and hand-­‐coding •  Cache-­‐resident sub-­‐blocks –  requires reordering cells C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Solver •  An AlternaAng DirecAon Implicit solver for CFD simulaAons •  Regular 3D grids •  75K lines of Fortran90 code •  MPI parallelism C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Solver CharacterisAcs •  Four subrouAnes account for the bulk of processing –  SWEEPJ1, SWEEPI, SWEEPK, SWEEPJ2 •  ExaminaAon of the code and HPCToolkit data gives an upper bound on operaAonal intensity for SWEEPs in the range of 40-­‐80 floaAng point ops per byte of data read (quite high) •  SWEEPJ1, SWEEPK and SWEEPJ2 should vectorize well FLD3DI Profiling Data te
r o f E xc
el
l
OperaAonal Intensity 7.00 6.00 Opera&onal Intensity (ops/byte) C
en
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
5.00 4.00 SWEEPJ1 SWEEPI 3.00 SWEEPK SWEEPJ2 2.00 1.00 0.00 2 4 8 12 24 Number of Processes 48 96 120 FLD3DI Profiling Data GFLOpS per Core r o f E xc
el
l
7.00 Opera&onal Intensity te
6.00 5.00 SWEEPJ1 4.00 SWEEPI 3.00 SWEEPK 2.00 SWEEPJ2 1.00 0.00 2 4 8 12 24 48 96 120 4.00 GFLOPS per Core C
en
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
3.50 3.00 2.50 SWEEPJ1 2.00 SWEEPI 1.50 1.00 SWEEPK 0.50 SWEEPJ2 0.00 2 4 8 12 24 48 Number of Processes 96 120 C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Profiling Data InterpretaAon •  Geing nowhere near the upper bound of operaAonal intensity •  Drop in GFLOPS per core from on a single node (12 cores) shows data movement is an issue •  Counts of vector operaAons vs. scalar operaAons shows we could do beeer C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI and KCFD •  Many similariAes but also major differences –  FDL3DI has a much higher upper bound for operaAonal intensity –  ADI parallelizaAon vs. Gauss Seidel C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Next Steps 1.  Improve memory access paeerns a.  Cache-­‐resident sub-­‐blocks within process 2.  Improve vectorizaAon 3.  Hybrid parallelizaAon a.  openmp instead of MPI within a node b.  accelerators (Xeon Phi or GPU) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Fruits of this Work •  Faster FDL3DI •  Beeer understanding of what kind of code runs well on today’s hardware •  And tomorrow’s hardware? C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
What is the future of CFD?