Enhancement of FDL3DI

C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Enhancement of FDL3DI John Eisenlohr
Nandan Phadke
P. Sadayyapan
C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
We are Experimentalists •  How to design eﬃcient aircra= –  theories of ﬂuid dynamics –  experiments with parAcular designs •  How to design eﬃcient so=ware –  theories of computaAon –  experiments with parAcular applicaAons •  Deal with Technological Change C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Pushing Toward Exascale C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
HPC ApplicaAon Challenges Ø  Transistor counts are sAll increasing rapidly Ø 
Ø 
Ø 
However, tradiAonal single-‐core advancements peaked long ago Ø 
Power limitaAons Parallel processing hardware has emerged as the soluAon Ø 
MulA-‐core Ø 
Short vector units Ø 
Compute accelerators (Graphic Processors, Xeon Phi) Bandwidth has not increased at the same rate as computaAon Ø 
Cache Ø 
DRAM Ø 
Inter-‐chip communicaAon te
r o f E xc
el
l
AFRL Spirit Bandwidth/Compute Rooﬂine Kestrel Gauss Seidel 1904 Bytes 150 OperaAons 0.08 OperaAonal Eﬃciency Double Precision GFLOP/s C
en
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
166.4 GFLOP/s 256 128 64 32 16 8 0.08 (Kestrel Gauss Seidel) 1.65 (Balanced) Opera&onal Intensity (FLOP/byte) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Minimize communicaAon to save energy Source: John Shalf, LBNL C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
VectorizaAon (1/3) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
VectorizaAon (2/3) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
VectorizaAon (3/3) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
HPC ApplicaAons: Key Issues Ø  OpAmize Data Movement v  Data locality v  Algorithm modiﬁcaAons for beeer reuse Ø  Exploit VectorizaAon Ø  Parallelism within a node v  MPI v  OpenMP v  accelerators C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
AVUS CFD Code CharacterisAcs •  Unstructured Grid CFD •  Gauss-‐Seidel linear system solver •  Upper bound on operaAonal intensity is 0.08 •  Non-‐opAmal vectorizaAon •  Inherently serial algorithm C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
AVUS CFD Code EvaluaAon •  MulA-‐core cache contenAon –  Open Source Fighter –  Intra-‐node performance is sub-‐linear –  Inter-‐node performance scales well Spirit OSC Oakley C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
AVUS CFD Code Status •  VectorizaAon improvements with changed data layout and hand-‐coding •  Cache-‐resident sub-‐blocks –  requires reordering cells C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Solver •  An AlternaAng DirecAon Implicit solver for CFD simulaAons •  Regular 3D grids •  75K lines of Fortran90 code •  MPI parallelism C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Solver CharacterisAcs •  Four subrouAnes account for the bulk of processing –  SWEEPJ1, SWEEPI, SWEEPK, SWEEPJ2 •  ExaminaAon of the code and HPCToolkit data gives an upper bound on operaAonal intensity for SWEEPs in the range of 40-‐80 ﬂoaAng point ops per byte of data read (quite high) •  SWEEPJ1, SWEEPK and SWEEPJ2 should vectorize well FLD3DI Proﬁling Data te
r o f E xc
el
l
OperaAonal Intensity 7.00 6.00 Opera&onal Intensity (ops/byte) C
en
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
5.00 4.00 SWEEPJ1 SWEEPI 3.00 SWEEPK SWEEPJ2 2.00 1.00 0.00 2 4 8 12 24 Number of Processes 48 96 120 FLD3DI Proﬁling Data GFLOpS per Core r o f E xc
el
l
7.00 Opera&onal Intensity te
6.00 5.00 SWEEPJ1 4.00 SWEEPI 3.00 SWEEPK 2.00 SWEEPJ2 1.00 0.00 2 4 8 12 24 48 96 120 4.00 GFLOPS per Core C
en
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
3.50 3.00 2.50 SWEEPJ1 2.00 SWEEPI 1.50 1.00 SWEEPK 0.50 SWEEPJ2 0.00 2 4 8 12 24 48 Number of Processes 96 120 C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Proﬁling Data InterpretaAon •  Geing nowhere near the upper bound of operaAonal intensity •  Drop in GFLOPS per core from on a single node (12 cores) shows data movement is an issue •  Counts of vector operaAons vs. scalar operaAons shows we could do beeer C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI and KCFD •  Many similariAes but also major diﬀerences –  FDL3DI has a much higher upper bound for operaAonal intensity –  ADI parallelizaAon vs. Gauss Seidel C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
FDL3DI Next Steps 1.  Improve memory access paeerns a.  Cache-‐resident sub-‐blocks within process 2.  Improve vectorizaAon 3.  Hybrid parallelizaAon a.  openmp instead of MPI within a node b.  accelerators (Xeon Phi or GPU) C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
Fruits of this Work •  Faster FDL3DI •  Beeer understanding of what kind of code runs well on today’s hardware •  And tomorrow’s hardware? C
en
te
en
ce
Comp
u
es
nc
nal Sc
tio
ie
ta
r o f E xc
el
l
What is the future of CFD?

Download Report

brain_138_2toc 1..3

A Randomized, Double-‐Blind, Placebo-‐Controlled

NIELSEN GLOBAL SURVEY AUTOMOTIVE DEMAND

Talk Slides

here - Age Management Medicine Group