C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l Enhancement of FDL3DI John Eisenlohr Nandan Phadke P. Sadayyapan C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l We are Experimentalists • How to design efficient aircra= – theories of fluid dynamics – experiments with parAcular designs • How to design efficient so=ware – theories of computaAon – experiments with parAcular applicaAons • Deal with Technological Change C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l Pushing Toward Exascale C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l HPC ApplicaAon Challenges Ø Transistor counts are sAll increasing rapidly Ø Ø Ø However, tradiAonal single-‐core advancements peaked long ago Ø Power limitaAons Parallel processing hardware has emerged as the soluAon Ø MulA-‐core Ø Short vector units Ø Compute accelerators (Graphic Processors, Xeon Phi) Bandwidth has not increased at the same rate as computaAon Ø Cache Ø DRAM Ø Inter-‐chip communicaAon te r o f E xc el l AFRL Spirit Bandwidth/Compute Roofline Kestrel Gauss Seidel 1904 Bytes 150 OperaAons 0.08 OperaAonal Efficiency Double Precision GFLOP/s C en en ce Comp u es nc nal Sc tio ie ta 166.4 GFLOP/s 256 128 64 32 16 8 0.08 (Kestrel Gauss Seidel) 1.65 (Balanced) Opera&onal Intensity (FLOP/byte) C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l Minimize communicaAon to save energy Source: John Shalf, LBNL C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l VectorizaAon (1/3) C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l VectorizaAon (2/3) C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l VectorizaAon (3/3) C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l HPC ApplicaAons: Key Issues Ø OpAmize Data Movement v Data locality v Algorithm modificaAons for beeer reuse Ø Exploit VectorizaAon Ø Parallelism within a node v MPI v OpenMP v accelerators C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l AVUS CFD Code CharacterisAcs • Unstructured Grid CFD • Gauss-‐Seidel linear system solver • Upper bound on operaAonal intensity is 0.08 • Non-‐opAmal vectorizaAon • Inherently serial algorithm C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l AVUS CFD Code EvaluaAon • MulA-‐core cache contenAon – Open Source Fighter – Intra-‐node performance is sub-‐linear – Inter-‐node performance scales well Spirit OSC Oakley C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l AVUS CFD Code Status • VectorizaAon improvements with changed data layout and hand-‐coding • Cache-‐resident sub-‐blocks – requires reordering cells C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l FDL3DI Solver • An AlternaAng DirecAon Implicit solver for CFD simulaAons • Regular 3D grids • 75K lines of Fortran90 code • MPI parallelism C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l FDL3DI Solver CharacterisAcs • Four subrouAnes account for the bulk of processing – SWEEPJ1, SWEEPI, SWEEPK, SWEEPJ2 • ExaminaAon of the code and HPCToolkit data gives an upper bound on operaAonal intensity for SWEEPs in the range of 40-‐80 floaAng point ops per byte of data read (quite high) • SWEEPJ1, SWEEPK and SWEEPJ2 should vectorize well FLD3DI Profiling Data te r o f E xc el l OperaAonal Intensity 7.00 6.00 Opera&onal Intensity (ops/byte) C en en ce Comp u es nc nal Sc tio ie ta 5.00 4.00 SWEEPJ1 SWEEPI 3.00 SWEEPK SWEEPJ2 2.00 1.00 0.00 2 4 8 12 24 Number of Processes 48 96 120 FLD3DI Profiling Data GFLOpS per Core r o f E xc el l 7.00 Opera&onal Intensity te 6.00 5.00 SWEEPJ1 4.00 SWEEPI 3.00 SWEEPK 2.00 SWEEPJ2 1.00 0.00 2 4 8 12 24 48 96 120 4.00 GFLOPS per Core C en en ce Comp u es nc nal Sc tio ie ta 3.50 3.00 2.50 SWEEPJ1 2.00 SWEEPI 1.50 1.00 SWEEPK 0.50 SWEEPJ2 0.00 2 4 8 12 24 48 Number of Processes 96 120 C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l FDL3DI Profiling Data InterpretaAon • Geing nowhere near the upper bound of operaAonal intensity • Drop in GFLOPS per core from on a single node (12 cores) shows data movement is an issue • Counts of vector operaAons vs. scalar operaAons shows we could do beeer C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l FDL3DI and KCFD • Many similariAes but also major differences – FDL3DI has a much higher upper bound for operaAonal intensity – ADI parallelizaAon vs. Gauss Seidel C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l FDL3DI Next Steps 1. Improve memory access paeerns a. Cache-‐resident sub-‐blocks within process 2. Improve vectorizaAon 3. Hybrid parallelizaAon a. openmp instead of MPI within a node b. accelerators (Xeon Phi or GPU) C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l Fruits of this Work • Faster FDL3DI • Beeer understanding of what kind of code runs well on today’s hardware • And tomorrow’s hardware? C en te en ce Comp u es nc nal Sc tio ie ta r o f E xc el l What is the future of CFD?
© Copyright 2024 ExpyDoc