1/20 Efficient multigrid solvers for mixed finite element discretisations in NWP models Colin Cotter† , David Ham† , Lawrence Mitchell† , * , Robert Scheichl* ¨ Eike Hermann Muller * University of Bath, † Imperial College London 16th ECMWF Workshop on High Performance Computing in Meteorology Oct 2014 Eike Mueller FEM multigrid solvers in NWP 2/20 Motivation Separation of concerns and high-level abstractions Parallel Optimised Code Equations, Algorithms discretisation Domain specialist Equations, Algorithms High level code Computational Scientist PETSc firedrake PyOP2 * Parallel Optimised Code Use case: complex iterative solver for pressure correction eqn. * All credit for developing firedrake/PyOP2 goes to David Ham and his group at Imperial, I’m giving this talk as a “user” Eike Mueller FEM multigrid solvers in NWP 3/20 Motivation Computational challenge in numerical forecasts: Solver for elliptic pressure correction equation (algorithmic + parallel) scalability & performance 10 16 64 Number of CPU cores 256 1024 4096 16384 65536 Titan theoretical peak 1016 PetaFLOP 1014 1013 FLOPs Total solution time [s] 1015 1 1012 10 single GPU peak TeraFLOP 11 1010 0.1 256 512 768 1 4 16 CG Multigrid Titan [nVidia Kepler] Hector [AMD Opteron] 109 64 256 1024 4096 16384 Number of GPUs Weak scaling, total solution time 1 GigaFLOP 4 16 256 512 768 CG Multigrid 64 256 1024 4096 16384 Number of GPUs Absolute performance 0.5 · 1012 dofs on 16384 GPUs of TITAN, 0.78 PFLOPs, 20%-50% BW Finite volume discretisation, simpler geometry Eike Mueller FEM multigrid solvers in NWP 4/20 Motivation Challenges Finite element discretisation (Met Office next generation dyn. core), unstructured grids ⇒ more complicated Duplicate effort for CPU, GPU (Xeon Phi,. . . ) implementation & optimisation, reinvent the wheel? Mix two issues: algorithmic (domain specialist/numerical analyst) parallelisation & low level optimisation (computational scientist) Goal Separation of concerns (algorithmic ↔ computational) ⇒ rapid algorithm development and testing at high abstraction level Performance portability Reuse existing, tested and optimised tools and libraries ⇒ Choice of correct abstraction(s) Eike Mueller FEM multigrid solvers in NWP 5/20 This talk Iterative solver for Helmholtz equation: φ + ω∇ (φ∗ u) = rφ −u − ω∇φ = r u Mixed finite element discretisation, icosahedral grid Matrix-free multigrid preconditioner for pressure correction Performance portable firedrake/PyOP2 toolchain [Rathgeber et al., (2012 & 2014)] Python ⇒ automatic generation of optimised C code Collaboration between Numerical algorithm developers (Colin, Rob, Eike) Computational scientists, Library developers (David, Lawrence, {PETSc,PyOP2,firedrake} teams) Eike Mueller FEM multigrid solvers in NWP 6/20 Shallow water equations Model system: shallow water equations ∂φ + ∇ (φu) = 0 ∂t ∂u + (u · ∇) u + g ∇φ = 0 ∂t Implicit timestepping ⇒ linear system for velocity u and height perturbation φ φ + ω∇ (φ∗ u) = rφ −u − ω∇φ = r u reference state φ∗ ≡ 1 ω = ch ∆ t = √ g φ∗ ∆t Eike Mueller FEM multigrid solvers in NWP 7/20 Mixed system Finite element discretisation: φ ∈ V2 , u ∈ V1 ⇒ Matrix system for dof-vectors Φ ∈ RnV2 , U ∈ RnV1 Φ ! Mφ A = U ωB T ! ! ! ωB Φ Rφ = −Mu U Ru Mixed finite elements DG0 + RT1 (lowest order) DG1 + BDFM1 (higher order) Z (Mφ )ij ≡ Z βi (x )βj (x ) dx , (Mu )ef ≡ v e (x ) · v f (x ) dx Ω Ω Z ∇v e (x )βi (x ) dx Bie ≡ (derivative matrix) Ω Eike Mueller FEM multigrid solvers in NWP (mass matrix) 8/20 Iterative solver Iterative solver 1 Outer iteration (mixed system): GMRES, Richardson, BiCGStab, . . . Preconditioner: Mφ P≡ ωB T P −1 = ωB −Mu∗ 1 ! lumped velocity mass matrix 0 (Mu∗ )−1 ωB T ! 1 H −1 0 0 −(Mu∗ )−1 ! 1 ωB (Mu∗ )−1 0 1 Elliptic positive definite Helmholtz operator H = Mφ + ω2 B (Mu∗ )−1 B T 2 Inner iteration (pressure system H φ0 = Rφ0 ): CG with geometric multigrid preconditioner Eike Mueller FEM multigrid solvers in NWP ! 9/20 Multigrid Multigrid V-cycle smooth smooth smooth x 1/4 smooth smooth smooth smooth x1 x 1/16 x 1/64 Computational bottleneck Smoother and operator application φ0 7→ φ0 + ρrelax DH−1 Rφ0 − H φ0 H = Mφ + ω2 B (Mu∗ ) −1 BT Intrinsic length scale Λ/∆x = cs ∆t /∆x ∼ νCFL = O(10) grid spacings (∀ resolutions ∆x) ⇒ small number of multigrid levels/coarse smoother sufficient Eike Mueller FEM multigrid solvers in NWP 10/20 Grid iteration in FEM codes Computational bottlenecks in finite element codes u m (1) e 2 7 5 3 8 4 10 11 u m (3) 15 e 16 12 5 4 14 1 5 5 13 19 13 6 9 6 9 20 16 14 12 21 7 15 8 18 17 10 11 ɸe u m (2) e Example: Weak divergence φ = ∇ · u → Φ = BU (e ) φe 7→ φe + b1 ume (1) (e ) Loop over topological entities (e.g. cells). In each cell Access local dofs (e.g. pressure and velocity) Execute local kernel Write data back Manage halo exchanges, thread parallelism Eike Mueller (e ) + b2 ume (2) + b3 ume (3) FEM multigrid solvers in NWP 11/20 Firedrake/PyOP2 framework Algorithms developers, num. analysts Eike, Rob, Colin Firedrake Interface Geometry, (non)linear solves Unified Form parallel loop Language (UFL) assembly, compiled expressions Problem definition in FEM weak form modified FFC PETSc4py (KSP, SNES, DMPlex) Meshes, matrices, vectors data structures (Set, Map, Dat) Library developers, comp. scientists Lawrence, David, {PyOP2 firedrake, PETSc}developers FIAT Local assembly kernels (AST) parallel loop Parallel loops: kernels executed over mesh PyOP2 Interface COFFEE AST optimizer Parallel scheduling, code generation MPI GPU CPU (OpenMP/ (PyCUDA / OpenCL) PyOpenCL) Future arch. Firedrake Performanceportable Finite-element computation framework Explicitly parallel hardwarespecific implementation PyOP2 Parallel unstructured mesh computation framework [Florian Rathgeber, PhD thesis, Imperical College (2014)] Eike Mueller FEM multigrid solvers in NWP 12/20 UFL Unified Form Language [Alnæs et al. (2014)] Example: Bilinear form a (u, v ) = Z u(x )v (x ) dx Ω Python code from firedrake import * mesh = UnitSquareMesh(8,8) V = FunctionSpace(mesh,’CG’,1) u = Function(V) v = Function(V) a_uv = assemble(u*v*dx) Eike Mueller FEM multigrid solvers in NWP 13/20 Generated code C-Code generated by Firedrake/PyOP2 Kernel (below) execution over grid (right) OpenMP backend Eike Mueller FEM multigrid solvers in NWP 14/20 Putting it all together ≈ 1, 600 lines of highly modular python code (incl. comments) 1 Use UFL wherever possible Z 2 φ(∇·u) dx 7→ assemble(phi*div(u)*dx) (weak divergence operator) Use PETSc for iterative solvers provide operators/preconditioners as callbacks 3 Escape hatch: Write direct PyOP2 parallel loops lumped mass matrix division: U 7→ (Mu∗ ) −1 U kernel_code = ’’’void lumped_mass_divide(double **m,double **U) { for (int i=0; i<4; ++i) U[i][0] /= m[0][i]; }’’’ kernel = op2.Kernel(kernel_code,"lumped_mass_divide") op2.par_loop(kernel,facetset, m.dat(op2.READ,self.facet2dof_map_facets), u.dat(op2.RW,self.facet2dof_map_BDFM1)) Eike Mueller FEM multigrid solvers in NWP 15/20 Helmholtz operator −1 Helmholtz operator H φ = Mφ φ + ω2 B (Mu∗ ) B T φ Operator application class Operator(object): [...] def apply(self,phi): psi = TestFunction(V_pressure) w = TestFunction(V_velocity) B_phi = assemble(div(w)*phi*dx) self.velocity_mass_matrix.divide(B_phi) BT_B_phi = assemble(psi*div(B_phi)*dx) M_phi = assemble(psi*phi*dx) return assemble(M_phi + self.omega**2*BT_B_phi) Interface with PETSc petsc_op = PETSc.Mat().create() petsc_op.setPythonContext(Operator(omega)) petsc_op.setUp() ksp = PETSc.KSP() ksp.create() ksp.setOperators(petsc_op) Eike Mueller FEM multigrid solvers in NWP 16/20 Algorithmic performance Convergence history Inner Solve: 3 CG iterations with multigrid preconditioner Icosahedral grid, 327,680 cells 4 multigrid levels 100 DG0 + RT0 Richardson DG0 + RT0 GMRES DG1 + BDFM1 Richardson DG1 + BDFM1 GMRES relative residual ||r||2 /||r0 ||2 10-1 10-2 10-3 10-4 10-5 10-6 0 5 Eike Mueller 10 iteration 15 20 FEM multigrid solvers in NWP 17/20 Efficiency Single node run on ARCHER, lowest order, 5.2 · 106 cells [preliminary] Total solve Pressure solve Helmholtz operator 1 2 /s GB/s 1.3GB 4.7 GB/s .6 10 B/s .1G 17 0 3 4 time [s] 5 6 7 STREAM triad: 74.1GB/s per node ⇒ up to ≈ 23% of peak BW Eike Mueller FEM multigrid solvers in NWP 18/20 Parallel scalability Weak scaling on ARCHER [preliminary] Number of cells hypre BoomerAMG matrix-free MG Number of cores Eike Mueller o mi o 33 5.5 36 mi 15 38 4 83 .9 io .0m 21 5.2 m io 96 100 24 101 1.3 m io DG1 + BDFM1 DG0 + RT0 6 Solution time [s] 102 FEM multigrid solvers in NWP 19/20 Conclusion Summary Matrix-free multigrid-preconditioner for pressure correction equation Implementation in firedrake/PyOP2/PETSc framework: Performance portability Correct abstractions ⇒ Separation of concerns Outlook Test on GPU backend Extend to 3d: Regular vertical grid ⇒ Improved caching ⇒ Improved memory BW Improve and extend parallel scaling Full 3d dynamical core (Colin Cotter) Eike Mueller FEM multigrid solvers in NWP 20/20 References The firedrake project: http://firedrakeproject.org/ F. Rathgeber et al.: Firedrake: automating the finite element method by composing abstractions. in preparation F. Rathgeber et al.: PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes. HPC, Networking Storage and Analysis, SC Companion, p 1116-1123, Los Alamitos, CA, USA, 2012. IEEE Computer Society ¨ E. Muller et al.: Petascale elliptic solvers for anisotropic PDEs on GPU clusters. Submitted to Parallel Computing (2014) [arxiv:1402.3545] ¨ E. Muller, R. Scheichl: Massively parallel solvers for elliptic PDEs in Numerical Weather- and Climate Prediction. QJRMS (2013) [arxiv:1307.2036] Helmholtz solver code on github https://github.com/firedrakeproject/firedrake-helmholtzsolver Eike Mueller FEM multigrid solvers in NWP
© Copyright 2024 ExpyDoc