Efficient multigrid solvers for mixed finite element

1/20
Efficient multigrid solvers for mixed finite element
discretisations in NWP models
Colin Cotter† , David Ham† , Lawrence Mitchell† ,
* , Robert Scheichl*
¨
Eike Hermann Muller
*
University of Bath, † Imperial College London
16th ECMWF Workshop on High Performance Computing in
Meteorology Oct 2014
Eike Mueller
FEM multigrid solvers in NWP
2/20
Motivation
Separation of concerns and high-level abstractions
Parallel
Optimised
Code
Equations,
Algorithms
discretisation
Domain specialist
Equations,
Algorithms
High level code
Computational Scientist
PETSc
firedrake
PyOP2 *
Parallel
Optimised
Code
Use case: complex iterative solver for pressure correction eqn.
* All credit for developing firedrake/PyOP2 goes to David Ham and his group at
Imperial, I’m giving this talk as a “user”
Eike Mueller
FEM multigrid solvers in NWP
3/20
Motivation
Computational challenge in numerical forecasts:
Solver for elliptic pressure correction equation
(algorithmic + parallel) scalability & performance
10
16
64
Number of CPU cores
256 1024 4096 16384 65536
Titan theoretical peak
1016
PetaFLOP
1014
1013
FLOPs
Total solution time [s]
1015
1
1012
10
single GPU peak
TeraFLOP
11
1010
0.1
256
512
768
1
4
16
CG
Multigrid
Titan [nVidia Kepler]
Hector [AMD Opteron]
109
64
256 1024 4096 16384
Number of GPUs
Weak scaling, total solution time
1
GigaFLOP
4
16
256
512
768
CG
Multigrid
64
256 1024 4096 16384
Number of GPUs
Absolute performance
0.5 · 1012 dofs on 16384 GPUs of TITAN, 0.78 PFLOPs, 20%-50% BW
Finite volume discretisation, simpler geometry
Eike Mueller
FEM multigrid solvers in NWP
4/20
Motivation
Challenges
Finite element discretisation (Met Office next generation dyn. core),
unstructured grids ⇒ more complicated
Duplicate effort for CPU, GPU (Xeon Phi,. . . ) implementation
& optimisation, reinvent the wheel?
Mix two issues:
algorithmic (domain specialist/numerical analyst)
parallelisation & low level optimisation (computational scientist)
Goal
Separation of concerns (algorithmic ↔ computational)
⇒ rapid algorithm development and testing at high
abstraction level
Performance portability
Reuse existing, tested and optimised tools and libraries
⇒ Choice of correct abstraction(s)
Eike Mueller
FEM multigrid solvers in NWP
5/20
This talk
Iterative solver for Helmholtz equation:
φ + ω∇ (φ∗ u) = rφ
−u − ω∇φ = r u
Mixed finite element discretisation, icosahedral grid
Matrix-free multigrid preconditioner for pressure correction
Performance portable firedrake/PyOP2 toolchain
[Rathgeber et al., (2012 & 2014)]
Python ⇒ automatic generation of optimised C code
Collaboration between
Numerical algorithm developers (Colin, Rob, Eike)
Computational scientists, Library developers (David,
Lawrence, {PETSc,PyOP2,firedrake} teams)
Eike Mueller
FEM multigrid solvers in NWP
6/20
Shallow water equations
Model system: shallow water equations
∂φ
+ ∇ (φu) = 0
∂t
∂u
+ (u · ∇) u + g ∇φ = 0
∂t
Implicit timestepping
⇒ linear system for velocity u and height perturbation φ
φ + ω∇ (φ∗ u) = rφ
−u − ω∇φ = r u
reference state φ∗ ≡ 1
ω = ch ∆ t =
√
g φ∗ ∆t
Eike Mueller
FEM multigrid solvers in NWP
7/20
Mixed system
Finite element discretisation: φ ∈ V2 , u ∈ V1
⇒ Matrix system for dof-vectors Φ ∈ RnV2 , U ∈ RnV1
Φ
!
Mφ
A
=
U
ωB T
! !
!
ωB Φ
Rφ
=
−Mu U
Ru
Mixed finite elements
DG0 + RT1 (lowest order)
DG1 + BDFM1 (higher order)
Z
(Mφ )ij ≡
Z
βi (x )βj (x ) dx ,
(Mu )ef ≡
v e (x ) · v f (x ) dx
Ω
Ω
Z
∇v e (x )βi (x ) dx
Bie ≡
(derivative matrix)
Ω
Eike Mueller
FEM multigrid solvers in NWP
(mass matrix)
8/20
Iterative solver
Iterative solver
1
Outer iteration (mixed system):
GMRES, Richardson, BiCGStab, . . .
Preconditioner:
Mφ
P≡
ωB T
P −1 =
ωB
−Mu∗
1
!
lumped velocity mass matrix
0
(Mu∗ )−1 ωB T
!
1
H −1
0
0
−(Mu∗ )−1
!
1 ωB (Mu∗ )−1
0
1
Elliptic positive definite Helmholtz operator
H = Mφ + ω2 B (Mu∗ )−1 B T
2
Inner iteration (pressure system H φ0 = Rφ0 ):
CG with geometric multigrid preconditioner
Eike Mueller
FEM multigrid solvers in NWP
!
9/20
Multigrid
Multigrid V-cycle
smooth
smooth
smooth
x 1/4
smooth
smooth
smooth
smooth
x1
x 1/16
x 1/64
Computational bottleneck
Smoother and operator application
φ0 7→ φ0 + ρrelax DH−1 Rφ0 − H φ0
H = Mφ + ω2 B (Mu∗ )
−1
BT
Intrinsic length scale
Λ/∆x = cs ∆t /∆x ∼ νCFL = O(10) grid spacings (∀ resolutions ∆x)
⇒ small number of multigrid levels/coarse smoother sufficient
Eike Mueller
FEM multigrid solvers in NWP
10/20
Grid iteration in FEM codes
Computational bottlenecks in finite element codes
u m (1)
e
2 7
5
3
8
4
10
11
u m (3)
15
e
16
12
5
4
14
1
5
5
13 19 13
6
9
6
9
20
16
14
12 21
7 15 8
18
17
10
11
ɸe
u m (2)
e
Example: Weak divergence
φ = ∇ · u → Φ = BU
(e )
φe 7→ φe + b1 ume (1)
(e )
Loop over topological entities (e.g. cells). In each cell
Access local dofs (e.g. pressure and velocity)
Execute local kernel
Write data back
Manage halo exchanges, thread parallelism
Eike Mueller
(e )
+ b2 ume (2) + b3 ume (3)
FEM multigrid solvers in NWP
11/20
Firedrake/PyOP2 framework
Algorithms
developers,
num. analysts
Eike, Rob, Colin
Firedrake
Interface
Geometry,
(non)linear
solves
Unified Form
parallel
loop
Language (UFL)
assembly,
compiled
expressions
Problem definition
in FEM weak form
modified
FFC
PETSc4py (KSP,
SNES, DMPlex)
Meshes,
matrices,
vectors
data structures
(Set, Map, Dat)
Library
developers,
comp. scientists
Lawrence, David,
{PyOP2
firedrake,
PETSc}developers
FIAT
Local assembly
kernels (AST)
parallel
loop
Parallel loops: kernels
executed over mesh
PyOP2
Interface
COFFEE
AST optimizer
Parallel scheduling, code generation
MPI
GPU
CPU
(OpenMP/ (PyCUDA /
OpenCL) PyOpenCL)
Future
arch.
Firedrake
Performanceportable
Finite-element
computation
framework
Explicitly
parallel
hardwarespecific
implementation
PyOP2
Parallel
unstructured
mesh
computation
framework
[Florian Rathgeber, PhD thesis, Imperical College (2014)]
Eike Mueller
FEM multigrid solvers in NWP
12/20
UFL
Unified Form Language [Alnæs et al. (2014)]
Example: Bilinear form
a (u, v ) =
Z
u(x )v (x ) dx
Ω
Python code
from firedrake import *
mesh = UnitSquareMesh(8,8)
V = FunctionSpace(mesh,’CG’,1)
u = Function(V)
v = Function(V)
a_uv = assemble(u*v*dx)
Eike Mueller
FEM multigrid solvers in NWP
13/20
Generated code
C-Code generated by
Firedrake/PyOP2
Kernel (below)
execution over grid (right)
OpenMP backend
Eike Mueller
FEM multigrid solvers in NWP
14/20
Putting it all together
≈ 1, 600 lines of highly modular python code (incl. comments)
1
Use UFL wherever possible
Z
2
φ(∇·u) dx 7→ assemble(phi*div(u)*dx)
(weak divergence operator)
Use PETSc for iterative solvers
provide operators/preconditioners as callbacks
3
Escape hatch: Write direct PyOP2 parallel loops
lumped mass matrix division: U 7→ (Mu∗ )
−1
U
kernel_code = ’’’void lumped_mass_divide(double **m,double **U) {
for (int i=0; i<4; ++i) U[i][0] /= m[0][i];
}’’’
kernel = op2.Kernel(kernel_code,"lumped_mass_divide")
op2.par_loop(kernel,facetset,
m.dat(op2.READ,self.facet2dof_map_facets),
u.dat(op2.RW,self.facet2dof_map_BDFM1))
Eike Mueller
FEM multigrid solvers in NWP
15/20
Helmholtz operator
−1
Helmholtz operator H φ = Mφ φ + ω2 B (Mu∗ ) B T φ
Operator application
class Operator(object):
[...]
def apply(self,phi):
psi = TestFunction(V_pressure)
w = TestFunction(V_velocity)
B_phi = assemble(div(w)*phi*dx)
self.velocity_mass_matrix.divide(B_phi)
BT_B_phi = assemble(psi*div(B_phi)*dx)
M_phi = assemble(psi*phi*dx)
return assemble(M_phi + self.omega**2*BT_B_phi)
Interface with PETSc
petsc_op = PETSc.Mat().create()
petsc_op.setPythonContext(Operator(omega))
petsc_op.setUp()
ksp = PETSc.KSP()
ksp.create()
ksp.setOperators(petsc_op)
Eike Mueller
FEM multigrid solvers in NWP
16/20
Algorithmic performance
Convergence history
Inner Solve: 3 CG iterations with multigrid preconditioner
Icosahedral grid, 327,680 cells
4 multigrid levels
100
DG0 + RT0 Richardson
DG0 + RT0 GMRES
DG1 + BDFM1 Richardson
DG1 + BDFM1 GMRES
relative residual ||r||2 /||r0 ||2
10-1
10-2
10-3
10-4
10-5
10-6
0
5
Eike Mueller
10
iteration
15
20
FEM multigrid solvers in NWP
17/20
Efficiency
Single node run on ARCHER, lowest order, 5.2 · 106 cells
[preliminary]
Total solve
Pressure solve
Helmholtz operator
1
2
/s
GB/s
1.3GB
4.7 GB/s
.6
10
B/s
.1G
17
0
3
4
time [s]
5
6
7
STREAM triad: 74.1GB/s per node ⇒ up to ≈ 23% of peak BW
Eike Mueller
FEM multigrid solvers in NWP
18/20
Parallel scalability
Weak scaling on ARCHER [preliminary]
Number of cells
hypre BoomerAMG
matrix-free MG
Number of cores
Eike Mueller
o
mi
o
33
5.5
36
mi
15
38
4
83
.9
io
.0m
21
5.2
m
io
96
100
24
101
1.3
m
io
DG1 + BDFM1
DG0 + RT0
6
Solution time [s]
102
FEM multigrid solvers in NWP
19/20
Conclusion
Summary
Matrix-free multigrid-preconditioner for pressure correction
equation
Implementation in firedrake/PyOP2/PETSc framework:
Performance portability
Correct abstractions ⇒ Separation of concerns
Outlook
Test on GPU backend
Extend to 3d: Regular vertical grid ⇒ Improved caching ⇒
Improved memory BW
Improve and extend parallel scaling
Full 3d dynamical core (Colin Cotter)
Eike Mueller
FEM multigrid solvers in NWP
20/20
References
The firedrake project: http://firedrakeproject.org/
F. Rathgeber et al.: Firedrake: automating the finite element
method by composing abstractions. in preparation
F. Rathgeber et al.: PyOP2: A High-Level Framework for
Performance-Portable Simulations on Unstructured Meshes.
HPC, Networking Storage and Analysis, SC Companion, p 1116-1123, Los
Alamitos, CA, USA, 2012. IEEE Computer Society
¨
E. Muller
et al.: Petascale elliptic solvers for anisotropic PDEs
on GPU clusters.
Submitted to Parallel Computing (2014) [arxiv:1402.3545]
¨
E. Muller,
R. Scheichl: Massively parallel solvers for elliptic
PDEs in Numerical Weather- and Climate Prediction.
QJRMS (2013) [arxiv:1307.2036]
Helmholtz solver code on github
https://github.com/firedrakeproject/firedrake-helmholtzsolver
Eike Mueller
FEM multigrid solvers in NWP