Applications

Tianhe
天河
Software Issues for
Extreme Large Scale Computing
Prof. Yutong Lu
School of computer science, NUDT &
State Key Laboratory of High Performance Computing
[email protected]
国防科学技术大学
National University of Defense Technology
Outline
r Trend
of HPC Architecture
r Scalable
System software
r Applications
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Challenges
PSPR
r Performance
r Scalability
r Power consumption
r Reliability
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Trend of Architecture
r Tree
Tianhe
天河
carriages of Performance
Ø  Frequency
Ø  ILP
Ø  Parallelism
r Performance
Ø  ……
= Parallelism
Ø  Year
2010:TH-1A,4.7Pflops,7168Nodes,186,368
Cores
Ø  Year 2013:TH-2, 54.9Pflops, 16000Nodes,
3,120,000 Cores
Ø  ……
r Exploit
parallelism
Ø  Longitude
( 100,000nodes)
Ø  Latitude(multi/many cores, SIMD、ILP)
国防科学技术大学
National University of Defense Technology
Trend of Architecture
r Heterogeneous
Ø  Some
architecture
of top-level supercomputers
u  Tianhe-2
– 
Intel Xeon Phi
u  Titan
– 
NVIDIA K20X GPU
u  Lately,
r Compute
#53/Top500, #24/Top100,#4/Top10
Efficiency
Ø  More
computations per joule
Ø  More computations per transistor
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Trend of Architecture
r Many
Ø Intel
core processor
MIC
u >60cores,
>200threads
u 1.15GHz
u >
1TFlops performance
u 512b SIMD
Ø GPU,
NVIDIA Kapler
u 2688
cores
u 732MHz
u 1.31TFlops
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Trend of Architecture
r Tianhe
TH-1A
GPU
TH-2
MIC
vs
Data Parallel
r  Simple instruction
r 
r 
r 
Limited scheduling
GPU Direct available
Steep learning curve
r  Supporting
Cuda
Open CL
…
2CPU + 2GPU ~71%
r  Whole system 56.5%
r 
国防科学技术大学
National University of Defense Technology
Native,Offload,Symmetric,Shared
r 
SIMD available
Ø  ~ 4.5 times speedup
on Tianhe-2
Relatively easy to get started
Intel Supporting
r 
2CPU + 3MIC ~76.5%
r 
Whole system 61.6%
r 
~40% ↑ MPI communication
on Tianhe-1A
Ø  5% ↑ Linpack
r 
Multi threads & SIMD
Flexible modes
Ø 
Ø 
Ø 
Ø 
Ø 
天河
supercomputers
r 
Ø 
Tianhe
r 
Trend of Architecture
Tianhe
天河
Memory Hierarchy
r Performance
of CPU á59%, Perf of MEM á 26%
Register
L1 Cache
L2 Cache
L3 Cache
Local Mem
Non-local MEM
r Exploit
1 circle
3 circle
10 circle
30 circle
150 circle
>1500 circle
Data Locality, reduce communication and
memory accessing
国防科学技术大学
National University of Defense Technology
Trend of Architecture
r Memory
Tianhe
天河
architecture will be benefited from
multiple technologies
Ø Deeper
memory hierarchy
Ø Advanced package technology
u 3D
stack、MCM
Ø Optical
connection btw chips
国防科学技术大学
National University of Defense Technology
Trend of Architecture
Tianhe
天河
Power Consumption
r PW
for data moving / 48X PW for data computing
Ø  MLA
inside core:
Ø  Read inside CPU:
Ø  Data moving btw cores:
Ø  Data moving btw nodes:
r DTF,reduce
100PJ
4800PJ
7500PJ
9000PJ
20% power consumption, with 5%
performance losing
r Power control applications,power aware,
minimum data moving
国防科学技术大学
National University of Defense Technology
Trend of Architecture
Tianhe
天河
Interconnection network
r NIC
Ø  High
Bandwidth
Ø  Multiple Lanes
r Router
Ø  High
radix Vs. Low radix
r Topology
N-D Torus Vs. Fat Tree
Ø  N Dimension Tree
Ø 
NRM
NRM
SW
r Cost
software
国防科学技术大学
National University of Defense Technology
NRM
M
SW
32 Compute node
NRM
M
SW
32 Compute node
Rack 0
NRM
32 Compute node
M
NRM
M
SW
32 Compute node
32 Compute node
M
BTM
BTM
SW
NRM
NRM
576-­‐port Switch 12
BTM
SW
32 Compute node
NRM
BTM
BTM
NRM
NRM
576-­‐port Switch j
BTM
M
Rack 124
32 Compute node
32 Compute node
32 Compute node
576-­‐port Switch i
BTM
NRM
32 Compute node
NRM
576-­‐port Switch 0
SW
BW, Low Latency, EMC
r Topo-aware
Rack 63
32 Compute node
32 Compute node
BTM
r Optical
Ø  High
32 Compute node
M
SW
NRM
NRM
32 Compute node
32 Compute node
Rack 62
M
NRM
32 Compute node
Trend of Architecture
n
o
i
t
a
c
i
n
u
Comm
ty
i
l
i
b
a
i
l
e
R
Power
ity
l
i
b
a
m
m
a
r
Prog
Heavy the burden of Software
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Software issues
r Scalability
Ø How
to use the exist systems better
Ø How to explore the next generation systems
r Resilience
Ø Reduce
the CR overhead
Ø Lightweight resilience method
r Power
Control
r Programmability
r HPC vs Big data
Ø Data
management and filesystem
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Highlights of Tianhe-2
Perf
54.9PFlops / 33.86PFlops
Nodes
16000
Mem
1.4PB
Racks
125+8+13+24=170 (720m2)
Power
17.8 MW (1.9GFlops/W)
Cool
Close-coupled chilled
water cooling
Tianhe
天河
TH-2
TH-2
system
(125
x Rack)
Rack
(8 x Frame)
APM
32个
算
点
576端⼝口交
Frame
Compute
board
Phi
#48000
32个
算
机0
576端⼝口交
32个
576端⼝口交
SW
32个
算
点
机柜124
算
32个
M
32个
算
算
576端⼝口交
点
机12
SW
M
BTM
M
SW
32个
机柜0
算
M
NRM
NRM
NRM
点
算
BTM
SW
32个
TH-Net
32个
NRM
M
NRM
点
点
NRM
机j
SW
NRM
点
算
NRM
BTM
NRM
点
32个
BTM
M
算
点
M
BTM
32个
算
NRM
机i
SW
NRM
点
点
NRM
M
SW
算
算
32个
BTM
NRM
32个
点
NRM
BTM
点
32个
算
点
32个
算
点
机柜62
Service Manage System
Independent CPU Array
Commercial CPU Array
Service Node
FT Compute Node
Compute Node
IO Enhanced
Compute Node
Login Node
FT Compute Node
. . .
Compute Node
. . .
IO Enhanced
Compute Node
FT Compute Node
Compute Node
IO Enhanced
Compute Node
FT Compute Node
Compute Node
IO Enhanced
Compute Node
IO Server Node
IO Server Node
Login Node
Management Node
Commericial CPU Array
Distributed Local Storage
. . .
IO Server Node
Accelerating Storage
CPM
国防科学技术大学
National University of Defense Technology
机柜63
BTM
(8 x board)
ION
FT-1500
#4096
点
NRM
SW
. . .
IVB
#32000
算
32个
NRM
IP Interconnect Network
IB Storage Network
Global Global Storage
Storage
Hybrid Hierarchy shared storage System
12.4PB
Highlights of Tianhe-2
r Software
Stack
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Programming model
r Trend
Tianhe
天河
of programming model
Ø Whole
system
u MPI
u New
Ø Intra
Data-driven model
node
Portability
Performance
Simplicity and Symmetry
Modularity
Compatibility
Completeness Distributed memory
u Various
–  OpenMP,
Ø Others
u PGAS
...
国防科学技
–  术 大 学
National University of Defense Technology
Cuda/OpenCL, OpenACC
We are here…
Tianhe
天河
r Top5(2013.11)
Computer
AVG #core of top500
Ø More
Rmax
(Tflop/s)
Rpeak
(Tflop/s)
Tianhe-2
3,120,000
33,862.7
54,902.4
Titan
560,640
17,590.0
27,112.5
Sequoia BlueGene/Q
1,572,864
16324.75
20132.66
K computer
705,024
10510.00
11280.38
Mira - BlueGene/Q,
786,432
8162.38
10066.33
41,434
r Selected
Cores
applications on Tianhe
than 10,000 ~ 1,000,000 cores
国防科学技术大学
National University of Defense Technology
Scalable MPI
Tianhe
天河
r Performance
Ø P2P:
Bandwidth/Latency
Ø Collective communication
Ø Communicator/Group operations
Ø MPI-Init
r Resource
consumption
Ø Memory
Ø Network
r Data
connection
structures should be redesigned
Ø Communicator,
国防科学技术大学
National University of Defense Technology
RMA window, protocol buffer…
Scalable MPI
r TH-Express2
Ø Network
& TH-Express2+
Interface Chip: NIC
u 10Gbps
X 8lane
u 14Gbps X 8lane(plus)
Ø  Network
u 16
Router Chip: NRC
ports, more(plus)
Ø Optic
and electronic hybrid network
Ø Topology:
Ø  Design
Fat tree à N Dimension Tree
for extension to 100PFlops
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Scalable MPI
Tianhe
天河
Message Passing services over TH-Express
r  Galaxy Express (GLEX)
Ø  Basic message passing infrastructure on network interface
Ø  User level communication technology
Ø  User and kernel API
r  MPICH-GLEX
Ø  Based
on MPICH from ANL
Ø  Nemesis netmod using GLEX
r  Design
Consideration
Ø  Protocol:
different communication mechanisms exhibit
different performance and resource usage
Ø  Application characteristic: communication mode, such as
nearest-neighbor communication
Ø  Scalability: balance between performance and resource usage
国防科学技术大学
National University of Defense Technology
Scalable MPI
r Message
Tianhe
天河
passing protocols
r Various protocols in low level with TH-Net
Ø Eager
Protocol
u Exclusive
– 
Performance oriented
u Shared
– 
RDMA Channel
Scalability oriented
u Hybrid
– 
RDMA Channel
channels
Combine application model
Ø  Rendezvous
u Zero-copy
protocol
data transfer based on RDMA Get
国防科学技术大学
National University of Defense Technology
Scalable MPI
r Collective
communication
Ø Data-parallel
styles for users
Ø MPI interface level
u NonBlock
collective
u Alltoallv/AllGetherV
u Group-split
Ø Implementation
u Scalable
level
algorithm
u Topology aware
u Hardware offload
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Scalable MPI
Tianhe
天河
r Collective offload
Ø  Construct topology-aware algorithm tree dynamically
Ø  Message passed between tree nodes automatically based
on the trigger mechanism of NIC
Ø  Bypass effect of OS noise, improving overlap of
computation and communication
Ø  Reduce
Latency of large scale collective operation
国防科学技术大学
National University of Defense Technology
Scalable MPI
r Non-stop
Tianhe
天河
and fault Resilient MPI (NR-MPI)
Ø  Application
continue execution without being relaunched
Ø  Failure detection and MPI state recovery done by runtime
Ø  Data-backup by application-level diskless C/R
Ø  Reconstruct of MPI communicator and channel
Legend
Failure
Detector
NR-MPI Programming Interface
NR-­‐MPI
Normal execution
Stand-by execution
Local data
Recreating the
Data backup
Data restore
restore
World comm.
Data comm for mutual
Local
Failure detection
data backup and restore
data copy
and notification
MPI Application
Data Backup and Restore Module
P0
P1
Failure Detecting
Module
State Management
Module
Failure Recovery
Module
Existing MPI Library
P2
P3
P2
P4
Operating System
Communication Adapter
P5
Phase 1
国防科学技术大学
National University of Defense Technology
FA
Phase 2
Phase 3
-- publishing on ICPADS2013
Dynamic Software
r Complexity:Multidisciplinary,
Tianhe
天河
Multi-physics,
Multi-scale, Multi-method
r Legacy
applications:, Long term for developing,
Expensive, Difficult
r Autotuning the performance
r Dynamic resources requirement and providing
r Topo-aware and Latency hiding
r Resource sharing & Hybrid runtime
r Fault tolerant and Resilience
r Rethink & Redesign the software
国防科学技术大学
National University of Defense Technology
Scientific Discovery
r Creative
Computing Technology
Ø Hardware,
system software, algorithm,
applications
r Creative
Ø Data
r Big
Data Processing Technology
management, Analysis, Visualization
Data come from
Ø Experiment
Ø Observation
Ø Sensor
network
Ø Simulation
r Challenge
of computing/throughput
国防科学技术大学
National University of Defense Technology
Tianhe
天河
HPC Vs Big Data
r Increasing
Tianhe
天河
I/O requirements
Ø  Large
scale Pre/Post data sets
Ø  Visualization and Analysis
Ø  Big science with Big data
Ø  Expected data volume per simulation from ~GB to ~PB,
typically ~100 TB
r I/O
Bottleneck
Ø  Scalability,
Efficiency, Performance, Economic and
durability
r What’s
needed for Parallel IO interface
Ø  More
hints could be expressed
Ø  More patterns could be supported
Ø  Interface to application IO library
国防科学技术大学
National University of Defense Technology
Scalable IO Structure
r IO
Tianhe
天河
Architecture on Tianhe-2
Ø Multiple Layers
u  Local Disk
u  PCI-E SSD
u  Disk Array
Ø 6400
& Hybrid Storages
local Disks
u Bus
attached
High
High Speed
Speed Network
Network
Ø 64
Storage
Storage Network
Network
Global
Global
Storage
Storage
and IB QDR port
Storage Servers
u Sustained:about
国防科学技术大学
National University of Defense Technology
Burst
Buffer
SSD
SSD
I/O
I/O Forward
Forward
Ø 256 IO nodes
u  Burst: above 1TB/s
u  TH-Express
Data
Analysis
Store
Local
Local
Storage
Storage
100GB/s
Large
Capacity
Persistent
Store
Scalable IO Structure
Tianhe
天河
r H2FS: Hybrid Hierarchy File System
Ø  HVN, Hybrid, Unified and Isolated dynamic namespace
maintained by centralized servers
Ø  DPU, A fundamental HVN unit for data processing, tightly
couples a compute node with its local storage
Ø  Layered and enriched metadata, I/O hints as high level
metadata
Job
Job Management
r I/O API
Ø  POSIX
Ø  MPI-IO
Ø  Extended API,
layout
and policy guide
Ø  HDF5 over POSIX
and extended API
Ø  Object API(todo)
国防科学技术大学
National University of Defense Technology
HPC Workloads
Interface
Local
Storage
Posix API
Data-Intensive Workloads
Layout API
Policy API
HVN 01
(Job 01)
HVN 02
(Job 02)
HVN 03
(Job 03)
Virtual
Namespace
Virtual
Namespace
Virtual
Namespace
DPU DPU DPU
01
02
03
DPU DPU
04
05
DPU DPU DPU
06
07
08
...
HVN Management API
HVN
Management
Server
DPU08
DPU01
DPU07
DPU02
DPU03
DPU06
DPU04 DPU05
Shared
Storage
Policy
Database
Logging
Volume
Scalable IO Structure
Tianhe
天河
r HPC benefits
Ø  Scalable burst BW for typical HPC application
Ø  Isolated HVN makes data intensive application
personalized their optimization
Ø  Reduced requirements for costly shared storage
Ø  Scalability, Efficiency, Economic and Ease of use
r Data processing benefits
Ø  Maximum locality, DPU provides opportunity to
schedule tasks close to data
Ø  Single namespace make post-processing easy
Ø  Reduction of data movement, better support for in-situ
data analysis and data in-transit analysis
国防科学技术大学
National University of Defense Technology
Different Levels of Performance
r Peak
performance
r LINPACK performance
Ø Avg.
80%
r Gordon
Bell Prize performance
Ø  ~30%
r Application
sustained performance
Ø  <5%~10%
r HPCG
Benchmark
Ø  ~1%
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Applications
r High
Energy Density
Physics
r Weather
& Climate
r CFD
r Seismic
data
processing
r Bio-information
r E-Gov
& Service
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Applications
r Cases
Study(CPU+MIC)
Ø CFD
u High-order
CFD Simulation 682.4billion grids,
#7168,1.37million cores
Ø Climate
u Global
shallow water model,#8664,~1.7million
cores, 77%
Ø Physics
u Gyrokinetic
Toroidal Code GTC,#2048,
~160,000 cores
Ø Business
Opinion Analysis
u 600TB
structured/non structured data with
micMR (Hadoop over MIC), #1024,
100Million Rec/day
Ø ……
国防科学技术大学
National University of Defense Technology
Tianhe
天河
Applications
Tianhe
天河
HCFD: High-Order SimulaTor of Aerodynamics
Ø  WCNS-
Weighted Compact Nonlinear Scheme
Ø  Explicit Runge-Kutta
国防科学技术大学
National University of Defense Technology
Slides provided by Dr. Yongxian Wang from NUDT
Applications
Tianhe
天河
HCFD: High-Order SimulaTor of Aerodynamics
r C919
complex shape flow
field simulation
r Direct
Numerical Simulation
(DNS)
国防科学技术大学
National University of Defense Technology
Slides provided by Dr. Yongxian Wang from NUDT
Applications
r Cardiac
Tianhe
天河
subcellular level nanoscale calcium-spread
mechanical simulation
Ø  Explore
the pathogenesis of heart disease
Ø  CPU+MIC hybrid computing
u 4096
nodes
u 1 MIC Vs.1.82-5 IVB
r Virtual
drug screening - molecular docking
calculations
Ø  DOCK6.5
Ø  303,826
compounds conformation(specs)
Ø  1,100 drug target(pdtd)
Ø  Over 334 million docking calculation
国防科学技术大学
National University of Defense Technology
Applications
r  The Catalytic Mechanism of Human
Ø  QM/MM MD simulation (Qchem-Tinker)
Ø  22000 atoms
Ø  Sun Yat-Sen University
r  Study
Tianhe
天河
Oxidosqualene Cyclase
the pathogenesis of Flavobacterium
Research and product development of the key technology in freshwater fish
immune disease prevention and control
Ø  Pearl River Fisheries Research Institute
Ø  Chinese Academy of Fishery Sciences
Ø 
r  Regional
Marine digitizing system of Pearl River EstuarySouth China Sea
Ø  Sun
Yat-Sen University
国防科学技术大学
National University of Defense Technology
Applications
r  Neutrino
Tianhe
天河
Mass Measurement
Ø  Beijing
Normal University
Ø  High energy institute of Canada
Ø  Simulate 13.7-billion-years cosmic evolution
in 48 hours
r 
High-speed rail tunnel aerodynamic
effects
Ø 
r 
Middle South University
Shock Wave/Turbulent Boundary
Layer Interaction
Ø  Structural
safety of the high-speed aircraft
Ø  Problem scale:2800 million grid points
Ø  Consumed time:12 hours
国防科学技术大学
National University of Defense Technology
Massless Neutrino
Neutrino
Massive
Applications
Tianhe
天河
KylinCloud Cloud Platform
r 
r  Architecture
Ø 
Resource
Rental
Big Data
E-Gov
Features
Ø 
Ø 
KylinCloud
Platform as a Service
Billing
Develop/Deploy
Enviroment
DataBase
& Middleware
Data
Analysis
Auto
Deployment
Auto
Configuration
Auto
Scaling
Ø 
Config
uration
Ø 
Infrastructure as a Service
Authen
tication
Service
Orchestration
Compute
Resource
Resource
Scheduling
Storage
Resource
Log
System
Network
Resource
Ø 
Monito
ring
Customized according to the need of various applications and
the arch. of TH-2
Provide IaaS and PaaS services to applications with efficient
resource management and scheduling mechanisms
Provide high scalability to manage datacenters with 10,000
nodes and 100,000 cores
Provide enhanced security and high availability to ensure the
safety of data and applications
Provide multiple-level user management and quota
management to tenants
Provide friendly self-service portal and the statistics,
reporting and displaying of the usage of resource
E-Gov
Kylin Server Operating System
SmartCity
国防科学技术大学
National University of Defense Technology
Education
Energy
Finance
Applications
Tianhe
天河
r Applications
Ø E-Gov
Ø RenderCloud
Ø micMR
Ø Video
Processing
Ø Electromagnetic Spectrum
Management
国防科学技术大学
National University of Defense Technology
4
40
Applications
r Cases
Study(CPU+MIC)(cont.)
Over PFolps Applications
Ø Seissol: Earthquake simulation
u #8192
with 3mics
Ø GRN:
Whole-Genome Regulatory
Networks Inference through Large-Scale
Bayesian Learning
u #8192
Ø SCF:
with 3mics
Quantum Chemical Simulations of
DNA
u #8100
with 3mics
Ø 国
…防…
科学技术大学
National University of Defense Technology
Tianhe
天河
Applications
r Need
Tianhe
天河
custom hybrid algorithms
Ø Performance-oriented
programming
Ø Simple optimization based on MPI+OpenMP
alone will not be enough
Ø Rethinking heterogeneous new algorithms at the
physics level to maximize the performance
r Application Code
Ø Portable, persistent, correct
Ø Scalability,Maintainable
Ø Resilience or Oblivious
Ø Productivity
国防科学技术大学
National University of Defense Technology
Tianhe
Summary
天河
r HPC
Challenges
r Big science, Big engineering, Big data
Keys
r Eco-system
r Co-design
r HPC Education
HPC System
Parallel software
Domain APP
Talents
Algorithm & model
HPC Eco- Syst em
Frame 001 ⏐ 27 Mar 2007 ⏐ Result
0.5
0.4
Mz
0.3
0.2
0.1
MA= 0.76 Grid1_xhs
MA= 0.76 Roe+SST_wgx
MA= 0.76 sa_vl_mb
MA= 0.76 ROE_SST_xzy
MA= 0.76 FL-26
0
国防科学技术大学
National University of Defense Technology
-0.1
-4
-2
0
2
4
Alpha
6
8
10
12
Thanks
国防科学技术大学
National University of Defense Technology
Tianhe
天河