Manifold - CASL Gatech | Computer Architecture and Systems

4/8/14
Manifold: A Parallel Simulation Framework for
Multicore Systems
Jun Wang, Jesse Beu, Rishiraj Bheda, Tom Conte, Zhenjiang Dong,
Chad Kersey, Mitchelle Rasquinha, George Riley, William Song, He Xiao, Peng Xu,
and Sudhakar Yalamanchili
School of Electrical and Computer Engineering and School of Computer Science
Georgia Institute of Technology
Atlanta, GA. 30332
ISPASS 2014
Sponsors
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
Motivation
“Remember that all models are wrong; the practical question is how
wrong do they have to be to not be useful.”
Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building
and Response Surfaces, John Wiley & Sons, New York, NY.
George E. P. Box, 2011
George E. P. Box, 2011
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
2
1
4/8/14
Simulation Infrastructure Challenges
n Scalability
n  Processors
are parallel and tools are not
à not sustainable
Applications
& OS
n Multi-disciplinary
Physical
Models
(power,
thermal,
etc.)
n  Functional
Microarch
Simulator
n Need
+ Timing + Physical models
to model complete systems
n  Cores,
networks, memories, software at
scale
n Islands
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
of expertise
n  Ability
to integrate point tools à best of
breed models
n Composability
n  Easily
construct the simulator you need
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
3
MANIFOLD
4
Overview
n Execution
Model
n Multicore
Emulator Front-End & Component Based
Timing Model Back-End
n Physical
n Parallel
n Some
Modeling
Simulation
Example Simulators
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
2
4/8/14
Manifold Overview
n  A
parallel simulation framework for multicore architectures
n  Consists of:
n  A
parallel simulation kernel
(growing) set of architectural components
n  Integration of physical models for energy, thermal, power, and so on
n  A
n  Goal:
easy construction of parallel simulators of multicore
architectures
Manifold repository
core
core
core
cache
cache
cache
kernel
network
network
network
Ease of
composition
core
core
core
core
caches
caches
caches
caches
Interconnection network
memory
memory
memory
other
other
other
memory
controller
memory
controller
Related Work: Graphite, Sniper, Gem5, SST, PTLSim, etc.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
5
Execution Model: Overview
Options
Instruction stream
Traces
Serial
Emulator
Parallel
Emulator
backpressure
Physical Models
(e.g., Power)
• 
Cycle level
components
High level timing
components
Instruction stream
• 
• 
System timing
model
Generated by i) trace files, ii) Qsim server, iii) Qsim Lib
System timing model
Multicore model built with Manifold components
Components assigned to multiple logical processes (LPs)
• 
• 
• 
Each LP assigned to one MPI task; LPs run in parallel
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
6
3
4/8/14
Execution Model (Socket/Blade)
Parallel Simulation
n  Parallel
QSim
application
VCPU
VCPU
Functional
front-end
TCP/IP
shared memory
LP
core
L1$
L2$
core
L1$
L2$
core
L1$
L2$
PSK
PSK
MPI
PSK
n  Hybrid
timing model
n  Multiscale
Qsim
proxy
LP
Physical
Models
Virtual CPU
Qsim
proxy
simulation
Simulation
n  Integrated
Linux
VCPU
n  Full-system
n  Component-based
logical process (LP)
LP
network
design
Timing
back-end
PSK
Parallel Simulation Kernel (PSK)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
7
MANIFOLD
8
Overview
n Execution
Model
n Multicore
Emulator Front-End & Component Based
Timing Model Back-End
n Physical
n Parallel
n Some
Modeling
Simulation
Example Simulators
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
4
4/8/14
QSim Multicore Emulator
application
n  Runs
unmodified x86 (32-bit)
binaries on lightly-modified Linux
kernel.
QSim
Linux
VCPU
VCPU
Virtual CPU
n  Provides
callback interface for
execution events
VCPU
TCP/IP
Qsim
proxy
n  Callbacks
generated for all
instructions, including OS
Qsim
proxy
shared memory
LP
LP
core
L1$
L2$
PSK
MPI
core
L1$
L2$
logical process (LP)
LP
core
L1$
network
L2$
PSK
PSK
n  Filtering
of instruction stream
n  Can
be extended to support other
ISAs, e.g., ARM, PPC, via QEMU
support
PSK
n Library
for instantiating multicore emulators
n Based on the core translation engine from QEMU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
9
MANIFOLD
10
QSim Features
n Fine-grained
n  Single
n Two
instruction level
ways to use:
n  QsimLib:
n  QSim
n State
for creating multithreaded emulators
Server: for serving parallel/distributed simulations
files for fast startup
n  State
n Fast
Instruction-level execution control
file contains memory state after Linux bootup
forward and region of interest support
n Booted
up to 512 virtual cores
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
5
4/8/14
Timing Model Components
Port
serialize()/deserialize()
handled transparently
in the kernel
Links are
unidirectional
send(data*)
Events
event_handler(T*)
Events
Simulation
Kernel
may be timestepped, discrete event,
or both
n  Named
handlers for each
case
n  Clock subscription
n  Can mix time stepped and
DES
n Kernels
Deliver events
Send events
n Component
Inter-LP events
enforce correct
ordering across and
within LPs
MANIFOLD
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
11
Manifold Component Models
IRIS: Flit level Network Simulator
•  Multiple Models
•  Multiple level of
Abstraction
CaffDRAM
Coherent cache hierarchy
Multiple core models: in-order, out-of-order, abstract, etc.
Coherence Domain
Manager
Tier 1
Client
Client
Client
Client
Coherence Realm
Manager
Tier 2
Client
Client
Client
Client
Coherence Realm
Manager
Tier 3
$
G. Loh et. al Zesto: A Cycle-Level Simulator for Highly
Detailed Microarchitecture Exploration, ISPASS 2009
$
$
$
*J. G. Beu,, “Manager-Client Pairing: A Framework
for Implementing Coherence Hierarchies,”
MICRO-44, Dec., 2011.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
12
6
4/8/14
Overview
n Execution
Model
n Multicore
Emulator Front-End & Component Based
Timing Model Back-End
n Physical
n Parallel
n Some
Modeling
Simulation
Example Simulators
13
MANIFOLD
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
Modeling Physical Phenomena
n Energy
Introspector (EI) is a modeling library to facilitate the
(selective) uses of different models and capture the
interactions among microprocessor physics models.
Model Library Wrappers HotSpot,,
3D-ICE, TSI,
…
Power Model Library Reliability Model Library Thermal Model Library PDN,
Temperature
Inversion,
…
Delay Model Library …
NBTI, TDDB,
Electromigration,
…
Microarchitecture Descrip5on & Sta5s5cs USER INTERFACE McPAT (Cacti),
Orion (DSENT),
……
Coordinated Mul7-­‐Physics Interface SYNCHRONIZED INTERACTIONS compute_power(); compute_temperature(); Available at www.manifold.gatech.edu
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
14
7
4/8/14
Architecture-Level Physical Modeling
Abstract Representation of Architecture-Physics Interactions
Input Workloads!
Microarchitecture! Clock Frequency, Power Gating/Boosting, etc.!
Execution!
Architectural!
Configuration!
Physical!
Configuration!
Delay / Timing Error!
Voltage!
Leakage Power
Calculation!
Leakage !
Feedback!
Temperature!
+!
Dynamic Power
Calculation!
VF Controller!
Power!
Voltage!
Delay Modeling!
Activity !
Counts!
Timing!
Power!
Temperature!
Thermal
Modeling!
* Model not integrated yet!
Package Configuration!
Die Floor-planning!
Reliability
Modeling!
Failure Rate / MTTF!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
15
Processor Representation
Intermediate Component !
(i.e., core, tile, etc.)!
Pseudo'Component '
' Model'Library'='
Data'Queue'
' Energy'Library '
Architecture /!
Circuit Blocks!
Pseudo'Component '
' Model'Library'='
' Energy'Library '
Pseudo'Component '
' Model'Library'='
None'
'
Data'Queue'
Data'Queue'
Pseudo'Component '
' Model'Library'='
' Energy'Library '
Data'Queue'
Pseudo'Component '
' Model'Library'='
Data'Queue'
' Reliability'Library '
Die Partitions!
(Floor-planning)!
Processor Package!
Pseudo'Component '
' Model'Library'='
Data'Queue'
' Reliability'Library '
Pseudo'Component '
' Model'Library'='
' Thermal'Library '
Data'Queue'
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
16
8
4/8/14
Overview
n Execution
Model
n Multicore
Emulator Front-End & Component Based
Timing Model Back-End
n Physical
n Parallel
n Some
Modeling
Simulation
Example Simulators
MANIFOLD
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
17
Parallel Simulation Kernel: Synchronization
Algorithms
LP
LP
core
L1$
L2$
PSK
MPI
core
L1$
L2$
core
L1$
L2$
PSK
PSK
logical process (LP)
LP
network
Timing
back-end
PSK
Prediction of time value of next event
l 
l 
Problem: Synchronized advance of simulation time
State of the Practice solutions
l 
l 
l 
l 
Conservative algorithms, e.g., lower bound time stamp (PBTS)
Optimistic algorithms, e.g., Time Warp
Included in Manifold
New: Forecast Null Message (FNM)
l 
l 
Use domain specific information to predict time of future events
For example, consequence of Last Level Cache access.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
18
9
4/8/14
Forecast Null Message (FNM) Algorithm
Forecast determined by runtime state
1. 
• 
2. 
E.g., at cycle t, cache receives a request; with latency L, the earliest
time when it sends out a message is (t + L). This is its forecast.
Because components send credits after receiving events,
time-stamp for Null-message must consider neighbors’
forecast.
Null(ts, forecast)l
LP1
3. 
LP2
Null-message time-stamp set to min(out_forecast,
in_forecast), where in_forecast is forecast carried
on incoming Null-messages.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
19
Building and Running Parallel Simulations
Initialization
Instantiate
Components
Connect Components
Register Clocks
Simulation Functions
Configuration parameters
•  From Manifold Library
•  Inputs (trace, QSIM, etc.)
Instantiate Links
•  Set Timing Behavior
•  Time stepped vs. discrete event
Set Duration, Cleanup, etc.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
20
10
4/8/14
CMP (16-, 32-, 64-core)
n 16,
n 2,
core
core
core
core
caches
caches
caches
caches
32, 64-core CMP models
Interconnection network
4, 8 memory controllers, respectively
n 5x4,
memory
controller
memory
controller
6x6, 9x8 torus, respectively
n Host:
Linux cluster; each node has 2 Intel Xeon X5670 6-core
CPUs with 24 h/w threads
n 13,
22, 40 h/w threads used by the simulator on 1, 2, 3
nodes, respectively
n 200
Million simulated cycles in region of interest (ROI)
n  Saved
boot state and fast forward to ROI
MANIFOLD
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
21
Sample Results: Simulation Time in Minutes
16-core
32-core
Para.
64-core
Seq.
Para.
Seq.
Seq.
dedup
1095.7
251.4 (4.4X)
2134.8 301.3 (7.1X)
2322.9 345.3 (6.7X)
facesim
1259.3
234.9 (5.4X)
2614.2 303.6 (8.6X)
3170.2 342.3 (9.3X)
ferret
1124.8
227.8 (4.9X)
1777.9
255.6 (7.0X)
2534.3 331.3 (7.6X)
freqmine
1203.3
218.0 (5.5X)
1635.6 245.6 (6.7X)
2718.9 337.3 (8.1X)
stream
1183.8
222.7 (5.3X)
1710.6 244.3 (7.0X)
4796.4 396.2 (12.1X)
vips
1167.0
227.3 (5.1X)
1716.3 257.2 (6.7X)
2564.6 337.9 (7.6X)
barnes
1039.9
224.3 (4.6X)
1693.0 283.3 (6.0X)
3791.8 341.4 (11.1X)
cholesky
1182.4
227.2 (5.2X)
1600.3 245.7 (6.5X)
4278.3 402.1 (10.6X)
fmm
1146.3
229.6 (5.0X)
1689.8 253.6 (6.7X)
5037.2
lu
871.2
156.4 (5.6X)
1475.8 204.6 (7.2X)
4540.3 402.7 (11.3X)
radiosity
1022.3
228.8 (4.5X)
1567.5
250.4 (6.3X)
2813.5 350.3 (8.0X)
water
671.5
158.4 (4.2X)
1397.3
236.7 (5.9X)
2560.1 356.3 (7.2X)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
Para.
416.1 (12.1X)
MANIFOLD
22
11
4/8/14
Simulation KIPS and KIPS per H/W Thread
n Need
metrics for
assessing scalability of
parallel simulation
n Note
the impact of noninstruction events, e.g.,
network or memory
events
n Drop
roughly parallels
drop in parallel efficiency
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
23
MANIFOLD
24
Some Lessons
n Variations
n Physical
n Event
in component timing behaviors
models can be the bottleneck
flow control
n  Hidden
(infinite capacity) buffers occurrence
n  Example: Memory controller interface
n Effects
of simulation model partitioning
n  Synchronization
n Relaxed
overhead
synchronization for power/thermal modeling
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
12
4/8/14
Overview
n Execution
Model
n Multicore
Emulator Front-End & Component Based
Timing Model Back-End
n Physical
n Parallel
n Some
Modeling
Simulation
Example Simulators
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
25
MANIFOLD
26
Power Capping Controller
New set point
Two in-order and
two OOO cores
§  Dynamic Voltage Frequency Scaling
§  Regulating asymmetric processors
N. Almoosa, W. Song, Y. Wardi, and S. Yalamanchili, “A Power Capping
Controller for Multicore Processors,” American Control Conf., June 2012.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
13
4/8/14
Microfluidic Cooling in Die Stacks
SCH SCH
INT
DL1
DL1
INT
FPU
FPU
FE
FE
FE
FE
SCH SCH
SCH SCH
INT
DL1
DL1
INT
FPU
FPU
FPU
INT
INT
DL1
DL1
DL1
DL1
INT
FPU
INT
FPU
FPU
FPU
INT
INT
FE
FE
DL1
DL1
FE
FE
DL1
DL1
INT
FPU
INT
FPU
SCH SCH
SCH SCH
FPU
•  Thermal Grids: 50x50
•  Sampling Period: 1us
•  Steady-State Analysis
INT
FE
FE
FE
FE
SCH SCH
SCH SCH
DL1
DL1
INT
FPU
FPU
16 symmetric cores
INT
FE
FE
DL1
FE
FE
DL1
SCH SCH
INT
FPU
8.4mm x 8.4mm
Ambient:
Temperature: 300K
Nehalem-like, OoO cores;
3GHz, 1.0V, max temp 100◦C
DL1: 128KB, 4096 sets, 64B
2.1mm
IL1: 32KB, 256 sets, 32B, 4 cycles;
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
8.4 m m
FPU
2.1mm
2.1mm x 2.1mm
8.4 m m
L2 & Network Cache Layer:
L2 (per core): 2MB, 4096 sets,
128B, 35 cycles;
DRAM: 1GB, 50ns access time (for
performance model)
Executing SPLASH and
PARSEC Benchmarks
H. Xiao, Z. Min, S. Yalamanchili and Y. Joshi, “Leakage Power Characterization and Minimization over 3D Stacked Multi-core Chip with
Microfluidic Cooling,” IEEE Symposium on Thermal Measurement, Modeling, and Management (SEMITHERM), March 2014
27
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of Cooling Co-design on Leakage Power
Sat. Pt.
4.75X
3.31X
Barnes
Sat. Pt.
1.79X
1.39X
Ocean-c
H. Xiao, Z. Min, S. Yalamanchili and Y. Joshi, “Leakage Power Characterization and Minimization over 3D Stacked Multi-core Chip with
Microfluidic Cooling,” IEEE Symposium on Thermal Measurement, Modeling, and Management (SEMITHERM), March 2014
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
28
14
4/8/14
Summary
n Composable
simulators
www.manifold.gatech.edu
simulation infrastructure for constructing multicore
n  Parallel
execution
n  Integrated physical models
n Provide
base library of components to build useful simulators
n Distribute
n Need:
some stock simulators
Validation Techniques such as Uncertainty Quantification
Novel
Cooling
Technology
Thermal
Field
Modeling
Power
Distr.
Network
μarchitecture
Power
Management
Microarchitecture and Workload Execu7on Thermal Coupling and Cooling Power Dissipa7on Degrada7on and Recovery Algorithms
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY
MANIFOLD
29
29
15