4/8/14 Manifold: A Parallel Simulation Framework for Multicore Systems Jun Wang, Jesse Beu, Rishiraj Bheda, Tom Conte, Zhenjiang Dong, Chad Kersey, Mitchelle Rasquinha, George Riley, William Song, He Xiao, Peng Xu, and Sudhakar Yalamanchili School of Electrical and Computer Engineering and School of Computer Science Georgia Institute of Technology Atlanta, GA. 30332 ISPASS 2014 Sponsors SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Motivation “Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY. George E. P. Box, 2011 George E. P. Box, 2011 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 2 1 4/8/14 Simulation Infrastructure Challenges n Scalability n Processors are parallel and tools are not à not sustainable Applications & OS n Multi-disciplinary Physical Models (power, thermal, etc.) n Functional Microarch Simulator n Need + Timing + Physical models to model complete systems n Cores, networks, memories, software at scale n Islands Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile of expertise n Ability to integrate point tools à best of breed models n Composability n Easily construct the simulator you need SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 3 MANIFOLD 4 Overview n Execution Model n Multicore Emulator Front-End & Component Based Timing Model Back-End n Physical n Parallel n Some Modeling Simulation Example Simulators SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 2 4/8/14 Manifold Overview n A parallel simulation framework for multicore architectures n Consists of: n A parallel simulation kernel (growing) set of architectural components n Integration of physical models for energy, thermal, power, and so on n A n Goal: easy construction of parallel simulators of multicore architectures Manifold repository core core core cache cache cache kernel network network network Ease of composition core core core core caches caches caches caches Interconnection network memory memory memory other other other memory controller memory controller Related Work: Graphite, Sniper, Gem5, SST, PTLSim, etc. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 5 Execution Model: Overview Options Instruction stream Traces Serial Emulator Parallel Emulator backpressure Physical Models (e.g., Power) • Cycle level components High level timing components Instruction stream • • System timing model Generated by i) trace files, ii) Qsim server, iii) Qsim Lib System timing model Multicore model built with Manifold components Components assigned to multiple logical processes (LPs) • • • Each LP assigned to one MPI task; LPs run in parallel SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 6 3 4/8/14 Execution Model (Socket/Blade) Parallel Simulation n Parallel QSim application VCPU VCPU Functional front-end TCP/IP shared memory LP core L1$ L2$ core L1$ L2$ core L1$ L2$ PSK PSK MPI PSK n Hybrid timing model n Multiscale Qsim proxy LP Physical Models Virtual CPU Qsim proxy simulation Simulation n Integrated Linux VCPU n Full-system n Component-based logical process (LP) LP network design Timing back-end PSK Parallel Simulation Kernel (PSK) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 7 MANIFOLD 8 Overview n Execution Model n Multicore Emulator Front-End & Component Based Timing Model Back-End n Physical n Parallel n Some Modeling Simulation Example Simulators SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 4 4/8/14 QSim Multicore Emulator application n Runs unmodified x86 (32-bit) binaries on lightly-modified Linux kernel. QSim Linux VCPU VCPU Virtual CPU n Provides callback interface for execution events VCPU TCP/IP Qsim proxy n Callbacks generated for all instructions, including OS Qsim proxy shared memory LP LP core L1$ L2$ PSK MPI core L1$ L2$ logical process (LP) LP core L1$ network L2$ PSK PSK n Filtering of instruction stream n Can be extended to support other ISAs, e.g., ARM, PPC, via QEMU support PSK n Library for instantiating multicore emulators n Based on the core translation engine from QEMU SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 9 MANIFOLD 10 QSim Features n Fine-grained n Single n Two instruction level ways to use: n QsimLib: n QSim n State for creating multithreaded emulators Server: for serving parallel/distributed simulations files for fast startup n State n Fast Instruction-level execution control file contains memory state after Linux bootup forward and region of interest support n Booted up to 512 virtual cores SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 5 4/8/14 Timing Model Components Port serialize()/deserialize() handled transparently in the kernel Links are unidirectional send(data*) Events event_handler(T*) Events Simulation Kernel may be timestepped, discrete event, or both n Named handlers for each case n Clock subscription n Can mix time stepped and DES n Kernels Deliver events Send events n Component Inter-LP events enforce correct ordering across and within LPs MANIFOLD SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 11 Manifold Component Models IRIS: Flit level Network Simulator • Multiple Models • Multiple level of Abstraction CaffDRAM Coherent cache hierarchy Multiple core models: in-order, out-of-order, abstract, etc. Coherence Domain Manager Tier 1 Client Client Client Client Coherence Realm Manager Tier 2 Client Client Client Client Coherence Realm Manager Tier 3 $ G. Loh et. al Zesto: A Cycle-Level Simulator for Highly Detailed Microarchitecture Exploration, ISPASS 2009 $ $ $ *J. G. Beu,, “Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies,” MICRO-44, Dec., 2011. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 12 6 4/8/14 Overview n Execution Model n Multicore Emulator Front-End & Component Based Timing Model Back-End n Physical n Parallel n Some Modeling Simulation Example Simulators 13 MANIFOLD SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY Modeling Physical Phenomena n Energy Introspector (EI) is a modeling library to facilitate the (selective) uses of different models and capture the interactions among microprocessor physics models. Model Library Wrappers HotSpot,, 3D-ICE, TSI, … Power Model Library Reliability Model Library Thermal Model Library PDN, Temperature Inversion, … Delay Model Library … NBTI, TDDB, Electromigration, … Microarchitecture Descrip5on & Sta5s5cs USER INTERFACE McPAT (Cacti), Orion (DSENT), …… Coordinated Mul7-‐Physics Interface SYNCHRONIZED INTERACTIONS compute_power(); compute_temperature(); Available at www.manifold.gatech.edu SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 14 7 4/8/14 Architecture-Level Physical Modeling Abstract Representation of Architecture-Physics Interactions Input Workloads! Microarchitecture! Clock Frequency, Power Gating/Boosting, etc.! Execution! Architectural! Configuration! Physical! Configuration! Delay / Timing Error! Voltage! Leakage Power Calculation! Leakage ! Feedback! Temperature! +! Dynamic Power Calculation! VF Controller! Power! Voltage! Delay Modeling! Activity ! Counts! Timing! Power! Temperature! Thermal Modeling! * Model not integrated yet! Package Configuration! Die Floor-planning! Reliability Modeling! Failure Rate / MTTF! SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 15 Processor Representation Intermediate Component ! (i.e., core, tile, etc.)! Pseudo'Component ' ' Model'Library'=' Data'Queue' ' Energy'Library ' Architecture /! Circuit Blocks! Pseudo'Component ' ' Model'Library'=' ' Energy'Library ' Pseudo'Component ' ' Model'Library'=' None' ' Data'Queue' Data'Queue' Pseudo'Component ' ' Model'Library'=' ' Energy'Library ' Data'Queue' Pseudo'Component ' ' Model'Library'=' Data'Queue' ' Reliability'Library ' Die Partitions! (Floor-planning)! Processor Package! Pseudo'Component ' ' Model'Library'=' Data'Queue' ' Reliability'Library ' Pseudo'Component ' ' Model'Library'=' ' Thermal'Library ' Data'Queue' SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 16 8 4/8/14 Overview n Execution Model n Multicore Emulator Front-End & Component Based Timing Model Back-End n Physical n Parallel n Some Modeling Simulation Example Simulators MANIFOLD SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 17 Parallel Simulation Kernel: Synchronization Algorithms LP LP core L1$ L2$ PSK MPI core L1$ L2$ core L1$ L2$ PSK PSK logical process (LP) LP network Timing back-end PSK Prediction of time value of next event l l Problem: Synchronized advance of simulation time State of the Practice solutions l l l l Conservative algorithms, e.g., lower bound time stamp (PBTS) Optimistic algorithms, e.g., Time Warp Included in Manifold New: Forecast Null Message (FNM) l l Use domain specific information to predict time of future events For example, consequence of Last Level Cache access. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 18 9 4/8/14 Forecast Null Message (FNM) Algorithm Forecast determined by runtime state 1. • 2. E.g., at cycle t, cache receives a request; with latency L, the earliest time when it sends out a message is (t + L). This is its forecast. Because components send credits after receiving events, time-stamp for Null-message must consider neighbors’ forecast. Null(ts, forecast)l LP1 3. LP2 Null-message time-stamp set to min(out_forecast, in_forecast), where in_forecast is forecast carried on incoming Null-messages. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 19 Building and Running Parallel Simulations Initialization Instantiate Components Connect Components Register Clocks Simulation Functions Configuration parameters • From Manifold Library • Inputs (trace, QSIM, etc.) Instantiate Links • Set Timing Behavior • Time stepped vs. discrete event Set Duration, Cleanup, etc. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 20 10 4/8/14 CMP (16-, 32-, 64-core) n 16, n 2, core core core core caches caches caches caches 32, 64-core CMP models Interconnection network 4, 8 memory controllers, respectively n 5x4, memory controller memory controller 6x6, 9x8 torus, respectively n Host: Linux cluster; each node has 2 Intel Xeon X5670 6-core CPUs with 24 h/w threads n 13, 22, 40 h/w threads used by the simulator on 1, 2, 3 nodes, respectively n 200 Million simulated cycles in region of interest (ROI) n Saved boot state and fast forward to ROI MANIFOLD SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 21 Sample Results: Simulation Time in Minutes 16-core 32-core Para. 64-core Seq. Para. Seq. Seq. dedup 1095.7 251.4 (4.4X) 2134.8 301.3 (7.1X) 2322.9 345.3 (6.7X) facesim 1259.3 234.9 (5.4X) 2614.2 303.6 (8.6X) 3170.2 342.3 (9.3X) ferret 1124.8 227.8 (4.9X) 1777.9 255.6 (7.0X) 2534.3 331.3 (7.6X) freqmine 1203.3 218.0 (5.5X) 1635.6 245.6 (6.7X) 2718.9 337.3 (8.1X) stream 1183.8 222.7 (5.3X) 1710.6 244.3 (7.0X) 4796.4 396.2 (12.1X) vips 1167.0 227.3 (5.1X) 1716.3 257.2 (6.7X) 2564.6 337.9 (7.6X) barnes 1039.9 224.3 (4.6X) 1693.0 283.3 (6.0X) 3791.8 341.4 (11.1X) cholesky 1182.4 227.2 (5.2X) 1600.3 245.7 (6.5X) 4278.3 402.1 (10.6X) fmm 1146.3 229.6 (5.0X) 1689.8 253.6 (6.7X) 5037.2 lu 871.2 156.4 (5.6X) 1475.8 204.6 (7.2X) 4540.3 402.7 (11.3X) radiosity 1022.3 228.8 (4.5X) 1567.5 250.4 (6.3X) 2813.5 350.3 (8.0X) water 671.5 158.4 (4.2X) 1397.3 236.7 (5.9X) 2560.1 356.3 (7.2X) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY Para. 416.1 (12.1X) MANIFOLD 22 11 4/8/14 Simulation KIPS and KIPS per H/W Thread n Need metrics for assessing scalability of parallel simulation n Note the impact of noninstruction events, e.g., network or memory events n Drop roughly parallels drop in parallel efficiency SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 23 MANIFOLD 24 Some Lessons n Variations n Physical n Event in component timing behaviors models can be the bottleneck flow control n Hidden (infinite capacity) buffers occurrence n Example: Memory controller interface n Effects of simulation model partitioning n Synchronization n Relaxed overhead synchronization for power/thermal modeling SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 12 4/8/14 Overview n Execution Model n Multicore Emulator Front-End & Component Based Timing Model Back-End n Physical n Parallel n Some Modeling Simulation Example Simulators SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 25 MANIFOLD 26 Power Capping Controller New set point Two in-order and two OOO cores § Dynamic Voltage Frequency Scaling § Regulating asymmetric processors N. Almoosa, W. Song, Y. Wardi, and S. Yalamanchili, “A Power Capping Controller for Multicore Processors,” American Control Conf., June 2012. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY 13 4/8/14 Microfluidic Cooling in Die Stacks SCH SCH INT DL1 DL1 INT FPU FPU FE FE FE FE SCH SCH SCH SCH INT DL1 DL1 INT FPU FPU FPU INT INT DL1 DL1 DL1 DL1 INT FPU INT FPU FPU FPU INT INT FE FE DL1 DL1 FE FE DL1 DL1 INT FPU INT FPU SCH SCH SCH SCH FPU • Thermal Grids: 50x50 • Sampling Period: 1us • Steady-State Analysis INT FE FE FE FE SCH SCH SCH SCH DL1 DL1 INT FPU FPU 16 symmetric cores INT FE FE DL1 FE FE DL1 SCH SCH INT FPU 8.4mm x 8.4mm Ambient: Temperature: 300K Nehalem-like, OoO cores; 3GHz, 1.0V, max temp 100◦C DL1: 128KB, 4096 sets, 64B 2.1mm IL1: 32KB, 256 sets, 32B, 4 cycles; L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 8.4 m m FPU 2.1mm 2.1mm x 2.1mm 8.4 m m L2 & Network Cache Layer: L2 (per core): 2MB, 4096 sets, 128B, 35 cycles; DRAM: 1GB, 50ns access time (for performance model) Executing SPLASH and PARSEC Benchmarks H. Xiao, Z. Min, S. Yalamanchili and Y. Joshi, “Leakage Power Characterization and Minimization over 3D Stacked Multi-core Chip with Microfluidic Cooling,” IEEE Symposium on Thermal Measurement, Modeling, and Management (SEMITHERM), March 2014 27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Impact of Cooling Co-design on Leakage Power Sat. Pt. 4.75X 3.31X Barnes Sat. Pt. 1.79X 1.39X Ocean-c H. Xiao, Z. Min, S. Yalamanchili and Y. Joshi, “Leakage Power Characterization and Minimization over 3D Stacked Multi-core Chip with Microfluidic Cooling,” IEEE Symposium on Thermal Measurement, Modeling, and Management (SEMITHERM), March 2014 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 28 14 4/8/14 Summary n Composable simulators www.manifold.gatech.edu simulation infrastructure for constructing multicore n Parallel execution n Integrated physical models n Provide base library of components to build useful simulators n Distribute n Need: some stock simulators Validation Techniques such as Uncertainty Quantification Novel Cooling Technology Thermal Field Modeling Power Distr. Network μarchitecture Power Management Microarchitecture and Workload Execu7on Thermal Coupling and Cooling Power Dissipa7on Degrada7on and Recovery Algorithms SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD 29 29 15
© Copyright 2024 ExpyDoc