Tianhe 天河 Software Issues for Extreme Large Scale Computing Prof. Yutong Lu School of computer science, NUDT & State Key Laboratory of High Performance Computing [email protected] 国防科学技术大学 National University of Defense Technology Outline r Trend of HPC Architecture r Scalable System software r Applications 国防科学技术大学 National University of Defense Technology Tianhe 天河 Challenges PSPR r Performance r Scalability r Power consumption r Reliability 国防科学技术大学 National University of Defense Technology Tianhe 天河 Trend of Architecture r Tree Tianhe 天河 carriages of Performance Ø Frequency Ø ILP Ø Parallelism r Performance Ø …… = Parallelism Ø Year 2010:TH-1A,4.7Pflops,7168Nodes,186,368 Cores Ø Year 2013:TH-2, 54.9Pflops, 16000Nodes, 3,120,000 Cores Ø …… r Exploit parallelism Ø Longitude ( 100,000nodes) Ø Latitude(multi/many cores, SIMD、ILP) 国防科学技术大学 National University of Defense Technology Trend of Architecture r Heterogeneous Ø Some architecture of top-level supercomputers u Tianhe-2 – Intel Xeon Phi u Titan – NVIDIA K20X GPU u Lately, r Compute #53/Top500, #24/Top100,#4/Top10 Efficiency Ø More computations per joule Ø More computations per transistor 国防科学技术大学 National University of Defense Technology Tianhe 天河 Trend of Architecture r Many Ø Intel core processor MIC u >60cores, >200threads u 1.15GHz u > 1TFlops performance u 512b SIMD Ø GPU, NVIDIA Kapler u 2688 cores u 732MHz u 1.31TFlops 国防科学技术大学 National University of Defense Technology Tianhe 天河 Trend of Architecture r Tianhe TH-1A GPU TH-2 MIC vs Data Parallel r Simple instruction r r r Limited scheduling GPU Direct available Steep learning curve r Supporting Cuda Open CL … 2CPU + 2GPU ~71% r Whole system 56.5% r 国防科学技术大学 National University of Defense Technology Native,Offload,Symmetric,Shared r SIMD available Ø ~ 4.5 times speedup on Tianhe-2 Relatively easy to get started Intel Supporting r 2CPU + 3MIC ~76.5% r Whole system 61.6% r ~40% ↑ MPI communication on Tianhe-1A Ø 5% ↑ Linpack r Multi threads & SIMD Flexible modes Ø Ø Ø Ø Ø 天河 supercomputers r Ø Tianhe r Trend of Architecture Tianhe 天河 Memory Hierarchy r Performance of CPU á59%, Perf of MEM á 26% Register L1 Cache L2 Cache L3 Cache Local Mem Non-local MEM r Exploit 1 circle 3 circle 10 circle 30 circle 150 circle >1500 circle Data Locality, reduce communication and memory accessing 国防科学技术大学 National University of Defense Technology Trend of Architecture r Memory Tianhe 天河 architecture will be benefited from multiple technologies Ø Deeper memory hierarchy Ø Advanced package technology u 3D stack、MCM Ø Optical connection btw chips 国防科学技术大学 National University of Defense Technology Trend of Architecture Tianhe 天河 Power Consumption r PW for data moving / 48X PW for data computing Ø MLA inside core: Ø Read inside CPU: Ø Data moving btw cores: Ø Data moving btw nodes: r DTF,reduce 100PJ 4800PJ 7500PJ 9000PJ 20% power consumption, with 5% performance losing r Power control applications,power aware, minimum data moving 国防科学技术大学 National University of Defense Technology Trend of Architecture Tianhe 天河 Interconnection network r NIC Ø High Bandwidth Ø Multiple Lanes r Router Ø High radix Vs. Low radix r Topology N-D Torus Vs. Fat Tree Ø N Dimension Tree Ø NRM NRM SW r Cost software 国防科学技术大学 National University of Defense Technology NRM M SW 32 Compute node NRM M SW 32 Compute node Rack 0 NRM 32 Compute node M NRM M SW 32 Compute node 32 Compute node M BTM BTM SW NRM NRM 576-‐port Switch 12 BTM SW 32 Compute node NRM BTM BTM NRM NRM 576-‐port Switch j BTM M Rack 124 32 Compute node 32 Compute node 32 Compute node 576-‐port Switch i BTM NRM 32 Compute node NRM 576-‐port Switch 0 SW BW, Low Latency, EMC r Topo-aware Rack 63 32 Compute node 32 Compute node BTM r Optical Ø High 32 Compute node M SW NRM NRM 32 Compute node 32 Compute node Rack 62 M NRM 32 Compute node Trend of Architecture n o i t a c i n u Comm ty i l i b a i l e R Power ity l i b a m m a r Prog Heavy the burden of Software 国防科学技术大学 National University of Defense Technology Tianhe 天河 Software issues r Scalability Ø How to use the exist systems better Ø How to explore the next generation systems r Resilience Ø Reduce the CR overhead Ø Lightweight resilience method r Power Control r Programmability r HPC vs Big data Ø Data management and filesystem 国防科学技术大学 National University of Defense Technology Tianhe 天河 Highlights of Tianhe-2 Perf 54.9PFlops / 33.86PFlops Nodes 16000 Mem 1.4PB Racks 125+8+13+24=170 (720m2) Power 17.8 MW (1.9GFlops/W) Cool Close-coupled chilled water cooling Tianhe 天河 TH-2 TH-2 system (125 x Rack) Rack (8 x Frame) APM 32个 算 点 576端⼝口交 Frame Compute board Phi #48000 32个 算 机0 576端⼝口交 32个 576端⼝口交 SW 32个 算 点 机柜124 算 32个 M 32个 算 算 576端⼝口交 点 机12 SW M BTM M SW 32个 机柜0 算 M NRM NRM NRM 点 算 BTM SW 32个 TH-Net 32个 NRM M NRM 点 点 NRM 机j SW NRM 点 算 NRM BTM NRM 点 32个 BTM M 算 点 M BTM 32个 算 NRM 机i SW NRM 点 点 NRM M SW 算 算 32个 BTM NRM 32个 点 NRM BTM 点 32个 算 点 32个 算 点 机柜62 Service Manage System Independent CPU Array Commercial CPU Array Service Node FT Compute Node Compute Node IO Enhanced Compute Node Login Node FT Compute Node . . . Compute Node . . . IO Enhanced Compute Node FT Compute Node Compute Node IO Enhanced Compute Node FT Compute Node Compute Node IO Enhanced Compute Node IO Server Node IO Server Node Login Node Management Node Commericial CPU Array Distributed Local Storage . . . IO Server Node Accelerating Storage CPM 国防科学技术大学 National University of Defense Technology 机柜63 BTM (8 x board) ION FT-1500 #4096 点 NRM SW . . . IVB #32000 算 32个 NRM IP Interconnect Network IB Storage Network Global Global Storage Storage Hybrid Hierarchy shared storage System 12.4PB Highlights of Tianhe-2 r Software Stack 国防科学技术大学 National University of Defense Technology Tianhe 天河 Programming model r Trend Tianhe 天河 of programming model Ø Whole system u MPI u New Ø Intra Data-driven model node Portability Performance Simplicity and Symmetry Modularity Compatibility Completeness Distributed memory u Various – OpenMP, Ø Others u PGAS ... 国防科学技 – 术 大 学 National University of Defense Technology Cuda/OpenCL, OpenACC We are here… Tianhe 天河 r Top5(2013.11) Computer AVG #core of top500 Ø More Rmax (Tflop/s) Rpeak (Tflop/s) Tianhe-2 3,120,000 33,862.7 54,902.4 Titan 560,640 17,590.0 27,112.5 Sequoia BlueGene/Q 1,572,864 16324.75 20132.66 K computer 705,024 10510.00 11280.38 Mira - BlueGene/Q, 786,432 8162.38 10066.33 41,434 r Selected Cores applications on Tianhe than 10,000 ~ 1,000,000 cores 国防科学技术大学 National University of Defense Technology Scalable MPI Tianhe 天河 r Performance Ø P2P: Bandwidth/Latency Ø Collective communication Ø Communicator/Group operations Ø MPI-Init r Resource consumption Ø Memory Ø Network r Data connection structures should be redesigned Ø Communicator, 国防科学技术大学 National University of Defense Technology RMA window, protocol buffer… Scalable MPI r TH-Express2 Ø Network & TH-Express2+ Interface Chip: NIC u 10Gbps X 8lane u 14Gbps X 8lane(plus) Ø Network u 16 Router Chip: NRC ports, more(plus) Ø Optic and electronic hybrid network Ø Topology: Ø Design Fat tree à N Dimension Tree for extension to 100PFlops 国防科学技术大学 National University of Defense Technology Tianhe 天河 Scalable MPI Tianhe 天河 Message Passing services over TH-Express r Galaxy Express (GLEX) Ø Basic message passing infrastructure on network interface Ø User level communication technology Ø User and kernel API r MPICH-GLEX Ø Based on MPICH from ANL Ø Nemesis netmod using GLEX r Design Consideration Ø Protocol: different communication mechanisms exhibit different performance and resource usage Ø Application characteristic: communication mode, such as nearest-neighbor communication Ø Scalability: balance between performance and resource usage 国防科学技术大学 National University of Defense Technology Scalable MPI r Message Tianhe 天河 passing protocols r Various protocols in low level with TH-Net Ø Eager Protocol u Exclusive – Performance oriented u Shared – RDMA Channel Scalability oriented u Hybrid – RDMA Channel channels Combine application model Ø Rendezvous u Zero-copy protocol data transfer based on RDMA Get 国防科学技术大学 National University of Defense Technology Scalable MPI r Collective communication Ø Data-parallel styles for users Ø MPI interface level u NonBlock collective u Alltoallv/AllGetherV u Group-split Ø Implementation u Scalable level algorithm u Topology aware u Hardware offload 国防科学技术大学 National University of Defense Technology Tianhe 天河 Scalable MPI Tianhe 天河 r Collective offload Ø Construct topology-aware algorithm tree dynamically Ø Message passed between tree nodes automatically based on the trigger mechanism of NIC Ø Bypass effect of OS noise, improving overlap of computation and communication Ø Reduce Latency of large scale collective operation 国防科学技术大学 National University of Defense Technology Scalable MPI r Non-stop Tianhe 天河 and fault Resilient MPI (NR-MPI) Ø Application continue execution without being relaunched Ø Failure detection and MPI state recovery done by runtime Ø Data-backup by application-level diskless C/R Ø Reconstruct of MPI communicator and channel Legend Failure Detector NR-MPI Programming Interface NR-‐MPI Normal execution Stand-by execution Local data Recreating the Data backup Data restore restore World comm. Data comm for mutual Local Failure detection data backup and restore data copy and notification MPI Application Data Backup and Restore Module P0 P1 Failure Detecting Module State Management Module Failure Recovery Module Existing MPI Library P2 P3 P2 P4 Operating System Communication Adapter P5 Phase 1 国防科学技术大学 National University of Defense Technology FA Phase 2 Phase 3 -- publishing on ICPADS2013 Dynamic Software r Complexity:Multidisciplinary, Tianhe 天河 Multi-physics, Multi-scale, Multi-method r Legacy applications:, Long term for developing, Expensive, Difficult r Autotuning the performance r Dynamic resources requirement and providing r Topo-aware and Latency hiding r Resource sharing & Hybrid runtime r Fault tolerant and Resilience r Rethink & Redesign the software 国防科学技术大学 National University of Defense Technology Scientific Discovery r Creative Computing Technology Ø Hardware, system software, algorithm, applications r Creative Ø Data r Big Data Processing Technology management, Analysis, Visualization Data come from Ø Experiment Ø Observation Ø Sensor network Ø Simulation r Challenge of computing/throughput 国防科学技术大学 National University of Defense Technology Tianhe 天河 HPC Vs Big Data r Increasing Tianhe 天河 I/O requirements Ø Large scale Pre/Post data sets Ø Visualization and Analysis Ø Big science with Big data Ø Expected data volume per simulation from ~GB to ~PB, typically ~100 TB r I/O Bottleneck Ø Scalability, Efficiency, Performance, Economic and durability r What’s needed for Parallel IO interface Ø More hints could be expressed Ø More patterns could be supported Ø Interface to application IO library 国防科学技术大学 National University of Defense Technology Scalable IO Structure r IO Tianhe 天河 Architecture on Tianhe-2 Ø Multiple Layers u Local Disk u PCI-E SSD u Disk Array Ø 6400 & Hybrid Storages local Disks u Bus attached High High Speed Speed Network Network Ø 64 Storage Storage Network Network Global Global Storage Storage and IB QDR port Storage Servers u Sustained:about 国防科学技术大学 National University of Defense Technology Burst Buffer SSD SSD I/O I/O Forward Forward Ø 256 IO nodes u Burst: above 1TB/s u TH-Express Data Analysis Store Local Local Storage Storage 100GB/s Large Capacity Persistent Store Scalable IO Structure Tianhe 天河 r H2FS: Hybrid Hierarchy File System Ø HVN, Hybrid, Unified and Isolated dynamic namespace maintained by centralized servers Ø DPU, A fundamental HVN unit for data processing, tightly couples a compute node with its local storage Ø Layered and enriched metadata, I/O hints as high level metadata Job Job Management r I/O API Ø POSIX Ø MPI-IO Ø Extended API, layout and policy guide Ø HDF5 over POSIX and extended API Ø Object API(todo) 国防科学技术大学 National University of Defense Technology HPC Workloads Interface Local Storage Posix API Data-Intensive Workloads Layout API Policy API HVN 01 (Job 01) HVN 02 (Job 02) HVN 03 (Job 03) Virtual Namespace Virtual Namespace Virtual Namespace DPU DPU DPU 01 02 03 DPU DPU 04 05 DPU DPU DPU 06 07 08 ... HVN Management API HVN Management Server DPU08 DPU01 DPU07 DPU02 DPU03 DPU06 DPU04 DPU05 Shared Storage Policy Database Logging Volume Scalable IO Structure Tianhe 天河 r HPC benefits Ø Scalable burst BW for typical HPC application Ø Isolated HVN makes data intensive application personalized their optimization Ø Reduced requirements for costly shared storage Ø Scalability, Efficiency, Economic and Ease of use r Data processing benefits Ø Maximum locality, DPU provides opportunity to schedule tasks close to data Ø Single namespace make post-processing easy Ø Reduction of data movement, better support for in-situ data analysis and data in-transit analysis 国防科学技术大学 National University of Defense Technology Different Levels of Performance r Peak performance r LINPACK performance Ø Avg. 80% r Gordon Bell Prize performance Ø ~30% r Application sustained performance Ø <5%~10% r HPCG Benchmark Ø ~1% 国防科学技术大学 National University of Defense Technology Tianhe 天河 Applications r High Energy Density Physics r Weather & Climate r CFD r Seismic data processing r Bio-information r E-Gov & Service 国防科学技术大学 National University of Defense Technology Tianhe 天河 Applications r Cases Study(CPU+MIC) Ø CFD u High-order CFD Simulation 682.4billion grids, #7168,1.37million cores Ø Climate u Global shallow water model,#8664,~1.7million cores, 77% Ø Physics u Gyrokinetic Toroidal Code GTC,#2048, ~160,000 cores Ø Business Opinion Analysis u 600TB structured/non structured data with micMR (Hadoop over MIC), #1024, 100Million Rec/day Ø …… 国防科学技术大学 National University of Defense Technology Tianhe 天河 Applications Tianhe 天河 HCFD: High-Order SimulaTor of Aerodynamics Ø WCNS- Weighted Compact Nonlinear Scheme Ø Explicit Runge-Kutta 国防科学技术大学 National University of Defense Technology Slides provided by Dr. Yongxian Wang from NUDT Applications Tianhe 天河 HCFD: High-Order SimulaTor of Aerodynamics r C919 complex shape flow field simulation r Direct Numerical Simulation (DNS) 国防科学技术大学 National University of Defense Technology Slides provided by Dr. Yongxian Wang from NUDT Applications r Cardiac Tianhe 天河 subcellular level nanoscale calcium-spread mechanical simulation Ø Explore the pathogenesis of heart disease Ø CPU+MIC hybrid computing u 4096 nodes u 1 MIC Vs.1.82-5 IVB r Virtual drug screening - molecular docking calculations Ø DOCK6.5 Ø 303,826 compounds conformation(specs) Ø 1,100 drug target(pdtd) Ø Over 334 million docking calculation 国防科学技术大学 National University of Defense Technology Applications r The Catalytic Mechanism of Human Ø QM/MM MD simulation (Qchem-Tinker) Ø 22000 atoms Ø Sun Yat-Sen University r Study Tianhe 天河 Oxidosqualene Cyclase the pathogenesis of Flavobacterium Research and product development of the key technology in freshwater fish immune disease prevention and control Ø Pearl River Fisheries Research Institute Ø Chinese Academy of Fishery Sciences Ø r Regional Marine digitizing system of Pearl River EstuarySouth China Sea Ø Sun Yat-Sen University 国防科学技术大学 National University of Defense Technology Applications r Neutrino Tianhe 天河 Mass Measurement Ø Beijing Normal University Ø High energy institute of Canada Ø Simulate 13.7-billion-years cosmic evolution in 48 hours r High-speed rail tunnel aerodynamic effects Ø r Middle South University Shock Wave/Turbulent Boundary Layer Interaction Ø Structural safety of the high-speed aircraft Ø Problem scale:2800 million grid points Ø Consumed time:12 hours 国防科学技术大学 National University of Defense Technology Massless Neutrino Neutrino Massive Applications Tianhe 天河 KylinCloud Cloud Platform r r Architecture Ø Resource Rental Big Data E-Gov Features Ø Ø KylinCloud Platform as a Service Billing Develop/Deploy Enviroment DataBase & Middleware Data Analysis Auto Deployment Auto Configuration Auto Scaling Ø Config uration Ø Infrastructure as a Service Authen tication Service Orchestration Compute Resource Resource Scheduling Storage Resource Log System Network Resource Ø Monito ring Customized according to the need of various applications and the arch. of TH-2 Provide IaaS and PaaS services to applications with efficient resource management and scheduling mechanisms Provide high scalability to manage datacenters with 10,000 nodes and 100,000 cores Provide enhanced security and high availability to ensure the safety of data and applications Provide multiple-level user management and quota management to tenants Provide friendly self-service portal and the statistics, reporting and displaying of the usage of resource E-Gov Kylin Server Operating System SmartCity 国防科学技术大学 National University of Defense Technology Education Energy Finance Applications Tianhe 天河 r Applications Ø E-Gov Ø RenderCloud Ø micMR Ø Video Processing Ø Electromagnetic Spectrum Management 国防科学技术大学 National University of Defense Technology 4 40 Applications r Cases Study(CPU+MIC)(cont.) Over PFolps Applications Ø Seissol: Earthquake simulation u #8192 with 3mics Ø GRN: Whole-Genome Regulatory Networks Inference through Large-Scale Bayesian Learning u #8192 Ø SCF: with 3mics Quantum Chemical Simulations of DNA u #8100 with 3mics Ø 国 …防… 科学技术大学 National University of Defense Technology Tianhe 天河 Applications r Need Tianhe 天河 custom hybrid algorithms Ø Performance-oriented programming Ø Simple optimization based on MPI+OpenMP alone will not be enough Ø Rethinking heterogeneous new algorithms at the physics level to maximize the performance r Application Code Ø Portable, persistent, correct Ø Scalability,Maintainable Ø Resilience or Oblivious Ø Productivity 国防科学技术大学 National University of Defense Technology Tianhe Summary 天河 r HPC Challenges r Big science, Big engineering, Big data Keys r Eco-system r Co-design r HPC Education HPC System Parallel software Domain APP Talents Algorithm & model HPC Eco- Syst em Frame 001 ⏐ 27 Mar 2007 ⏐ Result 0.5 0.4 Mz 0.3 0.2 0.1 MA= 0.76 Grid1_xhs MA= 0.76 Roe+SST_wgx MA= 0.76 sa_vl_mb MA= 0.76 ROE_SST_xzy MA= 0.76 FL-26 0 国防科学技术大学 National University of Defense Technology -0.1 -4 -2 0 2 4 Alpha 6 8 10 12 Thanks 国防科学技术大学 National University of Defense Technology Tianhe 天河
© Copyright 2024 ExpyDoc