Large-Scale Distributed Systems Layer Zero COMP6511A Spring 2014 HKUST Lin Gu [email protected] Technologies • • • • • • • • • • MapReduce/Hadoop Dryad/DryadLINQ OpenCL X10 Waves of new technologies DVM in cloud and big data computation Megastore/Spanner Pregel RamCloud, SPARK/RDD Windows Azure GAE (Google App Engine) The capability of big-data computation heavily relies on system software and programming frameworks. How to construct system software to support general-purpose and high-performance computation on large data sets? Next Generation Technology MapReduce/Hadoop is slow and constrained. How to construct the next-generation technology? • App-specific optimizations – Pregel, PowerGraph • More flexible frameworks – Dryad/DryadLINQ, OpenCL, X10 • Data services – PNUTS, Megastore, Spanner • In-memory computation – SPARK/RDD – Layer Zero Virtualization technology VM 1 app app VM 3 VM 2 app • Package resources • Enforce isolation VM 4 app app app app • A fundamental component in cloud technology replying 5 in datacenters Towards a datacenter-scale virtual machine Layer Zero: Unite compute nodes in a datacenter as a big virtual machine –General –Scalable (1000s of machines) –Efficient –Easy-to-program –Portable “The datacenter as a computer” [Barroso 2009] 6 Layer Zero (L0): big virtual machine VM 1 app VM 2 VM k app Layer Zero app i0: Instruction Set Architecture for Layer Zero … 7 Why not other approaches? c0 touches many levels of the system software stack. Why not a simpler approach?  MapReduce (Hadoop) - application frameworks  X10 - parallel programming languages  MPI - System calls/APIs  Increased complexity     Decreased performance   Partition program state (MapReduce) Programmer specified synchronization (X10) Semantic gaps (MPI) 7X improvement is possible (k-means) Diminished generality  Specific control flow and dependence relation (MapReduce) 8 Layer Zero architecture Resource Group Virtual Memory Container (VMC) Virtual Processor Container (VPC) L0 L0 Scheduler 0 Scheduler 0 Physical host 2 Physical host 1 Task Physical host 3 9 i0 - instructions 80100 03 8(1000) 2 (1000) 80000 1 1000 80000 add opcode (1000):q, • Unified operand address based on memory + • Orthogonality in instruction design • Selected group of frequently used instructions for efficiency • Support for massive, flexible and efficient parallel processing 8(1000), 0x80100 operands operand attributes: 64-bit integer 10 i0 - instructions Instruction Operands Effect mov D1, M1 Move [D1] to M1 add D1, D2, M1 Add [D1] and [D2]; store the result in M1 sub D1, D2, M1 Subtract [D2] from [D1]; store the result in M1 div D1, D2, M1 Divide [D1] by [D2]; store the result in M1 and D1, D2, M1 Store the bitwise AND of [D1] and [D2] in M1 or D1, D2, M1 Store the bitwise inclusive OR of [D1] and [D2] in M1 xor D1, D2, M1 Store the bitwise exclusive OR of [D1] and [D2] in M1 br D1, D2, M1 Compare [D1] and [D2]; jump to M1 depending on the comparing result bl M1, M2 Branch and link (procedure call) spawn M1, M2, M3, M4 Create a new runner mul exit D1, D2, M1 groupMultiply [D1] and [D2]; store the result in instructions M1 Selected of frequently used Exit and commit or abort Instructions for massive, flexible and efficient parallel processing 11 Layer Zero Technologies Web time machine c0 lib Big Data Calculator Java & more? vMR Storage Persistency c0 memory Collaborative editor i0 scheduler Parallelization and Dependence Control Tasks – an example Calculate the sums of 20,480 integers Each task sums two integers L0 … Scheduler 0 … Sums results from two runners … VPC VPC VPC VPC VPC … … 14 Many-task parallel execution … L0 Scheduler 0 Create 1000s of new runners easily and efficiently newr stack, newr stack, newr stack, ... newr stack, exit:c heap, watched, fi heap, watched, fi heap, watched, fi VPC VPC VPC VPC … heap, watched, fi 15 Store runner state Programming on a big single computer •Large unified memory space – Shared region (SR) and private region (PR) ~64 TBs and 4 GBs Challenge: thousands of tasks (runners) access SR concurrently • A snapshot on interested ranges for a runner – Updates affect associated snapshot => concurrent accesses – Most accesses handled at native speed • Commit upon completion of the task – Coordination only needed for committing memory ranges 16 Manage runners Parent runner creates 10,240 child runners Commit 10,240 times? Share data parent creates Only 1 commit created … parent creates parent commits schedule schedulable running exit or abort finished created 17 Task dependency Dependency and order control is a key issue in concurrent task execution • X10 – synchronization mechanisms – Need to synchronize concurrent execution • MapReduce – Restricted programming model • Dryad – DAG-based – Non-trivial burden in programming – Automatic DAG generation only implemented for certain high-level languages 18 Watcher Watcher – explicitly express data dependence – Data dependence: “watched ranges” e.g. [0x1000, 0x1010) – Flexible way to declare dependence – Automatic dependence resolution parent commits spawn created watching memory change parent commits schedule schedulable running exit or abort finished 19 The Tomasulo’s Algorithm • From IBM 360/91 • Goal: High Performance using a limited number of registers without a special compiler – 4 double-precision FP registers on 360 – Uses register renaming • The descendants of this include: Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, … 20 Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders 21 Reservation Stations To Mem FP multipliers Common Data Bus (CDB) Watcher … … … … store result in 0x1000 store result in 0x1008 watch shared memory range [0x1000, 0x1010) if (*((long*)0x1000) != 0 && *((long*)0x1008) != 0) { // add the sum produced by two // runners together } else { // create itself and keep watching Initial value in 0x1000 and 0x1008 is 0 22 L0 is general-purpose. Can it be faster than specialized solutions for specific computing paradigms? What is the performance? Implementation and evaluation • Emulate c0 on x86-64 – Dynamic binary translation of i0 instructions • Implemented and evaluated on – CCMR – a research testbed – Amazon Elastic Compute Cloud (EC2) – Tianhe-1A (GZ) • Workloads: microbenchmarks, primechecker, matrix operations, graph algorithms, and k-means clustering • Compare with Xen, VMware, Hadoop and X10 24 Implementation and evaluation • Platform: CCMR (50 nodes, 1300 cores, 1.5TB RAM, 100TB HD) – Standard compute node: 2CPU, 32GB RAM or more, 4TB HD – Thin compute node: 1CPU, 16GB RAM, 2TB HD – GPU server: NVIDIA Tesla S1070 (960 cores) and Fermi • Interconnect – High-bandwidth Ethernet and switches – 10GbE twin-ax cabling Designed in 2009 to speculate and emulate “future” clusters in datacenters. k-means The MapReduce Approach • MapReduce—a parallel computing framework • Map: process independent input data – Calculate the distance between a data point and all centroids – Output <n, v> (assign data point v to centroid n) 1 2 m <1,1> 3 4 5 m m m m <1,2> <4,3> <4,4> <4,5> Input: [1, 2, 3, 4, 5] Centroids: 1, 4 m: map function  [<1,1>,<1,2>, <4,3>,<4,4>,<4,5>] k-means The MapReduce Approach • Reduce (like fold): compute the new centroids • Group, sum, then average – Group intermediate date together on keys (centroids) – A combiner processes local partial sums for local groups – Reduce obtains the global average for each group <1,1> r Initial value <4,3> <1,2> r r <4,4> r <4,5> r: the combiner and reduce function r (fold r 0 '(<1,1>,<1,2>)) 0 1 1.5 0 aggregated value 3 3.5 4  1.5 (fold r 0 '(<4,3>,<4,4>,<4,5>)) aggregated value 4 k-means The MapReduce Approach • Start over (more iterations) • Stop upon convergence or iteration limit reached m r Performance comparison – k-means on 16 nodes L0 L0 is 13X faster Execution time of k-means on 16 working nodes 29 Scalability with data size Increased throughput Pheonix [Ranger 2007] [Yoo 2009] 1/2 day on Hadoop/X10 CGI-MapReduce Execution time and throughput of k-means [Ekanayake 2008] as the size of dataset grows General, scalable, efficient, portable, easy-to-program 30 K-means/vMR – Scale working nodes 31 K-means/vMR – Scale datasets 32 Run c0 programs on a supercomputer Performance, scalability, portability Implementation on Supercomputer Tianhe-1A • 186,368 (or 202,752) cores – 7,168 nodes, each node comprising 2 Intel Xeon X5670 2.93GHz 6-core “Westmere” processors, 1 Nvidia Tesla M2010 (or M2050) 448-ALU 1150MHz “Fermi” GPUs – Additional 2,048 Galaxy FT-1000 1GHz 8-core processors • 229TB memory Installation on the TH-1A/GZ Supercomputer Performance comparison – execution time of k-means on vMR vs. Hadoop 35 Conclusion and futuristic notes • Cloud computing and the big data processing are a grand challenge to computer scientists, and will be disruptive. • c0 is an approach to unifying computation on a cluster built with commodity hardware. – Illusion of a “big machine” – “The datacenter as a computer” – Scales to many compute nodes • Future work – Better compiler, framework, libraries and support – To be open-sourced. Contributes welcomed! 36 • Innovative computing organization – Store-and-Compute – RAM is storage—remove the gap between main memory and storage – LAN is local—network bandwidth is comparable to that of internal data bus! • Optimize computing, not computers Questions