Layer Zero

Large-Scale Distributed Systems
Layer Zero
COMP6511A Spring 2014 HKUST
Lin Gu
[email protected]
Technologies
•
•
•
•
•
•
•
•
•
•
MapReduce/Hadoop
Dryad/DryadLINQ
OpenCL
X10
Waves of new technologies
DVM
in cloud and big data
computation
Megastore/Spanner
Pregel
RamCloud, SPARK/RDD
Windows Azure
GAE (Google App Engine)
The capability of big-data computation heavily relies on system
software and programming frameworks.
How to construct system software to support
general-purpose and high-performance computation
on large data sets?
Next Generation Technology
MapReduce/Hadoop is slow and constrained. How
to construct the next-generation technology?
• App-specific optimizations
– Pregel, PowerGraph
• More flexible frameworks
– Dryad/DryadLINQ, OpenCL, X10
• Data services
– PNUTS, Megastore, Spanner
• In-memory computation
– SPARK/RDD
– Layer Zero
Virtualization technology
VM 1
app
app
VM 3
VM 2
app
• Package resources
• Enforce isolation
VM 4
app
app
app
app
• A fundamental component
in cloud technology replying
5
in datacenters
Towards a datacenter-scale
virtual machine
Layer Zero: Unite compute nodes in a datacenter
as a big virtual machine
–General
–Scalable (1000s of machines)
–Efficient
–Easy-to-program
–Portable
“The datacenter as a computer”
[Barroso 2009]
6
Layer Zero (L0): big virtual machine
VM 1
app
VM 2
VM k
app
Layer Zero
app
i0: Instruction Set Architecture for Layer Zero
…
7
Why not other approaches?
c0 touches many levels of the system software stack. Why
not a simpler approach?
 MapReduce (Hadoop) - application frameworks
 X10 - parallel programming languages
 MPI - System calls/APIs
 Increased complexity




Decreased performance


Partition program state (MapReduce)
Programmer specified synchronization (X10)
Semantic gaps (MPI)
7X improvement is possible (k-means)
Diminished generality

Specific control flow and dependence relation (MapReduce)
8
Layer Zero architecture
Resource
Group
Virtual
Memory
Container
(VMC)
Virtual
Processor
Container
(VPC)
L0
L0
Scheduler 0
Scheduler 0
Physical host 2
Physical host 1
Task
Physical host 3
9
i0 - instructions
80100
03
8(1000)
2
(1000)
80000
1
1000
80000
add
opcode
(1000):q,
• Unified operand address based on
memory
+
• Orthogonality in instruction design
• Selected group of frequently used
instructions for efficiency
• Support for massive, flexible and
efficient parallel processing
8(1000),
0x80100
operands
operand attributes:
64-bit integer
10
i0 - instructions
Instruction
Operands
Effect
mov
D1, M1
Move [D1] to M1
add
D1, D2, M1
Add [D1] and [D2]; store the result in M1
sub
D1, D2, M1
Subtract [D2] from [D1]; store the result in M1
div
D1, D2, M1
Divide [D1] by [D2]; store the result in M1
and
D1, D2, M1
Store the bitwise AND of [D1] and [D2] in M1
or
D1, D2, M1
Store the bitwise inclusive OR of [D1] and [D2] in M1
xor
D1, D2, M1
Store the bitwise exclusive OR of [D1] and [D2] in M1
br
D1, D2, M1
Compare [D1] and [D2]; jump to M1 depending on the comparing result
bl
M1, M2
Branch and link (procedure call)
spawn
M1, M2, M3, M4
Create a new runner
mul
exit
D1, D2, M1 groupMultiply
[D1] and [D2]; store the
result in instructions
M1
Selected
of frequently
used
Exit and commit or abort
Instructions for massive, flexible
and efficient parallel processing
11
Layer Zero Technologies
Web time
machine
c0 lib
Big Data
Calculator
Java &
more?
vMR
Storage
Persistency
c0
memory
Collaborative
editor
i0
scheduler
Parallelization and Dependence Control
Tasks – an example
Calculate the sums of
20,480 integers
Each task sums two integers
L0
…
Scheduler 0
…
Sums results from
two runners
…
VPC
VPC
VPC
VPC
VPC
…
…
14
Many-task parallel execution
…
L0
Scheduler 0
Create 1000s of new
runners easily and efficiently
newr stack,
newr stack,
newr stack,
...
newr stack,
exit:c
heap, watched, fi
heap, watched, fi
heap, watched, fi
VPC
VPC
VPC
VPC
…
heap, watched, fi
15
Store runner state
Programming on a big single computer
•Large unified memory space
– Shared region (SR) and private region (PR) ~64 TBs and 4 GBs
Challenge: thousands of tasks (runners)
access SR concurrently
• A snapshot on interested ranges for a runner
– Updates affect associated snapshot => concurrent
accesses
– Most accesses handled at native speed
• Commit upon completion of the task
– Coordination only needed for committing memory ranges
16
Manage runners
Parent runner creates
10,240 child runners
Commit 10,240 times?
Share data
parent
creates
Only 1 commit
created
…
parent
creates
parent
commits
schedule
schedulable
running
exit or
abort
finished
created
17
Task dependency
Dependency and order control is a key issue in
concurrent task execution
• X10 – synchronization mechanisms
– Need to synchronize concurrent execution
• MapReduce – Restricted programming model
• Dryad – DAG-based
– Non-trivial burden in programming
– Automatic DAG generation only implemented for certain
high-level languages
18
Watcher
Watcher – explicitly express data dependence
– Data dependence: “watched ranges” e.g. [0x1000, 0x1010)
– Flexible way to declare dependence
– Automatic dependence resolution
parent
commits
spawn
created
watching
memory
change
parent
commits
schedule
schedulable
running
exit or
abort
finished
19
The Tomasulo’s Algorithm
• From IBM 360/91
• Goal: High Performance using a limited number of
registers without a special compiler
– 4 double-precision FP registers on 360
– Uses register renaming
• The descendants of this include: Alpha 21264, HP
8000, MIPS 10000, Pentium III, PowerPC 604, …
20
Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
21
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
Watcher
…
…
…
…
store result in 0x1000
store result in 0x1008
watch shared
memory range
[0x1000, 0x1010)
if (*((long*)0x1000) != 0 &&
*((long*)0x1008) != 0) {
// add the sum produced by two
// runners together
} else {
// create itself and keep watching
Initial value in 0x1000 and 0x1008 is 0
22
L0 is general-purpose. Can it be faster than specialized
solutions for specific computing paradigms?
What is the performance?
Implementation and evaluation
• Emulate c0 on x86-64
– Dynamic binary translation of i0 instructions
• Implemented and evaluated on
– CCMR – a research testbed
– Amazon Elastic Compute Cloud (EC2)
– Tianhe-1A (GZ)
• Workloads: microbenchmarks, primechecker, matrix operations, graph
algorithms, and k-means clustering
• Compare with Xen, VMware, Hadoop and X10
24
Implementation and evaluation
• Platform: CCMR (50 nodes,
1300 cores, 1.5TB RAM,
100TB HD)
– Standard compute node: 2CPU,
32GB RAM or more, 4TB HD
– Thin compute node: 1CPU,
16GB RAM, 2TB HD
– GPU server: NVIDIA Tesla S1070
(960 cores) and Fermi
• Interconnect
– High-bandwidth Ethernet
and switches
– 10GbE twin-ax cabling
Designed in 2009 to speculate and emulate “future” clusters in datacenters.
k-means
The MapReduce Approach
• MapReduce—a parallel computing framework
• Map: process independent input data
– Calculate the distance between a data point and all
centroids
– Output <n, v> (assign data point v to centroid n)
1
2
m
<1,1>
3
4
5
m
m
m
m
<1,2>
<4,3>
<4,4>
<4,5>
Input: [1, 2, 3, 4, 5]
Centroids: 1, 4
m: map function
 [<1,1>,<1,2>,
<4,3>,<4,4>,<4,5>]
k-means
The MapReduce Approach
• Reduce (like fold): compute the new centroids
• Group, sum, then average
– Group intermediate date together on keys (centroids)
– A combiner processes local partial sums for local
groups
– Reduce obtains the global average for each group
<1,1>
r
Initial value
<4,3>
<1,2>
r
r
<4,4>
r
<4,5>
r: the combiner and reduce
function
r
(fold r 0 '(<1,1>,<1,2>))
0
1
1.5
0
aggregated value
3
3.5
4
 1.5
(fold r 0 '(<4,3>,<4,4>,<4,5>))
aggregated value 4
k-means
The MapReduce Approach
• Start over (more iterations)
• Stop upon convergence or iteration limit
reached
m
r
Performance comparison – k-means on 16 nodes
L0
L0 is
13X faster
Execution time of k-means on 16 working nodes
29
Scalability with data size
Increased
throughput
Pheonix
[Ranger 2007]
[Yoo 2009]
1/2 day on
Hadoop/X10
CGI-MapReduce
Execution time and throughput
of k-means
[Ekanayake
2008] as the size of dataset grows
General, scalable, efficient, portable, easy-to-program
30
K-means/vMR – Scale working nodes
31
K-means/vMR – Scale datasets
32
Run c0 programs on a supercomputer
Performance, scalability, portability
Implementation on Supercomputer
Tianhe-1A
• 186,368 (or 202,752) cores
– 7,168 nodes, each node comprising 2 Intel Xeon X5670
2.93GHz 6-core “Westmere” processors, 1 Nvidia Tesla M2010
(or M2050) 448-ALU 1150MHz “Fermi” GPUs
– Additional 2,048 Galaxy FT-1000 1GHz 8-core processors
• 229TB memory
Installation on the TH-1A/GZ
Supercomputer
Performance comparison – execution time
of k-means on vMR vs. Hadoop
35
Conclusion and futuristic notes
• Cloud computing and the big data processing are a
grand challenge to computer scientists, and will be
disruptive.
• c0 is an approach to unifying computation on a
cluster built with commodity hardware.
– Illusion of a “big machine” – “The datacenter as a
computer”
– Scales to many compute nodes
• Future work
– Better compiler, framework, libraries and support
– To be open-sourced. Contributes welcomed!
36
• Innovative computing organization
– Store-and-Compute
– RAM is storage—remove the gap between main memory and
storage
– LAN is local—network bandwidth is comparable to that of
internal data bus!
• Optimize computing, not computers
Questions