集中講義 計算量理論と並列分散 アルゴリズムについて

Lectures on Parallel and Distributed Computing
并行及分布式计算系列讲座
Dr. Wei Chen (陈慰), Professor
Tennessee State University
于上海海事大学信息学院,2013年8月
1
Outline
Lecture 1: Introduction to parallel computing
Lecture 2: Parallel computational models
Lecture 3: Parallel algorithm design and analysis
Lecture 4: Distributed-memory programming with PVM/MPI
Lecture 5: Shared-memory programming with Open MP
Lecture 6: Shared-memory programming with GPU
Lecture 7: Introduction to distributed systems
Lecture 8: Synchronous network algorithms
Lecture 9: Asynchronous shared-memory/network algorithms
2
Reference:
(1) Lecture 1 & 2 & 3: Joseph Jaja, “Introduction to Parallel Algorithms,”
Addison Wesley, 1992.
(2) Lecture 4 & 5 & 6: Peter S. Pacheco: An Introduction to Parallel
Programming, Morgan Kaufmann Publishers, 2011.
(3) Lecture 7 & 8: Nancy A. Lynch, “Distributed Algorithms,” Morgan
Kaufmann Publishers, 1996.
3
Lecture1: Introduction to Parallel Computing
4
Why Parallel Computing
Problems with large computing complexity


Computing hard problems (NP-complete problems)
exponential computing time.
Problems with large scale of input size
quantum chemistry, statistic mechanics, relative physics, universal physics, fluid
mechanics, biology, genetics engineering, …
For example, it costs 1 hour using the current computer to simulate the procedure
of 1 second reaction of protein molecule and water molecule. It costs
1.14 108 yearstosimulate theprocedureof 1 hour reaction.
5
Why parallel Computing
Physical limitation of CPU computational power
In past 50 years, CPU was speeded up double every 2.5 years. But, there
are a physical limitation. Light speed is 3  108 m / sec .
Therefore, the limitation of the number of CPU clocks is expected to be
about 10GHz.
To solve computing hard problems
• Parallel processing
• DNA computer
• Quantum computer
…
6
What is parallel computing
 Using a number of processors to process one task
 Speeding up the processing by distributing it to the processors
One problem
7
Classification of parallel computers
Two kinds of classification
1.Flann’s Classification
 SISD (Single Instruction stream, Single Data stream)
 MISD (Multiple Instruction stream, Single Data stream)
 SIMD (Single Instruction stream, Multiple Data stream)
 MIMD (Multiple Instruction stream, Multiple Data stream)
2. Classification by memory status
 Share memory
 Distributed memory
8
Flynn’s classification(1)
SISD (Single Instruction Single Data) computer
 Von Neuman’s one processor computer
Control
Memory
Processor
Instruction
Stream
Data
Stream
9
Flynn’s classification (2)
MISD (Multi Instructions Single Data) computer
All processors share a common memory, have their own control devices and
execute their own instructions on same data.
Control
Processor
Instruction
Stream
Processor
Control
Instruction
Stream
Memory
Data
Stream
Processor
Control
Instruction
Stream
10
Flynn’s classification (3)
SIMD (Single Instructions Multi Data) computer
• Processors execute the same instructions on different data
• Operations of processors are synchronized by global clock.
Processor
Processor
Control
Data
Stream
Shared
Memory
Data or
Stream
Instruction
Stream
Interconnection
Nerwork
Processor
Data
Stream
11
Flynn’s classification (4)
MIMD (Multi Instruction Multi Data) Computer
•
•
•
Processors have their own control devices, and execute different
instructions on different data.
Operations of processors are executed asynchronously in most time.
It is also called as distributed computing system.
Control
Processor
Instruction
Stream
Control
Control
Data
Stream
Processor
Instruction
Stream
Data or
Stream
Interconnection
Nerwork
Processor
Instruction
Stream
Shared
Memory
Data
Stream
12
Classification by memory types (1)
1. Parallel computer with a shared common memory
• Communication based on shared memory
For example, consider the case that processor i sends some data to
processor j. First, processor i writes the data to the share memory, then
processor j reads the data from the same address of the share memory.
Processor
Processor
Shared
Memory
Processor
13
Classification by memory types (2)
Features of parallel computers with shared common memory
 Programming is easy.
 Exclusive control is necessary for the access to the same memory cell.

Realization is difficult when the number of processors is large.
(Reason) The number of processors connected to shared memory is
limited by the physical factors such as the size and voltage of units, and
the latency caused in memory accessing.
14
Classification by memory tapes (3)
2. Parallel computers with distributed memory
Communication style is one to one based on interconnection network.
For example, consider the case that processor i sends data to processor j.
First, processor i issues a send command such as “processor j sends xxx
to process i”, then processor j gets the data by a receiving command.
Processor
Local Memory
Processor
Local Memory
Processor
Local Memory
Interconnection
Nerwork
15
Classification by memory types (4)
Features of parallel computers using distributed memory
 There are various architectures of interconnection networks (Generally,
the degree of connectivity is not large.)
 Programming is difficult since comparing with shared common memory
the communication style is one to one.
 It is easy to increase the number of processors.
16
Types of parallel computers with distributed memory
P1
Complete connection type
Any two processors are connected
Features
Strong communication ability, but not practical
(each processor has to be connected to many processors).
P2
P5
P3
P4
Mash connection type
Processors are connected as a two-dimension lattice.
Features
- Connected to few processors. Easily to increase
the number of processors.
– Large distance between processors: n
– Existence of processor communication bottleneck.
P0,0
P0,1
P0,2
P0,3
P1,0
P1,1
P1,2
P1,3
P2,0
P2,1
P2,2
P2,3
P3,0
P3,1
P3,2
P3,3
17
Types of parallel computers with distributed memory
Hypercube connection type
Processors connected as a hypercube (each processor has a binary number. (Processors
are connected if only if one bit of their number are different.)
Features
• Small distance between processors: log n.
• Balanced communication load because of its symmetric structure.
• Easy to increase the number of processors.
0100
0110
1100
1110
0101
0111
1101
1111
0000
0010
1000
1010
0001
0011
1001
1011
18
Types of parallel computers with distributed memory
Other connected type
Tree connection type, butterfly connection type, bus
connection type.
Criterion for selecting an inter-connection network
• Small diameter (the largest distance between processors)
for small communication delay.
• Symmetric structure for easily increasing the number of
processors.
The type of inter-connection network depends on
application, ability of processors, upper bound of
the number of processors and other factors.
19
Real parallel processing system (1)
Early days parallel computer (ILLIAC IV)
• Built in 1972
• SIMD type with distributed memory, consisting of
64 processors
P0,0
P0,1
P0,2
P0,3
P1,0
P1,1
P1,2
P1,3
P2,0
P2,1
P2,2
P2,3
P3,0
P3,1
P3,2
P3,3
Transformed mash connection type,
equipped with common data bus,
common control bus, and one control
Unit.
20
Real parallel processing system (2)
Parallel computers in recent 1990s
• Shared common memory type
Workstation, SGI Origin2000 and other with 2-8 processors.
• Distributed memory type
Name
Maker
Processor num
Processing type
Network type
CM-2
TM
65536
SIMD
hypercube
CM-5
TM
1024
MIMD
fat tree
nCUBE2
NCUBE
8192
MIMD
hypercube
iWarp
CMU, Intel
64
MIMD
2D torus
Intel
4096
MIMD
2D torus
SP-2
IBM
512
MIMD
HP switch
AP1000
Fujitsu
1024
MIMD
2D torus
1024
MIMD
crossbar
Paragon
SR2201
Hitachi
21
Real parallel processing system (3)
Deep blue
• Developed by IBM for chess game only
• Defeating the chess champion
• Based on general parallel computer SP-2
Inter-connection network (Generalized hypercube)
RS/6000
RS/6000
RS/6000
memory
memory
memory
(32 nodes)
node
RS/6000
memory
Bus
Interface
node
node
Microchannel Bus
Deep Deep
blue blue
chip chip (8)
Deep
blue
chip
22
K (京)Computer - Fujitsu
Architecture: 88,128 SPARC64 VIIIfx 2.0 GHz 8 cores processors,
864 cabinets of each with 96 computing nodes and 6 I/O nodes, 6
dimension Tofu/torus interconnect, Linux-based enhanced operating
system, open-source Open MPI libaray,12.6 MW, 10.51 petaflops,
ranked as # 3 in 2011.
23
Titan –Oak Ridge Lab
Architecture:
18,688 AMD Opteron 6247 16-cores
CPUs, Cray Linux, 8.2 MW, 17.59
petaflops, GPU based, Torus topology,
ranked #1 in 2012.
24
天河一号 – 国家计算中心,天津
Architecture:
14,336 Xeon X5670 processors, 7168 Nvidia Tesla M2050 GPUs, 2048 FeiTeng 1000
SPARC-based processors, 4.7 petaflops. 112 computer cabinets, 8 I/O cabinets, 11D
hypercube topology with IB QDR/DDR,Linux, #2 in 2011
25
Exercises
Compare top three supercomputers in the world.
26