Parallel Processing & Parallel Algorithm

Parallel Processing &
Parallel Algorithm
May 8, 2003
B4 Yuuki Horita
Chapter 3
Principles of Parallel
Programming
Principles of Parallel Programming
•
•
•
•
Data Dependency
Processors Communication
Mapping
Granularity
Data Dependency
•
•
•
•
•
data flow dependency
data anti-dependency
data output dependency
data input dependency
data control dependency
data flow dependency &
data anti-dependency
- data flow dependency
S1: A = B + C
S2: D = A + E
- data anti-dependency
S1: A = B + C
S2: B = D + E
data output dependency &
data input dependency
- data output dependency
S1: A = B + C
S2: A = D + E
- data input dependency
S1: A = B + C
S2: D = B + E
data control dependency
S1: A = B - C
if ( A > 0 )
S2: then D = 1;
S3: else D = 2;
endif
S1⇒S2,S3
Dependency Graph
• G = ( V, E )
V(Vertices) : statements
E(Edges) : dependency
ex) S1: A = B + C
S2: B = A * 3
S3: A = 2 * C
S4: P = B ≧0
if ( P is True)
S5: then D = 1
S6: else D = 2
endif
S1 → S2 : flow dep. , anti-dep.
S1 → S3 : output dep. , input dep.
S2 → S3 : anti-dep.
S2 → S4 : flow dep.
S4 → S5 , S6 : control dep.
Elimination of the dependencies
• data output dependency
• data anti-dependency
⇒ renaming can remove these form of
dependency
Elimination of the dependencies(2)
Ex)
S1: A = B + C
S2: B = D + E
S3: A = F + G
Renaming
S1,S2 : anti-dep.
S1,S3 : output-dep.
S1: A = B + C
S2: B’ = D + E
S3: A’ = F + G
No dependency!
Processors communication
• Message passing communication
- processors communicate via communication links
• Shared memory communication
- processors communicate via common memory
Message Passing System
Interconnection Network
PE1
PE2
PE3
・・・
PEm
M1
M2
M3
・・・
Mm
Message Passing System(2)
• Send and Receive operations
- blocking
- nonblocking
• System
- synchronous
blocking send and blocking receive operations
- asynchronous
nonblocking send and blocking receive operations
( the messages are buffered )
Shared memory system
Global Memory
M1
M2
M3
・・・
Mm
Interconnection Network
PE1
PE2
PE3
・・・
PEm
Mapping
… matching parallel algorithms to parallel
architecture
mapping to
- asynchronous architecture
- synchronous architecture
- distributed architecture
Mapping to Asynchronous
Architecture
Mapping a program to an asynchronous sharedmemory computer has the following steps:
1.
2.
3.
Allocate each statement in the program to a processor
Allocate each variable to a memory
Specify the control flow for each processor
the sender may send many messages without the
receiver removing them from the channel
⇒ the need for buffering messages
Mapping to Synchronous
Architecture
•
•
•
•
Mapping a program is the same as
Asynchronous Architecture
a common clock for synchronization purposes
each processor executes an instruction at
each step ( at each clock tick)
only one message exists at any one time on
the channel
⇒ no buffering is needed
Mapping to Distributed Architecture
• A local memory accessible only by owning
processor
• only a pair of processor along the channel
• Mapping a program is the same as in the
asynchronous shared-memory architecture,
except that each variable is allocated either to a
processor or a channel
Granularity
relates to the ratio of the amount of computation to the
amount of communication
• fine : at statement level
• medium : at procedure level
• coarse : at program level
Program Level Parallelism
• a program creates a new process by
creating a complete copy of itself
- Fork() (UNIX)
Statement Level Parallelism
・ Parbegin-Parend block
Par-begin
Statement1
Statement2
:
Statementn
Par-end
⇒ the statements Statement1~Statementn are
executed in parallel
Statement Level Parallelism(2)
ex) (a + b) * (c + d) – (e / f )
Par-begin
Par-begin
t1 = a + b
t2 = c + d
Par-end
t4 = t1 * t2
t3 = e / f
Par-end
t5 = t4 – t3
Statement Level Parallelism(3)
・ Fork, Join, Quit
- Fork x
cause a new process to be created and to start
executing at the instruction labeled x
- Join t, y
t=t–1
if ( t = 0) then go to y
- Quit
the process terminates
Statement Level Parallelism(4)
ex) (a + b) * (c + d) – (e / f )
P1:
P2:
P4:
P3:
P5:
n=2
m=2
Fork P2
Fork P3
t1 = a + b; Join m, P4; Quit;
t2 = c + d; Join m, P4; Quit;
t4 = t1*t2; Join n, P5; Quit;
t3 = e / f; Join n, P5; Quit;
t5 = t4 – t3