ペタスケール・システムインターコネクト技術開発 プ

PSI Framework : A Dynamic
Optimization of MPI
Hyacinthe NZIGOU M., Vivien Oddou
and Feng Long GU
2008年 2月 19日
1
Presentation Flow
Background


Collective Communications
Algorithms
Motivation and Goal
Current System Limitation
PSI Framework Flowchart
Experimental Results
MPI Allgather
 MPI Alltoall

Conclusions and Future work
2
Background : Collective Communications
Coll. Comm.
Algorithms
Broadcast
Sequential, Chain, Binary, Binomial …
Scatter
Sequential, Chain, Binary …
Gather
Sequential, Chain, Binary …
Reduce
Gather followed by operation, Chain, Binary,
Binomial and Rabenseifner …
Allreduce
Reduce followed by broadcast, Allgether by
operation, Chain, Binary, Binomial and Rab…
Gather + broadcast, Circular, Recursive db …
Allgather
Barrier
Alltoall
Extended ring, Distributed binomial and
Tournament …
Simple spread, Ring, Bruck, Pair-wise …
3
Background:Algorithm performances
Alltoall algorithms (32 Procs)
Time [microseconds]
100000
S. Spread
Ring
Bruck
Recurs. Db
Original
10000
1000
100
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Message Size (Bytes)
4
Background:Algorithm performances
Alltoall algorithms (32 Procs)
1000000
S. Spread
Ring
Bruck
Pair light ba.
Original
10000
1000
51
2
10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
13
10
72
26
21
44
52
42
88
25
6
12
8
100
64
Time [microseconds]
100000
Message Size (Bytes)
5
Motivation and Objective
Find techniques that dynamically select the
best performing algorithm for a given
situation



Network topology and parameters (Latency,
Bandwidth etc.)
Message size
Load imbalance
Develop collective communication routines
that can adapt System/Application
configuration.
6
Current system limitation
In most static MPI implementations such as
MPICH, Fujitsu MPI etc. adaptability is based
on


Message size
Number of processors involved in the
communication
Difficult to achieve high performance of such
MPI implementations for random behavior of
the systems.
7
System Design
User Program
PSI Framework
MPI Library
MPI_Send
MPI_recv
Parallel machine
8
Framework Implementation
User application
program
Code.c
Psi.c
#include “Psi_mpi.h”
#include “mpi.h”
…
…
PSI_MPI_Alltoall(… args…);
MPI_Alltoall(... args …);
…
…
0 Byte
4 Kb
2 Mb
9
PSI Framework Flowchart
Start Program, Initializations
Alltoall(… args …); Allgather(… args …)
Learn : t1  Algo1 x 5 ; t2  Algo2 x 5; …
Compare : tmin = min(t1, t2, t3 …)
no
ti = tmin ?
yes
Select Alg i
Probing : Alltoall ( … args …)  Algo i
10
Example of Learning and Probing
user program:
main()
{
..
init();
..
for (i=0; i < 150; ++i)
{
..
MPI_Alltoall();
..
}
}
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
MPI_Alltoall()
-
16 : MPI_Alltoall() 17 : MPI_Alltoall() 18 : MPI_Alltoall() 19 : MPI_Alltoall() 20 : MPI_Alltoall() .....................
149 : MPI_Alltoall() -
bruck
bruck
bruck
bruck
ring
ring
ring
ring
s spread
s spread
s spread
s spread
pairwise
pairwise
pairwise
pairwise
s
s
s
s
s
spread
spread
spread
spread
spread
avg 2s
avg 1s
learning
phase
avg 0.5s
avg 1s
election
probing
phase
s spread
11
Experimental system Configuration
Configuration
Number of nodes
Riken PC Cluster
128
CPU
Inter Xeon 3.06GHz (2 CPU/node)
RAM
4GB/node
O.S
Compiler
RedHat Linux Enterprise
Fujitsu C compiler 5.0
Interconnect
InfiniBand
MPI software
Fujitsu MPI
12
Experimental Results
MP I_A llg ather on ps ioc ta, 8x 1
1200
Wt (s ec ond)
1000
800
psi-frame
built-in MPI
600
Pairwise
400
Gpairwise
200
Ring
0
0.E+00 1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07 7.E+07 8.E+07
Me ssa g e S iz e
The framework always chooses the best performing
algorithm that is in this case Ring algorithm
13
Experimental Results (Load Balance)
Original
Framework
Alltoall 32x1 Procs
800
700
Time (ms)
600
500
400
300
200
100
0
256
512
Message Sizes (Bytes)
Original
Framework
Alltoall 32x1 Procs
450
400
350
Time (ms)
300
250
200
150
100
50
0
32
64
Message Sizes (Bytes)
14
Experimental Results (Load imbalance)
Original
Framework
Performance Evaluation
257450
257400
257350
Time (ms)
257300
257250
257200
257150
257100
257050
257000
32
64
Message Sizes (Bytes)
Rank order
30
26
28
22
24
18
20
14
16
10
12
6
8
2
4
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
Ratio
Computation/communication
15
Conclusions and Future work
The experimental results of our framework show
encouraging benefits


It can adapt to different platform systems
It always select the most suitable algorithm for
different charge of the load
However, for a given situation the slow
algorithms in the learning phase degrade the
overall performance
Coding of the dynamic algorithm selection based
on performance models

Algorithms grouping implementation
16
Conclusions and Future work (cont~)
Start Program, Initializations
P-LogP parameters
Alltoall ( … args …)
Os(m) Or(m)
g(m) L
…
…
…
…
t1  Pring = (P-1)(L + 2 g(m)) ;
t2  Algo 2 ; … ti  Algo i
ti  min (t1, t2, t3, …) ;
tj  ti + 10% * ti
no
ti ≤ tj ?
yes
Select Algo i
Learn : ti  Algo i x 5; tj  Algo j x 5 ; …
17
ノードの負荷アンバランスの影響
ご静聴ありがとうございました。
18