PSI Framework : A Dynamic Optimization of MPI Hyacinthe NZIGOU M., Vivien Oddou and Feng Long GU 2008年 2月 19日 1 Presentation Flow Background Collective Communications Algorithms Motivation and Goal Current System Limitation PSI Framework Flowchart Experimental Results MPI Allgather MPI Alltoall Conclusions and Future work 2 Background : Collective Communications Coll. Comm. Algorithms Broadcast Sequential, Chain, Binary, Binomial … Scatter Sequential, Chain, Binary … Gather Sequential, Chain, Binary … Reduce Gather followed by operation, Chain, Binary, Binomial and Rabenseifner … Allreduce Reduce followed by broadcast, Allgether by operation, Chain, Binary, Binomial and Rab… Gather + broadcast, Circular, Recursive db … Allgather Barrier Alltoall Extended ring, Distributed binomial and Tournament … Simple spread, Ring, Bruck, Pair-wise … 3 Background:Algorithm performances Alltoall algorithms (32 Procs) Time [microseconds] 100000 S. Spread Ring Bruck Recurs. Db Original 10000 1000 100 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Message Size (Bytes) 4 Background:Algorithm performances Alltoall algorithms (32 Procs) 1000000 S. Spread Ring Bruck Pair light ba. Original 10000 1000 51 2 10 24 20 48 40 96 81 92 16 38 4 32 76 8 65 53 6 13 10 72 26 21 44 52 42 88 25 6 12 8 100 64 Time [microseconds] 100000 Message Size (Bytes) 5 Motivation and Objective Find techniques that dynamically select the best performing algorithm for a given situation Network topology and parameters (Latency, Bandwidth etc.) Message size Load imbalance Develop collective communication routines that can adapt System/Application configuration. 6 Current system limitation In most static MPI implementations such as MPICH, Fujitsu MPI etc. adaptability is based on Message size Number of processors involved in the communication Difficult to achieve high performance of such MPI implementations for random behavior of the systems. 7 System Design User Program PSI Framework MPI Library MPI_Send MPI_recv Parallel machine 8 Framework Implementation User application program Code.c Psi.c #include “Psi_mpi.h” #include “mpi.h” … … PSI_MPI_Alltoall(… args…); MPI_Alltoall(... args …); … … 0 Byte 4 Kb 2 Mb 9 PSI Framework Flowchart Start Program, Initializations Alltoall(… args …); Allgather(… args …) Learn : t1 Algo1 x 5 ; t2 Algo2 x 5; … Compare : tmin = min(t1, t2, t3 …) no ti = tmin ? yes Select Alg i Probing : Alltoall ( … args …) Algo i 10 Example of Learning and Probing user program: main() { .. init(); .. for (i=0; i < 150; ++i) { .. MPI_Alltoall(); .. } } 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 : : : : : : : : : : : : : : : : MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() MPI_Alltoall() - 16 : MPI_Alltoall() 17 : MPI_Alltoall() 18 : MPI_Alltoall() 19 : MPI_Alltoall() 20 : MPI_Alltoall() ..................... 149 : MPI_Alltoall() - bruck bruck bruck bruck ring ring ring ring s spread s spread s spread s spread pairwise pairwise pairwise pairwise s s s s s spread spread spread spread spread avg 2s avg 1s learning phase avg 0.5s avg 1s election probing phase s spread 11 Experimental system Configuration Configuration Number of nodes Riken PC Cluster 128 CPU Inter Xeon 3.06GHz (2 CPU/node) RAM 4GB/node O.S Compiler RedHat Linux Enterprise Fujitsu C compiler 5.0 Interconnect InfiniBand MPI software Fujitsu MPI 12 Experimental Results MP I_A llg ather on ps ioc ta, 8x 1 1200 Wt (s ec ond) 1000 800 psi-frame built-in MPI 600 Pairwise 400 Gpairwise 200 Ring 0 0.E+00 1.E+07 2.E+07 3.E+07 4.E+07 5.E+07 6.E+07 7.E+07 8.E+07 Me ssa g e S iz e The framework always chooses the best performing algorithm that is in this case Ring algorithm 13 Experimental Results (Load Balance) Original Framework Alltoall 32x1 Procs 800 700 Time (ms) 600 500 400 300 200 100 0 256 512 Message Sizes (Bytes) Original Framework Alltoall 32x1 Procs 450 400 350 Time (ms) 300 250 200 150 100 50 0 32 64 Message Sizes (Bytes) 14 Experimental Results (Load imbalance) Original Framework Performance Evaluation 257450 257400 257350 Time (ms) 257300 257250 257200 257150 257100 257050 257000 32 64 Message Sizes (Bytes) Rank order 30 26 28 22 24 18 20 14 16 10 12 6 8 2 4 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 Ratio Computation/communication 15 Conclusions and Future work The experimental results of our framework show encouraging benefits It can adapt to different platform systems It always select the most suitable algorithm for different charge of the load However, for a given situation the slow algorithms in the learning phase degrade the overall performance Coding of the dynamic algorithm selection based on performance models Algorithms grouping implementation 16 Conclusions and Future work (cont~) Start Program, Initializations P-LogP parameters Alltoall ( … args …) Os(m) Or(m) g(m) L … … … … t1 Pring = (P-1)(L + 2 g(m)) ; t2 Algo 2 ; … ti Algo i ti min (t1, t2, t3, …) ; tj ti + 10% * ti no ti ≤ tj ? yes Select Algo i Learn : ti Algo i x 5; tj Algo j x 5 ; … 17 ノードの負荷アンバランスの影響 ご静聴ありがとうございました。 18
© Copyright 2024 ExpyDoc