Highly Latency Tolerant Gaussian Elimination Toshio ENDO, Kenjiro TAURA University of Tokyo Grid05 workshop Highly Latency Tolerant GE/T. Endo 1 Background Demands for large scale computing are increasing Grid computing is attractive to improve cost performance Grid has been successful for applications with small numbers of communication Master-worker, parameter sweeping …@home projects Evaluations of applications with frequent communication on Grid are still rare Matrix ops, PDE solver MPICH-G2, Cactus-G Grid05 workshop Highly Latency Tolerant GE/T. Endo 2 Obstacles to Running Apps with Frequent Comm. Volatility, heterogeneity of computing nodes will be solved by progress of programming tools Low bandwidth of WAN will be solved by improvements of WAN Large latencies of WAN More than 10ms on Grid >> a few us on supercomputers Large latencies will remain an obstacle! Algorithms that tolerate latencies are important Grid05 workshop Highly Latency Tolerant GE/T. Endo 3 Target Computation Gaussian elimination (GE) of dense matrices for solving linear equations Ax b Used for Same as LU decomposition, Linpack Fluid simulations, structural analysis Top500 ranking Difficult to achieve good performance with large latencies Partial pivoting (PP) introduces frequent synchronous communications Grid05 workshop Highly Latency Tolerant GE/T. Endo 4 Overview of This Work Gaussian elimination algorithm that tolerates large latencies is presented An alternative pivoting method, named batched pivoting (BP) is proposed More latency tolerant than PP BP can largely reduce the frequency of synchronous communications Grid05 workshop Highly Latency Tolerant GE/T. Endo 5 Outline Gaussian elimination with partial pivoting Batched pivoting Evaluation Latency tolerance Numerical accuracy Summary Grid05 workshop Highly Latency Tolerant GE/T. Endo 6 Gaussian Elimination with Partial pivoting GE of n×n matrix A for k = 1 to n Pivoting Finds the largest element (pivot) in the k-th column Row Exchange n Update aij aij aik akj / akk Since pivot is used as divisor, its absolute value should be larger Grid05 workshop Highly Latency Tolerant GE/T. Endo 7 Problem of GE with PP # of nodes p=6 (=2x3) Well-known distribution: 2D block cyclic distribution Good: Comm. amount is small O(n 2 p ) n sb Grid05 workshop Each column is partitioned among nodes Each pivot selection requires synchronization With large latencies, sync. costs become bottleneck Highly Latency Tolerant GE/T. Endo 8 Performance of GE/PP with Large Latencies We emulated large artificial latencies on a Linux cluster Identical latencies are inserted among all pairs of nodes base, +2ms, +5ms, +10ms High performance Linpack (HPL) is measured GE with PP Matrix size n=32768 64 (=8x8) nodes With +10ms latency, it gets 6 times slower! 1400 Execution time (sec) 1200 1000 800 600 400 200 0 GE with PP is weak in large latencies Grid05 workshop Highly Latency Tolerant GE/T. Endo +0 +2 +5 +10 Added latency (msec) Partial 9 How about Other Pivoting Methods? Strict Complete Rook [Neal92] Not latency tolerant Partial Threshold [Malard91] etc. Batched (Ours) Relaxed Grid05 workshop Pairwise [Sorensen85] No pivoting etc. Highly Latency Tolerant GE/T. Endo numerically unstable 10 Outline Gaussian elimination with partial pivoting Batched pivoting Evaluation Latency tolerance Numerical accuracy Summary Grid05 workshop Highly Latency Tolerant GE/T. Endo 11 The Aim of Batched Pivoting (BP) BP reduces the frequency of synchronous communications We batch pivot selections of several contiguous steps The size of batch d is determined in advance Synchronous communications occur only every d steps Grid05 workshop Highly Latency Tolerant GE/T. Endo 12 Batched Pivoting Algorithm (1) Algorithm that selects d contiguous pivots In the figure, d columns are partitioned between P1 and P2 Each node duplicates the columns and makes a sub-matrix Each node locally and speculatively performs GE with PP Each node obtains d pivot candidates sb d d P1 P1 P3 P2 P4 P1 P3 P2 P4 P2 sb Grid05 workshop Highly Latency Tolerant GE/T. Endo 13 Batched Pivoting Algorithm (2) Sets of pivot candidates are gathered The ‘best’ set is selected We try to avoid bad pivots d=4 in the figure P1 I recommend 4.8 on 50th row, -2.5 on 241th row, 4.3 on 285th row, -3.6 on 36th row P2 Compare I recommend -9.2 on 310th row, 6.8 on 121th row, 0.8 on 170th row, -5.9 on 146th row Adopt! Contents of the best set are broadcast as final pivots Grid05 workshop Highly Latency Tolerant GE/T. Endo 14 Comparison with PP Selected pivots PP selects pivot of each step independently BP takes pivots of d contiguous steps from a single node Pivots may be worse than PP Computation costs PP: (2 / 3)n3 O(n2 ) 3 2 2 BP: (2 / 3)n O(n ) O(dn ) The difference is small if d<<n Grid05 workshop Highly Latency Tolerant GE/T. Endo For local GE 15 Comparison with other pivoting methods Threshold pivoting [Malard91] etc. It may not select the ‘best’ pivot. An element may become the pivot if a pk max aik holds (0 1) Good: It can reduce communication for row exchange Bad: Not latency tolerant Pairwise pivoting [Sorensen85] etc. It repeatedly takes two adjacent rows and eliminates one of the two (cf. bubble sort) Good: It enables pipelining pivot selections Bad: Numerically unstable Grid05 workshop Highly Latency Tolerant GE/T. Endo 16 Outline Gaussian elimination with partial pivoting Batched pivoting Evaluation Latency tolerance Numerical accuracy Summary Grid05 workshop Highly Latency Tolerant GE/T. Endo 17 Environment for Parallel Experiments 192node Linux cluster Dual Xeon 2.4/2.8GHz (1 CPU per node is used) Gigabit ethernet Latencies: 55—75 us BP is implemented by modifying HPL mpich 1.2.6 BLAS library by Kazushige Goto Grid05 workshop Highly Latency Tolerant GE/T. Endo 18 Network Structure of Cluster 14 SW 14 SW 14 SW 14 SW 14 1Gbps 2Gbps 4Gbps SW SW FW SW FS SW SW 14 Grid05 workshop SW 14 SW SW SW SW SW 20 20 20 20 SW ISTBS cluster at U. Tokyo 14 Highly Latency Tolerant GE/T. Endo 19 Execution time (sec) Basic Parallel Performance 400 350 300 250 200 150 100 50 0 Comparing speeds of PP and BP (d=4, 16, 64) 32 96 128 160 64 The number of nodes Partial Batched(16) Grid05 workshop Batched(4) Batched(64) 32 to 160 nodes n=32768, sb=256 No emulated latency BP shows similar scalability to that of PP BP suffers from overheads of additional computation 7 to 15% with d=64 Highly Latency Tolerant GE/T. Endo 20 Performance with Large Latencies Execution time (sec) With latencies, BP is much faster Large latencies are added 1400 1200 1000 800 600 400 200 0 +0 +2 +5 +10 Added latency (msec) Partial Batched(16) Grid05 workshop Batched(4) Batched(64) +2ms, +5ms, +10ms between all pairs of nodes 64(=8x8) nodes n=32768, sb=256 BP is stable against large latencies! When d is larger, it gets more tolerant of latencies Highly Latency Tolerant GE/T. Endo 21 Evaluation Method of Numerical Accuracy We conducted experiments to evaluate numerical accuracy Partial, Batched, Threshold, Pairwise, No pivoting are compared Done on a single node In BP, blocks of size 64 are regarded as nodes 100 random matrices for each condition Matrix sizes are 128 to 2048 Next slide shows the average residuals x b || / || A || || ~ x || n Normalized residuals || A~ are evaluated 53 ~ x : computed solution, ε: machine epsilon(= 2 ) Grid05 workshop Highly Latency Tolerant GE/T. Endo 22 Numerical Accuracy NNorm dual orm al aliized zed resi resi dual 0. 101 1 01 0. 0. 1 0.01 PP achieves the best accuracy No pivoting, Pairwise are numerically unstable BP and threshold achieve comparable accuracy to PP 0.001 100 1000 M atrix size n P artial atched(16) BB atched(16) 5) d(0.5) Threshold(0. Threshol se rw iise airw PP ai Grid05 workshop 10000 B atched(4) atched(64) BB atched(64) 25) d(0.25) Threshold(0. Threshol NN oo Average residuals of BP (d=4) are x1.1--1.6 of PP The sizes of residuals depend on d Tradeoff between accuracy and latency tolerance Highly Latency Tolerant GE/T. Endo 23 Outline Gaussian elimination with partial pivoting Batched pivoting Evaluation Latency tolerance Numerical accuracy Summary Grid05 workshop Highly Latency Tolerant GE/T. Endo 24 Summary A GE algorithm that tolerates large latencies of Grid Batched pivoting largely reduces the number of synchronous communications BP achieves comparable numerical accuracy to PP Grid05 workshop Highly Latency Tolerant GE/T. Endo 25 Future Work Performance evaluation on actual Grid Improvement of accuracy Combining batched pivoting with complete or rook pivoting Theoretical error analysis cf. average case analysis by Trefethen et al. Grid05 workshop Highly Latency Tolerant GE/T. Endo 26 When each column is placed on a single node, synchronization is not necessary It is latency tolerant, but… Slower because of increase in comm. amount Grid05 workshop Execution time (sec) Another Approach: Column Distribution 1400 1200 1000 800 600 400 200 0 Highly Latency Tolerant GE/T. Endo +0 +2 +5 +10 Added latency (msec) Partial Partial-Col Batched(4) Batched(16) Batched(64) 28 Why PP is fragile to large latencies Batching several steps are well-known technique for row exchange and update Then, can we reduce synchronizations for pivoting? No. pivoting cannot be batched or pipelined Each pivoting depends on pivoting of proceeding steps! In total, n times synchronizations are required If latencies are too large, synchronization costs become bottleneck Grid05 workshop Highly Latency Tolerant GE/T. Endo 29 Bandwidth Requirements Estimation of limit speed when bisection bandwidth is given Depends on n Very optimistic Our experiments requires 250Mbps Bluegene/L(Jun.2005) requires 5Gbps 1.E+08 Limit speed(GFlops) Bluegene/L Our experiments 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E+04 1.E+05 Highly Latency Tolerant GE/T. Endo 1.E+07 Matrix size n 100Mbps 10Gbps Grid05 workshop 1.E+06 1Gbps 100Gbps 30 Effects of latencies Estimation of limit speed when latency is given Depends on n Very optimistic With >7ms latencies, we can never obtain performance of Bluegene/L Bluegene/L Our experiments 1.E+09 1.E+08 Limit speed(GFlops) 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.0E+04 0.1ms Grid05 workshop Highly Latency Tolerant GE/T. Endo 1.0E+05 1.0E+06 Matrix size n 1ms 10ms 1.0E+07 100ms 31 Displayed residual of HPL differs from actual value Displayed results ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------W21L2L4 1024 256 1 1 0.30 2.357e+00 ---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0307237 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0135777 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035758 ...... PASSED Divide by n doesn’t be shown Source code(HPL_pdtest.c) resid1 = resid0 / ( TEST->epsil * Anorm1 * (double)(N) ); resid2 = resid0 / ( TEST->epsil * Anorm1 * Xnorm1 ); resid3 = resid0 / ( TEST->epsil * AnormI * XnormI * (double)(N) ); Actually, devided Grid05 workshop Highly Latency Tolerant GE/T. Endo 32 Numerical Accuracy(2) 0.01 N orm alized residual(resid3) Compares PP and BP with large matrices Matrices are generated by HPL On 64(=8x8) nodes 0.001 0.0001 0 20000 40000 60000 M atrix size n P artial B atched(16) Grid05 workshop Highly Latency Tolerant GE/T. Endo B atched(4) B atched(64) 33 Detail of accuracy (Small matrices) Worst case in 100 trials Standard deviation 10 0.1 10 0.1 1 Normalized residual 正規化残差 正規化残差 Normalized residual 1001 0.1 0.01 0.01 0.001 100 1000 1000 Matrix size n 行列サイズn Partial Partial Batched(16) Batched(16) Threshold(0.5) Threshold(0.5) Pairwise Pairwise Grid05 workshop 10000 10000 Batched(4) Batched(4) Batched(64) Batched(64) Threshold(0.25) Threshold(0.25) No No 1 0.01 0.1 0.01 0.001 0.001 0.0001 100 1000 行列サイズn Matrix size n Partial Batched(16) Threshold(0.5) Pairwise Highly Latency Tolerant GE/T. Endo 10000 Batched(4) Batched(64) Threshold(0.25) No 34 Detail of accuracy (Large matrices) 0.01 0.001 20000 40000 60000 Matrix size n Partial Batched(4) Batched(16) Batched(64) 0.001 0.0001 0.001 0 0.01 resid3 0.01 resid2 resid1 0.1 0 20000 40000 60000 matrix size n Partial Batched(4) Batched(16) Batched(64) 0 20000 40000 60000 Matrix size n Partial Batched(4) Batched(16) Batched(64) From left to right, ||Ax-b||_oo / ( eps * ||A||_1 * N ) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo * N) Grid05 workshop Highly Latency Tolerant GE/T. Endo 35 Notes Simple batched pivoting may fail, if local GEs fail on all nodes This situation may occur with (nearly) sparse matrices We can recover by restarting from the failed column The number of synchronization gets closer to that of partial pivoting Grid05 workshop Highly Latency Tolerant GE/T. Endo 36
© Copyright 2024 ExpyDoc