www.el.gsic.titech.ac.jp

Highly Latency Tolerant
Gaussian Elimination
Toshio ENDO, Kenjiro TAURA
University of Tokyo
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
1
Background



Demands for large scale computing are increasing
Grid computing is attractive to improve cost
performance
Grid has been successful for applications with
small numbers of communication



Master-worker, parameter sweeping
…@home projects
Evaluations of applications with frequent
communication on Grid are still rare


Matrix ops, PDE solver
MPICH-G2, Cactus-G
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
2
Obstacles to Running
Apps with Frequent Comm.

Volatility, heterogeneity
of computing nodes
will be solved by
progress of
programming tools

Low bandwidth of WAN
will be solved by
improvements of WAN

Large latencies of WAN

More than 10ms on Grid >> a few us on supercomputers
Large latencies will remain an obstacle!
Algorithms that tolerate latencies are important
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
3
Target Computation
Gaussian elimination (GE) of dense matrices for
solving linear equations Ax  b


Used for



Same as LU decomposition, Linpack
Fluid simulations, structural analysis
Top500 ranking
Difficult to achieve good performance with
large latencies

Partial pivoting (PP) introduces frequent
synchronous communications
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
4
Overview of This Work


Gaussian elimination algorithm that tolerates
large latencies is presented
An alternative pivoting method, named
batched pivoting (BP) is proposed


More latency tolerant than PP
BP can largely reduce the frequency of
synchronous communications
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
5
Outline



Gaussian elimination with partial pivoting
Batched pivoting
Evaluation



Latency tolerance
Numerical accuracy
Summary
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
6
Gaussian Elimination with
Partial pivoting
GE of n×n matrix A
for k = 1 to n
Pivoting
Finds the largest element (pivot)
in the k-th column
Row Exchange
n
Update
aij  aij  aik akj / akk
Since pivot is used as divisor,
its absolute value should be larger
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
7
Problem of GE with PP

# of nodes p=6 (=2x3)
Well-known distribution:
2D block cyclic distribution

Good: Comm. amount is small
O(n 2 p )
n
sb
Grid05 workshop
Each column is partitioned
among nodes
Each pivot selection
requires synchronization
With large latencies, sync. costs
become bottleneck
Highly Latency Tolerant GE/T. Endo
8
Performance of GE/PP
with Large Latencies
We emulated large artificial
latencies on a Linux cluster



Identical latencies are inserted
among all pairs of nodes
base, +2ms, +5ms, +10ms
High performance Linpack
(HPL) is measured



GE with PP
Matrix size n=32768
64 (=8x8) nodes
With +10ms latency,
it gets 6 times slower!
1400
Execution time (sec)

1200
1000
800
600
400
200
0
GE with PP is weak in large latencies
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
+0
+2
+5
+10
Added latency (msec)
Partial
9
How about Other Pivoting
Methods?
Strict
Complete
Rook
[Neal92]
Not latency tolerant
Partial
Threshold
[Malard91]
etc.
Batched (Ours)
Relaxed
Grid05 workshop
Pairwise [Sorensen85]
No pivoting
etc.
Highly Latency Tolerant GE/T. Endo
numerically unstable
10
Outline



Gaussian elimination with partial pivoting
Batched pivoting
Evaluation



Latency tolerance
Numerical accuracy
Summary
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
11
The Aim of Batched Pivoting (BP)

BP reduces the frequency of synchronous
communications

We batch pivot selections of several contiguous steps


The size of batch d is determined in advance
Synchronous communications occur only every d steps
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
12
Batched Pivoting Algorithm (1)
Algorithm that selects d contiguous
pivots



In the figure, d columns are partitioned
between P1 and P2
Each node duplicates the columns
and makes a sub-matrix
Each node locally and speculatively
performs GE with PP

Each node obtains d pivot candidates sb
d
d
P1
P1
P3
P2
P4
P1
P3
P2
P4
P2
sb
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
13
Batched Pivoting Algorithm (2)


Sets of pivot candidates are gathered
The ‘best’ set is selected


We try to avoid bad pivots
d=4 in the figure
P1
I recommend
4.8 on 50th row,
-2.5 on 241th row,
4.3 on 285th row,
-3.6 on 36th row
P2
Compare
I recommend
-9.2 on 310th row,
6.8 on 121th row,
0.8 on 170th row,
-5.9 on 146th row
Adopt!

Contents of the best set are broadcast as final pivots
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
14
Comparison with PP

Selected pivots



PP selects pivot of each step independently
BP takes pivots of d contiguous steps from a single
node
Pivots may be worse than PP
Computation costs
PP: (2 / 3)n3  O(n2 )
3
2
2
 BP: (2 / 3)n  O(n )  O(dn )
The difference is small if d<<n

Grid05 workshop
Highly Latency Tolerant GE/T. Endo
For local GE
15
Comparison with
other pivoting methods

Threshold pivoting [Malard91] etc.




It may not select the ‘best’ pivot. An element may become the
pivot if a pk    max aik holds (0    1)
Good: It can reduce communication for row exchange
Bad: Not latency tolerant
Pairwise pivoting [Sorensen85] etc.



It repeatedly takes two adjacent rows and eliminates one of the
two (cf. bubble sort)
Good: It enables pipelining pivot selections
Bad: Numerically unstable
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
16
Outline



Gaussian elimination with partial pivoting
Batched pivoting
Evaluation



Latency tolerance
Numerical accuracy
Summary
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
17
Environment for
Parallel Experiments

192node Linux cluster




Dual Xeon 2.4/2.8GHz (1 CPU per node is used)
Gigabit ethernet
Latencies: 55—75 us
BP is implemented by modifying HPL


mpich 1.2.6
BLAS library by Kazushige Goto
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
18
Network Structure of Cluster
14
SW
14
SW
14
SW
14
SW
14
1Gbps
2Gbps
4Gbps
SW
SW
FW
SW
FS
SW
SW
14
Grid05 workshop
SW
14
SW
SW
SW
SW
SW
20
20
20
20
SW
ISTBS cluster
at U. Tokyo
14
Highly Latency Tolerant GE/T. Endo
19
Execution time (sec)
Basic Parallel Performance

400
350
300
250
200
150
100
50
0
Comparing speeds of PP
and BP (d=4, 16, 64)



32
96 128 160
64
The number of nodes


Partial
Batched(16)
Grid05 workshop
Batched(4)
Batched(64)
32 to 160 nodes
n=32768, sb=256
No emulated latency
BP shows similar scalability
to that of PP
BP suffers from overheads
of additional computation

7 to 15% with d=64
Highly Latency Tolerant GE/T. Endo
20
Performance with Large Latencies
Execution time (sec)
With latencies, BP is
much faster

Large latencies are added
1400

1200
1000

800

600
400
200

0
+0
+2
+5
+10
Added latency (msec)
Partial
Batched(16)
Grid05 workshop
Batched(4)
Batched(64)
+2ms, +5ms, +10ms
between all pairs of nodes
64(=8x8) nodes
n=32768, sb=256
BP is stable against large
latencies!

When d is larger, it gets more
tolerant of latencies
Highly Latency Tolerant GE/T. Endo
21
Evaluation Method of
Numerical Accuracy
We conducted experiments to evaluate numerical
accuracy
 Partial, Batched, Threshold, Pairwise, No pivoting are
compared




Done on a single node
In BP, blocks of size 64 are regarded as nodes
100 random matrices for each condition

Matrix sizes are 128 to 2048

Next slide shows the average residuals
x  b || / || A || || ~
x || n 
Normalized residuals || A~
are evaluated
53
~
 x : computed solution, ε: machine epsilon(= 2
)
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
22
Numerical Accuracy
NNorm
dual
orm al
aliized
zed resi
resi
dual
0.
101

1

01
0.
0.
1

0.01
PP achieves the best
accuracy
No pivoting, Pairwise are
numerically unstable
BP and threshold achieve
comparable accuracy to PP

0.001
100
1000
M atrix size n
P artial
atched(16)
BB atched(16)
5)
d(0.5)
Threshold(0.
Threshol
se
rw iise
airw
PP ai
Grid05 workshop
10000
B atched(4)
atched(64)
BB atched(64)
25)
d(0.25)
Threshold(0.
Threshol
NN oo

Average residuals of BP
(d=4) are x1.1--1.6 of PP
The sizes of residuals depend
on d
Tradeoff between accuracy
and latency tolerance
Highly Latency Tolerant GE/T. Endo
23
Outline



Gaussian elimination with partial pivoting
Batched pivoting
Evaluation



Latency tolerance
Numerical accuracy
Summary
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
24
Summary

A GE algorithm that tolerates large latencies of
Grid


Batched pivoting largely reduces the number of
synchronous communications
BP achieves comparable numerical accuracy to PP
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
25
Future Work


Performance evaluation on actual Grid
Improvement of accuracy


Combining batched pivoting with complete or rook
pivoting
Theoretical error analysis

cf. average case analysis by Trefethen et al.
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
26

When each column is
placed on a single
node, synchronization
is not necessary

It is latency tolerant,
but…
Slower because of
increase in comm.
amount

Grid05 workshop
Execution time (sec)
Another Approach:
Column Distribution
1400
1200
1000
800
600
400
200
0
Highly Latency Tolerant GE/T. Endo
+0
+2
+5
+10
Added latency (msec)
Partial
Partial-Col
Batched(4)
Batched(16)
Batched(64)
28
Why PP is fragile to large
latencies



Batching several steps are well-known technique for
row exchange and update
Then, can we reduce synchronizations for pivoting?
No. pivoting cannot be batched or pipelined


Each pivoting depends on pivoting of proceeding steps!
In total, n times synchronizations are required
If latencies are too large, synchronization
costs become bottleneck
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
29
Bandwidth Requirements
Estimation of limit
speed when bisection
bandwidth is given




Depends on n
Very optimistic
Our experiments
requires 250Mbps
Bluegene/L(Jun.2005)
requires 5Gbps
1.E+08
Limit speed(GFlops)

Bluegene/L
Our experiments
1.E+07
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E+04
1.E+05
Highly Latency Tolerant GE/T. Endo
1.E+07
Matrix size n
100Mbps
10Gbps
Grid05 workshop
1.E+06
1Gbps
100Gbps
30
Effects of latencies
Estimation of limit
speed when latency is
given



Depends on n
Very optimistic
With >7ms latencies,
we can never obtain
performance of
Bluegene/L
Bluegene/L
Our experiments
1.E+09
1.E+08
Limit speed(GFlops)

1.E+07
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.0E+04
0.1ms
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
1.0E+05
1.0E+06
Matrix size n
1ms
10ms
1.0E+07
100ms
31
Displayed residual of HPL
differs from actual value

Displayed results
============================================================================
T/V
N
NB
P
Q
Time
Gflops
---------------------------------------------------------------------------W21L2L4
1024
256
1
1
0.30
2.357e+00
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
) =
0.0307237 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0135777 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0035758 ...... PASSED
Divide by n doesn’t be shown

Source code(HPL_pdtest.c)
resid1 = resid0 / ( TEST->epsil * Anorm1
* (double)(N) );
resid2 = resid0 / ( TEST->epsil * Anorm1 * Xnorm1
);
resid3 = resid0 / ( TEST->epsil * AnormI * XnormI * (double)(N) );
Actually, devided
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
32
Numerical Accuracy(2)
0.01
N orm alized residual(resid3)
Compares PP and BP with
large matrices
 Matrices are generated by
HPL
 On 64(=8x8) nodes
0.001
0.0001
0
20000
40000
60000
M atrix size n
P artial
B atched(16)
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
B atched(4)
B atched(64)
33
Detail of accuracy
(Small matrices)
Worst case in 100 trials
Standard deviation
10
0.1
10
0.1
1
Normalized
residual
正規化残差
正規化残差
Normalized
residual
1001
0.1
0.01
0.01
0.001
100
1000
1000
Matrix size n
行列サイズn
Partial
Partial
Batched(16)
Batched(16)
Threshold(0.5)
Threshold(0.5)
Pairwise
Pairwise
Grid05 workshop
10000
10000
Batched(4)
Batched(4)
Batched(64)
Batched(64)
Threshold(0.25)
Threshold(0.25)
No
No
1
0.01
0.1
0.01
0.001
0.001
0.0001
100
1000
行列サイズn
Matrix
size n
Partial
Batched(16)
Threshold(0.5)
Pairwise
Highly Latency Tolerant GE/T. Endo
10000
Batched(4)
Batched(64)
Threshold(0.25)
No
34
Detail of accuracy
(Large matrices)
0.01
0.001
20000 40000 60000
Matrix size n
Partial
Batched(4)
Batched(16)
Batched(64)
0.001
0.0001
0.001
0

0.01
resid3
0.01
resid2
resid1
0.1
0
20000 40000 60000
matrix size n
Partial
Batched(4)
Batched(16)
Batched(64)
0
20000 40000 60000
Matrix size n
Partial
Batched(4)
Batched(16)
Batched(64)
From left to right,
||Ax-b||_oo / ( eps * ||A||_1 * N
)
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo * N)
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
35
Notes

Simple batched pivoting may fail, if
local GEs fail on all nodes


This situation may occur with (nearly)
sparse matrices
We can recover by restarting from the
failed column
The number of synchronization gets
closer to that of partial pivoting
Grid05 workshop
Highly Latency Tolerant GE/T. Endo
36