名古屋大学 大学院工学研究科 宮田 考史

PDCS 2007
November 20, 2007
Accelerating the Complex Hessenberg QR
Algorithm with the CSX600 Floating-Point
Coprocessor
Yusaku Yamamoto 1
Takafumi Miyata 1
1 Nagoya
University
2 Kyoto
Yoshimasa Nakamura 2
University
Introduction
• Background
– Advent of many-core floating point accelerators as a
means to speed up scientific computations
• Objective of our study
– Apply these accelerators to the eigenvalue problem for
nonsymmetric matrices.
– Make clear potential problems.
– Modify existing algorithms or develop new algorithms if
necessary.
2
Outline of the talk
• Introduction
• Many-core floating point accelerators and its
performance characteristics
• The nonsymmetric eigenvalue problem
• Proposed algorithm
– Modification of the small-bulge multishift QR algorithm for
floating-point accelerators
• Performance evaluation
• Conclusion
3
Many-core Floating-point accelerators
• ClearSpeed CSX600
– 1+96 processor cores
– 48GFLOPS (double precision)
• GRAPE-DR (Tokyo Univ.)
– 512 processor cores
– 512GFLOPS (single precision)
– 256GFLOPS (double precision)
• Intel Larrabee (under development)
– 80 processor cores
– 1TFLOPS (single precision)
• Integrates hundreds of floating-point cores
• Very high GFLOPS value (peak performance)
4
Architecture of the CSX600 accelerator
• The CSX600 chip
– 1 main processor
– 96 floating-point processors
• 64bit
• 2flops/cycle
• 128Byte register file
• 6KB SRAM
– Operates at 250MHz
– Peak performance: 48GFLOPS
• ClearSpeed Advance board
– Two CSX600 processors
– 1GB DRAM
– Connected with the PC via the
PCI-X bus.
– Peak performance: 96GFLOPS
5
Problem with the data transfer speed
• Peak floating-point performance --- very high
– 48GFLOPS / chip
– 96GFLOPS / board
• Data transfer speed --- relatively low
– 3.2GB/s between the chip and on-board memory
– 1.066GF/s between the board and main memory
• Byte/flop
– 0.066Byte/flop between the chip and on-board memory
– 0.011Byte/flop between the board and main memory
PC
CPU
DRAM
DRAM
I/F
1.066
GB/s
DRAM
ClearSpeed
Advance board
3.2GB/s
I/F
CSX600
CSX600
PCI-X
6
Byte/flop of typical linear algebraic operations
Amount of
data transfer
Function
Operation
Flop
count
Byte/flop
Dot product
a := xTy
2n
2n
8
AXPY
x := x + ay
3n
2n
12
Matrix-vector
multiplication
y := Ax
n2+2n
2n2
4
Rank-1
update
A:= A + xyT
2n2+2n
2n2
8
Matrix
multiplication
(MatMult)
C:= C + AB
4n2
2n3
16/n
• Operations other than matrix multiplication cannot exploit the performance
of the CSX600 due to the limitation of data transfer speed.
• Matrix multiplication (MatMult) can be executed efficiently, but only if
the size is very large (n > 1500).
7
Performance of MatMult on the ClearSpeed board
30
N
M
GFLOPS
K
25
M = K = 500
M = K = 1000
M = K = 1500
M = K = 2000
20
15
10
5
0
1000 2000 3000 4000 5000 6000
K
GFLOPS
30
N
M
N
25
N = K = 500
N = K = 1000
N = K = 1500
N = K = 2000
20
15
10
5
0
1000 2000 3000 4000 5000 6000
•
•
M
The library transfers the input data from the main memory to the board,
perform the computation and return the data to the main memory.
M, N and K must be larger than 1500 to get substantial performance gain.
8
Problems to be solved
• Problems
– Is it possible to reorganize the algorithm so that most of the
computations are done with matrix multiplications?
– What is the overhead of using very large size matrix
multiplications?
– How can we reduce the overhead?
We consider these problems for the nonsymmetric eigenvalue
problem.
9
The nonsymmetric eigenvalue problem
• The problem
– Eigenvalue problem
• A:
dense complex nonsymmetric matrix
• Compute all the eigenvalues / eigenvectors
• Applications
–
–
–
–
Magnetohydrodynamics
Structural dynamics
Quantum chemistry
Fluid dynamics
• Cf. Z. Bai and J. Demmel: A test matrix collection for nonHermitian eigenvalue problems.
10
Algorithm for the nonsymmetric eigenproblem
• The standard algorithm
– Similarity transformation to the upper triangular matrix
Target of speedup
diagonal elements
= eigenvalues
Finite # of steps
Dense matrix
Householder method
Work: (10/3)n3
0
Iterative method
Hessenberg matrix
0
Upper triangular matrix
QR algorithm
Work: 10 n3
(empirically)
– We focus on speeding up the QR algorithm.
11
The small-bulge multishift QR algorithm
• Algorithm
Matrices:
A Hessenberg
Q unitary
R upper triangular
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR algorithm at once.
• Computational procedure for one iteration
– Introduce (m / 2) bulges
– Transform the matrix to Hessenberg form again
by chasing (m / 2) bulges.
0
the case of m = 4 shifts
12
The small-bulge multishift QR algorithm
• Algorithm
Matrices:
A Hessenberg
Q unitary
R upper triangular
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration
– Introduce (m / 2) bulges
– Transform the matrix to Hessenberg form again
by chasing (m / 2) bulges.
0
the case of m = 4 shifts
13
The small-bulge multishift QR algorithm
• Algorithm
Matrices:
A Hessenberg
Q unitary
R upper triangular
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration
– Introduce (m / 2) bulges
– Transform the matrix to Hessenberg form again
by chasing (m / 2) bulges.
0
the case of m = 4 shifts
14
The small-bulge multishift QR algorithm
• Algorithm
Matrices:
A Hessenberg
Q unitary
R upper triangular
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration
– Introduce (m / 2) bulges
– Transform the matrix to Hessenberg form again
by chasing (m / 2) bulges.
0
the case of m = 4 shifts
15
The small-bulge multishift QR algorithm
• Algorithm
Matrices:
A Hessenberg
Q unitary
R upper triangular
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration
– Introduce (m / 2) bulges
– Transform the matrix to Hessenberg form again
by chasing (m / 2) bulges.
0
the case of m = 4 shifts
16
Use of the level-3 BLAS
Blockingofofthe
bulge-chasing
operations
• Division
updating operations
– Chase the bulges by only k rows at a time.
– Divide update operations into two parts:
• First, update the diagonal block
sequentially.
– Accumulate the Householder
transformations used in the update as
a unitary matrix.
Level-3
BLAS
• Next, update the off-diagonal blocks.
– Multiply the off-diagonal blocks by the
unitary matrix.
k
0
Bulge (3x3)
Diagonal update
(sequential)
Off-diagonal update
(MatMult)
17
Performance on the CSX600
Execution time (sec)
12000
6000
• Random matrix (n = 6000)
• Compute all the eigenvalues /
eigenvectors with the smallbulge multishift QR algorithm
4000
• Computational environments
others
10000
MatMult
8000
– Xeon 3.2 GHz, Memory 8 GB
– ClearSpeed advance board
• CSX600 x 2
2000
0
Xeon
100
600
Xeon + CSX600
120
720
Parts
other than
160 200 240 Number
of shifts
MatMult
960 1200 1440 MatMult
size need to be
• As the number of shifts increases …
– MatMult part
– other part
sped up !
decrease
increase (bottleneck)
18
Modification of the algorithm (1)
Reformulation as a recursive algorithm
k/q
k
0
0
Diagonal update (sequential)
Diagonal update (sequential)
Off-diagonal (MatMult)
(ex. Recursion level d = 1)
Chasing of (m / 2) / q
bulges by k / q rows
Off-diagonal update (MatMult)
19
Modification of the algorithm (2)
• Deflation
– Trailing
(
)
submatrix is isolated.
• Eigensolution of the isolated submatrix
– Apply the double shift QR algorithm.
– Size of the submatrix increases with m.
0
0
• Division of the update operations
– Update the diagonal block (until convergence)
• Accumulate the Householder transformations
used in the update as a unitary matrix.
– Update the off-diagonal block (only once, at last)
• Multiply the off-diagonal blocks by the unitary
matrix.
sequential
MatMult
Reduce the
Computational
work
20
Numerical experiments
• Test problem
–
random matrices with elements in [0, 1]
• Reduced to Hessenberg form by Householder’s method
– Compute all the eigenvalues / eigenvectors
• Computational environments
– Xeon 3.2 GHz , Memory 8 GB
– Fortran 77, double precision
– ClearSpeed advance board
• CSX600 x 2
– Matrix multiplication
• ClearSpeed’s Library (for large size MatMult)
• Intel Math Kernel Library (for small size MatMult)
21
Numerical experiments
• Comparison
– Existing algorithm (small-bulge multishift QR method)
• MatMult part
– Off-diagonal update
CSX600
– Our algorithm (mutishift QR + recursion)
• MatMult parts
– Off-diagonal update
– Diagonal update
– Eigensolution of isolated submatrix
– Parameter values
•
•
•
•
Number of shifts: m is chosen optimally for each case.
Row length of bulge chasing: k = (3/2)m
Level of recursion: d = 1
Number of subdivision: q = m / 40
22
Effect of our modifications
CSX600 is used in all cases
Execution time (sec)
5000
4000
3000
others
isolated eigenproblem
diagonal update
off-diagonal update
2000
1000
0
Original
従来法
Ours
提案法
Original
従来法
Ours
提案法
(n = 3000, m = 160, q = 4) (n = 6000, m = 200, q = 5)
• Our algorithm is 1.4 times faster
– Diagonal update: 1.5 times faster
– Eigensolution of isolated submatrix: 10 times faster
23
Effect of using the CSX600
Execution time (sec)
100000
80000
60000
others
isolated eigenproblem
diagonal update
off-diagonal update
40000
n = 12000
20000
n = 6000
0
(無)
Xeon
Xeon +
(有)
(有)
CSX600
(無)
Xeon
(有)
(有)
CSX600
Xeon +
m = 100
q=5
m = 200
q=5
m = 100
q=5
m = 240
q=6
• By combining the CSX600 with our algorithm,
– 3.5 times speedup when n = 6000
– 3.8 times speedup when n = 12000
24
Conclusion
• We proposed an approach to accelerate the solution of
nonsymmetric eigenproblem using a floating-point accelerator.
• We used the small-bulge multishift QR algorithm, which can use
matrix multiplications efficiently, as a basis.
• By reformulating part of the algorithm as a recursive algorithm, we
succeeded in reducing the computational time spent by nonblocked part. This enables us to use large block size (number of
shifts) with small overhead and to exploit the performance of the
floating-point accelerator.
• When solving an eigenproblem of order 12,000, our algorithm is
1.4 times faster than the original small-bugle multishift QR
algorithm. We obtained 3.8 times speedup with the CSX600 over
the 3.2GHz Xeon.
25