PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata 1 1 Nagoya University 2 Kyoto Yoshimasa Nakamura 2 University Introduction • Background – Advent of many-core floating point accelerators as a means to speed up scientific computations • Objective of our study – Apply these accelerators to the eigenvalue problem for nonsymmetric matrices. – Make clear potential problems. – Modify existing algorithms or develop new algorithms if necessary. 2 Outline of the talk • Introduction • Many-core floating point accelerators and its performance characteristics • The nonsymmetric eigenvalue problem • Proposed algorithm – Modification of the small-bulge multishift QR algorithm for floating-point accelerators • Performance evaluation • Conclusion 3 Many-core Floating-point accelerators • ClearSpeed CSX600 – 1+96 processor cores – 48GFLOPS (double precision) • GRAPE-DR (Tokyo Univ.) – 512 processor cores – 512GFLOPS (single precision) – 256GFLOPS (double precision) • Intel Larrabee (under development) – 80 processor cores – 1TFLOPS (single precision) • Integrates hundreds of floating-point cores • Very high GFLOPS value (peak performance) 4 Architecture of the CSX600 accelerator • The CSX600 chip – 1 main processor – 96 floating-point processors • 64bit • 2flops/cycle • 128Byte register file • 6KB SRAM – Operates at 250MHz – Peak performance: 48GFLOPS • ClearSpeed Advance board – Two CSX600 processors – 1GB DRAM – Connected with the PC via the PCI-X bus. – Peak performance: 96GFLOPS 5 Problem with the data transfer speed • Peak floating-point performance --- very high – 48GFLOPS / chip – 96GFLOPS / board • Data transfer speed --- relatively low – 3.2GB/s between the chip and on-board memory – 1.066GF/s between the board and main memory • Byte/flop – 0.066Byte/flop between the chip and on-board memory – 0.011Byte/flop between the board and main memory PC CPU DRAM DRAM I/F 1.066 GB/s DRAM ClearSpeed Advance board 3.2GB/s I/F CSX600 CSX600 PCI-X 6 Byte/flop of typical linear algebraic operations Amount of data transfer Function Operation Flop count Byte/flop Dot product a := xTy 2n 2n 8 AXPY x := x + ay 3n 2n 12 Matrix-vector multiplication y := Ax n2+2n 2n2 4 Rank-1 update A:= A + xyT 2n2+2n 2n2 8 Matrix multiplication (MatMult) C:= C + AB 4n2 2n3 16/n • Operations other than matrix multiplication cannot exploit the performance of the CSX600 due to the limitation of data transfer speed. • Matrix multiplication (MatMult) can be executed efficiently, but only if the size is very large (n > 1500). 7 Performance of MatMult on the ClearSpeed board 30 N M GFLOPS K 25 M = K = 500 M = K = 1000 M = K = 1500 M = K = 2000 20 15 10 5 0 1000 2000 3000 4000 5000 6000 K GFLOPS 30 N M N 25 N = K = 500 N = K = 1000 N = K = 1500 N = K = 2000 20 15 10 5 0 1000 2000 3000 4000 5000 6000 • • M The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory. M, N and K must be larger than 1500 to get substantial performance gain. 8 Problems to be solved • Problems – Is it possible to reorganize the algorithm so that most of the computations are done with matrix multiplications? – What is the overhead of using very large size matrix multiplications? – How can we reduce the overhead? We consider these problems for the nonsymmetric eigenvalue problem. 9 The nonsymmetric eigenvalue problem • The problem – Eigenvalue problem • A: dense complex nonsymmetric matrix • Compute all the eigenvalues / eigenvectors • Applications – – – – Magnetohydrodynamics Structural dynamics Quantum chemistry Fluid dynamics • Cf. Z. Bai and J. Demmel: A test matrix collection for nonHermitian eigenvalue problems. 10 Algorithm for the nonsymmetric eigenproblem • The standard algorithm – Similarity transformation to the upper triangular matrix Target of speedup diagonal elements = eigenvalues Finite # of steps Dense matrix Householder method Work: (10/3)n3 0 Iterative method Hessenberg matrix 0 Upper triangular matrix QR algorithm Work: 10 n3 (empirically) – We focus on speeding up the QR algorithm. 11 The small-bulge multishift QR algorithm • Algorithm Matrices: A Hessenberg Q unitary R upper triangular – shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al – Perform m steps of the QR algorithm at once. • Computational procedure for one iteration – Introduce (m / 2) bulges – Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. 0 the case of m = 4 shifts 12 The small-bulge multishift QR algorithm • Algorithm Matrices: A Hessenberg Q unitary R upper triangular – shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al – Perform m steps of the QR method at once. • Computational procedure for one iteration – Introduce (m / 2) bulges – Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. 0 the case of m = 4 shifts 13 The small-bulge multishift QR algorithm • Algorithm Matrices: A Hessenberg Q unitary R upper triangular – shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al – Perform m steps of the QR method at once. • Computational procedure for one iteration – Introduce (m / 2) bulges – Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. 0 the case of m = 4 shifts 14 The small-bulge multishift QR algorithm • Algorithm Matrices: A Hessenberg Q unitary R upper triangular – shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al – Perform m steps of the QR method at once. • Computational procedure for one iteration – Introduce (m / 2) bulges – Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. 0 the case of m = 4 shifts 15 The small-bulge multishift QR algorithm • Algorithm Matrices: A Hessenberg Q unitary R upper triangular – shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al – Perform m steps of the QR method at once. • Computational procedure for one iteration – Introduce (m / 2) bulges – Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. 0 the case of m = 4 shifts 16 Use of the level-3 BLAS Blockingofofthe bulge-chasing operations • Division updating operations – Chase the bulges by only k rows at a time. – Divide update operations into two parts: • First, update the diagonal block sequentially. – Accumulate the Householder transformations used in the update as a unitary matrix. Level-3 BLAS • Next, update the off-diagonal blocks. – Multiply the off-diagonal blocks by the unitary matrix. k 0 Bulge (3x3) Diagonal update (sequential) Off-diagonal update (MatMult) 17 Performance on the CSX600 Execution time (sec) 12000 6000 • Random matrix (n = 6000) • Compute all the eigenvalues / eigenvectors with the smallbulge multishift QR algorithm 4000 • Computational environments others 10000 MatMult 8000 – Xeon 3.2 GHz, Memory 8 GB – ClearSpeed advance board • CSX600 x 2 2000 0 Xeon 100 600 Xeon + CSX600 120 720 Parts other than 160 200 240 Number of shifts MatMult 960 1200 1440 MatMult size need to be • As the number of shifts increases … – MatMult part – other part sped up ! decrease increase (bottleneck) 18 Modification of the algorithm (1) Reformulation as a recursive algorithm k/q k 0 0 Diagonal update (sequential) Diagonal update (sequential) Off-diagonal (MatMult) (ex. Recursion level d = 1) Chasing of (m / 2) / q bulges by k / q rows Off-diagonal update (MatMult) 19 Modification of the algorithm (2) • Deflation – Trailing ( ) submatrix is isolated. • Eigensolution of the isolated submatrix – Apply the double shift QR algorithm. – Size of the submatrix increases with m. 0 0 • Division of the update operations – Update the diagonal block (until convergence) • Accumulate the Householder transformations used in the update as a unitary matrix. – Update the off-diagonal block (only once, at last) • Multiply the off-diagonal blocks by the unitary matrix. sequential MatMult Reduce the Computational work 20 Numerical experiments • Test problem – random matrices with elements in [0, 1] • Reduced to Hessenberg form by Householder’s method – Compute all the eigenvalues / eigenvectors • Computational environments – Xeon 3.2 GHz , Memory 8 GB – Fortran 77, double precision – ClearSpeed advance board • CSX600 x 2 – Matrix multiplication • ClearSpeed’s Library (for large size MatMult) • Intel Math Kernel Library (for small size MatMult) 21 Numerical experiments • Comparison – Existing algorithm (small-bulge multishift QR method) • MatMult part – Off-diagonal update CSX600 – Our algorithm (mutishift QR + recursion) • MatMult parts – Off-diagonal update – Diagonal update – Eigensolution of isolated submatrix – Parameter values • • • • Number of shifts: m is chosen optimally for each case. Row length of bulge chasing: k = (3/2)m Level of recursion: d = 1 Number of subdivision: q = m / 40 22 Effect of our modifications CSX600 is used in all cases Execution time (sec) 5000 4000 3000 others isolated eigenproblem diagonal update off-diagonal update 2000 1000 0 Original 従来法 Ours 提案法 Original 従来法 Ours 提案法 (n = 3000, m = 160, q = 4) (n = 6000, m = 200, q = 5) • Our algorithm is 1.4 times faster – Diagonal update: 1.5 times faster – Eigensolution of isolated submatrix: 10 times faster 23 Effect of using the CSX600 Execution time (sec) 100000 80000 60000 others isolated eigenproblem diagonal update off-diagonal update 40000 n = 12000 20000 n = 6000 0 (無) Xeon Xeon + (有) (有) CSX600 (無) Xeon (有) (有) CSX600 Xeon + m = 100 q=5 m = 200 q=5 m = 100 q=5 m = 240 q=6 • By combining the CSX600 with our algorithm, – 3.5 times speedup when n = 6000 – 3.8 times speedup when n = 12000 24 Conclusion • We proposed an approach to accelerate the solution of nonsymmetric eigenproblem using a floating-point accelerator. • We used the small-bulge multishift QR algorithm, which can use matrix multiplications efficiently, as a basis. • By reformulating part of the algorithm as a recursive algorithm, we succeeded in reducing the computational time spent by nonblocked part. This enables us to use large block size (number of shifts) with small overhead and to exploit the performance of the floating-point accelerator. • When solving an eigenproblem of order 12,000, our algorithm is 1.4 times faster than the original small-bugle multishift QR algorithm. We obtained 3.8 times speedup with the CSX600 over the 3.2GHz Xeon. 25
© Copyright 2024 ExpyDoc