SC14 Poster GPU Acceleration

GPU Acceleration of Small Dense Matrix Computation of One-Sided Factorizations
T. Dong, M. Gates, A. Haidar, P. Luszczek, S. Tomov
Speedup of the solver
for matrix size 150
4
Compute final index ppiv[ i ] for each row i
For column j = 1 to N
Swap entries 1, …, nb with entries ppiv[ 1, …, nb ] in
parallel
!"#$%&"'(")*+,-'
sequential swap:
swap kernel 60%!
160
140
120
gemm kernel 15%!
parallel swap:
60
MAGMA W/ classic blocked algorithm
40
• Develop parallel version of swap.
• Data transfer is not coalesced: GPU cannot read 32 values in parallel
unless the matrix is stored rowwise. However, if the matrix is stored
rowwise, the swap is fast, BUT other components become very slow.
0
• Improves write back of swapped rows as data transfer is now coalesced.
• A natural way to implement a low level panel routine would be to load the panel to the GPU's
shared memory and then do the entire computation before writing the result back to main memory.
While this direction can be easily implemented, it cannot provide good performance for two main
reasons:
A GPU’s shared memory is currently limited to only 48KB per streaming multiprocessor (SMX) for the
newest NVIDIA Kepler architecture, thereby limiting the panel size.
• Since a CUDA warp consists of 32 threads, it is recommended to develop CUDA kernels that use a
multiple of 32 threads per thread block.
400
500
600
200
Classical dgetf2:
panel: classical getf2 30%!
1 Find the max absolute value of a column below the diag “pivot”
2 Swap the row of size nb columns
3 Scale the column below the diag by the inverse of the pivot
4 Update to the right of the column with dger
MAGMA W/ streamed/batched gemm
180
160
140
MAGMA W/ recursive blocking
120
Blocked dgetf2:
MAGMA W/ parallel swap
100
80
panel: blocked getf2 8%!
cuBLAS
60
MAGMA W/ classic blocked algorithm
40
Bottlenecks:
Proposition:
• It requires 30% of the global time.
U(done)
NB
• Nested blocking fashion: Develop a nested blocking technique that also blocks the
panel in order to replace dger kernel with dgemm kernel.
• It is memory bound.
DTRSM
NNB
DTRSM
20
0
0
100
200
300
400
500
600
Matrix Size
DGEMM
PROFILER!"#$%&"'(")*+,-'
TRACING
900
streamed dgemm K=128
800
700
streamed dgemm K= 64
600
500
400
streamed dgemm K= 32
300
batched dgemm K=128
batched dgemm K= 64
batched dgemm K= 32
200
100
0
0
32
64
128
160
192
256
matrix m=n
384
448
512
Bottlenecks:
• Batched gemm kernel from cuBLAS is well
suited for small matrix sizes (<128), but
stagnates for large sizes (>128).
QR PERFORMANCE
Proposition:
• Streamed gemm can provide higher
performance for large matrix sizes (>128),
thus we propose switching between
streamed and batched gemm according to
the size of the trailing matrix update.
Batched dgeqrf count = 2000
200
180
batched dgemm!
MAGMA W/ streamed/batched gemm
160
140
MAGMA W/ recursive blocking
120
streamed
streamed dgemm!
dgemm!
• We found that such an implementation allows many thread blocks to be executed by the same SMX
in parallel, taking better advantage of its resources.
USING GPU STREAM
• We implemented all the low level BLAS routines, such as dscal, dnorm, and dger, using a batched
style that minimizes the use of shared memory. For example, we load only one column to the shared
memory during the LU dgetf2.
300
• Synchronization between each kernel launch.
Saturating the shared memory per SMX can decrease performance, since only one thread block will be
mapped to that SMX at a time, resulting in low occupancy and subsequently poor core utilization.
PROPOSED SOLUTIONS
200
Batched dgetrf count = 2000
GFlops/s
PROBLEM
How does dgetf2 work?
DGEMM
MAGMA
100
LU PERFORMANCE
!"#$%&"'(")*+,-'
non-blocked DGETF2
MA48
0
Matrix Size
PROFILER TRACING
MKL
MAGMA W/ non blocked algorithm
20
swap kernel 10%!
2
1
100
80
Proposition:
• Swap is sequential and data dependent.
MAGMA W/ recursive blocking
180
gemm kernel 30%!
Bottlenecks:
Batched dpotrf count = 2000
200
GFlops/s
Given pivot indices ipiv[ 1, …, nb ]
For row i = 1 to nb
Swap row i with row ipiv[ i ] of size N columns
Two rows, i1 and i2, can both swap with the same row,
k, as shown, creating data dependency.
Blocked Panel
Speedup
3
PARALLEL SWAP
L
Panel P1i
Trailing Matrix
Factored part of A1
A1i
Trailing Matrix
A1i
Trailing Matrix
A1i
Trailing Matrix
A1i
Batched factorization
of a set of k matrices
A1, A2,..., Ak
CLASSICAL SEQUENTIAL SWAP
CHOLESKY PERFORMANCE
GFlops/s
Panel P1i
Factored part of A2
Panel P1i
• High-order FEM simulations
Factored part of A3
Panel P1i
• Direct sparse solvers
Factored part of A4
Trailing Matrix
Aki
PROFILER TRACING
How does the LU swap work?
Gflops/s
• Structural mechanics
• Astrophysics
}
Factored part of Ak
MULTI LEVEL BLOCKING
Numerous applications require
factorization of many small matrices
PARALLEL PIVOTING IN LU
Batched factorization of a set of small matrices in parallel
Panel Pki
DEFINITION
OPTIMIZATION TECHNIQUES FOR DEVELOPING HIGH-PERFORMANCE BATCHED FACTORIZATIONS
100
80
60
MAGMA W/ classic blocked algorithm
40
20
0
0
100
200
300
Matrix Size
400
500
600