GPU Acceleration of Small Dense Matrix Computation of One-Sided Factorizations T. Dong, M. Gates, A. Haidar, P. Luszczek, S. Tomov Speedup of the solver for matrix size 150 4 Compute final index ppiv[ i ] for each row i For column j = 1 to N Swap entries 1, …, nb with entries ppiv[ 1, …, nb ] in parallel !"#$%&"'(")*+,-' sequential swap: swap kernel 60%! 160 140 120 gemm kernel 15%! parallel swap: 60 MAGMA W/ classic blocked algorithm 40 • Develop parallel version of swap. • Data transfer is not coalesced: GPU cannot read 32 values in parallel unless the matrix is stored rowwise. However, if the matrix is stored rowwise, the swap is fast, BUT other components become very slow. 0 • Improves write back of swapped rows as data transfer is now coalesced. • A natural way to implement a low level panel routine would be to load the panel to the GPU's shared memory and then do the entire computation before writing the result back to main memory. While this direction can be easily implemented, it cannot provide good performance for two main reasons: A GPU’s shared memory is currently limited to only 48KB per streaming multiprocessor (SMX) for the newest NVIDIA Kepler architecture, thereby limiting the panel size. • Since a CUDA warp consists of 32 threads, it is recommended to develop CUDA kernels that use a multiple of 32 threads per thread block. 400 500 600 200 Classical dgetf2: panel: classical getf2 30%! 1 Find the max absolute value of a column below the diag “pivot” 2 Swap the row of size nb columns 3 Scale the column below the diag by the inverse of the pivot 4 Update to the right of the column with dger MAGMA W/ streamed/batched gemm 180 160 140 MAGMA W/ recursive blocking 120 Blocked dgetf2: MAGMA W/ parallel swap 100 80 panel: blocked getf2 8%! cuBLAS 60 MAGMA W/ classic blocked algorithm 40 Bottlenecks: Proposition: • It requires 30% of the global time. U(done) NB • Nested blocking fashion: Develop a nested blocking technique that also blocks the panel in order to replace dger kernel with dgemm kernel. • It is memory bound. DTRSM NNB DTRSM 20 0 0 100 200 300 400 500 600 Matrix Size DGEMM PROFILER!"#$%&"'(")*+,-' TRACING 900 streamed dgemm K=128 800 700 streamed dgemm K= 64 600 500 400 streamed dgemm K= 32 300 batched dgemm K=128 batched dgemm K= 64 batched dgemm K= 32 200 100 0 0 32 64 128 160 192 256 matrix m=n 384 448 512 Bottlenecks: • Batched gemm kernel from cuBLAS is well suited for small matrix sizes (<128), but stagnates for large sizes (>128). QR PERFORMANCE Proposition: • Streamed gemm can provide higher performance for large matrix sizes (>128), thus we propose switching between streamed and batched gemm according to the size of the trailing matrix update. Batched dgeqrf count = 2000 200 180 batched dgemm! MAGMA W/ streamed/batched gemm 160 140 MAGMA W/ recursive blocking 120 streamed streamed dgemm! dgemm! • We found that such an implementation allows many thread blocks to be executed by the same SMX in parallel, taking better advantage of its resources. USING GPU STREAM • We implemented all the low level BLAS routines, such as dscal, dnorm, and dger, using a batched style that minimizes the use of shared memory. For example, we load only one column to the shared memory during the LU dgetf2. 300 • Synchronization between each kernel launch. Saturating the shared memory per SMX can decrease performance, since only one thread block will be mapped to that SMX at a time, resulting in low occupancy and subsequently poor core utilization. PROPOSED SOLUTIONS 200 Batched dgetrf count = 2000 GFlops/s PROBLEM How does dgetf2 work? DGEMM MAGMA 100 LU PERFORMANCE !"#$%&"'(")*+,-' non-blocked DGETF2 MA48 0 Matrix Size PROFILER TRACING MKL MAGMA W/ non blocked algorithm 20 swap kernel 10%! 2 1 100 80 Proposition: • Swap is sequential and data dependent. MAGMA W/ recursive blocking 180 gemm kernel 30%! Bottlenecks: Batched dpotrf count = 2000 200 GFlops/s Given pivot indices ipiv[ 1, …, nb ] For row i = 1 to nb Swap row i with row ipiv[ i ] of size N columns Two rows, i1 and i2, can both swap with the same row, k, as shown, creating data dependency. Blocked Panel Speedup 3 PARALLEL SWAP L Panel P1i Trailing Matrix Factored part of A1 A1i Trailing Matrix A1i Trailing Matrix A1i Trailing Matrix A1i Batched factorization of a set of k matrices A1, A2,..., Ak CLASSICAL SEQUENTIAL SWAP CHOLESKY PERFORMANCE GFlops/s Panel P1i Factored part of A2 Panel P1i • High-order FEM simulations Factored part of A3 Panel P1i • Direct sparse solvers Factored part of A4 Trailing Matrix Aki PROFILER TRACING How does the LU swap work? Gflops/s • Structural mechanics • Astrophysics } Factored part of Ak MULTI LEVEL BLOCKING Numerous applications require factorization of many small matrices PARALLEL PIVOTING IN LU Batched factorization of a set of small matrices in parallel Panel Pki DEFINITION OPTIMIZATION TECHNIQUES FOR DEVELOPING HIGH-PERFORMANCE BATCHED FACTORIZATIONS 100 80 60 MAGMA W/ classic blocked algorithm 40 20 0 0 100 200 300 Matrix Size 400 500 600
© Copyright 2025 ExpyDoc