Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC S´eminaire CASYS - LJK, April 2014 Motivations Matrix multiplication plays a central role in computer algebra. algebraic complexity of O(nω ) with ω < 2.3727 [Williams 2011] Modern processors provide many levels of parallelism. superscalar, SIMD units, multiple cores High performance matrix multiplication 3 numerical computing = classic algorithm + hardware arithmetic 7 exact computing 6= numerical computing algebraic algorithm is not the most efficient (6= complexity model) arithmetic is not directly in the hardware (e.g. Z, Fq , Z[x], Q[x, y, z]). Motivation : Superscalar processor with SIMD Hierarchical memory : L1 cache : 32kB - 4 cycles L2 cache : 256kB - 12 cycles L3 cache : 8MB - 36 cycles RAM : 32GB - 36 cycles + 57ns Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism Our goal : try to incorporate these rules into exact matrix multiplications Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials Matrix multiplication with small integers This corresponds to the case where each integer result holds in one processor register : A, B ∈ Zn×n such that ||AB||∞ < 2s where s is the register size. Main interests ring isomorphism : → computation over Z/pZ is congruent to Z/2s Z when p(n − 1)2 < 2s . its a building block for matrix mutiplication with larger integers Matrix multiplication with small integers Two possibilities for hardware support : use floating point mantissa, i.e. s = 253 , use native integer, i.e. s = 264 . Using floating point historically, the first approach in computer algebra [Dumas, Gautier, Pernet 2002] 3 out of the box performance from optimized BLAS 7 handle matrix with entries < 226 Using native integers 3 apply same optimizations as BLAS libraries 32 3 handle matrix with entries < 2 [Goto, Van De Geijn 2008] Matrix multiplication with small integers floating point Nehalem (2008) SSE4 128-bits 1 mul+1 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add Haswell (2013) AVX2 256-bits 2 FMA # vector operations per cycle (pipelined) integers 1 mul+2 add 1 mul+2 add 50 45 40 OpenBlas: double -‐ SSE OpenBlas: double -‐ AVX OpenBlas: double -‐ AVX2 Our code: int -‐ SSE Our code: int -‐ AVX2 35 GFLOPS 30 25 20 15 10 5 0 32 64 128 256 512 1024 Matrix dimensions 2048 benchmark on Intel i7-4960HQ @ 2.60GHz 4096 8192 Matrix multiplication with small integers Matrix multiplication modulo a small integer Let p such that (p − 1)2 × n < 253 1 perform the multiplication in Z using BLAS 2 reduce the result modulo p 60 OpenBlas 50 mod p -‐ Classic mod p -‐ Winograd GFLOPS 40 30 20 10 0 32 64 128 256 512 1024 Matrix dimensions 2048 benchmark on Intel i7-4960HQ @ 2.60GHz 4096 8192 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials Matrix multiplication with multi-precision integers Direct approach Let M(k) be the bit complexity of k-bit integers multiplication and A, B ∈ Zn×n such that ||A||∞ , ||B||∞ ∈ O(2k ). Computing AB using direct algorithm costs nω M(k) bit operations. 7 not best possible complexity, i.e. M(k) is super-linear 7 not efficient in practice Remark: Use evaluation/interpolation technique for better performances ! ! ! Multi-modular matrix multiplication Multi-modular approach ||AB||∞ < M = k Y mi , with primes mi ∈ O(1) i=1 then AB can be reconstructed with the CRT from (AB) mod mi . 1 for each mi compute Ai = A mod mi and Bi = B mod mi 2 for each mi compute Ci = Ai Bi mod mi 3 reconstruct C = AB from (C1 , . . . , Ck ) Bit complexity : O(nω k + n2 R(k)) where R(k) is the cost of reduction/reconstruction Multi-modular matrix multiplication Multi-modular approach ||AB||∞ < M = k Y mi , with primes mi ∈ O(1) i=1 then AB can be reconstructed with the CRT from (AB) mod mi . 1 for each mi compute Ai = A mod mi and Bi = B mod mi 2 for each mi compute Ci = Ai Bi mod mi 3 reconstruct C = AB from (C1 , . . . , Ck ) Bit complexity : O(nω k + n2 R(k)) where R(k) is the cost of reduction/reconstruction R(k) = O(M(k) log(k)) using divide and conquer strategy R(k) = O(k 2 ) using naive approach Multi-modular matrix multiplication Improving naive approach with linear algebra reduction/reconstruction of n2 data corresponds to matrix multiplication 3 improve the bit complexity from O(n2 k 2 ) to O(n2 k ω−1 ) 3 benefit from optimized matrix multiplication, i.e. SIMD Remark : A similar approach has been used by non-distributed code. [Doliskani, Schost 2010] in a Multi-modular reductions of an integer matrix Let us assume M = Qk i=1 mi < β k with mi < β. Multi-reduction of a single entry Let a = a0 + a1 β + . . . ak−1 β k−1 be a value to reduce modmi then |a|m1 1 |β|m1 . . . |β k−1 |m1 a0 m1 .. .. . . . . . .. .. .. . = . × .. − Q × .. |a|mk ak−1 mk 1 |β|mk . . . |β k−1 |mk with ||Q||∞ < kβ 2 Lemma : if kβ 2 ∈ O(1) than the reduction of n2 integers modulo the mi ’s costs O(n2 k ω−1 ) + O(n2 k) bit operations. Multi-modular reconstruction of an integer matrix Let us assume M = Qk i=1 mi < β k with mi < β and Mi = M/mi k X |a|m1 · Mi |Mi−1 |mi ) mod M CRT formulae : a = ( i=1 Reconstruction of a single entry (i) (i) (i) Let Mi |Mi−1 |mi = α0 + α1 β + . . . αk−1 β k−1 be the CRT constants, then (1) α0 a0 .. . . = .. (k) ak−1 α0 ... .. . (1) αk−1 |a|m1 .. × .. . . ... αk−1 (k) |a|mk with ai < kβ 2 and a = a0 + . . . + ak−1 β k−1 mod M the CRT solution. Lemma : if kβ 2 ∈ O(1) than the reconstruction of n2 integers from their images modulo the mi ’s costs O(n2 k ω−1 ) + O˜(n2 k) bit operations. Matrix multiplication with multi-precision integers Implementation of multi-modular approach choose β = 216 to optimize β-adic conversions choose mi s.t. nβmi < 253 and use BLAS dgemm use a linear storage for multi-modular matrices Compare sequential performances with : FLINT library 1 : uses divide and conquer Mathemagix library 2 : uses divide and conquer Doliskani’s code 3 : uses dgemm for reductions only 1. www.flintlib.org 2. www.mathemagix.org 3. courtesy of J. Doliskani Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 32) 64.00 Flint Mathemagix Doliskani’s code Our code 16.00 Time in seconds 4.00 1.00 0.25 0.06 0.02 0.00 24 26 28 210 Entry bitsize 212 benchmark on Intel Xeon-2620 @ 2.0GHz 214 216 Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 128) 256.00 Flint Mathemagix Doliskani’s code Our code 64.00 Time in seconds 16.00 4.00 1.00 0.25 0.06 0.02 0.00 24 26 28 210 Entry bitsize 212 benchmark on Intel Xeon-2620 @ 2.0GHz 214 216 Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 512) 256.00 Flint Mathemagix Doliskani’s code Our code 128.00 Time in seconds 64.00 32.00 16.00 8.00 4.00 2.00 1.00 0.50 0.25 24 26 28 210 Entry bitsize 212 benchmark on Intel Xeon-2620 @ 2.0GHz 214 216 Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 1024) 128.00 Flint Mathemagix Doliskani’s code Our code Time in seconds 64.00 32.00 16.00 8.00 4.00 2.00 24 26 28 210 Entry bitsize 212 benchmark on Intel Xeon-2620 @ 2.0GHz 214 216 Parallel multi-modular matrix multiplication 1 for i = 1 . . . k compute Ai = A mod mi and Bi = B mod mi 3 reconstruct C = AB from (C1 , . . . , Ck ) Parallelization of multi-modular reduction/reconstruction each thread reduces (resp. reconstructs) a chunk of the given matrix thread 0 thread 1 thread 2 A0 = A mod m0 A1 = A mod m1 A2 = A mod m2 A3 = A mod m3 A4 = A mod m4 A5 = A mod m5 A6 = A mod m6 thread 3 Parallel multi-modular matrix multiplication 2 for i = 1 . . . k compute Ci = Ai Bi mod mi Parallelization of modular multiplication each thread computes a bunch of matrix multiplications modmi thread 0 C0 = A0 B0 mod m0 C1 = A1 B1 mod m1 thread 1 C2 = A2 B2 mod m2 C3 = A3 B3 mod m3 thread 2 C4 = A4 B4 mod m4 C5 = A5 B5 mod m5 thread 3 C6 = A6 B6 mod m6 Parallel Matrix multiplication with multi-precision integers based on OpenMP task CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy ! ! ! Speedup of parallel integer matrix mul5plica5on 12 speedup 10 8 dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = 8192 6 4 2 0 2 6 8 12 number of cores benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores) Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials Matrix multiplication over Fp [x] We consider the ”easiest” case : A, B ∈ Fp [x]n×n such that deg(AB) < k = 2t p is a Fourier prime, i.e. p = 2t q + 1 p is such that n(p − 1)2 < 253 Complexity O(nω k + n2 k log(k)) op. in Fp using evaluation/interpolation with FFT Matrix multiplication over Fp [x] We consider the ”easiest” case : A, B ∈ Fp [x]n×n such that deg(AB) < k = 2t p is a Fourier prime, i.e. p = 2t q + 1 p is such that n(p − 1)2 < 253 Complexity O(nω k + n2 k log(k)) op. in Fp using evaluation/interpolation with FFT Remark: using Vandermonde matrix on can get a similar approach as for integers, i.e. O(nω k + n2 k ω−1 ) Matrix multiplication over Fp [x] Evaluation/Interpolation scheme Let θ a primitive kth root of unity in Fp . 1 2 3 for i = 1 . . . k compute Ai = A(θi−1 ) and Bi = B(θi−1 ) for i = 1 . . . k compute Ci = Ai Bi ∈ Fp interpolate C = AB from (C1 , . . . , Ck ) steps 1 and 3 : O(n2 ) call to FFTk over Fp [x] step 2 : k matrix multiplications modulo a small prime p FFT with SIMD over Fp Butterly operation modulo p i compute X + Y mod p and (X − Y )θ2 mod p. Barret’s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X , Y ∈ [0, 2p), W ∈ [0, p), p < β/4 and W 0 = dW β/pe. Algorithm: Butterfly(X,Y,W,W’,p) 1: 2: 3: 4: 5: X 0 := X + Y mod 2p T := X − Y + 2p Q := dW 0 T /βe Y 0 := (WT − Qp) mod β return (X 0 , Y 0 ) 1 high short product 2 low short products FFT with SIMD over Fp Butterly operation modulo p i compute X + Y mod p and (X − Y )θ2 mod p. Barret’s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X , Y ∈ [0, 2p), W ∈ [0, p), p < β/4 and W 0 = dW β/pe. Algorithm: Butterfly(X,Y,W,W’,p) 1: 2: 3: 4: 5: X 0 := X + Y mod 2p T := X − Y + 2p Q := dW 0 T /βe Y 0 := (WT − Qp) mod β return (X 0 , Y 0 ) 1 high short product 2 low short products 3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product) Matrix multiplication over Fp [x] Implementation radix-4 FFT with 128-bits SSE (29 bits primes) BLAS-based matrix multiplication over Fp [FFLAS-FFPACK library] Polynomial Matrix Mul9plica9on Performances 1 024,00 512,00 256,00 128,00 64,00 Time in seconds 32,00 16,00 8,00 4,00 FLINT 2,00 MMX 1,00 our code 0,50 0,25 0,13 0,06 0,03 0,02 deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32 n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048 Matrix with dimension (n) and degree (deg) benchmark on Intel Xeon-2620 @ 2.0GHz Matrix multiplication over Z[x] A, B ∈ Z[x]n×n such that deg(AB) < d and ||(AB)i ||∞ < k Complexity O˜(nω d log(d) log(k)) bit op. using Kronecker substitution O(nω d log(k) + n2 d log(d) log(k)) bit op. using CRT+FFT Remark: if the result’s degree and bitsize are not too large, CRT with Fourier primes might suffice. Matrix multiplication over Z[x] Implementation use CRT with Fourier primes re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication in Fp [x] FLINT 4096 our code 2048 1024 Time in seconds 512 256 128 64 32 16 8 4 2 1 log(k)=64 log(k)=128 log(k)=256 n=16 ; deg=8192 log(k)=512 log(k)=64 log(k)=128 log(k)=256 n=128 ; deg=512 benchmark on Intel Xeon-2620 @ 2.0GHz Parallel Matrix multiplication over Z[x] Very first attempt (work still in progress) parallel CRT with linear algebra (same code as in Z case) perform each multiplication over Fp [x] in parallel some part of the code still sequential n 64 32 32 128 d 1024 4096 2048 128 log(k) 600 600 1024 1024 6 cores ×3.52 ×3.68 ×3.95 ×3.76 12 cores ×4.88 ×5.02 ×5.73 ×5.55 time seq 61.1s 64.4s 54.5s 53.9s Polynomial Matrix in LinBox (proposition) Generic handler class for Polynomial Matrix t e m p l a t e<s i z e t t y p e , s i z e t s t o r a g e , c l a s s F i e l d > class PolynomialMatrix ; Specialization for different memory strategy // M a t r i x o f p o l y n o m i a l s t e m p l a t e<c l a s s Field> c l a s s P o l y n o m i a l M a t r i x <PMType : : p o l f i r s t , PMStorage : : p l a i n , F i e l d >; // P o l y n o m i a l o f m a t r i c e s t e m p l a t e<c l a s s Field> c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t , PMStorage : : p l a i n , F i e l d >; // P o l y n o m i a l o f m a t r i c e s ( p a r t i a l v i e w on m o n o m i a l s ) t e m p l a t e<c l a s s Field> c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t , PMStorage : : v i e w , F i e l d >; Conclusion High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT FFT for polynomial over wordsize prime fields adaptative matrix representation We provide in the LinBox library (www.linalg.org) efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication over Fp [x] and Z[x]
© Copyright 2024 ExpyDoc