Toward High Performance Matrix Multiplication for Exact

Toward High Performance Matrix Multiplication
for Exact Computation
Pascal Giorgi
Joint work with Romain Lebreton (U. Waterloo)
Funded by the French ANR project HPAC
S´eminaire CASYS - LJK, April 2014
Motivations
Matrix multiplication plays a central role in computer algebra.
algebraic complexity of O(nω ) with ω < 2.3727
[Williams 2011]
Modern processors provide many levels of parallelism.
superscalar, SIMD units, multiple cores
High performance matrix multiplication
3 numerical computing = classic algorithm + hardware arithmetic
7 exact computing 6= numerical computing
algebraic algorithm is not the most efficient (6= complexity model)
arithmetic is not directly in the hardware (e.g. Z, Fq , Z[x], Q[x, y, z]).
Motivation : Superscalar processor with SIMD
Hierarchical memory :
L1 cache : 32kB - 4 cycles
L2 cache : 256kB - 12 cycles
L3 cache : 8MB - 36 cycles
RAM : 32GB - 36 cycles + 57ns
Motivations : practical algorithms
High performance algorithms (rule of thumb)
best asymptotic complexity is not alway faster : constants matter
better arithmetic count is not always faster : caches matter
process multiple data at the same time : vectorization
fine/coarse grain task parallelism matter : multicore parallelism
Motivations : practical algorithms
High performance algorithms (rule of thumb)
best asymptotic complexity is not alway faster : constants matter
better arithmetic count is not always faster : caches matter
process multiple data at the same time : vectorization
fine/coarse grain task parallelism matter : multicore parallelism
Our goal : try to incorporate these rules into exact matrix multiplications
Outline
1
Matrix multiplication with small integers
2
Matrix multiplication with multi-precision integers
3
Matrix multiplication with polynomials
Outline
1
Matrix multiplication with small integers
2
Matrix multiplication with multi-precision integers
3
Matrix multiplication with polynomials
Matrix multiplication with small integers
This corresponds to the case where each integer result holds in one
processor register :
A, B ∈ Zn×n such that ||AB||∞ < 2s
where s is the register size.
Main interests
ring isomorphism :
→ computation over Z/pZ is congruent to Z/2s Z when p(n − 1)2 < 2s .
its a building block for matrix mutiplication with larger integers
Matrix multiplication with small integers
Two possibilities for hardware support :
use floating point mantissa, i.e. s = 253 ,
use native integer, i.e. s = 264 .
Using floating point
historically, the first approach in computer algebra
[Dumas, Gautier, Pernet 2002]
3 out of the box performance from optimized BLAS
7 handle matrix with entries < 226
Using native integers
3 apply same optimizations as BLAS libraries
32
3 handle matrix with entries < 2
[Goto, Van De Geijn 2008]
Matrix multiplication with small integers
floating point
Nehalem (2008)
SSE4 128-bits
1 mul+1 add
Sandy Bridge (2011)
AVX 256-bits
1 mul+1 add
Haswell (2013)
AVX2 256-bits
2 FMA
# vector operations per cycle (pipelined)
integers
1 mul+2 add
1 mul+2 add
50 45 40 OpenBlas: double -­‐ SSE OpenBlas: double -­‐ AVX OpenBlas: double -­‐ AVX2 Our code: int -­‐ SSE Our code: int -­‐ AVX2 35 GFLOPS 30 25 20 15 10 5 0 32 64 128 256 512 1024 Matrix dimensions 2048 benchmark on Intel i7-4960HQ @ 2.60GHz
4096 8192 Matrix multiplication with small integers
Matrix multiplication modulo a small integer
Let p such that (p − 1)2 × n < 253
1
perform the multiplication in Z using BLAS
2
reduce the result modulo p
60 OpenBlas 50 mod p -­‐ Classic mod p -­‐ Winograd GFLOPS 40 30 20 10 0 32 64 128 256 512 1024 Matrix dimensions 2048 benchmark on Intel i7-4960HQ @ 2.60GHz
4096 8192 Outline
1
Matrix multiplication with small integers
2
Matrix multiplication with multi-precision integers
3
Matrix multiplication with polynomials
Matrix multiplication with multi-precision integers
Direct approach
Let M(k) be the bit complexity of k-bit integers multiplication and
A, B ∈ Zn×n such that ||A||∞ , ||B||∞ ∈ O(2k ).
Computing AB using direct algorithm costs nω M(k) bit operations.
7 not best possible complexity, i.e. M(k) is super-linear
7 not efficient in practice
Remark:
Use evaluation/interpolation technique for better performances ! ! !
Multi-modular matrix multiplication
Multi-modular approach
||AB||∞ < M =
k
Y
mi ,
with primes mi ∈ O(1)
i=1
then AB can be reconstructed with the CRT from (AB) mod mi .
1
for each mi compute Ai = A mod mi and Bi = B mod mi
2
for each mi compute Ci = Ai Bi mod mi
3
reconstruct C = AB from (C1 , . . . , Ck )
Bit complexity :
O(nω k + n2 R(k)) where R(k) is the cost of reduction/reconstruction
Multi-modular matrix multiplication
Multi-modular approach
||AB||∞ < M =
k
Y
mi ,
with primes mi ∈ O(1)
i=1
then AB can be reconstructed with the CRT from (AB) mod mi .
1
for each mi compute Ai = A mod mi and Bi = B mod mi
2
for each mi compute Ci = Ai Bi mod mi
3
reconstruct C = AB from (C1 , . . . , Ck )
Bit complexity :
O(nω k + n2 R(k)) where R(k) is the cost of reduction/reconstruction
R(k) = O(M(k) log(k)) using divide and conquer strategy
R(k) = O(k 2 ) using naive approach
Multi-modular matrix multiplication
Improving naive approach with linear algebra
reduction/reconstruction of n2 data corresponds to matrix multiplication
3 improve the bit complexity from O(n2 k 2 ) to O(n2 k ω−1 )
3 benefit from optimized matrix multiplication, i.e. SIMD
Remark :
A similar approach has been used by
non-distributed code.
[Doliskani, Schost 2010]
in a
Multi-modular reductions of an integer matrix
Let us assume M =
Qk
i=1
mi < β k with mi < β.
Multi-reduction of a single entry
Let a = a0 + a1 β + . . . ak−1 β k−1 be a value to reduce modmi then

 
 

 
|a|m1
1 |β|m1 . . . |β k−1 |m1
a0
m1
 ..   ..




.
.
.
. 
.
..
..
..
 .  = .
 ×  ..  − Q ×  .. 
|a|mk
ak−1
mk
1 |β|mk . . . |β k−1 |mk
with ||Q||∞ < kβ 2
Lemma : if kβ 2 ∈ O(1) than the reduction of n2 integers modulo the
mi ’s costs O(n2 k ω−1 ) + O(n2 k) bit operations.
Multi-modular reconstruction of an integer matrix
Let us assume M =
Qk
i=1
mi < β k with mi < β and Mi = M/mi
k
X
|a|m1 · Mi |Mi−1 |mi ) mod M
CRT formulae : a = (
i=1
Reconstruction of a single entry
(i)
(i)
(i)
Let Mi |Mi−1 |mi = α0 + α1 β + . . . αk−1 β k−1 be the CRT constants, then
  (1)
α0
a0
 ..  
.
 . =
 ..
(k)
ak−1
α0

...
..
.
 

(1)
αk−1
|a|m1

..  ×  .. 
.   . 
...
αk−1
(k)
|a|mk
with ai < kβ 2 and a = a0 + . . . + ak−1 β k−1 mod M the CRT solution.
Lemma : if kβ 2 ∈ O(1) than the reconstruction of n2 integers from their
images modulo the mi ’s costs O(n2 k ω−1 ) + O˜(n2 k) bit operations.
Matrix multiplication with multi-precision integers
Implementation of multi-modular approach
choose β = 216 to optimize β-adic conversions
choose mi s.t. nβmi < 253 and use BLAS dgemm
use a linear storage for multi-modular matrices
Compare sequential performances with :
FLINT library 1 : uses divide and conquer
Mathemagix library 2 : uses divide and conquer
Doliskani’s code 3 : uses dgemm for reductions only
1. www.flintlib.org
2. www.mathemagix.org
3. courtesy of J. Doliskani
Matrix multiplication with multi-precision integers
Integer matrix multiplication
(matrix dim = 32)
64.00
Flint
Mathemagix
Doliskani’s code
Our code
16.00
Time in seconds
4.00
1.00
0.25
0.06
0.02
0.00
24
26
28
210
Entry bitsize
212
benchmark on Intel Xeon-2620 @ 2.0GHz
214
216
Matrix multiplication with multi-precision integers
Integer matrix multiplication
(matrix dim = 128)
256.00
Flint
Mathemagix
Doliskani’s code
Our code
64.00
Time in seconds
16.00
4.00
1.00
0.25
0.06
0.02
0.00
24
26
28
210
Entry bitsize
212
benchmark on Intel Xeon-2620 @ 2.0GHz
214
216
Matrix multiplication with multi-precision integers
Integer matrix multiplication
(matrix dim = 512)
256.00
Flint
Mathemagix
Doliskani’s code
Our code
128.00
Time in seconds
64.00
32.00
16.00
8.00
4.00
2.00
1.00
0.50
0.25
24
26
28
210
Entry bitsize
212
benchmark on Intel Xeon-2620 @ 2.0GHz
214
216
Matrix multiplication with multi-precision integers
Integer matrix multiplication
(matrix dim = 1024)
128.00
Flint
Mathemagix
Doliskani’s code
Our code
Time in seconds
64.00
32.00
16.00
8.00
4.00
2.00
24
26
28
210
Entry bitsize
212
benchmark on Intel Xeon-2620 @ 2.0GHz
214
216
Parallel multi-modular matrix multiplication
1
for i = 1 . . . k compute Ai = A mod mi and Bi = B mod mi
3
reconstruct C = AB from (C1 , . . . , Ck )
Parallelization of multi-modular reduction/reconstruction
each thread reduces (resp. reconstructs) a chunk of the given matrix
thread 0
thread 1
thread 2
A0 = A mod m0
A1 = A mod m1
A2 = A mod m2
A3 = A mod m3
A4 = A mod m4
A5 = A mod m5
A6 = A mod m6
thread 3
Parallel multi-modular matrix multiplication
2
for i = 1 . . . k compute Ci = Ai Bi mod mi
Parallelization of modular multiplication
each thread computes a bunch of matrix multiplications modmi
thread 0
C0 = A0 B0 mod m0
C1 = A1 B1 mod m1
thread 1
C2 = A2 B2 mod m2
C3 = A3 B3 mod m3
thread 2
C4 = A4 B4 mod m4
C5 = A5 B5 mod m5
thread 3
C6 = A6 B6 mod m6
Parallel Matrix multiplication with multi-precision integers
based on OpenMP task
CPU affinity (hwloc-bind), allocator (tcmalloc)
still under progress for better memory strategy ! ! !
Speedup of parallel integer matrix mul5plica5on 12 speedup 10 8 dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = 8192 6 4 2 0 2 6 8 12 number of cores benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores)
Outline
1
Matrix multiplication with small integers
2
Matrix multiplication with multi-precision integers
3
Matrix multiplication with polynomials
Matrix multiplication over Fp [x]
We consider the ”easiest” case :
A, B ∈ Fp [x]n×n such that deg(AB) < k = 2t
p is a Fourier prime, i.e. p = 2t q + 1
p is such that n(p − 1)2 < 253
Complexity
O(nω k + n2 k log(k)) op. in Fp using evaluation/interpolation with FFT
Matrix multiplication over Fp [x]
We consider the ”easiest” case :
A, B ∈ Fp [x]n×n such that deg(AB) < k = 2t
p is a Fourier prime, i.e. p = 2t q + 1
p is such that n(p − 1)2 < 253
Complexity
O(nω k + n2 k log(k)) op. in Fp using evaluation/interpolation with FFT
Remark:
using Vandermonde matrix on can get a similar approach as for
integers, i.e. O(nω k + n2 k ω−1 )
Matrix multiplication over Fp [x]
Evaluation/Interpolation scheme
Let θ a primitive kth root of unity in Fp .
1
2
3
for i = 1 . . . k compute Ai = A(θi−1 ) and Bi = B(θi−1 )
for i = 1 . . . k compute Ci = Ai Bi ∈ Fp
interpolate C = AB from (C1 , . . . , Ck )
steps 1 and 3 : O(n2 ) call to FFTk over Fp [x]
step 2 : k matrix multiplications modulo a small prime p
FFT with SIMD over Fp
Butterly operation modulo p
i
compute X + Y mod p and (X − Y )θ2 mod p.
Barret’s modular multiplication with a constant (NTL)
calculate into [0, 2p) to remove two conditionals
[Harvey 2014]
Let X , Y ∈ [0, 2p), W ∈ [0, p), p < β/4 and W 0 = dW β/pe.
Algorithm: Butterfly(X,Y,W,W’,p)
1:
2:
3:
4:
5:
X 0 := X + Y mod 2p
T := X − Y + 2p
Q := dW 0 T /βe
Y 0 := (WT − Qp) mod β
return (X 0 , Y 0 )
1 high short product
2 low short products
FFT with SIMD over Fp
Butterly operation modulo p
i
compute X + Y mod p and (X − Y )θ2 mod p.
Barret’s modular multiplication with a constant (NTL)
calculate into [0, 2p) to remove two conditionals
[Harvey 2014]
Let X , Y ∈ [0, 2p), W ∈ [0, p), p < β/4 and W 0 = dW β/pe.
Algorithm: Butterfly(X,Y,W,W’,p)
1:
2:
3:
4:
5:
X 0 := X + Y mod 2p
T := X − Y + 2p
Q := dW 0 T /βe
Y 0 := (WT − Qp) mod β
return (X 0 , Y 0 )
1 high short product
2 low short products
3 SSE/AVX provide 16 or 32-bits low short product
7 no high short product available (use full product)
Matrix multiplication over Fp [x]
Implementation
radix-4 FFT with 128-bits SSE (29 bits primes)
BLAS-based matrix multiplication over Fp
[FFLAS-FFPACK library]
Polynomial Matrix Mul9plica9on Performances 1 024,00 512,00 256,00 128,00 64,00 Time in seconds 32,00 16,00 8,00 4,00 FLINT 2,00 MMX 1,00 our code 0,50 0,25 0,13 0,06 0,03 0,02 deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32 n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048 Matrix with dimension (n) and degree (deg) benchmark on Intel Xeon-2620 @ 2.0GHz
Matrix multiplication over Z[x]
A, B ∈ Z[x]n×n such that deg(AB) < d and ||(AB)i ||∞ < k
Complexity
O˜(nω d log(d) log(k)) bit op. using Kronecker substitution
O(nω d log(k) + n2 d log(d) log(k)) bit op. using CRT+FFT
Remark:
if the result’s degree and bitsize are not too large, CRT with Fourier
primes might suffice.
Matrix multiplication over Z[x]
Implementation
use CRT with Fourier primes
re-use multi-modular reduction/reconstruction with linear algebra
re-use multiplication in Fp [x]
FLINT 4096 our code 2048 1024 Time in seconds 512 256 128 64 32 16 8 4 2 1 log(k)=64 log(k)=128 log(k)=256 n=16 ; deg=8192 log(k)=512 log(k)=64 log(k)=128 log(k)=256 n=128 ; deg=512 benchmark on Intel Xeon-2620 @ 2.0GHz
Parallel Matrix multiplication over Z[x]
Very first attempt (work still in progress)
parallel CRT with linear algebra (same code as in Z case)
perform each multiplication over Fp [x] in parallel
some part of the code still sequential
n
64
32
32
128
d
1024
4096
2048
128
log(k)
600
600
1024
1024
6 cores
×3.52
×3.68
×3.95
×3.76
12 cores
×4.88
×5.02
×5.73
×5.55
time seq
61.1s
64.4s
54.5s
53.9s
Polynomial Matrix in LinBox (proposition)
Generic handler class for Polynomial Matrix
t e m p l a t e<s i z e t t y p e , s i z e t s t o r a g e , c l a s s F i e l d >
class PolynomialMatrix ;
Specialization for different memory strategy
// M a t r i x o f p o l y n o m i a l s
t e m p l a t e<c l a s s
Field>
c l a s s P o l y n o m i a l M a t r i x <PMType : : p o l f i r s t , PMStorage : : p l a i n , F i e l d >;
// P o l y n o m i a l o f m a t r i c e s
t e m p l a t e<c l a s s
Field>
c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t , PMStorage : : p l a i n , F i e l d >;
// P o l y n o m i a l o f m a t r i c e s ( p a r t i a l v i e w on m o n o m i a l s )
t e m p l a t e<c l a s s
Field>
c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t , PMStorage : : v i e w , F i e l d >;
Conclusion
High performance tools for exact linear algebra :
matrix multiplication through floating points
multi-dimensional CRT
FFT for polynomial over wordsize prime fields
adaptative matrix representation
We provide in the LinBox library (www.linalg.org)
efficient sequential/parallel matrix multiplication over Z
efficient sequential matrix multiplication over Fp [x] and Z[x]