SIMD Programming and What You Need to Know about CPU Peak

SIMD Programming and
What You Must Know about CPU Peak
FLOPS
Kenjiro Taura
1 / 55
Contents
1
Introduction
2
SIMD Instructions
3
SIMD programming alternatives
Auto loop vectorization
GCC’s Vector Types
Vector intrinsics
4
What you must know to get near peak FLOPS
2 / 55
Remember performance of our Cholesky code?
3 / 55
What is the theoretical limit?
we are using Intel Ivy Bridge processor
its single core can execute, in every cycle,
a floating point multiply instruction,
a floating point add instruction,
and others (e.g., integer arithmetic, load, store, . . . ) I’ll
cover later
moreover, a single add/multiply instruction can add/multiply
four or eight operands
Single Instruction Multiple Data (SIMD) instructions
4 / 55
Terminology
flops: floating point operations
FLOPS: Floating Point Operations Per Second
Practically,
Peak FLOPS = FP operands per instruction (SIMD width)
× instructions per cycle (IPC)
× cycles per second (frequency)
5 / 55
Peak flops/cycle of recent cores
Recent processors increasingly rely on SIMD as an energy
efficient way for boosting peak FLOPS
Microarchitecture
ISA
SIMD
width
4
8
max flops/cycle
SSE
AVX
throughput
(per clock)
1 add + 1 mul
1 add + 1 mul
Nehalem
Sandy Bridge
/Ivy Bridge
Haswell
Knights Corner
AVX2
AVX-512F
2 fmadds
1 fmadd(∗)
8
16
32
32
8
16
SIMD width : number of single precision operands
ISA : Instruction Set Architecture
fmadd : fused multiply and add
∗ : single hardware thread can achieve half of it
6 / 55
Today’s theme
so our Cholesky runs at a mere 5% of the peak
I talk about ingredients with which to improve it and make it
a topic in the next hands-on
we’ll learn
practical ways to use SIMD instructions
basics of processors to know what kind of code can get
close-to-peak performance
7 / 55
Intel SIMD instructions at a glance
Some example AVX instructions
operation
multiply
add
load
store
syntax
vmulps %ymm0,%ymm1,%ymm2
vaddps %ymm0,%ymm1,%ymm2
vmovaps 400(%rax),%ymm0
vmovaps %ymm0,400(%rax)
C-like expression
ymm2 = ymm1 * ymm0
ymm2 = ymm1 + ymm0
ymm0 = *(rax+400)
*(rax+400) = ymm0
ymm0 . . . ymm15 are 256 bit registers
can hold eight single precision (float of C) or four double
precision (double of C) floating point numbers
XXXXps stands for packed single precision
8 / 55
Intel SIMD instructions at a glance
Don’t be confused by similar but different variations
operands
vmulps %ymm0,%ymm1,%ymm2
vmulpd %ymm0,%ymm1,%ymm2
mulps %xmm0,%xmm1
mulpd %xmm0,%xmm1
mulss %xmm0,%xmm1
mulsd %xmm0,%xmm1
8 SPs
4 DPs
4 SPs
2 DPs
1 SP
1 DP
vector
/scalar?
vector
vector
vector
vector
scalar
scalar
width
(bits)
256
256
128
128
(32)
(64)
ISA
AVX
AVX
SSE
SSE
SSE
SSE
. . . ps : packed single precision
. . . pd : packed double precision
xmm0, . . . , xmm15 : 128 bit SSE registers (aliased to lower
half of ymm registers)
Intel instructions have traditionally been taking two
operands, but since AVX, many new instructions become
three operands
9 / 55
Applications/limitations of SIMD
SIMD is good at parallelizing computations doing almost
exactly the same series of instructions on contiguous data
⇒ generally, main targets are simple loops whose iterations
have identical control flows
less effective for
loops with conditional branches
loops with nested loops of variable trip counts
loops with unknown function calls
non loops
10 / 55
Tools for SIMD programming
Auto vectorization
loop vectorization
basic block vectorization
SIMD-oriented constructs/directives
SIMD directives for loops (OpenMP 4.0/Cilk Plus)
SIMD-enabled functions (OpenMP 4.0/Cilk Plus)
array languages (Cilk Plus)
specially designed languages
Somewhat portable vector types
GCC vector extensions
Boost.SIMD
Vector intrinsics
Assembly programming
I’ll cover blue ones today, purple ones when I talk about OpenMP.
11 / 55
Contents
1
Introduction
2
SIMD Instructions
3
SIMD programming alternatives
Auto loop vectorization
GCC’s Vector Types
Vector intrinsics
4
What you must know to get near peak FLOPS
12 / 55
Auto loop vectorization
write scalar loops and hope the compiler does the job
e.g.,
1
2
3
4
5
6
void axpy_auto(float a, float * x, float c, long m) {
long j;
for (j = 0; j < m; j++) {
x[j] = a * x[j] + c;
}
}
compile and run
1
2
3
4
5
$ gcc -o simd_auto -march=native -O3 simd_auto.c
$ master:6simd% ./simd_auto
8000000000.000000 flops
1226107900 clocks
6.524711 flops/clock
13 / 55
How to know if the compiler vectorized it?
there are options useful to know why it didn’t vectorize
GCC < 4.8
GCC ≥ 4.8
Intel
report successful vectorization
-ftree-vectorizer-verbose=1
-fopt-info-vec-optimized
-vec-report=1
report failed vectorization
-ftree-vectorizer-verbose=2
-fopt-info-vec-missed
-vec-report=2
but don’t hesitate to dive into assembly code
gcc -S is your friend
trick: enclose loops with inline assembler comments
1
2
3
4
5
asm volatile ("# xxx loop begin");
for (i = 0; i < n; i++) {
... /∗ hope to be vectorized ∗/
}
asm volatile ("# xxx loop end");
14 / 55
Issues that complicate vectorization
misaligned data
potential aliasing
Telling the compiler “they don’t happen; don’t worry
about them” helps compilers do a better job
15 / 55
Alignment and vectorization
background: SIMD load/store instructions require/prefer the
address to be aligned to vector size
movaps: requires 16 byte-aligned address
vmovaps: requires 32 byte-aligned address
even for loops as simple as this
1
2
3
for (j = 0; j < m; j++) {
x[j] = a * x[j] + c;
}
the generated code runs a few scalar iterations until the
address of x[j] becomes aligned
if possible,
(step 1): align your data appropriately and
(step 2): let the compiler know about it
16 / 55
Step 1: align data
static/local/global variables and arrays
1
float a[100]
attribute ((aligned(64)));
dynamically allocated data
1
2
3
float * a;
int err = posix memalign(&a, 64, sizeof(float) * 100);
if (err) failed();
17 / 55
Step 2: let the compiler know about alignment
GCC (≥ 4.6)
1
x =
builtin assume aligned(x, 64);
returns x, except the compiler assumes it is now 64
byte-aligned
GCC (< 4.6); only works for function parameters
1
2
3
void f(float *
...
}
attribute
((aligned(64))) x) {
18 / 55
Aliasing and vectorization
loops operating on two or more arrays are often not
vectorizable if they happen to be the same array
1
2
3
for (i = 0; i < m; i++) {
y[i] = x[i+1] + 1;
}
good compilers generate code that first checks x[i+1:i+9]
and y[i:i+8] point to the same region at runtime, and
jumps to 8-way-vectorized code when they don’t
1
2
3
4
5
$ gcc -march=native -O3 -fopt-info-vec-optimized
..
a.c:6: note: create runtime check for data references *_12 and *_8
a.c:6: note: created 1 versioning for alias checks.
..
if you know they don’t, it’s advisable to make that explicit
restrict keyword, introduced by C99, does just that
19 / 55
restrict keyword
annotate parameters of pointer type with restrict, if you
know they never point to the same data
1
2
3
4
5
6
void axpy(float a, float * restrict x, float c,
float * restrict y, long m) {
for (long j = 0; j < m; j++) {
y[j] = a * x[j+8] + c;
}
}
you need to specify -std=c99 or -std=gnu99
1
2
3
4
5
$ gcc -march=native -O3 -S a.c -std=gnu99 -fopt-info-vec-optimized
...
a.c:5: note: LOOP VECTORIZED.
a.c:1: note: vectorized 1 loops in function.
...
20 / 55
Lower level alternatives
auto vectorization fails for all but the simplest kind of code
there are various ways to explicitly vectorize your code
21 / 55
Contents
1
Introduction
2
SIMD Instructions
3
SIMD programming alternatives
Auto loop vectorization
GCC’s Vector Types
Vector intrinsics
4
What you must know to get near peak FLOPS
22 / 55
GCC vector types
GCC allows you to define a vector type
1
typedef float float8
attribute ((vector size(32)));
You can use arithmetic on vector types
1
2
float8 x, y, z;
z += x * y;
You cannot mix scalar and vectors, however
1
2
float8 x, y, z;
z = 3.5 * x + y;
You can combine them with intrinsics
23 / 55
axpy in GCC vector extension
scalar code
1
2
3
4
5
6
void axpy_auto(float a, float * x, float c, long m) {
long j;
for (j = 0; j < m; j++) {
x[j] = a * x[j] + c;
}
}
GCC vector extension
1
typedef float float8
attribute ((vector size(32)));
2
3
4
5
6
7
8
9
10
11
12
13
void axpy_vector_ext(float a, float * x, float c, long m) {
long j;
float8 a8 = { a,a,a,a,a,a,a,a };
float8 c8 = { c,c,c,c,c,c,c,c };
for (j = 0; j + 7 < m; j += 8) {
*((float8*)&x[j]) = a8 * *((float8*)&x[j]) + c8;
}
for (; j < m; j++) {
x[j] = a * x[j] + c;
}
}
24 / 55
Contents
1
Introduction
2
SIMD Instructions
3
SIMD programming alternatives
Auto loop vectorization
GCC’s Vector Types
Vector intrinsics
4
What you must know to get near peak FLOPS
25 / 55
Vector intrinsics
processor/platform-specific functions and types
for AVX,
1
#include <immintrin.h>
and you get
m256 (256 bit vector) and m128 (128 bit vector) type
lots of mm256 xxx (AVX) and mm xxx (SSE) functions
each function almost directly maps to a single assembly
instruction, so you know what will happen
mm256 load ps
mm256 store ps
mm256 add ps
mm add ps, etc.
26 / 55
axpy with vector intrinsics
1
2
3
4
5
6
7
8
9
void axpy_intrin(float a, float * x, float c, long m) {
long j;
for (j = 0; j < m; j += 8) {
_mm256_store_ps(&x[j],
_mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(a),
_mm256_load_ps(&x[j])),
_mm256_set1_ps(c)));
}
}
ugly
you would like to use vector extension (or trivially
vectorizable loops) and use intrinsics sparingly only when
necessary
mm256 set1 ps(x): extends a scalar to a vector
mm256 shuffle ps(v): shuffle words in a vector
etc.
27 / 55
What you need to know to get near peak FLOPS
so you now know how to use SIMD instructions, which may
bring 8x performance boost
dense matrix computation seems particularly well-suited to
SIMD
yet, just vectorizing code does not automatically bring you to
satisfactory result
28 / 55
An endeavor to nearly peak FLOPS
let’s start with the simplest code only doing fp mul and add
can it be simpler than this?
1
typedef float float8 __attribute__((vector_size(32)));
2
3
4
5
6
7
8
9
float8 axpy_1(float8 a, float8 x, float8 c, long n) {
long i;
for (i = 0; i < n; i++) {
x = a * x + c;
}
return x;
}
the code performs 16n flops
no memory access in the loop
29 / 55
Let’s run it!
compile
1
$ gcc -o 0axpy -march=native
-O3 0axpy.c
30 / 55
Let’s run it!
compile
1
$ gcc -o 0axpy -march=native
-O3 0axpy.c
and run!
1
2
3
4
$ ./0axpy
16000000000.000000 flops
8024035912 clocks
1.994009 flops/clock
30 / 55
Let’s run it!
compile
1
$ gcc -o 0axpy -march=native
-O3 0axpy.c
and run!
1
2
3
4
$ ./0axpy
16000000000.000000 flops
8024035912 clocks
1.994009 flops/clock
what!? let’s examine what the compiler did
30 / 55
What!?
put a landmark in the assembly code
1
2
3
4
5
6
7
8
9
float8 axpy_1(float8 a, float8 x, float8 c, long n) {
long i;
asm volatile ("# ax+c loop begin");
for (i = 0; i < n; i++) {
x = a * x + c;
}
asm volatile ("# ax+c loop end");
return x;
}
31 / 55
What!?
put a landmark in the assembly code
1
2
3
4
5
6
7
8
9
float8 axpy_1(float8 a, float8 x, float8 c, long n) {
long i;
asm volatile ("# ax+c loop begin");
for (i = 0; i < n; i++) {
x = a * x + c;
}
asm volatile ("# ax+c loop end");
return x;
}
compile into assembly
1
$ gcc -S -march=native -O3 0axpy.c
and see 0axpy.s in your editor
31 / 55
Assembly
# ax+c loop begin
# 0 "" 2
#NO_APP
testq
%rdi, %rdi
jle
.L2
xorl
%eax, %eax
.p2align 4,,10
.p2align 3
a little calculation shows the
above loop takes
(16/1.994009) ≈ 8 cycles
per iteration
why?
.L3:
addq
vmulps
vaddps
cmpq
jne .L3
$1, %rax
%ymm1, %ymm0, %ymm1
%ymm2, %ymm1, %ymm1
%rdi, %rax
.L2:
#APP
# 20 "0sequence.c" 1
# ax+c loop end
32 / 55
Latency and throughput
Ivy Bridge core can execute a vmulps and a vaddps every
cycle
yet, it does not mean the result of vmulps at line 3 is
available in the next cycle for vaddps at line 4
1
2
3
4
5
6
.L3:
addq
vmulps
vaddps
cmpq
jne .L3
$1, %rax
%ymm1, %ymm0, %ymm1
%ymm2, %ymm1, %ymm1
%rdi, %rax
# ymm1 *= ymm0 (ax)
# ymm1 += ymm2 (+c)
what you need to know
“1 add + 1 mul every cycle” refers to throughput
each instruction has a specific latency
33 / 55
Ivy Bridge Latencies
instruction
fp add
fp mul
typical integer ops
...
latency (cycles)
3
5
1
...
http://www.agner.org/optimize/ is an invaluable resource
34 / 55
Our code in light of latencies
1
2
3
4
5
6
7
8
in our code, vaddps uses the result of vmulps
moreover, vmulps of the next iteration uses the result of
vaddps of the previous iteration
well, that was obvious from the source code too
.L3:
addq
$1, %rax
# ymm1 *= ymm0 (ax)
vmulps %ymm1, %ymm0, %ymm1
# ymm1 += ymm2 (+c)
vaddps %ymm2, %ymm1, %ymm1
cmpq
%rdi, %rax
jne .L3
1
2
3
for (i = 0; i < n; i++) {
x = a * x + c;
}
Conclusion:
the loop can’t run faster than 8 cycles/iteration
35 / 55
How to overcome latencies?
increase parallelism!
36 / 55
How to overcome latencies?
increase parallelism!
you can’t make a serial chain of computation faster (change
the algorithm if you want to)
36 / 55
How to overcome latencies?
increase parallelism!
you can’t make a serial chain of computation faster (change
the algorithm if you want to)
but you can increase throughput, by running many
independent chains
36 / 55
How to overcome latencies?
increase parallelism!
you can’t make a serial chain of computation faster (change
the algorithm if you want to)
but you can increase throughput, by running many
independent chains
we expect the following to finish in the same number of cycles
as the original one, despite it performs twice as many flops
note that two series are independent
1
2
3
4
5
6
7
8
float8 sequence2(float8 a, float8 x0, float8 x1, float8 c, long n) {
long i;
for (i = 0; i < n; i++) {
x0 = a * x0 + c;
x1 = a * x1 + c;
}
return x0 + x1;
}
36 / 55
. . . and you can have more than two chains . . .
we expect to reach peak FLOPS with ≥ 8 chains and we do
chains
1
2
3
4
6
8
9
11
13
clocks/iter
float8 sequence2(float8 a, float8 x0, float8
x1, ..., float8 c, long n) {
for (i = 0; i < n; i++) {
x0 = a * x0 + c;
x1 = a * x1 + c;
...
}
return x0 + x1;
}
16
14
12
10
8
6
4
2
0
clocks/iter
8.027084796
8.024828988
8.024573088
8.024671996
8.026337580
8.106405064
9.126867400
11.162532772
13.185083644
clocks/iter
flops/clock
0
2
4
6
8
10
flops/clock
1.993252
3.987624
5.981627
7.975404
11.960623
15.789983
15.777593
15.767031
15.775402
flops/clock
12
14
number of chains
37 / 55
What’s happening in the processor
0 1 2 3 ...
cycles
∈ reservation station
8 registers
t
...
at cycle t, it executes tth vmulps instruction and (t − 5)th
vaddps instruction (out of order execution)
meanwhile, there are instructions that are waiting for
operands to come (reservation station)
38 / 55
A variable number of chains
1
2
3
4
5
we have gotten a program that achieves peak FLOPS
as having many variables is ugly and inflexible, let’s turn
them into an array
for (i = 0; i < n; i++) {
x0 = a * x0 + c;
x1 = a * x1 + c;
...
}
1
⇒
2
3
4
5
for (i = 0; i < n; i++) {
for (j = 0; j < m; j++) {
x[j] = a * x[j] + c;
}
}
we hope it shows a similar performance (≈ peak if m ≥ 8)
and the program can handle an arbitrary number of
sequences!
39 / 55
chains
1
2
3
4
5
6
7
8
9
10
11
12
13
flops/clock
1.063351
2.118230
3.171326
4.234297
5.244192
5.890180
6.718166
7.473419
7.555141
7.387568
7.554162
7.589363
7.679630
flops/clock
Yet, when we experiment . . .
16
14
12
10
8
6
4
2
0
scalars
array
0
2
4
6
8
10
12
14
number of chains
Two issues
the throughput hits the plateau at ≈ 2 cycles/iter
the latency of a single update became ≈ 16 cycles
40 / 55
Take a look at assembly
it looks like:
# j loop begins
...
1
2
3
4
5
6
7
8
9
10
11
.L14:
vmulps (%rdi,%rax), %ymm0, %ymm2
vaddps %ymm1, %ymm2, %ymm2
vmovaps %ymm2, (%rdi,%rax)
addq
$32, %rax
cmpq
%rcx, %rax
jne
.L14
...
# j loop ends
do you see the different from the code we have seen before for
the scalar version?
1
2
3
4
5
6
.L3:
addq
vmulps
vaddps
cmpq
jne .L3
$1, %rax
%ymm1, %ymm0, %ymm1
%ymm2, %ymm1, %ymm1
%rdi, %rax
41 / 55
The reason of the latency (16 cycles)
both stem from the fact that the code now involves
load/stores
according to Agner’s document, here are the latency of
load/stores
vmulps :
vmuladd
vmovaps
vmovaps
5
: 3
(load) : 4
(store) : 4
this explains the 16 cycles (5 + 3 + 4 + 4 = 16) at m = 1
42 / 55
The reason of the throughput (2 cycles/iter)
What you need to know:
All instructions have their own throughput limits due to
execution resources (dispatch ports and execution units)
“a single mul + add
per cycle” is just one
example of it
Ivy Bridge has 6
dispatch ports, behind
which various
execution units are
attached
Intel 64 and IA-32 Architecture Optimization Reference Manual
43 / 55
Dispatch ports and throughput limits
Specific instructions go to specific
dispatch ports. e.g.,
port 0 : FP mul, int ops, . . .
port 1 : FP add, int ops, . . .
port 2/3 : load, address calc
port 4 : store data
port 5 : int, jmp, . . .
Dispatch ports tell us which combinations of instructins can be
executed in a single cycle. e.g.,
a vmulps + a vaddps + a load + int add, or
three int adds
...
44 / 55
Throughput limits of instructions
many instructions have 1 instruction/cycle throughput per
dispatch port
yet some have less than that or even not pipelined
vrsqrtps : 1/2
vsqrtps : 1/7
vdivps : 1/14
vmovaps (store) : 1/2
256 bit store throught of 1/2 is the reason of the plateau
45 / 55
How to overcome the throughput limit?
we obviously need to quit loading/storing data for every
single multiply/add
1
2
3
4
5
for (i = 0; i < n; i++) {
for (j = 0; j < m; j++) {
x[j] = a * x[j] + c;
}
}
the minimum “unit” of a computation should look like:
1
2
3
load x[j] to a register;
do a * x + c several times on the register;
store the result to x[j];
and run multiple independent units
46 / 55
Several ways to arrange computation
take a variable at a time and
run it until the end (suffer
from latency)
j
load
on register
i
j
advance all variables one step
at a time (suffer from the
store throughput)
small and constant
i
strategy 1: take a few
variables and run them until
the end
strategy 2: advance all
variables, a few steps at a time
load on register store
i
i
j
j
47 / 55
Implementing strategy 1
say we advance eight variables at a time
1
2
3
4
5
6
7
8
9
for (jj = 0; jj < m; jj += 8) {
/* run 8 variables until the end */
for (i = 0; i < n; i++) {
float8 * xx = &x[jj];
for (j = 0; j < 8; j++) {
xx[j] = a * xx[j] + c;
}
}
}
we hope it loads/stores each variable only once through the i
loop (line 2)!
depend a lot on compiler’s smartness
promote fixed-sized arrays into registers
if the compiler is not clever enough, you manually write eight
separate variables
48 / 55
Implementing strategy 2
say we advance all variables two steps at a time
1
2
3
4
5
6
7
for (i =
/* run
for (j
x[j]
x[j]
}
}
0; i < n; i += 2) {
all variables 2 steps */
= 0; j < m; j++) {
= a * x[j] + c;
= a * x[j] + c;
again, we hope it performs a single load/store per iteration
(eliminate load/stores for purple parts)
the latency of a single j iteration increases, but we hope the j
loop exposes lots of independent computations
49 / 55
Results
strategy 1: succeeded
≈ 15 flops/clock out of 16 flops/clock; 94% of the peak
strategy 2: not as successful as strategy 1
≈ 11 - 14 flops/clock; (70% - 85% of the peak)
flops/clock
why?
16
14
12
10
8
6
4
2
0
strategy 1
strategy 2 (2 steps)
strategy 2 (3 steps)
strategy 2 (4 steps)
0
10
20
30
40
50
60
number of chains
50 / 55
Reason of suboptimal results for strategy 2
again, latency is the issue, but in a subtler way
remember reservation station?
0 1 2 3 ...
cycles
∈ reservation station
8 registers
t
...
reservation station is finite (Ivy Bridge has 54 entries)
⇒ the processor has only so many waiting (i.e., decoded but
have not been sent to dispatch ports) instructions
when the reservation station is full, the processor cannot
decode any more instructions (even if some can be
immediately executed)
51 / 55
The limit on the number of waiting instructions
instructions must enter the reservation station in the
program order
instructions near the tail of a long dependency chain occupies
entries for a long time
0 1 2 3 ...
cycles
∈ reservation station
t
if there are too many such instructions, the processor cannot
decode instructions further ahead and fail to find otherwise
executable instructions
52 / 55
Takeaways
the processor cannot execute “a serial chain of dependency”
that fast (throughtput = 1/latency)
have a number of independent chains and gain throughput
avoid putting long-waiting instructions upstream
peak FLOPS is rarely achievable, but trying to achieve it is a
great learning opportunity (of superscalar processors)
53 / 55
Summary
without SIMD instructions, the processor achieves at most
1/D of its peak FLOPS (D = the number of SIMD lanes and
is 2 - 16)
flops is not the only limiting factor and there are many
algorithms that simply can’t be vectorized, yet SIMD is too
significant to simply ignore
orthogonal to SIMD, increase IPC by exposing independent
instructions to the processor
long chains of dependent instructions are bad, as the
processor can keep only so many waiting instructions
54 / 55
Remember GEMM?
the leaf loop of GEMM
1
2
3
4
5
6
7
for (i = 0; i <
for (j = 0; j
for (k = 0;
C(i,j) +=
}
}
}
M; i++) {
< N; j++) {
k < K; k++) {
A(i,k) * B(k,j); // like c = a x + c
is fairly close to what we have studied
1
2
3
4
5
for (i = 0; i < n; i++) {
for (j = 0; j < m; j++) {
x[j] = a * x[j] + c;
}
}
in the next hands-on, apply what you’ve learned to this code
55 / 55