SIMD Programming and What You Must Know about CPU Peak FLOPS Kenjiro Taura 1 / 55 Contents 1 Introduction 2 SIMD Instructions 3 SIMD programming alternatives Auto loop vectorization GCC’s Vector Types Vector intrinsics 4 What you must know to get near peak FLOPS 2 / 55 Remember performance of our Cholesky code? 3 / 55 What is the theoretical limit? we are using Intel Ivy Bridge processor its single core can execute, in every cycle, a floating point multiply instruction, a floating point add instruction, and others (e.g., integer arithmetic, load, store, . . . ) I’ll cover later moreover, a single add/multiply instruction can add/multiply four or eight operands Single Instruction Multiple Data (SIMD) instructions 4 / 55 Terminology flops: floating point operations FLOPS: Floating Point Operations Per Second Practically, Peak FLOPS = FP operands per instruction (SIMD width) × instructions per cycle (IPC) × cycles per second (frequency) 5 / 55 Peak flops/cycle of recent cores Recent processors increasingly rely on SIMD as an energy efficient way for boosting peak FLOPS Microarchitecture ISA SIMD width 4 8 max flops/cycle SSE AVX throughput (per clock) 1 add + 1 mul 1 add + 1 mul Nehalem Sandy Bridge /Ivy Bridge Haswell Knights Corner AVX2 AVX-512F 2 fmadds 1 fmadd(∗) 8 16 32 32 8 16 SIMD width : number of single precision operands ISA : Instruction Set Architecture fmadd : fused multiply and add ∗ : single hardware thread can achieve half of it 6 / 55 Today’s theme so our Cholesky runs at a mere 5% of the peak I talk about ingredients with which to improve it and make it a topic in the next hands-on we’ll learn practical ways to use SIMD instructions basics of processors to know what kind of code can get close-to-peak performance 7 / 55 Intel SIMD instructions at a glance Some example AVX instructions operation multiply add load store syntax vmulps %ymm0,%ymm1,%ymm2 vaddps %ymm0,%ymm1,%ymm2 vmovaps 400(%rax),%ymm0 vmovaps %ymm0,400(%rax) C-like expression ymm2 = ymm1 * ymm0 ymm2 = ymm1 + ymm0 ymm0 = *(rax+400) *(rax+400) = ymm0 ymm0 . . . ymm15 are 256 bit registers can hold eight single precision (float of C) or four double precision (double of C) floating point numbers XXXXps stands for packed single precision 8 / 55 Intel SIMD instructions at a glance Don’t be confused by similar but different variations operands vmulps %ymm0,%ymm1,%ymm2 vmulpd %ymm0,%ymm1,%ymm2 mulps %xmm0,%xmm1 mulpd %xmm0,%xmm1 mulss %xmm0,%xmm1 mulsd %xmm0,%xmm1 8 SPs 4 DPs 4 SPs 2 DPs 1 SP 1 DP vector /scalar? vector vector vector vector scalar scalar width (bits) 256 256 128 128 (32) (64) ISA AVX AVX SSE SSE SSE SSE . . . ps : packed single precision . . . pd : packed double precision xmm0, . . . , xmm15 : 128 bit SSE registers (aliased to lower half of ymm registers) Intel instructions have traditionally been taking two operands, but since AVX, many new instructions become three operands 9 / 55 Applications/limitations of SIMD SIMD is good at parallelizing computations doing almost exactly the same series of instructions on contiguous data ⇒ generally, main targets are simple loops whose iterations have identical control flows less effective for loops with conditional branches loops with nested loops of variable trip counts loops with unknown function calls non loops 10 / 55 Tools for SIMD programming Auto vectorization loop vectorization basic block vectorization SIMD-oriented constructs/directives SIMD directives for loops (OpenMP 4.0/Cilk Plus) SIMD-enabled functions (OpenMP 4.0/Cilk Plus) array languages (Cilk Plus) specially designed languages Somewhat portable vector types GCC vector extensions Boost.SIMD Vector intrinsics Assembly programming I’ll cover blue ones today, purple ones when I talk about OpenMP. 11 / 55 Contents 1 Introduction 2 SIMD Instructions 3 SIMD programming alternatives Auto loop vectorization GCC’s Vector Types Vector intrinsics 4 What you must know to get near peak FLOPS 12 / 55 Auto loop vectorization write scalar loops and hope the compiler does the job e.g., 1 2 3 4 5 6 void axpy_auto(float a, float * x, float c, long m) { long j; for (j = 0; j < m; j++) { x[j] = a * x[j] + c; } } compile and run 1 2 3 4 5 $ gcc -o simd_auto -march=native -O3 simd_auto.c $ master:6simd% ./simd_auto 8000000000.000000 flops 1226107900 clocks 6.524711 flops/clock 13 / 55 How to know if the compiler vectorized it? there are options useful to know why it didn’t vectorize GCC < 4.8 GCC ≥ 4.8 Intel report successful vectorization -ftree-vectorizer-verbose=1 -fopt-info-vec-optimized -vec-report=1 report failed vectorization -ftree-vectorizer-verbose=2 -fopt-info-vec-missed -vec-report=2 but don’t hesitate to dive into assembly code gcc -S is your friend trick: enclose loops with inline assembler comments 1 2 3 4 5 asm volatile ("# xxx loop begin"); for (i = 0; i < n; i++) { ... /∗ hope to be vectorized ∗/ } asm volatile ("# xxx loop end"); 14 / 55 Issues that complicate vectorization misaligned data potential aliasing Telling the compiler “they don’t happen; don’t worry about them” helps compilers do a better job 15 / 55 Alignment and vectorization background: SIMD load/store instructions require/prefer the address to be aligned to vector size movaps: requires 16 byte-aligned address vmovaps: requires 32 byte-aligned address even for loops as simple as this 1 2 3 for (j = 0; j < m; j++) { x[j] = a * x[j] + c; } the generated code runs a few scalar iterations until the address of x[j] becomes aligned if possible, (step 1): align your data appropriately and (step 2): let the compiler know about it 16 / 55 Step 1: align data static/local/global variables and arrays 1 float a[100] attribute ((aligned(64))); dynamically allocated data 1 2 3 float * a; int err = posix memalign(&a, 64, sizeof(float) * 100); if (err) failed(); 17 / 55 Step 2: let the compiler know about alignment GCC (≥ 4.6) 1 x = builtin assume aligned(x, 64); returns x, except the compiler assumes it is now 64 byte-aligned GCC (< 4.6); only works for function parameters 1 2 3 void f(float * ... } attribute ((aligned(64))) x) { 18 / 55 Aliasing and vectorization loops operating on two or more arrays are often not vectorizable if they happen to be the same array 1 2 3 for (i = 0; i < m; i++) { y[i] = x[i+1] + 1; } good compilers generate code that first checks x[i+1:i+9] and y[i:i+8] point to the same region at runtime, and jumps to 8-way-vectorized code when they don’t 1 2 3 4 5 $ gcc -march=native -O3 -fopt-info-vec-optimized .. a.c:6: note: create runtime check for data references *_12 and *_8 a.c:6: note: created 1 versioning for alias checks. .. if you know they don’t, it’s advisable to make that explicit restrict keyword, introduced by C99, does just that 19 / 55 restrict keyword annotate parameters of pointer type with restrict, if you know they never point to the same data 1 2 3 4 5 6 void axpy(float a, float * restrict x, float c, float * restrict y, long m) { for (long j = 0; j < m; j++) { y[j] = a * x[j+8] + c; } } you need to specify -std=c99 or -std=gnu99 1 2 3 4 5 $ gcc -march=native -O3 -S a.c -std=gnu99 -fopt-info-vec-optimized ... a.c:5: note: LOOP VECTORIZED. a.c:1: note: vectorized 1 loops in function. ... 20 / 55 Lower level alternatives auto vectorization fails for all but the simplest kind of code there are various ways to explicitly vectorize your code 21 / 55 Contents 1 Introduction 2 SIMD Instructions 3 SIMD programming alternatives Auto loop vectorization GCC’s Vector Types Vector intrinsics 4 What you must know to get near peak FLOPS 22 / 55 GCC vector types GCC allows you to define a vector type 1 typedef float float8 attribute ((vector size(32))); You can use arithmetic on vector types 1 2 float8 x, y, z; z += x * y; You cannot mix scalar and vectors, however 1 2 float8 x, y, z; z = 3.5 * x + y; You can combine them with intrinsics 23 / 55 axpy in GCC vector extension scalar code 1 2 3 4 5 6 void axpy_auto(float a, float * x, float c, long m) { long j; for (j = 0; j < m; j++) { x[j] = a * x[j] + c; } } GCC vector extension 1 typedef float float8 attribute ((vector size(32))); 2 3 4 5 6 7 8 9 10 11 12 13 void axpy_vector_ext(float a, float * x, float c, long m) { long j; float8 a8 = { a,a,a,a,a,a,a,a }; float8 c8 = { c,c,c,c,c,c,c,c }; for (j = 0; j + 7 < m; j += 8) { *((float8*)&x[j]) = a8 * *((float8*)&x[j]) + c8; } for (; j < m; j++) { x[j] = a * x[j] + c; } } 24 / 55 Contents 1 Introduction 2 SIMD Instructions 3 SIMD programming alternatives Auto loop vectorization GCC’s Vector Types Vector intrinsics 4 What you must know to get near peak FLOPS 25 / 55 Vector intrinsics processor/platform-specific functions and types for AVX, 1 #include <immintrin.h> and you get m256 (256 bit vector) and m128 (128 bit vector) type lots of mm256 xxx (AVX) and mm xxx (SSE) functions each function almost directly maps to a single assembly instruction, so you know what will happen mm256 load ps mm256 store ps mm256 add ps mm add ps, etc. 26 / 55 axpy with vector intrinsics 1 2 3 4 5 6 7 8 9 void axpy_intrin(float a, float * x, float c, long m) { long j; for (j = 0; j < m; j += 8) { _mm256_store_ps(&x[j], _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(a), _mm256_load_ps(&x[j])), _mm256_set1_ps(c))); } } ugly you would like to use vector extension (or trivially vectorizable loops) and use intrinsics sparingly only when necessary mm256 set1 ps(x): extends a scalar to a vector mm256 shuffle ps(v): shuffle words in a vector etc. 27 / 55 What you need to know to get near peak FLOPS so you now know how to use SIMD instructions, which may bring 8x performance boost dense matrix computation seems particularly well-suited to SIMD yet, just vectorizing code does not automatically bring you to satisfactory result 28 / 55 An endeavor to nearly peak FLOPS let’s start with the simplest code only doing fp mul and add can it be simpler than this? 1 typedef float float8 __attribute__((vector_size(32))); 2 3 4 5 6 7 8 9 float8 axpy_1(float8 a, float8 x, float8 c, long n) { long i; for (i = 0; i < n; i++) { x = a * x + c; } return x; } the code performs 16n flops no memory access in the loop 29 / 55 Let’s run it! compile 1 $ gcc -o 0axpy -march=native -O3 0axpy.c 30 / 55 Let’s run it! compile 1 $ gcc -o 0axpy -march=native -O3 0axpy.c and run! 1 2 3 4 $ ./0axpy 16000000000.000000 flops 8024035912 clocks 1.994009 flops/clock 30 / 55 Let’s run it! compile 1 $ gcc -o 0axpy -march=native -O3 0axpy.c and run! 1 2 3 4 $ ./0axpy 16000000000.000000 flops 8024035912 clocks 1.994009 flops/clock what!? let’s examine what the compiler did 30 / 55 What!? put a landmark in the assembly code 1 2 3 4 5 6 7 8 9 float8 axpy_1(float8 a, float8 x, float8 c, long n) { long i; asm volatile ("# ax+c loop begin"); for (i = 0; i < n; i++) { x = a * x + c; } asm volatile ("# ax+c loop end"); return x; } 31 / 55 What!? put a landmark in the assembly code 1 2 3 4 5 6 7 8 9 float8 axpy_1(float8 a, float8 x, float8 c, long n) { long i; asm volatile ("# ax+c loop begin"); for (i = 0; i < n; i++) { x = a * x + c; } asm volatile ("# ax+c loop end"); return x; } compile into assembly 1 $ gcc -S -march=native -O3 0axpy.c and see 0axpy.s in your editor 31 / 55 Assembly # ax+c loop begin # 0 "" 2 #NO_APP testq %rdi, %rdi jle .L2 xorl %eax, %eax .p2align 4,,10 .p2align 3 a little calculation shows the above loop takes (16/1.994009) ≈ 8 cycles per iteration why? .L3: addq vmulps vaddps cmpq jne .L3 $1, %rax %ymm1, %ymm0, %ymm1 %ymm2, %ymm1, %ymm1 %rdi, %rax .L2: #APP # 20 "0sequence.c" 1 # ax+c loop end 32 / 55 Latency and throughput Ivy Bridge core can execute a vmulps and a vaddps every cycle yet, it does not mean the result of vmulps at line 3 is available in the next cycle for vaddps at line 4 1 2 3 4 5 6 .L3: addq vmulps vaddps cmpq jne .L3 $1, %rax %ymm1, %ymm0, %ymm1 %ymm2, %ymm1, %ymm1 %rdi, %rax # ymm1 *= ymm0 (ax) # ymm1 += ymm2 (+c) what you need to know “1 add + 1 mul every cycle” refers to throughput each instruction has a specific latency 33 / 55 Ivy Bridge Latencies instruction fp add fp mul typical integer ops ... latency (cycles) 3 5 1 ... http://www.agner.org/optimize/ is an invaluable resource 34 / 55 Our code in light of latencies 1 2 3 4 5 6 7 8 in our code, vaddps uses the result of vmulps moreover, vmulps of the next iteration uses the result of vaddps of the previous iteration well, that was obvious from the source code too .L3: addq $1, %rax # ymm1 *= ymm0 (ax) vmulps %ymm1, %ymm0, %ymm1 # ymm1 += ymm2 (+c) vaddps %ymm2, %ymm1, %ymm1 cmpq %rdi, %rax jne .L3 1 2 3 for (i = 0; i < n; i++) { x = a * x + c; } Conclusion: the loop can’t run faster than 8 cycles/iteration 35 / 55 How to overcome latencies? increase parallelism! 36 / 55 How to overcome latencies? increase parallelism! you can’t make a serial chain of computation faster (change the algorithm if you want to) 36 / 55 How to overcome latencies? increase parallelism! you can’t make a serial chain of computation faster (change the algorithm if you want to) but you can increase throughput, by running many independent chains 36 / 55 How to overcome latencies? increase parallelism! you can’t make a serial chain of computation faster (change the algorithm if you want to) but you can increase throughput, by running many independent chains we expect the following to finish in the same number of cycles as the original one, despite it performs twice as many flops note that two series are independent 1 2 3 4 5 6 7 8 float8 sequence2(float8 a, float8 x0, float8 x1, float8 c, long n) { long i; for (i = 0; i < n; i++) { x0 = a * x0 + c; x1 = a * x1 + c; } return x0 + x1; } 36 / 55 . . . and you can have more than two chains . . . we expect to reach peak FLOPS with ≥ 8 chains and we do chains 1 2 3 4 6 8 9 11 13 clocks/iter float8 sequence2(float8 a, float8 x0, float8 x1, ..., float8 c, long n) { for (i = 0; i < n; i++) { x0 = a * x0 + c; x1 = a * x1 + c; ... } return x0 + x1; } 16 14 12 10 8 6 4 2 0 clocks/iter 8.027084796 8.024828988 8.024573088 8.024671996 8.026337580 8.106405064 9.126867400 11.162532772 13.185083644 clocks/iter flops/clock 0 2 4 6 8 10 flops/clock 1.993252 3.987624 5.981627 7.975404 11.960623 15.789983 15.777593 15.767031 15.775402 flops/clock 12 14 number of chains 37 / 55 What’s happening in the processor 0 1 2 3 ... cycles ∈ reservation station 8 registers t ... at cycle t, it executes tth vmulps instruction and (t − 5)th vaddps instruction (out of order execution) meanwhile, there are instructions that are waiting for operands to come (reservation station) 38 / 55 A variable number of chains 1 2 3 4 5 we have gotten a program that achieves peak FLOPS as having many variables is ugly and inflexible, let’s turn them into an array for (i = 0; i < n; i++) { x0 = a * x0 + c; x1 = a * x1 + c; ... } 1 ⇒ 2 3 4 5 for (i = 0; i < n; i++) { for (j = 0; j < m; j++) { x[j] = a * x[j] + c; } } we hope it shows a similar performance (≈ peak if m ≥ 8) and the program can handle an arbitrary number of sequences! 39 / 55 chains 1 2 3 4 5 6 7 8 9 10 11 12 13 flops/clock 1.063351 2.118230 3.171326 4.234297 5.244192 5.890180 6.718166 7.473419 7.555141 7.387568 7.554162 7.589363 7.679630 flops/clock Yet, when we experiment . . . 16 14 12 10 8 6 4 2 0 scalars array 0 2 4 6 8 10 12 14 number of chains Two issues the throughput hits the plateau at ≈ 2 cycles/iter the latency of a single update became ≈ 16 cycles 40 / 55 Take a look at assembly it looks like: # j loop begins ... 1 2 3 4 5 6 7 8 9 10 11 .L14: vmulps (%rdi,%rax), %ymm0, %ymm2 vaddps %ymm1, %ymm2, %ymm2 vmovaps %ymm2, (%rdi,%rax) addq $32, %rax cmpq %rcx, %rax jne .L14 ... # j loop ends do you see the different from the code we have seen before for the scalar version? 1 2 3 4 5 6 .L3: addq vmulps vaddps cmpq jne .L3 $1, %rax %ymm1, %ymm0, %ymm1 %ymm2, %ymm1, %ymm1 %rdi, %rax 41 / 55 The reason of the latency (16 cycles) both stem from the fact that the code now involves load/stores according to Agner’s document, here are the latency of load/stores vmulps : vmuladd vmovaps vmovaps 5 : 3 (load) : 4 (store) : 4 this explains the 16 cycles (5 + 3 + 4 + 4 = 16) at m = 1 42 / 55 The reason of the throughput (2 cycles/iter) What you need to know: All instructions have their own throughput limits due to execution resources (dispatch ports and execution units) “a single mul + add per cycle” is just one example of it Ivy Bridge has 6 dispatch ports, behind which various execution units are attached Intel 64 and IA-32 Architecture Optimization Reference Manual 43 / 55 Dispatch ports and throughput limits Specific instructions go to specific dispatch ports. e.g., port 0 : FP mul, int ops, . . . port 1 : FP add, int ops, . . . port 2/3 : load, address calc port 4 : store data port 5 : int, jmp, . . . Dispatch ports tell us which combinations of instructins can be executed in a single cycle. e.g., a vmulps + a vaddps + a load + int add, or three int adds ... 44 / 55 Throughput limits of instructions many instructions have 1 instruction/cycle throughput per dispatch port yet some have less than that or even not pipelined vrsqrtps : 1/2 vsqrtps : 1/7 vdivps : 1/14 vmovaps (store) : 1/2 256 bit store throught of 1/2 is the reason of the plateau 45 / 55 How to overcome the throughput limit? we obviously need to quit loading/storing data for every single multiply/add 1 2 3 4 5 for (i = 0; i < n; i++) { for (j = 0; j < m; j++) { x[j] = a * x[j] + c; } } the minimum “unit” of a computation should look like: 1 2 3 load x[j] to a register; do a * x + c several times on the register; store the result to x[j]; and run multiple independent units 46 / 55 Several ways to arrange computation take a variable at a time and run it until the end (suffer from latency) j load on register i j advance all variables one step at a time (suffer from the store throughput) small and constant i strategy 1: take a few variables and run them until the end strategy 2: advance all variables, a few steps at a time load on register store i i j j 47 / 55 Implementing strategy 1 say we advance eight variables at a time 1 2 3 4 5 6 7 8 9 for (jj = 0; jj < m; jj += 8) { /* run 8 variables until the end */ for (i = 0; i < n; i++) { float8 * xx = &x[jj]; for (j = 0; j < 8; j++) { xx[j] = a * xx[j] + c; } } } we hope it loads/stores each variable only once through the i loop (line 2)! depend a lot on compiler’s smartness promote fixed-sized arrays into registers if the compiler is not clever enough, you manually write eight separate variables 48 / 55 Implementing strategy 2 say we advance all variables two steps at a time 1 2 3 4 5 6 7 for (i = /* run for (j x[j] x[j] } } 0; i < n; i += 2) { all variables 2 steps */ = 0; j < m; j++) { = a * x[j] + c; = a * x[j] + c; again, we hope it performs a single load/store per iteration (eliminate load/stores for purple parts) the latency of a single j iteration increases, but we hope the j loop exposes lots of independent computations 49 / 55 Results strategy 1: succeeded ≈ 15 flops/clock out of 16 flops/clock; 94% of the peak strategy 2: not as successful as strategy 1 ≈ 11 - 14 flops/clock; (70% - 85% of the peak) flops/clock why? 16 14 12 10 8 6 4 2 0 strategy 1 strategy 2 (2 steps) strategy 2 (3 steps) strategy 2 (4 steps) 0 10 20 30 40 50 60 number of chains 50 / 55 Reason of suboptimal results for strategy 2 again, latency is the issue, but in a subtler way remember reservation station? 0 1 2 3 ... cycles ∈ reservation station 8 registers t ... reservation station is finite (Ivy Bridge has 54 entries) ⇒ the processor has only so many waiting (i.e., decoded but have not been sent to dispatch ports) instructions when the reservation station is full, the processor cannot decode any more instructions (even if some can be immediately executed) 51 / 55 The limit on the number of waiting instructions instructions must enter the reservation station in the program order instructions near the tail of a long dependency chain occupies entries for a long time 0 1 2 3 ... cycles ∈ reservation station t if there are too many such instructions, the processor cannot decode instructions further ahead and fail to find otherwise executable instructions 52 / 55 Takeaways the processor cannot execute “a serial chain of dependency” that fast (throughtput = 1/latency) have a number of independent chains and gain throughput avoid putting long-waiting instructions upstream peak FLOPS is rarely achievable, but trying to achieve it is a great learning opportunity (of superscalar processors) 53 / 55 Summary without SIMD instructions, the processor achieves at most 1/D of its peak FLOPS (D = the number of SIMD lanes and is 2 - 16) flops is not the only limiting factor and there are many algorithms that simply can’t be vectorized, yet SIMD is too significant to simply ignore orthogonal to SIMD, increase IPC by exposing independent instructions to the processor long chains of dependent instructions are bad, as the processor can keep only so many waiting instructions 54 / 55 Remember GEMM? the leaf loop of GEMM 1 2 3 4 5 6 7 for (i = 0; i < for (j = 0; j for (k = 0; C(i,j) += } } } M; i++) { < N; j++) { k < K; k++) { A(i,k) * B(k,j); // like c = a x + c is fairly close to what we have studied 1 2 3 4 5 for (i = 0; i < n; i++) { for (j = 0; j < m; j++) { x[j] = a * x[j] + c; } } in the next hands-on, apply what you’ve learned to this code 55 / 55
© Copyright 2024 ExpyDoc