anahpc13 Jos Martin presentation slides

MATLAB in HPC:
The challenges involved in providing a high-level language on
high-performance hardware.
Jos Martin
Principal Architect, Parallel Computing Tools
© 2014 The MathWorks, Inc.1
Agenda

Aims

CPU parallel

GPU

Some future directions
2
Attributes of MATLAB
1.
2.
3.
4.
Mathematically correct
Usable
Bug-free
Fast
3
Greater Control
Ease of Use
Usable API's
CPU Parallel
GPU
built-in to toolboxes
gpuArray, associated
maths
Intermediate
parfor, distributed
arrays, batch
arrayfun(@fun, …)
pagefun(@fun, …)
Detailed
spmd, jobs and tasks,
parfeval
direct integration with
CUDA kernels (in
MATLAB & mex)
Simple
4
Built-in support
17 Products contain built-in support for parallelism
Simulink
Embedded Coder
Simulink Coder
Bioinformatics Toolbox
Robust Control Toolbox
Simulink Control
Design
Simulink Design
Optimization
Image Processing
Toolbox
Global Optimization
Toolbox
Model-Based
Calibration Toolbox
Neural Network Toolbox
Optimization Toolbox
Statistics Toolbox
Communications
System Toolbox
Phased Array System
Toolbox
Signal Processing
Toolbox
SystemTest
5
parfor
Definition
Code in a parfor loop is guaranteed by the programmer to be
execution order independent
Why is that important?
We can execute the iterates of the loop in any order, potentially at
the same time on many different workers.
6
A simple parfor loop
parfor i = 1:N
out(i) = someFunction(in(i));
end
7
Reduction with parfor
reduce = zeros(A, 1);
parfor i = 1:N
reduce = reduce + someFun(i, A);
end
8
parfor Algorithms




Harmonic division of work
Minimal communication
Static language analysis and transformation
Non-deterministic
9
Distributed Arrays
D = distributed( someArray )
D = rand( 1e5, 'distributed' )

1-D distribution schemes (any dimension & partition)
2-D block-cyclic schemes

~300 functions available

10
Some Benchmarks
Score
(Lower is better)
Implementation
HPL
FFT
EP-Stream
3
6
6
o = A\b;
o = fft(v);
o = a.*b + c;
11
Science (even in HPC) is about the Maths

Don’t make it hard to program

Make expressing parallelism easy
12
Distributed Array Algorithms

Some we write
BLACS (Basic Linear Algebra Communication Subprograms)
ScaLAPACK
FFTW3 mpi

Indexing and assignment



13
Underneath Distributed Arrays

spmd
– codistributed arrays (explicitly collective)
– Communication primitives

labSend, labRecieve, labSendReceive, labBroadcast,
labBarrier, numlabs, labindex
– Deadlock and interrupt detection
14
You don't even need a cluster

Make it easy to try out
– Local cluster with Parallel Computing Toolbox
– Scale out to cluster with no code changes
15
Using gpuArray


Honestly – it’s just like an ordinary MATLAB array
Except that the methods that are implemented for it will
run on the GPU (over 300 currently and growing)
– Maybe some of these will be faster on your GPU

Want to get the data back to the CPU
c = gather(g);
16
GPUness spreads
function [a, b, c] = example(d, e, f)
a = sin(d) + e;
b = cos(d) + f;
c = a + b + e + f;
17
GPUness spreads
function [a, b, c] = example(d, e, f)
% Imagine if the input d were on the GPU
a = sin(d) + e;
b = cos(d) + f;
c = a + b + e + f;
18
Semantic work pattern: gpuArray
D = A.*B + C
.*
+
A
.*
tmp
+
B
.*
C
+
.*
D
+
time
19
Lazy Evaluation

Where possible we queue things up on the GPU and
return back to the program immediately
– We also try to amalgamate sets of operations together
20
Actual work pattern: gpuArray
.*
.*
On
GPU
On
CPU
A
B
+
tmpactual
+
.*
+
.*
+
C
tmpfuture
Dfuture
CPU code continues
Dactual
time
21
Lazy Evaluation

Why do you care?
– Improves performance a lot
– CPU & GPU work at the same time.

But be careful because tic;toc; can easily give you
the wrong time, since the computation hasn’t finished
d = gpuDevice; % Get the current GPU device
tic
gpuStuffToTime;
wait(d); % wait for computation on the GPU d to finished
toc
22
Can we do better?
D = A.*B + C
A(1) B(1)
.*
tmp(1)
+
D(1)
A(2) B(2)
.*
tmp(2)
+
D(2)
A(3) B(3)
.*
tmp(3)
+
D(3)
A(4) B(4)
.*
tmp(4)
+
D(4)
23
arrayfun

Apply a function to each element of a set of gpuArrays
[o1, o2] = arrayfun(@aFunction, s1, s2, s3)

Some limitations apply
– All code uses scalar variables
– Only a subset of the MATLAB language is supported
24
Why is this a good idea?

We know what inputs are being passed to your function
We know what code is in your function

with that we can infer the type of all variables in your code

and then we can generate code for your GPU

for each element of your input arrays we can execute your function on a
single CUDA thread

–
remember a GPU can execute thousands of threads at once, and schedule even more
25
Other ways to express parallelism

pagefun( @fun, A, B, C, ... )
– For each page of arrays A, B, C, etc. call the function fun.
– A page of an array is defined to be A(:,:,I,J,K,...)
– On the GPU, for certain functions, this can be run in parallel
26
GPU
Invoking CUDA Kernels
MATLAB
% Setup
kern = parallel.gpu.CUDAKernel(‘myKern.ptx’, cFcnSig)
% Configure
kern.ThreadBlockSize=[512 1];
kern.GridSize=[1024 1024];
% Run
[c, d] = feval(kern, a, b);
C & mex
// Setup
mxGPUArray const * A = mxGPUCreateFromMxArray(prhs[0]);
// Create a GPUArray to hold the result and get its underlying
// pointer.
mxGPUArray * B = mxGPUCreateGPUArray(mxGPUGetNumberOfDimensions(A),
mxGPUGetDimensions(A),
mxGPUGetClassID(A),
mxGPUGetComplexity(A),
MX_GPU_DO_NOT_INITIALIZE);
double * d_B = (double *)(mxGPUGetData(B));
// Standard CUDA kernel call using the CUDA runtime.
TimesTwo<<<blocksPerGrid, threadsPerBlock>>>(d_B, N);
}
// Device code prototype ...
void __global__ TimesTwo(double * const B, int const N) { ... };
27
Thank You
28