MATLAB in HPC: The challenges involved in providing a high-level language on high-performance hardware. Jos Martin Principal Architect, Parallel Computing Tools © 2014 The MathWorks, Inc.1 Agenda Aims CPU parallel GPU Some future directions 2 Attributes of MATLAB 1. 2. 3. 4. Mathematically correct Usable Bug-free Fast 3 Greater Control Ease of Use Usable API's CPU Parallel GPU built-in to toolboxes gpuArray, associated maths Intermediate parfor, distributed arrays, batch arrayfun(@fun, …) pagefun(@fun, …) Detailed spmd, jobs and tasks, parfeval direct integration with CUDA kernels (in MATLAB & mex) Simple 4 Built-in support 17 Products contain built-in support for parallelism Simulink Embedded Coder Simulink Coder Bioinformatics Toolbox Robust Control Toolbox Simulink Control Design Simulink Design Optimization Image Processing Toolbox Global Optimization Toolbox Model-Based Calibration Toolbox Neural Network Toolbox Optimization Toolbox Statistics Toolbox Communications System Toolbox Phased Array System Toolbox Signal Processing Toolbox SystemTest 5 parfor Definition Code in a parfor loop is guaranteed by the programmer to be execution order independent Why is that important? We can execute the iterates of the loop in any order, potentially at the same time on many different workers. 6 A simple parfor loop parfor i = 1:N out(i) = someFunction(in(i)); end 7 Reduction with parfor reduce = zeros(A, 1); parfor i = 1:N reduce = reduce + someFun(i, A); end 8 parfor Algorithms Harmonic division of work Minimal communication Static language analysis and transformation Non-deterministic 9 Distributed Arrays D = distributed( someArray ) D = rand( 1e5, 'distributed' ) 1-D distribution schemes (any dimension & partition) 2-D block-cyclic schemes ~300 functions available 10 Some Benchmarks Score (Lower is better) Implementation HPL FFT EP-Stream 3 6 6 o = A\b; o = fft(v); o = a.*b + c; 11 Science (even in HPC) is about the Maths Don’t make it hard to program Make expressing parallelism easy 12 Distributed Array Algorithms Some we write BLACS (Basic Linear Algebra Communication Subprograms) ScaLAPACK FFTW3 mpi Indexing and assignment 13 Underneath Distributed Arrays spmd – codistributed arrays (explicitly collective) – Communication primitives labSend, labRecieve, labSendReceive, labBroadcast, labBarrier, numlabs, labindex – Deadlock and interrupt detection 14 You don't even need a cluster Make it easy to try out – Local cluster with Parallel Computing Toolbox – Scale out to cluster with no code changes 15 Using gpuArray Honestly – it’s just like an ordinary MATLAB array Except that the methods that are implemented for it will run on the GPU (over 300 currently and growing) – Maybe some of these will be faster on your GPU Want to get the data back to the CPU c = gather(g); 16 GPUness spreads function [a, b, c] = example(d, e, f) a = sin(d) + e; b = cos(d) + f; c = a + b + e + f; 17 GPUness spreads function [a, b, c] = example(d, e, f) % Imagine if the input d were on the GPU a = sin(d) + e; b = cos(d) + f; c = a + b + e + f; 18 Semantic work pattern: gpuArray D = A.*B + C .* + A .* tmp + B .* C + .* D + time 19 Lazy Evaluation Where possible we queue things up on the GPU and return back to the program immediately – We also try to amalgamate sets of operations together 20 Actual work pattern: gpuArray .* .* On GPU On CPU A B + tmpactual + .* + .* + C tmpfuture Dfuture CPU code continues Dactual time 21 Lazy Evaluation Why do you care? – Improves performance a lot – CPU & GPU work at the same time. But be careful because tic;toc; can easily give you the wrong time, since the computation hasn’t finished d = gpuDevice; % Get the current GPU device tic gpuStuffToTime; wait(d); % wait for computation on the GPU d to finished toc 22 Can we do better? D = A.*B + C A(1) B(1) .* tmp(1) + D(1) A(2) B(2) .* tmp(2) + D(2) A(3) B(3) .* tmp(3) + D(3) A(4) B(4) .* tmp(4) + D(4) 23 arrayfun Apply a function to each element of a set of gpuArrays [o1, o2] = arrayfun(@aFunction, s1, s2, s3) Some limitations apply – All code uses scalar variables – Only a subset of the MATLAB language is supported 24 Why is this a good idea? We know what inputs are being passed to your function We know what code is in your function with that we can infer the type of all variables in your code and then we can generate code for your GPU for each element of your input arrays we can execute your function on a single CUDA thread – remember a GPU can execute thousands of threads at once, and schedule even more 25 Other ways to express parallelism pagefun( @fun, A, B, C, ... ) – For each page of arrays A, B, C, etc. call the function fun. – A page of an array is defined to be A(:,:,I,J,K,...) – On the GPU, for certain functions, this can be run in parallel 26 GPU Invoking CUDA Kernels MATLAB % Setup kern = parallel.gpu.CUDAKernel(‘myKern.ptx’, cFcnSig) % Configure kern.ThreadBlockSize=[512 1]; kern.GridSize=[1024 1024]; % Run [c, d] = feval(kern, a, b); C & mex // Setup mxGPUArray const * A = mxGPUCreateFromMxArray(prhs[0]); // Create a GPUArray to hold the result and get its underlying // pointer. mxGPUArray * B = mxGPUCreateGPUArray(mxGPUGetNumberOfDimensions(A), mxGPUGetDimensions(A), mxGPUGetClassID(A), mxGPUGetComplexity(A), MX_GPU_DO_NOT_INITIALIZE); double * d_B = (double *)(mxGPUGetData(B)); // Standard CUDA kernel call using the CUDA runtime. TimesTwo<<<blocksPerGrid, threadsPerBlock>>>(d_B, N); } // Device code prototype ... void __global__ TimesTwo(double * const B, int const N) { ... }; 27 Thank You 28
© Copyright 2025 ExpyDoc