Parallel Computation Patterns Presented by Fred Graham Seminar Objectives • The patterns being discussed in this presentation form the basis of many parallel algorithms that appear in applications today. • Introduce and analyze the Parallel Pattern: Convolution. • Introduce and analyze the Parallel Pattern: Prefix Sum (Parallel Scan). • Introduce and analyze the Parallel Pattern: Reduction Trees Parallel Patterns: Convolution • We start with the extremely important “Convolution” computation pattern. • Convolution is a popular array operation that is used in various forms of signal processing, digital recording, image processing, video processing, and computer vision. Parallel Patterns: Convolution cont’d • Convolution is typically used as a filter that transforms signals and pixels into more desirable values. ▫ Some filters smooth out the signal values so that one can see the “bigpicture” trend. ▫ Others, like the Gaussian filters, can be used to sharpen boundaries and edges of objects in images. • Typically involves a significant number of arithmetic operations on each data element. Parallel Patterns: Convolution cont’d • For large data sets, such as high-definition images and video, the amount of computation can be very large. • Each output data element can be calculated independently of each other – a desirable trait for massively parallel computing. • So how about an example… Parallel Patterns: 1D Convolution • Imagine we have an digital audio signal, see image below. The input data are in 1D (hence 1-dimensional convolution) form. Convolution can be • The example illustrates signal volume as a function of time. • Notice the “noisy” waveform on the bottom. We can use convolution to smooth the noisy data out. Image Source: http://www.stereophile.com/content/ps-audiogcc-100-integrated-amplifier-measurements used to smooth out noisy data. Parallel Patterns: 1D Convolution (How does it work?) • Convolution is an array operation where each output data element [P] is a weighted sum of a collection of neighbouring input elements [N]. • The weights used in the weighted sum calculation are defined by an input mask array [M]. • Illustration shows calculation for output element P[2] Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 Parallel Patterns: 1D Convolution (How does it work?) • The fact that a 5-element mask [M] is used in this example means that each [P] element is generated by a weighted sum of the corresponding [N] element, up to two elements to the left and up to two elements to the right. • For example, the calculation of P[2] in the previous slide is generated by the weighted sum of N[0] (N[2-2]) through N[4] (N[2+2]). • So P[2] = N[0]*M[0] + N[1]*M[1] + N[2]*M[2] + N[3]*M[3] + N[4]*M[4] = 1*3 + 2*4 + 3*5 + 4*4 + 5*3 = 57 Parallel Patterns: 1D Convolution (How does it work?) • Now you do it…Calculate the value at element P[3] in the previous example. • So P[3] = N[1]*M[0] + N[2]*M[1] + N[3]*M[2] + N[4]*M[3] + N[5]*M[4] = 2*3 + 3*4 + 4*5 + 5*4 + 6*3 = 76 N N[0] N[1] N[2] N[3] N[4] N[5] N[6] 1 M 2 3 4 5 6 P 7 3 M[0] M[1]M[2] M[3] M[4] 3 4 5 4 3 6 12 20 P[0] P[1] P[2] P[3] P[4] P[5] P[6] 20 18 8 57 76 15 3 3 Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 Parallel Patterns: 2D Convolution • For image processing and computer vision, input data is usually in 2D form, with pixels in an x-y space. • In a 2D convolution, the mask M is a 2D array. Its x and y dimensions determine the range of neighbours to be included in the weighted sum calculation. Parallel Patterns: 2D Convolution (How does it work?) • To generate an output element, we take the subarray of which the center is at the corresponding location in the input array N. • We then perform pairwise multiplication between elements of the input array N and the mask array M and sum them all together. • The result is placed in the same position in P as the value in N being operated on. Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 Parallel Patterns: 2D Convolution (How does it work?) • Now you try…What will the resulting value be at P[1][0] if we perform 2D convolution on the highlighted element. Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 Parallel Patterns: 2D Convolution (How does it work?) Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 Parallel Patterns: Prefix-Sum (Scan) • Our next very important parallel pattern is Prefix Sum or Scan as it is commonly known. • Parallel scan patterns are frequently used to convert seemingly sequential operations, such as, resource allocation, work assignment, and polynomial evaluation, into parallel operations. • In general, if a computation is naturally described as a mathematical recursion, it can likely be parallelized as a parallel scan operation. Parallel Patterns: Prefix-Sum (Scan) • Parallel scan plays a key role in massively parallel computing for a very simple reason: ▫ Any sequential section of an application can drastically limit its overall performance. • It is a key primitive in many parallel algorithms to convert serial computation into parallel computation. • Based on reduction tree and reverse reduction tree patterns (discussed later). Parallel Patterns: (Inclusive) Prefix-Sum (Scan) Definition Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [x0, x1, …, xn-1], and returns the array [x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1)]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [3 4 11 11 15 16 22 25]. So now that we have our definitions, let’s do an example… Parallel Patterns: (Inclusive) Prefix-Sum (Scan) Example Assume that we have a 100-inch sausage to feed 10 people. • We know how much each person wants in inches, defined by the array: ▫ [3 5 2 7 28 4 3 0 8 1] How do we cut the sausage quickly? How much will be left? • Method 1: cut the sections sequentially: 3 inches first, 5 inches second, 2 inches third, etc…clearly there is a better way... Parallel Patterns: (Inclusive) Prefix-Sum (Scan) Example • Method 2: calculate Prefix scan ▫ [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left) • The numbers in the returned array are cutting locations. With this information, one can simultaneously make all the eight cuts that will generate the sections that each person ordered. ▫ The first cut point is 3 inches – so person 0 gets 3 inches. ▫ The second cut point is 8 inches – so person 1 gets 5 inches etc…. • Since all the cutting points are known from the scan operation, all cuts can be done in parallel. Parallel Patterns: Typical Applications of Scan • Scan is a simple and useful parallel building block ▫ Convert recurrences from sequential : for(j=1;j<n;j++) out[j] = out[j-1] + f(j); ▫ into parallel: forall(j) { temp[j] = f(j) }; scan(out, temp); Useful for many parallel algorithms: • • • • • radix sort quicksort String comparison Lexical analysis Stream compaction • • • • • Polynomial evaluation Solving recurrences Tree operations Histograms Etc. Parallel Patterns: Prefix-Sum (Scan) • Disclaimer…It should be noted that this is by far the simplest example one could conjure to explain this parallel pattern. • There are many more complex and work efficient type scans to produce sum values for a given set of values. ▫ By far the fastest of these is the Reduction Tree which we will discuss next. ▫ The reduction tree can generate the sum for N values in log2(N) steps. • I leave it to the reader to explore these algorithms further, I recommend: ▫ Mark Harris, Parallel Prefix Sum with CUDA http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf Parallel Patterns: Reduction Trees • Reduction trees are arguable the most widely used parallel computation pattern. • “Partition and Summarize” type problems for processing large input data sets are common uses of reduction trees. Partition the data set into smaller chunks. Assign a thread to process each chunk of data. Use a reduction tree to summarize the results from each chunk into the final answer. • Google is an example of this pattern. Parallel Patterns: Reduction Trees • So what is a reduction computation??? • It is used to summarize a set of input values into one value using a “reduction operation”. ▫ ▫ ▫ ▫ ▫ Max Min Sum Product Often with user defined reduction operation function as long as the operation Is associative and commutative Has well-defined identity values (e.g. 0 for sum) Parallel Patterns: Reduction Trees – Max Value Problem • Let’s take a look at a sequential reduction algorithm that will find the max value given an array of input data. 3 1 7 0 4 1 6 3 Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 • If we know our input data is always positive, we could initialize identity variable “result=0”. • Our algorithm would then sequentially scan through the input and perform the reduction operation between the “result” and the current input value. • Therefore, the algorithm performs N operations = O(N) time. Parallel Patterns: Reduction Trees – Max Value Problem • This problem naturally lends itself to parallelization. • The parallel reduction tree algorithm performs N-1 operations in log(N) steps. • A significant improvement for sufficiently large input data. Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 Parallel Patterns: Reduction Trees – Max Value Problem • A quick analysis shows that for N input values, the reduction tree performs: ▫ (1/2)N + (1/4)N + (1/8)N +…+ (1/N) = (1-(1/N))N = N-1 operations. ▫ If we can do this in log(N) steps: For an array of 1,000,000 input values, we can find the maximum number in 20 steps… Impressive!!! Parallel Patterns: Conclusion • This concludes my seminar on Parallel Computation Patterns, I hope you have found this informative. • References used: ▫ Programming Massively Parallel Processors, 2nd Edition David B. Kirk, Wen-mei W. Hwu ▫ ECE408/CS483/ECE498 al University of Illinois, 2007-2012 David B. Kirk/NVIDIA and Wen-mei W. Hwu, SSL(2014)
© Copyright 2024 ExpyDoc