Parallel Computation Patterns

Parallel Computation Patterns
Presented by
Fred Graham
Seminar Objectives
• The patterns being discussed in this presentation form the basis of
many parallel algorithms that appear in applications today.
• Introduce and analyze the Parallel Pattern: Convolution.
• Introduce and analyze the Parallel Pattern: Prefix Sum (Parallel Scan).
• Introduce and analyze the Parallel Pattern: Reduction Trees
Parallel Patterns: Convolution
• We start with the extremely important “Convolution” computation
pattern.
• Convolution is a popular array operation that is used in various forms of
signal processing, digital recording, image processing, video processing,
and computer vision.
Parallel Patterns: Convolution cont’d
• Convolution is typically used as a filter that transforms signals and pixels
into more desirable values.
▫ Some filters smooth out the signal values so that one can see the “bigpicture” trend.
▫ Others, like the Gaussian filters, can be used to sharpen boundaries and
edges of objects in images.
• Typically involves a significant number of arithmetic operations on each
data element.
Parallel Patterns: Convolution cont’d
• For large data sets, such as high-definition images and video, the
amount of computation can be very large.
• Each output data element can be calculated independently of each
other – a desirable trait for massively parallel computing.
• So how about an example…
Parallel Patterns: 1D Convolution
• Imagine we have an digital audio signal, see image below. The input
data are in 1D (hence 1-dimensional convolution) form. Convolution can be
• The example illustrates signal volume as
a function of time.
• Notice the “noisy” waveform on the
bottom. We can use convolution to
smooth the noisy data out.
Image Source: http://www.stereophile.com/content/ps-audiogcc-100-integrated-amplifier-measurements
used to smooth out
noisy data.
Parallel Patterns: 1D Convolution (How does it work?)
• Convolution is an array operation where each output data element [P] is
a weighted sum of a collection of neighbouring input elements [N].
• The weights used in the weighted sum calculation are defined by an
input mask array [M].
• Illustration shows
calculation for output
element P[2]
Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Parallel Patterns: 1D Convolution (How does it work?)
• The fact that a 5-element mask [M] is used in this example means that each
[P] element is generated by a weighted sum of the corresponding [N]
element, up to two elements to the left and up to two elements to the right.
• For example, the calculation of P[2] in the previous slide is generated by the
weighted sum of N[0] (N[2-2]) through N[4] (N[2+2]).
• So P[2] = N[0]*M[0] + N[1]*M[1] + N[2]*M[2] + N[3]*M[3] + N[4]*M[4]
= 1*3 + 2*4 + 3*5 + 4*4 + 5*3
= 57
Parallel Patterns: 1D Convolution (How does it work?)
• Now you do it…Calculate the value at element P[3] in the previous
example.
• So P[3] = N[1]*M[0] + N[2]*M[1] + N[3]*M[2] + N[4]*M[3] + N[5]*M[4]
= 2*3 + 3*4 + 4*5 + 5*4 + 6*3
= 76
N
N[0] N[1] N[2] N[3] N[4] N[5] N[6]
1
M
2
3
4
5
6
P
7
3
M[0] M[1]M[2] M[3] M[4]
3 4 5 4 3
6
12 20
P[0] P[1] P[2] P[3] P[4] P[5] P[6]
20 18
8
57
76 15
3
3
Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Parallel Patterns: 2D Convolution
• For image processing and computer vision, input data is usually in 2D
form, with pixels in an x-y space.
• In a 2D convolution, the mask M is a 2D array. Its x and y dimensions
determine the range of neighbours to be included in the weighted sum
calculation.
Parallel Patterns: 2D Convolution (How does it work?)
• To generate an output element, we take the subarray of which the center is at
the corresponding location in the input array N.
• We then perform pairwise
multiplication between
elements of the input array N
and the mask array M and sum
them all together.
• The result is placed in the same
position in P as the value in N
being operated on.
Image Source: David Kirk/NVIDIA and Wen-mei
W. Hwu ECE408/CS483/ECE498al, University
of Illinois, 2007-2012
Parallel Patterns: 2D Convolution (How does it work?)
• Now you try…What will the resulting value be at P[1][0] if we perform
2D convolution on the highlighted element.
Image Source: David Kirk/NVIDIA and Wen-mei
W. Hwu ECE408/CS483/ECE498al, University
of Illinois, 2007-2012
Parallel Patterns: 2D Convolution (How does it work?)
Image Source: David Kirk/NVIDIA and Wen-mei
W. Hwu ECE408/CS483/ECE498al, University
of Illinois, 2007-2012
Parallel Patterns: Prefix-Sum (Scan)
• Our next very important parallel pattern is Prefix Sum or Scan as it is
commonly known.
• Parallel scan patterns are frequently used to convert seemingly
sequential operations, such as, resource allocation, work assignment,
and polynomial evaluation, into parallel operations.
• In general, if a computation is naturally described as a mathematical
recursion, it can likely be parallelized as a parallel scan operation.
Parallel Patterns: Prefix-Sum (Scan)
• Parallel scan plays a key role in massively parallel computing for a very
simple reason:
▫ Any sequential section of an application can drastically limit its overall
performance.
• It is a key primitive in many parallel algorithms to convert serial
computation into parallel computation.
• Based on reduction tree and reverse reduction tree patterns (discussed
later).
Parallel Patterns: (Inclusive) Prefix-Sum (Scan) Definition
Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array
of n elements
[x0, x1, …, xn-1],
and returns the array
[x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1)].
Example: If ⊕ is addition, then the all-prefix-sums operation on the array
[3 1 7 0 4 1 6 3],
would return [3 4 11 11 15 16 22 25].
So now that we have our definitions, let’s do an example…
Parallel Patterns: (Inclusive) Prefix-Sum (Scan) Example
Assume that we have a 100-inch sausage to feed 10 people.
• We know how much each person wants in inches, defined by the array:
▫ [3 5 2 7 28 4 3 0 8 1]
How do we cut the sausage quickly? How much will be left?
• Method 1: cut the sections sequentially: 3 inches first, 5 inches second,
2 inches third, etc…clearly there is a better way...
Parallel Patterns: (Inclusive) Prefix-Sum (Scan) Example
• Method 2: calculate Prefix scan
▫ [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left)
• The numbers in the returned array are cutting locations. With this
information, one can simultaneously make all the eight cuts that will generate
the sections that each person ordered.
▫ The first cut point is 3 inches – so person 0 gets 3 inches.
▫ The second cut point is 8 inches – so person 1 gets 5 inches etc….
• Since all the cutting points are known from the scan operation, all cuts can be
done in parallel.
Parallel Patterns: Typical Applications of Scan
•
Scan is a simple and useful parallel building block
▫
Convert recurrences from sequential :
for(j=1;j<n;j++)
out[j] = out[j-1] + f(j);
▫
into parallel:
forall(j) { temp[j] = f(j) };
scan(out, temp);
Useful for many parallel algorithms:
•
•
•
•
•
radix sort
quicksort
String comparison
Lexical analysis
Stream compaction
•
•
•
•
•
Polynomial evaluation
Solving recurrences
Tree operations
Histograms
Etc.
Parallel Patterns: Prefix-Sum (Scan)
• Disclaimer…It should be noted that this is by far the simplest example one
could conjure to explain this parallel pattern.
• There are many more complex and work efficient type scans to produce sum
values for a given set of values.
▫ By far the fastest of these is the Reduction Tree which we will discuss next.
▫ The reduction tree can generate the sum for N values in log2(N) steps.
• I leave it to the reader to explore these algorithms further, I recommend:
▫
Mark Harris, Parallel Prefix Sum with CUDA
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf
Parallel Patterns: Reduction Trees
• Reduction trees are arguable the most widely used parallel computation
pattern.
• “Partition and Summarize” type problems for processing large input
data sets are common uses of reduction trees.
 Partition the data set into smaller chunks.
 Assign a thread to process each chunk of data.
 Use a reduction tree to summarize the results from each chunk into the final
answer.
• Google is an example of this pattern.
Parallel Patterns: Reduction Trees
• So what is a reduction computation???
• It is used to summarize a set of input values into one value using a “reduction
operation”.
▫
▫
▫
▫
▫
Max
Min
Sum
Product
Often with user defined reduction operation function as long as the operation
 Is associative and commutative
 Has well-defined identity values (e.g. 0 for sum)
Parallel Patterns: Reduction Trees – Max Value Problem
• Let’s take a look at a sequential reduction algorithm that will find the max
value given an array of input data.
3
1
7
0
4
1
6
3
Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
• If we know our input data is always positive, we could initialize identity
variable “result=0”.
• Our algorithm would then sequentially scan through the input and perform
the reduction operation between the “result” and the current input value.
• Therefore, the algorithm performs N operations = O(N) time.
Parallel Patterns: Reduction Trees – Max Value Problem
• This problem naturally lends itself to parallelization.
• The parallel reduction tree algorithm
performs N-1 operations
in log(N) steps.
• A significant improvement for
sufficiently large input data.
Image Source: David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Parallel Patterns: Reduction Trees – Max Value Problem
• A quick analysis shows that for N input values, the reduction tree
performs:
▫ (1/2)N + (1/4)N + (1/8)N +…+ (1/N) = (1-(1/N))N = N-1 operations.
▫ If we can do this in log(N) steps:
 For an array of 1,000,000 input values, we can find the maximum number in 20
steps…
Impressive!!!
Parallel Patterns: Conclusion
• This concludes my seminar on Parallel Computation Patterns, I hope you
have found this informative.
• References used:
▫ Programming Massively Parallel Processors, 2nd Edition
David B. Kirk, Wen-mei W. Hwu
▫ ECE408/CS483/ECE498 al University of Illinois, 2007-2012
David B. Kirk/NVIDIA and Wen-mei W. Hwu, SSL(2014)