Sequential Dependencies and Overlapping

Computer Science 1 (4003-231)
What to Learn This Week?
•  We will learn how to analyze an algorithm for sequential
dependencies.
Sequential Dependencies
and Overlapping
•  We will also learn how to parallelize the algorithm while
enforcing the sequential dependencies.
•  We will then discover how the CPU’s cache memory
affects a parallel program’s performance.
Minseok Kwon
Department of Computer Science
Rochester Institute of Technology
1
All Pairs Shortest Paths
2
Floyd’s Algorithm
•  Given a network (or a graph), how do we find the
shortest paths between all possible pairs?
for i = 0 to n-1
// Update the distance between nodes r and c via node i.
for r = 0 to n-1
for c = 0 to n-1
D(r,c) = min(D(r,c), D(r,i) + D(i,c))
3
4
1
Computer Science 1 (4003-231)
I/O Files
I/O Files
package edu.rit.smp.network;
import edu.rit.io.DoubleMatrixFile;
import edu.rit.util.Random;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
public class FloydRandom {
public static void main(String[] args) {
// Parse command-line arguments.
if (args.length != 4) usage();
long seed = Long.parseLong(args[0]);
double radius = Double.parseDouble(args[1]);
int n = Integer.parseInt(args[2]);
String matrixfile = args[3];
// Compute distance matrix elements.
double[][] = new double[n][n];
for (int r = 0; r < n; r++) {
double[] dr = d[r];
for (int c = 0; c < n; c++) {
double dx = x[r] – x[c];
double dy = y[r] – y[c];
double distance = Math.sqrt(dx*dx + dy*dy);
dr[c] = (distance <= radius ?
distance : Double.POSITIVE_INFINITY);
}
}
// Write distance matrix to output file.
DoubleMatrixFile dmf = new DoubleMatrixFile(n,n,d);
DoubleMatrixFile.Writer writer = dmf.prepareToWrite
(new BufferedOutputStream
(new FileOutputStream(matrixfile)));
writer.write();
writer.close();
}
}
// Set up pseudorandom number generator.
Random prng = Random.getInstance(see);
// Generate random node locations in the unit square.
double[] x = new double[n];
double[] y = new double[n];
for (int i = 0; i < n; i++) {
x[i] = prng.nextDouble();
y[i] = prng.nextDouble();
}
5
Sequential Program
6
Sequential Program
package edu.rit.smp.network;
import edu.rit.io.DoubleMatrixFile;
import java.io.BufferedInputStream;
import java.io.BufferedOutputSream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class FloydSeq {
// Number of nodes.
static int n;
// Distance matrix.
static double[][] d;
reader.read();
reader.close();
n = dmf.getRowCount();
d = dmf.getMatrix();
long t2 = System.currentTimeMillis();
for (int i = 0; i < n; i++) {
double[] di = d[i];
for (int r = 0; r < n; r++) {
double[] dr = d[r];
for (int c = 0; c < n; c++)
dr[c] = Math.min(dr[c],dr[i]+di[c]);
}
}
long t3 = System.currentTimeMillis();
public static void main(String[] args) {
// Start timing.
long t1 = System.currentTimeMillis();
// Parse command-line arguments.
if (args.length != 2) usage();
File infile = new File(args[0]);
File outfile = new File(args[1]);
// Read distance matrix from input file.
DoubleMatrixFile dmf = new DoubleMatrixFile();
DoubleMatrixFile.Reader reader = dmf.prepareToRead
(new BufferedInputStream(new FileInputStream(infile)));
// Write distance matrix to output file.
DoubleMatrixFile.Writer writer = dmf.prepareToWrite(
new BufferedOutputStream (new FileOutputStream (outfile)));
writer.write();
writer.close();
// Stop timing.
long t4 = System.currentTimeMillis();
// Print the results.
7
8
2
Computer Science 1 (4003-231)
Parallel Floyd’s Algorithm
Row Slicing vs. Column Slicing
•  Let’s turn Floyd’s algorithm into a parallel version.
•  Q: Can we do the iterations of the outer loop, the loop
over i, in parallel?
•  One problem is that on the previous iteration, the values
of drc, dri, and dic could have changed.
•  There is a sequential dependency from each iteration i to
the next iteration i+1.
•  How about the middle loop, the loop over r?
•  It can be parallelized because the elements in row i and
column i do not change.
9
Parallel Program with Row Slicing
10
Results
new ParallelTeam().execute (new ParallelRegion() {
public void run() {
for (int ii = 0; ii < n; ii++) {
final int i = ii;
final double[] di = d[i];
execute(0, n-1, new IntegerForLoop() {
public void run(int first, int last) {
for (int r = first; r <= last; r++) {
double[] dr = d[r];
for (int c = 0; c < n; c++)
dr[c] = Math.min(dr[c], dr[i] + di[c]);
}
}
});
}
}
});
11
12
3
Computer Science 1 (4003-231)
Results
Parallel Program with Column Slicing
new ParallelTeam().execute (new ParallelRegion() {
public void run() {
for (int ii = 0; ii < n; ii++) {
final int i = ii;
final double[] di = d[i];
for (int r = 0; r < n; r++) {
final double[] dr = d[r];
execute(0, n-1, new IntegerForLoop() {
public void run(int first, int last) {
for (int c = first; c <= last; c++)
dr[c] = Math.min(dr[c], dr[i] + di[c]);
}
});
}
}
}
});
13
Results
14
Results
15
16
4
Computer Science 1 (4003-231)
Cellular Automata
Cellular Automata
•  A cellular automaton (CA) is a simple abstract computing
device that is capable of generating all kinds of
interesting behavior.
•  We’re especially interested in one dimensional
continuous CA (1-D CCA):
•  All cells are initially 0, except one cell in the middle is
initially 1.
•  X(i) = frac( (X(i-1) + X(i) + X(i+1))*1/3*A+B )
•  Once all the new values have been calculated from the
current values, all the cells simultaneously change to
their new values.
•  This process repeats and the CCA evolves through a
sequence of states as the iterations progress.
17
Cellular Automata
18
Cellular Automata
19
20
5
Computer Science 1 (4003-231)
Rational Arithmetic
Improving Memory Scalability
•  How can we generate 1-D CCA like those previous
figures?
For c = 0 to C-1:
cell(0,c) = 0
pixl(0,c) = 0
cell(0,c/2) = 1;
pixel(0,c/2) = 1;
For s = 1 to S:
For c = 0 to C-1:
cell(s,c) = frac((cell(s-1,c-1)+cell(s-1,c)+cell(s-1,c+1))*A/3+B)
pixel(s,c) = floatValue(cell(s,c));
Write pixel data to PJG image file
•  We should use rational arithmetic when calculating the
states of a CCA, not floating-point arithmetic, for higher
precision.
•  One problem, though, is that this program requires
O(SC) memory to hold the data.
•  This is particularly a problem when scaling up the
problem.
•  How can we improve scalability?
•  We will store the numerators and denominators as
arbitrary precision integers using java.math.BigInteger.
•  We will use java.math.BigDecimal to represent the color
of each pixel.
•  BigRational class provides all of these!
21
Improving Memory Scalability
22
Sequential Program
For c = 0 to C-1:
curCell(c) = 0
curCell(c/2) = 1
For s = 0 to S-1:
For c = 0 to C-1:
nextCell(c) = frac((curCell(c-1)+curCell(c)+curCell(c+1)*A/3+B)
For c = 0 to C-1:
pixel(c) = floatValue(curCell(c));
Write pixel data as row s of PJG image file
Swap(curCell, nextCell)
For c = 0 to C-1:
pixel(c) = floatValue(curCell(c))
Write pixel data as row S of PJG image file
package edu.rit.smp.ca;
import edu.rit.image.GrayImageRow;
import edu.rit.image.PJGGrayImage;
import edu.rit.image.PJGImage;
import edu.rit.util.Range;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
public class CCASeq {
// Constants.
static final BigRational ZERO = new BigRational(“0”);
static final BigRational ONE = new BigRational(“1”);
static final BigRational ONE_THIRD = new BigRational(“1/3”);
// Command-line arguments.
static int C;
static int S;
static BigRational A;
static BigRational B;
static File imagefile;
•  What’s the complexity now?
•  O(C)
// Old and new cell arrays.
static BigRational[] curCell;
static BigRational[] nextCell;
23
24
6
Computer Science 1 (4003-231)
Sequential Program
Sequential Program
// Grayscale image matrix.
static byte[][] pixelmatrix;
static PJGGrayImage image;
static PJGImage.Writer writer;
for (int i = 0; i < C; i++) {
curCell[i] = new BigRational();
nextCell[i] = new BigRational();
}
curCell[C/2].assign(ONE);
// One row of the grayscale image matrix.
static byte[] pixelrow;
static GrayImageRow imagerow;
// Set up pixel matrix, image, and image writer.
pixelmatrix = new byte[S+1][];
image = new PJGGrayImage(S+1, C, pixelmatrix);
writer = image.prepareToWrite(new BufferedOutputStream
(new FileOutputStream(imagefile)));
public static void main(String[] args) {
// Start timing.
long t1 = System.currentTimeMillis();
// Allocate storage for one pixel matrix row.
pixelrow = new byte[C];
imagerow = new GrayImageRow(pixelrow);
imagerow.setInterpretation(PJGGrayImage.ZERO_IS_WHITE);
// Parse command-line arguments.
if (args.length != 5) usage();
C = Integer.parseInt(args[0]);
S = Integer.parseInt(args[1]);
A = new BigRational(args[2]).mul(ONE_THIRD);
B = new BigRational(args[3]);
imagefile = new File(args[4]);
// Do S time steps.
for (int s = 0; s < S; s++) {
// Calculate next state of each cell.
for (int i = 0; i < C; i++)
nextCell[i].assign(curCell[i]).add(curCell[(i-1+C)%C])
.add(curCell[(i+1)%C]).mul(A).add(B)
.normalize().fracPart();
// Allocate storage for old and new cell arrays. Init them.
curCell = new BigRational[C];
nextCell = new BigRational[C];
25
Sequential Program
26
Sequential Program
// Write current CA state to image file.
writeCurrentCell(s);
// Write the current cell array to the given row of the image file.
private static void writeCurrentCell(int r) {
// Set image row’s gray values based on current cell states.
for (int i = 0; i < C; i++)
imagerow.setPixel(i, curCell[i].floatValue());
// Advance one time step – swap old and new cell arrays.
BigRational[] tmp = curCell;
curCell = nextCell;
nextCell = tmp;
// Set row r of the pixel matrix.
pixelmatrix[r] = pixelrow;
}
// Write final CA state to image file.
writeCurrentCell(S);
writer.close();
// Write row-r slice of the image to the image file.
writer.writeRowSlice(new Range(r,r));
}
// Stop timing.
long t2 = System.currentTimeMillis();
System.out.println((t2-t1)+” msec total”);
}
27
28
7
Computer Science 1 (4003-231)
Barrier Actions
Barrier Actions
•  Q: Now, let’s convert this sequential program to a
parallel version.
•  Are there any sequential dependencies?
•  There is indeed: we cannot compute the cell values for
the next step until we finish computing the cell values for
the current step.
•  Hence, the outer loop must remain a sequential loop.
•  However, the value calculated for any element in the
nextCell array does not depend on the value of any other
element in the nextCell array.
•  No sequential dependencies in the inner loop!
For c = 0 to C-1:
currCell(c) = 0
curCell(C/2) = 1
Parallel region:
For s = 0 to S-1:
Parallel for c = 0 to C-1:
nextCell(c) = frac((curCell(c-1)+curCell(c)+curCell(c+1)*A/3+B)
Compute and write pixel data as row s of PJG image file
Swap(curCell, nextCell)
Compute and write pixel data as rwo S of PJG image file
29
Barrier Actions
30
Barrier Actions
new ParallelTeam().execute(new ParallelRegion() {
...
execute(0, 99, new IntegerForLoop() {
public void run(int first, int last) {
for (int i = first; i <= last; i++) {
// Loop body
}
}
},
new BarrierAction() {
public void run() {
// Code to be executed in a single thread
}
});
...
});
31
32
8
Computer Science 1 (4003-231)
Barrier Actions
Parallel Program
// We can specify the constant barrier action instead of an object.
// Each thread waits at the barrier.
execute(0, 99, new IntegerForLoop() {
...
},
BarrierAction.WAIT);
// Calculate next state of each cell.
// Parallel inner loop with a barrier action.
execute(0, C-1, new IntegerForLoop() {
public IntegerSchedule schedule() {
return IntegerSchedule.guided();
}
public void run (int first, int last) {
for (int i = first; i <= last; i++)
nextCell[i].assign(curCell[i]).add(curCell[(i-1+C)%C])
.add(curCell[(i+1)%C]).mul(A).add(B)
.normalize().fracPart();
}
},
// Each thread does not wait at the barrier.
execute(0, 99, new IntegerForLoop() {
...
},
BarrierAction.NO_WAIT);
// Synchronize threads before next outer loop iteration.
new BarrierAction() {
public void run() throws Exception {
writeCurrentCell(s);
// Advance one time step – swap old and new cell arrays.
BigRational[] tmp = curCell;
curCell = nextCell;
nextCell = tmp;
}
});
33
Results
34
Results
35
36
9
Computer Science 1 (4003-231)
Overlapping
Overlapping
•  Q: What is causing the large sequential fraction in the
cellular automata problem?
•  It’s the barrier action that computes and writes the pixel
row to the image file!
•  However, we don’t have to finish calculating all the cells’
next states before starting to write the current states.
•  In fact, we can do this I/O and computations together.
•  In general, we can run the I/O thread in parallel with
computation threads.
•  This is called overlapping!
37
Speedup and Efficiency
38
Speedup and Efficiency
•  Running time
–  T(N,K) = max(F*T(N,1), 1/K * (1-F) * T(N,1))
•  Speedup
–  T(N,1) / T(N,K) = min(1/F, K/(1-F))
•  Efficiency
–  Eff(N,K) = Speedup(N,K) / K = min(1/KF, 1/(1-F))
•  Any difference does overlapping make?
•  With overlapping, the speedup is superlinear until it hits
the limit.
39
40
10
Computer Science 1 (4003-231)
Parallel Sections
Parallel Sections
•  So how do we program overlapping in PJ?
•  We should be able to run a parallel team of threads
where each thread may execute a different piece of
code, e.g., I/O and computations.
•  We can do this using parallel section!
new ParallelTeam(2).execute (new ParallelRegion() {
public void run() {
execute (new ParallelSection() {
public void run() {
// Code for computation
}},
new ParallelSection() {
public void run() {
// Code for I/O
}},
new BarrierAction() {
public void run() {
// Code for single-threaded barrier action
}});
}
});
41
Nested Parallel Regions
42
Nested Parallel Regions
•  The computation specialist’s task is itself a result parallel
problem.
•  So the computation section contains another parallel
region, which creates nested parallel regions.
For c = 0 to C-1:
currCell(c) = 0
currCell(C/2) = 1
Parallel region:
For s = 0 to S-1:
Parallel sections: (two threads)
Parallel region:
Parallel for c = 0 to C-1: (K threads)
nextCell = frac((currCell(c-1) + currCell(c) +
currCell(c+1)) * 1/3 * A + B)
Compute and write pixel data as row s of PJG image file
Barrier action: (single thread)
Swap(currCell, nextCell)
Compute and write pixel data as row S of PJG image file
43
44
11
Computer Science 1 (4003-231)
Parallel Program with Overlapping
Results
•  See edu.rit.smp.ca.CCASmp2.java
45
Results
46
Results
47
48
12
Computer Science 1 (4003-231)
Results
49
13