Computer Science 1 (4003-231) What to Learn This Week? • We will learn how to analyze an algorithm for sequential dependencies. Sequential Dependencies and Overlapping • We will also learn how to parallelize the algorithm while enforcing the sequential dependencies. • We will then discover how the CPU’s cache memory affects a parallel program’s performance. Minseok Kwon Department of Computer Science Rochester Institute of Technology 1 All Pairs Shortest Paths 2 Floyd’s Algorithm • Given a network (or a graph), how do we find the shortest paths between all possible pairs? for i = 0 to n-1 // Update the distance between nodes r and c via node i. for r = 0 to n-1 for c = 0 to n-1 D(r,c) = min(D(r,c), D(r,i) + D(i,c)) 3 4 1 Computer Science 1 (4003-231) I/O Files I/O Files package edu.rit.smp.network; import edu.rit.io.DoubleMatrixFile; import edu.rit.util.Random; import java.io.BufferedOutputStream; import java.io.FileOutputStream; public class FloydRandom { public static void main(String[] args) { // Parse command-line arguments. if (args.length != 4) usage(); long seed = Long.parseLong(args[0]); double radius = Double.parseDouble(args[1]); int n = Integer.parseInt(args[2]); String matrixfile = args[3]; // Compute distance matrix elements. double[][] = new double[n][n]; for (int r = 0; r < n; r++) { double[] dr = d[r]; for (int c = 0; c < n; c++) { double dx = x[r] – x[c]; double dy = y[r] – y[c]; double distance = Math.sqrt(dx*dx + dy*dy); dr[c] = (distance <= radius ? distance : Double.POSITIVE_INFINITY); } } // Write distance matrix to output file. DoubleMatrixFile dmf = new DoubleMatrixFile(n,n,d); DoubleMatrixFile.Writer writer = dmf.prepareToWrite (new BufferedOutputStream (new FileOutputStream(matrixfile))); writer.write(); writer.close(); } } // Set up pseudorandom number generator. Random prng = Random.getInstance(see); // Generate random node locations in the unit square. double[] x = new double[n]; double[] y = new double[n]; for (int i = 0; i < n; i++) { x[i] = prng.nextDouble(); y[i] = prng.nextDouble(); } 5 Sequential Program 6 Sequential Program package edu.rit.smp.network; import edu.rit.io.DoubleMatrixFile; import java.io.BufferedInputStream; import java.io.BufferedOutputSream; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; public class FloydSeq { // Number of nodes. static int n; // Distance matrix. static double[][] d; reader.read(); reader.close(); n = dmf.getRowCount(); d = dmf.getMatrix(); long t2 = System.currentTimeMillis(); for (int i = 0; i < n; i++) { double[] di = d[i]; for (int r = 0; r < n; r++) { double[] dr = d[r]; for (int c = 0; c < n; c++) dr[c] = Math.min(dr[c],dr[i]+di[c]); } } long t3 = System.currentTimeMillis(); public static void main(String[] args) { // Start timing. long t1 = System.currentTimeMillis(); // Parse command-line arguments. if (args.length != 2) usage(); File infile = new File(args[0]); File outfile = new File(args[1]); // Read distance matrix from input file. DoubleMatrixFile dmf = new DoubleMatrixFile(); DoubleMatrixFile.Reader reader = dmf.prepareToRead (new BufferedInputStream(new FileInputStream(infile))); // Write distance matrix to output file. DoubleMatrixFile.Writer writer = dmf.prepareToWrite( new BufferedOutputStream (new FileOutputStream (outfile))); writer.write(); writer.close(); // Stop timing. long t4 = System.currentTimeMillis(); // Print the results. 7 8 2 Computer Science 1 (4003-231) Parallel Floyd’s Algorithm Row Slicing vs. Column Slicing • Let’s turn Floyd’s algorithm into a parallel version. • Q: Can we do the iterations of the outer loop, the loop over i, in parallel? • One problem is that on the previous iteration, the values of drc, dri, and dic could have changed. • There is a sequential dependency from each iteration i to the next iteration i+1. • How about the middle loop, the loop over r? • It can be parallelized because the elements in row i and column i do not change. 9 Parallel Program with Row Slicing 10 Results new ParallelTeam().execute (new ParallelRegion() { public void run() { for (int ii = 0; ii < n; ii++) { final int i = ii; final double[] di = d[i]; execute(0, n-1, new IntegerForLoop() { public void run(int first, int last) { for (int r = first; r <= last; r++) { double[] dr = d[r]; for (int c = 0; c < n; c++) dr[c] = Math.min(dr[c], dr[i] + di[c]); } } }); } } }); 11 12 3 Computer Science 1 (4003-231) Results Parallel Program with Column Slicing new ParallelTeam().execute (new ParallelRegion() { public void run() { for (int ii = 0; ii < n; ii++) { final int i = ii; final double[] di = d[i]; for (int r = 0; r < n; r++) { final double[] dr = d[r]; execute(0, n-1, new IntegerForLoop() { public void run(int first, int last) { for (int c = first; c <= last; c++) dr[c] = Math.min(dr[c], dr[i] + di[c]); } }); } } } }); 13 Results 14 Results 15 16 4 Computer Science 1 (4003-231) Cellular Automata Cellular Automata • A cellular automaton (CA) is a simple abstract computing device that is capable of generating all kinds of interesting behavior. • We’re especially interested in one dimensional continuous CA (1-D CCA): • All cells are initially 0, except one cell in the middle is initially 1. • X(i) = frac( (X(i-1) + X(i) + X(i+1))*1/3*A+B ) • Once all the new values have been calculated from the current values, all the cells simultaneously change to their new values. • This process repeats and the CCA evolves through a sequence of states as the iterations progress. 17 Cellular Automata 18 Cellular Automata 19 20 5 Computer Science 1 (4003-231) Rational Arithmetic Improving Memory Scalability • How can we generate 1-D CCA like those previous figures? For c = 0 to C-1: cell(0,c) = 0 pixl(0,c) = 0 cell(0,c/2) = 1; pixel(0,c/2) = 1; For s = 1 to S: For c = 0 to C-1: cell(s,c) = frac((cell(s-1,c-1)+cell(s-1,c)+cell(s-1,c+1))*A/3+B) pixel(s,c) = floatValue(cell(s,c)); Write pixel data to PJG image file • We should use rational arithmetic when calculating the states of a CCA, not floating-point arithmetic, for higher precision. • One problem, though, is that this program requires O(SC) memory to hold the data. • This is particularly a problem when scaling up the problem. • How can we improve scalability? • We will store the numerators and denominators as arbitrary precision integers using java.math.BigInteger. • We will use java.math.BigDecimal to represent the color of each pixel. • BigRational class provides all of these! 21 Improving Memory Scalability 22 Sequential Program For c = 0 to C-1: curCell(c) = 0 curCell(c/2) = 1 For s = 0 to S-1: For c = 0 to C-1: nextCell(c) = frac((curCell(c-1)+curCell(c)+curCell(c+1)*A/3+B) For c = 0 to C-1: pixel(c) = floatValue(curCell(c)); Write pixel data as row s of PJG image file Swap(curCell, nextCell) For c = 0 to C-1: pixel(c) = floatValue(curCell(c)) Write pixel data as row S of PJG image file package edu.rit.smp.ca; import edu.rit.image.GrayImageRow; import edu.rit.image.PJGGrayImage; import edu.rit.image.PJGImage; import edu.rit.util.Range; import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; public class CCASeq { // Constants. static final BigRational ZERO = new BigRational(“0”); static final BigRational ONE = new BigRational(“1”); static final BigRational ONE_THIRD = new BigRational(“1/3”); // Command-line arguments. static int C; static int S; static BigRational A; static BigRational B; static File imagefile; • What’s the complexity now? • O(C) // Old and new cell arrays. static BigRational[] curCell; static BigRational[] nextCell; 23 24 6 Computer Science 1 (4003-231) Sequential Program Sequential Program // Grayscale image matrix. static byte[][] pixelmatrix; static PJGGrayImage image; static PJGImage.Writer writer; for (int i = 0; i < C; i++) { curCell[i] = new BigRational(); nextCell[i] = new BigRational(); } curCell[C/2].assign(ONE); // One row of the grayscale image matrix. static byte[] pixelrow; static GrayImageRow imagerow; // Set up pixel matrix, image, and image writer. pixelmatrix = new byte[S+1][]; image = new PJGGrayImage(S+1, C, pixelmatrix); writer = image.prepareToWrite(new BufferedOutputStream (new FileOutputStream(imagefile))); public static void main(String[] args) { // Start timing. long t1 = System.currentTimeMillis(); // Allocate storage for one pixel matrix row. pixelrow = new byte[C]; imagerow = new GrayImageRow(pixelrow); imagerow.setInterpretation(PJGGrayImage.ZERO_IS_WHITE); // Parse command-line arguments. if (args.length != 5) usage(); C = Integer.parseInt(args[0]); S = Integer.parseInt(args[1]); A = new BigRational(args[2]).mul(ONE_THIRD); B = new BigRational(args[3]); imagefile = new File(args[4]); // Do S time steps. for (int s = 0; s < S; s++) { // Calculate next state of each cell. for (int i = 0; i < C; i++) nextCell[i].assign(curCell[i]).add(curCell[(i-1+C)%C]) .add(curCell[(i+1)%C]).mul(A).add(B) .normalize().fracPart(); // Allocate storage for old and new cell arrays. Init them. curCell = new BigRational[C]; nextCell = new BigRational[C]; 25 Sequential Program 26 Sequential Program // Write current CA state to image file. writeCurrentCell(s); // Write the current cell array to the given row of the image file. private static void writeCurrentCell(int r) { // Set image row’s gray values based on current cell states. for (int i = 0; i < C; i++) imagerow.setPixel(i, curCell[i].floatValue()); // Advance one time step – swap old and new cell arrays. BigRational[] tmp = curCell; curCell = nextCell; nextCell = tmp; // Set row r of the pixel matrix. pixelmatrix[r] = pixelrow; } // Write final CA state to image file. writeCurrentCell(S); writer.close(); // Write row-r slice of the image to the image file. writer.writeRowSlice(new Range(r,r)); } // Stop timing. long t2 = System.currentTimeMillis(); System.out.println((t2-t1)+” msec total”); } 27 28 7 Computer Science 1 (4003-231) Barrier Actions Barrier Actions • Q: Now, let’s convert this sequential program to a parallel version. • Are there any sequential dependencies? • There is indeed: we cannot compute the cell values for the next step until we finish computing the cell values for the current step. • Hence, the outer loop must remain a sequential loop. • However, the value calculated for any element in the nextCell array does not depend on the value of any other element in the nextCell array. • No sequential dependencies in the inner loop! For c = 0 to C-1: currCell(c) = 0 curCell(C/2) = 1 Parallel region: For s = 0 to S-1: Parallel for c = 0 to C-1: nextCell(c) = frac((curCell(c-1)+curCell(c)+curCell(c+1)*A/3+B) Compute and write pixel data as row s of PJG image file Swap(curCell, nextCell) Compute and write pixel data as rwo S of PJG image file 29 Barrier Actions 30 Barrier Actions new ParallelTeam().execute(new ParallelRegion() { ... execute(0, 99, new IntegerForLoop() { public void run(int first, int last) { for (int i = first; i <= last; i++) { // Loop body } } }, new BarrierAction() { public void run() { // Code to be executed in a single thread } }); ... }); 31 32 8 Computer Science 1 (4003-231) Barrier Actions Parallel Program // We can specify the constant barrier action instead of an object. // Each thread waits at the barrier. execute(0, 99, new IntegerForLoop() { ... }, BarrierAction.WAIT); // Calculate next state of each cell. // Parallel inner loop with a barrier action. execute(0, C-1, new IntegerForLoop() { public IntegerSchedule schedule() { return IntegerSchedule.guided(); } public void run (int first, int last) { for (int i = first; i <= last; i++) nextCell[i].assign(curCell[i]).add(curCell[(i-1+C)%C]) .add(curCell[(i+1)%C]).mul(A).add(B) .normalize().fracPart(); } }, // Each thread does not wait at the barrier. execute(0, 99, new IntegerForLoop() { ... }, BarrierAction.NO_WAIT); // Synchronize threads before next outer loop iteration. new BarrierAction() { public void run() throws Exception { writeCurrentCell(s); // Advance one time step – swap old and new cell arrays. BigRational[] tmp = curCell; curCell = nextCell; nextCell = tmp; } }); 33 Results 34 Results 35 36 9 Computer Science 1 (4003-231) Overlapping Overlapping • Q: What is causing the large sequential fraction in the cellular automata problem? • It’s the barrier action that computes and writes the pixel row to the image file! • However, we don’t have to finish calculating all the cells’ next states before starting to write the current states. • In fact, we can do this I/O and computations together. • In general, we can run the I/O thread in parallel with computation threads. • This is called overlapping! 37 Speedup and Efficiency 38 Speedup and Efficiency • Running time – T(N,K) = max(F*T(N,1), 1/K * (1-F) * T(N,1)) • Speedup – T(N,1) / T(N,K) = min(1/F, K/(1-F)) • Efficiency – Eff(N,K) = Speedup(N,K) / K = min(1/KF, 1/(1-F)) • Any difference does overlapping make? • With overlapping, the speedup is superlinear until it hits the limit. 39 40 10 Computer Science 1 (4003-231) Parallel Sections Parallel Sections • So how do we program overlapping in PJ? • We should be able to run a parallel team of threads where each thread may execute a different piece of code, e.g., I/O and computations. • We can do this using parallel section! new ParallelTeam(2).execute (new ParallelRegion() { public void run() { execute (new ParallelSection() { public void run() { // Code for computation }}, new ParallelSection() { public void run() { // Code for I/O }}, new BarrierAction() { public void run() { // Code for single-threaded barrier action }}); } }); 41 Nested Parallel Regions 42 Nested Parallel Regions • The computation specialist’s task is itself a result parallel problem. • So the computation section contains another parallel region, which creates nested parallel regions. For c = 0 to C-1: currCell(c) = 0 currCell(C/2) = 1 Parallel region: For s = 0 to S-1: Parallel sections: (two threads) Parallel region: Parallel for c = 0 to C-1: (K threads) nextCell = frac((currCell(c-1) + currCell(c) + currCell(c+1)) * 1/3 * A + B) Compute and write pixel data as row s of PJG image file Barrier action: (single thread) Swap(currCell, nextCell) Compute and write pixel data as row S of PJG image file 43 44 11 Computer Science 1 (4003-231) Parallel Program with Overlapping Results • See edu.rit.smp.ca.CCASmp2.java 45 Results 46 Results 47 48 12 Computer Science 1 (4003-231) Results 49 13
© Copyright 2024 ExpyDoc