Parallel Programming in MPI part 2 1 1 Answer of the previous exercise • 作成したプログラムについて、以下を説明してください。 Explain your program with the following points of view. • どのように解決したか。 How did you solve the problem. • 何が難しかったか。 Where was the difficulty. 2 Today's Topic • ノンブロッキング通信 Non-Blocking Communication • 通信の完了を待つ間に他の処理を行う Execute other instructions while waiting for the completion of a communication. • 集団通信関数の実装 Implementation of collective communications • MPIプログラムの時間計測 Measuring execution time of MPI programs • デッドロック Deadlock 3 Today's Topic • ノンブロッキング通信 Non-Blocking Communication • 通信の完了を待つ間に他の処理を行う Execute other instructions while waiting for the completion of a communication. • 集団通信関数の実装 Implementation of collective communications • MPIプログラムの時間計測 Measuring execution time of MPI programs • デッドロック Deadlock 4 ノンブロッキング通信関数 Non-blocking communication functions • ノンブロッキング = ある命令の完了を待たずに次の命令に移る Non-blocking = Do not wait for the completion of an instruction and proceed to the next instruction • Example) MPI_Irecv & MPI_Wait Blocking Non-Blocking MPI_Recv Wait for the arrival of data data Proceed to the next instruction without waiting for the data MPI_Irecv next instructions MPI_Wait data next instructions 5 MPI_Irecv Usage: int MPI_Irecv(void *b, int c, MPI_Datatype d, int src, int t, MPI_Comm comm, MPI_Request *r); • Non-Blocking Receive • Parameters: start address for storing received data, number of elements, data type, rank of the source, tag (= 0, in most cases), communicator (= MPI_COMM_WORLD, in most cases), request • request: 通信要求 Communication Request • この通信の完了を待つ際に用いる Used for Waiting completion of this communication • Example) MPI_Request req; ... MPI_Irecv(a, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &req); ... MPI_Wait(&req, &status); 6 6 MPI_Isend • Non-Blocking Send Usage: int MPI_Isend(void *b, int c, MPI_Datatype d, int dest, int t, MPI_Comm comm, MPI_Request *r); • Parameters: start address for sending data, number of elements, data type, rank of the destination, tag (= 0, in most cases), communicator (= MPI_COMM_WORLD, in most cases), request • Example) MPI_Request req; ... MPI_Isend(a, 100, MPI_INT, 1, 0, MPI_COMM_WORLD, &req); ... MPI_Wait(&req, &status); 7 7 Non-Blocking Send? • Blocking send (MPI_Send): 送信データが別の場所にコピーされるのを待つ Wait for the data to be copied to somewhere else. • ネットワークにデータを送出し終わるか、一時的にデータのコピーを作成するまで。 Until completion of the data to be transferred to the network or, until completion of the data to be copied to a temporal memory. • Non-Blocking send (MPI_Recv): 待たない 8 Notice: ノンブロッキング通信中はデータが不定 Data is not sure in non-blocking communications • MPI_Irecv: • 受信データの格納場所と指定した変数の値は MPI_Waitまで不定 Value of the variable specified for receiving data is not fixed before MPI_Wait Value of A at here can be 10 or 50 MPI_Irecv to A A 10 arrived data ... ~=A ... A 50 50 MPI_Wait Value of A is 50 ~=A 9 Notice: ノンブロッキング通信中はデータが不定 Data is not sure in non-blocking communications • MPI_Isend: • 送信データを格納した変数を MPI_Waitより前に書き換えると、実際に送信さ れる値は不定 If the variable that stored the data to be sent is modified before MPI_Wait, the value to be actually sent is unpredictable. Modifying value of A here causes incorrect communication MPI_Isend A ... A = 50 ... A 10 A 50 data sent 10 or 50 MPI_Wait You can modify value of A at here without any problem A = 100 10 MPI_Wait Usage: int MPI_Wait(MPI_Request *req, MPI_Status *stat); • ノンブロッキング通信(MPI_Isend、 MPI_Irecv)の完了を待つ。 Wait for the completion of MPI_Isend or MPI_Irecv • 送信データの書き換えや受信データの参照が行える Make sure that sending data can be modified, or receiving data can be referred. • Parameters: request, status • status: MPI_Irecv 完了時に受信データの statusを格納 The status of the received data is stored at the completion of MPI_Irecv 11 MPI_Waitall Usage: int MPI_Waitall(int c, MPI_Request *requests, MPI_Status *statuses); • 指定した数のノンブロッキング通信の完了を待つ Wait for the completion of specified number of non-blocking communications • Parameters: count, requests, statuses • count: ノンブロッキング通信の数 The number of non-blocking communications • requests, statuses: 少なくとも count個の要素を持つ MPI_Request と MPI_Statusの配列 Arrays of MPI_Request or MPI_Status that consists at least 'count' number of elements. 12 Today's Topic • ノンブロッキング通信 Non-Blocking Communication • 通信の完了を待つ間に他の処理を行う Execute other instructions while waiting for the completion of a communication. • 集団通信関数の実装 Implementation of collective communications • MPIプログラムの時間計測 Measuring execution time of MPI programs • デッドロック Deadlock 13 集団通信関数の中身 Inside of the functions of collective communications • 通常,集団通信関数は, MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv 等の一対一通信で実装される Usually, functions of collective communications are implemented by using message passing functions. 14 Inside of MPI_Bcast • One of the most simple implementations int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm) { int i, myid, procs; MPI_Status st; MPI_Comm_rank(comm, &myid); MPI_Comm_size(comm, &procs); if (myid == root){ for (i = 0; i < procs) if (i != root) MPI_Send(a, c, d, i, 0, comm); } else{ MPI_Recv(a, c, d, root, 0, comm, &st); } return 0; } 15 Another implementation: With MPI_Isend int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm) { int i, myid, procs, cntr; MPI_Status st, *stats; MPI_Request *reqs; MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); if (myid == root){ stats = (MPI_Status *)malloc(sizeof(MPI_Status)*procs); reqs = (MPI_Request *)malloc(sizeof(MPI_Request)*procs); cntr = 0; for (i = 0; i < procs) if (i != root) MPI_Isend(a, c, d, i, 0, comm, &(reqs[cntr++])); MPI_Waitall(procs-1, reqs, stats); free(stats); free(reqs); } else{ MPI_Recv(a, c, d, root, 0, comm, &st); } return 0; } 16 Flow of the Simple Implementation Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Isend to 1 Irecv from 0 Irecv from 0 Irecv from 0 Irecv from 0 wait wait wait Isend to 2 Rank 5 Rank 6 Rank 7 Irecv from 0 Irecv from 0 Irecv from 0 wait wait Isend to 3 Isend to 4 wait wait Isend to 5 Isend to 6 Isend to 7 waitall 17 Time for Simple Implementation • 1 link can transfer 1 message at a time 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Total Time = T * (P-1) T: Time for transferring 1 message P: Number of processes 18 Another implementation: Binomial Tree int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm) { int i, myid, procs; MPI_Status st; int mask, relative_rank, src, dst; int tag = 1, success = 0; MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); relative_rank = myid - root; if (relative_rank < 0) relative_rank += procs; mask = 1; while (mask < num_procs){ if (relative_rank & mask){ src = myid - mask; if (src < 0) src += procs; MPI_Recv(a, c, d, src, 0, comm, &st); break; } mask <<= 1; } mask >>= 1; while (mask > 0){ if (relative_rank + mask < procs){ dst = myid + mask; if (dst >= procs) dst -= procs; MPI_Send (a, c, d, dst, 0, comm); } mask >>= 1; } return 0; } 19 Flow of Binomial Tree • Use 'mask' to determine when and how to Send/Recv Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 2 Recv mask = 2 Recv mask = 2 Recv mask = 2 Recv mask = 4 mask = 4 from 6 from 4 Recv from 0 Recv from 2 Recv from 4 mask = 4 from 0 from 0 Send to 4 mask = 2 Send to 2 mask = 1 Send to 3 mask = 2 Send to 6 mask = 1 Send to 5 mask = 1 Send to 7 mask = 1 Send to 1 20 Time for Binomial Tree • Use multiple links at a time 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Total Time = T * log2P T: Time for transferring 1 message P: Number of processes 21 Today's Topic • ノンブロッキング通信 Non-Blocking Communication • 通信の完了を待つ間に他の処理を行う Execute other instructions while waiting for the completion of a communication. • 集団通信関数の実装 Implementation of collective communications • MPIプログラムの時間計測 Measuring execution time of MPI programs • デッドロック Deadlock 22 MPIプログラムの時間計測 Measure the time of MPI programs • MPI_Wtime • 現在時間(秒)を実数で返す関数 Returns the current time in seconds. • Example) Measure time here ... double t1, t2; ... t1 = MPI_Wtime(); 処理 t2 = MPI_Wtime(); printf("Elapsed time: %e sec.\n", t2 – t1); 並列プログラムにおける時間計測の問題 Problem on measuring time in parallel programs • プロセス毎に違う時間を測定: どの時間が本当の所要時間か? Each process measures different time. Which time is the time we want? Rank 0 t1 = MPI_Wtime(); Rank 1 Read Measure time here t1 = MPI_Wtime(); Read Send Read Rank 2 t1 = MPI_Wtime(); Receive Receive t1 = MPI_Wtime(); Send t1 = MPI_Wtime(); t1 = MPI_Wtime(); 24 集団通信 MPI_Barrierを使った解決策 Use MPI_Barrier • 時間計測前にMPI_Barrierで同期 Synchronize processes before each measurement • For measuring total execution time. Rank 0 MPI_Barrier t1 = MPI_Wtime(); Measure time here Rank 1 MPI_Barrier Receive Rank 2 MPI_Barrier Receive Read Read Send Read MPI_Barrier Send MPI_Barrier MPI_Barrier t1 = MPI_Wtime(); 25 より細かい解析 Detailed analysis • Average • MPI_Reduce can be used to achieve the average: double t1, t2, t, total; t1 = MPI_Wtime(); ... t2 = MPI_Wtime(); t = t2 – t1; MPI_Reduce(&t, &total, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myrank == 0) printf("Ave. elapsed: %e sec.\n", total/procs); • MAX and MIN • Use MPI_Gather to gather all of the results to Rank 0. • Let Rank 0 to find MAX and MIN 26 最大(Max)、平均(Ave)、最小(Min)の関係 Relationships among Max, Ave and Min • プロセス毎の負荷(仕事量)のばらつき検証に利用 Can be used for checking the load-balance. Max – Ave is large Max – Ave is small Ave – Min is large NG Mostly OK Ave – Min is small NG OK Time includes Computation Time and Communication Time 27 通信時間の計測 Measuring time for communications double t1, t2, t3, t4 comm=0; t3 = MPI_Wtime(); for (i = 0; i < N; i++){ computation t1 = MPI_Wtime(); communication t2 = MPI_Wtime(); comm += t2 – t1; computation t1 = MPI_Wtime(); communication t2 = MPI_Wtime(); comm += t2 – t1; } t4 = MPI_Wtime(); 28 Analyze computation time • Computation time = Total time - Communication time • Or, just measure the computation time • 計算時間のばらつき = 負荷の不均衡の度合い Balance of computation time shows balance of the amount of computation • 注意: 通信時間には、負荷の不均衡によって生じた待ち時間が含まれ るので、単純な評価は難しい Communication time is difficult to analyze since it consists waiting time caused by load-imbalance. ==> Balance computation first. 29 Today's Topic • ノンブロッキング通信 Non-Blocking Communication • 通信の完了を待つ間に他の処理を行う Execute other instructions while waiting for the completion of a communication. • 集団通信関数の実装 Implementation of collective communications • MPIプログラムの時間計測 Measuring execution time of MPI programs • デッドロック Deadlock 30 Deadlock • 何らかの理由で、プログラムを進行させることができなくなった状態 A status of a program in which it cannot proceed by some reasons. • MPIプログラムでデッドロックが発生しやすい場所: Places you need to be careful for deadlocks: 1. MPI_Recv, MPI_Wait, MPI_Waitall Wrong case: if (myid == 0){ MPI_Recv from rank 1 MPI_Send to rank 1 } if (myid == 1){ MPI_Recv from rank 0 MPI_Send to rank 0 } One solution: use MPI_Irecv if (myid == 0){ MPI_Irecv from rank 1 MPI_Send to rank 1 MPI_Wait } if (myid == 1){ MPI_Irecv from rank 0 MPI_Send to rank 0 MPI_Wait } 2. Collective communications 全部のプロセスが同じ集団通信関数を実行するまで先に進めない A program cannot proceed until all processes call the same collective communication function Summary • ノンブロッキング通信の効果 Effect of non-blocking communication • 通信開始と通信完了待ちを分離 Split the start and the completion of a communication • 通信と計算のオーバラップを可能にする Enable overlapping of communication and computation . • 集団通信の実装 Implementation of collective communication. • 内部で送信と受信を組み合わせて実装 Construct algorithms with sends and receives. • 所要時間はアルゴリズムに依存 Time depends on the algorithm. • MPIプログラムの時間計測 Measuring execution time of MPI programs • 並列プログラムではデッドロックに注意 Be careful about deadlocks. 32 Report) Make Reduce function by yourself • 次のページのプログラムの my_reduce関数の中身を追加してプロ グラムを完成させる Fill the inside of 'my_reduce' function in the program shown in the next slide • my_reduce: MPI_Reduceの簡略版 Simplified version of MPI_Reduce • 整数の総和のみ. ルートランクは 0限定. コミュニケータは MPI_COMM_WORLD Calculates total sum of integer numbers. The root rank is always 0. The communicator is always MPI_COMM_WORLD. • アルゴリズムは好きなものを考えてよい Any algorithm is OK. 33 #include <stdio.h> #include <stdlib.h> #include "mpi.h" #define N 20 int my_reduce(int *a, int *b, int c) { complete here by yourself return 0; } int main(int argc, char *argv[]) { int i, myid, procs; int a[N], b[N]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); for (i = 0; i < N; i++){ a[i] = i; b[i] = 0; } my_reduce(a, b, N); if (myid == 0) for (i = 0; i < N; i++) printf("b[%d] = %d , correct answer = %d\n", i, b[i], i*procs); MPI_Finalize(); return 0; } 34
© Copyright 2025 ExpyDoc