スライド 1 - Home Page of Koji OKAMURA

January 25, 2011
1
Sample of the correct code for the last week's report.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mpi.h"
#define N 1000
int main(int argc, char *argv[]) {
double *a, *b, local_c, c, r, *temp;
int
i, p, myid, procs, local_size;
struct timeval tv;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
local_size = N/procs;
a = (double *)malloc(local_size * sizeof(double));
b = (double *)malloc(local_size * sizeof(double));
if (myid == 0){
gettimeofday(&tv, NULL);
srand(tv.tv_usec);
for (i = 0; i < local_size; i++){
r = rand();
a[i]=r/RAND_MAX;
}
for (i = 0; i < local_size; i++){
r = rand();
b[i]=r/RAND_MAX;
}
temp = (double *)malloc(local_size * sizeof(double));
for (p = 1; p < procs; p++){
for (i = 0; i < local_size; i++){
r = rand();
temp[i]=r/RAND_MAX;
}
MPI_Send(temp, local_size, MPI_DOUBLE, p, 0,
MPI_COMM_WORLD);
for (i = 0; i < local_size; i++){
r = rand();
temp[i]=r/RAND_MAX;
}
MPI_Send(temp, local_size, MPI_DOUBLE, p, 0,
MPI_COMM_WORLD);
}
} else{
MPI_Recv(a, local_size, MPI_DOUBLE, 0, 0,
MPI_COMM_WORLD, &status);
MPI_Recv(b, local_size, MPI_DOUBLE, 0, 0,
MPI_COMM_WORLD, &status);
}
local_c = 0.0;
for (i = 0; i < local_size; i++)
local_c = local_c + a[i]*b[i];
MPI_Reduce(&local_c, &c, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (myid == 0)
printf("Result: %e \n", c);
free(a);
free(b);
if (myid == 0)
free(temp);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
2
An answer that uses MPI_Bcast
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mpi.h"
#define N 1000
int main(int argc, char *argv[]) {
double *a, *b, c, sum, r;
double t1, t2;
int i;
int myid,procs;
struct timeval tv;
a = (double *)malloc(N * sizeof(double));
b = (double *)malloc(N * sizeof(double));
c = 0.0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
MPI_Barrier(MPI_COMM_WORLD);
if(0 == myid){
t1 = MPI_Wtime();
gettimeofday(&tv, NULL);
srand(tv.tv_usec);
for (i = 0; i < N; i++){
r = rand();
a[i]=r/RAND_MAX;
}
for (i = 0; i < N; i++){
r = rand();
b[i]=r/RAND_MAX;
}
}
MPI_Bcast(a, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
for (i = (N/procs)*myid; i < (N/procs)*(myid+1); i++){
c = c + a[i]*b[i];
}
MPI_Reduce(&c, &sum, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
if(0 == myid){
t2 = MPI_Wtime();
printf("Result: %e\n", sum);
printf("number of processes:%d\n",procs);
printf("Elapsed time: %e sec.\n", t2 - t1);
}
free(a);
free(b);
MPI_Finalize();
}
Compare to the previous slide:
- The structure of the program is simpler.
- Consumes more memory.
- Does more amount of communications.
--> Can be slower.
3
実行結果
Result of execution
 プロセス数を増やすと遅くなる!
The program gets slow down as the number of processes increased.
$ mpirun -np 8 ./mpi-2.exe
Result: 2.582018e+02
number of processes: 8
Elapsed time: 2.067089e-03 sec.
$ mpirun -np 2 ./mpi-2.exe
Result: 2.565114e+02
number of processes: 2
Elapsed time: 5.409718e-04 sec.
$ mpirun -np 1 ./mpi-2.exe
Result: 2.539768e+02
number of processes: 1
Elapsed time: 1.161098e-04 sec.
今回のプログラムは計算量が少なすぎて
並列化の効果よりも通信時間の方が
長かった.
ベクトルの内積: O(N)
行列同士の積: O(N3)
Amount of calculation in this program is too small.
So, the effect of the parallelization is much
smaller than the time for communication.
Dot-product of vectors: O(N)
Matrix-Matrix multiply: O(N3)
4
Today's Topic
 ノンブロッキング通信
Non-Blocking Communication
 通信の完了を待つ間に他の処理を行う
Execute other instructions while waiting for the completion of a communication.
 集団通信関数の実装
Implementation of collective communications
 デッドロック Deadlock
5
ノンブロッキング通信関数
Non-blocking communication functions
 ノンブロッキング = ある命令の完了を待たずに次の命令に移る
Non-blocking = Do not wait for the completion of an
instruction and proceed to the next instruction
 Example) MPI_Irecv & MPI_Wait
Blocking
Non-Blocking
MPI_Recv
Wait for the
arrival of data
data
Proceed to the next
instruction without
waiting for the data
MPI_Irecv
next instructions
MPI_Wait
data
next instructions
6
MPI_Irecv
Usage:
int MPI_Irecv(void *b, int c, MPI_Datatype d, int src,
int t, MPI_Comm comm, MPI_Request *r);
 Non-Blocking Receive
 Parameters:
start address for storing received data,
number of elements, data type,
rank of the source, tag (= 0, in most cases),
communicator (= MPI_COMM_WORLD, in most cases),
request
 request: 通信要求 Communication Request
 この通信の完了を待つ際に用いる
Used for Waiting completion of this communication
 Example)
MPI_Request req;
...
MPI_Irecv(a, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &req);
...
MPI_Wait(&req, &status);
7
MPI_Isend
Usage:
int MPI_Send(void *b, int c, MPI_Datatype d,
int dest, int t, MPI_Comm comm);
 Non-Blocking Send
 Parameters:
start address for sending data,
number of elements, data type,
rank of the destination, tag (= 0, in most cases),
communicator (= MPI_COMM_WORLD, in most cases),
request
 Example)
MPI_Request req;
...
MPI_Isend(a, 100, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
...
MPI_Wait(&req, &status);
8
Non-Blocking Send?
 Blocking send (MPI_Send):
送信データが別の場所にコピーされるのを待つ
Wait for the data to be copied to somewhere else.
 ネットワークにデータを送出し終わるか、一時的にデータのコピーを作成
するまで。
Until completion of the data to be transferred to the network
or, until completion of the data to be copied to a temporal memory.
 Non-Blocking send (MPI_Recv):
待たない
9
Notice: ノンブロッキング通信中はデータが不定
Data is not sure in non-blocking communications
 MPI_Irecv:
 受信データの格納場所と指定した変数の値は MPI_Waitまで不定
Value of the variable specified for receiving data is not fixed
before MPI_Wait
Value of A at here
can be 10 or 50
MPI_Irecv to A
A
10
arrived
data
...
~=A
...
A
50
50
MPI_Wait
Value of A is 10
~=A
10
Notice: ノンブロッキング通信中はデータが不定
Data is not sure in non-blocking communications
 MPI_Isend:
 送信データを格納した変数を MPI_Waitより前に書き換えると、実際
に送信される値は不定
If the variable that stored the data to be sent is modified before
MPI_Wait, the value to be actually sent is unpredictable.
Modifying value of A here
causes incorrect
communication
You can modify value of A
at here without any
problem
MPI_Isend A
...
A = 50
...
A
10
A
50
data sent
10 or 50
MPI_Wait
A = 100
11
MPI_Wait
Usage:
int MPI_Wait(MPI_Request *req, MPI_Status *stat);
 ノンブロッキング通信(MPI_Isend、 MPI_Irecv)の完了を待つ。
Wait for the completion of MPI_Isend or MPI_Irecv
 送信データの書き換えや受信データの参照が行える
Make sure that sending data can be modified,
or receiving data can be referred.
 Parameters:
request, status
 status:
MPI_Irecv 完了時に受信データの statusを格納
The status of the received data is stored at the
completion of MPI_Irecv
12
MPI_Waitall
Usage:
int MPI_Waitall(int c,
MPI_Request *requests, MPI_Status *statuses);
 指定した数のノンブロッキング通信の完了を待つ
Wait for the completion of specified number of nonblocking communications
 Parameters:
count, requests, statuses
 count:
ノンブロッキング通信の数
The number of non-blocking communications
 requests, statuses:
少なくとも count個の要素を持つ MPI_Request と MPI_Status
の配列
Arrays of MPI_Request or MPI_Status that consists at
least 'count' number of elements.
13
集団通信関数の中身
Inside of the functions of collective communications
 通常,集団通信関数は,
MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv
等の一対一通信で実装される
Usually, functions of collective communications are
implemented by using message passing functions.
14
Inside of MPI_Bcast
 One of the most simple implementations
int MPI_Bcast(char *a, int c, MPI_Datatype d, int root,
MPI_Comm comm)
{
int i, myid, procs;
MPI_Status st;
MPI_Comm_rank(comm, &myid);
MPI_Comm_rank(comm, &procs);
if (myid == root){
for (i = 0; i < procs)
if (i != root)
MPI_Send(a, c, d, i, 0, comm);
} else{
MPI_Recv(a, c, d, root, 0, comm, &st);
}
return 0;
}
15
Another implementation: With MPI_Isend
int MPI_Bcast(char *a, int c, MPI_Datatype d, int root,
MPI_Comm comm)
{
int i, myid, procs;
MPI_Status st, *stats;
MPI_Request *reqs;
MPI_Comm_rank(comm, &myid);
MPI_Comm_rank(comm, &procs);
if (myid == root){
stats = (MPI_Status *)malloc(sizeof(MPI_Status)*procs);
reqs = (MPI_Request *)malloc(sizeof(MPI_Request)*procs);
for (i = 0; i < procs)
if (i != root)
MPI_Isend(a, c, d, i, 0, comm);
MPI_Waitall(procs, reqs, stats);
free(stats);
free(reqs);
} else{
MPI_Recv(a, c, d, root, 0, comm, &st);
}
return 0;
}
16
Another implementation: Binomial Tree
int MPI_Bcast(char *a, int c, MPI_Datatype d,
int root, MPI_Comm comm)
{
int i, myid, procs;
MPI_Status st;
int mask, relative_rank, src, dst;
int tag = 1, success = 0;
MPI_Comm_rank(comm, &myid);
MPI_Comm_rank(comm, &procs);
relative_rank = myid - root;
if (relative_rank < 0)
relative_rank += procs;
mask = 1;
while (mask < num_procs){
if (relative_rank & mask){
src = myid - mask;
if (src < 0) src += procs;
MPI_Recv(a, c, d, src, 0, comm, &st);
break;
}
mask <<= 1;
}
mask >>= 1;
while (mask > 0){
if (relative_rank + mask < procs){
dst = myid + mask;
if (dst >= procs) dst -= procs;
MPI_Send (a, c, d, dst, 0, comm);
}
mask >>= 1;
}
return 0;
}
17
Flow of Binomial Tree
 Use 'mask' to determine when and how to Send/Recv
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1
mask = 2 Recv
mask = 2 Recv
mask = 2 Recv
mask = 2 Recv
mask = 4
mask = 4
from 6
from 4 Recv
from 0 Recv
from 2 Recv
from 4
mask = 4
from 0
from
0
Send to 4
mask = 2
Send to 2
mask = 1
Send to 3
mask = 2
Send to 6
mask = 1
Send to 5
mask = 1
Send to 7
mask = 1
Send to 1
18
Deadlock
 何らかの理由で、プログラムを進行させることができなくなった状態
A status of a program in which it cannot proceed by some reasons.
 MPIプログラムでデッドロックが発生しやすい場所:
Places you need to be careful for deadlocks:
1. MPI_Recv, MPI_Wait, MPI_Waitall
Wrong case:
if (myid == 0){
MPI_Recv from rank 1
MPI_Send to rank 1
}
if (myid == 1){
MPI_Recv from rank 0
MPI_Send to rank 1
}
One solution:
use MPI_Irecv
if (myid == 0){
MPI_Irecv from rank 1
MPI_Send to rank 1
MPI_Wait
}
if (myid == 1){
MPI_Irecv from rank 0
MPI_Send to rank 0
MPI_Wait
}
2. Collective communications
全部のプロセスが同じ集団通信関数を実行するまで先に進めない
A program cannot proceed until all processes call
the same collective communication function
Summary
 並列プログラムの作成には,
計算の分割,データの分割,通信が必要
Parallel programs need distribution of computation,
distribution of data and communications.
 並列化で必ず高速化できるとは限らない
Parallelization does not always speed up programs.
 並列化出来ないプログラムがある
There are non-parallelizable programs
 並列プログラムではデッドロックに注意
Be careful about deadlocks.
20
Report) Make Reduce function by yourself
 次のページのプログラムの my_reduce関数の中身を追加
してプログラムを完成させる
Fill the inside of 'my_reduce' function in the program
shown in the next slide
 my_reduce: MPI_Reduceの簡略版
Simplified version of MPI_Reduce

整数の総和のみ. ルートランクは 0限定.
コミュニケータは MPI_COMM_WORLD
Calculates total sum of integer numbers. The root rank is always 0.
The communicator is always MPI_COMM_WORLD.
 アルゴリズムは好きなものを考えてよい
Any algorithm is OK.
21
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#define N 20
int my_reduce(int *a, int *b, int c)
{
complete here by yourself
return 0;
}
int main(int argc, char *argv[])
{
int i, myid, procs;
int a[N], b[N];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
for (i = 0; i < N; i++){
a[i] = i;
b[i] = 0;
}
my_reduce(a, b, N);
if (myid == 0)
for (i = 0; i < N; i++)
printf("b[%d] = %d , correct answer = %d\n", i, b[i], i*procs);
MPI_Finalize();
return 0;
}
22