SWoPP2010論文用評価実験シナリオ

HPCS2011 Jan. 18-19, 2011
パケットペーシングによる
全対全通信の最適化と
シミュレーション評価
○柴村英智1)
平尾智也1)
清水俊幸2)
三輪英樹2)
安島雄一郎2)
石畑宏明3)
薄田竜太郎1)
三吉郁夫2)
井上弘士4)
1) 九州先端科学技術研究所 2) 富士通株式会社
3) 東京工科大学
4) 九州大学
背景と目的
►
全対全通信（All-to-all）


►
同時送信可能なインターコネクトの提案


►
複数の通信コントローラ（NIC）をルータに搭載
同時送信を用いた全対全通信アルゴリズムの提案（高上ら）
高速インターネット通信におけるペーシング


►
インターコネクトへの負荷が高い（高トラフィック）
アルゴリズムとインターコネクトのトポロジとの親和性が重要
高いネットワーク利用効率を達成
インターコネクトへの応用は少ない
全対全通信におけるパケットペーシングの効果を探る
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
2
発表の流れ
►
背景と目的
►
A2AT：最適全対全通信アルゴリズム
►
NSIM：インターコネクトシミュレータ
►
評価対象システム
►
実験概要
►
評価実験と結果
►
まとめ
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
3
A2AT：最適全対全通信アルゴリズム
►
高上、矢崎、安島、清水、石畑、“2次元Meshネット
ワーク・Torusネットワーク上での最適全対全通信ア
ルゴリズム”、HPCS2010


複数リンクを用いた同時送信を前提
常に全てのリンクを利用し、
Nホップ時に、高々N個の
メッセージでリンクを共有
ホップ＝1
2011/1/18
ホップ＝2
ホップ＝3
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
4
A2ATの実行フェーズ
// alltoall by A2AT algorithm
n = mat_size ;
偶数ノード数時の実行フェーズ
if ( n % 2 == 0 ) {
for ( i=0; i<n; i++ ) {
if ( 0 < i && i <= (n-1)/2 ) {
// Phase 1
recv( -i, 0, msg_size, NIC0, &r_req[0] );
recv( i, 0, msg_size, NIC1, &r_req[1] );
recv( 0, -i, msg_size, NIC2, &r_req[2] );
recv( 0, i, msg_size, NIC3, &r_req[3] );
send( i, 0, msg_size, NIC0, &s_req[0] );
send( -i, 0, msg_size, NIC1, &s_req[1] );
send( 0, i, msg_size, NIC2, &s_req[2] );
send( 0, -i, msg_size, NIC3, &s_req[3] );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
for ( j=0; j<n; j++ ) {
if ( 0 < j && j < i && i <= (n-1)/2 ) {
// Phase 3
recv( i, j, msg_size, NIC0, &r_req[0] );
recv( -i, -j, msg_size, NIC1, &r_req[1] );
recv( j, i, msg_size, NIC2, &r_req[2] );
recv( -j, -i, msg_size, NIC3, &r_req[3] );
send( -i, -j, msg_size, NIC0, &s_req[0] );
send( i, j, msg_size, NIC1, &s_req[1] );
send( -j, -i, msg_size, NIC2, &s_req[2] );
send( j, i, msg_size, NIC3, &s_req[3] );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
if ( 0 < i && i <= (n-1)/2 ) {
// Phase 2
recv( -i, -i, msg_size, NIC0, &r_req[0] );
recv( i, i, msg_size, NIC1, &r_req[1] );
recv( i, -i, msg_size, NIC2, &r_req[2] );
recv( -i, i, msg_size, NIC3, &r_req[3] );
send( i, i, msg_size, NIC0, &s_req[0] );
send( -i, -i, msg_size, NIC1, &s_req[1] );
send( -i, i, msg_size, NIC2, &s_req[2] );
send( i, -i, msg_size, NIC3, &s_req[3] );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
// Phase 4
recv( -j, i, msg_size, NIC0, &r_req[0] );
recv( j, -i, msg_size, NIC1, &r_req[1] );
recv( -i, j, msg_size, NIC2, &r_req[2] );
recv( i, -j, msg_size, NIC3, &r_req[3] );
send( j, -i, msg_size, NIC0, &s_req[0] );
send( -j, i, msg_size, NIC1, &s_req[1] );
send( i, -j, msg_size, NIC2, &s_req[2] );
send( -i, j, msg_size, NIC3, &s_req[3] );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
for ( k=1; k<= n/2-1; k++ ) {
// Phase 5
recv( -n/2, -k, msg_size, NIC0, &r_req[0] );
recv( n/2, k, msg_size, NIC1, &r_req[1] );
recv( -k, -n/2, msg_size, NIC2, &r_req[2] );
recv( k, n/2, msg_size, NIC3, &r_req[3] );
send( n/2, k, msg_size, NIC0, &s_req[0] );
send( -n/2, -k, msg_size, NIC1,&s_req[1] );
send( k, n/2, msg_size, NIC2, &s_req[2] );
send( -k, -n/2, msg_size, NIC3,&s_req[3] );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
// Phase 6
recv( n/2, n/2, msg_size, NIC0, &r_req[0] );
recv( -n/2, 0, msg_size, NIC1, &r_req[1] );
recv( 0, -n/2, msg_size, NIC2, &r_req[2] );
send( -n/2, -n/2, msg_size, NIC0, &s_req[0] );
send( n/2, 0, msg_size, NIC1, &s_req[1] );
send( 0, n/2, msg_size, NIC2, &s_req[2] );
waitall( 3, s_req, s_stat );
waitall( 3, r_req, r_stat );
}
}
}
}
2011/1/18
※send, recvは非同期通信
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
5
A2ATにおけるパケットペーシング
►
Nホップ時に、高々N個のメッセージでリンクを共有
►
パケット間に（N－１）パケット分のギャップを予め挿
入し、リンクの帯域効率を向上
3ホップ時は、3メッセージで1リンクを共有
src
dst
パケットの流れ
2パケット分のギャップによるパケットペーシング
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
6
NSIM：インターコネクトシミュレータ
►
大規模インターコネクトの性能評価を目的
►
MGENプログラムと呼ぶ通信事象生成プログラムを
用いて、シミュレーション中に【通信事象】を生成

プログラム中の通信データは送受信しない
#include <stdio.h>
MGENプログラム例（Pairwise exchange)
#include "mgen.h"
int MGEN_Main( int argc, char *argv[] )
{
:
// Pairwise exchange
for ( i=1; i<mysize; i++ ) {
dst = src = myrank ^ i;
MGEN_Irecv(NULL, msg_size, MGEN_BYTE, src, 0, MGEN_COMM_WORLD, &req[0]);
MGEN_Isend(NULL, msg_size, MGEN_BYTE, dst, 0, MGEN_COMM_WORLD, &req[1]);
MGEN_Wait(&req[1], &stat[1]);
MGEN_Wait(&req[0], &stat[0]);
}
return (0);
}
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
7
評価対象システム（１）
● トポロジ：n×n 2次元トーラス網
リンクバンド幅：4GB/s
■ ルーティング：次元順＋日付変更線
■ パケット転送方式：VCT（フリット長：16B）
仮想チャネルバッファ：2KiB（MTU）×4
パケット長：32B～2KiB（ヘッダ長：32B）
仮想チャネル数：2
■ NIC数：4
■ フロー制御方式：クレジットベース
n
+X
-X
+Y
-Y
4GB/s
4GB/s
4GB/s
4GB/s
+X
-X
+Y
-Y
n
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
8
評価対象システム（２）
メモリ・DMA転送バンド幅：
MPI関数処理オーバヘッド：
ルーティング計算時間（RC）：
仮想チャネル設定時間（VA）：
スイッチ設定時間（SA）：
フリット転送時間（ST）：
スイッチ遅延時間：
ケーブル遅延時間：
MPIプロセス数／ノード：
16GB/s
200ns
4ns
4ns
4ns
4ns
78ns
10ns
1
※これらの仕様をコンフィグレーションファイルでNSIMに与える
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
9
コンフィグレーションファイルの例
(network
name=“2dt-8x8"
rank-node-filename=“2dt-8x8.rnc"
number-of-NIC=4
number-of-virtual-channels=2
packetsize=2048 ; byte
flitsize=16 ; byte
mtusize=2048 ; byte
packetheadersize=32 ; byte
virtual-channel-buffersize=8192 ; byte
routing-computation-time=4000 ; ps
virtual-channel-allocation-time=4000;ps
switch-allocation-time=4000 ; ps
flit-traversal-time=4000 ; ps
switch-latency=78000 ; ps
cable-latency=10000 ; ps
Tnrb=250 ; ps
enable-pipelining=true
connect-nodes-directly=true
(topology
name="3D Torus"
X=8 Y=8 Z=1
2011/1/18
torus-X=true
torus-Y=true
torus-Z=true
)
(mpi
number-of-target-ranks=64
rendezvous-size-threshold=67108864
zerocopy-size-threshold=4 ; byte
Tdma=250 ; ps
Tnic=200000 ; ps
Tcopy=250 ; ps
Tpps=200000 ; ps
Teps=200000 ; ps
Tppr=200000 ; ps
Tepr=200000 ; ps
Tpw =200000 ; ps
Tew =200000 ; ps
Tcp =200000 ; ps
Tce =200000 ; ps
Tci =200000 ; ps
)
)
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
10
実験概要
実践的なシステム仕様に基づいた評価実験

インターコネクトシミュレータNSIMを利用
►
実験１：全対全通信アルゴリズムの性能比較
►
実験２：パケットペーシングによる性能改善
►
実験３：シーケンス毎のチューニング
►
実験４：バリア同期によるインバランスの抑制
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
11
実験１：全対全通信アルゴリズムの性能比較
►
A2ATと他の全対全通信アルゴリズムを比較

アルゴリズム：a2at、pwx (pairwise exchange)、
ring、bruck



ノード数：64（8×8）、256（16×16）、
1,024（32×32）、4,096（64×64）
メッセージサイズ：64MiB÷ノード数


総有効データ転送量を同じにするため
評価指標


2011/1/18
a2at のみ、4NICを使った同時送信
MGEN実行時間（予想実行時間）、リンク帯域利用効率
次元毎・平均のリンクスループット（時系列）
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
12
a2at、pwx、ring、bruckの性能比較
►
実行性能： a2at ＜ pwx ＜ ring ＜ bruck
ただし、a2at は４NICを利用した同時送信
 a2at のリンク帯域利用効率が良くない
⇒次元軸毎にリンクバンド幅の時系列変化を調査

MGEN実行時間
2011/1/18
リンク帯域利用効率
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
13
時系列で見た次元毎のリンクバンド幅
※2次元トーラス網
64ノード（8×8）
実行時間：71.3ms
理想実行時間：17.0ms
各軸のリンクを平等に利用
各軸の帯域効率が変化
a2at
隣接通信が多いため±X軸を多用
実行時間：87.0ms
ホップ数の増加により＋Y軸のリンク利用が増加
pwx
実行時間：115.5ms
すべてのリンクを利用している機会が少ない
ring
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
14
n×nの2次元トーラス網における
全対全通信の理想実行時間
Tideal 
Q
Q (B) ：Bisectionリンクを通過する総データ量
Bbisect
n 
Q  2n  
2
Bbisect  4nBlink
2
Tideal
n

2
n 
2
 
n 
 2  Lmesg
 
Bbisect (B/sec)：bisectionバンド幅
Lmesg (B)：メッセージサイズ
Blink (B/sec)：リンクバンド幅
 n  Lmesg
2 B
  link
※本実験では、各パケットのヘッダ長もQに含め、理想実行時間を算出している
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
15
a2at-8x8と-9x9におけるリンクバンド幅
►
偶数ノード数の場合、プログラム後半から通信性能が低下

次元順ルーティングとアルゴリズムとの齟齬が原因
a2at-8x8（偶数ノード数：64）
理想実行時間：17.0ms
各軸の帯域効率が変化
実行時間：71.3ms
a2at-9x9（奇数ノード数：81）
理想実行時間：24.0ms
各軸のリンクを平等に利用
実行時間：84.0ms
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
16
a2at-8x8の平均VCバッファ利用率
バッファがほぼ飽和状態
（後続パケットをバッファできない）
※VC1は省略
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
17
2次元トーラス網8×8におけるa2at
30 | 43 | 35 | 28 | 37 | 45 | 32 | 58
----+----+----+----+----+----+----+---41 | 14 | 19 |リンクを通過する
12 | 21 | 16 | 47 | 54
メッセージが多い
----+----+----+----+----+----+----+---33 | 17 | 6 | 4 | 8 | 23 | 39 | 50
----+----+----+----+----+----+----+---26 | 10 | 2 | 0 | 1 | 9 | 25 | 62
----+----+----+----+----+----+----+---40 | 24 | 7 | 3 | 5 | 18 | 34 | 49
----+----+----+----+----+----+----+---48 | 15 | 22 | 11 | 20 | 13 | 42 | 53
----+----+----+----+----+----+----+---31 | 46 | 38 | 27 | 36 | 44 | 29 | 57
----+----+----+----+----+----+----+---60 | 56 | 52 | 63 | 51 | 55 | 59 | 61
2011/1/18
► 偶数ノード数の場合、
アルゴリズム後半で
最右辺・最下辺との
通信が発生
► ＋X （右）方向、およ
び、－Y （下）方向へ
の通信量が増加
► ＋X方向の通信混
雑により実行時間が
増加
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
18
実験２：パケットペーシングによる性能改善
►
全体を最適化するパケット間ギャップ値を求め、
ペーシングの効果を調査



アルゴリズム：a2at
ノード数：64（8×8）、81（9×9）、… 、256（16×16）
メッセージサイズ：1MiB



520×フルサイズパケット（2KiB）＋1×小サイズパケット（288B）
ルータ毎にパケット単位でアービトレーション
パケット間ギャップ


ハードウェアによるパケットペーシングを仮定
（ホップ数－1）＋パケット間ギャップバイアス

2011/1/18
－2～＋16 step 1.0 （0～＋6については step 0.125）
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
19
各ギャップバイアスにおける実行時間
►
適切なパケット間ギャップを与えることで、実行性能
が向上

パケットペーシングの効果を確認
※アルゴリズム：a2at メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
20
ペーシング時のリンクバンド幅
►
パケット間ギャップバイアス値：1.250
►
実行性能は良くなっているが、リンク性能を十分に活用
していない
理想実行時間：17.0ms
理想実行時間：24.0ms
実行時間：42.7ms
実行時間：30.7ms
A2AT-8x8（64ノード）
A2AT-9x9（81ノード）
※アルゴリズム：a2at メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
21
a2atの通信状況（2次元トーラス網 8x8）
ペーシング無し
実行時間=71.3(ms)
59.9%
ペーシング有り
実行時間=42.7(ms)
・ペーシングによって通信効率（奇数ノード数部）が改善
・最終フェーズ部の通信状況は悪い
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
22
a2atの通信状況（2次元トーラス網 9x9）
ペーシング無し
実行時間=84.0(ms)
36.5%
ペーシング有り
実行時間=30.7(ms)
・ペーシングによって通信効率が改善
・最終フェーズ部の通信状況も良好
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
23
ペーシングによる性能向上
シーケンス毎の最適化によって2倍以上の性能向上
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
24
実験３：シーケンス毎のチューニング
►
各シーケンス（実行フェーズ）の実行を最小化する
バイアス値を求めた後、全シーケンスを統合実行




アルゴリズム：a2at
ノード数：81（9×9）
メッセージサイズ：1MiB
パケット間ギャップ
►
2011/1/18
（ホップ数－1）＋パケット間ギャップバイアス
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
25
各バイアス値におけるシーケンス実行時間
パケット間ギャップ：1.250
※アルゴリズム：a2at メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
26
全シーケンスの統合実行
►
概ね最適なリンクバンド幅で実行している

シーケンス15の性能が低い
⇒ シーケンス14の実行後のインバランスが原因
シーケンス14（ホップ数8）の通信中に他プロセスの
シーケンス15が始まっている
理想実行時間：24.0ms
実行時間：25.8ms
107.5%
シーケンス15
※アルゴリズム：a2at ノード数：81（9×9）メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
27
実験４：バリア同期によるインバランスの抑制
►
インバランスを抑制するために、シーケンス終了毎
にバリア同期を入れる




アルゴリズム：a2at
ノード数：81（9×9）
メッセージサイズ：1MiB
パケット間ギャップ


バリア同期：ソフトウェアバリア（dissemination）

2011/1/18
（ホップ数－1）＋パケット間ギャップバイアス
同期コスト：11.5μs@81ノード ⇒ 218.5μs@20シーケンス
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
28
バリア同期挿入後の全シーケンス実行
►
バリア同期のインバランスによって性能低下


バリア同期用の通信が、シーケンス実行の通信を妨害
バリア同期のインバランス量を調査 ⇒ 3μs未満
理想実行時間：24.0ms
実行時間：92.6ms
※アルゴリズム：a2at ノード数：81（9×9）メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
29
バリア同期のインバランスを考慮した実行
►
インバランス状態で各シーケンスを個別に最適化し、全
シーケンスを統合実行

インバランス量：3μs、正規分布発生
理想実行時間：24.0ms
実行時間：32.8ms
136.7%
※アルゴリズム：a2at ノード数：81（9×9）メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
30
まとめと今後の課題
►
同時送信機構を活用するA2ATの性能評価

►
実践的な仕様では十分な性能を発揮できない場合がある
A2ATにパケットペーシングを適用

通信アルゴリズムに適した積極的なペーシングが有効


各フェーズ実行開始時刻のインバランスに敏感

►
最大で、理論値の107.5%で実行可能
余裕を持ったペーシングがポイント
今後の課題



2011/1/18
大規模システムを対象とした評価
様々な集団通信アルゴリズムを用いた評価
パケットペーシングの動的自動最適化
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
31
NSIMの一般公開
►
インターコネクトシミュレータNSIM（OpenNSIM）を
2010年11月17日より無償公開中

TaaS（Tools as a Service）と呼ぶクラウド環境から利用
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
32
OpenNSIM ポータルサイト
https://ngarch.isit.or.jp/taas/opennsim/
OpenNSIM
検索
あるいは
[swopp-announce:170454]
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
33
バックアップスライド
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
34
n×nの2次元トーラス網における
全対全通信の理想実行時間（1）
【全対全通信の理想実行時間】
bisection
0
1
理想実行時間：
Tideal (sec)
データ転送量：Q1 (B)
データ転送量：Q2 (B)
N-1
※ N = n×n
Tideal (sec) 
Q (B)
Bbisect (B/sec)
bisectionバンド幅：
Bbisect (B/sec)
Bisectionリンクを通過する
総データ量：Q (B) = Q1 + Q2
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
35
n×nの2次元トーラス網における
全対全通信の理想実行時間（2）
【Bisectionリンクを通過する総データ量：Q】
bisection
n 
 2  n
bisection
n 
 2  n
n 
 2 
n 
 2 
メッセージサイズ：
Lmesg (B)
n  n 
Q1    n  n  Lmesg
2 2
Q1
n
Q2
n  n 
Q2    n   n  Lmesg
2 2
2次元トーラス網
Q2 (B)
2011/1/18
 n  n 
Q  Q1  Q2  2n     L mesg
 2  2 
2
Q1 (B)
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
36
n×nの2次元トーラス網における
全対全通信の理想実行時間（3）
【 2次元トーラス網における全対全通信の理想実行時間： Tideal 】
Tideal 
Q
Bbisect
n  n 
Q  2n     Lmesg
2 2
Bbisect  （双方向）
 n （トーラスリンク）
 リンクバンド幅
2
 2  n  2  Blink
 4nBlink
Tideal
2011/1/18
n

2
n 
2
 
 n  Lmesg
2 B
  link
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
37
実験概要
►
実験１：全対全通信アルゴリズムの性能比較

►
実験２：パケットペーシングによる性能改善

►
A2ATにパケットペーシングを適用し、様々なパケット間
ギャップ値による性能を調査
実験３：シーケンス毎のチューニング

►
A2AT、pairwise exchange、ringを比較
A2ATを細分化し、各シーケンスの最適なパケット間
ギャップを求めた後、統合実行
実験４：バリア同期によるインバランスの抑制

2011/1/18
バリア同期を各シーケンス終了後に挿入し、インバランス
による性能低下を抑制した実行
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
38
本研究における同時送信
複数NICを用いて同時に複数のパケットをネットワークに投入
パケット
NIC
ネットワーク
パケットの到着順に、送信開始
パケットの先頭を整えて、送信開始
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
39
NSIM：インターコネクトシミュレータ
►
大規模インターコネクトの性能評価を目的
►
MGENプログラムと呼ぶ通信事象生成プログラムを
用いて、シミュレーション中に【通信事象】を生成

►
シミュレーション結果

►
MGENプログラムの通信データは送受信しない
並列プログラム予想実行時間、実効バンド幅、実効リンク
バンド幅、リンクビジー率、パケット待ち時間など多彩
並列離散事象シミュレーション（PDES）による、
NSIM自体の並列実装（MPI実装）
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
40
MGENプログラムの例
#include <stdio.h>
#include "mgen.h"
int MGEN_Main( int argc, char *argv[] )
{
int myrank, mysize, i, src, dst, msg_size=1024;
MGEN_Request req[2];
MGEN_Status stat[2];
MGEN_Comm_rank(MGEN_COMM_WORLD, &myrank);
MGEN_Comm_size(MGEN_COMM_WORLD, &mysize); // mysize must be a power of 2
// Pairwise exchange
for ( i=1; i<mysize; i++ ) {
dst = src = myrank ^ i;
MGEN_Irecv(NULL, msg_size, MGEN_BYTE, src, 0, MGEN_COMM_WORLD, &req[0]);
MGEN_Isend(NULL, msg_size, MGEN_BYTE, dst, 0, MGEN_COMM_WORLD, &req[1]);
MGEN_Wait(&req[1], &stat[1]);
MGEN_Wait(&req[0], &stat[0]);
}
return (0);
}
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
41
OpenNSIMにおける各種遅延の積算
送信側ランク
CPU
MPI_Send
受信側ランク
主記憶
NIC
NIC
主記憶
CPU
Tpps
Tpsend
Recv発行が早い
Tppr
Recv発行が遅い
MPI_Recv
Tcopy·Sm(m)
Tdma·(Sp(1)+…+Sp(n-1))+Tnic·(n-1)
Teps
Tdma·Sp(n)+Tnic
Twr(m)
Tlws(m)
Tprecv
Ti(m)
Tdma·Sp(n)+Tnic
MPI_Recv
Tcopy·Sm(m)
Tppr
Tcopy·Sm(m)
Tepr
MPI関数のスタート
ネットワーク遅延を
アップ遅延を積算
積算
•関数プロローグ、エピローグ •ルータ遅延
•メモリコピー（ユーザ⇒MPI） •ケーブル遅延
•DMA転送（主記憶⇒NIC） •衝突による遅延など
2011/1/18
Tprecv
Tepr
MPI関数の後処理
時間を積算
•DMA転送（NIC⇒主記憶）
•メモリコピー（MPI⇒ユーザ）
•関数プロローグ、エピローグ
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
42
OpenNSIMの主な仕様（1）
►
ノード数（Webブラウザから指定）

►
トポロジ（Webブラウザから指定）


►
8～128Kノード
直接網：2次元メッシュ網、3次元メッシュ網
2次元トーラス網、3次元トーラス網
間接網：FatTree （2層 or 3層FBB接続）
ルーティングアルゴリズム


直接網：次元順ルーティング＋日付変更線
間接網：V-Up D-Down あるいは D-Up V-Down
※ V: Vertical D: Diagonal
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
43
OpenNSIMの主な仕様（2）
►
ルータ






リンクバンド幅：4GB/s※（単方向）
パケット長：32B～2KiB※
パケット（仮想チャネル）バッファ長：8KiB※
フリット長：16B※
パケット転送方式：Virtual cut-through
フロー制御方式：クレジットベース
※デフォルト値：設定ファイルを用いて変更可能
►
評価アプリケーション（MGENプログラム）





2011/1/18
Point to point
Random ring （HPCCより）
Alltoall（pairwise exchange, ring, bruck）
Allreduce（recursive doubling）
ユーザが作成したMGENプログラム
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
44
OpenNSIMの主な仕様（3）
►
同期通信

►
非同期通信

►
MGEN_Send（）、MGEN_Recv（）
MGEN_Isend（）、MGEN_Irecv（）、MGEN_Wait（）
その他



MGEN_Comm_size（）
MGEN_Comm_rank（）
MGEN_Comp（）

2011/1/18
指定された時間だけ、シミュレーション時間を進める。
⇒ 既知の計算ブロックの演算時間を与えることで、アプリケーショ
ン実行時間の推定精度を高めることができる。
⇒ ロードインバランスやOSジッタを踏まえたシミュレーションも可能
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
45
OpenNSIMの主な仕様（4）
►
出力項目（時系列表示）






►
ネットワークレイテンシ
リンクスループット
バッファ利用率
通信レイテンシ
リンクスループットの割合の変動
輻輳の発生割合
出力項目（その他）

2011/1/18
輻輳箇所をノード・ポート毎に上位順で表示
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
46
OpenNSIMの入出力ファイル
入力ファイル
出力ファイル
MGENプログラム
性能情報
コンフィグ
レーション
ランク・ノード
変換テーブル
2011/1/18
OpenNSIM
統計情報
通信履歴
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
47
a2at、pwx、ring、bruckの性能比較
2011/1/18
MGEN実行時間
リンク帯域利用効率
リンクビジー率
平均パケット待ち時間
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
48
リンク帯域効率と平均パケット待ち時間
►
適切なパケット間ギャップ


リンク帯域効率の向上
パケット待ち時間の減少
※アルゴリズム：a2at メッセージサイズ：1MiB
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
49
pwxアルゴリズム
for ( i=1; i<mysize; i++ ) {
dst = src = myrank ^ i;
MGEN_Irecv(,,, src,,, &req[0]);
MGEN_Isend(,,, dst,,, &req[1]);
MGEN_Wait(&req[1], &stat[1]);
MGEN_Wait(&req[0], &stat[0]);
}
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
50
ringアルゴリズム
for ( i=1; i<mysize; i++ ) {
src=(myrank-i +mysize)%mysize;
dst=(myrank+i)%mysize;
MGEN_Irecv(,,, src,,, &req[0]);
MGEN_Isend(,,, dst,,, &req[1]);
MGEN_Wait(&req[1], &stat[1]);
MGEN_Wait(&req[0], &stat[0]);
}
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
51
A2ATの通信状況
MGEN_Waitによる
受信待ち状態
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
52
pwxの通信状況
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
53
ringの通信状況
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
54
パケット間ギャップの指定方法
void send( int a, int b, int size, int nic, MGEN_Request *req )
float gap_bias )
{
int myrank, mx, my;
int dst;
t_TsimRoutingInfo info;
MGEN_Comm_rank(MGEN_COMM_WORLD, &myrank);
mx = ( Xlevel( myrank ) + a + Xsize ) % Xsize;
my = ( Ylevel( myrank ) + b + Ysize ) % Ysize;
dst = RankXY( mx, my );
default_routing_info(&info);
info.SrcNIC = nic;
info.DstNIC = nic;
info.gap = 8.0 * (( hop_2dt(myrank, dst) -1 + gap_bias);
if ( info.gap < 0 ) info.gap = 0;
}
LMGEN_Isend( NULL, size, MGEN_BYTE, dst,
MGEN_TAG0, MGEN_COMM_WORLD, req, &info, 1 );
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
55
インターコネクト・コンフィグレーションファイル
# 2dt-9x9.cfg : NSIM configuration file
#
generated by mkconfig.sh on Sat Jun 26 03:34:51 JST 2010
(simulation
name="sim-2dt-9x9"
:
(network
name="2dt-9x9"
rank-node-filename="2dt-9x9.rnc"
number-of-NIC=4
number-of-virtual-channels=2
packetsize=2048 ; byte
flitsize=16 ; byte
mtusize=2048 ; byte
packetheadersize=32 ; byte
virtual-channel-buffersize=8192 ; byte
routing-computation-time=4000 ; ps
virtual-channel-allocation-time=4000 ; ps
switch-allocation-time=4000 ; ps
flit-traversal-time=4000 ; ps
switch-latency=78000 ; ps
cable-latency=10000 ; ps
Tnrb=62 ; ps
enable-pipelining=true
connect-nodes-directly=true
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
56
A2AT MGENプログラム（1 of 3）
#include
#include
#include
#include
<stdio.h>
<stdlib.h>
<string.h>
"mgen.h"
py = abs( Ylevel(dst) - Ylevel(src) );
my = min( py, Ysize-py );
hop_y = min( py, my );
hop = hop_x + hop_y;
#define BLOCKING_SENDRECV 0
return( hop );
}
#endif
#define MGEN_TAG0 0
#define MGEN_RANK0 0
#define
#define
#define
#define
NIC0
NIC1
NIC2
NIC3
0
1
2
3
// definitions for 2DT
#define Xlevel(n) ( (n) % Xsize )
#define Ylevel(n) (((n) % ( Xsize * Ysize )) / Xsize )
#define RankXY(x,y) ( (x) + ((y) * Xsize) )
#ifdef USE_GAP
#define max(a,b) (a>b?a:b)
#define min(a,b) (a>b?b:a)
#endif
static int Xsize;
static int Ysize;
void send( int a, int b, int size, int nic, MGEN_Request *req )
float gap_bias )
{
int myrank, mx, my;
int dst;
t_TsimRoutingInfo info;
MGEN_Comm_rank(MGEN_COMM_WORLD, &myrank);
mx = ( Xlevel( myrank ) + a + Xsize ) % Xsize;
my = ( Ylevel( myrank ) + b + Ysize ) % Ysize;
dst = RankXY( mx, my );
default_routing_info(&info);
info.SrcNIC = nic;
info.DstNIC = nic;
#ifdef USE_GAP
info.gap
= 8.0 * (( hop_2dt(myrank, dst) -1 + gap_bias);
if ( info.gap < 0 ) info.gap = 0;
#endif
static void default_routing_info(t_TsimRoutingInfo *info){
CONF_Default_routing(TSIM_MEVENT_SEND, 0, 0, MGEN_COMM_WORLD, info);
}
#if BLOCKING_SENDRECV
LMGEN_Send( NULL, size, MGEN_BYTE, dst,
#ifdef USE_GAP
MGEN_TAG0, MGEN_COMM_WORLD, &info, 1 );
int hop_2dt( int src, int dst )
#else
{
LMGEN_Isend( NULL, size, MGEN_BYTE, dst,
int hop;
MGEN_TAG0, MGEN_COMM_WORLD, req, &info, 1 );
int hop_x, hop_y;
#endif
int px,mx,py,my;
}
px = abs( Xlevel(dst) - Xlevel(src) );
mx = min( px , Xsize-px );
hop_x = min( px, mx );
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
57
A2AT MGENプログラム（2 of 3）
void recv( int a, int b, int size, int nic, MGEN_Request *req )
{
int myrank, mx, my;
int src;
t_TsimRoutingInfo info;
MGEN_Status status;
static int init = 0;
// Main routine variables
static int msg_size; // message size
static int mat_size; // matrix size
int n,rn;
int i,j,k;
MGEN_Comm_rank(MGEN_COMM_WORLD, &myrank);
mx = ( Xlevel( myrank ) + a + Xsize ) % Xsize;
my = ( Ylevel( myrank ) + b + Ysize ) % Ysize;
src = RankXY( mx, my );
default_routing_info(&info);
info.DstNIC = nic;
#if BLOCKING_SENDRECV
LMGEN_Recv( NULL, size, MGEN_BYTE, src,
MGEN_TAG0, MGEN_COMM_WORLD, &status, &info, 1 );
#else
LMGEN_Irecv( NULL, size, MGEN_BYTE, src,
MGEN_TAG0, MGEN_COMM_WORLD, req, &info, 1 );
#endif
}
MGEN_Comm_rank(MGEN_COMM_WORLD, &myrank);
MGEN_Comm_size(MGEN_COMM_WORLD, &mysize);
if ( init==0 ) {
init = 1;
if ( argc == 3 ) {
msg_size = atoi(argv[1]);
mat_size = atoi(argv[2]);
} else {
if (myrank==0)
printf("%% %s [message size] [matrix size]\n", argv[0]);
exit(EXIT_FAILURE);
}
}
void waitall( int arraysize, MGEN_Request *requests, MGEN_Status
*statuses )
{
#if ! BLOCKING_SENDRECV
int i;
for ( i=0; i< arraysize; i++ )
MGEN_Wait( &requests[i], &statuses[i] );
#endif
}
int MGEN_Main( int argc, char *argv[] )
{
// MGEN variables
int myrank;
int mysize;
MGEN_Request s_req[4], r_req[4];
MGEN_Status s_stat[4], r_stat[4];
// Initialization variable
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
58
A2AT MGENプログラム（3 of 3）
// alltoall by ishihata-a2a algorithm
Xsize = Ysize = n = mat_size ;
for ( i=0; i<n; i++ ) {
if ( 0 < i && i <= (n-1)/2 ) {
recv( -i, 0, msg_size, NIC0, &r_req[0]
recv( i, 0, msg_size, NIC1, &r_req[1]
recv( 0, -i, msg_size, NIC2, &r_req[2]
recv( 0, i, msg_size, NIC3, &r_req[3]
send( i, 0, msg_size, NIC0, &s_req[0]
send( -i, 0, msg_size, NIC1, &s_req[1]
send( 0, i, msg_size, NIC2, &s_req[2]
send( 0, -i, msg_size, NIC3, &s_req[3]
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
if ( 0 < i && i <= (n-1)/2 ) {
recv( -i, -i, msg_size, NIC0,
recv( i, i, msg_size, NIC1,
recv( i, -i, msg_size, NIC2,
recv( -i, i, msg_size, NIC3,
send( i, i, msg_size, NIC0,
send( -i, -i, msg_size, NIC1,
send( -i, i, msg_size, NIC2,
send( i, -i, msg_size, NIC3,
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
&r_req[0]
&r_req[1]
&r_req[2]
&r_req[3]
&s_req[0]
&s_req[1]
&s_req[2]
&s_req[3]
recv( -i, j, msg_size, NIC2, &r_req[2] );
recv( i, -j, msg_size, NIC3, &r_req[3] );
);
);
);
);
);
);
);
);
NIC0,
NIC1,
NIC2,
NIC3,
&s_req[0]
&s_req[1]
&s_req[2]
&s_req[3]
);
);
);
);
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
}
}
if ( n % 2 == 0 ) {
for ( k=1; k<= n/2-1; k++ ) {
recv( -n/2,
-k, msg_size, NIC0, &r_req[0] );
recv( n/2,
k, msg_size, NIC1, &r_req[1] );
recv(
-k, -n/2, msg_size, NIC2, &r_req[2] );
recv(
k, n/2, msg_size, NIC3, &r_req[3] );
send( n/2,
k, msg_size, NIC0, &s_req[0] );
send( -n/2,
-k, msg_size, NIC1, &s_req[1] );
send(
k, n/2, msg_size, NIC2, &s_req[2] );
send(
-k, -n/2, msg_size, NIC3, &s_req[3] );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
recv( n/2, n/2, msg_size, NIC0, &r_req[0] );
recv( -n/2,
0, msg_size, NIC1, &r_req[1] );
recv(
0, -n/2, msg_size, NIC2, &r_req[2] );
send( -n/2, -n/2, msg_size, NIC0, &s_req[0] );
send( n/2,
0, msg_size, NIC1, &s_req[1] );
send(
0, n/2, msg_size, NIC2, &s_req[2] );
waitall( 3, s_req, s_stat );
waitall( 3, r_req, r_stat );
);
);
);
);
);
);
);
);
for ( j=0; j<n; j++ ) {
if ( 0 < j && j < i && i <= (n-1)/2 ) {
recv( i, j, msg_size, NIC0, &r_req[0]
recv( -i, -j, msg_size, NIC1, &r_req[1]
recv( j, i, msg_size, NIC2, &r_req[2]
recv( -j, -i, msg_size, NIC3, &r_req[3]
send( -i, -j, msg_size, NIC0, &s_req[0]
send( i, j, msg_size, NIC1, &s_req[1]
send( -j, -i, msg_size, NIC2, &s_req[2]
send( j, i, msg_size, NIC3, &s_req[3]
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
recv( -j, i, msg_size, NIC0, &r_req[0]
recv( j, -i, msg_size, NIC1, &r_req[1]
2011/1/18
send( j, -i, msg_size,
send( -j, i, msg_size,
send( i, -j, msg_size,
send( -i, j, msg_size,
);
);
);
);
);
);
);
);
}
return(EXIT_SUCCESS);
}
/* EOF */
);
);
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
59
各シーケンスのリンクバンド幅
Seq-1
Seq-2
Seq-3
Seq-4
Seq-5
Seq-6
Seq-7
Seq-8
Seq-9
Seq-10
Seq-11
Seq-12
Seq-13
Seq-14
Seq-15
Seq-16
Seq-17
Seq-18
Seq-19
Seq-20
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
60
a2at-8x8の平均VCバッファ利用率
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
61
最適全対全通信アルゴリズム
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
62
A2AT：最適全対全通信アルゴリズム
2011/1/18
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
63
A2ATの実行フェーズ
// alltoall by A2AT algorithm
Xsize = Ysize = n = mat_size ;
偶数ノード数時の実行フェーズ
for ( i=0; i<n; i++ ) {
フェーズ単体実行用
if ( 0 < i && i <= (n-1)/2 ) {
// Phase 1
++cseq;
gap = gaplist[cseq-1];
if ( myrank == 0 ) printf("seq %d: phase 1:
gapbias=%.3f\n", cseq, gap);
recv( -i, 0, msg_size, NIC0, &r_req[0] );
recv( i, 0, msg_size, NIC1, &r_req[1] );
recv( 0, -i, msg_size, NIC2, &r_req[2] );
recv( 0, i, msg_size, NIC3, &r_req[3] );
send( i, 0, msg_size, NIC0, &s_req[0], gap );
send( -i, 0, msg_size, NIC1, &s_req[1], gap );
send( 0, i, msg_size, NIC2, &s_req[2], gap );
send( 0, -i, msg_size, NIC3, &s_req[3], gap );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
if ( 0 < i && i <= (n-1)/2 ) {
// Phase 2
++cseq;
gapbias = gaplist[cseq-1];
if ( myrank == 0 ) printf("seq %d: phase 2:
gap=%.3f\n", cseq, gap);
recv( -i, -i, msg_size, NIC0, &r_req[0] );
recv( i, i, msg_size, NIC1, &r_req[1] );
recv( i, -i, msg_size, NIC2, &r_req[2] );
recv( -i, i, msg_size, NIC3, &r_req[3] );
send( i, i, msg_size, NIC0, &s_req[0], gap );
send( -i, -i, msg_size, NIC1, &s_req[1], gap );
send( -i, i, msg_size, NIC2, &s_req[2], gap );
send( i, -i, msg_size, NIC3, &s_req[3], gap );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
2011/1/18
for ( j=0; j<n; j++ ) {
if ( 0 < j && j < i && i <= (n-1)/2 ) {
// Phase 3
++cseq;
gap = gaplist[cseq-1];
if ( myrank == 0 ) printf("seq %d: phase 3:
gap=%.3f\n", cseq, gap);
recv( i, j, msg_size, NIC0, &r_req[0] );
recv( -i, -j, msg_size, NIC1, &r_req[1] );
recv( j, i, msg_size, NIC2, &r_req[2] );
recv( -j, -i, msg_size, NIC3, &r_req[3] );
send( -i, -j, msg_size, NIC0, &s_req[0], gap );
send( i, j, msg_size, NIC1, &s_req[1], gap );
send( -j, -i, msg_size, NIC2, &s_req[2], gap );
send( j, i, msg_size, NIC3, &s_req[3], gap );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
// Phase 4
++cseq;
gap = gaplist[cseq-1];
if ( myrank == 0 ) printf("seq %d: phase 4:
gap=%.3f\n", cseq, gap);
recv( -j, i, msg_size, NIC0, &r_req[0] );
recv( j, -i, msg_size, NIC1, &r_req[1] );
recv( -i, j, msg_size, NIC2, &r_req[2] );
recv( i, -j, msg_size, NIC3, &r_req[3] );
send( j, -i, msg_size, NIC0, &s_req[0], gap );
send( -j, i, msg_size, NIC1, &s_req[1], gap );
send( i, -j, msg_size, NIC2, &s_req[2], gap );
send( -i, j, msg_size, NIC3, &s_req[3], gap );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
if ( n % 2 == 0 ) {
for ( k=1; k<= n/2-1; k++ ) {
// Phase 5
++cseq;
gap = gaplist[cseq-1];
if ( myrank == 0 ) printf("seq %d: phase 5:
gap=%.3f\n", cseq, gap);
recv( -n/2, -k, msg_size, NIC0, &r_req[0] );
recv( n/2, k, msg_size, NIC1, &r_req[1] );
recv( -k, -n/2, msg_size, NIC2, &r_req[2] );
recv( k, n/2, msg_size, NIC3, &r_req[3] );
send( n/2, k, msg_size, NIC0, &s_req[0], gap );
send( -n/2, -k, msg_size, NIC1, &s_req[1], gap );
send( k, n/2, msg_size, NIC2, &s_req[2], gap );
send( -k, -n/2, msg_size, NIC3, &s_req[3], gap );
waitall( 4, s_req, s_stat );
waitall( 4, r_req, r_stat );
}
// Phase 6
++cseq;
gap = gaplist[cseq-1];
if ( myrank == 0 ) printf("seq %d: phase 6:
gap=%.3f\n", cseq, gapbias);
recv( n/2, n/2, msg_size, NIC0, &r_req[0] );
recv( -n/2, 0, msg_size, NIC1, &r_req[1] );
recv( 0, -n/2, msg_size, NIC2, &r_req[2] );
send( -n/2, -n/2, msg_size, NIC0, &s_req[0], gap );
send( n/2, 0, msg_size, NIC1, &s_req[1], gap );
send( 0, n/2, msg_size, NIC2, &s_req[2], gap );
waitall( 3, s_req, s_stat );
waitall( 3, r_req, r_stat );
}
}
}
}
※send, recvは非同期通信
All Rights Reserved, Copyright © Institute of Systems, Information Technologies and Nanotechnologies 2011
64

Download Report