計算科学技術特論B - 理化学研究所計算科学研究機構

計算科学技術特論Ｂ
2016年5月12日 国立研究開発法人理化学研究所 計算科学研究機構�運用技術部門 ソフトウェア技術チーム�チームヘッド 南�一生 [email protected]
RIKENADVANCEDINSTITUTEFORCOMPUTATIONALSCIENCE
2016年5月12日計算科学技術特論Ｂ 2016年5月12日計算科学技術特論Ｂ 2016年5月12日計算科学技術特論Ｂ 2016年5月12日計算科学技術特論Ｂ 5
2016年5月12日計算科学技術特論Ｂ A Processor Performance
difficult
High
easy
Low
RSDFT
Lattice
QCD
Seism3D
NICAM
PHASE
FFB
Low
2016年5月12日計算科学技術特論Ｂ 2016年5月12日計算科学技術特論Ｂ ParallelPerformance
Performance
Parallel
Parallelization
7
difficult
High
●
●
漏れ電流が問題
FET
LSI
FET
Si
n-Si
LSI
2020
2030
LSI
FET
漏れ電流を押さえる
2016年5月12日計算科学技術特論Ｂ 9
FET
Kohn-Sham方程式
電子密度 n(r )=|
∑ ϕi (r )
2
波動関数
i
φi: 電子軌道（=波動関数）
ハミルトニアン
i：電子準位（=エネルギーバンド）
r：空間離散点（=空間格子）
Hϕ i (r) = ε iϕ i (r)
2016年5月12日計算科学技術特論Ｂ 10
実空間法
Hϕ i (r) = ε iϕ i (r)
Kohn-Sham
u
3
u
ML2
ML1,ML2,ML3
ML(=ML1×ML2×ML3
3
2016年5月12日計算科学技術特論Ｂ 11
Self-ConsistentFieldprocedure
1 (CG)共役勾配法
2 (GS)Gram-Schmidt規格直交化
3 密度とポテンシャルの更新
4 (SD)部分対角化
SCF
2016年5月12日計算科学技術特論Ｂ 12
r
u
×
×
SCF
4
or
5
×
SCF
v
Gram-Schmidt
w
n
SCF
n
r
o
scf_mix
um
※
→
SCF
r
F
→
2016年5月12日計算科学技術特論Ｂ u
v
END
13
J.-I.Iwataetal.,J.Comp.Phys.(2010)
Realspace
CPUspace
Blue:Siatom
Yellow:electrondensity
2016年5月12日計算科学技術特論Ｂ 14
2016年5月12日計算科学技術特論Ｂ RSDFT
• 実空間差分法
• 空間並列
計算コアの最適化
• 行列積化
ターゲット計算機：PACS-CS,T2K-Tsukuba
スレッド並列の実装
ターゲット計算機：PACS-CS,T2K-Tsukuba
2016年5月12日計算科学技術特論Ｂ 16
M EOT UP
ψ 1"
4
ψ1
ψ 2"
4
ψ2
ψ 1" ψ 2 ψ 1"
ψ 3"
4
ψ3
ψ 1" ψ 3 ψ 1"
ψ 2" ψ 3 ψ 2"
ψ 4"
4
ψ4
ψ 1" ψ 4 ψ 1"
ψ 2" ψ 4 ψ 2"
ψ 3" ψ 4 ψ 3"
ψ 5"
4
ψ5
ψ 1" ψ 5 ψ 1"
ψ 2" ψ 5 ψ 2"
ψ 3" ψ 5 ψ 3"
ψ 4" ψ 5 ψ 4"
ψ 6"
4
ψ6
ψ 1" ψ 6 ψ 1"
ψ 2" ψ 6 ψ 2"
ψ 3" ψ 6 ψ 3"
ψ 4" ψ 6 ψ 4"
ψ 5" ψ 6 ψ 5"
ψ 7"
4
ψ7
ψ 1" ψ 7 ψ 1"
ψ 2" ψ 7 ψ 2"
ψ 3" ψ 7 ψ 3"
ψ 4" ψ 7 ψ 4"
ψ 5" ψ 7 ψ 5"
ψ 6" ψ 7 ψ 6"
ψ 8"
4
ψ8
ψ 1" ψ 8 ψ 1"
ψ 2" ψ 8 ψ 2"
ψ 3" ψ 8 ψ 3"
ψ 4" ψ 8 ψ 4"
ψ 5" ψ 8 ψ 5"
ψ 6" ψ 8 ψ 6"
ψ 7" ψ 8 ψ 7"
ψ 9"
4
ψ9
ψ 1" ψ 9 ψ 1"
ψ 2" ψ 9 ψ 2"
ψ 3" ψ 9 ψ 3"
ψ 4" ψ 9 ψ 4"
ψ 5" ψ 9 ψ 5"
ψ 6" ψ 9 ψ 6"
ψ 7" ψ 9 ψ 7"
オリジナルは行列ベクトル積
ψ 8" ψ 9 ψ 8"
2016年5月12日計算科学技術特論Ｂ M EOT UP
ベクトル積を行列積に変換
(DGEMV)
再帰分割法
(DGEMM)
•
•
依存関係のある三角部とない四角部にブロック化して計算
再帰的にブロック化することで四角部を多く確保
2016年5月12日計算科学技術特論Ｂ ※ＳＤも同様に行列積化が可能
18
2016年5月12日計算科学技術特論Ｂ 3
8F7
B
e
gB A+
tm
63
l
MWW P_O
p
m
x
6
UYU Ud
yu
O(N2)
ψ m H KS ψ n
.
lk
ψn ψn
M EOT UP
B
b 6+
gB A
lk
8 9
8 9 H
8 9
8 9
8 9 H
8 9
l
H m ,n = ψ m H KS ψ n
O(N3)
s
B
b 6+
gB A,
lk
m
8=5
z
v
M9
n −1
'
n
ψ = ψ n − ∑ψ m ψ m ψ n
m =1
P c
P
&
$
$
$
%
H N ×N
N
DZ H
#& #
& #
!$ ! !
$! !
c
=
ε
!$ n !
$ cn !
!$ !
$ !
"% "
% "
ψ n' ( r ) = ∑ cn ,mψ m ( r )
m =1
lk
O(N3)
B 6, gB
A,
O(N3)
B
b 6+
gB A,
O(N3)
i
C Pc
8 9
P
%8 9 H
EOMWM MOV
lk
lk
8 9
8 9 H
8 9
D=77
1%)))
1b1b1
*
E7:
l
l
*+)b*+)b*+)
×
p E7:
*))
)-
×
u
6OM
22 /
B A,
8=5
,) .
B A,
8 9
8F7
+0 -
B A+
,1 /
B A,
8 9
E7:
8B E7:
*/%)))
=
YP(= O
*))
M EOT UP
UbUYS%
3 D P_O % = YP(= O
CE=
C8E 9H8 6OM
3
6OM
D P_O
=
=
YP(= O
YP(= O
MWW P_O
( CE=
(67E9F
5WW P_O
,*
E7:
z
m
9A88B
2016年5月12日計算科学技術特論Ｂ F+
E
2)
6 5E
8 9
%8 9 H
F _V_NM
C =
M UOT+
PU_
3*+1% +./% .*+% *)+-% +)-1
Si4096(格子：96x96x96，バンド：8192）
実行時間，T2K-Tsukuba
Si4,096(格子:96x96x96，バンド：8192)
速度向上率，T2K-Tsukuba
18.0
1000.0
SCF
DIAG
GS
DTCG
900.0
Ideal
SCF
DIAG
GS
DTCG
16.0
14.0
12.0
700.0
速度向上率
実行時間（秒）
800.0
600.0
500.0
400.0
10.0
8.0
6.0
300.0
200.0
4.0
100.0
2.0
0.0
0
500
*+1
.*+
-b-b1 1b1b1
21
m
1,000
1,500
並列数
*)+1b1b*/
2016年5月12日計算科学技術特論Ｂ 2,000
2,500
0.0
0
+)-1
1b*/b*/
500
*+1
.*+
-b-b1 1b1b1
1,000
1,500
並列数
*)+1b1b*/
2,000
+)-1
1b*/b*/
2,500
Si4 0 9 6(格子:9 6x9 6x9 6，バンド:8 1 9 2)
演算時間と通信時間，T 2 K -Tsukuba
1000.0
通信(S :大域)
通信(S :隣接)
演算
900.0
SCF
800.0
実行時間（秒）
700.0
DIAG
600.0
500.0
400.0
GS
300.0
200.0
100.0
DTCG
128
256
512
1,024
2,048
128
256
512
1,024
2,048
128
256
512
1,024
2,048
128
256
512
1,024
2,048
0.0
SCF
DIAG
並列数
GS
DTCG
fC8E 9H8
m
2016年5月12日計算科学技術特論Ｂ 1
2
3
4
5
6
2016年5月12日計算科学技術特論Ｂ 24
並列数
128
256
512
1,024
2,048
128
256
512
1,024
2,048
128
256
512
1,024
2,048
128
256
512
1,024
2,048
演算
通信(S :隣接) 通信(S :大域)
749.472 77.435 60.057
363.281 52.715 59.732
172.218 38.797 64.133
87.206 33.345 73.494
59.728 28.637 108.404
322.212 10.005 14.931
181.852
6.831
9.690
86.939
4.492 16.882
46.392
4.061 17.659
31.597
4.229 34.866
148.354
0.000
9.629
73.964
0.000 10.070
37.479
0.000
9.969
16.900
0.000
8.008
11.451
0.000 10.687
278.906 67.429 35.497
107.465 45.884 39.972
47.800 34.305 37.283
23.914 29.284 47.827
16.679 24.408 62.851
3
8F7
B
e
gB A+
tm
p
6
UYU Ud
l
MWW P_O
通信時間増大
演算時間と逆転
.
m
行列ベクトル積
x
yu
O(N2)
ψ m H KS ψ n
63
性能は悪い
並列度の不足
lk
ψn ψn
M EOT UP
B
b 6+
gB A
H m ,n = ψ m H KS ψ n
O(N3)
lk
通信時間減少せず
l
8 9
s
m
演算時間と同程度
8 9 H
8 9
行列積化で良好
並列度の不足
8=5
z
v
n −1
M9
ψ n' = ψ n − ∑ψ m ψ m ψ n
m =1
P c
P
&
$
$
$
%
H N ×N
#& #
& #
!$ ! !
$! !
!$ c n ! = ε $ c n !
!$ !
$ !
"% "
% "
N
DZ H
ψ n' ( r ) = ∑ cn ,mψ m ( r )
m =1
RSDFT
• 実空間差分法
• ベクトルの内積計算
が基本
• 空間並列
B
b 6+
gB A,
O(N3)
B 6, gB
A,
O(N3)
B
b 6+
gB A,
O(N3)
lk
通信時間増大
lk
8 9
行列積化で良好
演算時間と同程度
並列度の不足
i
C Pc
8 9
P
%8 9 H
Scalapackの性能
EOMWM MOV
Scalapackのスケー
ラビリティが悪い
lk
8 9 H
8 9
が悪い
8 9
lk
8 9 H
8 9
行列積化で良好
計算コアの最適化
• 行列積化
ターゲット計算機：PACS-CS,T2K-Tsukuba
スレッド並列の実装
ターゲット計算機：PACS-CS,T2K-Tsukuba
超並列向けの実装
• バンド並列の拡張
EIGENライブラリ※の適用
※高速固有値ライブラリ
•
Imamuraelal.SNA+MC2010(2010)
ターゲット計算機：Kcomputer
2016年5月12日計算科学技術特論Ｂ 26
Hϕ i (r) = ε iϕ i (r) φ : 電子軌道（=波動関数）
i
i：電子準位（=エネルギーバンド）
r：空間離散点（=空間格子）
iはエネルギーバンド量子数
ML2
iについての依存関係はない
空間(S)に加えエネルギーバンド(B)
の並列を実装
万を超える並列度を確保
3
2016年5月12日計算科学技術特論Ｂ 27
•
• 10
•
•
•
×
•
•
2016年5月12日計算科学技術特論Ｂ 28
- Gram-Schmidt
0
1
2
0
?
1
?
2
4
3
→
5
3
7
6
-
8
3
9
6
10
(1) 三角部の計算
(2) 計算した値を四角部に転送（バンド方向の各プロセッサに分配）
(3) 四角部を並列に計算
HPCS2012
2016年5月12日計算科学技術特論Ｂ 29
- 通信の見積りと効果の予測 E 6
h
h
t
✓5
D98G79
➢ M EOT UP 3
➢8F7 3
✓D98G79 3
➢8=5
M9
✓675EF3
➢8=5 DZ H
➢ M EOT UP 3
5F 9DH
✓5
➢8=5
➢ M EOT UP
✓
✓
✓
2016年5月12日計算科学技術特論Ｂ 367E9F
CE=
ux
8=5
M9
x
- 通信の見積りと効果の予測 M EOT UP
3
ULMWWSM T
UL MW1
6(
ULMWW P_O
UL MW1
A6
A6
ULMWW P_O
UL MW1
A6
ULMWW P_O
UL MW1
*
ULNOM
UL MW1
MWWSM T
=
*
A6
* *
*
6(A6
=Y WZS A6
A6
6(A6
=Y WZS A6
A6 *
) A6
6(A6
(
6(A6
(
(
6(A6 (
(A6 *
6(A6 (
(
6
6
UL MW1
6
6
UOLNOM
UL MW1
6E=K9 A6E=K9
ULU
UL MW1
W MLY
YP U MYV
6
/
t
6( 6
(
UL MW1
W MLY
YP U MYV
6
/
t
6( 6
(
YP(U O
P c
*
6( 6
ULU
YP
6( 6E=K9
6( 6
YP
(
P
6
/
6( 6
(
UL MW1
P
6
/
6( 6
(
6( 6
6
6(A6E=K9 (
UL MW1
ULaMU MWW
v
(
l
ULU O
3
6( 6
*
ULaMU MWW
63
A6
6E=K93 6b 6
6(A6
6(A6
UL MW1
ULU O
67E9F
*
6(A6 (
(A6 *
6(
EOMWM MOV
P
CE=
A6
* *
UL MW1
UL P_O
8=5
*
A6 *3
3 UY 6E=K9%A6E=K9
v
P3
(
6E=K93 6b 6
W MLY YP3
- 通信の見積りと効果の予測 -
8F7
l
l
OZYPLOS
67E9F
ULMWW P_O
UL MW1
6LP
6( 6LP(
ULMWW P_O
UL MW1
6LP
6( 6LP(
ULMWW P_O
UL MW1
6LP
6( 6LP
OS(
ULMWW P_O
UL MW1
6LP /
6( 6LP
OS(
ULMWW P_O
UL MW1
6LP
6( 6LP
OS(
ULMWW P_O
UL MW1
6
ULMWW P_O
UL MW1
6LP
6( 6LP
OS(
ULU
UL MW1
P
6LP
/
6( 6LP
OS(
UL MW1
P
6LP
/
6( 6LP
OS(
YP
ULU O
+
ULaMU MWW
CE=
ULU
6( 6LP
YP
UL MW1
W MLY
6LP
YP U MYV
/
UL MW1
W MLY
6LP
YP U MYV
/
l
63
l
ULU O
l
l
ULaMU MWW
l
67E9F
ULU
ULU O
l
ULaMU MWW
6LP
2016年5月12日計算科学技術特論Ｂ OS * (
OS(
t
6( 6LP
t
6( 6LP
6( 6LP
YP
l
OS * (
OS * (
UL MW1
P
6LP
/
6( 6LP
OS * (
UL MW1
P
6LP
/
6( 6LP
OS * (
6( 6LP
P3
,
W MLY
YP3
OS * (
- 効果の確認 -
I MV EOMWUYS
(
v
*+b*+b*+
1
u
tw
ll
F+
+%-))
*
.*+
-1b-1b-1
*2%+))
.*+ -b-b-b1
+
*%)))
/)b/)b/)
*2%+))
*%))) .b.b.b1
,
*%0+1
0+b0+b0+
*2%+))
*%0+1 /b/b/b1
-
-%)2/
2/b2/b2/
*2%+))
-%)2/ 1b1b1b1
.
1%)))
*+)b*+)b*+)
*2%+))
1%))) *)b*)b*)b1
F _V_NM
33
2016年5月12日計算科学技術特論Ｂ - 効果の確認 -
I MV EOMWUYS
S i4 0 9 6(格子:9 6x9 6x9 6,バンド:8 1 9 2)
GS, Weak Scaling, T2K-Tsukuba
Si4096(格子:96x96x96,バンド:8192)
DTCG, Weak Scaling, T2K-Tsukuba
45.0
実行時間（秒）
50.0
40.0
30.0
20.0
DTCG
演算
通信(S:隣接)
通信(S:大域)
通信(B:大域)
40.0
35.0
実行時間（秒）
GS
演算
通信(S :大域)
通信(B :大域)
60.0
30.0
25.0
20.0
15.0
10.0
10.0
5.0
0.0
0
1,000
2,000
3,000
並列数
4,000
0.0
5,000
0
20.0
15.0
10.0
5.0
4,000
5,000
RotV
演算
通信(S:大域)
30.0
実行時間（秒）
実行時間（秒）
25.0
3,000
並列数
35.0
MatE
演算
通信(S:隣接)
通信(S:大域)
30.0
2,000
Si4096(格子:96x96x96,バンド:8192)
RoTV/DIAG, Weak Scaling, T2K-Tsukuba
Si4096(格子:96x96x96,バンド:8192)
MatE/DIAG, Weak Scaling, T2K-Tsukuba
35.0
1,000
25.0
20.0
15.0
10.0
5.0
0.0
0.0
0
1,000
2,000
3,000
並列数
4,000
5,000
0
1,000
2,000
3,000
並列数
4,000
5,000
-Tofu
空間並列
-
空間並列＋バンド並列
バンド(1:On)
Tofuネットワークへのマッピング
各バンドグループをサブメッシュ/
トーラス・ネットワークにマッピング
バンド(1:Omax)
バンド(On+1:Om)
orbital
z
Z
space
Y
y
バンド(Om+1:Omax)
x
CPUspace
X
•
•
2016年5月12日計算科学技術特論Ｂマッピング・ルール
通信の最適化
サブメッシュ/トーラス内で通信が閉じられる
HPCS2012
35
-Gram-Schmidt
lgl
llllllllll C=
120.0
実行時間(秒）
90.0
MPI_Bcast
FZR_
s
t
u
•
原子数: 19,848
•
格子数: 320x320x120
•
軌道数: 41,472
•
トータルプロセス数: 12,288
mll
✓ 空間並列: 2,048(32x32x2)
✓ バンド並列: 6
60.0
MPI_Allreduce
•
32x32x12のトーラスにマッピング
30.0
0.0
マッピングなし
2016年5月12日計算科学技術特論Ｂ最適マッピング
36
-二軸並列の
-
SiNW,19,848原子,格子数:320x320x120,バンド数:41,472
トータル並列プロセス数は12,288で固定
120.0
Space
+Orbital
Space
Space
+Orbital
Space
Space
+Orbital
Space
Space
+Orbital
Space
TimeperSCF(sec.)
in grid points are kept within a part and -79%
do not affect the-78%Data-1 onWait/orbital
12288, 24576, 26864 and 73728 cores (cases 1, 2, 3
Globalcommunication/orbital
communications
within the other part. The number of parts is
and 4, respectively).
The number of parallel tasks is the product of
100.0
Globalcommunication/space
taken up to the number of parallelisms in orbitals. We used this
the number
of parallel tasks in space and that in orbitals. The
Adjacentcommunication/space
mapping in the experiments.
number of parallel tasks in space was fixed to 1,536, and the
Computation
80.0
number of parallel tasks in orbitals was varied from 1, 2, 3 and 6.
Therefore, the number of parallel tasks amounted to 1536, 3072,
5. PERFORMANCE ANALYSIS
-78%
4608,
60.0
in
grid
points
are
kept
within
a
part
and
do
not
affect or
the9216. Data-1 on 12288, 24576, 26864 and 73728 cores (cases 1, 2, 3
The scalability and efficiency of the modified code which is
communications
within the
part. above
The number of parts is
and 4, respectively). The number of parallel tasks is the product of
parallelized in terms of grid
points and orbitals
as other
described
Figure 6 plots the
computation time and communication time for
taken up to the number of parallelisms in orbitals. We used this
the number of parallel tasks in space and that in orbitals. The
were evaluated
with mapping
a simulation
model
named
“Data-1,”
40.0
GS,
CG,
MatE
in SD of
(MatE/SD)
andinRotV
SDfixed
(RotV/SD)
in the experiments.
number
parallel tasks
space in
was
to 1,536, . and the
-79%
depicting a SiNW with 19,848 atoms. The number of grid points
The horizontal
axis
in
each
figure
stands
for
the
number
of cores.
number of parallel tasks in orbitals was varied from
1, 2, 3 and 6.
and orbitals were 320x320x120
and 40,992, respectively.
We
The theoreticalTherefore,
computation
time of
forparallel
cases tasks
2, 3amounted
and 4 isto also
the number
1536, 3072,
5.
PERFORMANCE
ANALYSIS
20.0
used an MPI library tuned
the Tofuand
network
to obtain
4608,
or 9216. time of case1 is divided by 2, 3 and
the
computation
The to
scalability
efficiency
of the higher
modified codeploted
whichwhere
is
performance. We also made
an appropriate
taskpoints
mapping
to theas described
6, respectively.
They show
the and
parallelization
fortime for
parallelized
in terms of grid
and orbitals
above
Figure
6 plotsthe
thescalability
computationintime
communication
空間分割:12,288
Tofu network. 0.0
were evaluated with a simulation model named the
“Data-1,”
orbitals. The
timeand
becames
to
GS,measured
CG, MatE computation
in SD (MatE/SD)
RotV incloser
SD (RotV/SD)
.
depicting a SiNW with 19,848 atoms. The number of grid
points
The horizontaltime
axis in
figure stands
for the number
the theoretical
computation
as each
the number
of parallel
tasks inof cores.
5.1 Performanceand
of orbitals
DGEMM
were 320x320x120 and 40,992, respectively.
The theoretical
time and
for cases
2, 3 and
orbitalsWeincreased,
and GS, computation
CG, 空間分割:2,048
MatE/SD
RotV/SD
were4 is also
an MPI library
tuned
to the
network to obtain higher
The principal kernel of used
the modified
RSDFT
code
is aTofu
DGEMM
ploted
where
the
computation
time
of
case1
is
divided
by 2, 3 and
found to be well-parallelized in orbitals.
also made an
appropriate
6, respectively. They show the scalability in the parallelization for
routine of the BLAS. Weperformance.
optimized We
the DGEMM
routine
so as task
to mapping to the
Tofu such
network.
The communication
time with
to computation
the parallel time
tasksbecames
in spacecloser to
the orbitals.
The respect
measured
バンド分割:6
utilize the special features
as SIMD instructions and the
the
theoretical
computation
time
as
the
number
of
parallel tasks in
decreased
in
proportion
to
the
number
of
parallel
tasks
in
orbitals,
sector cache and hardware
synchronization
capability
of
5.1
Performance
of DGEMM
GSbarrier
MatE/SD
RotV/SD
CG
orbitals of
increased,
and
GS,functions
CG, MatE/SD
and RotV/SD
because
the
number
calls
of
MPI
for
global
and were
the SPARC64TM VIIIfx The
effectively.
The
sustained
performance
of
principal kernel 行列生成
of the modified RSDFT 回転
code is a DGEMM
found to be well-parallelized
ininorbitals.
adjacent
communications
decreased
proportion
to
the
the BLAS.
We optimized
the DGEMM
DGEMM on a compute routine
node isof123.7
giga-flops,
or 96.6%
of the routine so as to
The
communication
timetasks
with respect
to theOn
parallel
tasks in space
reciprocal
of
the
number
of
parallel
in
orbitals.
the
other
utilize
the
special
features
such
as
SIMD
instructions
and
the
peak performance. In particular, we found that the computation HPCS2012
decreased
in proportiontime
to thefor
number
paralleltasks
tasks in
sector cache
and hardware
capability
of global
hand, the
communication
the of
parallel
inorbitals,
time
as a result of keeping
the
block
data onbarrier
the synchronization
L1 cache
TM
2016年5月12日計算科学技術
特論Ｂ because the number of calls of MPI functions for global and
the SPARC64
VIIIfx effectively. The sustained performance
of
orbitals
was
supposed
to
increase
as
the
number
of
parallel
tasks
manually decreased by 12% compared with the computation time
adjacent communications decreased in proportion to the
DGEMM on a compute node is 123.7 giga-flops, or 96.6%
of the increased.
in orbitals
number
MPI processes
requiring
for the usual data replacement
operationsInofparticular,
the L1 cache.
reciprocalThe
of the
number of
of parallel
tasks in orbitals.
On the other
peak performance.
we foundThis
that the computation
communicationshand,
for the
the global
parallel
tasks in orbitals,
however,
wastasks in
communication
time
for
the
parallel
DGEMM tuned for the
K
computer
was
also
used
for
the
time as a result of keeping the block data on the L1 cache
actually
to a was
relatively
small
numberas of
nodes,
orbitals
supposed
to increase
thecompute
number of
parallel tasks
LINPACK benchmark program.
manually decreased by 12% compared with the computation
timerestricted
in orbitals
increased.
numbercommunications
of MPI processesofrequiring
and therefore,
the
wall clock
time The
for global
for the usual data replacement operations of the L1 cache.
This
communications
for small.
the parallel
in we
orbitals,
however, was
5.2 Scalability DGEMM tuned for the K computer was also usedthefor
parallel
in orbitals was
This tasks
means
succeeded
the tasks
actually
to a relatively small
of compute nodes,
LINPACK
benchmark
We measured the computation
time
for theprogram.
SCF iterations with
in decreasing time
for restricted
global communication
by number
the combination
大域通信時間を大幅に削減
5.2 Scalability
(a)
We measured the computation time for the SCF iterations with
400.0
160.0
theoretical computation
(a)
160.0
global/orbital
theoretical computation
120.0
wait/orbital
computation
global/space
Time per CG (sec.)
300.0
100.0
wait/orbital
200.0
80.0
40.0
80.0
40.0
100.0
0.0
0.0
(c)
0.040,000
Number0of cores
60,000
20,000
40,000
Number of cores
(c)
200.0
Time per MatE/SD (sec.)
100.0
60,000
theoretical computation
150.0
(d)
40000
80000
60000
300.0
theoretical computation
computation
computation
computation
adjacent/space
adjacent/space
adjacent/space
adjacent/space
global/space
global/space
global/space
global/space
global/orbital
global/orbital
global/orbital
global/orbital
50.0
200.0
100.0
80000
Number of cores
theoretical computation
100.0
theoretical computation
200.0
100.0
0.0
0
20,000
60000
20000
Number of
cores
(d)
0.0
0
40000
0
computation
50.0
0.0
20000
0.0
80,000
300.0
200.0
150.0
0
80,000
Time per RotV/SD(sec.)
20,000
Time per RotV/SD(sec.)
0
Time per MatE/SD (sec.)
adjacent/space
global/space
global/orbital
120.0
global/orbital
Time per CG (sec.)
Time per GS (sec.)
300.0
200.0
theoretical computation
computation
adjacent/space
theoretical computation
global/space
computation
global/orbital
(b)
computation
global/space
400.0
Time per GS (sec.)
and therefore, the wall clock time for global communications of
the parallel tasks in orbitals was small. This means we succeeded
in decreasing time for global communication by the combination
(b)
20,000
40,000
Number of cores
40,000
Number of cores
60,000
80,000
60,000
0
80,000
0.0
0
20,000
20,000
40,000
Number of cores
40,000
Number of cores
60,000
60,000
80,000
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
Figure 6.特論Ｂ Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
2016年5月12日計算科学技術
RotV/SD for different numbers of cores.
80,000
Hasegawa et al.
13
Table 2. Distribution of computational costs for an iteration of the SCF calculation of the modified code.
Procedure block
SCF
SD
MatE/SD
EigenSolve/SD
RotV/SD
CG
GS
Execution
time (s)
Computation
time (s)
2903.10
1796.97
525.33
492.56
779.08
159.97
946.16
1993.89
1281.44
363.18
240.66
677.60
43.28
669.17
Communication time (s)
Adjacent/grids
Global/grids
Global/orbitals
61.73
823.02
12.57
13.90
497.36
4.27
13.90
143.98
4.27
–
251.90
–
–
101.48
–
47.83
68.85
0.01
http://hpc.sagepub.com/
–
256.81
8.29
Performance
Wait/orbitals (PFLOPS/%)
11.89
–
–
–
–
–
11.89
5.48/51.67
5.32/50.17
6.15/57.93
0.01/1.03
8.14/76.70
0.06/0.60
6.70/63.10
International Journal of High Performance
Computing Applications
The test model was a SiNW with 107,292 atoms. The numbers of grids and orbitals were 576 ! 576 ! 180, and 230,400, respectively. The numbers of
parallel
tasks in grids and orbitals were 27,648 and three, respectively, using 82,944 compute nodes. Each parallel task had 2160 grids and 76,800 orbitals.
Article
The International Journal of High
Performance Computing Applications
1–21
The Author(s) 2013
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1094342013508163
hpc.sagepub.com
Performance Performance
evaluation of
ultra-largeevaluation
of ultra-largescale
first-principles
electronic
structure
on the
ª
confinement
becomes
prominent.
Thecalculation
quantum code
effects,
K computer
scale first-principles electronic structure
depend
on theAtsushi
crystallographic
of theTaisuke
nano- Boku,
Yukihiro
Jun-Ichi Iwata, Miwako Tsuji, which
Daisuke
Takahashi,
Oshiyama, directions
Kazuo Minami,
calculation code
on Hasegawa,
the K computer
Hikaru Inoue, Yoshito Kitazawa,
Ikuo
Mitsuo Yokokawa
wire axes
andMiyoshi
on the and
cross-sectional
shapes of the nanowires,
Yukihiro Hasegawa et al., International Journal of High Performance Computing Applications published online 17 October 2013
result in substantial modifications to the energy-band
DOI: 10.1177/1094342013508163
Yukihiro Hasegawa1, Jun-Ichi Iwata2, Miwako Tsuji1,
structures and the transport characteristics of SiNW FETs.
3
2
1
Daisuke Takahashi , Atsushi Oshiyama , Kazuo Minami ,
The online version of this article can be found at:
However, knowledge of the effect of the structural mor,
Taisuke Boku3, Hikaru Inoue4, Yoshito Kitazawa5http://hpc.sagepub.com/content/early/2013/10/16/1094342013508163
Ikuo Miyoshi6 and Mitsuo Yokokawa7,1
phology on the energy bands of SiNWs is lacking. In addition, actual nanowires have side-wall roughness. The
2016年5月12日計算科学技術特論Ｂ Published
by: imperfections on the energy bands are
effects
of such
Abstract
Silicon nanowires are potentially useful in next-generation field-effect transistors, and it is important to clarify the electron
unknown.
The advent of reliable first-principles calculastates of silicon nanowires to know the behavior of new devices. Computer simulations are promising tools for
calculating
http://www.sagepublications.com
electron states. Real-space density functional theory (RSDFT) code performs first-principles electronic structure calculations would provide a firm theoretical framework for the
tions. To obtain higher performance, we applied various optimization techniques to the code: multi-level parallelization,
load balance management, sub-mesh/torus allocation, and a message-passing interface library tuned for the design
K computer. of suitable SiNWs of 10,000–100,000 atoms for
We measured and evaluated the performance of the modified RSDFT code on the K computer. A 5.48 petaflops (PFLOPS)
Additional
services
and
information
for
International
Journal
of High
Performance
Computing
be imporfound at:
Our
RSDFT
code enables
usApplications
to achievecan
this
sustained performance was measured for an iteration of a self-consistent field calculation for a 107,292-atomFETs.
Si nanowire
simulation using 82,944 compute nodes, which is 51.67% of the K computer’s peak performance of 10.62 PFLOPS. This
tant
task using the K computer.
scale of simulation enables analysis of the behavior of a silicon nanowire with a diameter of Email
10–20 nm.
Alerts:
http://hpc.sagepub.com/cgi/alerts
We have performed extensive electronic-structure calKeywords
Subscriptions: http://hpc.sagepub.com/subscriptions
culations for [110]- and [100]-SiNWs with various crossK computer, Tofu network, next-generation supercomputer, real-space density functional theory, RSDFT, self-consistent
electron states, silicon nanowire, petaflops, PFLOPS, collective communication
Reprints: http://www.sagepub.com/journalsReprints.nav
sectional shapes using RSDFT with local density approxiFigure 13. Plot of volume versus total energy for crystalline
mation
(LDA) (Perdew and Zunger, 1981). Figure 14
supercomputer (http://www.open-supercomputer.org/;
NaPermissions: http://www.sagepub.com/journalsPermissions.nav
1. Introduction
silicon
of 10,648 (squares) and 21,952 (triangles)
kashima, 2008),atoms.
by incorporating matrix–matrix
multipli-examples of cross-sectional views of the obtained
shows
Computer simulations are essential when clarifying and precations for Gram-Schmidt orthogonalization with a high
dicting the properties of materials that have prospects for
geometry-optimized
[110]-SiNWs. The cross-sectional
cache-hit ratio and thread parallelization with
OpenMP
practical applications. In particular, first-principles electronic
directives for multicore processors (Iwata et al., 2010; Tsuji
3
3
structure-times
calculationsand
based on
functional
theory
(11
14density
-times
supercells
correspond
to
the
dimension
is
approximately
3–6 nm. The sidewalls of the
>> OnlineFirst Version of Record - Oct 17, 2013
(DFT) have been performed on a variety of materials using
10,648and
21,952-atom
models,
respectively).
The
comwires
are
terminated
by
hydrogen
atoms, reflecting certain
diverse software implementations on parallel computers.
RIKEN Advanced Institute for Computational Science, Kobe, Japan
What is This?
Real-space density functional theory (RSDFT) code
Department
of Applied
Physics,
School of Engineering, The
University of
experimental
situations
on
one
hand
and focusing on essenputed
equilibrium
lattice
constants
for
the
two
supercell
(Iwata et al., 2010) is a simulation technique used to per- Tokyo, Tokyo, Japan
form first-principles electronic structure calculations. ‘Real
Center for Computational Sciences, University of Tsukuba, Tsukuba,
tial
features
of
the
energy
bands
on
the
other. Figures 14(b)
models
correspond
to
the
lattice
constant,
5.39
Å,
of
the
space’ means that three-dimensional physical coordinates Japan
Computational Science and Engineering Solution Division, Technical
are discretized, and
the wave
functions,
electronagrees
density, with
and
(d)
show
ellipses
that
extend
along
different crystalloprimitive
unit
cell,
which
the
experimental
Computing Solution Unit, Fujitsu Ltd, Chiba, Japan
and potential field are calculated at the resulting discrete
IT Solution Unit, CAE Simulation Department, Fujitsu Systems East Ltd,
result
(5.43
withinof 1%.
graphic directions. We call the ellipses that have their long
lattice points
or grids.Å)
One advantage
this method is that Nagano, Japan
it is suitable for parallel computations. In fact, the HamiltoPA Project, Next Generation Technical Computing Unit, Fujitsu Ltd,
Prediction
offormulation
electronic
axes along the [001] and the [1-10] directions the [001]nian matrix
of the real-space
is sparse, andstructures,
fast Kawasaki, Japan or correlation
Graduate School of System Informatics, Kobe University, Kobe, Japan
Fourier transform (FFT), which usually requires global combetween
the
electronic
structure
and
a
specific
atomic
ellipses and the [1-10]-ellipses, respectively. Similarly, the
munication traversing all compute nodes of a parallel com- Received June 13, 2013; Accepted for publication September 9, 2013.
puter, is unnecessary
Hamiltonian matrix
operations. application of DFT. To
structure,
isforanother
important
dumbbells shown in Figure 14(c) and (e) are called the
Corresponding author:
RSDFT code has been parallelized to run on parallel Yukihiro Hasegawa, RIKEN Advanced Institute for Computational Science,
investigate
possibility,
we applied
the
present
RSDFT
computers, like thethis
PACS-CS
(http://www.rccp.tsukuba.
[001]-dumbbells
and [1-10]-dumbbells, respectively. Fig7-1-6, Minatojima-minami-machi, Chuo-ku, Kobe, Hyogo 650-0047,
Japan.
ac.jp/PACS-CS/; Boku et al., 2006) and the T2K open
[email protected]
scheme to SiNWs. Si nanowire Email:
field-effect
transistors ure 15 shows the calculated energy bands for each SiNW
(SiNW FETs) are expected to be boosters in post-scaling shown in Figure 14. In all cases, the conduction band minsemiconductor technology (http://www.itrs.net/). Clear imum is located at the zone center !. The second minimum,
scaling of short-channel effects vs. nanowires with fixed which is higher than the !-point minimum by "Eshift, is
gate lengths has been observed (Bangsaruntip et al., located at some point between ! and the zone boundary.
2009). Reducing the dimensions of SiNWs has been The "Eshift is a deciding factor in determining the number
shown to improve short-channel control. Furthermore, of conduction channels and is therefore important when
SiNW FETs with channel dimensions of 5.0 ! 6.3 nm designing FETs. We have found that the "Eshift of the
have been fabricated (Bangsaruntip et al., 2009). A recent [001]-ellipse and dumbbell are smaller than those of the
experiment showed that the optimal dimension of the [1-10]-ellipse and dumbbell, and that the "Eshift of
cross-section of a SiNW FET is approximately 10 nm the [001]-ellipse is smaller than that of the circle. These
(Chen et al., 2010). At
such a small scale, quantum results unequivocally show that, with respect to the number
2016年5月12日計算科学技術
特論Ｂ 1
2
3
4
5
6
7
Downloaded from hpc.sagepub.com at RIKEN on April 30, 2014
Downloaded from hpc.sagepub.com at RIKEN on April 30, 2014
Downloaded from hpc.sagepub.com at RIKEN on April 30, 2014
●
●
●
2016年5月12日計算科学技術特論Ｂ Kohn-Sham方程式
Hϕ i (r) = ε iϕ i (r)
波数：Gによる展開
Hϕ ik (G) = ε iϕ ik (G)
φik: 電子軌道（=波動関数）
i：電子準位（=エネルギーバンド量子数）
G：波数格子
k：k点
2016年5月12日計算科学技術特論Ｂ 42
解くべき方程式
H|ψ> =εn|ψ>
|ψ> i :
Δi+1 <= F( H|ψ>i , |ψ> i )
|ψ>i+1 <= |ψ>i + Δ i+1
|ψ>i+1
H|ψ> |ψ>
|ψ>
|ψ>
|ψ>i
|ψ>
<ψ|ψ>i+1 :
H
:
2016年5月12日計算科学技術特論Ｂ Hϕ ik (G) = ε iϕ ik (G)
iはエネルギーバンド量子数
基本的にエネルギーバンドについて並列化されてい
る
一部，波数：Gについて並列化されている
G
G並列の前にエネルギーバンド並列されている波動
関数をG並列可能なようにトランスバース転送が発生
G並列後にG並列されている波動関数をエネルギーバ
i
i
2016年5月12日計算科学技術特論Ｂンド並列に戻すためなトランスバース転送が発生
このトランスバース転送のコストが大
G
■
O(N3)
O(N2logN)
O(N3)
2016年5月12日計算科学技術特論Ｂ BLAS Level3
LAS! Level3
! 
■
! 
BLAS Level3
BLAS Level3
(
(
(
2)
(
2)
14 [s]
14 [s]
12
12
10
10
8
8
6
6
4
4
2
2
0
origina
23axis:
23axis:
0
16
46
original:0calc.
23axis:0calc.
23axis:0comm.
32 16
64 32
2016年5月12日計算科学技術特論Ｂ 12864
1
■
第 4 章性能最適化技術の応用
演算処理
通信処理
トランスバース処理
(通信以外)
2016年5月12日計算科学技術特論Ｂ Fig. 4.24: PHASE 区間 5 のスケーラピリティ
確保できないということである。良い並列効果を得るには、オリジナルコードを使
限り、ノードあたりの原子数をもっと大きくとる必要がある。区間 8 でも、区間 2
のハードウェアと現状のアプリの並列度のミスマッチが発生していると言える。
■
41
41
41
41
41
41
5637
41
41
0. 2
Fig. 4.25: PHASE 区間 8 のスケーラピリティ
2016年5月12日計算科学技術特論Ｂ第 4 章性能最適化技術の応用
対角化
対角化の処理を実施している区間
9 を HfSiO2、384 原子アモルファス系を入力データと
■
して、16 から 128対角化カーネル(区間９）
並列の低並列度でストロングスケールで測定した。スケーラビリティの
データを Table 4.8 に示す。固有値計算の部分は並列度を上げても全く実行時間の短縮が見
HfSiO2 384原子，アモルファス系で測定．
られない。これは並列対応のライブラリである
ScaLAPACK でなく、逐次対応の LAPACK
固有値計算の部分は並列度を上げても全く実行時間の短縮が見られない．
を使用していることが原因である。つまりこの部分は、完全に非並列部が残存しているこ
ととなっている。逐次対応のLAPACK を使用していることが原因.
完全に非並列部が残存している.
Table 4.8: PHASE 区間 9 のスケーラピリティ
プロセス数
16
32
64
128
区間
区間 9 全体
固有値計算
(lapack-dsyev)
!
ψi′ =
Uij ψj
′
ψi 転置通信
高速化率 [-](全体)
高速化率 [-](固有値
計算含まず)
22.233 21.228 20.881 20.633
20.349 20.189 20.260 20.223
1.584 0.824 0.410 0.210
0.300 0.215 0.211 0.200
1.00
1.05
1.06
1.08
1.00
1.81
3.03
4.60
2016年5月12日計算科学技術特論Ｂ 4.3.8
PHASE の高性能化のための書き換え
並列軸の拡張
4.3.7 節の行列行列積に書き換え可能な処理と FFT を含む処理の項で、PHASE の高並
列上の問題点は、ハードウェアとアプリケーションの並列度のミスマッチであることを述
φik: 電子軌道（=波動関数）
べた。PHASE は (4.4) の固有値方程式を解いている。オリジナルの
PHASE では基本的
ik
i ik
i：電子準位（=エネルギーバンド量子数）
に、固有値方程式の中のエネルギーバンド量子数を表す変数、
i について並列化が実施さ
G：波数格子
れている。また一部分、例えば GramShumit 対角化処理については、波数
(G) による並
列化が実施されている。本来 PHASE の方程式は、すべての部分が完全に、エネルギーバ
k：k点
ンド:i と波数:G について並列化可能である。 3.6 節に述べたように並列軸の拡張により、
ハードウェアとアプリケーションの並列度のミスマッチへの対処が可能になる。今回は並
エネルギーバンド(B)に加え波数(G)
列軸の拡張として、エネルギーバンドと波数について、すべての部分について 2 軸のフル
並列化を実施した。この並列軸の拡張により、
の並列を実装 1000 止まりであった並列度を数万レベルま
で拡張し、原子数よりも大きな並列度で並列化することを目指した。
完全な２軸並列とする
また 4.3.7 節で PHASE の高並列上の問題点は、非並列部の残存であることを示した。
RSDFTと同様の行列積化も実装
Fig. 4.24 に示した、原子数ループの先頭とエネルギーバンド並列部に挟まれた非並列部
は、波数に関する計算を実施している部分である。したがってエネルギーバンドに加えて、
波数 (G) による並列化を実施することにより、この残存している非並列部が並列化可能と
なる。
Hϕ (G) = ε ϕ (G)
2016年5月12日計算科学技術特論Ｂ 50
■二軸並列化
■
非並列部が波数で並列
化できる
2016年5月12日計算科学技術特論Ｂ ■二軸並列化
■
•
•
•
2016年5月12日計算科学技術特論Ｂ 51
■二軸並列化
■
•
バンド
波数
2016年5月12日計算科学技術特論Ｂ ■二軸並列化
■
•
•
•
•
•
波数
バンド
2016年5月12日計算科学技術特論Ｂ ■二軸並列化
• この分野では小規模問題を短時間で計算したいという科学
的要求が高い．
• バンド計算(エネルギー準位など)：1万原子の1回SCF収束
で良い∼100SCF程度．
• 構造緩和(MD)や反応経路探索：外側に原子核の緩和に関す
るループ構造∼100step程度．
• 10,000原子を10PFシステム(８0,000ノード)，また1,000
原子を10,000ノードで計算する事を目指せる．
• ただし二軸並列はメリットとデメリットがあるため実施前
に効果が期待できるか詳細な評価を実施した．
2016年5月12日計算科学技術特論Ｂ [sec]
[s]
300
250
200
[%]
[sec]
[%]
[sec]
[%]
[sec]
[%]
[%]
[%]
[sec]
[%]
5
512.7 100.0 2 23
5
512.7 100.0 2 23
[sec]
[%]
[sec]
[%]
512.7 100.0
[%] 105.3
m_ES_F_transpose_r
105.3
20.5
0
m_ES_F_transpose_r
20.5 [%] 5 0
m_ES_F_transpose_r
105.3
20.5
5
512.7 100.0 2 23
m_ES_W_transpose_r
15.7
3.1
0
m_ES_W_transpose_r
15.7
3.1 m_ES_W_transpose_r
0
15.7
3.1
m_ES_F_transpose_r
105.3
20.5
0
WSW_t
15.2
3.0
36
WSW_t
15.2
3.0
36
WSW_t
15.2
3.0
m_ES_W_transpose_r
15.7
3.1
0
0.2
normalize_bp_and_psi_t
0.9
0.2
3 250.9
normalize_bp_and_psi_t
0.2 normalize_bp_and_psi_t
3 25
WSW_t
15.2
3.0
36 0.9
Gram/Schmidt3orthonormalization
49.2
9.6
W1SW2_t_r
49.2
9.6
5 46
normalize_bp_and_psi_t
0.9
0.2
3 25 49.2
W1SW2_t_r
9.6 W1SW2_t_r
5 46
modify_bp_and_psi_t_r
50.8
9.9
W1SW2_t_r modify_bp_and_psi_t_r
49.2
9.6
5
46
modify_bp_and_psi_t_r
50.8
9.9
4 45
50.8
9.9
4 45
W1SW2_t_r_block
162.0
31.6
modify_bp_and_psi_t_r
50.8
9.9
4 45
W1SW2_t_r_block
162.0
31.6 41 89
W1SW2_t_r_block
162.0
31.6 modify_bp_and_psi_t_r_block
41 89
other
96.2
18.8
W1SW2_t_r_block
162.0
31.6 41 89
M× M
modify_bp_and_psi_t_r_block
96.2
18.8 74 65
modify_bp_and_psi_t_r_block
96.2
18.8
74
65
m_ES_W_transpose_back_r
14.1
2.8
modify_bp_and_psi_t_r_block
96.2
18.8
74
65
M× V
0.3
m_ES_W_transpose_back_r
14.1
2.8
01.3
m_ES_W_transpose_back_r
2.8 m_ES_F_transpose_back_r
0
m_ES_W_transpose_back_r
14.1
2.8
0 14.1
m_ES_F_transpose_back_r
1.3
0.3
0
m_ES_F_transpose_back_r
1.3
0.3
0
m_ES_F_transpose_back_r
1.3
0.3
0
150
行列積
100
行列ベクトル積
50
0
orinigal
original+DGEMM
22axis+1threads
Algorithm
22axis+8threads
2016年5月12日計算科学技術特論Ｂ [%]
[%]
2 23
0
0
36
3 25
5 46
4 45
41 89
74 65
0
0
■二軸並列化
■
• 行列積化されたカーネルに(区間2)ついての結果．
• HfSiO2 384原子アモルファス系のデータ．
• 大幅な性能向上を達成．
通信部
20.0
20.0
18.0
18.0
16.0
14.0
12.0
]ces[間時過経
]ces[間時過経
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
16
32
64
14 [s]
主要演算ループ
original:0calc.
23axis:0calc.
23axis:0comm.
12
10
8
10.0
8.0
6
6.0
4.0
4
2.0
0.0
2
128
0
プロセス数
16
プロセス数
2016年5月12日計算科学技術特論Ｂ 32
64
プロセス数
128
■二軸並列化
■FFT
• FFTを含むカーネルに(区間8)ついての結果．
• HfSiO2 384原子アモルファス系のデータ．
• 性能向上を達成．
50.0
50.0
40.0
40.0
18 [s]
]ces[間時過経
]ces[間時過経
14
30.0
30.0
12
20.0
10
10.0
10.0
8
0.0
0.0
20.0
16
32
64
original:0other
original:0DGEMM
original:0FFT
2:axis:0other
2:axis:0comm.
2:axis:0DGEMM
2:axis:0FFT
16
6
4
128
2
プロセス数
2016年5月12日計算科学技術特論Ｂプロセス数
0
16
32
プロセス数 64
128
■Scalapack分割数の固定
■
• 対角化はエネルギーバンド数の元を持つ行列が対象
• 行列の大きさに比べて分割数が多すぎる
• 分割数を16 16=256に固定
100 [s]
700[s]
HfSiO4_3072
80
HfSiO4_6144
60
600
250[s]
ScaLAPACK
FFT
BLAS
200
500
400
150
300
40
200
20
100
100
0
0
0
10
20
分割数
30
50
40
0
16x32
16x64
16x96
16x128
プロセス数(バンド×波数)
2016年5月12日計算科学技術特論Ｂ ■
■
16x160
(
Time%[s]
1400
DGEMM
FFT
ScaLAPACK
1200
1000
800
600
400
200
0
48
2016年5月12日計算科学技術特論Ｂ 96
192
384
768 1536 3072 6114 12288
Process
2016年5月12日計算科学技術特論Ｂ

Download Report