プレ・ポスト統合化ビューア GPP View/Editor

Parallel Iterative Solvers for Ill-Conditioned
Problems with Reordering
Kengo Nakajima
Department of Earth & Planetary Science, The University of Tokyo.
 Parallel preconditioned iterative solvers for
 Basically, convergence is faster if the number
of color is larger …
FEM-type applications
 Optimization on the Earth Simulator and other SMP cluster
 Smaller vector length ⇒ poor performance on vector processors.
type architectures
 Flat MPI, OpenMP/MPI Hybrid
 Synchronization overhead of OpenMP
 DJDS provides data locality if the color number increases. On
scalar processors, performance may be improved as color
number increases (both for flat MPI and Hybrid).
 Reordering for parallel/vector processing
 Multicoloring (MC)
30
10.00
1.00
7.50
0.75
GFLOPS
20
GFLOPS
GFLOPS
 RCM, CM-RCM (Cyclic Multicolor)
5.00
10
Hitachi
SR11000
2.50
ES
10
100
10
1000
100
do i= 1, N
k1= index(i-1)+1
k2= index(i)
do k= k1, k2
kk= item(k)
Y(i)= Y(i)+A(k)*x(k)
enddo
enddo
 DJDS with long innermost loops is suitable for vector
processors.
 Reduction type loop of DCRS is more suitable for cache-based
scalar processor because of its localized operation.
Uniform Distributed
Force in z-direction
@ z=Zmin
between MC/RCM/CM-RCM and effect of color
number is not so significant.
30
150
ES
750
100
Uy=0
@ y=Ymin
GFLOPS
1.0E+01
z
 For well-conditioned problems, difference
sec.
●DJDS
■DCRS
▲No Reordering
500
Ux=0
@ x=Xmin
50
10
0
0
0
10
100
10
1000
100
1000
Ny-1
y
Effect of Reordering on ES for
3D Linear-Elastic Problem
(1 SMP node, OpenMP)
Uz=0 @ z=Zmin
uniform meshes
(106nodes, 3x106
DOF)
 Single SMP node with
OpenMP, DJDS
Nx-1
x
 Complicated PGA (Pin Grid Array) model
0
0.00
10
100
ES
Iterations
100
500
50
0
10
100
0
10
1000
100
1000
COLOR#
100
1000
10000
10
100
1000
6.00
Hitachi
SR11000
10000
COLOR#
8.00
Max. ratio: 30:1
6.00
GFLOPS
Hitachi
SR11000
4.00
0.00
10
100
1000
COLOR#
10000
COLOR#
1000
0.00
10
100
COLOR#
efficient convergence for both of vector and
scalar processors.
2.00
0
100
 CM-RCM provides the most robust and
200
100
10
4.00
2.00
0
 Number of
independent sets for
RCM is 2985.
 Min. number of colors
for independent CMRCM is 2381.
1000
COLOR#
200
COLOR#
300
100
20
0
10
10000
10
8.00
100
100
1000
COLOR#
100
10
●MC
●CM-RCM
▲RCM
250
10
0
20
10
sec.
GFLOPS
sec.
100
1000
COLOR#
300
200
100
30
COLOR#
200
10
150
750
ES
0
1000
1000
30
sec.
Iterations
2.00
 956,128 elements
300
300
100
4.00
 But it’s significant for ill-conditioned problems
0
400
200
 61 pins
(3,037,062 DOF)
 Single SMP node with
OpenMP, DJDS
1000
6.00
COLOR#
 1,012,354 nodes
500
Hitachi
SR11000
sec.
DOF: Problem Size
 3D elastic cube with
GFLOPS
1.0E+07
GFLOPS
1.0E+06
100
8.00
GFLOPS
Nz-1
10
COLOR#
sec.
1.0E+05
●MC
●CM-RCM
▲RCM
COLOR#
COLOR#
300
1.0E+04
20
250
1.0E+00
1.0E-01
1000
Colors
Colors
1000
Iterations
GFLOPS
1.0E+02
100
do iv= 1, NCOLORS
!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)
do ip= 1, PEsmpTOT
iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)
SMP
do j= 1, NLhyp(iv)
parallel
iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)
iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEP
do i= iv0+1, iv0+iE-iS
k= i+iS - iv0
kk= IAL(k)
Vectorized
(Important Computations)
enddo
enddo
enddo
enddo
DCRS: Descending-order
Compressed Row Storage
do j= 1, NCON
do i= 1, NN(j)
k = index(j-1) + i
kk= item (k)
Y(i)= Y(i)+A(k)*x(k)
enddo
enddo
10
1000
Effect of color# and matrix storage (PGA model, 1 SMP node, OpenMP)
CM-RCM
 Matrix storage
DJDS: Descending-order
Jagged Diagonal Storage)
IBM SP3
0.00
Colors
RCM
●DJDS
■DCRS
0.25
0.00
0
MC
0.50
10
100
1000
COLOR#
10000
 complicated geometries
1000