Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo. Parallel preconditioned iterative solvers for Basically, convergence is faster if the number of color is larger … FEM-type applications Optimization on the Earth Simulator and other SMP cluster Smaller vector length ⇒ poor performance on vector processors. type architectures Flat MPI, OpenMP/MPI Hybrid Synchronization overhead of OpenMP DJDS provides data locality if the color number increases. On scalar processors, performance may be improved as color number increases (both for flat MPI and Hybrid). Reordering for parallel/vector processing Multicoloring (MC) 30 10.00 1.00 7.50 0.75 GFLOPS 20 GFLOPS GFLOPS RCM, CM-RCM (Cyclic Multicolor) 5.00 10 Hitachi SR11000 2.50 ES 10 100 10 1000 100 do i= 1, N k1= index(i-1)+1 k2= index(i) do k= k1, k2 kk= item(k) Y(i)= Y(i)+A(k)*x(k) enddo enddo DJDS with long innermost loops is suitable for vector processors. Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation. Uniform Distributed Force in z-direction @ z=Zmin between MC/RCM/CM-RCM and effect of color number is not so significant. 30 150 ES 750 100 Uy=0 @ y=Ymin GFLOPS 1.0E+01 z For well-conditioned problems, difference sec. ●DJDS ■DCRS ▲No Reordering 500 Ux=0 @ x=Xmin 50 10 0 0 0 10 100 10 1000 100 1000 Ny-1 y Effect of Reordering on ES for 3D Linear-Elastic Problem (1 SMP node, OpenMP) Uz=0 @ z=Zmin uniform meshes (106nodes, 3x106 DOF) Single SMP node with OpenMP, DJDS Nx-1 x Complicated PGA (Pin Grid Array) model 0 0.00 10 100 ES Iterations 100 500 50 0 10 100 0 10 1000 100 1000 COLOR# 100 1000 10000 10 100 1000 6.00 Hitachi SR11000 10000 COLOR# 8.00 Max. ratio: 30:1 6.00 GFLOPS Hitachi SR11000 4.00 0.00 10 100 1000 COLOR# 10000 COLOR# 1000 0.00 10 100 COLOR# efficient convergence for both of vector and scalar processors. 2.00 0 100 CM-RCM provides the most robust and 200 100 10 4.00 2.00 0 Number of independent sets for RCM is 2985. Min. number of colors for independent CMRCM is 2381. 1000 COLOR# 200 COLOR# 300 100 20 0 10 10000 10 8.00 100 100 1000 COLOR# 100 10 ●MC ●CM-RCM ▲RCM 250 10 0 20 10 sec. GFLOPS sec. 100 1000 COLOR# 300 200 100 30 COLOR# 200 10 150 750 ES 0 1000 1000 30 sec. Iterations 2.00 956,128 elements 300 300 100 4.00 But it’s significant for ill-conditioned problems 0 400 200 61 pins (3,037,062 DOF) Single SMP node with OpenMP, DJDS 1000 6.00 COLOR# 1,012,354 nodes 500 Hitachi SR11000 sec. DOF: Problem Size 3D elastic cube with GFLOPS 1.0E+07 GFLOPS 1.0E+06 100 8.00 GFLOPS Nz-1 10 COLOR# sec. 1.0E+05 ●MC ●CM-RCM ▲RCM COLOR# COLOR# 300 1.0E+04 20 250 1.0E+00 1.0E-01 1000 Colors Colors 1000 Iterations GFLOPS 1.0E+02 100 do iv= 1, NCOLORS !$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.) do ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) SMP do j= 1, NLhyp(iv) parallel iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip ) !CDIR NODEP do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) Vectorized (Important Computations) enddo enddo enddo enddo DCRS: Descending-order Compressed Row Storage do j= 1, NCON do i= 1, NN(j) k = index(j-1) + i kk= item (k) Y(i)= Y(i)+A(k)*x(k) enddo enddo 10 1000 Effect of color# and matrix storage (PGA model, 1 SMP node, OpenMP) CM-RCM Matrix storage DJDS: Descending-order Jagged Diagonal Storage) IBM SP3 0.00 Colors RCM ●DJDS ■DCRS 0.25 0.00 0 MC 0.50 10 100 1000 COLOR# 10000 complicated geometries 1000
© Copyright 2024 ExpyDoc