Ridgeway Kite Innovative Technology for Reservoir Engineers Large Scale Reservoir Simulation utilizing multiple GPUs Garf Bowen 25th March 2014 Summary • Introduce – RKS – Reservoir Simulation • • • • HPC goals Implementation Large scale simulations Results & future Ridgeway Kite RKS • Start-up (April 2013) – Long history in Reservoir Simulation – Sister company, NITEC – consulting • Differentiators – Massively Parallel Code – Multiple Realizations – “Unconventional” – Coupled surface network Ridgeway Kite Reservoir Simulation • Finite Volume • Unstructured (features) • Implicit 𝑹 = ∆𝑴 − 𝑭 = 𝟎 Ridgeway Kite Driving from London to Manchester… Check the Ferrari or the traffic jam? Lot of code that all needs to go fast Challenge is often “not to go slow” Can’t just focus on “hot spots” Ridgeway Kite HPC goals • “not to go slow” • Portability CPU/GPU (+clusters) – Want to be future proof • Simplification – (massive) parallelization is an opportunity – Developer efficiency – Same result on any platform Ridgeway Kite Shuffle Calculate Pattern Scatter I/O from node zero Calculate “one-to-one” Shuffle Ridgeway Kite • • • • Gather output All data is on the GPU Calculations are embarrassingly parallel No indirect addressing Ability to time separately Example – calculate flows One flow two cells Different flow same cell One cell involved in Multiple flows More flows than cells Ridgeway Kite Multiple copies – “slots” Simplicity Returns? “one code” kernel many (independent) calls Split to run MPI distributed Underlying system - XPL • Takes care of running • Different modes • Different architectures Code looks serial again Ridgeway Kite Maps & MPI Src Dest Slot i1 j1 0 i2 j2 1 i3 j3 0 i4 … j4 … 1 … Maps are defined in “serial” space Not recommended test.exe –cpu test.exe –gpu mpirun –np 16 test.exe Ridgeway Kite Simple Example 𝑥𝑖 = 𝐴𝑖 −1 𝑟𝑖 ∀𝑖 template<typename KP> struct Testinv { A - n*n small dense matrix ~millions of i’s LU factorization (partial pivoting) __host__ __device__ Testinv(Args* inArgs, int index, int N) { int ia=0; mat<double,KP> a(inArgs,ia++,index); Scaling y = 2.35x + 2.31 y = 2.23x + 1.20 log time (secs) 5.00 4.00 CPU 3.00 2.00 0.40 0.60 0.80 1.00 Log n vec<double,KP> r(inArgs,ia++,index); vec<double,KP> x(inArgs,ia++,index); mat<double,KP> w(inArgs,ia++,index); case rks::TestKernels::TEST_INV: GPU w = a; calc(inArgs, gpu<Testinv<kp> >, cpu<Testinv<kp> >); 1.20 w.inv(); break; x.zero(); w.mult(r,x); Ridgeway Kite Now add complexity well -40 8.4 ==================================================== jac -40 19.1 Comparison between: mass --40 1.9 cpuflow 1243.630 and--gpu 147.960 -40 16.5 ==================================================== flow_ ---- 4640 16.0 well -1.0 0.08 norm -40 0.4 jac -1.0 12.62 lin -30 52.7 52.5 mass --1.0 17.93 ling --30 2.0 2.0 flow --1.0 11.66 lins --30 50.0 flow_ --1.0 49.9 11.84 orth-it ------ 30 norm 2.19 norm ---- 1.0 ---- 219 0.1 lin -1.0 9.87 precon ----- 189 48.1 ling --1.0 1.70 pressure ------ 189 46.9 lins --- 1.0 10.08 orth-it ---- 1.0 10.10 norm ----- 1.0 48.40 precon ----- 1.0 9.17 pressure ------ 1.0 8.24 Ridgeway Kite Linear Solver Strategy Linear Solver Important Communication Mechanism Challenge in parallel environments Like getting “the same” results If we can implement a solver in XPL, then we get this for free …but we’re only a small company And don’t really want to be linear solver experts Home grown May not be competitive Using Nvidia’s AmgX Lose the “same” algorithm Performing Ridgeway Kite Linear Solver • Home Grown – Massively helpful for development • Same results for all configurations – Challenged algorithmically on difficult problems • AmgX – Many options (pre-coded) – Single GPU working well – Focussed our effort here • MPI programming becomes important Ridgeway Kite Strategy as problem size increases • Tesla C2070 – 6Gb memory – Black Oil model 1million cells (SPE10 1.2e6 cells) • Little incentive to utilize >1 GPU • noting people will often run multiple realizations • Larger model -> cluster – Memory constrained Ridgeway Kite Scaling Test • Based on SPE10 benchmark – Refined model – 5 wells – ~1 million cells • We can fit: – Base case on one GPU – 4 (connected) copies on 4 GPUs • Actually require 8 GPUs – Extra memory – 16 copies on 16/32 GPUs • Less challenging scaling than refinement Ridgeway Kite Memory & Performance 1400 4500 Memory 3500 Memory Mb Example Performance 1200 3000 2500 4E6 - 8GPUs 2000 1E6 - 2 GPUs 1500 1E6 - 1GPU 1000 500 Wall Clock Time (secs) 4000 1000 800 600 400 200 0 1 2 3 4 5 processors Ridgeway Kite 6 7 8 0 "1E6-1GPU" "1E6-2GPU" Lessons: Very variable timings Instrumentation vital Future: Still working on the 32-way case Classical MPI optimization step "4E6-8GPU" Summary & Conclusions • Shuffle-Calculate pattern – Works for us, so far – Portable – Allowing us to exploit the GPU – Using Amgx we’re able to tackle realistic cases requiring multi-GPU’s • Full system – Commercial offering early next year Ridgeway Kite Acknowledgements • Co-authors: Bachar Zineddin & Tommy Miller • Jeremy Appleyard, Nvidia • “The authors would like to acknowledge the work presented here made use of the IRIDIS*/EMERALD* HPC facility provided by the Centre for Innovation.” • Nvidia for AmgX beta access Ridgeway Kite Questions? Ridgeway Kite Backup#1 – LU code example // // Main elimination loop // for (int j=0; j<m_xdim; j++) { // // Sum // for (int i=0; i<j;i++) { double sum = (*this)(i,j); for (int k=0; k<i; k++) { sum = sum - (*this)(i,k)*(*this)(k,j); } (*this)(i,j) = sum; } // // Max // aamax = 0.0; for(int i=j; i<m_xdim; i++) { double sum = (*this)(i,j); for( int k=0; k<j; k++) { sum = sum - (*this)(i,k)*(*this)(k,j); } (*this)(i,j) = sum; if ( std::fabs(vv[i]*sum)>=aamax ) { imax = i; aamax = std::fabs(vv[i]*sum); } } Ridgeway Kite // // Swap // if (j!=imax) { for( int k=0; k<m_xdim; k++) { double dum = (*this)(imax,j); (*this)(imax,k) = (*this)(j,k); (*this)(j,k) = dum; } vv[imax] = vv[j]; } // // Store // piv[j] = imax; if ( (*this)(j,j)==0.0 ) { (*this)(j,j) = 1e-20; } // // Set // if(j!=m_xdim) { double dum = 1.0/(*this)(j,j); for( int i=j+1; i<m_xdim; i++ ) { (*this)(i,j) = (*this)(i,j)*dum; } } } //------ End lu step ---- Backup#2 – Home Grown Solver 𝐴𝑤𝑤 𝐴𝑏𝑤 𝐴𝑤𝑤 𝐴𝑏𝑤 𝐴𝑤𝑏 𝑥𝑤 𝑅𝑤 = 𝑅𝑏 𝐴𝑏𝑏 𝑥𝑏 0 𝐴𝑏𝑏 ∗ 𝑅𝑤 𝐼 𝐴𝑤𝑏 ∗ 𝑥𝑤 = 𝑥𝑏 𝑅𝑏 0 𝐼 𝐴𝑏𝑏 ∗ =𝐴𝑏𝑏 − 𝐴𝑏𝑤 𝐴𝑤𝑤 −1 𝐴𝑤𝑏 Note: 1−𝑥 −1 = 1 + 𝑥 + 𝑥2 + 𝑥3 + … . . With: Ridgeway Kite 𝑥 = 𝐴𝑏𝑤 𝐴𝑤𝑤 −1 𝐴𝑤𝑏 𝐴𝑏𝑏 −1
© Copyright 2024 ExpyDoc