Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite Philippe Virouleau 1 Pierrick Brunet 1 François Broquedis 2 Nathalie Furmento 3 Samuel Thibault 4 Olivier Aumage 1 Thierry Gautier 1 1 Inria - RUNTIME and MOAIS teams 2 Grenoble institute of Technology 3 CNRS 4 University of Bordeaux Monday, September 29th Philippe Virouleau (INRIA) KASTORS Monday, September 29th 1 / 24 Plan 1 OpenMP 4.0 dependent tasks 2 KASTORS benchmarks overview 3 KASTORS performances 4 Ongoing works 5 Conclusion Philippe Virouleau (INRIA) KASTORS Monday, September 29th 2 / 24 OpenMP 4.0 dependent tasks Plan 1 OpenMP 4.0 dependent tasks Example Compiler and runtime support Our contribution 2 KASTORS benchmarks overview 3 KASTORS performances 4 Ongoing works 5 Conclusion Philippe Virouleau (INRIA) KASTORS Monday, September 29th 3 / 24 OpenMP 4.0 dependent tasks Example Reminder on OpenMP dependent tasks How does it work ? Independent tasks void task_example ( void ) { type a , b ; # pragma omp parallel # pragma omp single { # pragma omp task work (& a ) ; # pragma omp task work (& b ) ; # pragma omp taskwait # pragma omp task process ( a ) ; # pragma omp task process ( b ) ; } } Philippe Virouleau (INRIA) KASTORS Monday, September 29th 4 / 24 OpenMP 4.0 dependent tasks Example Reminder on OpenMP dependent tasks How does it work ? Independent tasks void task_example ( void ) { type a , b ; # pragma omp parallel # pragma omp single { # pragma omp task work (& a ) ; # pragma omp task work (& b ) ; # pragma omp taskwait # pragma omp task process ( a ) ; # pragma omp task process ( b ) ; } } Philippe Virouleau (INRIA) Dependent tasks void task_example ( void ) { type a , b ; # pragma omp parallel # pragma omp single { # pragma omp task depend ( out : a ) work (& a ) ; # pragma omp task depend ( out : b ) work (& b ) ; # pragma omp task depend ( in : a ) process ( a ) ; # pragma omp task depend ( in : b ) process ( b ) ; } } KASTORS Monday, September 29th 4 / 24 OpenMP 4.0 dependent tasks Example Reminder on OpenMP dependent tasks How does it work ? Independent tasks void task_example ( void ) { type a , b ; # pragma omp parallel # pragma omp single { # pragma omp task work (& a ) ; # pragma omp task work (& b ) ; # pragma omp taskwait # pragma omp task process ( a ) ; # pragma omp task process ( b ) ; } } Dependent tasks void task_example ( void ) { type a , b ; # pragma omp parallel # pragma omp single { # pragma omp task depend ( out : a ) work (& a ) ; # pragma omp task depend ( out : b ) work (& b ) ; # pragma omp task depend ( in : a ) process ( a ) ; # pragma omp task depend ( in : b ) process ( b ) ; } } Describe how data are used vs Synchronization points. Philippe Virouleau (INRIA) KASTORS Monday, September 29th 4 / 24 OpenMP 4.0 dependent tasks Compiler and runtime support Compiler support GCC/libGOMP • Support for most of OpenMP 4.0 in GCC 4.9 • Compiler passes all dependencies to GOMP_task Clang + libIOMP • Support for most of OpenMP 4.0 in a separate Intel project Clang-omp (Merge with master is ongoing) • Compiler passes all dependencies to __kmpc_omp_task_with_deps Handling of dependencies • Dependencies stored in parent task • New task => look for unresolved dependencies Philippe Virouleau (INRIA) KASTORS Monday, September 29th 5 / 24 OpenMP 4.0 dependent tasks Compiler and runtime support Runtime support GCC/libKOMP [IWOMP’12] • GOMP replacement using Kaapi runtime system (Inria) • Dependent tasks supported • Compiler support from GCC 4.9 OMPSS [BSC] • Compiler support : Mercurium • Runtime support : Nanos Philippe Virouleau (INRIA) KASTORS Monday, September 29th 6 / 24 OpenMP 4.0 dependent tasks Our contribution Motivations Objectives • Evaluate the OpenMP dependent task paradigm (overhead, scalability...) • Compare runtimes Existing benchmark suite • BOTS (Barcelona OpenMP Task Suite [BSC]) • NAS, SPECOMP, RODINIA ... • No benchmark for dependent tasks yet ! Philippe Virouleau (INRIA) KASTORS Monday, September 29th 7 / 24 OpenMP 4.0 dependent tasks Our contribution Motivations Objectives • Evaluate the OpenMP dependent task paradigm (overhead, scalability...) • Compare runtimes Existing benchmark suite • BOTS (Barcelona OpenMP Task Suite [BSC]) • NAS, SPECOMP, RODINIA ... • No benchmark for dependent tasks yet ! KASTORS Benchmark suite for OpenMP dependent tasks ! Philippe Virouleau (INRIA) KASTORS Monday, September 29th 7 / 24 KASTORS benchmarks overview Plan 1 OpenMP 4.0 dependent tasks 2 KASTORS benchmarks overview 3 KASTORS performances 4 Ongoing works 5 Conclusion Philippe Virouleau (INRIA) KASTORS Monday, September 29th 8 / 24 KASTORS benchmarks overview OpenMP Dependent Task Suite Applications • Poisson2D • SparseLU (BOTSa , BSC) • Strassen (BOTS, BSC) • Cholesky (Plasma, ICL/UTK) • QR (Plasma, ICL/UTK) a Barcelona OpenMP Task Suite Philippe Virouleau (INRIA) KASTORS Monday, September 29th 9 / 24 KASTORS benchmarks overview OpenMP Dependent Task Suite Poisson2D Solve Poisson equation on a domain divided into a regular grid SparseLU (BOTS, BSC) LU factorization of a sparse matrix Strassen (BOTS, BSC) Dense matrix multiplication by bloc decomposition Cholesky (Plasma, ICL/UTK) Cholesky tiled matrix factorization (dpotrf) QR (Plasma, ICL/UTK) QR tiled matrix factorization (dgeqrf) Philippe Virouleau (INRIA) KASTORS Monday, September 29th 10 / 24 KASTORS performances Plan 1 OpenMP 4.0 dependent tasks 2 KASTORS benchmarks overview 3 KASTORS performances Set-up GCC/Clang results 4 Ongoing works 5 Conclusion Philippe Virouleau (INRIA) KASTORS Monday, September 29th 11 / 24 KASTORS performances Set-up How we conducted experiments Machines • AMD48 : 48 cores AMD Magny Cours processors, 256GB of memory • INTEL32 : 32 cores Intel Xeon E5-4620 processors, 380GB of memory Philippe Virouleau (INRIA) KASTORS Monday, September 29th 12 / 24 KASTORS performances Set-up How we conducted experiments Machines • AMD48 : 48 cores AMD Magny Cours processors, 256GB of memory • INTEL32 : 32 cores Intel Xeon E5-4620 processors, 380GB of memory Compilers • GCC + libGOMP 4.9, git-mirror commit 6ed3847ffd0 • Clang + libIOMP llvm commit 233b1e3f034 clang-ompa commit 7580e521e51f libIOMP : released 20131209 a http://clang-omp.github.io/ Philippe Virouleau (INRIA) KASTORS Monday, September 29th 12 / 24 KASTORS performances GCC/Clang results SparseLU, Strassen, and Poisson2D on AMD48 A bit more on speed up • Speed-up for each compiler is computed against its respective sequential time. • Times are in seconds Compiler GCC CLANG Philippe Virouleau (INRIA) Poisson2D 5.28 5.42 SparseLU 94.36 95.08 KASTORS Strassen 269.93 234.16 Monday, September 29th 13 / 24 KASTORS performances GCC/Clang results SparseLU, Strassen, and Poisson2D on AMD48 45 GCC-task GCC-taskdep Clang-task Clang-taskdep 40 Speed up 35 30 25 20 15 10 5 0 Philippe Virouleau (INRIA) SparseLU (64x128) Strassen (8192) KASTORS Poisson2D (8192) Monday, September 29th 13 / 24 KASTORS performances GCC/Clang results SparseLU, Strassen, and Poisson2D on AMD48 45 GCC-task GCC-taskdep Clang-task Clang-taskdep Clang-for 40 Speed up 35 30 25 20 15 10 5 0 Philippe Virouleau (INRIA) SparseLU (64x128) Strassen (8192) KASTORS Poisson2D (8192) Monday, September 29th 13 / 24 KASTORS performances GCC/Clang results SparseLU, Strassen, and Poisson2D on AMD48 45 GCC-task GCC-taskdep Clang-task Clang-taskdep Clang-for 40 Speed up 35 30 25 20 15 10 5 0 SparseLU (64x128) Strassen (8192) Poisson2D (8192) Conclusion • Dependent tasks > independent tasks • Poisson2D : Scheduling issue (for is better). Challenge is data locality Philippe Virouleau (INRIA) KASTORS Monday, September 29th 13 / 24 KASTORS performances GCC/Clang results SparseLU, Strassen, and Poisson2D on INTEL192 SparseLU (64x256) taskdep SparseLU task Strassen (8192) taskdep Strassen task 100 Speed up 80 60 40 20 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 Number of cores INTEL192 • 24 Intel Xeon E5-4640 processors (total 192 cores), 756GB of memory Philippe Virouleau (INRIA) KASTORS Monday, September 29th 13 / 24 KASTORS performances GCC/Clang results QR (dgeqrf) and Cholesky (dpotrf) on AMD48 A word about QR and Cholesky • Dense Matrix factorizations • Linear algebra algorithms • Taken from Plasma [UTK] A word about Plasma Execution : • Static scheduling • Dynamic scheduling using Quark (dedicated runtime, dataflow) Philippe Virouleau (INRIA) KASTORS Monday, September 29th 14 / 24 KASTORS performances GCC/Clang results QR (dgeqrf) and Cholesky (dpotrf) on AMD48 250 GigaFlops 200 Plasma Static Plasma Dynamic GCC-OMP Clang-OMP 150 100 50 0 DGEQRF DPOTRF DGEQRF DPOTRF DGEQRF DPOTRF DGEQRF DPOTRF M,B : 2048,128 M,B : 4096,128 M,B : 8192,224 M,B : 16384,256 • Plasma static scheduling better for small sizes • OpenMP runtimes achieve same or better performances than Quark (!) Philippe Virouleau (INRIA) KASTORS Monday, September 29th 15 / 24 Ongoing works Plan 1 OpenMP 4.0 dependent tasks 2 KASTORS benchmarks overview 3 KASTORS performances 4 Ongoing works Next kernel from Plasma KASTORS 5 Conclusion Philippe Virouleau (INRIA) KASTORS Monday, September 29th 16 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) double * A ; int BS , n_blocks ; for ( int i = 0; i < n_blocks ; i ++) { # pragma omp task depend ( out : A [0: N ]) dgemm_f2 ( A [ i * BS ] , ...) ; } # pragma omp task depend ( in : A [0: N ]) swap (A , ...) ; Legend : Execution order : Block written : Depend clause Issues Bad performances due to dependencies What we do... Philippe Virouleau (INRIA) KASTORS Monday, September 29th 17 / 24 Ongoing works Next kernel from Plasma LU (dgetrf) on AMD48 200 Plasma Static Plasma Dynamic GCC-OMP Clang-OMP GigaFlops 150 100 50 0 Philippe Virouleau (INRIA) 2048,128 4096,128 8196,224 16384,256 DGETRF KASTORS Monday, September 29th 18 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) What we want ! Philippe Virouleau (INRIA) KASTORS Monday, September 29th 19 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) double * A ; int BS , n_blocks ; for ( int i = 0; i < n_blocks ; i ++) { # pragma omp task depend ( out : A [ i * BS : BS ]) dgemm_f2 ( A [ i * BS ] , ...) ; } # pragma omp task depend ( in : A [0: N ]) swap (A , ...) ; What we want ! Philippe Virouleau (INRIA) KASTORS Monday, September 29th 19 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) double * A ; int BS , n_blocks ; for ( int i = 0; i < n_blocks ; i ++) { # pragma omp task depend ( out : A [ i * BS : BS ]) dgemm_f2 ( A [ i * BS ] , ...) ; } # pragma omp task depend ( in : A [0: N ]) swap (A , ...) ; What about the norm ? List items used in depend clauses of the same task or sibling tasks must indicate identical storage or disjoint storage. What we want ! Philippe Virouleau (INRIA) KASTORS Monday, September 29th 19 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) double * A ; int BS , n_blocks ; for ( int i = 0; i < n_blocks ; i ++) { # pragma omp task depend ( out : A [ i * BS : BS ]) dgemm_f2 ( A [ i * BS ] , ...) ; } # pragma omp task depend ( in : A [0: N ]) swap (A , ...) ; What about the norm ? List items used in depend clauses of the same task or sibling tasks must indicate identical storage or disjoint storage. What we want ! Issues • Runtime-known number of deps. • Overlapping memory regions Philippe Virouleau (INRIA) KASTORS Monday, September 29th 19 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) double * A ; int BS , n_blocks ; for ( int i = 0; i < n_blocks ; i ++) { # pragma omp task depend ( cw : A [0: N ]) dgemm_f2 ( A [ i * BS ] , ...) ; } Another possible solution Using Concurrent Write (Mode CW) # pragma omp task depend ( in : A [0: N ]) swap (A , ...) ; What we want ! Issues • Runtime-known number of deps. • Overlapping memory regions Philippe Virouleau (INRIA) KASTORS Monday, September 29th 19 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) double * A ; int BS , n_blocks ; for ( int i = 0; i < n_blocks ; i ++) { # pragma omp task depend ( out : A [ i * BS : BS ]) dgemm_f2 ( A [ i * BS ] , ...) ; } # pragma omp task \ depend ( in :{ A [ i * BS : BS ] for i in [0: n_blocks -1]}) swap (A , ...) ; Another original solution Using "variadic" dependencies (...) What we want ! Issues • Runtime-known number of deps. • Overlapping memory regions Philippe Virouleau (INRIA) KASTORS Monday, September 29th 19 / 24 Ongoing works Next kernel from Plasma Plasma’s LU factorization (DGETRF) Some way to deal with it In Atapascan [PACT’98] Use of CW "Concurrent Write" In Quark Mode "INOUT-GATHERV" : wait for all in,out,inout but not for others GATHERV In OMPSS Dependencies on memory regions Philippe Virouleau (INRIA) KASTORS Monday, September 29th 20 / 24 Ongoing works KASTORS Improvements to KASTORS OpenMP • Update KASTORS with OpenMP 4.X specifications Others • Add real application kernels We welcome your suggestion http://kastors.gforge.inria.fr Philippe Virouleau (INRIA) KASTORS Monday, September 29th 21 / 24 Conclusion Plan 1 OpenMP 4.0 dependent tasks 2 KASTORS benchmarks overview 3 KASTORS performances 4 Ongoing works 5 Conclusion Philippe Virouleau (INRIA) KASTORS Monday, September 29th 22 / 24 Conclusion Conclusion Results are quite encouraging • Better performances with dependent tasks • OpenMP 4.0 is a good alternative to dedicated runtimes (eg : Plasma-Quark) On the web • Release 1.0 for IWOMP • http://kastors.gforge.inria.fr The K’Star Inria project • Source-to-source OpenMP compiler targeting Kaapi and StarPU runtimes • Based on Clang Philippe Virouleau (INRIA) KASTORS Monday, September 29th 23 / 24 Conclusion Questions ? Philippe Virouleau (INRIA) KASTORS Monday, September 29th 24 / 24
© Copyright 2024 ExpyDoc