Evaluation of OpenMP dependent tasks with the KASTORS

Evaluation of OpenMP dependent tasks with the
KASTORS benchmark suite
Philippe Virouleau 1 Pierrick Brunet 1 François Broquedis 2
Nathalie Furmento 3 Samuel Thibault 4 Olivier Aumage 1
Thierry Gautier 1
1 Inria
- RUNTIME and MOAIS teams
2 Grenoble
institute of Technology
3 CNRS
4 University
of Bordeaux
Monday, September 29th
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
1 / 24
Plan
1
OpenMP 4.0 dependent tasks
2
KASTORS benchmarks overview
3
KASTORS performances
4
Ongoing works
5
Conclusion
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
2 / 24
OpenMP 4.0 dependent tasks
Plan
1
OpenMP 4.0 dependent tasks
Example
Compiler and runtime support
Our contribution
2
KASTORS benchmarks overview
3
KASTORS performances
4
Ongoing works
5
Conclusion
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
3 / 24
OpenMP 4.0 dependent tasks
Example
Reminder on OpenMP dependent tasks
How does it work ?
Independent tasks
void task_example ( void ) {
type a , b ;
# pragma omp parallel
# pragma omp single
{
# pragma omp task
work (& a ) ;
# pragma omp task
work (& b ) ;
# pragma omp taskwait
# pragma omp task
process ( a ) ;
# pragma omp task
process ( b ) ;
}
}
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
4 / 24
OpenMP 4.0 dependent tasks
Example
Reminder on OpenMP dependent tasks
How does it work ?
Independent tasks
void task_example ( void ) {
type a , b ;
# pragma omp parallel
# pragma omp single
{
# pragma omp task
work (& a ) ;
# pragma omp task
work (& b ) ;
# pragma omp taskwait
# pragma omp task
process ( a ) ;
# pragma omp task
process ( b ) ;
}
}
Philippe Virouleau (INRIA)
Dependent tasks
void task_example ( void ) {
type a , b ;
# pragma omp parallel
# pragma omp single
{
# pragma omp task depend ( out : a )
work (& a ) ;
# pragma omp task depend ( out : b )
work (& b ) ;
# pragma omp task depend ( in : a )
process ( a ) ;
# pragma omp task depend ( in : b )
process ( b ) ;
}
}
KASTORS
Monday, September 29th
4 / 24
OpenMP 4.0 dependent tasks
Example
Reminder on OpenMP dependent tasks
How does it work ?
Independent tasks
void task_example ( void ) {
type a , b ;
# pragma omp parallel
# pragma omp single
{
# pragma omp task
work (& a ) ;
# pragma omp task
work (& b ) ;
# pragma omp taskwait
# pragma omp task
process ( a ) ;
# pragma omp task
process ( b ) ;
}
}
Dependent tasks
void task_example ( void ) {
type a , b ;
# pragma omp parallel
# pragma omp single
{
# pragma omp task depend ( out : a )
work (& a ) ;
# pragma omp task depend ( out : b )
work (& b ) ;
# pragma omp task depend ( in : a )
process ( a ) ;
# pragma omp task depend ( in : b )
process ( b ) ;
}
}
Describe how data are used vs Synchronization points.
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
4 / 24
OpenMP 4.0 dependent tasks
Compiler and runtime support
Compiler support
GCC/libGOMP
• Support for most of OpenMP 4.0 in GCC 4.9
• Compiler passes all dependencies to GOMP_task
Clang + libIOMP
• Support for most of OpenMP 4.0 in a separate Intel project
Clang-omp (Merge with master is ongoing)
• Compiler passes all dependencies to __kmpc_omp_task_with_deps
Handling of dependencies
• Dependencies stored in parent task
• New task => look for unresolved dependencies
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
5 / 24
OpenMP 4.0 dependent tasks
Compiler and runtime support
Runtime support
GCC/libKOMP [IWOMP’12]
• GOMP replacement using Kaapi runtime system (Inria)
• Dependent tasks supported
• Compiler support from GCC 4.9
OMPSS [BSC]
• Compiler support : Mercurium
• Runtime support : Nanos
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
6 / 24
OpenMP 4.0 dependent tasks
Our contribution
Motivations
Objectives
• Evaluate the OpenMP dependent task paradigm (overhead,
scalability...)
• Compare runtimes
Existing benchmark suite
• BOTS (Barcelona OpenMP Task Suite [BSC])
• NAS, SPECOMP, RODINIA ...
• No benchmark for dependent tasks yet !
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
7 / 24
OpenMP 4.0 dependent tasks
Our contribution
Motivations
Objectives
• Evaluate the OpenMP dependent task paradigm (overhead,
scalability...)
• Compare runtimes
Existing benchmark suite
• BOTS (Barcelona OpenMP Task Suite [BSC])
• NAS, SPECOMP, RODINIA ...
• No benchmark for dependent tasks yet !
KASTORS
Benchmark suite for OpenMP dependent tasks !
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
7 / 24
KASTORS benchmarks overview
Plan
1
OpenMP 4.0 dependent tasks
2
KASTORS benchmarks overview
3
KASTORS performances
4
Ongoing works
5
Conclusion
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
8 / 24
KASTORS benchmarks overview
OpenMP Dependent Task Suite
Applications
• Poisson2D
• SparseLU (BOTSa , BSC)
• Strassen (BOTS, BSC)
• Cholesky (Plasma, ICL/UTK)
• QR (Plasma, ICL/UTK)
a
Barcelona OpenMP Task Suite
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
9 / 24
KASTORS benchmarks overview
OpenMP Dependent Task Suite
Poisson2D
Solve Poisson equation on a domain divided into a regular grid
SparseLU (BOTS, BSC)
LU factorization of a sparse matrix
Strassen (BOTS, BSC)
Dense matrix multiplication by bloc decomposition
Cholesky (Plasma, ICL/UTK)
Cholesky tiled matrix factorization (dpotrf)
QR (Plasma, ICL/UTK)
QR tiled matrix factorization (dgeqrf)
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
10 / 24
KASTORS performances
Plan
1
OpenMP 4.0 dependent tasks
2
KASTORS benchmarks overview
3
KASTORS performances
Set-up
GCC/Clang results
4
Ongoing works
5
Conclusion
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
11 / 24
KASTORS performances
Set-up
How we conducted experiments
Machines
• AMD48 : 48 cores AMD Magny Cours processors, 256GB of memory
• INTEL32 : 32 cores Intel Xeon E5-4620 processors, 380GB of memory
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
12 / 24
KASTORS performances
Set-up
How we conducted experiments
Machines
• AMD48 : 48 cores AMD Magny Cours processors, 256GB of memory
• INTEL32 : 32 cores Intel Xeon E5-4620 processors, 380GB of memory
Compilers
• GCC + libGOMP 4.9, git-mirror commit 6ed3847ffd0
• Clang + libIOMP
llvm commit 233b1e3f034
clang-ompa commit 7580e521e51f
libIOMP : released 20131209
a
http://clang-omp.github.io/
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
12 / 24
KASTORS performances
GCC/Clang results
SparseLU, Strassen, and Poisson2D on AMD48
A bit more on speed up
• Speed-up for each compiler is computed against its respective
sequential time.
• Times are in seconds
Compiler
GCC
CLANG
Philippe Virouleau (INRIA)
Poisson2D
5.28
5.42
SparseLU
94.36
95.08
KASTORS
Strassen
269.93
234.16
Monday, September 29th
13 / 24
KASTORS performances
GCC/Clang results
SparseLU, Strassen, and Poisson2D on AMD48
45
GCC-task
GCC-taskdep
Clang-task
Clang-taskdep
40
Speed up
35
30
25
20
15
10
5
0
Philippe Virouleau (INRIA)
SparseLU
(64x128)
Strassen
(8192)
KASTORS
Poisson2D
(8192)
Monday, September 29th
13 / 24
KASTORS performances
GCC/Clang results
SparseLU, Strassen, and Poisson2D on AMD48
45
GCC-task
GCC-taskdep
Clang-task
Clang-taskdep
Clang-for
40
Speed up
35
30
25
20
15
10
5
0
Philippe Virouleau (INRIA)
SparseLU
(64x128)
Strassen
(8192)
KASTORS
Poisson2D
(8192)
Monday, September 29th
13 / 24
KASTORS performances
GCC/Clang results
SparseLU, Strassen, and Poisson2D on AMD48
45
GCC-task
GCC-taskdep
Clang-task
Clang-taskdep
Clang-for
40
Speed up
35
30
25
20
15
10
5
0
SparseLU
(64x128)
Strassen
(8192)
Poisson2D
(8192)
Conclusion
• Dependent tasks > independent tasks
• Poisson2D : Scheduling issue (for is better).
Challenge is data locality
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
13 / 24
KASTORS performances
GCC/Clang results
SparseLU, Strassen, and Poisson2D on INTEL192
SparseLU (64x256) taskdep
SparseLU task
Strassen (8192) taskdep
Strassen task
100
Speed up
80
60
40
20
0
8
16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192
Number of cores
INTEL192
• 24 Intel Xeon E5-4640 processors (total 192 cores), 756GB of memory
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
13 / 24
KASTORS performances
GCC/Clang results
QR (dgeqrf) and Cholesky (dpotrf) on AMD48
A word about QR and Cholesky
• Dense Matrix factorizations
• Linear algebra algorithms
• Taken from Plasma [UTK]
A word about Plasma
Execution :
• Static scheduling
• Dynamic scheduling using Quark (dedicated runtime, dataflow)
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
14 / 24
KASTORS performances
GCC/Clang results
QR (dgeqrf) and Cholesky (dpotrf) on AMD48
250
GigaFlops
200
Plasma Static
Plasma Dynamic
GCC-OMP
Clang-OMP
150
100
50
0
DGEQRF DPOTRF
DGEQRF DPOTRF
DGEQRF DPOTRF
DGEQRF DPOTRF
M,B : 2048,128
M,B : 4096,128
M,B : 8192,224
M,B : 16384,256
• Plasma static scheduling better for small sizes
• OpenMP runtimes achieve same or better performances than Quark (!)
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
15 / 24
Ongoing works
Plan
1
OpenMP 4.0 dependent tasks
2
KASTORS benchmarks overview
3
KASTORS performances
4
Ongoing works
Next kernel from Plasma
KASTORS
5
Conclusion
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
16 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
double * A ; int BS , n_blocks ;
for ( int i = 0; i < n_blocks ; i ++) {
# pragma omp task depend ( out : A [0: N ])
dgemm_f2 ( A [ i * BS ] , ...) ;
}
# pragma omp task depend ( in : A [0: N ])
swap (A , ...) ;
Legend
: Execution order
: Block written
: Depend clause
Issues
Bad performances due
to dependencies
What we do...
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
17 / 24
Ongoing works
Next kernel from Plasma
LU (dgetrf) on AMD48
200
Plasma Static
Plasma Dynamic
GCC-OMP
Clang-OMP
GigaFlops
150
100
50
0
Philippe Virouleau (INRIA)
2048,128
4096,128
8196,224
16384,256
DGETRF
KASTORS
Monday, September 29th
18 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
What we want !
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
19 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
double * A ; int BS , n_blocks ;
for ( int i = 0; i < n_blocks ; i ++) {
# pragma omp task depend ( out : A [ i * BS : BS ])
dgemm_f2 ( A [ i * BS ] , ...) ;
}
# pragma omp task depend ( in : A [0: N ])
swap (A , ...) ;
What we want !
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
19 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
double * A ; int BS , n_blocks ;
for ( int i = 0; i < n_blocks ; i ++) {
# pragma omp task depend ( out : A [ i * BS : BS ])
dgemm_f2 ( A [ i * BS ] , ...) ;
}
# pragma omp task depend ( in : A [0: N ])
swap (A , ...) ;
What about the norm ?
List items used in depend clauses of the
same task or sibling tasks must indicate
identical storage or disjoint storage.
What we want !
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
19 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
double * A ; int BS , n_blocks ;
for ( int i = 0; i < n_blocks ; i ++) {
# pragma omp task depend ( out : A [ i * BS : BS ])
dgemm_f2 ( A [ i * BS ] , ...) ;
}
# pragma omp task depend ( in : A [0: N ])
swap (A , ...) ;
What about the norm ?
List items used in depend clauses of the
same task or sibling tasks must indicate
identical storage or disjoint storage.
What we want !
Issues
• Runtime-known
number of deps.
• Overlapping
memory regions
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
19 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
double * A ; int BS , n_blocks ;
for ( int i = 0; i < n_blocks ; i ++) {
# pragma omp task depend ( cw : A [0: N ])
dgemm_f2 ( A [ i * BS ] , ...) ;
}
Another possible solution
Using Concurrent Write (Mode
CW)
# pragma omp task depend ( in : A [0: N ])
swap (A , ...) ;
What we want !
Issues
• Runtime-known
number of deps.
• Overlapping
memory regions
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
19 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
double * A ; int BS , n_blocks ;
for ( int i = 0; i < n_blocks ; i ++) {
# pragma omp task depend ( out : A [ i * BS : BS ])
dgemm_f2 ( A [ i * BS ] , ...) ;
}
# pragma omp task \
depend ( in :{ A [ i * BS : BS ] for i in [0: n_blocks -1]})
swap (A , ...) ;
Another original solution
Using "variadic" dependencies
(...)
What we want !
Issues
• Runtime-known
number of deps.
• Overlapping
memory regions
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
19 / 24
Ongoing works
Next kernel from Plasma
Plasma’s LU factorization (DGETRF)
Some way to deal with it
In Atapascan [PACT’98]
Use of CW "Concurrent Write"
In Quark
Mode "INOUT-GATHERV" : wait for all in,out,inout but not for others
GATHERV
In OMPSS
Dependencies on memory regions
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
20 / 24
Ongoing works
KASTORS
Improvements to KASTORS
OpenMP
• Update KASTORS with OpenMP 4.X specifications
Others
• Add real application kernels
We welcome your suggestion
http://kastors.gforge.inria.fr
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
21 / 24
Conclusion
Plan
1
OpenMP 4.0 dependent tasks
2
KASTORS benchmarks overview
3
KASTORS performances
4
Ongoing works
5
Conclusion
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
22 / 24
Conclusion
Conclusion
Results are quite encouraging
• Better performances with dependent tasks
• OpenMP 4.0 is a good alternative to dedicated runtimes (eg :
Plasma-Quark)
On the web
• Release 1.0 for IWOMP
• http://kastors.gforge.inria.fr
The K’Star Inria project
• Source-to-source OpenMP compiler targeting Kaapi and StarPU
runtimes
• Based on Clang
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
23 / 24
Conclusion
Questions ?
Philippe Virouleau (INRIA)
KASTORS
Monday, September 29th
24 / 24