Automatic Hybrid MPI+OpenMP Code Generation

Automatic Hybrid MPI+OpenMP Code
Generation with llc ?
Ruym´
an Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande
Dept. de E.I.O. y Computaci´
on
Universidad de La Laguna, 38271–La Laguna, Spain
{rreyes,ajdorta,falmeida,fsande}@ull.es
Abstract. The evolution of the architecture of massively parallel computers is progressing toward systems with a hierarchical hardware design where each node is a shared memory system with several multi-core
CPUs. There is consensus in the HPC community about the need to
increment the efficiency of the programming paradigms used to exploit
massive parallelism, if we are to make a full use of the advantages offered by these new architectures. llc is a language where parallelism is
expressed using OpenMP compiler directives. In this paper we focus our
attention on the automatic generation of hybrid MPI+OpenMP code from
the llc annotated source code.
Keywords
MPI, OpenMP, llc, Hybrid parallel programming, multi-core
1
Introduction
The eruption in the market of multi-core processors [?] is producing radical
changes in the architecture of massively parallel computers. High-Performance
Computing (HPC) platforms with many thousands of cores will be deployed in
the coming years [?,?,?]. These new systems present a hierarchical hardware
design: individual nodes in these systems will be heterogeneous multi-threading,
multi-core systems with a deep memory hierarchy. The new HPC systems will
deliver a sustained performance of one to two petaflops for many applications.
Although the computational power of new generation processors has been
increasing continuously, the new systems still lack programmability. Application
programmers will have to address several problems. Some of these problems are
present now in the current architectures, but new ones are arising. The number
of computing nodes is so huge that scalability is becoming a key issue. The
number of control threads executing simultaneously will be much greater than it
is nowadays. Resource sharing between threads on a much larger scale is another
?
This work was partially funded by the EC (FEDER) and the Spanish MEC (Plan
Nacional de I+D+I, TIN2008-06570-C04-03).
point that will require attention. Consequently, the code will need to exhibit a
greater amount of parallelism.
The intense changes in the newest architectures are arriving so fast that
we are failing to take advantage of the increased power of the systems. The
efficiency of the programming paradigms and tools we use to exploit massive
parallelism must be increased [?]. Therefore, the efficient use of these new systems
is an important challenge, and the scientists and engineers responsible for these
applications, in general, are seldom experts in HPC, and usually do not want
either to introduce new changes in their codes or to learn new programming
paradigms. They need solutions in terms of effective automatic parallelization
tools and libraries. When the target architecture is a heterogeneous cluster or
a multi-core processor, current parallelization tools usually offer unsatisfactory
results, mainly due to the continuous changes, complexity and diversity of these
parallel platforms, which poses serious obstacles to the development of efficient
and stable solutions.
In this paper, we aim to address the problem of automatic code generation for
hybrid architectures from high level standard languages. Starting from an llc
[?,?] source code, we can generate hybrid MPI/OpenMP code ready to be compiled
in a hierarchical architecture using standard tools. As a noteworthy feature, we
can mention that the generation process is quite simple and the code generated
is as efficient as ad-hoc implementations. The rest of the paper is organized as
follows: Section ?? presents the motivation of the work. In Sect. ??, through
the use of an example, we illustrate the llc programming model and expose
the simplicity of our method. Section ?? is dedicated to explaining the translation process that llCoMP performs. Several computational results obtained for
OpenMP, MPI and llc codes are shown and discussed in Sect. ??. Finally, Sect.
?? offers conclusions and comments on future work.
2
Motivation
While MPI [?] and OpenMP [?] are nowadays the standard de-facto tools for
programming distributed and shared memory architectures, respectively, both
should evolve to adapt to the many-core era. According to [?], these new systems
will support a variety of programming models, including the “MPI everywhere”
model, which is the most common today. However, according to [?], whereas
MPI has proven to be an excellent means of expressing program parallelism when
nodes have a small number of cores, future architectures may make this a tough
proposition. In particular, MPI does not provide the programmer with the means
to conserve memory or to directly modify the code to benefit from resource sharing and to avoid its negative implications. Both authors agree, however, that in
the current scenario with the presence of hybrid architectures, one possible step
forward is to systematically combine MPI with OpenMP. Such a combined programming model is usually called a hybrid model [?,?,?], and is based on the
use of multiple threads in each MPI process. Hybrid programming techniques
have clear advantages over the classic approach to the new hardware. They al-
low for better control of application granularity, and they improve scalability
and load balancing. The hybrid programming model has found mixed success to
date, with many experiments showing little benefit, and others showing promise.
The reason for the succeeding situations is related to the use of OpenMP within
MPI programs where OpenMP is used to complement MPI, for example, by providing better support for load-balancing, adaptive computations or sharing large
data tables.
llc [?,?] is a high-level parallel language that aims to exploit the best features
of both MPI and OpenMP. llc follows the simplicity of OpenMP and avoids its
well-known drawbacks. While in OpenMP we can start from a sequential code
and parallelize it incrementally through the use of directives, the code cannot be
ported to either distributed memory architectures or to the large new systems. In
llc the code annotated with parallel directives is compiled with llCoMP, the llc
compiler-translator. llCoMP produces an efficient and portable parallel source
code, valid for both shared and distributed memory architectures. Figure ??
illustrates this process. Different directives have been designed and implemented
in llc to support common parallel constructs: forall, sections, pipelines and task
queues.
Fig. 1. llCoMP code generation
The main contribution of this work is the design and development of the new
llCoMP backend that is able to produce hybrid code for multi-core architectures
from the llc annotated source. Some consequences of this contribution are:
– Scientists and engineers can generate efficient code without learning new
material, just by using the well-known OpenMP programming paradigm.
– Through the use of llc, many OpenMP codes can be ported to large multi/many core systems without loss of efficiency.
– The new backend of the llc compiler represents an improvement with respect to its former version.
– llc is best suited for non-expert users as a semi-automatic tool for generating
hybrid MPI+OpenMP parallel code.
In our approach, MPI handles inter-node communications, while OpenMP is
used inside each SMP node. The preliminary results we present indicate a clear
improvement over the previous implementation of the compiler when targeting
the new architectures.
3
Example
The OTOSP (One Thread is One Set of Processors) model is the distributed
memory computational model underlying the llc language. It is essential to
know OTOSP in order to understand the llc implementation. The key reference
on the model is [?].
Probably the first example of any parallel API is the “computing π” program.
The code in Listing ?? shows the Rcalculation of π in llc. This well-known code
1 4
is based on the relationship: π = 0 1+t
2 dt. Splitting the interval [0 . . . 1] into N
segments of size w, this integral can be approximated as the sum of the areas of
the rectangles delimited by the central point of a segment t and the curve value
at this point.
When compiled by llCoMP, the loop (line 5) iterations are distributed among
the processors. The clause private in line 3 is kept only for compatibility with
OpenMP, since all the storages are private in the OTOSP computing model. The
OpenMP clause reduction indicates that all the local values of variable pi have
to be added at the end of the loop. This operation implies a collective communication among all processors in the OTOSP group and the updating of the
variable with the result of the reduction operation. Since type analysis has not
been included in llCoMP, the type of reduction variable has to be specified in
line 4.
Notice that an OpenMP code can be adapted to llc with very few changes. In
fact, all the OpenMP 2.5 directives and clauses are recognized by llCoMP and therefore, we have three versions in the same code: sequential, OpenMP and llc/MPI.
We need only choose the proper compiler to obtain the corresponding binary.
1
2
3
4
5
6
7
8
9
h = 1.0 / N ;
pi = 0 . 0 ;
#pragma omp p a r a l l e l f o r p r i v a t e ( t ) r e d u c t i o n (+: p i )
#pragma l l c r e d u c t i o n t y p e ( double )
f o r ( i = 0 ; i < N ; i++) {
x = (i + 0.5) ∗ h ;
pi = pi + 4 . 0 / ( 1 . 0 + t ∗ t ) ;
}
pi ∗= w ;
Listing 1. Computing π in parallel with llc
We have selected the “computing π” code due to its simplicity to show the
main features of llc. Obviously, algorithms like this one, with a reduced amount
of communications, are those best suited to attain optimal performance when
parallelized with llc. Apart from the parallel for, llc also provides support for
the most common parallel constructs: sections, pipelines and farms ([?,?].
The code in Listing ?? shows part of the hybrid code automatically generated
by llCoMP for the “computing π” code.
4
The translation process
Studying a wide set of parallel applications, we can conclude that most of them
can be classified according to the parallelization paradigm used [?]. Moreover, the
parallel code itself (data distribution, communications, etc.) is quite similar in
applications following the same paradigm. llCoMP takes advantage of this feature
to generate code. We have named these portions of reusable code patterns, and
llCoMP uses two kinds of patterns in order to process the llc parallel constructs.
Static patterns are the pure parallel code and they do not depend directly
on the application, but rather on the parallelization paradigm specified using
the llc and/or OpenMP constructs. These codes are implemented in llCoMP using the target language (i.e. MPI) and they encode operations like initialization
of the parallel environment, resources distribution, data communications, load
balancing, etc. The compiler adapts the static patterns to a specific translation
using special tags in the pattern that the compiler fills with information coming
from the source code directives.
To ease the compiler code maintenance, static patterns have been split into
several files, each file implementing a specific stage of the entire pattern: initialization, execution, communication and/or finalization. In addition to the general
case, llCoMP implements specialized code for some common situations. When it
detects any of these situations, it uses the optimized code. For this reason, each
of the stages can be supported through different files. The good performance
delivered for the general case can be improved if an optimization is detected.
As a result, each paradigm and its static pattern is implemented by several text
files.
The static patterns need some extra code to work. This additional code
handles the operations on data structures needed to build the translation, i.e.
management of buffers used during communications. This code is sequential
and specific to each application and therefore cannot be embedded in the static
patterns. llCoMP uses dynamic patterns to produce this complementary code.
The dynamic pattern code is generated by the compiler during the compilation
process and stored in temporary files. The static patterns use special marks
to indicate the compiler the right point where each temporary file has to be
inserted to produce the target code. The dynamic patterns are carefully tuned
to optimize parallel performance. For instance, data packing/unpacking greatly
reduces communications.
It is through a combination of both static and dynamic patterns that llCoMP
produces the target code. The compilation process is carried out by llCoMP without user intervention. All the information needed is gathered from the parallel
constructs in the source code, and from it, the compiler selects the best combination of static and dynamics patterns, including any available optimization, to
build the target code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#pragma omp p a r a l l e l f o r p r i v a t e ( t ) r e d u c t i o n (+ : p i )
f o r ( i = ( 0 ) + llc_F [ llc_grp_save ] ; i < ( 0 ) + llc_F [
llc_grp_save ] + LLC_PROCSGRP ( llc_grp_save ) ; i++) {
{
x = (i + 0.5) ∗ h ;
pi = pi + 4 . 0 / ( 1 . 0 + t ∗ t ) ;
};
}
. . .
i f ( LLC_NAME == 0 ) {
f o r ( llc_i = 1 ; llc_i < LLC_NUMPROCESSORS ; llc_i++) {
MPI_Recv ( llc_buf , llc_buf_size , MPI_BYTE , llc_i ,
LLC_TAG_REDUCE_DATA ,
∗ llc_CurrentGroup , &llc_status ) ;
llc_buf_ptr = llc_buf ;
pi += ( ∗ ( double ∗ ) llc_buf_ptr ) ;
llc_buf_ptr += s i z e o f ( double ) ;
}
llc_buf_ptr = llc_buf ;
memcpy ( llc_buf_ptr , &pi , s i z e o f ( double ) ) ;
llc_buf_ptr += s i z e o f ( double ) ;
MPI_Bcast ( llc_buf , llc_buf_size , MPI_BYTE , 0 , ∗
llc_CurrentGroup ) ;
}
else {
llc_buf_ptr = llc_buf ;
memcpy ( llc_buf_ptr , &pi , s i z e o f ( double ) ) ;
llc_buf_ptr += s i z e o f ( double ) ;
MPI_Send ( llc_buf , llc_buf_size , MPI_BYTE , 0 ,
LLC_TAG_REDUCE_DATA , ∗ llc_CurrentGroup ) ;
MPI_Bcast ( llc_buf , llc_buf_size , MPI_BYTE , 0 , ∗
llc_CurrentGroup ) ;
llc_buf_ptr = llc_buf ;
memcpy (&pi , llc_buf_ptr , s i z e o f ( double ) ) ;
llc_buf_ptr += s i z e o f ( double ) ;
}
Listing 2. The llc translation of the “computing π” code
The OTOSP computational model guarantees that each processor owns at
least a part of the main dataset. We are therefore able to add a new level of
parallelism by simply re-parallelizing the annotated loop. The tags in the static
patterns allow for the injection of this second parallel level to generate hybrid
code. In this nested level, we use a shared memory approach to parallelization,
using pure OpenMP code for the previously llc-annotated loop. llCoMP builds
the OpenMP code to be injected from the source code.
For the distribution between MPI and OpenMP threads, we have chosen to use
one MPI process per computation node, and shared memory threads between
the cores of the node. According to the classification in [?], we use a hybrid
masteronly scheme. This allows us to take advantage of the shared memory in
the cores, in particular of their shared caches, thus increasing the performance
of the loops.
Since our target systems in this work are multi-node clusters, the most efficient approach is to use one processor to communicate over the internode network, given a single processor’s capability to use the network completely, as
shown in [?].
llCoMP generated code handles the MPI communications and synchronization
of data between nodes. This code would not be enhanced by using more than
one CPU, as it consists mainly of synchronization routines.
5
Experimental Results
In this Sect. we present the preliminary results of our implementation by showing
the performance obtained for 4 algorithms in two different multi-core systems.
The algorithms we used are the “computing π” code introduced in Sect. ??, a
Molecular Dynamic (MD) code, the Mandelbrot Set computation and the Conjugate Gradient (CG) algorithm. The MD code is an llc implementation of the
velocity Verlet algorithm [?] for Molecular Dynamics simulations. It employs an
iterative numerical procedure to obtain an approximate solution whose accuracy
is determined by the time step of the simulation. The Mandelbrot Set is the convergence domain of the complex series defined by Zn = Zn−1 2 + C. The area
of the set is an open question in Mathematics. Using a Monte-Carlo method,
the algorithm computes an estimation of the set area. The CG algorithm is one
of the kernels in the NAS Parallel Benchmarks [?]. CG uses the inverse power
method to find an estimate of the largest eigenvalue of a symmetric positive definite sparse matrix with a random pattern of nonzeros. The algorithm performs
a certain number of iterations, and for each of them, the solution z to the linear
system of equations Az = x is approximated using the conjugate gradient (CG)
method, where A denotes the sparse matrix of order N. The source code for all
four algorithms is available at the llc project home page [?].
We evaluated the above algorithms using 2 different systems, named Tajinaste and Verode. Both systems were run on Linux CentOS 5.3 with kernel 2.6.
The codes were compiled using gcc 4.1 and the OpenMPI MPI implementation.
Tajinaste is an IBM cluster with 14 dual core Opteron processors interconnected
with a Gigabit Ethernet network. The results we present for this cluster are
limited to 12 cores to guarantee exclusive use of the system. Verode is a two
quad-core Opteron system. In this system, we used the same OpenMPI implementation with the shared memory intercommunication device.
Fig. 2. Speed-up of the Mandelbrot set computation on Tajinaste
Figure ?? shows the results for 4 different implementations of the Mandelbrot
set computation algorithm in Tajinaste. The meaning of the labels in the Fig.
corresponds with the following implementations:
–
–
–
–
MPI : a pure MPI implementation.
LLC-MPI : llCoMP generates pure MPI code.
LLC-HYB : llCoMP generates hybrid OpenMP-MPI code.
HYB : An ad-hoc hybrid implementation.
In every case the code iterates over 16368 points in the complex plane. This
algorithm is usually taken as an academic example of a situation with an intense
load imbalance: each complex point analyzed requires a large-varying number
of iterations of the main loop to converge. In this situation, Fig. ?? reflects the
benefit of the LLC-HYB approach in relation to the LLC-MPI case. The parallel
execution time improvement is 17% percent for 14 processors. The llc versions
exhibit results comparable to their ad-hoc counterparts, with a much smaller
coding effort.
Tables ?? and ?? show the results for the MD, computing π and CG algorithms in Tajinaste and Verode, respectively. In both tables, row SEQ shows
the time (seconds) corresponding to the sequential version of the code, while
the remaining data correspond to the speedups achieved by llc implementations. Column MPI corresponds to pure MPI code generation by llCoMP and
MD
#Proc.
MPI
SEQ
π
HYB
MPI
38.590
1
2
3
4
6
8
9
12
0.984
1.967
2.948
3.929
5.813
7.637
7.558
9.715
HYB
9.184
0.950
1.903
2.849
3.776
5.600
7.316
8.533
11.303
0.994
1.993
2.986
3.975
5.949
7.881
8.775
11.681
0.997
1.989
2.968
3.890
5.791
7.744
8.477
11.482
Table 1. Speedup for the MD and computing π algorithms in Tajinaste
column HYB reflects the results using the new llCoMP backend. In the case
of the computing π and MD algorithms, no significant improvement is evident.
Nevertheless, in the case of the CG code, again the results show the benefit
achieved with the hybrid code. The reason behind this improvement is the granularity of the algorithm: the CG code parallelizes several loops, some of them
used for initializations. This fine-grain parallelism penalizes the pure MPI code
version, while the hybrid translation takes advantage of it.
MD
#Proc.
SEQ
1
2
3
4
5
6
7
8
MPI
HYB
38.590
0.817
1.613
2.421
3.226
4.027
4.857
5.670
6.451
0.823
1.624
2.465
3.247
4.089
4.929
5.754
6.586
π
MPI
CG
HYB
9.184
0.997
1,982
2.959
3.956
4.936
5.944
6.935
6.545
1.004
1.987
2.953
3.970
4.942
5.944
6.831
6.862
MPI
HYB
198.941
1.003
1.681
2.123
2.334
2.471
2.448
2.579
2.423
0.988
1.792
2.566
3.209
3.835
4.372
4.772
5.247
Table 2. Speedup for the MD, computing π and CG algorithms in Verode
6
Conclusions and Future Work
We have explored the potential of using llc and its compiler as a simple method
for producing hybrid MPI/OpenMP code to target new hierarchical multi/manycore parallel architectures. We are curious as to whether our methodology could
also be applied to other compilers or languages. In our approach, the use of
direct generation of hybrid code for the translation of the OpenMP directives
conjugates simplicity with reasonable performance. Our code generation scheme,
while simple, is also reasonably efficient.
Through the use of llc, most OpenMP codes can be directly ported to the
new hierarchical architectures without loss of efficiency. We believe that llc is
a good choice for non-expert users to exploit massive parallelism. The results
we have shown in this work represent a clear improvement over the previous
implementation of llCoMP. Combining the above facts leads us to believe that
the hybrid model of programming is deserving of more thorough study.
Work in progress in our project includes the following issues:
– Enlarge the computational experiments with additional algorithms that benefit more fully from our approach.
– Test our implementations on machines with a larger number of computational nodes.
– Study the applicability of our method to mixed multi-core CPU/GPU architectures.
Acknowledgments
We wish to thank the anonymous reviewers for their suggestions on how to
improve the paper.