Automatic Hybrid MPI+OpenMP Code Generation with llc ? Ruym´ an Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande Dept. de E.I.O. y Computaci´ on Universidad de La Laguna, 38271–La Laguna, Spain {rreyes,ajdorta,falmeida,fsande}@ull.es Abstract. The evolution of the architecture of massively parallel computers is progressing toward systems with a hierarchical hardware design where each node is a shared memory system with several multi-core CPUs. There is consensus in the HPC community about the need to increment the efficiency of the programming paradigms used to exploit massive parallelism, if we are to make a full use of the advantages offered by these new architectures. llc is a language where parallelism is expressed using OpenMP compiler directives. In this paper we focus our attention on the automatic generation of hybrid MPI+OpenMP code from the llc annotated source code. Keywords MPI, OpenMP, llc, Hybrid parallel programming, multi-core 1 Introduction The eruption in the market of multi-core processors [?] is producing radical changes in the architecture of massively parallel computers. High-Performance Computing (HPC) platforms with many thousands of cores will be deployed in the coming years [?,?,?]. These new systems present a hierarchical hardware design: individual nodes in these systems will be heterogeneous multi-threading, multi-core systems with a deep memory hierarchy. The new HPC systems will deliver a sustained performance of one to two petaflops for many applications. Although the computational power of new generation processors has been increasing continuously, the new systems still lack programmability. Application programmers will have to address several problems. Some of these problems are present now in the current architectures, but new ones are arising. The number of computing nodes is so huge that scalability is becoming a key issue. The number of control threads executing simultaneously will be much greater than it is nowadays. Resource sharing between threads on a much larger scale is another ? This work was partially funded by the EC (FEDER) and the Spanish MEC (Plan Nacional de I+D+I, TIN2008-06570-C04-03). point that will require attention. Consequently, the code will need to exhibit a greater amount of parallelism. The intense changes in the newest architectures are arriving so fast that we are failing to take advantage of the increased power of the systems. The efficiency of the programming paradigms and tools we use to exploit massive parallelism must be increased [?]. Therefore, the efficient use of these new systems is an important challenge, and the scientists and engineers responsible for these applications, in general, are seldom experts in HPC, and usually do not want either to introduce new changes in their codes or to learn new programming paradigms. They need solutions in terms of effective automatic parallelization tools and libraries. When the target architecture is a heterogeneous cluster or a multi-core processor, current parallelization tools usually offer unsatisfactory results, mainly due to the continuous changes, complexity and diversity of these parallel platforms, which poses serious obstacles to the development of efficient and stable solutions. In this paper, we aim to address the problem of automatic code generation for hybrid architectures from high level standard languages. Starting from an llc [?,?] source code, we can generate hybrid MPI/OpenMP code ready to be compiled in a hierarchical architecture using standard tools. As a noteworthy feature, we can mention that the generation process is quite simple and the code generated is as efficient as ad-hoc implementations. The rest of the paper is organized as follows: Section ?? presents the motivation of the work. In Sect. ??, through the use of an example, we illustrate the llc programming model and expose the simplicity of our method. Section ?? is dedicated to explaining the translation process that llCoMP performs. Several computational results obtained for OpenMP, MPI and llc codes are shown and discussed in Sect. ??. Finally, Sect. ?? offers conclusions and comments on future work. 2 Motivation While MPI [?] and OpenMP [?] are nowadays the standard de-facto tools for programming distributed and shared memory architectures, respectively, both should evolve to adapt to the many-core era. According to [?], these new systems will support a variety of programming models, including the “MPI everywhere” model, which is the most common today. However, according to [?], whereas MPI has proven to be an excellent means of expressing program parallelism when nodes have a small number of cores, future architectures may make this a tough proposition. In particular, MPI does not provide the programmer with the means to conserve memory or to directly modify the code to benefit from resource sharing and to avoid its negative implications. Both authors agree, however, that in the current scenario with the presence of hybrid architectures, one possible step forward is to systematically combine MPI with OpenMP. Such a combined programming model is usually called a hybrid model [?,?,?], and is based on the use of multiple threads in each MPI process. Hybrid programming techniques have clear advantages over the classic approach to the new hardware. They al- low for better control of application granularity, and they improve scalability and load balancing. The hybrid programming model has found mixed success to date, with many experiments showing little benefit, and others showing promise. The reason for the succeeding situations is related to the use of OpenMP within MPI programs where OpenMP is used to complement MPI, for example, by providing better support for load-balancing, adaptive computations or sharing large data tables. llc [?,?] is a high-level parallel language that aims to exploit the best features of both MPI and OpenMP. llc follows the simplicity of OpenMP and avoids its well-known drawbacks. While in OpenMP we can start from a sequential code and parallelize it incrementally through the use of directives, the code cannot be ported to either distributed memory architectures or to the large new systems. In llc the code annotated with parallel directives is compiled with llCoMP, the llc compiler-translator. llCoMP produces an efficient and portable parallel source code, valid for both shared and distributed memory architectures. Figure ?? illustrates this process. Different directives have been designed and implemented in llc to support common parallel constructs: forall, sections, pipelines and task queues. Fig. 1. llCoMP code generation The main contribution of this work is the design and development of the new llCoMP backend that is able to produce hybrid code for multi-core architectures from the llc annotated source. Some consequences of this contribution are: – Scientists and engineers can generate efficient code without learning new material, just by using the well-known OpenMP programming paradigm. – Through the use of llc, many OpenMP codes can be ported to large multi/many core systems without loss of efficiency. – The new backend of the llc compiler represents an improvement with respect to its former version. – llc is best suited for non-expert users as a semi-automatic tool for generating hybrid MPI+OpenMP parallel code. In our approach, MPI handles inter-node communications, while OpenMP is used inside each SMP node. The preliminary results we present indicate a clear improvement over the previous implementation of the compiler when targeting the new architectures. 3 Example The OTOSP (One Thread is One Set of Processors) model is the distributed memory computational model underlying the llc language. It is essential to know OTOSP in order to understand the llc implementation. The key reference on the model is [?]. Probably the first example of any parallel API is the “computing π” program. The code in Listing ?? shows the Rcalculation of π in llc. This well-known code 1 4 is based on the relationship: π = 0 1+t 2 dt. Splitting the interval [0 . . . 1] into N segments of size w, this integral can be approximated as the sum of the areas of the rectangles delimited by the central point of a segment t and the curve value at this point. When compiled by llCoMP, the loop (line 5) iterations are distributed among the processors. The clause private in line 3 is kept only for compatibility with OpenMP, since all the storages are private in the OTOSP computing model. The OpenMP clause reduction indicates that all the local values of variable pi have to be added at the end of the loop. This operation implies a collective communication among all processors in the OTOSP group and the updating of the variable with the result of the reduction operation. Since type analysis has not been included in llCoMP, the type of reduction variable has to be specified in line 4. Notice that an OpenMP code can be adapted to llc with very few changes. In fact, all the OpenMP 2.5 directives and clauses are recognized by llCoMP and therefore, we have three versions in the same code: sequential, OpenMP and llc/MPI. We need only choose the proper compiler to obtain the corresponding binary. 1 2 3 4 5 6 7 8 9 h = 1.0 / N ; pi = 0 . 0 ; #pragma omp p a r a l l e l f o r p r i v a t e ( t ) r e d u c t i o n (+: p i ) #pragma l l c r e d u c t i o n t y p e ( double ) f o r ( i = 0 ; i < N ; i++) { x = (i + 0.5) ∗ h ; pi = pi + 4 . 0 / ( 1 . 0 + t ∗ t ) ; } pi ∗= w ; Listing 1. Computing π in parallel with llc We have selected the “computing π” code due to its simplicity to show the main features of llc. Obviously, algorithms like this one, with a reduced amount of communications, are those best suited to attain optimal performance when parallelized with llc. Apart from the parallel for, llc also provides support for the most common parallel constructs: sections, pipelines and farms ([?,?]. The code in Listing ?? shows part of the hybrid code automatically generated by llCoMP for the “computing π” code. 4 The translation process Studying a wide set of parallel applications, we can conclude that most of them can be classified according to the parallelization paradigm used [?]. Moreover, the parallel code itself (data distribution, communications, etc.) is quite similar in applications following the same paradigm. llCoMP takes advantage of this feature to generate code. We have named these portions of reusable code patterns, and llCoMP uses two kinds of patterns in order to process the llc parallel constructs. Static patterns are the pure parallel code and they do not depend directly on the application, but rather on the parallelization paradigm specified using the llc and/or OpenMP constructs. These codes are implemented in llCoMP using the target language (i.e. MPI) and they encode operations like initialization of the parallel environment, resources distribution, data communications, load balancing, etc. The compiler adapts the static patterns to a specific translation using special tags in the pattern that the compiler fills with information coming from the source code directives. To ease the compiler code maintenance, static patterns have been split into several files, each file implementing a specific stage of the entire pattern: initialization, execution, communication and/or finalization. In addition to the general case, llCoMP implements specialized code for some common situations. When it detects any of these situations, it uses the optimized code. For this reason, each of the stages can be supported through different files. The good performance delivered for the general case can be improved if an optimization is detected. As a result, each paradigm and its static pattern is implemented by several text files. The static patterns need some extra code to work. This additional code handles the operations on data structures needed to build the translation, i.e. management of buffers used during communications. This code is sequential and specific to each application and therefore cannot be embedded in the static patterns. llCoMP uses dynamic patterns to produce this complementary code. The dynamic pattern code is generated by the compiler during the compilation process and stored in temporary files. The static patterns use special marks to indicate the compiler the right point where each temporary file has to be inserted to produce the target code. The dynamic patterns are carefully tuned to optimize parallel performance. For instance, data packing/unpacking greatly reduces communications. It is through a combination of both static and dynamic patterns that llCoMP produces the target code. The compilation process is carried out by llCoMP without user intervention. All the information needed is gathered from the parallel constructs in the source code, and from it, the compiler selects the best combination of static and dynamics patterns, including any available optimization, to build the target code. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 #pragma omp p a r a l l e l f o r p r i v a t e ( t ) r e d u c t i o n (+ : p i ) f o r ( i = ( 0 ) + llc_F [ llc_grp_save ] ; i < ( 0 ) + llc_F [ llc_grp_save ] + LLC_PROCSGRP ( llc_grp_save ) ; i++) { { x = (i + 0.5) ∗ h ; pi = pi + 4 . 0 / ( 1 . 0 + t ∗ t ) ; }; } . . . i f ( LLC_NAME == 0 ) { f o r ( llc_i = 1 ; llc_i < LLC_NUMPROCESSORS ; llc_i++) { MPI_Recv ( llc_buf , llc_buf_size , MPI_BYTE , llc_i , LLC_TAG_REDUCE_DATA , ∗ llc_CurrentGroup , &llc_status ) ; llc_buf_ptr = llc_buf ; pi += ( ∗ ( double ∗ ) llc_buf_ptr ) ; llc_buf_ptr += s i z e o f ( double ) ; } llc_buf_ptr = llc_buf ; memcpy ( llc_buf_ptr , &pi , s i z e o f ( double ) ) ; llc_buf_ptr += s i z e o f ( double ) ; MPI_Bcast ( llc_buf , llc_buf_size , MPI_BYTE , 0 , ∗ llc_CurrentGroup ) ; } else { llc_buf_ptr = llc_buf ; memcpy ( llc_buf_ptr , &pi , s i z e o f ( double ) ) ; llc_buf_ptr += s i z e o f ( double ) ; MPI_Send ( llc_buf , llc_buf_size , MPI_BYTE , 0 , LLC_TAG_REDUCE_DATA , ∗ llc_CurrentGroup ) ; MPI_Bcast ( llc_buf , llc_buf_size , MPI_BYTE , 0 , ∗ llc_CurrentGroup ) ; llc_buf_ptr = llc_buf ; memcpy (&pi , llc_buf_ptr , s i z e o f ( double ) ) ; llc_buf_ptr += s i z e o f ( double ) ; } Listing 2. The llc translation of the “computing π” code The OTOSP computational model guarantees that each processor owns at least a part of the main dataset. We are therefore able to add a new level of parallelism by simply re-parallelizing the annotated loop. The tags in the static patterns allow for the injection of this second parallel level to generate hybrid code. In this nested level, we use a shared memory approach to parallelization, using pure OpenMP code for the previously llc-annotated loop. llCoMP builds the OpenMP code to be injected from the source code. For the distribution between MPI and OpenMP threads, we have chosen to use one MPI process per computation node, and shared memory threads between the cores of the node. According to the classification in [?], we use a hybrid masteronly scheme. This allows us to take advantage of the shared memory in the cores, in particular of their shared caches, thus increasing the performance of the loops. Since our target systems in this work are multi-node clusters, the most efficient approach is to use one processor to communicate over the internode network, given a single processor’s capability to use the network completely, as shown in [?]. llCoMP generated code handles the MPI communications and synchronization of data between nodes. This code would not be enhanced by using more than one CPU, as it consists mainly of synchronization routines. 5 Experimental Results In this Sect. we present the preliminary results of our implementation by showing the performance obtained for 4 algorithms in two different multi-core systems. The algorithms we used are the “computing π” code introduced in Sect. ??, a Molecular Dynamic (MD) code, the Mandelbrot Set computation and the Conjugate Gradient (CG) algorithm. The MD code is an llc implementation of the velocity Verlet algorithm [?] for Molecular Dynamics simulations. It employs an iterative numerical procedure to obtain an approximate solution whose accuracy is determined by the time step of the simulation. The Mandelbrot Set is the convergence domain of the complex series defined by Zn = Zn−1 2 + C. The area of the set is an open question in Mathematics. Using a Monte-Carlo method, the algorithm computes an estimation of the set area. The CG algorithm is one of the kernels in the NAS Parallel Benchmarks [?]. CG uses the inverse power method to find an estimate of the largest eigenvalue of a symmetric positive definite sparse matrix with a random pattern of nonzeros. The algorithm performs a certain number of iterations, and for each of them, the solution z to the linear system of equations Az = x is approximated using the conjugate gradient (CG) method, where A denotes the sparse matrix of order N. The source code for all four algorithms is available at the llc project home page [?]. We evaluated the above algorithms using 2 different systems, named Tajinaste and Verode. Both systems were run on Linux CentOS 5.3 with kernel 2.6. The codes were compiled using gcc 4.1 and the OpenMPI MPI implementation. Tajinaste is an IBM cluster with 14 dual core Opteron processors interconnected with a Gigabit Ethernet network. The results we present for this cluster are limited to 12 cores to guarantee exclusive use of the system. Verode is a two quad-core Opteron system. In this system, we used the same OpenMPI implementation with the shared memory intercommunication device. Fig. 2. Speed-up of the Mandelbrot set computation on Tajinaste Figure ?? shows the results for 4 different implementations of the Mandelbrot set computation algorithm in Tajinaste. The meaning of the labels in the Fig. corresponds with the following implementations: – – – – MPI : a pure MPI implementation. LLC-MPI : llCoMP generates pure MPI code. LLC-HYB : llCoMP generates hybrid OpenMP-MPI code. HYB : An ad-hoc hybrid implementation. In every case the code iterates over 16368 points in the complex plane. This algorithm is usually taken as an academic example of a situation with an intense load imbalance: each complex point analyzed requires a large-varying number of iterations of the main loop to converge. In this situation, Fig. ?? reflects the benefit of the LLC-HYB approach in relation to the LLC-MPI case. The parallel execution time improvement is 17% percent for 14 processors. The llc versions exhibit results comparable to their ad-hoc counterparts, with a much smaller coding effort. Tables ?? and ?? show the results for the MD, computing π and CG algorithms in Tajinaste and Verode, respectively. In both tables, row SEQ shows the time (seconds) corresponding to the sequential version of the code, while the remaining data correspond to the speedups achieved by llc implementations. Column MPI corresponds to pure MPI code generation by llCoMP and MD #Proc. MPI SEQ π HYB MPI 38.590 1 2 3 4 6 8 9 12 0.984 1.967 2.948 3.929 5.813 7.637 7.558 9.715 HYB 9.184 0.950 1.903 2.849 3.776 5.600 7.316 8.533 11.303 0.994 1.993 2.986 3.975 5.949 7.881 8.775 11.681 0.997 1.989 2.968 3.890 5.791 7.744 8.477 11.482 Table 1. Speedup for the MD and computing π algorithms in Tajinaste column HYB reflects the results using the new llCoMP backend. In the case of the computing π and MD algorithms, no significant improvement is evident. Nevertheless, in the case of the CG code, again the results show the benefit achieved with the hybrid code. The reason behind this improvement is the granularity of the algorithm: the CG code parallelizes several loops, some of them used for initializations. This fine-grain parallelism penalizes the pure MPI code version, while the hybrid translation takes advantage of it. MD #Proc. SEQ 1 2 3 4 5 6 7 8 MPI HYB 38.590 0.817 1.613 2.421 3.226 4.027 4.857 5.670 6.451 0.823 1.624 2.465 3.247 4.089 4.929 5.754 6.586 π MPI CG HYB 9.184 0.997 1,982 2.959 3.956 4.936 5.944 6.935 6.545 1.004 1.987 2.953 3.970 4.942 5.944 6.831 6.862 MPI HYB 198.941 1.003 1.681 2.123 2.334 2.471 2.448 2.579 2.423 0.988 1.792 2.566 3.209 3.835 4.372 4.772 5.247 Table 2. Speedup for the MD, computing π and CG algorithms in Verode 6 Conclusions and Future Work We have explored the potential of using llc and its compiler as a simple method for producing hybrid MPI/OpenMP code to target new hierarchical multi/manycore parallel architectures. We are curious as to whether our methodology could also be applied to other compilers or languages. In our approach, the use of direct generation of hybrid code for the translation of the OpenMP directives conjugates simplicity with reasonable performance. Our code generation scheme, while simple, is also reasonably efficient. Through the use of llc, most OpenMP codes can be directly ported to the new hierarchical architectures without loss of efficiency. We believe that llc is a good choice for non-expert users to exploit massive parallelism. The results we have shown in this work represent a clear improvement over the previous implementation of llCoMP. Combining the above facts leads us to believe that the hybrid model of programming is deserving of more thorough study. Work in progress in our project includes the following issues: – Enlarge the computational experiments with additional algorithms that benefit more fully from our approach. – Test our implementations on machines with a larger number of computational nodes. – Study the applicability of our method to mixed multi-core CPU/GPU architectures. Acknowledgments We wish to thank the anonymous reviewers for their suggestions on how to improve the paper.
© Copyright 2024 ExpyDoc