Implementation of Just In Time Value Specialization for the Optimization of Data Parallel Kernels PhD Dissertation (Draft 4.0) Nat Tuck University of Massachusetts Lowell October 10, 2014 Abstract This dissertation explores just-in-time (JIT) specialization as an optimization for OpenCL data-parallel compute kernels. It describes the implementation and performance of two new extensions to OpenCL, Bacon and Specialization Annotated OpenCL (SOCL). Bacon is a replacement interface for OpenCL that provides improved usability and has JIT specialization built in. SOCL is a simple extension to OpenCL that allows JIT specialization to be applied to OpenCL kernels directly. We have developed one implementation of Bacon and two implementations of SOCL, which are described and benchmarked. Both implementations of SOCL show significant speedups on some test programs. Speedups of up to 175% are demonstrated when JIT specialization is applied to kernels executing on graphics processing units (GPUs) from AMD and Nvidia using our Pancake library. With Specializing POCL, speedups of up to 150% are shown in tests on a dual-CPU 24-core workstation. Extensive work has been done previously on JIT compilers and ahead-of-time specialization. Just in time value specialization is a well known technique, but is not commonly implemented as a optimization to improve runtime either in the literature or in existing systems. This work provides JIT value specialization as a general optimization explicitly exposed to the programmer and directly supported by the Bacon and SOCL runtime systems. Contents 1 Introduction 3 2 Technical Background 2.1 Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Existing Parallel Programming Systems . . . . . . . . . . . . 2.3 OpenCL: The Open Compute Language . . . . . . . . . . . . 6 6 14 19 3 Just In Time Specialization 3.1 Value Specialization . . . . . . . . . . . . . . . . . . . . . . . 3.2 Just In Time Compilation . . . . . . . . . . . . . . . . . . . . 3.3 Existing Uses of Just In Time Specialization . . . . . . . . . . 28 28 31 32 4 The 4.1 4.2 4.3 4.4 4.5 Bacon System The Bacon C Language Kernel Lifecycle . . . . . Implementation . . . . . Ease of Use . . . . . . . Performance . . . . . . . . . . . . 34 34 37 38 39 39 5 JIT 5.1 5.2 5.3 5.4 5.5 Specialization of OpenCL Specialization Annotated OpenCL . . . . . . . . . . . Pancake: Implementing SOCL by Specializing OpenCL Serial Specialization with LLVM . . . . . . . . . . . . . Specializing POCL . . . . . . . . . . . . . . . . . . . . JIT Specialization in Clover . . . . . . . . . . . . . . . . . . . . 43 43 50 57 77 83 6 Conclusions and Future Work 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 84 85 . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. . . . . . . . . . . . . . . . . A Benchmarking Hardware and Methods A.1 Benchmarking Methodology . . . . . . . . . . . . . . . . . . . A.2 Benchmarking Hardware . . . . . . . . . . . . . . . . . . . . . B Code Listings B.1 OpenCL Listings . . B.2 Bacon Listings . . . B.3 Serial Kernel Listings B.4 Specializer Listings . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 88 89 90 90 98 107 110 114 2 Chapter 1 Introduction Today, computers are parallel machines. Even cell phones have quad-core processors and highly parallel graphics processors with programmable shaders [1]. A high end commodity workstation can have both 64 CPU cores and several general purpose graphics processing units (GPUs) that together can perform 11264 floating point operations in parallel.1 Taking advantage of this parallelism is no longer optional for computationally intensive programs. This raises the question: How do we easily write efficient parallel programs? In a perfect world, we would be able to write our programs in a high-level language like Ruby and have them execute with a full parallel speedup on modern parallel hardware. Unfortunately, most of the programming languages, libraries, and development tools that we have assume serial program execution. Many of the popular high-level languages, including Python, Ruby, JavaScript and others, are designed so that no more than one thread can be executed in parallel. This makes the problem very difficult, since programmers are understandably attached to their existing programming tools and code bases. Any technique for parallelization that required throwing these out would be a hard sell. Luckily, many programs have most of their heavy computation requirements isolated in a few relatively small regions of their code. A traditional optimization technique is to separate out these computational hot spots into their own routines, rewrite them a more efficient (usually lower level) programming language, and then call them from the main program. These rewritten hot spot routines are called “compute kernels.” This method works well for adding 1 Quad socket AMD Opteron system with four Radeon R9 290X GPUs 3 parallelism to a program; the kernels can be made parallel while the bulk of the program remains serial. There are many ways to express parallelism in code, all with different benefits and drawbacks. One model that is easy to understand and maps well to a variety of hardware is array-based SPMD (single program multiple data) data parallelism. In this model, one piece of code operates on an entire (possibly multi-dimensional) array by running in parallel once for each element in the array. The current cross-vendor standard for GPGPU programming is called OpenCL. OpenCL uses this model of writing separate SPMD kernels and executing them from a host program. The OpenCL standard [2] consists of two major pieces. First, it defines a programming language called OpenCL C for writing compute kernels to run on parallel hardware. Second, it defines runtime APIs for C and C++ that allow these kernels to be compiled, loaded, and executed from programs running on a host CPU. The phrase “portable assembly language” is sometimes used to describe the C programming language. OpenCL is now the portable assembly language of GPU-like parallel processors. A few years ago, when the potential speedup from parallelism on consumer devices was in the single digits, it made sense to do other optimizations first; with a factor-of-two parallel speedup, there’s no reason to rewrite a kernel for parallel execution before rewriting it in handoptimized C. Until recently, high-level parallel languages didn’t really make sense outside of large clusters for high performance computing. Today that trade-off has changed. According to the Computer Language Benchmarks lame [3], the speedup from using a fast, low-level language like C over a slow high-level language like Ruby is around a factor of 50. Commodity machines can have more than 50 CPU cores, and GPUs can be two orders of magnitude more parallel than that. Today a program written in a language like Ruby that parallelized perfectly could actually execute faster than a similarly optimized C program. But even if thousand-core CPUs become common we would still rather not give up a factor of 50 in performance if we can avoid it. Luckily, describing computation at a higher level actually provides more information which can be used for aggressive compiler optimization. One very aggressive optimization that maps well to data-parallel kernels is value specialization (sometimes called “partial evaluation” [4]), where routines written for a general case computation are transformed automatically to operate on specific input cases. Computers 4 are fast enough now that we can do this specialization at runtime based on the actual values seen by the program. This dissertation explores just in time specialization for OpenCL compute kernels. General technical background including hardware, other parallel software systems, and the OpenCL standard are explored in Chapter 2. The history and use of JIT specialization is covered in Chapter 3. Work on Bacon, a programming system with JIT specialization and other features that targets OpenCL as a base platform, is described in Chapter 4 The main contribution in this dissertation is Specialization Annotated OpenCL (SOCL), which is described in Chapter 5. This is an extension to OpenCL that provides specialization for OpenCL kernels directly. Two implementations of SOCL are described and their performance is analyzed. Finally, conclusions and possible future work are discussed in Chapter 6. It is my hope that this work will inform the development of future parallel programming systems, and that eventually programmers will be able to get the performance of hand-optimized OpenCL with the programming effort and expressive power of a language like Ruby. Unfortunately, computer systems are complex, and this dissertation presents only a simple compiler optimization rather than the silver bullet for parallel programming. 5 Chapter 2 Technical Background 2.1 Parallel Processors This dissertation explores the optimization of programs executed on parallel processors. In this chapter we look at some of the parallel processors that are available, especially in consumer computing devices. First, we show the basic architectures used for processors and how they innately support parallelism. As of early 2014, common consumer computing devices include laptops, desktops, tablets, and smartphones. Each of these generally contains a central processing unit (CPU) with between two and eight processor cores and a graphics processing unit (GPU) with as many as several thousand programmable shader units. At the higher end of commodity computing devices, servers and workstations can have up to four CPUs and four discrete GPUs. Desktops, laptops, workstations, and servers are generally Intel-compatible processors produced by either Intel or AMD. Mobile phones and tablets are generally based on processors in the ARM family designed by ARM Holdings and produced under license by various manufacturers. This isn’t a hard rule, with Intel pushing their new Atom processor for phones and tablets, AMD pushing ARM processors for servers, and some inexpensive tablets and laptops are available that use MIPS processors. 6 Model Cores Threads Speed Turbo / Boost Memory Channels L1 Code L1 Data L2 Cache L3 Cache GFLOPS1 TDP Intel Core i7-4960X[5] 6 12 3.6 GHz 4.0 GHz 4 32kB / core 32kB / core 256kB / core 15MB 172 130W AMD FX-9590[6] 8 8 4.7 GHz 4.7 GHz 2 64kB / 2 cores 16kB / 2 cores 2MB / 2 cores 8M 150 220W Table 2.1: High-End Desktop CPUs 2.1.1 PC CPUs Intel-compatible microprocessors have been the de-facto standard for personal computers since the release of the IBM PC in 1980. Modern incarnations of this architecture use the 64-bit x86_64 (aka. AMD64, Intel64, x64) extension to Intel’s x86 instruction set first introduced in AMD’s Athlon 64 chip in 2003. Today, Intel-compatible CPUs are produced almost exclusively by Intel and AMD. Although both companies are moving towards designs with integrated GPUs, they both produce processors without built in graphics hardware for the enthusiast gaming and graphics workstation markets where discrete GPUs remain essential for good performance. Two sample desktop processors from Intel’s Core and AMD’s FX lines are shown in Table 2.1. As of February 2014, these are the most powerful processors marketed to consumers by the two companies. The primary difference between the Intel Core and AMD FX processor designs is the approach to sharing hardware resources between threads. Intel uses less cores with high single thread performance and improves performance for many threads with Hyper Threading™, their implementation of simultaneous multi-threading (SMT) [7]. In contrast, AMD uses many cores with moderate single thread performance, organizing them as dual core modules that share both L2 cache and 256-bit SIMD units to minimize the increase in die space caused by including more cores. 7 Model AMD Radeon R9 290X[10][11] Shaders 2816 Clusters / SMXes 44 Speed 1000 MHz L1 Cache 16 kB / cluster L2 Cache 1 MB RAM 4 GB Memory Bus 512 bit SP GFLOPS 5632 DP GFLOPS 704 Nvidia GeForce GTX Titan Black[12][13] 2880 15 980 MHz 64k / SMX 1.5 MB 2 GB 384 bit 5000 1700 Table 2.2: Modern High End GPUs Vector single-instruction multiple-data (SIMD) arithmetic units are a significant source of parallelism in modern CPUs. The latest SIMD extension to the AMD64 instruction set is called VEX, which operates on 256-bit wide registers. This allows data parallel operations to be performed on either four 64-bit values or eight 32-bit values at a time. Taking both multiple cores and SIMD units into account, modern multi-core CPUs are capable of significant parallelism. A data parallel computation operating on 32-bit values on the Intel Core i7 4960X (with 6 cores and 8-wide SIMD) is potentially 48 times faster than the same computation executing sequentially. Even greater parallelism is available in server and workstation systems. Four 15-core Intel processors have a total of 60 cores and can execute 120 threads simultaneously. With SIMD instructions, such a system can perform 480 floating point additions per clock cycle. CPU parallelism can be exploited using standard tools. Most programming environments support the use of multiple threads or processes, and some compilers (including GCC [8] and LLVM [9]) can automatically vectorize appropriately programs to use SIMD instructions like VEX. High performance math libraries are readily available that exploit CPU parallelism on the x86_64 architecture. 8 2.1.2 Discrete GPUs Graphics Processing Units (GPUs) are massively parallel processor arrays that have developed largely to support computer gaming [14]. As the complexity of game graphics has increased, these GPUs have become more programmable. Current generation GPUs can execute arbitrary parallel functions with only a few programmer constraints. This flexibility has allowed these devices to be used for general computation, and the use of GPUs in this way is frequently called General Purpose GPU (GPGPU) computing. Two high end GPUs, an AMD Radeon R9 and an Nvidia GeForce GTX, are shown in Table 2.2. Both graphics cards can execute nearly 3000 threads in parallel at about 1.0 GHz. This means that they have theoretical floating point performance more than 30 times faster than the fastest CPU available when they were released. Due to higher memory bandwidth and other factors, the speedup may be even higher on an appropriately parallel workload. A parallel program that can run on a GPU offers a potential speedup of three orders of magnitude over a sequential program. GPUs are more complicated to work with than multi-core CPUs. GPUs are natively SPMD data-parallel processors. They execute threads in batches called “work groups” of 32 or 64 threads (for Nvidia and AMD respectively) that execute the same instruction sequence in lock-step. This makes array data parallelism the simplest way to write programs for execution on this hardware. Code for execution on a GPU generally needs to be written using a specialized programming language and executed using a vendor-specific execution API. The open standard for GPU programming is OpenCL [2], which is discussed further in Chapter 2.3. In addition, there are a variety of other GPU tools such as Nvidia’s Cuda [15] and Microsoft’s DirectCompute [16], which are discussed further in Chapter 2.2. Nvidia also produces a line of parallel accelerator cards under the Tesla brand name. These cards are very similar in performance to the GeForce GTX Titan. 2.1.3 Accelerated Processing Units 1 Each Intel HD Graphics ”core” has two 4-wide vector ALUs. Each Radeon R7 ”core” has 64 ”stream cores” (ALUs). 3 Theoretical GFLOPS = vector width * vector units * clock speed 2 9 Model Cores Threads CPU Speed GPU ”Cores” Turbo / Boost L1 Code L1 Data L2 Cache L3 Cache L4 Cache CPU GFLOPS3 GPU GFLOPS TDP Intel Core i7-4770R[17][18] 4 8 3.2 GHz 401 3.9 GHz 32kB / core 32kB / core 256kB / core 6MB 128MB 102 832 65W AMD A10-7850K[6] 8 8 3.7 GHz 82 4.0 GHz 64kB / 2 cores 16kB / 2 cores 2MB / 2 cores 59 737 220W Table 2.3: Desktop Accelerated Processing Units (APUs) As we saw with floating-point arithmetic units in the 80’s, widely used coprocessors can be integrated into CPUs as transistors get smaller and cheaper. Both Intel and AMD now produce CPUs with integrated programmable graphics processors (APUs) as their primary mainstream processor offerings. AMD uses a scaled down version of their discrete Radeon graphics processors for their integrated graphics, while Intel has developed their Intel HD Graphics modules solely for integration into their CPUs. Integrated graphics processors can’t offer the performance of discrete graphics cards, but they can provide an order of magnitude more compute power than the CPU cores they are paired with. Table 2.3 shows two APUs: A Core i7 from Intel and an A10 APU from AMD. The processors provide similar numbers for theoretical GFLOPs, but the Intel processor has a very interesting feature: a huge (128 MB) EDRAM that it uses as L4 cache. The major performance drawback for integrated graphics processors is the use of system RAM shared with the CPU, which is relatively low bandwidth compared to the high speed graphics RAM on discrete graphics cards. Intel’s EDRAM provides only a small performance boost on games, but it may provide a significantly larger benefit for GPGPU program execution. APUs seem to be the way of the future. I overconfidently predict that discrete graphics cards will be largely extinct in consumer PCs a decade from now. 10 Model Cores CPU Speed CPU ISA GPU Speed L0 Code + Data L1 Code + Data L2 Cache L3 Cache CPU GFLOPS GPU GFLOPS TDP Qualcomm Snapdragon 600 8974-AA SoC[19] 4 2.2 GHz ARMv7 450 MHz 4 kB + 4 kB 16 kB + 16 kB 2 MB 35.2 130 < 5W Apple A7 SoC[20] 2 1.3 GHz ARMv8-A 200 MHz 64 kB 1 MB 4 MB 10.4 76.8 < 5W Table 2.4: Mobile Systems on Chip Programs for APUs are written by writing for the two components separately. The CPU cores are programmed using standard tools, while the graphics processor is programmed using specialized GPU programming tools. The fact that memory is shared between the two components simplifies things somewhat, as current generation APUs can share data structures without copying. 2.1.4 ARM Processors Smartphones are now the most common type of computers in general use, having overtaken PC shipments in 2011 [21]. Mobile devices, including smartphones and tablets, are powerful general purpose computers with multi-core CPUs and programmable integrated graphics. The main difference from PCs is their use of ARM rather than Intel-compatible CPUs. Sample specifications for two current mobile systems on a chip (SoCs) from Qualcomm and Apple are shown in Table 2.4. These SoCs are less powerful than modern desktop APUs by about an order of magnitude, while also using about an order of magnitude less power. Like the APUs, the graphics processors on these SoCs have about an order of magnitude more compute power than the CPU component. Mobile devices are programmed using development tools supplied by the software vendor and provide only minimal support for programming of the GPU. The development tools have improved over time, so full GPU programming support from all vendors is likely in the near future. 11 Model Cores Speed L1 Code + Data L2 Cache L3 Cache RAM GFLOPS TDP Intel® Xeon Phi 7120P[22] 61 1.2 GHz 32 kB + 32 kB 512 kB / core 16 GB 1208 300W Tilera TILEncore-Gx72[23][24] 72 1.0 GHz 32 kB + 32 kB 256 kB / core 18 MB 32 GB 576 95W Table 2.5: Mobile Systems on Chip 2.1.5 Parallel Accelerators There are a number of parallel accelerators on the market, commonly in the form of PCI express expansion cards, that can perform parallel computations. These cards are not generally available for retail purchase, but their existence is worth noting because they provide a computational model somewhere between that of a multi-core CPU and a general purpose GPU and will likely inform the design of future commodity computing components. Table 2.5 shows two current parallel accelerator cards: The Xeon Phi from Intel and the TILEncore Gx72 from Tilera. The Xeon Phi is intended primarily for high performance computing. Each core provides 4-way simultaneous multi-threading, allowing the card to execute 244 simultaneous threads. The Phi executes standard 64-bit x86 code except for the addition of a 512-bit SIMD unit to provide higher data-parallel performance. The cores are connected in a ring topology. The Xeon Phi is programmed using a variant of Intel’s standard x86_64 compilers, which are very mature. The co-processor can even run existing x86_64 binaries without the benefit of the extended SIMD instructions. The Tilera TILEncore is primarily marketed as a realtime network packet processor for applications like deep packet inspection. Towards this end, the card is equipped with four 10Gb Ethernet ports. The cores support SIMD and heavy duty cryptographic acceleration. The core interconnect is a mesh connecting each processor with five links to each of its four neighbors. Tilera’s accelerator is programmed using their development kit, which seems to be based on standard tools. 12 Example Processor Comparison Intel Core i7-4770R (Sequential) Apple A7 SoC (CPU+GPU) AMD FX-9590 Qualcomm Snapdragon 600 8974-AA SoC (CPU+GPU) Intel Core i7-4960X Tilera TILEncore Gx72 1000 AMD A10-7850K (CPU+GPU) 2000 Intel Core i7-4770R (CPU+GPU) 3000 AMD Radeon R9 290X GFLOPS 4000 Intel Xeon Phi 7120P 5000 Nvidia GeForce GTX TITAN Black 6000 0 Processor Figure 2.1: Comparison of the theoretical GFLOPS of sample processors. The sequential performance of the Core i7 is 7.2 GFLOPS. 2.1.6 Comparative Performance Figure 2.1 shows the comparative theoretical maximum floating point performance of the processors discussed in this chapter. For processors with both CPU cores and integrated graphics, this is the total performance of the chip. This chart only shows maximum theoretical GFLOPS and doesn’t take into account memory bandwidth, cache performance, how effectively a program was parallelized, or any other practical considerations. Modern hardware allows parallel programs to execute significantly faster than sequential programs. When compared to GPUs, sequential execution on the fastest CPU in the world barely registers on the chart. 13 2.2 Existing Parallel Programming Systems In this section we describe several existing parallel programming systems, separated by the type of hardware they target. 2.2.1 Cluster Programming Systems Historically, parallel programming largely meant programming for clusters of many machines connected by a network. Work on this sort of system is generally called High Performance Computing (HPC) and largely consists of simulations of physics problems, engineering designs, the weather, financial systems, and other problems that generate enough funding to purchase a large number of machines. Even though a modern cell phone has as much computational power as a cluster from 20 years ago, there’s no shortage of larger problems that need a cluster to solve as individual computers get faster and have more RAM and storage space. Many of the older and more mature parallel programming systems target clusters. MPI MPI (Message Passing Interface) is an open standard API for parallel program execution on clusters. There are many different MPI implementations, including vendor-specific libraries optimized for specific hardware and widely portable open implementations like MPICH[25]. As shared memory is not generally available on clusters, MPI programs communicate by message passing. This scales better than shared memory but makes some otherwise efficient parallel algorithms no longer feasible to execute. Newer versions of MPI include support for execution on multi-core systems and simulation of shared memory. MPI can be used to manage the inter-machine communication in systems where other tools are used to manage local parallelism, including clusters that do their primary computation on GPUs. 14 PVM PVM[26] (Parallel Virtual Machine) is an open source software package for parallel programming on clusters. It is slightly older than MPI, but provides similar capabilities. PVM connects the machines in the cluster together into a “virtual machine” and provides extensive APIs to support querying and controlling the status of the various nodes and running processes. ZPL ZPL[27] (Z-level Programming Language) is a higher level array programming language built on top of MPI. In ZPL, the programmer describes operations on entire arrays or pieces of arrays rather than describing parallel computations by looking at single array elements. ZPL pioneered several parallel programming concepts, including the idea of “regions” which are very similar to the parallel ranges used in modern data-parallel systems. Chapel Another high-level parallel programming language of note is Chapel[28]. This language, a conceptual descendant of ZPL, was developed at Cray as an entry in the DARPA funded High Productivity Computing Systems program. It tries to solve a larger piece of the parallel and distributed computing problem than other systems by being a “multi-resolution” language, allowing users to write both high- and low-level routines as part of the same program. This is supported by novel parallel abstractions like zippered iterators and distributed software transactional memory. 2.2.2 Multicore Programming Systems As described in the previous chapter, most people now have powerful parallel computers on their desks and in their pockets. The most basic way to exploit the parallelism of multi-CPU systems is to run multiple instances of a program so that the operating system can schedule each instance on a separate CPU core. Unfortunately, this method has some drawbacks in ease of development, communication between instances, and memory usage. 15 To simplify communication and reduce memory usage, operating systems added the ability for each instance of a running program to have multiple “threads” of execution that could run in parallel. These threads all share the same address space, so they use a smaller total amount of memory and communication can be accomplished by reading and writing the shared memory. Unfortunately, accessing memory concurrently leads to race conditions, and explicit locks are used to guarantee atomic access to shared memory. This means that programmers working with threads need to worry about both race conditions and deadlocks, which can occur when locks are not used correctly. Various tools and techniques have been developed to simplify programming with threads and multiple processes. OpenMP OpenMP[29] (Open Multiprocessing) is an open standard API and syntax extension to C, C++, and FORTRAN for writing parallel code for shared memory multiprocessor machines. It is especially popular due to its low-impact work flow. First, the programmer writes and tests a sequential program while keeping parallelism issues in mind. Second, the programmer annotates that program with “pragmas” that expose parallelism to the compiler. Optimally, the program will then run with a good parallel speedup. OpenMP is supported by a variety of compilers, including GCC, Intel’s ICC, Microsoft Visual Studio, IBM XL C/C++, and Oracle Solaris Studio. Another extension to C and C++ similar to OpenMP is Intel’s Cilk Plus[30]. Intel Threading Building Blocks Intel Threading Building Blocks[31] (TBB) is a C++ template library to provide simple patterns that can be used to add parallelism without having to deal with low-level primitives and performance tuning. With TBB, the programmer specifies tasks that can run in parallel and the library automatically schedules their execution on multiple processor cores. The library also provides patterns for parallel loops and pipelines that will generate the tasks to be run in parallel automatically. 16 Immutable Data A common technique for avoiding locks in threaded systems is the use of immutable data. If an object in memory never changes, there’s no need to worry about race conditions when accessing it. Many functional programming languages, such as Haskell and Clojure, explicitly enforce the immutability of data. Operations that modify an object in place are disallowed. If a modified version is needed, a new object with the modification is generated from the old version in a single atomic operation. This guarantees that all objects can be safely shared between threads, since objects cannot be changed once they are created. Immutable data can also be used as a programming pattern in languages that don’t explicitly enforce it. Generally, the strategy is that objects are produced in a single thread and then a reference to them is shared with other threads. Once the reference has been shared, the object is not changed. If a modified version of the object is needed, a thread makes a local copy, modifies the copy, and then shares the new version. The disadvantage to immutable data is the performance penalty that comes from allocating a new object for every operation. In languages like Haskell this can largely be optimized away by the compiler, but in general the cost of memory allocation is a concern due to cache effects and the fact that every piece of memory that is allocated must later be garbage collected. 2.2.3 Low-Level GPU Programming APIs OpenCL OpenCL[2] (Open Computing Language) is a cross-vendor standard for dataparallel programming designed primarily to target GPU devices. It was developed by a group led by Apple in 1998 to create an open standard in response to vendor-specific GPU programming APIs from Nvidia and Microsoft. Since then it has gone through three major revisions, with OpenCL 2.0 having been finalized in November of 2013[32]. OpenCL is described in detail in Chapter 2.3. 17 CUDA™ CUDA™ [33] is a proprietary API for GPU programming developed by Nvidia. It is somewhat more mature than OpenCL but is otherwise very similar, with a slightly different set of features and restrictions. In addition to a C-based language, the CUDA platform also allows kernels to be written in variants of C++ and FORTRAN. CUDA can only be used to develop for Nvidia GPUs. Microsoft DirectCompute Microsoft provides an API for GPU computing for the Windows operating system called DirectCompute[16] as part of their DirectX API package. DirectCompute allows “compute shaders” written in Microsoft’s High Level Shader Language (HLSL) to be executed on supported GPUs. DirectCompute provides a subset of the features provided by OpenCL or CUDA. Its primary advantage is its integration with the DirectX API, which is the standard for development of games and other 3D applications on the Windows platform. OpenGL Compute Shaders The recent OpenGL 4.3 standard[34] adds support for compute shaders to this standard graphics API. Shaders are written in the OpenGL Shader Language (GLSL) and are executed through the OpenGL API. Like Microsoft’s compute shaders, these are less featureful than OpenCL or CUDA, but are well integrated with the OpenGL API for graphics applications that need simple support for compute acceleration. Google RenderScript Google RenderScript[35] is an API for parallel acceleration on the Android platform. It allows kernels to be written in the style of compute shaders and then called from Java code in Android applications. Google has intentionally not provided default support for OpenCL on Android to promote RenderScript, although OpenCL is significantly more featureful. 18 2.2.4 High-Level GPU Programming Tools Aparapi Aparapi[36] is an open source API that allows methods written in Java to be translated into OpenCL and executed on GPU devices at runtime. The methods to be translated must be written in a restricted subset of Java consisting largely of loops and integer operations. The ability to embed GPU kernels directly in the host code in the host language makes it much simpler to take advantage of GPU acceleration than writing separate kernel or shader code in a separate language with a separate execution API. Clyther Clyther[37] is a tool like Aparapi, except for the Python programming language rather than Java. One additional feature of note is the fact that since Python is a dynamically typed language, kernels written in Clyther will be type-specialized on the types of their arguments. This allows a kernel to be written once and then executed on arguments of several types. 2.3 OpenCL: The Open Compute Language OpenCL has two major parts: the OpenCL C programming language and the OpenCL runtime library. Kernels (routines to be run in parallel) are developed in OpenCL C and and are loaded and executed at runtime by API calls to the OpenCL library from a host application. This delayed compilation allows OpenCL to be hardware independent; kernels are compiled for devices identified and selected at runtime. The OpenCL standard further defines a “virtual parallel machine” model for programmers to target with OpenCL kernels. The intent is that this model be general enough to map to all target devices, while having enough of the specific properties of GPUs that it can efficiently target their somewhat peculiar architecture. The OpenCL model describes how a kernel executes on a device. The kernel is executed in parallel over a 1-, 2-, or 3-dimensional grid-shaped index space called an NDRange. That grid is further divided up into “work groups” in a 19 way that can be specified by the user if it is important for a given kernel. Each executing instance of the kernel (“thread”) is able to access its index, and can thus determine what piece of the computation it is responsible for calculating. Memory in the OpenCL model is separated into five address spaces. Global memory is shared across all threads in a kernel, but no guarantee is made that writes from one thread will be visible to other threads before the kernel is finished executing. Local memory is shared within a work group and can be synchronized for inter-thread communication by an explicit barrier. Private memory is accessible only to a single thread. As a general guideline, global memory is slow, shared memory is faster, and private memory is the fastest. As an alternative to global memory, there are two additional address spaces that may provide special high speed caching: constant memory and images. Constant memory is read-only during kernel execution. Images are 2D or 3D arrays that can only store data in certain formats and are restricted such that a given image can only be either read or written, but not both, by a given execution of a kernel. Use of the OpenCL C language that kernels are written in is subject to some major constraints, including a lack of function pointers, recursion, or any sort of dynamic memory allocation or array sizing. The standard seem to be slowly becoming less restrictive with extensions, newer revisions, and improvements in the capabilities of GPU hardware, but it will be several years before OpenCL can be straightforwardly treated like a parallel C. The OpenCL specification provides a clear description of the OpenCL platform model. It is advisable to read the beginning of Section 3 of the OpenCL Specification [38] to get a more complete picture of the standard, as the remainder of this dissertation describes work based on the OpenCL platform. 2.3.1 A Sample OpenCL Program To see how OpenCL works, it’s useful to walk through how a kernel is compiled and executed using the OpenCL API. Although there are wrappers available for other languages like C++ and Java, OpenCL is natively a C API, so we describe the process from a C host program. Using OpenCL from other languages (e.g., the C++ API) is basically the same. The information in this section is based on Version 1.2 of the OpenCL Specification [39]; detailed documentation on the API functions can be found there. 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 /* vec_pow : * Takes a v e c t o r o f ( n ) l o n g i n t e g e r s and r a i s e s each one * t o t h e ( y ) ’ t h power i n p l a c e . */ k e r n e l void vec_pow ( g l o b a l long * xs , long y , long n ) { long i i = g e t _ g l o b a l _ i d ( 0 ) ; long x = xs [ i i ] ; long x1 = x ; for ( long j j = 0 ; j j < y ; ++j j ) { x1*= x ; } xs [ i i ] = x1 ; } Listing 2.1: Sample OpenCL Kernel: vec_pow The kernel that we call is the vec_pow kernel shown in Listing 2.1. A complete host program that loads and runs the vec_pow kernel is shown in Appendix B (Listing 2.3.1 on page 23). Even in this minimal example which takes the first OpenCL device on the system, excludes any error checking, and includes only one buffer taken from an array literal, this process requires 12 conceptual steps and 84 lines of C code. We now describe these steps. After including the OpenCL header file (line 1), the first step is to select an OpenCL platform. OpenCL allows versions of its runtime library from multiple vendors to co-exist on one machine. The clGetPlatformIDs function (line 14) lists these different implementations. This function also demonstrates the use of a common C memory management idiom: the first call returns how big an array to allocate for the result, and the second call populates that array. Here we blindly take the first platform. Once a platform has been selected, the second step is to select a parallel device that will be used to execute kernels. The clGetDeviceIDs function (line 20) lists all devices that the selected platform can utilize of the specified type. Here we take the first device of any type with CL_DEVICE_TYPE_ALL, but we could have requested, e.g., a CPU or GPU device specifically with CL_DEVICE_TYPE_CPU or CL_DEVICE_TYPE_GPU. The next step is to create an OpenCL context for the selected device and platform using the clCreateContext function (line 25). A context can contain 21 more than one device, which would allow OpenCL memory buffers to be shared by all of the devices in the context. Once we have a context, we can create a command queue for our device using the clCreateCommandQueue function (line 29). Actions on this device like executing a kernel or copying a memory buffer to or from the device will be added to this queue and then processed asynchronously. To execute a kernel, we first need to read the kernel source from disk (lines 32-39). It would be convenient if OpenCL provided an API function to do this for us, but instead we do it by hand using standard C I/O functions. Once we have the kernel source in memory, we call the clCreateProgramWithSource function (line 45) to create an OpenCL program object. We can then use the clBuildProgram function (line 47) to compile it to a device specific binary. The interface to this function is somewhat complicated, expecting an array of source strings and an array with a length for each source string. At this point we could use the clGetProgramBuildInfo function to get the status of the build and any errors or warnings that were encountered so we could print a useful error message. After building the program, we need to get a reference to the kernel we want to call with the clCreateKernel function (line 50). This takes the name of the kernel as a string and returns an OpenCL kernel object. To pass an array to a kernel, we need to put the data in an OpenCL buffer object. We can create such an object with the clCreateBuffer function (line 57). In this case we use the CL_MEM_COPY_HOST_PTR flag, which causes data to be copied into the buffer when it is created, but OpenCL also supports other options like using the host pointer directly with the CL_MEM_USE_HOST_PTR flag which can avoid copying for some devices but has caching caveats with others, or by writing data to the buffer explicitly with the clEnqueueWriteBuffer function. Once we have a kernel object, we can use the clSetKernelArg function (lines 61-63) to set values for the arguments we want to pass in when the kernel is executed. This function takes the numeric position of the argument, the size of the item being passed, and a pointer to the data. Buffers (or other cl_mem objects like images) are treated a bit specially here – you pass in an OpenCL memory object but the kernel gets a pointer to the data. With the arguments set, we can finally enqueue the kernel for execution using the clEnqueueNDRangeKernel function (line 67). This will cause the kernel to 22 get executed in parallel once for each position in a 1D, 2D, or 3D ”range”. In this case we select a 1 dimensional range of size n, which will result in copies of the kernel being executed such that calls to get_global_id(0) (line 10 in Listing 2.1) will return values from 0 to n − 1, which can then be used directly as indices into the array xs. Before we can use the results of the kernel, we need to ensure that it has completed executing. We call clFinish (line 70) to guarantee that all asynchronous tasks have finished executing. Once the kernel has been executed, we can read back the results using the clEnqueueReadBuffer function (line 74) and then print them (lines 77-81). Sample Host Program for vec_pow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #include #include #include #include <CL/ c l . h> <s t d i o . h> <a l l o c a . h> <s t r i n g . h> int main ( int argc , char* argv ) { c l _ u i n t nn ; /* Take t h e f i r s t OpenCL p l a t f o r m */ c l G e t P l a t f o r m I D s ( 0 , 0 , &nn ) ; c l _ p l a t f o r m _ i d p l a t f o r m s [ nn ] ; c l G e t P l a t f o r m I D s ( nn , p l a t f o r m s , 0 ) ; cl_platform_id platform = platforms [ 0 ] ; /* Take t h e d e f a u l t OpenCL d e v i c e */ c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, 0 , 0 , &nn ) ; c l _ d e v i c e _ i d d e v i c e s [ nn ] ; c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, nn , d e v i c e s , 0 ) ; cl_device_id device = devices [ 0 ] ; /* Create an OpenCL c o n t e x t */ long context _props [ ] = {CL_CONTEXT_PLATFORM, ( long ) p l a t f o r m , 0}; cl_context context = clCreateContext ( ( c l _ c o n t e x t _ p r o p e r t i e s * ) context_props , 1 , &d e v i c e , 0 , 0 , 0) ; /* Create an OpenCL command queue */ 23 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 cl_command_queue queue = clCreateCommandQueue ( c o n t e x t , device , 0 , 0) ; /* Read i n t h e OpenCL k e r n e l s o u r c e */ char* o p e n c l _ s o u r c e = a l l o c a ( 1 0 2 4 ) ; FILE* o p e n c l _ f i l e = f o p e n ( ” vec_pow_kernel . c l ” , ” r ” ) ; nn = 0 ; while ( ! f e o f ( o p e n c l _ f i l e ) ) nn += f r e a d ( o p e n c l _ s o u r c e + nn , 1 , 5 1 2 , o p e n c l _ f i l e ) ; o p e n c l _ s o u r c e [ nn ] = ’ \n ’ ; o p e n c l _ s o u r c e [ nn + 1 ] = 0 ; fclose ( opencl_file ) ; /* Compile t h e OpenCL s o u r c e and s e l e c t t h e k e r n e l we want */ size_t length = s t r l e n ( opencl_source ) ; const char ** s o u r c e s = ( const char * * ) &o p e n c l _ s o u r c e ; const s i z e _ t * l e n g t h s = ( const s i z e _ t * ) &l e n g t h ; cl_program program = clCreateProgramWithSource ( context , 1 , sources , lengths , 0) ; clB uild Pro gram ( program , 1 , &d e v i c e , ” ” , 0 , 0 ) ; /* S e l e c t t h e k e r n e l we want */ c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program , ”vec_pow” , 0 ) ; /* Create t h e i n p u t b u f f e r */ cl_long n = 5; c l _ l o n g xs0 [ ] = { 1 , 2 , 3 , 4 , 5 } ; cl_mem_flags f l a g s = CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR; size_t xs_size = n * sizeof ( cl_long ) ; cl_mem xs = c l C r e a t e B u f f e r ( c o n t e x t , f l a g s , x s _ s i z e , xs0 , 0 ) ; /* S e t t h e k e r n e l arguments */ cl_long y = 3; c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl_mem ) , &xs ) ; c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( y ) , &y ) ; c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( n ) , &n ) ; /* E x e c u t e t h e k e r n e l o v e r a p a r a l l e l range . */ c l _ e v e n t kernel_done = 0 ; clEnqueueNDRangeKernel ( queue , k e r n e l , 1 , 0 , &n , 0 , 0 , 0 , &kernel_done ) ; /* Wait f o r t h e Kernel t o f i n i s h e x e c u t i n g */ clFinish () ; /* Copy b a c k t h e d a t a from t h e b u f f e r . */ c l _ l o n g xs1 [ 5 ] ; clEnqueueReadBuffer ( queue , xs , CL_TRUE, 0 , x s _ s i z e , xs1 , 0 , 24 75 76 77 78 79 80 81 82 83 84 0 , 0) ; /* P r i n t o u t t h e r e s u l t */ p r i n t f ( ” 1 2 3 4 5 cubed i s : \ n” ) ; for ( int i i = 0 ; i i < 5 ; ++i i ) { p r i n t f ( ”%l d ” , xs1 [ i i ] ) ; } p r i n t f ( ” \n” ) ; } return 0 ; 2.3.2 OpenCL Implementations OpenCL is an open standard that is generally implemented by each vendor for execution on their hardware devices. As such, there are a number of implementations of OpenCL available, each supporting different hardware, different software platforms, and each supporting different subsets of the OpenCL standard and its extensions. Several of the major implementations of OpenCL are described below. Apple Apple was the intital developer of the OpenCL standard as a hardware-independent graphics card programming API for their Macintosh computers. They maintain an implementation [40] of the standard for their OS X operating system. The current version of Apple OpenCL—included in their OS X Mavericks operating system—supports Version 1.2 of the standard on AMD, Intel, and Nvidia graphics hardware and Intel-compatible CPUs. AMD AMD provides an implementation of OpenCL that supports their Radeon and FirePro graphics hardware and all Intel-compatible CPUs called the Accelerated Parallel Programming (APP) SDK [41]. The AMD SDK supports Linux and Windows operating systems, leaving support for Mac OS X to Apple. The current version of the SDK (2.9) supports OpenCL 1.2. 25 Nvidia Nvidia’s CUDA SDK includes an implementation of OpenCL [42] for their GeForce Graphics Cards and Tesla compute accelerators. CUDA 5.5, the most recent version released, only supports OpenCL 1.1. Nvidia has presumably not updated their OpenCL support to 1.2 to promote their proprietary APIs. The Nvidia CUDA SDK supports Linux, Macintosh, and Windows. Intel Intel provides an implementation of OpenCL [43] for their CPUs and integrated graphics hardware. The 2013 R3 release supports OpenCL 1.2 for Windows. Clover The Mesa project [44], providers of the standard open source OpenGL implementation for Linux, is working on an incomplete open source implementation of OpenCL called Clover (sometimes “Gallium Compute”) [45]. It currently provides partial support for OpenCL 1.0 on CPUs and AMD Radeon GPUs released between 2009 and 2013. It is included in the latest stable release of Mesa, but is disabled by default. As GPU computing instruction sets stabilize and the implementation catches up, this is likely to become the standard GPU compute implementation for Linux, but it can currently run only a few OpenCL applications. Beignet Intel has been working on an open source OpenCL implementation to go along with their Linux graphics driver called Beignet [46]. Currently it claims support for the OpenCL 1.1 spec on 3rd Generation Intel Core (“Ivy Bridge”) integrated graphics, but it has not yet been included in an official release of Mesa. POCL Portable Computing Language (POCL) [47] is an open source implementation of OpenCL initially developed to target custom hardware but now provid26 ing a complete implementation of OpenCL 1.1 for CPU execution. POCL is described in detail in Chapter 5.4. 27 Chapter 3 Just In Time Specialization 3.1 Value Specialization Specialization, also known as partial evaluation, was formalized by Futamura in 1971 [4]: Given a procedure π with parameters c1 ...cn , r1 ...rm and values c01 ...c0n for the first n parameters, a procedure π 0 specialized on those values can be generated by partially evaluating the procedure. Those parts of the procedure that depend on the known parameters (c1 ...cn ) can be evaluated, while the unknown parameters (r1 ...rn ) remain unknown. Futamura used this concept to describe a method of generating compilers from interpreters by specializing the interpreter (the procedure) on an input program (the known values). This technique is useful as a program optimization, and there have been many projects using it over the last four decades. As this dissertation describes specialization on a language based on C, specialization of C-type languages is most applicable. An early example of partial evaluation of C programs is described by Anderson in 1991 [48]. This partial evaluator for a subset of C is self-applicable, allowing it to be used for the compiler-generation technique described by Futamura. A speedup of nearly a factor of seven is shown for one test program. A later partial evaluator for C called Tempo [49] performs general ahead of time partial evaluation of C code. This system provided more consistent speedups 28 and was used to accelerate HPC kernels. The same primary author, Consel, also did work on specialization at runtime [50]. 3.1.1 An Example of Value Specialization Consider the ipow function shown at the top of Listing 3.1. It performs integer exponentiation, raising xx to the kk’th power, by iterated multiplication. This function implements the general case of exponentiation, allowing xx to be raise to any non-negative integer power. But what if our application only uses this function to calculate cubes? In this case, we know that every time this function is called the value of kk will be 3. We can write a more efficient function for this application by not handling the general case and instead only calculating cubes. Such a specific function, icube, is shown at the bottom of Listing 3.1. Generating a function like icube from a function like ipow is an example of specialization. Applying Futamura’s definition from Section 3.1: • The initial procedure, π, is the function ipow. • It has one parameter with a known value; r1 is the parameter kk. • It has one parameter with an unknown value; c1 is the parameter xx. • The known value for kk, c01 is 3. • The procedure to be generated, π 0 , is icube. Using Futamura’s technique, we can generate π 0 from π by partial evaluation. In this case, that is a three step process: 1. Remove the parameter kk and replace all instances of that variable in the body with the known value 3. The result of this step is shown in Listing 3.1 as ipow_s1. 2. Since we know how many iterations the loop has (three), we can symbolically evaluate it by replacing it with three copies of its body. As a compiler optimization this is called “unrolling”. The result of this step is shown in Listing 3.1 as ipow_s2. 3. Even though we don’t know the value of xx, we can apply the rules of algebra to simplify the arithmetic. This will give us our final function icube. 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 /* C a l c u l a t e xx r a i s e d t o t h e kk ’ t h power . */ int ipow ( int xx , int kk ) { int yy = 1 ; for ( int i i = 0 ; i i < kk ; ++i i ) yy *= xx ; } return yy ; int ipow_s1 ( int xx ) { int yy = 1 ; /* The one i n s t a n c e o f kk has been r e p l a c e d by t h e v a l u e 3 . */ for ( int i i = 0 ; i i < 3 ; ++i i ) yy *= xx ; } return yy ; int ipow_s2 ( int xx ) { int yy = 1 ; /* yy yy yy } The l o o p has been r e p l a c e d by t h r e e c o p i e s o f i t s body . */ *= xx ; *= xx ; *= xx ; return yy ; /* J u s t cube a number */ int i c u b e ( int xx ) { return xx * xx * xx ; } Listing 3.1: ipow.c: An integer exponentiation kernel 30 The technical details of how to implement specialization on C code in a production quality compiler are discussed in Chapter 5.3. 3.2 Just In Time Compilation Programming systems with just-in-time compilation delay compilation until the program is run on the target system. This has the advantage that the program can run on any machine with an appropriate run-time environment, not just one machine type that the program was compiled for. Just-in-time compilation also provides a performance tradeoff: compiling the program takes time, but compiling on the exact target system may produce more optimized machine code. 3.2.1 Uses of Just In Time Compilation The most widely recognized use of just-in-time (JIT) compilation techniques is in the Java Hotspot® Virtual Machine (JVM) [51]. The JVM allows Java to make its famous “write once run anywhere” claim by delaying compilation to native code until runtime when the target hardware architecture is known and a JIT compiler for that architecture is available. Java goes so far as to provide different JIT compilers for different use cases. For example, the Java Client VM is tuned for responsiveness, which means it accepts somewhat less optimal code in exchange for shorter JIT pauses. In contrast, the Java Server VM makes the opposite trade-off, allowing more optimization to be done at the cost of longer JIT pauses to ensure faster execution once steady state is reached. OpenCL [52] uses JIT compilation in much the same way Java does, except to support a wider range of novel hardware architectures. This is essential for its design goal of supporting GPUs, as the hardware architectures for these processors are still in flux. For example, the recent AMD “Graphics Core Next” GPU architecture is significantly different from the design of previous AMD GPUs, but OpenCL code written for the older hardware will run on it perfectly. A final notable example of JIT techniques is the research language Self [53]. This language pioneered the technique of using JIT type-specialization to allow for very efficient execution of dynamic programming languages. Although type 31 specialization is different from value specialization, this is one of the most significant examples of JIT specialization providing significant performance benefits in practice, allowing a dynamic purely object-oriented language to approach within a small factor of optimized C in performance. These same techniques are used in the fast JavaScript runtimes in modern web browsers. 3.3 Existing Uses of Just In Time Specialization Specialization is a well known technique that is used widely in existing compilers. Some examples of this technique in just in time compilers for data parallel applications are described in this section. 3.3.1 POCL The POCL OpenCL runtime specializes kernels on work group size [54]. POCL runs each work group in its own thread, and processes the items in a work group by looping through them. Specializing on workgroup size means that the number of work items per thread (and thus the size of the loop) is a known constant. This simplifies the correct implementation of the OpenCL specification by allowing easier static analysis and transformation of this work-item loop, which is necessary to implement OpenCL’s barrier semantics in POCL’s one thread per workgroup model. Although this transformation is primarily for implementation simplicity rather than performance, it is likely to provide a performance benefit for simple kernels by allowing this loop to be unrolled. 3.3.2 OpenGL on Mac OS X Apple uses LLVM to specialize shaders on their Mac OS X implementation of OpenGL. According to a mailing list post by Chris Lattner [55]: [Apple’s OpenGL implementation does] runtime code specialization within the fixed-function vertex-processing pipeline. Basically, the OpenGL pipeline has many parameters (is fog enabled? do vertices have texture info? etc) which rarely change: executing the fully branchy code swamps the branch predictors and performs 32 poorly. To solve this, the code is precompiled to LLVM .bc form, from which specializations of the code are made, optimized, and JIT compiled as they are needed at runtime. 3.3.3 Copperhead Copperhead [56][57] is a project that allows data-parallel code to be written directly in Python and execute efficiently on a GPU. Copperhead provides nested parallelism in a functional style, providing parallel versions of traditional operations like map and reduce. These functions execute about half as fast as hand-coded CUDA kernels. For kernels where this approach works and provides sufficient performance Copperhead or something like it may simply be the best option for development of GPU accelerated kernels. At the moment the main drawback to Copperhead is its dependence on the Nvidia CUDA platform, but the authors anticipate support for other back-ends (including OpenCL) in the future. 3.3.4 SEJITS SEJITS [58] describes earlier work by some of the same people who work on Copperhead. The project explores JIT compilation on programs written using Copperhead-like style parallel embedded domain specific languages. Although the term “specialization” is used in the SEJITS paper to refer to hardware targeted JIT compilation, value-specific partial evaluation is mentioned as an optimization possibly enabled by a JIT compiling system. 33 Chapter 4 The Bacon System Bacon is a GPU programming system intended to make it easier to work with OpenCL. It provides a simplified API for running kernels from C++ host programs and an extension to OpenCL C, called Bacon C, intended to make life easier for kernel developers. In this chapter the first prototype of the Bacon system and its performance characteristics are described. Most of this work was done in 2011 and 2012, and was published in PDPTA 2012 [59]. 4.1 The Bacon C Language Bacon C is based on OpenCL C with extensions for improved usability and to enable the automatic generation of C++ wrapper code. A grammar for the Bacon C language is included in Appendix B (Listing B.2.1). Sample Bacon C kernels that perform matrix multiplication and demonstrate nine of the new features of the language are shown in Listing 4.1 and Listing B.2.2. These nine features are: 1. In-kernel sequential setup 2. In-kernel range specification 3. Kernels can return values 4. Parameterized types 5. Error handling with assert and fail 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 kernel Array2D<f l o a t > mat_mul ( Array2D<f l o a t > aa , Array2D<f l o a t > bb ) { SETUP: g l o b a l Array2D<f l o a t > c c [ aa . rows , bb . c o l s ] ; BODY: @range [ c c . rows , c c . c o l s ] ; } f l o a t sum = 0 . 0 ; a s s e r t ( aa . c o l s == bb . rows , ” M a t r i c e s must have c o m p a t i b l e d i m e n s i o n s . ” ) ; for ( int kk = 0 ; kk < aa . c o l s ; ++kk ) { sum += aa [ $row , kk ] * bb [ kk , $ c o l ] ; } c c [ $row , $ c o l ] = sum ; return c c ; Listing 4.1: Naive Matrix Multiplication in Bacon C 6. Multi-dimensional arrays 7. Variable sized private arrays 8. Specialization on array sizes 9. Specialization on const parameters Bacon preserves the OpenCL single program multiple data (SPMD) computation model. A kernel is executed in parallel over a 1D, 2D, or 3D range. Each executing instance of the kernel can query its position in that range to determine which part of the work it is responsible for performing. For example, the kernel in Listing 4.1 will be executed in parallel once for each element in the output matrix. Each kernel is separated into SETUP and BODY sections. The SETUP section (line 5 of Listing 4.1) is for code that will run serially on the host processor while the BODY section contains the code to be executed in parallel. In practice, the SETUP section is primarily used to declare output arrays that can be returned to the host application and to compute the sizes of these arrays. 35 Unlike OpenCL C kernels, Bacon C kernels can return values (a return type, Array2D<float>, is shown on line 2 in Listing 4.1). Any variable of a simple type can be returned, as can any array passed as an argument or declared in the SETUP section. The return statement occurs in the kernel BODY (line 18 in Listing 4.1) and can be selected conditionally (e.g., by an if statement), but the behavior if different parallel instances of the kernel try to return different values is undefined. Each BODY includes an @range declaration (line 9 in Listing 4.1) that specifies the range it will be executed over in parallel. Within the BODY, the current position in that range is held in special Bacon-specific variables named $col, $row, and $dep for the first, second, and third dimension respectively (line 17 in Listing 4.1). The range is formatted like an array declaration, so a BODY with @range[4] will be executed 4 times in parallel with $col having the values 0, 1, 2, and 3. Since this is only a 1D range, the values of $row and $dep will both be zero in all four instances. Bacon provides parameterized types (line 6 in Listing 4.1) for 1D, 2D, and 3D arrays using C++-style angle bracket syntax. Both declarations and element access use a comma separated list of numbers in square brackets (line 6 in Listing 4.1). The dimensions of these arrays can be accessed using struct-style dot notation. For example, a three by three by three array of integers called “cube” could be allocated with int cube[3,3,3]; and the width of that array could be accessed with cube.cols. Additional error handling is provided through the assert (line 12 in Listing 4.1) and fail keywords which will stop kernel execution and raise exceptions in the host process if triggered. A fail is triggered if execution in any thread reaches that statement, while an assert is only triggered if its associated condition is false. Each Bacon kernel has a set of specialization variables. These fall into two categories. First, the dimensions of any arrays passed as arguments to a kernel are always specialization variables. Second, additional specialization variables can be specified explicitly by declaring arguments using the const qualifier (like the blksz argument on line 3 of Listing B.2.2). Whenever a kernel is called with a new set of values for its specialization variables a specialized version of that kernel is generated and executed. Specialized kernels are cached for future calls with the same set of specialization values. This specialization, in addition to providing performance benefits, allows variable sized arrays in thread-private memory as long as the array size depends 36 1. 2. 3. Traditional (e.g. FORTRAN Kernel) Developer writes host application (C++) and kernel (e.g. FORTRAN). — 4. 5. 6. Host application and kernel are compiled. Host application is run. — — 7. 8. — — 9. 10. Input data is read. Kernel is executed on input data. OpenCL Kernel Bacon Kernel Developer writes host application (C++) and kernel (OpenCL C). Developer writes wrapper code (C++) to load and run kernel. Host application and wrapper code are compiled. Host application is run. — — Developer writes host application (C++) and kernel (Bacon C). Bacon compiler generates wrapper code (C++) to load and run kernel. Host application and wrapper code are compiled. Host application is run. Input data is read. Bacon library generates specialized OpenCL kernel. Target GPU is identified. Kernel is compiled for target GPU. — Kernel is executed on input data. Target GPU is identified. Kernel is compiled for target GPU. Input data is read. Kernel is executed on input data. Table 4.1: Lifecycle of Bacon and OpenCL Kernels only on const variables and array dimensions. Since OpenCL does not allow any form of in-kernel dynamic memory allocation, this makes it possible for users to write kernels that would have been difficult to write using OpenCL directly1 . The blocked matrix multiply kernel in Listing B.2.2 gives an example of this feature at line 11. 4.2 Kernel Lifecycle The basic technique of separating out high performance “kernels” from an application and implementing them in a separate language has been used in software development for decades. This even occurs for purely sequential programming. For example, an application written primarily in C++ may have high performance routines written in hand-optimized assembly code or FORTRAN. Traditionally, the host application and kernel code are compiled into separate modules and then linked together before execution. This basic sequence is shown in the first column of Table 4.1. Compiling everything before execution, or ahead-of-time (AOT) compilation, has one major downside: the target hardware is set when the application or module is compiled. Just-in-time (JIT) compilation avoids this problem by 1 This restriction is loosened in OpenCL 2.0 [38], released in November 2013 37 delaying compilation until the program is run on a specific machine, allowing the target hardware to be detected dynamically at runtime. OpenCL uses JIT techniques to allow the portability of kernels across the variety of computecapable GPUs and other parallel acceleration hardware that provide support for the standard. The JIT OpenCL kernel lifecycle is shown in the second column of Table 4.1. Bacon takes delayed compilation one step further, waiting to compile a kernel until it is actually called and the characteristics of the arguments can be examined. This allows just-in-time specialization to be performed, as shown in the third column of the table. 4.3 Implementation The Bacon system consists of two pieces: the Bacon compiler and the Bacon runtime library. The compiler runs at application compile time and parses the Bacon C source, generating a C++ wrapper and a serialized abstract syntax tree (AST). The Bacon runtime library is called from the generated wrapper as the host application is running to load the AST, generate specialized OpenCL C code when a kernel is called, and run that code on the GPU using the OpenCL runtime. The system is built using Perl and C++. The Bacon compiler parses the source code using Parse::Yapp [60], a yacc-compatible parser generator for Perl. This constructs the abstract syntax tree as a Perl data structure. The C++ wrapper is then generated by traversing this tree. The generated C++ wrapper provides a C++ function with the kernel’s type signature that can be called from the user’s application. When this function is called, the Bacon runtime library loads the AST and traverses it to generate the specialized OpenCL code. Two optimizations are performed at code generation time directly from the AST without the use of a traditional low-level intermediate representation. The first optimization, constant propagation calculates the values of all the variables that have been marked as const by the programmer. If the value of any of these variables cannot be computed from the specialized arguments to the kernel, the Bacon runtime library will throw an exception. This information is used to construct a symbol table, and references to these variables are replaced with their constant integer values in the generated OpenCL code. 38 Loop unrolling is performed on any loops for which the iteration count and range can be determined after constant propagation. Short loops are fully unrolled. In this case, no loop is passed to the compiler at all. An example of this is shown at the bottom of Listing B.2.2. Longer loops are unrolled by some factor that evenly divides the iteration count. This specialized and optimized OpenCL C code is then passed to the OpenCL compiler provided with the vendor SDK which will perform further optimizations on the generated code, including more aggressive constant propagation and possibly static register-load scheduling enabled by the Bacon preoptimizations. The implementation of Bacon is available publicly under an open source license. The current version can be downloaded from the public git repository.2 4.4 Ease of Use We believe that Bacon provides an improvement in ease of use compared to OpenCL by itself. A number of syntactic improvements are described in Section 4.1. More significantly, the automatic generation of wrapper code performed by the Bacon compiler significantly reduces the number of lines of code that the developer is required to write. As a concrete example of the reduction in lines of code required by Bacon, we consider naive matrix multiplication kernels like the one shown in Listing 4.1. Our OpenCL C version of this kernel is the same length: 17 lines of nonwhitespace code. Unfortunately, even ignoring the application-specific code required to construct the input matrices, it takes 174 lines of non-whitespace C++ code to compile and call this 17 line OpenCL kernel. In contrast, the Bacon version only requires three lines to load, specialize, and call; all the bookkeeping is done either by the automatically generated wrapper code or by the Bacon runtime library. 4.5 Performance To evaluate the performance of the Bacon system, we compare the run time of matrix multiplication kernels written in both Bacon C and OpenCL C. These 2 https://github.com/NatTuck/bacon 39 Test Time (s) Speedup OpenCL - Naive 11.9 1.0 OpenCL - Hand Vectorized 2.54 4.7 Bacon - Naive 3.45 3.5 Bacon - Blocked (unroll = 2) 1.97 6.1 Table 4.2: Summary of 4k Matrix Multiplication Performance results are summarized in Table 4.2, which shows that Bacon is able to provide measurable performance improvements over similar programs written directly in OpenCL C. Testing was performed on an AMD Radeon HD 5830 GPU. This is a mid-range GPU intended for high definition computer gaming. As of early 2012 this card was already a full hardware generation out of date, but it still has a theoretical parallel compute capacity of 1.7 teraflops, which is more than an order of magnitude greater than a high end CPU like the Intel Core i7 3930 at 154 gigaflops.3 Four implementations of matrix multiplication were tested: Bacon - Naive A textbook implementation of parallel matrix multiplication. Shown in Listing 4.1. OpenCL - Naive An OpenCL implementation equivalent to the naive Bacon code. Bacon - Blocked Shown in Listing B.2.2. This generalizes a 2D unrolling of the computation into square blocks. OpenCL - Hand Vectorized Hand unrolled to compute 4x4 blocks at once. Explicitly uses OpenCL’s native vector types. Based on a sample from the AMD OpenCL SDK. The execution time of these four kernels was tested on randomly generated 4096x4096 matrices. Each test was performed five times and the average result was taken. The times were very consistent; most tests had a coefficient of variation under one percent. The speedups over the naive OpenCL imple3 FLOPS ratings from manufacturer specifications 40 Unrolling Speedups 7 Simple Bacon 2x2 Blocks 6 Speedup 5 4 3 2 1 0 1 2 4 8 16 Unroll Factor Figure 4.1: Speedup on 4k matrix multiplication with unrolled loops over naive OpenCL implementation. mentation are shown in Figure 4.1. 4.5.1 Discussion These measurements show that value specialization provides significant speedups over a non-specialized OpenCL kernel. This result can be explained by the fact that providing constant values and unrolled loops at compile time allows the OpenCL C compiler to do extensive constant propagation-based optimizations. Bacon provides good performance results regardless of whether the code is hand optimized. The naive Bacon kernel beats the naive OpenCL kernel, the hand-optimized Bacon kernel beats the hand-optimized OpenCL kernel, and the naive Bacon kernel even comes close in performance to the optimized OpenCL kernel. Unlike constant propagation, which provides a clear and consistent performance benefit, loop unrolling is more complicated. Too much unrolling will decrease loop performance, most likely due to the exhaustion of registers on 41 the GPU device. Still, when properly tuned, loop unrolling provides significant speedups allowing the blocked Bacon kernel to beat the hand vectorized OpenCL kernel by nearly 30 percent. Somewhat surprisingly, Bacon’s loop unrolling is able to beat the vectorized OpenCL code without the explicit use of native vector types. From this we conclude that either vectorization is being done automatically by the AMD OpenCL compiler or the use of vector types doesn’t have a significant performance benefit for this matrix multiplication algorithm. 42 Chapter 5 JIT Specialization of OpenCL 5.1 Specialization Annotated OpenCL Experimentation with the Bacon prototype described in Chapter 4 showed that just-in-time specialization was a compiler optimization technique worthy of further study. In this chapter we describe Specialization Annotated OpenCL (SOCL), an extension to the OpenCL C language intended to allow JIT specialization of OpenCL compute kernels to be studied directly in isolation from the other features of Bacon. This chapter has three parts. First, the extension to the OpenCL C language is described. Second, two approaches to implementing SOCL are described. Finally, test cases and benchmarks are presented that are used in later chapters to verify the correctness and test the performance of SOCL implementations. 5.1.1 Specialization Annotations for OpenCL C SOCL C is standard OpenCL C with the addition of specialization annotations. A specialization annotation is a specially formatted comment placed between the argument list and body of an OpenCL kernel. For example, a matrix multiplication kernel could be specialized on the size of the matrices by being annotated as shown in Listing 5.1. Inside the comment, the annotation has three parts. The string “@spec” marks the comment as a specialization annotation. After the marker is the name of 43 Standard OpenCL Host Program Executing Parallel Kernel (OpenCL C) main() OpenCL SDK (compile) OpenCL_JIT(...) kernel_call(...) Kernel Executable Code exit() SOCL Host Program Executing Kernel Arguments Parallel Kernel (OpenCL C) main() kernel_call(...) Specialized kernels are cached for future calls Specializer (specialize & compile) Specialized Kernel Executable Code exit() Figure 5.1: Execution sequence: Standard OpenCL vs SOCL 44 1 2 3 4 5 6 k e r n e l void mmul( g l o b a l double* C, g l o b a l double* A, g l o b a l double* B, s i z e _ t nn , s i z e _ t mm) /* @spec mmul( nn , mm) */ { // k e r n e l body o m i t t e d } Listing 5.1: mmul.cl: Example specialization annotation the function. Enclosed in parentheses after the name of the function is the list of arguments that the function will be specialized on. This annotation format was chosen primarily for ease of implementation and testing. Adding a comment does not change the standard OpenCL code at all, since the comment will be ignored by a standard OpenCL compiler. Further, the annotation is easy to spot visually in the source code and is easy to parse. If SOCL were to be adopted as an extension to production OpenCL implementations it would be more appropriate to use either a “#pragma” directive or a new keyword to mark arguments in the kernel argument list directly. SOCL Execution Model SOCL kernels are executed like standard OpenCL kernels, except specialized code will be generated and executed for each set of values for the specialized arguments. For example, if the kernel from Listing 5.1 is called several times with parameters mm = nn = 8, execution will be equivalent to the code in Listing 5.2. One of two implementations of SOCL described in this dissertation, Pancake, performs specialization by generating code very much like this. 1 2 3 4 5 6 7 k e r n e l void mmul( g l o b a l double* C, g l o b a l double* A, g l o b a l double* B) { const s i z e _ t nn = 8 ; const s i z e _ t mm = 8 ; } // k e r n e l body o m i t t e d Listing 5.2: mmul-codegen.cl: Execution model example. 45 5.1.2 Implementation Techniques for SOCL To execute SOCL kernels, it is necessary to generate specialized kernel code at runtime. Code generation must be performed lazily after the kernel arguments are known. This altered execution sequence is compared to the standard OpenCL process in Figure 5.1. There are two ways to implement this specializer: either specialized OpenCL C source code can be generated and passed to the OpenCL C compiler or specialized object code can be generated during the OpenCL C compilation process. Generation of specialized OpenCL C source code is explored in Chapter 5.2, in a library called Pancake. This technique either requires a modification of the OpenCL runtime, modification of the host program, or trickery with shadowing symbols in dynamic libraries. Pancake requires that the host program be modified by adding a C header. This allows SOCL to be used with OpenCL implementations that can’t be modified, as is common for GPU vendor OpenCL libraries. Generation of specialized object code by modifying an OpenCL compiler is explored in Chapter 5.4, where the open source POCL OpenCL compiler is modified. Compared to specializing the source code, this technique potentially allows for faster compilation, since the OpenCL source code only needs to be parsed once for any number of specializations of a kernel. It also potentially allows for better code to be generated, since the OpenCL C compiler could be made explicitly aware that specialization is occurring. 5.1.3 Benchmarks and Tests for SOCL Testing the correctness and performance of the SOCL implementations described in the next chapters requires OpenCL programs with specialization annotations. Two new programs have been written for this purpose and six existing OpenCL benchmarks have had annotations added. These test programs are described below. Full listings for the kernels of the two new programs are included in Appendix B.1. My expectation is that the highest performance benefits for specialization with SOCL will come from kernels where specialization transforms a tight inner loop from having a variable number of iterations to having a constant number of iterations. This makes the loop easily unrollable. Other kernels may or may 46 not see any performance benefit, depending on the effects of other compiler optimizations such as algebraic simplification. New Test Cases These programs were written specifically as test cases for this project. They are intentionally written as naive implementations of the algorithm, without any hand optimization. This may make performance effects from optimization easier to see compared to the highly optimized test programs taken from hardware benchmark suites. Gaussian Blur (blur) This program applies Gaussian blur to a large (5184x3456) greyscale image. Because a 2D Gaussian blur can be calculated by applying two orthogonal 1D Gaussian blurs to an image in sequence, this program runs two OpenCL kernels: one to blur vertically and one to blur horizontally. The kernels are each specialized on three arguments, the width and height of the image and the radius of the blur effect. The image width and height are only used for index calculations, so their specialization should have minimal effect. The radius is used for the inner count of a tight inner loop, so its specialization should result in a measurable performance benefit. Matrix Multiply (mmul) This program multiplies two square matrices. Although matrix multiply programs are common in OpenCL vendor samples, they are generally optimized to take advantages of features like vector types. This program was written specifically to be a simple and easily readable version of parallel matrix multiply. The kernel is specialized on two arguments: the width and height of the matrices. Since the iteration count of the inner loop is dependent on the width and height of the matrices, this test should show measurable performance improvement from specialization. SNU Port of NAS Parallel Benchmarks This is a port of the NASA Advanced Supercomputing Division’s NAS Parallel Benchmark [61] suite for large parallel HPC installations to OpenCL, 47 performed by the Center for Manycore Programming at Seoul National University in Korea [62]. These benchmarks are reasonably complex and require a significant amount of communication in patterns that are not easily implemented in OpenCL, so the SNU developers put significant effort into splitting kernels for synchronization and hand optimizing to get high speedups. Conjugate Gradient (nas-cg) This program is described by the NAS benchmark description as “Solving an unstructured sparse linear system by the conjugate gradient method” with the benchmark goal of testing “irregular memory access and communication”. For the OpenCL implementation this program was split into 13 kernels. The kernels were all specialized on a single argument, n, which remains constant across any single execution of the program. As this argument is not used directly as the bound of any loop, specialization and loop unrolling seem unlikely to provide any performance benefit. Scalar Penta-Diagonal (nas-sp) This program is intended to represent a larger HPC application that involves multi-step processing of data. It performs a fluid dynamics simulation. This program uses 26 OpenCL kernels. Fourteen of these kernels are specialized on parameters that stay nearly constant in a single execution of the program, some of which are used as loop bounds. This program may see benefits from specialization. Nvidia OpenCL Samples These program were released by Nvidia as sample programs to demonstrate the usage of their OpenCL implementation. They are written primarily for clarity to serve as good examples, but include some features intended to improve performance. Black-Scholes Option Pricing (nv-bs) According to Nvidia, “This sample evaluates fair call and put prices for a given set of European options by the Black-Scholes formula.” 48 This kernel is specialized on a single variable, OptN, which is the maximum number of iterations of the algorithm. This is a constant across multiple calls to the kernel in the test program. It specifies the iteration limit of a reasonably tight inner loop, but the loop start varies across work items so the loop may not be possible to unroll. Specialization may or may not provide a performance benefit on this program. DCT (nv-dct) This program performs a Discrete Cosine Transform on an image. The kernel is specialized on the stride, width, and height of the input image, and the block size of the transform, which all remain constant in the test program. The program was initially developed with a constant block size of eight, but this was made a variable to exercise specialization. Since all the loops in the program are a number of iterations equal to the block size, this program should see speedups due to specialization. Rodinia Benchmarks These benchmarks were found in the Rodinia Benchmark Suite [63], from the University of Virginia. This suite is intended to compare the performance of hardware and OpenCL implementations. The benchmarks are extracted from real world scientific computing applications, and are heavily hand-optimized for performance. Mandelbrot (mandelbrot) This generates a 2048x2048 image of the Mandelbrot set. It has a single OpenCL kernel to generate this image. The kernel is specialized on four arguments: the width and height of the image, the zoom level, and the maximum number of iterations of the tight inner loop. Although the number of iterations of the tight inner loop is bounded by a specialized argument, each instance of the kernel may iterate less times. Specialization may not provide a significant performance advantage in this test if the loop cannot be unrolled. Particle Filter (particlefilter) This program tries to estimate current location based on a model of movement and fuzzy location measurements. 49 The main kernel is specialized on the number of particles used, which is constant through an execution of the program. This kernel has a loop based on the number of particles, but the start point is based on the current particle so the loop may not be possible to unroll. This program may or may not see speedups from specialization. 5.2 Pancake: Implementing SOCL by Specializing OpenCL C In this chapter we introduce Pancake, the first implementation of Specialization Annotated OpenCL (SOCL) (described in Chapter 5) discussed in this dissertation. Pancake is a C library that, with minimal source code modification, transparently enables specialization of SOCL kernels when using any OpenCL runtime. The use and implementation of Pancake are described, as are the results of testing Pancake with the OpenCL benchmarks described in Chapter 5. 5.2.1 Using Pancake Pancake can be added to an existing OpenCL program that uses the OpenCL API for either C or C++ in four steps, shown below. Once Pancake has been enabled, all OpenCL kernels will be treated as SOCL kernels and automatically specialized according to any specialization annotations that are present. 1. Modify your OpenCL kernels to add specialization annotations as described in Chapter 5. 2. Include the header "pancake/shim.h" in each file that uses the OpenCL API after the standard OpenCL header. 3. Link the “pancake” library with the host program executable. 4. Set the environment variable PANCAKE_SPEC when you run your host program. This variable can be left unset to disable specialization. 50 API Function clCreateProgram("program.cl") clBuildProgram(flags) clCreateKernel("name") clSetKernelArg(num, value) clEnqueueNDRangeKernel Standard OpenCL Loads OpenCL C source Compiles the source file Gets a reference to the kernel Sets the kernel arguments Runs the kernel Pancake Sets OpenCL C source file name Sets the compiler flags Sets the kernel name Sets the kernel arguments Specializes the kernel, then performs the full OpenCL sequence from the second column on the specialized source Table 5.1: Comparison of Standard OpenCL and Pancake Kernel Execution 5.2.2 How Pancake Works Pancake operates by defining a sequence of C macros that replace calls to the OpenCL API for C in the host program with calls to functions in the Pancake library. The Pancake library generates specialized OpenCL C code and uses the OpenCL API for C itself to compile and execute it. Because the OpenCL API for C++ is written in terms of the API for C, Pancake will work with C++ programs as well. To specialize SOCL kernels, Pancake delays compilation until the kernel has actually been called. The modified kernel compilation process in Pancake compared to standard OpenCL is shown in Table 5.1. To accomplish this, Pancake replaces the OpenCL cl_program and cl_kernel structures with wrappers and intercepts all OpenCL run-time library calls (15 of them) that operate on either structure. Intercepting these calls gives access to sufficient data to generate specialized OpenCL C source and then compile and execute it. To avoid duplicate compilation work, Pancake caches compiled kernels and re-uses them when all the specialized arguments match. Currently a linked list is used for the cache and old entries are never cleaned up, so Pancake will need additional optimization before it is suitable for use in long-running programs that generate thousands of different specialized kernels. 5.2.3 Pancake Testing Each benchmark was tested in four different configurations: 1. default - Specialization and unrolling disabled. Equivalent to standard OpenCL behavior. 51 Benchmark blur mandelbrot mmul nas-cg nas-sp nv-bs nv-dct particlefilter AMD Radeon 5830 GeForce GTX 650 Ti 175.5% 16.1% 3.6% 23.5% 91.4% 86.7% -1.6% -0.4% 0.1% 0.2% 8.0% 1.0% 3.5% 2.0% 0.2% -2.7% Table 5.2: Speedups with Pancake 2. unroll - Loop unrolling is enabled but specialization is not. 3. spec - Specialization is enabled but loop unrolling is not. 4. spec-unroll - Both specialization and unrolling are enabled. Timings for tests on an AMD Radeon 5830 graphics card are shown in Figure 5.2, while timings with an Nvidia GeForce GTX 650 Ti card are shown in Figure 5.3. A summary of speedups from specialization, comparing default OpenCL behavior to specialization and unrolling, is shown in Table 5.2. These timings were taken according to the procedure described in Appendix A.1. The test machines used are described in Appendix A.2. These timings show that specialization with Pancake and vendor GPU runtimes can provide a very large performance improvement on a few especially well suited kernels (e.g., “mmul” runs nearly twice as fast). Otherwise, it provides a small speedup on most of the test cases and a small slowdown on some others. Note that the slowdowns can be eliminated by removing the specialization annotations on those kernels. Results by Benchmark Gaussian Blur (blur) This benchmark saw a speedup as expected. The improvement was much more significant on the AMD GPU than the Nvidia GPU, showing the single largest speedup (175%) observed in the tests. Specialization was essential to the speedup on this benchmark. Attempting to unroll loops without specialization did not work, and in fact caused a significant slowdown on the Nvidia GPU. 52 Mandelbrot (mandelbrot) It was difficult to predict an outcome for this benchmark due to the complex loop condition, but it showed a speedup on both GPUs. The effect was significantly larger on the Nvidia GPU. Specialization was essential to the speedup on this benchmark. Attempting to unroll loops without specialization had no effect. Matrix Multiply (mmul) This benchmark showed a speedup as expected. The effect was similar on both GPUs: a large speedup of around 90%. This kernel is ideal for specialization, consisting almost entirely of a single tight simple loop that gets a constant trip count from specialization. Specialization was essential to the speedup on this benchmark. Attempting to unroll loops without specialization had no effect on AMD and a very large negative effect on Nvidia. Interestingly, on AMD specifying unrolling with specialization had a negative effect compared to specialization alone. I don’t have a good explanation for this. Conjugate Gradient (nas-cg) This benchmark showed no speedup from specialization, as expected. It was possible that it would enable algebraic simplification for a small speedup, but a small slowdown was observed instead. Loop unrolling had very little effect on this benchmark. Scalar Penta-Diagonal (nas-sp) This benchmark showed no speedup from specialization. This benchmark looked slightly more promising than “nas-cg”, but was not. Unrolling had no effect on this benchmark on the Nvidia GPU. On AMD, specialization alone or unrolling alone provided a small speedup, but this disappeared when both were applied together. Black-Scholes Option Pricing (nv-bs) This benchmark showed a small speedup from specialization, which was consistent with the expected possible speedup. The effect was significantly larger on AMD than Nvidia. On Nvidia, unrolling alone provided a larger speedup than unrolling with specialization. On AMD, unrolling alone provided a speedup significantly smaller than specialization and unrolling. 53 DCT (nv-dct) This benchmark showed a small speedup from specialization, less of a gain than was expected. The speedup on AMD was somewhat larger than on Nvidia. Specialization was essential to the speedup on this benchmark. Attempting to unroll loops without specialization had no effect on Nvidia and only a very small effect on AMD. Particle Filter (particlefilter) No speedup was observed from specialization on this benchmark, with Nvidia showing a small slowdown. This is consistent with the expected result. The results were similar for unrolling or specialization alone. 54 blur -- Kernel execution times with Pancake on AMD 3.270 3.5 3.270 1.183 mmul -- Kernel execution times with Pancake on AMD 1 1.187 3 0.874 0.874 0.354 0.456 0.8 0 default-spec-unroll 0.2 default-spec Time (s) 0.4 default-unroll 0.5 0.6 default default 1 default-spec 1.5 default-spec-unroll 2 default-unroll Time (s) 2.5 0 mandelbrot -- Kernel execution times with Pancake on AMD nv-bs -- Kernel execution times with Pancake on AMD 1 0.12 0.740 0.740 0.717 0.714 0.086 0.084 0.085 0.080 0.1 0.8 0 0 nas-cg -- Kernel execution times with Pancake on AMD nas-sp -- Kernel execution times with Pancake on AMD 0.45 7 6.074 6.117 6.075 6.170 0.4 6 0.347 0.338 0.338 0.347 0.35 5 0 default-spec 0.1 0.05 default-unroll 0.2 0.15 default-spec-unroll 0.25 default 1 default-spec default 2 default-spec-unroll 3 Time (s) 0.3 4 default-unroll 0 nv-dct -- Kernel execution times with Pancake on AMD particlefilter -- Kernel execution times with Pancake on AMD 0.065 0.064 0.062 8.409 8.418 8.402 8.393 default-spec-unroll 0.064 default-spec 0.08 default-unroll 10 0.09 default Time (s) default-spec-unroll 0.02 default 0.04 default-spec 0.06 default-unroll Time (s) default-spec-unroll default 0.2 default-spec 0.4 default-unroll Time (s) 0.08 0.6 8 0.07 0.02 0.01 0 default-spec 0.03 default-unroll Time (s) 0.04 default-spec-unroll 0.05 default Time (s) 0.06 6 4 2 0 Figure 5.2: Benchmark Timings for Pancake on AMD Radeon 5830 55 2.541 0.6 3 0.5 2.5 1.5 default 1 0.5 default-spec 2 Time (s) 3.5 default-spec-unroll 0.7 0.393 0.708 0.301 0.4 0.3 0.2 0.1 0 default-spec 2.540 default-unroll 4.160 default 2.950 4 default-unroll Time (s) 4.5 mmul -- Kernel execution times with Pancake on Nvidia 0.8 0.211 default-spec-unroll blur -- Kernel execution times with Pancake on Nvidia 0 mandelbrot -- Kernel execution times with Pancake on Nvidia 0.244 0.25 0.244 0.198 nv-bs -- Kernel execution times with Pancake on Nvidia 0.198 0.06 0.055 0.050 0.054 0.054 0.05 0.01 0 default-spec-unroll 0.02 default-spec Time (s) 0.03 default-unroll 0.05 0.04 default default default-unroll 0.1 default-spec-unroll 0.15 default-spec Time (s) 0.2 0 nas-cg -- Kernel execution times with Pancake on Nvidia nas-sp -- Kernel execution times with Pancake on Nvidia 0.35 2.5 2.273 2.275 2.262 2.283 0.292 0.292 0.294 0.291 0.3 0.05 0 0 0.035 0.035 0.03 25 0.005 0 default-spec 0.01 default 0.015 default-unroll 0.02 default-spec-unroll 0.025 Time (s) 30 27.373 27.378 28.141 28.131 default-spec-unroll 0.035 default-spec 0.036 0.035 default-unroll 0.04 particlefilter -- Kernel execution times with Pancake on Nvidia 35 default nv-dct -- Kernel execution times with Pancake on Nvidia Time (s) default-spec-unroll 0.1 default-spec 0.15 default-unroll 0.5 0.2 default default default-unroll 1 default-spec-unroll 1.5 Time (s) 0.25 default-spec Time (s) 2 20 15 10 5 0 Figure 5.3: Benchmark Timings for Pancake on Nvidia GTX 650 Ti 56 5.3 Serial Specialization with LLVM Now that JIT specialization on OpenCL C source has been examined in both the Bacon prototype and with Pancake, it is useful to explore integrating JIT specialization directly into an OpenCL implementation. The major open source implementations of OpenCL are POCL [64] and Clover [45]. Both of these implementations are based on the LLVM compiler system, so this chapter will examine LLVM and explore how LLVM can be extended with support for specialization. 5.3.1 Overview of LLVM LLVM [65], formerly the Low Level Virtual Machine, is a collection of tools and libraries for constructing compilers. It includes clang [66], a modern standard compliant compiler for C, C++, and Objective C that can be used as a drop-in replacement for the GNU GCC compilers for those languages. LLVM has been used as the basis for most implementations of the OpenCL standard, including the implementations from AMD, Apple, and Nvidia, as well as the open source implementations POCL and Clover. In a traditional optimizing compiler, you have a “front end” which parses source code and outputs some intermediate format code, a “middle end” which takes several passes over that intermediate format code to optimize it, and then a “back end” that outputs executable code for a specific machine. LLVM is structured to provide these as separate modular pieces with a single intermediate code representation to tie them together called LLVM intermediate representation (LLVM IR). This makes it easy to build front ends for new languages, back ends for new hardware, and new optimization passes that are compatible with all of the existing pieces. LLVM IR is a single static assignment (SSA) code representation, which simplifies many compiler optimizations. This IR code can be serialized in two formats: LLVM assembly language (file extension .ll) and LLVM bitcode (file extension .bc). LLVM provides numerous tools to work with these files, including llvm-as and llvm-dis to translate between the serialization formats and opt to run built-in and user-supplied optimization passes. For example, here’s how to compile and optimize a Hello World C program one step at a time using LLVM: 57 $ $ $ $ clang -c -emit-llvm -o hello.bc hello.c opt -licm -sccp -o hello-opt.bc hello.bc llc -o hello.s hello-opt.bc as -o hello hello.s In this example, the clang command is used to generate LLVM bitcode from the C source code. Then, the opt command transforms this code by applying two optimizations (loop invariant code motion and sparse conditional constant propagation, in that order). These optimizations do not change the IR code because there are no loops or variables in “Hello World”. Next, the llc command translates the bitcode into machine specific assembly. Finally, that code is assembled into a binary. The clang tool will perform all the other steps itself when given no option flags (by calling the other tools either as libraries or directly), but this example clearly shows the three phases of compilation and how the tools communicate with each other using LLVM IR in bitcode format. Note that the system assembler (as) is used to generate the final executable from the target specific assembly code, as LLVM does not yet include its own assembler. 5.3.2 LLVM Optimization Passes LLVM provides a variety of optimization passes implementing many traditional compiler optimizations. A number of these passes that seem especially applicable for compute kernels are shown in Table 5.3 and Table 5.4. Other optimization passes are not considered, for example cross-procedure optimizations are ignored because compute kernels are assumed to be fully inlined. Information on some of these passes is available in the LLVM documentation [67], while information about other passes is only available as comments in the source code. 5.3.3 Specializing LLVM Bitcode To test the effects of specialization as an optimization, an LLVM module pass [68] was implemented to specialize kernel functions in LLVM bitcode. In this section we describe how that transformation pass works and provides some benchmarks for specialization on simple serial kernels. Producing a kernel specialized on an argument value is a relatively simple process. It would be nearly sufficient to textually add a C preprocessor macro 58 Flag Name – Description -adce Aggressive Dead Code Elimination – Eliminates dead code using a “dead until proven alive” algorithm -bb-vectorize Basic Block Vectorize – Scans each basic block and combines scalar operations into vector operations -correlated-propagation Correlated Value Propagation – Propagates constants across basic block boundaries -dse Dead Store Elimination – Eliminates stores to local variables that will have no effect on the behavior of the program -early-cse Early Common Subexpression Elimination – Eliminates simple redundant computations and canonicalizes some operations -globaldce Global Dead Code Elimination – Eliminates unused global variables and internal functions -globalopt Global Variable Optimizer – Finds read only global variables and marks them as constant and eliminates global variables that are never read (even if they’re written to) -gvn Global Value Numbering – Numbers every value produced in each function and eliminates redundant computation of those values -indvars Canonicalize Induction Variables – Transforms loops into a standard form for future loop transformations if possible -instcombine Instruction Combining – Combines redundant instructions and performs arithmetic simplification that doesn’t require reordering operations -jump-threading Jump Threading – Analyzes branches to conditional branches. If it can prove the conditional will always branch a given way in that thread of code flow, it will replace the first branch to skip the test. -lcssa Loop Closed Single Static Assignment – Adds extra phi nodes at the end of each loop for variables that had phi nodes in inner blocks. This simplifies some other loop transformations. -licm Loop Invariant Code Motion – Moves code in a loop that doesn’t depend on the loop iteration out of the loop Table 5.3: LLVM Optimization Flags (A-L) 59 Flag Name – Description -loop-deletion Loop Deletion – Deletes dead loops -loop-reduce Loop Strength Reduction – Decouples the calculation of array indexes in loops from the induction variable -loop-rotate Loop Rotate – Moves the loop condition check to the end of the loop. This is required for the Loop Unroll pass. -loop-simplify Loop Simplify – Guarantees that each loop has one entry, one exit, and one back-edge. This is required for other transformations including Loop Invariant Code Motion and Loop Vectorization. -loop-unroll Loop Unroll – Duplicates the loop body a number of times divisible by the trip count to reduce loop overhead. The loop must be in the form produced by Loop Rotate. This pass has two extra option flags: without -unroll-allow-partial it will only fully unroll loops, and a maximum unroll amount can be specified with -unroll-threshold. -loop-unswitch Loop Unswitch – Moves loop invariant conditionals out of loops, duplicating loops if necessary. This pass depends on Loop Invariant Code Motion. -loop-vectorize Loop Vectorize – Partially unrolls loops and replaces scalar arithmetic across several iterations with vector instructions. -mem2reg Memory To Register – Replaces pointers to the stack with registers and phi nodes where possible -reassociate Arithmetic Reassociate – Rearranges arithmetic to make other optimizations work better and performs arithmetic simplification -sccp Sparse Conditional Constant Propagation – Performs constant propagation and eliminates constant conditional branches -simplifycfg Simplify Control Flow Graph – Combines basic blocks connected by an unconditional branch and removes unused basic blocks -sink Code Sinking – Moves instructions that produce values only used in one branch of a conditional into that branch so that the value isn’t calculated unless it is needed -sroa Scalar Replacement of Aggregates – Splits small structs and arrays into multiple registers Table 5.4: LLVM Optimization Flags (L-Z) 60 to the body of the kernel function that replaced all instances of the argument variable with its constant value, which is how the optimization is generally done by hand in practice. Using an LLVM transformation pass instead has two major advantages. First, it avoids edge cases like variable name shadowing and kernels that mutate their arguments. Second, it avoids re-parsing the kernel source code which speeds up specialization slightly. Specializing a kernel in an LLVM module pass consists of finding the kernel function, iterating through its arguments, and calling the replaceAllUsesWith method on each specialized argument with the appropriate value. The name of the kernel to specialize, the arguments to specialize, and the values to specialize on are specified as command line arguments. To illustrate the effect of specialization on a kernel, we consider the C function ipow in Listing 5.3, which raises an integer xx to an integer power kk by iterated multiplication. 1 2 3 4 5 6 7 8 9 10 int ipow ( int xx , int kk ) { int yy = 1 ; for ( int i i = 0 ; i i < kk ; ++i i ) yy *= xx ; return yy ; } Listing 5.3: ipow.c: An integer exponentiation kernel LLVM assembly code can be generated from the ipow.c source with the clang command as shown below, producing the LLVM assembly code1 in Listing 5.4. $ clang -S -emit-llvm -o ipow1.ll ipow.c The direct translation from C explicitly puts arguments and local variables on the stack. This is necessary in general because the code may use pointers to values on the stack, but makes the code much less clear and disables some optimization. The mem2reg transformation can be used to move these variables into registers using the opt command shown below. This produces the code shown in Listing 5.5. 1 Data layout and optimizer metadata are not included in LLVM assembly listings for clarity. 61 $ opt -S -mem2reg -o ipow2.ll ipow.ll The specialization pass can be run on this code, specializing the kk argument on the value 6, using the opt command shown below. This produces the LLVM assembly shown in Listing 5.6, with the register %kk replaced by the literal 6 on line 10. The code for the LLVM specialization pass is shown in Listing B.4.1. Note that it is necessary to explicitly load the custom transformation pass from a shared library. $ opt -S -load=libspec.so -specialize -spec-text="kk=6" \ -kernel="ipow" -o ipow3.ll ipow2.ll Now that the code has been specialized, it’s possible to apply some further optimizations to take advantage of the specialized value. The most obvious optimization available is loop unrolling. Since the trip count is now known, the loop can be fully unrolled. With LLVM, this requires two transformation passes. First, the loop needs to be rotated to put the conditional at the bottom (-loop-rotate). Then the loop can be unrolled using the -loop-unroll pass. The commands are shown below and the resulting code is shown in Listing 5.7 and Listing 5.8. $ opt -S -loop-rotate -o ipow4.ll ipow3.ll $ opt -S -loop-unroll -o ipow5.ll ipow4.ll The output of loop unrolling has a lot of unnecessary blocks and unconditional branches. This can be cleaned up with -simplifycfg to get the result shown in Listing 5.9. $ opt -S -simplifycfg -o ipow6.ll ipow5.ll Finally, the optimization can be taken one step further by combining some of the multiplications with -reassociate, producing the code shown in Listing 5.10. $ opt -S -reassociate -o ipow7.ll ipow6.ll This final version is probably the optimal code to compute x6 by multiplication. This series of transformations clearly demonstrates the power of an optimizing compiler and the promise of specialization to generate high quality special-case code from code written for the general case. 62 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ; ModuleID = ’ ipow . c ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) #0 { entry : %xx . addr = a l l o c a i 3 2 , a l i g n 4 %kk . addr = a l l o c a i 3 2 , a l i g n 4 %yy = a l l o c a i 3 2 , a l i g n 4 %i i = a l l o c a i 3 2 , a l i g n 4 s t o r e i 3 2 %xx , i 3 2 * %xx . addr , a l i g n 4 s t o r e i 3 2 %kk , i 3 2 * %kk . addr , a l i g n 4 s t o r e i 3 2 1 , i 3 2 * %yy , a l i g n 4 s t o r e i 3 2 0 , i 3 2 * %i i , a l i g n 4 br l a b e l %f o r . cond f o r . cond : ; p r e d s = %f o r . i n c , %e n t r y %0 = l o a d i 3 2 * %i i , a l i g n 4 %1 = l o a d i 3 2 * %kk . addr , a l i g n 4 %cmp = icmp s l t i 3 2 %0, %1 br i 1 %cmp , l a b e l %f o r . body , l a b e l %f o r . end f o r . body : ; p r e d s = %f o r . cond %2 = l o a d i 3 2 * %xx . addr , a l i g n 4 %3 = l o a d i 3 2 * %yy , a l i g n 4 %mul = mul nsw i 3 2 %3, %2 s t o r e i 3 2 %mul , i 3 2 * %yy , a l i g n 4 br l a b e l %f o r . i n c for . inc : ; p r e d s = %f o r . body %4 = l o a d i 3 2 * %i i , a l i g n 4 %i n c = add nsw i 3 2 %4, 1 s t o r e i 3 2 %i n c , i 3 2 * %i i , a l i g n 4 br l a b e l %f o r . cond f o r . end : ; p r e d s = %f o r . cond %5 = l o a d i 3 2 * %yy , a l i g n 4 r e t i 3 2 %5 } Listing 5.4: ipow1.ll: Direct translation from C 63 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ; ModuleID = ’ ipow . bc ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) #0 { entry : br l a b e l %f o r . cond f o r . cond : %yy . 0 = p h i %i i . 0 = p h i %cmp = icmp br i 1 %cmp , ; p r e d s = %f o r . i n c , %e n t r y i 3 2 [ 1 , %e n t r y ] , [ %mul , %f o r . i n c ] i 3 2 [ 0 , %e n t r y ] , [ %i n c , %f o r . i n c ] s l t i 3 2 %i i . 0 , %kk l a b e l %f o r . body , l a b e l %f o r . end f o r . body : ; p r e d s = %f o r . cond %mul = mul nsw i 3 2 %yy . 0 , %xx br l a b e l %f o r . i n c for . inc : ; p r e d s = %f o r . body %i n c = add nsw i 3 2 %i i . 0 , 1 br l a b e l %f o r . cond f o r . end : r e t i 3 2 %yy . 0 } ; p r e d s = %f o r . cond Listing 5.5: ipow2.ll: After the mem2reg transformation. 64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ; ModuleID = ’ ipow2 . l l ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) { entry : br l a b e l %f o r . cond f o r . cond : %yy . 0 = p h i %i i . 0 = p h i %cmp = icmp br i 1 %cmp , ; p r e d s = %f o r . i n c , %e n t r y i 3 2 [ 1 , %e n t r y ] , [ %mul , %f o r . i n c ] i 3 2 [ 0 , %e n t r y ] , [ %i n c , %f o r . i n c ] s l t i 3 2 %i i . 0 , 6 l a b e l %f o r . body , l a b e l %f o r . end f o r . body : ; p r e d s = %f o r . cond %mul = mul nsw i 3 2 %yy . 0 , %xx br l a b e l %f o r . i n c for . inc : ; p r e d s = %f o r . body %i n c = add nsw i 3 2 %i i . 0 , 1 br l a b e l %f o r . cond f o r . end : r e t i 3 2 %yy . 0 } ; p r e d s = %f o r . cond Listing 5.6: ipow3.ll: After specializing on kk=6 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ; ModuleID = ’ ipow3 . l l ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) { entry : br l a b e l %f o r . body f o r . body : ; p r e d s = %entry , %f o r . i n c %i i . 0 2 = p h i i 3 2 [ 0 , %e n t r y ] , [ %i n c , %f o r . i n c ] %yy . 0 1 = p h i i 3 2 [ 1 , %e n t r y ] , [ %mul , %f o r . i n c ] %mul = mul nsw i 3 2 %yy . 0 1 , %xx br l a b e l %f o r . i n c for . inc : ; p r e d s = %f o r . body %i n c = add nsw i 3 2 %i i . 0 2 , 1 %cmp = icmp s l t i 3 2 %i n c , 6 br i 1 %cmp , l a b e l %f o r . body , l a b e l %f o r . end f o r . end : ; p r e d s = %f o r . i n c %yy . 0 . l c s s a = p h i i 3 2 [ %mul , %f o r . i n c ] r e t i 3 2 %yy . 0 . l c s s a } Listing 5.7: ipow4.ll: After rotating the loop 66 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ; ModuleID = ’ ipow4 . l l ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) { entry : br l a b e l %f o r . body f o r . body : ; p r e d s = %e n t r y br l a b e l %f o r . i n c for . inc : ; p r e d s = %f o r . body %mul . 1 = mul nsw i 3 2 %xx , %xx br l a b e l %f o r . i n c . 1 for . inc . 1 : ; p r e d s = %f o r . i n c %mul . 2 = mul nsw i 3 2 %mul . 1 , %xx br l a b e l %f o r . i n c . 2 for . inc . 2 : ; p r e d s = %f o r . i n c . 1 %mul . 3 = mul nsw i 3 2 %mul . 2 , %xx br l a b e l %f o r . i n c . 3 for . inc . 3 : ; p r e d s = %f o r . i n c . 2 %mul . 4 = mul nsw i 3 2 %mul . 3 , %xx br l a b e l %f o r . i n c . 4 for . inc . 4 : ; p r e d s = %f o r . i n c . 3 %mul . 5 = mul nsw i 3 2 %mul . 4 , %xx br l a b e l %f o r . i n c . 5 for . inc . 5 : r e t i 3 2 %mul . 5 } ; p r e d s = %f o r . i n c . 4 Listing 5.8: ipow5.ll: After unrolling the loop 67 1 2 3 4 5 6 7 8 9 10 11 ; ModuleID = ’ ipow5 . l l ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) { entry : %mul . 1 = mul nsw i 3 2 %xx , %xx %mul . 2 = mul nsw i 3 2 %mul . 1 , %xx %mul . 3 = mul nsw i 3 2 %mul . 2 , %xx %mul . 4 = mul nsw i 3 2 %mul . 3 , %xx %mul . 5 = mul nsw i 3 2 %mul . 4 , %xx r e t i 3 2 %mul . 5 } Listing 5.9: ipow6.ll: After merging excess blocks 1 2 3 4 5 6 7 8 9 ; ModuleID = ’ ipow6 . l l ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) { entry : %0 = mul i 3 2 %xx , %xx %1 = mul i 3 2 %0, %xx %2 = mul i 3 2 %1, %1 r e t i 3 2 %2 } Listing 5.10: ipow7.ll: After arithmetic simplification 68 5.3.4 Serial Specialization Testing To test the effects of value specialization as an optimization for serial kernels, tests have been run on three simple examples. Listings for these kernels are given in Appendix B.3. These tests were run on the Opteron workstation described in Appendix A.2, using the methodology described in Appendix A.1, except that each test was repeated ten times rather than five. In these tests only the kernel function itself is subject to optimization. The rest of the test program is left unoptimized to isolate the effects of optimization on the kernel. Ten different sets of optimizations are compared in these benchmarks, consisting of combinations of sets of optimization flags described below. For example, a graph column labeled “spec-O3-re” describes a test where the kernel was specialized, then optimized with the LLVM heavy optimization set (“O3”), then optimized with the Unroll and Reassociate set (“re”). The optimization sets tested were found by manual exploration of the available LLVM optimizations and examination of the resulting assembly code. These are not optimal sets, the search space for LLVM optimization flags is just too big for such sets to be reasonably found. The optimization sets used are as follows: Specialization (spec) eters. The kernel is specialized on the appropriate param- Not Optimized (notopt) No optimizations are performed. LLVM Heavy Optimization (O3) LLVM provides three default sets of optimization passes for different levels of optimization: -O1, -O2, and -O3. This (-O3) is the heaviest level of optimization and is equivalent to running the following opt command: $ opt -targetlibinfo -no-aa -tbaa -basicaa -notti -globalopt -ipsccp -deadargelim \ -instcombine -simplifycfg -basiccg -prune-eh -inline-cost -inline -functionattrs \ -argpromotion -sroa -domtree -early-cse -simplify-libcalls -lazy-value-info \ -jump-threading -correlated-propagation -simplifycfg -instcombine -tailcallelim \ -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa -loop-rotate \ -licm -lcssa -loop-unswitch -instcombine -scalar-evolution -loop-simplify -lcssa \ -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -gvn -memdep -memcpyopt \ -sccp -instcombine -lazy-value-info -jump-threading -correlated-propagation \ -domtree -memdep -dse -adce -simplifycfg -instcombine -strip-dead-prototypes \ 69 -globaldce -constmerge -preverify -domtree -verify Note that this list of passes includes analysis passes that would run even if not explicitly specified in the opt command due to being required by later transformation passes. Unroll Loops (unroll) This set of optimizations performs constant propagation and loop unrolling, including enabling partially unrolled loops. $ opt -mem2reg -sccp -loop-rotate -loop-unroll -unroll-allow-partial \ -simplifycfg Unroll and Reassociate (re) This is like “unroll”, except it adds the arithmetic reassociation and simplification pass, which turns out to help a lot on specialized code. $ opt -mem2reg -sccp -loop-rotate -loop-unroll -unroll-allow-partial \ -simplifycfg -reassociate Repeated Integer Exponentiation The integer exponentiation kernel shown in Listing 5.3 is specialized on the exponent (kk = 17) and executed 50 million times. The execution time for this test with various optimization configurations is shown in Figure 5.5, while the optimization time is shown in Figure 5.6. For this simple example, it is possible to provide a detailed explanation for the measured runtimes. First, several optimizations can make this code run faster, so having the unoptimized code run slowest is the expected result. Next, the code shown in Listing 5.11 seems to be the best code that can be generated for raising an integer to the 17th power. To produce this code, it is necessary to specialize the function, fully unroll the loop, and then perform arithmetic simplification, in that order. The two test cases that do so produce the expected code and give the fastest runtimes. The next question is why specialization doesn’t help without arithmetic simplification, or, more specifically, why a fully unrolled loop isn’t faster than the same loop not unrolled. The best explanation seems to be that the CPU is able to issue the independent instructions to multiply and increment the loop variable in parallel and then correctly predict the branch. Experimenting 70 1 2 3 4 5 6 7 8 9 10 11 ; ModuleID = ’ ipow17 . l l ’ d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) { entry : %0 = mul i 3 2 %xx , %xx %1 = mul i 3 2 %0, %0 %2 = mul i 3 2 %1, %1 %3 = mul i 3 2 %2, %2 %mul . 1 6 = mul i 3 2 %3, %xx r e t i 3 2 %mul . 1 6 } Listing 5.11: ipow17-opt.ll: Fully optimized ipow for kk = 17 by adding more independent arithmetic operations to the loop body in ipow and rerunning the benchmarks seems to support this hypothesis: specialization and unrolling provides a speedup over not unrolling when there are more operations in the loop body. Finally, it’s not clear why unrolling without specialization (effectively equivalent to just doing -loop-rotate) runs faster than the full set of optimizations in “O3”. The only difference in the generated assembly code is the use of an “equals” comparison instead of a “less than” comparison. Perhaps this is worse for the CPU branch predictor. The results for optimization time are simpler, as shown in Figure 5.6. Optimizing this kernel takes less than 1/10th of a second and increases with more optimizations. For comparison, linking and assembly for this kernel takes about 0.15 seconds. The “spec-unroll-reassoc” optimizations are slightly faster than “spec-O3-reassoc”, but it’s not a big difference given the linking time. Matrix Multiplication The matrix multiplication kernel in Listing B.3.2 is specialized on the matrix size (NN = 256). Timings are shown in Figure 5.7, while optimization times are shown in Figure 5.8. This is a more complex kernel, so the optimization results aren’t quite as clear as they were with ipow. Any of the optimization sets provides about the same speedup compared to not optimizing, nearly a factor of four. For NN = 256 just unrolling and specializing provides a slightly better speedup than the 71 1.628 1.367 1.502 1.480 1.429 1.316 1.357 1.388 no-spec-O3 no-spec-unroll no-spec-unroll-re spec-no-opt spec-O3 spec-unroll spec-unroll-re 2 no-spec-no-opt mmul-atom run-time Time (s) 1.5 1 0.5 0 OptFlags Figure 5.4: Run times for mmul serial kernel (NN = 256) on Intel Atom N270 other optimization sets, but this effect is small. Testing with NN = 512 caused this effect to disappear. Unrolling is not expected to give a significant speedup on a test like this in any case, due to the superscalar architecture of the Opteron CPU that the test is running on [69]. The CPU can identify instruction level parallelism and execute the multiplication and loop increment operations in parallel while successfully predicting the branch for the loop condition, eliminating any benefit from loop unrolling. The optimization times are significantly higher on this test, in some cases greater than the kernel runtime. This demonstrates a potential benefit to selecting minimum set of optimizations for JIT compiled compute kernels. To explore the effects of instruction level parallelism further, this test was also run on a machine with an Intel Atom N270, which is an in-order CPU that issues two instructions at a time [70], compared to the four out of order instructions that the Opteron can issue [71] at once. This should result in a better improvement for specialization and unrolling. The results with the Atom CPU are shown in Figure 5.4. There is clear speedup from specialization and unrolling, but it’s still not as impressive as it 72 was on the GPU devices because the Atom is still issuing both instructions in the loop body at once and predicting the branch. Gaussian Blur The gaussian blur kernel in Listing B.3.1 is specialized on image size and blur radius. Run times are shown in Figure 5.9, while optimization times are shown in Figure 5.10. For the blur kernel, specialization provides a significant speedup for loop unrolling, but does not provide an additional speedup when added to the full set of “O3” optimizations. This makes specialization potentially attractive as a way to improve speedup per optimization time. 5.3.5 Discussion These test cases show that specialization can provide a significant benefit for an appropriately structured kernel like ipow. Unfortunately, this benefit seems to decrease quickly on more complex kernels, largely due to the superscalar architectures of modern CPUs. AMD and Intel have optimized tight inner loops already in hardware, so there is not a lot of space for optimizing them further in software when targeting CPU devices. Medium size loops, with 6-8 instructions per iteration, might see more benefit, but such loops don’t seem to be a very common case. 73 ipow run-time 1.892 1.895 0.940 0.930 0.941 0.545 0.825 0.926 0.825 0.546 2 spec-unroll-reassoc unroll-reassoc spec-unroll unroll spec-O3-reassoc O3-reassoc spec-O3 notopt 0.5 O3 1 spec-notopt Time (s) 1.5 0 OptFlags Figure 5.5: Execution times for ipow serial kernel. ipow opt-time 0.05 0.025 0.028 0.042 0.044 0.040 0.043 0.031 0.035 0.032 0.033 0 OptFlags Figure 5.6: Optimization times for ipow serial kernel. 74 spec-unroll-reassoc unroll-reassoc spec-unroll unroll spec-O3-reassoc O3-reassoc spec-O3 0.01 O3 0.02 spec-notopt 0.03 notopt Time (s) 0.04 mmul run-time 0.18 0.16 0.155 0.155 0.043 0.044 0.044 0.044 0.047 0.039 0.048 0.039 0.14 unroll-re spec-unroll unroll spec-O3-re O3-re 0.02 spec-O3 0.04 notopt 0.06 O3 0.08 spec-unroll-re 0.1 spec-notopt Time (s) 0.12 0 OptFlags Figure 5.7: Run times for mmul serial kernel (NN = 256). mmul opt-time 0.1 0.027 0.030 0.085 0.063 0.084 0.064 0.036 0.055 0.036 0.057 spec-unroll-re unroll-re spec-unroll unroll spec-O3-re O3-re spec-O3 0.02 O3 0.04 spec-notopt 0.06 notopt Time (s) 0.08 0 OptFlags Figure 5.8: Optimization times for mmul serial kernel (NN = 256). 75 blur run-time 0.7 0.589 0.580 0.217 0.215 0.217 0.216 0.466 0.381 0.468 0.381 0.6 unroll-re spec-unroll unroll spec-O3-re O3-re 0.1 spec-O3 notopt 0.2 O3 0.3 spec-unroll-re 0.4 spec-notopt Time (s) 0.5 0 OptFlags Figure 5.9: Run times for blur serial kernel. blur opt-time 0.2 0.033 0.035 0.171 0.165 0.174 0.167 0.050 0.062 0.052 0.065 0 OptFlags Figure 5.10: Optimization times for blur serial kernel. 76 spec-unroll-re unroll-re spec-unroll unroll spec-O3-re O3-re spec-O3 O3 0.05 spec-notopt 0.1 notopt Time (s) 0.15 5.4 Specializing POCL In this section we describe Specializing POCL, an implementation of SOCL built as an extension to the open source POCL implementation of OpenCL. First, we explain how POCL compiles and executes OpenCL kernels. Next, we describe how JIT specialization was added to this process. Finally, we provide timings for the resulting Specializing POCL system running the OpenCL benchmarks described in Chapter 5. 5.4.1 Execution of an OpenCL kernel with POCL First, we will describe how the POCL [47] (Portable Open Compute Language) implementation of OpenCL operates when executing kernel code on an AMD64-based multi-core machine. Specifically, we look at how several parts of the OpenCL API described in Chapter 2.3 are implemented as well as how executable kernel code is generated and scheduled for parallel execution. POCL is an open source implementation of OpenCL supporting execution on multi-core CPUs and several types of specialized acceleration hardware. It was initially developed [64] by Pekka Jaaskelainen and others at the Tampere University of Technology in Finland and the Universidad Rey Juan Calos in Spain to study the design of application specific processor hardware, but has since been extended for use as a general purpose OpenCL implementation. POCL is built on the LLVM [65][72] compiler construction toolkit described in Section 5.3.1. In broad strokes, POCL uses the Clang [66] C front end to parse OpenCL C and generate LLVM IR, does most of its work in custom LLVM transformation passes, and then uses an appropriate LLVM back-end to generate kernel binaries for execution. In conventional UNIX fashion, POCL does its work by creating a temporary directory and calling LLVM command line utilities to transform the kernel in several stages. These intermediate files can be examined by setting the POCL_LEAVE_TEMP_DIRS environment variable. We illustrate this process by describing a single run of the vec_pow_host program shown in Listing B.1.2 using POCL. The executable is started with this command: $ POCL_LEAVE_TEMP_DIRS=1 ./vec_pow_host 77 When the program calls clCreateProgramFromSource (line 45 in Listing B.1.2), a temporary directory is created and the kernel source is written to program.cl in that directory. In this case, the code written to this initial temporary file is identical to the code in vec_pow_kernel.cl shown in Listing B.1.1. The clBuildProgram function (line 47) causes the clang command to be executed to compile the OpenCL C source into an LLVM bitcode file. POCL creates a subdirectory for the target OpenCL device. In this case the device is called “pthread” because the kernel will be executed on a multi-core processor using POSIX threads. Clang outputs the bitcode to a file called program.bc in the new subdirectory. The LLVM assembly code for the bitcode in program.bc as generated by the llvm-dis command is shown in Listing B.1.3. This is a direct translation from the OpenCL C code into LLVM assembly. The two features which mark this code as being OpenCL code are the metadata on lines 29 and 31 indicating that vec_pow is an OpenCL kernel, and the addrspace(3) annotation (line 5) which marks the variable xs as being a pointer to the OpenCL global address space. This address space information is not relevant to the “pthread” device back end, but is used to select different types of memory for data storage when a kernel is executed devices like GPUs with partitioned memory spaces. The clCreateKernel function (line 50 of Listing B.1.2) causes the LLVM opt command to be executed to run a custom analysis pass that finds various information about the kernel such as the argument types and minimum workgroup size. This information is loaded back into the host program by generating a small shared library in C which is compiled, linked (creating a descriptor.so) and dynamically loaded. Finally, when the clEnqueueNDRangeKernel function is called (line 67) the rest of the work to generate a parallel, executable version of the kernel is performed. First, the llvm-link command is called to link the kernel bitcode to the OpenCL C standard library, generating kernel_linked.bc. Second, several analysis and transformation passes are run on the linked bitcode to enable the kernel to run in parallel. We consider these passes in four broad steps: Flatten, Barriers, Work Item Replication, and Standard Optimizations. The Flatten transformation inlines as much as it can into the kernel. Since the OpenCL standard doesn’t allow recursive calls in kernel code, this is a well defined transformation that allows later passes to consider the kernel computation as a whole without worrying about function calls. The most difficult part of transforming a kernel for parallel execution is han78 dling barriers. Whenever a work item (a parallel instance of a kernel) calls the barrier function, each other work item in the same work group must also call barrier before execution can proceed beyond the barrier in any of them. Since barriers can occur in conditional blocks and loops, this makes a static serialization of an OpenCL kernel non-trivial in general. Based on text in the OpenCL standard that says that each work item in a group must hit the same sequence of barriers, POCL handles this problem with the simplifying assumption that all work items in a group will branch the same way if that would cause any of them to hit a barrier [73]. Once barriers have been properly identified and the kernel has been transformed to put them in a canonical form, POCL transforms the kernel into a serialization of the work items in a single work group. Finally, several standard optimization passes are run on the kernel to make it execute faster. The result of these passes – which is specialized on workgroup size – is written to a final bitcode file parallel.bc which is then compiled to a shared library parallel.so. Execution of the kernel with the provided arguments is then added to the work queue. Copies of the arguments are made in case another call to the kernel is enqueued before the first kernel call is executed. The kernel is then actually executed asynchronously by dynamically loading the shared library and executing the parallel kernel work-group function on the arguments provided. One instance is executed in parallel for each work group using POSIX threads. 5.4.2 Adding Scalar Specialization to POCL Now that the OpenCL API and its implementation in POCL have been described, the addition of specialization on scalar arguments is relatively simple. POCL already delays final kernel code generation until after the kernel arguments have been specified, so it is sufficient to add a specialization pass that modifies the kernel between the Flatten pass and the Barriers pass described in Section 5.4.1. In more detail, adding specialization required three modification to POCL. First, the clCreateKernel function has been modified to read the @spec annotation from the saved program.cl source file and include information about which arguments should be specialized on in the descriptor.so. Second, the clEnqueueNDRangeKernel function has been modified to store the values of 79 the specialized arguments to a temporary file (called spec.info). Finally, an LLVM transformation pass has been written that reads spec.info and replaces each usage of a specialized argument in the kernel with its value. This LLVM pass for specialization for POCL kernels is shown in Listing B.4.1. It is the same pass that was used to specialize serial kernels in the previous chapter. This modification has been tested on several test kernels, including the vec_pow kernel shown in Listing B.1.1 and the benchmark kernels described in Chapter 5. The current code can specialize 32 and 64 bit integers and floating point values, which are the most common scalar parameters to OpenCL C functions. The strategy could be further extended to support arbitrary arrays (e.g., to allow a general image convolution function to be specialized on a specific convolution matrix), but that is somewhat more complex and has not been implemented. 5.4.3 Benchmarks for POCL with Specialization The eight test benchmarks from Chapter 5 were run using Specializing POCL comparing no optimization, loop unrolling, and the full “-O3” set of optimizations both with and without specialization. Tests were run on the 24-core Opteron workstation described in Appendix A.2. A summary of speedups is shown in Table 5.5. Charts with the full results for each benchmark are shown in Figure 5.11. These results vary significantly. Only one benchmark got large speedups from specialization across the various sets of optimization flags. The other benchmarks either saw consistent slowdowns or saw speedups from only one or two sets of optimizations. Given the results from serial specialization, the presence of any significant speedups is surprising. Results by Benchmark Gaussian Blur (blur) This benchmark saw a speedup from specialization, but a significant slowdown from any additional optimizations. This optimization slowdown is present even without specialization. 80 Benchmark Spec Spec + Unroll blur 21.5% -23.4% mandelbrot -0.9% 0.3% mmul -0.3% 122.6% nas-cg -19.6% -22.2% nas-sp 193.1% 169.5% nv-bs -1.1% -1.9% nv-dct -0.0% 30.8% particlefilter -4.0% -2.7% Spec + O3 -23.0% -0.6% -11.0% -21.4% 156.2% -6.6% 36.4% -2.7% Table 5.5: Speedups with Specializing POCL Mandelbrot (mandelbrot) This benchmark showed approximately no effect from specialization. Heavy optimization resulted in a slight slowdown, independent of specialization. Matrix Multiply (mmul) Matrix multiply got a significant speedup from specialization with loop unrolling. All the other combinations of optimizations and specializations caused a slowdown. Conjugate Gradient (nas-cg) This benchmark got a significant slowdown from specialization that remained consistent across the different optimization sets. Scalar Penta-Diagonal (nas-sp) This benchmark got a significant speedup from specialization that remained consistent across the different optimization sets. Black-Scholes Option Pricing (nv-bs) Specialization caused a slight slowdown for this benchmark, which was made even worse by heavy optimization. DCT (nv-dct) Specialization and optimization caused a significant speedup on this benchmark. Optimization without specialization had almost no effect. Particle Filter (particlefilter) This benchmark got a slight slowdown from specialization. This is the same result as on the Pancake GPU tests. 81 blur -- Kernel execution times with Specializing POCL mmul -- Kernel execution times with Specializing POCL 12 10 6.236 8.795 8.847 5.133 8.145 8.095 8.689 9.894 9.887 8.716 3.903 9.768 10 8 0 O3-spec unroll-spec O3 unroll default O3-spec 2 default-spec Time (s) 6 4 unroll-spec O3 2 unroll default 4 default-spec Time (s) 8 6 0 mandelbrot -- Kernel execution times with Specializing POCL 10.944 12 10.924 11.015 11.041 10.911 nv-bs -- Kernel execution times with Specializing POCL 4 11.011 3.350 3.336 3.322 3.385 3.414 3.588 3.5 3 0 O3-spec O3 unroll 0.5 default 1 O3-spec unroll-spec O3 2 unroll default 4 2 1.5 unroll-spec 6 2.5 default-spec Time (s) 8 default-spec Time (s) 10 0 nas-cg -- Kernel execution times with Specializing POCL nas-sp -- Kernel execution times with Specializing POCL 25 35 21.907 22.257 22.512 27.244 28.169 27.885 18.187 30 19.360 21.487 6.205 6.747 7.099 20 0 O3-spec unroll-spec 0 1.999 10 1.918 2.5 8.476 8.451 8.411 8.829 8.716 8.712 O3-spec 2.615 unroll-spec 2.672 default-spec 2.578 O3 2.615 unroll 3 particlefilter -- Kernel execution times with Specializing POCL default nv-dct -- Kernel execution times with Specializing POCL 8 Time (s) 2 1.5 6 0 O3-spec unroll-spec O3 unroll 0.5 default-spec 4 1 default Time (s) default-spec 5 O3 10 unroll O3-spec O3 5 unroll default 10 unroll-spec 15 15 default Time (s) 20 default-spec Time (s) 25 2 0 Figure 5.11: Benchmark Timings for Specializing POCL 82 5.732 5.596 2.732 5.587 default-spec unroll-spec O3-spec 5.732 O3 5.732 unroll 6 default mmul - data/exec_times.csv Time (s) 5 4 3 2 1 0 OpenCL Configuration Figure 5.12: Clover 5.5 Execution times for the “mmul” benchmark with a modified JIT Specialization in Clover Clover is the OpenCL implementation from the Mesa [44] project, with the goal of providing an open source implementation of the OpenCL standard for AMD and Nvidia GPUs. The implementation isn’t complete, currently providing only partial support of the standard, and only for AMD GPUs. Of the benchmarks described in Chapter 5, Clover will only run “mmul”. Clover has been modified to use the LLVM specialization pass described in the previous two chapters. Rather than implement the @spec annotation for one benchmark, the specialization values are provided through environment variables. This implementation was sufficient to collect the benchmark runtimes for “mmul” shown in Figure 5.12. This test demonstrates that JIT specialization can be added to an OpenCL implementation targeting GPUs directly. Specialization provided a speedup only with the “unroll” optimization flags, implying that some of the optimizations performed by “O3” are not appropriate for a GPU target. 83 Chapter 6 Conclusions and Future Work 6.1 Conclusions This dissertation has explored the application of just in time value specialization to data parallel compute kernels. Three new implementations of this technique were built and tested: Bacon, Pancake, and Specializing POCL. All three demonstrated significant speedups on especially suited kernels. Bacon was primarily built to increase the usability of OpenCL. It has several features which should have usability benefits, but it is not clear that these benefits warrant the effort of adopting a new language for kernel development or a new and incomplete runtime library. Other approaches, like Aparapi [36], Clyther [37], and Copperhead [57], that allow parallel kernels to be written directly in a high-level host language are likely to be a better way to approach usability for programming parallel hardware. Pancake and Specializing POCL implement Specialization Annotated OpenCL (SOCL), allowing OpenCL C kernels to be directly annotated for JIT specialization on argument values. This allowed the effects of this optimization to be tested directly without the language changes introduced in Bacon. Testing with Pancake showed that JIT specialization could provide significant performance benefits when running appropriate kernels on GPU devices even when the specialization was performed before passing the kernels to opaque vendor implementations of OpenCL. The simplicity of the GPU architectures allows well known optimizations like loop unrolling and algebraic simplification to have a significant effect on runtime, which makes specialization an effective 84 optimization technique. Specializing POCL demonstrates that support for SOCL can be easily added to an existing OpenCL implementation. The performance benefits running on a CPU target were not as significant as those on GPU targets due to the complexity of the CPU architecture. The benefit could probably be improved, especially by controlling the interleaving of instructions from multiple OpenCL work items, but it would require additional work on coordinating compiler optimizations with scheduling information. Both implementations of SOCL showed very large speedups on some of the test cases. For example, executing on an AMD GPU with Pancake, the Gaussian Blur test case executed 175% faster than without specialization. With Specializing POCL on a multi-core CPU, the “nas-sp” benchmark showed a speedup of over 150% at any optimization level. It’s not clear how to generalize these results to other cases, but it is a tempting target for future work. Just in time specialization is a promising optimization that could benefit from significant further research, but I would not recommend incorporating SOCL or something similar into existing OpenCL compilers based solely on the work in this dissertation. 6.2 Future Work There are several ways that the work in this dissertation could be usefully extended. This includes both further work on JIT specialization and work on other compiler techniques that would make parallel kernels execute faster. 6.2.1 Automatic Specialization SOCL requires explicit annotations specifying which kernel parameters should be specialized on. It should be possible to find appropriate specialization parameters by a combination of static analysis and runtime profiling. This would allow the benefits of JIT specialization to be realized on appropriate kernels without any additional effort by the programmer. The most obvious candidate parameters for specialization are loop variables. These can be identified by simple static analysis. Other candidates could be 85 identified by more complex analyses, such as by finding variables that are used in many arithmetic operations. Once a candidate parameter is identified, the problem is determining whether the kernel is called several times with the same value for that parameter. This could be determined by recording the values used in calls. Once the same value has been used some threshold number of times, it may be worth generating a specialized kernel for that value. 6.2.2 Speculative Optimization Selecting an appropriate set of optimizations to balance compilation time against kernel runtime is difficult. Worse, optimizing more may not even be better. Several of the test cases for Specializing POCL showed a slowdown when compiled with the full (“O3”) set of optimizations. Adaptive optimization can be used to select the best optimizations for a given kernel. The initial run of the kernel can be done with whatever set of optimizations will compile fastest, in parallel with a slower speculative compilation of a more optimized version of the kernel. Once the optimized version has compiled, the next run can be performed with that version. This process can be repeated several times with different optimizations. Finally, whichever version executes fastest can be used. This technique is applicable to any optimization, but works especially well with automatic specialization. A specialized kernel can be compiled speculatively for the most promising values of a candidate parameter, and can be discarded if that value is not used again or if the specialized kernel provides no speedup. Speculative optimization does cause extra kernel compilations to be performed, potentially many of them. But, especially for kernels executing on GPU devices, there may be unused CPU cores that can be used for speculative compilations without any negative impact on program performance. 6.2.3 Improved Scheduling in POCL POCL currently uses two techniques to schedule work items into work groups. Either the the blocks between barriers in each work item are interleaved into a single work group function or the work items are looped over for each of these blocks. 86 There are many other options for scheduling, including separating a work group into multiple threads, grouping instructions for ILP, and grouping instructions explicitly for vector execution. A detailed analysis of how these scheduling choices interact with specialization could result in significant performance benefits. 87 Appendix A Benchmarking Hardware and Methods A.1 Benchmarking Methodology In this appendix we describe the methodology used to collect the timing results for SOCL in Chapters 5.2 and 5.4, as well as the timings for serial specialization in Chapter 5.3. A set of scripts were developed to run the benchmarks with a variety of different optimization settings using different OpenCL runtimes. These scripts set environment variables that controlled the configuration of the OpenCL runtime and specialization behavior. Each test was run five times for each combination being tested. This is a relatively low number, but it was selected to optimize the tradeoff between eliminating bad samples and overall time taken by the test runs. Some test runs took several hours to complete. Significantly increasing testing time beyond that would not have been worth it for small incremental gains in accuracy, since timings were reasonably consistent in practice. The generated charts show the median times. This removes the effect of outliers while providing a realistic assessment of the inconsistent execution time for parallel programs. Error bars show the minimum and maximum times. 88 Specs OS CPU CPU Cores CPU Clock RAM GPU Shaders GPU Clock GPU RAM Workstation Ubuntu 13.10 Dual AMD Opteron 6234 24 2.4 GHz 32 GB Nvidia GeForce GTX 650 Ti 768 928 MHz 1 GB Desktop Ubuntu 13.10 AMD Phenom II X3 720 3 2.8 GHz 4 GB AMD Radeon 5830 1120 800 MHz 1 GB Netbook Debian 7 Intel Atom N270 1 1.6 GHz 512 MB Integrated Graphics - Table A.1: Test Hardware A.2 Benchmarking Hardware The timings in this dissertation were taken on three test machines, shown in Table A.1. These machines were selected to meet several criteria: • GPU testing needed to be performed on both AMD and Nvidia GPUs. • A high CPU count was desired for OpenCL CPU testing. The Opteron workstation has the highest core count I was willing to pay for. • The Netbook was used to test serial optimization on relatively simple CPU archetecture. 89 Appendix B Code Listings B.1 B.1.1 OpenCL Listings Sample Kernel: vec_pow Listing B.1.1: Simple kernel raising the elements of a vector to some power. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 /* vec_pow : * Take a v e c t o r o f n l o n g i n t s and r a i s e each one * t o t h e y ’ t h power i n p l a c e . */ k e r n e l void vec_pow ( g l o b a l long * xs , long y , long n ) /* @spec vec_pow ( y , n ) */ { long i i = g e t _ g l o b a l _ i d ( 0 ) ; long x = xs [ i i ] ; long x1 = x ; for ( long j j = 0 ; j j < y ; ++j j ) { x1*= x ; } xs [ i i ] = x1 ; } 90 B.1.2 Sample Host Program: vec_pow Listing B.1.2: Host program to run the vec_pow kernel above 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 #include #include #include #include <CL/ c l . h> <s t d i o . h> <a l l o c a . h> <s t r i n g . h> int main ( int argc , char* argv ) { c l _ u i n t nn ; /* Take t h e f i r s t OpenCL p l a t f o r m */ c l G e t P l a t f o r m I D s ( 0 , 0 , &nn ) ; c l _ p l a t f o r m _ i d p l a t f o r m s [ nn ] ; c l G e t P l a t f o r m I D s ( nn , p l a t f o r m s , 0 ) ; cl_platform_id platform = platforms [ 0 ] ; /* Take t h e d e f a u l t OpenCL d e v i c e */ c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, 0 , 0 , &nn ) ; c l _ d e v i c e _ i d d e v i c e s [ nn ] ; c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, nn , d e v i c e s , 0 ) ; cl_device_id device = devices [ 0 ] ; /* Create an OpenCL c o n t e x t */ long context _props [ ] = {CL_CONTEXT_PLATFORM, ( long ) p l a t f o r m , 0}; cl_context context = clCreateContext ( ( c l _ c o n t e x t _ p r o p e r t i e s * ) context_props , 1 , &d e v i c e , 0 , 0 , 0) ; /* Create an OpenCL command queue */ cl_command_queue queue = clCreateCommandQueue ( c o n t e x t , device , 0 , 0) ; /* Read i n t h e OpenCL k e r n e l s o u r c e */ char* o p e n c l _ s o u r c e = a l l o c a ( 1 0 2 4 ) ; FILE* o p e n c l _ f i l e = f o p e n ( ” vec_pow_kernel . c l ” , ” r ” ) ; nn = 0 ; while ( ! f e o f ( o p e n c l _ f i l e ) ) nn += f r e a d ( o p e n c l _ s o u r c e + nn , 1 , 5 1 2 , o p e n c l _ f i l e ) ; o p e n c l _ s o u r c e [ nn ] = ’ \n ’ ; o p e n c l _ s o u r c e [ nn + 1 ] = 0 ; 91 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 fclose ( opencl_file ) ; /* Compile t h e OpenCL s o u r c e and s e l e c t t h e k e r n e l we want */ size_t length = s t r l e n ( opencl_source ) ; const char ** s o u r c e s = ( const char * * ) &o p e n c l _ s o u r c e ; const s i z e _ t * l e n g t h s = ( const s i z e _ t * ) &l e n g t h ; cl_program program = clCreateProgramWithSource ( context , 1 , sources , lengths , 0) ; clB uild Pro gram ( program , 1 , &d e v i c e , ” ” , 0 , 0 ) ; /* S e l e c t t h e k e r n e l we want */ c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program , ”vec_pow” , 0 ) ; /* Create t h e i n p u t b u f f e r */ cl_long n = 5; c l _ l o n g xs0 [ ] = { 1 , 2 , 3 , 4 , 5 } ; cl_mem_flags f l a g s = CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR; size_t xs_size = n * sizeof ( cl_long ) ; cl_mem xs = c l C r e a t e B u f f e r ( c o n t e x t , f l a g s , x s _ s i z e , xs0 , 0 ) ; /* S e t t h e k e r n e l arguments */ cl_long y = 3; c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl_mem ) , &xs ) ; c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( y ) , &y ) ; c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( n ) , &n ) ; /* E x e c u t e t h e k e r n e l o v e r a p a r a l l e l range . */ c l _ e v e n t kernel_done = 0 ; clEnqueueNDRangeKernel ( queue , k e r n e l , 1 , 0 , &n , 0 , 0 , 0 , &kernel_done ) ; 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 /* Wait f o r t h e Kernel t o f i n i s h e x e c u t i n g */ clFinish () ; /* Copy b a c k t h e d a t a from t h e b u f f e r . */ c l _ l o n g xs1 [ 5 ] ; clEnqueueReadBuffer ( queue , xs , CL_TRUE, 0 , x s _ s i z e , xs1 , 0 , 0 , 0) ; /* P r i n t o u t t h e r e s u l t */ p r i n t f ( ” 1 2 3 4 5 cubed i s : \ n” ) ; for ( int i i = 0 ; i i < 5 ; ++i i ) { p r i n t f ( ”%l d ” , xs1 [ i i ] ) ; } p r i n t f ( ” \n” ) ; } return 0 ; 92 B.1.3 LLVM Assembly Code for vec_pow kernel Listing B.1.3: Results of compiling vec_pow kernel to LLVM assembly 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ; ModuleID = ’ program . bc ’ t a r g e t d a t a l a y o u t = *** s t r i n g o f data s i z e s *** t a r g e t t r i p l e = ” x86_64−unknown−l i n u x −gnu ” d e f i n e void @vec_pow ( i 6 4 a d d r s p a c e ( 3 ) * n o c a p t u r e %xs , i 6 4 %y , i 6 4 %n ) nounwind uwtable { entry : %c a l l = t a i l c a l l i 6 4 @get_global_id ( i 3 2 0 ) nounwind %a r r a y i d x = g e t e l e m e n t p t r inbounds i 6 4 a d d r s p a c e ( 3 ) * %xs , i 6 4 %c a l l %0 = l o a d i 6 4 a d d r s p a c e ( 3 ) * %a r r a y i d x , a l i g n 8 , ! tbaa ! 1 %cmp2 = icmp s g t i 6 4 %y , 0 br i 1 %cmp2 , l a b e l %for . body , l a b e l %for . end for . body : ; preds = %for . body , %e n t r y %j j . 0 4 = p h i i 6 4 [ %i n c , %for . body ] , [ 0 , %e n t r y ] %x1 . 0 3 = p h i i 6 4 [ %mul , %for . body ] , [ %0, %e n t r y ] %mul = mul nsw i 6 4 %x1 . 0 3 , %0 %i n c = add nsw i 6 4 %j j . 0 4 , 1 %e x i t c o n d = icmp eq i 6 4 %i n c , %y br i 1 %e x i t c o n d , l a b e l %for . end , l a b e l %for . body for . end : ; preds = %for . body , %e n t r y %x1 . 0 . l c s s a = p h i i 6 4 [ %0, %e n t r y ] , [ %mul , %for . body ] s t o r e i 6 4 %x1 . 0 . l c s s a , i 6 4 a d d r s p a c e ( 3 ) * %a r r a y i d x , a l i g n 8 , ! tbaa ! 1 r e t void } d e c l a r e i 6 4 @get_global_id ( i 3 2 ) ! opencl . kernels = ! { ! 0 } !0 !1 !2 !3 = = = = metadata metadata metadata metadata ! { void ( i 6 4 a d d r s p a c e ( 3 ) * , i 6 4 , i 6 4 ) * @vec_pow} ! { metadata ! ” long ” , metadata ! 2 } ! { metadata ! ” omnipotent char ” , metadata ! 3 } ! { metadata ! ” Simple C/C++ TBAA” } 93 B.1.4 Gaussian Blur Kernels Listing B.1.4: blur-kernels.cl: OpenCL kernels to perform gaussian blur 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 kernel void blur_hor ( g l o b a l uchar * im1 , g l o b a l uchar * im0 , g l o b a l f l o a t * bvec , int ww, int hh , int r r ) /* @spec b l u r _ h o r (ww, hh , r r ) */ { int i i = g e t _ g l o b a l _ i d ( 0 ) ; int j j = g e t _ g l o b a l _ i d ( 1 ) ; f l o a t pp = 0 . 0 f ; for ( int kk = −r r ; kk <= r r ; ++kk ) { int j j 0 = clamp ( j j + kk , 0 , ww − 1 ) ; pp += bvec [ kk + r r ] * im0 [ww* i i + j j 0 ] ; } } im1 [ww* i i + j j ] = clamp ( ( int ) round ( pp ) , 0 , 2 5 5 ) ; kernel void bl ur _v er ( g l o b a l uchar * im1 , g l o b a l uchar * im0 , g l o b a l f l o a t * bvec , int ww, int hh , int r r ) /* @spec b l u r _ v e r (ww, hh , r r ) */ { int i i = g e t _ g l o b a l _ i d ( 0 ) ; int j j = g e t _ g l o b a l _ i d ( 1 ) ; f l o a t pp = 0 . 0 f ; for ( int kk = −r r ; kk <= r r ; ++kk ) { int i i 0 = clamp ( i i + kk , 0 , hh − 1 ) ; pp += bvec [ kk + r r ] * im0 [ww* i i 0 + j j ] ; } } im1 [ww* i i + j j ] = clamp ( ( int ) round ( pp ) , 0 , 2 5 5 ) ; 94 B.1.5 Matrix Multiplication Kernel Listing B.1.5: fmma-kernel.cl: OpenCL kernel to perform matrix multiplication 1 2 /* fmma : F l a o t i n g p o i n t m a t r i x m u l t i p l y and a c c u m u l a t e * Performs a m a t r i x m u l t i p l i c a t i o n i f t h e r e s u l t m a t r i x C i s a l l z e r o s . */ 3 4 k e r n e l void 5 fmma( g l o b a l f l o a t *C, g l o b a l f l o a t *A, g l o b a l f l o a t *B, int nn , int s p i n ) 6 /* @spec fmma( nn , s p i n ) */ 7 { 8 const int xx = g e t _ g l o b a l _ i d ( 0 ) ; 9 const int yy = g e t _ g l o b a l _ i d ( 1 ) ; 10 11 for ( long i t = 0 ; i t < s p i n ; ++i t ) { 12 f l o a t sum = C[ nn * yy + xx ] ; 13 14 for ( int kk = 0 ; kk < nn ; ++kk ) { 15 sum += A[ nn * yy + kk ] * B [ nn * kk + xx ] ; 16 } 17 18 C[ nn * yy + xx ] = sum ; 19 } 20 } B.1.6 Rodinia Mandelbrot Kernel Listing B.1.6: mandelbrot-kernel.cl: OpenCL kernel to generate pictures of the mandelbrot set 1 2 3 4 5 6 /* C o p y r i g h t ( c ) 2009 David B u c c i a r e l l i ( d a v i b u @ i n t e r f r e e . i t ) P e r m i s s i o n i s h e r e b y g r a n t e d , f r e e o f charge , t o any p e r s o n obtaining a copy o f t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( t h e ” Software ” ) , to deal in the Software without r e s t r i c t i o n , including 95 7 8 9 10 11 12 w i t h o u t l i m i t a t i o n t h e r i g h t s t o use , copy , modify , merge , publish , d i s t r i b u t e , s u b l i c e n s e , and/ or s e l l c o p i e s o f t h e S o f t w a r e , and t o p e r m i t p e r s o n s t o whom t h e S o f t w a r e i s f u r n i s h e d t o do so , sub jec t to the following conditions : The above c o p y r i g h t n o t i c e and t h i s p e r m i s s i o n n o t i c e s h a l l be included i n a l l c o p i e s or s u b s t a n t i a l p o r t i o n s o f t h e S o f t w a r e . 13 14 15 THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, 16 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 17 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 18 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 19 CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, 20 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 21 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 22 */ 23 24 __kernel void mandelGPU ( 25 __global int * p i x e l s , 26 const int width , 27 const int h e i g h t , 28 const f l o a t s c a l e , 29 const f l o a t o f f s e t X , 30 const f l o a t o f f s e t Y , 31 const int m a x I t e r a t i o n s 32 ) 33 /* @spec mandelGPU ( width , h e i g h t , s c a l e , o f f s e t X , o f f s e t Y , m a x I t e r a t i o n s ) */ 34 { 35 const int g i d = g e t _ g l o b a l _ i d ( 0 ) ; 36 const int g i d 4 = 4 * g i d ; 37 const int maxSize = max( width , h e i g h t ) ; 38 const f l o a t kx = ( s c a l e / 2 . f ) * width ; 39 const f l o a t ky = ( s c a l e / 2 . f ) * h e i g h t ; 40 41 int t ; 42 unsigned int i t e r [ 4 ] ; 43 for ( t = 0 ; t < 4 ; ++t ) { 44 const int t i d = g i d 4 + t ; 45 46 const int screenX = t i d % width ; 47 const int screenY = t i d / width ; 96 48 49 50 51 52 53 // Check i f we have s o m e t h i n g t o do i f ( screenY >= h e i g h t ) return ; const f l o a t x0 = ( ( screenX * s c a l e ) − kx ) / maxSize + o f f s e t X ; const f l o a t y0 = ( ( screenY * s c a l e ) − ky ) / maxSize + o f f s e t Y ; 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 float float float float x = x0 ; y = y0 ; x2 = x * x ; y2 = y * y ; int i i = 0 ; for ( i i = 0 ; ( x2 + y2 <= 4 . f ) && ( i i < m a x I t e r a t i o n s ) ; ++i i ) { y = 2 * x * y + y0 ; x = x2 − y2 + x0 ; } iter [ t ] = ii ; } } x2 = x * x ; y2 = y * y ; i f ( i t e r [ t ] == m a x I t e r a t i o n s ) iter [ t ] = 0; else { i t e r [ t ] = i t e r [ t ] % 512; i f ( i t e r [ t ] > 255) i t e r [ t ] = 511 − i t e r [ t ] ; } p i x e l s [ gid ] = i t e r [ 0 ] ( iter ( iter ( iter | [ 1 ] << 8 ) | [ 2 ] << 1 6 ) | [ 3 ] << 2 4 ) ; 97 B.2 Bacon Listings B.2.1 Bacon Grammar Listing B.2.1: Parse::Yacc grammar for Bacon C without Perl code. 1 # Based on ANSI C yacc grammar o r i g i o n a l l y by J e f f Lee , taken from 2 # h t t p : //www. l y s a t o r . l i u . s e / c /ANSI−C−grammar−y . html 3 4 %s t a r t t r a n s l a t i o n _ u n i t 5 6 /* 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 e x p r e s s i o n s */ primary_expression : CONSTANT | IDENTIFIER | STRING | ’ ( ’ expression ’ ) ’ ; postfix_expression : primary_expression | IDENTIFIER ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’ | IDENTIFIER ’ ( ’ ’ ) ’ | IDENTIFIER ’ ( ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ) ’ | IDENTIFIER ’ . ’ IDENTIFIER | p o s t f i x _ e x p r e s s i o n PTR_OP IDENTIFIER | p o s t f i x _ e x p r e s s i o n INC_OP | p o s t f i x _ e x p r e s s i o n DEC_OP ; argument_expression_list : assignment_expression | argument_expression_list ’ , ’ assignment_expression ; unary_expression : postfix_expression | INC_OP u n a r y _ e x p r e s s i o n | DEC_OP u n a r y _ e x p r e s s i o n | unary_operator c a s t _ e x p r e s s i o n | SIZEOF u n a r y _ e x p r e s s i o n 98 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | SIZEOF ’ ( ’ t y p e _ s p e c i f i e r ’ ) ’ ; unary_operator : ’& ’ | ’* ’ | ’+ ’ | ’− ’ | ’~ ’ | ’! ’ ; cast_expression : unary_expression | ’ ( ’ type_specifier ’ ) ’ cast_expression ; multiplicative_expression : cast_expression | multiplicative_expression ’* ’ cast_expression | multiplicative_expression ’/ ’ cast_expression | m u l t i p l i c a t i v e _ e x p r e s s i o n ’%’ c a s t _ e x p r e s s i o n ; additive_expression : multiplicative_expression | a d d i t i v e _ e x p r e s s i o n ’+ ’ m u l t i p l i c a t i v e _ e x p r e s s i o n | a d d i t i v e _ e x p r e s s i o n ’− ’ m u l t i p l i c a t i v e _ e x p r e s s i o n ; shift_expression : additive_expression | s h i f t _ e x p r e s s i o n LEFT_OP a d d i t i v e _ e x p r e s s i o n | s h i f t _ e x p r e s s i o n RIGHT_OP a d d i t i v e _ e x p r e s s i o n ; relational_expression : shift_expression | relational_expression | relational_expression | relational_expression | relational_expression ; ’< ’ s h i f t _ e x p r e s s i o n ’> ’ s h i f t _ e x p r e s s i o n LE_OP s h i f t _ e x p r e s s i o n GE_OP s h i f t _ e x p r e s s i o n equality_expression : relational_expression | e q u a l i t y _ e x p r e s s i o n EQ_OP r e l a t i o n a l _ e x p r e s s i o n | e q u a l i t y _ e x p r e s s i o n NE_OP r e l a t i o n a l _ e x p r e s s i o n 99 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 ; and_expression : equality_expression | a n d _ e x p r e s s i o n ’& ’ e q u a l i t y _ e x p r e s s i o n ; exclusive_or_expression : and_expression | exclusive_or_expression ’^ ’ and_expression ; inclusive_or_expression : exclusive_or_expression | inclusive_or_expression ’ | ’ exclusive_or_expression ; logical_and_expression : inclusive_or_expression | l o g i c a l _ a n d _ e x p r e s s i o n AND_OP i n c l u s i v e _ o r _ e x p r e s s i o n ; logical_or_expression : logical_and_expression | l o g i c a l _ o r _ e x p r e s s i o n OR_OP l o g i c a l _ a n d _ e x p r e s s i o n ; conditional_expression : logical_or_expression | logical_or_expression ’? ’ expression ’ : ’ conditional_expression ; assignment_expression : conditional_expression | unary_expression assignment_operator assignment_expression ; assignment_operator : ’= ’ | MUL_ASSIGN | DIV_ASSIGN | MOD_ASSIGN | ADD_ASSIGN | SUB_ASSIGN | LEFT_ASSIGN | RIGHT_ASSIGN | AND_ASSIGN 100 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | XOR_ASSIGN | OR_ASSIGN ; expression : assignment_expression | expression ’ , ’ assignment_expression ; constant_expression : conditional_expression ; /* d e c l a r a t i o n s */ declaration : type_specifier declarator_list ’ ; ’ ; declarator_list : declarator | declarator_list ’ , ’ declarator ; declarator : IDENTIFIER | IDENTIFIER ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’ | IDENTIFIER ’= ’ i n i t i a l i z e r ; /* t y p e s */ type_qualifier : CONST | VOLATILE ; simple_type : CHAR | UCHAR | SHORT | USHORT | INT | UINT 101 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | | | | ; LONG ULONG FLOAT DOUBLE parameterized_type : ARRAY | ARRAY2D | ARRAY3D | ARRAY2Z | ARRAY3Z | IMAGE2D | IMAGE3D ; type_specifier : parameterized_type_specifier | simple_type | VOID | pointer_type | type_qualifier type_specifier ; parameterized_type_specifier : p a r a m e t e r i z e d _ t y p e ’< ’ simple_type ’> ’ | s c o p e _ s p e c i f i e r p a r a m e t e r i z e d _ t y p e ’< ’ simple_type ’> ’ ; pointer_type : simple_type ’ * ’ | VOID ’ * ’ | s c o p e _ s p e c i f i e r simple_type ’ * ’ | s c o p e _ s p e c i f i e r VOID ’ * ’ ; scope_specifier : PRIVATE | LOCAL | GLOBAL ; function_declarator : IDENTIFIER ’ ( ’ p a r a m e t e r _ l i s t ’ ) ’ | IDENTIFIER ’ ( ’ ’ ) ’ ; 102 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 parameter_list : parameter_declaration | parameter_list ’ , ’ parameter_declaration ; parameter_declaration : t y p e _ s p e c i f i e r IDENTIFIER ; initializer : assignment_expression | ’{ ’ initializer_list ’} ’ | ’{ ’ initializer_list ’ , ’ ’} ’ ; initializer_list : initializer | initializer_list ; ’ , ’ initializer /* s t a t e m e n t s */ statement : labeled_statement | compound_statement | expression_statement | selection_statement | iteration_statement | jump_statement | error_statement ; labeled_statement : IDENTIFIER ’ : ’ s t a t e m e n t | CASE c o n s t a n t _ e x p r e s s i o n ’ : ’ s t a t e m e n t | DEFAULT ’ : ’ s t a t e m e n t ; compound_statement : ’{ ’ ’} ’ | ’{ ’ statement_list ’} ’ ; statement_list : statement | declaration 103 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | statement_list statement | statement_list declaration ; expression_statement : ’; ’ | expression ’ ; ’ ; selection_statement : IF ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t | IF ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t ELSE s t a t e m e n t | SWITCH ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t ; for_var_init : expression_statement | t y p e _ s p e c i f i e r IDENTIFIER ’= ’ e x p r e s s i o n _ s t a t e m e n t ; iteration_statement : WHILE ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t | DO s t a t e m e n t WHILE ’ ( ’ e x p r e s s i o n ’ ) ’ ’ ; ’ | FOR ’ ( ’ f o r _ v a r _ i n i t e x p r e s s i o n _ s t a t e m e n t ’ ) ’ s t a t e m e n t | FOR ’ ( ’ f o r _ v a r _ i n i t e x p r e s s i o n _ s t a t e m e n t e x p r e s s i o n ’ ) ’ statement ; jump_statement : GOTO IDENTIFIER ’ ; ’ | CONTINUE ’ ; ’ | BREAK ’ ; ’ | RETURN ’ ; ’ | RETURN e x p r e s s i o n ’ ; ’ ; error_statement : FAIL ’ ( ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ) ’ ’ ; ’ | ASSERT ’ ( ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ) ’ ’ ; ’ ; /* t o p l e v e l d e c l a r a t i o n s */ translation_unit : external_declaration | translation_unit external_declaration 104 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 ; external_declaration : kernel_definition | function_definition | declaration ; kernel_range_spec : RANGE ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’ ’ ; ’ ; kernel_group_spec : GROUP ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’ ’ ; ’ ; kernel_ranges : kernel_range_spec | kernel_group_spec | k e r n e l _ r a n g e _ s p e c kernel_group_spec ; kernel_body : ’ { ’ SETUP s t a t e m e n t _ l i s t BODY k e r n e l _ r a n g e s s t a t e m e n t _ l i s t ’} ’ | ’ { ’ BODY k e r n e l _ r a n g e s s t a t e m e n t _ l i s t ’ } ’ ; 339 340 341 342 k e r n e l _ d e f i n i t i o n 343 : KERNEL t y p e _ s p e c i f i e r f u n c t i o n _ d e c l a r a t o r kernel_body 344 ; 345 346 f u n c t i o n _ d e f i n i t i o n 347 : t y p e _ s p e c i f i e r f u n c t i o n _ d e c l a r a t o r compound_statement 348 ; 349 350 %% B.2.2 Blocked Matrix Multiply Listing B.2.2: blocked-mm.bc: Matrix multiplication in Bacon with block size as a parameter 1 kernel 105 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Array2D<f l o a t > blocked_mat_mul ( Array2D<f l o a t > aa , Array2D<f l o a t > bb , const u i n t blksz ) { SETUP: g l o b a l Array2D<f l o a t > c c [ aa . rows , bb . c o l s ] ; BODY: @range [ c c . rows / b l k s z , c c . c o l s / b l k s z ] ; p r i v a t e Array2D<f l o a t > sum [ b l k s z , b l k s z ] ; int i i , j j , kk , gg ; for ( i i = 0 ; i i < b l k s z ; ++i i ) { for ( j j = 0 ; j j < b l k s z ; ++j j ) { sum [ i i , j j ] = 0 . 0 ; } } int b a s e _ i i = $row * b l k s z ; int b a s e _ j j = $ c o l * b l k s z ; int base_kk ; for ( gg = 0 ; gg < aa . c o l s / b l k s z ; ++gg ) { base_kk = gg * b l k s z ; } // These l o o p s a r e shown u n r o l l e d b e l o w . for ( i i = 0 ; i i < b l k s z ; ++i i ) { for ( j j = 0 ; j j < b l k s z ; ++j j ) { for ( kk = 0 ; kk < b l k s z ; ++kk ) { sum [ i i , j j ] += aa [ b a s e _ i i + i i , base_kk + kk ] * bb [ base_kk + kk , b a s e _ j j + j j ] ; } } } for ( i i = 0 ; i i < b l k s z ; ++i i ) { for ( j j = 0 ; j j < b l k s z ; ++j j ) { c c [ b a s e _ i i + i i , b a s e _ j j + j j ] = sum [ i i , j j ] ; } } } return c c ; // U n r o l l i n g r e s u l t s f o r b l k s z = 2 106 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 sum [ 0 , 0 ] += * sum [ 0 , 0 ] += * sum [ 0 , 1 ] += * sum [ 0 , 1 ] += * sum [ 1 , 0 ] += * sum [ 1 , 0 ] += * sum [ 1 , 1 ] += * sum [ 1 , 1 ] += * B.3 B.3.1 aa [ b a s e _ i i , base_kk ] bb [ base_kk , b a s e _ j j ] ; aa [ b a s e _ i i , base_kk + 1 ] bb [ base_kk + 1 , b a s e _ j j ] ; aa [ b a s e _ i i , base_kk ] bb [ base_kk , b a s e _ j j + 1 ] ; aa [ b a s e _ i i , base_kk + 1 ] bb [ base_kk + 1 , b a s e _ j j + aa [ b a s e _ i i + 1 , base_kk ] bb [ base_kk , b a s e _ j j ] ; aa [ b a s e _ i i + 1 , base_kk + bb [ base_kk + 1 , b a s e _ j j ] ; aa [ b a s e _ i i + 1 , base_kk ] bb [ base_kk , b a s e _ j j + 1 ] ; aa [ b a s e _ i i + 1 , base_kk + bb [ base_kk + 1 , b a s e _ j j + 1]; 1] 1] 1]; Serial Kernel Listings Serial Gaussian Blur Listing B.3.1: blur.c: Serial implementation of gaussian blur. 1 #include < s t d l i b . h> 2 #include <math . h> 3 4 double 5 g a u s s ( double x , double mu, double sigma ) 6 { 7 double aa = 1 . 0 / ( sigma * s q r t ( 2 . 0 * M_PI) ) ; 8 double bbT = −pow ( x − mu, 2 . 0 ) ; 9 double bbB = 2 * pow ( sigma , 2 . 0 ) ; 10 double bb = bbT / bbB ; 11 return aa * exp ( bb ) ; 12 } 13 14 int 15 clamp ( int xx , int x0 , int x1 ) 16 { 17 i f ( xx < x0 ) return x0 ; 18 i f ( xx > x1 ) return x1 ; 19 return xx ; 107 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 } void b l u r ( byte * im0 , s i z e _ t ww, s i z e _ t hh , int sigma ) { int r r = 3 * sigma ; /* Generate t h e b l u r v e c t o r */ double * bvec = m a l l o c ( ( 2 * r r + 1 ) * s i z e o f ( double ) ) ; double bsum = 0 . 0 ; for ( int kk = −r r ; kk <= r r ; ++kk ) { int i i = kk + r r ; bvec [ i i ] = g a u s s ( kk , 0 . 0 , sigma ) ; bsum += bvec [ i i ] ; } for ( int i i = 0 ; i i < 2 * r r + 1 ; ++i i ) { bvec [ i i ] *= 1 . 0 / bsum ; } byte * im1 = m a l l o c (ww * hh * s i z e o f ( byte ) ) ; /* Blur im0 h o r i z o n t a l l y i n t o im1 */ for ( int i i = 0 ; i i < hh ; ++i i ) { for ( int j j = 0 ; j j < ww; ++j j ) { double p1 = 0 . 0 ; for ( int kk = −r r ; kk <= r r ; ++kk ) { int j j 0 = clamp ( j j + kk , 0 , ww − 1 ) ; p1 += bvec [ kk + r r ] * im0 [ww* i i + j j 0 ] ; } } } im1 [ww* i i + j j ] = clamp ( round ( p1 ) , 0 , 2 5 5 ) ; /* Blur im1 v e r t i c a l l y b a c k i n t o im0 */ for ( int j j = 0 ; j j < ww; ++j j ) { for ( int i i = 0 ; i i < hh ; ++i i ) { double p0 = 0 . 0 ; for ( int kk = −r r ; kk <= r r ; ++kk ) { int i i 0 = clamp ( i i + kk , 0 , hh − 1 ) ; p0 += bvec [ kk + r r ] * im1 [ww* i i 0 + j j ] ; } im0 [ww* i i + j j ] = clamp ( round ( p0 ) , 0 , 2 5 5 ) ; 108 68 69 70 71 72 73 } } } f r e e ( im1 ) ; f r e e ( bvec ) ; B.3.2 Serial Matrix Multiply Listing B.3.2: mmul.c: Serial implementation of matrix multiply. 1 /* A, B, C a r e s q u a r e m a t r i x e s . */ 2 /* C = A * B */ 3 4 void 5 mmul( double* C, double* A, double* B, int nn ) 6 { 7 for ( int i i = 0 ; i i < nn ; ++i i ) { 8 for ( int j j = 0 ; j j < nn ; ++j j ) { 9 double sum = 0 . 0 ; 10 11 for ( int kk = 0 ; kk < nn ; ++kk ) { 12 sum += A[ nn * i i + kk ] * B [ nn * kk + j j ] ; 13 } 14 15 C[ nn * i i + j j ] = sum ; 16 } 17 } 18 } B.3.3 Serial Integer Exponentiation Listing B.3.3: ipow.c: Serial implementation of simple integer exponentiation. 1 2 3 4 5 6 7 int ipow ( int xx , int kk ) { int yy = 1 ; for ( int i i = 0 ; i i < kk ; ++i i ) yy *= xx ; 109 8 9 10 } return yy ; B.4 Specializer Listings B.4.1 LLVM Specialize Pass Listing B.4.1: SpecPass.cc: Specialization as an LLVM module pass 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 #include ” llvm / Pass . h” #include ” llvm / PassSupport . h” #include ” llvm / Support /CommandLine . h” #include #include #include #include #include #include #include #include ” llvm /IR/Argument . h” ” llvm /IR/ Constants . h” ” llvm /IR/ DerivedTypes . h” ” llvm /IR/ Function . h” ” llvm /IR/ G l o b a l V a r i a b l e . h” ” llvm /IR/ I n s t r u c t i o n s . h” ” llvm /IR/Module . h” ” llvm / Support / raw_ostream . h” #include #include #include #include <s t d i o . h> <s t d i n t . h> <map> <s t r i n g > #include ” cake / l s t r i n g . h” #include ” cake / s p e c . h” #include ” cake / u t i l . h” u s i n g namespace llvm ; u s i n g namespace s t d ; c l : : opt<s t r i n g > S p e c I n f o ( ” spec−i n f o ” , c l : : d e s c ( ” Spec I n f o ” ) , c l : : v al ue _d e sc ( ” s p e c i n f o f i l e ” ) , cl : : i n i t ( ”” ) ) ; c l : : opt<s t r i n g > SpecText ( ” spec−t e x t ” , c l : : d e s c ( ” Spec Text ” ) , c l : : v al ue _d e sc ( ” s p e c s p e c s t r i n g ” ) , 110 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 cl : : i n i t ( ”” ) ) ; extern c l : : opt<s t r i n g > KernelName ; namespace { c l a s s S p e c i a l i z e : p u b l i c ModulePass { public : s t a t i c char ID ; S p e c i a l i z e ( ) : ModulePass ( ID ) {} } }; v i r t u a l void g e t A n a l y s i s U s a g e ( A n a l y s i s U s a g e &AU) const ; v i r t u a l b o o l runOnModule ( Module &M) ; char S p e c i a l i z e : : ID = 0 ; s t a t i c R e g i s t e r P a s s <S p e c i a l i z e > X( ” s p e c i a l i z e ” , ” S p e c i a l i z e k e r n e l on a r g s ” ) ; void S p e c i a l i z e : : g e t A n a l y s i s U s a g e ( A n a l y s i s U s a g e& au ) const { // Do n o t h i n g . } static char* type_name ( Type* t t ) { std : : s t r i n g buffer ; llvm : : raw_string_ostream stream ( b u f f e r ) ; t t −>p r i n t ( stream ) ; return l s t r d u p ( stream . s t r ( ) . c _ s t r ( ) ) ; } bool S p e c i a l i z e : : runOnModule ( Module& M) { auto F = M. g e t F u n c t i o n ( KernelName ) ; i f (F == 0 ) { c a r p ( ”No such k e r n e l , g i v i n g up . ” ) ; } /* Read s p e c . i n f o */ 111 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 spec_info * info ; i f ( SpecText . g e t V a l u e ( ) != ” ” ) { p r i n t f ( ” S p e c i a l i z e k e r n e l %s with s t r i n g %s \n” , KernelName . g e t V a l u e ( ) . c _ s t r ( ) , SpecInfo . getValue ( ) . c_str ( ) ) ; } int arg_count = F−>a r g _ s i z e ( ) ; char ** arg_names = ( char * * ) a l l o c a ( arg_count * s i z e o f ( char * ) ) ; int i i = 0 ; for ( auto i t = F−>arg_begin ( ) ; i t != F−>arg_end ( ) ; ++i t ) { arg_names [ i i ] = l s t r d u p ( i t −>getName ( ) . s t r ( ) . c _ s t r ( ) ) ; i i += 1 ; } i n f o = p a r s e _ s p e c _ t e x t ( arg_names , arg_count , SpecText . g e t V a l u e ( ) . c _ s t r ( ) ) ; i f ( S p e c I n f o . g e t V a l u e ( ) != ” ” ) { p r i n t f ( ” S p e c i a l i z e k e r n e l %s with f i l e %s \n” , KernelName . g e t V a l u e ( ) . c _ s t r ( ) , SpecInfo . getValue ( ) . c_str ( ) ) ; } i n f o = read_spec_info ( SpecInfo . getValue ( ) . c_str ( ) ) ; /* Find f u n c t i o n a r g s */ p r i n t f ( ” Function a r g s : \ n” ) ; int i i = 0 ; for ( auto i t = F−>arg_begin ( ) ; i t != F−>arg_end ( ) ; ++i t ) { spec_arg s a = i n f o −>a r g s [ i i ++]; Type * t t = i t −>getType ( ) ; s t r i n g name ( i t −>getName ( ) . s t r ( ) ) ; s t r i n g type ( type_name ( t t ) ) ; p r i n t f ( ” − Arg %s , type %s \n” , i t −>getName ( ) . s t r ( ) . c _ s t r ( ) , type . c _ s t r ( ) ) ; i f ( sa . spec ) { i f ( type == ” i 6 4 ” ) { i n t 6 4 _ t vv64 ; i f ( s a . s i z e == s i z e o f ( i n t 6 4 _ t ) ) vv64 = * ( ( i n t 6 4 _ t * ) s a . v a l u e ) ; 112 e l s e i f ( s a . s i z e == s i z e o f ( i n t 3 2 _ t ) ) vv64 = ( i n t 6 4 _ t ) * ( ( i n t 3 2 _ t * ) s a . v a l u e ) ; else c a r p ( ” S i z e mismatch ” ) ; printf (” −−> Spec a s i 6 4 with v a l u e %l d \n” , vv64 ) ; Value * sp_vv64 = C o n s t a n t I n t : : g e t S i g n e d ( t t , vv64 ) ; i t −>r e p l a c e A l l U s e s W i t h ( sp_vv64 ) ; 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 } } } } e l s e i f ( type == ” i 3 2 ” ) { a s s e r t ( s a . s i z e == s i z e o f ( i n t 3 2 _ t ) ) ; i n t 3 2 _ t vv32 = * ( ( i n t 3 2 _ t * ) s a . v a l u e ) ; printf (” −−> Spec a s i 3 2 with v a l u e %d\n” , vv32 ) ; Value * sp_vv32 = C o n s t a n t I n t : : g e t S i g n e d ( t t , vv32 ) ; i t −>r e p l a c e A l l U s e s W i t h ( sp_vv32 ) ; } e l s e i f ( type == ” double ” ) { a s s e r t ( s a . s i z e == s i z e o f ( double ) ) ; double vvd = * ( ( double * ) s a . v a l u e ) ; printf (” −−> Spec a s double with v a l u e %.02 f \n” , vvd ) ; Value * spd = ConstantFP : : g e t ( t t , vvd ) ; i t −>r e p l a c e A l l U s e s W i t h ( spd ) ; } e l s e i f ( type == ” f l o a t ” ) { a s s e r t ( s a . s i z e == s i z e o f ( f l o a t ) ) ; float vvf = *(( float *) sa . value ) ; printf (” −−> Spec a s f l o a t with v a l u e %.02 f \n” , vvf ) ; Value * s p f = ConstantFP : : g e t ( t t , v v f ) ; i t −>r e p l a c e A l l U s e s W i t h ( s p f ) ; } else { char* error_msg = l s p r i n t f ( ”Can ’ t s p e c i a l i z e on a v a l u e o f type %s . ” , type . c _ s t r ( ) ) ; c a r p ( error_msg ) ; } return f a l s e ; 113 Bibliography [1] Tegra 3 Processor. url: http://www.nvidia.com/object/tegra-3processor.html. [2] OpenCL: The open standard for parallel programming of heterogeneous systems. url: http://www.khronos.org/opencl/ (visited on Mar. 1, 2014). [3] The Computer Language Benchmark Game. url: http : / / http : / / shootout.alioth.debian.org/. [4] Y. Futamura. “Partial Evaluation of Computation Process – An Approach to a Compiler-Compiler”. In: Systems, Computers, Controls 2.5 (1971), pp. 45–50. [5] Intel Core i7-4960X Specifications. url: http : / / ark . intel . com / products/77779 (visited on Feb. 28, 2014). [6] AMD FX-9590 Specifications. URL: http://www.webcitation.org/ 6NR9Ivy3f. url: http : / / www . amd . com / us / products / desktop / processors/amdfx/Pages/amdfx.aspx#2 (visited on Sept. 9, 2014). [7] Various. “Intel Hyper Threading Technology Archetecture and Microarchetecture”. In: Intel Technology Journal. Vol. 06. 1. 2002. [8] Auto-Vectorization in GCC. url: http : / / gcc . gnu . org / projects / tree-ssa/vectorization.html (visited on Mar. 1, 2014). [9] Auto-Vectorization in LLVM. url: http://llvm.org/docs/Vectorizers. html (visited on Mar. 1, 2014). [10] AMD R9 Series Specifications. url: http://www.amd.com/us/products/ desktop/graphics/R9/Pages/amd-radeon-hd-r9-series.aspx (visited on Feb. 28, 2014). [11] Radeon R9 290X Review. url: http://www.anandtech.com/show/7457 (visited on Feb. 28, 2014). [12] GeForce GTX TITAN Black Specifications. url: http://www.geforce. com/hardware/desktop-gpus/geforce-gtx-titan-black/specifications (visited on Feb. 28, 2014). 114 [13] GTX TITAN Review. url: http://www.anandtech.com/show/6760 (visited on Feb. 28, 2014). [14] The History of the Modern Graphics Processor. url: http : / / www . techspot . com / article / 650 - history - of - the - gpu/ (visited on Feb. 28, 2014). [15] CUDA Parallel Computing Platform. url: http://www.nvidia.com/ object/cuda_home_new.html (visited on Mar. 1, 2014). [16] Compute Shader Overview (Windows). url: http://msdn.microsoft. com/en- us/library/windows/desktop/ff476331(v=vs.85).aspx (visited on Mar. 1, 2014). [17] Intel Core i7 4770R Specifications. url: http : / / ark . intel . com / products/76642 (visited on Feb. 28, 2014). [18] Aanandtech Iris Pro Review. url: http://www.anandtech.com/show/ 6993 (visited on Feb. 28, 2014). [19] Anandtech Galaxy S 4 Review. url: http : / / www . anandtech . com / show/6914 (visited on Feb. 28, 2014). [20] Andandtech iPhone 5S Review. url: http://anandtech.com/show/ 7335 (visited on Feb. 28, 2014). [21] Smart phones overtake client PCs in 2011. url: http://www.canalys. com/newsroom/smart- phones- overtake- client- pcs- 2011 (visited on Feb. 28, 2014). [22] Intel Xeon Phi 7120P Specifications. url: http : / / ark . intel . com / products/75799 (visited on Mar. 1, 2014). [23] Tilera TILEncore-Gx72 Product Brief. url: http://www.tilera.com/ sites/default/files/images/products/TILEncore- Gx72_PB043Rel_1_2_Web.pdf (visited on Mar. 1, 2014). [24] Tilera TILE-Gx8072 Processor Product Brief. url: http://www.tilera. com / sites / default / files / productbriefs / TILE - Gx8072 _ PB041 03_WEB.pdf (visited on Mar. 1, 2014). [25] W. Gropp et al. “A high-performance, portable implementation of the MPI message passing interface standard”. In: Parallel computing 22.6 (1996), pp. 789–828. [26] PVM: Parallel Virtual Machine. url: http://www.csm.ornl.gov/pvm/ (visited on Mar. 2, 2014). [27] B.L. Chamberlain et al. “ZPL: A machine independent programming language for parallel computers”. In: Software Engineering, IEEE Transactions on 26.3 (2000), pp. 197–211. [28] B.L. Chamberlain, D. Callahan, and H.P. Zima. “Parallel programmability and the chapel language”. In: International Journal of High Performance Computing Applications 21.3 (2007), pp. 291–312. 115 [29] L. Dagum and R. Menon. “OpenMP: an industry standard API for shared-memory programming”. In: Computational Science & Engineering, IEEE 5.1 (1998), pp. 46–55. [30] Intel Cilk Plus. url: http://software.intel.com/en- us/intelcilk-plus (visited on Mar. 2, 2014). [31] Intel Threading Building Blocks. url: https://www.threadingbuildingblocks. org/ (visited on Mar. 2, 2014). [32] Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing. url: https : / / www . khronos . org / news / press / khronos finalizes-opencl-2.0-specification-for-heterogeneous-computing (visited on Mar. 2, 2014). [33] Nvidia. “CUDA Programming Guide”. In: (2007). [34] Khronos Releases OpenGL 4.3 Specification with Major Enhancements. url: https://www.khronos.org/news/press/khronos- releasesopengl-4.3- specification- with-major-enhancements (visited on Mar. 2, 2014). [35] RenderScript. url: http://developer.android.com/guide/topics/ renderscript/compute.html (visited on Mar. 2, 2014). [36] Aparapi API for data parallel Java. url: http://code.google.com/p/ aparapi/ (visited on Mar. 2, 2014). [37] Clyther Development Documentation. url: http://srossross.github. io/Clyther/ (visited on Mar. 2, 2014). [38] Aaftab Munshi, ed. The OpenCL Specification 2.0. url: http://www. khronos.org/registry/cl/specs/opencl-2.0.pdf (visited on Mar. 2, 2014). [39] Aaftab Munshi, ed. The OpenCL Specification 2.0. url: http://www. khronos.org/registry/cl/specs/opencl-1.2.pdf (visited on Mar. 2, 2014). [40] OpenCL for OS X. url: https : / / developer . apple . com / opencl/ (visited on Mar. 2, 2014). [41] AMD Accelerated Parallel Processing SDK. url: http://developer. amd.com/tools-and-sdks/heterogeneous-computing/amd-acceleratedparallel-processing-app-sdk/ (visited on Mar. 2, 2014). [42] Nvidia CUDA: OpenCL Support. url: https://developer.nvidia. com/opencl (visited on Mar. 2, 2014). [43] Intel SDK for OpenCL. url: http://software.intel.com/en- us/ vcsource/tools/opencl-sdk (visited on Mar. 2, 2014). [44] The Mesa 3D Graphics Library. url: http://www.mesa3d.org/ (visited on Mar. 2, 2014). 116 [45] The Mesa GalliumCompute Wiki Page. url: http://dri.freedesktop. org/wiki/GalliumCompute/ (visited on Mar. 2, 2014). [46] Freedesktop.org Beignet Wiki Page. url: http://www.freedesktop. org/wiki/Software/Beignet/ (visited on Mar. 2, 2014). [47] Portable Computing Language. url: http://pocl.sourceforge.net/ (visited on Mar. 2, 2014). [48] L.O. Andersen. “C Program Specialization”. MA thesis. DIKU, University of Copenhagen, Denmark, Dec. 1991. [49] C. Consel et al. “Tempo: Specializing Systems Applications and Beyond C”. In: ACM Comput. Surv. 30 (3es Sept. 1998). issn: 0360-0300. [50] C. Consel et al. “A Uniform Approach for Compile-Time and Run-Time Specialization”. In: 1996, pp. 54–72. [51] T. Kotzmann et al. “Design of the Java HotSpot™ client compiler for Java 6”. In: ACM Transactions on Architecture and Code Optimization (TACO) 5.1 (2008), p. 7. [52] Aaftab Munshi, ed. The OpenCL Specification 1.1. url: http://www. khronos.org/opencl/. [53] C. Chambers. “The design and implementation of the self compiler, an optimizing compiler for object-oriented programming languages”. PhD thesis. Stanford University, 1992. [54] POCL Kernel Compiler Design Notes. url: http://pocl.sourceforge. net/docs/html/kernel_compiler.html (visited on Mar. 3, 2014). [55] A cool use of LLVM at Apple: the OpenGL stack. url: http://lists. cs.uiuc.edu/pipermail/llvmdev/2006-August/006492.html. [56] B. Catanzaro, M. Garland, and K. Keutzer. “Copperhead: Compiling an embedded data parallel language”. In: Principles and Practices of Parallel Programming, PPoPP 11 (2011), pp. 47–56. [57] Copperhead: Data Parallel Python. url: http://copperhead.github. io/ (visited on Mar. 4, 2014). [58] B. Catanzaro et al. “SEJITS: Getting productivity and performance with selective embedded JIT specialization”. In: First Workshop on Programmable Models for Emerging Architecture at the 18th International Conference on Parallel Architectures and Compilation Techniques. 2009. [59] Nat Tuck. “Bacon: A GPU Programming System With Just in Time Specialization”. In: Proc. PDPTA 3037. 2012. [60] Francois Desarmenien. Parse::Yapp. url: http://search.cpan.org/ ~fdesar/. [61] R.F. Van der Wijngaart and P. Wong. NAS parallel benchmarks version 2.4. Tech. rep. NAS Technical Report NAS-02-007, 2002. 117 [62] S. Seo, G. Jo, and J. Lee. “Performance characterization of the NAS Parallel Benchmarks in OpenCL”. In: Workload Characterization (IISWC), 2011 IEEE International Symposium on. IEEE. 2011, pp. 137–148. url: http://aces.snu.ac.kr/Center_for_Manycore_Programming/SNU_ NPB_Suite.html. [63] The Rodinia Benchmark Suite. url: https://www.cs.virginia.edu/ ~skadron/wiki/rodinia/index.php/Main_Page. [64] PO Jaaskelainen et al. “OpenCL-based design methodology for applicationspecific processors”. In: Embedded Computer Systems (SAMOS), 2010 International Conference on. IEEE. 2010, pp. 223–230. [65] The LLVM Compiler Infrastructure Project. url: http://llvm.org/. [66] clang: a C language family frontend for LLVM. url: http://clang. llvm.org/. [67] LLVM 3.3 Documentation: Analysis and Transformation Passes. url: http://llvm.org/releases/3.3/docs/Passes.html. [68] LLVM 3.3 Documentation: Writing an LLVM Pass. url: http://llvm. org/releases/3.3/docs/WritingAnLLVMPass.html. [69] Norman P Jouppi and David W Wall. Available instruction-level parallelism for superscalar and superpipelined machines. Vol. 17. 2. ACM, 1989. [70] Intel’s Atom Architecture: The Journey Begins. url: http : / / www . anandtech.com/show/2493 (visited on Mar. 14, 2014). [71] Bulldozer Has Arrived: AMD FX-8150 Processor Review. url: http: //www.xbitlabs.com/articles/cpu/display/amd- fx- 8150.html (visited on Mar. 15, 2014). [72] Chris Lattner. “LLVM: An Infrastructure for Multi-Stage Optimization”. See http://llvm.cs.uiuc.edu. MA thesis. Urbana, IL: Computer Science Dept., University of Illinois at Urbana-Champaign, Dec. 2002. [73] Pekka Jääskeläinen. pocl-devel mailing list message on kernel vectorization. url: http : / / www . mail - archive . com / pocl - devel @ lists . sourceforge.net/msg00182.html. 118
© Copyright 2024 ExpyDoc