Implementation of Just In Time Value Specialization for the

Implementation of Just In Time Value
Specialization for the Optimization of Data
Parallel Kernels
PhD Dissertation (Draft 4.0)
Nat Tuck
University of Massachusetts Lowell
October 10, 2014
Abstract
This dissertation explores just-in-time (JIT) specialization as an optimization
for OpenCL data-parallel compute kernels. It describes the implementation
and performance of two new extensions to OpenCL, Bacon and Specialization
Annotated OpenCL (SOCL). Bacon is a replacement interface for OpenCL
that provides improved usability and has JIT specialization built in. SOCL is
a simple extension to OpenCL that allows JIT specialization to be applied to
OpenCL kernels directly.
We have developed one implementation of Bacon and two implementations
of SOCL, which are described and benchmarked. Both implementations of
SOCL show signiﬁcant speedups on some test programs. Speedups of up to
175% are demonstrated when JIT specialization is applied to kernels executing
on graphics processing units (GPUs) from AMD and Nvidia using our Pancake
library. With Specializing POCL, speedups of up to 150% are shown in tests
on a dual-CPU 24-core workstation.
Extensive work has been done previously on JIT compilers and ahead-of-time
specialization. Just in time value specialization is a well known technique, but
is not commonly implemented as a optimization to improve runtime either in
the literature or in existing systems. This work provides JIT value specialization as a general optimization explicitly exposed to the programmer and
directly supported by the Bacon and SOCL runtime systems.
Contents
1 Introduction
3
2 Technical Background
2.1 Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Existing Parallel Programming Systems . . . . . . . . . . . .
2.3 OpenCL: The Open Compute Language . . . . . . . . . . . .
6
6
14
19
3 Just In Time Specialization
3.1 Value Specialization . . . . . . . . . . . . . . . . . . . . . . .
3.2 Just In Time Compilation . . . . . . . . . . . . . . . . . . . .
3.3 Existing Uses of Just In Time Specialization . . . . . . . . . .
28
28
31
32
4 The
4.1
4.2
4.3
4.4
4.5
Bacon System
The Bacon C Language
Kernel Lifecycle . . . . .
Implementation . . . . .
Ease of Use . . . . . . .
Performance . . . . . . .
.
.
.
.
.
34
34
37
38
39
39
5 JIT
5.1
5.2
5.3
5.4
5.5
Specialization of OpenCL
Specialization Annotated OpenCL . . . . . . . . . . .
Pancake: Implementing SOCL by Specializing OpenCL
Serial Specialization with LLVM . . . . . . . . . . . . .
Specializing POCL . . . . . . . . . . . . . . . . . . . .
JIT Specialization in Clover . . . . . . . . . . . . . . .
.
.
.
.
.
43
43
50
57
77
83
6 Conclusions and Future Work
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
84
85
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
C.
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
A Benchmarking Hardware and Methods
A.1 Benchmarking Methodology . . . . . . . . . . . . . . . . . . .
A.2 Benchmarking Hardware . . . . . . . . . . . . . . . . . . . . .
B Code Listings
B.1 OpenCL Listings . .
B.2 Bacon Listings . . .
B.3 Serial Kernel Listings
B.4 Specializer Listings .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
88
88
89
90
90
98
107
110
114
2
Chapter 1
Introduction
Today, computers are parallel machines. Even cell phones have quad-core
processors and highly parallel graphics processors with programmable shaders
[1]. A high end commodity workstation can have both 64 CPU cores and
several general purpose graphics processing units (GPUs) that together can
perform 11264 ﬂoating point operations in parallel.1 Taking advantage of this
parallelism is no longer optional for computationally intensive programs.
This raises the question: How do we easily write eﬃcient parallel programs?
In a perfect world, we would be able to write our programs in a high-level
language like Ruby and have them execute with a full parallel speedup on
modern parallel hardware. Unfortunately, most of the programming languages,
libraries, and development tools that we have assume serial program execution.
Many of the popular high-level languages, including Python, Ruby, JavaScript
and others, are designed so that no more than one thread can be executed
in parallel. This makes the problem very diﬃcult, since programmers are
understandably attached to their existing programming tools and code bases.
Any technique for parallelization that required throwing these out would be a
hard sell.
Luckily, many programs have most of their heavy computation requirements
isolated in a few relatively small regions of their code. A traditional optimization technique is to separate out these computational hot spots into their own
routines, rewrite them a more eﬃcient (usually lower level) programming language, and then call them from the main program. These rewritten hot spot
routines are called “compute kernels.” This method works well for adding
1
Quad socket AMD Opteron system with four Radeon R9 290X GPUs
3
parallelism to a program; the kernels can be made parallel while the bulk of
the program remains serial.
There are many ways to express parallelism in code, all with diﬀerent beneﬁts
and drawbacks. One model that is easy to understand and maps well to a
variety of hardware is array-based SPMD (single program multiple data) data
parallelism. In this model, one piece of code operates on an entire (possibly
multi-dimensional) array by running in parallel once for each element in the
array.
The current cross-vendor standard for GPGPU programming is called OpenCL.
OpenCL uses this model of writing separate SPMD kernels and executing them
from a host program. The OpenCL standard [2] consists of two major pieces.
First, it deﬁnes a programming language called OpenCL C for writing compute kernels to run on parallel hardware. Second, it deﬁnes runtime APIs for
C and C++ that allow these kernels to be compiled, loaded, and executed
from programs running on a host CPU.
The phrase “portable assembly language” is sometimes used to describe the
C programming language. OpenCL is now the portable assembly language of
GPU-like parallel processors. A few years ago, when the potential speedup
from parallelism on consumer devices was in the single digits, it made sense
to do other optimizations ﬁrst; with a factor-of-two parallel speedup, there’s
no reason to rewrite a kernel for parallel execution before rewriting it in handoptimized C. Until recently, high-level parallel languages didn’t really make
sense outside of large clusters for high performance computing.
Today that trade-oﬀ has changed. According to the Computer Language
Benchmarks lame [3], the speedup from using a fast, low-level language like
C over a slow high-level language like Ruby is around a factor of 50. Commodity machines can have more than 50 CPU cores, and GPUs can be two
orders of magnitude more parallel than that. Today a program written in
a language like Ruby that parallelized perfectly could actually execute faster
than a similarly optimized C program.
But even if thousand-core CPUs become common we would still rather not
give up a factor of 50 in performance if we can avoid it. Luckily, describing
computation at a higher level actually provides more information which can be
used for aggressive compiler optimization. One very aggressive optimization
that maps well to data-parallel kernels is value specialization (sometimes called
“partial evaluation” [4]), where routines written for a general case computation
are transformed automatically to operate on speciﬁc input cases. Computers
4
are fast enough now that we can do this specialization at runtime based on
the actual values seen by the program.
This dissertation explores just in time specialization for OpenCL compute kernels. General technical background including hardware, other parallel software
systems, and the OpenCL standard are explored in Chapter 2. The history
and use of JIT specialization is covered in Chapter 3. Work on Bacon, a
programming system with JIT specialization and other features that targets
OpenCL as a base platform, is described in Chapter 4
The main contribution in this dissertation is Specialization Annotated OpenCL
(SOCL), which is described in Chapter 5. This is an extension to OpenCL that
provides specialization for OpenCL kernels directly. Two implementations of
SOCL are described and their performance is analyzed. Finally, conclusions
and possible future work are discussed in Chapter 6.
It is my hope that this work will inform the development of future parallel
programming systems, and that eventually programmers will be able to get
the performance of hand-optimized OpenCL with the programming eﬀort and
expressive power of a language like Ruby. Unfortunately, computer systems
are complex, and this dissertation presents only a simple compiler optimization
rather than the silver bullet for parallel programming.
5
Chapter 2
Technical Background
2.1
Parallel Processors
This dissertation explores the optimization of programs executed on parallel
processors. In this chapter we look at some of the parallel processors that are
available, especially in consumer computing devices. First, we show the basic
architectures used for processors and how they innately support parallelism.
As of early 2014, common consumer computing devices include laptops, desktops, tablets, and smartphones. Each of these generally contains a central
processing unit (CPU) with between two and eight processor cores and a graphics processing unit (GPU) with as many as several thousand programmable
shader units. At the higher end of commodity computing devices, servers and
workstations can have up to four CPUs and four discrete GPUs.
Desktops, laptops, workstations, and servers are generally Intel-compatible
processors produced by either Intel or AMD. Mobile phones and tablets are
generally based on processors in the ARM family designed by ARM Holdings
and produced under license by various manufacturers. This isn’t a hard rule,
with Intel pushing their new Atom processor for phones and tablets, AMD
pushing ARM processors for servers, and some inexpensive tablets and laptops
are available that use MIPS processors.
6
Model
Cores
Threads
Speed
Turbo / Boost
Memory Channels
L1 Code
L1 Data
L2 Cache
L3 Cache
GFLOPS1
TDP
Intel Core i7-4960X[5]
6
12
3.6 GHz
4.0 GHz
4
32kB / core
32kB / core
256kB / core
15MB
172
130W
AMD FX-9590[6]
8
8
4.7 GHz
4.7 GHz
2
64kB / 2 cores
16kB / 2 cores
2MB / 2 cores
8M
150
220W
Table 2.1: High-End Desktop CPUs
2.1.1
PC CPUs
Intel-compatible microprocessors have been the de-facto standard for personal
computers since the release of the IBM PC in 1980. Modern incarnations of
this architecture use the 64-bit x86_64 (aka. AMD64, Intel64, x64) extension
to Intel’s x86 instruction set ﬁrst introduced in AMD’s Athlon 64 chip in 2003.
Today, Intel-compatible CPUs are produced almost exclusively by Intel and
AMD. Although both companies are moving towards designs with integrated
GPUs, they both produce processors without built in graphics hardware for
the enthusiast gaming and graphics workstation markets where discrete GPUs
remain essential for good performance.
Two sample desktop processors from Intel’s Core and AMD’s FX lines are
shown in Table 2.1. As of February 2014, these are the most powerful processors marketed to consumers by the two companies.
The primary diﬀerence between the Intel Core and AMD FX processor designs
is the approach to sharing hardware resources between threads. Intel uses
less cores with high single thread performance and improves performance for
many threads with Hyper Threading™, their implementation of simultaneous
multi-threading (SMT) [7]. In contrast, AMD uses many cores with moderate
single thread performance, organizing them as dual core modules that share
both L2 cache and 256-bit SIMD units to minimize the increase in die space
caused by including more cores.
7
Model AMD Radeon R9
290X[10][11]
Shaders 2816
Clusters / SMXes 44
Speed 1000 MHz
L1 Cache 16 kB / cluster
L2 Cache 1 MB
RAM 4 GB
Memory Bus 512 bit
SP GFLOPS 5632
DP GFLOPS 704
Nvidia GeForce GTX
Titan Black[12][13]
2880
15
980 MHz
64k / SMX
1.5 MB
2 GB
384 bit
5000
1700
Table 2.2: Modern High End GPUs
Vector single-instruction multiple-data (SIMD) arithmetic units are a signiﬁcant source of parallelism in modern CPUs. The latest SIMD extension to the
AMD64 instruction set is called VEX, which operates on 256-bit wide registers. This allows data parallel operations to be performed on either four 64-bit
values or eight 32-bit values at a time.
Taking both multiple cores and SIMD units into account, modern multi-core
CPUs are capable of signiﬁcant parallelism. A data parallel computation operating on 32-bit values on the Intel Core i7 4960X (with 6 cores and 8-wide
SIMD) is potentially 48 times faster than the same computation executing
sequentially.
Even greater parallelism is available in server and workstation systems. Four
15-core Intel processors have a total of 60 cores and can execute 120 threads
simultaneously. With SIMD instructions, such a system can perform 480 ﬂoating point additions per clock cycle.
CPU parallelism can be exploited using standard tools. Most programming
environments support the use of multiple threads or processes, and some compilers (including GCC [8] and LLVM [9]) can automatically vectorize appropriately programs to use SIMD instructions like VEX. High performance math
libraries are readily available that exploit CPU parallelism on the x86_64
architecture.
8
2.1.2
Discrete GPUs
Graphics Processing Units (GPUs) are massively parallel processor arrays that
have developed largely to support computer gaming [14]. As the complexity of
game graphics has increased, these GPUs have become more programmable.
Current generation GPUs can execute arbitrary parallel functions with only a
few programmer constraints. This ﬂexibility has allowed these devices to be
used for general computation, and the use of GPUs in this way is frequently
called General Purpose GPU (GPGPU) computing.
Two high end GPUs, an AMD Radeon R9 and an Nvidia GeForce GTX, are
shown in Table 2.2. Both graphics cards can execute nearly 3000 threads in
parallel at about 1.0 GHz. This means that they have theoretical ﬂoating
point performance more than 30 times faster than the fastest CPU available
when they were released. Due to higher memory bandwidth and other factors,
the speedup may be even higher on an appropriately parallel workload. A
parallel program that can run on a GPU oﬀers a potential speedup of three
orders of magnitude over a sequential program.
GPUs are more complicated to work with than multi-core CPUs. GPUs are
natively SPMD data-parallel processors. They execute threads in batches
called “work groups” of 32 or 64 threads (for Nvidia and AMD respectively)
that execute the same instruction sequence in lock-step. This makes array data
parallelism the simplest way to write programs for execution on this hardware.
Code for execution on a GPU generally needs to be written using a specialized
programming language and executed using a vendor-speciﬁc execution API.
The open standard for GPU programming is OpenCL [2], which is discussed
further in Chapter 2.3. In addition, there are a variety of other GPU tools such
as Nvidia’s Cuda [15] and Microsoft’s DirectCompute [16], which are discussed
further in Chapter 2.2.
Nvidia also produces a line of parallel accelerator cards under the Tesla brand
name. These cards are very similar in performance to the GeForce GTX Titan.
2.1.3
Accelerated Processing Units
1
Each Intel HD Graphics ”core” has two 4-wide vector ALUs.
Each Radeon R7 ”core” has 64 ”stream cores” (ALUs).
3
Theoretical GFLOPS = vector width * vector units * clock speed
2
9
Model
Cores
Threads
CPU Speed
GPU ”Cores”
Turbo / Boost
L1 Code
L1 Data
L2 Cache
L3 Cache
L4 Cache
CPU GFLOPS3
GPU GFLOPS
TDP
Intel Core i7-4770R[17][18]
4
8
3.2 GHz
401
3.9 GHz
32kB / core
32kB / core
256kB / core
6MB
128MB
102
832
65W
AMD A10-7850K[6]
8
8
3.7 GHz
82
4.0 GHz
64kB / 2 cores
16kB / 2 cores
2MB / 2 cores
59
737
220W
Table 2.3: Desktop Accelerated Processing Units (APUs)
As we saw with ﬂoating-point arithmetic units in the 80’s, widely used coprocessors can be integrated into CPUs as transistors get smaller and cheaper.
Both Intel and AMD now produce CPUs with integrated programmable graphics processors (APUs) as their primary mainstream processor oﬀerings.
AMD uses a scaled down version of their discrete Radeon graphics processors
for their integrated graphics, while Intel has developed their Intel HD Graphics
modules solely for integration into their CPUs. Integrated graphics processors
can’t oﬀer the performance of discrete graphics cards, but they can provide an
order of magnitude more compute power than the CPU cores they are paired
with.
Table 2.3 shows two APUs: A Core i7 from Intel and an A10 APU from AMD.
The processors provide similar numbers for theoretical GFLOPs, but the Intel
processor has a very interesting feature: a huge (128 MB) EDRAM that it
uses as L4 cache. The major performance drawback for integrated graphics
processors is the use of system RAM shared with the CPU, which is relatively
low bandwidth compared to the high speed graphics RAM on discrete graphics
cards. Intel’s EDRAM provides only a small performance boost on games, but
it may provide a signiﬁcantly larger beneﬁt for GPGPU program execution.
APUs seem to be the way of the future. I overconﬁdently predict that discrete
graphics cards will be largely extinct in consumer PCs a decade from now.
10
Model
Cores
CPU Speed
CPU ISA
GPU Speed
L0 Code + Data
L1 Code + Data
L2 Cache
L3 Cache
CPU GFLOPS
GPU GFLOPS
TDP
Qualcomm Snapdragon 600 8974-AA SoC[19]
4
2.2 GHz
ARMv7
450 MHz
4 kB + 4 kB
16 kB + 16 kB
2 MB
35.2
130
< 5W
Apple A7 SoC[20]
2
1.3 GHz
ARMv8-A
200 MHz
64 kB
1 MB
4 MB
10.4
76.8
< 5W
Table 2.4: Mobile Systems on Chip
Programs for APUs are written by writing for the two components separately.
The CPU cores are programmed using standard tools, while the graphics processor is programmed using specialized GPU programming tools. The fact that
memory is shared between the two components simpliﬁes things somewhat, as
current generation APUs can share data structures without copying.
2.1.4
ARM Processors
Smartphones are now the most common type of computers in general use,
having overtaken PC shipments in 2011 [21]. Mobile devices, including smartphones and tablets, are powerful general purpose computers with multi-core
CPUs and programmable integrated graphics. The main diﬀerence from PCs
is their use of ARM rather than Intel-compatible CPUs.
Sample speciﬁcations for two current mobile systems on a chip (SoCs) from
Qualcomm and Apple are shown in Table 2.4. These SoCs are less powerful
than modern desktop APUs by about an order of magnitude, while also using
about an order of magnitude less power. Like the APUs, the graphics processors on these SoCs have about an order of magnitude more compute power
than the CPU component.
Mobile devices are programmed using development tools supplied by the software vendor and provide only minimal support for programming of the GPU.
The development tools have improved over time, so full GPU programming
support from all vendors is likely in the near future.
11
Model
Cores
Speed
L1 Code + Data
L2 Cache
L3 Cache
RAM
GFLOPS
TDP
Intel® Xeon Phi 7120P[22]
61
1.2 GHz
32 kB + 32 kB
512 kB / core
16 GB
1208
300W
Tilera TILEncore-Gx72[23][24]
72
1.0 GHz
32 kB + 32 kB
256 kB / core
18 MB
32 GB
576
95W
Table 2.5: Mobile Systems on Chip
2.1.5
Parallel Accelerators
There are a number of parallel accelerators on the market, commonly in the
form of PCI express expansion cards, that can perform parallel computations.
These cards are not generally available for retail purchase, but their existence is
worth noting because they provide a computational model somewhere between
that of a multi-core CPU and a general purpose GPU and will likely inform
the design of future commodity computing components.
Table 2.5 shows two current parallel accelerator cards: The Xeon Phi from
Intel and the TILEncore Gx72 from Tilera.
The Xeon Phi is intended primarily for high performance computing. Each
core provides 4-way simultaneous multi-threading, allowing the card to execute 244 simultaneous threads. The Phi executes standard 64-bit x86 code
except for the addition of a 512-bit SIMD unit to provide higher data-parallel
performance. The cores are connected in a ring topology.
The Xeon Phi is programmed using a variant of Intel’s standard x86_64 compilers, which are very mature. The co-processor can even run existing x86_64
binaries without the beneﬁt of the extended SIMD instructions.
The Tilera TILEncore is primarily marketed as a realtime network packet
processor for applications like deep packet inspection. Towards this end, the
card is equipped with four 10Gb Ethernet ports. The cores support SIMD
and heavy duty cryptographic acceleration. The core interconnect is a mesh
connecting each processor with ﬁve links to each of its four neighbors.
Tilera’s accelerator is programmed using their development kit, which seems
to be based on standard tools.
12
Example Processor Comparison
Intel Core i7-4770R (Sequential)
Apple A7 SoC (CPU+GPU)
AMD FX-9590
Qualcomm Snapdragon 600
8974-AA SoC (CPU+GPU)
Intel Core i7-4960X
Tilera TILEncore Gx72
1000
AMD A10-7850K (CPU+GPU)
2000
Intel Core i7-4770R (CPU+GPU)
3000
AMD Radeon R9 290X
GFLOPS
4000
Intel Xeon Phi 7120P
5000
Nvidia GeForce GTX TITAN Black
6000
0
Processor
Figure 2.1: Comparison of the theoretical GFLOPS of sample processors. The
sequential performance of the Core i7 is 7.2 GFLOPS.
2.1.6
Comparative Performance
Figure 2.1 shows the comparative theoretical maximum ﬂoating point performance of the processors discussed in this chapter. For processors with both
CPU cores and integrated graphics, this is the total performance of the chip.
This chart only shows maximum theoretical GFLOPS and doesn’t take into
account memory bandwidth, cache performance, how eﬀectively a program
was parallelized, or any other practical considerations.
Modern hardware allows parallel programs to execute signiﬁcantly faster than
sequential programs. When compared to GPUs, sequential execution on the
fastest CPU in the world barely registers on the chart.
13
2.2
Existing Parallel Programming Systems
In this section we describe several existing parallel programming systems, separated by the type of hardware they target.
2.2.1
Cluster Programming Systems
Historically, parallel programming largely meant programming for clusters of
many machines connected by a network. Work on this sort of system is generally called High Performance Computing (HPC) and largely consists of simulations of physics problems, engineering designs, the weather, ﬁnancial systems,
and other problems that generate enough funding to purchase a large number
of machines. Even though a modern cell phone has as much computational
power as a cluster from 20 years ago, there’s no shortage of larger problems
that need a cluster to solve as individual computers get faster and have more
RAM and storage space. Many of the older and more mature parallel programming systems target clusters.
MPI
MPI (Message Passing Interface) is an open standard API for parallel program execution on clusters. There are many diﬀerent MPI implementations,
including vendor-speciﬁc libraries optimized for speciﬁc hardware and widely
portable open implementations like MPICH[25].
As shared memory is not generally available on clusters, MPI programs communicate by message passing. This scales better than shared memory but
makes some otherwise eﬃcient parallel algorithms no longer feasible to execute.
Newer versions of MPI include support for execution on multi-core systems and
simulation of shared memory. MPI can be used to manage the inter-machine
communication in systems where other tools are used to manage local parallelism, including clusters that do their primary computation on GPUs.
14
PVM
PVM[26] (Parallel Virtual Machine) is an open source software package for
parallel programming on clusters. It is slightly older than MPI, but provides
similar capabilities. PVM connects the machines in the cluster together into
a “virtual machine” and provides extensive APIs to support querying and
controlling the status of the various nodes and running processes.
ZPL
ZPL[27] (Z-level Programming Language) is a higher level array programming
language built on top of MPI. In ZPL, the programmer describes operations on
entire arrays or pieces of arrays rather than describing parallel computations by
looking at single array elements. ZPL pioneered several parallel programming
concepts, including the idea of “regions” which are very similar to the parallel
ranges used in modern data-parallel systems.
Chapel
Another high-level parallel programming language of note is Chapel[28]. This
language, a conceptual descendant of ZPL, was developed at Cray as an entry
in the DARPA funded High Productivity Computing Systems program. It
tries to solve a larger piece of the parallel and distributed computing problem
than other systems by being a “multi-resolution” language, allowing users to
write both high- and low-level routines as part of the same program. This is
supported by novel parallel abstractions like zippered iterators and distributed
software transactional memory.
2.2.2
Multicore Programming Systems
As described in the previous chapter, most people now have powerful parallel
computers on their desks and in their pockets.
The most basic way to exploit the parallelism of multi-CPU systems is to run
multiple instances of a program so that the operating system can schedule each
instance on a separate CPU core. Unfortunately, this method has some drawbacks in ease of development, communication between instances, and memory
usage.
15
To simplify communication and reduce memory usage, operating systems added
the ability for each instance of a running program to have multiple “threads” of
execution that could run in parallel. These threads all share the same address
space, so they use a smaller total amount of memory and communication can
be accomplished by reading and writing the shared memory. Unfortunately,
accessing memory concurrently leads to race conditions, and explicit locks are
used to guarantee atomic access to shared memory. This means that programmers working with threads need to worry about both race conditions and
deadlocks, which can occur when locks are not used correctly.
Various tools and techniques have been developed to simplify programming
with threads and multiple processes.
OpenMP
OpenMP[29] (Open Multiprocessing) is an open standard API and syntax
extension to C, C++, and FORTRAN for writing parallel code for shared
memory multiprocessor machines. It is especially popular due to its low-impact
work ﬂow. First, the programmer writes and tests a sequential program while
keeping parallelism issues in mind. Second, the programmer annotates that
program with “pragmas” that expose parallelism to the compiler. Optimally,
the program will then run with a good parallel speedup.
OpenMP is supported by a variety of compilers, including GCC, Intel’s ICC,
Microsoft Visual Studio, IBM XL C/C++, and Oracle Solaris Studio.
Another extension to C and C++ similar to OpenMP is Intel’s Cilk Plus[30].
Intel Threading Building Blocks
Intel Threading Building Blocks[31] (TBB) is a C++ template library to provide simple patterns that can be used to add parallelism without having to
deal with low-level primitives and performance tuning. With TBB, the programmer speciﬁes tasks that can run in parallel and the library automatically
schedules their execution on multiple processor cores. The library also provides patterns for parallel loops and pipelines that will generate the tasks to
be run in parallel automatically.
16
Immutable Data
A common technique for avoiding locks in threaded systems is the use of
immutable data. If an object in memory never changes, there’s no need to
worry about race conditions when accessing it.
Many functional programming languages, such as Haskell and Clojure, explicitly enforce the immutability of data. Operations that modify an object in
place are disallowed. If a modiﬁed version is needed, a new object with the
modiﬁcation is generated from the old version in a single atomic operation.
This guarantees that all objects can be safely shared between threads, since
objects cannot be changed once they are created.
Immutable data can also be used as a programming pattern in languages that
don’t explicitly enforce it. Generally, the strategy is that objects are produced
in a single thread and then a reference to them is shared with other threads.
Once the reference has been shared, the object is not changed. If a modiﬁed
version of the object is needed, a thread makes a local copy, modiﬁes the copy,
and then shares the new version.
The disadvantage to immutable data is the performance penalty that comes
from allocating a new object for every operation. In languages like Haskell
this can largely be optimized away by the compiler, but in general the cost of
memory allocation is a concern due to cache eﬀects and the fact that every
piece of memory that is allocated must later be garbage collected.
2.2.3
Low-Level GPU Programming APIs
OpenCL
OpenCL[2] (Open Computing Language) is a cross-vendor standard for dataparallel programming designed primarily to target GPU devices. It was developed by a group led by Apple in 1998 to create an open standard in response
to vendor-speciﬁc GPU programming APIs from Nvidia and Microsoft. Since
then it has gone through three major revisions, with OpenCL 2.0 having been
ﬁnalized in November of 2013[32].
OpenCL is described in detail in Chapter 2.3.
17
CUDA™
CUDA™ [33] is a proprietary API for GPU programming developed by Nvidia.
It is somewhat more mature than OpenCL but is otherwise very similar, with
a slightly diﬀerent set of features and restrictions. In addition to a C-based
language, the CUDA platform also allows kernels to be written in variants of
C++ and FORTRAN. CUDA can only be used to develop for Nvidia GPUs.
Microsoft DirectCompute
Microsoft provides an API for GPU computing for the Windows operating system called DirectCompute[16] as part of their DirectX API package. DirectCompute allows “compute shaders” written in Microsoft’s High Level Shader
Language (HLSL) to be executed on supported GPUs.
DirectCompute provides a subset of the features provided by OpenCL or
CUDA. Its primary advantage is its integration with the DirectX API, which
is the standard for development of games and other 3D applications on the
Windows platform.
OpenGL Compute Shaders
The recent OpenGL 4.3 standard[34] adds support for compute shaders to this
standard graphics API. Shaders are written in the OpenGL Shader Language
(GLSL) and are executed through the OpenGL API.
Like Microsoft’s compute shaders, these are less featureful than OpenCL or
CUDA, but are well integrated with the OpenGL API for graphics applications
that need simple support for compute acceleration.
Google RenderScript
Google RenderScript[35] is an API for parallel acceleration on the Android
platform. It allows kernels to be written in the style of compute shaders and
then called from Java code in Android applications. Google has intentionally
not provided default support for OpenCL on Android to promote RenderScript, although OpenCL is signiﬁcantly more featureful.
18
2.2.4
High-Level GPU Programming Tools
Aparapi
Aparapi[36] is an open source API that allows methods written in Java to be
translated into OpenCL and executed on GPU devices at runtime. The methods to be translated must be written in a restricted subset of Java consisting
largely of loops and integer operations. The ability to embed GPU kernels
directly in the host code in the host language makes it much simpler to take
advantage of GPU acceleration than writing separate kernel or shader code in
a separate language with a separate execution API.
Clyther
Clyther[37] is a tool like Aparapi, except for the Python programming language rather than Java. One additional feature of note is the fact that since
Python is a dynamically typed language, kernels written in Clyther will be
type-specialized on the types of their arguments. This allows a kernel to be
written once and then executed on arguments of several types.
2.3
OpenCL: The Open Compute Language
OpenCL has two major parts: the OpenCL C programming language and
the OpenCL runtime library. Kernels (routines to be run in parallel) are
developed in OpenCL C and and are loaded and executed at runtime by API
calls to the OpenCL library from a host application. This delayed compilation
allows OpenCL to be hardware independent; kernels are compiled for devices
identiﬁed and selected at runtime.
The OpenCL standard further deﬁnes a “virtual parallel machine” model for
programmers to target with OpenCL kernels. The intent is that this model be
general enough to map to all target devices, while having enough of the speciﬁc properties of GPUs that it can eﬃciently target their somewhat peculiar
architecture.
The OpenCL model describes how a kernel executes on a device. The kernel
is executed in parallel over a 1-, 2-, or 3-dimensional grid-shaped index space
called an NDRange. That grid is further divided up into “work groups” in a
19
way that can be speciﬁed by the user if it is important for a given kernel. Each
executing instance of the kernel (“thread”) is able to access its index, and can
thus determine what piece of the computation it is responsible for calculating.
Memory in the OpenCL model is separated into ﬁve address spaces. Global
memory is shared across all threads in a kernel, but no guarantee is made that
writes from one thread will be visible to other threads before the kernel is
ﬁnished executing. Local memory is shared within a work group and can be
synchronized for inter-thread communication by an explicit barrier. Private
memory is accessible only to a single thread. As a general guideline, global
memory is slow, shared memory is faster, and private memory is the fastest.
As an alternative to global memory, there are two additional address spaces
that may provide special high speed caching: constant memory and images.
Constant memory is read-only during kernel execution. Images are 2D or 3D
arrays that can only store data in certain formats and are restricted such that
a given image can only be either read or written, but not both, by a given
execution of a kernel.
Use of the OpenCL C language that kernels are written in is subject to some
major constraints, including a lack of function pointers, recursion, or any sort
of dynamic memory allocation or array sizing. The standard seem to be slowly
becoming less restrictive with extensions, newer revisions, and improvements
in the capabilities of GPU hardware, but it will be several years before OpenCL
can be straightforwardly treated like a parallel C.
The OpenCL speciﬁcation provides a clear description of the OpenCL platform
model. It is advisable to read the beginning of Section 3 of the OpenCL Speciﬁcation [38] to get a more complete picture of the standard, as the remainder
of this dissertation describes work based on the OpenCL platform.
2.3.1
A Sample OpenCL Program
To see how OpenCL works, it’s useful to walk through how a kernel is compiled
and executed using the OpenCL API. Although there are wrappers available
for other languages like C++ and Java, OpenCL is natively a C API, so
we describe the process from a C host program. Using OpenCL from other
languages (e.g., the C++ API) is basically the same. The information in
this section is based on Version 1.2 of the OpenCL Speciﬁcation [39]; detailed
documentation on the API functions can be found there.
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* vec_pow :
*
Takes a v e c t o r o f ( n ) l o n g i n t e g e r s and r a i s e s each one
*
t o t h e ( y ) ’ t h power i n p l a c e .
*/
k e r n e l void
vec_pow ( g l o b a l long * xs , long y , long n )
{
long i i = g e t _ g l o b a l _ i d ( 0 ) ;
long x = xs [ i i ] ;
long x1 = x ;
for ( long j j = 0 ; j j < y ; ++j j ) {
x1*= x ;
}
xs [ i i ] = x1 ;
}
Listing 2.1: Sample OpenCL Kernel: vec_pow
The kernel that we call is the vec_pow kernel shown in Listing 2.1.
A complete host program that loads and runs the vec_pow kernel is shown in
Appendix B (Listing 2.3.1 on page 23). Even in this minimal example which
takes the ﬁrst OpenCL device on the system, excludes any error checking, and
includes only one buﬀer taken from an array literal, this process requires 12
conceptual steps and 84 lines of C code. We now describe these steps.
After including the OpenCL header ﬁle (line 1), the ﬁrst step is to select an
OpenCL platform. OpenCL allows versions of its runtime library from multiple
vendors to co-exist on one machine. The clGetPlatformIDs function (line 14)
lists these diﬀerent implementations. This function also demonstrates the use
of a common C memory management idiom: the ﬁrst call returns how big an
array to allocate for the result, and the second call populates that array. Here
we blindly take the ﬁrst platform.
Once a platform has been selected, the second step is to select a parallel
device that will be used to execute kernels. The clGetDeviceIDs function
(line 20) lists all devices that the selected platform can utilize of the speciﬁed
type. Here we take the ﬁrst device of any type with CL_DEVICE_TYPE_ALL,
but we could have requested, e.g., a CPU or GPU device speciﬁcally with
CL_DEVICE_TYPE_CPU or CL_DEVICE_TYPE_GPU.
The next step is to create an OpenCL context for the selected device and
platform using the clCreateContext function (line 25). A context can contain
21
more than one device, which would allow OpenCL memory buﬀers to be shared
by all of the devices in the context.
Once we have a context, we can create a command queue for our device using
the clCreateCommandQueue function (line 29). Actions on this device like
executing a kernel or copying a memory buﬀer to or from the device will be
added to this queue and then processed asynchronously.
To execute a kernel, we ﬁrst need to read the kernel source from disk (lines
32-39). It would be convenient if OpenCL provided an API function to do this
for us, but instead we do it by hand using standard C I/O functions.
Once we have the kernel source in memory, we call the clCreateProgramWithSource function (line 45) to create an OpenCL program object. We can
then use the clBuildProgram function (line 47) to compile it to a device speciﬁc binary. The interface to this function is somewhat complicated, expecting
an array of source strings and an array with a length for each source string.
At this point we could use the clGetProgramBuildInfo function to get the
status of the build and any errors or warnings that were encountered so we
could print a useful error message.
After building the program, we need to get a reference to the kernel we want
to call with the clCreateKernel function (line 50). This takes the name of
the kernel as a string and returns an OpenCL kernel object.
To pass an array to a kernel, we need to put the data in an OpenCL buﬀer
object. We can create such an object with the clCreateBuffer function (line
57). In this case we use the CL_MEM_COPY_HOST_PTR ﬂag, which causes data to
be copied into the buﬀer when it is created, but OpenCL also supports other
options like using the host pointer directly with the CL_MEM_USE_HOST_PTR ﬂag
which can avoid copying for some devices but has caching caveats with others,
or by writing data to the buﬀer explicitly with the clEnqueueWriteBuffer
function.
Once we have a kernel object, we can use the clSetKernelArg function (lines
61-63) to set values for the arguments we want to pass in when the kernel is
executed. This function takes the numeric position of the argument, the size
of the item being passed, and a pointer to the data. Buﬀers (or other cl_mem
objects like images) are treated a bit specially here – you pass in an OpenCL
memory object but the kernel gets a pointer to the data.
With the arguments set, we can ﬁnally enqueue the kernel for execution using
the clEnqueueNDRangeKernel function (line 67). This will cause the kernel to
22
get executed in parallel once for each position in a 1D, 2D, or 3D ”range”. In
this case we select a 1 dimensional range of size n, which will result in copies
of the kernel being executed such that calls to get_global_id(0) (line 10 in
Listing 2.1) will return values from 0 to n − 1, which can then be used directly
as indices into the array xs.
Before we can use the results of the kernel, we need to ensure that it has
completed executing. We call clFinish (line 70) to guarantee that all asynchronous tasks have ﬁnished executing.
Once the kernel has been executed, we can read back the results using the
clEnqueueReadBuffer function (line 74) and then print them (lines 77-81).
Sample Host Program for vec_pow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include
#include
#include
#include
<CL/ c l . h>
<s t d i o . h>
<a l l o c a . h>
<s t r i n g . h>
int
main ( int argc , char* argv )
{
c l _ u i n t nn ;
/* Take t h e f i r s t OpenCL p l a t f o r m */
c l G e t P l a t f o r m I D s ( 0 , 0 , &nn ) ;
c l _ p l a t f o r m _ i d p l a t f o r m s [ nn ] ;
c l G e t P l a t f o r m I D s ( nn , p l a t f o r m s , 0 ) ;
cl_platform_id platform = platforms [ 0 ] ;
/* Take t h e d e f a u l t OpenCL d e v i c e */
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, 0 , 0 , &nn ) ;
c l _ d e v i c e _ i d d e v i c e s [ nn ] ;
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, nn , d e v i c e s , 0 ) ;
cl_device_id device = devices [ 0 ] ;
/* Create an OpenCL c o n t e x t */
long context _props [ ] = {CL_CONTEXT_PLATFORM, ( long ) p l a t f o r m ,
0};
cl_context context = clCreateContext (
( c l _ c o n t e x t _ p r o p e r t i e s * ) context_props , 1 , &d e v i c e ,
0 , 0 , 0) ;
/* Create an OpenCL command queue */
23
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
cl_command_queue queue = clCreateCommandQueue ( c o n t e x t ,
device , 0 , 0) ;
/* Read i n t h e OpenCL k e r n e l s o u r c e */
char* o p e n c l _ s o u r c e = a l l o c a ( 1 0 2 4 ) ;
FILE* o p e n c l _ f i l e = f o p e n ( ” vec_pow_kernel . c l ” , ” r ” ) ;
nn = 0 ;
while ( ! f e o f ( o p e n c l _ f i l e ) )
nn += f r e a d ( o p e n c l _ s o u r c e + nn , 1 , 5 1 2 , o p e n c l _ f i l e ) ;
o p e n c l _ s o u r c e [ nn ] = ’ \n ’ ;
o p e n c l _ s o u r c e [ nn + 1 ] = 0 ;
fclose ( opencl_file ) ;
/* Compile t h e OpenCL s o u r c e and s e l e c t t h e k e r n e l we want */
size_t length = s t r l e n ( opencl_source ) ;
const char ** s o u r c e s = ( const char * * ) &o p e n c l _ s o u r c e ;
const s i z e _ t * l e n g t h s = ( const s i z e _ t * ) &l e n g t h ;
cl_program program = clCreateProgramWithSource (
context , 1 , sources , lengths , 0) ;
clB uild Pro gram ( program , 1 , &d e v i c e , ” ” , 0 , 0 ) ;
/* S e l e c t t h e k e r n e l we want */
c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program , ”vec_pow” , 0 ) ;
/* Create t h e i n p u t b u f f e r */
cl_long n = 5;
c l _ l o n g xs0 [ ] = { 1 , 2 , 3 , 4 , 5 } ;
cl_mem_flags f l a g s = CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR;
size_t xs_size = n * sizeof ( cl_long ) ;
cl_mem xs = c l C r e a t e B u f f e r ( c o n t e x t , f l a g s , x s _ s i z e , xs0 , 0 ) ;
/* S e t t h e k e r n e l arguments */
cl_long y = 3;
c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl_mem ) , &xs ) ;
c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( y ) , &y ) ;
c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( n ) , &n ) ;
/* E x e c u t e t h e k e r n e l o v e r a p a r a l l e l range . */
c l _ e v e n t kernel_done = 0 ;
clEnqueueNDRangeKernel ( queue , k e r n e l , 1 , 0 , &n , 0 , 0 , 0 ,
&kernel_done ) ;
/* Wait f o r t h e Kernel t o f i n i s h e x e c u t i n g */
clFinish () ;
/* Copy b a c k t h e d a t a from t h e b u f f e r . */
c l _ l o n g xs1 [ 5 ] ;
clEnqueueReadBuffer ( queue , xs , CL_TRUE, 0 , x s _ s i z e , xs1 , 0 ,
24
75
76
77
78
79
80
81
82
83
84
0 , 0) ;
/* P r i n t o u t t h e r e s u l t */
p r i n t f ( ” 1 2 3 4 5 cubed i s : \ n” ) ;
for ( int i i = 0 ; i i < 5 ; ++i i ) {
p r i n t f ( ”%l d ” , xs1 [ i i ] ) ;
}
p r i n t f ( ” \n” ) ;
}
return 0 ;
2.3.2
OpenCL Implementations
OpenCL is an open standard that is generally implemented by each vendor for
execution on their hardware devices. As such, there are a number of implementations of OpenCL available, each supporting diﬀerent hardware, diﬀerent
software platforms, and each supporting diﬀerent subsets of the OpenCL standard and its extensions.
Several of the major implementations of OpenCL are described below.
Apple
Apple was the intital developer of the OpenCL standard as a hardware-independent
graphics card programming API for their Macintosh computers. They maintain an implementation [40] of the standard for their OS X operating system.
The current version of Apple OpenCL—included in their OS X Mavericks
operating system—supports Version 1.2 of the standard on AMD, Intel, and
Nvidia graphics hardware and Intel-compatible CPUs.
AMD
AMD provides an implementation of OpenCL that supports their Radeon and
FirePro graphics hardware and all Intel-compatible CPUs called the Accelerated Parallel Programming (APP) SDK [41]. The AMD SDK supports Linux
and Windows operating systems, leaving support for Mac OS X to Apple. The
current version of the SDK (2.9) supports OpenCL 1.2.
25
Nvidia
Nvidia’s CUDA SDK includes an implementation of OpenCL [42] for their
GeForce Graphics Cards and Tesla compute accelerators. CUDA 5.5, the most
recent version released, only supports OpenCL 1.1. Nvidia has presumably not
updated their OpenCL support to 1.2 to promote their proprietary APIs. The
Nvidia CUDA SDK supports Linux, Macintosh, and Windows.
Intel
Intel provides an implementation of OpenCL [43] for their CPUs and integrated
graphics hardware. The 2013 R3 release supports OpenCL 1.2 for Windows.
Clover
The Mesa project [44], providers of the standard open source OpenGL implementation for Linux, is working on an incomplete open source implementation
of OpenCL called Clover (sometimes “Gallium Compute”) [45]. It currently
provides partial support for OpenCL 1.0 on CPUs and AMD Radeon GPUs
released between 2009 and 2013. It is included in the latest stable release of
Mesa, but is disabled by default.
As GPU computing instruction sets stabilize and the implementation catches
up, this is likely to become the standard GPU compute implementation for
Linux, but it can currently run only a few OpenCL applications.
Beignet
Intel has been working on an open source OpenCL implementation to go along
with their Linux graphics driver called Beignet [46]. Currently it claims support for the OpenCL 1.1 spec on 3rd Generation Intel Core (“Ivy Bridge”)
integrated graphics, but it has not yet been included in an oﬃcial release of
Mesa.
POCL
Portable Computing Language (POCL) [47] is an open source implementation
of OpenCL initially developed to target custom hardware but now provid26
ing a complete implementation of OpenCL 1.1 for CPU execution. POCL is
described in detail in Chapter 5.4.
27
Chapter 3
Just In Time Specialization
3.1
Value Specialization
Specialization, also known as partial evaluation, was formalized by Futamura
in 1971 [4]:
Given a procedure π with parameters c1 ...cn , r1 ...rm and values
c01 ...c0n for the ﬁrst n parameters, a procedure π 0 specialized on those
values can be generated by partially evaluating the procedure.
Those parts of the procedure that depend on the known parameters (c1 ...cn )
can be evaluated, while the unknown parameters (r1 ...rn ) remain unknown.
Futamura used this concept to describe a method of generating compilers
from interpreters by specializing the interpreter (the procedure) on an input
program (the known values).
This technique is useful as a program optimization, and there have been many
projects using it over the last four decades. As this dissertation describes
specialization on a language based on C, specialization of C-type languages is
most applicable.
An early example of partial evaluation of C programs is described by Anderson
in 1991 [48]. This partial evaluator for a subset of C is self-applicable, allowing
it to be used for the compiler-generation technique described by Futamura. A
speedup of nearly a factor of seven is shown for one test program.
A later partial evaluator for C called Tempo [49] performs general ahead of time
partial evaluation of C code. This system provided more consistent speedups
28
and was used to accelerate HPC kernels. The same primary author, Consel,
also did work on specialization at runtime [50].
3.1.1
An Example of Value Specialization
Consider the ipow function shown at the top of Listing 3.1. It performs integer
exponentiation, raising xx to the kk’th power, by iterated multiplication. This
function implements the general case of exponentiation, allowing xx to be raise
to any non-negative integer power.
But what if our application only uses this function to calculate cubes? In this
case, we know that every time this function is called the value of kk will be
3. We can write a more eﬃcient function for this application by not handling
the general case and instead only calculating cubes. Such a speciﬁc function,
icube, is shown at the bottom of Listing 3.1.
Generating a function like icube from a function like ipow is an example of
specialization. Applying Futamura’s deﬁnition from Section 3.1:
• The initial procedure, π, is the function ipow.
• It has one parameter with a known value; r1 is the parameter kk.
• It has one parameter with an unknown value; c1 is the parameter xx.
• The known value for kk, c01 is 3.
• The procedure to be generated, π 0 , is icube.
Using Futamura’s technique, we can generate π 0 from π by partial evaluation.
In this case, that is a three step process:
1. Remove the parameter kk and replace all instances of that variable in
the body with the known value 3. The result of this step is shown in
Listing 3.1 as ipow_s1.
2. Since we know how many iterations the loop has (three), we can symbolically evaluate it by replacing it with three copies of its body. As a
compiler optimization this is called “unrolling”. The result of this step
is shown in Listing 3.1 as ipow_s2.
3. Even though we don’t know the value of xx, we can apply the rules of
algebra to simplify the arithmetic. This will give us our ﬁnal function
icube.
29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/* C a l c u l a t e xx r a i s e d t o t h e kk ’ t h power . */
int ipow ( int xx , int kk )
{
int yy = 1 ;
for ( int i i = 0 ; i i < kk ; ++i i )
yy *= xx ;
}
return yy ;
int ipow_s1 ( int xx )
{
int yy = 1 ;
/* The one i n s t a n c e o f kk has been r e p l a c e d by t h e v a l u e 3 . */
for ( int i i = 0 ; i i < 3 ; ++i i )
yy *= xx ;
}
return yy ;
int ipow_s2 ( int xx )
{
int yy = 1 ;
/*
yy
yy
yy
}
The l o o p has been r e p l a c e d by t h r e e c o p i e s o f i t s body . */
*= xx ;
*= xx ;
*= xx ;
return yy ;
/* J u s t cube a number */
int i c u b e ( int xx )
{
return xx * xx * xx ;
}
Listing 3.1: ipow.c: An integer exponentiation kernel
30
The technical details of how to implement specialization on C code in a production quality compiler are discussed in Chapter 5.3.
3.2
Just In Time Compilation
Programming systems with just-in-time compilation delay compilation until
the program is run on the target system. This has the advantage that the
program can run on any machine with an appropriate run-time environment,
not just one machine type that the program was compiled for. Just-in-time
compilation also provides a performance tradeoﬀ: compiling the program takes
time, but compiling on the exact target system may produce more optimized
machine code.
3.2.1
Uses of Just In Time Compilation
The most widely recognized use of just-in-time (JIT) compilation techniques
is in the Java Hotspot® Virtual Machine (JVM) [51]. The JVM allows Java
to make its famous “write once run anywhere” claim by delaying compilation
to native code until runtime when the target hardware architecture is known
and a JIT compiler for that architecture is available. Java goes so far as to
provide diﬀerent JIT compilers for diﬀerent use cases. For example, the Java
Client VM is tuned for responsiveness, which means it accepts somewhat less
optimal code in exchange for shorter JIT pauses. In contrast, the Java Server
VM makes the opposite trade-oﬀ, allowing more optimization to be done at
the cost of longer JIT pauses to ensure faster execution once steady state is
reached.
OpenCL [52] uses JIT compilation in much the same way Java does, except
to support a wider range of novel hardware architectures. This is essential for
its design goal of supporting GPUs, as the hardware architectures for these
processors are still in ﬂux. For example, the recent AMD “Graphics Core
Next” GPU architecture is signiﬁcantly diﬀerent from the design of previous
AMD GPUs, but OpenCL code written for the older hardware will run on it
perfectly.
A ﬁnal notable example of JIT techniques is the research language Self [53].
This language pioneered the technique of using JIT type-specialization to allow
for very eﬃcient execution of dynamic programming languages. Although type
31
specialization is diﬀerent from value specialization, this is one of the most
signiﬁcant examples of JIT specialization providing signiﬁcant performance
beneﬁts in practice, allowing a dynamic purely object-oriented language to
approach within a small factor of optimized C in performance. These same
techniques are used in the fast JavaScript runtimes in modern web browsers.
3.3
Existing Uses of Just In Time Specialization
Specialization is a well known technique that is used widely in existing compilers. Some examples of this technique in just in time compilers for data parallel
applications are described in this section.
3.3.1
POCL
The POCL OpenCL runtime specializes kernels on work group size [54]. POCL
runs each work group in its own thread, and processes the items in a work
group by looping through them. Specializing on workgroup size means that
the number of work items per thread (and thus the size of the loop) is a known
constant. This simpliﬁes the correct implementation of the OpenCL speciﬁcation by allowing easier static analysis and transformation of this work-item
loop, which is necessary to implement OpenCL’s barrier semantics in POCL’s
one thread per workgroup model. Although this transformation is primarily
for implementation simplicity rather than performance, it is likely to provide
a performance beneﬁt for simple kernels by allowing this loop to be unrolled.
3.3.2
OpenGL on Mac OS X
Apple uses LLVM to specialize shaders on their Mac OS X implementation of
OpenGL. According to a mailing list post by Chris Lattner [55]:
[Apple’s OpenGL implementation does] runtime code specialization within the ﬁxed-function vertex-processing pipeline. Basically,
the OpenGL pipeline has many parameters (is fog enabled? do vertices have texture info? etc) which rarely change: executing the
fully branchy code swamps the branch predictors and performs
32
poorly. To solve this, the code is precompiled to LLVM .bc form,
from which specializations of the code are made, optimized, and
JIT compiled as they are needed at runtime.
3.3.3
Copperhead
Copperhead [56][57] is a project that allows data-parallel code to be written
directly in Python and execute eﬃciently on a GPU. Copperhead provides
nested parallelism in a functional style, providing parallel versions of traditional operations like map and reduce. These functions execute about half as
fast as hand-coded CUDA kernels. For kernels where this approach works and
provides suﬃcient performance Copperhead or something like it may simply be
the best option for development of GPU accelerated kernels. At the moment
the main drawback to Copperhead is its dependence on the Nvidia CUDA
platform, but the authors anticipate support for other back-ends (including
OpenCL) in the future.
3.3.4
SEJITS
SEJITS [58] describes earlier work by some of the same people who work on
Copperhead. The project explores JIT compilation on programs written using
Copperhead-like style parallel embedded domain speciﬁc languages. Although
the term “specialization” is used in the SEJITS paper to refer to hardware
targeted JIT compilation, value-speciﬁc partial evaluation is mentioned as an
optimization possibly enabled by a JIT compiling system.
33
Chapter 4
The Bacon System
Bacon is a GPU programming system intended to make it easier to work with
OpenCL. It provides a simpliﬁed API for running kernels from C++ host
programs and an extension to OpenCL C, called Bacon C, intended to make
life easier for kernel developers.
In this chapter the ﬁrst prototype of the Bacon system and its performance
characteristics are described. Most of this work was done in 2011 and 2012,
and was published in PDPTA 2012 [59].
4.1
The Bacon C Language
Bacon C is based on OpenCL C with extensions for improved usability and to
enable the automatic generation of C++ wrapper code. A grammar for the
Bacon C language is included in Appendix B (Listing B.2.1). Sample Bacon C
kernels that perform matrix multiplication and demonstrate nine of the new
features of the language are shown in Listing 4.1 and Listing B.2.2. These nine
features are:
1. In-kernel sequential setup
2. In-kernel range speciﬁcation
3. Kernels can return values
4. Parameterized types
5. Error handling with assert and fail
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
kernel
Array2D<f l o a t >
mat_mul ( Array2D<f l o a t > aa , Array2D<f l o a t > bb )
{
SETUP:
g l o b a l Array2D<f l o a t > c c [ aa . rows , bb . c o l s ] ;
BODY:
@range [ c c . rows , c c . c o l s ] ;
}
f l o a t sum = 0 . 0 ;
a s s e r t ( aa . c o l s == bb . rows ,
” M a t r i c e s must have c o m p a t i b l e d i m e n s i o n s . ” ) ;
for ( int kk = 0 ; kk < aa . c o l s ; ++kk ) {
sum += aa [ $row , kk ] * bb [ kk , $ c o l ] ;
}
c c [ $row , $ c o l ] = sum ;
return c c ;
Listing 4.1: Naive Matrix Multiplication in Bacon C
6. Multi-dimensional arrays
7. Variable sized private arrays
8. Specialization on array sizes
9. Specialization on const parameters
Bacon preserves the OpenCL single program multiple data (SPMD) computation model. A kernel is executed in parallel over a 1D, 2D, or 3D range. Each
executing instance of the kernel can query its position in that range to determine which part of the work it is responsible for performing. For example, the
kernel in Listing 4.1 will be executed in parallel once for each element in the
output matrix.
Each kernel is separated into SETUP and BODY sections. The SETUP section
(line 5 of Listing 4.1) is for code that will run serially on the host processor
while the BODY section contains the code to be executed in parallel. In
practice, the SETUP section is primarily used to declare output arrays that
can be returned to the host application and to compute the sizes of these
arrays.
35
Unlike OpenCL C kernels, Bacon C kernels can return values (a return type,
Array2D<float>, is shown on line 2 in Listing 4.1). Any variable of a simple
type can be returned, as can any array passed as an argument or declared in
the SETUP section. The return statement occurs in the kernel BODY (line 18
in Listing 4.1) and can be selected conditionally (e.g., by an if statement), but
the behavior if diﬀerent parallel instances of the kernel try to return diﬀerent
values is undeﬁned.
Each BODY includes an @range declaration (line 9 in Listing 4.1) that speciﬁes
the range it will be executed over in parallel. Within the BODY, the current
position in that range is held in special Bacon-speciﬁc variables named $col,
$row, and $dep for the ﬁrst, second, and third dimension respectively (line 17
in Listing 4.1). The range is formatted like an array declaration, so a BODY
with @range[4] will be executed 4 times in parallel with $col having the
values 0, 1, 2, and 3. Since this is only a 1D range, the values of $row and
$dep will both be zero in all four instances.
Bacon provides parameterized types (line 6 in Listing 4.1) for 1D, 2D, and 3D
arrays using C++-style angle bracket syntax. Both declarations and element
access use a comma separated list of numbers in square brackets (line 6 in
Listing 4.1). The dimensions of these arrays can be accessed using struct-style
dot notation. For example, a three by three by three array of integers called
“cube” could be allocated with int cube[3,3,3]; and the width of that array
could be accessed with cube.cols.
Additional error handling is provided through the assert (line 12 in Listing 4.1) and fail keywords which will stop kernel execution and raise exceptions in the host process if triggered. A fail is triggered if execution in
any thread reaches that statement, while an assert is only triggered if its
associated condition is false.
Each Bacon kernel has a set of specialization variables. These fall into two
categories. First, the dimensions of any arrays passed as arguments to a kernel
are always specialization variables. Second, additional specialization variables
can be speciﬁed explicitly by declaring arguments using the const qualiﬁer
(like the blksz argument on line 3 of Listing B.2.2). Whenever a kernel is
called with a new set of values for its specialization variables a specialized
version of that kernel is generated and executed. Specialized kernels are cached
for future calls with the same set of specialization values.
This specialization, in addition to providing performance beneﬁts, allows variable sized arrays in thread-private memory as long as the array size depends
36
1.
2.
3.
Traditional (e.g.
FORTRAN Kernel)
Developer writes host application (C++) and kernel (e.g.
FORTRAN).
—
4.
5.
6.
Host application and kernel
are compiled.
Host application is run.
—
—
7.
8.
—
—
9.
10.
Input data is read.
Kernel is executed on input
data.
OpenCL Kernel
Bacon Kernel
Developer writes host application (C++) and kernel
(OpenCL C).
Developer writes wrapper
code (C++) to load and run
kernel.
Host application and wrapper
code are compiled.
Host application is run.
—
—
Developer writes host application (C++) and kernel
(Bacon C).
Bacon compiler generates
wrapper code (C++) to load
and run kernel.
Host application and wrapper
code are compiled.
Host application is run.
Input data is read.
Bacon library generates specialized OpenCL kernel.
Target GPU is identiﬁed.
Kernel is compiled for target
GPU.
—
Kernel is executed on input
data.
Target GPU is identiﬁed.
Kernel is compiled for target
GPU.
Input data is read.
Kernel is executed on input
data.
Table 4.1: Lifecycle of Bacon and OpenCL Kernels
only on const variables and array dimensions. Since OpenCL does not allow
any form of in-kernel dynamic memory allocation, this makes it possible for
users to write kernels that would have been diﬃcult to write using OpenCL directly1 . The blocked matrix multiply kernel in Listing B.2.2 gives an example
of this feature at line 11.
4.2
Kernel Lifecycle
The basic technique of separating out high performance “kernels” from an
application and implementing them in a separate language has been used in
software development for decades. This even occurs for purely sequential programming. For example, an application written primarily in C++ may have
high performance routines written in hand-optimized assembly code or FORTRAN. Traditionally, the host application and kernel code are compiled into
separate modules and then linked together before execution. This basic sequence is shown in the ﬁrst column of Table 4.1.
Compiling everything before execution, or ahead-of-time (AOT) compilation,
has one major downside: the target hardware is set when the application or
module is compiled. Just-in-time (JIT) compilation avoids this problem by
1
This restriction is loosened in OpenCL 2.0 [38], released in November 2013
37
delaying compilation until the program is run on a speciﬁc machine, allowing
the target hardware to be detected dynamically at runtime. OpenCL uses JIT
techniques to allow the portability of kernels across the variety of computecapable GPUs and other parallel acceleration hardware that provide support
for the standard. The JIT OpenCL kernel lifecycle is shown in the second
column of Table 4.1.
Bacon takes delayed compilation one step further, waiting to compile a kernel
until it is actually called and the characteristics of the arguments can be examined. This allows just-in-time specialization to be performed, as shown in
the third column of the table.
4.3
Implementation
The Bacon system consists of two pieces: the Bacon compiler and the Bacon
runtime library. The compiler runs at application compile time and parses the
Bacon C source, generating a C++ wrapper and a serialized abstract syntax
tree (AST). The Bacon runtime library is called from the generated wrapper as
the host application is running to load the AST, generate specialized OpenCL
C code when a kernel is called, and run that code on the GPU using the
OpenCL runtime.
The system is built using Perl and C++. The Bacon compiler parses the
source code using Parse::Yapp [60], a yacc-compatible parser generator for
Perl. This constructs the abstract syntax tree as a Perl data structure. The
C++ wrapper is then generated by traversing this tree.
The generated C++ wrapper provides a C++ function with the kernel’s type
signature that can be called from the user’s application. When this function is
called, the Bacon runtime library loads the AST and traverses it to generate
the specialized OpenCL code. Two optimizations are performed at code generation time directly from the AST without the use of a traditional low-level
intermediate representation.
The ﬁrst optimization, constant propagation calculates the values of all the
variables that have been marked as const by the programmer. If the value of
any of these variables cannot be computed from the specialized arguments to
the kernel, the Bacon runtime library will throw an exception. This information is used to construct a symbol table, and references to these variables are
replaced with their constant integer values in the generated OpenCL code.
38
Loop unrolling is performed on any loops for which the iteration count and
range can be determined after constant propagation. Short loops are fully
unrolled. In this case, no loop is passed to the compiler at all. An example
of this is shown at the bottom of Listing B.2.2. Longer loops are unrolled by
some factor that evenly divides the iteration count.
This specialized and optimized OpenCL C code is then passed to the OpenCL
compiler provided with the vendor SDK which will perform further optimizations on the generated code, including more aggressive constant propagation and possibly static register-load scheduling enabled by the Bacon preoptimizations.
The implementation of Bacon is available publicly under an open source license. The current version can be downloaded from the public git repository.2
4.4
Ease of Use
We believe that Bacon provides an improvement in ease of use compared to
OpenCL by itself. A number of syntactic improvements are described in Section 4.1. More signiﬁcantly, the automatic generation of wrapper code performed by the Bacon compiler signiﬁcantly reduces the number of lines of code
that the developer is required to write.
As a concrete example of the reduction in lines of code required by Bacon, we
consider naive matrix multiplication kernels like the one shown in Listing 4.1.
Our OpenCL C version of this kernel is the same length: 17 lines of nonwhitespace code. Unfortunately, even ignoring the application-speciﬁc code
required to construct the input matrices, it takes 174 lines of non-whitespace
C++ code to compile and call this 17 line OpenCL kernel. In contrast, the
Bacon version only requires three lines to load, specialize, and call; all the
bookkeeping is done either by the automatically generated wrapper code or
by the Bacon runtime library.
4.5
Performance
To evaluate the performance of the Bacon system, we compare the run time of
matrix multiplication kernels written in both Bacon C and OpenCL C. These
2
https://github.com/NatTuck/bacon
39
Test
Time (s) Speedup
OpenCL - Naive
11.9
1.0
OpenCL - Hand Vectorized
2.54
4.7
Bacon - Naive
3.45
3.5
Bacon - Blocked (unroll = 2)
1.97
6.1
Table 4.2: Summary of 4k Matrix Multiplication Performance
results are summarized in Table 4.2, which shows that Bacon is able to provide
measurable performance improvements over similar programs written directly
in OpenCL C.
Testing was performed on an AMD Radeon HD 5830 GPU. This is a mid-range
GPU intended for high deﬁnition computer gaming. As of early 2012 this
card was already a full hardware generation out of date, but it still has a
theoretical parallel compute capacity of 1.7 teraﬂops, which is more than an
order of magnitude greater than a high end CPU like the Intel Core i7 3930
at 154 gigaﬂops.3
Four implementations of matrix multiplication were tested:
Bacon - Naive
A textbook implementation of parallel matrix multiplication. Shown in
Listing 4.1.
OpenCL - Naive
An OpenCL implementation equivalent to the naive Bacon code.
Bacon - Blocked
Shown in Listing B.2.2. This generalizes a 2D unrolling of the computation into square blocks.
OpenCL - Hand Vectorized
Hand unrolled to compute 4x4 blocks at once. Explicitly uses OpenCL’s
native vector types. Based on a sample from the AMD OpenCL SDK.
The execution time of these four kernels was tested on randomly generated
4096x4096 matrices. Each test was performed ﬁve times and the average result was taken. The times were very consistent; most tests had a coeﬃcient
of variation under one percent. The speedups over the naive OpenCL imple3
FLOPS ratings from manufacturer speciﬁcations
40
Unrolling Speedups
7
Simple Bacon
2x2 Blocks
6
Speedup
5
4
3
2
1
0
1
2
4
8
16
Unroll Factor
Figure 4.1: Speedup on 4k matrix multiplication with unrolled loops over naive
OpenCL implementation.
mentation are shown in Figure 4.1.
4.5.1
Discussion
These measurements show that value specialization provides signiﬁcant speedups
over a non-specialized OpenCL kernel. This result can be explained by the
fact that providing constant values and unrolled loops at compile time allows
the OpenCL C compiler to do extensive constant propagation-based optimizations.
Bacon provides good performance results regardless of whether the code is
hand optimized. The naive Bacon kernel beats the naive OpenCL kernel,
the hand-optimized Bacon kernel beats the hand-optimized OpenCL kernel,
and the naive Bacon kernel even comes close in performance to the optimized
OpenCL kernel.
Unlike constant propagation, which provides a clear and consistent performance beneﬁt, loop unrolling is more complicated. Too much unrolling will
decrease loop performance, most likely due to the exhaustion of registers on
41
the GPU device. Still, when properly tuned, loop unrolling provides signiﬁcant speedups allowing the blocked Bacon kernel to beat the hand vectorized
OpenCL kernel by nearly 30 percent.
Somewhat surprisingly, Bacon’s loop unrolling is able to beat the vectorized
OpenCL code without the explicit use of native vector types. From this we
conclude that either vectorization is being done automatically by the AMD
OpenCL compiler or the use of vector types doesn’t have a signiﬁcant performance beneﬁt for this matrix multiplication algorithm.
42
Chapter 5
JIT Specialization of OpenCL
5.1
Specialization Annotated OpenCL
Experimentation with the Bacon prototype described in Chapter 4 showed
that just-in-time specialization was a compiler optimization technique worthy
of further study. In this chapter we describe Specialization Annotated OpenCL
(SOCL), an extension to the OpenCL C language intended to allow JIT specialization of OpenCL compute kernels to be studied directly in isolation from
the other features of Bacon.
This chapter has three parts. First, the extension to the OpenCL C language
is described. Second, two approaches to implementing SOCL are described.
Finally, test cases and benchmarks are presented that are used in later chapters
to verify the correctness and test the performance of SOCL implementations.
5.1.1
Specialization Annotations for OpenCL C
SOCL C is standard OpenCL C with the addition of specialization annotations.
A specialization annotation is a specially formatted comment placed between
the argument list and body of an OpenCL kernel. For example, a matrix
multiplication kernel could be specialized on the size of the matrices by being
annotated as shown in Listing 5.1.
Inside the comment, the annotation has three parts. The string “@spec” marks
the comment as a specialization annotation. After the marker is the name of
43
Standard OpenCL
Host Program
Executing
Parallel
Kernel
(OpenCL C)
main()
OpenCL
SDK
(compile)
OpenCL_JIT(...)
kernel_call(...)
Kernel
Executable
Code
exit()
SOCL
Host Program
Executing
Kernel
Arguments
Parallel
Kernel
(OpenCL C)
main()
kernel_call(...)
Specialized kernels
are cached for
future calls
Specializer
(specialize
& compile)
Specialized
Kernel
Executable
Code
exit()
Figure 5.1: Execution sequence: Standard OpenCL vs SOCL
44
1
2
3
4
5
6
k e r n e l void mmul( g l o b a l double* C, g l o b a l double* A,
g l o b a l double* B, s i z e _ t nn , s i z e _ t mm)
/* @spec mmul( nn , mm) */
{
// k e r n e l body o m i t t e d
}
Listing 5.1: mmul.cl: Example specialization annotation
the function. Enclosed in parentheses after the name of the function is the list
of arguments that the function will be specialized on.
This annotation format was chosen primarily for ease of implementation and
testing. Adding a comment does not change the standard OpenCL code at all,
since the comment will be ignored by a standard OpenCL compiler. Further,
the annotation is easy to spot visually in the source code and is easy to parse.
If SOCL were to be adopted as an extension to production OpenCL implementations it would be more appropriate to use either a “#pragma” directive
or a new keyword to mark arguments in the kernel argument list directly.
SOCL Execution Model
SOCL kernels are executed like standard OpenCL kernels, except specialized
code will be generated and executed for each set of values for the specialized
arguments.
For example, if the kernel from Listing 5.1 is called several times with parameters mm = nn = 8, execution will be equivalent to the code in Listing 5.2.
One of two implementations of SOCL described in this dissertation, Pancake,
performs specialization by generating code very much like this.
1
2
3
4
5
6
7
k e r n e l void mmul( g l o b a l double* C, g l o b a l double* A, g l o b a l
double* B)
{
const s i z e _ t nn = 8 ;
const s i z e _ t mm = 8 ;
}
// k e r n e l body o m i t t e d
Listing 5.2: mmul-codegen.cl: Execution model example.
45
5.1.2
Implementation Techniques for SOCL
To execute SOCL kernels, it is necessary to generate specialized kernel code
at runtime. Code generation must be performed lazily after the kernel arguments are known. This altered execution sequence is compared to the standard
OpenCL process in Figure 5.1. There are two ways to implement this specializer: either specialized OpenCL C source code can be generated and passed
to the OpenCL C compiler or specialized object code can be generated during
the OpenCL C compilation process.
Generation of specialized OpenCL C source code is explored in Chapter 5.2, in
a library called Pancake. This technique either requires a modiﬁcation of the
OpenCL runtime, modiﬁcation of the host program, or trickery with shadowing symbols in dynamic libraries. Pancake requires that the host program be
modiﬁed by adding a C header. This allows SOCL to be used with OpenCL implementations that can’t be modiﬁed, as is common for GPU vendor OpenCL
libraries.
Generation of specialized object code by modifying an OpenCL compiler is
explored in Chapter 5.4, where the open source POCL OpenCL compiler is
modiﬁed. Compared to specializing the source code, this technique potentially
allows for faster compilation, since the OpenCL source code only needs to be
parsed once for any number of specializations of a kernel. It also potentially
allows for better code to be generated, since the OpenCL C compiler could be
made explicitly aware that specialization is occurring.
5.1.3
Benchmarks and Tests for SOCL
Testing the correctness and performance of the SOCL implementations described in the next chapters requires OpenCL programs with specialization
annotations. Two new programs have been written for this purpose and six
existing OpenCL benchmarks have had annotations added.
These test programs are described below. Full listings for the kernels of the
two new programs are included in Appendix B.1.
My expectation is that the highest performance beneﬁts for specialization with
SOCL will come from kernels where specialization transforms a tight inner loop
from having a variable number of iterations to having a constant number of
iterations. This makes the loop easily unrollable. Other kernels may or may
46
not see any performance beneﬁt, depending on the eﬀects of other compiler
optimizations such as algebraic simpliﬁcation.
New Test Cases
These programs were written speciﬁcally as test cases for this project. They
are intentionally written as naive implementations of the algorithm, without
any hand optimization. This may make performance eﬀects from optimization easier to see compared to the highly optimized test programs taken from
hardware benchmark suites.
Gaussian Blur (blur) This program applies Gaussian blur to a large (5184x3456)
greyscale image. Because a 2D Gaussian blur can be calculated by applying
two orthogonal 1D Gaussian blurs to an image in sequence, this program runs
two OpenCL kernels: one to blur vertically and one to blur horizontally.
The kernels are each specialized on three arguments, the width and height of
the image and the radius of the blur eﬀect. The image width and height are
only used for index calculations, so their specialization should have minimal
eﬀect. The radius is used for the inner count of a tight inner loop, so its
specialization should result in a measurable performance beneﬁt.
Matrix Multiply (mmul) This program multiplies two square matrices.
Although matrix multiply programs are common in OpenCL vendor samples,
they are generally optimized to take advantages of features like vector types.
This program was written speciﬁcally to be a simple and easily readable version
of parallel matrix multiply.
The kernel is specialized on two arguments: the width and height of the matrices. Since the iteration count of the inner loop is dependent on the width
and height of the matrices, this test should show measurable performance
improvement from specialization.
SNU Port of NAS Parallel Benchmarks
This is a port of the NASA Advanced Supercomputing Division’s NAS Parallel Benchmark [61] suite for large parallel HPC installations to OpenCL,
47
performed by the Center for Manycore Programming at Seoul National University in Korea [62].
These benchmarks are reasonably complex and require a signiﬁcant amount of
communication in patterns that are not easily implemented in OpenCL, so the
SNU developers put signiﬁcant eﬀort into splitting kernels for synchronization
and hand optimizing to get high speedups.
Conjugate Gradient (nas-cg) This program is described by the NAS
benchmark description as “Solving an unstructured sparse linear system by
the conjugate gradient method” with the benchmark goal of testing “irregular
memory access and communication”.
For the OpenCL implementation this program was split into 13 kernels. The
kernels were all specialized on a single argument, n, which remains constant
across any single execution of the program. As this argument is not used directly as the bound of any loop, specialization and loop unrolling seem unlikely
to provide any performance beneﬁt.
Scalar Penta-Diagonal (nas-sp) This program is intended to represent a
larger HPC application that involves multi-step processing of data. It performs
a ﬂuid dynamics simulation.
This program uses 26 OpenCL kernels. Fourteen of these kernels are specialized on parameters that stay nearly constant in a single execution of the
program, some of which are used as loop bounds. This program may see
beneﬁts from specialization.
Nvidia OpenCL Samples
These program were released by Nvidia as sample programs to demonstrate the
usage of their OpenCL implementation. They are written primarily for clarity
to serve as good examples, but include some features intended to improve
performance.
Black-Scholes Option Pricing (nv-bs) According to Nvidia, “This sample evaluates fair call and put prices for a given set of European options by
the Black-Scholes formula.”
48
This kernel is specialized on a single variable, OptN, which is the maximum
number of iterations of the algorithm. This is a constant across multiple calls
to the kernel in the test program. It speciﬁes the iteration limit of a reasonably
tight inner loop, but the loop start varies across work items so the loop may
not be possible to unroll. Specialization may or may not provide a performance
beneﬁt on this program.
DCT (nv-dct) This program performs a Discrete Cosine Transform on an
image.
The kernel is specialized on the stride, width, and height of the input image,
and the block size of the transform, which all remain constant in the test
program. The program was initially developed with a constant block size of
eight, but this was made a variable to exercise specialization. Since all the
loops in the program are a number of iterations equal to the block size, this
program should see speedups due to specialization.
Rodinia Benchmarks
These benchmarks were found in the Rodinia Benchmark Suite [63], from the
University of Virginia. This suite is intended to compare the performance of
hardware and OpenCL implementations. The benchmarks are extracted from
real world scientiﬁc computing applications, and are heavily hand-optimized
for performance.
Mandelbrot (mandelbrot) This generates a 2048x2048 image of the Mandelbrot set. It has a single OpenCL kernel to generate this image.
The kernel is specialized on four arguments: the width and height of the image,
the zoom level, and the maximum number of iterations of the tight inner
loop. Although the number of iterations of the tight inner loop is bounded
by a specialized argument, each instance of the kernel may iterate less times.
Specialization may not provide a signiﬁcant performance advantage in this test
if the loop cannot be unrolled.
Particle Filter (particleﬁlter) This program tries to estimate current location based on a model of movement and fuzzy location measurements.
49
The main kernel is specialized on the number of particles used, which is constant through an execution of the program. This kernel has a loop based on
the number of particles, but the start point is based on the current particle
so the loop may not be possible to unroll. This program may or may not see
speedups from specialization.
5.2
Pancake: Implementing SOCL by Specializing OpenCL C
In this chapter we introduce Pancake, the ﬁrst implementation of Specialization Annotated OpenCL (SOCL) (described in Chapter 5) discussed in this
dissertation. Pancake is a C library that, with minimal source code modiﬁcation, transparently enables specialization of SOCL kernels when using any
OpenCL runtime. The use and implementation of Pancake are described, as
are the results of testing Pancake with the OpenCL benchmarks described in
Chapter 5.
5.2.1
Using Pancake
Pancake can be added to an existing OpenCL program that uses the OpenCL
API for either C or C++ in four steps, shown below. Once Pancake has been
enabled, all OpenCL kernels will be treated as SOCL kernels and automatically
specialized according to any specialization annotations that are present.
1. Modify your OpenCL kernels to add specialization annotations as described in Chapter 5.
2. Include the header "pancake/shim.h" in each ﬁle that uses the OpenCL
API after the standard OpenCL header.
3. Link the “pancake” library with the host program executable.
4. Set the environment variable PANCAKE_SPEC when you run your host
program. This variable can be left unset to disable specialization.
50
API Function
clCreateProgram("program.cl")
clBuildProgram(flags)
clCreateKernel("name")
clSetKernelArg(num, value)
clEnqueueNDRangeKernel
Standard OpenCL
Loads OpenCL C source
Compiles the source ﬁle
Gets a reference to the kernel
Sets the kernel arguments
Runs the kernel
Pancake
Sets OpenCL C source ﬁle name
Sets the compiler ﬂags
Sets the kernel name
Sets the kernel arguments
Specializes the kernel, then performs
the full OpenCL sequence from the second column on the specialized source
Table 5.1: Comparison of Standard OpenCL and Pancake Kernel Execution
5.2.2
How Pancake Works
Pancake operates by deﬁning a sequence of C macros that replace calls to the
OpenCL API for C in the host program with calls to functions in the Pancake
library. The Pancake library generates specialized OpenCL C code and uses
the OpenCL API for C itself to compile and execute it. Because the OpenCL
API for C++ is written in terms of the API for C, Pancake will work with
C++ programs as well.
To specialize SOCL kernels, Pancake delays compilation until the kernel has
actually been called. The modiﬁed kernel compilation process in Pancake
compared to standard OpenCL is shown in Table 5.1.
To accomplish this, Pancake replaces the OpenCL cl_program and cl_kernel
structures with wrappers and intercepts all OpenCL run-time library calls (15
of them) that operate on either structure. Intercepting these calls gives access
to suﬃcient data to generate specialized OpenCL C source and then compile
and execute it.
To avoid duplicate compilation work, Pancake caches compiled kernels and
re-uses them when all the specialized arguments match. Currently a linked
list is used for the cache and old entries are never cleaned up, so Pancake
will need additional optimization before it is suitable for use in long-running
programs that generate thousands of diﬀerent specialized kernels.
5.2.3
Pancake Testing
Each benchmark was tested in four diﬀerent conﬁgurations:
1. default - Specialization and unrolling disabled. Equivalent to standard
OpenCL behavior.
51
Benchmark
blur
mandelbrot
mmul
nas-cg
nas-sp
nv-bs
nv-dct
particleﬁlter
AMD Radeon 5830 GeForce GTX 650 Ti
175.5%
16.1%
3.6%
23.5%
91.4%
86.7%
-1.6%
-0.4%
0.1%
0.2%
8.0%
1.0%
3.5%
2.0%
0.2%
-2.7%
Table 5.2: Speedups with Pancake
2. unroll - Loop unrolling is enabled but specialization is not.
3. spec - Specialization is enabled but loop unrolling is not.
4. spec-unroll - Both specialization and unrolling are enabled.
Timings for tests on an AMD Radeon 5830 graphics card are shown in Figure 5.2, while timings with an Nvidia GeForce GTX 650 Ti card are shown
in Figure 5.3. A summary of speedups from specialization, comparing default
OpenCL behavior to specialization and unrolling, is shown in Table 5.2. These
timings were taken according to the procedure described in Appendix A.1. The
test machines used are described in Appendix A.2.
These timings show that specialization with Pancake and vendor GPU runtimes can provide a very large performance improvement on a few especially
well suited kernels (e.g., “mmul” runs nearly twice as fast). Otherwise, it
provides a small speedup on most of the test cases and a small slowdown on
some others. Note that the slowdowns can be eliminated by removing the
specialization annotations on those kernels.
Results by Benchmark
Gaussian Blur (blur) This benchmark saw a speedup as expected. The
improvement was much more signiﬁcant on the AMD GPU than the Nvidia
GPU, showing the single largest speedup (175%) observed in the tests.
Specialization was essential to the speedup on this benchmark. Attempting to
unroll loops without specialization did not work, and in fact caused a signiﬁcant slowdown on the Nvidia GPU.
52
Mandelbrot (mandelbrot) It was diﬃcult to predict an outcome for this
benchmark due to the complex loop condition, but it showed a speedup on
both GPUs. The eﬀect was signiﬁcantly larger on the Nvidia GPU.
Specialization was essential to the speedup on this benchmark. Attempting to
unroll loops without specialization had no eﬀect.
Matrix Multiply (mmul) This benchmark showed a speedup as expected.
The eﬀect was similar on both GPUs: a large speedup of around 90%. This
kernel is ideal for specialization, consisting almost entirely of a single tight
simple loop that gets a constant trip count from specialization.
Specialization was essential to the speedup on this benchmark. Attempting
to unroll loops without specialization had no eﬀect on AMD and a very large
negative eﬀect on Nvidia. Interestingly, on AMD specifying unrolling with
specialization had a negative eﬀect compared to specialization alone. I don’t
have a good explanation for this.
Conjugate Gradient (nas-cg) This benchmark showed no speedup from
specialization, as expected. It was possible that it would enable algebraic
simpliﬁcation for a small speedup, but a small slowdown was observed instead.
Loop unrolling had very little eﬀect on this benchmark.
Scalar Penta-Diagonal (nas-sp) This benchmark showed no speedup from
specialization. This benchmark looked slightly more promising than “nas-cg”,
but was not.
Unrolling had no eﬀect on this benchmark on the Nvidia GPU. On AMD,
specialization alone or unrolling alone provided a small speedup, but this disappeared when both were applied together.
Black-Scholes Option Pricing (nv-bs) This benchmark showed a small
speedup from specialization, which was consistent with the expected possible
speedup. The eﬀect was signiﬁcantly larger on AMD than Nvidia.
On Nvidia, unrolling alone provided a larger speedup than unrolling with
specialization. On AMD, unrolling alone provided a speedup signiﬁcantly
smaller than specialization and unrolling.
53
DCT (nv-dct) This benchmark showed a small speedup from specialization,
less of a gain than was expected. The speedup on AMD was somewhat larger
than on Nvidia.
Specialization was essential to the speedup on this benchmark. Attempting
to unroll loops without specialization had no eﬀect on Nvidia and only a very
small eﬀect on AMD.
Particle Filter (particleﬁlter) No speedup was observed from specialization on this benchmark, with Nvidia showing a small slowdown. This is
consistent with the expected result.
The results were similar for unrolling or specialization alone.
54
blur -- Kernel execution times with Pancake on AMD
3.270
3.5
3.270
1.183
mmul -- Kernel execution times with Pancake on AMD
1
1.187
3
0.874
0.874
0.354
0.456
0.8
0
default-spec-unroll
0.2
default-spec
Time (s)
0.4
default-unroll
0.5
0.6
default
default
1
default-spec
1.5
default-spec-unroll
2
default-unroll
Time (s)
2.5
0
mandelbrot -- Kernel execution times with Pancake on AMD
nv-bs -- Kernel execution times with Pancake on AMD
1
0.12
0.740
0.740
0.717
0.714
0.086
0.084
0.085
0.080
0.1
0.8
0
0
nas-cg -- Kernel execution times with Pancake on AMD
nas-sp -- Kernel execution times with Pancake on AMD
0.45
7
6.074
6.117
6.075
6.170
0.4
6
0.347
0.338
0.338
0.347
0.35
5
0
default-spec
0.1
0.05
default-unroll
0.2
0.15
default-spec-unroll
0.25
default
1
default-spec
default
2
default-spec-unroll
3
Time (s)
0.3
4
default-unroll
0
nv-dct -- Kernel execution times with Pancake on AMD
particleﬁlter -- Kernel execution times with Pancake on AMD
0.065
0.064
0.062
8.409
8.418
8.402
8.393
default-spec-unroll
0.064
default-spec
0.08
default-unroll
10
0.09
default
Time (s)
default-spec-unroll
0.02
default
0.04
default-spec
0.06
default-unroll
Time (s)
default-spec-unroll
default
0.2
default-spec
0.4
default-unroll
Time (s)
0.08
0.6
8
0.07
0.02
0.01
0
default-spec
0.03
default-unroll
Time (s)
0.04
default-spec-unroll
0.05
default
Time (s)
0.06
6
4
2
0
Figure 5.2: Benchmark Timings for Pancake on AMD Radeon 5830
55
2.541
0.6
3
0.5
2.5
1.5
default
1
0.5
default-spec
2
Time (s)
3.5
default-spec-unroll
0.7
0.393
0.708
0.301
0.4
0.3
0.2
0.1
0
default-spec
2.540
default-unroll
4.160
default
2.950
4
default-unroll
Time (s)
4.5
mmul -- Kernel execution times with Pancake on Nvidia
0.8
0.211
default-spec-unroll
blur -- Kernel execution times with Pancake on Nvidia
0
mandelbrot -- Kernel execution times with Pancake on Nvidia
0.244
0.25
0.244
0.198
nv-bs -- Kernel execution times with Pancake on Nvidia
0.198
0.06
0.055
0.050
0.054
0.054
0.05
0.01
0
default-spec-unroll
0.02
default-spec
Time (s)
0.03
default-unroll
0.05
0.04
default
default
default-unroll
0.1
default-spec-unroll
0.15
default-spec
Time (s)
0.2
0
nas-cg -- Kernel execution times with Pancake on Nvidia
nas-sp -- Kernel execution times with Pancake on Nvidia
0.35
2.5
2.273
2.275
2.262
2.283
0.292
0.292
0.294
0.291
0.3
0.05
0
0
0.035
0.035
0.03
25
0.005
0
default-spec
0.01
default
0.015
default-unroll
0.02
default-spec-unroll
0.025
Time (s)
30
27.373
27.378
28.141
28.131
default-spec-unroll
0.035
default-spec
0.036
0.035
default-unroll
0.04
particleﬁlter -- Kernel execution times with Pancake on Nvidia
35
default
nv-dct -- Kernel execution times with Pancake on Nvidia
Time (s)
default-spec-unroll
0.1
default-spec
0.15
default-unroll
0.5
0.2
default
default
default-unroll
1
default-spec-unroll
1.5
Time (s)
0.25
default-spec
Time (s)
2
20
15
10
5
0
Figure 5.3: Benchmark Timings for Pancake on Nvidia GTX 650 Ti
56
5.3
Serial Specialization with LLVM
Now that JIT specialization on OpenCL C source has been examined in both
the Bacon prototype and with Pancake, it is useful to explore integrating JIT
specialization directly into an OpenCL implementation.
The major open source implementations of OpenCL are POCL [64] and Clover
[45]. Both of these implementations are based on the LLVM compiler system,
so this chapter will examine LLVM and explore how LLVM can be extended
with support for specialization.
5.3.1
Overview of LLVM
LLVM [65], formerly the Low Level Virtual Machine, is a collection of tools
and libraries for constructing compilers. It includes clang [66], a modern
standard compliant compiler for C, C++, and Objective C that can be used
as a drop-in replacement for the GNU GCC compilers for those languages.
LLVM has been used as the basis for most implementations of the OpenCL
standard, including the implementations from AMD, Apple, and Nvidia, as
well as the open source implementations POCL and Clover.
In a traditional optimizing compiler, you have a “front end” which parses
source code and outputs some intermediate format code, a “middle end” which
takes several passes over that intermediate format code to optimize it, and then
a “back end” that outputs executable code for a speciﬁc machine. LLVM is
structured to provide these as separate modular pieces with a single intermediate code representation to tie them together called LLVM intermediate
representation (LLVM IR). This makes it easy to build front ends for new
languages, back ends for new hardware, and new optimization passes that are
compatible with all of the existing pieces.
LLVM IR is a single static assignment (SSA) code representation, which simpliﬁes many compiler optimizations. This IR code can be serialized in two
formats: LLVM assembly language (ﬁle extension .ll) and LLVM bitcode
(ﬁle extension .bc). LLVM provides numerous tools to work with these ﬁles,
including llvm-as and llvm-dis to translate between the serialization formats
and opt to run built-in and user-supplied optimization passes.
For example, here’s how to compile and optimize a Hello World C program
one step at a time using LLVM:
57
$
$
$
$
clang -c -emit-llvm -o hello.bc hello.c
opt -licm -sccp -o hello-opt.bc hello.bc
llc -o hello.s hello-opt.bc
as -o hello hello.s
In this example, the clang command is used to generate LLVM bitcode from
the C source code. Then, the opt command transforms this code by applying
two optimizations (loop invariant code motion and sparse conditional constant
propagation, in that order). These optimizations do not change the IR code
because there are no loops or variables in “Hello World”. Next, the llc command translates the bitcode into machine speciﬁc assembly. Finally, that code
is assembled into a binary.
The clang tool will perform all the other steps itself when given no option
ﬂags (by calling the other tools either as libraries or directly), but this example
clearly shows the three phases of compilation and how the tools communicate
with each other using LLVM IR in bitcode format. Note that the system
assembler (as) is used to generate the ﬁnal executable from the target speciﬁc
assembly code, as LLVM does not yet include its own assembler.
5.3.2
LLVM Optimization Passes
LLVM provides a variety of optimization passes implementing many traditional compiler optimizations. A number of these passes that seem especially
applicable for compute kernels are shown in Table 5.3 and Table 5.4. Other
optimization passes are not considered, for example cross-procedure optimizations are ignored because compute kernels are assumed to be fully inlined.
Information on some of these passes is available in the LLVM documentation
[67], while information about other passes is only available as comments in the
source code.
5.3.3
Specializing LLVM Bitcode
To test the eﬀects of specialization as an optimization, an LLVM module pass
[68] was implemented to specialize kernel functions in LLVM bitcode. In this
section we describe how that transformation pass works and provides some
benchmarks for specialization on simple serial kernels.
Producing a kernel specialized on an argument value is a relatively simple
process. It would be nearly suﬃcient to textually add a C preprocessor macro
58
Flag
Name – Description
-adce
Aggressive Dead Code Elimination – Eliminates dead code using a
“dead until proven alive” algorithm
-bb-vectorize
Basic Block Vectorize – Scans each basic block and combines scalar
operations into vector operations
-correlated-propagation
Correlated Value Propagation – Propagates constants across basic
block boundaries
-dse
Dead Store Elimination – Eliminates stores to local variables that
will have no eﬀect on the behavior of the program
-early-cse
Early Common Subexpression Elimination – Eliminates simple redundant computations and canonicalizes some operations
-globaldce
Global Dead Code Elimination – Eliminates unused global variables
and internal functions
-globalopt
Global Variable Optimizer – Finds read only global variables and
marks them as constant and eliminates global variables that are
never read (even if they’re written to)
-gvn
Global Value Numbering – Numbers every value produced in each
function and eliminates redundant computation of those values
-indvars
Canonicalize Induction Variables – Transforms loops into a standard form for future loop transformations if possible
-instcombine
Instruction Combining – Combines redundant instructions and performs arithmetic simpliﬁcation that doesn’t require reordering operations
-jump-threading
Jump Threading – Analyzes branches to conditional branches. If
it can prove the conditional will always branch a given way in that
thread of code ﬂow, it will replace the ﬁrst branch to skip the test.
-lcssa
Loop Closed Single Static Assignment – Adds extra phi nodes at
the end of each loop for variables that had phi nodes in inner blocks.
This simpliﬁes some other loop transformations.
-licm
Loop Invariant Code Motion – Moves code in a loop that doesn’t
depend on the loop iteration out of the loop
Table 5.3: LLVM Optimization Flags (A-L)
59
Flag
Name – Description
-loop-deletion
Loop Deletion – Deletes dead loops
-loop-reduce
Loop Strength Reduction – Decouples the calculation of array indexes in loops from the induction variable
-loop-rotate
Loop Rotate – Moves the loop condition check to the end of the
loop. This is required for the Loop Unroll pass.
-loop-simplify
Loop Simplify – Guarantees that each loop has one entry, one exit,
and one back-edge. This is required for other transformations including Loop Invariant Code Motion and Loop Vectorization.
-loop-unroll
Loop Unroll – Duplicates the loop body a number of times divisible by the trip count to reduce loop overhead. The loop must be
in the form produced by Loop Rotate. This pass has two extra
option ﬂags: without -unroll-allow-partial it will only fully
unroll loops, and a maximum unroll amount can be speciﬁed with
-unroll-threshold.
-loop-unswitch
Loop Unswitch – Moves loop invariant conditionals out of loops,
duplicating loops if necessary. This pass depends on Loop Invariant
Code Motion.
-loop-vectorize
Loop Vectorize – Partially unrolls loops and replaces scalar arithmetic across several iterations with vector instructions.
-mem2reg
Memory To Register – Replaces pointers to the stack with registers
and phi nodes where possible
-reassociate
Arithmetic Reassociate – Rearranges arithmetic to make other optimizations work better and performs arithmetic simpliﬁcation
-sccp
Sparse Conditional Constant Propagation – Performs constant
propagation and eliminates constant conditional branches
-simplifycfg
Simplify Control Flow Graph – Combines basic blocks connected
by an unconditional branch and removes unused basic blocks
-sink
Code Sinking – Moves instructions that produce values only used
in one branch of a conditional into that branch so that the value
isn’t calculated unless it is needed
-sroa
Scalar Replacement of Aggregates – Splits small structs and arrays
into multiple registers
Table 5.4: LLVM Optimization Flags (L-Z)
60
to the body of the kernel function that replaced all instances of the argument
variable with its constant value, which is how the optimization is generally
done by hand in practice. Using an LLVM transformation pass instead has two
major advantages. First, it avoids edge cases like variable name shadowing and
kernels that mutate their arguments. Second, it avoids re-parsing the kernel
source code which speeds up specialization slightly.
Specializing a kernel in an LLVM module pass consists of ﬁnding the kernel
function, iterating through its arguments, and calling the replaceAllUsesWith
method on each specialized argument with the appropriate value. The name
of the kernel to specialize, the arguments to specialize, and the values to specialize on are speciﬁed as command line arguments.
To illustrate the eﬀect of specialization on a kernel, we consider the C function
ipow in Listing 5.3, which raises an integer xx to an integer power kk by
iterated multiplication.
1
2
3
4
5
6
7
8
9
10
int
ipow ( int xx , int kk )
{
int yy = 1 ;
for ( int i i = 0 ; i i < kk ; ++i i )
yy *= xx ;
return yy ;
}
Listing 5.3: ipow.c: An integer exponentiation kernel
LLVM assembly code can be generated from the ipow.c source with the clang
command as shown below, producing the LLVM assembly code1 in Listing 5.4.
$ clang -S -emit-llvm -o ipow1.ll ipow.c
The direct translation from C explicitly puts arguments and local variables
on the stack. This is necessary in general because the code may use pointers
to values on the stack, but makes the code much less clear and disables some
optimization. The mem2reg transformation can be used to move these variables
into registers using the opt command shown below. This produces the code
shown in Listing 5.5.
1
Data layout and optimizer metadata are not included in LLVM assembly listings for
clarity.
61
$ opt -S -mem2reg -o ipow2.ll ipow.ll
The specialization pass can be run on this code, specializing the kk argument
on the value 6, using the opt command shown below. This produces the LLVM
assembly shown in Listing 5.6, with the register %kk replaced by the literal 6 on
line 10. The code for the LLVM specialization pass is shown in Listing B.4.1.
Note that it is necessary to explicitly load the custom transformation pass
from a shared library.
$ opt -S -load=libspec.so -specialize -spec-text="kk=6" \
-kernel="ipow" -o ipow3.ll ipow2.ll
Now that the code has been specialized, it’s possible to apply some further
optimizations to take advantage of the specialized value. The most obvious
optimization available is loop unrolling. Since the trip count is now known,
the loop can be fully unrolled. With LLVM, this requires two transformation
passes. First, the loop needs to be rotated to put the conditional at the bottom
(-loop-rotate). Then the loop can be unrolled using the -loop-unroll pass.
The commands are shown below and the resulting code is shown in Listing 5.7
and Listing 5.8.
$ opt -S -loop-rotate -o ipow4.ll ipow3.ll
$ opt -S -loop-unroll -o ipow5.ll ipow4.ll
The output of loop unrolling has a lot of unnecessary blocks and unconditional
branches. This can be cleaned up with -simplifycfg to get the result shown
in Listing 5.9.
$ opt -S -simplifycfg -o ipow6.ll ipow5.ll
Finally, the optimization can be taken one step further by combining some
of the multiplications with -reassociate, producing the code shown in Listing 5.10.
$ opt -S -reassociate -o ipow7.ll ipow6.ll
This ﬁnal version is probably the optimal code to compute x6 by multiplication.
This series of transformations clearly demonstrates the power of an optimizing
compiler and the promise of specialization to generate high quality special-case
code from code written for the general case.
62
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
; ModuleID = ’ ipow . c ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) #0 {
entry :
%xx . addr = a l l o c a i 3 2 , a l i g n 4
%kk . addr = a l l o c a i 3 2 , a l i g n 4
%yy = a l l o c a i 3 2 , a l i g n 4
%i i = a l l o c a i 3 2 , a l i g n 4
s t o r e i 3 2 %xx , i 3 2 * %xx . addr , a l i g n 4
s t o r e i 3 2 %kk , i 3 2 * %kk . addr , a l i g n 4
s t o r e i 3 2 1 , i 3 2 * %yy , a l i g n 4
s t o r e i 3 2 0 , i 3 2 * %i i , a l i g n 4
br l a b e l %f o r . cond
f o r . cond :
; p r e d s = %f o r . i n c , %e n t r y
%0 = l o a d i 3 2 * %i i , a l i g n 4
%1 = l o a d i 3 2 * %kk . addr , a l i g n 4
%cmp = icmp s l t i 3 2 %0, %1
br i 1 %cmp , l a b e l %f o r . body , l a b e l %f o r . end
f o r . body :
; p r e d s = %f o r . cond
%2 = l o a d i 3 2 * %xx . addr , a l i g n 4
%3 = l o a d i 3 2 * %yy , a l i g n 4
%mul = mul nsw i 3 2 %3, %2
s t o r e i 3 2 %mul , i 3 2 * %yy , a l i g n 4
br l a b e l %f o r . i n c
for . inc :
; p r e d s = %f o r . body
%4 = l o a d i 3 2 * %i i , a l i g n 4
%i n c = add nsw i 3 2 %4, 1
s t o r e i 3 2 %i n c , i 3 2 * %i i , a l i g n 4
br l a b e l %f o r . cond
f o r . end :
; p r e d s = %f o r . cond
%5 = l o a d i 3 2 * %yy , a l i g n 4
r e t i 3 2 %5
}
Listing 5.4: ipow1.ll: Direct translation from C
63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
; ModuleID = ’ ipow . bc ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) #0 {
entry :
br l a b e l %f o r . cond
f o r . cond :
%yy . 0 = p h i
%i i . 0 = p h i
%cmp = icmp
br i 1 %cmp ,
; p r e d s = %f o r . i n c , %e n t r y
i 3 2 [ 1 , %e n t r y ] , [ %mul , %f o r . i n c ]
i 3 2 [ 0 , %e n t r y ] , [ %i n c , %f o r . i n c ]
s l t i 3 2 %i i . 0 , %kk
l a b e l %f o r . body , l a b e l %f o r . end
f o r . body :
; p r e d s = %f o r . cond
%mul = mul nsw i 3 2 %yy . 0 , %xx
br l a b e l %f o r . i n c
for . inc :
; p r e d s = %f o r . body
%i n c = add nsw i 3 2 %i i . 0 , 1
br l a b e l %f o r . cond
f o r . end :
r e t i 3 2 %yy . 0
}
; p r e d s = %f o r . cond
Listing 5.5: ipow2.ll: After the mem2reg transformation.
64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
; ModuleID = ’ ipow2 . l l ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) {
entry :
br l a b e l %f o r . cond
f o r . cond :
%yy . 0 = p h i
%i i . 0 = p h i
%cmp = icmp
br i 1 %cmp ,
; p r e d s = %f o r . i n c , %e n t r y
i 3 2 [ 1 , %e n t r y ] , [ %mul , %f o r . i n c ]
i 3 2 [ 0 , %e n t r y ] , [ %i n c , %f o r . i n c ]
s l t i 3 2 %i i . 0 , 6
l a b e l %f o r . body , l a b e l %f o r . end
f o r . body :
; p r e d s = %f o r . cond
%mul = mul nsw i 3 2 %yy . 0 , %xx
br l a b e l %f o r . i n c
for . inc :
; p r e d s = %f o r . body
%i n c = add nsw i 3 2 %i i . 0 , 1
br l a b e l %f o r . cond
f o r . end :
r e t i 3 2 %yy . 0
}
; p r e d s = %f o r . cond
Listing 5.6: ipow3.ll: After specializing on kk=6
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
; ModuleID = ’ ipow3 . l l ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) {
entry :
br l a b e l %f o r . body
f o r . body :
; p r e d s = %entry , %f o r . i n c
%i i . 0 2 = p h i i 3 2 [ 0 , %e n t r y ] , [ %i n c , %f o r . i n c ]
%yy . 0 1 = p h i i 3 2 [ 1 , %e n t r y ] , [ %mul , %f o r . i n c ]
%mul = mul nsw i 3 2 %yy . 0 1 , %xx
br l a b e l %f o r . i n c
for . inc :
; p r e d s = %f o r . body
%i n c = add nsw i 3 2 %i i . 0 2 , 1
%cmp = icmp s l t i 3 2 %i n c , 6
br i 1 %cmp , l a b e l %f o r . body , l a b e l %f o r . end
f o r . end :
; p r e d s = %f o r . i n c
%yy . 0 . l c s s a = p h i i 3 2 [ %mul , %f o r . i n c ]
r e t i 3 2 %yy . 0 . l c s s a
}
Listing 5.7: ipow4.ll: After rotating the loop
66
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
; ModuleID = ’ ipow4 . l l ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) {
entry :
br l a b e l %f o r . body
f o r . body :
; p r e d s = %e n t r y
br l a b e l %f o r . i n c
for . inc :
; p r e d s = %f o r . body
%mul . 1 = mul nsw i 3 2 %xx , %xx
br l a b e l %f o r . i n c . 1
for . inc . 1 :
; p r e d s = %f o r . i n c
%mul . 2 = mul nsw i 3 2 %mul . 1 , %xx
br l a b e l %f o r . i n c . 2
for . inc . 2 :
; p r e d s = %f o r . i n c . 1
%mul . 3 = mul nsw i 3 2 %mul . 2 , %xx
br l a b e l %f o r . i n c . 3
for . inc . 3 :
; p r e d s = %f o r . i n c . 2
%mul . 4 = mul nsw i 3 2 %mul . 3 , %xx
br l a b e l %f o r . i n c . 4
for . inc . 4 :
; p r e d s = %f o r . i n c . 3
%mul . 5 = mul nsw i 3 2 %mul . 4 , %xx
br l a b e l %f o r . i n c . 5
for . inc . 5 :
r e t i 3 2 %mul . 5
}
; p r e d s = %f o r . i n c . 4
Listing 5.8: ipow5.ll: After unrolling the loop
67
1
2
3
4
5
6
7
8
9
10
11
; ModuleID = ’ ipow5 . l l ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) {
entry :
%mul . 1 = mul nsw i 3 2 %xx , %xx
%mul . 2 = mul nsw i 3 2 %mul . 1 , %xx
%mul . 3 = mul nsw i 3 2 %mul . 2 , %xx
%mul . 4 = mul nsw i 3 2 %mul . 3 , %xx
%mul . 5 = mul nsw i 3 2 %mul . 4 , %xx
r e t i 3 2 %mul . 5
}
Listing 5.9: ipow6.ll: After merging excess blocks
1
2
3
4
5
6
7
8
9
; ModuleID = ’ ipow6 . l l ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) {
entry :
%0 = mul i 3 2 %xx , %xx
%1 = mul i 3 2 %0, %xx
%2 = mul i 3 2 %1, %1
r e t i 3 2 %2
}
Listing 5.10: ipow7.ll: After arithmetic simpliﬁcation
68
5.3.4
Serial Specialization Testing
To test the eﬀects of value specialization as an optimization for serial kernels,
tests have been run on three simple examples. Listings for these kernels are
given in Appendix B.3. These tests were run on the Opteron workstation
described in Appendix A.2, using the methodology described in Appendix A.1,
except that each test was repeated ten times rather than ﬁve.
In these tests only the kernel function itself is subject to optimization. The
rest of the test program is left unoptimized to isolate the eﬀects of optimization
on the kernel.
Ten diﬀerent sets of optimizations are compared in these benchmarks, consisting of combinations of sets of optimization ﬂags described below. For example,
a graph column labeled “spec-O3-re” describes a test where the kernel was specialized, then optimized with the LLVM heavy optimization set (“O3”), then
optimized with the Unroll and Reassociate set (“re”).
The optimization sets tested were found by manual exploration of the available
LLVM optimizations and examination of the resulting assembly code. These
are not optimal sets, the search space for LLVM optimization ﬂags is just too
big for such sets to be reasonably found. The optimization sets used are as
follows:
Specialization (spec)
eters.
The kernel is specialized on the appropriate param-
Not Optimized (notopt) No optimizations are performed.
LLVM Heavy Optimization (O3) LLVM provides three default sets of
optimization passes for diﬀerent levels of optimization: -O1, -O2, and -O3.
This (-O3) is the heaviest level of optimization and is equivalent to running
the following opt command:
$ opt -targetlibinfo -no-aa -tbaa -basicaa -notti -globalopt -ipsccp -deadargelim \
-instcombine -simplifycfg -basiccg -prune-eh -inline-cost -inline -functionattrs \
-argpromotion -sroa -domtree -early-cse -simplify-libcalls -lazy-value-info \
-jump-threading -correlated-propagation -simplifycfg -instcombine -tailcallelim \
-simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa -loop-rotate \
-licm -lcssa -loop-unswitch -instcombine -scalar-evolution -loop-simplify -lcssa \
-indvars -loop-idiom -loop-deletion -loop-unroll -memdep -gvn -memdep -memcpyopt \
-sccp -instcombine -lazy-value-info -jump-threading -correlated-propagation \
-domtree -memdep -dse -adce -simplifycfg -instcombine -strip-dead-prototypes \
69
-globaldce -constmerge -preverify -domtree -verify
Note that this list of passes includes analysis passes that would run even if
not explicitly speciﬁed in the opt command due to being required by later
transformation passes.
Unroll Loops (unroll) This set of optimizations performs constant propagation and loop unrolling, including enabling partially unrolled loops.
$ opt -mem2reg -sccp -loop-rotate -loop-unroll -unroll-allow-partial \
-simplifycfg
Unroll and Reassociate (re) This is like “unroll”, except it adds the arithmetic reassociation and simpliﬁcation pass, which turns out to help a lot on
specialized code.
$ opt -mem2reg -sccp -loop-rotate -loop-unroll -unroll-allow-partial \
-simplifycfg -reassociate
Repeated Integer Exponentiation
The integer exponentiation kernel shown in Listing 5.3 is specialized on the
exponent (kk = 17) and executed 50 million times.
The execution time for this test with various optimization conﬁgurations is
shown in Figure 5.5, while the optimization time is shown in Figure 5.6.
For this simple example, it is possible to provide a detailed explanation for
the measured runtimes. First, several optimizations can make this code run
faster, so having the unoptimized code run slowest is the expected result.
Next, the code shown in Listing 5.11 seems to be the best code that can be
generated for raising an integer to the 17th power. To produce this code, it
is necessary to specialize the function, fully unroll the loop, and then perform
arithmetic simpliﬁcation, in that order. The two test cases that do so produce
the expected code and give the fastest runtimes.
The next question is why specialization doesn’t help without arithmetic simpliﬁcation, or, more speciﬁcally, why a fully unrolled loop isn’t faster than
the same loop not unrolled. The best explanation seems to be that the CPU
is able to issue the independent instructions to multiply and increment the
loop variable in parallel and then correctly predict the branch. Experimenting
70
1
2
3
4
5
6
7
8
9
10
11
; ModuleID = ’ ipow17 . l l ’
d e f i n e i 3 2 @ipow ( i 3 2 %xx , i 3 2 %kk ) {
entry :
%0 = mul i 3 2 %xx , %xx
%1 = mul i 3 2 %0, %0
%2 = mul i 3 2 %1, %1
%3 = mul i 3 2 %2, %2
%mul . 1 6 = mul i 3 2 %3, %xx
r e t i 3 2 %mul . 1 6
}
Listing 5.11: ipow17-opt.ll: Fully optimized ipow for kk = 17
by adding more independent arithmetic operations to the loop body in ipow
and rerunning the benchmarks seems to support this hypothesis: specialization and unrolling provides a speedup over not unrolling when there are more
operations in the loop body.
Finally, it’s not clear why unrolling without specialization (eﬀectively equivalent to just doing -loop-rotate) runs faster than the full set of optimizations
in “O3”. The only diﬀerence in the generated assembly code is the use of
an “equals” comparison instead of a “less than” comparison. Perhaps this is
worse for the CPU branch predictor.
The results for optimization time are simpler, as shown in Figure 5.6. Optimizing this kernel takes less than 1/10th of a second and increases with more optimizations. For comparison, linking and assembly for this kernel takes about
0.15 seconds. The “spec-unroll-reassoc” optimizations are slightly faster than
“spec-O3-reassoc”, but it’s not a big diﬀerence given the linking time.
Matrix Multiplication
The matrix multiplication kernel in Listing B.3.2 is specialized on the matrix
size (NN = 256). Timings are shown in Figure 5.7, while optimization times
are shown in Figure 5.8.
This is a more complex kernel, so the optimization results aren’t quite as clear
as they were with ipow. Any of the optimization sets provides about the
same speedup compared to not optimizing, nearly a factor of four. For NN =
256 just unrolling and specializing provides a slightly better speedup than the
71
1.628
1.367
1.502
1.480
1.429
1.316
1.357
1.388
no-spec-O3
no-spec-unroll
no-spec-unroll-re
spec-no-opt
spec-O3
spec-unroll
spec-unroll-re
2
no-spec-no-opt
mmul-atom run-time
Time (s)
1.5
1
0.5
0
OptFlags
Figure 5.4: Run times for mmul serial kernel (NN = 256) on Intel Atom N270
other optimization sets, but this eﬀect is small. Testing with NN = 512 caused
this eﬀect to disappear.
Unrolling is not expected to give a signiﬁcant speedup on a test like this in
any case, due to the superscalar architecture of the Opteron CPU that the
test is running on [69]. The CPU can identify instruction level parallelism and
execute the multiplication and loop increment operations in parallel while successfully predicting the branch for the loop condition, eliminating any beneﬁt
from loop unrolling.
The optimization times are signiﬁcantly higher on this test, in some cases
greater than the kernel runtime. This demonstrates a potential beneﬁt to
selecting minimum set of optimizations for JIT compiled compute kernels.
To explore the eﬀects of instruction level parallelism further, this test was
also run on a machine with an Intel Atom N270, which is an in-order CPU
that issues two instructions at a time [70], compared to the four out of order
instructions that the Opteron can issue [71] at once. This should result in a
better improvement for specialization and unrolling.
The results with the Atom CPU are shown in Figure 5.4. There is clear
speedup from specialization and unrolling, but it’s still not as impressive as it
72
was on the GPU devices because the Atom is still issuing both instructions in
the loop body at once and predicting the branch.
Gaussian Blur
The gaussian blur kernel in Listing B.3.1 is specialized on image size and blur
radius. Run times are shown in Figure 5.9, while optimization times are shown
in Figure 5.10.
For the blur kernel, specialization provides a signiﬁcant speedup for loop unrolling, but does not provide an additional speedup when added to the full set
of “O3” optimizations. This makes specialization potentially attractive as a
way to improve speedup per optimization time.
5.3.5
Discussion
These test cases show that specialization can provide a signiﬁcant beneﬁt for
an appropriately structured kernel like ipow. Unfortunately, this beneﬁt seems
to decrease quickly on more complex kernels, largely due to the superscalar
architectures of modern CPUs. AMD and Intel have optimized tight inner
loops already in hardware, so there is not a lot of space for optimizing them
further in software when targeting CPU devices.
Medium size loops, with 6-8 instructions per iteration, might see more beneﬁt,
but such loops don’t seem to be a very common case.
73
ipow run-time
1.892 1.895 0.940 0.930 0.941 0.545 0.825 0.926 0.825 0.546
2
spec-unroll-reassoc
unroll-reassoc
spec-unroll
unroll
spec-O3-reassoc
O3-reassoc
spec-O3
notopt
0.5
O3
1
spec-notopt
Time (s)
1.5
0
OptFlags
Figure 5.5: Execution times for ipow serial kernel.
ipow opt-time
0.05
0.025 0.028 0.042 0.044 0.040 0.043 0.031 0.035 0.032 0.033
0
OptFlags
Figure 5.6: Optimization times for ipow serial kernel.
74
spec-unroll-reassoc
unroll-reassoc
spec-unroll
unroll
spec-O3-reassoc
O3-reassoc
spec-O3
0.01
O3
0.02
spec-notopt
0.03
notopt
Time (s)
0.04
mmul run-time
0.18
0.16
0.155 0.155 0.043 0.044 0.044 0.044 0.047 0.039 0.048 0.039
0.14
unroll-re
spec-unroll
unroll
spec-O3-re
O3-re
0.02
spec-O3
0.04
notopt
0.06
O3
0.08
spec-unroll-re
0.1
spec-notopt
Time (s)
0.12
0
OptFlags
Figure 5.7: Run times for mmul serial kernel (NN = 256).
mmul opt-time
0.1
0.027 0.030 0.085 0.063 0.084 0.064 0.036 0.055 0.036 0.057
spec-unroll-re
unroll-re
spec-unroll
unroll
spec-O3-re
O3-re
spec-O3
0.02
O3
0.04
spec-notopt
0.06
notopt
Time (s)
0.08
0
OptFlags
Figure 5.8: Optimization times for mmul serial kernel (NN = 256).
75
blur run-time
0.7
0.589 0.580 0.217 0.215 0.217 0.216 0.466 0.381 0.468 0.381
0.6
unroll-re
spec-unroll
unroll
spec-O3-re
O3-re
0.1
spec-O3
notopt
0.2
O3
0.3
spec-unroll-re
0.4
spec-notopt
Time (s)
0.5
0
OptFlags
Figure 5.9: Run times for blur serial kernel.
blur opt-time
0.2
0.033 0.035 0.171 0.165 0.174 0.167 0.050 0.062 0.052 0.065
0
OptFlags
Figure 5.10: Optimization times for blur serial kernel.
76
spec-unroll-re
unroll-re
spec-unroll
unroll
spec-O3-re
O3-re
spec-O3
O3
0.05
spec-notopt
0.1
notopt
Time (s)
0.15
5.4
Specializing POCL
In this section we describe Specializing POCL, an implementation of SOCL
built as an extension to the open source POCL implementation of OpenCL.
First, we explain how POCL compiles and executes OpenCL kernels. Next,
we describe how JIT specialization was added to this process. Finally, we provide timings for the resulting Specializing POCL system running the OpenCL
benchmarks described in Chapter 5.
5.4.1
Execution of an OpenCL kernel with POCL
First, we will describe how the POCL [47] (Portable Open Compute Language) implementation of OpenCL operates when executing kernel code on an
AMD64-based multi-core machine. Speciﬁcally, we look at how several parts
of the OpenCL API described in Chapter 2.3 are implemented as well as how
executable kernel code is generated and scheduled for parallel execution.
POCL is an open source implementation of OpenCL supporting execution on
multi-core CPUs and several types of specialized acceleration hardware. It
was initially developed [64] by Pekka Jaaskelainen and others at the Tampere
University of Technology in Finland and the Universidad Rey Juan Calos in
Spain to study the design of application speciﬁc processor hardware, but has
since been extended for use as a general purpose OpenCL implementation.
POCL is built on the LLVM [65][72] compiler construction toolkit described
in Section 5.3.1. In broad strokes, POCL uses the Clang [66] C front end to
parse OpenCL C and generate LLVM IR, does most of its work in custom
LLVM transformation passes, and then uses an appropriate LLVM back-end
to generate kernel binaries for execution.
In conventional UNIX fashion, POCL does its work by creating a temporary
directory and calling LLVM command line utilities to transform the kernel
in several stages. These intermediate ﬁles can be examined by setting the
POCL_LEAVE_TEMP_DIRS environment variable. We illustrate this process by
describing a single run of the vec_pow_host program shown in Listing B.1.2
using POCL.
The executable is started with this command:
$ POCL_LEAVE_TEMP_DIRS=1 ./vec_pow_host
77
When the program calls clCreateProgramFromSource (line 45 in Listing B.1.2),
a temporary directory is created and the kernel source is written to program.cl
in that directory. In this case, the code written to this initial temporary ﬁle
is identical to the code in vec_pow_kernel.cl shown in Listing B.1.1.
The clBuildProgram function (line 47) causes the clang command to be
executed to compile the OpenCL C source into an LLVM bitcode ﬁle. POCL
creates a subdirectory for the target OpenCL device. In this case the device is
called “pthread” because the kernel will be executed on a multi-core processor
using POSIX threads. Clang outputs the bitcode to a ﬁle called program.bc
in the new subdirectory.
The LLVM assembly code for the bitcode in program.bc as generated by the
llvm-dis command is shown in Listing B.1.3. This is a direct translation from
the OpenCL C code into LLVM assembly. The two features which mark this
code as being OpenCL code are the metadata on lines 29 and 31 indicating
that vec_pow is an OpenCL kernel, and the addrspace(3) annotation (line 5)
which marks the variable xs as being a pointer to the OpenCL global address
space. This address space information is not relevant to the “pthread” device
back end, but is used to select diﬀerent types of memory for data storage when
a kernel is executed devices like GPUs with partitioned memory spaces.
The clCreateKernel function (line 50 of Listing B.1.2) causes the LLVM opt
command to be executed to run a custom analysis pass that ﬁnds various information about the kernel such as the argument types and minimum workgroup
size. This information is loaded back into the host program by generating a
small shared library in C which is compiled, linked (creating a descriptor.so)
and dynamically loaded.
Finally, when the clEnqueueNDRangeKernel function is called (line 67) the
rest of the work to generate a parallel, executable version of the kernel is
performed. First, the llvm-link command is called to link the kernel bitcode
to the OpenCL C standard library, generating kernel_linked.bc. Second,
several analysis and transformation passes are run on the linked bitcode to
enable the kernel to run in parallel. We consider these passes in four broad
steps: Flatten, Barriers, Work Item Replication, and Standard Optimizations.
The Flatten transformation inlines as much as it can into the kernel. Since
the OpenCL standard doesn’t allow recursive calls in kernel code, this is a
well deﬁned transformation that allows later passes to consider the kernel
computation as a whole without worrying about function calls.
The most diﬃcult part of transforming a kernel for parallel execution is han78
dling barriers. Whenever a work item (a parallel instance of a kernel) calls
the barrier function, each other work item in the same work group must also
call barrier before execution can proceed beyond the barrier in any of them.
Since barriers can occur in conditional blocks and loops, this makes a static
serialization of an OpenCL kernel non-trivial in general. Based on text in
the OpenCL standard that says that each work item in a group must hit the
same sequence of barriers, POCL handles this problem with the simplifying
assumption that all work items in a group will branch the same way if that
would cause any of them to hit a barrier [73].
Once barriers have been properly identiﬁed and the kernel has been transformed to put them in a canonical form, POCL transforms the kernel into a
serialization of the work items in a single work group.
Finally, several standard optimization passes are run on the kernel to make it
execute faster. The result of these passes – which is specialized on workgroup
size – is written to a ﬁnal bitcode ﬁle parallel.bc which is then compiled
to a shared library parallel.so. Execution of the kernel with the provided
arguments is then added to the work queue. Copies of the arguments are
made in case another call to the kernel is enqueued before the ﬁrst kernel call
is executed.
The kernel is then actually executed asynchronously by dynamically loading
the shared library and executing the parallel kernel work-group function on
the arguments provided. One instance is executed in parallel for each work
group using POSIX threads.
5.4.2
Adding Scalar Specialization to POCL
Now that the OpenCL API and its implementation in POCL have been described, the addition of specialization on scalar arguments is relatively simple.
POCL already delays ﬁnal kernel code generation until after the kernel arguments have been speciﬁed, so it is suﬃcient to add a specialization pass that
modiﬁes the kernel between the Flatten pass and the Barriers pass described
in Section 5.4.1.
In more detail, adding specialization required three modiﬁcation to POCL.
First, the clCreateKernel function has been modiﬁed to read the @spec annotation from the saved program.cl source ﬁle and include information about
which arguments should be specialized on in the descriptor.so. Second, the
clEnqueueNDRangeKernel function has been modiﬁed to store the values of
79
the specialized arguments to a temporary ﬁle (called spec.info). Finally, an
LLVM transformation pass has been written that reads spec.info and replaces each usage of a specialized argument in the kernel with its value. This
LLVM pass for specialization for POCL kernels is shown in Listing B.4.1. It is
the same pass that was used to specialize serial kernels in the previous chapter.
This modiﬁcation has been tested on several test kernels, including the vec_pow
kernel shown in Listing B.1.1 and the benchmark kernels described in Chapter 5.
The current code can specialize 32 and 64 bit integers and ﬂoating point values,
which are the most common scalar parameters to OpenCL C functions. The
strategy could be further extended to support arbitrary arrays (e.g., to allow a
general image convolution function to be specialized on a speciﬁc convolution
matrix), but that is somewhat more complex and has not been implemented.
5.4.3
Benchmarks for POCL with Specialization
The eight test benchmarks from Chapter 5 were run using Specializing POCL
comparing no optimization, loop unrolling, and the full “-O3” set of optimizations both with and without specialization. Tests were run on the 24-core
Opteron workstation described in Appendix A.2.
A summary of speedups is shown in Table 5.5. Charts with the full results for
each benchmark are shown in Figure 5.11.
These results vary signiﬁcantly. Only one benchmark got large speedups from
specialization across the various sets of optimization ﬂags. The other benchmarks either saw consistent slowdowns or saw speedups from only one or two
sets of optimizations. Given the results from serial specialization, the presence
of any signiﬁcant speedups is surprising.
Results by Benchmark
Gaussian Blur (blur) This benchmark saw a speedup from specialization,
but a signiﬁcant slowdown from any additional optimizations. This optimization slowdown is present even without specialization.
80
Benchmark
Spec Spec + Unroll
blur
21.5%
-23.4%
mandelbrot
-0.9%
0.3%
mmul
-0.3%
122.6%
nas-cg
-19.6%
-22.2%
nas-sp
193.1%
169.5%
nv-bs
-1.1%
-1.9%
nv-dct
-0.0%
30.8%
particleﬁlter
-4.0%
-2.7%
Spec + O3
-23.0%
-0.6%
-11.0%
-21.4%
156.2%
-6.6%
36.4%
-2.7%
Table 5.5: Speedups with Specializing POCL
Mandelbrot (mandelbrot) This benchmark showed approximately no eﬀect
from specialization. Heavy optimization resulted in a slight slowdown, independent of specialization.
Matrix Multiply (mmul) Matrix multiply got a signiﬁcant speedup from
specialization with loop unrolling. All the other combinations of optimizations
and specializations caused a slowdown.
Conjugate Gradient (nas-cg) This benchmark got a signiﬁcant slowdown
from specialization that remained consistent across the diﬀerent optimization
sets.
Scalar Penta-Diagonal (nas-sp) This benchmark got a signiﬁcant speedup
from specialization that remained consistent across the diﬀerent optimization
sets.
Black-Scholes Option Pricing (nv-bs) Specialization caused a slight slowdown for this benchmark, which was made even worse by heavy optimization.
DCT (nv-dct) Specialization and optimization caused a signiﬁcant speedup
on this benchmark. Optimization without specialization had almost no eﬀect.
Particle Filter (particleﬁlter) This benchmark got a slight slowdown
from specialization. This is the same result as on the Pancake GPU tests.
81
blur -- Kernel execution times with Specializing POCL
mmul -- Kernel execution times with Specializing POCL
12
10
6.236
8.795
8.847
5.133
8.145
8.095
8.689
9.894
9.887
8.716
3.903
9.768
10
8
0
O3-spec
unroll-spec
O3
unroll
default
O3-spec
2
default-spec
Time (s)
6
4
unroll-spec
O3
2
unroll
default
4
default-spec
Time (s)
8
6
0
mandelbrot -- Kernel execution times with Specializing POCL
10.944
12
10.924
11.015
11.041
10.911
nv-bs -- Kernel execution times with Specializing POCL
4
11.011
3.350
3.336
3.322
3.385
3.414
3.588
3.5
3
0
O3-spec
O3
unroll
0.5
default
1
O3-spec
unroll-spec
O3
2
unroll
default
4
2
1.5
unroll-spec
6
2.5
default-spec
Time (s)
8
default-spec
Time (s)
10
0
nas-cg -- Kernel execution times with Specializing POCL
nas-sp -- Kernel execution times with Specializing POCL
25
35
21.907
22.257
22.512
27.244
28.169
27.885
18.187
30
19.360
21.487
6.205
6.747
7.099
20
0
O3-spec
unroll-spec
0
1.999
10
1.918
2.5
8.476
8.451
8.411
8.829
8.716
8.712
O3-spec
2.615
unroll-spec
2.672
default-spec
2.578
O3
2.615
unroll
3
particleﬁlter -- Kernel execution times with Specializing POCL
default
nv-dct -- Kernel execution times with Specializing POCL
8
Time (s)
2
1.5
6
0
O3-spec
unroll-spec
O3
unroll
0.5
default-spec
4
1
default
Time (s)
default-spec
5
O3
10
unroll
O3-spec
O3
5
unroll
default
10
unroll-spec
15
15
default
Time (s)
20
default-spec
Time (s)
25
2
0
Figure 5.11: Benchmark Timings for Specializing POCL
82
5.732
5.596
2.732
5.587
default-spec
unroll-spec
O3-spec
5.732
O3
5.732
unroll
6
default
mmul - data/exec_times.csv
Time (s)
5
4
3
2
1
0
OpenCL Conﬁguration
Figure 5.12:
Clover
5.5
Execution times for the “mmul” benchmark with a modiﬁed
JIT Specialization in Clover
Clover is the OpenCL implementation from the Mesa [44] project, with the
goal of providing an open source implementation of the OpenCL standard
for AMD and Nvidia GPUs. The implementation isn’t complete, currently
providing only partial support of the standard, and only for AMD GPUs. Of
the benchmarks described in Chapter 5, Clover will only run “mmul”.
Clover has been modiﬁed to use the LLVM specialization pass described in the
previous two chapters. Rather than implement the @spec annotation for one
benchmark, the specialization values are provided through environment variables. This implementation was suﬃcient to collect the benchmark runtimes
for “mmul” shown in Figure 5.12.
This test demonstrates that JIT specialization can be added to an OpenCL implementation targeting GPUs directly. Specialization provided a speedup only
with the “unroll” optimization ﬂags, implying that some of the optimizations
performed by “O3” are not appropriate for a GPU target.
83
Chapter 6
Conclusions and Future Work
6.1
Conclusions
This dissertation has explored the application of just in time value specialization to data parallel compute kernels. Three new implementations of this
technique were built and tested: Bacon, Pancake, and Specializing POCL. All
three demonstrated signiﬁcant speedups on especially suited kernels.
Bacon was primarily built to increase the usability of OpenCL. It has several
features which should have usability beneﬁts, but it is not clear that these
beneﬁts warrant the eﬀort of adopting a new language for kernel development
or a new and incomplete runtime library. Other approaches, like Aparapi [36],
Clyther [37], and Copperhead [57], that allow parallel kernels to be written
directly in a high-level host language are likely to be a better way to approach
usability for programming parallel hardware.
Pancake and Specializing POCL implement Specialization Annotated OpenCL
(SOCL), allowing OpenCL C kernels to be directly annotated for JIT specialization on argument values. This allowed the eﬀects of this optimization to be
tested directly without the language changes introduced in Bacon.
Testing with Pancake showed that JIT specialization could provide signiﬁcant
performance beneﬁts when running appropriate kernels on GPU devices even
when the specialization was performed before passing the kernels to opaque
vendor implementations of OpenCL. The simplicity of the GPU architectures
allows well known optimizations like loop unrolling and algebraic simpliﬁcation
to have a signiﬁcant eﬀect on runtime, which makes specialization an eﬀective
84
optimization technique.
Specializing POCL demonstrates that support for SOCL can be easily added
to an existing OpenCL implementation. The performance beneﬁts running
on a CPU target were not as signiﬁcant as those on GPU targets due to the
complexity of the CPU architecture. The beneﬁt could probably be improved,
especially by controlling the interleaving of instructions from multiple OpenCL
work items, but it would require additional work on coordinating compiler
optimizations with scheduling information.
Both implementations of SOCL showed very large speedups on some of the
test cases. For example, executing on an AMD GPU with Pancake, the Gaussian Blur test case executed 175% faster than without specialization. With
Specializing POCL on a multi-core CPU, the “nas-sp” benchmark showed a
speedup of over 150% at any optimization level. It’s not clear how to generalize
these results to other cases, but it is a tempting target for future work.
Just in time specialization is a promising optimization that could beneﬁt from
signiﬁcant further research, but I would not recommend incorporating SOCL
or something similar into existing OpenCL compilers based solely on the work
in this dissertation.
6.2
Future Work
There are several ways that the work in this dissertation could be usefully
extended. This includes both further work on JIT specialization and work on
other compiler techniques that would make parallel kernels execute faster.
6.2.1
Automatic Specialization
SOCL requires explicit annotations specifying which kernel parameters should
be specialized on. It should be possible to ﬁnd appropriate specialization
parameters by a combination of static analysis and runtime proﬁling. This
would allow the beneﬁts of JIT specialization to be realized on appropriate
kernels without any additional eﬀort by the programmer.
The most obvious candidate parameters for specialization are loop variables.
These can be identiﬁed by simple static analysis. Other candidates could be
85
identiﬁed by more complex analyses, such as by ﬁnding variables that are used
in many arithmetic operations.
Once a candidate parameter is identiﬁed, the problem is determining whether
the kernel is called several times with the same value for that parameter. This
could be determined by recording the values used in calls. Once the same value
has been used some threshold number of times, it may be worth generating a
specialized kernel for that value.
6.2.2
Speculative Optimization
Selecting an appropriate set of optimizations to balance compilation time
against kernel runtime is diﬃcult. Worse, optimizing more may not even be
better. Several of the test cases for Specializing POCL showed a slowdown
when compiled with the full (“O3”) set of optimizations.
Adaptive optimization can be used to select the best optimizations for a given
kernel. The initial run of the kernel can be done with whatever set of optimizations will compile fastest, in parallel with a slower speculative compilation of
a more optimized version of the kernel. Once the optimized version has compiled, the next run can be performed with that version. This process can be
repeated several times with diﬀerent optimizations. Finally, whichever version
executes fastest can be used.
This technique is applicable to any optimization, but works especially well with
automatic specialization. A specialized kernel can be compiled speculatively
for the most promising values of a candidate parameter, and can be discarded
if that value is not used again or if the specialized kernel provides no speedup.
Speculative optimization does cause extra kernel compilations to be performed,
potentially many of them. But, especially for kernels executing on GPU devices, there may be unused CPU cores that can be used for speculative compilations without any negative impact on program performance.
6.2.3
Improved Scheduling in POCL
POCL currently uses two techniques to schedule work items into work groups.
Either the the blocks between barriers in each work item are interleaved into a
single work group function or the work items are looped over for each of these
blocks.
86
There are many other options for scheduling, including separating a work group
into multiple threads, grouping instructions for ILP, and grouping instructions
explicitly for vector execution.
A detailed analysis of how these scheduling choices interact with specialization
could result in signiﬁcant performance beneﬁts.
87
Appendix A
Benchmarking Hardware and
Methods
A.1
Benchmarking Methodology
In this appendix we describe the methodology used to collect the timing results
for SOCL in Chapters 5.2 and 5.4, as well as the timings for serial specialization
in Chapter 5.3.
A set of scripts were developed to run the benchmarks with a variety of diﬀerent optimization settings using diﬀerent OpenCL runtimes. These scripts set
environment variables that controlled the conﬁguration of the OpenCL runtime and specialization behavior.
Each test was run ﬁve times for each combination being tested. This is a
relatively low number, but it was selected to optimize the tradeoﬀ between
eliminating bad samples and overall time taken by the test runs. Some test
runs took several hours to complete. Signiﬁcantly increasing testing time beyond that would not have been worth it for small incremental gains in accuracy,
since timings were reasonably consistent in practice.
The generated charts show the median times. This removes the eﬀect of outliers while providing a realistic assessment of the inconsistent execution time
for parallel programs. Error bars show the minimum and maximum times.
88
Specs
OS
CPU
CPU Cores
CPU Clock
RAM
GPU
Shaders
GPU Clock
GPU RAM
Workstation
Ubuntu 13.10
Dual AMD Opteron 6234
24
2.4 GHz
32 GB
Nvidia GeForce GTX 650 Ti
768
928 MHz
1 GB
Desktop
Ubuntu 13.10
AMD Phenom II X3 720
3
2.8 GHz
4 GB
AMD Radeon 5830
1120
800 MHz
1 GB
Netbook
Debian 7
Intel Atom N270
1
1.6 GHz
512 MB
Integrated Graphics
-
Table A.1: Test Hardware
A.2
Benchmarking Hardware
The timings in this dissertation were taken on three test machines, shown in
Table A.1. These machines were selected to meet several criteria:
• GPU testing needed to be performed on both AMD and Nvidia GPUs.
• A high CPU count was desired for OpenCL CPU testing. The Opteron
workstation has the highest core count I was willing to pay for.
• The Netbook was used to test serial optimization on relatively simple
CPU archetecture.
89
Appendix B
Code Listings
B.1
B.1.1
OpenCL Listings
Sample Kernel: vec_pow
Listing B.1.1: Simple kernel raising the elements of a vector to some power.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* vec_pow :
*
Take a v e c t o r o f n l o n g i n t s and r a i s e each one
*
t o t h e y ’ t h power i n p l a c e .
*/
k e r n e l void
vec_pow ( g l o b a l long * xs , long y , long n )
/* @spec vec_pow ( y , n ) */
{
long i i = g e t _ g l o b a l _ i d ( 0 ) ;
long x = xs [ i i ] ;
long x1 = x ;
for ( long j j = 0 ; j j < y ; ++j j ) {
x1*= x ;
}
xs [ i i ] = x1 ;
}
90
B.1.2
Sample Host Program: vec_pow
Listing B.1.2: Host program to run the vec_pow kernel above
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include
#include
#include
#include
<CL/ c l . h>
<s t d i o . h>
<a l l o c a . h>
<s t r i n g . h>
int
main ( int argc , char* argv )
{
c l _ u i n t nn ;
/* Take t h e f i r s t OpenCL p l a t f o r m */
c l G e t P l a t f o r m I D s ( 0 , 0 , &nn ) ;
c l _ p l a t f o r m _ i d p l a t f o r m s [ nn ] ;
c l G e t P l a t f o r m I D s ( nn , p l a t f o r m s , 0 ) ;
cl_platform_id platform = platforms [ 0 ] ;
/* Take t h e d e f a u l t OpenCL d e v i c e */
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, 0 , 0 , &nn ) ;
c l _ d e v i c e _ i d d e v i c e s [ nn ] ;
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_ALL, nn , d e v i c e s , 0 ) ;
cl_device_id device = devices [ 0 ] ;
/* Create an OpenCL c o n t e x t */
long context _props [ ] = {CL_CONTEXT_PLATFORM, ( long ) p l a t f o r m ,
0};
cl_context context = clCreateContext (
( c l _ c o n t e x t _ p r o p e r t i e s * ) context_props , 1 , &d e v i c e ,
0 , 0 , 0) ;
/* Create an OpenCL command queue */
cl_command_queue queue = clCreateCommandQueue ( c o n t e x t ,
device , 0 , 0) ;
/* Read i n t h e OpenCL k e r n e l s o u r c e */
char* o p e n c l _ s o u r c e = a l l o c a ( 1 0 2 4 ) ;
FILE* o p e n c l _ f i l e = f o p e n ( ” vec_pow_kernel . c l ” , ” r ” ) ;
nn = 0 ;
while ( ! f e o f ( o p e n c l _ f i l e ) )
nn += f r e a d ( o p e n c l _ s o u r c e + nn , 1 , 5 1 2 , o p e n c l _ f i l e ) ;
o p e n c l _ s o u r c e [ nn ] = ’ \n ’ ;
o p e n c l _ s o u r c e [ nn + 1 ] = 0 ;
91
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
fclose ( opencl_file ) ;
/* Compile t h e OpenCL s o u r c e and s e l e c t t h e k e r n e l we want */
size_t length = s t r l e n ( opencl_source ) ;
const char ** s o u r c e s = ( const char * * ) &o p e n c l _ s o u r c e ;
const s i z e _ t * l e n g t h s = ( const s i z e _ t * ) &l e n g t h ;
cl_program program = clCreateProgramWithSource (
context , 1 , sources , lengths , 0) ;
clB uild Pro gram ( program , 1 , &d e v i c e , ” ” , 0 , 0 ) ;
/* S e l e c t t h e k e r n e l we want */
c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program , ”vec_pow” , 0 ) ;
/* Create t h e i n p u t b u f f e r */
cl_long n = 5;
c l _ l o n g xs0 [ ] = { 1 , 2 , 3 , 4 , 5 } ;
cl_mem_flags f l a g s = CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR;
size_t xs_size = n * sizeof ( cl_long ) ;
cl_mem xs = c l C r e a t e B u f f e r ( c o n t e x t , f l a g s , x s _ s i z e , xs0 , 0 ) ;
/* S e t t h e k e r n e l arguments */
cl_long y = 3;
c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl_mem ) , &xs ) ;
c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( y ) , &y ) ;
c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( n ) , &n ) ;
/* E x e c u t e t h e k e r n e l o v e r a p a r a l l e l range . */
c l _ e v e n t kernel_done = 0 ;
clEnqueueNDRangeKernel ( queue , k e r n e l , 1 , 0 , &n , 0 , 0 , 0 ,
&kernel_done ) ;
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/* Wait f o r t h e Kernel t o f i n i s h e x e c u t i n g */
clFinish () ;
/* Copy b a c k t h e d a t a from t h e b u f f e r . */
c l _ l o n g xs1 [ 5 ] ;
clEnqueueReadBuffer ( queue , xs , CL_TRUE, 0 , x s _ s i z e , xs1 , 0 ,
0 , 0) ;
/* P r i n t o u t t h e r e s u l t */
p r i n t f ( ” 1 2 3 4 5 cubed i s : \ n” ) ;
for ( int i i = 0 ; i i < 5 ; ++i i ) {
p r i n t f ( ”%l d ” , xs1 [ i i ] ) ;
}
p r i n t f ( ” \n” ) ;
}
return 0 ;
92
B.1.3
LLVM Assembly Code for vec_pow kernel
Listing B.1.3: Results of compiling vec_pow kernel to LLVM assembly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
; ModuleID = ’ program . bc ’
t a r g e t d a t a l a y o u t = *** s t r i n g o f data s i z e s ***
t a r g e t t r i p l e = ” x86_64−unknown−l i n u x −gnu ”
d e f i n e void @vec_pow ( i 6 4 a d d r s p a c e ( 3 ) * n o c a p t u r e %xs , i 6 4 %y , i 6 4
%n ) nounwind uwtable {
entry :
%c a l l = t a i l c a l l i 6 4 @get_global_id ( i 3 2 0 ) nounwind
%a r r a y i d x = g e t e l e m e n t p t r inbounds i 6 4 a d d r s p a c e ( 3 ) * %xs , i 6 4
%c a l l
%0 = l o a d i 6 4 a d d r s p a c e ( 3 ) * %a r r a y i d x , a l i g n 8 , ! tbaa ! 1
%cmp2 = icmp s g t i 6 4 %y , 0
br i 1 %cmp2 , l a b e l %for . body , l a b e l %for . end
for . body :
; preds =
%for . body , %e n t r y
%j j . 0 4 = p h i i 6 4 [ %i n c , %for . body ] , [ 0 , %e n t r y ]
%x1 . 0 3 = p h i i 6 4 [ %mul , %for . body ] , [ %0, %e n t r y ]
%mul = mul nsw i 6 4 %x1 . 0 3 , %0
%i n c = add nsw i 6 4 %j j . 0 4 , 1
%e x i t c o n d = icmp eq i 6 4 %i n c , %y
br i 1 %e x i t c o n d , l a b e l %for . end , l a b e l %for . body
for . end :
; preds =
%for . body , %e n t r y
%x1 . 0 . l c s s a = p h i i 6 4 [ %0, %e n t r y ] , [ %mul , %for . body ]
s t o r e i 6 4 %x1 . 0 . l c s s a , i 6 4 a d d r s p a c e ( 3 ) * %a r r a y i d x , a l i g n 8 ,
! tbaa ! 1
r e t void
}
d e c l a r e i 6 4 @get_global_id ( i 3 2 )
! opencl . kernels = ! { ! 0 }
!0
!1
!2
!3
=
=
=
=
metadata
metadata
metadata
metadata
! { void ( i 6 4 a d d r s p a c e ( 3 ) * , i 6 4 , i 6 4 ) * @vec_pow}
! { metadata ! ” long ” , metadata ! 2 }
! { metadata ! ” omnipotent char ” , metadata ! 3 }
! { metadata ! ” Simple C/C++ TBAA” }
93
B.1.4
Gaussian Blur Kernels
Listing B.1.4: blur-kernels.cl: OpenCL kernels to perform gaussian blur
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
kernel
void
blur_hor ( g l o b a l uchar * im1 , g l o b a l uchar * im0 , g l o b a l f l o a t * bvec ,
int ww, int hh , int r r )
/* @spec b l u r _ h o r (ww, hh , r r ) */
{
int i i = g e t _ g l o b a l _ i d ( 0 ) ;
int j j = g e t _ g l o b a l _ i d ( 1 ) ;
f l o a t pp = 0 . 0 f ;
for ( int kk = −r r ; kk <= r r ; ++kk ) {
int j j 0 = clamp ( j j + kk , 0 , ww − 1 ) ;
pp += bvec [ kk + r r ] * im0 [ww* i i + j j 0 ] ;
}
}
im1 [ww* i i + j j ] = clamp ( ( int ) round ( pp ) , 0 , 2 5 5 ) ;
kernel
void
bl ur _v er ( g l o b a l uchar * im1 , g l o b a l uchar * im0 , g l o b a l f l o a t * bvec ,
int ww, int hh , int r r )
/* @spec b l u r _ v e r (ww, hh , r r ) */
{
int i i = g e t _ g l o b a l _ i d ( 0 ) ;
int j j = g e t _ g l o b a l _ i d ( 1 ) ;
f l o a t pp = 0 . 0 f ;
for ( int kk = −r r ; kk <= r r ; ++kk ) {
int i i 0 = clamp ( i i + kk , 0 , hh − 1 ) ;
pp += bvec [ kk + r r ] * im0 [ww* i i 0 + j j ] ;
}
}
im1 [ww* i i + j j ] = clamp ( ( int ) round ( pp ) , 0 , 2 5 5 ) ;
94
B.1.5
Matrix Multiplication Kernel
Listing B.1.5: fmma-kernel.cl: OpenCL kernel to perform matrix multiplication
1
2
/* fmma : F l a o t i n g p o i n t m a t r i x m u l t i p l y and a c c u m u l a t e
* Performs a m a t r i x m u l t i p l i c a t i o n i f t h e r e s u l t m a t r i x C i s a l l
z e r o s . */
3
4 k e r n e l void
5 fmma( g l o b a l f l o a t *C, g l o b a l f l o a t *A, g l o b a l f l o a t *B, int nn ,
int s p i n )
6 /* @spec fmma( nn , s p i n ) */
7 {
8
const int xx = g e t _ g l o b a l _ i d ( 0 ) ;
9
const int yy = g e t _ g l o b a l _ i d ( 1 ) ;
10
11
for ( long i t = 0 ; i t < s p i n ; ++i t ) {
12
f l o a t sum = C[ nn * yy + xx ] ;
13
14
for ( int kk = 0 ; kk < nn ; ++kk ) {
15
sum += A[ nn * yy + kk ] * B [ nn * kk + xx ] ;
16
}
17
18
C[ nn * yy + xx ] = sum ;
19
}
20 }
B.1.6
Rodinia Mandelbrot Kernel
Listing B.1.6: mandelbrot-kernel.cl: OpenCL kernel to generate pictures of
the mandelbrot set
1
2
3
4
5
6
/*
C o p y r i g h t ( c ) 2009 David B u c c i a r e l l i ( d a v i b u @ i n t e r f r e e . i t )
P e r m i s s i o n i s h e r e b y g r a n t e d , f r e e o f charge , t o any p e r s o n
obtaining
a copy o f t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( t h e
” Software ” ) , to deal in the Software without r e s t r i c t i o n ,
including
95
7
8
9
10
11
12
w i t h o u t l i m i t a t i o n t h e r i g h t s t o use , copy , modify , merge ,
publish ,
d i s t r i b u t e , s u b l i c e n s e , and/ or s e l l c o p i e s o f t h e S o f t w a r e , and t o
p e r m i t p e r s o n s t o whom t h e S o f t w a r e i s f u r n i s h e d t o do so ,
sub jec t to
the following conditions :
The above c o p y r i g h t n o t i c e and t h i s p e r m i s s i o n n o t i c e s h a l l be
included
i n a l l c o p i e s or s u b s t a n t i a l p o r t i o n s o f t h e S o f t w a r e .
13
14
15 THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND,
16 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
17 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT.
18 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR
ANY
19 CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF
CONTRACT,
20 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
21 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
22 */
23
24 __kernel void mandelGPU (
25
__global int * p i x e l s ,
26
const int width ,
27
const int h e i g h t ,
28
const f l o a t s c a l e ,
29
const f l o a t o f f s e t X ,
30
const f l o a t o f f s e t Y ,
31
const int m a x I t e r a t i o n s
32
)
33 /* @spec mandelGPU ( width , h e i g h t , s c a l e , o f f s e t X , o f f s e t Y ,
m a x I t e r a t i o n s ) */
34 {
35
const int g i d = g e t _ g l o b a l _ i d ( 0 ) ;
36
const int g i d 4 = 4 * g i d ;
37
const int maxSize = max( width , h e i g h t ) ;
38
const f l o a t kx = ( s c a l e / 2 . f ) * width ;
39
const f l o a t ky = ( s c a l e / 2 . f ) * h e i g h t ;
40
41
int t ;
42
unsigned int i t e r [ 4 ] ;
43
for ( t = 0 ; t < 4 ; ++t ) {
44
const int t i d = g i d 4 + t ;
45
46
const int screenX = t i d % width ;
47
const int screenY = t i d / width ;
96
48
49
50
51
52
53
// Check i f we have s o m e t h i n g t o do
i f ( screenY >= h e i g h t )
return ;
const f l o a t x0 = ( ( screenX * s c a l e ) − kx ) /
maxSize + o f f s e t X ;
const f l o a t y0 = ( ( screenY * s c a l e ) − ky ) /
maxSize + o f f s e t Y ;
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
float
float
float
float
x = x0 ;
y = y0 ;
x2 = x * x ;
y2 = y * y ;
int i i = 0 ;
for ( i i = 0 ; ( x2 + y2 <= 4 . f ) && ( i i <
m a x I t e r a t i o n s ) ; ++i i ) {
y = 2 * x * y + y0 ;
x = x2 − y2 + x0 ;
}
iter [ t ] = ii ;
}
}
x2 = x * x ;
y2 = y * y ;
i f ( i t e r [ t ] == m a x I t e r a t i o n s )
iter [ t ] = 0;
else {
i t e r [ t ] = i t e r [ t ] % 512;
i f ( i t e r [ t ] > 255)
i t e r [ t ] = 511 − i t e r [ t ] ;
}
p i x e l s [ gid ] = i t e r [ 0 ]
( iter
( iter
( iter
|
[ 1 ] << 8 ) |
[ 2 ] << 1 6 ) |
[ 3 ] << 2 4 ) ;
97
B.2
Bacon Listings
B.2.1
Bacon Grammar
Listing B.2.1: Parse::Yacc grammar for Bacon C without Perl code.
1 # Based on ANSI C yacc grammar o r i g i o n a l l y by J e f f Lee , taken from
2 # h t t p : //www. l y s a t o r . l i u . s e / c /ANSI−C−grammar−y . html
3
4 %s t a r t t r a n s l a t i o n _ u n i t
5
6 /*
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
e x p r e s s i o n s */
primary_expression
: CONSTANT
| IDENTIFIER
| STRING
| ’ ( ’ expression ’ ) ’
;
postfix_expression
: primary_expression
| IDENTIFIER ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’
| IDENTIFIER ’ ( ’ ’ ) ’
| IDENTIFIER ’ ( ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ) ’
| IDENTIFIER ’ . ’ IDENTIFIER
| p o s t f i x _ e x p r e s s i o n PTR_OP IDENTIFIER
| p o s t f i x _ e x p r e s s i o n INC_OP
| p o s t f i x _ e x p r e s s i o n DEC_OP
;
argument_expression_list
: assignment_expression
| argument_expression_list ’ , ’ assignment_expression
;
unary_expression
: postfix_expression
| INC_OP u n a r y _ e x p r e s s i o n
| DEC_OP u n a r y _ e x p r e s s i o n
| unary_operator c a s t _ e x p r e s s i o n
| SIZEOF u n a r y _ e x p r e s s i o n
98
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
| SIZEOF ’ ( ’ t y p e _ s p e c i f i e r ’ ) ’
;
unary_operator
: ’& ’
| ’* ’
| ’+ ’
| ’− ’
| ’~ ’
| ’! ’
;
cast_expression
: unary_expression
| ’ ( ’ type_specifier ’ ) ’ cast_expression
;
multiplicative_expression
: cast_expression
| multiplicative_expression ’* ’ cast_expression
| multiplicative_expression ’/ ’ cast_expression
| m u l t i p l i c a t i v e _ e x p r e s s i o n ’%’ c a s t _ e x p r e s s i o n
;
additive_expression
: multiplicative_expression
| a d d i t i v e _ e x p r e s s i o n ’+ ’ m u l t i p l i c a t i v e _ e x p r e s s i o n
| a d d i t i v e _ e x p r e s s i o n ’− ’ m u l t i p l i c a t i v e _ e x p r e s s i o n
;
shift_expression
: additive_expression
| s h i f t _ e x p r e s s i o n LEFT_OP a d d i t i v e _ e x p r e s s i o n
| s h i f t _ e x p r e s s i o n RIGHT_OP a d d i t i v e _ e x p r e s s i o n
;
relational_expression
: shift_expression
| relational_expression
| relational_expression
| relational_expression
| relational_expression
;
’< ’ s h i f t _ e x p r e s s i o n
’> ’ s h i f t _ e x p r e s s i o n
LE_OP s h i f t _ e x p r e s s i o n
GE_OP s h i f t _ e x p r e s s i o n
equality_expression
: relational_expression
| e q u a l i t y _ e x p r e s s i o n EQ_OP r e l a t i o n a l _ e x p r e s s i o n
| e q u a l i t y _ e x p r e s s i o n NE_OP r e l a t i o n a l _ e x p r e s s i o n
99
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
;
and_expression
: equality_expression
| a n d _ e x p r e s s i o n ’& ’ e q u a l i t y _ e x p r e s s i o n
;
exclusive_or_expression
: and_expression
| exclusive_or_expression ’^ ’ and_expression
;
inclusive_or_expression
: exclusive_or_expression
| inclusive_or_expression ’ | ’ exclusive_or_expression
;
logical_and_expression
: inclusive_or_expression
| l o g i c a l _ a n d _ e x p r e s s i o n AND_OP i n c l u s i v e _ o r _ e x p r e s s i o n
;
logical_or_expression
: logical_and_expression
| l o g i c a l _ o r _ e x p r e s s i o n OR_OP l o g i c a l _ a n d _ e x p r e s s i o n
;
conditional_expression
: logical_or_expression
| logical_or_expression ’? ’ expression ’ : ’
conditional_expression
;
assignment_expression
: conditional_expression
| unary_expression assignment_operator assignment_expression
;
assignment_operator
: ’= ’
| MUL_ASSIGN
| DIV_ASSIGN
| MOD_ASSIGN
| ADD_ASSIGN
| SUB_ASSIGN
| LEFT_ASSIGN
| RIGHT_ASSIGN
| AND_ASSIGN
100
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
| XOR_ASSIGN
| OR_ASSIGN
;
expression
: assignment_expression
| expression ’ , ’ assignment_expression
;
constant_expression
: conditional_expression
;
/*
d e c l a r a t i o n s */
declaration
: type_specifier declarator_list ’ ; ’
;
declarator_list
: declarator
| declarator_list ’ , ’ declarator
;
declarator
: IDENTIFIER
| IDENTIFIER ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’
| IDENTIFIER ’= ’ i n i t i a l i z e r
;
/*
t y p e s */
type_qualifier
: CONST
| VOLATILE
;
simple_type
: CHAR
| UCHAR
| SHORT
| USHORT
| INT
| UINT
101
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
|
|
|
|
;
LONG
ULONG
FLOAT
DOUBLE
parameterized_type
: ARRAY
| ARRAY2D
| ARRAY3D
| ARRAY2Z
| ARRAY3Z
| IMAGE2D
| IMAGE3D
;
type_specifier
: parameterized_type_specifier
| simple_type
| VOID
| pointer_type
| type_qualifier type_specifier
;
parameterized_type_specifier
: p a r a m e t e r i z e d _ t y p e ’< ’ simple_type ’> ’
| s c o p e _ s p e c i f i e r p a r a m e t e r i z e d _ t y p e ’< ’ simple_type ’> ’
;
pointer_type
: simple_type ’ * ’
| VOID ’ * ’
| s c o p e _ s p e c i f i e r simple_type ’ * ’
| s c o p e _ s p e c i f i e r VOID ’ * ’
;
scope_specifier
: PRIVATE
| LOCAL
| GLOBAL
;
function_declarator
: IDENTIFIER ’ ( ’ p a r a m e t e r _ l i s t ’ ) ’
| IDENTIFIER ’ ( ’ ’ ) ’
;
102
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
parameter_list
: parameter_declaration
| parameter_list ’ , ’ parameter_declaration
;
parameter_declaration
: t y p e _ s p e c i f i e r IDENTIFIER
;
initializer
: assignment_expression
| ’{ ’ initializer_list ’} ’
| ’{ ’ initializer_list ’ , ’ ’} ’
;
initializer_list
: initializer
| initializer_list
;
’ , ’ initializer
/*
s t a t e m e n t s */
statement
: labeled_statement
| compound_statement
| expression_statement
| selection_statement
| iteration_statement
| jump_statement
| error_statement
;
labeled_statement
: IDENTIFIER ’ : ’ s t a t e m e n t
| CASE c o n s t a n t _ e x p r e s s i o n ’ : ’ s t a t e m e n t
| DEFAULT ’ : ’ s t a t e m e n t
;
compound_statement
: ’{ ’ ’} ’
| ’{ ’ statement_list ’} ’
;
statement_list
: statement
| declaration
103
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
| statement_list statement
| statement_list declaration
;
expression_statement
: ’; ’
| expression ’ ; ’
;
selection_statement
: IF ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t
| IF ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t ELSE s t a t e m e n t
| SWITCH ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t
;
for_var_init
: expression_statement
| t y p e _ s p e c i f i e r IDENTIFIER ’= ’ e x p r e s s i o n _ s t a t e m e n t
;
iteration_statement
: WHILE ’ ( ’ e x p r e s s i o n ’ ) ’ s t a t e m e n t
| DO s t a t e m e n t WHILE ’ ( ’ e x p r e s s i o n ’ ) ’ ’ ; ’
| FOR ’ ( ’ f o r _ v a r _ i n i t e x p r e s s i o n _ s t a t e m e n t ’ ) ’ s t a t e m e n t
| FOR ’ ( ’ f o r _ v a r _ i n i t e x p r e s s i o n _ s t a t e m e n t e x p r e s s i o n ’ ) ’
statement
;
jump_statement
: GOTO IDENTIFIER ’ ; ’
| CONTINUE ’ ; ’
| BREAK ’ ; ’
| RETURN ’ ; ’
| RETURN e x p r e s s i o n ’ ; ’
;
error_statement
: FAIL ’ ( ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ) ’ ’ ; ’
| ASSERT ’ ( ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ) ’ ’ ; ’
;
/*
t o p l e v e l d e c l a r a t i o n s */
translation_unit
: external_declaration
| translation_unit external_declaration
104
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
;
external_declaration
: kernel_definition
| function_definition
| declaration
;
kernel_range_spec
: RANGE ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’ ’ ; ’
;
kernel_group_spec
: GROUP ’ [ ’ a r g u m e n t _ e x p r e s s i o n _ l i s t ’ ] ’ ’ ; ’
;
kernel_ranges
: kernel_range_spec
| kernel_group_spec
| k e r n e l _ r a n g e _ s p e c kernel_group_spec
;
kernel_body
: ’ { ’ SETUP s t a t e m e n t _ l i s t BODY k e r n e l _ r a n g e s s t a t e m e n t _ l i s t
’} ’
| ’ { ’ BODY k e r n e l _ r a n g e s s t a t e m e n t _ l i s t ’ } ’
;
339
340
341
342 k e r n e l _ d e f i n i t i o n
343
: KERNEL t y p e _ s p e c i f i e r f u n c t i o n _ d e c l a r a t o r kernel_body
344
;
345
346 f u n c t i o n _ d e f i n i t i o n
347
: t y p e _ s p e c i f i e r f u n c t i o n _ d e c l a r a t o r compound_statement
348
;
349
350 %%
B.2.2
Blocked Matrix Multiply
Listing B.2.2: blocked-mm.bc: Matrix multiplication in Bacon with block size
as a parameter
1
kernel
105
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Array2D<f l o a t >
blocked_mat_mul ( Array2D<f l o a t > aa , Array2D<f l o a t > bb , const u i n t
blksz )
{
SETUP:
g l o b a l Array2D<f l o a t > c c [ aa . rows , bb . c o l s ] ;
BODY:
@range [ c c . rows / b l k s z , c c . c o l s / b l k s z ] ;
p r i v a t e Array2D<f l o a t > sum [ b l k s z , b l k s z ] ;
int i i , j j , kk , gg ;
for ( i i = 0 ; i i < b l k s z ; ++i i ) {
for ( j j = 0 ; j j < b l k s z ; ++j j ) {
sum [ i i , j j ] = 0 . 0 ;
}
}
int b a s e _ i i = $row * b l k s z ;
int b a s e _ j j = $ c o l * b l k s z ;
int base_kk ;
for ( gg = 0 ; gg < aa . c o l s / b l k s z ; ++gg ) {
base_kk = gg * b l k s z ;
}
// These l o o p s a r e shown u n r o l l e d b e l o w .
for ( i i = 0 ; i i < b l k s z ; ++i i ) {
for ( j j = 0 ; j j < b l k s z ; ++j j ) {
for ( kk = 0 ; kk < b l k s z ; ++kk ) {
sum [ i i , j j ] += aa [ b a s e _ i i + i i , base_kk + kk ] *
bb [ base_kk + kk , b a s e _ j j + j j ] ;
}
}
}
for ( i i = 0 ; i i < b l k s z ; ++i i ) {
for ( j j = 0 ; j j < b l k s z ; ++j j ) {
c c [ b a s e _ i i + i i , b a s e _ j j + j j ] = sum [ i i , j j ] ;
}
}
}
return c c ;
// U n r o l l i n g r e s u l t s f o r b l k s z = 2
106
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
sum [ 0 , 0 ] +=
*
sum [ 0 , 0 ] +=
*
sum [ 0 , 1 ] +=
*
sum [ 0 , 1 ] +=
*
sum [ 1 , 0 ] +=
*
sum [ 1 , 0 ] +=
*
sum [ 1 , 1 ] +=
*
sum [ 1 , 1 ] +=
*
B.3
B.3.1
aa [ b a s e _ i i , base_kk ]
bb [ base_kk , b a s e _ j j ] ;
aa [ b a s e _ i i , base_kk + 1 ]
bb [ base_kk + 1 , b a s e _ j j ] ;
aa [ b a s e _ i i , base_kk ]
bb [ base_kk , b a s e _ j j + 1 ] ;
aa [ b a s e _ i i , base_kk + 1 ]
bb [ base_kk + 1 , b a s e _ j j +
aa [ b a s e _ i i + 1 , base_kk ]
bb [ base_kk , b a s e _ j j ] ;
aa [ b a s e _ i i + 1 , base_kk +
bb [ base_kk + 1 , b a s e _ j j ] ;
aa [ b a s e _ i i + 1 , base_kk ]
bb [ base_kk , b a s e _ j j + 1 ] ;
aa [ b a s e _ i i + 1 , base_kk +
bb [ base_kk + 1 , b a s e _ j j +
1];
1]
1]
1];
Serial Kernel Listings
Serial Gaussian Blur
Listing B.3.1: blur.c: Serial implementation of gaussian blur.
1 #include < s t d l i b . h>
2 #include <math . h>
3
4 double
5 g a u s s ( double x , double mu, double sigma )
6 {
7
double aa = 1 . 0 / ( sigma * s q r t ( 2 . 0 * M_PI) ) ;
8
double bbT = −pow ( x − mu, 2 . 0 ) ;
9
double bbB = 2 * pow ( sigma , 2 . 0 ) ;
10
double bb = bbT / bbB ;
11
return aa * exp ( bb ) ;
12 }
13
14 int
15 clamp ( int xx , int x0 , int x1 )
16 {
17
i f ( xx < x0 ) return x0 ;
18
i f ( xx > x1 ) return x1 ;
19
return xx ;
107
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
}
void
b l u r ( byte * im0 , s i z e _ t ww, s i z e _ t hh , int sigma )
{
int r r = 3 * sigma ;
/* Generate t h e b l u r v e c t o r */
double * bvec = m a l l o c ( ( 2 * r r + 1 ) * s i z e o f ( double ) ) ;
double bsum = 0 . 0 ;
for ( int kk = −r r ; kk <= r r ; ++kk ) {
int i i = kk + r r ;
bvec [ i i ] = g a u s s ( kk , 0 . 0 , sigma ) ;
bsum += bvec [ i i ] ;
}
for ( int i i = 0 ; i i < 2 * r r + 1 ; ++i i ) {
bvec [ i i ] *= 1 . 0 / bsum ;
}
byte * im1 = m a l l o c (ww * hh * s i z e o f ( byte ) ) ;
/* Blur im0 h o r i z o n t a l l y i n t o im1 */
for ( int i i = 0 ; i i < hh ; ++i i ) {
for ( int j j = 0 ; j j < ww; ++j j ) {
double p1 = 0 . 0 ;
for ( int kk = −r r ; kk <= r r ; ++kk ) {
int j j 0 = clamp ( j j + kk , 0 , ww − 1 ) ;
p1 += bvec [ kk + r r ] * im0 [ww* i i + j j 0 ] ;
}
}
}
im1 [ww* i i + j j ] = clamp ( round ( p1 ) , 0 , 2 5 5 ) ;
/* Blur im1 v e r t i c a l l y b a c k i n t o im0 */
for ( int j j = 0 ; j j < ww; ++j j ) {
for ( int i i = 0 ; i i < hh ; ++i i ) {
double p0 = 0 . 0 ;
for ( int kk = −r r ; kk <= r r ; ++kk ) {
int i i 0 = clamp ( i i + kk , 0 , hh − 1 ) ;
p0 += bvec [ kk + r r ] * im1 [ww* i i 0 + j j ] ;
}
im0 [ww* i i + j j ] = clamp ( round ( p0 ) , 0 , 2 5 5 ) ;
108
68
69
70
71
72
73
}
}
}
f r e e ( im1 ) ;
f r e e ( bvec ) ;
B.3.2
Serial Matrix Multiply
Listing B.3.2: mmul.c: Serial implementation of matrix multiply.
1 /* A, B, C a r e s q u a r e m a t r i x e s . */
2 /* C = A * B */
3
4 void
5 mmul( double* C, double* A, double* B, int nn )
6 {
7
for ( int i i = 0 ; i i < nn ; ++i i ) {
8
for ( int j j = 0 ; j j < nn ; ++j j ) {
9
double sum = 0 . 0 ;
10
11
for ( int kk = 0 ; kk < nn ; ++kk ) {
12
sum += A[ nn * i i + kk ] * B [ nn * kk + j j ] ;
13
}
14
15
C[ nn * i i + j j ] = sum ;
16
}
17
}
18 }
B.3.3
Serial Integer Exponentiation
Listing B.3.3: ipow.c: Serial implementation of simple integer exponentiation.
1
2
3
4
5
6
7
int
ipow ( int xx , int kk )
{
int yy = 1 ;
for ( int i i = 0 ; i i < kk ; ++i i )
yy *= xx ;
109
8
9
10
}
return yy ;
B.4
Specializer Listings
B.4.1
LLVM Specialize Pass
Listing B.4.1: SpecPass.cc: Specialization as an LLVM module pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include ” llvm / Pass . h”
#include ” llvm / PassSupport . h”
#include ” llvm / Support /CommandLine . h”
#include
#include
#include
#include
#include
#include
#include
#include
” llvm /IR/Argument . h”
” llvm /IR/ Constants . h”
” llvm /IR/ DerivedTypes . h”
” llvm /IR/ Function . h”
” llvm /IR/ G l o b a l V a r i a b l e . h”
” llvm /IR/ I n s t r u c t i o n s . h”
” llvm /IR/Module . h”
” llvm / Support / raw_ostream . h”
#include
#include
#include
#include
<s t d i o . h>
<s t d i n t . h>
<map>
<s t r i n g >
#include ” cake / l s t r i n g . h”
#include ” cake / s p e c . h”
#include ” cake / u t i l . h”
u s i n g namespace llvm ;
u s i n g namespace s t d ;
c l : : opt<s t r i n g > S p e c I n f o ( ” spec−i n f o ” ,
c l : : d e s c ( ” Spec I n f o ” ) ,
c l : : v al ue _d e sc ( ” s p e c i n f o f i l e ” ) ,
cl : : i n i t ( ”” ) ) ;
c l : : opt<s t r i n g > SpecText ( ” spec−t e x t ” ,
c l : : d e s c ( ” Spec Text ” ) ,
c l : : v al ue _d e sc ( ” s p e c s p e c s t r i n g ” ) ,
110
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
cl : : i n i t ( ”” ) ) ;
extern c l : : opt<s t r i n g > KernelName ;
namespace {
c l a s s S p e c i a l i z e : p u b l i c ModulePass {
public :
s t a t i c char ID ;
S p e c i a l i z e ( ) : ModulePass ( ID ) {}
}
};
v i r t u a l void g e t A n a l y s i s U s a g e ( A n a l y s i s U s a g e &AU) const ;
v i r t u a l b o o l runOnModule ( Module &M) ;
char S p e c i a l i z e : : ID = 0 ;
s t a t i c R e g i s t e r P a s s <S p e c i a l i z e > X( ” s p e c i a l i z e ” , ” S p e c i a l i z e
k e r n e l on a r g s ” ) ;
void
S p e c i a l i z e : : g e t A n a l y s i s U s a g e ( A n a l y s i s U s a g e& au ) const
{
// Do n o t h i n g .
}
static
char*
type_name ( Type* t t )
{
std : : s t r i n g buffer ;
llvm : : raw_string_ostream stream ( b u f f e r ) ;
t t −>p r i n t ( stream ) ;
return l s t r d u p ( stream . s t r ( ) . c _ s t r ( ) ) ;
}
bool
S p e c i a l i z e : : runOnModule ( Module& M)
{
auto F = M. g e t F u n c t i o n ( KernelName ) ;
i f (F == 0 ) {
c a r p ( ”No such k e r n e l , g i v i n g up . ” ) ;
}
/* Read s p e c . i n f o */
111
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
spec_info * info ;
i f ( SpecText . g e t V a l u e ( ) != ” ” ) {
p r i n t f ( ” S p e c i a l i z e k e r n e l %s with s t r i n g %s \n” ,
KernelName . g e t V a l u e ( ) . c _ s t r ( ) ,
SpecInfo . getValue ( ) . c_str ( ) ) ;
}
int arg_count = F−>a r g _ s i z e ( ) ;
char ** arg_names = ( char * * ) a l l o c a ( arg_count *
s i z e o f ( char * ) ) ;
int i i = 0 ;
for ( auto i t = F−>arg_begin ( ) ; i t != F−>arg_end ( ) ; ++i t ) {
arg_names [ i i ] = l s t r d u p ( i t −>getName ( ) . s t r ( ) . c _ s t r ( ) ) ;
i i += 1 ;
}
i n f o = p a r s e _ s p e c _ t e x t ( arg_names , arg_count ,
SpecText . g e t V a l u e ( ) . c _ s t r ( ) ) ;
i f ( S p e c I n f o . g e t V a l u e ( ) != ” ” ) {
p r i n t f ( ” S p e c i a l i z e k e r n e l %s with f i l e %s \n” ,
KernelName . g e t V a l u e ( ) . c _ s t r ( ) ,
SpecInfo . getValue ( ) . c_str ( ) ) ;
}
i n f o = read_spec_info ( SpecInfo . getValue ( ) . c_str ( ) ) ;
/* Find f u n c t i o n a r g s */
p r i n t f ( ” Function a r g s : \ n” ) ;
int i i = 0 ;
for ( auto i t = F−>arg_begin ( ) ; i t != F−>arg_end ( ) ; ++i t ) {
spec_arg s a = i n f o −>a r g s [ i i ++];
Type * t t = i t −>getType ( ) ;
s t r i n g name ( i t −>getName ( ) . s t r ( ) ) ;
s t r i n g type ( type_name ( t t ) ) ;
p r i n t f ( ” − Arg %s , type %s \n” ,
i t −>getName ( ) . s t r ( ) . c _ s t r ( ) ,
type . c _ s t r ( ) ) ;
i f ( sa . spec ) {
i f ( type == ” i 6 4 ” ) {
i n t 6 4 _ t vv64 ;
i f ( s a . s i z e == s i z e o f ( i n t 6 4 _ t ) )
vv64 = * ( ( i n t 6 4 _ t * ) s a . v a l u e ) ;
112
e l s e i f ( s a . s i z e == s i z e o f ( i n t 3 2 _ t ) )
vv64 = ( i n t 6 4 _ t ) * ( ( i n t 3 2 _ t * ) s a . v a l u e ) ;
else
c a r p ( ” S i z e mismatch ” ) ;
printf (”
−−> Spec a s i 6 4 with v a l u e %l d \n” ,
vv64 ) ;
Value * sp_vv64 = C o n s t a n t I n t : : g e t S i g n e d ( t t , vv64 ) ;
i t −>r e p l a c e A l l U s e s W i t h ( sp_vv64 ) ;
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
}
}
}
}
e l s e i f ( type == ” i 3 2 ” ) {
a s s e r t ( s a . s i z e == s i z e o f ( i n t 3 2 _ t ) ) ;
i n t 3 2 _ t vv32 = * ( ( i n t 3 2 _ t * ) s a . v a l u e ) ;
printf (”
−−> Spec a s i 3 2 with v a l u e %d\n” ,
vv32 ) ;
Value * sp_vv32 = C o n s t a n t I n t : : g e t S i g n e d ( t t , vv32 ) ;
i t −>r e p l a c e A l l U s e s W i t h ( sp_vv32 ) ;
}
e l s e i f ( type == ” double ” ) {
a s s e r t ( s a . s i z e == s i z e o f ( double ) ) ;
double vvd = * ( ( double * ) s a . v a l u e ) ;
printf (”
−−> Spec a s double with v a l u e
%.02 f \n” , vvd ) ;
Value * spd = ConstantFP : : g e t ( t t , vvd ) ;
i t −>r e p l a c e A l l U s e s W i t h ( spd ) ;
}
e l s e i f ( type == ” f l o a t ” ) {
a s s e r t ( s a . s i z e == s i z e o f ( f l o a t ) ) ;
float vvf = *(( float *) sa . value ) ;
printf (”
−−> Spec a s f l o a t with v a l u e %.02 f \n” ,
vvf ) ;
Value * s p f = ConstantFP : : g e t ( t t , v v f ) ;
i t −>r e p l a c e A l l U s e s W i t h ( s p f ) ;
}
else {
char* error_msg = l s p r i n t f (
”Can ’ t s p e c i a l i z e on a v a l u e o f type %s . ” ,
type . c _ s t r ( ) ) ;
c a r p ( error_msg ) ;
}
return f a l s e ;
113
Bibliography
[1] Tegra 3 Processor. url: http://www.nvidia.com/object/tegra-3processor.html.
[2] OpenCL: The open standard for parallel programming of heterogeneous
systems. url: http://www.khronos.org/opencl/ (visited on Mar. 1,
2014).
[3] The Computer Language Benchmark Game. url: http : / / http : / /
shootout.alioth.debian.org/.
[4] Y. Futamura. “Partial Evaluation of Computation Process – An Approach to a Compiler-Compiler”. In: Systems, Computers, Controls 2.5
(1971), pp. 45–50.
[5] Intel Core i7-4960X Speciﬁcations. url: http : / / ark . intel . com /
products/77779 (visited on Feb. 28, 2014).
[6] AMD FX-9590 Speciﬁcations. URL: http://www.webcitation.org/
6NR9Ivy3f. url: http : / / www . amd . com / us / products / desktop /
processors/amdfx/Pages/amdfx.aspx#2 (visited on Sept. 9, 2014).
[7] Various. “Intel Hyper Threading Technology Archetecture and Microarchetecture”. In: Intel Technology Journal. Vol. 06. 1. 2002.
[8] Auto-Vectorization in GCC. url: http : / / gcc . gnu . org / projects /
tree-ssa/vectorization.html (visited on Mar. 1, 2014).
[9] Auto-Vectorization in LLVM. url: http://llvm.org/docs/Vectorizers.
html (visited on Mar. 1, 2014).
[10] AMD R9 Series Speciﬁcations. url: http://www.amd.com/us/products/
desktop/graphics/R9/Pages/amd-radeon-hd-r9-series.aspx (visited on Feb. 28, 2014).
[11] Radeon R9 290X Review. url: http://www.anandtech.com/show/7457
(visited on Feb. 28, 2014).
[12] GeForce GTX TITAN Black Speciﬁcations. url: http://www.geforce.
com/hardware/desktop-gpus/geforce-gtx-titan-black/specifications
(visited on Feb. 28, 2014).
114
[13] GTX TITAN Review. url: http://www.anandtech.com/show/6760
(visited on Feb. 28, 2014).
[14] The History of the Modern Graphics Processor. url: http : / / www .
techspot . com / article / 650 - history - of - the - gpu/ (visited on
Feb. 28, 2014).
[15] CUDA Parallel Computing Platform. url: http://www.nvidia.com/
object/cuda_home_new.html (visited on Mar. 1, 2014).
[16] Compute Shader Overview (Windows). url: http://msdn.microsoft.
com/en- us/library/windows/desktop/ff476331(v=vs.85).aspx
(visited on Mar. 1, 2014).
[17] Intel Core i7 4770R Speciﬁcations. url: http : / / ark . intel . com /
products/76642 (visited on Feb. 28, 2014).
[18] Aanandtech Iris Pro Review. url: http://www.anandtech.com/show/
6993 (visited on Feb. 28, 2014).
[19] Anandtech Galaxy S 4 Review. url: http : / / www . anandtech . com /
show/6914 (visited on Feb. 28, 2014).
[20] Andandtech iPhone 5S Review. url: http://anandtech.com/show/
7335 (visited on Feb. 28, 2014).
[21] Smart phones overtake client PCs in 2011. url: http://www.canalys.
com/newsroom/smart- phones- overtake- client- pcs- 2011 (visited
on Feb. 28, 2014).
[22] Intel Xeon Phi 7120P Speciﬁcations. url: http : / / ark . intel . com /
products/75799 (visited on Mar. 1, 2014).
[23] Tilera TILEncore-Gx72 Product Brief. url: http://www.tilera.com/
sites/default/files/images/products/TILEncore- Gx72_PB043Rel_1_2_Web.pdf (visited on Mar. 1, 2014).
[24] Tilera TILE-Gx8072 Processor Product Brief. url: http://www.tilera.
com / sites / default / files / productbriefs / TILE - Gx8072 _ PB041 03_WEB.pdf (visited on Mar. 1, 2014).
[25] W. Gropp et al. “A high-performance, portable implementation of the
MPI message passing interface standard”. In: Parallel computing 22.6
(1996), pp. 789–828.
[26] PVM: Parallel Virtual Machine. url: http://www.csm.ornl.gov/pvm/
(visited on Mar. 2, 2014).
[27] B.L. Chamberlain et al. “ZPL: A machine independent programming
language for parallel computers”. In: Software Engineering, IEEE Transactions on 26.3 (2000), pp. 197–211.
[28] B.L. Chamberlain, D. Callahan, and H.P. Zima. “Parallel programmability and the chapel language”. In: International Journal of High Performance Computing Applications 21.3 (2007), pp. 291–312.
115
[29] L. Dagum and R. Menon. “OpenMP: an industry standard API for
shared-memory programming”. In: Computational Science & Engineering, IEEE 5.1 (1998), pp. 46–55.
[30] Intel Cilk Plus. url: http://software.intel.com/en- us/intelcilk-plus (visited on Mar. 2, 2014).
[31] Intel Threading Building Blocks. url: https://www.threadingbuildingblocks.
org/ (visited on Mar. 2, 2014).
[32] Khronos Finalizes OpenCL 2.0 Speciﬁcation for Heterogeneous Computing. url: https : / / www . khronos . org / news / press / khronos finalizes-opencl-2.0-specification-for-heterogeneous-computing
(visited on Mar. 2, 2014).
[33] Nvidia. “CUDA Programming Guide”. In: (2007).
[34] Khronos Releases OpenGL 4.3 Speciﬁcation with Major Enhancements.
url: https://www.khronos.org/news/press/khronos- releasesopengl-4.3- specification- with-major-enhancements (visited on
Mar. 2, 2014).
[35] RenderScript. url: http://developer.android.com/guide/topics/
renderscript/compute.html (visited on Mar. 2, 2014).
[36] Aparapi API for data parallel Java. url: http://code.google.com/p/
aparapi/ (visited on Mar. 2, 2014).
[37] Clyther Development Documentation. url: http://srossross.github.
io/Clyther/ (visited on Mar. 2, 2014).
[38] Aaftab Munshi, ed. The OpenCL Speciﬁcation 2.0. url: http://www.
khronos.org/registry/cl/specs/opencl-2.0.pdf (visited on Mar. 2,
2014).
[39] Aaftab Munshi, ed. The OpenCL Speciﬁcation 2.0. url: http://www.
khronos.org/registry/cl/specs/opencl-1.2.pdf (visited on Mar. 2,
2014).
[40] OpenCL for OS X. url: https : / / developer . apple . com / opencl/
(visited on Mar. 2, 2014).
[41] AMD Accelerated Parallel Processing SDK. url: http://developer.
amd.com/tools-and-sdks/heterogeneous-computing/amd-acceleratedparallel-processing-app-sdk/ (visited on Mar. 2, 2014).
[42] Nvidia CUDA: OpenCL Support. url: https://developer.nvidia.
com/opencl (visited on Mar. 2, 2014).
[43] Intel SDK for OpenCL. url: http://software.intel.com/en- us/
vcsource/tools/opencl-sdk (visited on Mar. 2, 2014).
[44] The Mesa 3D Graphics Library. url: http://www.mesa3d.org/ (visited
on Mar. 2, 2014).
116
[45] The Mesa GalliumCompute Wiki Page. url: http://dri.freedesktop.
org/wiki/GalliumCompute/ (visited on Mar. 2, 2014).
[46] Freedesktop.org Beignet Wiki Page. url: http://www.freedesktop.
org/wiki/Software/Beignet/ (visited on Mar. 2, 2014).
[47] Portable Computing Language. url: http://pocl.sourceforge.net/
(visited on Mar. 2, 2014).
[48] L.O. Andersen. “C Program Specialization”. MA thesis. DIKU, University of Copenhagen, Denmark, Dec. 1991.
[49] C. Consel et al. “Tempo: Specializing Systems Applications and Beyond
C”. In: ACM Comput. Surv. 30 (3es Sept. 1998). issn: 0360-0300.
[50] C. Consel et al. “A Uniform Approach for Compile-Time and Run-Time
Specialization”. In: 1996, pp. 54–72.
[51] T. Kotzmann et al. “Design of the Java HotSpot™ client compiler for
Java 6”. In: ACM Transactions on Architecture and Code Optimization
(TACO) 5.1 (2008), p. 7.
[52] Aaftab Munshi, ed. The OpenCL Speciﬁcation 1.1. url: http://www.
khronos.org/opencl/.
[53] C. Chambers. “The design and implementation of the self compiler, an
optimizing compiler for object-oriented programming languages”. PhD
thesis. Stanford University, 1992.
[54] POCL Kernel Compiler Design Notes. url: http://pocl.sourceforge.
net/docs/html/kernel_compiler.html (visited on Mar. 3, 2014).
[55] A cool use of LLVM at Apple: the OpenGL stack. url: http://lists.
cs.uiuc.edu/pipermail/llvmdev/2006-August/006492.html.
[56] B. Catanzaro, M. Garland, and K. Keutzer. “Copperhead: Compiling
an embedded data parallel language”. In: Principles and Practices of
Parallel Programming, PPoPP 11 (2011), pp. 47–56.
[57] Copperhead: Data Parallel Python. url: http://copperhead.github.
io/ (visited on Mar. 4, 2014).
[58] B. Catanzaro et al. “SEJITS: Getting productivity and performance
with selective embedded JIT specialization”. In: First Workshop on Programmable Models for Emerging Architecture at the 18th International
Conference on Parallel Architectures and Compilation Techniques. 2009.
[59] Nat Tuck. “Bacon: A GPU Programming System With Just in Time
Specialization”. In: Proc. PDPTA 3037. 2012.
[60] Francois Desarmenien. Parse::Yapp. url: http://search.cpan.org/
~fdesar/.
[61] R.F. Van der Wijngaart and P. Wong. NAS parallel benchmarks version
2.4. Tech. rep. NAS Technical Report NAS-02-007, 2002.
117
[62] S. Seo, G. Jo, and J. Lee. “Performance characterization of the NAS Parallel Benchmarks in OpenCL”. In: Workload Characterization (IISWC),
2011 IEEE International Symposium on. IEEE. 2011, pp. 137–148. url:
http://aces.snu.ac.kr/Center_for_Manycore_Programming/SNU_
NPB_Suite.html.
[63] The Rodinia Benchmark Suite. url: https://www.cs.virginia.edu/
~skadron/wiki/rodinia/index.php/Main_Page.
[64] PO Jaaskelainen et al. “OpenCL-based design methodology for applicationspeciﬁc processors”. In: Embedded Computer Systems (SAMOS), 2010
International Conference on. IEEE. 2010, pp. 223–230.
[65] The LLVM Compiler Infrastructure Project. url: http://llvm.org/.
[66] clang: a C language family frontend for LLVM. url: http://clang.
llvm.org/.
[67] LLVM 3.3 Documentation: Analysis and Transformation Passes. url:
http://llvm.org/releases/3.3/docs/Passes.html.
[68] LLVM 3.3 Documentation: Writing an LLVM Pass. url: http://llvm.
org/releases/3.3/docs/WritingAnLLVMPass.html.
[69] Norman P Jouppi and David W Wall. Available instruction-level parallelism for superscalar and superpipelined machines. Vol. 17. 2. ACM,
1989.
[70] Intel’s Atom Architecture: The Journey Begins. url: http : / / www .
anandtech.com/show/2493 (visited on Mar. 14, 2014).
[71] Bulldozer Has Arrived: AMD FX-8150 Processor Review. url: http:
//www.xbitlabs.com/articles/cpu/display/amd- fx- 8150.html
(visited on Mar. 15, 2014).
[72] Chris Lattner. “LLVM: An Infrastructure for Multi-Stage Optimization”.
See http://llvm.cs.uiuc.edu. MA thesis. Urbana, IL: Computer Science Dept., University of Illinois at Urbana-Champaign, Dec. 2002.
[73] Pekka Jääskeläinen. pocl-devel mailing list message on kernel vectorization. url: http : / / www . mail - archive . com / pocl - devel @ lists .
sourceforge.net/msg00182.html.
118

Download Report