How to select superinstructions for Ruby

How to select superinstructions
for Ruby
ZAKIROV Salikh*, CHIBA Shigeru*,
and SHIBAYAMA Etsuya**
* Tokyo Institute of Technology,
dept. of Mathematical and Computing Sciences
** Tokyo University, Information Technology Center
Ruby
• Dynamic language
• Becoming popular
recently
• Numeric benchmarks
100—1000 times slower
than equivalent
program in C
Numeric benchmarks marked in red
* http://shootout.alioth.debian.org/
2
Interpreter optimization efforts
• Many techniques to optimize interpreter were
proposed
– Threaded interpretation
– Stack top caching
– Pipelining
– Superinstructions
Focus of this presentation
• Superinstructions
– Merge code of operations executed in sequence
3
Superinstructions (contrived example)
PUSH: // put <imm> argument on stack
stack[sp++] = *pc++;
goto **pc++;
ADD: // add two topmost values on stack
sp--;
stack[sp-1] += stack[sp];
goto **pc++;
Optimizations
applied
PUSH_ADD: // add <imm> to stack top
stack[sp-1] += *pc++;
goto **pc++;
PUSH_ADD: // add <imm> to stack top
stack[sp++] = *pc++;
//goto **pc++;
sp--;
stack[sp-1] += stack[sp];
goto **pc++;
Dispatch eliminated
4
Superinstructions (effects)
• Effects
1. Reduce dispatch overhead
a. Eliminate some jumps
b. Provide more context for indirect branch predictorby
replicating indirect jump instructions
2. Allow more optimizations within VM op
5
Good for reducing dispatch overhead
Superinstructions help when:
• VM operations are small (~10 hwop/vmop)
• Dispatch overhead is high (~50%)
Examples of successful use in prior research
• ANSI C interpreter: 2-3 times improvement
(Proebsting 1995)
• Ocaml: more than 50% improvement (Piumarta
1998)
• Forth: 20-80% improvement (Ertl 2003)
6
Ruby does not fit well
Superinstructions help when:
• VM operations are small (~10 hwop/vmop)
• Dispatch overhead is high (~50%)
Only 1-3%
BUT
misprediction overhead
Hardware profiling
data on Intel Core 2 Duo
on interpreter
dispatch
60-140 hardware
ops per VM op
7
Superinstructions for Ruby
• We experimentally evaluated effect of “naive”
superinstructions on Ruby
– Superinstructions are selected statically
– Frequently occurring in training run combinations
of length 2 selected as superinstructions
– Training run uses the same benchmark
– Superinstructions constructed by concatenating C
source code, C compiler optimizations applied
8
Naive superinstructions effect on Ruby
Limited benefit
4 benchmarks
Normalized execution time
Number of superinstructions used
Unpredictable
effects
9
Branch mispredictions
2 benchmarks: mandelbrot and spectral_norm
Normalized execution time
Number of superinstructions used
10
Branch mispredictions, reordered
2 benchmarks: mandelbrot and spectral_norm
Normalized execution time
Number of superinstructions used, reordered by execution time
11
So why Ruby is slow?
• Profile of numeric benchmarks
Garbage collection
takes significant time
Boxed floating point
values dominate
allocation
12
Floating point value boxing
Typical Ruby 1.9 VM operation
OPT_PLUS:
New “box” object is allocated
VALUE a = *(sp-2);
on each operation
VALUE b = *(sp-1);
/* ... */
if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) {
sp--;
*(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b));
} else {
CALL(1/*argnum*/, PLUS, a);
}
goto **pc++;
13
Proposal: use superinstructions for
boxing optimization
• 2 operation per allocation instead of 1
Boxing of intermediate result
eliminated
OPT_MULT_OPT_PLUS:
VALUE a = *(sp-3);
VALUE b = *(sp-2);
VALUE c = *(sp-1);
/* ... */
if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) {
sp-=2;
*(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c));
} else {
CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/);
CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/);
}
goto **pc++;
14
Implementation
• VM operations that handle floating point values
directly:
–
–
–
–
–
opt_plus
opt_minus
opt_mult
opt_div
opt_mod
• We implemented all 25 combinations of length 2
– Based on Ruby 1.9.1
– Using existing Ruby infrastructure for
superinstructions with some modifications
15
Limitations
• Coding style-sensitive
• Not applicable to other types (e.g. Fixnum,
Bignum, String)
– Fixnum is already unboxed
– Bignum and String cannot be unboxed
• Sequences of 3 arithmetic instructions or
longer virtually non-existent
– No occurrences in the benchmarks
16
Evaluation
• Methodology
– median time of 30 runs
• Reduction in allocation
17
Results
• Up to 22% benefit on numeric benchmarks
• No slowdown on other benchmarks
18
Example: mandelbrot tweak
• Slight modification produces 20%
difference in performance
+
+
ITER.times do
tr = zrzr - zizi + cr
tr = cr + (zrzr - zizi)
ti = 2.0*zr*zi + ci
ti = ci + 2.0*zr*zi
Normalized execution time
– 4 of 9 arithmetic instructions get
merged into 2 superinstructions
– 24% reduction in float allocation
19
Discussion of alternative approaches
• Faster GC would improve performance as well
– Superinstructions still apply, but with reduced
benefit
• Type inference
– Would allow to specialize expressions and
eliminate boxing
– Interoperability with dynamic code is an issue
• Dynamic specialization
– Topic for further research
20
Related work: Tagged values
• Use lower bits of pointers to trigger alternative
handling
• Embed floating point value into higher bits
• Limited to 64-bit platforms, as Ruby uses double
precision 64 bit floating point arithmetic
– Our approach has same effect on 32 and 64 bit
platforms
• Allows to eliminate majority of boxed floats
• Provides 28-35% benefit (on the same
benchmarks)
* Sasada 2008
21
Related work: Lazy boxing
• Java-like language with generics over valuetypes
• Boxing needed to avoid duplication of
template instantiation code for primitive types
• Lazy optimization works by allocating boxed
objects in the stack frame, and moving to
heap as needed
• Relies on static compiler analysis for escape
path detection, and runtime checks
* Owen 2004
22
Related work:Superinstructions
Superinstructions used for code compression
– ANSI C hybrid compiler-interpreter
– Trimedia code compression system
• Superinstructions chosen statically to minimize
code size
* Proebsting 1995
* Hoogerbrugge 1999
Superinstructions used to reduce dispatch overhead
– Forth, Ocaml
• Superinstructions chosen dynamically
* Piumarta 1998
* Ertl 2003
23
Conclusion
• Naive approach to superinstructions does not
produce substantial benefit for Ruby
• Floating point values boxing overhead is a
problem of Ruby
• Superinstructions provide some help (up to 22%)
Future work
• Eliminate float boxing further
– Specializing computation loop
24