A Study of Optimization for Dynamic Language

A Study of superinstructions
and dynamic mixin optimizations
08D37074 Salikh ZAKIROV
Chiba laboratory
Advisors: Etsuya SHIBAYAMA
Shigeru CHIBA
Outline
• Introduction
• Superinstructions
• Dynamic mixin optimization
– Evaluation in compiled environment
• Conclusion
2
Dynamic languages
Exist from 70s, but became popular again in 90s
• Provide highest productivity
• Performance worse than static languages
• Results in limited applicability
• Hardware progress helps
• However performance still low
Performance research is important!
3
Dynamic language implementation
• Design trade-off:
– performance
– implementation complexity
– dynamicity
• responsiveness to run-time change
4
Typical implementation approaches
Performance
JIT Compiler
Trace compiler
AOT Compiler
Inline caching
Interpreter
Dynamicity
5
The problem
Trade-off of performance with dynamicity
• Popular dynamic languages are slow
– We use Ruby
• Metaprogramming is essential to high
productivity
– Requires dynamicity
• Existing high-performance techniques has low
dynamicity
6
Focus of my work
• Ruby language
– Main implementation is VM interpreter
• Performance improvement
– While keeping high dynamicity
7
Contributions of this work
• Researched application of superinstructions
for Ruby
– Found a novel approach with higher benefit
• Arithmetic superinstructions
• Proposed inline caching for dynamic mixin
– fine-grained state tracking
– alternate caching
8
Position of superinstructions
• Known for dispatch overhead reduction
– Low benefit for Ruby on modern hardware
• Arithmetic superinstructions
– Novel application of superinstructions
– Benefit on numeric applications
• Response to dynamic updates
– does not differ from original interpreter
9
Position of dynamic mixin
• A variant of delegation
– Getting popular in recent research and practice
– Very slow with existing techniques
• We proposed novel optimization scheme
– Fine-grained state tracking
– Alternate caching
10
Outline
• Introduction
• Superinstructions
• Dynamic mixin optimization
– Evaluation in compiled environment
• Conclusion
11
Interpreter optimization efforts
• Many techniques to optimize interpreter were
proposed
– Threaded interpretation
– Stack top caching
– Pipelining
– Superinstructions
Focus of this presentation
• Superinstructions
– Merge code of operations executed in sequence
12
Superinstructions (example)
PUSH: // put <imm> argument on stack
stack[sp++] = *pc++;
goto **pc++;
ADD: // add two topmost values on stack
sp--;
stack[sp-1] += stack[sp];
goto **pc++;
Optimizations
applied
PUSH_ADD: // add <imm> to stack top
stack[sp-1] += *pc++;
goto **pc++;
PUSH_ADD: // add <imm> to stack top
stack[sp++] = *pc++;
//goto **pc++;
sp--;
stack[sp-1] += stack[sp];
goto **pc++;
Dispatch eliminated
13
Superinstructions (effects)
• Effects
1. Reduce dispatch overhead
a. Eliminate some jumps
b. Provide more context for indirect branch predictor by
replicating indirect jump instructions
2. Allow more optimizations within VM op
14
Prior research result:
Good for reducing dispatch overhead
Superinstructions help when:
• VM operations are small (~10 hwop/vmop)
• Dispatch overhead is high (~50%)
Examples of successful use in prior research
• ANSI C interpreter: 2-3 times improvement
(Proebsting 1995)
• Ocaml: more than 50% improvement (Piumarta
1998)
• Forth: 20-80% improvement (Ertl 2003)
15
Ruby does not fit well
Superinstructions help when:
• VM operations are small (~10 hwop/vmop)
• Dispatch overhead is high (~50%)
Only 1-3%
BUT
misprediction overhead
Hardware profiling
data on Intel Core 2 Duo
on interpreter
dispatch
60-140 hardware
ops per VM op
16
Superinstructions for Ruby
• We experimentally evaluated effect of “naive”
superinstructions on Ruby
– Superinstructions are selected statically
– Frequently occurring in training run combinations
of length 2 selected as superinstructions
– Training run uses the same benchmark
– Superinstructions constructed by concatenating C
source code, C compiler optimizations applied
17
Naive superinstructions effect on Ruby
Limited benefit
4 benchmarks
Normalized execution time
Number of superinstructions used
Unpredictable
effects
18
Branch mispredictions
2 benchmarks: mandelbrot and spectral_norm
Normalized execution time
Number of superinstructions used
19
Branch mispredictions, reordered
2 benchmarks: mandelbrot and spectral_norm
Normalized execution time
Number of superinstructions used, reordered by execution time
20
So why Ruby is slow?
• Profile of numeric benchmarks
Garbage collection
takes significant time
Boxed floating point
values dominate
allocation
21
Floating point value boxing
Typical Ruby 1.9 VM operation
OPT_PLUS:
New “box” object is allocated
VALUE a = *(sp-2);
on each operation
VALUE b = *(sp-1);
/* ... */
if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) {
sp--;
*(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b));
} else {
CALL(1/*argnum*/, PLUS, a);
}
goto **pc++;
22
Proposal: use superinstructions for
boxing optimization
• 2 operation per allocation instead of 1
Boxing of intermediate result
eliminated
OPT_MULT_OPT_PLUS:
VALUE a = *(sp-3);
VALUE b = *(sp-2);
VALUE c = *(sp-1);
/* ... */
if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) {
sp-=2;
*(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c));
} else {
CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/);
CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/);
}
goto **pc++;
23
Implementation
• VM operations that handle floating point values
directly:
–
–
–
–
–
opt_plus
opt_minus
opt_mult
opt_div
opt_mod
• We implemented all 25 combinations of length 2
– Based on Ruby 1.9.1
– Using existing Ruby infrastructure for
superinstructions with some modifications
24
Limitations
• Coding style-sensitive
– Can be fixed by adding getVariable
superinstructions
• Not applicable to other types (e.g. Fixnum,
Bignum, String)
– Fixnum is already unboxed
– Bignum and String cannot be unboxed
• Sequences of 3 arithmetic instructions or
longer virtually non-existent
– No occurrences in the benchmarks
25
Results
• 0–22% faster on numeric benchmarks (avg 12%)
• No slowdown on other benchmarks
26
Evaluation
GC reduction explains most
of the speedup
reduction in boxing translates
to reduction in GC count
27
Example: mandelbrot tweak
• Slight modification produces 20%
difference in performance
+
+
ITER.times do
tr = zrzr - zizi + cr
tr = cr + (zrzr - zizi)
ti = 2.0*zr*zi + ci
ti = ci + 2.0*zr*zi
• Alternative solution introduce
• OP-getdynamic-OP
Normalized execution time
– 4 of 9 arithmetic instructions get
merged into 2 superinstructions
– 24% reduction in float allocation
28
Discussion of alternative approaches
• Faster GC
– Superinstructions benefit reduced
• Tagged values
– 64 bit platforms only
• Stack allocation of intermediate results
• Dynamic specialization
• Type inference
29
Summary
• Naive approach to superinstructions does not
produce substantial benefit for Ruby
• Floating point values boxing overhead is a
problem of Ruby
• Superinstructions provide some help (upto
23%, 12% on average)
– implementation of 2000 SLOC – regular, thus
automatically generatable
• Dynamicity same as original interpreter
30
Outline
• Introduction
• Superinstructions
• Dynamic mixin optimization
– Evaluation in compiled environment
• Conclusion
31
Mixin
• code composition technique
BaseServer
BaseServer
Additional
Security
Additional
Security
Server
Server
Mixin use declaration
Mixin semantics
32
Dynamic mixin
• Temporary change in class hierarchy
• Available in Ruby, Python, JavaScript
BaseServer
BaseServer
Additional
Security
Server
Server
33
Dynamic mixin (2)
• Powerful technique of dynamic languages
• Enables
– dynamic patching
– dynamic monitoring
• Can be used to implement
– Aspect-oriented programming
– Context-oriented programming
• Widely used in Ruby, Python
– e.g. Object-Relational Mapping
34
Dynamic mixin in Ruby
• Ruby has dynamic mixin
– but only “install”, no “remove” operation
– because there is uncertainty in “remove”
semantics with transitive module inclusion
• “remove” can be implemented easily
– 23 lines of code
35
Target application
• Mixin is installed and removed frequently
• Application server with dynamic features
class BaseServer
def process() … end
end
module AdditionalSecurity
def process()
…
# security check
super # delegate to superclass
end
end
class Server < BaseServer
def process()
if request.isSensitive()
Server.class_eval {
include AdditionalSecurity
} end
super # delegate to superclass
…
# remove mixin
end
end
36
Overhead is high
Possible reasons
• Invalidation granularity
– clearing whole method cache
– invalidating all inline caches
• next calls require full method lookup
• Inline caching saves just 1 target
– which changes with mixin operations
– even though mixin operations are mostly repeated
37
Our research target
• Improve performance of application which
frequently uses dynamic mixin
– Make invalidation granularity smaller
– Make dynamic dispatch target cacheable in
presence of dynamic mixin operations
38
Proposal
• Reduce granularity of inline cache invalidation
– Fine-grained state tracking
• Cache multiple dispatch targets
– Polymorphic inline caching
• Enable cache reuse on repeated mixin
installation and removal
– Alternate caching
39
Basics: Inline caching
consider a call site
cat.speak()
(executable code)
Dynamic dispatch implementation
method
implementation
method = lookup(cat, ”speak”)
method(cat)
Expensive!
But the result is
mostly the same
Animal
speak() { … }
subclass
cat.speak()
class
Cat
ic
method
speak
Inline caching
if (cat has type ic.class) {
ic.method(cat)
} else {
ic.method = lookup(cat, ”speak”)
ic.class = cat.class
ic.method(cat)
}
Cat
instance
cat
40
Inline caching: problem
• What if the method has been overridden?
Animal
speak() { … }
Training
cat.speak()
class
Cat
ic
method
speak
Inline caching
speak(){ … }
if (cat has type ic.class) {
ic.method(cat)
} else {
ic.method = lookup(cat, ”speak”)
ic.class = cat.class
ic.method(cat)
}
Cat
instance
cat
41
Inline caching: invalidation
if (cat has type ic.class && state == ic.state) {
ic.method(cat)
} else {
ic.method = lookup(cat, ”speak”)
ic.class = cat.class; ic.state = state
ic.method(cat)
}
1
2
Global state
Animal
speak() { … }
Training
speak(){ … }
cat.speak()
Cat
class
Cat
ic
method
speak
state
1
2
Single global state object
• too coarse invalidation granularity
instance
cat
42
Fine-grained state tracking
• Many state objects
– small invalidation extent
– share as much as possible
• One state object for each family of methods called
from the same call site
• State objects associated with lookup path
– links updated during method lookups
43
Lookup procedure
• Lookup normally
– noting the state objects encountered
– starting from the state object in the inline cache
• Choose the last state object
– other state objects
• mark for one-time invalidation
• mark as overridden
• Store the pointer to the state object in every
class on lookup path
44
Invariant 1: lookup path
• After any number of lookups
– for any IC = (pstate, state, class, name)
• Either of the following holds:
– IC is invalidated
– for any class’ ∊ lookup_path(class, name)
• (class’, name) ↪ state
• Proof scheme (induction over lookups)
– initial state is invalid
– lookup procedure establishes invariant
– on lookups that update state object links, state is
linked transitively or invalidated
45
Invariant 2: validity of IC
• After any number of updates
– for any IC = (pstate, state, class, name, target)
• Either of the following holds
– IC is invalidated
– target = lookup(class, name)
• Proof scheme (induction over updates)
– any update either
• does not affect lookup
• invalidates IC
46
State object allocation
if (cat has type ic.class && ic.pstate.state == ic.state ) {
ic.method(cat)
} else {
ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate)
ic.class = cat.class; ic.state = state
method(cat)
inline caching
}
code
Animal
cat.speak()
speak() { *1* }
class
Cat
ic
speak *1*
method
state
1
pstate
1
No
implemmentation
here
Cat
speak
47
Mixin installation
if (cat has type ic.class && ic.pstate.state == ic.state ) {
ic.method(cat)
} else {
ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate)
ic.class = cat.class; ic.state = state
method(cat)
inline caching
}
code
Animal
cat.speak()
speak() { *1* }
class
Cat
ic
speak *2*
*1*
method
state
2
1
pstate
Training
1
2
speak() { *2* }
Cat
speak
48
Mixin removal
if (cat has type ic.class && ic.pstate.state == ic.state ) {
ic.method(cat)
} else {
ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate)
ic.class = cat.class; ic.state = state
method(cat)
inline caching
}
code
Animal
cat.speak()
speak() { *1* }
class
Cat
ic
speak *1*
*2*
method
state
2
3
pstate
Training
2
3
speak() { *2* }
Cat
speak
49
Alternate caching
alternate cache
• Detect repetition
• Conflicts detected by state
check
super
speak
state
4
3
4
3
…
speak() { *1* }
class
Cat
ic
Animal
Animal
A
cat.speak()
speak *2*
*1*
method
Training
Training
4
3
speak() { *2* }
Cat
speak
pstate
Inline cache contents oscillates
50
Polymorphic caching
alternate cache
• Use multiple entries in
inline cache
super
Training
speak
Animal
4
3
…
Animal
cat.speak()
speak() { *1* }
Cat
class
Cat
ic
*1*
*2*
method
3state4
Training
4
3
speak() { *2* }
Cat
speak
pstate
now PIC handles oscillating state value
51
Invariant 3: validity of alternate caching
• After any number of updates
– for any IC = (pstate, state, class, name, target)
• Either of the following holds
– value(pstate) != state (IC is invalid)
– target = lookup(class, name)
• Proof scheme (induction on updates)
– any update either
• introduce fresh value(pstate) for invalidation
• preserves correctness of cached target
52
State object merge
animal
executable
code
instance
animal.speak()
Animal
speak() { *1* }
while(true) {
cat.speak()
Training
S
speak() { *2* }
Cat
speak
remove mixin
}
Overridden by
Q
instance
cat
One-time invalidation
53
Alternate caching limitations
• Independent mixins do not conflict
• If two mixins override the same method
– scopes need to be properly nested
1
State object value
(for some method)
2
dynamic
mixin
scope
1
2
3
3
2
1
4
5
54
Overheads of proposed scheme
• Increased memory use
– 1 state object per polymorphic method family
– additional method entries
– alternate cache
– polymorphic inline cache entries
• Some operations become slower
– Lookup needs to track and update state objects
– Explicit state object checks on method dispatch
55
Generalizations (beyond Ruby)
• Delegation object model
– track arbitrary delegation pointer change
• Thread-local delegation
– allow for thread-local modification of delegation
pointer
– by having thread-local state object values
56
Evaluation
• Implementation based on Ruby 1.9.2
– about 1000 lines of code
• Hardware
– Intel Core i7 860 2.8 GHz
57
Evaluation: microbenchmarks
Single method call
overhead
• Inline cache hit
– state checks 1%
– polymorphic inline
caching 49%
overhead
• Full lookup
– 2x slowdown
58
Dynamic mixin-heavy microbenchmark
Normalized execution time
100%
(smaller is better)
23%
17%
15%
59
Evaluation: application
• Application server with dynamic mixin on each
request
Normalized execution time
(smaller is better)
100%
70%
baseline
method cache
state checks
58%
60%
fgst
fgst + PIC
52%
fgst + PIC +
altern
60
Evaluation
• Fine-grained state tracking considerably
reduces overhead
• Alternate caching brings only small
improvement
– Number of call sites affected by mixin is low
– Lookup cost / inline cache hit cost is low
• about 1.6x on Ruby
61
Related work
• Dependency tracking in Self
– focused on reducing recompilation,
rather than reducing method lookups
• Inline caching for Objective-C
– state object associated with method, no
dynamic mixin support
62
Summary
• We proposed combination of techniques
– Fine-grained state tracking
– Alternate caching
– Polymorphic inline caching
• To increase efficiency of inline caching
– with frequent dynamic mixin installation and
removal
63
Outline
• Introduction
• Superinstructions
• Dynamic mixin optimization
– Evaluation in compiled environment
• Conclusion
64
Evaluation in compiled environment
• Dynamic mixin optimizations
– applicable to compiled systems too
• In order to confirm this hypothesis
– we implemented a dynamic compiler for the
language IO
– we evaluated the performance of inline caching
65
Idea: efficient dynamic mixin
Control flow graph of
compiled call site
Repeated dynamic mixin
install / removal
server.f()
if (s == 1)
State guard
s=1
s=2
f11
if (s == 2)
Inlined
method
f1
f2
continue
handle
inline cache
miss
f2
66
Dynamic compiler
• Source language – IO
– only subset implemented
• Compilation uses profile collected in PICs
• PICs are initialized by interpreted execution
67
Microbenchmark results
• Overhead of state checks is reasonable (less
than 16 CPU cycles)
• The most common case commonly have just 1
cycle overhead
68
Summary and ongoing work
• We evaluated inline caching optimization for
dynamic mixin in compiled environment
• We verified that optimization is effective
• The most common case has very low
overhead
Future work
• The dynamic mixin switch operation may be
slow – need to be addressed in future work
• Full IO language support
69
Conclusion
• We researched two techniques
– Superinstructions for Ruby
– Dynamic mixin optimization
• Our techniques improve performance
• While keeping the advantages
– high dynamicity
– low implementation complexity
70
Publications
• S. Zakirov, S. Chiba, E. Shibayama. How to
select superinstructions, IPSJ PRO 2010.
• S. Zakirov, S. Chiba, E. Shibayama. Optimizing
dynamic dispatch with fine-grained state
tracking, DLS 2010.
71