A Study of superinstructions and dynamic mixin optimizations 08D37074 Salikh ZAKIROV Chiba laboratory Advisors: Etsuya SHIBAYAMA Shigeru CHIBA Outline • Introduction • Superinstructions • Dynamic mixin optimization – Evaluation in compiled environment • Conclusion 2 Dynamic languages Exist from 70s, but became popular again in 90s • Provide highest productivity • Performance worse than static languages • Results in limited applicability • Hardware progress helps • However performance still low Performance research is important! 3 Dynamic language implementation • Design trade-off: – performance – implementation complexity – dynamicity • responsiveness to run-time change 4 Typical implementation approaches Performance JIT Compiler Trace compiler AOT Compiler Inline caching Interpreter Dynamicity 5 The problem Trade-off of performance with dynamicity • Popular dynamic languages are slow – We use Ruby • Metaprogramming is essential to high productivity – Requires dynamicity • Existing high-performance techniques has low dynamicity 6 Focus of my work • Ruby language – Main implementation is VM interpreter • Performance improvement – While keeping high dynamicity 7 Contributions of this work • Researched application of superinstructions for Ruby – Found a novel approach with higher benefit • Arithmetic superinstructions • Proposed inline caching for dynamic mixin – fine-grained state tracking – alternate caching 8 Position of superinstructions • Known for dispatch overhead reduction – Low benefit for Ruby on modern hardware • Arithmetic superinstructions – Novel application of superinstructions – Benefit on numeric applications • Response to dynamic updates – does not differ from original interpreter 9 Position of dynamic mixin • A variant of delegation – Getting popular in recent research and practice – Very slow with existing techniques • We proposed novel optimization scheme – Fine-grained state tracking – Alternate caching 10 Outline • Introduction • Superinstructions • Dynamic mixin optimization – Evaluation in compiled environment • Conclusion 11 Interpreter optimization efforts • Many techniques to optimize interpreter were proposed – Threaded interpretation – Stack top caching – Pipelining – Superinstructions Focus of this presentation • Superinstructions – Merge code of operations executed in sequence 12 Superinstructions (example) PUSH: // put <imm> argument on stack stack[sp++] = *pc++; goto **pc++; ADD: // add two topmost values on stack sp--; stack[sp-1] += stack[sp]; goto **pc++; Optimizations applied PUSH_ADD: // add <imm> to stack top stack[sp-1] += *pc++; goto **pc++; PUSH_ADD: // add <imm> to stack top stack[sp++] = *pc++; //goto **pc++; sp--; stack[sp-1] += stack[sp]; goto **pc++; Dispatch eliminated 13 Superinstructions (effects) • Effects 1. Reduce dispatch overhead a. Eliminate some jumps b. Provide more context for indirect branch predictor by replicating indirect jump instructions 2. Allow more optimizations within VM op 14 Prior research result: Good for reducing dispatch overhead Superinstructions help when: • VM operations are small (~10 hwop/vmop) • Dispatch overhead is high (~50%) Examples of successful use in prior research • ANSI C interpreter: 2-3 times improvement (Proebsting 1995) • Ocaml: more than 50% improvement (Piumarta 1998) • Forth: 20-80% improvement (Ertl 2003) 15 Ruby does not fit well Superinstructions help when: • VM operations are small (~10 hwop/vmop) • Dispatch overhead is high (~50%) Only 1-3% BUT misprediction overhead Hardware profiling data on Intel Core 2 Duo on interpreter dispatch 60-140 hardware ops per VM op 16 Superinstructions for Ruby • We experimentally evaluated effect of “naive” superinstructions on Ruby – Superinstructions are selected statically – Frequently occurring in training run combinations of length 2 selected as superinstructions – Training run uses the same benchmark – Superinstructions constructed by concatenating C source code, C compiler optimizations applied 17 Naive superinstructions effect on Ruby Limited benefit 4 benchmarks Normalized execution time Number of superinstructions used Unpredictable effects 18 Branch mispredictions 2 benchmarks: mandelbrot and spectral_norm Normalized execution time Number of superinstructions used 19 Branch mispredictions, reordered 2 benchmarks: mandelbrot and spectral_norm Normalized execution time Number of superinstructions used, reordered by execution time 20 So why Ruby is slow? • Profile of numeric benchmarks Garbage collection takes significant time Boxed floating point values dominate allocation 21 Floating point value boxing Typical Ruby 1.9 VM operation OPT_PLUS: New “box” object is allocated VALUE a = *(sp-2); on each operation VALUE b = *(sp-1); /* ... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) { sp--; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)); } else { CALL(1/*argnum*/, PLUS, a); } goto **pc++; 22 Proposal: use superinstructions for boxing optimization • 2 operation per allocation instead of 1 Boxing of intermediate result eliminated OPT_MULT_OPT_PLUS: VALUE a = *(sp-3); VALUE b = *(sp-2); VALUE c = *(sp-1); /* ... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) { sp-=2; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c)); } else { CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/); CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/); } goto **pc++; 23 Implementation • VM operations that handle floating point values directly: – – – – – opt_plus opt_minus opt_mult opt_div opt_mod • We implemented all 25 combinations of length 2 – Based on Ruby 1.9.1 – Using existing Ruby infrastructure for superinstructions with some modifications 24 Limitations • Coding style-sensitive – Can be fixed by adding getVariable superinstructions • Not applicable to other types (e.g. Fixnum, Bignum, String) – Fixnum is already unboxed – Bignum and String cannot be unboxed • Sequences of 3 arithmetic instructions or longer virtually non-existent – No occurrences in the benchmarks 25 Results • 0–22% faster on numeric benchmarks (avg 12%) • No slowdown on other benchmarks 26 Evaluation GC reduction explains most of the speedup reduction in boxing translates to reduction in GC count 27 Example: mandelbrot tweak • Slight modification produces 20% difference in performance + + ITER.times do tr = zrzr - zizi + cr tr = cr + (zrzr - zizi) ti = 2.0*zr*zi + ci ti = ci + 2.0*zr*zi • Alternative solution introduce • OP-getdynamic-OP Normalized execution time – 4 of 9 arithmetic instructions get merged into 2 superinstructions – 24% reduction in float allocation 28 Discussion of alternative approaches • Faster GC – Superinstructions benefit reduced • Tagged values – 64 bit platforms only • Stack allocation of intermediate results • Dynamic specialization • Type inference 29 Summary • Naive approach to superinstructions does not produce substantial benefit for Ruby • Floating point values boxing overhead is a problem of Ruby • Superinstructions provide some help (upto 23%, 12% on average) – implementation of 2000 SLOC – regular, thus automatically generatable • Dynamicity same as original interpreter 30 Outline • Introduction • Superinstructions • Dynamic mixin optimization – Evaluation in compiled environment • Conclusion 31 Mixin • code composition technique BaseServer BaseServer Additional Security Additional Security Server Server Mixin use declaration Mixin semantics 32 Dynamic mixin • Temporary change in class hierarchy • Available in Ruby, Python, JavaScript BaseServer BaseServer Additional Security Server Server 33 Dynamic mixin (2) • Powerful technique of dynamic languages • Enables – dynamic patching – dynamic monitoring • Can be used to implement – Aspect-oriented programming – Context-oriented programming • Widely used in Ruby, Python – e.g. Object-Relational Mapping 34 Dynamic mixin in Ruby • Ruby has dynamic mixin – but only “install”, no “remove” operation – because there is uncertainty in “remove” semantics with transitive module inclusion • “remove” can be implemented easily – 23 lines of code 35 Target application • Mixin is installed and removed frequently • Application server with dynamic features class BaseServer def process() … end end module AdditionalSecurity def process() … # security check super # delegate to superclass end end class Server < BaseServer def process() if request.isSensitive() Server.class_eval { include AdditionalSecurity } end super # delegate to superclass … # remove mixin end end 36 Overhead is high Possible reasons • Invalidation granularity – clearing whole method cache – invalidating all inline caches • next calls require full method lookup • Inline caching saves just 1 target – which changes with mixin operations – even though mixin operations are mostly repeated 37 Our research target • Improve performance of application which frequently uses dynamic mixin – Make invalidation granularity smaller – Make dynamic dispatch target cacheable in presence of dynamic mixin operations 38 Proposal • Reduce granularity of inline cache invalidation – Fine-grained state tracking • Cache multiple dispatch targets – Polymorphic inline caching • Enable cache reuse on repeated mixin installation and removal – Alternate caching 39 Basics: Inline caching consider a call site cat.speak() (executable code) Dynamic dispatch implementation method implementation method = lookup(cat, ”speak”) method(cat) Expensive! But the result is mostly the same Animal speak() { … } subclass cat.speak() class Cat ic method speak Inline caching if (cat has type ic.class) { ic.method(cat) } else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat) } Cat instance cat 40 Inline caching: problem • What if the method has been overridden? Animal speak() { … } Training cat.speak() class Cat ic method speak Inline caching speak(){ … } if (cat has type ic.class) { ic.method(cat) } else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat) } Cat instance cat 41 Inline caching: invalidation if (cat has type ic.class && state == ic.state) { ic.method(cat) } else { ic.method = lookup(cat, ”speak”) ic.class = cat.class; ic.state = state ic.method(cat) } 1 2 Global state Animal speak() { … } Training speak(){ … } cat.speak() Cat class Cat ic method speak state 1 2 Single global state object • too coarse invalidation granularity instance cat 42 Fine-grained state tracking • Many state objects – small invalidation extent – share as much as possible • One state object for each family of methods called from the same call site • State objects associated with lookup path – links updated during method lookups 43 Lookup procedure • Lookup normally – noting the state objects encountered – starting from the state object in the inline cache • Choose the last state object – other state objects • mark for one-time invalidation • mark as overridden • Store the pointer to the state object in every class on lookup path 44 Invariant 1: lookup path • After any number of lookups – for any IC = (pstate, state, class, name) • Either of the following holds: – IC is invalidated – for any class’ ∊ lookup_path(class, name) • (class’, name) ↪ state • Proof scheme (induction over lookups) – initial state is invalid – lookup procedure establishes invariant – on lookups that update state object links, state is linked transitively or invalidated 45 Invariant 2: validity of IC • After any number of updates – for any IC = (pstate, state, class, name, target) • Either of the following holds – IC is invalidated – target = lookup(class, name) • Proof scheme (induction over updates) – any update either • does not affect lookup • invalidates IC 46 State object allocation if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat) } else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat) inline caching } code Animal cat.speak() speak() { *1* } class Cat ic speak *1* method state 1 pstate 1 No implemmentation here Cat speak 47 Mixin installation if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat) } else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat) inline caching } code Animal cat.speak() speak() { *1* } class Cat ic speak *2* *1* method state 2 1 pstate Training 1 2 speak() { *2* } Cat speak 48 Mixin removal if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat) } else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat) inline caching } code Animal cat.speak() speak() { *1* } class Cat ic speak *1* *2* method state 2 3 pstate Training 2 3 speak() { *2* } Cat speak 49 Alternate caching alternate cache • Detect repetition • Conflicts detected by state check super speak state 4 3 4 3 … speak() { *1* } class Cat ic Animal Animal A cat.speak() speak *2* *1* method Training Training 4 3 speak() { *2* } Cat speak pstate Inline cache contents oscillates 50 Polymorphic caching alternate cache • Use multiple entries in inline cache super Training speak Animal 4 3 … Animal cat.speak() speak() { *1* } Cat class Cat ic *1* *2* method 3state4 Training 4 3 speak() { *2* } Cat speak pstate now PIC handles oscillating state value 51 Invariant 3: validity of alternate caching • After any number of updates – for any IC = (pstate, state, class, name, target) • Either of the following holds – value(pstate) != state (IC is invalid) – target = lookup(class, name) • Proof scheme (induction on updates) – any update either • introduce fresh value(pstate) for invalidation • preserves correctness of cached target 52 State object merge animal executable code instance animal.speak() Animal speak() { *1* } while(true) { cat.speak() Training S speak() { *2* } Cat speak remove mixin } Overridden by Q instance cat One-time invalidation 53 Alternate caching limitations • Independent mixins do not conflict • If two mixins override the same method – scopes need to be properly nested 1 State object value (for some method) 2 dynamic mixin scope 1 2 3 3 2 1 4 5 54 Overheads of proposed scheme • Increased memory use – 1 state object per polymorphic method family – additional method entries – alternate cache – polymorphic inline cache entries • Some operations become slower – Lookup needs to track and update state objects – Explicit state object checks on method dispatch 55 Generalizations (beyond Ruby) • Delegation object model – track arbitrary delegation pointer change • Thread-local delegation – allow for thread-local modification of delegation pointer – by having thread-local state object values 56 Evaluation • Implementation based on Ruby 1.9.2 – about 1000 lines of code • Hardware – Intel Core i7 860 2.8 GHz 57 Evaluation: microbenchmarks Single method call overhead • Inline cache hit – state checks 1% – polymorphic inline caching 49% overhead • Full lookup – 2x slowdown 58 Dynamic mixin-heavy microbenchmark Normalized execution time 100% (smaller is better) 23% 17% 15% 59 Evaluation: application • Application server with dynamic mixin on each request Normalized execution time (smaller is better) 100% 70% baseline method cache state checks 58% 60% fgst fgst + PIC 52% fgst + PIC + altern 60 Evaluation • Fine-grained state tracking considerably reduces overhead • Alternate caching brings only small improvement – Number of call sites affected by mixin is low – Lookup cost / inline cache hit cost is low • about 1.6x on Ruby 61 Related work • Dependency tracking in Self – focused on reducing recompilation, rather than reducing method lookups • Inline caching for Objective-C – state object associated with method, no dynamic mixin support 62 Summary • We proposed combination of techniques – Fine-grained state tracking – Alternate caching – Polymorphic inline caching • To increase efficiency of inline caching – with frequent dynamic mixin installation and removal 63 Outline • Introduction • Superinstructions • Dynamic mixin optimization – Evaluation in compiled environment • Conclusion 64 Evaluation in compiled environment • Dynamic mixin optimizations – applicable to compiled systems too • In order to confirm this hypothesis – we implemented a dynamic compiler for the language IO – we evaluated the performance of inline caching 65 Idea: efficient dynamic mixin Control flow graph of compiled call site Repeated dynamic mixin install / removal server.f() if (s == 1) State guard s=1 s=2 f11 if (s == 2) Inlined method f1 f2 continue handle inline cache miss f2 66 Dynamic compiler • Source language – IO – only subset implemented • Compilation uses profile collected in PICs • PICs are initialized by interpreted execution 67 Microbenchmark results • Overhead of state checks is reasonable (less than 16 CPU cycles) • The most common case commonly have just 1 cycle overhead 68 Summary and ongoing work • We evaluated inline caching optimization for dynamic mixin in compiled environment • We verified that optimization is effective • The most common case has very low overhead Future work • The dynamic mixin switch operation may be slow – need to be addressed in future work • Full IO language support 69 Conclusion • We researched two techniques – Superinstructions for Ruby – Dynamic mixin optimization • Our techniques improve performance • While keeping the advantages – high dynamicity – low implementation complexity 70 Publications • S. Zakirov, S. Chiba, E. Shibayama. How to select superinstructions, IPSJ PRO 2010. • S. Zakirov, S. Chiba, E. Shibayama. Optimizing dynamic dispatch with fine-grained state tracking, DLS 2010. 71
© Copyright 2024 ExpyDoc