The Reuse Cache Downsizing the Shared Last-Level Cache ! Jorge Albericio1, Pablo Ibáñez2, Víctor Viñals2, and José M. Llabería3! ! ! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC" Fujitsu SPARC VIIIfx (2011)! Intel Itanium 9500 (2012)! SLLC" 2! SLLC" Inclusive Shared Last-Level Cache (SLLC) ! 8MB Conventional! size! 1/8! Reuse Cache! Data 1MB! Tags! Tags! Data! 1/2! Same average performance* " with 84% area savings" *100 multiprogrammed SPEC CPU 2006 workloads in an 8-core CMP! Outline" ! Motivation! ! The reuse cache! • Idea! • Organization! • Coherence! • Replacement! ! Related work! ! Evaluation! ! Conclusions! 4! Outline" ! Motivation" ! The reuse cache! • Idea! • Organization! • Coherence! • Replacement! ! Related work! ! Evaluation! ! Conclusions! 5! Motivation" • Recently reused lines more useful than lines reused before! SLLC data" Many hits! Zero hits! 6! Reuse locality" " • Used more than one time likely to be used many times! Opportunity for a smaller SLLC which stores only reused data ! Motivation" Live cache fraction evolution! LRU 0.75 0.5 NRR DRRIP 0.25 0.75 0.5 0.25 0.75 0.5 0.25 0 1000 2000 3000 4000 time (x100K cycles) 5000 6000 ! The fraction of live SLLC lines is very small (10-30% LRU)! • On average 78% of lines won’t receive any additional access! 7! ! State-of-art replacement policies better! ! 8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14! Motivation" 100% 5% of all the loaded lines concentrate all the hits" 50% 0.5% of all the loaded lines concentrate close to 50% of hits" Total hits! 0.5% 5% Inserted lines! Most of the inserted lines will not experience any hit because hits concentrate in a few lines ! 8! 8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14! Outline" ! Motivation! ! The reuse cache" • Idea! • Organization! • Coherence protocol! • Replacement! ! Related work! ! Evaluation! ! Conclusions! 9! Idea" Data is stored only when reuse is detected! From main memory! From main memory! data! insertion! tag! insertion! Tags! Data! To private! caches! First access: " Miss " only tag insertion! 10! tag hit! Tags! Data! To private! caches! Second access:" Tag hit " data insertion ! (reuse detected) ! Organization" ! Decoupled tag and data arrays! • More tag than data entries! Conventional cache! Reuse cache! Tag! Tag! 11! Data! Data! Organization" NORMAL! ! Tag array! • Maintains coherence/inclusion! • Set-associative! • Each entry has associated data or not! NEW! • Forward pointers: to indicate corresponding data array entry! Tag array! 16-way! . . ." 12! Data! array! tag! state! pointer! Organization" ! Tag array! ! • Maintains coherence/inclusion! • Set-associative! • Each entry has associated data or not! • Forward pointers: to indicate corresponding data array entry! Data array! • Only stores reused lines! • Queue or set-associative! • Reverse pointers: update tag array entry! Tag array! 16-way! . . ." 13! Data! array! tag! state! pointer! pointer! data! repl.! Coherence" ! Conventional coherence protocols can be used with small changes! • Two types of states have to be considered! # Tag-only states! # Tag + data states! • Transitions between both types of states! are triggered by data insertions or evictions! Data array! insertion! (reuse)! Tag-only! states! Tag + data! states! Data! eviction! 14! Replacement" ! Tag and data arrays have independent replacement policies! • Tag array! # Not Recently-Reused (NRR) 1 [Albericio et al. TACO’13]! ! reused and private cache contents are unlikely to be evicted! • Data array! # Clock, for queue organization [Corbató MIT’68]! # Not Recently-Used (NRU), for set associativity! 15! NRU vs NRR" ! NRU! • Exploiting temporal locality (approx. to LRU)! • 1 bit per line! • Used on intel i7 and SPARC T2! Used% Not%used% Insertion ! NRR1! • Exploiting reuse locality! • 1 bit per line! • Private cache contents are protected! Reused% Not%reused% Insertion 16! Outline" ! Motivation! ! The reuse cache! • Idea! • Organization! • Coherence protocol! • Replacement! ! Related work" ! Evaluation! ! Conclusions! 17! Related work" ! Many proposals decouple tag/data arrays! • Decoupled sectored caches, Seznec, ISCA 1994! • The pool of subsectors, Rothman and A. Smith, ICS 1999! • NuRAPID, Chishti et al., MICRO 2003! • V-way, Qureshi et al., ISCA 2005! ! Non-inclusive cache, inclusive directory (NCID), Zhao et al., Computer Frontiers 2010! • Additional tags to maintain inclusion! • Selective allocation policy based on set dueling! 18! Tag" Tag+data" Tag" Data" Set" Outline" ! Motivation! ! The reuse cache! • Idea! • Organization! • Coherence protocol! • Replacement! ! Related work! ! Evaluation" ! Conclusions! 19! Methodology" Baseline system configuration! • 8 in-order core CMP system! 1 DDR3 channel! Bank 0! …! Bank 3! • Simics full-system simulator! • Ruby from GEMS Multifacet! 8MB SLLC! LRU, 16-way! (inclusive)! L1! core 0! 20! …! ! Area and latency! • CACTI 6.5! ! Workloads! crossbar! L2! ! Simulation! L2! 256KB L2! L1! 16KB I$! 16KB D$! core 7! • Multi-programmed: SPEC CPU 2006 (100 random mixes from all the 29 applications)! • Parallel: from PARSEC and SPLASH2! ! Terminology" ! RC-x/y = Reuse cache with tags equivalent to x MB, ! y MB of data ! RC-x/y ! Data! Tags! Tags! Tags! Data! Conventional cache ! of x MB! 21! Data! Conventional cache ! of y MB! Performance over conventional 8MB! Size exploration Chosen configuration for! each data array size! 1.15! 1.1! 1.05! 1! 0.95! 0.9! 22! 8 16 32 64 4 8 16 32 2 4 8 16 2 4 8 16 RC-x/8 RC-x/4 RC-x/2 RC-x/1 RC-x/0.5 2 4 8 16 1.15! This configuration keeps average! 100% performance while it reduces size! 1.1! 80% Hardware cost 60% 1.05! Conv. 8MB LRU! 1! 40% 0.95! 20% 0.9! 0% 23! RC-16/8 RC-8/4 RC-8/2 RC-4/1 RC-4/0.5 Same avg. performance! with 84% area savings! Hardware cost respect to conv. 8MB" Performance over conventional 8MB! Size exploration Performance over conventional 8MB! Better baseline 1.15! +1.1% perf.! -40% HW cost! 1.1! -0.7% perf.! -60% HW cost! 1.05! Conv. DRRIP 8MB! Conv. LRU 8MB! 1! 0.95! 0.9! 24! RC-16/8 RC-8/4 RC-8/2 RC-4/1 RC-4/0.5 Multiprogrammed 9 workloads over 5% of speedup Better in more than half of the mixes Performance over conventional 8MB! 6 workloads over 5% of slowdown 25! NCID comparison 1.1! 7.0% Performance over conventional 8MB! Reuse%cache% 6.4% 1.05! 5.2% NCID% 5.3% 1! 0.95! 0.9! RC-8/4! NCID-8/4 26! RC-8/2! NCID-8/2 RC-4/1! NCID-4/1 RC-4/0.5! NCID-4/0.5 Behavior insight" Percentage of data stored into the cache with respect to the number of tags ! • Conventional cache: 100%! • RC-4/1: 5%! Our selection of contents policy is selecting lines that will receive hits 5% of the loaded lines get all the hits" 100% 50% Percentage of inserted lines! Percentage of total hits! 27! 0.5% 5% Additional results ! Performance analysis for more sizes ! ! Set-associative data array! ! Per-application study! ! Parallel workloads! 28! Outline" ! Motivation! ! The reuse cache! • Idea! • Organization! • Coherence protocol! • Replacement! ! Related work! ! Evaluation! ! Conclusions" 29! Conclusions" ! A big fraction of the SLLC contents is dead ! • A selective allocation policy is needed! ! A SLLC organization is proposed: the reuse cache! • Very selective line allocation policy, based on reuse! • Same avg. performance with 84% storage savings! • Alternative applications: ! # Further reduction of energy! # More computing elements with same area! 30! The Reuse Cache Downsizing the Shared Last-Level Cache ! Jorge Albericio1, Pablo Ibáñez2, Víctor Viñals2, and José M. Llabería3! ! ! 1 2 3
© Copyright 2025 ExpyDoc