A Low Energy Set-Associative I-Cache with Extended BTB K. Inoue, V. Moshnyaga, and K. Murakami Introduction Increase in cache size Power consumed in on-chip caches DEC 21164 CPU* StrongARM SA-110 CPU* Bipolar ECL CPU** 25% 43% 50% * Kamble et. al., “Analytical energy Dissipation Models for Low Power Caches”, ISLPED’97 ** Joouppi et. al., “A 300-MHz 115-W 32-b Bipolar ECL Microprocessor” ,IEEE Journal of Solid-State Circuits’93 Problem of Conventional Caches Conventional Cache way0 way1 way2 way3 Line Parallel search strategy produces unnecessary way activation! First access but High energy consumption Cycle 2 Cycle 1 Cycle 1 Tag Phased Cache Low energy consumption but Slow access Our Proposal History-Based Tag-Comparison I-Cache Attempts to reduce cache-access energy without performance degradation Reuses tag-check results to eliminate unnecessary way activation Can achieve 62% of energy reduction with only 0.2% of performance degradation Conventional Tag-Check Scheme Cache is updated only when replacement occurs! A loaded data stays at the same location at least until the next cache-miss takes place Programs include a lot of loops! A number of instructions are executed repetitively Inst. A is referenced N times Miss! Ref. A Ref. A Ref. A Ref. A Tag Tag Tag Tag Check! Check! Check! Check! Ref. A Miss! Tag Check! Completely the same tag-check result! Cache-miss interval Cache is stable! time History-Based Tag-Comparison (HBTC) Scheme Attempts to reuse tag-check results produced before during a cache-miss interval! The target instruction has been referenced before, and No cache miss has occurred since the previous reference. Miss! Ref. A Ref. A Tag Check! Reuse! Cache-miss interval Miss! time Concept of the HBTC Cache 1. Execute an instruction A at time T way0 way1 way2 way3 • Perform tag check Index • Save the tag-check result into extended BTB [way2] is the Hit-way! 2. If a cache miss occurs, then we invalidate all the stored tag-check results 3. Execute the instruction A at time T+X way0 way1 way2 way3 • Reuse the tag-check result Index to activate only the hit-way’s data sub-array [way2] is the Hit-way! Conventional VS. Phased VS. HBTC Cycle 1 Cycle 1 Cycle 1 Line Reuse way3 HBTC No Reuse way2 Cycle 2 way1 Cycle 1 way0 Cache Miss Cycle 1 Tag Phased Cache Hit Conventional HBTC SA I-$ Architecture PBAreg Branch Inst. Addr. WP Recode Reg. Pred. Result Address for writing Branch-Inst. Addr. Branch-Inst. Addr. Entry of the WP Table valid flag WP Table Target Addr. BTB (Branch Target Buffer) PC Tag Check Result Taken Not Taken I-Cache Target Addr. Branch Prediction Result n of way pointers valid Mode Controller WP Mode WP Reg. Miss? HBTC I-$ Operation Normal Mode (NM): w/ Tag checks Omitting Mode (OM): w/o Tag checks (Reuse) Tracing Mode (TM): w/ Tag checks (tag-check results are preserved into the WPRreg, and are stored into the WP-table on the next BTB hit ) Mode Transition GOtoNM I-Cache miss or OM BTB replacement or RAS access or GOtoNM Branch misprediction NM All WPs are invalidated! GOtoNM Valid BTB Hit Invalid TM PC and Pred.-result PBAreg HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table NM Inst. Addr. A Target Addr. Branch Target Buffer PC Valid T BTB Hit GOtoNM GOtoNM Invalid TM N Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table NM Inst. Addr. A Target Addr. A PC Valid Branch Target Buffer T BTB Hit GOtoNM GOtoNM Invalid TM N Inst. Addr. B Target Addr. 4-way I-Cache Taken Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table NM Inst. Addr. A Target Addr. A PC Valid Branch Target Buffer T BTB Hit GOtoNM GOtoNM Invalid TM N Inst. Addr. B Target Addr. 4-way I-Cache Taken NO valid WPs are detected! Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition PC and Branch prediction result are saved! PBAreg A OM WPRreg From I-Cache T WP Table NM Inst. Addr. A Target Addr. A PC Valid Branch Target Buffer T BTB Hit GOtoNM GOtoNM Invalid TM N Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) NO valid WPs are detected! Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! OM PBAreg A WPRreg 1 T WP Table Branch Target Buffer T BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid N GOtoNM Invalid TM Conventional Accesses! Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! OM PBAreg A WPRreg 3 T WP Table Branch Target Buffer T BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid N GOtoNM Invalid TM Conventional Accesses! Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! OM PBAreg A WPRreg 0 T WP Table Branch Target Buffer T BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid N GOtoNM Invalid TM Conventional Accesses! Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example The WPRreg is stored into the WP-Table entry pointed by the PBAreg! PBAreg A OM Valid BTB Hit WPRreg From I-Cache T WP Table NM Inst. Addr. A Target Addr. B PC Mode Transition Branch Target Buffer T GOtoNM Invalid TM GOtoNM BTB Hit! N Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table NM Inst. Addr. A Target Addr. A PC Valid Branch Target Buffer T BTB Hit GOtoNM GOtoNM Invalid TM N Inst. Addr. B Target Addr. 4-way I-Cache Taken Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table NM Inst. Addr. A Target Addr. A PC Valid Branch Target Buffer T BTB Hit GOtoNM GOtoNM Invalid TM N Inst. Addr. B Target Addr. 4-way I-Cache Taken Valid WPs are detected! Mode Controller 0 1 2 3 WPreg HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table Branch Target Buffer T Inst. Addr. B Target Addr. N BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid GOtoNM Invalid TM Tag-Comparison Reuse 4-way I-Cache Pred. (T or N) Mode Controller 1 WPreg 0 1 2 3 HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table Branch Target Buffer T Inst. Addr. B Target Addr. N BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid GOtoNM Invalid TM Tag-Comparison Reuse 4-way I-Cache Pred. (T or N) Mode Controller 3 WPreg 0 1 2 3 HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table Branch Target Buffer T Inst. Addr. B Target Addr. N BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid GOtoNM Invalid TM Tag-Comparison Reuse 4-way I-Cache Pred. (T or N) Mode Controller 0 WPreg 0 1 2 3 HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table Branch Target Buffer T Inst. Addr. B Target Addr. N BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid GOtoNM Invalid TM No valid WPs in the WPreg! 4-way I-Cache Pred. (T or N) Mode Controller ? WPreg 0 1 2 3 HBTC I-$ Operation Example Mode Transition OM WPRreg PBAreg From I-Cache WP Table Branch Target Buffer T N BTB Hit GOtoNM NM Inst. Addr. A Target Addr. PC Valid GOtoNM Invalid TM Conventional Accesses! Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) Mode Controller 0 1 2 3 WPreg Advantages and Disadvantages Omitting Mode (OM) way0 way1 way2 way3 Normal Mode (NM) / Tracing Mode (TM) way0 way1 way2 way3 ☺Eliminate unnecessary energy consumption ☹ ☹ ☹ w/o performance degradation (during OM)! BTB energy overhead due to WP-table readaccesses BTB access conflict for invalidating all WPs (causes 1 stall cycle) BTB access conflict to record WPs (causes 1 stall cycle) Evaluation – Environment – • OOO simulation by SimpleScalar 16 KB 4-way I-cache (32 B line size) For others, default parameters were used • Cache Energy Model based on [Kamble97] (including the WP-table read-energy overhead) • Assume that the BTB is accessed only when branch or jump instructions are executed (instructions are pre-decoded) Benchmark Programs SPECint95 099.go, 124.m88ksim, 126.gcc, 129.compress,130.li, 132.ijpeg SPECfp95 102.swim Mediabench mpeg2encode, mpeg2decode, adpcm_enc, adpcm_dec [Kamble97] M.B.Kamble and K.Ghose, ”Analytical Energy Dissipation Models For Low Power Caches,” ISLPED97 # of WPs = 4 1.2 1.0 62% 0.2% 1.2 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e) Normalized Exe. Time (cycle) Normalized Energy (Joule) Evaluation – Energy and Performance – 62% of Ecache reduction with 0.2% of Exe. Time increase Even if in the worst case, about 20% of Ecache reduction Evaluation Norm. Exe. Time (cycle) – Effect of WP invalidation penalty – 3 2.5 2 Cache Miss Penalty 126.gcc 1.5 099.go 1 mpeg2(d) 132.ijpeg 0.5 0 1 2 4 8 16 32 WP Invalidation Penalty (cycle) If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial. The performance overhead grows after the penalty is more than 4 clock cycles. Normalized Energy (Joule) Evaluation – Effect of The Number of WPs – w/ Pre-Decoding 1.2 1.0 Energy for Cache Access Energy Overhead of BTB w/o Pre-Decoding 126.gcc 0.8 0.6 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer Increasing the number of WPs makes it possible to reuse many tag-check results But, it produces BTB access energy overhead Energy (Joule) Evaluation – Effect of Cache Associativity – 8.E+06 7.E+06 6.E+06 5.E+06 4.E+06 3.E+06 2.E+06 1.E+06 0.E+00 mpeg2decode Conventional 1 2 4 8 16 32 64 HBTC 1 2 4 Eothers Etag Edata,bl Edata,prectl 8 16 32 64 Associativity Conv.: Ecache grows with the increase in assiciativity HBTC: Ecache is reduced with the increase in associativity (n<=4), after that, It starts to increase (n>4) Conclusions History-Based Tag-Comparison Instruction Cache 1. Recodes tag-check results generated by the I-cache into the extended BTB 2. Attempts to reuse them in order to eliminate unnecessary way activation 3. Achieves 62% of I-cache energy reduction with only 0.2% of performance degradation! Future work • Analyze energy consumption based on real chip design. Buck Up Slides (History-based Tag-Comparison Cache) Normalized Tag-Compare Count Evaluation – Comparison with IS Approach – Interline Sequential approach History-Based Look-up Cache Combination of IS and HBL 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e) Evaluation – Effects of Cache Associativity – Eothers Etag Edata,bl Edata,prectl 099.go Energy (Joule) 3.E+07 3.E+07 Conventional 2.E+07 HBL Cache 2.E+07 1.E+07 5.E+06 0.E+00 1 2 4 8 16 32 64 1 2 Associativity 4 8 16 32 64 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5 Evaluation – Effects of Cache Associativity – 126.gcc Energy (Joule) 7.E+07 6.E+07 5.E+07 Conventional HBL Cache Eothers Etag Edata,bl Edata,prectl 4.E+07 3.E+07 2.E+07 1.E+07 0.E+00 1 2 4 8 16 32 64 1 2 Associativity 4 8 16 32 64 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5 Evaluation – Effects of Cache Associativity – 132.ijpeg Energy (Joule) 6.E+07 5.E+07 Conventional HBL Cache Eothers Etag Edata,bl Edata,prectl 4.E+07 3.E+07 2.E+07 1.E+07 0.E+00 1 2 4 8 16 32 64 1 2 Associativity 4 8 16 32 64 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5 Evaluation – Effects of Cache Associativity – Energy (Joule) mpeg2decode 8.E+06 7.E+06 6.E+06 5.E+06 4.E+06 3.E+06 2.E+06 1.E+06 0.E+00 Conventional 1 2 4 8 16 32 64 HBL Cache 1 2 Associativity 4 Eothers Etag Edata,bl Edata,prectl 8 16 32 64 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5 Evaluation – Effects of # of WPs – Normalized Energy (Joule) w/ Pre-Decoding (BTB access occurs only at branch, or jump, executions) 1.0 126.gcc 132.ijpeg 0.8 0.6 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer Energy for Cache Access Energy Overhead at BTB Evaluation – Effects of # of WPs – Normalized Energy (Joule) w/o Pre-Decoding (BTB access occurs for all instructions) 1.0 126.gcc 132.ijpeg 0.8 0.6 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer Energy for Cache Access Energy Overhead at BTB Evaluation BTB Replacement Cache Miss 3 2.5 2 126.gcc Cache Miss Penalty 1.5 099.go 1 mpeg2(d) 132.ijpeg 0.5 0 1 2 4 8 16 32 WP Invalidation Penalty (cycle) Breakdown of WP invalidations Normalized Exe. Time (cycle) – Effect of WP invalidation penalty – 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp.132.ijpeg adpcm(e) mpeg2(e) If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial. The performance overhead grows after the penalty is more than 4 clock cycles.
© Copyright 2025 ExpyDoc