スライド タイトルなし

A Low Energy Set-Associative I-Cache
with Extended BTB
K. Inoue, V. Moshnyaga, and K. Murakami
Introduction
Increase in cache size
Power consumed in on-chip caches
DEC 21164 CPU* StrongARM SA-110 CPU* Bipolar ECL CPU**
25%
43%
50%
* Kamble et. al., “Analytical energy Dissipation Models for Low Power Caches”, ISLPED’97
** Joouppi et. al., “A 300-MHz 115-W 32-b Bipolar ECL Microprocessor” ,IEEE Journal
of Solid-State Circuits’93
Problem of Conventional Caches
Conventional Cache
way0
way1
way2
way3
Line
Parallel search strategy produces
unnecessary way activation!
First access but
High energy consumption
Cycle 2
Cycle 1
Cycle 1
Tag
Phased Cache
Low energy consumption but
Slow access
Our Proposal
History-Based Tag-Comparison I-Cache
 Attempts to reduce cache-access energy
without performance degradation
 Reuses tag-check results to eliminate
unnecessary way activation
 Can achieve 62% of energy reduction
with only 0.2% of performance
degradation
Conventional Tag-Check Scheme
Cache is updated only when replacement occurs!
A loaded data stays at the same location at least
until the next cache-miss takes place
Programs include a lot of loops!
A number of instructions are executed repetitively
Inst. A is referenced N times
Miss!
Ref. A
Ref. A Ref. A Ref. A
Tag
Tag
Tag
Tag
Check! Check! Check! Check!
Ref. A
Miss!
Tag
Check!
Completely the same tag-check result!
Cache-miss interval
Cache is stable!
time
History-Based Tag-Comparison
(HBTC) Scheme
Attempts to reuse tag-check results produced
before during a cache-miss interval!
 The target instruction has been referenced
before, and
 No cache miss has occurred since the previous
reference.
Miss!
Ref. A
Ref. A
Tag
Check!
Reuse!
Cache-miss interval
Miss!
time
Concept of the HBTC Cache
1. Execute an instruction A at time T
way0 way1 way2 way3
• Perform tag check
Index
• Save the tag-check result
into extended BTB
[way2] is the Hit-way!
2. If a cache miss occurs, then we invalidate all the
stored tag-check results
3. Execute the instruction A at time T+X
way0 way1 way2 way3
• Reuse the tag-check result
Index
to activate only the hit-way’s
data sub-array
[way2] is the Hit-way!
Conventional VS. Phased VS. HBTC
Cycle 1
Cycle 1
Cycle 1
Line
Reuse
way3
HBTC
No Reuse
way2
Cycle 2
way1
Cycle 1
way0
Cache Miss
Cycle 1
Tag
Phased
Cache Hit
Conventional
HBTC SA I-$ Architecture
PBAreg
Branch Inst. Addr.
WP Recode Reg.
Pred.
Result Address for
writing
Branch-Inst. Addr.
Branch-Inst. Addr.
Entry of the WP Table
valid flag
WP Table
Target Addr.
BTB (Branch Target Buffer)
PC
Tag Check
Result
Taken
Not
Taken
I-Cache
Target Addr.
Branch Prediction
Result
n of way pointers
valid
Mode
Controller
WP
Mode
WP Reg.
Miss?
HBTC I-$ Operation
Normal Mode (NM): w/ Tag checks
Omitting Mode (OM): w/o Tag checks (Reuse)
Tracing Mode (TM): w/ Tag checks
(tag-check results are preserved into the WPRreg, and
are stored into the WP-table on the next BTB hit )
Mode Transition
GOtoNM
I-Cache miss or
OM
BTB replacement or
RAS access or
GOtoNM
Branch misprediction
NM
All WPs are invalidated!
GOtoNM
Valid
BTB Hit
Invalid
TM
PC and
Pred.-result
PBAreg
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
NM
Inst. Addr. A Target Addr.
Branch Target Buffer
PC
Valid
T
BTB Hit
GOtoNM
GOtoNM
Invalid
TM
N
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
NM
Inst. Addr. A Target Addr.
A
PC
Valid
Branch Target Buffer
T
BTB Hit
GOtoNM
GOtoNM
Invalid
TM
N
Inst. Addr. B Target Addr.
4-way I-Cache
Taken
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
NM
Inst. Addr. A Target Addr.
A
PC
Valid
Branch Target Buffer
T
BTB Hit
GOtoNM
GOtoNM
Invalid
TM
N
Inst. Addr. B Target Addr.
4-way I-Cache
Taken
NO valid WPs
are detected!
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
PC and Branch prediction
result are saved!
PBAreg
A
OM
WPRreg
From
I-Cache
T
WP Table
NM
Inst. Addr. A Target Addr.
A
PC
Valid
Branch Target Buffer
T
BTB Hit
GOtoNM
GOtoNM
Invalid
TM
N
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
NO valid WPs
are detected!
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
Tag-Comparison result is stored
into the WPRreg!
OM
PBAreg
A
WPRreg
1
T
WP Table
Branch Target Buffer
T
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
N
GOtoNM
Invalid
TM
Conventional
Accesses!
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
Tag-Comparison result is stored
into the WPRreg!
OM
PBAreg
A
WPRreg
3
T
WP Table
Branch Target Buffer
T
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
N
GOtoNM
Invalid
TM
Conventional
Accesses!
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
Tag-Comparison result is stored
into the WPRreg!
OM
PBAreg
A
WPRreg
0
T
WP Table
Branch Target Buffer
T
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
N
GOtoNM
Invalid
TM
Conventional
Accesses!
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
The WPRreg is stored into the WP-Table
entry pointed by the PBAreg!
PBAreg
A
OM
Valid
BTB Hit
WPRreg
From
I-Cache
T
WP Table
NM
Inst. Addr. A Target Addr.
B
PC
Mode Transition
Branch Target Buffer
T
GOtoNM
Invalid
TM
GOtoNM
BTB Hit!
N
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
NM
Inst. Addr. A Target Addr.
A
PC
Valid
Branch Target Buffer
T
BTB Hit
GOtoNM
GOtoNM
Invalid
TM
N
Inst. Addr. B Target Addr.
4-way I-Cache
Taken
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
NM
Inst. Addr. A Target Addr.
A
PC
Valid
Branch Target Buffer
T
BTB Hit
GOtoNM
GOtoNM
Invalid
TM
N
Inst. Addr. B Target Addr.
4-way I-Cache
Taken
Valid WPs
are detected!
Mode
Controller
0 1 2 3
WPreg
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
Branch Target Buffer
T
Inst. Addr. B Target Addr.
N
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
GOtoNM
Invalid
TM
Tag-Comparison
Reuse
4-way I-Cache
Pred. (T or N)
Mode
Controller
1
WPreg
0 1 2 3
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
Branch Target Buffer
T
Inst. Addr. B Target Addr.
N
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
GOtoNM
Invalid
TM
Tag-Comparison
Reuse
4-way I-Cache
Pred. (T or N)
Mode
Controller
3
WPreg
0 1 2 3
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
Branch Target Buffer
T
Inst. Addr. B Target Addr.
N
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
GOtoNM
Invalid
TM
Tag-Comparison
Reuse
4-way I-Cache
Pred. (T or N)
Mode
Controller
0
WPreg
0 1 2 3
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
Branch Target Buffer
T
Inst. Addr. B Target Addr.
N
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
GOtoNM
Invalid
TM
No valid WPs in
the WPreg!
4-way I-Cache
Pred. (T or N)
Mode
Controller
?
WPreg
0 1 2 3
HBTC I-$ Operation Example
Mode Transition
OM
WPRreg
PBAreg
From
I-Cache
WP Table
Branch Target Buffer
T
N
BTB Hit
GOtoNM
NM
Inst. Addr. A Target Addr.
PC
Valid
GOtoNM
Invalid
TM
Conventional
Accesses!
Inst. Addr. B Target Addr.
4-way I-Cache
Pred. (T or N)
Mode
Controller
0 1 2 3
WPreg
Advantages and Disadvantages
Omitting Mode (OM)
way0 way1 way2 way3
Normal Mode (NM) / Tracing Mode (TM)
way0 way1 way2 way3
☺Eliminate unnecessary energy consumption
☹
☹
☹
w/o performance degradation (during OM)!
BTB energy overhead due to WP-table readaccesses
BTB access conflict for invalidating all WPs
(causes 1 stall cycle)
BTB access conflict to record WPs
(causes 1 stall cycle)
Evaluation – Environment –
• OOO simulation by SimpleScalar
16 KB 4-way I-cache (32 B line size)
For others, default parameters were used
• Cache Energy Model based on [Kamble97]
(including the WP-table read-energy overhead)
• Assume that the BTB is accessed only when branch or jump
instructions are executed (instructions are pre-decoded)
Benchmark Programs
SPECint95
099.go, 124.m88ksim, 126.gcc, 129.compress,130.li, 132.ijpeg
SPECfp95
102.swim
Mediabench
mpeg2encode, mpeg2decode, adpcm_enc, adpcm_dec
[Kamble97] M.B.Kamble and K.Ghose, ”Analytical Energy Dissipation Models For Low Power Caches,” ISLPED97
# of WPs = 4
1.2
1.0
62% 0.2%
1.2
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
099.go
126.gcc
130.li
102.swim adpcm(d) mpeg2(d)
124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)
Normalized Exe. Time (cycle)
Normalized Energy (Joule)
Evaluation
– Energy and Performance –
 62% of Ecache reduction with 0.2% of Exe. Time increase
 Even if in the worst case, about 20% of Ecache reduction
Evaluation
Norm. Exe. Time (cycle)
– Effect of WP invalidation penalty –
3
2.5
2
Cache Miss
Penalty
126.gcc
1.5
099.go
1
mpeg2(d) 132.ijpeg
0.5
0
1
2
4
8
16
32
WP Invalidation Penalty (cycle)
 If the penalty is equal to or smaller than 4 clock cycles,
the performance overhead is trivial.
 The performance overhead grows after the penalty is
more than 4 clock cycles.
Normalized Energy (Joule)
Evaluation
– Effect of The Number of WPs –
w/ Pre-Decoding
1.2
1.0
Energy for Cache Access
Energy Overhead of BTB
w/o Pre-Decoding
126.gcc
0.8
0.6
0.4
0.2
0.0
1
2
4
8
16
32
1
2
4
8
16
32
# of Way Pointer
 Increasing the number of WPs makes it possible to
reuse many tag-check results
 But, it produces BTB access energy overhead
Energy (Joule)
Evaluation
– Effect of Cache Associativity –
8.E+06
7.E+06
6.E+06
5.E+06
4.E+06
3.E+06
2.E+06
1.E+06
0.E+00
mpeg2decode
Conventional
1 2
4
8 16 32 64
HBTC
1 2
4
Eothers
Etag
Edata,bl
Edata,prectl
8 16 32 64
Associativity
 Conv.: Ecache grows with the increase in assiciativity
 HBTC: Ecache is reduced with the increase in associativity
(n<=4), after that, It starts to increase (n>4)
Conclusions
History-Based Tag-Comparison Instruction Cache
1. Recodes tag-check results generated by the I-cache
into the extended BTB
2. Attempts to reuse them in order to eliminate
unnecessary way activation
3. Achieves 62% of I-cache energy reduction with only
0.2% of performance degradation!
Future work
• Analyze energy consumption based on real chip
design.
Buck Up Slides
(History-based Tag-Comparison Cache)
Normalized Tag-Compare Count
Evaluation
– Comparison with IS Approach –
Interline Sequential approach
History-Based Look-up Cache
Combination of IS and HBL
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
099.go
126.gcc
130.li
102.swim adpcm(d) mpeg2(d)
124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)
Evaluation
– Effects of Cache Associativity –
Eothers
Etag
Edata,bl
Edata,prectl
099.go
Energy (Joule)
3.E+07
3.E+07
Conventional
2.E+07
HBL Cache
2.E+07
1.E+07
5.E+06
0.E+00
1 2
4
8 16 32 64
1 2
Associativity
4
8 16 32 64
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design
**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation
– Effects of Cache Associativity –
126.gcc
Energy (Joule)
7.E+07
6.E+07
5.E+07
Conventional
HBL Cache
Eothers
Etag
Edata,bl
Edata,prectl
4.E+07
3.E+07
2.E+07
1.E+07
0.E+00
1 2
4
8 16 32 64
1 2
Associativity
4
8 16 32 64
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design
**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation
– Effects of Cache Associativity –
132.ijpeg
Energy (Joule)
6.E+07
5.E+07
Conventional
HBL Cache
Eothers
Etag
Edata,bl
Edata,prectl
4.E+07
3.E+07
2.E+07
1.E+07
0.E+00
1 2
4
8 16 32 64
1 2
Associativity
4
8 16 32 64
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design
**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation
– Effects of Cache Associativity –
Energy (Joule)
mpeg2decode
8.E+06
7.E+06
6.E+06
5.E+06
4.E+06
3.E+06
2.E+06
1.E+06
0.E+00
Conventional
1 2
4
8 16 32 64
HBL Cache
1 2
Associativity
4
Eothers
Etag
Edata,bl
Edata,prectl
8 16 32 64
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design
**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation
– Effects of # of WPs –
Normalized Energy (Joule)
w/ Pre-Decoding
(BTB access occurs only at branch, or jump, executions)
1.0
126.gcc
132.ijpeg
0.8
0.6
0.4
0.2
0.0
1
2
4
8
16
32
1
2
4
8
16
32
# of Way Pointer
Energy for Cache Access
Energy Overhead at BTB
Evaluation
– Effects of # of WPs –
Normalized Energy (Joule)
w/o Pre-Decoding
(BTB access occurs for all instructions)
1.0
126.gcc
132.ijpeg
0.8
0.6
0.4
0.2
0.0
1
2
4
8
16
32
1
2
4
8
16
32
# of Way Pointer
Energy for Cache Access
Energy Overhead at BTB
Evaluation
BTB Replacement
Cache Miss
3
2.5
2
126.gcc
Cache Miss
Penalty
1.5
099.go
1
mpeg2(d) 132.ijpeg
0.5
0
1
2
4
8
16
32
WP Invalidation Penalty (cycle)
Breakdown of
WP invalidations
Normalized Exe. Time (cycle)
– Effect of WP invalidation penalty –
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d)
124.m88ksim 129.comp.132.ijpeg adpcm(e) mpeg2(e)
 If the penalty is equal to or smaller than 4 clock cycles,
the performance overhead is trivial.
 The performance overhead grows after the penalty is
more than 4 clock cycles.