The Reuse Cache

The Reuse Cache
Downsizing the Shared Last-Level Cache !
Jorge Albericio1, Pablo Ibáñez2, Víctor Viñals2, and José M. Llabería3!
!
!
1
2
3
Modern CMPs"
Intel e5 2600 (2013)!
SLLC"
AMD Orochi (2012)!
SLLC"
Fujitsu SPARC VIIIfx (2011)!
Intel Itanium 9500 (2012)!
SLLC"
2!
SLLC"
Inclusive Shared Last-Level Cache (SLLC) !
8MB Conventional!
size!
1/8!
Reuse Cache!
Data 1MB!
Tags!
Tags!
Data!
1/2!
Same average performance* "
with 84% area savings"
*100 multiprogrammed SPEC CPU 2006 workloads in an 8-core CMP!
Outline"
!  Motivation!
!  The reuse cache!
•  Idea!
•  Organization!
•  Coherence!
•  Replacement!
!  Related work!
!  Evaluation!
!  Conclusions!
4!
Outline"
!  Motivation"
!  The reuse cache!
•  Idea!
•  Organization!
•  Coherence!
•  Replacement!
!  Related work!
!  Evaluation!
!  Conclusions!
5!
Motivation"
•  Recently reused lines more useful than lines reused before!
SLLC data"
Many hits!
Zero hits!
6!
Reuse locality"
"
•  Used more than one time likely to be used many times!
Opportunity for a
smaller SLLC
which stores only
reused data !
Motivation"
Live cache fraction evolution!
LRU
0.75
0.5
NRR
DRRIP
0.25
0.75
0.5
0.25
0.75
0.5
0.25
0
1000
2000
3000
4000
time (x100K cycles)
5000
6000
!  The fraction of live SLLC lines is very small (10-30% LRU)!
•  On average 78% of lines won’t receive any additional access!
7!
!  State-of-art replacement policies better!
!
8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14!
Motivation"
100%
5% of all the loaded lines
concentrate all the hits"
50%
0.5% of all the loaded lines
concentrate close to 50% of hits"
Total hits!
0.5%
5%
Inserted lines!
Most of the inserted lines will not experience any hit
because hits concentrate in a few lines !
8!
8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14!
Outline"
!  Motivation!
!  The reuse cache"
•  Idea!
•  Organization!
•  Coherence protocol!
•  Replacement!
!  Related work!
!  Evaluation!
!  Conclusions!
9!
Idea"
Data is stored only when reuse is detected!
From main memory!
From main memory!
data!
insertion!
tag!
insertion!
Tags!
Data!
To private!
caches!
First access: "
Miss " only tag insertion!
10!
tag hit!
Tags!
Data!
To private!
caches!
Second access:"
Tag hit " data insertion !
(reuse detected) !
Organization"
!  Decoupled tag and data arrays!
•  More tag than data entries!
Conventional cache!
Reuse cache!
Tag!
Tag!
11!
Data!
Data!
Organization"
NORMAL!
!  Tag array!
•  Maintains coherence/inclusion!
•  Set-associative!
•  Each entry has associated data or not!
NEW! •  Forward pointers: to indicate
corresponding data array entry!
Tag array!
16-way!
. . ."
12!
Data!
array!
tag!
state!
pointer!
Organization"
!  Tag array!
! 
•  Maintains coherence/inclusion!
•  Set-associative!
•  Each entry has associated data or not!
•  Forward pointers: to indicate
corresponding data array entry!
Data array!
•  Only stores reused lines!
•  Queue or set-associative!
•  Reverse pointers:
update tag array entry!
Tag array!
16-way!
. . ."
13!
Data!
array!
tag!
state!
pointer!
pointer!
data!
repl.!
Coherence"
!  Conventional coherence protocols
can be used with small changes!
•  Two types of states have to be considered!
#  Tag-only
states!
#  Tag + data states!
•  Transitions between both types of states!
are triggered by data insertions or evictions!
Data array!
insertion!
(reuse)!
Tag-only!
states!
Tag + data!
states!
Data!
eviction!
14!
Replacement"
!  Tag and data arrays have independent replacement
policies!
•  Tag array!
#  Not
Recently-Reused (NRR) 1 [Albericio et al. TACO’13]!
!  reused and private cache contents are unlikely to be evicted!
•  Data array!
#  Clock,
for queue organization [Corbató MIT’68]!
#  Not Recently-Used (NRU), for set associativity!
15!
NRU vs NRR"
!  NRU!
•  Exploiting temporal locality (approx. to LRU)!
•  1 bit per line!
•  Used on intel i7 and SPARC T2!
Used%
Not%used%
Insertion
!  NRR1!
•  Exploiting reuse locality!
•  1 bit per line!
•  Private cache contents are protected!
Reused%
Not%reused%
Insertion
16!
Outline"
!  Motivation!
!  The reuse cache!
•  Idea!
•  Organization!
•  Coherence protocol!
•  Replacement!
!  Related work"
!  Evaluation!
!  Conclusions!
17!
Related work"
!  Many proposals decouple tag/data arrays!
•  Decoupled sectored caches, Seznec, ISCA 1994!
•  The pool of subsectors, Rothman and A. Smith, ICS 1999!
•  NuRAPID, Chishti et al., MICRO 2003!
•  V-way, Qureshi et al., ISCA 2005!
!  Non-inclusive cache, inclusive
directory (NCID), Zhao et al.,
Computer Frontiers 2010!
•  Additional tags to maintain inclusion!
•  Selective allocation policy based on set
dueling!
18!
Tag"
Tag+data"
Tag"
Data"
Set"
Outline"
!  Motivation!
!  The reuse cache!
•  Idea!
•  Organization!
•  Coherence protocol!
•  Replacement!
!  Related work!
!  Evaluation"
!  Conclusions!
19!
Methodology"
Baseline system configuration!
•  8 in-order core CMP system!
1 DDR3 channel!
Bank
0!
…!
Bank
3!
•  Simics full-system simulator!
•  Ruby from GEMS Multifacet!
8MB SLLC!
LRU, 16-way!
(inclusive)!
L1!
core 0!
20!
…!
!  Area and latency!
•  CACTI 6.5!
!  Workloads!
crossbar!
L2!
!  Simulation!
L2!
256KB L2!
L1!
16KB I$!
16KB D$!
core 7!
•  Multi-programmed: SPEC
CPU 2006 (100 random mixes
from all the 29 applications)!
•  Parallel: from PARSEC and
SPLASH2!
!
Terminology"
!  RC-x/y = Reuse cache with tags equivalent to x MB, !
y MB of data !
RC-x/y !
Data!
Tags!
Tags!
Tags!
Data!
Conventional cache !
of x MB!
21!
Data!
Conventional cache !
of y MB!
Performance over conventional 8MB!
Size exploration
Chosen configuration for!
each data array size!
1.15!
1.1!
1.05!
1!
0.95!
0.9!
22!
8 16 32 64
4 8 16 32
2 4 8 16
2 4 8 16
RC-x/8
RC-x/4
RC-x/2
RC-x/1 RC-x/0.5
2 4 8 16
1.15!
This configuration keeps average! 100%
performance while it reduces size!
1.1!
80%
Hardware cost
60%
1.05!
Conv. 8MB LRU!
1!
40%
0.95!
20%
0.9!
0%
23!
RC-16/8
RC-8/4
RC-8/2 RC-4/1 RC-4/0.5
Same avg. performance!
with 84% area savings!
Hardware cost respect to conv. 8MB"
Performance over conventional 8MB!
Size exploration
Performance over conventional 8MB!
Better baseline
1.15!
+1.1% perf.!
-40% HW cost!
1.1!
-0.7% perf.!
-60% HW cost!
1.05!
Conv. DRRIP 8MB!
Conv. LRU 8MB!
1!
0.95!
0.9!
24!
RC-16/8
RC-8/4
RC-8/2 RC-4/1 RC-4/0.5
Multiprogrammed
9 workloads over 5%
of speedup
Better in more than
half of the mixes
Performance over conventional 8MB!
6 workloads over
5% of slowdown
25!
NCID comparison
1.1!
7.0%
Performance over conventional 8MB!
Reuse%cache%
6.4%
1.05!
5.2%
NCID%
5.3%
1!
0.95!
0.9!
RC-8/4!
NCID-8/4
26!
RC-8/2!
NCID-8/2
RC-4/1!
NCID-4/1
RC-4/0.5!
NCID-4/0.5
Behavior insight"
Percentage of data stored into the cache
with respect to the number of tags !
•  Conventional cache: 100%!
•  RC-4/1: 5%!
Our selection of contents policy
is selecting lines that will receive hits
5% of the loaded lines
get all the hits"
100%
50%
Percentage of
inserted lines!
Percentage
of total hits!
27!
0.5%
5%
Additional results
!  Performance analysis for more sizes !
!  Set-associative data array!
!  Per-application study!
!  Parallel workloads!
28!
Outline"
!  Motivation!
!  The reuse cache!
•  Idea!
•  Organization!
•  Coherence protocol!
•  Replacement!
!  Related work!
!  Evaluation!
!  Conclusions"
29!
Conclusions"
!  A big fraction of the SLLC contents is dead !
•  A selective allocation policy is needed!
!  A SLLC organization is proposed: the reuse cache!
•  Very selective line allocation policy, based on reuse!
•  Same avg. performance with 84% storage savings!
•  Alternative applications: !
#  Further
reduction of energy!
#  More computing elements with same area!
30!
The Reuse Cache
Downsizing the Shared Last-Level Cache !
Jorge Albericio1, Pablo Ibáñez2, Víctor Viñals2, and José M. Llabería3!
!
!
1
2
3