LCA: A Memory Link and Cache-Aware Co-scheduling

LCA: A Memory Link and Cache-Aware
Co-scheduling Approach for CMPs
Dr. Konstantinos Nikas
Computing Systems Laboratory
National Technical University of Athens
8/10/2014
1
A Typical CMP
• Concurrent execution of
multiple threads
• Various levels of sharing
–
–
–
–
Caches
Prefetchers
Memory controllers
Memory link
• Sharing affects performance
8/10/2014
2
Co-execution Impact
• Single-thread execution
DRAM
LLC
8/10/2014
3
L1
L1
C1
C2
Co-execution Impact
• Single-thread execution
DRAM
LLC
L1
L1
C1
C2
IPC1 = 2.2
8/10/2014
4
Co-execution Impact
• Single-thread execution
DRAM
LLC
8/10/2014
5
L1
L1
C1
C2
Co-execution Impact
• Single-thread execution
DRAM
LLC
L1
L1
C1
C2
IPC2 = 1.8
8/10/2014
6
Co-execution Impact
• Dual-thread execution
DRAM
– Performance degradation for both
threads
– Competition on shared resources
• Shared LLC
• Memory link
LLC
L1
L1
C1
C2
IPC1 = 1.9 IPC2 = 1.6
8/10/2014
7
Co-execution in CMPs
• Initial work focused on cache contention
– Cache partitioning (Qureshi and Patt, 2006)
– Cache-fair scheduling (Fedorova et al., 2007)
• In reality more than one contention points
– Cache, Memory Controller, Memory Link, Prefetchers
→ Contention-Aware Scheduling
8/10/2014
8
Contention-Aware Scheduling
• Miss rate (Blagodurov et al., 2010)
– Sort apps by LLC miss rate
– Target: distribute miss rate evenly among the caches
• Memory Bandwidth (Merkel et al., 2010)
– Sort apps by memory bus utilization
– Combine apps with complementary bandwidth demands
• ADAPT (Pusukuri et al, 2013)
– Multithreaded programs
– Cache contention, lock contention, thread latency
– Trained statistical models
8/10/2014
9
LCA: Motivation
• Why focus only on a single contention point?
• Handle contention on memory link and LLC
• Need to inspect the entire memory hierarchy
and capture resource utilization
• Application characterization
– Locate contention on memory link and LLC
– Collect information at execution time
– No additional hardware support
8/10/2014
10
LCA: Application Classes
• N
– Within the core or the private
cache
– No contention on shared
resources
• C
– Heavy activity on the LLC
• small data sets with heavy reuse
• optimized for LLC (e.g. blocking)
• latency-bound that benefit from LLC
hits
8/10/2014
11
LCA: Application Classes
• LC
– Significant activity on both the LLC
and the memory link
• L
– Memory link intensive apps
• streaming applications
• large data sets or large reuse distance
8/10/2014
12
LCA: Classification Method
8/10/2014
13
LCA: Co-execution Scenarios
• N-*
• L–L
– Applications compete for
memory link
• L–C
– No interference
• LC – LC
– Low interference
– Catastrophic for C
• LC – C
• L – LC
– Low interference
– Some slowdown for C
– L suffers some interference
– LC suffers higher slowdown • C – C
– Difficult to predict
– Low probability for
severe contention
8/10/2014
14
LCA: Co-execution Scenarios
• N-*
• L–L
– Applications compete for
memory link
• L–C
– No interference
• LC – LC
– Low interference
– Catastrophic for C
• LC – C
• L – LC
– Low interference
– Some slowdown for C
– L suffers some interference
– LC suffers higher slowdown • C – C
– Difficult to predict
– Low probability for
severe contention
8/10/2014
15
LCA: Co-execution Scenarios
8/10/2014
16
LCA: Co-scheduling Algorithm
8/10/2014
17
LCA: Evaluation
• Intel Xeon E5-4620
– 8 cores, private L1 & L2, 16MB 16-way shared L3, 64GB
memory
• 4 benchmarks for each application class
– 4 threads
• Compare LCA to:
–
–
–
–
–
–
8/10/2014
Gang
Link bandwidth balance (LBB)
LLC miss rate balance (LLC_MRB)
Linux scheduler (Completely Fair Scheduler)
Optimal
Random
18
LCA: Evaluation
8/10/2014
19
Scheduling in CMPs (so far)
DRAM
• Create co-execution pairs
– Time scheduling if cores less
than threads
8/10/2014
20
LLC
L1
L1
C1
C2
Scheduling in CMPs (so far)
• Create co-execution pairs
DRAM
– Time scheduling if cores less
than threads
– Space scheduling by using
multiple pairs
8/10/2014
21
LLC
LLC
L1
L1
L1
L1
C1
C2
C3
C4
Scheduling in CMPs: Challenges
DRAM
• Scaling
LLC
L1
L1
C1
C2
…
…
8/10/2014
22
L1
L1
C15
C16
Scheduling in CMPs: Challenges
DRAM
• Scaling
• Different levels
of sharing
LLC
L2
L2
L1
L1
C1
C2
…
…
8/10/2014
23
L1
L1
C15
C16
General Co-scheduling Scenario
• n programs with t threads
• system with p cores, l memory links, c shared caches
• Typical scenario: desktops, oversubscribed servers, etc.
– state-of-the-art: CFS
DRAM
DRAM
LLC
LLC
L2
L2
L1
L1
C1
C2
…
L2
L1
L1
L1
L1
C15
C16
C1
C2
…
8/10/2014
L2
…
…
24
L1
L1
C15
C16
CFS: Is it really fair?
1
Normalized CPU Time
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
8/10/2014
25
CFS: Is it really fair?
1
Normalized CPU Time
Normalized Throughput
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
8/10/2014
26
Conclusions – Future Work
• LCA takes into account two contention points
• LCA outperforms other schedulers in terms of
average application throughput
• Classification needs to be extended/adjusted
to create triplets, quadraplets, n-tuplets
• I/O bound applications?
8/10/2014
27
Trivia
• This work appeared in PACT 2014
– “LCA: A Memory Link and Cache-Aware Co-Scheduling
Approach for CMPs”, A. Haritatos, G. Goumas, N.
Anastopoulos, K. Nikas, K. Kourtis, N. Koziris
• This research is funded by project I-PARTS: “Integrating
Parallel Run-Time Systems for Efficient Resource
Allocation in Multicore Systems” (code 2504) of Action
ARISTEIA.
8/10/2014
28