LCA: A Memory Link and Cache-Aware Co-scheduling Approach for CMPs Dr. Konstantinos Nikas Computing Systems Laboratory National Technical University of Athens 8/10/2014 1 A Typical CMP • Concurrent execution of multiple threads • Various levels of sharing – – – – Caches Prefetchers Memory controllers Memory link • Sharing affects performance 8/10/2014 2 Co-execution Impact • Single-thread execution DRAM LLC 8/10/2014 3 L1 L1 C1 C2 Co-execution Impact • Single-thread execution DRAM LLC L1 L1 C1 C2 IPC1 = 2.2 8/10/2014 4 Co-execution Impact • Single-thread execution DRAM LLC 8/10/2014 5 L1 L1 C1 C2 Co-execution Impact • Single-thread execution DRAM LLC L1 L1 C1 C2 IPC2 = 1.8 8/10/2014 6 Co-execution Impact • Dual-thread execution DRAM – Performance degradation for both threads – Competition on shared resources • Shared LLC • Memory link LLC L1 L1 C1 C2 IPC1 = 1.9 IPC2 = 1.6 8/10/2014 7 Co-execution in CMPs • Initial work focused on cache contention – Cache partitioning (Qureshi and Patt, 2006) – Cache-fair scheduling (Fedorova et al., 2007) • In reality more than one contention points – Cache, Memory Controller, Memory Link, Prefetchers → Contention-Aware Scheduling 8/10/2014 8 Contention-Aware Scheduling • Miss rate (Blagodurov et al., 2010) – Sort apps by LLC miss rate – Target: distribute miss rate evenly among the caches • Memory Bandwidth (Merkel et al., 2010) – Sort apps by memory bus utilization – Combine apps with complementary bandwidth demands • ADAPT (Pusukuri et al, 2013) – Multithreaded programs – Cache contention, lock contention, thread latency – Trained statistical models 8/10/2014 9 LCA: Motivation • Why focus only on a single contention point? • Handle contention on memory link and LLC • Need to inspect the entire memory hierarchy and capture resource utilization • Application characterization – Locate contention on memory link and LLC – Collect information at execution time – No additional hardware support 8/10/2014 10 LCA: Application Classes • N – Within the core or the private cache – No contention on shared resources • C – Heavy activity on the LLC • small data sets with heavy reuse • optimized for LLC (e.g. blocking) • latency-bound that benefit from LLC hits 8/10/2014 11 LCA: Application Classes • LC – Significant activity on both the LLC and the memory link • L – Memory link intensive apps • streaming applications • large data sets or large reuse distance 8/10/2014 12 LCA: Classification Method 8/10/2014 13 LCA: Co-execution Scenarios • N-* • L–L – Applications compete for memory link • L–C – No interference • LC – LC – Low interference – Catastrophic for C • LC – C • L – LC – Low interference – Some slowdown for C – L suffers some interference – LC suffers higher slowdown • C – C – Difficult to predict – Low probability for severe contention 8/10/2014 14 LCA: Co-execution Scenarios • N-* • L–L – Applications compete for memory link • L–C – No interference • LC – LC – Low interference – Catastrophic for C • LC – C • L – LC – Low interference – Some slowdown for C – L suffers some interference – LC suffers higher slowdown • C – C – Difficult to predict – Low probability for severe contention 8/10/2014 15 LCA: Co-execution Scenarios 8/10/2014 16 LCA: Co-scheduling Algorithm 8/10/2014 17 LCA: Evaluation • Intel Xeon E5-4620 – 8 cores, private L1 & L2, 16MB 16-way shared L3, 64GB memory • 4 benchmarks for each application class – 4 threads • Compare LCA to: – – – – – – 8/10/2014 Gang Link bandwidth balance (LBB) LLC miss rate balance (LLC_MRB) Linux scheduler (Completely Fair Scheduler) Optimal Random 18 LCA: Evaluation 8/10/2014 19 Scheduling in CMPs (so far) DRAM • Create co-execution pairs – Time scheduling if cores less than threads 8/10/2014 20 LLC L1 L1 C1 C2 Scheduling in CMPs (so far) • Create co-execution pairs DRAM – Time scheduling if cores less than threads – Space scheduling by using multiple pairs 8/10/2014 21 LLC LLC L1 L1 L1 L1 C1 C2 C3 C4 Scheduling in CMPs: Challenges DRAM • Scaling LLC L1 L1 C1 C2 … … 8/10/2014 22 L1 L1 C15 C16 Scheduling in CMPs: Challenges DRAM • Scaling • Different levels of sharing LLC L2 L2 L1 L1 C1 C2 … … 8/10/2014 23 L1 L1 C15 C16 General Co-scheduling Scenario • n programs with t threads • system with p cores, l memory links, c shared caches • Typical scenario: desktops, oversubscribed servers, etc. – state-of-the-art: CFS DRAM DRAM LLC LLC L2 L2 L1 L1 C1 C2 … L2 L1 L1 L1 L1 C15 C16 C1 C2 … 8/10/2014 L2 … … 24 L1 L1 C15 C16 CFS: Is it really fair? 1 Normalized CPU Time 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 8/10/2014 25 CFS: Is it really fair? 1 Normalized CPU Time Normalized Throughput 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 8/10/2014 26 Conclusions – Future Work • LCA takes into account two contention points • LCA outperforms other schedulers in terms of average application throughput • Classification needs to be extended/adjusted to create triplets, quadraplets, n-tuplets • I/O bound applications? 8/10/2014 27 Trivia • This work appeared in PACT 2014 – “LCA: A Memory Link and Cache-Aware Co-Scheduling Approach for CMPs”, A. Haritatos, G. Goumas, N. Anastopoulos, K. Nikas, K. Kourtis, N. Koziris • This research is funded by project I-PARTS: “Integrating Parallel Run-Time Systems for Efficient Resource Allocation in Multicore Systems” (code 2504) of Action ARISTEIA. 8/10/2014 28
© Copyright 2024 ExpyDoc