Comparing Locks E V E R Y T H I N G Y O U A LWAY S WA N T E D T O K N O W A B O U T S Y N C H R O N I Z AT I O N B U T W E R E A F R A I D T O A S K T U D O R D AV I D , R A C H I D G U E R R A O U I , VA S I L E I O S T R I G O N A K I P H I L I P TA F F E T Outline Motivation Locks Target Platforms Performance Comparison Results Some Conclusions Applications (SSYNC) Motivation There are many types of locks and even more types of architectures. Does locking performance depend on architecture? How? Which lock is best? Locks Test and set (TAS) Test and test and set (TTAS) Ticket lock Array-based lock MCS lock CLH lock Hierarchical CLH lock (HCLH) Hierarchical Ticket lock (HTICKET) FIFO Locks Locks Test and set (TAS) Test and test and set (TTAS) Ticket lock Array-based lock MCS lock CLH lock Hierarchical CLH lock (HCLH) Hierarchical Ticket lock (HTICKET) Queueing Locks Hierarchical CLH Local CLH queue per cluster (socket) One global queue Qnode at the head of the global queue holds the lock A Hierarchical CLH Queue Lock; Victor Luchangco , Dan Nussbaum , Nir Shavit Acquire lock- Step 1: Enqueue locally Tail pointer A Tail pointer B 1 2 Socket 1 Socket 2 A 1 B 2 C Tail pointer Acquire lock- Step 2: Spin wait A Tail pointer B 1 2 Socket 1 Socket 2 A 1 B 2 C Acquire lock- Step 3: Combining delay 1 Tail pointer 2 Socket 1 Socket 2 C 1 D 2 Acquire lock- Step 4: Splicing 2 C D Socket 1 Socket 2 C 1 D 2 Tail pointer Acquire lock- Step 5: Spin wait 2 C D Socket 1 Socket 2 C 1 D 2 Tail pointer Hierarchical Ticket Similar concept with two levels of locking When a thread acquires the local lock, it attempts to acquire the global lock Global lock not released until the local queue is empty Systems Opteron ◦ Directory based cache coherence protocol. ◦ Directory located in LLC. Xeon ◦ Broadcast snooping Niagara ◦ Coherence via shared L2 cache on other side of chip Tilera ◦ Coherence via shared L2 cache distributed across chip Opteron Average dist: 1.25 hops Xeon Average distance: 1.375 hops = 1 socket =10 cores 8-way MT L1 8-way MT L1 8-way MT L1 8-way MT L1 8-way MT L1 8-way MT L1 8-way MT L1 8-way MT L1 Crossbar Niagara L2 Cache Niagara: A 32-way Multithreaded Sparc Processor; Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun Tilera http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8036_PB033-02_web.pdf Comparing Lock Types Higher is better Conclusion: There is no universally best lock. Why the variations? System Hops Modified Owned Exclusive Shared Invalid Modified Owned Exclusive Shared Operation Modified Shared Opteron (2.1 GHz) Xeon (2.13 GHz) Niagara (1.2 GHz) Tilera (1.2 GHz) same die same MCM one hop two hops same die one hop two hops same core other core one hop max hops loads 81 161 172 252 109 289 400 3 24 45 65 83 163 175 254 83 163 175 253 92 273 383 3 24 45 65 83 164 176 254 44 223 334 3 24 45 65 136 237 247 327 355 492 601 176 176 118 162 stores 83 172 191 273 115 320 431 24 24 57 77 244 255 286 291 83 171 191 271 115 315 425 24 24 57 77 246 255 286 296 116 318 428 24 24 86 106 atomic operations: Compare & Swap (C), Fetch & Increment (F), Test & Set (T), Swap (S) all all all all all all all C F T S C F T S C F T S C F T S 110 197 216 296 120 324 430 71 108 64 95 66 99 55 90 77 51 70 63 98 71 89 84 272 283 312 332 113 312 423 76 99 67 93 66 99 55 90 124 82 121 95 142 102 141 115 Why the variations? Higher is better Relative performance of atomic primitives and cache operations varies widely in the hardware. ⇒ varying performance of locks Effect of contention Amount of contention affects performance of the lock ◦ E.g. test-and-set is good in a low-contention situation but bad in a high-contention situation Should the programmer have to predict how much contention there will be for a given lock? 35 Throughput (Mops/s) No contention 40 TAS 30 TTAS 25 TICKET 20 ARRAY 15 MUTEX 10 MCS 5 CLH 0 HCLH Single Thread Same Die Same MCM One Hop Two Hops High contention Low contention Opteron Single Thread Same Die One Hop Two Hops Xeon Single Thread Same Core Same Die Niagara Single Thread Max Hops Tilera HTICKET Main observations Crossing sockets is killer (2x to 7.5x worse performance vs. intra-socket) It’s hard to avoid cross-socket communication (OS scheduler, incomplete cache directory, etc.) Loads, stores can be as expensive as atomic operations (Non-local access can be a bottleneck) Intra-socket non-uniformity matters (Hierarchical locks scale better on non-uniform systems) Consider message passing for highly contended data (Message passing may be faster) There’s no universally best lock (Pick a lock based on architecture and expected contention) Simple locks are powerful (Ticket lock performs best in many cases) Main observations Crossing sockets is killer (2x to 7.5x worse performance vs. intra-socket) It’s hard to avoid cross-socket communication (OS scheduler, incomplete cache directory, etc.) Loads, stores can be as expensive as atomic operations (Non-local access can be a bottleneck) Intra-socket non-uniformity matters (Hierarchical locks scale better on non-uniform systems) Consider message passing for highly contended data (Message passing may be faster) There’s no universally best lock (Pick a lock based on architecture and expected contention) Simple locks are powerful (Ticket lock performs best in many cases) ◦ Except when they aren’t Conclusions Queueing locks deliver good performance on most platforms, especially under high contention Standard system-dependent library with optimized lock implementations for that platform? Because low-overhead is so important (Simple locks are powerful), it’s hard to try anything fancy without creating excessive overhead Idea: Adaptive locks that change type based on contention they detect ◦ Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. SIGPLAN Not. 45, 5 (January 2010), 269-280. DOI=10.1145/1837853.1693489 http://doi.acm.org/10.1145/1837853.1693489
© Copyright 2024 ExpyDoc