Comparing Locks

Comparing Locks
E V E R Y T H I N G Y O U A LWAY S WA N T E D T O K N O W A B O U T S Y N C H R O N I Z AT I O N B U T W E R E A F R A I D T O A S K
T U D O R D AV I D , R A C H I D G U E R R A O U I , VA S I L E I O S T R I G O N A K I
P H I L I P TA F F E T
Outline
Motivation
Locks
Target Platforms
Performance Comparison Results
Some Conclusions
Applications (SSYNC)
Motivation
There are many types of locks and even more types of architectures.
Does locking performance depend on architecture? How?
Which lock is best?
Locks
Test and set (TAS)
Test and test and set (TTAS)
Ticket lock
Array-based lock
MCS lock
CLH lock
Hierarchical CLH lock (HCLH)
Hierarchical Ticket lock (HTICKET)
FIFO Locks
Locks
Test and set (TAS)
Test and test and set (TTAS)
Ticket lock
Array-based lock
MCS lock
CLH lock
Hierarchical CLH lock (HCLH)
Hierarchical Ticket lock (HTICKET)
Queueing Locks
Hierarchical CLH
Local CLH queue per cluster (socket)
One global queue
Qnode at the head of the global queue holds the lock
A Hierarchical CLH Queue Lock; Victor Luchangco , Dan Nussbaum , Nir Shavit
Acquire lock- Step 1: Enqueue locally
Tail pointer
A
Tail pointer
B
1
2
Socket 1
Socket 2
A
1
B
2
C
Tail pointer
Acquire lock- Step 2: Spin wait
A
Tail pointer
B
1
2
Socket 1
Socket 2
A
1
B
2
C
Acquire lock- Step 3: Combining delay
1
Tail pointer
2
Socket 1
Socket 2
C
1
D
2
Acquire lock- Step 4: Splicing
2
C
D
Socket 1
Socket 2
C
1
D
2
Tail pointer
Acquire lock- Step 5: Spin wait
2
C
D
Socket 1
Socket 2
C
1
D
2
Tail pointer
Hierarchical Ticket
Similar concept with two levels of locking
When a thread acquires the local lock, it attempts to acquire the global lock
Global lock not released until the local queue is empty
Systems
Opteron
◦ Directory based cache coherence protocol.
◦ Directory located in LLC.
Xeon
◦ Broadcast snooping
Niagara
◦ Coherence via shared L2 cache on other side of chip
Tilera
◦ Coherence via shared L2 cache distributed across chip
Opteron
Average dist:
1.25 hops
Xeon
Average distance:
1.375 hops
= 1 socket =10 cores
8-way MT
L1
8-way MT
L1
8-way MT
L1
8-way MT
L1
8-way MT
L1
8-way MT
L1
8-way MT
L1
8-way MT
L1
Crossbar
Niagara
L2 Cache
Niagara: A 32-way Multithreaded Sparc Processor; Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun
Tilera
http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8036_PB033-02_web.pdf
Comparing Lock Types
Higher is better
Conclusion: There is no universally best lock.
Why the variations?
System
Hops
Modified
Owned
Exclusive
Shared
Invalid
Modified
Owned
Exclusive
Shared
Operation
Modified
Shared
Opteron (2.1 GHz)
Xeon (2.13 GHz)
Niagara (1.2 GHz)
Tilera (1.2 GHz)
same die
same MCM one hop two hops same die one hop two hops same core
other core
one hop
max hops
loads
81
161
172
252
109
289
400
3
24
45
65
83
163
175
254 83
163
175
253
92
273
383
3
24
45
65
83
164
176
254
44
223
334
3
24
45
65
136
237
247
327
355
492
601
176
176
118
162
stores
83
172
191
273
115
320
431
24
24
57
77
244
255
286
291 83
171
191
271
115
315
425
24
24
57
77
246
255
286
296
116
318
428
24
24
86
106
atomic operations: Compare & Swap (C), Fetch & Increment (F), Test & Set (T), Swap (S)
all
all
all
all
all
all
all
C F T S C F T S C F T S C F T S
110
197
216
296
120
324
430 71 108 64 95 66 99 55 90 77 51 70 63 98 71 89 84
272
283
312
332
113
312
423 76 99 67 93 66 99 55 90 124 82 121 95 142 102 141 115
Why the variations?
Higher is better
Relative performance of atomic primitives and cache operations varies widely in the hardware.
⇒ varying performance of locks
Effect of contention
Amount of contention affects performance of the lock
◦ E.g. test-and-set is good in a low-contention situation but bad in a high-contention situation
Should the programmer have to predict how much contention there will be for a
given lock?
35
Throughput (Mops/s)
No contention
40
TAS
30
TTAS
25
TICKET
20
ARRAY
15
MUTEX
10
MCS
5
CLH
0
HCLH
Single
Thread
Same Die Same MCM One Hop Two Hops
High contention
Low contention
Opteron
Single
Thread
Same Die
One Hop Two Hops
Xeon
Single
Thread
Same Core Same Die
Niagara
Single
Thread
Max Hops
Tilera
HTICKET
Main observations
Crossing sockets is killer (2x to 7.5x worse performance vs. intra-socket)
It’s hard to avoid cross-socket communication (OS scheduler, incomplete cache directory, etc.)
Loads, stores can be as expensive as atomic operations (Non-local access can be a bottleneck)
Intra-socket non-uniformity matters (Hierarchical locks scale better on non-uniform systems)
Consider message passing for highly contended data (Message passing may be faster)
There’s no universally best lock (Pick a lock based on architecture and expected contention)
Simple locks are powerful (Ticket lock performs best in many cases)
Main observations
Crossing sockets is killer (2x to 7.5x worse performance vs. intra-socket)
It’s hard to avoid cross-socket communication (OS scheduler, incomplete cache directory, etc.)
Loads, stores can be as expensive as atomic operations (Non-local access can be a bottleneck)
Intra-socket non-uniformity matters (Hierarchical locks scale better on non-uniform systems)
Consider message passing for highly contended data (Message passing may be faster)
There’s no universally best lock (Pick a lock based on architecture and expected contention)
Simple locks are powerful (Ticket lock performs best in many cases)
◦ Except when they aren’t
Conclusions
Queueing locks deliver good performance on most platforms, especially under
high contention
Standard system-dependent library with optimized lock implementations for
that platform?
Because low-overhead is so important (Simple locks are powerful), it’s hard to
try anything fancy without creating excessive overhead
Idea: Adaptive locks that change type based on contention they detect
◦ Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded
applications. SIGPLAN Not. 45, 5 (January 2010), 269-280. DOI=10.1145/1837853.1693489
http://doi.acm.org/10.1145/1837853.1693489