A Scalable Ordering Primitive for Multicore Machines
Transcript of A Scalable Ordering Primitive for Multicore Machines
A Scalable Ordering Primitivefor Multicore Machines
Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim
3
Scope of multicore machines
Huge hardwarethread parallelism
...
How are operations executed correctly?Ordering
Becomes scalability bottleneck
4
Example: Read Log Update (RLU)
● Extension of RCU● Modifes objects in a thread’s local log● Clock maintains correct snapshot (old vs new)● Frees objects via epoch-based reclamation
A B C D
D’
E
B’
A log/bufer to store copies (per-thread)
LogLogRLU header
Global Clock (22)
Local Clock(22)
Read on start
Read Log Update (RLU) operation
P
Q
Write Clock(∞)
Global Clock (22)
A B C D
C’D’
E
B’
1. P updates clocks 2. P executes RCU-epoch Waits for Q to fnish
1. P updates clocks 2. P executes RCU-epoch Waits for Q to fnish
Global Clock (23)
Local Clock(22)
Write Clock(23)
Q will read only old objects
RLU commit operation
P
Q
Logical Clock maintains correctness/orderingMaintained via atomic instructions FAA/CAS→
7
Issue with logical clock● RLU sufers from global clock contention
– Cache-line contention due to atomic instructions– Possible to circumvent with our approach
0
30
60
90
120
150
180
0 64 128 192 256
Ops
/use
c
#cores
Phi
Atomic
0
40
80
120
160
0 16 32 48 64 80 96
#cores
ARM
0
30
60
90
120
150
180
0 64 128 192 256
Ops
/use
c
#cores
Phi
Atomic
Ordo
0
40
80
120
160
0 16 32 48 64 80 96
#cores
ARM
How can we achieve orderingwith minimal timestamping overhead?
8
Our proposed ordering primitive: Ordo
● Exposes a monotonically increasing clock– Current hardware already provides– rdtscp (X86), cntvct (ARM), stick (Sparc)
● Relies on a per-core invariant hardware clock– Monotonically increases with constant skew regardless
of dynamic frequency and voltage scaling
9
Challenges with Ordo
● Comparing two clocks– Clocks are not synchronized– Cores receive RESET signal at varying times
● Application:– Modifying algorithms to use Ordo– Able to compare between two timestamps
10
Embracing the invariant clocks
● Measure a global uncertainty window– Ensure a new timestamp once a window is over– Provides a notion of globally synchronized clock
● Measured ofset MUST have the invariant:Measured ofset is greater than the physical ofset
– Physical ofset: ofset due to RESET signal– Measured ofset: physical ofset + one-way delay
11
Calculating global uncertainty window:ORDO_BOUNDARY
● Add one-way delay latency on each path
C1 C2
T(C1): 0
T(C2): 20
C1 C→ 2
20
time
1) Calculate C1 timestamp2) Notify C2 via memory3) Get C2 timestamp4) Repeat steps 1-3 to get
the minimum
12
Calculating global uncertainty window:ORDO_BOUNDARY
● Add one-way delay latency on each path
C1 C2
T(C1): 80
T(C2): 50
C2 C→ 1
30
time
● Repeat prior steps in opposite direction
● Do not know which clockis ahead of the other
13
Calculating global uncertainty window:ORDO_BOUNDARY
● Repeat steps for each pair of cores from C1 to Cn
● The maximum ofset is the ORDO_BOUNDARY
C1 C2
C1 C←→ 2
30
T(C1): 80
T(C2): 50
C1 C→ 2
20
C2 C→ 1
30
C1 C2
T(C1): 0
T(C2): 20
time
14
Ordo application
● Applicable to any timestamp-based algorithm● Expose Ordo API for these algorithms
– get_time(): Current hardware timestamp– cmp_time(t1, t2): Compare two timestamps with
uncertainty, if |t1-t2| < ORDO_BOUNDARY
– new_time(t): Return tnew > (t + ORDO_BOUNDARY)
● Catch: Algorithms should handle uncertainty
15
Algorithms with Ordohandling uncertainty
● Physical to logical timestamping:– Rely on cmp_time() to compare two timestamps– Either defer or revert if comparison is uncertain– Use new_time() to guarantee new time
● Physical timestamping:– Use new_time() to access the global clock
A B C D
D’
E
B’ LogLog
Q’s coreclock (50)
Local Clock(50)
P’s localclock (22)
Read on start
Read Log Update (RLUOrdo) operation
P
Q
Global ofset(30)
Write Clock(∞)
A B C D
C’D’
E
B’
Local Clock(50)
Write Clock(150)
Q will read only old objects
RLUOrdo commit operation
P
Q
Global ofset(30)
1. P updates own clock 2. P executes RCU-epoch Waits for Q to fnish
1. P updates own clock 2. P executes RCU-epoch Waits for Q to fnish
18
Algorithms modifed with Ordo
● RLU● Transactional Locking (TL2) in STM● Database concurrency control: OCC, MVCC● Oplog used in Linux forking functionality
See our paper
19
Evaluation
● Questions:– Measured global ofset (ORDO_BOUNDARY)– Maximum scalability of Ordo– Ordo’s impact on algorithms
● Machines confguration:– 240 core, 8 socket Intel Xeon machine (Xeon)– 256 core, Intel Xeon Phi (Phi)– 96 core, 2 socket ARM machine (ARM)– 32 core, 8 socket AMD machine (AMD)
20
Machine Minimum (ns) Maximum (ns)
Intel Xeon 70 276
Intel Xeon phi 90 270
ARM 100 1,100
AMD 93 203
Ofset between clocks
● Empirically measured ofset after reboots● ORDO_BOUNDARY is the maximum ofset
21
Timestamping with Ordo● Ordo relies on hardware timestamping ● 17.4 – 285.5x faster than atomic increments
0
4
8
12
0 60 120 180 240
Ops
/use
c/co
re Xeon(Atomic)
0
4
8
12
0 64 128 192 256
Phi(Atomic)
0
4
8
12
0 16 32 48 64 80 96
Ops
/use
c/co
re
#core
ARM(Atomic)
0
4
8
12
0 4 8 12 16 20 24 28 32#core
AMD(Atomic)
0
4
8
12
0 60 120 180 240
Ops
/use
c/co
re Xeon(Atomic)
Xeon(Ordo)
0
4
8
12
0 64 128 192 256
Phi(Atomic)
Phi(Ordo)
0
4
8
12
0 16 32 48 64 80 96
Ops
/use
c/co
re
#core
ARM(Atomic)
ARM(Ordo)
0
4
8
12
0 4 8 12 16 20 24 28 32#core
AMD(Atomic)
AMD(Ordo)
22
Scaling RLU with Ordo● RLUOrdo is 2.1x faster on an average● Still sufers from object copy and its locking
0306090
120150
0 60 120 180 240
Ops
/use
c
Xeon
RLU 2% RLU(Ordo) 2%
0306090
120150180
0 64 128 192 256
Phi
0
40
80
120
160
0 16 32 48 64 80 96
Ops
/use
c
#core
ARM0
20
40
60
80
0 8 16 24 32#core
AMD
23
Discussion and limitations
● Simplifes the design and understanding of algorithms
● Not a panacea– Applicable when clock is contentious
● No skew consideration● Thread ID-based timestamp comparison has its
limitation
24
Conclusion
● Ordo is a scalable timestamping primitive– Relies on invariant hardware clocks
● Exposes time-based API to the user● Applied Ordo to fve concurrent algorithms● Improves the scalability of algorithms by at
most 39.7x across architectures
26
Ofset between clocks● Clocks are not synchronized
– 8th socket in Xeon and 2nd socket in ARM– Results remain consistent even after reboots and
measuring after a period of time
0
30
60
90
120
0 30 60 90 120
# core
Xeon
0
75
150
225
0
24
48
72
96
0 24 48 72 96
# core
Arm
0
300
600
900
Of
set b
etw
een
cloc
ks
27
Sensitivity of ORDO_BOUNDARY● Varying ORDO_BOUNDARY from 1/8x – 8x● Cycles increases from 32.2–18K on Xeon machine
0.92
0.96
1.00
1.04
1.08
1-core 1-socket 8-sockets
Nor
mal
ized
thro
ughp
ut
28
Physical timestamping: Oplog● Improves Exim performance by 1.9x at 240 cores
0k
20k
40k
60k
80k
100k
120k
30 60 90 120 150 180 210 240
Mes
sage
s/se
c
#core
Stock
Oplog(Ordo)
29
Scaling database concurrency control● Improves OCC and MVCC by 4.1–39.7x for read-only (YCSB)● OCCOrdo 1.24x faster than Tictoc and Silo (TPC-C)
0306090
120150180
0 60 120 180 240
Txns
/use
c
Xeon
OCCOCC (Ordo)
MVCCMVCC (Ordo)
020406080
100
0 64 128 192
Phi
08
16243240
0 16 32 48 64 80 96
Txns
/use
c
# core
ARM
07
14212835
0 8 16 24 32# core
AMD
30
Cannot use clock synchronization protocols
● No information on minimum bounds on message delivery between/among clocks
● Protocols introduce various errors● Can lead to mis-synchronized clocks
– Larger or smaller than the actual physical ofset
Lead to incorrect implementation ofconcurrent algorithms