Practical Concerns for Scalable Synchronization
-
Upload
ora-hudson -
Category
Documents
-
view
20 -
download
1
description
Transcript of Practical Concerns for Scalable Synchronization
![Page 1: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/1.jpg)
Practical Concerns for Scalable Synchronization
Jonathan Walpole (PSU)Paul McKenney (IBM)
Tom Hart (University of Toronto)
![Page 2: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/2.jpg)
2www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
The problem
“i++” is dangerous if “i” is global
CPU 0load r1,i
inc r1store r1,i
i
CPU 1
![Page 3: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/3.jpg)
3www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
The problem
“i++” is dangerous if “i” is global
CPU 0
load r1,iload r1,i
inc r1store r1,i
i
i
CPU 1
load r1,i
i
![Page 4: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/4.jpg)
4www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
The problem
“i++” is dangerous if “i” is global
CPU 0
inc r1load r1,i
inc r1store r1,i
i
i+1
CPU 1
inc r1
i+1
![Page 5: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/5.jpg)
5www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
The problem
“i++” is dangerous if “i” is global
CPU 0
store r1,iload r1,i
inc r1store r1,i
i+1
i+1
CPU 1
store r1,i
i+1
![Page 6: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/6.jpg)
6www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Question
What is this problem called?
![Page 7: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/7.jpg)
7www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
What is this problem called?
What solution could we apply?
Question
![Page 8: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/8.jpg)
8www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
The solution – critical sections
Classic multiprocessor solution: spinlocks– CPU 1 waits for CPU 0 to release the lock
Counts are accurate, but locks have overhead!
spin_lock(&mylock);
i++;
spin_unlock(&mylock);
![Page 9: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/9.jpg)
9www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Question
What are spinlocks built from?
![Page 10: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/10.jpg)
10www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Critical-section efficiency
Lock Acquisition (Ta )
Critical Section (Tc )
Lock Release (Tr )
Critical-section efficiency =Tc
Tc+Ta+Tr
Ignoring lock contention and cache conflicts in the critical section
![Page 11: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/11.jpg)
11www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Critical section efficiency
Crit
ical
Sec
tion
Siz
e
![Page 12: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/12.jpg)
12www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Performance of normal instructions
![Page 13: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/13.jpg)
13www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Question
Have synchronization instructions got faster?– Relative to normal instructions?– In absolute terms?
![Page 14: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/14.jpg)
14www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Have synchronization instructions got faster?– Relative to normal instructions?– In absolute terms?
What are the implications of this for the performance of operating systems?
Questions
![Page 15: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/15.jpg)
15www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Have synchronization instructions got faster?– Relative to normal instructions?– In absolute terms?
What are the implications of this for the performance of operating systems?
Can we fix this problem by adding more CPUs?
Questions
![Page 16: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/16.jpg)
16www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
What’s going on?
Taller memory hierarchies– Memory speeds have not kept up with CPU
speeds– 1984: no caches needed, since instructions
were slower than memory accesses– 2005: 3-4 level cache hierarchies, since
instructions are orders of magnitude faster than memory accesses
![Page 17: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/17.jpg)
17www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Why does this matter?
![Page 18: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/18.jpg)
18www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Why does this matter?
Synchronization implies sharing data across CPUs– normal instructions tend to hit in top-level cache– synchronization operations tend to miss
Synchronization requires a consistent view of data– between cache and memory– across multiple CPUs– requires CPU-CPU communication
Synchronization instructions see memory latency!
![Page 19: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/19.jpg)
19www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
… but that’s not all!
Longer pipelines– 1984: Many clock cycles per instruction– 2005: Many instructions per clock cycle
● 20-stage pipelines
Out of order execution– Keeps the pipelines full– Must not reorder the critical section before its
lock!
Synchronization instructions stall the pipeline!
![Page 20: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/20.jpg)
20www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Reordering means weak memory consistency
Memory barriers- Additional synchronization instructions are needed to manage reordering
![Page 21: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/21.jpg)
21www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
What is the cost of all this?
Instruction Cost 1.45 GHz
3.06GHzIBM POWER4 Intel Xeon
Normal Instruction 1.0 1.0
![Page 22: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/22.jpg)
22www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Atomic increment
Instruction Cost 1.45 GHz
3.06GHzIBM POWER4 Intel Xeon
Normal Instruction
Atomic Increment
1.0
183.1
1.0
402.3
![Page 23: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/23.jpg)
23www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Memory barriers
Instruction Cost 1.45 GHz
3.06GHzIBM POWER4 Intel Xeon
Normal Instruction
Atomic Increment
SMP Write Memory Barrier
Read Memory Barrier
Write Memory Barrier
1.0
183.1328.6328.9400.9
1.0
402.30.0
402.30.0
![Page 24: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/24.jpg)
24www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Lock acquisition/release with LL/SC
Instruction Cost 1.45 GHz
3.06GHzIBM POWER4 Intel Xeon
Normal Instruction
Atomic Increment
SMP Write Memory Barrier
Read Memory Barrier
Write Memory Barrier
Local Lock Round Trip
1.0
183.1328.6328.9400.91057.5
1.0
402.30.0
402.30
1138.8
![Page 25: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/25.jpg)
25www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Compare & swap unknown values (NBS)
Instruction Cost 1.45 GHz
3.06GHzIBM POWER4 Intel Xeon
Normal Instruction
Atomic Increment
SMP Write Memory Barrier
Read Memory Barrier
Write Memory Barrier
Local Lock Round Trip
CAS Cache Transfer & Invalidate
1.0
183.1328.6328.9400.91057.5247.1
1.0
402.30.0
402.30
1138.8847.1
![Page 26: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/26.jpg)
26www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Compare & swap known values (spinlocks)
Instruction Cost 1.45 GHz
3.06GHzIBM POWER4 Intel Xeon
Normal Instruction
Atomic Increment
SMP Write Memory Barrier
Read Memory Barrier
Write Memory Barrier
Local Lock Round Trip
CAS Cache Transfer & Invalidate
CAS Blind Cache Transfer
1.0
183.1328.6328.9400.91057.5247.1257.1
1.0
402.30.0
402.30
1138.8847.1993.9
![Page 27: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/27.jpg)
27www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
The net result?
1984: Lock contention was the main issue
2005: Critical section efficiency is a key issue
Even if the lock is always free when you try to acquire it, performance can still suck!
![Page 28: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/28.jpg)
28www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
How has this affected OS design?
Multiprocessor OS designers search for “scalable” synchronization strategies– reader-writer locking instead of global locking– data locking and partitioning– Per-CPU reader-writer locking– Non-blocking synchronization
The “common case” is read-mostly access to linked lists and hash-tables– asymmetric strategies favouring readers are good
![Page 29: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/29.jpg)
29www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Review - Global locking
A symmetric approach (also called “code locking”)– A critical section of code is guarded by a lock– Only one thread at a time can hold the lock
Examples include– Monitors– Java “synchronized” on global object– Linux spin_lock() on global spinlock_t
What is the problem with global locking?
![Page 30: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/30.jpg)
30www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Review - Global locking
A symmetric approach (also called “code locking”)– A critical section of code is guarded by a lock– Only one thread at a time can hold the lock
Examples include– Monitors– Java “synchronized” on global object– Linux spin_lock() on global spinlock_t
Global locking doesn’t scale due to lock contention!
![Page 31: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/31.jpg)
31www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Review - Reader-writer locking
Many readers can concurrently hold the lock
Writers exclude readers and other writers
The result?– No lock contention in read-mostly scenarios– So it should scale well, right?
![Page 32: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/32.jpg)
32www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Review - Reader-writer locking
Many readers can concurrently hold the lock
Writers exclude readers and other writers
The result?– No lock contention in read-mostly scenarios– So it should scale well, right?– … wrong!
![Page 33: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/33.jpg)
33www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Scalability of reader/writer locking
CPU 0
CPU 1
read
-acq
uir
em
em
ory
barr
ier
read
-acq
uir
em
em
ory
barr
ier
read
-acq
uir
em
em
ory
barr
ier
crit
ical
sect
ion
crit
ical
sect
ion
lock
Reader/writer locking does not scale due to criticalsection efficiency!
![Page 34: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/34.jpg)
34www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Review - Data locking
A lock per data item instead of one per collection– Per-hash-bucket locking for hash tables– CPUs acquire locks for different hash chains in
parallel– CPUs incur memory-latency and pipeline-flush
overheads in parallel
Data locking improves scalability by executing critical section “overhead” in parallel
![Page 35: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/35.jpg)
35www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Review - Per-CPU reader-writer locking
One lock per CPU (called brlock in Linux)– Readers acquire their own CPU’s lock– Writers acquire all CPU’s locks
In read-only workloads CPUs never exchange locks– no memory latency is incurred
Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios
![Page 36: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/36.jpg)
36www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Scalability comparison
Expected scalability on read-mostly workloads– Global locking – poor due to lock contention– R/W locking – poor due to critical section
efficiency– Data locking – better?– R/W data locking – better still?– Per-CPU R/W locking – the best we can do?
![Page 37: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/37.jpg)
37www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Actual scalability
Scalability of locking strategies using read-only workloads in a hash-table benchmark
Measurements taken on a 4-CPU 700 MHz P-III system
Similar results are obtained on more recent CPUs
![Page 38: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/38.jpg)
38www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Scalability on 1.45 GHz POWER4 CPUs
![Page 39: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/39.jpg)
39www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Performance at different update fractions on 8 1.45 GHz POWER4 CPUs
![Page 40: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/40.jpg)
40www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
What are the lessons so far?
![Page 41: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/41.jpg)
41www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Avoid lock contention !
Avoid synchronization instructions !– … especially in the read-path !
What are the lessons so far?
![Page 42: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/42.jpg)
42www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
How about non-blocking synchronization?
Basic idea – copy & flip pointer (no locks!)– Read a pointer to a data item – Create a private copy of the item to update in place– Swap the old item for the new one using an atomic
compare & swap (CAS) instruction on its pointer– CAS fails if current pointer not equal to initial value– Retry on failure
NBS should enable fast reads … in theory!
![Page 43: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/43.jpg)
43www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Problems with NBS in practice
Reusing memory causes problems– Readers holding references can be hijacked during
data structure traversals when memory is reclaimed
– Readers see inconsistent data structures when memory is reused
How and when should memory be reclaimed?
![Page 44: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/44.jpg)
44www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Immediate reclamation?
![Page 45: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/45.jpg)
45www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Immediate reclamation?
In practice, readers must either– Use LL/SC to test if pointers have changed, or– Verify that version numbers associated with data
structures have not changed (2 memory barriers)
Synchronization instructions slow NBS readers!
![Page 46: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/46.jpg)
46www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Reader-friendly solutions
Never reclaim memory ?
Type-stable memory ?– Needs free pool per data structure type– Readers can still be hijacked to the free pool– Exposes OS to denial of service attacks
Ideally, defer reclaiming memory until its safe!– Defer reclamation of a data item until references
to it are no longer held by any thread
![Page 47: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/47.jpg)
47www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Wait for a while then delete?– … but how long should you wait?
Maintain reference counts or per-CPU “hazard pointers” on data that is in use?
How should we defer reclamation?
![Page 48: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/48.jpg)
48www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
How should we defer reclamation?
Wait for a while then delete?– … but how long should you wait?
Maintain reference counts or per-CPU “hazard pointers” on data that is in use?– Requires synchronization in read path!
Challenge – deferring destruction without using synchronization instructions in the read path
![Page 49: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/49.jpg)
49www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Coding convention:– Don’t allow a quiescent state to occur in a read-
side critical section
Reclamation strategy:– Only reclaim data after all CPUs in the system
have passed through a quiescent state
Example quiescent states:– Context switch in non-preemptive kernel– Yield in preemptive kernel– Return from system call …
Quiescent-state-based reclamation
![Page 50: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/50.jpg)
50www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Coding conventions for readers
Delineate read-side critical section– rcu_read_lock() and rcu_read_unlock() primitives– may compile to nothing on most architectures
Don’t hold references outside critical sections– Re-traverse data structure to pick up reference
Don’t yield the CPU during critical sections– Don’t voluntarily yield– Don’t block, don’t leave the kernel …
![Page 51: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/51.jpg)
51www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Overview of the basic idea
Writers create new versions– Using locking or NBS to synchronize with each other
– Register call-backs to destroy old versions when safe● call_rcu() primitive registers a call back with a reclaimer
– Call-backs are deferred and memory reclaimed in batches
Readers do not use synchronization– While they hold a reference to a version it will not be
destroyed
– Completion of read-side critical sections is “inferred” by the reclaimer from observation of quiescent states
![Page 52: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/52.jpg)
52www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Overview of RCU API
Writer Reader
Reclaimer
rcu_dereference ()
rcu_assign_pointer ()
rcu_read_lock ()
synchronize_rcu ()
call_rcu ()
Memory Consistency of Mutable PointersCollection of versions of Immutable Objects
![Page 53: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/53.jpg)
53www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Context switch as a quiescent state
CPU 0
CPU 1 RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
Rem
ove
Ele
men
t
Con
text
Sw
itch
Con
text
Sw
itch
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
May hold referenceCan't hold reference to old version, but RCU can't tell
Can't hold reference to old version
Can't hold reference to old version
Con
text
Sw
itch
![Page 54: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/54.jpg)
54www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Grace periods
CPU 0
CPU 1 RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
Dele
teEle
men
t
Con
text
Sw
itch
Con
text
Sw
itch
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
RC
U R
ead
-Sid
eC
riti
cal S
ecti
on
Con
text
Sw
itch
Grace Period
Grace Period
Con
text
Sw
itch
Grace Period
![Page 55: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/55.jpg)
55www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Example quiescent states– Context switch (non-preemptive kernels)– Voluntary context switch (preemptive kernels)– Kernel entry/exit– Blocking call
Grace periods– A period during which every CPU has gone
through a quiescent state
Quiescent states and grace periods
![Page 56: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/56.jpg)
56www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Efficient implementation
Choosing good quiescent states– They should occur anyway– They should be easy to count– Not too frequent or infrequent
Recording and dispatching call-backs– Minimize inter-CPU communication– Maintain per-CPU queues of call-backs– Two queues – waiting for grace period start and
end
![Page 57: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/57.jpg)
57www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
'Next' RCUCallbacks
RCU's data structures
'Current' RCU
Callback
Grace-Period
Number
Global Grace-Period
Number
Global CPUBitmask
call_rcu()
Counter
CounterSnapshot
End of PreviousGrace Period (If
Any)
End of CurrentGrace Period
![Page 58: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/58.jpg)
58www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
RCU implementations
DYNIX/ptx RCU (data center)
Linux– Multiple implementations (in 2.5 and 2.6 kernels)– Preemptible and nonpreemptible
Tornado/K42 “generations”– Preemptive kernel– Helped generalize usage
![Page 59: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/59.jpg)
59www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Experimental results
How do different combinations of RCU, SMR, NBS and Locking compare?
Hash table mini-benchmark running on 1.45 GHz POWER4 system with 8 CPUs
Various workloads– Read/update fraction– Hash table size– Memory constraints– Number of CPUs
![Page 60: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/60.jpg)
60www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Scalability with working set in cache
![Page 61: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/61.jpg)
61www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Scalability with large working set
![Page 62: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/62.jpg)
62www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Performance at different update fractions (8 CPUs)
![Page 63: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/63.jpg)
63www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Performance at different update fractions (2 CPUs)
![Page 64: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/64.jpg)
64www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Performance in read-mostly scenarios
![Page 65: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/65.jpg)
65www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Impact of memory constraints
![Page 66: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/66.jpg)
66www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Performance and complexity
When should RCU be used?– Instead of simple spinlock?– Instead of per-CPU reader-writer lock?
Under what environmental conditions?– Memory-latency ratio– Number of CPUs
Under what workloads?– Fraction of accesses that are updates– Number of updates per grace period
![Page 67: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/67.jpg)
67www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Analytic results
Compute breakeven update-fraction contours for RCU vs. locking performance, against:– Number of CPUs (n)– Updates per grace period ()– Memory-latency ratio (r)
Look at computed memory-latency ratio at extreme values of for n=4 CPUs
![Page 68: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/68.jpg)
68www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Breakevens for RCU worst case(f vs. r for Small )
![Page 69: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/69.jpg)
69www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Breakeven for RCU best case(f vs. r, Large )
![Page 70: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/70.jpg)
70www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Real-world performance and complexity
SysV IPC– >10x on microbenchmark (8 CPUs)– 5% for database benchmark (2 CPUs)– 151 net lines added to the kernel
Directory-Entry Cache– +20% in multiuser benchmark (16 CPUs)– +12% on SPECweb99 (8 CPUs)– -10% time required to build kernel (16 CPUs)– 126 net lines added to the kernel
![Page 71: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/71.jpg)
71www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Real-world performance and complexity
Task List– +10% in multiuser benchmark (16 CPUs)– 6 net lines added to the kernel
● 13 added● 7 deleted
![Page 72: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/72.jpg)
72www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Summary and Conclusions (1)
RCU can provide order-of-magnitude speedups for read-mostly data structures– RCU optimal when less than 10% of accesses are
updates over wide range of CPUs– RCU projected to remain useful in future CPU
architectures
In Linux 2.6 kernel, RCU provided excellent performance with little added complexity– Currently over 700 uses of RCU API in Linux kernel
![Page 73: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/73.jpg)
73www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Summary and Conclusions (2)
RCU introduces a new model and API for synchronization– There is additional complexity– Visual inspection of kernel code has uncovered
some subtle bugs in use of RCU API primitives– Tools to ensure correct use of API primitives are
needed
![Page 74: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/74.jpg)
74www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
A thought
“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it !”
– Brian Kernighan
![Page 75: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/75.jpg)
75www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
UseUse
the right the right tooltool
for the for the job!!!job!!!
![Page 76: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/76.jpg)
76www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
Spare slides
![Page 77: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/77.jpg)
77www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
How does RCU address overheads?
Lock Contention– Readers need not acquire locks: no contention!!!– Writers can still suffer lock contention
● But only with each other, and writers are infrequent● Very little contention!!!
Memory Latency– Readers do not perform memory writes– No need to communicate data among CPUs for
cache consistency● Memory latency greatly reduced
![Page 78: Practical Concerns for Scalable Synchronization](https://reader035.fdocuments.in/reader035/viewer/2022062422/568134cf550346895d9bf751/html5/thumbnails/78.jpg)
78www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006
How does RCU address overheads?
Pipeline-Stall Overhead– On most CPUs, readers do not stall pipeline due to
update ordering or atomic operations
Instruction Overhead– No atomic instructions required for readers– Readers only need to execute fast instructions