CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XII: Replication.
CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.
-
Upload
ben-blankenship -
Category
Documents
-
view
219 -
download
1
Transcript of CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.
CMPT 431
Dr. Alexandra Fedorova
Lecture IV: OS Support
2CMPT 431 © A. Fedorova
Outline
• Continue discussing OS support for threads and processes
• Alternative distributed systems architectures inspired by limitations of threads
• Support for IPC• Scalable synchronization
3CMPT 431 © A. Fedorova
Process/Thread Support: Good Enough?• Many computer scientists observed limited scalability of MT
and MP architectures
Performance of a threaded web server
M. Welsh, SOSP ‘01
4CMPT 431 © A. Fedorova
Alternative Web Services Architectures
• Alternative architectures for web services that rely less heavily on threads/processes:– Single-Process Event-Driven (SPED)– Asymmetric Multiprocess Event-Driven (AMPED)– Stage Event-Driven Architecture (SEDA)
5CMPT 431 © A. Fedorova
Web Services Architecture. Case Study: A Web server
• Sequence of actions at the web server
• Each step can block:– Socket read/accept can block on network I/O– File find/read can block for disk I/O– Send can block on TCP buffer queue
• How do servers overlap blocking and computation?
V. Pai, USENIX ‘99
6CMPT 431 © A. Fedorova
Multiprocess (MP) or Multithreaded (MT) Architecture: A Review
V. Pai, USENIX ‘99• One process performs all
steps for a request• I/O and computation
overlap naturally• OS switches to a new
process when a process blocks
MP
V. Pai, USENIX ‘99• One thread performs all
steps for a request• I/O and computation
overlap is possible using kernel threads (provided by modern OS)
MT
7CMPT 431 © A. Fedorova
Single Process Event-Driven Architecture• A single process executes processing steps for all requests• Uses non-blocking network and disk I/O system calls• Uses select system call to check on the status of those operations• Problem #1: many OSs do not provide non-blocking system calls for
disk I/O• Problem #2: those that do, do not integrate them with select –
cannot check for completion of network and disk I/O simultaneously
V. Pai, USENIX ‘99
8CMPT 431 © A. Fedorova
Asymmetric Multiprocess Event Driven Architecture (AMPED)
• AMPED = MP + SPED• Use SPED architecture for I/O operations with non-blocking interface: socket
read/write, accept• Use MP architecture for I/O operations without the non-blocking interface:
file read/write:
• mmap the file• Use mincore to check if the file is in
memory• If not, spawn a helper process to bring
the file into memory• Communicate with the helper process
via IPCV. Pai, USENIX ‘99
• Flash – a web server implemented using AMPED (V. Pai, et al., USENIX ‘99)• Matches or exceeds performance of existing web servers by up to 50%
9CMPT 431 © A. Fedorova
Staged Event-Driven Architecture
• Observation: AMPED is good, but it is not easy to control application resources. E.g., which event to process first?
• SEDA: Create a stage for each logical step of processing; Manage each stage separately
• There is a queue of events for each stage, so you can tell how each stage is loaded
• Each stage can be processed by several (a small number of) threads• Adaptive load shedding – manage queues to control load
– E.g., if the stage that involves disk I/O is the bottleneck, drop the queued up requests or reject new requests
• Dynamic control – adjust the number of threads per stage based on demand
M. Welsh, SOSP ‘01
10CMPT 431 © A. Fedorova
Outline
• Continue discussing OS support for threads and processes
• Alternative distributed systems architectures inspired by limitations of threads
• Support for IPC• Support for scalable synchronization• Distributed operating systems
11CMPT 431 © A. Fedorova
OS Support for Inter-Process Communication (IPC)
• Cooperating processes or threads need to communicate• Threads share address space, so they communicate via
shared memory• What about processes? They do not share an address
space. They communicate via:– Unix pipes– Memory-mapped files– Inter-process shared memory
12CMPT 431 © A. Fedorova
Unix Pipes
Pipe is a communication channel among two processes
Using pipe in a shell:
prompt% cat log_file | grep “May 16”
cat
grep
writeread
Pipes can also be created using pipe() system call
13CMPT 431 © A. Fedorova
Implementation of Pipes
• In Solaris: a data structure containing two vnodes, a lock and a buffer
lock
fnode
fnode
buffer
vnode
vnode
• To the user, each end of the pipe is represented by a file descriptor
• The user reads/writes the pipe by reading/writing the file descriptor
• The OS blocks the process reading from an empty pipe
• The OS blocks the process writing into the full pipe (when the buffer is full)
14CMPT 431 © A. Fedorova
Memory-mapped Files
Address space of process A
File
Mapped file
Address space of process B
Mapped File
15CMPT 431 © A. Fedorova
Inter-process Shared Memory
• Inter-process shared memory: a piece of physical memory set up to be shared among processes
• Allocate inter-process shared memory using shmget• Get permission to use (attach to it) via shmat• Disadvantages: shared memory is not cleaned up
automatically when processes exit; it needs to be cleaned up explicitly
16CMPT 431 © A. Fedorova
Performance of IPC
• IPC involves inter-process context switching• The expensive kind of context switch, because it
involves switching address spaces• The cost of a context switch determines the cost
of IPC – largely depends on the hardware
17CMPT 431 © A. Fedorova
Outline
• Continue discussing OS support for threads and processes
• Alternative distributed systems architectures inspired by limitations of threads
• Support for IPC• Support for scalable synchronization
18CMPT 431 © A. Fedorova
Synchronization
if(account_balance >= amount){
account_balance -= amount;}
Thread 1: perform a withdrawal if(account_balance >= service_fee){
account_balance -= service_fee;}
Thread 2: subtract service fee
1 2
3 4
Unsynchronized Access
Account balanced has changed between steps 2 and 4!!!
Synchronized Access
lock_aquire(account_balance_lock);if(account_balance >= amount){
account_balance -= amount;}lock_release(account_balance);
lock_aquire(account_balance_lock);if(account_balance >= service_fee){
account_balance -= service_fee;}lock_release(account_balance);
19CMPT 431 © A. Fedorova
Synchronization Primitives (SP)
• Synchronization primitives provide atomic access to a critical section• Types of synchronization primitives
– mutex– semaphore– lock– condition variable– etc.
• Synchronization primitives are provided by the OS• Can also be implemented by a library (e.g., pthreads) or by the
application • Hardware provides special atomic instructions for implementation of
synchronization primitives (test-and-set, compare-and-swap, etc.)
20CMPT 431 © A. Fedorova
Implementation of SP
• Performance of applications that use SP is determined by an implementation of the SP
• A SP must be scalable – must continue to perform well as the number of contending threads increases
• We will look at several implementations of locks to understand how to create a scalable implementation
21CMPT 431 © A. Fedorova
What should you do if you can’t get a lock?
• Keep trying– “spin” or “busy-wait”– Good if delays are short
• Give up the processor– Good if delays are long– Always good on uniprocessor
• Systems usually use a combination:– Spin for a while, then give up the processor
• We will focus on multiprocessors, so we’ll look at spinlock implementations
© Herlihy-Shavit 2007
22CMPT 431 © A. Fedorova
A Shared Memory Multiprocessor
Bus
cache
memory
cachecache
© Herlihy-Shavit 2007
23CMPT 431 © A. Fedorova
Basic Spinlock
CS
Resets lock upon exit
spin lock
critical section
...
…lock suffers from contention
Sequential Bottleneck no parallelism
© Herlihy-Shavit 2007
24CMPT 431 © A. Fedorova
Review: Test-and-Set
• We have a boolean value in memory• Test-and-set (TAS)
– Swap true with prior value– Return value tells if prior value was true or false
• Can reset just by writing false
© Herlihy-Shavit 2007
25CMPT 431 © A. Fedorova
TAS• Provided by the hardware• Example SPARC: an assembly instruction load-store
unsigned byte ldstub
public class AtomicBoolean {
boolean value; public synchronized boolean getAndSet(boolean newValue)
{ boolean prior = value;
value = newValue;return prior;
}} Swap old and new values
loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically.
© Herlihy-Shavit 2007
• TAS can be implemented in a high-level language. • Example in Java:
26CMPT 431 © A. Fedorova
TAS Locks
• Value of TAS’ed memory shows lock state:– Lock is free: value is false– Lock is taken: value is true
• Acquire lock by calling TAS:– If result is false, you win– If result is true, you lose
• Release lock by writing false
27CMPT 431 © A. Fedorova
TAS Lock in SPARC Assembly
spin_lock:busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch retl nop ! delay slot for branch
loads old value into reg. o1. Writes “1” into memory at address in %o0 .Test if %o1 equals to zero.
If %o1 is not zero (old value is true), spin.
28CMPT 431 © A. Fedorova
TAS Lock in Java
class TASlock {
AtomicBoolean state = new AtomicBoolean(false);
void lock() {
while (state.getAndSet(true)) {} }
void unlock() {
state.set(false); }}
Initialize lock state to false (unlocked)
While lock is taken (true) spin.
Release the lock – set state to false
© Herlihy-Shavit 2007
29CMPT 431 © A. Fedorova
Performance of TAS Lock
• Experiment– N threads on a multiprocessor– Increment shared counter 1 million times (total)– The thread acquires a lock before incrementing the counter– Each thread does 1,000,000/N increments
• N does not exceed the number of processors no thread switching overhead
• How long should it take?• How long does it take?
30CMPT 431 © A. Fedorova
Expected performance
idealTota
l tim
e
Number of threads
no speedup because there is no parallelism
© Herlihy-Shavit 2007
lock_acquireincrementlock_release
lock_acquireincrementlock_release
lock_acquireincrementlock_release
lock_acquireincrementlock_release
Thread 1 Thread 2
same as sequential execution
31CMPT 431 © A. Fedorova
Actual Performance
TAS lock
Ideal Much worse than
ideal
Tota
l tim
e
Number of threads
© Herlihy-Shavit 2007
32CMPT 431 © A. Fedorova
Reasons for Bad TAS Lock Performance
• Has to do with cache behaviour on the multiprocessor system
• TAS causes a lot of invalidation misses– This hurts performance
• To understand what this means, let’s review how caches work
33CMPT 431 © A. Fedorova
Processor Issues Load Request
Bus
cache
memory
cachecache
datadata
© Herlihy-Shavit 2007
34CMPT 431 © A. Fedorova
Another Processor Issues Load Request
Bus
cache
memory
cachecache
data
dataBus
I got data
dataBus
I want data
© Herlihy-Shavit 2007
35CMPT 431 © A. Fedorova
memory
Bus
Processor Modifies Data
cache cachecache
data
datadata
Now other copies are invalid
data
© Herlihy-Shavit 2007
36CMPT 431 © A. Fedorova
Send Invalidation Message to Others
memory
Bus
cache cachecache
data
datadata data
Invalidate!
Bus
Other caches lose read
permission
No need to change now:
other caches can provide valid data © Herlihy-Shavit 2007
37CMPT 431 © A. Fedorova
Processor Asks for Data
memory
Bus
cache cachecache
data
datadata
Bus
I want data
data
© Herlihy-Shavit 2007
38CMPT 431 © A. Fedorova
Multiprocessor Caches: Summary
• Simultaneous reads and writes of shared data:– Make data invalid
• Invalidation is bad for performance• On next data request:
– Data must be fetched from another cache• This slows down performance
39CMPT 431 © A. Fedorova
What This Has to Do with TAS Locks
• Recall that TAS lock had bad performance
• Invalidations were the cause
TAS lock
IdealTota
l tim
e
Number of threads
• Here is why: • All spinners do load/store in a loop• They all read/write the same location• Cause lots of invalidations
40CMPT 431 © A. Fedorova
A Solution: Test-And-Test-And-Set Lock• Wait until lock “looks” free
– Spin on local cache– No bus use while lock busy
class TTASlock {
AtomicBoolean state = new AtomicBoolean(false);
void lock() {while (true) {
while (state.get()) {} if (!state.getAndSet(true)) return;
}}
Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations.
Now try to acquire it.
© Herlihy-Shavit 2007
41CMPT 431 © A. Fedorova
TTAS Lock Performance
TAS lock
TTAS lock
IdealBetter, but
still far from ideal
Tota
l tim
e
Number of threads
© Herlihy-Shavit 2007
42CMPT 431 © A. Fedorova
The Problem with TTAS Lock
• When the lock is released:– Everyone tries to acquire is– Everyone does TAS– There is a storm of invalidations
• Only one processor can use the bus at a time• So all processors queue up, waiting for the bus, so
they can perform the TAS
43CMPT 431 © A. Fedorova
A Solution: TTAS Lock with Backoff
• Intuition: If I fail to get the lock there must be contention
• So I should back off before trying again• Introduce a random “sleep” delay before trying to
acquire the lock again
© Herlihy-Shavit 2007
44CMPT 431 © A. Fedorova
TTAS Lock with Backoff: Performance
TAS lock
TTAS lockBackoff lockIdealTo
tal ti
me
Number of threads
© Herlihy-Shavit 2007
45CMPT 431 © A. Fedorova
Backoff Locks
• Better performance than TAS and TTAS• Caveats:
– Performance is sensitive to the choice of delay parameter
– The delay parameter depends on the number of processors and their speed
– Easy to tune for one platform– Difficult to write an implementation that will work well
across multiple platforms
© Herlihy-Shavit 2007
46CMPT 431 © A. Fedorova
An Idea
• Avoid useless invalidations– By keeping a queue of threads
• Each thread– Notifies next in line– Without bothering the others
© Herlihy-Shavit 2007
47CMPT 431 © A. Fedorova
Anderson Queue Lock
flags
next
T F F F F F F F
idle
locations on which thread spin, one per thread
Points to the next unused
“spin” location acquiring getAndIncrement:atomically get value of“next”, and increment “next”pointer
If “next” was TRUE, lock is acquired
© Herlihy-Shavit 2007
48CMPT 431 © A. Fedorova
Acquiring a Held Lock
flags
next
T F F F F F F F
acquired acquiring
getAndIncrement
T
released acquired
© Herlihy-Shavit 2007
49CMPT 431 © A. Fedorova
Anderson Lock: Performance
TAS lock
TTAS lock
IdealAnderson lock
Tota
l tim
e
Number of threads
Almost ideal. We avoid all unnecessary invalidations.Portable – no tunable parameters.
© Herlihy-Shavit 2007
50CMPT 431 © A. Fedorova
Scalable Synchronization: Summary
• Making synchronization primitives scalable is tricky• Performance tied to the hardware architecture• We looked at these spinlocks:
– TAS – poor performance due to invalidations– TTAS – avoids constant invalidations, but a storm of invalidations on lock
release– TTAS with backoff – eliminates the storm of invalidations on release– Anderson Queue Lock – completely eliminates all useless invalidations
• One could think of other optimizations…• For more information, look at the references in the syllabus
51CMPT 431 © A. Fedorova
Transactional Memory
• Programming with locks is tough• Yet everyone has to do synchronization –
multithreaded programming is driven by multicore revolution
• Transactional memory: concurrent programming without locks
52CMPT 431 © A. Fedorova
Coarse vs. Fine Synchronization
int update_shared_counters(int *counters, int n_counters) {
int i;
coarse_lock_acquire(counters_lock);for (i=0; i<n_counters; i++) {
fine_lock_acquire(counter_locks[i]);counters[i]++;fine_lock_release(counter_locks[i]);
}coarse_lock_release(counters_lock);
}
Coarse locks are easy to programBut perform poorly
Fine locks perform wellBut are difficult to program
53CMPT 431 © A. Fedorova
Transactional Memory To the Rescue!
• Can we have the best of both worlds?– Good performance– Ease of programming
• The answer is: – Transactional Memory (TM)
54CMPT 431 © A. Fedorova
Transactional Memory (TM)
• Programming model:– Extension to the language– Runtime and/or hardware support
• Lets you do synchronization without locks• Performance of fine grained locks• Ease of programming of coarse grained locks
55CMPT 431 © A. Fedorova
Transactional Memory vs. Locks
int update_shared_counters(int *counters, int n_counters) {
int i;ATOMIC_BEGIN(); coarse_lock_acquire(counters_lock);for (i=0; i<n_counters; i++) {
fine_lock_acquire(counter_locks[i]);counters[i]++;fine_lock_release(counter_locks[i]);
}coarse_lock_release(counters_lock);ATOMIC_END();
}
Transactional section• Looks like
coarse grained lock• Acts like fine
grained lock• Performance degrades
only if there is conflict
56CMPT 431 © A. Fedorova
The Backend of TM
read Awrite Bread Bwrite Awrite D
read Cwrite Cread Ewrite Eread D
Abort!
restart
57CMPT 431 © A. Fedorova
State of TM
• Still evolving– More work needed to make it usable and well
performing• It is very real
– Sun’s new Rock processor has TM support– Intel is very active
58CMPT 431 © A. Fedorova
OS Support For Distributed Systems: Summary (I)
• Networking– Access to network devices– Implementation of network protocols: TPC, UDP, IP
• Processes and Threads (because many DS components use MP/MT architectures). Must ensure:– Good load balance– Good response time– Minimize context switches– We looked at how Solaris time-sharing scheduler does this
59CMPT 431 © A. Fedorova
OS Support For Distributed Systems: Summary (II)
• Inter-process communication– Pipes– Memory-mapped files– Inter-process shared memory
• Scalable Synchronization