CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

CMPT 431

Dr. Alexandra Fedorova

Lecture IV: OS Support

2CMPT 431 © A. Fedorova

Outline

• Continue discussing OS support for threads and processes

• Alternative distributed systems architectures inspired by limitations of threads

• Support for IPC• Scalable synchronization


Process/Thread Support: Good Enough?• Many computer scientists observed limited scalability of MT

and MP architectures

Performance of a threaded web server

M. Welsh, SOSP ‘01


Alternative Web Services Architectures

• Alternative architectures for web services that rely less heavily on threads/processes:– Single-Process Event-Driven (SPED)– Asymmetric Multiprocess Event-Driven (AMPED)– Stage Event-Driven Architecture (SEDA)


Web Services Architecture. Case Study: A Web server

• Sequence of actions at the web server

• Each step can block:– Socket read/accept can block on network I/O– File find/read can block for disk I/O– Send can block on TCP buffer queue

• How do servers overlap blocking and computation?

V. Pai, USENIX ‘99


Multiprocess (MP) or Multithreaded (MT) Architecture: A Review

V. Pai, USENIX ‘99• One process performs all

steps for a request• I/O and computation

overlap naturally• OS switches to a new

process when a process blocks

MP

V. Pai, USENIX ‘99• One thread performs all

steps for a request• I/O and computation

overlap is possible using kernel threads (provided by modern OS)

MT


Single Process Event-Driven Architecture• A single process executes processing steps for all requests• Uses non-blocking network and disk I/O system calls• Uses select system call to check on the status of those operations• Problem #1: many OSs do not provide non-blocking system calls for

disk I/O• Problem #2: those that do, do not integrate them with select –

cannot check for completion of network and disk I/O simultaneously

V. Pai, USENIX ‘99


Asymmetric Multiprocess Event Driven Architecture (AMPED)

• AMPED = MP + SPED• Use SPED architecture for I/O operations with non-blocking interface: socket

read/write, accept• Use MP architecture for I/O operations without the non-blocking interface:

file read/write:

• mmap the file• Use mincore to check if the file is in

memory• If not, spawn a helper process to bring

the file into memory• Communicate with the helper process

via IPCV. Pai, USENIX ‘99

• Flash – a web server implemented using AMPED (V. Pai, et al., USENIX ‘99)• Matches or exceeds performance of existing web servers by up to 50%


Staged Event-Driven Architecture

• Observation: AMPED is good, but it is not easy to control application resources. E.g., which event to process first?

• SEDA: Create a stage for each logical step of processing; Manage each stage separately

• There is a queue of events for each stage, so you can tell how each stage is loaded

• Each stage can be processed by several (a small number of) threads• Adaptive load shedding – manage queues to control load

– E.g., if the stage that involves disk I/O is the bottleneck, drop the queued up requests or reject new requests

• Dynamic control – adjust the number of threads per stage based on demand

M. Welsh, SOSP ‘01


Outline



• Support for IPC• Support for scalable synchronization• Distributed operating systems


OS Support for Inter-Process Communication (IPC)

• Cooperating processes or threads need to communicate• Threads share address space, so they communicate via

shared memory• What about processes? They do not share an address

space. They communicate via:– Unix pipes– Memory-mapped files– Inter-process shared memory


Unix Pipes

Pipe is a communication channel among two processes

Using pipe in a shell:

prompt% cat log_file | grep “May 16”

cat

grep

writeread

Pipes can also be created using pipe() system call


Implementation of Pipes

• In Solaris: a data structure containing two vnodes, a lock and a buffer

lock

fnode

fnode

buffer

vnode

vnode

• To the user, each end of the pipe is represented by a file descriptor

• The user reads/writes the pipe by reading/writing the file descriptor

• The OS blocks the process reading from an empty pipe

• The OS blocks the process writing into the full pipe (when the buffer is full)


Memory-mapped Files

Address space of process A

File

Mapped file

Address space of process B

Mapped File


Inter-process Shared Memory

• Inter-process shared memory: a piece of physical memory set up to be shared among processes

• Allocate inter-process shared memory using shmget• Get permission to use (attach to it) via shmat• Disadvantages: shared memory is not cleaned up

automatically when processes exit; it needs to be cleaned up explicitly


Performance of IPC

• IPC involves inter-process context switching• The expensive kind of context switch, because it

involves switching address spaces• The cost of a context switch determines the cost

of IPC – largely depends on the hardware


Outline



• Support for IPC• Support for scalable synchronization


Synchronization

if(account_balance >= amount){

account_balance -= amount;}

Thread 1: perform a withdrawal if(account_balance >= service_fee){

account_balance -= service_fee;}

Thread 2: subtract service fee

1 2

3 4

Unsynchronized Access

Account balanced has changed between steps 2 and 4!!!

Synchronized Access

lock_aquire(account_balance_lock);if(account_balance >= amount){

account_balance -= amount;}lock_release(account_balance);

lock_aquire(account_balance_lock);if(account_balance >= service_fee){

account_balance -= service_fee;}lock_release(account_balance);


Synchronization Primitives (SP)

• Synchronization primitives provide atomic access to a critical section• Types of synchronization primitives

– mutex– semaphore– lock– condition variable– etc.

• Synchronization primitives are provided by the OS• Can also be implemented by a library (e.g., pthreads) or by the

application • Hardware provides special atomic instructions for implementation of

synchronization primitives (test-and-set, compare-and-swap, etc.)


Implementation of SP

• Performance of applications that use SP is determined by an implementation of the SP

• A SP must be scalable – must continue to perform well as the number of contending threads increases

• We will look at several implementations of locks to understand how to create a scalable implementation


What should you do if you can’t get a lock?

• Keep trying– “spin” or “busy-wait”– Good if delays are short

• Give up the processor– Good if delays are long– Always good on uniprocessor

• Systems usually use a combination:– Spin for a while, then give up the processor

• We will focus on multiprocessors, so we’ll look at spinlock implementations

© Herlihy-Shavit 2007


A Shared Memory Multiprocessor

Bus

cache

memory

cachecache



Basic Spinlock

CS

Resets lock upon exit

spin lock

critical section

...

…lock suffers from contention

Sequential Bottleneck no parallelism



Review: Test-and-Set

• We have a boolean value in memory• Test-and-set (TAS)

– Swap true with prior value– Return value tells if prior value was true or false

• Can reset just by writing false



TAS• Provided by the hardware• Example SPARC: an assembly instruction load-store

unsigned byte ldstub

public class AtomicBoolean {

boolean value; public synchronized boolean getAndSet(boolean newValue)

{ boolean prior = value;

value = newValue;return prior;

}} Swap old and new values

loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically.


• TAS can be implemented in a high-level language. • Example in Java:


TAS Locks

• Value of TAS’ed memory shows lock state:– Lock is free: value is false– Lock is taken: value is true

• Acquire lock by calling TAS:– If result is false, you win– If result is true, you lose

• Release lock by writing false


TAS Lock in SPARC Assembly

spin_lock:busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch retl nop ! delay slot for branch

loads old value into reg. o1. Writes “1” into memory at address in %o0 .Test if %o1 equals to zero.

If %o1 is not zero (old value is true), spin.


TAS Lock in Java

class TASlock {

AtomicBoolean state = new AtomicBoolean(false);

void lock() {

while (state.getAndSet(true)) {} }

void unlock() {

state.set(false); }}

Initialize lock state to false (unlocked)

While lock is taken (true) spin.

Release the lock – set state to false



Performance of TAS Lock

• Experiment– N threads on a multiprocessor– Increment shared counter 1 million times (total)– The thread acquires a lock before incrementing the counter– Each thread does 1,000,000/N increments

• N does not exceed the number of processors no thread switching overhead

• How long should it take?• How long does it take?


Expected performance

idealTota

l tim

e

Number of threads

no speedup because there is no parallelism


lock_acquireincrementlock_release




Thread 1 Thread 2

same as sequential execution


Actual Performance

TAS lock

Ideal Much worse than

ideal

Tota

l tim

e

Number of threads



Reasons for Bad TAS Lock Performance

• Has to do with cache behaviour on the multiprocessor system

• TAS causes a lot of invalidation misses– This hurts performance

• To understand what this means, let’s review how caches work


Processor Issues Load Request

Bus

cache

memory

cachecache

datadata



Another Processor Issues Load Request

Bus

cache

memory

cachecache

data

dataBus

I got data

dataBus

I want data



memory

Bus

Processor Modifies Data

cache cachecache

data

datadata

Now other copies are invalid

data



Send Invalidation Message to Others

memory

Bus

cache cachecache

data

datadata data

Invalidate!

Bus

Other caches lose read

permission

No need to change now:

other caches can provide valid data © Herlihy-Shavit 2007


Processor Asks for Data

memory

Bus

cache cachecache

data

datadata

Bus

I want data

data



Multiprocessor Caches: Summary

• Simultaneous reads and writes of shared data:– Make data invalid

• Invalidation is bad for performance• On next data request:

– Data must be fetched from another cache• This slows down performance


What This Has to Do with TAS Locks

• Recall that TAS lock had bad performance

• Invalidations were the cause

TAS lock

IdealTota

l tim

e

Number of threads

• Here is why: • All spinners do load/store in a loop• They all read/write the same location• Cause lots of invalidations


A Solution: Test-And-Test-And-Set Lock• Wait until lock “looks” free

– Spin on local cache– No bus use while lock busy

class TTASlock {

AtomicBoolean state = new AtomicBoolean(false);

void lock() {while (true) {

while (state.get()) {} if (!state.getAndSet(true)) return;

}}

Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations.

Now try to acquire it.



TTAS Lock Performance

TAS lock

TTAS lock

IdealBetter, but

still far from ideal

Tota

l tim

e

Number of threads



The Problem with TTAS Lock

• When the lock is released:– Everyone tries to acquire is– Everyone does TAS– There is a storm of invalidations

• Only one processor can use the bus at a time• So all processors queue up, waiting for the bus, so

they can perform the TAS


A Solution: TTAS Lock with Backoff

• Intuition: If I fail to get the lock there must be contention

• So I should back off before trying again• Introduce a random “sleep” delay before trying to

acquire the lock again



TTAS Lock with Backoff: Performance

TAS lock

TTAS lockBackoff lockIdealTo

tal ti

me

Number of threads



Backoff Locks

• Better performance than TAS and TTAS• Caveats:

– Performance is sensitive to the choice of delay parameter

– The delay parameter depends on the number of processors and their speed

– Easy to tune for one platform– Difficult to write an implementation that will work well

across multiple platforms



An Idea

• Avoid useless invalidations– By keeping a queue of threads

• Each thread– Notifies next in line– Without bothering the others



Anderson Queue Lock

flags

next

T F F F F F F F

idle

locations on which thread spin, one per thread

Points to the next unused

“spin” location acquiring getAndIncrement:atomically get value of“next”, and increment “next”pointer

If “next” was TRUE, lock is acquired



Acquiring a Held Lock

flags

next

T F F F F F F F

acquired acquiring

getAndIncrement

T

released acquired



Anderson Lock: Performance

TAS lock

TTAS lock

IdealAnderson lock

Tota

l tim

e

Number of threads

Almost ideal. We avoid all unnecessary invalidations.Portable – no tunable parameters.



Scalable Synchronization: Summary

• Making synchronization primitives scalable is tricky• Performance tied to the hardware architecture• We looked at these spinlocks:

– TAS – poor performance due to invalidations– TTAS – avoids constant invalidations, but a storm of invalidations on lock

release– TTAS with backoff – eliminates the storm of invalidations on release– Anderson Queue Lock – completely eliminates all useless invalidations

• One could think of other optimizations…• For more information, look at the references in the syllabus


Transactional Memory

• Programming with locks is tough• Yet everyone has to do synchronization –

multithreaded programming is driven by multicore revolution

• Transactional memory: concurrent programming without locks


Coarse vs. Fine Synchronization

int update_shared_counters(int *counters, int n_counters) {

int i;

coarse_lock_acquire(counters_lock);for (i=0; i<n_counters; i++) {

fine_lock_acquire(counter_locks[i]);counters[i]++;fine_lock_release(counter_locks[i]);

}coarse_lock_release(counters_lock);

}

Coarse locks are easy to programBut perform poorly

Fine locks perform wellBut are difficult to program


Transactional Memory To the Rescue!

• Can we have the best of both worlds?– Good performance– Ease of programming

• The answer is: – Transactional Memory (TM)


Transactional Memory (TM)

• Programming model:– Extension to the language– Runtime and/or hardware support

• Lets you do synchronization without locks• Performance of fine grained locks• Ease of programming of coarse grained locks


Transactional Memory vs. Locks

int update_shared_counters(int *counters, int n_counters) {

int i;ATOMIC_BEGIN(); coarse_lock_acquire(counters_lock);for (i=0; i<n_counters; i++) {

fine_lock_acquire(counter_locks[i]);counters[i]++;fine_lock_release(counter_locks[i]);

}coarse_lock_release(counters_lock);ATOMIC_END();

}

Transactional section• Looks like

coarse grained lock• Acts like fine

grained lock• Performance degrades

only if there is conflict


The Backend of TM

read Awrite Bread Bwrite Awrite D

read Cwrite Cread Ewrite Eread D

Abort!

restart


State of TM

• Still evolving– More work needed to make it usable and well

performing• It is very real

– Sun’s new Rock processor has TM support– Intel is very active


OS Support For Distributed Systems: Summary (I)

• Networking– Access to network devices– Implementation of network protocols: TPC, UDP, IP

• Processes and Threads (because many DS components use MP/MT architectures). Must ensure:– Good load balance– Good response time– Minimize context switches– We looked at how Solaris time-sharing scheduler does this


OS Support For Distributed Systems: Summary (II)

• Inter-process communication– Pipes– Memory-mapped files– Inter-process shared memory

• Scalable Synchronization

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

Documents

Transcript of CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.