Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from...
-
Upload
william-logan -
Category
Documents
-
view
216 -
download
0
Transcript of Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from...
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoicaand from course CS194, by prof. Katherine Yelick
Shared Memory ProgrammingSynchronization primitives
Ing. Andrea Marongiu([email protected])
• Program is a collection of threads of control.• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common blocks,
or global heap.• Threads communicate implicitly by writing and reading shared variables.• Threads coordinate by synchronizing on shared variables
PnP1P0
s s = ...
y = ..s ...
Shared memory
i: 2 i: 5 Private memory
i: 8
Shared Memory Programming
Thread 1
for i = 0, n/2-1 s = s + sqr(A[i])
Thread 2
for i = n/2, n-1 s = s + sqr(A[i])
static int s = 0;
• Problem is a race condition on variable s in the program• A race condition or data race occurs when:
- two processors (or two threads) access the same variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so they could happen simultaneously
Shared Memory code for computing a sum
Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
static int s = 0;
• Assume A = [3,5], f is the square function, and s=0 initially• For this program to work, s should be 34 at the end
• but it may be 34,9, or 25
• The atomic operations are reads and writes• Never see ½ of one number, but no += operation is not atomic• All computations happen in (private) registers
9 250 09 25
259
3 5A f = square
Shared Memory code for computing a sum
Thread 1
local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1
Thread 2
local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2
static int s = 0;
• Since addition is associative, it’s OK to rearrange order• Right?
• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s
Shared Memory code for computing a sum
ATOMIC ATOMIC
Atomic Operations• To understand a concurrent program, we need to know what the
underlying indivisible operations are!
• Atomic Operation: an operation that always runs to completion or not at all• It is indivisible: it cannot be stopped in the middle and state cannot be
modified by someone else in the middle• Fundamental building block – if no atomic operations, then have no way
for threads to work together
• On most machines, memory references and assignments (i.e. loads and stores) of words are atomic
Role of Synchronization• “A parallel computer is a collection of processing elements
that cooperate and communicate to solve large problems fast.”
• Types of Synchronization• Mutual Exclusion• Event synchronization
• point-to-point• group• global (barriers)
• How much hardware support?
Most used forms of synchronization in shared
memory parallel programming
Motivation: “Too much milk”
• Example: People need to coordinate:
Arrive home, put milk away3:30
Buy milk3:25
Arrive at storeArrive home, put milk away3:20
Leave for storeBuy milk3:15
Leave for store3:05
Look in Fridge. Out of milk3:00
Look in Fridge. Out of milkArrive at store3:10
Person BPerson ATime
Definitions
• Synchronization: using atomic operations to ensure cooperation between threads• For now, only loads and stores are atomic• hard to build anything useful with only reads and writes
• Mutual Exclusion: ensuring that only one thread does a particular thing at a time• One thread excludes the other while doing its task
• Critical Section: piece of code that only one thread can execute at once• Critical section and mutual exclusion are two ways of describing the
same thing• Critical section defines sharing granularity
More Definitions• Lock: prevents someone from doing something
• Lock before entering critical section and before accessing shared data
• Unlock when leaving, after accessing shared data
• Wait if locked• Important idea: all synchronization involves waiting
• Example: fix the milk problem by putting a lock on refrigerator• Lock it and take key if you are going to go buy milk
• Fixes too much (coarse granularity): roommate angry if only wants orange juice
#$@%@#$@
• Need to be careful about correctness of concurrent programs, since non-deterministic• Always write down desired behavior first• think first, then code
• What are the correctness properties for the “Too much milk” problem?• Never more than one person buys• Someone buys if needed
• Restrict ourselves to use only atomic load and store operations as building blocks
Too Much Milk: Correctness properties
Too Much Milk: Solution #1• Use a note to avoid buying too much milk:
• Leave a note before buying (kind of “lock”)• Remove note after buying (kind of “unlock”)• Don’t buy if note (wait)
• Suppose a computer tries this (remember, only memory read/write are atomic):
if (noMilk) { if (noNote) { leave Note; buy milk; remove note; }
}
• Result?
Too Much Milk: Solution #1Thread A Thread B
if (noMilk) if (noNote) { if (noMilk)
if (noNote) { leave Note;
buy milk; remove note; } } leave Note;
buy milk; remove note; } }
Need to atomically
update lock variable
How to Implement Lock?
• Lock: prevents someone from accessing something• Lock before entering critical section (e.g., before accessing shared data)• Unlock when leaving, after accessing shared data• Wait if locked
• Important idea: all synchronization involves waiting• Should sleep if waiting for long time
• Hardware atomic instructions?
Examples of hardware atomic instructions
• test&set (&address) { /* most architectures */result = M[address];M[address] = 1;return result;
}
• swap (&address, register) { /* x86 */ temp = M[address];
M[address] = register;register = temp;
}
• compare&swap (&address, reg1, reg2) { /* 68000 */if (reg1 == M[address]) {
M[address] = reg2;return success;
} else {return failure;
}}
Atomic operations!
Implementing Locks with test&set
• Simple solution:
int value = 0; // Free
Acquire() {while (test&set(value)); // while busy
}
Release() {value = 0;
}
• Simple explanation:• If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns
0 so while exits• If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so
while loop continues• When we set value = 0, someone else can get lock
test&set (&address) { result = M[address]; M[address] = 1; return result;}
Too Much Milk: Solution #2• Lock.Acquire() – wait until lock is free, then grab
• Lock.Release() – unlock, waking up anyone waiting
• atomic operations – if two threads are waiting for the lock, only one succeeds to grab the lock
• Then, our milk problem is easy:milklock.Acquire();if (nomilk) buy milk;milklock.Release();
• Once again, section of code between Acquire() and Release() called a “Critical Section”
Thread 1
local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1
Thread 2
local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2
static int s = 0;
• Since addition is associative, it’s OK to rearrange order• Right?
• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s
static lock lk;
lock(lk);
unlock(lk);
lock(lk);
unlock(lk);
Shared Memory code for computing a sum
Performance Criteria for Synch. Ops
• Latency (time per op)• How long does it take if you always win• Especially when light contention
• Bandwidth (ops per sec)• Especially under high contention• How long does it take (averaged over threads) when many others are
trying for it
• Traffic• How many events on shared resources (bus, crossbar,…)
• Storage• How much memory is required?
• Fairness• Can any one threads be “starved” and never get the lock?
Barriers• Software algorithms implemented using locks, flags,
counters• Hardware barriers
• Wired-AND line separate from address/data bus• Set input high when arrive, wait for output to be high to leave
• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors
• even harder with multiple processes per processor
• Difficult to dynamically change number and identity of participants• e.g. latter due to process migration
• Not common today on bus-based machines
struct bar_type { int counter; struct lock_type lock int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)
bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */
}else while (bar_name.flag == 0) {}; /* busy wait for release */
}
A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived
• increment when arrive (lock), check until reaches numprocs• Problem?
A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work
• Must prevent process from entering until all have left previous instance
• Could use another counter, but increases latency and contention
• Sense reversal: wait for flag to take different value consecutive times• Toggle this value only when all processes reach
BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)
UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/
else { UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }}
Centralized Barrier Performance• Latency
• Centralized has critical path length at least proportional to p
• Traffic• About 3p bus transactions
• Storage Cost• Very low: centralized counter and flag
• Fairness• Same processor should not always be last to exit barrier
• No such bias in centralized
• Key problems for centralized barrier are latency and traffic• Especially with distributed memory, traffic goes to same node
Improved Barrier Algorithm
• Separate gather and release trees
• Advantage: use of ordinary reads/writes instead of locks (array of flags)
• 2x(p-1) messages exchanged over the network
• Valuable in distributed network: communicate along different paths
Master-Slave barrier
• Master core gathers slaves on the barrier and releases them
• Use separate, per-core polling flags for different wait stages
Centralized
Contention
Master-Slave
Improved Barrier Algorithm
• Not all messages have same latency
• Need for locality-aware implementation
What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012
Master-Slave
MEM
PROC
XBAR NI
MEM
PROC
XBARNI
MEM
PROC
XBAR
MEM
PROC
XBAR
NI NI
Improved Barrier Algorithm
• Separate arrival and exit trees, and use sense reversal
• Valuable in distributed network: communicate along different paths
• Higher latency (log p steps of work, and O(p) serialized bus xactions)
• Advantage: use of ordinary reads/writes instead of locks
Software combining tree• Only k processors access the same location, where k is degree of tree
Little contention
Centralized
Contention
Tree
Improved Barrier Algorithm
Centralized
Contention
Tree
Software combining tree• Only k processors access the same location, where k is degree of tree
• Separate arrival and exit trees, and use sense reversal
• Valuable in distributed network: communicate along different paths
• Higher latency (log p steps of work, and O(p) serialized bus xactions)
• Advantage: use of ordinary reads/writes instead of locks
Improved Barrier Algorithm
• Hierarchical synchronization
• locality-aware implementation
What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012
Tree
MEM
PROC
XBAR NI
MEM
PROC
XBARNI
MEM
PROC
XBAR
MEM
PROC
XBAR
NI NI
Barrier performance
• Programming model is made up of the languages and libraries that create an abstract view of the machine
• Control• How is parallelism created?• How is are dependencies (orderings) enforced?
• Data• Can data be shared or is it all private?• How is shared data accessed or private data communicated?
• Synchronization• What operations can be used to coordinate parallelism• What are the atomic (indivisible) operations?
Parallel programming models
• In this and the upcoming lectures we will see different programming models and the features that each provide with respect to• Control• Data• Synchronization
• Pthreads• OpenMP• OpenCL
Parallel programming models