10 Multicore 07
description
Transcript of 10 Multicore 07
![Page 1: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/1.jpg)
Multiprocessor/Multicore Systems
Scheduling, Synchronization
![Page 2: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/2.jpg)
2
Multiprocessors
Definition:A computer system in which two or more CPUs share full access to a common RAM
![Page 3: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/3.jpg)
3
Multiprocessor/Multicore Hardware (ex.1)
Bus-based multiprocessors
![Page 4: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/4.jpg)
4
Multiprocessor/Multicore Hardware (ex.2)
UMA (uniform memory access)
Not/hardly scalable• Bus-based architectures -> saturation• Crossbars too expensive (wiring constraints) Possible solutions• Reduce network traffic by caching• Clustering -> non-uniform memory latency
behaviour (NUMA.
![Page 5: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/5.jpg)
5
Multiprocessor/Multicore Hardware (ex.3)
NUMA (non-uniform memory access)
• Single address space visible to all CPUs• Access to remote memory slower than to local• Cache-controller/MMU determines whether a reference is local or remote• When caching is involved, it's called CC-NUMA (cache coherent NUMA)• Typically Read Replication (write invalidation)
![Page 6: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/6.jpg)
Cache-coherence
6
![Page 7: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/7.jpg)
Cache coherence (cont)
cache coherency protocols are based on a set of (cache block) states and state transitions : 2 types of protocols
• write-update • write-invalidate (suffers from false sharing)
– Some invalidations are not necessary for correct program execution:
Processor 1: Processor 2: while (true) do while (true) do A = A + 1 B = B + 1 If A and B are located in the same cache block, a cache
miss occurs in each loop-iteration due to a ping-pong of invalidations
7
![Page 8: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/8.jpg)
On multicores
8
Reason for multicores: physical limitations can cause significant heat dissipation and data synchronization problems
In addition to operating system (OS) support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors.
Virtual machine approach again in focus.
Intel Core 2 dual core processor, with CPU-local Level 1 caches+ shared, on-die Level 2 cache.
![Page 9: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/9.jpg)
On multicores (cont)
• Also possible (figure from www.microsoft.com/licensing/highlights/multicore.mspx)
9
![Page 10: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/10.jpg)
10
OS Design issues (1):Who executes the OS/scheduler(s)?
• Master/slave architecture: Key kernel functions always run on a particular processor– Master is responsible for scheduling; slave sends service request
to the master– Disadvantages
• Failure of master brings down whole system• Master can become a performance bottleneck
• Peer architecture: Operating system can execute on any processor– Each processor does self-scheduling– New issues for the operating system
• Make sure two processors do not choose the same process
![Page 11: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/11.jpg)
11
Master-Slave multiprocessor OS
Bus
![Page 12: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/12.jpg)
12
Non-symmetric Peer Multiprocessor OS
Each CPU has its own operating system
Bus
![Page 13: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/13.jpg)
13
Symmetric Peer Multiprocessor OS
• Symmetric Multiprocessors– SMP multiprocessor model
Bus
![Page 14: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/14.jpg)
14
Scheduling in Multiprocessors
Recall: Tightly coupled multiprocessing (SMPs)– Processors share main memory – Controlled by operating system
Different degrees of parallelism– Independent and Coarse-Grained Parallelism
• no or very limited synchronization• can by supported on a multiprocessor with little change (and a
bit of salt )– Medium-Grained Parallelism
• collection of threads; usually interact frequently– Fine-Grained Parallelism
• Highly parallel applications; specialized and fragmented area
![Page 15: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/15.jpg)
15
Design issues 2: Assignment of Processes to Processors
Per-processor ready-queues vs global ready-queue• Permanently assign process to a processor;
– Less overhead– A processor could be idle while another processor has a backlog
• Have a global ready queue and schedule to any available processor– can become a bottleneck– Task migration not cheap
![Page 16: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/16.jpg)
16
Multiprocessor Scheduling:per partition RQ
• Space sharing– multiple threads at same time across multiple
CPUs
![Page 17: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/17.jpg)
17
Multiprocessor Scheduling:Load sharing / Global ready queue
• Timesharing– note use of single data structure for
scheduling
![Page 18: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/18.jpg)
18
Multiprocessor Scheduling Load Sharing: a problem
• Problem with communication between two threads– both belong to process A– both running out of phase
![Page 19: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/19.jpg)
19
Design issues 3:Multiprogramming on processors?
Experience shows:– Threads running on separate processors (to the extend of
dedicating a processor to a thread) yields dramatic gains in performance
– Allocating processors to threads ~ allocating pages to processes (can use working set model?)
– Specific scheduling discipline is less important with more than on processor; the decision of “distributing” tasks is more important
![Page 20: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/20.jpg)
20
Gang Scheduling• Approach to address the prev. problem :
1. Groups of related threads scheduled as a unit (a gang)2. All members of gang run simultaneously
• on different timeshared CPUs3. All gang members start and end time slices together
![Page 21: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/21.jpg)
21
Gang Scheduling: another option
![Page 22: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/22.jpg)
22
Multiprocessor Thread Scheduling Dynamic Scheduling
• Number of threads in a process are altered dynamically by the application
• Programs (through thread libraries) give info to OS to manage parallelism– OS adjusts the load to improve use
• Or os gives info to run-time system about available processors, to adjust # of threads.
• i.e dynamic vesion of partitioning:
![Page 23: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/23.jpg)
23
Summary: Multiprocessor Thread Scheduling
Load sharing: processors/threads not assigned to particular processors
• load is distributed evenly across the processors;• needs central queue; may be a bottleneck • preemptive threads are unlikely to resume execution on the
same processor; cache use is less efficient
Gang scheduling: Assigns threads to particular processors (simultaneous scheduling of threads that make up a process)
• Useful where performance severely degrades when any part of the application is not running (due to synchronization)
• Extreme version: Dedicated processor assignment (no multiprogramming of processors)
![Page 24: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/24.jpg)
24
Multiprocessor Scheduling and Synchronization
Priorities + blocking synchronization may result in: priority inversion: low-priority process P holds a
lock, high-priority process waits, medium priority processes do not allow P to complete and release the lock fast (scheduling less efficient). To cope/avoid this:– use priority inheritance– use non-blocking synchronization (wait-free, lock-
free, optimistic synchronization)
convoy effect: processes need a resource for short time, the process holding it may block them for long time (hence, poor utilization)– non-blocking synchronization is good here, too
![Page 25: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/25.jpg)
25
Readers-Writersnon-blocking synchronization
(some slides are adapted from J. Anderson’s slides on same topic)
![Page 26: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/26.jpg)
26
The Mutual Exclusion ProblemLocking Synchronization
• N processes, each with this structure:
while true doNoncritical Section;Entry Section;Critical Section;Exit Section
od
while true doNoncritical Section;Entry Section;Critical Section;Exit Section
od• Basic Requirements:– Exclusion: Invariant(# in CS 1).– Starvation-freedom: (process i in Entry) leads-
to (process i in CS).
• Can implement by “busy waiting” (spin locks) or using kernel calls.
![Page 27: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/27.jpg)
27
Non-blocking Synchronization
• The problem:– Implement a shared object without mutual
exclusion.• Shared Object: A data structure (e.g., queue)
shared by concurrent processes.
– Why?• To avoid performance problems that result when a
lock-holding task is delayed.• To avoid priority inversions (more on this later).
Locking
![Page 28: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/28.jpg)
28
Non-blocking Synchronization
• Two variants:– Lock-free:
• Only system-wide progress is guaranteed.• Usually implemented using “retry loops.”
– Wait-free:• Individual progress is guaranteed.• Code for object invocations is purely sequential.
![Page 29: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/29.jpg)
29
Readers/Writers Problem
[Courtois, et al. 1971.]• Similar to mutual exclusion, but several
readers can execute critical section at once.• If a writer is in its critical section, then no
other process can be in its critical section.• + no starvation, fairness
![Page 30: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/30.jpg)
30
Solution 1
Readers have “priority”…
Reader::P(mutex); rc := rc + 1; if rc = 1 then P(w) fi;V(mutex);CS;P(mutex); rc := rc 1; if rc = 0 then V(w) fi;V(mutex)
Writer::P(w); CS;V(w)
“First” reader executes P(w). “Last” one executes V(w).
![Page 31: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/31.jpg)
31
Solution 2Writers have “priority” …readers should not build long queue on r, so that writers can overtake => mutex3 Reader::
P(mutex3); P(r); P(mutex1); rc := rc + 1;
if rc = 1 then P(w) fi;V(mutex1);
V(r);V(mutex3);CS;P(mutex1); rc := rc 1; if rc = 0 then V(w) fi;V(mutex1)
Writer::P(mutex2); wc := wc + 1; if wc = 1 then P(r) fi;V(mutex2);P(w); CS;V(w);P(mutex2); wc := wc 1; if wc = 0 then V(r) fi;V(mutex2)
![Page 32: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/32.jpg)
32
Properties
• If several writers try to enter their critical sections, one will execute P(r), blocking readers.
• Works assuming V(r) has the effect of picking a process waiting to execute P(r) to proceed.
• Due to mutex3, if a reader executes V(r) and a writer is at P(r), then the writer is picked to proceed.
![Page 33: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/33.jpg)
33
Concurrent Reading and Writing[Lamport ‘77]
• Previous solutions to the readers/writers problem use some form of mutual exclusion.
• Lamport considers solutions in which readers and writers access a shared object concurrently.
• Motivation:– Don’t want writers to wait for readers.– Readers/writers solution may be needed to
implement mutual exclusion (circularity problem).
![Page 34: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/34.jpg)
34
Interesting Factoids
• This is the first ever lock-free algorithm: guarantees consistency without locks
• An algorithm very similar to this is implemented within an embedded controller in Mercedes automobiles!!
![Page 35: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/35.jpg)
35
The Problem
• Let v be a data item, consisting of one or more digits.– For example, v = 256 consists of three digits, “2”,
“5”, and “6”.
• Underlying model: Digits can be read and written atomically.
• Objective: Simulate atomic reads and writes of the data item v.
![Page 36: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/36.jpg)
36
Preliminaries
• Definition: v[i], where i 0, denotes the ith value written to v. (v[0] is v’s initial value.)
• Note: No concurrent writing of v.
• Partitioning of v: v1 vm.
– vi may consist of multiple digits.
• To read v: Read each vi (in some order).
• To write v: Write each vi (in some order).
![Page 37: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/37.jpg)
37
More Preliminaries
write:k write:k+i write:l
read v3 read v2 read vm-1read v1 read vm
read r:
We say: r reads v[k,l].
Value is consistent if k = l.
![Page 38: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/38.jpg)
38
Theorem 1
If v is always written from right to left, then a read from left to right obtains a value
v1[k1,l1] v2
[k2,l2] … vm[km,lm]
where k1 l1 k2 l2 … km lm.
read: read v1
read d1
read v2
read d2
read v3
read d3
write:1 wv3 wv2 wv1
wd3 wd2 wd1
write:0 wv3 wv2 wv1
wd3 wd2 wd1
write:2 wv3 wv2 wv1
wd3 wd2 wd1
Example: v = v1v2v3 = d1d2d3
Read reads v1[0,0] v2
[1,1] v3[2,2].
![Page 39: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/39.jpg)
39
Another Example
read:
write:0
wv2 wv1
wd3 wd4 wd1
wd2
write:1
wv2 wv1
wd3 wd4 wd1
wd2
write:2
wv2 wv1
wd3 wd4 wd1
wd2
read v1
rd1
rd2
read v2
rd4
rd3
v = v1 v2
d1d2 d3d4
Read reads v1[0,1] v2
[1,2].
![Page 40: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/40.jpg)
40
Theorem 2
Assume that i j implies that v[i] v[j], where v = d1 … dm.
(a) If v is always written from right to left, then a read from left to right obtains a value v[k,l] v[l].
(b) If v is always written from left to right, then a read from right to left obtains a value v[k,l] v[k].
![Page 41: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/41.jpg)
41
Example of (a)
read: read d1
read d2
read d3
write:1(399)
wd3 wd2 wd1
9 9 3
write:0(398)
wd3 wd2 wd1
8 9 3
write:2(400)
wd3 wd2 wd1
0 0 4
v = d1d2d3
Read obtains v[0,2] = 390 < 400 = v[2].
![Page 42: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/42.jpg)
42
Example of (b)
read: read d3
read d2
read d1
write:1(399)
wd1 wd2 wd3
3 9 9
write:0(398)
wd1 wd2 wd3
3 9 8
write:2(400)
wd1 wd2 wd3
4 0 0
v = d1d2d3
Read obtains v[0,2] = 498 > 398 = v[0].
![Page 43: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/43.jpg)
43
Readers/Writers Solution
Writer::V1 :> V1;write D;V2 := V1
Reader:: repeat temp := V2 read D until V1 = temp
:> means assign larger value.
V1 means “left to right”.
V2 means “right to left”.
![Page 44: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/44.jpg)
44
Proof Obligation
• Assume reader reads V2[k1, l1] D[k2, l2] V1[k3, l3].
• Proof Obligation: V2[k1, l1] = V1[k3, l3] k2 = l2.
![Page 45: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/45.jpg)
45
Proof
By Theorem 2,
V2[k1,l1] V2[l1] and V1[k3] V1[k3,l3]. (1)
Applying Theorem 1 to V2 D V1,
k1 l1 k2 l2 k3 l3 . (2)
By the writer program,
l1 k3 V2[l1] V1[k3]. (3)
(1), (2), and (3) imply
V2[k1,l1] V2[l1] V1[k3] V1[k3,l3].
Hence, V2[k1,l1] = V1[k3,l3] V2[l1] = V1[k3]
l1 = k3 , by the writer’s program.
k2 = l2 by (2).
![Page 46: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/46.jpg)
46
Supplemental Reading
• check:– G.L. Peterson, “Concurrent Reading While Writing”,
ACM TOPLAS, Vol. 5, No. 1, 1983, pp. 46-55.
– Solves the same problem in a wait-free manner:• guarantees consistency without locks and• the unbounded reader loop is eliminated.
– First paper on wait-free synchronization.
• Now, very rich literature on the topic. Check also:– PhD thesis A. Gidenstam, 2006, CTH– PhD Thesis H. Sundell, 2005, CTH
![Page 47: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/47.jpg)
47
Useful Synchronization PrimitivesUsually Necessary in Nonblocking Algorithms
CAS2extendsthis
CAS(var, old, new) if var old then return false fi; var := new; return true
CAS(var, old, new) if var old then return false fi; var := new; return true
LL(var) establish “link” to var; return var
SC(var, val) if “link” to var still exists then break all current links of all processes; var := val; return true else return false fi
LL(var) establish “link” to var; return var
SC(var, val) if “link” to var still exists then break all current links of all processes; var := val; return true else return false fi
![Page 48: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/48.jpg)
48
type Qtype = record v: valtype; next: pointer to Qtype endshared var Tail: pointer to Qtype;local var old, new: pointer to Qtype
procedure Enqueue (input: valtype) new := (input, NIL); repeat old := Tail until CAS2(Tail, old->next, old, NIL, new, new)
type Qtype = record v: valtype; next: pointer to Qtype endshared var Tail: pointer to Qtype;local var old, new: pointer to Qtype
procedure Enqueue (input: valtype) new := (input, NIL); repeat old := Tail until CAS2(Tail, old->next, old, NIL, new, new)
Another Lock-free ExampleShared Queue
Tail
old
Tail
oldnew new
retry loop
![Page 49: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/49.jpg)
49
Using Locks in Real-time SystemsThe Priority Inversion Problem
Uncontrolled use of locks in RT systemscan result in unbounded blocking due topriority inversions.
Time
Low
Med
High
t t t0 1 2
Shared Object Access Priority Inversion
Solution: Limit priority inversionsby modifying task priorities.
Time
Low
Med
High
t0
t1 t 2
Computation not involving object accesses
![Page 50: 10 Multicore 07](https://reader035.fdocuments.in/reader035/viewer/2022081413/5479384f5906b5a8048b46b5/html5/thumbnails/50.jpg)
50
Dealing with Priority Inversions
• Common Approach: Use lock-based schemes that bound their duration (as shown).– Examples: Priority-inheritance and -ceiling protocols.– Disadvantages: Kernel support, very inefficient on
multiprocessors.• Alternative: Use non-blocking objects.
– No priority inversions or kernel support.– Wait-free algorithms are clearly applicable here.– What about lock-free algorithms?
• Advantage: Usually simpler than wait-free algorithms.• Disadvantage: Access times are potentially
unbounded.• But for periodic task sets access times are also
predictable!! (check further-reading-pointers)