Multiprocessors and Multi-computers Multi-computers –Distributed address space accessible by local...

Multiprocessors and Multi-computers

• Multi-computers– Distributed address space accessible by local processors– Requires message passing – Programming tends to be more difficult

• Multiprocessors– Single address space accessible by all processors– Simultaneous access to shared variables can produce

inconsistent results– Generally programming is more convenient– Doesn’t scale to more than about sixteen processors

Shared Memory Hardware

Bus

Processes Memory modules

Bus configuration Crossbar switch configuration

Memory Modules

Processe

s

Shared Memory Access

• Critical Section– A section of code that needs

to be protected from simultaneous access

• Mutual Exclusion– The mechanism used to

enforce a critical section– Locks– Semaphores– Monitors– Condition Variables

x

=1 =2

Process 1 Process 2

Shared Variable

Locks

• Single bit variable– 1=locked, 0=unlocked– “Enter door and lock the door at entry”

• Spin locks (busy wait locks)– While (lock==1) spin();

lock = 1;// Critical sectionlock = 0;

• Advantages– Simple and easy to understand

• Disadvantages– Poor use of the CPU if process does not block while waiting– It’s easy to skip the lock=0 statement

• Examples: Pthreads, openMP

Locks are the simplest mutual exclusion mechanism

Note: The while and lock setting must be atomic

Semaphores• Allows limited concurrent

access to a critical section• An integer variable, s, controls

the mechanism• Operations

– P operation: passeren in Dutch for: to pass

s--;while (s<0) wait();// Critical section code

– V operation: vrigeven in Dutch for: to release

s++;if (s<=0) unblock a waiting process;

p(s);// Critical sectionv(s);

• Notes– Set s=1 initially for s to

be a binary semaphore which acts like a lock.

– Set s=k>1 initially if k simultaneous entries are possible

– Set s=k<=0 for consumer processes waiting to consume data produced.

• Disadvantage– Easy to skip the v

operation• Example: UNIX

Monitors

• A Class mechanism that limits access to a shared resource

public class doIt{ public doIt() {//Constructor logic} public synchronized void critMethod() { wait(); // Wait till another thread signals notify();} }

• Advantage: Most natural mutual exclusive mechanism• Disadvantage: Requires a language that supports the construct• Examples: Java, ADA, Modula II

Condition Variables

• Advantages: – Reduce overhead

associated with checking global conditions

– Avoids having to frequently “poll” the global variable

• Disadvantage: Its easy to skip the unlock operations

• Example: Pthreads• Notes:

– wait() unlocks and locks mutex automatically

– Threads must already be waiting for a signal when it is thrown

Example

• Thread 1lock(mutex)while (c<>VALUE) wait(cVar,mutex) // Critical section unlock(mutex);

• Thread 2if (c==VALUE)

signal(condVar)

Mechanism to guarantee a global condition before critical section entry

Deadlock

• Necessary Conditions– Circular Wait

– Limited Resource

– Non-preemptive

– Hold and Wait

Deadly Embrace

R1 R2 Rn-1 Rn

P1 P2 Pn-1 Pn

R1 R2

P1 P2

Two Process Deadlock

Resources needed by a process can never be allocated

Cache Coherence

• Cache Coherence Protocol– Update: All caches

immediately updated with altered data

– Invalidate: Altered data is invalidated in all caches. Updates take place only if subsequently referenced.

• False Sharing: Cache updates take place because multiple processes access the same cache block but not the same locations

Processor 1 Processor 2

Memory

x

y

yx

Cache Blocks

Multiprocessor Cache Considerations

• Cache coherence in multiprocessor systems– Temporal locality is reduced

• Smart compiler optimizations require knowledge of cache to alter data layout

• Optimizations might require special programming• Multiple cache exist instead of one

• Shared memory distributed systems– NUMA (SGI Origin), COMA (KSR)– Hardware or software cache coherence algorithms

• Performance of software coherence can be inadequate

Shared Memory Programming Alternatives

• Heavyweight processes• Thread programming standard

– Pthreads – Java Threads

• Programming language designed for parallel processing (ADA)• Library routines with an existing language• Modified syntax of an existing language (HP Fortran)• Compiler directives to specify parallel programs (OpenMP)

Heavyweight Processes• Operating system switches context from one

process to another• Advantage: A blocked process doesn’t block the

other processes• Not popular because overhead is significant

– Separate process control blocks

UNIXpid = fork();if (pid==0){ //slave code }else{ //parent code }if (pid==0)exit(0);else wait(0);

Fork

FORK FORK

FORK

FORK

JOIN

JOIN

JOIN JOIN

JOIN

JOIN

Using Heavyweight Processes

• Create new processespid= fork();

• Wait for any process to completepid=wait(&status);

• Wait for specific process to completePid = waitpid(pid, &status, blockingcall);

• Exit processexit(status);

• Attaching shared memoryid=shmget(IPC_PRIVATE,size);addr=shmat(id, NULL, 0);

Processes and Threads

Code

Resources

Listeners

Heap

Stack

IP Code

Resources

Listeners

HeapStack

IP

Stack

IP

Single Thread Process

Dual Thread ProcessNotes: •Threads can be three orders of magnitude faster than processes•Thread safe library routines can be used by multiple concurrent threads•Synchronization uses shared variables

Pthreads – Create and Join

• Spawn an attached threadpthread_create

(&thread1, NULL, proc1, &arg)

.

.

.

Pthread_join(thread1, status)

• Thread executionvoid proc1(&arg)

{ // Thread code

return(*status);

}

• Detatched threads– Join is not needed

– The OS destroys thread resources when they terminate

– A parameter in the create call indicates a detached thread

IEEE Portable Operating System Interface Standard (POSIX)

Note: The Pthreads library must be available

Pthread Locks• Mutex variables

– Wait until lock opens before proceeding– Thread selection when lock opens is non-deterministic– Only the thread owning the lock can unlock it

• Example: declaring a Pthread lock

Pthread_mutex_t mutex;pthread_mutex_lock(&mutex);// Critical section codePthread_mutex_unlock(&mutex);

• Example: test if a lock is open– Returns ‘EBUSY’ if already locked or 0 if open

Pthread_mutex_trylock(mutex)

Pthread Condition Variables

• Declarationspthread_con_t cond1;pthread_mutex_t mutex1;

• Thread 1pthread_mutex_lock(&mutex1);while (cond1<>0) pthread_cond_wait(cond1, mutex1);pthread_mutex_unlock(&mutex1);// Code to take action

• Thread 2pthread_mutex_lock(&mutex1);cond1--;if (cond1==0) pthread_cond_signal(cond1);pthread_mutex_unlock(&mutex1);

Example Code

Java Threads

• Instantiate and run a threadThreadClass t = new ThreadClass();t.start();

• Thread classClass ThreadClass extends thread{ public ThreadClass {//Constructor} public void run() { while (true) { //yield or sleep periodically. //thread code entered here.} } }

Modify Existing Language Syntax

• Declaration of a shared memory variableshared int x;

• Specify statements to execute concurrentlypar { s1(); s2(); s3(); … sn(); }

• Iterations assigned to different processorsforall (i=0; i<n; i++) { //code }

• Examples: High Performance Fortran and C

Example Constructs

Smart Compilers

• Perform Bernstein’s Dependency Analysis– Ii is the set of memory locations read by a block of code i.– Oi is the set of memory locations written by a block of code i. – If Ii ∩Oj = Ф and Oi ∩Oj = Ф where i≠j, the code can execute in

parallel– Questions: Consider instructions: ‘a=x+y and b=x+z’

• Can these instructions execute in parallel? • What are I1, I2, O1, and O2? • What are I1 ∩O2, I2 ∩O1, and O1 ∩O2 ?

• Other optimizations– Relocate data to optimize cache utilization– Reorder instructions to optimize register use– Loop decomposition example:

Before: for (i=3;i<=20;i++) a[i] = a[i-1]+4;After: for (i=3;i<=20;i+=2) a[i]=a[i-2]+4;

for (i=4;i<=20;i+=2) a[i]=a[i-2]+4;

Compiler Directives (openMP)

• Contains compiler directives and library directives• Recognized industry standard developed in the late 1990s• Designed for shared memory programming• Interfaces to C/C++, Fortran, and Java (JOMP)• Uses fork-join model based on threads• Parallel sections of code execute using “teams of threads”• openMP statements have the form

– C: #pragma omp <directive>– JOMP: //omp <directive>

OMP Parallel Directive• Description

– All threads execute the structured block– An implicit barrier is at the end of the block

#pragma omp parallel{ // code here }

• Example#pragma omp parallel private(x, num_threads){ x = omp_get_thread_num(); num_threads = omp_get_num_threads(); a[x] = 10*num_threads; }

• Notes– Omp_get_thread_num() returns the rank of the current thread– Omp_get_num_threads() returns the total number of threads– A[] is global, x and num_threads are local to each thread

OMP: Establish Thread Team Size

• Techniques that programs can use– Specify in the parallel directive

#pragma parallel num_threads(value)– Call the library routine: omp_set_num_threads(value)– Set the environment variable: OMP_NUM_THREADS– Use system dependent value if not specified

• Notes: – The parallel directive sets the team size for that directive– Other approaches set the value which remains until it is changed

OMP: Section and For Constructs

• Parallel sections (teams of threads share structured blocks)– Syntax#pragma omp sections{ #pragma omp section {//block of code} #pragma omp section {//block of code}

. . . }– Note: There is an implied barrier at the end of the construct unless

nowait appears on the pragma line

• Parallel for loop– Syntax: #pragma omp for (i=0; i<MAX; i++) {block of code}– Note: There is an implied barrier at the end of each iteration unless

nowait appears on the pragma line

OMP: Sequential within Parallel Blocks

• Single: the block executed by one of the threads– Syntax: #pragma omp single {//code}

– Note: There is an implied barrier at the end of the construct unless nowait appears on the pragma line

• Master: The block executed by the master thread– Syntax: #pragma omp master {//code}

– Note: There is no implied barrier in this construct

OMP: Critical Sections and Synchronization

• Critical Sections (one thread at a time)– Syntax: #pragma omp critical name {//code}– Notes

• The name ties all critical sections together with that name• The system uses a global default if name isn’t specified

• Barrier– Syntax: #pragma omp barrier– Note: All threads must be able to reach the barrier

• Atomic expression– Syntax: #pragma omp atomic <expressionStatement>

Note: Minimize critical sections or waiting processors execute serially

Example Program (summing 1000 numbers)

• Heavyweight UNIX processes (Section 8.7.1)– Concurrency: fork, exit, and wait.– Shared memory: shmget, shmat, shmctl.– Data sharing: semget, semctl, semop to create P,V functions.

• Lightweight threads (Section 8.7.2)– Concurrency: pthread create and join– Shared Memory: automatic– Data sharing: pthred mutex init, lock, and unlock.

• Java (Section 8.7.3)– Concurrency: Thread extension class– Shared Memory: automatic– Data Sharing: synchronized methods provide monitor capability

Multiprocessors and Multi-computers Multi-computers –Distributed address space accessible by local...

Documents

Transcript of Multiprocessors and Multi-computers Multi-computers –Distributed address space accessible by local...