Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed...

Communication Models for Parallel Computer Architectures Two distinct models have been proposed for how

CPUs in a parallel computer system should communicate.– In the first model, all CPUs share a common physical

memory.• This kind of system is called a multiprocessor or shared

memory system.

– In the second design, each CPU has its own private memory.

• Such a design is called a multicomputer or distributed memory system.

Multiprocessors

– Consider a program to find all of the objects in a bit-map image.

• One copy of the image is kept in memory.• Each CPU runs a single process which inspects one

section of the image.• Some objects occupy multiple sections, so it is essential

that each process have access to the entire image.

– Example multiprocessors include:– Sun Enterprise 1000– Sequent NUMA-Q– SGI Origin 2000– HP/Convex Exemplar

Multiprocessors

Multicomputers

– In a multicomputer solving the same problem, each CPU has a section of the image in its local memory.

• If the CPUs need to follow an object across the border, they must request the information from a neighboring CPU.

– This is done via message passing.

– Programming multicomputers is more difficult than programming multiprocessors, but they are more scalable.

• Building a multicomputer with 10,000 CPUs is straightforward.

Multicomputers

Multicomputers

– Example multicomputers include:• IBM SP/2• Intel/Sandia Option Red• Wisconsin COW

– Much research focuses on hybrid systems combining the best of both worlds.

– Shared memory might be implemented at a higher-level than the hardware.

• The operating system might simulate a shared memory by providing a single system-wide paged shared address space.

– This approach is called DSM (Distributed Shared Memory).

Shared Memory

Shared Memory

• Each machine has its own virtual memory and its own page table.

• When a CPU does a LOAD or STORE on a page it does not have, a trap to the OS occurs.

• The OS locates the page and asks the CPU currently holding it to unmap the page and send it over the interconnection network.

• When it arrives, the page is mapped in and the faulting instruction restarted.

– A third possibility is to have a user-level runtime system implement a form of shared memory.

Shared Memory

– The programming language provides a shared memory abstraction implemented by the compiler and runtime system.

• The Linda model is based on the abstraction of a shared space of tuples.

– Processes can input a tuple from the shared tuple space or output a tuple to the shared tuple space.

• The Orca model allows shared generic objects.– Processes can execute object-specific methods on shared objects.– When a change occurs to the internal state of some object, it is

up to the runtime system to simultaneously update all copies of the object.

Interconnection Networks

Multicomputers are held together by interconnection networks which move packets between CPUs and memory.– The CPUs and memory modules of multiprocessors

are also interconnected.– Interconnection networks consist of:

• CPUs• Memory modules• Interfaces• Links• Switches

Interconnection Networks

– The links are the physical channels over which bits move. They can be

• electrical or optical fiber

• serial or parallel

• simplex, half-duplex, or full duplex

– The switches are devices with multiple input ports and multiple output ports.

• When a packet arrives at an input port on a switch some bits are used to select the output port to which the packet is sent.

Topology

Switching

– An interconnection network consists of switches and wires connecting them.

– The following slide shows an example.• Each switch has four input ports and four output ports.• In addition each switch has some CPUs and

interconnect circuitry.• The job of the switch is to accept packets arriving on

any input port and send each one out on the correct output port.

• Each output port is connected to an input port of another switch by a parallel or serial line.

Switching

Switching

– Several switching strategies are possible.• In circuit switching, before a packet is sent, the entire path

from the source to the destination is reserved in advance.– All ports and buffers are claimed, so that when transmission starts,

all necessary resources are guaranteed to be available and the bits can move at full speed from the source, through the switches to the destination.

• In store-and-forward packet switching, no advance reservation is needed.

– The source sends a complete packet to the first switch where it is stored in its entirety.

– The switches may need to buffer packets if an output port is busy.

Switching

Communication Methods

– When a program is split up into pieces, the pieces (processes) often need to communicate with one another.

– This communication can be done in one of two ways:

• shared variables• explicit message passing• Logical sharing of variables is possible even on a

multicomputer.• Message passing is easy to implement on a multiprocessor

by simply copying from the sender to the receiver.

Communication Methods

Taxonomy of Parallel Computers

Although many researchers have tried to come up with a taxonomy of parallel computers, the only one which is widely used is that of Flynn (1972).

This classification is based on two concepts – instruction streams

• corresponding to a program counter

– data streams• consisting of a set of operands

Taxonomy of Parallel Computers

Memory Semantics

– Even though all multiprocessors present the CPUs with the image of a single shared address space, often there are many memory modules present, each holding some portion of the physical memory.

• The CPUs and memories are often interconnected by a complex interconnection network.

• Several CPUs may be attempting to read a memory word at the same time several other CPUs are attempting to write the same word.

• Multiple copies of some blocks may be in caches.

Memory Semantics

– One view of memory semantics is to view it as a contract between the software and the memory hardware.

• The rules are called consistency models, and many different ones have been proposed and implemented.

• For example, suppose that CPU 0 writes the value 1 to some memory word and a little later CPU 1 writes the value 2 to the same word.

• Now CPU 2 reads the word and gets the value 1.

• Is this an error?

Memory Semantics

– The simplest model is strict consistency.• With this model, any read to a location x, always returns

the value of the most recent write to x.• This model is great for programmers, but almost

impossible to implement.

– The next best model is called sequential consistency.

• The basic idea is that in the presence of multiple read and write requests, some interleaving of all the requests is chosen by the hardware (nondeterministically), but all CPUs see the same order.

Memory Semantics

Memory Semantics

– A looser consistency model, but one that is easier to implement on large multiprocessors, is processor consistency. It has two properties:

• Writes by any CPU are seen by all CPUs in the order they were issued.

• For every memory word, all CPUs see all writes to it in the same order.

• If CPU 1 issues writes with values 1A, 1B, and 1C to some memory location in that sequence, then all other processors see them in that order too.

• Every memory word has an unambiguous value after several CPUs write to it and stop.

Memory Semantics

– Weak consistency does not even guarantee that writes from a single CPU are seen in that order.

• One CPU might see 1A before 1B and another CPU might see 1A after 1B.

• However, to add some order, weakly consistent memories have synchronization variables or a synchronization variable.

– When a synchronization is executed, all pending writes are finished and no new ones are started until all the old ones are done and the synchronization itself is done.

– In effect a synchronization “flushes the pipeline” and brings the memory to a stable state with no operations pending.

– Time is divided into epochs delimited by the synchronizations.

Memory Semantics

Memory Semantics

– Weak consistency has the problem that it is quite inefficient because it must finish off all pending memory operations and hold all new ones until the current ones are done.

– Release consistency improves matters by adopting a model akin to critical sections.

• The idea behind this model is that when a process exits a critical region it is not necessary to force all writes to complete immediately. It is only necessary to make sure that they are done before any process enters the critical region again.

Memory Semantics

– In this model, the synchronization operation offered by weak consistency is split into two different operations.

• To read or write a shared data variable, a CPU must first do an acquire operation on the synchronization variable to get exclusive access to the shared data.

• When it is done, the CPU does a release operation on the synchronization variable to indicate that it is finished.

Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed...

Documents

Transcript of Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed...