Parallel Computer Architecture and Interconnect

Parallel Computer Architecture and Interconnect

1b.1

Types of Parallel Computer Architecture

1b.2

Two principal types:Shared memory multiprocessor

From a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical address space.

Distributed memory multicomputerIn hardware, refers to network based memory access that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines.

Ref slides from B. Wilkinson at UNC-Charlotte, 2006. and Kumar Introduction to parallel computing

Shared Memory Multiprocessor

1b.3

Conventional Computer

1b.4

Virtually all computers have followed a common machine model known as the von Neumann computer. Named after the Hungarian mathematician John von Neumann.

A von Neumann computer uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory.

Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.

Shared Memory Multiprocessor System

1b.5

Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module :Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.

UMA and NUMA.

1b.6

Uniform Memory Access (UMA): Most commonly represented today by Symmetric

Multiprocessor (SMP) machines Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache

coherent means if one processor updates a location in shared memory, all the other processors know about the update.

Non-Uniform Memory Access (NUMA): Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP Not all processors have equal access time to all

memories Memory access across link is slower If cache coherency is maintained, then may also be

called CC-NUMA - Cache Coherent NUMA

Shared Memory Computers

1b.7

Advantages: Global address space provides a user-friendly

programming interface to memory Data sharing between tasks is both fast and uniform

Disadvantages: Primary disadvantage is the lack of scalability

between memory and CPUs. Adding more CPUs can increases traffic on the shared memory-CPU path

Programmer responsibility for synchronization constructs that insure "correct" access of global memory and consistent data result.

Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

Distributed Memory Computer

1b.8

Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.

When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.

Distributed Memory Computer

1b.9

Advantages: Memory is scalable with number of processors.

Increase the number of processors and the size of memory increases proportionately.

Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency.

Cost effectiveness: can use commodity, off-the-shelf processors and networking like Ethenet.

Disadvantages: The programmer is responsible for many of the

details associated with data communication between processors.

Non-uniform memory access (NUMA) times

Hybrid Computer

1b.10

The largest and fastest computers in the world today employ both shared and distributed memory architectures.

The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global.

The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.

Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache.

Example Quad Shared Memory Multiprocessor

1b.11

Processor

L2 Cache

Bus interface

L1 cache

Processor

L2 Cache

Bus interface

L1 cache

Processor

L2 Cache

Bus interface

L1 cache

Processor

L2 Cache

Bus interface

L1 cache

Memory controller

Memory

I/O interface

I/O bus

Processor/memorybus

Shared memory

Programming Shared Memory ComputersSeveral possible ways

1b.12

1. Use Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access shared and global variables declared.

Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread.

Any thread can execute any subroutine at the same time as other threads.

Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to insure that more than one thread is not updating the same global address at any time.

Example Pthreads

1b.13

2. Use library functions and preprocessor compiler directives with a sequential programming language to declare shared variables and specify parallelism.

Portable / multi-platform, including Unix and Windows NT platforms

Available in C/C++ and Fortran implementations Can be very easy and simple to useExample OpenMP - industry standard. Consists of

library functions, compiler directives, and environment variables - needs OpenMP compiler

Programming Distributed Memory Computers

1b.14

Message passing modelTasks exchange data through communications by

sending and receiving messages. Data transfer usually requires cooperative operations

to be performed by each process. For example, a send operation must have a matching receive operation.

In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.

Interconnection Networks

1b.15

Provide mechanisms for data transfer between processors or between processors and memory

Typical network built on links (physical media such as wires and fibers) and switches ( provide mapping from input to output). Static network: point to point links

Dynamic network: switches and links. Communications are established dynamically among processors and memory.

Interconnection Networks

1b.16

2- and 3-dimensional meshesHypercube (not now common)Using Switches:

CrossbarTreesMultistage interconnection networks

1b.17

Bus-Based Networks

Idea for broadcasting. Distance between any two nodes is constant. However, the bounded bandwidth of a bus place limitations on performance as number of nodes creases. Cache is used to improve access time. Scalable in cost but not in performance

Crossbar Networks

pxb switches are employed. b>=p, non-blocking Lower bound on the total switches is (p^2).

Not scalable in terms of costScalable in terms of performance

1b.19

Multistage Networks

Intermediate class of networks lies between these above two extremes.Omega network consists of log p stages, where p is the number of inputs (nodes) and output (memory).

1b.20

Input i and output j, a link exists if:

j = 2i 0<=i <=p/2 -1or j = 2i +1-p, p/2<=i<=p-1

Left shift by one bit for input binary sequence

1b.21

p inputs are fed into a set of p/2 switches. Each switch is in one of the two connection modes.

1). Pass-through: input are sent straight through to the outputs2). Cross-over: Inputs are crossed over and then sent out.

1b.22

Total number of switches?

1b.23

AB link may be used by another pair of node to memory. Such communication will be blocked.

1b.24

Completely-connected network is good in the sense that any two nodes can exchange message in a single step. Similar to crossbar network due to non-blocking property

Star connected is similar to bus-based network. Communication between any pair of nodes is routed through the central processor. The central node is the bottleneck just like the bus.

1b.26

Total nodes are 2^d

In general, a d-dimensional hypercube is constructed by connecting corresponding nodes of two (d-1) dimensional hypercubes.

1b.27

Tree-based networka.Static tree network has a processing nodes at each node.b.Dynamic tree has switching nodes at intermediate levels, processing nodes at leaf level.

To route a message, source node sends the message up the tree until reach the node that is the root of the subtree containing both sender and receiver.

1b.28

Cache Coherence

1b.28

In the case of shared-address-space computers, additional hardware is required to keep multiple copies of data consistent with each other.

Especially, for multiple processors how to ensure they all use the same updated values?

If a processor changes the value of its copy, one the two things must happen:The other copies must be invalidatedThe other copies must be updated

1b.30

Solid line represents processor actions and the dashed line presents coherence actions.

Read on invalid data transition to shared by accessing the remote value

A write on shared transition to dirty and c_write to label other copies to be invalid.

Parallel Computer Architecture and Interconnect

Documents

Transcript of Parallel Computer Architecture and Interconnect