PP16-lec4-arch3

23
1.1 Parallel Processing sp2016 lec#4 Dr M Shamim Baig

Transcript of PP16-lec4-arch3

Page 1: PP16-lec4-arch3

1.1

Parallel Processingsp2016

lec#4

Dr M Shamim Baig

Page 2: PP16-lec4-arch3

Explicitly Parallel Processor architectures:

Task-level Parallelism

1.2

Page 3: PP16-lec4-arch3

1.3

Elements of (Explicit) Parallel Architectures

• Processor configurations: Instruction/Data Stream based• Memory Configurations: - Physical & Logical based - Access-Delay based• Inter-processor communication:

- Communication-Interface design - Data Exchange/ Synch approach

Page 4: PP16-lec4-arch3

1.4

Example SIMD & MIMD systems

• Variants of SIMD have found use in co-processing units such as the MMX units in Intel processors, DSP chips such as the Sharc & Vividia’s graphic processors GPUs

• Examples of MIMD-platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters & IBM SP.

Page 5: PP16-lec4-arch3

1.5

Ex: Conditional Execution in SIMD Processors

Executing a conditional statement on an SIMD computer with four processors: (a) the conditional statement; (b) the execution of the statement in two steps.

It is often necessary to selectively turn off operations on certain data items. For this, most SIMD programming paradigms allow for ``activity mask'', which determines if a processor should participate in a computation or not

Page 6: PP16-lec4-arch3

1.6

Programing Models: MPMD/ SPMD• There are two programming models for PP called

Multiple Program Multiple-Data (MPMD) execute different program on different processors. Single Program Multiple-Data (MPMD/ SPMD) execute same program on different processors

• As SIMD system can execute only one program which works on different parts of data. MIMD system can execute same /different programs which also work on different parts of data.

• Hence SIMD supports only SPMD prog-model. Although MIMD supports both models of programming (MPMD & SPMD), SPMD is preferred choice due to software management

Page 7: PP16-lec4-arch3

1.7

Comparison: SIMD vs MIMD• Control flow: Synchronous in SIMD vs Asynchronous in MIMD • Programming-model:SIMD supports only SPMD prog-model

while MIMD supports both (SPMD & MPMD) prog-models

• Cost: SIMD computers require less hardware than MIMD computers (single control unit). – However, since SIMD processors are specially

designed, they tend to be expensive & have long design cycles.

– In contrast, MIMD processors can be built from inexpensive off-the-shelf components with relatively little effort in a short time

• Flexibility: SIMD perform very well for specialized / regular structured applications (eg image proc) but not for all applications, while MIMD are more flexible & general purpose.

Page 8: PP16-lec4-arch3

1.8

Elements of (Explicit) Parallel Architectures

• Processor configurations: Instruction/Data Stream based• Memory Configurations: - Physical & Logical based - Access-Delay based• Inter-processor communication:

- Communication-Interface design - Data Exchange/ Synch approach

Page 9: PP16-lec4-arch3

1.9

Parallel Platforms:Memory (Physical vs Logical) Configurations

• Physical vs Logical Memory Config– Physical Memory config (SM, DM, CSM)– Logical Address Space config (SAS, NSAS)– Combinations

• CSM + SAS (SMP; UMA)• DM + SAS (DSM; NUMA)• DM + NSAD (Multicomputer/Clusters)

Page 10: PP16-lec4-arch3

1.10

Shared memory (SM) Multiprocessor• It is important to note difference between

terms Shared Memory & Shared Address Space

• Former is physical memory config, while later is Logical memory address view for program.

• It is possible to provide Shared Address Space using a physically distributed memory.

• SM-multiprocessors systems are SAS-based using physical memory configuration

either as CSM or as (DM DSM)

Page 11: PP16-lec4-arch3

1.11

UMA vs NUMA• SM-multiprocessors are further categorized based on

memory access delay as UMA (uniform memory access) & NUMA (non uniform memory access)

• UMA system is based on (CSM + SAS) config, where each processor has same delay for accessing any memory location

• NUMA system is based on (DM+SAS = DSM) config, where a processor may have different delay for accessing different memory location.

Page 12: PP16-lec4-arch3

1.12

UMA & NUMA Arch Block Diagrams

Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only.

M

M

M

UMA (CSM+ SAS) NUMA (DM+ SAS= DSM)

Both are SM-multiprocessors differing in Memory Access Delay format

Page 13: PP16-lec4-arch3

1.13

Simplistic view of a small shared memory Symmetric Multi Processor (SMP):

(CSM + SAS + Bus)

Examples:• Dual Pentiums• Quad Pentiums

Processors Shared memory

Bus

Page 14: PP16-lec4-arch3

1.14

Quad Pentium Shared Memory SMP

Processor

L2 Cache

Bus interface

L1 cacheProcessor

L2 Cache

Bus interface

L1 cacheProcessor

L2 Cache

Bus interface

L1 cacheProcessor

L2 Cache

Bus interface

L1 cache

Memory controller

Memory

I/O interface

I/O bus

Processor/memorybus

Shared memory

Page 15: PP16-lec4-arch3

1.15

Multicomputer (Cluster) PlatformComplete computers P (CU + PE), DM with NSAS & interconnection network interface at I/O bus level.

Processor

Interconnectionnetwork

Local

Computers

Messages

memory

• These platforms comprise of a set of processors and their own (exclusive/ distributed) memory

• Instances of such a view come naturally from non-shared-address space (NSAS) multicomputers e.g clustered workstations

Page 16: PP16-lec4-arch3

1.16

Data Exchange/Synch Approaches:Shared data vs Message-Passing

• There are two primary approaches of data exchange/synch in parallel systems– Shared-data approach – Message-Passing approach

• SM-multiprocessors use Shared-Data approach for data exchange/synch.

• Multicomputers (Clusters) use Message-Passing approach for data exchange/ synch.

Page 17: PP16-lec4-arch3

1.17

• Shared memory platforms have low comm overhead, can support lower grain levels, while message passing platforms have more comm overhead & therefore are more suited for coarse grain levels

• SM Multiprocessors are faster, but have poor scalability

• Message passing Multicomputer platforms are slower but have higher scalability.

DataExchange/Synch Platforms:Shared-memory vs Message-Passing

Page 18: PP16-lec4-arch3

1.18

Clusters as a Computing Platform

• Clusters: A network of computers became a very attractive alternative to expensive supercomputers used for high-performance computing in early 1990s

• Several early projects Notably:-

– Berkeley NOW (network of workstations) project.

– NASA Beowulf project.

Page 19: PP16-lec4-arch3

1.19

Advantages of Cluster Computer:(NOW-like)

• Very high performance workstations and PCs readily available at low cost.

• Latest processors can easily be incorporated into the system as they become available.

• Easily scalable• Existing software can be used or easily

modified.

Page 20: PP16-lec4-arch3

1.20

Beowulf Clusters*

• A group of interconnected commodity computers achieving high performance with low cost.

• Typically using commodity interconnects e.g high speed Ethernet & OS e.g Linux.

* Beowulf comes from name given by NASA Goddard Space Flight Center cluster project.

Page 21: PP16-lec4-arch3

1.21

Cluster Interconnects: LAN vs SAN• LANs : fast / Gbits/ 10-Gbits Ethernet• SANs: Myrinet, Quadrics, Infiniband

Comparison LAN vs SAN

• Distance: LAN for longer distance few (km vs m), causing more delay/slower

• Reliability: LAN for less reliable networks, so includes overhead (error correction etc) which adds to delays

• Processing Speed: LAN uses OS calls, causing more processing delays

Page 22: PP16-lec4-arch3

1.22

Vector/ Array Data Processors• Vector proc:1D-Temporal parallelism using

pipeline Arith unit & Vector chaining– Float add pipe: Comp exp, algn mant, add mant, Normalize

• Array proc:1D- Spatial parallelism using ALU-array as SIMD

• Systolic Array: combines 2-D spatial parallelism with pipelined (computational wavefront

Block Diagrams of Vector/array & Systolic processing?????

Page 23: PP16-lec4-arch3

1.23

Summary: Parallel Platforms; Memory & Interconnect Configurations

• Memory Config (Physical vs Logical)– Physical Memory config (SM, DM, CSM)– Logical Address Space config (SAS, NSAS)– Combinations

• CSM + SAS (SMP; UMA)• DM + SAS (DSM; NUMA)• DM + NSAD (Multicomputer/Clusters)

• Interconnection Network: o Interface level: memory bus (using MBEU) in SM-

multiprocessors (UMA, NUMA) vs I/O bus (using NIU) in multicomputer / cluster

o Data Exchange / sync: Shared Data model vs Message Passing model