Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...

Lecture 3TTH 03:30AM-04:45PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc

e569/

CSCE569 Parallel Computing

University of South CarolinaDepartment of Computer Science and Engineering

Outline- Parallel ArchitectureInterconnection networksProcessor arraysMultiprocessorsMulticomputersFlynn’s taxonomy

Interconnection NetworksUses of interconnection networks

Connect processors to shared memoryConnect processors to each other

Interconnection media typesShared mediumSwitched medium

Shared versus Switched Media

Shared MediumAllows only message at a timeMessages are broadcastEach processor “listens” to every messageArbitration is decentralizedCollisions require resending of messagesEthernet is an example

Switched MediumSupports point-to-point messages between

pairs of processorsEach processor has its own path to switchAdvantages over shared media

Allows multiple messages to be sent simultaneously

Allows scaling of network to accommodate increase in processors

Switch Network TopologiesView switched network as a graph

Vertices = processors or switchesEdges = communication paths

Two kinds of topologiesDirectIndirect

Direct TopologyRatio of switch nodes to processor nodes is

1:1Every switch node is connected to

1 processor nodeAt least 1 other switch node

Indirect TopologyRatio of switch nodes to processor nodes is

greater than 1:1Some switches simply connect other

switches

Evaluating Switch TopologiesDiameterBisection widthNumber of edges / nodeConstant edge length? (yes/no)

2-D Mesh NetworkDirect topologySwitches arranged into a 2-D latticeCommunication allowed only between

neighboring switchesVariants allow wraparound connections

between switches on edge of mesh

2-D Meshes

Evaluating 2-D MeshesDiameter: (n1/2)

Bisection width: (n1/2)

Number of edges per switch: 4

Constant edge length? Yes

Binary Tree NetworkIndirect topologyn = 2d processor nodes, n-1 switches

Evaluating Binary Tree NetworkDiameter: 2 log n

Bisection width: 1

Edges / node: 3

Constant edge length? No

Hypertree NetworkIndirect topologyShares low diameter of binary treeGreatly improves bisection widthFrom “front” looks like k-ary tree of height dFrom “side” looks like upside down binary

tree of height d

Hypertree Network

Evaluating 4-ary HypertreeDiameter: log n

Bisection width: n / 2

Edges / node: 6


Butterfly NetworkIndirect topologyn = 2d processor

nodes connectedby n(log n + 1)switching nodes

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

Butterfly Network Routing

Evaluating Butterfly NetworkDiameter: log n


Edges per node: 4


HypercubeDirectory topology2 x 2 x … x 2 meshNumber of nodes a power of 2Node addresses 0, 1, …, 2k-1Node i connected to k nodes whose

addresses differ from i in exactly one bit position

Hypercube Addressing

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

Hypercubes Illustrated

Evaluating Hypercube NetworkDiameter: log n


Edges per node: log n


Shuffle-exchangeDirect topologyNumber of nodes a power of 2Nodes have addresses 0, 1, …, 2k-1Two outgoing links from node i

Shuffle link to node LeftCycle(i)Exchange link to node [xor (i, 1)]

Shuffle-exchange Illustrated

0 1 2 3 4 5 6 7

Shuffle-exchange Addressing

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1

0 11 0 0 11 1

Evaluating Shuffle-exchangeDiameter: 2log n - 1

Bisection width: n / log n

Edges per node: 2


Comparing NetworksAll have logarithmic diameter

except 2-D meshHypertree, butterfly, and hypercube have

bisection width n / 2All have constant edges per node except

hypercubeOnly 2-D mesh keeps edge lengths

constant as network size increases

Vector ComputersVector computer: instruction set includes

operations on vectors as well as scalarsTwo ways to implement vector computers

Pipelined vector processor: streams data through pipelined arithmetic units

Processor array: many identical, synchronized arithmetic processing elements

Why Processor Arrays?Historically, high cost of a control unitScientific applications have data parallelism

Processor Array

Data/instruction StorageFront end computer

ProgramData manipulated sequentially

Processor arrayData manipulated in parallel

Processor Array PerformancePerformance: work done per time unitPerformance of processor array

Speed of processing elementsUtilization of processing elements

Performance Example 11024 processorsEach adds a pair of integers in 1 secWhat is performance when adding two

1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

Performance Example 2512 processorsEach adds two integers in 1 secPerformance adding two vectors of length

600?

sec/ops103ePerformanc 6sec2

operations600

2-D Processor Interconnection Network

Each VLSI chip has 16 processing elements

Processor Array ShortcomingsNot all problems are data-parallelSpeed drops for conditionally executed

codeDon’t adapt to multiple users wellDo not scale down well to “starter” systemsRely on custom VLSI for processorsExpense of control units has dropped

MultiprocessorsMultiprocessor: multiple-CPU computer with

a shared memorySame address on two different CPUs refers

to the same memory locationAvoid three problems of processor arrays

Can be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code

Centralized MultiprocessorStraightforward extension of uniprocessorAdd CPUs to busAll processors share same primary memoryMemory access time same for all CPUs

Uniform memory access (UMA) multiprocessor

Symmetrical multiprocessor (SMP)

Centralized Multiprocessor

Private and Shared DataPrivate data: items used only by a single

processorShared data: values used by multiple

processorsIn a multiprocessor, processors

communicate via shared data values

Problems Associated with Shared DataCache coherence

Replicating data across multiple caches reduces contention

How to ensure different processors have same value for same address?

SynchronizationMutual exclusionBarrier

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X


CPU A CPU B

Memory

7X

7


CPU A CPU B

Memory

7X

77


CPU A CPU B

Memory

2X

72

Write Invalidate Protocol

CPU A CPU B

7X

77 Cache control monitor


CPU A CPU B

7X

77

Intent to write X


CPU A CPU B

7X

7

Intent to write X


CPU A CPU B

X 2

2

Memory consistencyUse memory consistency model to achieve

memory consistencyDefine allowable behavior of the memory

system used by programmer of parallel programs.

Consistency model guarantee that for any read/write input to the memory system, only allowable outputs are produced..

Sequential Consistency ModelRequire all write operations appear to all

processors in the same orderCan be guaranteed by the following sufficient

conditions:1) Every processor issues memory operations in

program order2) After a processor issues write op, wait until

write is done bf. Issue next op3) After a processor issued Read op, it waits until

this read and write op whose value should be returned to completed.

4) RR RW WW WR 1) XY: X must complete bf. Y

Network embedding/MappingE.g.: How to embed a network into

hypercubeReflected Gray Code (RGC): Mapping nodes

of network A to nodes of network B.Example: Map ring to hypercube

Embed ring/2d mesh to hypercube

Embed ring to hyperCube

Embed 2d mesh to hyperCube

RoutingRouting algorithms depend on network

topology, network contention, and network congestion.

Deterministic or adaptive routing algo. XY routing for 2D meshes:

E-Cube routing for Hypercube

Routing for Omega network

Stage 0 Stage 1 Stage 2

Omega network allows max 8 concurrent messaging.

Blocking network since only 1 path exists for (AB)

Non-blocking network: Benes

For any permutation ofThere exists a switching of the Benes network to realize connection without collision… HOW do u find this configuration?

Distributed MultiprocessorDistribute primary memory among

processorsIncrease aggregate memory bandwidth and

lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access

(NUMA) multiprocessor

Distributed Multiprocessor

MulticomputerDistributed memory multiple-CPU computerSame address on different processors

refers to different physical memory locations

Processors interact through message passing

Commercial multicomputersCommodity clusters

Asymmetrical Multicomputer

Asymmetrical MC AdvantagesBack-end processors dedicated to parallel

computations Easier to understand, model, tune performance

Only a simple back-end operating system needed Easy for a vendor to create

Asymmetrical MC DisadvantagesFront-end computer is a single point of

failureSingle front-end computer limits scalability

of systemPrimitive operating system in back-end

processors makes debugging difficultEvery application requires development of

both front-end and back-end program

Symmetrical Multicomputer

Symmetrical MC AdvantagesAlleviate performance bottleneck caused

by single front-end computerBetter support for debuggingEvery processor executes same program

Symmetrical MC DisadvantagesMore difficult to maintain illusion of single

“parallel computer”No simple way to balance program

development workload among processorsMore difficult to achieve high performance

when multiple processes on each processor

ParPar Cluster, A Mixed Model

Commodity ClusterCo-located computersDedicated to running parallel jobsNo keyboards or displaysIdentical operating systemIdentical local disk imagesAdministered as an entity

Network of WorkstationsDispersed computersFirst priority: person at keyboardParallel jobs run in backgroundDifferent operating systemsDifferent local imagesCheckpointing and restarting important

Flynn’s TaxonomyInstruction streamData streamSingle vs. multipleFour combinations

SISDSIMDMISDMIMD

SISDSingle Instruction, Single DataSingle-CPU systemsNote: co-processors don’t count

FunctionalI/O

Example: PCs

SIMDSingle Instruction, Multiple DataTwo architectures fit this category

Pipelined vector processor(e.g., Cray-1)

Processor array(e.g., Connection Machine)

MISDMultiple

Instruction,Single Data

Example:systolic array

MIMDMultiple Instruction, Multiple DataMultiple-CPU computers

MultiprocessorsMulticomputers

SummaryCommercial parallel computers appeared

in 1980sMultiple-CPU computers now dominateSmall-scale: Centralized multiprocessorsLarge-scale: Distributed memory

architectures (multiprocessors or multicomputers)

Summary (1/2)High performance computing

U.S. governmentCapital-intensive industriesMany companies and research labs

Parallel computersCommercial systemsCommodity-based systems

Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...

Documents

Transcript of Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...