Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...

82
Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu /csce569/ CSCE569 Parallel Computing University of South Carolina Department of Computer Science and Engineering

Transcript of Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South...

Page 1: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Lecture 3TTH 03:30AM-04:45PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc

e569/

CSCE569 Parallel Computing

University of South CarolinaDepartment of Computer Science and Engineering

Page 2: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Outline- Parallel ArchitectureInterconnection networksProcessor arraysMultiprocessorsMulticomputersFlynn’s taxonomy

Page 3: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Interconnection NetworksUses of interconnection networks

Connect processors to shared memoryConnect processors to each other

Interconnection media typesShared mediumSwitched medium

Page 4: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Shared versus Switched Media

Page 5: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Shared MediumAllows only message at a timeMessages are broadcastEach processor “listens” to every messageArbitration is decentralizedCollisions require resending of messagesEthernet is an example

Page 6: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Switched MediumSupports point-to-point messages between

pairs of processorsEach processor has its own path to switchAdvantages over shared media

Allows multiple messages to be sent simultaneously

Allows scaling of network to accommodate increase in processors

Page 7: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Switch Network TopologiesView switched network as a graph

Vertices = processors or switchesEdges = communication paths

Two kinds of topologiesDirectIndirect

Page 8: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Direct TopologyRatio of switch nodes to processor nodes is

1:1Every switch node is connected to

1 processor nodeAt least 1 other switch node

Page 9: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Indirect TopologyRatio of switch nodes to processor nodes is

greater than 1:1Some switches simply connect other

switches

Page 10: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating Switch TopologiesDiameterBisection widthNumber of edges / nodeConstant edge length? (yes/no)

Page 11: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

2-D Mesh NetworkDirect topologySwitches arranged into a 2-D latticeCommunication allowed only between

neighboring switchesVariants allow wraparound connections

between switches on edge of mesh

Page 12: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

2-D Meshes

Page 13: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating 2-D MeshesDiameter: (n1/2)

Bisection width: (n1/2)

Number of edges per switch: 4

Constant edge length? Yes

Page 14: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Binary Tree NetworkIndirect topologyn = 2d processor nodes, n-1 switches

Page 15: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating Binary Tree NetworkDiameter: 2 log n

Bisection width: 1

Edges / node: 3

Constant edge length? No

Page 16: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Hypertree NetworkIndirect topologyShares low diameter of binary treeGreatly improves bisection widthFrom “front” looks like k-ary tree of height dFrom “side” looks like upside down binary

tree of height d

Page 17: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Hypertree Network

Page 18: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating 4-ary HypertreeDiameter: log n

Bisection width: n / 2

Edges / node: 6

Constant edge length? No

Page 19: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Butterfly NetworkIndirect topologyn = 2d processor

nodes connectedby n(log n + 1)switching nodes

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

Page 20: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Butterfly Network Routing

Page 21: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating Butterfly NetworkDiameter: log n

Bisection width: n / 2

Edges per node: 4

Constant edge length? No

Page 22: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

HypercubeDirectory topology2 x 2 x … x 2 meshNumber of nodes a power of 2Node addresses 0, 1, …, 2k-1Node i connected to k nodes whose

addresses differ from i in exactly one bit position

Page 23: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Hypercube Addressing

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

Page 24: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Hypercubes Illustrated

Page 25: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating Hypercube NetworkDiameter: log n

Bisection width: n / 2

Edges per node: log n

Constant edge length? No

Page 26: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Shuffle-exchangeDirect topologyNumber of nodes a power of 2Nodes have addresses 0, 1, …, 2k-1Two outgoing links from node i

Shuffle link to node LeftCycle(i)Exchange link to node [xor (i, 1)]

Page 27: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Shuffle-exchange Illustrated

0 1 2 3 4 5 6 7

Page 28: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Shuffle-exchange Addressing

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1

0 11 0 0 11 1

Page 29: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Evaluating Shuffle-exchangeDiameter: 2log n - 1

Bisection width: n / log n

Edges per node: 2

Constant edge length? No

Page 30: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Comparing NetworksAll have logarithmic diameter

except 2-D meshHypertree, butterfly, and hypercube have

bisection width n / 2All have constant edges per node except

hypercubeOnly 2-D mesh keeps edge lengths

constant as network size increases

Page 31: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Vector ComputersVector computer: instruction set includes

operations on vectors as well as scalarsTwo ways to implement vector computers

Pipelined vector processor: streams data through pipelined arithmetic units

Processor array: many identical, synchronized arithmetic processing elements

Page 32: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Why Processor Arrays?Historically, high cost of a control unitScientific applications have data parallelism

Page 33: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Processor Array

Page 34: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Data/instruction StorageFront end computer

ProgramData manipulated sequentially

Processor arrayData manipulated in parallel

Page 35: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Processor Array PerformancePerformance: work done per time unitPerformance of processor array

Speed of processing elementsUtilization of processing elements

Page 36: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Performance Example 11024 processorsEach adds a pair of integers in 1 secWhat is performance when adding two

1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

Page 37: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Performance Example 2512 processorsEach adds two integers in 1 secPerformance adding two vectors of length

600?

sec/ops103ePerformanc 6sec2

operations600

Page 38: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

2-D Processor Interconnection Network

Each VLSI chip has 16 processing elements

Page 39: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Processor Array ShortcomingsNot all problems are data-parallelSpeed drops for conditionally executed

codeDon’t adapt to multiple users wellDo not scale down well to “starter” systemsRely on custom VLSI for processorsExpense of control units has dropped

Page 40: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

MultiprocessorsMultiprocessor: multiple-CPU computer with

a shared memorySame address on two different CPUs refers

to the same memory locationAvoid three problems of processor arrays

Can be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code

Page 41: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Centralized MultiprocessorStraightforward extension of uniprocessorAdd CPUs to busAll processors share same primary memoryMemory access time same for all CPUs

Uniform memory access (UMA) multiprocessor

Symmetrical multiprocessor (SMP)

Page 42: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Centralized Multiprocessor

Page 43: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Private and Shared DataPrivate data: items used only by a single

processorShared data: values used by multiple

processorsIn a multiprocessor, processors

communicate via shared data values

Page 44: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Problems Associated with Shared DataCache coherence

Replicating data across multiple caches reduces contention

How to ensure different processors have same value for same address?

SynchronizationMutual exclusionBarrier

Page 45: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X

Page 46: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Cache-coherence Problem

CPU A CPU B

Memory

7X

7

Page 47: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Cache-coherence Problem

CPU A CPU B

Memory

7X

77

Page 48: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Cache-coherence Problem

CPU A CPU B

Memory

2X

72

Page 49: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Write Invalidate Protocol

CPU A CPU B

7X

77 Cache control monitor

Page 50: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Write Invalidate Protocol

CPU A CPU B

7X

77

Intent to write X

Page 51: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Write Invalidate Protocol

CPU A CPU B

7X

7

Intent to write X

Page 52: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Write Invalidate Protocol

CPU A CPU B

X 2

2

Page 53: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Memory consistencyUse memory consistency model to achieve

memory consistencyDefine allowable behavior of the memory

system used by programmer of parallel programs.

Consistency model guarantee that for any read/write input to the memory system, only allowable outputs are produced..

Page 54: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Sequential Consistency ModelRequire all write operations appear to all

processors in the same orderCan be guaranteed by the following sufficient

conditions:1) Every processor issues memory operations in

program order2) After a processor issues write op, wait until

write is done bf. Issue next op3) After a processor issued Read op, it waits until

this read and write op whose value should be returned to completed.

4) RR RW WW WR 1) XY: X must complete bf. Y

Page 55: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Network embedding/MappingE.g.: How to embed a network into

hypercubeReflected Gray Code (RGC): Mapping nodes

of network A to nodes of network B.Example: Map ring to hypercube

Page 56: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Embed ring/2d mesh to hypercube

Page 57: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Embed ring to hyperCube

Page 58: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Embed 2d mesh to hyperCube

Page 59: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

RoutingRouting algorithms depend on network

topology, network contention, and network congestion.

Deterministic or adaptive routing algo. XY routing for 2D meshes:

Page 60: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

E-Cube routing for Hypercube

Page 61: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Routing for Omega network

Stage 0 Stage 1 Stage 2

Page 62: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Omega network allows max 8 concurrent messaging.

Blocking network since only 1 path exists for (AB)

Page 63: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Non-blocking network: Benes

For any permutation ofThere exists a switching of the Benes network to realize connection without collision… HOW do u find this configuration?

Page 64: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Distributed MultiprocessorDistribute primary memory among

processorsIncrease aggregate memory bandwidth and

lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access

(NUMA) multiprocessor

Page 65: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Distributed Multiprocessor

Page 66: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

MulticomputerDistributed memory multiple-CPU computerSame address on different processors

refers to different physical memory locations

Processors interact through message passing

Commercial multicomputersCommodity clusters

Page 67: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Asymmetrical Multicomputer

Page 68: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Asymmetrical MC AdvantagesBack-end processors dedicated to parallel

computations Easier to understand, model, tune performance

Only a simple back-end operating system needed Easy for a vendor to create

Page 69: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Asymmetrical MC DisadvantagesFront-end computer is a single point of

failureSingle front-end computer limits scalability

of systemPrimitive operating system in back-end

processors makes debugging difficultEvery application requires development of

both front-end and back-end program

Page 70: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Symmetrical Multicomputer

Page 71: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Symmetrical MC AdvantagesAlleviate performance bottleneck caused

by single front-end computerBetter support for debuggingEvery processor executes same program

Page 72: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Symmetrical MC DisadvantagesMore difficult to maintain illusion of single

“parallel computer”No simple way to balance program

development workload among processorsMore difficult to achieve high performance

when multiple processes on each processor

Page 73: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

ParPar Cluster, A Mixed Model

Page 74: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Commodity ClusterCo-located computersDedicated to running parallel jobsNo keyboards or displaysIdentical operating systemIdentical local disk imagesAdministered as an entity

Page 75: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Network of WorkstationsDispersed computersFirst priority: person at keyboardParallel jobs run in backgroundDifferent operating systemsDifferent local imagesCheckpointing and restarting important

Page 76: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Flynn’s TaxonomyInstruction streamData streamSingle vs. multipleFour combinations

SISDSIMDMISDMIMD

Page 77: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

SISDSingle Instruction, Single DataSingle-CPU systemsNote: co-processors don’t count

FunctionalI/O

Example: PCs

Page 78: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

SIMDSingle Instruction, Multiple DataTwo architectures fit this category

Pipelined vector processor(e.g., Cray-1)

Processor array(e.g., Connection Machine)

Page 79: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

MISDMultiple

Instruction,Single Data

Example:systolic array

Page 80: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

MIMDMultiple Instruction, Multiple DataMultiple-CPU computers

MultiprocessorsMulticomputers

Page 81: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

SummaryCommercial parallel computers appeared

in 1980sMultiple-CPU computers now dominateSmall-scale: Centralized multiprocessorsLarge-scale: Distributed memory

architectures (multiprocessors or multicomputers)

Page 82: Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of.

Summary (1/2)High performance computing

U.S. governmentCapital-intensive industriesMany companies and research labs

Parallel computersCommercial systemsCommodity-based systems