Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity...

Chapter 2

Parallel Architectures

2

Outline• Some chapter references

• Brief review of complexity – Terminology for comparisons

• Interconnection networks

• Processor arrays

• Multiprocessors

• Multicomputers

• Flynn’s Taxonomy – moved to Chpt 1

3

Some Chapter References

• Selim Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989 (earlier textbook).

• G. C. Fox, What Have We Learnt from Using Real Parallel Machines to Solve Real Problems? Technical Report C3P-522, Cal Tech, December 1989. (Included in part in more recent books co-authored by Fox.)

• A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, 2003 (first edition 1994), Addison Wesley.

• Harry Jordan, Gita Alaghband, Fundamentals of Parallel Processing: Algorithms, Architectures, Languages, Prentice Hall, 2003, Ch 1, 3-5.

• F. Thomson Leighton; Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes; 1992; Morgan Kaufmann Publishers.

4

References - continued• Gregory Pfsiter, In Search of Clusters: The ongoing

Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses details of some serious problems that MIMDs incur).

• Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill,2004 (Current Textbook), Chapter 2.

• Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994, Ch. 1,2

• Sayed H. Roosta, “Parallel Processing & Parallel Algorithms: Theory and Computation”, Springer Verlag, 2000, Chpt 1.

• Wilkinson & Allen, Parallel Programming: Techniques and Applications, Prentice Hall, 2nd Edition, 2005, Ch 1-2.

5

Brief Review Complexity Concepts Needed for Comparisons

• Whenever we define a counting function, we usually characterize the growth rate of that function in terms of complexity classes.

• Technical Definition: We say a function f(n) is in O(g(n)), if (and only if) there are positive constants c and n0 such that

0 ≤ f(n) cg(n) for n n0

• O(n) is read as big-oh of n.• This notation can be used to separate counting functions

into complexity classes that characterize the size of the count.

• We can use it for any kind of counting functions such as timings, bisection widths, etc.

6

Big-Oh and Asymptotic Growth Rate

• The big-Oh notation gives an upper bound on the (asymptotic) growth rate of a function

• The statement “f(n) is O(g(n))” means that the growth rate of f(n) is not greater than the growth rate of g(n)

• We can use the big-Oh notation to rank functions according to their growth rate

Assume: f(n) is O(g(n)) g(n) is O(f(n))

g(n) grows faster Yes No

f(n) grows faster No Yes

Same growth Yes Yes

7

Relatives of Big-Oh• big-Omega

– f(n) is (g(n)) if there is a constant c > 0

and an integer constant n0 1 such that

f(n) cg(n) ≥ for n n0

Intuitively, this says up to a constant factor, f(n) asymptotically is greater than or equal to g(n)

• big-Theta– f(n) is (g(n)) if there are constants c’ > 0 and c’’ > 0

and an integer constant n0 1 such that 0 ≤ c’g(n) f(n) c’’•g(n) for n n0

Intuitively, this says up to a constant factor, f(n) and g(n) are asymptotically the same.

Note: These concepts are covered in algorithm courses

8

Relatives of Big-Oh

• little-oh– f(n) is o(g(n)) if, for any constant c > 0, there is an

integer constant n0 0 such that 0 f(n) < cg(n) for n n0

Intuitively, this says f(n) is, up to a constant, asymptotically strictly less than g(n), so f(n) ≠ (g(n)).

• little-omega– f(n) is (g(n)) if, for any constant c > 0, there is an

integer constant n0 0 such that f(n) > cg(n) ≥ 0 for n n0

Intuitively, this says f(n) is, up to a constant, asymptotically strictly greater than g(n), so f(n) ≠ (g(n)).

These are not used as much as the earlier definitions, but they round out the picture.

9

Summary for Intuition for Asymptotic Notation

big-Oh– f(n) is O(g(n)) if f(n) is asymptotically less than or equal to

g(n)

big-Omega– f(n) is (g(n)) if f(n) is asymptotically greater than or

equal to g(n)big-Theta– f(n) is (g(n)) if f(n) is asymptotically equal to g(n)

little-oh– f(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n)

little-omega– f(n) is (g(n)) if is asymptotically strictly greater than g(n)

10

A CALCULUS DEFINITION OF O, (often easier to use)

Definition: Let f and g be functions defined on the positive integers with nonnegative values.

We say g is in O(f) if and only if

lim g(n)/f(n) = c

n -> for some nonnegative real number c--- i.e. the limit exists and is not infinite.

Definition: We say f is in (g) if and only if

f is in O(g) and g is in O(f)

Note: Often use L'Hopital's Rule to calculate the limits you need.

11

Why Asymptotic Behavior is Important

• 1) Allows us to compare counts on large sets.• 2) Helps us understand the maximum size of

input that can be handled in a given time, provided we know the environment in which we are running.

• 3) Stresses the fact that even dramatic speedups in hardware can not overcome the handicap of an asymptotically slow algorithm.

12

Recall: ORDER WINS OUT(Example from Baase’s Algorithms Text)

The TRS-80

Main language support: BASIC - typically a slow running interpreted language

For more details on TRS-80 see:

http://mate.kjsl.com/trs80/

http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaqp1.html#TOC3

The CRAY-YMP

Language used in example: FORTRAN- a fast running language

For more details on CRAY-YMP see:

http://mate.kjsl.com/trs80/



13

CRAY YMP TRS-80with FORTRAN with BASICcomplexity is 3n3 complexity is 19,500,000n

n is:

10

100

1000

2500

10000

1000000

3 microsec 200 millisec

3 millisec 2 sec

3 sec 20 sec

50 sec 50 sec

49 min 3.2 min

95 years 5.4 hours

microsecond (abbr microsecond (abbr µµsec)sec) One-millionth of a second. One-millionth of a second.millisecond (abbr msec)millisecond (abbr msec) One-thousandth of a second. One-thousandth of a second.

14

Interconnection Networks• Uses of interconnection networks

– Connect processors to shared memory– Connect processors to each other

• Interconnection media types– Shared medium– Switched medium

• Different interconnection networks define different parallel machines.

• The interconnection network’s properties influence the type of algorithm used for various machines as it effects how data is routed.

15

Shared versus Switched Media

a. With shared medium, one message is sent & all processors listen

b. With switched medium, multiple messages are possible.

16

Shared Medium

• Allows only message at a time

• Messages are broadcast

• Each processor “listens” to every message

• Before sending a message, a processor “listen” until medium is unused

• Collisions require resending of messages

• Ethernet is an example

17

Switched Medium

• Supports point-to-point messages between pairs of processors

• Each processor is connected to one switch

• Advantages over shared media– Allows multiple messages to be sent

simultaneously– Allows scaling of the network to accommodate

the increase in processors

18

Switch Network Topologies

• View switched network as a graph– Vertices = processors or switches– Edges = communication paths

• Two kinds of topologies– Direct– Indirect

19

Direct Topology

• Ratio of switch nodes to processor nodes is 1:1

• Every switch node is connected to– 1 processor node– At least 1 other switch node

Indirect Topology

• Ratio of switch nodes to processor nodes is greater than 1:1• Some switches simply connect to other switches

20

Terminology for Evaluating Switch Topologies

• We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness in implementing efficient parallel algorithms on a machine with a given network.

• These are– The diameter – The bisection width– The edges per node– The constant edge length

• We’ll define these and see how they affect algorithm choice.

• Then we will investigate several different topologies and see how these characteristics are evaluated.

21


• Diameter – Largest distance between two switch nodes.– A low diameter is desirable– It puts a lower bound on the complexity of

parallel algorithms which requires communication between arbitrary pairs of nodes.

22


• Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves (within 1 node, if the number of processors is odd.)

• High bisection width is desirable.• In algorithms requiring large amounts of data

movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm,

• Actually proving what the bisection width of a network is can be quite difficult.

23


• Number of edges per node– It is best if the maximum number of edges/node is a

constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes.

– Degree is the maximum number of edges per node.

• Constant edge length? (yes/no)– Again, for scalability, it is best if the nodes and edges

can be laid out in 3D space so that the maximum edge length is a constant independent of network size.

24

Evaluating Switch Topologies

• Many have been proposed and analyzed. We will consider several well known ones:– 2-D mesh– linear network– binary tree– hypertree– butterfly– hypercube– shuffle-exchange

• Those in yellow have been used in commercial parallel computers.

25

2-D Meshes

Note: Circles represent switches and squares represent processors in all these slides.

26

2-D Mesh Network• Direct topology

• Switches arranged into a 2-D lattice or grid

• Communication allowed only between neighboring switches

• Torus: Variant that includes wraparound connections between switches on edge of mesh

27

Evaluating 2-D Meshes(Assumes mesh is a square)

n = number of processors • Diameter:

(n1/2)– Places a lower bound on algorithms that require

processing with arbitrary nodes sharing data.• Bisection width:

(n1/2)– Places a lower bound on algorithms that require

distribution of data to all nodes.• Max number of edges per switch:

– 4 is the degree• Constant edge length?

– Yes• Does this scale well?

– Yes

28

Linear Network

• Switches arranged into a 1-D mesh• Direct topology• Corresponds to a row or column of a 2-D mesh• Ring : A variant that allows a wraparound

connection between switches on the end.• The linear and ring networks have many

applications• Essentially supports a pipeline in both directions• Although these networks are very simple, they

support many optimal algorithms.

29

Evaluating Linear and Ring Networks• Diameter

– Linear : n-1 or Θ(n)– Ring: n/2 or Θ(n)

• Bisection width: – Linear: 1 or Θ(1)– Ring: 2 or Θ(1)

• Degree for switches:– 2

• Constant edge length? – Yes

• Does this scale well?– Yes

30

Binary Tree Network

• Indirect topology

• n = 2d processor nodes, 2n-1 switches, where d= 0,1,... is the number of levels

i.e. 23 = 8 processors on bottom and

2(n) – 1 = 2(8) – 1 = 15 switches

31

Evaluating Binary Tree Network• Diameter:

– 2 log n or O(log n).– Note- this is small

• Bisection width: – 1, the lowest possible number

• Degree: – 3

• Constant edge length? – No

• Does this scale well?– No

32

Hypertree Network (of degree 4 and depth 2)

(a) Front view: 4-ary tree of height 2

(b) Side view: upside down binary tree of height d

(c) Complete network

33

Hypertree Network• Indirect topology• Note- the degree k and the depth d must be

specified. • This gives from the front a k-ary tree of height d.• From the side, the same network looks like an

upside down binary tree of height d.• Joining the front and side views yields the

complete network.

34

Evaluating 4-ary Hypertree with Depth d

• A 4-ary hypertree has n = 4d processors– General formula for k-ary hypertree is n = kd

• Diameter is 2d = 2 log n– shares the low diameter of binary tree

• Bisection width = 2d+1

– Note here, 2d+1 = 23 = 8– Large value - much better than binary tree

• Constant edge length? – No

• Degree = 6

35

Butterfly Network

• Indirect topology• n = 2d processor

nodes connectedby n(log n + 1)switching nodes

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

A 23 = 8 processor butterfly network with 8*4=32 switching nodes

As complicated as this switching network appears to be, it is really quite simple as it admits a very nice routing algorithm!

Wrapped Butterfly: When top and bottom ranks are merged into single rank. The rows are called ranks.

36

Building the 23 Butterfly Network• There are 8 processors.• Have 4 ranks (i.e. rows) with 8 switches per rank.• Connections:

– Node(i,j), for i > 0, is connected to two nodes on rank i-1, namely node(i-1,j) and node(i-1,m), where m is the integer found by flipping the ith most significant bit in the binary d-bit representation of j.

– For example, suppose i = 2 and j = 3. Then node (2,3) is connected to node (1,3).

– To get the other connection, 3 = 0112. So, flip 2nd significant bit – i.e. 0012 and connect node(2,3) to node(1,1) --- NOTE: There is an error on pg 32 on this example.

• Nodes connected by a cross edge from rank i to rank i+1 have node numbers that differ only in their (i+1) bit.

37

Why It Is Called a Butterfly Network• Walk cycles such as node(i,j), node(i-1,j),

node(i,m), node(i-1,m), node(i,j) where m is determined by the bit flipping as shown and you “see” a butterfly:

38

Butterfly Network RoutingSend message from processor 2 to processor 5.

Algorithm:0 means ship left; 1 means ship right. 1) 5 = 101. Pluck off leftmost bit 1 and send “01msg” to right.2) Pluck off leftmost bit 0 and send “1msg” to left.3) Pluck off leftmost bit 1 and send “msg” to right.

Each cross edge followed changes address by 1 bit.

39

Evaluating the Butterfly Networkwith n Processors

• Diameter:– log n

• Bisection width: – n / 2 *(Likely error 32/2=16)

• Degree: – 4 (even for d > 3)

• Constant edge length? – No, grows exponentially as rank size increase

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

* On pg 442, Leighton gives “(n / log(n))” as the bisection width. Simply remove cross edges between two successive levels to create bisection cut.

40

Hypercube (also called binary n-cube)

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

A hypercube with n = 2d processors & switches for d=4

41

Hypercube (or Binary n-cube) n = 2d Processors

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Direct topology• 2 x 2 x … x 2 mesh• Number of nodes is a power of 2• Node addresses 0, 1, …, n-1• Node i is connected to k nodes whose addresses differ from i in exactly one bit position.• Example: k = 0111 is connected to 1111, 0011, 0101, and 0110

42

Growing a HypercubeNote: For d = 4, it is called a

4-D hypercube or just a 4 cube

43

Evaluating Hypercube Networkwith n = 2d nodes

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Diameter: • d = log n

•Bisection width: • n / 2

•Edges per node: • log n

•Constant edge length?

• No. The length of the longest edge increases as n increases.

44

Routing on the Hypercube Network

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Example: Send a message from node 2 = 0010 to node 5 = 0101• The nodes differ in 3 bits so the shortest path will be of length 3.• One path is0010 0110 0100 0101obtained by flipping one of the differing bits at each step.

• As with the butterfly network, bit flipping helps you route on this network.

45

A Perfect Shuffle

• A permutation that is produced as follows is called a perfect shuffle:

• Given a power of 2 cards, numbered 0, 1, 2, ..., 2d -1, write the card number with d bits. By left rotating the bits with a wrap, we calculate the position of the card after the perfect shuffle.

• Example: For d = 3, card 5 = 101. Left rotating and wrapping gives us 011. So, card 5 goes to position 3. Note that card 0 = 000 and card 7 = 111, stay in position.

46

Shuffle-exchange Network with n = 2d Processors

0 1 2 3 4 5 6 7

• Direct topology• Number of nodes is a power of 2• Nodes have addresses 0, 1, …, 2d-1• Two outgoing links from node i

Shuffle link to node LeftCycle(i) Exchange link between node i and node i+1

when i is even

47

Shuffle-exchange Addressing – 16 processors

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1

0 11 0 0 11 1

No arrows on line segment means it is bidirectional. Otherwise, you must follow the arrows.

Devising a routing algorithm for this network is interesting and may be a homework problem.

48

Evaluating the Shuffle-exchange

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1

0 11 0 0 11 1

• Diameter: 2log n – 1

• Edges per node: 3

• Constant edge length? No

• Bisection width: (n/ log n) Between 2n/log n and n/(2 log n)*

* See Leighton pg 480

49

Two Problems with Shuffle-Exchange

• Shuffle-Exchange does not expand well– A large shuffle-exchange network does not

decompose well into smaller separate shuffle exchange networks.

– In a large shuffle-exchange network, a small percentage of nodes will be hot spots

• They will encounter much heavier traffic

• Above results are in dissertation of one of Batcher’s students.

50

Comparing Networks(See Table 2.1)

• All have logarithmic diameterexcept 2-D mesh

• 4nary Hypertree, butterfly, and hypercube have bisection width n / 2 (Likely untrue only butterfly)

• All have constant edges per node except hypercube

• Only 2-D mesh, linear, and ring topologies keep edge lengths constant as network size increases

• Shuffle-exchange is a good compromise- fixed number of edges per node, low diameter, good bisection width.– However, negative results on preceding slide also

need to be considered.

51

Alternate Names for SIMDs

• Recall that all active processors of a true SIMD computer must simultaneously access the same memory location.

• The value in the i-th processor can be viewed as the i-th component of a vector.

• SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently.

52

SIMD Computers

• SIMD computers that focus on vector operations– Support some vector and possibly matrix

operations in hardware– Usually limit or provide less support for non-

vector type operations involving data in the “vector components”.

• General purpose SIMD computers– May also provide some vector and possibly

matrix operations in hardware.– More support for traditional type operations

(e.g., other than for vector/matrix data types).

53

Pipelined Architectures• Pipelined architectures are sometimes

considered to be SIMD architectures– See pg 37 of Textbook & pg 8-9 Jordan et. al.– Vector components are entered successively into first

processor in pipeline.– The i-th processor of the pipeline receives the output

from the (i-1)th processor.

• Normal “operations” in each processor are much larger (coarser) in pipelined computers than in true SIMDs

• “Pipelined” is somewhat SIMD in nature in that synchronization is not required.

54

Why Processor Arrays?• Historically, high cost of control units• Scientific applications have data parallelism

55

Data/instruction Storage

• Front end computer– Also called the control unit– Holds and runs program– Data manipulated sequentially

• Processor array– Data manipulated in parallel

56

Processor Array Performance

• Performance: work done per time unit

• Performance of processor array– Speed of processing elements– Utilization of processing elements

57

Performance Example 1

• 1024 processors

• Each adds a pair of integers in 1 sec (1 microsecond or one millionth of second or 10-6 second.)

• What is the performance when adding two 1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

58

Performance Example 2

• 512 processors

• Each adds two integers in 1 sec

• What is the performance when adding two vectors of length 600?

• Since 600 > 512, 88 processor must add two pairs of integers.

• The other 424 processors add only a single pair of integers.

59

Example of a 2-D Processor Interconnection Network in a Processor

ArrayEach VLSI chip has 16 processing elements.Each PE can simultaneously send a value to a neighbor.

PE = processor element

60

SIMD Execution Style• The traditional (SIMD, vector, processor array) execution

style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:– The sequential processor that broadcasts the

commands to the rest of the processors is called the front end or control unit (or sometimes host).

– The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel.

– The front end normally executes the sequential portions of the program.

• Alternately, all PEs needing computation can execute steps synchronously and avoid broadcast cost to distribute results

– Each processing element has a local memory that cannot be directly accessed by the control unit or other processing elements.

61

SIMD Execution Style

– Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel.

– When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel.

– Although the PEs execute in parallel, some units can be allowed to skip particular instructions.

62

Masking on Processor Arrays

• All the processors work in lockstep except those that are masked out (by setting mask register).

• The conditional if-then-else is different for processor arrays than sequential version – Every active processor tests to see if its data meets

the negation of the boolean condition.– If it does, it sets its mask bit so those processors will

not participate in the operation initially.– Next the unmasked processors, execute the THEN

part.– Afterwards, mask bits (for original set of active

processors) are flipped and unmasked processors perform the the ELSE part.

63

if (COND) then A else B

64


65


66

SIMD Machines• An early SIMD computer designed for vector and

matrix processing was the Illiac IV computer – Initial development at the University of Illinois 1965-70 – Moved to NASA Ames, completed in 1972 but not

fully functional until 1976.– See Jordan et. al., pg 7 and Wikepedia

• The MPP, DAP, the Connection Machines CM-1 and CM-2, and MasPar’s MP-1 and MP-2 are examples of SIMD computers– See Akl pg 8-12 and [Quinn, 94]

• The CRAY-1 and the Cyber-205 use pipelined arithmetic units to support vector operations and are sometimes called a pipelined SIMD – See [Jordan, et al, p7], [Quinn 94, pg 61-2], and

[Quinn 2004, pg37).

67

SIMD Machines• Quinn [1994, pg 63-67] discusses the CM-2

Connection Machine (with 64K PEs) and a smaller & updated CM-200.

• Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO– ASPRO is a small second generation STARAN used

by the Navy in the spy planes.

• Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.

68

Today’s SIMDs

• SIMD functionality is sometimes embedded in sequential machines.

• Others are being build as part of hybrid architectures.

• Some SIMD and SIMD-like features are included in some multi/many core processing units

• Some SIMD-like architectures have been build as special purpose machines, although some of these could classify as general purpose.– Some of this work has been proprietary.– The fact that a parallel computer is SIMD or SIMD-like

is often not advertised by the company building them.

69

A Company that Recently Built an Inexpensive SIMD

• ClearSpeed produced a COTS (commodity off the shelf) SIMD Board– WorldScape has developed some defense and

commercial applications for this computer.

• Not a traditional SIMD as the hardware doesn’t tightly synchronize the execution of instructions.– Hardware design supports efficient synchronization

• This machine is programmed like a SIMD.• The U.S. Navy observed that their machines

process radar a magnitude faster than others.• Earlier, quite a bit of information about this was

at www.wscape.com and www.clearspeed.com

http://www.wscape.com/

http://www.clearspeed.com/

70

An Example of a Hybrid SIMD• Embedded Massively Parallel Accelerators

– Fuzion 150: 1536 processors on a single chip

– Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan

(This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.)

– Systola 1024: PC add-on board with 1024 processors

71

Hybrid Architecture

High speed Myrinet switchHigh speed Myrinet switch

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

– combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer

72

Architecture of Systola 1024

Interface processors

ISA

RAM NORTH

host computer bus

Controller

RAM WEST

program memory

• Instruction Systolic Array:– 32 32 mesh of

processing elements– wavefront instruction

execution

73

SIMDs Embedded in SISDs• Intel's Pentium 4 included what they call MMX

technology to gain a significant performance boost

• IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine.

• Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code.

• This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering.

74

Special Purpose SIMDs in the Bioinformatics Arena

• Parcel – Acquired by Celera Genomics in 2000– Products include the sequence

supercomputer GeneMatcher, which has a high throughput sequence analysis capability• Supports over a million processors

– GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing

• TimeLogic, Inc– Has DeCypher, a reconfigurable SIMD

75

Advantages of SIMDs

• Reference: [Roosta, pg 10] • Less hardware than MIMDs as they have only

one control unit. – Control units are complex.

• Less memory needed than MIMD – Only one copy of the instructions need to be stored– Allows more data to be stored in memory.

• Less startup time in communicating between PEs.

76

Advantages of SIMDs (cont)

• Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug.– Similar to sequential programming

• Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions.

• MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.

77

Advantages of SIMDs (cont)• During a communication operation between

PEs, – PEs send data to a neighboring PE in parallel and in

lock step– No need to create a header with routing information

as “routing” is determined by program steps.– the entire communication operation is executed

synchronously– SIMDs are deterministic & have much more

predictable running time.• Can normally compute a tight (worst case) upper bound for

the time for communications operations.

• Less complex hardware in SIMD since no message decoder is needed in the PEs– MIMDs need a message decoder in each PE.

78

SIMD Shortcomings(with some rebuttals)

• Claims are from our textbook [i.e., Quinn 2004].– Similar statements are found in [Grama, et. al].

• Claim 1: Not all problems are data-parallel– While true, most problems seem to have a

data parallel solution. – In [Fox, et.al.], the observation was made in

their study of large parallel applications at national labs, that most were data parallel by nature, but often had points where significant branching occurred.

79


• Claim 2: Speed drops for conditionally executed branches– MIMDs processors can execute multiple branches

concurrently.– For an if-then-else statement with execution times for

the “then” and “else” parts being roughly equal, about ½ of the SIMD processors are idle during its execution

• With additional branching, the average number of inactive processors can become even higher.

• With SIMDs, only one of these branches can be executed at a time.

• This reason justifies the study of multiple SIMDs (or MSIMDs).

– On many applications, any branching is quite shallow.

80


• Claim 2 (cont): Speed drops for conditionally executed code– In [Fox, et.al.], the observation was made that

for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8.

– The cost of the extremely simple processors used in a SIMD are extremely low

• Programmers used to worry about ‘full utilization of memory’ but stopped this after memory cost became insignificant overall.

81


• Claim 3: Don’t adapt to multiple users well.– This is true to some degree for all parallel computers. – If usage of a parallel processor is dedicated to a

important problem, it is probably best not to risk compromising its performance by ‘sharing’

– This reason also justifies the study of multiple SIMDs (or MSIMD).

– SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.

82


• Claim 4: Do not scale down well to “starter” systems that are affordable.– This point is arguable and its ‘truth’ is likely to

vary rapidly over time– ClearSpeed currently sells a very economical

SIMD board that plugs into a PC.

83


Claim 5: Requires customized VLSI for processors and expense of control units in PCs has dropped.

• Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS

• Expense of PCs (with control units) has dropped significantly

• However, reliance on COTS has fueled the success of ‘low level parallelism’ provided by clusters and restricted new innovative parallel architecture research for well over a decade.

84


Claim 5 (cont.)• There is strong evidence that the period of

continual dramatic increases in speed of PCs and clusters is ending.

• Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities

• Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.

85

Multiprocessors

• Multiprocessor: multiple-CPU computer with a shared memory

• Same address on two different CPUs refers to the same memory location

• Avoids three cited criticisms for SIMDs– Can be built from commodity CPUs– Naturally support multiple users– Maintain efficiency in conditional code

86

Centralized Multiprocessor

87

Centralized Multiprocessor• Straightforward extension of uniprocessor• Add CPUs to bus• All processors share same primary memory• Memory access time same for all CPUs

– Uniform memory access (UMA) multiprocessor

• Also called a symmetrical multiprocessor (SMP)

88

Private and Shared Data

• Private data: items used only by a single processor

• Shared data: values used by multiple processors

• In a centralized multiprocessor (i.e. SMP), processors communicate via shared data values

89

Problems Associated with Shared Data

The cache coherence problem• Replicating data across multiple caches reduces

contention among processors for shared data values.

• But - how can we ensure different processors have the same value for same address?

• The cache coherence problem is when an obsolete value is still stored in a processor’s cache.

90

Write Invalidate Protocol

Most common solution to cache coherency

1. Each CPU’s cache controller monitors (snoops) the bus & identifies which cache blocks are requested by other CPUs.

2. A PE gains exclusive control of data item before performing “write”.

3. Before “write” occurs, all other copies of data item cached by other PEs are invalidated.

4. When any other CPU tries to read a memory location from an invalidated cache block,

– a cache miss occurs– It has to retrieve updated data from memory

91

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X

92


CPU A CPU B

Memory

7X

7

Read from memory is not a problem.

93


CPU A CPU B

Memory

7X

7 7

94


CPU A CPU B

Memory

2X

7 2

Write to main memory is a problem.

95


CPU A CPU B

7X

7 7

A cache control monitorsnoops the bus to seewhich cache block isbeing requested by other processors.

96


CPU A CPU B

7X

7 7

Intent to write X

Before a write can occur, all copies of data at that address are declared invalid.

97


CPU A CPU B

7X

7

Intent to write X

98


CPU A CPU B

X 2

2

When another processor tries to read from this location in cache, it receives a cache miss error and will have to refresh from main memory.

99

Synchronization Required for Shared Data

• Mutual exclusion

– Definition: At most one process can be engaged in an activity at any time.

– Example: Only one processor can write to the same address in main memory at the same time.

– We say that process must mutually exclude all others while it performs this write.

• Barrier synchronization

– Definition: Guarantees that no process will proceed beyond a designated point (called the barrier) until every process reaches that point.

100

Distributed Multiprocessor

• Distributes primary memory among processors• Increase aggregate memory bandwidth and

lower average memory access time• Allows greater number of processors• Also called non-uniform memory access (NUMA)

multiprocessor– Local memory access time is fast– Non-local memory access time can vary– Distributed memories have one logical address space

101

Distributed Multiprocessors

102

Cache Coherence

• Some NUMA multiprocessors do not support it in hardware– Only instructions and private data are stored

in cache– Policy creates a large memory access time

variance

• Implementation more difficult– No shared memory bus to “snoop”– Directory-based protocol needed

103

Directory-based Protocol

• Distributed directory contains information about cacheable memory blocks

• One directory entry for each cache block

• Each entry has– Sharing status– Which processors have copies

104

Sharing Status

• Uncached -- (denoted by “U”)– Block not in any processor’s cache

• Shared – (denoted by “S”)– Cached by one or more processors– Read only

• Exclusive – (denoted by “E”)– Cached by exactly one processor– Processor has written to block– Copy in memory is obsolete

105

Directory-based Protocol - step1

Interconnection Network

Directory

Local Memory

Cache

CPU 0

Directory

Local Memory

Cache

CPU 1

Directory

Local Memory

Cache

CPU 2

106

X has value 7 – step 2


CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Bit Vector

107

CPU 0 Reads X – step 3


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


Read Miss

108

CPU 0 Reads X –step 4


CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

109

CPU 0 Reads X –step 5


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X

110



CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X

Read Miss

111



CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X

112



CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

113

CPU 0 Writes 6 to X – step 9


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

Write Miss

114



CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

Invalidate

115



CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

116



CPU 0 CPU 1 CPU 2

7X

Caches

Memories


6X

Read Miss

117



CPU 0 CPU 1 CPU 2

7X

Caches

Memories


6X

Switch to Shared

118



CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X

119



CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X

120



CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X

Write Miss

121

CPU 2 Writes 5 to X - step 17


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X

Invalidate

122

CPU 2 Writes 5 to X –step 18


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


5X

123



CPU 0 CPU 1 CPU 2

6X

Caches

Memories


5X

Write Miss

124



CPU 0 CPU 1 CPU 2

6X

Caches

Memories


Take Away

5X

125



CPU 0 CPU 1 CPU 2

5X

Caches

Memories


5X

126



CPU 0 CPU 1 CPU 2

5X

Caches

Memories


127



CPU 0 CPU 1 CPU 2

5X

Caches

Memories


5X

Creates cache block storage for X

128



CPU 0 CPU 1 CPU 2

5X

Caches

Memories


4X

129

CPU 0 Writes Back X Block – step 25


CPU 0 CPU 1 CPU 2

5X

Caches

Memories


4X

4X

Data Write Back

130

CPU 0 flushes cache block X step 26


CPU 0 CPU 1 CPU 2

4X

Caches

Memories


131

Characteristics of Multiprocessors• Interprocessor communication is done in the

memory interface by read and write instructions• Memory may be physically distributed and the

reads and writes from different processors may take different time.

• Congestion and hotspots in the interconnection network may occur.

• Memory latency (i.e., time to complete a read or write) may be long and variable.

• Most messages through the bus or interconnection network are the size of single memory words.

• Randomization of requests may be used to reduce the probability of collisions.

132

Multicomputers

• Distributed memory multiple-CPU computer

• Same address on different processors refers to different physical memory locations

• Processors interact through message passing

133

Typically, Two Flavors of Multicomputers

• Commercial multicomputers– Custom switch network– Low latency (the time it takes to send a message).– High bandwidth (data path width) across processors

• Commodity clusters– Mass produced computers, switches and other

equipment– Use low cost components– Message latency is higher– Communications bandwidth is lower

134

Multicomputer Communication• Processors are connected by an interconnection

network • Each processor has a local memory and can

only access its own local memory• Data is passed between processors using

messages, as required by the program• Data movement across the network is also

asynchronous– A common approach is to use MPI to handling

message passing

135

Multicomputer Communications (cont)

• Multicomputers can be scaled to larger sizes much easier than multiprocessors.

• The amount of data transmissions between processors have a huge impact on the performance

– The distribution of the data among the processors is a very important factor in the performance efficiency.

136

Message-Passing Advantages

• No problem with simultaneous access to data.

• Allows different PCs to operate on the same data independently.

• Allows PCs on a network to be easily upgraded when faster processors become available.

137

Disadvantages of Message-Passing

• Programmers must make explicit message-passing calls in the code

• This is low-level programming and is error prone.

• Data is not shared but copied, which increases the total data size.

• Data Integrity– Difficulty in maintaining correctness of

multiple copies of data item.

138

Some Interconnection Network Terminology (1/2)

References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks.

A link is the connection between two nodes.• A switch that enables packets to be routed

through the node to other nodes without disturbing the processor is assumed.

• The link between two nodes can be either bidirectional or use two directional links .

• Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word).

• The above choices do not have a major impact on the concepts presented in this course.

139

Network Terminology (2/2)

• The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second).

• The network latency is the time required to transfer a message through the network.– The communication latency is the total time

required to send a message, including software overhead and interface delay.

– The message latency or startup time is the time required to send a zero-length message.

• Includes software & hardware overhead, such as– Choosing a route– packing and unpacking the message

140

Circuit Switching Message Passing• Technique establishes a path and allows the

entire message to transfer uninterrupted.• Similar to telephone connection that is held until

the end of the call.• Links used are not available to other messages

until the transfer is complete.• Latency (message transfer time): If the length of

control packet sent to establish path is small wrt (with respect to) the message length, the latency is essentially– the constant L/B, where L is message length

and B is bandwidth.

141

Store-and-forward Packet Switching

• Message is divided into “packets” of information• Each packet includes source and destination

addresses.• Packets can not exceed a fixed, maximum size

(e.g., 1000 byte).• A packet is stored in a node in a buffer until it

can move to the next node.• Different packets typically follow different routes

but are re-assembled at the destination, as the packets arrive.

• Movements of packets is asynchronous.

142

Packet Switching (cont)

• At each node, the designation information is looked at and used to select which node to forward the packet to.

• Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams.

• Significant latency is created by storing each packet in each node it reaches.

• Latency increases linearly with the length of the route.

143

Virtual Cut-Through Package Switching

• Used to reduce the latency.

• Allows packet to pass through a node without being stored, if the outgoing link is available.

• If complete path is available, a message can immediately move from source to destination..

144

Wormhole Routing

• Alternate to store-and-forward packet routing

• A message is divided into small units called flits (flow control units).

• Flits are 1-2 bytes in size.

• Can be transferred in parallel on links with multiple wires.

• Only head of flit is initially transferred when the next link becomes available.

145

Wormhole Routing (cont)

• As each flit moves forward, the next flit can move forward.

• The entire path must be reserved for a message as these packets pull each other along (like cars of a train).

• Request/acknowledge bit messages are required to coordinate these pull-along moves. – See Wilkinson, et. al.

• Latency: If the head of the flit is very small compared to the length of the message, then the latency is essentially the constant L/B, with L the message length and B the link bandwidth.

146

Deadlock• Routing algorithms needed to find a path

between the nodes.• Adaptive routing algorithms choose different

paths, depending on traffic conditions. • Livelock is a deadlock-type situation where a

packet continues to go around the network, without ever reaching its destination.

• Deadlock: No packet can be forwarded because they are blocked by other stored packets waiting to be forwarded.

147

Asymmetric Multicomputers

• Has a front-end that interacts with users and I/O devices.– Processors in back end are used for

computation. – Programming similar to SIMDs (i.e., processor

arrays)– Common with early multicomputers

• Examples of asymmetrical multicomputers given in textbook.

148

Asymmetrical MC Advantages

• Back-end processors dedicated to parallel computations Easier to understand, model, tune performance

• Only a simple back-end operating system needed Easy for a vendor to create

149

Asymmetrical MC Disadvantages

• Front-end computer is a single point of failure

• Single front-end computer limits scalability of system

• Primitive operating system in back-end processors makes debugging difficult

• Every application requires development of both front-end and back-end program

150

Symmetric Multicomputers

• Every computer executes the same operating system and has identical functionality.

• Users may log into any computer to edit or compile their programs.

• Any or all computers may be involved in the execution of their program.

• During execution of programs, every PE executes the same program.

• When only one PE should execute an operation, an “if” statement is used to select the PE.

151

Symmetric Multicomputers

152

Symmetrical MC Advantages

• Alleviate performance bottleneck caused by single front-end computer

• Better support for debugging

• Every processor executes same program

153

Symmetrical MC Disadvantages

• More difficult to maintain illusion of single “parallel computer”

• No simple way to balance program development workload among processors

• More difficult to achieve high performance when multiple processes run on each processor– Details on next slide

154

Symmetric MC Disadvantages (cont)

• (cont.) More difficult to achieve high performance when multiple processes run on each processor– Processes on same processor compete for

same resources• CPU Cycles• Cache space• Memory bandwidth

– Increased cache misses • Cache is “PE oriented” instead of “process

oriented”

155

Best Model for Commodity Cluster• Full-Fledged operating system (e.g.,

Linux) desirable– Feature of symmetric multicomputer

• Desirable to increase cache hits– Favors having only a single user process on each PE– Favors most nodes being off-limits for program

development

• Need fast network– Keep program development users off networks and

have them access front-end by another path.– Reserve interconnection network to usage by parallel

processes

• Overall, a mixed model may be best for commodity clusters

156

Ideal Commodity Cluster Features

• Co-located computers• Computers dedicated to running a single

process at a time to lower cache misses• Accessible only from the network

– No keyboards or displays

• Users access front-end by another route.• Identical operating systems• Identical local disk images• Administered as an entity

157

ParPar Cluster, A Mixed Model

• Mixed model

• Incorporates both asymetrical and symetrical designs.

158

Network of Workstations• Dispersed computers

– Typically located on user’s desks

• First priority: response time for person at the keyboard

• Parallel jobs wait in background and run with spare CPU cycles are available.

• Different operating systems• Different local images• Checkpointing and restarting important• Typically connected by ethernet

– Too slow for commodity network usage

159

A Commodity Cluster vs Network of Workstations

• A commodity cluster contains components of local area networks– Commodity computers– Switches

• A network of workstations is a dispersed collection of computers– Distributed hetergeneous computers – Located on primary users’ desks– Unused cycles available for parallel use– Example: SETI project

160

Summary

• Commercial parallel computers appearedin 1980s

• Multiple-CPU computers now dominate

• Small-scale: Centralized multiprocessors

• Large-scale: Distributed memory architectures– Multiprocessors– Multicomputers

Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity...

Documents

Transcript of Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity...