Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity...

160
Chapter 2 Parallel Architectures
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    3

Transcript of Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity...

Page 1: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

Chapter 2

Parallel Architectures

Page 2: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

2

Outline• Some chapter references

• Brief review of complexity – Terminology for comparisons

• Interconnection networks

• Processor arrays

• Multiprocessors

• Multicomputers

• Flynn’s Taxonomy – moved to Chpt 1

Page 3: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

3

Some Chapter References

• Selim Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989 (earlier textbook).

• G. C. Fox, What Have We Learnt from Using Real Parallel Machines to Solve Real Problems? Technical Report C3P-522, Cal Tech, December 1989. (Included in part in more recent books co-authored by Fox.)

• A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, 2003 (first edition 1994), Addison Wesley.

• Harry Jordan, Gita Alaghband, Fundamentals of Parallel Processing: Algorithms, Architectures, Languages, Prentice Hall, 2003, Ch 1, 3-5.

• F. Thomson Leighton; Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes; 1992; Morgan Kaufmann Publishers.

Page 4: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

4

References - continued• Gregory Pfsiter, In Search of Clusters: The ongoing

Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses details of some serious problems that MIMDs incur).

• Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill,2004 (Current Textbook), Chapter 2.

• Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994, Ch. 1,2

• Sayed H. Roosta, “Parallel Processing & Parallel Algorithms: Theory and Computation”, Springer Verlag, 2000, Chpt 1.

• Wilkinson & Allen, Parallel Programming: Techniques and Applications, Prentice Hall, 2nd Edition, 2005, Ch 1-2.

Page 5: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

5

Brief Review Complexity Concepts Needed for Comparisons

• Whenever we define a counting function, we usually characterize the growth rate of that function in terms of complexity classes.

• Technical Definition: We say a function f(n) is in O(g(n)), if (and only if) there are positive constants c and n0 such that

0 ≤ f(n) cg(n) for n n0

• O(n) is read as big-oh of n.• This notation can be used to separate counting functions

into complexity classes that characterize the size of the count.

• We can use it for any kind of counting functions such as timings, bisection widths, etc.

Page 6: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

6

Big-Oh and Asymptotic Growth Rate

• The big-Oh notation gives an upper bound on the (asymptotic) growth rate of a function

• The statement “f(n) is O(g(n))” means that the growth rate of f(n) is not greater than the growth rate of g(n)

• We can use the big-Oh notation to rank functions according to their growth rate

Assume: f(n) is O(g(n)) g(n) is O(f(n))

g(n) grows faster Yes No

f(n) grows faster No Yes

Same growth Yes Yes

Page 7: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

7

Relatives of Big-Oh• big-Omega

– f(n) is (g(n)) if there is a constant c > 0

and an integer constant n0 1 such that

f(n) cg(n) ≥ for n n0

Intuitively, this says up to a constant factor, f(n) asymptotically is greater than or equal to g(n)

• big-Theta– f(n) is (g(n)) if there are constants c’ > 0 and c’’ > 0

and an integer constant n0 1 such that 0 ≤ c’g(n) f(n) c’’•g(n) for n n0

Intuitively, this says up to a constant factor, f(n) and g(n) are asymptotically the same.

Note: These concepts are covered in algorithm courses

Page 8: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

8

Relatives of Big-Oh

• little-oh– f(n) is o(g(n)) if, for any constant c > 0, there is an

integer constant n0 0 such that 0 f(n) < cg(n) for n n0

Intuitively, this says f(n) is, up to a constant, asymptotically strictly less than g(n), so f(n) ≠ (g(n)).

• little-omega– f(n) is (g(n)) if, for any constant c > 0, there is an

integer constant n0 0 such that f(n) > cg(n) ≥ 0 for n n0

Intuitively, this says f(n) is, up to a constant, asymptotically strictly greater than g(n), so f(n) ≠ (g(n)).

These are not used as much as the earlier definitions, but they round out the picture.

Page 9: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

9

Summary for Intuition for Asymptotic Notation

big-Oh– f(n) is O(g(n)) if f(n) is asymptotically less than or equal to

g(n)

big-Omega– f(n) is (g(n)) if f(n) is asymptotically greater than or

equal to g(n)big-Theta– f(n) is (g(n)) if f(n) is asymptotically equal to g(n)

little-oh– f(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n)

little-omega– f(n) is (g(n)) if is asymptotically strictly greater than g(n)

Page 10: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

10

A CALCULUS DEFINITION OF O, (often easier to use)

Definition: Let f and g be functions defined on the positive integers with nonnegative values.

We say g is in O(f) if and only if

lim g(n)/f(n) = c

n -> for some nonnegative real number c--- i.e. the limit exists and is not infinite.

Definition: We say f is in (g) if and only if

f is in O(g) and g is in O(f)

Note: Often use L'Hopital's Rule to calculate the limits you need.

Page 11: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

11

Why Asymptotic Behavior is Important

• 1) Allows us to compare counts on large sets.• 2) Helps us understand the maximum size of

input that can be handled in a given time, provided we know the environment in which we are running.

• 3) Stresses the fact that even dramatic speedups in hardware can not overcome the handicap of an asymptotically slow algorithm.

Page 12: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

12

Recall: ORDER WINS OUT(Example from Baase’s Algorithms Text)

The TRS-80

Main language support: BASIC - typically a slow running interpreted language

For more details on TRS-80 see:

http://mate.kjsl.com/trs80/

http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaqp1.html#TOC3

The CRAY-YMP

Language used in example: FORTRAN- a fast running language

For more details on CRAY-YMP see:

Page 13: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

13

CRAY YMP TRS-80with FORTRAN with BASICcomplexity is 3n3 complexity is 19,500,000n

n is:

10

100

1000

2500

10000

1000000

3 microsec 200 millisec

3 millisec 2 sec

3 sec 20 sec

50 sec 50 sec

49 min 3.2 min

95 years 5.4 hours

microsecond (abbr microsecond (abbr µµsec)sec) One-millionth of a second. One-millionth of a second.millisecond (abbr msec)millisecond (abbr msec) One-thousandth of a second. One-thousandth of a second.

Page 14: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

14

Interconnection Networks• Uses of interconnection networks

– Connect processors to shared memory– Connect processors to each other

• Interconnection media types– Shared medium– Switched medium

• Different interconnection networks define different parallel machines.

• The interconnection network’s properties influence the type of algorithm used for various machines as it effects how data is routed.

Page 15: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

15

Shared versus Switched Media

a. With shared medium, one message is sent & all processors listen

b. With switched medium, multiple messages are possible.

Page 16: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

16

Shared Medium

• Allows only message at a time

• Messages are broadcast

• Each processor “listens” to every message

• Before sending a message, a processor “listen” until medium is unused

• Collisions require resending of messages

• Ethernet is an example

Page 17: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

17

Switched Medium

• Supports point-to-point messages between pairs of processors

• Each processor is connected to one switch

• Advantages over shared media– Allows multiple messages to be sent

simultaneously– Allows scaling of the network to accommodate

the increase in processors

Page 18: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

18

Switch Network Topologies

• View switched network as a graph– Vertices = processors or switches– Edges = communication paths

• Two kinds of topologies– Direct– Indirect

Page 19: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

19

Direct Topology

• Ratio of switch nodes to processor nodes is 1:1

• Every switch node is connected to– 1 processor node– At least 1 other switch node

Indirect Topology

• Ratio of switch nodes to processor nodes is greater than 1:1• Some switches simply connect to other switches

Page 20: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

20

Terminology for Evaluating Switch Topologies

• We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness in implementing efficient parallel algorithms on a machine with a given network.

• These are– The diameter – The bisection width– The edges per node– The constant edge length

• We’ll define these and see how they affect algorithm choice.

• Then we will investigate several different topologies and see how these characteristics are evaluated.

Page 21: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

21

Terminology for Evaluating Switch Topologies

• Diameter – Largest distance between two switch nodes.– A low diameter is desirable– It puts a lower bound on the complexity of

parallel algorithms which requires communication between arbitrary pairs of nodes.

Page 22: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

22

Terminology for Evaluating Switch Topologies

• Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves (within 1 node, if the number of processors is odd.)

• High bisection width is desirable.• In algorithms requiring large amounts of data

movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm,

• Actually proving what the bisection width of a network is can be quite difficult.

Page 23: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

23

Terminology for Evaluating Switch Topologies

• Number of edges per node– It is best if the maximum number of edges/node is a

constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes.

– Degree is the maximum number of edges per node.

• Constant edge length? (yes/no)– Again, for scalability, it is best if the nodes and edges

can be laid out in 3D space so that the maximum edge length is a constant independent of network size.

Page 24: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

24

Evaluating Switch Topologies

• Many have been proposed and analyzed. We will consider several well known ones:– 2-D mesh– linear network– binary tree– hypertree– butterfly– hypercube– shuffle-exchange

• Those in yellow have been used in commercial parallel computers.

Page 25: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

25

2-D Meshes

Note: Circles represent switches and squares represent processors in all these slides.

Page 26: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

26

2-D Mesh Network• Direct topology

• Switches arranged into a 2-D lattice or grid

• Communication allowed only between neighboring switches

• Torus: Variant that includes wraparound connections between switches on edge of mesh

Page 27: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

27

Evaluating 2-D Meshes(Assumes mesh is a square)

n = number of processors • Diameter:

(n1/2)– Places a lower bound on algorithms that require

processing with arbitrary nodes sharing data.• Bisection width:

(n1/2)– Places a lower bound on algorithms that require

distribution of data to all nodes.• Max number of edges per switch:

– 4 is the degree• Constant edge length?

– Yes• Does this scale well?

– Yes

Page 28: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

28

Linear Network

• Switches arranged into a 1-D mesh• Direct topology• Corresponds to a row or column of a 2-D mesh• Ring : A variant that allows a wraparound

connection between switches on the end.• The linear and ring networks have many

applications• Essentially supports a pipeline in both directions• Although these networks are very simple, they

support many optimal algorithms.

Page 29: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

29

Evaluating Linear and Ring Networks• Diameter

– Linear : n-1 or Θ(n)– Ring: n/2 or Θ(n)

• Bisection width: – Linear: 1 or Θ(1)– Ring: 2 or Θ(1)

• Degree for switches:– 2

• Constant edge length? – Yes

• Does this scale well?– Yes

Page 30: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

30

Binary Tree Network

• Indirect topology

• n = 2d processor nodes, 2n-1 switches, where d= 0,1,... is the number of levels

i.e. 23 = 8 processors on bottom and

2(n) – 1 = 2(8) – 1 = 15 switches

Page 31: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

31

Evaluating Binary Tree Network• Diameter:

– 2 log n or O(log n).– Note- this is small

• Bisection width: – 1, the lowest possible number

• Degree: – 3

• Constant edge length? – No

• Does this scale well?– No

Page 32: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

32

Hypertree Network (of degree 4 and depth 2)

(a) Front view: 4-ary tree of height 2

(b) Side view: upside down binary tree of height d

(c) Complete network

Page 33: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

33

Hypertree Network• Indirect topology• Note- the degree k and the depth d must be

specified. • This gives from the front a k-ary tree of height d.• From the side, the same network looks like an

upside down binary tree of height d.• Joining the front and side views yields the

complete network.

Page 34: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

34

Evaluating 4-ary Hypertree with Depth d

• A 4-ary hypertree has n = 4d processors– General formula for k-ary hypertree is n = kd

• Diameter is 2d = 2 log n– shares the low diameter of binary tree

• Bisection width = 2d+1

– Note here, 2d+1 = 23 = 8– Large value - much better than binary tree

• Constant edge length? – No

• Degree = 6

Page 35: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

35

Butterfly Network

• Indirect topology• n = 2d processor

nodes connectedby n(log n + 1)switching nodes

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

A 23 = 8 processor butterfly network with 8*4=32 switching nodes

As complicated as this switching network appears to be, it is really quite simple as it admits a very nice routing algorithm!

Wrapped Butterfly: When top and bottom ranks are merged into single rank. The rows are called ranks.

Page 36: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

36

Building the 23 Butterfly Network• There are 8 processors.• Have 4 ranks (i.e. rows) with 8 switches per rank.• Connections:

– Node(i,j), for i > 0, is connected to two nodes on rank i-1, namely node(i-1,j) and node(i-1,m), where m is the integer found by flipping the ith most significant bit in the binary d-bit representation of j.

– For example, suppose i = 2 and j = 3. Then node (2,3) is connected to node (1,3).

– To get the other connection, 3 = 0112. So, flip 2nd significant bit – i.e. 0012 and connect node(2,3) to node(1,1) --- NOTE: There is an error on pg 32 on this example.

• Nodes connected by a cross edge from rank i to rank i+1 have node numbers that differ only in their (i+1) bit.

Page 37: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

37

Why It Is Called a Butterfly Network• Walk cycles such as node(i,j), node(i-1,j),

node(i,m), node(i-1,m), node(i,j) where m is determined by the bit flipping as shown and you “see” a butterfly:

Page 38: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

38

Butterfly Network RoutingSend message from processor 2 to processor 5.

Algorithm:0 means ship left; 1 means ship right. 1) 5 = 101. Pluck off leftmost bit 1 and send “01msg” to right.2) Pluck off leftmost bit 0 and send “1msg” to left.3) Pluck off leftmost bit 1 and send “msg” to right.

Each cross edge followed changes address by 1 bit.

Page 39: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

39

Evaluating the Butterfly Networkwith n Processors

• Diameter:– log n

• Bisection width: – n / 2 *(Likely error 32/2=16)

• Degree: – 4 (even for d > 3)

• Constant edge length? – No, grows exponentially as rank size increase

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

* On pg 442, Leighton gives “(n / log(n))” as the bisection width. Simply remove cross edges between two successive levels to create bisection cut.

Page 40: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

40

Hypercube (also called binary n-cube)

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

A hypercube with n = 2d processors & switches for d=4

Page 41: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

41

Hypercube (or Binary n-cube) n = 2d Processors

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Direct topology• 2 x 2 x … x 2 mesh• Number of nodes is a power of 2• Node addresses 0, 1, …, n-1• Node i is connected to k nodes whose addresses differ from i in exactly one bit position.• Example: k = 0111 is connected to 1111, 0011, 0101, and 0110

Page 42: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

42

Growing a HypercubeNote: For d = 4, it is called a

4-D hypercube or just a 4 cube

Page 43: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

43

Evaluating Hypercube Networkwith n = 2d nodes

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Diameter: • d = log n

•Bisection width: • n / 2

•Edges per node: • log n

•Constant edge length?

• No. The length of the longest edge increases as n increases.

Page 44: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

44

Routing on the Hypercube Network

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Example: Send a message from node 2 = 0010 to node 5 = 0101• The nodes differ in 3 bits so the shortest path will be of length 3.• One path is0010 0110 0100 0101obtained by flipping one of the differing bits at each step.

• As with the butterfly network, bit flipping helps you route on this network.

Page 45: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

45

A Perfect Shuffle

• A permutation that is produced as follows is called a perfect shuffle:

• Given a power of 2 cards, numbered 0, 1, 2, ..., 2d -1, write the card number with d bits. By left rotating the bits with a wrap, we calculate the position of the card after the perfect shuffle.

• Example: For d = 3, card 5 = 101. Left rotating and wrapping gives us 011. So, card 5 goes to position 3. Note that card 0 = 000 and card 7 = 111, stay in position.

Page 46: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

46

Shuffle-exchange Network with n = 2d Processors

0 1 2 3 4 5 6 7

• Direct topology• Number of nodes is a power of 2• Nodes have addresses 0, 1, …, 2d-1• Two outgoing links from node i

Shuffle link to node LeftCycle(i) Exchange link between node i and node i+1

when i is even

Page 47: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

47

Shuffle-exchange Addressing – 16 processors

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1

0 11 0 0 11 1

No arrows on line segment means it is bidirectional. Otherwise, you must follow the arrows.

Devising a routing algorithm for this network is interesting and may be a homework problem.

Page 48: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

48

Evaluating the Shuffle-exchange

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1

0 11 0 0 11 1

• Diameter: 2log n – 1

• Edges per node: 3

• Constant edge length? No

• Bisection width: (n/ log n) Between 2n/log n and n/(2 log n)*

* See Leighton pg 480

Page 49: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

49

Two Problems with Shuffle-Exchange

• Shuffle-Exchange does not expand well– A large shuffle-exchange network does not

decompose well into smaller separate shuffle exchange networks.

– In a large shuffle-exchange network, a small percentage of nodes will be hot spots

• They will encounter much heavier traffic

• Above results are in dissertation of one of Batcher’s students.

Page 50: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

50

Comparing Networks(See Table 2.1)

• All have logarithmic diameterexcept 2-D mesh

• 4nary Hypertree, butterfly, and hypercube have bisection width n / 2 (Likely untrue only butterfly)

• All have constant edges per node except hypercube

• Only 2-D mesh, linear, and ring topologies keep edge lengths constant as network size increases

• Shuffle-exchange is a good compromise- fixed number of edges per node, low diameter, good bisection width.– However, negative results on preceding slide also

need to be considered.

Page 51: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

51

Alternate Names for SIMDs

• Recall that all active processors of a true SIMD computer must simultaneously access the same memory location.

• The value in the i-th processor can be viewed as the i-th component of a vector.

• SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently.

Page 52: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

52

SIMD Computers

• SIMD computers that focus on vector operations– Support some vector and possibly matrix

operations in hardware– Usually limit or provide less support for non-

vector type operations involving data in the “vector components”.

• General purpose SIMD computers– May also provide some vector and possibly

matrix operations in hardware.– More support for traditional type operations

(e.g., other than for vector/matrix data types).

Page 53: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

53

Pipelined Architectures• Pipelined architectures are sometimes

considered to be SIMD architectures– See pg 37 of Textbook & pg 8-9 Jordan et. al.– Vector components are entered successively into first

processor in pipeline.– The i-th processor of the pipeline receives the output

from the (i-1)th processor.

• Normal “operations” in each processor are much larger (coarser) in pipelined computers than in true SIMDs

• “Pipelined” is somewhat SIMD in nature in that synchronization is not required.

Page 54: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

54

Why Processor Arrays?• Historically, high cost of control units• Scientific applications have data parallelism

Page 55: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

55

Data/instruction Storage

• Front end computer– Also called the control unit– Holds and runs program– Data manipulated sequentially

• Processor array– Data manipulated in parallel

Page 56: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

56

Processor Array Performance

• Performance: work done per time unit

• Performance of processor array– Speed of processing elements– Utilization of processing elements

Page 57: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

57

Performance Example 1

• 1024 processors

• Each adds a pair of integers in 1 sec (1 microsecond or one millionth of second or 10-6 second.)

• What is the performance when adding two 1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

Page 58: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

58

Performance Example 2

• 512 processors

• Each adds two integers in 1 sec

• What is the performance when adding two vectors of length 600?

• Since 600 > 512, 88 processor must add two pairs of integers.

• The other 424 processors add only a single pair of integers.

Page 59: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

59

Example of a 2-D Processor Interconnection Network in a Processor

ArrayEach VLSI chip has 16 processing elements.Each PE can simultaneously send a value to a neighbor.

PE = processor element

Page 60: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

60

SIMD Execution Style• The traditional (SIMD, vector, processor array) execution

style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:– The sequential processor that broadcasts the

commands to the rest of the processors is called the front end or control unit (or sometimes host).

– The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel.

– The front end normally executes the sequential portions of the program.

• Alternately, all PEs needing computation can execute steps synchronously and avoid broadcast cost to distribute results

– Each processing element has a local memory that cannot be directly accessed by the control unit or other processing elements.

Page 61: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

61

SIMD Execution Style

– Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel.

– When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel.

– Although the PEs execute in parallel, some units can be allowed to skip particular instructions.

Page 62: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

62

Masking on Processor Arrays

• All the processors work in lockstep except those that are masked out (by setting mask register).

• The conditional if-then-else is different for processor arrays than sequential version – Every active processor tests to see if its data meets

the negation of the boolean condition.– If it does, it sets its mask bit so those processors will

not participate in the operation initially.– Next the unmasked processors, execute the THEN

part.– Afterwards, mask bits (for original set of active

processors) are flipped and unmasked processors perform the the ELSE part.

Page 63: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

63

if (COND) then A else B

Page 64: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

64

if (COND) then A else B

Page 65: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

65

if (COND) then A else B

Page 66: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

66

SIMD Machines• An early SIMD computer designed for vector and

matrix processing was the Illiac IV computer – Initial development at the University of Illinois 1965-70 – Moved to NASA Ames, completed in 1972 but not

fully functional until 1976.– See Jordan et. al., pg 7 and Wikepedia

• The MPP, DAP, the Connection Machines CM-1 and CM-2, and MasPar’s MP-1 and MP-2 are examples of SIMD computers– See Akl pg 8-12 and [Quinn, 94]

• The CRAY-1 and the Cyber-205 use pipelined arithmetic units to support vector operations and are sometimes called a pipelined SIMD – See [Jordan, et al, p7], [Quinn 94, pg 61-2], and

[Quinn 2004, pg37).

Page 67: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

67

SIMD Machines• Quinn [1994, pg 63-67] discusses the CM-2

Connection Machine (with 64K PEs) and a smaller & updated CM-200.

• Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO– ASPRO is a small second generation STARAN used

by the Navy in the spy planes.

• Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.

Page 68: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

68

Today’s SIMDs

• SIMD functionality is sometimes embedded in sequential machines.

• Others are being build as part of hybrid architectures.

• Some SIMD and SIMD-like features are included in some multi/many core processing units

• Some SIMD-like architectures have been build as special purpose machines, although some of these could classify as general purpose.– Some of this work has been proprietary.– The fact that a parallel computer is SIMD or SIMD-like

is often not advertised by the company building them.

Page 69: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

69

A Company that Recently Built an Inexpensive SIMD

• ClearSpeed produced a COTS (commodity off the shelf) SIMD Board– WorldScape has developed some defense and

commercial applications for this computer.

• Not a traditional SIMD as the hardware doesn’t tightly synchronize the execution of instructions.– Hardware design supports efficient synchronization

• This machine is programmed like a SIMD.• The U.S. Navy observed that their machines

process radar a magnitude faster than others.• Earlier, quite a bit of information about this was

at www.wscape.com and www.clearspeed.com

Page 70: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

70

An Example of a Hybrid SIMD• Embedded Massively Parallel Accelerators

– Fuzion 150: 1536 processors on a single chip

– Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan

(This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.)

– Systola 1024: PC add-on board with 1024 processors

Page 71: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

71

Hybrid Architecture

High speed Myrinet switchHigh speed Myrinet switch

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

– combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer

Page 72: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

72

Architecture of Systola 1024

Interface processors

ISA

RAM NORTH

host computer bus

Controller

RAM WEST

program memory

• Instruction Systolic Array:– 32 32 mesh of

processing elements– wavefront instruction

execution

Page 73: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

73

SIMDs Embedded in SISDs• Intel's Pentium 4 included what they call MMX

technology to gain a significant performance boost

• IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine.

• Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code.

• This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering.

Page 74: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

74

Special Purpose SIMDs in the Bioinformatics Arena

• Parcel – Acquired by Celera Genomics in 2000– Products include the sequence

supercomputer GeneMatcher, which has a high throughput sequence analysis capability• Supports over a million processors

– GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing

• TimeLogic, Inc– Has DeCypher, a reconfigurable SIMD

Page 75: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

75

Advantages of SIMDs

• Reference: [Roosta, pg 10] • Less hardware than MIMDs as they have only

one control unit. – Control units are complex.

• Less memory needed than MIMD – Only one copy of the instructions need to be stored– Allows more data to be stored in memory.

• Less startup time in communicating between PEs.

Page 76: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

76

Advantages of SIMDs (cont)

• Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug.– Similar to sequential programming

• Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions.

• MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.

Page 77: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

77

Advantages of SIMDs (cont)• During a communication operation between

PEs, – PEs send data to a neighboring PE in parallel and in

lock step– No need to create a header with routing information

as “routing” is determined by program steps.– the entire communication operation is executed

synchronously– SIMDs are deterministic & have much more

predictable running time.• Can normally compute a tight (worst case) upper bound for

the time for communications operations.

• Less complex hardware in SIMD since no message decoder is needed in the PEs– MIMDs need a message decoder in each PE.

Page 78: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

78

SIMD Shortcomings(with some rebuttals)

• Claims are from our textbook [i.e., Quinn 2004].– Similar statements are found in [Grama, et. al].

• Claim 1: Not all problems are data-parallel– While true, most problems seem to have a

data parallel solution. – In [Fox, et.al.], the observation was made in

their study of large parallel applications at national labs, that most were data parallel by nature, but often had points where significant branching occurred.

Page 79: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

79

SIMD Shortcomings(with some rebuttals)

• Claim 2: Speed drops for conditionally executed branches– MIMDs processors can execute multiple branches

concurrently.– For an if-then-else statement with execution times for

the “then” and “else” parts being roughly equal, about ½ of the SIMD processors are idle during its execution

• With additional branching, the average number of inactive processors can become even higher.

• With SIMDs, only one of these branches can be executed at a time.

• This reason justifies the study of multiple SIMDs (or MSIMDs).

– On many applications, any branching is quite shallow.

Page 80: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

80

SIMD Shortcomings(with some rebuttals)

• Claim 2 (cont): Speed drops for conditionally executed code– In [Fox, et.al.], the observation was made that

for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8.

– The cost of the extremely simple processors used in a SIMD are extremely low

• Programmers used to worry about ‘full utilization of memory’ but stopped this after memory cost became insignificant overall.

Page 81: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

81

SIMD Shortcomings(with some rebuttals)

• Claim 3: Don’t adapt to multiple users well.– This is true to some degree for all parallel computers. – If usage of a parallel processor is dedicated to a

important problem, it is probably best not to risk compromising its performance by ‘sharing’

– This reason also justifies the study of multiple SIMDs (or MSIMD).

– SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.

Page 82: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

82

SIMD Shortcomings(with some rebuttals)

• Claim 4: Do not scale down well to “starter” systems that are affordable.– This point is arguable and its ‘truth’ is likely to

vary rapidly over time– ClearSpeed currently sells a very economical

SIMD board that plugs into a PC.

Page 83: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

83

SIMD Shortcomings(with some rebuttals)

Claim 5: Requires customized VLSI for processors and expense of control units in PCs has dropped.

• Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS

• Expense of PCs (with control units) has dropped significantly

• However, reliance on COTS has fueled the success of ‘low level parallelism’ provided by clusters and restricted new innovative parallel architecture research for well over a decade.

Page 84: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

84

SIMD Shortcomings(with some rebuttals)

Claim 5 (cont.)• There is strong evidence that the period of

continual dramatic increases in speed of PCs and clusters is ending.

• Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities

• Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.

Page 85: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

85

Multiprocessors

• Multiprocessor: multiple-CPU computer with a shared memory

• Same address on two different CPUs refers to the same memory location

• Avoids three cited criticisms for SIMDs– Can be built from commodity CPUs– Naturally support multiple users– Maintain efficiency in conditional code

Page 86: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

86

Centralized Multiprocessor

Page 87: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

87

Centralized Multiprocessor• Straightforward extension of uniprocessor• Add CPUs to bus• All processors share same primary memory• Memory access time same for all CPUs

– Uniform memory access (UMA) multiprocessor

• Also called a symmetrical multiprocessor (SMP)

Page 88: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

88

Private and Shared Data

• Private data: items used only by a single processor

• Shared data: values used by multiple processors

• In a centralized multiprocessor (i.e. SMP), processors communicate via shared data values

Page 89: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

89

Problems Associated with Shared Data

The cache coherence problem• Replicating data across multiple caches reduces

contention among processors for shared data values.

• But - how can we ensure different processors have the same value for same address?

• The cache coherence problem is when an obsolete value is still stored in a processor’s cache.

Page 90: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

90

Write Invalidate Protocol

Most common solution to cache coherency

1. Each CPU’s cache controller monitors (snoops) the bus & identifies which cache blocks are requested by other CPUs.

2. A PE gains exclusive control of data item before performing “write”.

3. Before “write” occurs, all other copies of data item cached by other PEs are invalidated.

4. When any other CPU tries to read a memory location from an invalidated cache block,

– a cache miss occurs– It has to retrieve updated data from memory

Page 91: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

91

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X

Page 92: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

92

Cache-coherence Problem

CPU A CPU B

Memory

7X

7

Read from memory is not a problem.

Page 93: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

93

Cache-coherence Problem

CPU A CPU B

Memory

7X

7 7

Page 94: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

94

Cache-coherence Problem

CPU A CPU B

Memory

2X

7 2

Write to main memory is a problem.

Page 95: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

95

Write Invalidate Protocol

CPU A CPU B

7X

7 7

A cache control monitorsnoops the bus to seewhich cache block isbeing requested by other processors.

Page 96: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

96

Write Invalidate Protocol

CPU A CPU B

7X

7 7

Intent to write X

Before a write can occur, all copies of data at that address are declared invalid.

Page 97: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

97

Write Invalidate Protocol

CPU A CPU B

7X

7

Intent to write X

Page 98: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

98

Write Invalidate Protocol

CPU A CPU B

X 2

2

When another processor tries to read from this location in cache, it receives a cache miss error and will have to refresh from main memory.

Page 99: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

99

Synchronization Required for Shared Data

• Mutual exclusion

– Definition: At most one process can be engaged in an activity at any time.

– Example: Only one processor can write to the same address in main memory at the same time.

– We say that process must mutually exclude all others while it performs this write.

• Barrier synchronization

– Definition: Guarantees that no process will proceed beyond a designated point (called the barrier) until every process reaches that point.

Page 100: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

100

Distributed Multiprocessor

• Distributes primary memory among processors• Increase aggregate memory bandwidth and

lower average memory access time• Allows greater number of processors• Also called non-uniform memory access (NUMA)

multiprocessor– Local memory access time is fast– Non-local memory access time can vary– Distributed memories have one logical address space

Page 101: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

101

Distributed Multiprocessors

Page 102: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

102

Cache Coherence

• Some NUMA multiprocessors do not support it in hardware– Only instructions and private data are stored

in cache– Policy creates a large memory access time

variance

• Implementation more difficult– No shared memory bus to “snoop”– Directory-based protocol needed

Page 103: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

103

Directory-based Protocol

• Distributed directory contains information about cacheable memory blocks

• One directory entry for each cache block

• Each entry has– Sharing status– Which processors have copies

Page 104: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

104

Sharing Status

• Uncached -- (denoted by “U”)– Block not in any processor’s cache

• Shared – (denoted by “S”)– Cached by one or more processors– Read only

• Exclusive – (denoted by “E”)– Cached by exactly one processor– Processor has written to block– Copy in memory is obsolete

Page 105: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

105

Directory-based Protocol - step1

Interconnection Network

Directory

Local Memory

Cache

CPU 0

Directory

Local Memory

Cache

CPU 1

Directory

Local Memory

Cache

CPU 2

Page 106: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

106

X has value 7 – step 2

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Bit Vector

Page 107: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

107

CPU 0 Reads X – step 3

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Read Miss

Page 108: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

108

CPU 0 Reads X –step 4

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

Page 109: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

109

CPU 0 Reads X –step 5

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

7X

Page 110: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

110

CPU 2 Reads X – step 6

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

7X

Read Miss

Page 111: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

111

CPU 2 Reads X – step 7

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X

Page 112: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

112

CPU 2 Reads X – step 8

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Page 113: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

113

CPU 0 Writes 6 to X – step 9

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Write Miss

Page 114: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

114

CPU 0 Writes 6 to X – step 10

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Invalidate

Page 115: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

115

CPU 0 Writes 6 to X – step 11

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Page 116: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

116

CPU 1 Reads X – step 12

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Read Miss

Page 117: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

117

CPU 1 Reads X – step 13

Interconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Switch to Shared

Page 118: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

118

CPU 1 Reads X – step 14

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 1 0 0

6X

Page 119: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

119

CPU 1 Reads X – step 15

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Page 120: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

120

CPU 2 Writes 5 to X – step 16

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Write Miss

Page 121: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

121

CPU 2 Writes 5 to X - step 17

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Invalidate

Page 122: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

122

CPU 2 Writes 5 to X –step 18

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 0 0 1

5X

Page 123: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

123

CPU 0 Writes 4 to X – step 19

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 0 0 1

5X

Write Miss

Page 124: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

124

CPU 0 Writes 4 to X – step 20

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 1 0 0

Take Away

5X

Page 125: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

125

CPU 0 Writes 4 to X – step 21

Interconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 0 1 0

5X

Page 126: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

126

CPU 0 Writes 4 to X – step 22

Interconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

Page 127: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

127

CPU 0 Writes 4 to X – step 23

Interconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

5X

Creates cache block storage for X

Page 128: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

128

CPU 0 Writes 4 to X – step 24

Interconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

4X

Page 129: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

129

CPU 0 Writes Back X Block – step 25

Interconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

4X

4X

Data Write Back

Page 130: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

130

CPU 0 flushes cache block X step 26

Interconnection Network

CPU 0 CPU 1 CPU 2

4X

Caches

Memories

Directories X U 0 0 0

Page 131: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

131

Characteristics of Multiprocessors• Interprocessor communication is done in the

memory interface by read and write instructions• Memory may be physically distributed and the

reads and writes from different processors may take different time.

• Congestion and hotspots in the interconnection network may occur.

• Memory latency (i.e., time to complete a read or write) may be long and variable.

• Most messages through the bus or interconnection network are the size of single memory words.

• Randomization of requests may be used to reduce the probability of collisions.

Page 132: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

132

Multicomputers

• Distributed memory multiple-CPU computer

• Same address on different processors refers to different physical memory locations

• Processors interact through message passing

Page 133: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

133

Typically, Two Flavors of Multicomputers

• Commercial multicomputers– Custom switch network– Low latency (the time it takes to send a message).– High bandwidth (data path width) across processors

• Commodity clusters– Mass produced computers, switches and other

equipment– Use low cost components– Message latency is higher– Communications bandwidth is lower

Page 134: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

134

Multicomputer Communication• Processors are connected by an interconnection

network • Each processor has a local memory and can

only access its own local memory• Data is passed between processors using

messages, as required by the program• Data movement across the network is also

asynchronous– A common approach is to use MPI to handling

message passing

Page 135: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

135

Multicomputer Communications (cont)

• Multicomputers can be scaled to larger sizes much easier than multiprocessors.

• The amount of data transmissions between processors have a huge impact on the performance

– The distribution of the data among the processors is a very important factor in the performance efficiency.

Page 136: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

136

Message-Passing Advantages

• No problem with simultaneous access to data.

• Allows different PCs to operate on the same data independently.

• Allows PCs on a network to be easily upgraded when faster processors become available.

Page 137: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

137

Disadvantages of Message-Passing

• Programmers must make explicit message-passing calls in the code

• This is low-level programming and is error prone.

• Data is not shared but copied, which increases the total data size.

• Data Integrity– Difficulty in maintaining correctness of

multiple copies of data item.

Page 138: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

138

Some Interconnection Network Terminology (1/2)

References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks.

A link is the connection between two nodes.• A switch that enables packets to be routed

through the node to other nodes without disturbing the processor is assumed.

• The link between two nodes can be either bidirectional or use two directional links .

• Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word).

• The above choices do not have a major impact on the concepts presented in this course.

Page 139: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

139

Network Terminology (2/2)

• The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second).

• The network latency is the time required to transfer a message through the network.– The communication latency is the total time

required to send a message, including software overhead and interface delay.

– The message latency or startup time is the time required to send a zero-length message.

• Includes software & hardware overhead, such as– Choosing a route– packing and unpacking the message

Page 140: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

140

Circuit Switching Message Passing• Technique establishes a path and allows the

entire message to transfer uninterrupted.• Similar to telephone connection that is held until

the end of the call.• Links used are not available to other messages

until the transfer is complete.• Latency (message transfer time): If the length of

control packet sent to establish path is small wrt (with respect to) the message length, the latency is essentially– the constant L/B, where L is message length

and B is bandwidth.

Page 141: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

141

Store-and-forward Packet Switching

• Message is divided into “packets” of information• Each packet includes source and destination

addresses.• Packets can not exceed a fixed, maximum size

(e.g., 1000 byte).• A packet is stored in a node in a buffer until it

can move to the next node.• Different packets typically follow different routes

but are re-assembled at the destination, as the packets arrive.

• Movements of packets is asynchronous.

Page 142: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

142

Packet Switching (cont)

• At each node, the designation information is looked at and used to select which node to forward the packet to.

• Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams.

• Significant latency is created by storing each packet in each node it reaches.

• Latency increases linearly with the length of the route.

Page 143: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

143

Virtual Cut-Through Package Switching

• Used to reduce the latency.

• Allows packet to pass through a node without being stored, if the outgoing link is available.

• If complete path is available, a message can immediately move from source to destination..

Page 144: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

144

Wormhole Routing

• Alternate to store-and-forward packet routing

• A message is divided into small units called flits (flow control units).

• Flits are 1-2 bytes in size.

• Can be transferred in parallel on links with multiple wires.

• Only head of flit is initially transferred when the next link becomes available.

Page 145: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

145

Wormhole Routing (cont)

• As each flit moves forward, the next flit can move forward.

• The entire path must be reserved for a message as these packets pull each other along (like cars of a train).

• Request/acknowledge bit messages are required to coordinate these pull-along moves. – See Wilkinson, et. al.

• Latency: If the head of the flit is very small compared to the length of the message, then the latency is essentially the constant L/B, with L the message length and B the link bandwidth.

Page 146: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

146

Deadlock• Routing algorithms needed to find a path

between the nodes.• Adaptive routing algorithms choose different

paths, depending on traffic conditions. • Livelock is a deadlock-type situation where a

packet continues to go around the network, without ever reaching its destination.

• Deadlock: No packet can be forwarded because they are blocked by other stored packets waiting to be forwarded.

Page 147: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

147

Asymmetric Multicomputers

• Has a front-end that interacts with users and I/O devices.– Processors in back end are used for

computation. – Programming similar to SIMDs (i.e., processor

arrays)– Common with early multicomputers

• Examples of asymmetrical multicomputers given in textbook.

Page 148: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

148

Asymmetrical MC Advantages

• Back-end processors dedicated to parallel computations Easier to understand, model, tune performance

• Only a simple back-end operating system needed Easy for a vendor to create

Page 149: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

149

Asymmetrical MC Disadvantages

• Front-end computer is a single point of failure

• Single front-end computer limits scalability of system

• Primitive operating system in back-end processors makes debugging difficult

• Every application requires development of both front-end and back-end program

Page 150: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

150

Symmetric Multicomputers

• Every computer executes the same operating system and has identical functionality.

• Users may log into any computer to edit or compile their programs.

• Any or all computers may be involved in the execution of their program.

• During execution of programs, every PE executes the same program.

• When only one PE should execute an operation, an “if” statement is used to select the PE.

Page 151: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

151

Symmetric Multicomputers

Page 152: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

152

Symmetrical MC Advantages

• Alleviate performance bottleneck caused by single front-end computer

• Better support for debugging

• Every processor executes same program

Page 153: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

153

Symmetrical MC Disadvantages

• More difficult to maintain illusion of single “parallel computer”

• No simple way to balance program development workload among processors

• More difficult to achieve high performance when multiple processes run on each processor– Details on next slide

Page 154: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

154

Symmetric MC Disadvantages (cont)

• (cont.) More difficult to achieve high performance when multiple processes run on each processor– Processes on same processor compete for

same resources• CPU Cycles• Cache space• Memory bandwidth

– Increased cache misses • Cache is “PE oriented” instead of “process

oriented”

Page 155: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

155

Best Model for Commodity Cluster• Full-Fledged operating system (e.g.,

Linux) desirable– Feature of symmetric multicomputer

• Desirable to increase cache hits– Favors having only a single user process on each PE– Favors most nodes being off-limits for program

development

• Need fast network– Keep program development users off networks and

have them access front-end by another path.– Reserve interconnection network to usage by parallel

processes

• Overall, a mixed model may be best for commodity clusters

Page 156: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

156

Ideal Commodity Cluster Features

• Co-located computers• Computers dedicated to running a single

process at a time to lower cache misses• Accessible only from the network

– No keyboards or displays

• Users access front-end by another route.• Identical operating systems• Identical local disk images• Administered as an entity

Page 157: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

157

ParPar Cluster, A Mixed Model

• Mixed model

• Incorporates both asymetrical and symetrical designs.

Page 158: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

158

Network of Workstations• Dispersed computers

– Typically located on user’s desks

• First priority: response time for person at the keyboard

• Parallel jobs wait in background and run with spare CPU cycles are available.

• Different operating systems• Different local images• Checkpointing and restarting important• Typically connected by ethernet

– Too slow for commodity network usage

Page 159: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

159

A Commodity Cluster vs Network of Workstations

• A commodity cluster contains components of local area networks– Commodity computers– Switches

• A network of workstations is a dispersed collection of computers– Distributed hetergeneous computers – Located on primary users’ desks– Unused cycles available for parallel use– Example: SETI project

Page 160: Chapter 2 Parallel Architectures 2 Outline Some chapter references Brief review of complexity –Terminology for comparisons Interconnection networks Processor.

160

Summary

• Commercial parallel computers appearedin 1980s

• Multiple-CPU computers now dominate

• Small-scale: Centralized multiprocessors

• Large-scale: Distributed memory architectures– Multiprocessors– Multicomputers