Parallel Real-Time Systems Parallel Computing Overview.

137
Parallel Real-Time Systems Parallel Computing Overview
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    228
  • download

    2

Transcript of Parallel Real-Time Systems Parallel Computing Overview.

Page 1: Parallel Real-Time Systems Parallel Computing Overview.

Parallel Real-Time Systems

Parallel Computing Overview

Page 2: Parallel Real-Time Systems Parallel Computing Overview.

2

References(Will be expanded as needed)

• Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/PDC-F08/– Selected slides from “Introduction to Parallel

Computing”

• Michael Quinn, Parallel Programming in C with MPI and Open MP, McGraw Hill, 2004.– Chapter 1 is posted on website

• Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available on website.

Page 3: Parallel Real-Time Systems Parallel Computing Overview.

3

Outline

• Why use parallel computing

• Moore’s Law

• Modern parallel computers

• Flynn’s Taxonomy

• Seeking Concurrency

• Data clustering case study

• Programming parallel computers

Page 4: Parallel Real-Time Systems Parallel Computing Overview.

4

Why Use Parallel Computers

• Solve compute-intensive problems faster– Make infeasible problems feasible– Reduce design time

• Solve larger problems in same amount of time– Improve answer’s precision– Reduce design time

• Increase memory size– More data can be kept in memory– Dramatically reduces slowdown due to accessing

external storage increases computation time

• Gain competitive advantage

Page 5: Parallel Real-Time Systems Parallel Computing Overview.

5

1989 Grand Challenges to Computational Science Categories

• Quantum chemistry, statistical mechanics, and relativistic physics

• Cosmology and astrophysics• Computational fluid dynamics and turbulence• Materials design and superconductivity• Biology, pharmacology, genome sequencing, genetic

engineering, protein folding, enzyme activity, and cell modeling

• Medicine, and modeling of human organs and bones• Global weather and environmental modeling

Page 6: Parallel Real-Time Systems Parallel Computing Overview.

6

Weather Prediction• Atmosphere is divided into 3D cells• Data includes temperature, pressure, humidity,

wind speed and direction, etc – Recorded at regular time intervals in each cell

• There are about 5×103 cells of 1 mile cubes.• Calculations would take a modern computer

over 100 days to perform calculations needed for a 10 day forecast

• Details in Ian Foster’s 1995 online textbook– Design & Building Parallel Programs– Included in Parallel Reference List, which will be

posted on website.

Page 7: Parallel Real-Time Systems Parallel Computing Overview.

7

Moore’s Law

• In 1965, Gordon Moore [87] observed that the density of chips doubled every year.– That is, the chip size is being halved yearly.– This is an exponential rate of increase.

• By the late 1980’s, the doubling period had slowed to 18 months.

• Reduction of the silicon area causes speed of the processors to increase.

• Moore’s law is sometimes stated: “The processor speed doubles every 18 months”

Page 8: Parallel Real-Time Systems Parallel Computing Overview.

8

Microprocessor Revolution

Micros

Minis

Mainframes

Speed (log scale)

Time

Supercomputers

Page 9: Parallel Real-Time Systems Parallel Computing Overview.

9

Some Definitions

• Concurrent – Sequential events or processes which seem to occur or progress at the same time.

• Parallel –Events or processes which occur or progress at the same time

• Parallel computing provides simultaneous execution of operations within a single parallel computer

• Distributed computing provides simultaneous execution of operations across a number of systems.

Page 10: Parallel Real-Time Systems Parallel Computing Overview.

10

Flynn’s Taxonomy

• Best known classification scheme for parallel computers.

• Depends on parallelism it exhibits with its – Instruction stream – Data stream

• A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream)

• The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M)

• Four combinations: SISD, SIMD, MISD, MIMD

Page 11: Parallel Real-Time Systems Parallel Computing Overview.

11

SISD

• Single Instruction, Single Data• Usual sequential computer is primary example

– i.e., uniprocessors– Note: co-processors don’t count as more processors

• Concurrent processing allowed– Instruction prefetching– Pipelined execution of instructions – Independent concurrent tasks can execute different

sequences of operations.

Page 12: Parallel Real-Time Systems Parallel Computing Overview.

12

SIMD• Single instruction, multiple data

• One instruction stream is broadcast to all processors

• Each processor, also called a processing element (or PE), is very simplistic and is essentially an ALU;

– PEs do not store a copy of the program nor have a program control unit.

• Individual processors can be inhibited from participating in an instruction (based on a data test).

Page 13: Parallel Real-Time Systems Parallel Computing Overview.

13

SIMD (cont.)

• All active processor executes the same instruction synchronously, but on different data

• On a memory access, all active processors must access the same location in their local memory.

• The data items form an array (or vector) and an instruction can act on the complete array in one cycle.

Page 14: Parallel Real-Time Systems Parallel Computing Overview.

14

SIMD (cont.)

• Quinn calls this architecture a processor array.

• Examples include– The STARAN and MPP (Dr. Batcher architect)– Connection Machine CM2, built by Thinking

Machines).

Page 15: Parallel Real-Time Systems Parallel Computing Overview.

15

How to View a SIMD Machine

• Think of soldiers all in a unit.

• The commander selects certain soldiers as active.– For example, every even numbered row.

• The commander barks out an order to all the active soldiers, who execute the order synchronously.

Page 16: Parallel Real-Time Systems Parallel Computing Overview.

16

MISD

• Multiple instruction streams, single data stream• Primarily corresponds to multiple redundant

computation, say for reliability.• Quinn argues that a systolic array is an example

of a MISD structure (pg 55-57)• Some authors include pipelined architecture in

this category• This category does not receive much attention

from most authors, so we won’t discuss it further.

Page 17: Parallel Real-Time Systems Parallel Computing Overview.

17

MIMD

• Multiple instruction, multiple data

• Processors are asynchronous and can independently execute different programs on different data sets.

• Communications are handled either – through shared memory. (multiprocessors)– by use of message passing (multicomputers)

• MIMD’s are considered by many researchers to include the most powerful, least restricted computers.

Page 18: Parallel Real-Time Systems Parallel Computing Overview.

18

MIMD (cont. 2/4)• Have major communication costs

– When compared to SIMDs– Internal ‘housekeeping activities’ are often overlooked

• Maintaining distributed memory & distributed databases • Synchronization or scheduling of tasks• Load balancing between processors

• The SPMD method of programming MIMDs – All processors to execute the same program.– SPMD stands for single program, multiple data.– Easy method to program when number of processors

are large.– While processors have same code, they can each can

be executing different parts at any point in time.

Page 19: Parallel Real-Time Systems Parallel Computing Overview.

19

MIMD (cont 3/4)• A more common technique for programming MIMDs

is to use multi-tasking– The problem solution is broken up into various tasks.– Tasks are distributed among processors initially.– If new tasks are produced during executions, these

may handled by parent processor or distributed– Each processor can execute its collection of tasks

concurrently.• If some of its tasks must wait for results from other tasks or

new data , the processor will focus the remaining tasks.

– Larger programs usually require a load balancing algorithm to rebalance tasks between processors

– Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks

• E.g., on critical path, more important, earlier deadline, etc.

Page 20: Parallel Real-Time Systems Parallel Computing Overview.

20

MIMD (cont 4/4)

• Recall, there are two principle types of MIMD computers:– Multiprocessors (with shared memory)– Multicomputers (message passing)

• Both are important and will be covered in greater detail next.

Page 21: Parallel Real-Time Systems Parallel Computing Overview.

21

Multiprocessors (Shared Memory MIMDs)

• Consists of two types– Centralized Multiprocessors

• Also called UMA (Uniform Memory Access)• Symmetric Multiprocessor or SMP

– Distributed Multiprocessors• Also called NUMA (Nonuniform Memory

Access)

Page 22: Parallel Real-Time Systems Parallel Computing Overview.

22

Centralized Multiprocessors(SMPs)

Page 23: Parallel Real-Time Systems Parallel Computing Overview.

23

Centralized Multiprocessors(SMPs)

• Consists of identical CPUs connected by a bus and to common block of memory.

• Each processor requires the same amount of time to access memory.

• Usually limited to a few dozen processors due to memory bandwidth.

• SMPs and clusters of SMPs are currently very popular

Page 24: Parallel Real-Time Systems Parallel Computing Overview.

24

Distributed Multiprocessors

Page 25: Parallel Real-Time Systems Parallel Computing Overview.

25

Distributed Multiprocessors(or NUMA)

• Has a distributed memory system• Each memory location has the same address for

all processors.– Access time to a given memory location varies

considerably for different CPUs.

• Normally, uses fast cache to reduce the problem of different memory access time for processors. – Creates problem of ensuring all copies of the same

data in different memory locations are identical.

Page 26: Parallel Real-Time Systems Parallel Computing Overview.

26

Multicomputers (Message-Passing MIMDs)

• Processors are connected by a network– Usually an interconnection network– Also, may be connected by Ethernet links or a

bus.

• Each processor has a local memory and can only access its own local memory.

• Data is passed between processors using messages, when specified by the program.

Page 27: Parallel Real-Time Systems Parallel Computing Overview.

27

Multicomputers (cont)

• Message passing between processors is controlled by a message passing language (e.g., MPI, PVM)

• The problem is divided into processes or tasks that can be executed concurrently on individual processors.

• Each processor is normally assigned multiple tasks.

Page 28: Parallel Real-Time Systems Parallel Computing Overview.

28

Multiprocessors vs Multicomputers

• Programming disadvantages of message-passing– Programmers must make explicit message-

passing calls in the code– This is low-level programming and is error

prone.– Data is not shared but copied, which

increases the total data size.– Data Integrity: difficulty in maintaining

correctness of multiple copies of data item.

Page 29: Parallel Real-Time Systems Parallel Computing Overview.

29

Multiprocessors vs Multicomputers (cont)

• Programming advantages of message-passing– No problem with simultaneous access to data. – Allows different PCs to operate on the same

data independently.– Allows PCs on a network to be easily

upgraded when faster processors become available.

• Mixed “distributed shared memory” systems exist– An example is a cluster of SMPs.

Page 30: Parallel Real-Time Systems Parallel Computing Overview.

30

Types of Parallel Execution

• Data parallelism

• Control/Job/Functional parallelism

• Pipelining

• Virtual parallelism

Page 31: Parallel Real-Time Systems Parallel Computing Overview.

31

Data Parallelism• All tasks (or processors) apply the same set of

operations to different data.

• Example:

• Operations may be executed concurrently

• Accomplished on SIMDs by having all active processors execute the operations synchronously.

• Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously.

for i 0 to 99 do a[i] b[i] + c[i]endfor

Page 32: Parallel Real-Time Systems Parallel Computing Overview.

32

Supporting MIMD Data Parallelism

• SPMD (single program, multiple data) programming is not really data parallel execution, as processors typically execute different sections of the program concurrently.

• Data parallel programming can be strictly enforced when using SPMD as follows:– Processors execute the same block of instructions

concurrently but asynchronously– No communication or synchronization occurs within

these concurrent instruction blocks.– Each instruction block is normally followed by a

synchronization and communication block of steps

Page 33: Parallel Real-Time Systems Parallel Computing Overview.

33

MIMD Data Parallelism (cont.)

• Strict data parallel programming is unusual for MIMDs, as the processors usually execute independently, running their own local program.

Page 34: Parallel Real-Time Systems Parallel Computing Overview.

34

Data Parallelism Features• Each processor performs the same data

computation on different data sets• Computations can be performed either

synchronously or asynchronously• Defn: Grain Size is the average number of

computations performed between communication or synchronization steps – See Quinn textbook, page 411

• Data parallel programming usually results in smaller grain size computation– SIMD computation is considered to be fine-grain– MIMD data parallelism is usually considered to be

medium grain

Page 35: Parallel Real-Time Systems Parallel Computing Overview.

35

Control/Job/Functional Parallelism

• Independent tasks apply different operations to different data elements

• First and second statements may execute concurrently

• Third and fourth statements may execute concurrently

a 2b 3m (a + b) / 2s (a2 + b2) / 2v s - m2

Page 36: Parallel Real-Time Systems Parallel Computing Overview.

36

Control Parallelism Features

• Problem is divided into different non-identical tasks

• Tasks are divided between the processors so that their workload is roughly balanced

• Parallelism at the task level is considered to be coarse grained parallelism

Page 37: Parallel Real-Time Systems Parallel Computing Overview.

37

Data Dependence Graph• Can be used to identify data

parallelism and job parallelism.• See page 11.• Most realistic jobs contain both

parallelisms– Can be viewed as branches in

data parallel tasks- If no path from vertex u to

vertex v, then job parallelism can be used to execute the tasks u and v concurrently.

- If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently.

Page 38: Parallel Real-Time Systems Parallel Computing Overview.

38

For example, “mow lawn” becomes • Mow N lawn • Mow S lawn • Mow E lawn • Mow W lawn • If 4 people are available to mow, then data parallelismcan be used to do these tasks simultaneously. • Similarly, if several peopleare available to “edge lawn” and “weed garden”, then we can use data parallelism to provide more concurrency.

Page 39: Parallel Real-Time Systems Parallel Computing Overview.

39

Pipelining

• Divide a process into stages

• Produce several items simultaneously

Page 40: Parallel Real-Time Systems Parallel Computing Overview.

40

Compute Partial Sums

Consider the for loop:p[0] a[0]for i 1 to 3 do

p[i] p[i-1] + a[i]endfor

• This computes the partial sums:p[0] a[0]p[1] a[0] + a[1]p[2] a[0] + a[1] + a[2]p[3] a[0] + a[1] + a[2] + a[3]

• The loop is not data parallel as there are dependencies.• However, we can stage the calculations in order to achieve

some parallelism.

Page 41: Parallel Real-Time Systems Parallel Computing Overview.

41

Partial Sums Pipeline

a[ 0 ]

= + + +

a[ 1 ] a[ 2 ] a[ 3 ]

p [ 0 ] p [ 1 ] p [ 2 ]

p [ 0 ] p [ 1 ] p [ 2 ] p [ 3 ]

Page 42: Parallel Real-Time Systems Parallel Computing Overview.

Virtual Parallelism

• In data parallel applications, it is often simpler to initially design an algorithm or program assuming one data item per processor. – Particularly useful for SIMD programming

• If more processors are needed in actual program, each processor is given a block of n/p or n/p data items– Typically, requires a routine adjustment in program.– Will result in a slowdown in running time of at least n/p.

• Called Virtual Parallelism since each processor plays the role of several processors.

• A SIMD computer has been built that automatically converts code to handle n/p items per processor.– Wavetracer SIMD computer. 42

Page 43: Parallel Real-Time Systems Parallel Computing Overview.

Slides from Parallel Architecture Section

See www.cs.kent.edu/~jbaker/PDC-F08/

s

Page 44: Parallel Real-Time Systems Parallel Computing Overview.

44

References• Slides in this section are taken from the Parallel

Architecture Slides at site www.cs.kent.edu/~jbaker/PDC-F08/

• Book reference is Chapter 2 of Quinn’s textbook.

Page 45: Parallel Real-Time Systems Parallel Computing Overview.

Interconnection Networks• Uses of interconnection networks

– Connect processors to shared memory– Connect processors to each other

• Different interconnection networks define different parallel machines.

• The interconnection network’s properties influence the type of algorithm used for various machines as it affects how data is routed.

Page 46: Parallel Real-Time Systems Parallel Computing Overview.

Terminology for Evaluating Switch Topologies

• We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness

• These are– The diameter – The bisection width– The edges per node– The constant edge length

• We’ll define these and see how they affect algorithm choice.

• Then we will introduce several different interconnection networks.

Page 47: Parallel Real-Time Systems Parallel Computing Overview.

Terminology for Evaluating Switch Topologies

• Diameter – Largest distance between two switch nodes.– A low diameter is desirable– It puts a lower bound on the complexity of

parallel algorithms which requires communication between arbitrary pairs of nodes.

Page 48: Parallel Real-Time Systems Parallel Computing Overview.

Terminology for Evaluating Switch Topologies

• Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves.– Or within 1 node of one-half if the number of

processors is odd.• High bisection width is desirable.• In algorithms requiring large amounts of data

movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm.

Page 49: Parallel Real-Time Systems Parallel Computing Overview.

Terminology for Evaluating Switch Topologies

• Number of edges per node– It is best if the maximum number of edges/node is a

constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes.

– Degree is the maximum number of edges per node.

• Constant edge length? (yes/no)– Again, for scalability, it is best if the nodes and edges

can be laid out in 3D space so that the maximum edge length is a constant independent of network size.

Page 50: Parallel Real-Time Systems Parallel Computing Overview.

Three Important Interconnection Networks

• We will consider the following three well known interconnection networks:– 2-D mesh– linear network– hypercube

• All three of these networks have been used to build commercial parallel computers.

Page 51: Parallel Real-Time Systems Parallel Computing Overview.

2-D Meshes

Note: Circles represent switches and squares represent processors in all these slides.

Page 52: Parallel Real-Time Systems Parallel Computing Overview.

2-D Mesh Network• Switches arranged into a 2-D lattice or grid

• Communication allowed only between neighboring switches

• Torus: Variant that includes wraparound connections between switches on edge of mesh

Page 53: Parallel Real-Time Systems Parallel Computing Overview.

Evaluating 2-D Meshes(Assumes mesh is a square)

n = number of processors • Diameter:

(n1/2)– Places a lower bound on algorithms that require

processing with arbitrary nodes sharing data.• Bisection width:

(n1/2)– Places a lower bound on algorithms that require

distribution of data to all nodes.• Max number of edges per switch:

– 4 is the degree• Constant edge length?

– Yes• Does this scale well?

– Yes

Page 54: Parallel Real-Time Systems Parallel Computing Overview.

Linear Network

• Switches arranged into a 1-D mesh• Corresponds to a row or column of a 2-D mesh• Ring : A variant that allows a wraparound

connection between switches on the end.• The linear and ring networks have many

applications• Essentially supports a pipeline in both directions• Although these networks are very simple, they

support many optimal algorithms.

Page 55: Parallel Real-Time Systems Parallel Computing Overview.

Evaluating Linear and Ring Networks• Diameter

– Linear : n-1 or Θ(n)– Ring: n/2 or Θ(n)

• Bisection width: – Linear: 1 or Θ(1)– Ring: 2 or Θ(1)

• Degree for switches:– 2

• Constant edge length? – Yes

• Does this scale well?– Yes

Page 56: Parallel Real-Time Systems Parallel Computing Overview.

Hypercube (also called binary n-cube)

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

A hypercube with n = 2d processors & switches for d=4

Page 57: Parallel Real-Time Systems Parallel Computing Overview.

Hypercube with n = 2d Processors

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

•Number of nodes is a power of 2• Node addresses 0, 1, …, n-1• Node i is connected to k nodes whose addresses differ from i in exactly one bit position.• Example: k = 0111 is connected to 1111, 0011, 0101, and 0110

Page 58: Parallel Real-Time Systems Parallel Computing Overview.

Growing a HypercubeNote: For d = 4, it is called

a 4-dimensional cube.

Page 59: Parallel Real-Time Systems Parallel Computing Overview.

Evaluating Hypercube Networkwith n = 2d nodes

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

• Diameter: • d = log n

•Bisection width: • n / 2

•Edges per node: • log n

•Constant edge length?

• No. The length of the longest edge increases as n increases.

Page 60: Parallel Real-Time Systems Parallel Computing Overview.

MIMD Message-Passing

Slides are still from Parallel Architecture Unit at “Parallel & Distributed Computing” website

Page 61: Parallel Real-Time Systems Parallel Computing Overview.

61

Some Interconnection Network Terminology (1/2)

References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks.

A link is the connection between two nodes.• A switch that enables packets to be routed

through the node to other nodes without disturbing the processor is assumed.

• The link between two nodes can be either bidirectional or use two directional links .

• Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word).

• The above choices do not have a major impact on the concepts presented in this course.

Page 62: Parallel Real-Time Systems Parallel Computing Overview.

62

Some Interconnection Network Terminology (1/2)

References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks.

A link is the connection between two nodes.• A switch that enables packets to be routed

through the node to other nodes without disturbing the processor is assumed.

• The link between two nodes can be either bidirectional or use two directional links .

• Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word).

• The above choices do not have a major impact on the concepts presented in this course.

Page 63: Parallel Real-Time Systems Parallel Computing Overview.

63

Network Terminology (2/2)

• The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second).

• The network latency is the time required to transfer a message through the network.– The communication latency is the total time

required to send a message, including software overhead and interface delay.

– The message latency or startup time is the time required to send a zero-length message.

• Includes software & hardware overhead, such as– Choosing a route– packing and unpacking the message

Page 64: Parallel Real-Time Systems Parallel Computing Overview.

64

Store-and-forward Packet Switching

• Message is divided into “packets” of information• Each packet includes source and destination

addresses.• Packets can not exceed a fixed, maximum size

(e.g., 1000 byte).• A packet is stored in a node in a buffer until it

can move to the next node.• Different packets typically follow different routes

but are re-assembled at the destination, as the packets arrive.

• Movements of packets is asynchronous.

Page 65: Parallel Real-Time Systems Parallel Computing Overview.

65

Packet Switching (cont)

• At each node, the designation information is looked at and used to select which node to forward the packet to.

• Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams.

• Significant latency is created by storing each packet in each node it reaches.

• Latency increases linearly with the length of the route.

Page 66: Parallel Real-Time Systems Parallel Computing Overview.

Slides from Performance Analysis

Page 67: Parallel Real-Time Systems Parallel Computing Overview.

67

References on Performance Evaluation

• Slides are from www.cs.kent.edu/~jbaker/PDC-F08/ on topic of Performance Evaluation.

• Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available through website.

• Michael Quinn, Parallel Programming in C with MPI and Open MP, Ch. 7, McGraw Hill, 2004.

Page 68: Parallel Real-Time Systems Parallel Computing Overview.

Outline• Speedup• Superlinearity Issues• Speedup Analysis• Cost • Efficiency• Amdahl’s Law• Gustafson’s Law

Page 69: Parallel Real-Time Systems Parallel Computing Overview.

Speedup• Speedup measures increase in running time due

to parallelism. The number of PEs is given by n.

• S(n) = ts/tp , where

– ts is the running time on a single processor, using the fastest known sequential algorithm

– tp is the running time using a parallel processor.

• In simplest terms,

timerunning Parallel

timerunning Sequential Speedup

Page 70: Parallel Real-Time Systems Parallel Computing Overview.

Linear Speedup Usually Optimal• Speedup is linear if S(n) = (n) • Claim: The maximum possible speedup for parallel

computers with n PEs is n. • Usual pseudo-proof: (Assume ideal conditions)

– Assume a computation is partitioned perfectly into n processes of equal duration.

– Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc),

– Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation and

• the parallel running time will be ts /n.– Then the parallel speedup in this “ideal situation” is

S(n) = ts /(ts /n) = n

Page 71: Parallel Real-Time Systems Parallel Computing Overview.

Linear Speedup Usually Optimal (cont)

• This argument shows that for typical problems, linear speedup to be optimal

• This argument is valid for traditional problems, but is invalid for some types of nontraditional problems.

Page 72: Parallel Real-Time Systems Parallel Computing Overview.

Speedup Usually Smaller Than Linear• Unfortunately, the best speedup possible for

most applications is considerably smaller than n– The “ideal conditions” performance mentioned

in earlier argument is usually unattainable. – Normally, some parts of programs are

sequential and allow only one PE to be active.– Sometimes a significant number of

processors are idle for certain portions of the program. For example• Some PEs may be waiting to receive or to

send data during parts of the program.• Congestion may occur during message

passing

Page 73: Parallel Real-Time Systems Parallel Computing Overview.

Superlinear Speedup• Superlinear speedup occurs when S(n) > n

– Occasionally speedup that appears to be superlinear may occur, but can be explained by other reasons such as

• the extra memory in parallel system.• a sub-optimal sequential algorithm is compared to

parallel algorithm.• “Luck”, in case of algorithm that has a random

aspect in its design (e.g., random selection)

Page 74: Parallel Real-Time Systems Parallel Computing Overview.

Superlinearity (cont)

• Selim Akl has given a multitude of examples that establish that superlinear algorithms are required for many non-standard problems

• Examples include “nonstandard” problems involving

• Real-Time requirements where meeting deadlines is part of the problem requirements.

• Problems where all data is not initially available, but has to be processed after it arrives.

• Some problems are natural to solve using parallelism and sequential solutions are inefficient.

Page 75: Parallel Real-Time Systems Parallel Computing Overview.

Execution time for parallel portion

Shows nontrivial parallel algorithm’s computation component as a decreasing function of the number of processors used.

processors

time

Page 76: Parallel Real-Time Systems Parallel Computing Overview.

Time for MIMD communication

Shows a nontrivial parallel algorithm’s communication component as an increasing function of the number of processors.

processors

time

Page 77: Parallel Real-Time Systems Parallel Computing Overview.

Combining Parallel & MIMD Communicaton Times

Combining these, we see for a fixed problem size, there is an optimum number of processors that minimizes overall execution time.

processors

time

Page 78: Parallel Real-Time Systems Parallel Computing Overview.

MIMD Speedup Plot“Speedup reaches max and then drops as processors increase”

processors

speedup

Page 79: Parallel Real-Time Systems Parallel Computing Overview.

Cost• The cost of a parallel algorithm (or program) is

Cost = Parallel running time #processors • The cost of a parallel algorithm should be

compared to the running time of a sequential algorithm.– Cost removes the advantage of parallelism by

charging for each additional processor.– A parallel algorithm whose cost is O(running

time) of an optimal sequential algorithm is called cost-optimal.

Page 80: Parallel Real-Time Systems Parallel Computing Overview.

Efficiency

used Processors

Speedup Efficiency

timeexecution Parallel used Processors

timeexecution Sequential Efficiency

1 Efficiency 0 problems, ionalFor tradit

Cost

timerunning SequentialEfficiency

Processors

Speedup Efficiency

timerunning Parallel Processors

timerunning Sequential Efficiency

Page 81: Parallel Real-Time Systems Parallel Computing Overview.

Amdahl’s Law

• Having a detailed understanding of Amdahl’s law is not essential for this course.

• However, having a brief, non-technical introduction to this important law could be useful.

81

Page 82: Parallel Real-Time Systems Parallel Computing Overview.

Amdahl’s Law

Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup achievable by a parallel computer with n processors is

• The word “law” is often used by computer scientists when it is an observed phenomena (e.g, Moore’s Law) and not a theorem that has been proven in a strict sense.

• It is easy to prove Amdahl’s law for “traditional problems”, but it is not valid for “non-traditional problems”.

fnffnS

1

/)1(

1)(

Page 83: Parallel Real-Time Systems Parallel Computing Overview.

Example 1

• Assume 95% of a program’s execution time occurs inside a loop that can be executed in parallel.

• Amdahl’s law shows the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs is less than 6.

9.58/)05.01(05.0

1

Speedup 9.5

8/)05.01(05.0

1

Speedup

Page 84: Parallel Real-Time Systems Parallel Computing Overview.

Example 2

• Assume 5% of a parallel program’s execution time is spent within inherently sequential code.

• Amdahl’s law shows that the maximum speedup achievable by this program, regardless of how many PEs are used, is

2005.0

1

/)05.01(05.0

1lim

pp

Page 85: Parallel Real-Time Systems Parallel Computing Overview.

Amdahl’s Law• The argument used in proof of Amdahl’s law

assumes that speedup can not be superliner, so proof is invalid for “non-traditional” problems.

• Sometimes Amdahl’s law is just stated as

S(n) 1/f

• Note that S(n) never exceeds 1/f and approaches 1/f as n increases.

Page 86: Parallel Real-Time Systems Parallel Computing Overview.

Consequences of Amdahl’s Limitations to Parallelism

• For a long time, Amdahl’s law was viewed as a fatal limit to the usefulness of parallelism.

• A key flaw in these early arguments is that they were unaware of the impact of Gustafon’s Law:

• Gustafon’s Law: The proportion of the computations that are sequential normally decreases as the problem size increases.

• Note: Gustafon’s law is a “observed phenomena” and not a theorem.

• The negative impact of Amdahl’s law disappears as the problem size increases.

Page 87: Parallel Real-Time Systems Parallel Computing Overview.

Limitations of Amdahl’s Law• It is now generally accepted by parallel

computing professionals that Amdahl’s law is not a serious limit the benefit and future of parallel computing

• Note that Amdahl’s law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in huge performance gains.

Page 88: Parallel Real-Time Systems Parallel Computing Overview.

88

Task/Channel Model

• Local accesses of private data are assumed to be easily distinguished from nonlocal data access done over channels.

• Local accesses should be considered much faster than nonlocal accesses.

• In this model:– The execution time of a parallel algorithm is the

period of time a task is active.– The starting time of a parallel algorithm is when all

tasks simultaneously begin executing.– The finishing time of a parallel algorithm is when the

last task has stopped executing.

Page 89: Parallel Real-Time Systems Parallel Computing Overview.

Parallel MIMD Algorithm Design

Reference: Chptr 3, Quinn Textbook

Page 90: Parallel Real-Time Systems Parallel Computing Overview.

90

References

• Slides at www.cs.kent.edu/~jbaker/PDC-F08/ on Parallel Algorithm Design.

• Chapter 3 of Quinn’s Textbook

Page 91: Parallel Real-Time Systems Parallel Computing Overview.

91

Task/Channel Model• This model is intended for MIMDs (i.e., multiprocessors

and multicomputers) and not for SIMDs.• Parallel computation = set of tasks• A task consists of a

– Program– Local memory– Collection of I/O ports

• Tasks interact by sending messages through channels – A task can send local data values to other tasks via

output ports– A task can receive data values from other tasks via

input ports.• The local memory contains the program’s instructions

and its private data

Page 92: Parallel Real-Time Systems Parallel Computing Overview.

92

Task/Channel Model

• A channel is a message queue that connects one task’s ouput port with another task’s input port.

• Data values appear in input port in the same order in which they are placed in the channel’s output queue.

• A task is blocked if a task tries to receive a value at an input port and the value isn’t available.

• The blocked task must wait until the value is received.• A process sending a message is never blocked even if

previous messages it has sent on the channel have not been received yet.

• Thus, receiving is a synchronous operation and sending is an asynchronous operation.

Page 93: Parallel Real-Time Systems Parallel Computing Overview.

93

Task/Channel Model

• Local accesses of private data are assumed to be easily distinguished from nonlocal data access done over channels.

• Local accesses should be considered much faster than nonlocal accesses.

• In this model:– The execution time of a parallel algorithm is the

period of time a task is active.– The starting time of a parallel algorithm is when all

tasks simultaneously begin executing.– The finishing time of a parallel algorithm is when the

last task has stopped executing.

Page 94: Parallel Real-Time Systems Parallel Computing Overview.

94

Task/Channel Model

TaskTaskChannelChannel

A parallel computation can be viewed as a directed graph.

Page 95: Parallel Real-Time Systems Parallel Computing Overview.

95

Foster’s Design Methodology• Ian Foster has proposed a 4-step process

for designing parallel algorithms for machines that fit the task/channel model.– Foster’s online textbook is a useful resource

here• It encourages the development of scalable

algorithms by delaying machine-dependent considerations until the later steps.

• The 4 design steps are called:– Partitioning– Communication– Agglomeration– Mapping

Page 96: Parallel Real-Time Systems Parallel Computing Overview.

96

Foster’s Methodology

Page 97: Parallel Real-Time Systems Parallel Computing Overview.

97

Partitioning

• Partitioning: Dividing the computation and data into pieces

• Domain decomposition – one approach– Divide data into pieces– Determine how to associate computations with the

data– Focuses on the largest and most frequently accessed

data structure

• Functional decomposition – another approach– Divide computation into pieces– Determine how to associate data with the

computations– This often yields tasks that can be pipelined.

Page 98: Parallel Real-Time Systems Parallel Computing Overview.

98

Example Domain Decompositions

Think of the primitive tasks as processors.

In 1st, each 2D slice is mapped onto one processor of a system using 3 processors.

In second, a 1D slice is mapped onto a processor.

In last, an element is mapped onto a processor

The last leaves more primitive tasks and is usually preferred.

Page 99: Parallel Real-Time Systems Parallel Computing Overview.

99

Example Functional Decomposition

Page 100: Parallel Real-Time Systems Parallel Computing Overview.

100

Partitioning Checklist for Evaluating the Quality of a Partition

• At least 10x more primitive tasks than processors in target computer

• Minimize redundant computations and redundant data storage

• Primitive tasks are roughly the same size• Number of tasks an increasing function of

problem size• Remember – we are talking about MIMDs here

which typically have a lot less processors than SIMDs.

Page 101: Parallel Real-Time Systems Parallel Computing Overview.

101

Foster’s Methodology

Page 102: Parallel Real-Time Systems Parallel Computing Overview.

102

Communication

• Determine values passed among tasks• There are two kinds of communication:• Local communication

– A task needs values from a small number of other tasks

– Create channels illustrating data flow

• Global communication– A significant number of tasks contribute data to

perform a computation– Don’t create channels for them early in design

Page 103: Parallel Real-Time Systems Parallel Computing Overview.

103

Communication (cont.)

• Communications is part of the parallel computation overhead since it is something sequential algorithms do not have do.– Costs larger if some (MIMD) processors have to be

synchronized.• SIMD algorithms have much smaller

communication overhead because – Much of the SIMD data movement is between the

control unit and the PEs on broadcast/reduction circuits

• especially true for associative– Parallel data movement along the interconnection

network involves lockstep (i.e. synchronously) moves.

Page 104: Parallel Real-Time Systems Parallel Computing Overview.

104

Communication Checklist for Judging the Quality of Communications

• Communication operations should be balanced among tasks

• Each task communicates with only a small group of neighbors

• Tasks can perform communications concurrently

• Task can perform computations concurrently

Page 105: Parallel Real-Time Systems Parallel Computing Overview.

105

Foster’s Methodology

Page 106: Parallel Real-Time Systems Parallel Computing Overview.

106

What We Have Hopefully at This Point – and What We Don’t Have

• The first two steps look for parallelism in the problem.• However, the design obtained at this point probably

doesn’t map well onto a real machine.• If the number of tasks greatly exceed the number of

processors, the overhead will be strongly affected by how the tasks are assigned to the processors.

• Now we have to decide what type of computer we are targeting – Is it a centralized multiprocessor or a multicomputer?– What communication paths are supported– How must we combine tasks in order to map them

effectively onto processors?

Page 107: Parallel Real-Time Systems Parallel Computing Overview.

107

Agglomeration• Agglomeration: Grouping tasks into larger tasks• Goals

– Improve performance– Maintain scalability of program– Simplify programming – i.e. reduce software

engineering costs.• In MPI programming, a goal is

– to lower communication overhead.– often to create one agglomerated task per processor

• By agglomerating primitive tasks that communicate with each other, communication is eliminated as the needed data is local to a processor.

Page 108: Parallel Real-Time Systems Parallel Computing Overview.

108

Agglomeration Can Improve Performance

• It can eliminate communication between primitive tasks agglomerated into consolidated task

• It can combine groups of sending and receiving tasks

Page 109: Parallel Real-Time Systems Parallel Computing Overview.

109

Scalability• Assume we are manipulating a 3D matrix of size

8 x 128 x 256 and– Our target machine is a centralized multiprocessor

with 4 CPUs.

• Suppose we agglomerate the 2nd and 3rd dimensions. Can we run on our target machine?– Yes- because we can have tasks which are each

responsible for a 2 x 128 x 256 submatrix.– Suppose we change to a target machine that is a

centralized multiprocessor with 8 CPUs. Could our previous design basically work.

– Yes, because each task could handle a 1 x 128 x 256 matrix.

Page 110: Parallel Real-Time Systems Parallel Computing Overview.

110

Scalability

– However, what if we go to more than 8 CPUs? Would our design change if we had agglomerated the 2nd and 3rd dimension for the 8 x 128 x 256 matrix?

– Yes.

• This says the decision to agglomerate the 2nd and 3rd dimension in the long run has the drawback that the code portability to more CPUs is impaired.

Page 111: Parallel Real-Time Systems Parallel Computing Overview.

111

Reducing Software Engineering Costs

• Software Engineering – the study of techniques to bring very large projects in on time and on budget.

• One purpose of agglomeration is to look for places where existing sequential code for a task might exist,

• Use of that code helps bring down the cost of developing a parallel algorithm from scratch.

Page 112: Parallel Real-Time Systems Parallel Computing Overview.

112

Agglomeration Checklist for Checking the Quality of the Agglomeration

• Locality of parallel algorithm has increased• Replicated computations take less time than

communications they replace• Data replication doesn’t affect scalability• All agglomerated tasks have similar

computational and communications costs• Number of tasks increases with problem size• Number of tasks suitable for likely target systems• Tradeoff between agglomeration and code

modifications costs is reasonable

Page 113: Parallel Real-Time Systems Parallel Computing Overview.

113

Agglomeration Checklist for Checking the Quality of the Agglomeration

• Locality of parallel algorithm has increased• Replicated computations take less time than

communications they replace• Data replication doesn’t affect scalability• All the agglomerated tasks have similar

computational and communications costs• Number of tasks increases with problem size• Number of tasks suitable for likely target systems• Tradeoff between agglomeration and code

modifications costs is reasonable

Page 114: Parallel Real-Time Systems Parallel Computing Overview.

114

Foster’s Methodology

Page 115: Parallel Real-Time Systems Parallel Computing Overview.

115

Mapping

• Mapping: The process of assigning tasks to processors

• Centralized multiprocessor: Mapping done by operating system

• Distributed memory system: Mapping done by user• Conflicting goals of mapping

– Maximize processor utilization – i.e. the average percentage of time the system’s processors are actively executing tasks necessary for solving the problem.

– Minimize interprocessor communication

Page 116: Parallel Real-Time Systems Parallel Computing Overview.

116

Mapping Example

(a) is a task/channel graph showing the needed communications over channels.

(b) shows a possible mapping of the tasks to 3 processors.

Page 117: Parallel Real-Time Systems Parallel Computing Overview.

117

Mapping Example

If all tasks require the same amount of time and each CPU has the same capability, this mapping would mean the middle processor will take twice as long as the other two..

Page 118: Parallel Real-Time Systems Parallel Computing Overview.

118

Optimal Mapping• Optimality is with respect to processor utilization

and interprocessor communication.• Finding an optimal mapping is NP-hard.• Must rely on heuristics applied either manually

or by the operating system.• It is the interaction of the processor utilization

and communication that is important.• For example, with p processors and n tasks,

putting all tasks on 1 processor makes interprocessor communication zero, but utilization is 1/p.

Page 119: Parallel Real-Time Systems Parallel Computing Overview.

119

A Mapping Decision Tree (Quinn’s Suggestions – Details on pg 72)

• Static number of tasks – Structured communication

• Constant computation time per task– Agglomerate tasks to minimize communications– Create one task per processor

• Variable computation time per task– Cyclically map tasks to processors

– Unstructured communication• Use a static load balancing algorithm

• Dynamic number of tasks – Frequent communication between tasks

• Use a dynamic load balancing algorithm– Many short-lived tasks. No internal communication

• Use a run-time task-scheduling algorithm

Page 120: Parallel Real-Time Systems Parallel Computing Overview.

120

Mapping Checklist to Judge the Quality of a Mapping

• Consider designs based on one task per processor and multiple tasks per processor.

• Evaluate static and dynamic task allocation• If dynamic task allocation chosen, the task

allocator (i.e., manager) is not a bottleneck to performance

• If static task allocation chosen, ratio of tasks to processors is at least 10:1

Page 121: Parallel Real-Time Systems Parallel Computing Overview.

Boundary Value Problem

Example to illustrate use of Foster’s design method

Page 122: Parallel Real-Time Systems Parallel Computing Overview.

122

Boundary Value Problem

Ice waterRod Insulation

Problem: The ends of a rod of length 1 are in contact with ice

water at 00 C. The initial temperature at distance x from the end of the rod is 100sin(x). (These are the boundary values.)

The rod is surrounded by heavy insulation. So, the temperature changes along the length of the rod are a result of heat transfer at the ends of the rod and heat conduction along the length of the rod.

We want to model the temperature at any point on the rod as a function of time.

Page 123: Parallel Real-Time Systems Parallel Computing Overview.

123

• Over time the rod gradually cools.• A partial differential equation (PDE) models the

temperature at any point of the rod at any point in time.

• PDEs can be hard to solve directly, but a method called the finite difference method is one way to approximate a good solution using a computer.

• The derivative of f at a point s is defined by the limit: lim f(x+h) – f(x)

h0 h• If h is a fixed non-zero value (i.e. don’t take the

limit), then the above expression is called a finite difference.

Page 124: Parallel Real-Time Systems Parallel Computing Overview.

124

Finite differences approach differential quotients as h goes to zero.

Thus, we can use finite differences to approximate derivatives.

This is often used in numerical analysis, especially in numerical ordinary differential equations and numerical partial differential equations, which aim at the numerical solution of ordinary and partial differential equations respectively.

The resulting methods are called finite-difference methods.

Page 125: Parallel Real-Time Systems Parallel Computing Overview.

125

An Example of Using a Finite Difference Method for an ODE (Ordinary Differential Equation)

Given f’(x) = 3f(x) + 2, the fact thatf(x+h) – f(x) approximates f’(x)

hcan be used to iteratively calculate an approximation to f’(x).

In our case, a finite difference method finds the temperature at a fixed number of points in the rod at various time intervals.

The smaller the steps in space and time, the better the approximation.

Page 126: Parallel Real-Time Systems Parallel Computing Overview.

126

Rod Cools as Time Progresses

A finite difference method computes these temperature approximations (vertical axis) at various points along the rod (horizontal axis) for different times between 0 and 3.

Page 127: Parallel Real-Time Systems Parallel Computing Overview.

127

The Finite Difference Approximation Requires the Following Data Structure

A matrix is used where columns represent positions and rows represent time.

The element u(i,j) contains the temperature at position i on the rod at time j.

At each end of the rod the temperature is always 0. At time 0, the temperature at point x is 100sin(x)

Page 128: Parallel Real-Time Systems Parallel Computing Overview.

128

Finite Difference Method Actually Used• We have seen that for small h, we may

approximate f’(x) byf’(x) ~ [f(x + h) – f(x)] / h

• It can be shown that in this case, for small h,f’’(x) ~ [f(x + h) – 2f(x) + f(x-h)]

• Let u(i,j) represent the matrix element containing the temperature at position i on the rod at time j.

• Using above approximations, it is possible to determine a positive value r so that

u(i,j+1) ~ ru(i-1,j) + (1 – 2r)u(i,j) + ru(i+1,j)• In the finite difference method, the algorithm

computes the temperatures for the next time period using the above approximation.

Page 129: Parallel Real-Time Systems Parallel Computing Overview.

129

Partitioning Step

• This one is fairly easy to identify initially.• There is one data item (i.e. temperature) per grid

point in matrix.• Let’s associate one primitive task with each grid

point.• A primitive task would be the calculation of

u(i,j+1) as shown on the last slide.• This gives us a two-dimensional domain

decomposition.

Page 130: Parallel Real-Time Systems Parallel Computing Overview.

130

Communication Step

• Next, we identify the communication pattern between primitive tasks.

• Each interior primitive task needs three incoming and three outgoing channels because to calculate

u(i,j+1) = ru(i-1,j) + (1 – 2r)u(i,j) + ru(i+1,j)the task needs u(i-1,j), u(i,j), and u(i+1,j).

– i.e. 3 incoming channels andu(i,j+1) will be needed for 3 other tasks

- i.e. 3 outgoing channels.• Tasks on the sides don’t need as many

channels, but we really need to worry about the interior nodes.

Page 131: Parallel Real-Time Systems Parallel Computing Overview.

131

Agglomeration Step

We now have a task/channel graph below:

It should be clear this is not a good situation even if we had enough processors.

The top row depends on values from bottom rows.

Be careful when designing a parallel algorithm that you don’t think you have parallelism when tasks are sequential.

Page 132: Parallel Real-Time Systems Parallel Computing Overview.

132

Collapse the Columns in the 1st

Agglomeration Step

This task/channel graph represents each task as computing one temperature for a given position and time.

This task/channel graph represents each task as computing the temperature at a particular position for all time steps.

Page 133: Parallel Real-Time Systems Parallel Computing Overview.

133

Mapping Step

This graph shows only a few intervals. We are using one processor per task.

For the sake of a good approximation, we may want many more intervals than we have processors.

We go back to the decision tree on page 72 to see if we can do better when we want more intervals than we have available processors.

Note: On a large SIMD with an interconnection network (which the ASC emulator doesn’t have), we might stop here as we could possibly have enough processors.

Page 134: Parallel Real-Time Systems Parallel Computing Overview.

134

Use Decision Tree (See earlier Slide on Decision Tree or Pg 72 of Quinn)

• The number of tasks is static once we decide on how many intervals we want to use.

• The communication pattern among the tasks is regular – i.e. structured.

• Each task performs the same computations.• Therefore, the decision tree says to create one task per

processor by agglomerating primitive tasks so that computation workloads are balanced and communication is minimized.

• So, we will associate a contiguous piece of the rod with each task by dividing the rod into n pieces of size h, where n is the number of processors we have.

Comment: Can decide how to design algorithm without use of the decision tree as well.

Page 135: Parallel Real-Time Systems Parallel Computing Overview.

135

PictoriallyOur previous task/channel graph assumed 10 consolidated tasks, one per interval:

If we now assume 3 processors, we would now have:

Note this maintains the possibility of using some kind of nearest neighbor interconnection network and eliminates unnecessary communication.

What interconnection networks would work well?

Page 136: Parallel Real-Time Systems Parallel Computing Overview.

136

Agglomeration and Mapping

Agglomeration

and Mapping

Page 137: Parallel Real-Time Systems Parallel Computing Overview.

End of Unit

• This unit covered an overview of general topics on parallel computing.

• Slides were taken from website for my “Parallel and Distributed Computing” course.

• This website is at www.cs.kent.edu/~jbaker/PDC-F08/ and can be used for reference, if desired.

137