CSE 8383 - Advanced Computer Architecture

69
CSE 8383 - Advanced Computer Architecture Week-5 Week of Feb 9, 2004 engr.smu.edu/~rewini/8383

description

CSE 8383 - Advanced Computer Architecture. Week-5 Week of Feb 9, 2004 engr.smu.edu/~rewini/8383. Contents. Project/Schedule Introduction to Multiprocessors Parallelism Performance PRAM Model …. Warm Up. Parallel Numerical Integration Parallel Matrix Multiplication - PowerPoint PPT Presentation

Transcript of CSE 8383 - Advanced Computer Architecture

Page 1: CSE 8383 - Advanced Computer Architecture

CSE 8383 - Advanced Computer Architecture

Week-5Week of Feb 9, 2004

engr.smu.edu/~rewini/8383

Page 2: CSE 8383 - Advanced Computer Architecture

Contents Project/Schedule Introduction to Multiprocessors Parallelism Performance PRAM Model ….

Page 3: CSE 8383 - Advanced Computer Architecture

Warm Up Parallel Numerical Integration Parallel Matrix Multiplication

In class: Discuss with your neighbor!Videotape: Think about it!

What kind of architecture do we need?

Page 4: CSE 8383 - Advanced Computer Architecture

Explicit vs. Implicit Paralleism

Parallel Architecture

Programming Environment

Parallelizer

Sequentialprogram

Parallelprogram

Page 5: CSE 8383 - Advanced Computer Architecture

Motivation One-processor systems are not capable

of delivering solutions to some problems in reasonable time

Multiple processors cooperate to jointly execute a single computational task in order to speed up its execution

Speed-up versus Quality-up

Page 6: CSE 8383 - Advanced Computer Architecture

Multiprocessing

One-processor

Multiprocessor

Speed-up Quality-up Sharing

Physical limitations

N processors cooperate to solve a single computational task

Page 7: CSE 8383 - Advanced Computer Architecture

Flynn’s Classification- revisited

SISD (single instruction stream over a single data stream)

SIMD (single instruction stream over multiple data stream)

MIMD (multiple instruction streams over multiple data streams)

MISD (multiple instruction streams and a single data streams)

Page 8: CSE 8383 - Advanced Computer Architecture

SISD (single instruction stream over a single data stream)

SISD uniprocessor architecture

CU

IS

DSIS

PU MUI/O

Captions:

CU = control unit PU = Processing unit

MU = memory unit IS = instruction stream

DS = data stream PE = processing element

LM = Local Memory

Page 9: CSE 8383 - Advanced Computer Architecture

SIMD (single instruction stream over multiple data stream)

SIMD Architecture

PEn

PE1

LMn

CU

IS

DS DS

DS DS

ISProgram loaded from host

Data sets loaded from host

LM1

Page 10: CSE 8383 - Advanced Computer Architecture

MIMD (multiple instruction streams over multiple data streams)

CU1

CU1

PUn

IS DS

IS DS

MMD Architecture (with shared memory)

PU1

SharedMemory

I/O

I/O

IS

IS

Page 11: CSE 8383 - Advanced Computer Architecture

MISD (multiple instruction streams and a single data streams)

Memory(Programand data)

CU1 CU2

PU2

CUn

PUnPU1

IS IS

IS IS IS

DSI/O

DS DS DS

MISD architecture (the systolic array)

Page 12: CSE 8383 - Advanced Computer Architecture

System Components Three major Components

Processors

Memory Modules

Interconnection Network

Page 13: CSE 8383 - Advanced Computer Architecture

Memory Access Shared Memory

Distributed Memory

M PP

P

M

P

M

Page 14: CSE 8383 - Advanced Computer Architecture

Interconnection Network Taxonomy

Interconnection Network

Static Dynamic

Bus-based Switch-based1-D 2-D HC

Single Multiple SS MS Crossbar

Page 15: CSE 8383 - Advanced Computer Architecture

MIMD Shared Memory Systems

Interconnection Networks

M M M M

P P P P P

Page 16: CSE 8383 - Advanced Computer Architecture

Shared Memory Single address space Communication via read & write Synchronization via locks

Page 17: CSE 8383 - Advanced Computer Architecture

Bus Based & switch based SM Systems

Global Memory

P

C

P

C

P

C

P C

P C

P C

P C

M M M M

Page 18: CSE 8383 - Advanced Computer Architecture

Cache Coherent NUMA

Interconnection Network

M

C

P

M

C

P

M

C

P

M

C

P

Page 19: CSE 8383 - Advanced Computer Architecture

MIMD Distributed Memory Systems

Interconnection Networks

M M M M

P P P P

Page 20: CSE 8383 - Advanced Computer Architecture

Distributed Memory Multiple address spaces Communication via send & receive Synchronization via messages

Page 21: CSE 8383 - Advanced Computer Architecture

SIMD Computers

Processor

Memory

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

von Neumann Computer

Some Interconnection Network

Page 22: CSE 8383 - Advanced Computer Architecture

SIMD (Data Parallel) Parallel Operations within a

computation are partitioned spatially rather than temporally

Scalar instructions vs. Array instructions

Processors are incapable of operating autonomously they must be diven by the control uni

Page 23: CSE 8383 - Advanced Computer Architecture

Past Trends in Parallel Architecture (inside the box) Completely custom designed

components (processors, memory, interconnects, I/O) Longer R&D time (2-3 years) Expensive systems Quickly becoming outdated

Bankrupt companies!!

Page 24: CSE 8383 - Advanced Computer Architecture

New Trends in Parallel Architecture (outside the box) Advances in commodity processors and

network technology Network of PCs and workstations

connected via LAN or WAN forms a Parallel System

Network Computing Compete favorably (cost/performance) Utilize unused cycles of systems sitting

idle

Page 25: CSE 8383 - Advanced Computer Architecture

Clusters

M

C

P

I/O

OS

M

C

P

I/O

OS

M

C

P

I/O

OS

Middleware

Programming Environment

Interconnection Network

Page 26: CSE 8383 - Advanced Computer Architecture

Grids Grids are geographically

distributed platforms for computation.

They provide dependable, consistent, pervasive, and inexpensive access to high end computational capabilities.

Page 27: CSE 8383 - Advanced Computer Architecture

Problem Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk-shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 109 times per second? What would the diameter be if the switching requirements were 1012 time per second?

Page 28: CSE 8383 - Advanced Computer Architecture

Grosch’s Law (1960s) “To sell a computer for twice as

much, it must be four times as fast” Vendors skip small speed

improvements in favor of waiting for large ones

Buyers of expensive machines would wait for a twofold improvement in performance for the same price.

Page 29: CSE 8383 - Advanced Computer Architecture

Moore’s Law Gordon Moore (cofounder of Intel) Processor performance would

double every 18 months This prediction has held for several

decades Unlikely that single-processor

performance continues to increase indefinitely

Page 30: CSE 8383 - Advanced Computer Architecture

Von Neumann’s bottleneck Great mathematician of the 1940s and

1950s Single control unit connecting a memory to

a processing unit Instructions and data are fetched one at a

time from memory and fed to processing unit

Speed is limited by the rate at which instructions and data are transferred from memory to the processing unit.

Page 31: CSE 8383 - Advanced Computer Architecture

Parallelism Multiple CPUs

Within the CPU One Pipeline Multiple pipelines

Page 32: CSE 8383 - Advanced Computer Architecture

Speedup S = Speed(new) / Speed(old) S = Work/time(new) /

Work/time(old) S = time(old) / time(new) S = time(before improvement) / time(after improvement)

Page 33: CSE 8383 - Advanced Computer Architecture

Speedup Time (one CPU): T(1)

Time (n CPUs): T(n)

Speedup: S

S = T(1)/T(n)

Page 34: CSE 8383 - Advanced Computer Architecture

Amdahl’s Law

The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used

Page 35: CSE 8383 - Advanced Computer Architecture

20 hours

200 miles

A B

Walk 4 miles /hour 50 + 20 = 70 hours S = 1Bike 10 miles / hour 20 + 20 = 40 hours S = 1.8Car-1 50 miles / hour 4 + 20 = 24 hours S = 2.9Car-2 120 miles / hour 1.67 + 20 = 21.67 hours S = 3.2Car-3 600 miles /hour 0.33 + 20 = 20.33 hours S = 3.4

must walk

Example

Page 36: CSE 8383 - Advanced Computer Architecture

Amdahl’s Law (1967) : The fraction of the program that

is naturally serial

(1- ): The fraction of the program that is naturally parallel

Page 37: CSE 8383 - Advanced Computer Architecture

S = T(1)/T(N)

T(N) = T(1) + T(1)(1- )

N

S = 1

+ (1- )

N

=N

N + (1- )

Page 38: CSE 8383 - Advanced Computer Architecture

Amdahl’s Law

Page 39: CSE 8383 - Advanced Computer Architecture

Gustafson-Barsis Law

N & are not independent from each other

T(N) = 1

T(1) = + (1- ) N

S = N – (N-1)

: The fraction of the program that is naturally serial

Page 40: CSE 8383 - Advanced Computer Architecture

Gustafson-Barsis Law

Page 41: CSE 8383 - Advanced Computer Architecture

Comparison of Amdahl’s Law vs Gustafson-Barsis’ Law

Page 42: CSE 8383 - Advanced Computer Architecture

For I = 1 to 10 do

begin

S[I] = 0.0 ;

for J = 1 to 10 do

S[I] = S[I] + M[I, J];

S[I] = S[I]/10;

end

Example

Page 43: CSE 8383 - Advanced Computer Architecture
Page 44: CSE 8383 - Advanced Computer Architecture

Distributed Computing Performance

Single Program Performance

Multiple Program Performance

Page 45: CSE 8383 - Advanced Computer Architecture

PRAM Model

Page 46: CSE 8383 - Advanced Computer Architecture

What is a Model? According to Webster’s Dictionary, a

model is “a description or analogy used to help visualize something that cannot be directly observed.”

According to The Oxford English Dictionary, a model is “a simplified or idealized description or conception of a particular system, situation or process.”

Page 47: CSE 8383 - Advanced Computer Architecture

Why Models? In general, the purpose of

Modeling is to capture the salient characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction.

Megg, Matheson and Tarjan (1995)

Page 48: CSE 8383 - Advanced Computer Architecture

Models in Problem Solving Computer Scientists use models to

help design problem solving tools such as:

Fast Algorithms Effective Programming Environments Powerful Execution Engines

Page 49: CSE 8383 - Advanced Computer Architecture

A model is an interface separating high level properties from low level ones

An InterfaceApplications

Architectures

Providesoperations

Requires implementation

MODEL

Page 50: CSE 8383 - Advanced Computer Architecture

PRAM Model Synchronized

Read Compute Write Cycle

EREW ERCW CREW CRCW Complexity:

T(n), P(n), C(n)

Control

PrivateMemory

P1

PrivateMemory

P2

PrivateMemory

Pp

Global

Memory

Page 51: CSE 8383 - Advanced Computer Architecture

The PRAM model and its variations (cont.) There are different modes for read and write operations in a

PRAM. Exclusive read(ER) Exclusive write(EW) Concurrent read(CR) Concurrent write(CW)

Common Arbitrary Minimum Priority

Based on the different modes described above, the PRAM can be further divided into the following four subclasses.

EREW-PRAM model CREW-PRAM model ERCW-PRAM model CRCW-PRAM model

Page 52: CSE 8383 - Advanced Computer Architecture

Analysis of Algorithms Sequential Algorithms

Time Complexity Space Complexity

An algorithm whose time complexity is bounded by a polynomial is called a polynomial-time algorithm. An algorithm is considered to be efficient if it runs in polynomial time.

Page 53: CSE 8383 - Advanced Computer Architecture

Analysis of Sequential Algorithms

NP

P

NP-complete

NP-hard

The relationships among P, NP, NP-complete, NP-hard

Page 54: CSE 8383 - Advanced Computer Architecture

Analysis of parallel algorithm

Performance of a parallel algorithm is expressed in terms of how fast it is and how much resources it uses when it runs.

Run time, which is defined as the time during the execution of the algorithm

Number of processors the algorithm uses to solve a problem

The cost of the parallel algorithm, which is the product of the run time and the number of processors

Page 55: CSE 8383 - Advanced Computer Architecture

Analysis of parallel algorithmThe NC-class and P-completeness

NP

P

NP-complete

NC

P-complete

NP-hard

The relationships among P, NP, NP-complete, NP-hard, NC, and P-complete

(if PNP and NC P)

Page 56: CSE 8383 - Advanced Computer Architecture

Simulating multiple accesses on an EREW PRAM

Broadcasting mechanism: P1 reads x and makes it known to P2. P1 and P2 make x known to P3 and P4,

respectively, in parallel. P1, P2, P3 and P4 make x known to P5,

P6, P7 and P8, respectively, in parallel. These eight processors will make x

know to another eight processors, and so on.

Page 57: CSE 8383 - Advanced Computer Architecture

Simulating multiple accesses on an EREW PRAM (cont.)

Simulating Concurrent read on EREW PRAM with eight processors using Algorithm Broadcast_EREW

x

xx P1

(a)

x

x

xx P2

(b)

x

x

x

x

x P3

(c)

x

x

x

x

x

x

x

x

x P5

(d)

x P4

x P6

x P7

x P8

LLLL

Page 58: CSE 8383 - Advanced Computer Architecture

Simulating multiple accesses on an EREW PRAM (cont.) Algorithm Broadcast_EREW

Processor P1

y (in P1’s private memory) xL[1] y

for i=0 to log p-1 doforall Pj, where 2i +1 < j < 2i+1 do in parallel

y (in Pj’s private memory) L[j-2i]L[j] y

endforendfor

Page 59: CSE 8383 - Advanced Computer Architecture

Bus-based Shared Memory

Collection of wires and connectors

Only one transaction at a time

Bottleneck!! How can we solve the problem?

Global Memory

P P P P P

Page 60: CSE 8383 - Advanced Computer Architecture

Single Processor caching

P

x

x Memory

CacheHit: data in the cache

Miss: data is not in the cache

Hit rate: h

Miss rate: m = (1-h)

Page 61: CSE 8383 - Advanced Computer Architecture

Writing in the cache

P

x

x

Before

Memory

Cache

P

x’

x’

Write through

Memory

Cache

P

x’

x

Write back

Memory

Cache

Page 62: CSE 8383 - Advanced Computer Architecture

Using Caches

Global Memory

P1

C1

P2

C2

P3

C3

Pn

Cn

- Cache Coherence problem

- How many processors?

Page 63: CSE 8383 - Advanced Computer Architecture

Group Activity Variables

Number of processors (n) Hit rate (h) Bus Bandwidth (B) Processor speed (V)

Condition: n*(I - h)*v <= B

Maximum number of processors n = B/(1-h)*v

Page 64: CSE 8383 - Advanced Computer Architecture

Cache Coherence

P1

x

P2 P3

x

Pn

x

x

-Multiple copies of x-What if P1 updates x?

Page 65: CSE 8383 - Advanced Computer Architecture

Cache Coherence Policies Writing to Cache in 1 processor case

Write Through Write Back

Writing to Cache in n processor case Write Update - Write Through Write Invalidate - Write Back Write Update - Write Through Write Invalidate - Write Back

Page 66: CSE 8383 - Advanced Computer Architecture

Write-invalidate

P1

x

P2 P3

x

x

P1

x’

P2 P3

I

x’

P1

x’

P2 P3

I

x

Before Write Through Write back

Page 67: CSE 8383 - Advanced Computer Architecture

Write-Update

P1

x

P2 P3

x

x

P1

x’

P2 P3

x’

x’

P1

x’

P2 P3

x’

x

Before Write Through Write back

Page 68: CSE 8383 - Advanced Computer Architecture

SynchronizationP1 P2 P3

Lock…..…..

unlock

Lock…..…..

unlock

Lock…..…..

unlockLocks

wait

wait

Page 69: CSE 8383 - Advanced Computer Architecture

Superscalar Parallelism

Scheduling