CSE 8383 - Advanced Computer Architecture
description
Transcript of CSE 8383 - Advanced Computer Architecture
CSE 8383 - Advanced Computer Architecture
Week-5Week of Feb 9, 2004
engr.smu.edu/~rewini/8383
Contents Project/Schedule Introduction to Multiprocessors Parallelism Performance PRAM Model ….
Warm Up Parallel Numerical Integration Parallel Matrix Multiplication
In class: Discuss with your neighbor!Videotape: Think about it!
What kind of architecture do we need?
Explicit vs. Implicit Paralleism
Parallel Architecture
Programming Environment
Parallelizer
Sequentialprogram
Parallelprogram
Motivation One-processor systems are not capable
of delivering solutions to some problems in reasonable time
Multiple processors cooperate to jointly execute a single computational task in order to speed up its execution
Speed-up versus Quality-up
Multiprocessing
One-processor
Multiprocessor
Speed-up Quality-up Sharing
Physical limitations
N processors cooperate to solve a single computational task
Flynn’s Classification- revisited
SISD (single instruction stream over a single data stream)
SIMD (single instruction stream over multiple data stream)
MIMD (multiple instruction streams over multiple data streams)
MISD (multiple instruction streams and a single data streams)
SISD (single instruction stream over a single data stream)
SISD uniprocessor architecture
CU
IS
DSIS
PU MUI/O
Captions:
CU = control unit PU = Processing unit
MU = memory unit IS = instruction stream
DS = data stream PE = processing element
LM = Local Memory
SIMD (single instruction stream over multiple data stream)
SIMD Architecture
PEn
PE1
LMn
CU
IS
DS DS
DS DS
ISProgram loaded from host
Data sets loaded from host
LM1
MIMD (multiple instruction streams over multiple data streams)
CU1
CU1
PUn
IS DS
IS DS
MMD Architecture (with shared memory)
PU1
SharedMemory
I/O
I/O
IS
IS
MISD (multiple instruction streams and a single data streams)
Memory(Programand data)
CU1 CU2
PU2
CUn
PUnPU1
IS IS
IS IS IS
DSI/O
DS DS DS
MISD architecture (the systolic array)
System Components Three major Components
Processors
Memory Modules
Interconnection Network
Memory Access Shared Memory
Distributed Memory
M PP
P
M
P
M
Interconnection Network Taxonomy
Interconnection Network
Static Dynamic
Bus-based Switch-based1-D 2-D HC
Single Multiple SS MS Crossbar
MIMD Shared Memory Systems
Interconnection Networks
M M M M
P P P P P
Shared Memory Single address space Communication via read & write Synchronization via locks
Bus Based & switch based SM Systems
Global Memory
P
C
P
C
P
C
P C
P C
P C
P C
M M M M
Cache Coherent NUMA
Interconnection Network
M
C
P
M
C
P
M
C
P
M
C
P
MIMD Distributed Memory Systems
Interconnection Networks
M M M M
P P P P
Distributed Memory Multiple address spaces Communication via send & receive Synchronization via messages
SIMD Computers
Processor
Memory
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
von Neumann Computer
Some Interconnection Network
SIMD (Data Parallel) Parallel Operations within a
computation are partitioned spatially rather than temporally
Scalar instructions vs. Array instructions
Processors are incapable of operating autonomously they must be diven by the control uni
Past Trends in Parallel Architecture (inside the box) Completely custom designed
components (processors, memory, interconnects, I/O) Longer R&D time (2-3 years) Expensive systems Quickly becoming outdated
Bankrupt companies!!
New Trends in Parallel Architecture (outside the box) Advances in commodity processors and
network technology Network of PCs and workstations
connected via LAN or WAN forms a Parallel System
Network Computing Compete favorably (cost/performance) Utilize unused cycles of systems sitting
idle
Clusters
M
C
P
I/O
OS
M
C
P
I/O
OS
M
C
P
I/O
OS
Middleware
Programming Environment
Interconnection Network
Grids Grids are geographically
distributed platforms for computation.
They provide dependable, consistent, pervasive, and inexpensive access to high end computational capabilities.
Problem Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk-shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 109 times per second? What would the diameter be if the switching requirements were 1012 time per second?
Grosch’s Law (1960s) “To sell a computer for twice as
much, it must be four times as fast” Vendors skip small speed
improvements in favor of waiting for large ones
Buyers of expensive machines would wait for a twofold improvement in performance for the same price.
Moore’s Law Gordon Moore (cofounder of Intel) Processor performance would
double every 18 months This prediction has held for several
decades Unlikely that single-processor
performance continues to increase indefinitely
Von Neumann’s bottleneck Great mathematician of the 1940s and
1950s Single control unit connecting a memory to
a processing unit Instructions and data are fetched one at a
time from memory and fed to processing unit
Speed is limited by the rate at which instructions and data are transferred from memory to the processing unit.
Parallelism Multiple CPUs
Within the CPU One Pipeline Multiple pipelines
Speedup S = Speed(new) / Speed(old) S = Work/time(new) /
Work/time(old) S = time(old) / time(new) S = time(before improvement) / time(after improvement)
Speedup Time (one CPU): T(1)
Time (n CPUs): T(n)
Speedup: S
S = T(1)/T(n)
Amdahl’s Law
The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used
20 hours
200 miles
A B
Walk 4 miles /hour 50 + 20 = 70 hours S = 1Bike 10 miles / hour 20 + 20 = 40 hours S = 1.8Car-1 50 miles / hour 4 + 20 = 24 hours S = 2.9Car-2 120 miles / hour 1.67 + 20 = 21.67 hours S = 3.2Car-3 600 miles /hour 0.33 + 20 = 20.33 hours S = 3.4
must walk
Example
Amdahl’s Law (1967) : The fraction of the program that
is naturally serial
(1- ): The fraction of the program that is naturally parallel
S = T(1)/T(N)
T(N) = T(1) + T(1)(1- )
N
S = 1
+ (1- )
N
=N
N + (1- )
Amdahl’s Law
Gustafson-Barsis Law
N & are not independent from each other
T(N) = 1
T(1) = + (1- ) N
S = N – (N-1)
: The fraction of the program that is naturally serial
Gustafson-Barsis Law
Comparison of Amdahl’s Law vs Gustafson-Barsis’ Law
For I = 1 to 10 do
begin
S[I] = 0.0 ;
for J = 1 to 10 do
S[I] = S[I] + M[I, J];
S[I] = S[I]/10;
end
Example
Distributed Computing Performance
Single Program Performance
Multiple Program Performance
PRAM Model
What is a Model? According to Webster’s Dictionary, a
model is “a description or analogy used to help visualize something that cannot be directly observed.”
According to The Oxford English Dictionary, a model is “a simplified or idealized description or conception of a particular system, situation or process.”
Why Models? In general, the purpose of
Modeling is to capture the salient characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction.
Megg, Matheson and Tarjan (1995)
Models in Problem Solving Computer Scientists use models to
help design problem solving tools such as:
Fast Algorithms Effective Programming Environments Powerful Execution Engines
A model is an interface separating high level properties from low level ones
An InterfaceApplications
Architectures
Providesoperations
Requires implementation
MODEL
PRAM Model Synchronized
Read Compute Write Cycle
EREW ERCW CREW CRCW Complexity:
T(n), P(n), C(n)
Control
PrivateMemory
P1
PrivateMemory
P2
PrivateMemory
Pp
Global
Memory
The PRAM model and its variations (cont.) There are different modes for read and write operations in a
PRAM. Exclusive read(ER) Exclusive write(EW) Concurrent read(CR) Concurrent write(CW)
Common Arbitrary Minimum Priority
Based on the different modes described above, the PRAM can be further divided into the following four subclasses.
EREW-PRAM model CREW-PRAM model ERCW-PRAM model CRCW-PRAM model
Analysis of Algorithms Sequential Algorithms
Time Complexity Space Complexity
An algorithm whose time complexity is bounded by a polynomial is called a polynomial-time algorithm. An algorithm is considered to be efficient if it runs in polynomial time.
Analysis of Sequential Algorithms
NP
P
NP-complete
NP-hard
The relationships among P, NP, NP-complete, NP-hard
Analysis of parallel algorithm
Performance of a parallel algorithm is expressed in terms of how fast it is and how much resources it uses when it runs.
Run time, which is defined as the time during the execution of the algorithm
Number of processors the algorithm uses to solve a problem
The cost of the parallel algorithm, which is the product of the run time and the number of processors
Analysis of parallel algorithmThe NC-class and P-completeness
NP
P
NP-complete
NC
P-complete
NP-hard
The relationships among P, NP, NP-complete, NP-hard, NC, and P-complete
(if PNP and NC P)
Simulating multiple accesses on an EREW PRAM
Broadcasting mechanism: P1 reads x and makes it known to P2. P1 and P2 make x known to P3 and P4,
respectively, in parallel. P1, P2, P3 and P4 make x known to P5,
P6, P7 and P8, respectively, in parallel. These eight processors will make x
know to another eight processors, and so on.
Simulating multiple accesses on an EREW PRAM (cont.)
Simulating Concurrent read on EREW PRAM with eight processors using Algorithm Broadcast_EREW
x
xx P1
(a)
x
x
xx P2
(b)
x
x
x
x
x P3
(c)
x
x
x
x
x
x
x
x
x P5
(d)
x P4
x P6
x P7
x P8
LLLL
Simulating multiple accesses on an EREW PRAM (cont.) Algorithm Broadcast_EREW
Processor P1
y (in P1’s private memory) xL[1] y
for i=0 to log p-1 doforall Pj, where 2i +1 < j < 2i+1 do in parallel
y (in Pj’s private memory) L[j-2i]L[j] y
endforendfor
Bus-based Shared Memory
Collection of wires and connectors
Only one transaction at a time
Bottleneck!! How can we solve the problem?
Global Memory
P P P P P
Single Processor caching
P
x
x Memory
CacheHit: data in the cache
Miss: data is not in the cache
Hit rate: h
Miss rate: m = (1-h)
Writing in the cache
P
x
x
Before
Memory
Cache
P
x’
x’
Write through
Memory
Cache
P
x’
x
Write back
Memory
Cache
Using Caches
Global Memory
P1
C1
P2
C2
P3
C3
Pn
Cn
- Cache Coherence problem
- How many processors?
Group Activity Variables
Number of processors (n) Hit rate (h) Bus Bandwidth (B) Processor speed (V)
Condition: n*(I - h)*v <= B
Maximum number of processors n = B/(1-h)*v
Cache Coherence
P1
x
P2 P3
x
Pn
x
x
-Multiple copies of x-What if P1 updates x?
Cache Coherence Policies Writing to Cache in 1 processor case
Write Through Write Back
Writing to Cache in n processor case Write Update - Write Through Write Invalidate - Write Back Write Update - Write Through Write Invalidate - Write Back
Write-invalidate
P1
x
P2 P3
x
x
P1
x’
P2 P3
I
x’
P1
x’
P2 P3
I
x
Before Write Through Write back
Write-Update
P1
x
P2 P3
x
x
P1
x’
P2 P3
x’
x’
P1
x’
P2 P3
x’
x
Before Write Through Write back
SynchronizationP1 P2 P3
Lock…..…..
unlock
Lock…..…..
unlock
Lock…..…..
unlockLocks
wait
wait
Superscalar Parallelism
Scheduling