Instructor Information -...

21
1 55:132/22C:160 High Performance Computer Architecture Spring 2008 Instructor Information • Instructor: Jon Kuhl (That’s me) – Office: 4016A SC – Office Hours: 10:30-11:30 a.m. MWF ( Other times by appointment) – E-mail: [email protected] – Phone: (319) 335-5958 • TA: Prasidha Mohandas – Office: 1313 SC – Office hours: t.b.d. Class Info. • Website: www.engineering.uiowa.edu/~hpca • Texts: Required: Shen and Lipasti, Modern Processor Design- -Fundamentals of Superscalar Processors, McGraw Hill, 2005. Supplemental: Thomas and Moorby, The Verilog Hardware Description Language,Third Edition, Kluwer Academic Publishers, 1996. Additional Reference: Hennessy and Patterson, Computer Architecture—A Quantitative Approach, Morgan Kaufmann, Fourth Edition, 2007 Course Objectives Understand quantitative measures for assessing and comparing processor performance Understand modern processor design techniques, including: – pipelining instruction-level parallelism – multi-threading high performance memory architecture Master the use of modern design tools (HDLs) to design and analyze processors Do case studies of contemporary processors Discuss future trends in processor design

Transcript of Instructor Information -...

Page 1: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

1

55:132/22C:160High Performance Computer

ArchitectureSpring 2008

Instructor Information• Instructor: Jon Kuhl (That’s me)

– Office: 4016A SC– Office Hours: 10:30-11:30 a.m. MWF ( Other

times by appointment)– E-mail: [email protected]– Phone: (319) 335-5958

• TA: Prasidha Mohandas– Office: 1313 SC– Office hours: t.b.d.

Class Info.• Website:

www.engineering.uiowa.edu/~hpca

• Texts:Required: Shen and Lipasti, Modern Processor Design-

-Fundamentals of Superscalar Processors , McGraw Hill, 2005.

Supplemental: Thomas and Moorby, The Verilog Hardware Description Language,Third Edition , Kluwer Academic Publishers, 1996.

Additional Reference: Hennessy and Patterson, Computer Architecture—A Quantitative Approach , Morgan Kaufmann, Fourth Edition, 2007

Course Objectives• Understand quantitative measures for assessing

and comparing processor performance• Understand modern processor design

techniques, including:– pipelining– instruction-level parallelism– multi-threading– high performance memory architecture

• Master the use of modern design tools (HDLs) to design and analyze processors

• Do case studies of contemporary processors• Discuss future trends in processor design

Page 2: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

2

Expected Background• A previous course in computer

architecture/organization covering:– Instruction set architecture (ISA)– Addressing modes– Assembly language– Basic computer organization– Memory system organization

• Cache• virtual

– Etc.

• 22c:060 or 55:035 or equivalent

Course Organization

• Homework assignments--several

• Two projects (design/analysis exercises using the Verilog HDL and ModelSimsimulation environment)

• Two exams:– Midterm—Wed. March 12, in class– Final—Tues. May 13, 2:15-4:15 p.m.

Course Organization--continued

• Grading:– Exams:

• Better of midterm/final exam score: 35%• Poorer of midterm/final exam scores: 25%

– Homework: 10%– Projects 30%

Historical Perspectives• The Decade of the 1970’s: “Birth of

Microprocessors”– Programmable Controller– Single-Chip Microprocessors– Personal Computers (PC)

• The Decade of the 1980’s: “Quantitative Architecture”– Instruction Pipelining– Fast Cache Memories– Compiler Considerations– Workstations

• The Decade of the 1990’s: “Instruction-Level Parallelism”– Superscalar,Speculative Microarchitectures– Aggressive Compiler Optimizations– Low-Cost Desktop Supercomputing

Page 3: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

3

Moore’s Law Moore’s Law (1965)

• The number of devices that can be integrated on a single piece of silicon will double roughly every 18-24 months

• Moore’s law has held true for 40 years and will continue to hold for at least another decade.

Intel MicroprocessorsTransistor Count 1970-2005

Processor Performance—1987-1997

HP 9000/750

SUN-4/

260

MIPS

M2000

MIPS

M/120

IBM

RS6000100

200

300

400

500

600

700

800

900

1100

DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 4/266

DEC AXP/500

IBM POWER 100

Year

Pe

rfo

rma

nce

0

1000

1200

19971996199519941993199219911990198919881987

Page 4: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

4

Evolution of Single-Chip Micros

1970’s 1980’s 1990’s 2010

Transistor Count 10K-100K

100K-1M 1M-100M 1B

Clock Frequency 0.2-2MHz 2-20MHz 20M-1GHz

10GHz

Instruction/Cycle

< 0.1 0.1-0.9 0.9- 2.0 10 (?)

MIPS/MFLOPS < 0.2 0.2-20 20-2,000 100,000

Performance Growth in Perspective

• Doubling every 24 months (1971-2007): – total of 260,000X– Cars travel at 25 million MPH; get 5 million

miles/gal.– Air travel: L.A. to N.Y. in 0.1 seconds– Corn yield: 50 million bushels per acre

A Quote from Robert Cringely

“If the automobile had followed the same development as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year killing everyone inside.”

Convergence of Key Enabling Technologies:

• VLSI:– Submicron CMOS feature sizes: Intel is shipping 45nm chips and has

demonstrated 32nm (2x increase in density every 2 years)– Metal layers: 3 -> 4 -> 5 -> 6 -> 9 (copper)– Power supply voltage: 5v -> 3.3v -> 2.4v ->1.8v ->0.8 v

• CAD Tools:– Interconnect simulation and critical path analysis– Clock signal propagation analysis– Process simulation and yield analysis/learning

• Microarchitecture:– Superpipelined and superscalar machines– Speculative and dynamic microarchitectures– Simulation tools and emulation systems

• Compilers:– Extraction of instruction-level parallelism– Aggressive and speculative code scheduling– Object code translation and optimization

Page 5: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

5

Instruction Set Processing

ARCHITECTURE (ISA) programmer/compiler view– Functional appearance (interface) to user/system programmer– Opcodes, addressing modes, architected registers, IEEE floating

point– Serves as specification for processor design

IMPLEMENTATION (µarchitecture) processor designer view– Logical structure or organization that performs the architecture– Pipelining, functional units, caches, physical registers

REALIZATION (Chip) chip/system designer view– Physical structure that embodies the implementation– Gates, cells, transistors, wires

Iron LawProcessor Performance = ---------------

Time

Program

Architecture --> Implementation --> Realization

Compiler Designer Processor Designer C hip Designer

Instructions CyclesProgram Instruction

TimeCycle

(code size)

= X X

(CPI) (cycle time)

Iron Law

• Instructions/Program– Instructions executed, not static code size– Determined by algorithm, compiler, ISA

• Cycles/Instruction– Determined by ISA and CPU organization– Overlap among instructions reduces this term

• Time/cycle– Determined by technology, organization, clever

circuit design

Overall Goal• Minimize time, which is the product,

NOT isolated terms

• Common error to miss terms while devising optimizations– E.g. ISA change to decrease instruction

count– BUT leads to CPU organization which

makes clock slower

• Bottom line: terms are inter-related

• This is the crux of the RISC vs. CISC argument

Page 6: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

6

Instruction Set Architecture

• ISA, the boundary between software and hardware– Specifies the logical machine that is visible to the

programmer– Also, a functional spec for the processor designers

• What needs to be specified by an ISA– Operations

• what to perform and what to perform next

– Temporary Operand Storage in the CPU• accumulator, stacks, registers

– Number of operands per instruction– Operand location

• where and how to specify the operands

– Type and size of operands– Instruction-to-Binary Encoding

Operand Storage

• Registers (in processor vs. memory)?

• faster access

• shorter address

• Accumulator

• less hardware

• high memory traffic

• likely bottleneck

Operand Storage

• Stack - LIFO (60’s - 70’s)

• simple addressing (top of stack implicit)

• bottleneck while pipelining (why?)

• note: JAVA VM stack-based

• Registers - 8 to 256 words

• flexible: temporaries and variables

• registers must be named

• code density and “second” name space

Caches vs. Registers

• Registers

• faster (no addressing modes, no tags)

• deterministic (no misses)

• can replicate for more ports

• short identifier

• must save/restore on procedure calls

• can’t take address of a register (distinct from memory)

• fixed size (FP, strings, structures)

• compilers must manage (an advantage?)

Page 7: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

7

Registers vs. Caches

• How many registers? more =>

• hold operands longer (reducing memory traffic + run time)

• longer register specifiers (except with register windows)

• slower registers

• more state slows context switches

Operands for ALU Instructions

• ALU instructions require operands

• Number of explicit operands

• two - ri := ri op rj

• three - ri := rj op rk

• operands in registers or memory

• any combo - VAX - variable length instrs

• at least one register - IBM 360/370

• all registers - Cray, RISCs - separate loads/store instructions

VAX Addressing Modesregister: Ri displacement M[Ri + #n]

immediate: #n register indirect M[Ri]

indexed: M[Ri + Rj] absolute: M[#n]

memory indirect: M[M[Ri]] auto-increment: M[Ri]; Ri += d

auto-decrement: M[Ri]; Ri -= d

scaled: M[Ri + #n + Rj * d]

update: M[Ri = Ri + #n]

• Modes 1-4 account for 93% of all VAX operands [Clark and Emer]

Operations• arithmetic and logical - and, add …

• data transfer - move, load, store

• control - branch, jump, call

• system - system call, traps

• floating point - add, mul, div, sqrt

• decimal - addd, convert

• string - move, compare

• multimedia? 2D, 3D? e.g., Intel MMX/SSE and Sun VIS

Page 8: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

8

Control Instructions (Branches)

1. Types of BranchesA. Conditional or UnconditionalB. Save PC?C. How is target computed?

• Single target (immediate, PC+immediate)• Multiple targets (register)

2. Branch ArchitecturesA. Condition code or condition registersB. Register

Save or Restore State• What state?

• function calls: registers (CISC)

• system calls: registers, flags, PC, PSW, etc

• Hardware need not save registers

• caller can save registers in use

• callee save registers it will use

• Hardware register save

• IBM STM, VAX CALLS

• faster?

• Most recent architectures do no register saving

– Or do implicit register saving with register windows (SPARC)

VAX

• DEC 1977 VAX-11/780

• upward compatible from PDP-11

• 32-bit words and addresses

• virtual memory

• 16 GPRs (r15 PC r14 SP), CCs

• extremely orthogonal and memory-memory

• decode as byte stream - variable in length

• opcode: operation, #operands, operand types

VAX• Data types

• 8, 16, 32, 64, 128

• char string - 8 bits/char

• decimal - 4 bits/digit

• numeric string - 8 bits/digit

• Addressing modes

• literal 6 bits

• 8, 16, 32 bit immediates

• register, register deferred

• 8, 16, 32 bit displacements

• 8, 16, 32 bit displacements deferred

• indexed (scaled)

• autoincrement, autodecrement

• autoincrement deferred

Page 9: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

9

VAX• operations

– data transfer including string move

– arithmetic and logical (2 and 3 operands)

– control (branch, jump, etc)• AOBLEQ

– function calls save state

– bit manipulation

– floating point - add, sub, mul, div, polyf

– system - exception, VM

– other - crc (cyclic redundancy check), insque (insert in Q)

VAXaddl3 R1, 737\(R2\), #456

byte 1: addl3byte 2: mode, R1byte 3: mode, R2byte 4,5: 737byte 6: modebyte 7-10: 456

• VAX has too many modes and formats

• Big deal with RISC is not fewer instructions

– few modes/formats => fast decoding to facilitate pipelining

VAX 11/780• First implementation of VAX ISA

– 84% of instructions simple, 19% branches

– loop branches 91% taken, other branches 41% taken

– Operands: register mode 41%, complex addressing 6%

– Implementation• 10.6 CPI @ 200ns => 0.5 MIPS

• 50% of time decoding, simple instructions only 10% of time

• memory stalls 2.1 CPI (<< 10.6)

Anatomy of a Modern ISA• Operations

simple ALU op’s, data movement, control transfer

• Temporary Operand Storage in the CPULarge General Purpose Register (GPR) File

• Number of operands per instructiontriadic A ⇐ B op C

• Operand locationload-store architecture with register indirect addressing

• Type and size of operands32/64-bit integers, IEEE floats

• Instruction-to-Binary EncodingFixed width, regular fields

Exceptions: Intel x86, IBM 390 (aka z900)

Page 10: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

10

Dynamic-Static Interface

• Semantic gap between s/w and h/w

• Placement of DSI determines how gap is bridged

Program (Software)

Machine (Hardware)

Compilercomplexity

Hardwarecomplexity

Exposed tosoftware

Hidden inhardware

“Static”

“Dynamic”

Architecture (DSI)

Dynamic-Static Interface

• Low-level DSI exposes more knowledge of hardware through the ISA– Places greater burden on compiler/programmer

• Optimized code becomes specific to implementation– In fact: happens for higher-level DSI also

DEL ~CISC ~VLIW ~RISCHLL Program

DSI-1

DSI-2

DSI-3

Hardware

The Role of the Compiler

• Phases to manage complexityParsing --> intermediate representation

Jump Optimization

Loop Optimizations

Register Allocation

Code Generation --> assembly code

Common Sub-Expression

Procedure inlining

Constant Propagation

Strength ReductionPipeline Scheduling

Performance and Cost

• Which computer is fastest?

• Not so simple– Scientific simulation – FP performance

– Program development – Integer performance

– Commercial workload – Memory, I/O

Page 11: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

11

Performance of Computers• Want to buy the fastest computer for what

you want to do?– Workload is all-important– Correct measurement and analysis

• Want to design the fastest computer for what the customer wants to pay?– Cost is always an important criterion

• Speed is not always the only performance criteria:– Power– Area

Defining Performance

• What is important to whom?

• Computer system user– Minimize elapsed time for program =

time_end – time_start– Called response time

• Computer center manager– Maximize completion rate = #jobs/second– Called throughput

Improve Performance

• Improve (a) response time or (b) throughput?– Faster CPU

• Helps both (a) and (b)

– Add more CPUs• Helps (b) and perhaps (a) due to less queuing

Performance Comparison

• Machine A is n times faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = n

• Machine A is x% faster than machine B iff– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100

• E.g. time(A) = 10s, time(B) = 15s– 15/10 = 1.5 => A is 1.5 times faster than B– 15/10 = 1.5 => A is 50% faster than B

Page 12: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

12

Other Metrics

• MIPS and MFLOPS

• MIPS = instruction count/(execution time x 106)

= clock rate/(CPI x 106)

• But MIPS has serious shortcomings

Problems with MIPS

• E.g. without FP hardware, an FP op may take 50 single-cycle instructions

• With FP hardware, only one 2-cycle instruction

� Thus, adding FP hardware:– CPI increases (why?)– Instructions/program

decreases (why?)– Total execution time decreases

� BUT, MIPS gets worse!

50/50 => 2/150 => 1

50 => 250 MIPS => 2 MIPS

Problems with MIPS

• Ignore program

• Usually used to quote peak performance– Ideal conditions => guarantee not to exceed!

• When is MIPS ok?– Same compiler, same ISA– E.g. same binary running on Pentium-III, IV– Why? Instr/program is constant and can be

ignored

Other Metrics

• MFLOPS = FP ops in program/(execution time x 106)

• Assuming FP ops independent of compiler and ISA– Often safe for numeric codes: matrix size determines

# of FP ops/program– However, not always safe:

• Missing instructions (e.g. FP divide, sqrt/sin/cos)• Optimizing compilers

• Relative MIPS and normalized MFLOPS– Normalized to some common baseline machine

• E.g. VAX MIPS in the 1980s

Page 13: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

13

Iron Law Example

• Machine A: clock 1ns, CPI 2.0, for program x• Machine B: clock 2ns, CPI 1.2, for program x• Which is faster and how much?

Time/Program = instr/program x cycles/instr x sec/cycleTime(A) = N x 2.0 x 1 = 2NTime(B) = N x 1.2 x 2 = 2.4NCompare: Time(B)/Time(A) = 2.4N/2N = 1.2

• So, Machine A is 20% faster than Machine B for this program

Iron Law Example

Keep clock(A) @ 1ns and clock(B) @2ns

For equal performance, if CPI(B)=1.2, what is CPI(A)?

Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A))CPI(A) = 2.4

Iron Law Example

• Keep CPI(A)=2.0 and CPI(B)=1.2

• For equal performance, if clock(B)=2ns, what is clock(A)?

Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2)clock(A) = 1.2ns

Another Example

• Assume stores can execute in 1 cycle by slowing clock 15%

• Should this be implemented?

OP Freq Cycles

ALU 43% 1Load 21% 1

Store 12% 2Branch 24% 2

Page 14: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

14

Example-- Let’s do the math:

• Old CPI = 0.43 + 0.21 + 0.12 x 2 + 0.24 x 2 = 1.36• New CPI = 0.43 + 0.21 + 0.12 + 0.24 x 2 = 1.24

• Speedup = old time/new time = {P x old CPI x T}/{P x new CPI x 1.15 T}= (1.36)/(1.24 x 1.15) = 0.95

• Answer: Don’t make the change

OP Freq Cycles

ALU 43% 1

Load 21% 1

Store 12% 2

Branch 24% 2

Which Programs

• Execution time of what program?• Best case – you always run the same set of

programs– Port them and time the whole workload

• In reality, use benchmarks– Programs chosen to measure performance– Predict performance of actual workload– Saves effort and money– Representative? Honest? Benchmarketing…

Types of Benchmarks

• Real programs– representative of real workload– only accurate way to characterize performance– requires considerable work

• Kernels or microbenchmarks– “representative” program fragments– good for focusing on individual features not big

picture • Instruction mixes

– instruction frequency of occurrence; calculate CPI

Benchmarks: SPEC2000

• System Performance Evaluation Cooperative– Formed in 80s to combat benchmarketing– SPEC89, SPEC92, SPEC95, now SPEC2000

• 12 integer and 14 floating-point programs– Sun Ultra-5 300MHz reference machine has

score of 100– Report geometric mean of ratios to reference

machine

Page 15: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

15

Benchmarks: SPEC CINT2000Benchmark Description

164.gzip Compression

175.vpr FPGA place and route

176.gcc C compiler

181.mcf Combinatorial optimization

186.crafty Chess

197.parser Word processing, grammatical analysis

252.eon Visualization (ray tracing)

253.perlbmk PERL script execution

254.gap Group theory interpreter

255.vortex Object-oriented database

256.bzip2 Compression

300.twolf Place and route simulator

Benchmarks: SPEC CFP2000Benchmark Description

168.wupwise Physics/Quantum Chromodynamics

171.swim Shallow water modeling

172.mgrid Multi-grid solver: 3D potential field

173.applu Parabolic/elliptic PDE

177.mesa 3-D graphics library

178.galgel Computational Fluid Dynamics

179.art Image Recognition/Neural Networks

183.equake Seismic Wave Propagation Simulation

187.facerec Image processing: face recognition

188.ammp Computational chemistry

189.lucas Number theory/primality testing

191.fma3d Finite-element Crash Simulation

200.sixtrack High energy nuclear physics accelerator design

301.apsi Meteorology: Pollutant distribution

Benchmark Pitfalls

• Benchmark not representative– Your workload is I/O bound, SPECint is

useless

• Benchmark is too old– Benchmarks age poorly; benchmarketing

pressure causes vendors to optimize compiler/hardware/software to benchmarks

– Need to be periodically refreshed

Benchmark Pitfalls

• Choosing benchmark from the wrong application space– e.g., in a realtime environment, choosing gcc

• Choosing benchmarks from no application space– e.g., synthetic workloads, esp. unvalidated ones

• Using toy benchmarks (dhrystone, whetstone)– e.g., used to prove the value of RISC in early 80’s

• Mismatch of benchmark properties with scale of features studied– e.g., using SPECINT for large cache studies

Page 16: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

16

Benchmark Pitfalls

• Carelessly scaling benchmarks– Truncating benchmarks– Using only first few million instructions– Reducing program data size

• Too many easy cases– May not show value of a feature

• Too few easy cases– May exaggerate importance of a feature

Scalar to Superscalar

• Scalar processor—Fetches and issues at most one instruction per machine cycle

• Superscalar processor-- Fetches and issues multiple instructions per machine cycle

• Can also define superscalar in terms of how many instructions can complete execution in a given machine cycle.

• Note that only a superscalar architecture can achieve a CPI of less than 1

Processor Performance

• In the 1980’s (decade of pipelining):– CPI: 5.0 => 1.15

• In the 1990’s (decade of superscalar):– CPI: 1.15 => 0.5 (best case)

Processor Performance = ---------------Time

Program

Instructions CyclesProgram Instruction

TimeCycle

(code size)

= X X

(CPI) (cycle time)

Amdahl’s Law(Originally formulated for vector processing)

• f = fraction of program that is vectorizable• (1-f) = fraction that is serial• N = speedup for vectorizable portion• Overall speedup:

No. ofProcessors

N

Time1

f

N

ff

Speedup+−

=)1(

1

1-f

Page 17: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

17

Amdahl’s Law--Continued

• Sequential bottleneck

• Even if N is infinite– Performance limited by nonvectorizable

portion (1-f)

fNf

fN −

=+−

∞→ 1

1

1

1lim

Ramifications of Amdahl’s Law

• Consider: f = 0.9, (1-f) = 0.1 For N � , Speedup � 10

• Consider: f = 0.5, (1-f) = 0.5

For N� infinity, Speedup � 2

• Consider: f – 0.1, (1-f) = 0.9For N � infinity, Speedup � 1.1

Maximum Achievable Speedup

0

2

4

6

8

10

12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

parallelizable fraction f

Spe

edup

Speedup

Pipelining

Unpipelined operation

time T

Inputs I1, I2, I3, … Outputs O1, 02, ..

Time required to process K inputs = KT

Stage1

Stage2

Stage3

Perfect Pipeline (N stages):

T/N T/N

StageN

T/N

I1I1I2

I3 I2 I1…

Time required to process K inputs = (K + N-1)(T/N)

IN IN-1 IN-2… I1 O1

Note” For K >>N, the processing time approaches KT /N

T/N

… …

… … … …

Page 18: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

18

Pipelined Performance Model

• g = fraction of time pipeline is filled

• 1-g = fraction of time pipeline is not filled (stalled)

1-g g

PipelineDepth

N

1

Amdahl’s Law Applied to Pipelining

• g = fraction of time the pipeline is full• (1-g) = fraction that it is not full• N = pipeline depth• Overall speedup:

No. ofstages

N

Time1

g

N

gg

Speedup+−

=)1(

1

1-g

� g = fraction of time pipeline is filled� 1-g = fraction of time pipeline is not filled

(stalled)

1-g g

PipelineDepth

N

1

Pipelined Performance ModelPipelined Performance ModelPipelined Performance Model

• Tyranny of Amdahl’s Law [Bob Colwell]– When g is even slightly below 100%, a big

performance hit will result– Stalled cycles are the key adversary and must be

minimized as much as possible

1-g g

PipelineDepth

N

1

Page 19: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

19

Superscalar Proposal• Moderate tyranny of Amdahl’s Law

– Ease sequential bottleneck– More generally applicable– Robust (less sensitive to f)– Revised Amdahl’s Law:

s = amount of parallelism for non-vectorizable instructions

( )N

f

s

fSpeedup

+−=1

1

Motivation for Superscalar[Agerwala and Cocke]

0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7

8

Vectorizability f

Spe

edup

p

n=4

n=6

n=12

n=100

n=6,s=2

Typical Range

Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead

of s=1 (scalar)

Limits on Instruction Level Parallelism (ILP)

Weiss and Smith [1984] 1.58

Sohi and Vajapeyam [1987] 1.81

Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck)

Tjaden and Flynn [1973] 1.96

Uht [1986] 2.00

Smith et al. [1989] 2.00

Jouppi and Wall [1988] 2.40

Johnson [1991] 2.50

Acosta et al. [1986] 2.79

Wedig [1982] 3.00

Butler et al. [1991] 5.8

Melvin and Patt [1991] 6

Wall [1991] 7 (Jouppi disagreed)

Kuck et al. [1972] 8

Riseman and Foster [1972] 51 (no control dependences)

Nicolau and Fisher [1984] 90 (Fisher’s optimism)

Superscalar Proposal

• Go beyond single instruction pipeline, achieve IPC > 1

• Dispatch multiple instructions per cycle• Provide more generally applicable form of

concurrency (not just vectors)• Geared for sequential code that is hard to

parallelize otherwise• Exploit fine-grained or instruction-level

parallelism (ILP)

Page 20: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

20

Classifying ILP Machines[Jouppi, DECWRL 1991]• Baseline scalar RISC

– Issue parallelism = IP = 1– Operation latency = OP = 1– Peak IPC = 1

12

34

56

IF DE EX WB

1 2 3 4 5 6 7 8 90

TIME IN CYCLES (OF BASELINE MACHINE)

SU

CC

ES

SIV

EIN

ST

RU

CT

ION

S

Classifying ILP Machines[Jouppi, DECWRL 1991]• Superpipelined: cycle time = 1/m of baseline

– Issue parallelism = IP = 1 inst / minor cycle– Operation latency = OP = m minor cycles– Peak IPC = m instr / major cycle (m x speedup?)

12

34

5

IF DE EX WB6

1 2 3 4 5 6

Classifying ILP Machines[Jouppi, DECWRL 1991]• Superscalar:

– Issue parallelism = IP = n inst / cycle– Operation latency = OP = 1 cycle– Peak IPC = n instr / cycle (n x speedup?)

IF DE EX WB

123

456

9

78

Classifying ILP Machines[Jouppi, DECWRL 1991]• VLIW: Very Long Instruction Word

– Issue parallelism = IP = n inst / cycle– Operation latency = OP = 1 cycle– Peak IPC = n instr / cycle = 1 VLIW / cycle

IF DE

EX

WB

Page 21: Instructor Information - user.engineering.uiowa.eduuser.engineering.uiowa.edu/~hpca/LectureNotes/Lecture1Spring08.pdf · 4 Evolution of Single-Chip Micros 1970’s 1980’s 1990’s

21

Classifying ILP Machines[Jouppi, DECWRL 1991]• Superpipelined-Superscalar

– Issue parallelism = IP = n inst / minor cycle– Operation latency = OP = m minor cycles– Peak IPC = n x m instr / major cycle

IF DE EX WB

123

456

9

78

Superscalar vs. Superpipelined

• Roughly equivalent performance– If n = m then both have about the same

IPC– Parallelism exposed in space vs. time

Time in Cycles (of Base Machine)0 1 2 3 4 5 6 7 8 9

SUPERPIPELINED

10 11 12 13

SUPERSCALARKey:

IFetchDcode

ExecuteWriteback