Instructor Information -...
Transcript of Instructor Information -...
1
55:132/22C:160High Performance Computer
ArchitectureSpring 2008
Instructor Information• Instructor: Jon Kuhl (That’s me)
– Office: 4016A SC– Office Hours: 10:30-11:30 a.m. MWF ( Other
times by appointment)– E-mail: [email protected]– Phone: (319) 335-5958
• TA: Prasidha Mohandas– Office: 1313 SC– Office hours: t.b.d.
Class Info.• Website:
www.engineering.uiowa.edu/~hpca
• Texts:Required: Shen and Lipasti, Modern Processor Design-
-Fundamentals of Superscalar Processors , McGraw Hill, 2005.
Supplemental: Thomas and Moorby, The Verilog Hardware Description Language,Third Edition , Kluwer Academic Publishers, 1996.
Additional Reference: Hennessy and Patterson, Computer Architecture—A Quantitative Approach , Morgan Kaufmann, Fourth Edition, 2007
Course Objectives• Understand quantitative measures for assessing
and comparing processor performance• Understand modern processor design
techniques, including:– pipelining– instruction-level parallelism– multi-threading– high performance memory architecture
• Master the use of modern design tools (HDLs) to design and analyze processors
• Do case studies of contemporary processors• Discuss future trends in processor design
2
Expected Background• A previous course in computer
architecture/organization covering:– Instruction set architecture (ISA)– Addressing modes– Assembly language– Basic computer organization– Memory system organization
• Cache• virtual
– Etc.
• 22c:060 or 55:035 or equivalent
Course Organization
• Homework assignments--several
• Two projects (design/analysis exercises using the Verilog HDL and ModelSimsimulation environment)
• Two exams:– Midterm—Wed. March 12, in class– Final—Tues. May 13, 2:15-4:15 p.m.
Course Organization--continued
• Grading:– Exams:
• Better of midterm/final exam score: 35%• Poorer of midterm/final exam scores: 25%
– Homework: 10%– Projects 30%
Historical Perspectives• The Decade of the 1970’s: “Birth of
Microprocessors”– Programmable Controller– Single-Chip Microprocessors– Personal Computers (PC)
• The Decade of the 1980’s: “Quantitative Architecture”– Instruction Pipelining– Fast Cache Memories– Compiler Considerations– Workstations
• The Decade of the 1990’s: “Instruction-Level Parallelism”– Superscalar,Speculative Microarchitectures– Aggressive Compiler Optimizations– Low-Cost Desktop Supercomputing
3
Moore’s Law Moore’s Law (1965)
• The number of devices that can be integrated on a single piece of silicon will double roughly every 18-24 months
• Moore’s law has held true for 40 years and will continue to hold for at least another decade.
Intel MicroprocessorsTransistor Count 1970-2005
Processor Performance—1987-1997
HP 9000/750
SUN-4/
260
MIPS
M2000
MIPS
M/120
IBM
RS6000100
200
300
400
500
600
700
800
900
1100
DEC Alpha 5/500
DEC Alpha 21264/600
DEC Alpha 5/300
DEC Alpha 4/266
DEC AXP/500
IBM POWER 100
Year
Pe
rfo
rma
nce
0
1000
1200
19971996199519941993199219911990198919881987
4
Evolution of Single-Chip Micros
1970’s 1980’s 1990’s 2010
Transistor Count 10K-100K
100K-1M 1M-100M 1B
Clock Frequency 0.2-2MHz 2-20MHz 20M-1GHz
10GHz
Instruction/Cycle
< 0.1 0.1-0.9 0.9- 2.0 10 (?)
MIPS/MFLOPS < 0.2 0.2-20 20-2,000 100,000
Performance Growth in Perspective
• Doubling every 24 months (1971-2007): – total of 260,000X– Cars travel at 25 million MPH; get 5 million
miles/gal.– Air travel: L.A. to N.Y. in 0.1 seconds– Corn yield: 50 million bushels per acre
A Quote from Robert Cringely
“If the automobile had followed the same development as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year killing everyone inside.”
Convergence of Key Enabling Technologies:
• VLSI:– Submicron CMOS feature sizes: Intel is shipping 45nm chips and has
demonstrated 32nm (2x increase in density every 2 years)– Metal layers: 3 -> 4 -> 5 -> 6 -> 9 (copper)– Power supply voltage: 5v -> 3.3v -> 2.4v ->1.8v ->0.8 v
• CAD Tools:– Interconnect simulation and critical path analysis– Clock signal propagation analysis– Process simulation and yield analysis/learning
• Microarchitecture:– Superpipelined and superscalar machines– Speculative and dynamic microarchitectures– Simulation tools and emulation systems
• Compilers:– Extraction of instruction-level parallelism– Aggressive and speculative code scheduling– Object code translation and optimization
5
Instruction Set Processing
ARCHITECTURE (ISA) programmer/compiler view– Functional appearance (interface) to user/system programmer– Opcodes, addressing modes, architected registers, IEEE floating
point– Serves as specification for processor design
IMPLEMENTATION (µarchitecture) processor designer view– Logical structure or organization that performs the architecture– Pipelining, functional units, caches, physical registers
REALIZATION (Chip) chip/system designer view– Physical structure that embodies the implementation– Gates, cells, transistors, wires
Iron LawProcessor Performance = ---------------
Time
Program
Architecture --> Implementation --> Realization
Compiler Designer Processor Designer C hip Designer
Instructions CyclesProgram Instruction
TimeCycle
(code size)
= X X
(CPI) (cycle time)
Iron Law
• Instructions/Program– Instructions executed, not static code size– Determined by algorithm, compiler, ISA
• Cycles/Instruction– Determined by ISA and CPU organization– Overlap among instructions reduces this term
• Time/cycle– Determined by technology, organization, clever
circuit design
Overall Goal• Minimize time, which is the product,
NOT isolated terms
• Common error to miss terms while devising optimizations– E.g. ISA change to decrease instruction
count– BUT leads to CPU organization which
makes clock slower
• Bottom line: terms are inter-related
• This is the crux of the RISC vs. CISC argument
6
Instruction Set Architecture
• ISA, the boundary between software and hardware– Specifies the logical machine that is visible to the
programmer– Also, a functional spec for the processor designers
• What needs to be specified by an ISA– Operations
• what to perform and what to perform next
– Temporary Operand Storage in the CPU• accumulator, stacks, registers
– Number of operands per instruction– Operand location
• where and how to specify the operands
– Type and size of operands– Instruction-to-Binary Encoding
Operand Storage
• Registers (in processor vs. memory)?
• faster access
• shorter address
• Accumulator
• less hardware
• high memory traffic
• likely bottleneck
Operand Storage
• Stack - LIFO (60’s - 70’s)
• simple addressing (top of stack implicit)
• bottleneck while pipelining (why?)
• note: JAVA VM stack-based
• Registers - 8 to 256 words
• flexible: temporaries and variables
• registers must be named
• code density and “second” name space
Caches vs. Registers
• Registers
• faster (no addressing modes, no tags)
• deterministic (no misses)
• can replicate for more ports
• short identifier
• must save/restore on procedure calls
• can’t take address of a register (distinct from memory)
• fixed size (FP, strings, structures)
• compilers must manage (an advantage?)
7
Registers vs. Caches
• How many registers? more =>
• hold operands longer (reducing memory traffic + run time)
• longer register specifiers (except with register windows)
• slower registers
• more state slows context switches
Operands for ALU Instructions
• ALU instructions require operands
• Number of explicit operands
• two - ri := ri op rj
• three - ri := rj op rk
• operands in registers or memory
• any combo - VAX - variable length instrs
• at least one register - IBM 360/370
• all registers - Cray, RISCs - separate loads/store instructions
VAX Addressing Modesregister: Ri displacement M[Ri + #n]
immediate: #n register indirect M[Ri]
indexed: M[Ri + Rj] absolute: M[#n]
memory indirect: M[M[Ri]] auto-increment: M[Ri]; Ri += d
auto-decrement: M[Ri]; Ri -= d
scaled: M[Ri + #n + Rj * d]
update: M[Ri = Ri + #n]
• Modes 1-4 account for 93% of all VAX operands [Clark and Emer]
Operations• arithmetic and logical - and, add …
• data transfer - move, load, store
• control - branch, jump, call
• system - system call, traps
• floating point - add, mul, div, sqrt
• decimal - addd, convert
• string - move, compare
• multimedia? 2D, 3D? e.g., Intel MMX/SSE and Sun VIS
8
Control Instructions (Branches)
1. Types of BranchesA. Conditional or UnconditionalB. Save PC?C. How is target computed?
• Single target (immediate, PC+immediate)• Multiple targets (register)
2. Branch ArchitecturesA. Condition code or condition registersB. Register
Save or Restore State• What state?
• function calls: registers (CISC)
• system calls: registers, flags, PC, PSW, etc
• Hardware need not save registers
• caller can save registers in use
• callee save registers it will use
• Hardware register save
• IBM STM, VAX CALLS
• faster?
• Most recent architectures do no register saving
– Or do implicit register saving with register windows (SPARC)
VAX
• DEC 1977 VAX-11/780
• upward compatible from PDP-11
• 32-bit words and addresses
• virtual memory
• 16 GPRs (r15 PC r14 SP), CCs
• extremely orthogonal and memory-memory
• decode as byte stream - variable in length
• opcode: operation, #operands, operand types
VAX• Data types
• 8, 16, 32, 64, 128
• char string - 8 bits/char
• decimal - 4 bits/digit
• numeric string - 8 bits/digit
• Addressing modes
• literal 6 bits
• 8, 16, 32 bit immediates
• register, register deferred
• 8, 16, 32 bit displacements
• 8, 16, 32 bit displacements deferred
• indexed (scaled)
• autoincrement, autodecrement
• autoincrement deferred
9
VAX• operations
– data transfer including string move
– arithmetic and logical (2 and 3 operands)
– control (branch, jump, etc)• AOBLEQ
– function calls save state
– bit manipulation
– floating point - add, sub, mul, div, polyf
– system - exception, VM
– other - crc (cyclic redundancy check), insque (insert in Q)
VAXaddl3 R1, 737\(R2\), #456
byte 1: addl3byte 2: mode, R1byte 3: mode, R2byte 4,5: 737byte 6: modebyte 7-10: 456
• VAX has too many modes and formats
• Big deal with RISC is not fewer instructions
– few modes/formats => fast decoding to facilitate pipelining
VAX 11/780• First implementation of VAX ISA
– 84% of instructions simple, 19% branches
– loop branches 91% taken, other branches 41% taken
– Operands: register mode 41%, complex addressing 6%
– Implementation• 10.6 CPI @ 200ns => 0.5 MIPS
• 50% of time decoding, simple instructions only 10% of time
• memory stalls 2.1 CPI (<< 10.6)
Anatomy of a Modern ISA• Operations
simple ALU op’s, data movement, control transfer
• Temporary Operand Storage in the CPULarge General Purpose Register (GPR) File
• Number of operands per instructiontriadic A ⇐ B op C
• Operand locationload-store architecture with register indirect addressing
• Type and size of operands32/64-bit integers, IEEE floats
• Instruction-to-Binary EncodingFixed width, regular fields
Exceptions: Intel x86, IBM 390 (aka z900)
10
Dynamic-Static Interface
• Semantic gap between s/w and h/w
• Placement of DSI determines how gap is bridged
Program (Software)
Machine (Hardware)
Compilercomplexity
Hardwarecomplexity
Exposed tosoftware
Hidden inhardware
“Static”
“Dynamic”
Architecture (DSI)
Dynamic-Static Interface
• Low-level DSI exposes more knowledge of hardware through the ISA– Places greater burden on compiler/programmer
• Optimized code becomes specific to implementation– In fact: happens for higher-level DSI also
DEL ~CISC ~VLIW ~RISCHLL Program
DSI-1
DSI-2
DSI-3
Hardware
The Role of the Compiler
• Phases to manage complexityParsing --> intermediate representation
Jump Optimization
Loop Optimizations
Register Allocation
Code Generation --> assembly code
Common Sub-Expression
Procedure inlining
Constant Propagation
Strength ReductionPipeline Scheduling
Performance and Cost
• Which computer is fastest?
• Not so simple– Scientific simulation – FP performance
– Program development – Integer performance
– Commercial workload – Memory, I/O
11
Performance of Computers• Want to buy the fastest computer for what
you want to do?– Workload is all-important– Correct measurement and analysis
• Want to design the fastest computer for what the customer wants to pay?– Cost is always an important criterion
• Speed is not always the only performance criteria:– Power– Area
Defining Performance
• What is important to whom?
• Computer system user– Minimize elapsed time for program =
time_end – time_start– Called response time
• Computer center manager– Maximize completion rate = #jobs/second– Called throughput
Improve Performance
• Improve (a) response time or (b) throughput?– Faster CPU
• Helps both (a) and (b)
– Add more CPUs• Helps (b) and perhaps (a) due to less queuing
Performance Comparison
• Machine A is n times faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = n
• Machine A is x% faster than machine B iff– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
• E.g. time(A) = 10s, time(B) = 15s– 15/10 = 1.5 => A is 1.5 times faster than B– 15/10 = 1.5 => A is 50% faster than B
12
Other Metrics
• MIPS and MFLOPS
• MIPS = instruction count/(execution time x 106)
= clock rate/(CPI x 106)
• But MIPS has serious shortcomings
Problems with MIPS
• E.g. without FP hardware, an FP op may take 50 single-cycle instructions
• With FP hardware, only one 2-cycle instruction
� Thus, adding FP hardware:– CPI increases (why?)– Instructions/program
decreases (why?)– Total execution time decreases
� BUT, MIPS gets worse!
50/50 => 2/150 => 1
50 => 250 MIPS => 2 MIPS
Problems with MIPS
• Ignore program
• Usually used to quote peak performance– Ideal conditions => guarantee not to exceed!
• When is MIPS ok?– Same compiler, same ISA– E.g. same binary running on Pentium-III, IV– Why? Instr/program is constant and can be
ignored
Other Metrics
• MFLOPS = FP ops in program/(execution time x 106)
• Assuming FP ops independent of compiler and ISA– Often safe for numeric codes: matrix size determines
# of FP ops/program– However, not always safe:
• Missing instructions (e.g. FP divide, sqrt/sin/cos)• Optimizing compilers
• Relative MIPS and normalized MFLOPS– Normalized to some common baseline machine
• E.g. VAX MIPS in the 1980s
13
Iron Law Example
• Machine A: clock 1ns, CPI 2.0, for program x• Machine B: clock 2ns, CPI 1.2, for program x• Which is faster and how much?
Time/Program = instr/program x cycles/instr x sec/cycleTime(A) = N x 2.0 x 1 = 2NTime(B) = N x 1.2 x 2 = 2.4NCompare: Time(B)/Time(A) = 2.4N/2N = 1.2
• So, Machine A is 20% faster than Machine B for this program
Iron Law Example
Keep clock(A) @ 1ns and clock(B) @2ns
For equal performance, if CPI(B)=1.2, what is CPI(A)?
Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A))CPI(A) = 2.4
Iron Law Example
• Keep CPI(A)=2.0 and CPI(B)=1.2
• For equal performance, if clock(B)=2ns, what is clock(A)?
Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2)clock(A) = 1.2ns
Another Example
• Assume stores can execute in 1 cycle by slowing clock 15%
• Should this be implemented?
OP Freq Cycles
ALU 43% 1Load 21% 1
Store 12% 2Branch 24% 2
14
Example-- Let’s do the math:
• Old CPI = 0.43 + 0.21 + 0.12 x 2 + 0.24 x 2 = 1.36• New CPI = 0.43 + 0.21 + 0.12 + 0.24 x 2 = 1.24
• Speedup = old time/new time = {P x old CPI x T}/{P x new CPI x 1.15 T}= (1.36)/(1.24 x 1.15) = 0.95
• Answer: Don’t make the change
OP Freq Cycles
ALU 43% 1
Load 21% 1
Store 12% 2
Branch 24% 2
Which Programs
• Execution time of what program?• Best case – you always run the same set of
programs– Port them and time the whole workload
• In reality, use benchmarks– Programs chosen to measure performance– Predict performance of actual workload– Saves effort and money– Representative? Honest? Benchmarketing…
Types of Benchmarks
• Real programs– representative of real workload– only accurate way to characterize performance– requires considerable work
• Kernels or microbenchmarks– “representative” program fragments– good for focusing on individual features not big
picture • Instruction mixes
– instruction frequency of occurrence; calculate CPI
Benchmarks: SPEC2000
• System Performance Evaluation Cooperative– Formed in 80s to combat benchmarketing– SPEC89, SPEC92, SPEC95, now SPEC2000
• 12 integer and 14 floating-point programs– Sun Ultra-5 300MHz reference machine has
score of 100– Report geometric mean of ratios to reference
machine
15
Benchmarks: SPEC CINT2000Benchmark Description
164.gzip Compression
175.vpr FPGA place and route
176.gcc C compiler
181.mcf Combinatorial optimization
186.crafty Chess
197.parser Word processing, grammatical analysis
252.eon Visualization (ray tracing)
253.perlbmk PERL script execution
254.gap Group theory interpreter
255.vortex Object-oriented database
256.bzip2 Compression
300.twolf Place and route simulator
Benchmarks: SPEC CFP2000Benchmark Description
168.wupwise Physics/Quantum Chromodynamics
171.swim Shallow water modeling
172.mgrid Multi-grid solver: 3D potential field
173.applu Parabolic/elliptic PDE
177.mesa 3-D graphics library
178.galgel Computational Fluid Dynamics
179.art Image Recognition/Neural Networks
183.equake Seismic Wave Propagation Simulation
187.facerec Image processing: face recognition
188.ammp Computational chemistry
189.lucas Number theory/primality testing
191.fma3d Finite-element Crash Simulation
200.sixtrack High energy nuclear physics accelerator design
301.apsi Meteorology: Pollutant distribution
Benchmark Pitfalls
• Benchmark not representative– Your workload is I/O bound, SPECint is
useless
• Benchmark is too old– Benchmarks age poorly; benchmarketing
pressure causes vendors to optimize compiler/hardware/software to benchmarks
– Need to be periodically refreshed
Benchmark Pitfalls
• Choosing benchmark from the wrong application space– e.g., in a realtime environment, choosing gcc
• Choosing benchmarks from no application space– e.g., synthetic workloads, esp. unvalidated ones
• Using toy benchmarks (dhrystone, whetstone)– e.g., used to prove the value of RISC in early 80’s
• Mismatch of benchmark properties with scale of features studied– e.g., using SPECINT for large cache studies
16
Benchmark Pitfalls
• Carelessly scaling benchmarks– Truncating benchmarks– Using only first few million instructions– Reducing program data size
• Too many easy cases– May not show value of a feature
• Too few easy cases– May exaggerate importance of a feature
Scalar to Superscalar
• Scalar processor—Fetches and issues at most one instruction per machine cycle
• Superscalar processor-- Fetches and issues multiple instructions per machine cycle
• Can also define superscalar in terms of how many instructions can complete execution in a given machine cycle.
• Note that only a superscalar architecture can achieve a CPI of less than 1
Processor Performance
• In the 1980’s (decade of pipelining):– CPI: 5.0 => 1.15
• In the 1990’s (decade of superscalar):– CPI: 1.15 => 0.5 (best case)
Processor Performance = ---------------Time
Program
Instructions CyclesProgram Instruction
TimeCycle
(code size)
= X X
(CPI) (cycle time)
Amdahl’s Law(Originally formulated for vector processing)
• f = fraction of program that is vectorizable• (1-f) = fraction that is serial• N = speedup for vectorizable portion• Overall speedup:
No. ofProcessors
N
Time1
f
N
ff
Speedup+−
=)1(
1
1-f
17
Amdahl’s Law--Continued
• Sequential bottleneck
• Even if N is infinite– Performance limited by nonvectorizable
portion (1-f)
fNf
fN −
=+−
∞→ 1
1
1
1lim
Ramifications of Amdahl’s Law
• Consider: f = 0.9, (1-f) = 0.1 For N � , Speedup � 10
• Consider: f = 0.5, (1-f) = 0.5
For N� infinity, Speedup � 2
• Consider: f – 0.1, (1-f) = 0.9For N � infinity, Speedup � 1.1
Maximum Achievable Speedup
0
2
4
6
8
10
12
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
parallelizable fraction f
Spe
edup
Speedup
Pipelining
Unpipelined operation
time T
Inputs I1, I2, I3, … Outputs O1, 02, ..
Time required to process K inputs = KT
Stage1
Stage2
Stage3
Perfect Pipeline (N stages):
T/N T/N
StageN
T/N
…
I1I1I2
I3 I2 I1…
Time required to process K inputs = (K + N-1)(T/N)
IN IN-1 IN-2… I1 O1
Note” For K >>N, the processing time approaches KT /N
T/N
…
… …
… … … …
18
Pipelined Performance Model
• g = fraction of time pipeline is filled
• 1-g = fraction of time pipeline is not filled (stalled)
1-g g
PipelineDepth
N
1
Amdahl’s Law Applied to Pipelining
• g = fraction of time the pipeline is full• (1-g) = fraction that it is not full• N = pipeline depth• Overall speedup:
No. ofstages
N
Time1
g
N
gg
Speedup+−
=)1(
1
1-g
� g = fraction of time pipeline is filled� 1-g = fraction of time pipeline is not filled
(stalled)
1-g g
PipelineDepth
N
1
Pipelined Performance ModelPipelined Performance ModelPipelined Performance Model
• Tyranny of Amdahl’s Law [Bob Colwell]– When g is even slightly below 100%, a big
performance hit will result– Stalled cycles are the key adversary and must be
minimized as much as possible
1-g g
PipelineDepth
N
1
19
Superscalar Proposal• Moderate tyranny of Amdahl’s Law
– Ease sequential bottleneck– More generally applicable– Robust (less sensitive to f)– Revised Amdahl’s Law:
s = amount of parallelism for non-vectorizable instructions
( )N
f
s
fSpeedup
+−=1
1
Motivation for Superscalar[Agerwala and Cocke]
0 0.2 0.4 0.6 0.8 10
1
2
3
4
5
6
7
8
Vectorizability f
Spe
edup
p
n=4
n=6
n=12
n=100
n=6,s=2
Typical Range
Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead
of s=1 (scalar)
Limits on Instruction Level Parallelism (ILP)
Weiss and Smith [1984] 1.58
Sohi and Vajapeyam [1987] 1.81
Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck)
Tjaden and Flynn [1973] 1.96
Uht [1986] 2.00
Smith et al. [1989] 2.00
Jouppi and Wall [1988] 2.40
Johnson [1991] 2.50
Acosta et al. [1986] 2.79
Wedig [1982] 3.00
Butler et al. [1991] 5.8
Melvin and Patt [1991] 6
Wall [1991] 7 (Jouppi disagreed)
Kuck et al. [1972] 8
Riseman and Foster [1972] 51 (no control dependences)
Nicolau and Fisher [1984] 90 (Fisher’s optimism)
Superscalar Proposal
• Go beyond single instruction pipeline, achieve IPC > 1
• Dispatch multiple instructions per cycle• Provide more generally applicable form of
concurrency (not just vectors)• Geared for sequential code that is hard to
parallelize otherwise• Exploit fine-grained or instruction-level
parallelism (ILP)
20
Classifying ILP Machines[Jouppi, DECWRL 1991]• Baseline scalar RISC
– Issue parallelism = IP = 1– Operation latency = OP = 1– Peak IPC = 1
12
34
56
IF DE EX WB
1 2 3 4 5 6 7 8 90
TIME IN CYCLES (OF BASELINE MACHINE)
SU
CC
ES
SIV
EIN
ST
RU
CT
ION
S
Classifying ILP Machines[Jouppi, DECWRL 1991]• Superpipelined: cycle time = 1/m of baseline
– Issue parallelism = IP = 1 inst / minor cycle– Operation latency = OP = m minor cycles– Peak IPC = m instr / major cycle (m x speedup?)
12
34
5
IF DE EX WB6
1 2 3 4 5 6
Classifying ILP Machines[Jouppi, DECWRL 1991]• Superscalar:
– Issue parallelism = IP = n inst / cycle– Operation latency = OP = 1 cycle– Peak IPC = n instr / cycle (n x speedup?)
IF DE EX WB
123
456
9
78
Classifying ILP Machines[Jouppi, DECWRL 1991]• VLIW: Very Long Instruction Word
– Issue parallelism = IP = n inst / cycle– Operation latency = OP = 1 cycle– Peak IPC = n instr / cycle = 1 VLIW / cycle
IF DE
EX
WB
21
Classifying ILP Machines[Jouppi, DECWRL 1991]• Superpipelined-Superscalar
– Issue parallelism = IP = n inst / minor cycle– Operation latency = OP = m minor cycles– Peak IPC = n x m instr / major cycle
IF DE EX WB
123
456
9
78
Superscalar vs. Superpipelined
• Roughly equivalent performance– If n = m then both have about the same
IPC– Parallelism exposed in space vs. time
Time in Cycles (of Base Machine)0 1 2 3 4 5 6 7 8 9
SUPERPIPELINED
10 11 12 13
SUPERSCALARKey:
IFetchDcode
ExecuteWriteback