331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8...
-
date post
21-Dec-2015 -
Category
Documents
-
view
227 -
download
3
Transcript of 331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8...
331 W08.1 Spring 2005
14:332:331Computer Architecture and Assembly Language
Fall 2003
Week 8
[Adapted from Dave Patterson’s UCB CS152 slides and
Mary Jane Irwin’s PSU CSE331 slides]
331 W08.2 Spring 2005
Head’s Up This week’s material
CPU performance- Reading assignment – PH 4
Building a MIPS datapath- Reading assignment – PH 5.1-5.2
Next week’s material Single cycle datapath implementation
- Reading assignment – PH 5.3 and C.1 through C.2
331 W08.3 Spring 2005
Performance Purchasing perspective
given a collection of machines, which has the - best performance ?
- least cost ?
- best performance / cost ?
Design perspective faced with design options, which has the
- best performance improvement ?
- least cost ?
- best performance / cost ?
Both require basis for comparison metric for evaluation
Our goal is to understand cost & performance implications of architectural choices
331 W08.4 Spring 2005
Two notions of “performance”
° Time to do the task (Execution Time)
– execution time, response time, latency
° Tasks per day, hour, week, sec, ns. .. (Performance)
– throughput, bandwidth
Response time and throughput often are in opposition
Plane
Boeing 747
BAD/Sud Concodre
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
Which has higher performance?
331 W08.5 Spring 2005
Definitions
Performance is in units of things-per-second bigger is better
If we are primarily concerned with response time performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n = ----------------------
Performance(Y)
331 W08.6 Spring 2005
Example
Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 2.2 times faster
• = 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?
• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”
• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”
• Boeing is 1.6 times (“60%”)faster in terms of throughput
• Concord is 2.2 times (“120%”) faster in terms of flying time
• We will focus primarily on execution time for a single job
331 W08.7 Spring 2005
Basis of Evaluation
Actual Target Workload
Full Application Benchmarks
Small “Kernel” Benchmarks
Microbenchmarks
Pros Cons
• representative• very specific• non-portable• difficult to run, or measure• hard to identify cause
• portable• widely used• improvements useful in reality
• easy to run, early in design cycle
• identify peak capability and potential bottlenecks
•less representative
• easy to “fool”
• “peak” may be a long way from application performance
331 W08.8 Spring 2005
SPEC95
Eighteen application benchmarks (with inputs) reflecting a technical computing workload
Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d,
apsi, fppp, wave5
Must run with standard compiler flags eliminate special undocumented incantations that may
not even generate working code for real programs
331 W08.9 Spring 2005
Metrics of performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per month
Useful Operations per second
Each metric has a place and a purpose, and each can be misused
331 W08.10 Spring 2005
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
instr. count CPI clock rate
Program
Compiler
Instr. Set Arch.
Organization
Technology
331 W08.11 Spring 2005
CPI
CPU time = ClockCycleTime * CPI * Ii = 1
n
i i
CPI = CPI * F where F = I i = 1
n
i ii i
Instruction Count
"instruction frequency"
Invest Resources where time is Spent!
CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count
“Average cycles per instruction”
331 W08.12 Spring 2005
Example (RISC processor)
Typical Mix
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
2.2
How much faster would the machine be is a better data cachereduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave a cycle off the branch time?
What if two ALU instructions could be executed at once?
331 W08.13 Spring 2005
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) = 1 (1-F) + F/S
331 W08.14 Spring 2005
Summary: Evaluating Instruction Sets?
Design-time metrics:
° Can it be implemented, in how long, at what cost?
° Can it be programmed? Ease of compilation?
Static Metrics:
° How many bytes does the program occupy in memory?
Dynamic Metrics:
° How many instructions are executed?
° How many bytes does the processor fetch to execute the program?
° How many clocks are required per instruction?
° How "lean" a clock is practical?
Best Metric: Time to execute the program!
NOTE: this depends on instructions set, processor organization, and compilation techniques.
CPI
Inst. Count Cycle Time
331 W08.15 Spring 2005
Review: Design Principles
Simplicity favors regularity fixed size instructions – 32-bits only three instruction formats
Good design demands good compromises three instruction formats
Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes
Make the common case fast arithmetic operands from the register file (load-store
machine) allow instructions to contain immediate operands
331 W08.16 Spring 2005
We're ready to look at an implementation of the MIPS
Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic implementation: use the program counter (PC) to supply
the instruction address and fetch the instruction from memory (and update the PC)
decode the instruction (and read registers) execute the instruction
All instructions (except j) use the ALU after reading the registers
Why? memory-reference? arithmetic? control flow?
The Processor: Datapath & Control
FetchPC = PC+4
DecodeExec
331 W08.17 Spring 2005
Abstract Implementation View
Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential)
Single cycle operation
Split memory (Harvard) model - one memory for instructions and one for data
Address Instruction
InstructionMemory
Write Data
Reg Addr
Reg Addr
Reg Addr
Register
File ALU
DataMemory
Address
Write Data
Read DataPC
Read Data
Read Data
331 W08.18 Spring 2005
Clocking Methodologies Clocking methodology defines when signals can
be read and when they can be written
falling (negative) edge
rising (positive) edgecycle time
clock rate = 1/(cycle time) e.g., 10 nsec cycle time = 100 MHz clock rate 1 nsec cycle time = 1 GHz clock rate
State element design choices level sensitive latch master-slave and edge-triggered flipflops
331 W08.19 Spring 2005
Review: State Elements Set-reset latch
Level sensitive D latch
latch is transparent when clock is high (copies input to output)
R
S
Q
!Q
R S Q(t+1) !Q(t+1)
1 0 0 1
0 1 1 0
0 0 Q(t) !Q(t)
1 1 0 0
clock
D
Q
!Q
clock
D
Q
331 W08.20 Spring 2005
Review: State Elements, con’t
Race problem with latch based design …
Consider the case when D-latch0 holds a 0 and D-latch1 holds a 1 and you want to transfer the contents of D-latch0 to D-latch1 and vica versa
must have the clock high long enough for the transfer to take place
must not leave the clock high so long that the transferred data is copied back into the original latch
Two-sided clock constraint
D
clock
Q
!Q
D-latch0D
clock
Q
!Q
D-latch1
clock
331 W08.21 Spring 2005
Review: State Elements, con’t Solution is to use flipflops that change state (Q)
only on clock edge (master-slave)
- master (first D-latch) copies the input when the clock is high (the slave (second D-latch) is locked in its memory state and the output does not change)
- slave copies the master when the clock goes low (the master is now locked in its memory state so changes at the input are not loaded into the master D-latch)
One-sided clock constraint must have the clock cycle time long enough to
accommodate the worst case delay path
D
clock
Q
!Q
D-latchD
clock
Q
!Q
D-latchQ
!Q
D
clockclock
D
Q
331 W08.22 Spring 2005
Our Implementation
An edge-triggered methodology
Typical execution read contents of some state elements send values through some combinational logic write results to one or more state elements
Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted
and the clock edge occurs
Stateelement
1
Stateelement
2
Combinationallogic
clock
one clock cycle
331 W08.23 Spring 2005
Fetching InstructionsFetching instructions involves
reading the instruction from the Instruction Memory updating the PC to hold the address of the next
instruction
PC is updated every cycle, so it does not need an explicit write control signal
Instruction Memory is read every cycle, so it doesn’t need an explicit read control signal
ReadAddress
Instruction
InstructionMemory
Add
PC
4
331 W08.24 Spring 2005
Decoding InstructionsDecoding instructions involves
sending the fetched instruction’s opcode and function field bits to the control unit
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
ControlUnit
reading two values from the Register File- Register File addresses are contained in the
instruction
331 W08.25 Spring 2005
Executing R Format OperationsR format operations (add, sub, slt, and, or)
perform the indicated (by op and funct) operation on values in rs and rt
store the result back into the Register File (into location rd)
Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
ALU
overflowzero
ALU controlRegWrite
R-type:
31 25 20 15 5 0
op rs rt rd functshamt
10
331 W08.26 Spring 2005
Executing Load and Store OperationsLoad and store operations
compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction
- base register was read from the Register File during decode
- offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value
store value, read from the Register File during decode, must be written to the Data Memory
load value, read from the Data Memory, must be stored in the Register File
I-Type: op rs rt address offset
31 25 20 15 0
331 W08.27 Spring 2005
Executing Load and Store Operations, con’t
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
ALU
overflowzero
ALU controlRegWrite
DataMemory
Address
Write Data
Read Data
SignExtend
MemWrite
MemRead
331 W08.28 Spring 2005
Executing Branch Operations
Branch operations have to
compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output)
compute the branch target address by adding the updated PC to the sign extended16-bit signed offset field in the instruction
- “base register” is the updated PC
- offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address
I-Type: op rs rt address offset
31 25 20 15 0
331 W08.29 Spring 2005
Executing Branch Operations, con’t
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
ALU
zero
ALU control
SignExtend16 32
Shiftleft 2
Add
4 Add
PC
Branchtargetaddress
(to branch control logic)
331 W08.30 Spring 2005
Executing Jump Operations
Jump operations have to
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
ReadAddress
Instruction
InstructionMemory
Add
PC
4
Shiftleft 2
Jumpaddress
26
4
28
J-Type: op
31 25 0
jump target address
331 W08.31 Spring 2005
We wait for everything to settle down
ALU might not produce “right answer” right away
we use write signals along with the clock edge to
determine when to write (to the Register File and the
Data Memory)
Cycle time determined by length of the longest
path
Our Simple Control Structure
We are ignoring some details like register setup and hold times