CS152 – Computer Architecture and Engineering Lecture 9...
Transcript of CS152 – Computer Architecture and Engineering Lecture 9...
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
2004-09-28 Dave Patterson
(www.cs.berkeley.edu/~patterson)
John Lazzaro (www.cs.berkeley.edu/~lazzaro)
www-inst.eecs.berkeley.edu/~cs152/
CS152 – Computer Architecture andEngineering
Lecture 9 – Performance
1
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Last Time: Microcode, Multi-Cycle
5
CS 152 L09 Multicycle (25) Fall 2004 © UC Regents
Controller Design• The state diagrams that arise define the controller for an
instruction set processor are highly structured• Use this structure to construct a simple “microsequencer” • Control reduces to programming this very simple device
microprogramming
op-code
Map ROM
Micro-PC
Z I Ldatapath control
taken
CS 152 L09 Multicycle (26) Fall 2004 © UC Regents
Our Microsequencer
op-code
Map ROM
Micro-PC
Z I Ldatapath control
taken
CS 152 L09 Multicycle (27) Fall 2004 © UC Regents
Adding the Dispatch ROM
•Sequencer-based control– Called “microPC” or “µPC” vs. state register
Control Value Effect00 Next µaddress = 001 Next µaddress = dispatch ROM 10 Next µaddress = µaddress + 1
ROM:
Opcode
microPC
1
µAddressSelectLogic
Adder
ROM
Mux
0012
R-type 000000 0100BEQ 000100 0011ori 001101 0110LW 100011 1000SW 101011 1011
CS 152 L09 Multicycle (28) Fall 2004 © UC Regents
sequencercontrol
micro-PCµ-sequencer:fetch,dispatch,sequential
DispatchROM
Opcode
Inputs
Microprogramming
µ-Code ROM
To DataPath
DecodeDecode
datapath control
microinstruction (µ)
CS 152 L09 Multicycle (29) Fall 2004 © UC Regents
Microprogramming• Microprogramming is a convenient method for
implementing structured control state diagrams:– Random logic replaced by microPC sequencer and ROM– Each line of ROM called a microinstruction:
contains sequencer control + values for control points– To reduce confusion, normal instruction
(e.g., MIPS addu) called “macroinstruction”– limited state transitions:
branch to zero, next sequential,branch to µinstruction address from dispatch ROM
• Control design reduces to Microprogramming– Part of the design process is to develop a “language” that
describes control and is easy for humans to understand
CS 152 L09 Multicycle (30) Fall 2004 © UC Regents
“Macroinstruction” Interpretation
MainMemory
executionunit
controlmemory
CPU
ADDSUBAND
DATA
.
.
.
User program plus Data
this can change!
AND microsequence
e.g., FetchCalc Operand AddrFetch Operand(s)CalculateSave Answer(s)
one of these ismapped into oneof these
2
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Today’s Lecture - Performance
Measurement: what, why, how
The performance equation
How energy limits performance
Amdahl’s law
3
UC Regents Fall 2004 © UCBCS 152 L09 Performance ()
Performance Measurement(as seen by the customer)
4
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Who (sensibly) upgrades CPUs often?A professional who turns CPU cycles into money, and who is cycle-limited.
Artist tool: animation,
video special effects.
5
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
How to decide to buy a new machine?
Measure After Effects “execution time” on a representative render “workload”
“Night flight”
City map and cloudscomputed
“on the fly” with fractals
CPU intensive Trivial I/O
6
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Interpreting Execution Time
Performance 1Execution Time
= = 2.85 renders/hour
1.5 GHz PB (Y) is N times faster than 1.25 GHz PB (X). N is ?
N =Performance (Y)
Execution Time (Y)
Execution Time (X)Performance (X)
= = 1. 19
PB 1.5 Ghz : 3. 4 renders/hour. PB 1.25 : 2.85 renders/hour.Does artist productivity really increase?
Execution Time: 1265 seconds
PowerBookG41.25 GHz
7
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Execution Time: Time for 1 job to complete
2 CPUs: Execution Time vs Throughput
Throughput: # jobs/hour completed (not serialized)
Could G5 and Opteron have similar Throughput? Why?
Assume G5 MP executiontime faster because AE doesnot use both Opteron CPUs.
1.8xfaster.What does this imply?
2 CPUs vs1 CPU,otherwisesimilar
8
UC Regents Fall 2004 © UCBCS 152 L09 Performance ()
Performance Measurement(as seen by a CPU designer)
Q. Why do we care about After Effect’s performance?
A. We want the CPU we are designing to run it well !
9
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Step 1: Analyze the right measurement!CPU Time:
Time the CPU spends running program under measurement.
Response Time:
Total time: CPU Time + time spent waiting (for disk, I/O, ...).
Guides CPU design
Guides system design
How to measure CPU time?% time <program name>25.77u 0.72s 0:29.17 90.8%
How do designersuse these two numbers?
10
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Administrivia - Adjust Class Time ?We have permission to stay in thisroom past 12:30.
A: Lecture from 11:10 to 12:30B: Lecture from 11:15 to 12:35C: Lecture from 11:20 to 12:40
Does anyone have a class that starts 12:40?
Class time options (all “sharp” time)
11
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Administrivia - Mid-Term is Coming!
Mid-term review session: Sunday 10/10, 7-9 PM, 306 Soda.
Mid-term: Tuesday 10/12, 5:30-8:30 PM, 101 Morgan. No class on Tuesday.
After exam: Pizza at LaVal’s, on us!
12
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Administrivia - This Week’s Deadlines
Homework 2 due 9/29 (tomorrow)!283 Soda, in CS 152 box at 5 PM
Lab 2 Xilinx demo on Friday 10/1Lab 2 due Monday 10/4, 11:59 PMOn Tuesday 10/5, onto the Pipelining Lab!
13
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
CPU time: Proportional to Instruction Count
CPU timeProgram
Machine InstructionsProgram∝
Q. Static count?(lines of program printout)Or dynamic count? (trace of execution)
Rationale: Every additional
instruction you execute takes time.
Q. What type of computer architect influences the number of instructions a given program needs?A. Instruction set architect.
A. Dynamic.
Q. Once ISA is set, who can influence instructioncount?A. Compiler writer,application developer.
14
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
CPU time: Proportional to Clock PeriodQ. What ultimately limitsan architect’s ability to reduce clock period ?
TimeProgram
TimeOne Clock Period∝
A. Clock-to-Q, setup times.
Q. How can architects (not technologists) reduce clock period?A. Shorten the machine critical path.
Rationale: We measure each instruction’s
execution time in “number of cycles”. By shortening the period for
each cycle, we shorten execution time.15
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Completing the performance equation
SecondsProgram
InstructionsProgram
= SecondsCycle
We need all three terms, and only these terms, to
compute CPU Time!When is it OK to compare clock rates?
What factors make the CPI for a program differfrom the underlying CPIof a CPU implementation?
Instruction mix varies
Cache behavior varies.
Branch prediction varies.
“CPI” -- The Average Number of Clock
Cycles Per Instruction For the Program
InstructionCycles
16
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
CPI as an analytical tool to guide design
Multiply
Other A
LU Load
Store
Branch
2221
5
Machine CPI
5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20
100= 2.7 cycles/instruction
20%Branch
10%Store
20%Load
20%Other ALU
30%Multiply
ProgramInstruction Mix
15%Branch
7%
15%Load
7%
56%Multiply
Where program spends its time
Q. We lower machine multiply CPI, but program runs slower!What mistake(s) did we make?
17
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Amdahl’s Law (of Diminishing Returns)
If enhancement “E” speeds up multiply, but other instructions
are unchanged, what is the maximum speedup S?
17%Branch
8%
17%Load
8%
50%Multiply
Where programspends its time
Smax =1
1 - (% affected / 100 %)1
1 - (50/100)= 2=
Attributed to Gene Amdahl -- “Amdahl’s Law”
What is the lesson of Amdahl’s Law?
Must enhance computers in a balanced way!
18
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Peer Instruction: Amdahl’s Law
ProgramWeWishTo RunOn N CPUs
30%Serial
70%Parallel
The program spends 30%of its time running code that can not be recoded to run in parallel.
CPUs 2 3 4 5 ∞
Speedup
Compute speedup for N = 2, 3, 4, 5, and ∞.
19
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Peer Instruction: Amdahl’s Law
ProgramWeWishTo RunOn N CPUs
S =1
1 - (30 % + (70% / N) ) / 100 %)
30%Serial
70%Parallel
The program spends 30%of its time in serial code.
CPUs 2 3 4 5 ∞
Speedup 1.54 1.85 2.1 2.3 3.3
Compute speedup for N = 2, 3, 4, 5, and ∞.
S(∞)
2 3 # CPUs
20
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Final thoughts: Performance Equation
SecondsProgram
InstructionsProgram= Seconds
Cycle InstructionCycles
Goal is to optimize execution time, notindividualequation
terms.
The CPI of the
program.Reflects
the program’s instruction
mix.
Machinesare
optimizedwith
respect toprogram
workloads.
Clockperiod.
Optimizejointlywith
machineCPI.
21
UC Regents Fall 2004 © UCBCS 152 L09 Performance ()
Energy and Performance
1 Joule = 0.24 calories. 1 calorie raises 1 gram of water 1℃
1 Joule of energy is dissipated by a 1 Amp current flowing through a 1 Ohm resistor for 1 second.
Sad fact: computers turn electrical energy into heat. Computation is a byproduct.
Air or water carries heat away, or chip melts.
1 Watt: 1 Amp flowing through 1 Ohm.
Also, 1 Watt for 1 second.
22
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
IBM Power 4: How does die heat up?
4 dies on amulti-chip
module
2 CPUsper die
23
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
IBM Power 4: Dissipating 115 Watts
Fixedpointunits
Hot spots
Cachelogic
24
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Switching energy: Fundamental Physics
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
Vdd
C
1
2C VddE1->0=
21
2C VddE0->1=
2
Vdd
Every logic transition dissipates energy.
Strong result: Independent of technology.State-of-the-art CPUs (90 nm):
Switching energy is 70% of total energy. Remainder: at 90nm, “switches” are “dimmers”!
“leakage” currents
How can we limit switching energy?
65nm: 50/50! 25
CS 152 L09 Performance () UC Regents Fall 2004 © UCB
Conclusions
Customers: measure to buyArchitects: measure for design
Tools: Performance Equation, CPI
Energy:
Amdahl’s Law’s lesson: Balance
1
2C VddE0->1=
2 1
2C VddE1->0=
2
26