Post on 03-Jan-2016
204521 Digital System Architecture 1April 20, 2023
Lecture 2a Computer Performance and Cost
Pradondet Nilagupta
(Based on lecture notes by Prof. Randy Katz
& Prof. Jan M. Rabaey , UC Berkeley)
204521 Digital System Architecture 2April 20, 2023
Review: What is Computer Architecture?
Technology
ApplicationsComputer
Architect
Interfaces
Machine Organization
Measurement &Evaluation
ISA
AP
I
Link
I/O
Cha
n
Regs
IR
204521 Digital System Architecture 3April 20, 2023
The Architecture Process
New concepts created
EstimateCost &
PerformanceSort
Good ideasMediocre
ideasBad ideas
204521 Digital System Architecture 4April 20, 2023
Performance Measurement and Evaluation
Many dimensions to computer performance
– CPU execution time by instruction or sequence
– floating point
– integer
– branch performance
– Cache bandwidth
– Main memory bandwidth
– I/O performance bandwidth seeks pixels or polygons per second
Relative importance depends on applications
P
$
M
204521 Digital System Architecture 5April 20, 2023
Evaluation Tools Benchmarks, traces, & mixes
– macrobenchmarks & suites application execution time
– microbenchmarks measure one aspect of performance
– traces replay recorded accesses
– cache, branch, register Simulation at many levels
– ISA, cycle accurate, RTL, gate, circuit trade fidelity for simulation rate
Area and delay estimation Analysis
– e.g., queuing theory
MOVE 39%BR 20%LOAD 20%STORE 10%ALU 11%
LD 5EA3ST 31FF….LD 1EA2….
204521 Digital System Architecture 6April 20, 2023
Benchmarks
Microbenchmarks– measure one performance
dimension cache bandwidth main memory bandwidth procedure call overhead FP performance
– weighted combination of microbenchmark performance is a good predictor of application performance
– gives insight into the cause of performance bottlenecks
Macrobenchmarks– application execution time
measures overall performance, but on just one application
Perf. Dimensions
App
licat
ions
Mic
ro
Macro
204521 Digital System Architecture 7April 20, 2023
Some Warnings about Benchmarks Benchmarks measure the
whole system
– application
– compiler
– operating system
– architecture
– implementation Popular benchmarks typically
reflect yesterday’s programs
– computers need to be designed for tomorrow’s programs
Benchmark timings often very sensitive to
– alignment in cache
– location of data on disk
– values of data Benchmarks can lead to
inbreeding or positive feedback
– if you make an operation fast (slow) it will be used more (less) often
so you make it faster (slower)
– and it gets used even more (less)
• and so on…
204521 Digital System Architecture 8April 20, 2023
Architectural Performance Laws and Rules of Thumb
204521 Digital System Architecture 9April 20, 2023
Measurement and Evaluation
Architecture is an iterative process:• Searching the space of possible designs• Make selections• Evaluate the selections made
Good measurement tools are required to accurately evaluate the selection.
204521 Digital System Architecture 10April 20, 2023
Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels)
– ISA, RTL, Gate, Circuit
Queuing Theory Rules of Thumb Fundamental Laws
204521 Digital System Architecture 11April 20, 2023
Measuring and Reporting Performance
What do we mean by one Computer is faster than another?– program runs less time
Response time or execution time– time that users see the output
Throughput – total amount of work done in a given time
204521 Digital System Architecture 12April 20, 2023
Performance
“Increasing and decreasing” ?????
We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time .
improve performance = increase performance
improve execution time = decrease execution time
204521 Digital System Architecture 13April 20, 2023
Measuring Performance
Definition of time
Wall Clock time Response time Elapsed time
– A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead
204521 Digital System Architecture 14April 20, 2023
Does Anybody Really Know What Time it is?
User CPU Time (Time spent in program) System CPU Time (Time spent in OS) Elapsed Time (Response Time = 159 Sec.) (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is
CPU time. 45% of the time spent in I/O or running other programs
UNIX Time Command: 90.7u 12.9s 2:39 65%
204521 Digital System Architecture 15April 20, 2023
example
UNIX time command
90.7u 12.95 2:39 65%
user CPU time is 90.7 sec
system CPU time is 12.9 sec
elapsed time is 2 min 39 sec. (159 sec)
% of elapsed time that is CPU time is 90.7 + 12.9 = 65%
159
204521 Digital System Architecture 16April 20, 2023
Time
CPU time– time the CPU is computing
– not including the time waiting for I/O or running other program
User CPU time– CPU time spent in the program
System CPU time – CPU time spent in the operating system performing task requested
by the program decrease execution time
CPU time = User CPU time + System CPU time
204521 Digital System Architecture 17April 20, 2023
Performance
System Performance– elapsed time on unloaded system
CPU performance– user CPU time on an unloaded system
204521 Digital System Architecture 18April 20, 2023
Programs to Evaluate Processor Performance
Real Program Kernel
– Time critical excerpts
Toy benchmark– 10-100 lines of code, Sieve of Erastasthens, Puzzles, Quick Sort
Synthetic benchmark– Attempt to match average frequencies of real workloads
– similar to kernel
– Whetstone, Dhrystone
– not even pieces of real program, but kernel might be.
204521 Digital System Architecture 19April 20, 2023
Benchmark Suite
collecting of benchmark, try to measure the performance of processor with a variety of application– SPECint92, SPECfp92
– see fig. 1.9
204521 Digital System Architecture 20April 20, 2023
Benchmarking Games
Differing configurations used to run the same workload on two systems
Compiler wired to optimize the workload Test specification written to be biased towards one
machine Synchronized CPU/IO intensive job sequence used Workload arbitrarily picked Very small benchmarks used Benchmarks manually translated to optimize
performance
204521 Digital System Architecture 21April 20, 2023
Common Benchmarking Mistakes (I)
Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Inaccuracies due to sampling ignored
204521 Digital System Architecture 22April 20, 2023
Common Benchmarking Mistakes (II)
Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient (cold start) performance Using device utilizations for performance comparisons Collecting too much data but doing too little analysis
204521 Digital System Architecture 23April 20, 2023
SPEC: System Performance Evaluation Cooperative
First Round 1989– 10 programs yielding a single number
Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating point progra
ms)
– Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(
b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
Third Round 1995– Single flag setting for all programs; new set of programs “benchmarks u
seful for 3 years”
204521 Digital System Architecture 24April 20, 2023
SPEC: System Performance Evaluation Cooperative First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating
point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model
610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
204521 Digital System Architecture 25April 20, 2023
SPEC: System Performance Evaluation Cooperative Third Round 1995
– new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95, SPECfp_base95
204521 Digital System Architecture 26April 20, 2023
How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks execution
time: (Ti)/n or (Wi*Ti)
Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ (1/Ri) or n/(Wi/Ri)
Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10)– Arithmetic mean impacted by choice of reference machine
Use the geometric mean for comparison:(Ti)^1/n– Independent of chosen machine– but not good metric for total execution time
204521 Digital System Architecture 27April 20, 2023
SPEC First Round One program: 99% of time in single line of code New front-end compiler could improve dramatically
Benchmark
SP
EC
Pe
rf
0
100
200
300
400
500
600
700
800gcc
epresso
spice
doduc
nasa7 li
eqntott
matrix300
fpppp
tomcatv
204521 Digital System Architecture 28April 20, 2023
Impact of Means on SPECmark89 for IBM 550(without and with special compiler option) Ratio to VAX: Time: Weighted Time:
Program Before After Before After Before Aftergcc 30 29 49 51 8.91 9.22espresso 35 34 65 67 7.64 7.86spice 47 47 510 510 5.69 5.69doduc 46 49 41 38 5.81 5.45nasa7 78 144 258 140 3.43 1.86li 34 34 183 183 7.86 7.86eqntott 40 40 28 28 6.68 6.68matrix30078 730 58 6 3.43 0.37fpppp 90 87 34 35 2.97 3.07tomcatv 33 138 20 19 2.01 1.94Mean 54 72 124 108 54.42 49.99
Geometric Arithmetic Weighted Arith.
Ratio 1.33 Ratio 1.16 Ratio 1.09
204521 Digital System Architecture 29April 20, 2023
The Bottom Line: Performance (and Cost)
• Time to run the task (ExTime)– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth
Plane
Boeing 747
BAD/Sud Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pass/hr)
72
44
204521 Digital System Architecture 30April 20, 2023
The Bottom Line: Performance (and Cost)
"X is n times faster than Y" means
ExTime(Y) Performance(X)
--------- = --------------- = n
ExTime(X) Performance(Y)
• Performance is the reciprocal of execution time
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
204521 Digital System Architecture 31April 20, 2023
Performance Terminology
“X is n% faster than Y” means:ExTime(Y) Performance(X) n
--------- = -------------- = 1 + -----
ExTime(X) Performance(Y) 100
n = 100(Performance(X) - Performance(Y))
Performance(Y)
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
204521 Digital System Architecture 32April 20, 2023
Example
Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
n = 100(15 - 10)
10
n = 50%
204521 Digital System Architecture 33April 20, 2023
SpeedupSpeedup due to enhancement E: ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fractionenhanced of
the task by a factor Speedupenhanced , and the remainder of
the task is unaffected, then what is
ExTime(E) = ?
Speedup(E) = ?
204521 Digital System Architecture 34April 20, 2023
Amdahl’s Law
States that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time faster mode can be used
204521 Digital System Architecture 35April 20, 2023
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
204521 Digital System Architecture 36April 20, 2023
Example of Amdahl’s Law
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedupoverall =
ExTimenew =
204521 Digital System Architecture 37April 20, 2023
Example of Amdahl’s Law
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedupoverall = 1
0.95= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
204521 Digital System Architecture 38April 20, 2023
Corollary: Make The Common Case Fast
All instructions require an instruction fetch, only a fraction require a data fetch/store.
– Optimize instruction access over data access Programs exhibit locality
– 90% of time in 10% of code
– Spatial Locality
– Temporal Locality
Access to small memories is faster
– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.
Reg'sCache
Memory Disk / Tape
204521 Digital System Architecture 39April 20, 2023
RISC Philosophy
The simple case is usually the most frequent and the easiest to optimize!
Do simple, fast things in hardware and be sure the rest can be handled correctly in software
204521 Digital System Architecture 40April 20, 2023
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
204521 Digital System Architecture 41April 20, 2023
% of Instructions Responsible for 80-90% of Instructions Executed
0
10
20
30
40
50
60
70
80
compress li ear
204521 Digital System Architecture 42April 20, 2023
Aspects of CPU PerformanceCPU time = Seconds = Instructions x Cycles x
Seconds
(User) Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
(User) Program Program Instruction Cycle
Instr. Cnt CPI Clock RateProgram
Compiler
Instr. Set
Organization
Technology
204521 Digital System Architecture 44April 20, 2023
Marketing Metrics
MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
• Machines with different instruction sets ?
• Programs with different instruction mixes ?
– Dynamic frequency of instructions
• Uncorrelated with performance
MFLOP/s = FP Operations / Time * 10^6
• Machine dependent
• Often not where time is spent Normalized:
add,sub,compare,mult 1
divide, sqrt 4
exp, sin, . . . 8
Normalized:
add,sub,compare,mult 1
divide, sqrt 4
exp, sin, . . . 8
204521 Digital System Architecture 45April 20, 2023
Cycles Per Instruction
CPU time = CycleTime * CPIi * Iii = 1
n
Avg. CPI = CPI * F where F = I i = 1
n
i i i i
Instruction Count
“Instruction Frequency”
Invest Resources where time is Spent!
CPI = Clock Cycles for a Program/Instr. Count
“Average Cycles per Instruction”
204521 Digital System Architecture 46April 20, 2023
Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
1.5 is the Average CPI for this instruction mix
204521 Digital System Architecture 47April 20, 2023
Organizational Trade-offs
Instruction Mix
Cycle Time
CPI
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
204521 Digital System Architecture 48April 20, 2023
Base Machine (Reg / Reg)
Op Freq Cycles
ALU 50% 1
Load 20% 2
Store 10% 2
Branch 20% 2Typical Mix
Trade-off ExampleAdd register / memory operations:
– One source operand in memory– One source operand in register– Cycle count of 2
Branch cycle count to increase to 3.
What fraction of the loads must be eliminated for this to pay off?
204521 Digital System Architecture 49April 20, 2023
Example SolutionExec Time = Instr Cnt x CPI x Clock
Op Freq Cycles
ALU .50 1 .5
Load .20 2 .4
Store .10 2 .2
Branch .20 2 .3
Reg/Mem
1.00 1.5
204521 Digital System Architecture 50April 20, 2023
Example SolutionExec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 – X 1 .5 – X
Load .20 2 .4 .2 – X 2 .4 – 2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 – X (1.7 – X)/(1 – X)
CPINew must be normalized to new instruction frequency
CyclesNew
InstructionsNew
204521 Digital System Architecture 51April 20, 2023
Example SolutionExec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 – X 1 .5 – X
Load .20 2 .4 .2 – X 2 .4 – 2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 – X (1.7 – X)/(1 – X)
Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPInew x ClockNew
1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)
204521 Digital System Architecture 52April 20, 2023
Example SolutionExec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 – X 1 .5 – X
Load .20 2 .4 .2 – X 2 .4 – 2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 – X (1.7 – X)/(1 – X)
Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew
1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)
1.5 = 1.7 – X
0.2 = X
ALL loads must be eliminated for this to be a win!
204521 Digital System Architecture 53April 20, 2023
How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks e
xecution time: SUM(Ti)/n or SUM(Wi*Ti) Harmonic mean (weighted harmonic mean) of rates (e
.g., MFLOPS) tracks execution time: n/SUM(1/Ri) or n/SUM(Wi/Ri)
Normalized execution time is handy for scaling performance
But do not take the arithmetic mean of normalized execution time, use the geometric mean (prod(Ri)^1/n)
204521 Digital System Architecture 54April 20, 2023
Means
n
iiT
n 1
1Arithmetic
mean
n
i iR
n
1
1
Geometric mean
n
i ri
inn
i ri
i
T
T
nT
T
1
1
1
log1
exp
Harmonic mean
Consistent independent of reference
Can be weighted. Represents total execution time
204521 Digital System Architecture 55April 20, 2023
Which Machine is “Better”?
Computer A Computer B Computer CProgram P1(sec) 1 10 20Program P2 1000 100 20Total Time 1001 110 40
204521 Digital System Architecture 56April 20, 2023
Weighted Arithmetic Mean
Assume three weighting schemes
P1/P2 Comp A Comp B Comp C.5/.5 500.5 55.0 20.909/.091 91.82 18.8 20.999/.001 2 10.09 20
204521 Digital System Architecture 57April 20, 2023
Performance Evaluation Given sales is a function of performance relative to the co
mpetition, big investment in improving product as reported by performance summary
Good products created then have:– Good benchmarks– Good ways to summarize performance
If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins!
Execution time is the measure of performance