Systems Software & Performance
Transcript of Systems Software & Performance
Performance
Jin-Soo Kim([email protected])
Systems Software &Architecture Lab.
Seoul National University
Fall 2020
CPU Performance
Chap. 1.6, 2.13
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 3
βͺ Measure, analyze, report, and summarize
βͺ Make intelligent choices
βͺ See through the marketing hype
βͺ Key to understanding underlying organizational motivation
βͺ Questions
β’ Why is some hardware better than others for different programs?
β’ What factors of system performance are hardware related?
(e.g., Do we need a new machine or a new operating system?)
β’ How does the machineβs instruction set affect performance?
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 4
βͺ Algorithm
β’ Determines number of operations executed
βͺ Programming language, compiler, architecture
β’ Determine number of machine instructions executed per operation
βͺ Processor and memory system (microarchitecture)
β’ Determine how fast instructions are executed
βͺ I/O system (including OS)
β’ Determines how fast I/O operations are executed
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 5
βͺ Which airplane has the best performance?
0 100 200 300 400 500
Douglas DC-8-50
BAC/Sud Concorde
Boeing 747
Boeing 777
Passenger Capacity
0 2000 4000 6000 8000 10000
Douglas DC-8-50
BAC/Sud Concorde
Boeing 747
Boeing 777
Cruising Range (miles)
0 500 1000 1500
Douglas DC-8-50
BAC/Sud Concorde
Boeing 747
Boeing 777
Cruising Speed (mph)
0 100000 200000 300000 400000
Douglas DC-8-50
BAC/Sud Concorde
Boeing 747
Boeing 777
Passengers x mph
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 6
βͺ Response time (β execution time, latency)
β’ The time between the start and completion of a task
β’ How long does it take for my job to run?
β’ How long must I wait for the database query?
βͺ Throughput (β bandwidth)
β’ The total amount of work done in a given time
β’ How much work is getting done per unit time?
β’ What is the average execution rate?
βͺ What if β¦
β’ We replace the processor with a faster version?
β’ We add more processors?
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 7
βͺ Define
βͺ βX is n times faster than Yβ
βͺ Example: time taken to run a program
β’ 10s on machine A, 15s on machine B
β’ Execution TimeB / Execution TimeA = 15s / 10s = 1.5
β’ Machine A is 1.5 times faster than machine B
πππππππππππ = Ξ€1 πΈπ₯πππ’π‘πππ ππππ
πππππππππππ π
πππππππππππ π=
πΈπ₯πππ’π‘πππ π‘πππ π
πΈπ₯πππ’π‘πππ π‘πππ π= π
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 8
βͺ Elapsed time
β’ Total response time, including all aspects
β e.g., processing, I/O, OS overhead, idle time
β’ Determines system performance
βͺ CPU time
β’ Time spent processing a given job
β Discounts I/O time, other jobsβ shares
β’ Comprises user CPU time and system CPU time
β’ Different programs are affected differently by CPU and system performance
βͺ Our focus: User CPU time
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 9
βͺ Operation of digital hardware governed by a constant-rate clock
βͺ Clock period: duration of a clock cycle
β’ e.g., 250ps = 0.25ns = 250 x 10-12 s
βͺ Clock frequency (rate): cycles per second
β’ e.g., 4.0GHz = 4000MHz = 4.0x109 Hz
Clock (cycles)
Data transferand computation
Update state
Clock period
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 10
βͺ Performance improved by
β’ Reducing number of clock cycles
β’ Increasing clock rate (or decreasing the clock cycle time)
β’ Hardware designer must often trade off clock rate against cycle count
πΆππ ππππ = πΆππ πΆππππ πΆπ¦ππππ Γ πΆππππ πΆπ¦πππ ππππ
=πΆππ πΆππππ πΆπ¦ππππ
πΆππππ π ππ‘π
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 11
βͺ Computer A: 2GHz clock, 10s CPU time
βͺ Designing Computer B
β’ Aim for 6s CPU time
β’ Can do faster clock, but causes 1.2x clock cycles
βͺ How fast must Computer Bβs clock be?
πΆππππ π ππ‘ππ΅ =πΆππππ πΆπ¦ππππ π΅πΆππ πππππ΅
=1.2 Γ πΆππππ πΆπ¦ππππ π΄
6π =1.2 Γ 20 Γ 109
6π = 4πΊπ»π§
πΆππππ πΆπ¦ππππ π΄ = πΆππ πππππ΄ Γ πΆππππ π ππ‘ππ΄ = 10π Γ 2πΊπ»π§ = 20 Γ 109
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 12
βͺ Instruction count for a program
β’ Determined by program, ISA and compiler
βͺ Average cycles per instruction
β’ Determined by CPU hardware
β’ If different instructions have different CPU, average CPI affected by instruction mix
πΆππππ πΆπ¦ππππ = πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ Γ CPI (Cycles Per Instruction)
πΆππ ππππ = πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ Γ CPI Γ πΆππππ πΆπ¦πππ ππππ
=πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ Γ CPI
πΆππππ π ππ‘π
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 13
βͺ Computer A: Cycle time = 250ps, CPI = 2.0
βͺ Computer B: Cycle time = 500ps, CPI = 1.2
βͺ Same ISA
βͺ Which is faster, and by how much?
πΆππ πππππ΄ = πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ (πΌπΆ) Γ πΆππΌπ΄ Γ πΆπ¦πππ πππππ΄
= πΌπΆ Γ 2.0 Γ 250ππ = πΌπΆ Γ 500ππ
πΆππ πππππ΅ = πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ (πΌπΆ) Γ πΆππΌπ΅ Γ πΆπ¦πππ πππππ΅
= πΌπΆ Γ 1.2 Γ 500ππ = πΌπΆ Γ 600ππ
πΆππ πππππ΅πΆππ πππππ΄
=πΌπΆ Γ 600ππ
πΌπΆ Γ 500ππ = 1.2 A is faster than B by 1.2 times
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 14
βͺ If different instruction classes take different numbers of cycles
βͺ Weighted average CPI
πΆππππ πΆπ¦ππππ =
π=1
π
(πΆππΌπ Γ πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘π)
πΆππΌ =πΆππππ πΆπ¦ππππ
πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘=
π=1
π
πΆππΌπ ΓπΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ππΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘
Relative frequency
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 15
βͺ Alternative compiled code sequences A and B
βͺ Sequence A: IC = 500
β’ Clock cycles
= 200 x 1 + 100 x 2 + 200 x 3
= 1000
β’ Average CPI = 1000 / 500 = 2.0
Class ALU Load/Store Branch
CPI for class 1 2 3
IC in sequence A 200 100 200
IC in sequence B 400 100 100
βͺ Sequence B: IC = 600
β’ Clock cycles
= 400 x 1 + 100 x 2 + 100 x 3
= 900
β’ Average CPI = 900 / 600 = 1.5
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 16
void swap(long long v[], long long k){
long long temp;
temp = v[k];v[k] = v[k+1];v[k+1] = temp;
}
void sort(long long v[], size_t n){
size_t i, j;for (i = 1; i < n; i++)
for (j = i - 1; j >= 0 && v[j] > v[j+1]; j--) {swap(v, j);
}
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 17
βͺ Compiled with gcc on Pentium 4 under Linux
0
0.5
1
1.5
2
2.5
3
none O1 O2 O3
Relative Performance
020000400006000080000
100000120000140000160000180000
none O1 O2 O3
Clock Cycles
0
20000
40000
60000
80000
100000
120000
140000
none O1 O2 O3
Instruction count
0
0.5
1
1.5
2
none O1 O2 O3
CPI
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 18
0
0.5
1
1.5
2
2.5
3
C/none C/O1 C/O2 C/O3 Java/int Java/JIT
Bubblesort Relative Performance
0
0.5
1
1.5
2
2.5
C/none C/O1 C/O2 C/O3 Java/int Java/JIT
Quicksort Relative Performance
0
500
1000
1500
2000
2500
3000
C/none C/O1 C/O2 C/O3 Java/int Java/JIT
Quicksort vs. Bubblesort Speedup
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 19
βͺ Instruction count and CPI are not good performance indicators in
isolation
βͺ Compiler optimizations are sensitive to the algorithm
βͺ Java/JIT compiled code is significantly faster than JVM interpreted
β’ Comparable to optimized C in some cases
βͺ Nothing can fix a dumb algorithm!
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 20
πΆππ ππππ =πΌππ π‘ππ’ππ‘ππππ
πππππππΓ
πΆπ¦ππππ
πΌππ π‘ππ’ππ‘πππΓπππππππ
πΆπ¦πππ
= πΌππ π‘ππ’ππ‘πππ πΆππ’ππ‘ Γ πΆππΌ Γ πΆππππ πΆπ¦πππ ππππ
InstructionCount
CPI Clock Cycle
Algorithm β β³
Programming language β β
Compiler β β
ISA β β β
Microarchitecture β β
Technology β
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 21
βͺ MIPS (Millions of Instructions Per Second) doesnβt account for
β’ Differences in ISAs between computers
β’ Differences in complexity between instructions
β’ CPI varies between programs on a given CPU
ππΌππ =πΌππ π‘ππ’ππ‘πππ πππ’ππ‘
πΈπ₯πππ’π‘πππ π‘πππ Γ 106
=πΌππ π‘ππ’ππ‘πππ πππ’ππ‘
πΌππ π‘ππ’ππ‘πππ πππ’ππ‘ Γ πΆππΌπΆππππ πππ‘π
Γ 106=
πΆππππ πππ‘π
πΆππΌ Γ 106
Benchmarking
Chap. 1.9
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 23
βͺ How to measure the performance?
β’ Performance best determined by running a real application
β’ Use programs typical of expected workload
βͺ Small benchmarks
β’ Nice for architects and designers
β’ Easy to standardize
β’ Can be abused
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 24
βͺ SPEC (Standard Performance Evaluation Corporation)
β’ A non-profit organization that aims to βproduce, establish, maintain and endorse a
standardized setβ of performance benchmarks for computers
β’ CPU, Power, HPC (High-Performance Computing), Web servers, Java, Storage, β¦
β’ http://www.spec.org
βͺ SPEC CPU benchmark
β’ An industry-standardized, CPU-intensive benchmark suite, stressing a systemβs
processor, memory subsystem and compiler
β Companies have agreed on a set of real program and inputs
β Valuable indicator of performance (and compiler technology)
β’ CPU89 β CPU92 β CPU95 β CPU2000 β CPU2006 β CPU2017
β’ Can still be abused
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 25
An embarrassed Intel Corp. acknowledged Friday that a bug in a software
program known as a compiler had led the company to overstate the speed of its
microprocessor chips on an industry benchmark by 10 percent. However,
industry analysts said the coding errorβ¦was a sad commentary on a common
industry practice of βcheatingβ on standardized performance testsβ¦The error was
pointed out to Intel two days ago by a competitor, Motorola β¦came in a test
known as SPECint92β¦Intel acknowledged that it had βoptimizedβ its compiler to
improve its test scores. The company had also said that it did not like the practice
but felt to compelled to make the optimizations because its competitors were
doing the same thingβ¦At the heart of Intelβs problem is the practice of βtuningβ
compiler programs to recognize certain computing problems in the test and then
substituting special handwritten pieces of codeβ¦
Saturday, January 6, 1996 New York Times
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 26
βͺ Elapsed time to execute a selection of programs
β’ Negligible I/O, so focuses on CPU performance
βͺ Normalize relative to reference machine
β’ Sunβs historical βUltra Enterprise 2β introduced in 1997
β’ 296MHz UltraSPARC II processor
βͺ Summarize as geometric mean of performance ratios
β’ CINT2006: 12 integer programs written in C and C++
β’ CFP2006: 17 FP programs written in Fortran and C/C++
π
ΰ·
π=1
π
πΈπ₯πππ’π‘πππ π‘πππ πππ‘ππ π
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 27
Integer Benchmarks (CINT2006) Floating Point Benchmarks (CFP2006)
perlbench C Perl programming language bwaves Fortran Fluid dynamics
bzip2 C Compression gamess Fortran Quantum chemistry
gcc C C compiler milc C Physics: Quantum chromodynamics
mcf C Combinatorial optimization zeusmp Fortran Physics / CFD
gobmk C Artificial intelligence: Go gromacs C/Fortran Biochemistry / Molecular dynamics
hmmer C Search gene sequence cactusADM C/Fortran Physics / General relativity
sjeng C Artificial intelligence: Chess leslie3d Fortran Fluid dynamics
libquantum C Physics: Quantum computing namd C++ Biology / Molecular dynamics
h264ref C Video compression dealII C++ Finite element analysis
omnetpp C++ Discrete event simulation soplex C++ Linear programming, optimization
astar C++ Path-finding algorithms povray C++ Image ray-tracing
xalancbmk C++ XML processing calculix C/Fortran Structural mechanics
GemsFDTD Fortran Computational electromagnetics
tonto Fortran Quantum chemistry
lbm C Fluid dynamics
wrf C/Fortran Weather prediction
sphinx3 C Speech recognition
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 28
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 29
βͺ SPECpower_ssj2008
β’ The first industry-standard SPEC benchmark for evaluating the power and
performance characteristics of server class computers
β’ Initially targets the performance of server-side Java
β’ Power consumption of server at different workload levels (0% ~ 100%)
β Performance: ssj_ops/sec
β Power: Watts (Joules/sec)
ππ£πππππ π π π_πππ πππ πππ‘π‘ = ΰ΅
π=0
10
π π π_πππ π
π=0
10
πππ€πππ
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 30
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 31
βͺ Look back at Xeon power benchmark
β’ At 100% load: 258W
β’ At 50% load: 170W (66%)
β’ At 10% load: 121W (47%)
βͺ Google data center
β’ Mostly operates at 10% - 50% load
β’ At 100% load less than 1% of the time
βͺ Consider designing processors to
make power proportional to load
4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 32
βͺ Performance is specific to a particular program(s)
β’ Total execution time is a consistent summary of the performance
βͺ For a given architecture, performance increases come from
β’ Increases in clock rate (without adverse CPI effects)
β’ Improvements in processor organization that lower CPI
β’ Compiler enhancements that lower CPI and/or instruction count
β’ Algorithm/language choices that affect instruction count
βͺ Pitfall: Using a subset of the performance equation as a performance
metric