Systems Software & Performance

32
Performance Jin-Soo Kim ([email protected]) Systems Software & Architecture Lab. Seoul National University Fall 2020

Transcript of Systems Software & Performance

Page 1: Systems Software & Performance

Performance

Jin-Soo Kim([email protected])

Systems Software &Architecture Lab.

Seoul National University

Fall 2020

Page 2: Systems Software & Performance

CPU Performance

Chap. 1.6, 2.13

Page 3: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 3

β–ͺ Measure, analyze, report, and summarize

β–ͺ Make intelligent choices

β–ͺ See through the marketing hype

β–ͺ Key to understanding underlying organizational motivation

β–ͺ Questions

β€’ Why is some hardware better than others for different programs?

β€’ What factors of system performance are hardware related?

(e.g., Do we need a new machine or a new operating system?)

β€’ How does the machine’s instruction set affect performance?

Page 4: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 4

β–ͺ Algorithm

β€’ Determines number of operations executed

β–ͺ Programming language, compiler, architecture

β€’ Determine number of machine instructions executed per operation

β–ͺ Processor and memory system (microarchitecture)

β€’ Determine how fast instructions are executed

β–ͺ I/O system (including OS)

β€’ Determines how fast I/O operations are executed

Page 5: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 5

β–ͺ Which airplane has the best performance?

0 100 200 300 400 500

Douglas DC-8-50

BAC/Sud Concorde

Boeing 747

Boeing 777

Passenger Capacity

0 2000 4000 6000 8000 10000

Douglas DC-8-50

BAC/Sud Concorde

Boeing 747

Boeing 777

Cruising Range (miles)

0 500 1000 1500

Douglas DC-8-50

BAC/Sud Concorde

Boeing 747

Boeing 777

Cruising Speed (mph)

0 100000 200000 300000 400000

Douglas DC-8-50

BAC/Sud Concorde

Boeing 747

Boeing 777

Passengers x mph

Page 6: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 6

β–ͺ Response time (β‰ˆ execution time, latency)

β€’ The time between the start and completion of a task

β€’ How long does it take for my job to run?

β€’ How long must I wait for the database query?

β–ͺ Throughput (β‰ˆ bandwidth)

β€’ The total amount of work done in a given time

β€’ How much work is getting done per unit time?

β€’ What is the average execution rate?

β–ͺ What if …

β€’ We replace the processor with a faster version?

β€’ We add more processors?

Page 7: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 7

β–ͺ Define

β–ͺ β€œX is n times faster than Y”

β–ͺ Example: time taken to run a program

β€’ 10s on machine A, 15s on machine B

β€’ Execution TimeB / Execution TimeA = 15s / 10s = 1.5

β€’ Machine A is 1.5 times faster than machine B

π‘ƒπ‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’ = Ξ€1 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‡π‘–π‘šπ‘’

π‘ƒπ‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’ 𝑋

π‘ƒπ‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’ π‘Œ=

𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘Œ

𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ 𝑋= 𝑛

Page 8: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 8

β–ͺ Elapsed time

β€’ Total response time, including all aspects

– e.g., processing, I/O, OS overhead, idle time

β€’ Determines system performance

β–ͺ CPU time

β€’ Time spent processing a given job

– Discounts I/O time, other jobs’ shares

β€’ Comprises user CPU time and system CPU time

β€’ Different programs are affected differently by CPU and system performance

β–ͺ Our focus: User CPU time

Page 9: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 9

β–ͺ Operation of digital hardware governed by a constant-rate clock

β–ͺ Clock period: duration of a clock cycle

β€’ e.g., 250ps = 0.25ns = 250 x 10-12 s

β–ͺ Clock frequency (rate): cycles per second

β€’ e.g., 4.0GHz = 4000MHz = 4.0x109 Hz

Clock (cycles)

Data transferand computation

Update state

Clock period

Page 10: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 10

β–ͺ Performance improved by

β€’ Reducing number of clock cycles

β€’ Increasing clock rate (or decreasing the clock cycle time)

β€’ Hardware designer must often trade off clock rate against cycle count

πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’ = πΆπ‘ƒπ‘ˆ πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠 Γ— πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒 π‘‡π‘–π‘šπ‘’

=πΆπ‘ƒπ‘ˆ πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠

πΆπ‘™π‘œπ‘π‘˜ π‘…π‘Žπ‘‘π‘’

Page 11: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 11

β–ͺ Computer A: 2GHz clock, 10s CPU time

β–ͺ Designing Computer B

β€’ Aim for 6s CPU time

β€’ Can do faster clock, but causes 1.2x clock cycles

β–ͺ How fast must Computer B’s clock be?

πΆπ‘™π‘œπ‘π‘˜ π‘…π‘Žπ‘‘π‘’π΅ =πΆπ‘™π‘œπ‘π‘˜ πΆπ‘¦π‘π‘™π‘’π‘ π΅πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’π΅

=1.2 Γ— πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠𝐴

6𝑠=1.2 Γ— 20 Γ— 109

6𝑠= 4𝐺𝐻𝑧

πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠𝐴 = πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’π΄ Γ— πΆπ‘™π‘œπ‘π‘˜ π‘…π‘Žπ‘‘π‘’π΄ = 10𝑠 Γ— 2𝐺𝐻𝑧 = 20 Γ— 109

Page 12: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 12

β–ͺ Instruction count for a program

β€’ Determined by program, ISA and compiler

β–ͺ Average cycles per instruction

β€’ Determined by CPU hardware

β€’ If different instructions have different CPU, average CPI affected by instruction mix

πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠 = πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘ Γ— CPI (Cycles Per Instruction)

πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’ = πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘ Γ— CPI Γ— πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒 π‘‡π‘–π‘šπ‘’

=πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘ Γ— CPI

πΆπ‘™π‘œπ‘π‘˜ π‘…π‘Žπ‘‘π‘’

Page 13: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 13

β–ͺ Computer A: Cycle time = 250ps, CPI = 2.0

β–ͺ Computer B: Cycle time = 500ps, CPI = 1.2

β–ͺ Same ISA

β–ͺ Which is faster, and by how much?

πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’π΄ = πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘ (𝐼𝐢) Γ— 𝐢𝑃𝐼𝐴 Γ— 𝐢𝑦𝑐𝑙𝑒 π‘‡π‘–π‘šπ‘’π΄

= 𝐼𝐢 Γ— 2.0 Γ— 250𝑝𝑠 = 𝐼𝐢 Γ— 500𝑝𝑠

πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’π΅ = πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘ (𝐼𝐢) Γ— 𝐢𝑃𝐼𝐡 Γ— 𝐢𝑦𝑐𝑙𝑒 π‘‡π‘–π‘šπ‘’π΅

= 𝐼𝐢 Γ— 1.2 Γ— 500𝑝𝑠 = 𝐼𝐢 Γ— 600𝑝𝑠

πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’π΅πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’π΄

=𝐼𝐢 Γ— 600𝑝𝑠

𝐼𝐢 Γ— 500𝑝𝑠= 1.2 A is faster than B by 1.2 times

Page 14: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 14

β–ͺ If different instruction classes take different numbers of cycles

β–ͺ Weighted average CPI

πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠 =

𝑖=1

𝑛

(𝐢𝑃𝐼𝑖 Γ— πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘π‘–)

𝐢𝑃𝐼 =πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒𝑠

πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘=

𝑖=1

𝑛

𝐢𝑃𝐼𝑖 Γ—πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘π‘–πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘

Relative frequency

Page 15: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 15

β–ͺ Alternative compiled code sequences A and B

β–ͺ Sequence A: IC = 500

β€’ Clock cycles

= 200 x 1 + 100 x 2 + 200 x 3

= 1000

β€’ Average CPI = 1000 / 500 = 2.0

Class ALU Load/Store Branch

CPI for class 1 2 3

IC in sequence A 200 100 200

IC in sequence B 400 100 100

β–ͺ Sequence B: IC = 600

β€’ Clock cycles

= 400 x 1 + 100 x 2 + 100 x 3

= 900

β€’ Average CPI = 900 / 600 = 1.5

Page 16: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 16

void swap(long long v[], long long k){

long long temp;

temp = v[k];v[k] = v[k+1];v[k+1] = temp;

}

void sort(long long v[], size_t n){

size_t i, j;for (i = 1; i < n; i++)

for (j = i - 1; j >= 0 && v[j] > v[j+1]; j--) {swap(v, j);

}

Page 17: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 17

β–ͺ Compiled with gcc on Pentium 4 under Linux

0

0.5

1

1.5

2

2.5

3

none O1 O2 O3

Relative Performance

020000400006000080000

100000120000140000160000180000

none O1 O2 O3

Clock Cycles

0

20000

40000

60000

80000

100000

120000

140000

none O1 O2 O3

Instruction count

0

0.5

1

1.5

2

none O1 O2 O3

CPI

Page 18: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 18

0

0.5

1

1.5

2

2.5

3

C/none C/O1 C/O2 C/O3 Java/int Java/JIT

Bubblesort Relative Performance

0

0.5

1

1.5

2

2.5

C/none C/O1 C/O2 C/O3 Java/int Java/JIT

Quicksort Relative Performance

0

500

1000

1500

2000

2500

3000

C/none C/O1 C/O2 C/O3 Java/int Java/JIT

Quicksort vs. Bubblesort Speedup

Page 19: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 19

β–ͺ Instruction count and CPI are not good performance indicators in

isolation

β–ͺ Compiler optimizations are sensitive to the algorithm

β–ͺ Java/JIT compiled code is significantly faster than JVM interpreted

β€’ Comparable to optimized C in some cases

β–ͺ Nothing can fix a dumb algorithm!

Page 20: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 20

πΆπ‘ƒπ‘ˆ π‘‡π‘–π‘šπ‘’ =πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘›π‘ 

π‘ƒπ‘Ÿπ‘œπ‘”π‘Ÿπ‘Žπ‘šΓ—

𝐢𝑦𝑐𝑙𝑒𝑠

πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘›Γ—π‘†π‘’π‘π‘œπ‘›π‘‘π‘ 

𝐢𝑦𝑐𝑙𝑒

= πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› πΆπ‘œπ‘’π‘›π‘‘ Γ— 𝐢𝑃𝐼 Γ— πΆπ‘™π‘œπ‘π‘˜ 𝐢𝑦𝑐𝑙𝑒 π‘‡π‘–π‘šπ‘’

InstructionCount

CPI Clock Cycle

Algorithm β—‹ β–³

Programming language β—‹ β—‹

Compiler β—‹ β—‹

ISA β—‹ β—‹ β—‹

Microarchitecture β—‹ β—‹

Technology β—‹

Page 21: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 21

β–ͺ MIPS (Millions of Instructions Per Second) doesn’t account for

β€’ Differences in ISAs between computers

β€’ Differences in complexity between instructions

β€’ CPI varies between programs on a given CPU

𝑀𝐼𝑃𝑆 =πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› π‘π‘œπ‘’π‘›π‘‘

𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ Γ— 106

=πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› π‘π‘œπ‘’π‘›π‘‘

πΌπ‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘› π‘π‘œπ‘’π‘›π‘‘ Γ— πΆπ‘ƒπΌπΆπ‘™π‘œπ‘π‘˜ π‘Ÿπ‘Žπ‘‘π‘’

Γ— 106=

πΆπ‘™π‘œπ‘π‘˜ π‘Ÿπ‘Žπ‘‘π‘’

𝐢𝑃𝐼 Γ— 106

Page 22: Systems Software & Performance

Benchmarking

Chap. 1.9

Page 23: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 23

β–ͺ How to measure the performance?

β€’ Performance best determined by running a real application

β€’ Use programs typical of expected workload

β–ͺ Small benchmarks

β€’ Nice for architects and designers

β€’ Easy to standardize

β€’ Can be abused

Page 24: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 24

β–ͺ SPEC (Standard Performance Evaluation Corporation)

β€’ A non-profit organization that aims to β€œproduce, establish, maintain and endorse a

standardized set” of performance benchmarks for computers

β€’ CPU, Power, HPC (High-Performance Computing), Web servers, Java, Storage, …

β€’ http://www.spec.org

β–ͺ SPEC CPU benchmark

β€’ An industry-standardized, CPU-intensive benchmark suite, stressing a system’s

processor, memory subsystem and compiler

– Companies have agreed on a set of real program and inputs

– Valuable indicator of performance (and compiler technology)

β€’ CPU89 β†’ CPU92 β†’ CPU95 β†’ CPU2000 β†’ CPU2006 β†’ CPU2017

β€’ Can still be abused

Page 25: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 25

An embarrassed Intel Corp. acknowledged Friday that a bug in a software

program known as a compiler had led the company to overstate the speed of its

microprocessor chips on an industry benchmark by 10 percent. However,

industry analysts said the coding error…was a sad commentary on a common

industry practice of β€œcheating” on standardized performance tests…The error was

pointed out to Intel two days ago by a competitor, Motorola …came in a test

known as SPECint92…Intel acknowledged that it had β€œoptimized” its compiler to

improve its test scores. The company had also said that it did not like the practice

but felt to compelled to make the optimizations because its competitors were

doing the same thing…At the heart of Intel’s problem is the practice of β€œtuning”

compiler programs to recognize certain computing problems in the test and then

substituting special handwritten pieces of code…

Saturday, January 6, 1996 New York Times

Page 26: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 26

β–ͺ Elapsed time to execute a selection of programs

β€’ Negligible I/O, so focuses on CPU performance

β–ͺ Normalize relative to reference machine

β€’ Sun’s historical β€œUltra Enterprise 2” introduced in 1997

β€’ 296MHz UltraSPARC II processor

β–ͺ Summarize as geometric mean of performance ratios

β€’ CINT2006: 12 integer programs written in C and C++

β€’ CFP2006: 17 FP programs written in Fortran and C/C++

𝑛

ΰ·‘

𝑖=1

𝑛

𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘Ÿπ‘Žπ‘‘π‘–π‘œ 𝑖

Page 27: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 27

Integer Benchmarks (CINT2006) Floating Point Benchmarks (CFP2006)

perlbench C Perl programming language bwaves Fortran Fluid dynamics

bzip2 C Compression gamess Fortran Quantum chemistry

gcc C C compiler milc C Physics: Quantum chromodynamics

mcf C Combinatorial optimization zeusmp Fortran Physics / CFD

gobmk C Artificial intelligence: Go gromacs C/Fortran Biochemistry / Molecular dynamics

hmmer C Search gene sequence cactusADM C/Fortran Physics / General relativity

sjeng C Artificial intelligence: Chess leslie3d Fortran Fluid dynamics

libquantum C Physics: Quantum computing namd C++ Biology / Molecular dynamics

h264ref C Video compression dealII C++ Finite element analysis

omnetpp C++ Discrete event simulation soplex C++ Linear programming, optimization

astar C++ Path-finding algorithms povray C++ Image ray-tracing

xalancbmk C++ XML processing calculix C/Fortran Structural mechanics

GemsFDTD Fortran Computational electromagnetics

tonto Fortran Quantum chemistry

lbm C Fluid dynamics

wrf C/Fortran Weather prediction

sphinx3 C Speech recognition

Page 28: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 28

Page 29: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 29

β–ͺ SPECpower_ssj2008

β€’ The first industry-standard SPEC benchmark for evaluating the power and

performance characteristics of server class computers

β€’ Initially targets the performance of server-side Java

β€’ Power consumption of server at different workload levels (0% ~ 100%)

– Performance: ssj_ops/sec

– Power: Watts (Joules/sec)

π‘‚π‘£π‘’π‘Ÿπ‘Žπ‘™π‘™ 𝑠𝑠𝑗_π‘œπ‘π‘  π‘π‘’π‘Ÿ π‘Šπ‘Žπ‘‘π‘‘ = ΰ΅™

𝑖=0

10

𝑠𝑠𝑗_π‘œπ‘π‘  𝑖

𝑖=0

10

π‘π‘œπ‘€π‘’π‘Ÿπ‘–

Page 30: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 30

Page 31: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 31

β–ͺ Look back at Xeon power benchmark

β€’ At 100% load: 258W

β€’ At 50% load: 170W (66%)

β€’ At 10% load: 121W (47%)

β–ͺ Google data center

β€’ Mostly operates at 10% - 50% load

β€’ At 100% load less than 1% of the time

β–ͺ Consider designing processors to

make power proportional to load

Page 32: Systems Software & Performance

4190.308: Computer Architecture | Fall 2020 | Jin-Soo Kim ([email protected]) 32

β–ͺ Performance is specific to a particular program(s)

β€’ Total execution time is a consistent summary of the performance

β–ͺ For a given architecture, performance increases come from

β€’ Increases in clock rate (without adverse CPI effects)

β€’ Improvements in processor organization that lower CPI

β€’ Compiler enhancements that lower CPI and/or instruction count

β€’ Algorithm/language choices that affect instruction count

β–ͺ Pitfall: Using a subset of the performance equation as a performance

metric