Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer...

62
Fall 2014, Nov 10 . . Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 ELEC 5200-001/6200-001 Lecture 10 Lecture 10 1 ELEC 5200-001/6200-001 ELEC 5200-001/6200-001 Computer Architecture and Design Computer Architecture and Design Fall 2014 Fall 2014 Performance of a Performance of a Computer Computer (Chapter 4) (Chapter 4) Vishwani D. Agrawal Vishwani D. Agrawal James J. Danaher Professor James J. Danaher Professor Department of Electrical and Computer Department of Electrical and Computer Engineering Engineering Auburn University, Auburn, AL 36849 Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal [email protected]

Transcript of Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer...

Page 1: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 11

ELEC 5200-001/6200-001ELEC 5200-001/6200-001Computer Architecture and DesignComputer Architecture and Design

Fall 2014Fall 2014 Performance of a Computer Performance of a Computer

(Chapter 4)(Chapter 4)Vishwani D. AgrawalVishwani D. Agrawal

James J. Danaher ProfessorJames J. Danaher ProfessorDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering

Auburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849http://www.eng.auburn.edu/~vagrawal

[email protected]

Page 2: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 22

What is Performance?What is Performance?Response time: the time between the start and Response time: the time between the start and completion of a task.completion of a task.

Throughput: the total amount of work done in a Throughput: the total amount of work done in a given time.given time.

Some performance measures:Some performance measures:MIPS (million instructions per second).MIPS (million instructions per second).

MFLOPS (million floating point operations per second), also MFLOPS (million floating point operations per second), also GFLOPS, TFLOPS (10GFLOPS, TFLOPS (101212), etc.), etc.

SPEC (System Performance Evaluation Corporation) SPEC (System Performance Evaluation Corporation) benchmarks.benchmarks.

LINPACK benchmarks, floating point computing, used for LINPACK benchmarks, floating point computing, used for supercomputers.supercomputers.

Synthetic benchmarks.Synthetic benchmarks.

Page 3: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 33

Small and Large NumbersSmall and Large NumbersSmall Large

1010-3-3 millimilli mm 101033 kilokilo kk

1010-6-6 micromicro μμ 101066 megamega MM

1010-9-9 nanonano nn 101099 gigagiga GG

1010-12-12 picopico pp 10101212 teratera TT

1010-15-15 femtofemto ff 10101515 petapeta PP

1010-18-18 attoatto 10101818 exaexa

1010-21-21 zeptozepto 10102121 zettazetta

1010-24-24 yoctoyocto 10102424 yottayotta

Page 4: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 44

Computer Memory SizeComputer Memory Size

Number bits bytes

221010 1,0241,024 KK KbKb KBKB

222020 1,048,5761,048,576 MM MbMb MBMB

223030 1,073,741,8241,073,741,824 GG GbGb GBGB

224040 1,099,511,627,7761,099,511,627,776 TT TbTb TBTB

Page 5: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 55

Units for Measuring PerformanceUnits for Measuring PerformanceTime in seconds (s), microseconds (Time in seconds (s), microseconds (μμs), s), nanoseconds (ns), or picoseconds (ps).nanoseconds (ns), or picoseconds (ps).Clock cycleClock cycle

Period of the hardware clockPeriod of the hardware clockExample: one clock cycle means 1 nanosecond for Example: one clock cycle means 1 nanosecond for a 1GHz clock frequency (or 1GHz clock rate)a 1GHz clock frequency (or 1GHz clock rate)

CPU time = (CPU clock cycles)/(clock CPU time = (CPU clock cycles)/(clock rate)rate)

Cycles per instruction (CPI): average Cycles per instruction (CPI): average number of clock cycles used to execute a number of clock cycles used to execute a computer instruction.computer instruction.

Page 6: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 66

Components of PerformanceComponents of PerformanceComponents of Performance

Units

CPU time for a programCPU time for a program Time (seconds, etc.)Time (seconds, etc.)

Instruction countInstruction count Instructions executed by Instructions executed by the programthe program

CPICPI Average number of Average number of clock cycles per clock cycles per instructioninstruction

Clock cycle timeClock cycle time Time period of clock Time period of clock (seconds, etc.)(seconds, etc.)

Page 7: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 77

Time, While You Wait, or Pay ForTime, While You Wait, or Pay For

CPU timeCPU time is the time taken by CPU to is the time taken by CPU to execute the program. It has two execute the program. It has two components:components:– User CPU time User CPU time is the time to execute the is the time to execute the

instructions of the program.instructions of the program.– System CPU time System CPU time is the time used by the is the time used by the

operating system to run the program.operating system to run the program.

Elapsed time (wall clock time) Elapsed time (wall clock time) is theis the time time between the start and end of a program.between the start and end of a program.

Page 8: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 88

Example: Unix “time” CommandExample: Unix “time” Command90.7u 12.9s 2:39 65%

Use

r C

PU

tim

ein

sec

on

ds

Sys

tem

CP

U t

ime

in s

eco

nd

s

Ela

pse

d t

ime

In m

in:s

ec

CP

U t

ime

as p

erce

nt

of

elap

sed

tim

e

90.7 + 12.9 ─────── × 100 = 65% 159

Page 9: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 99

Computing CPU TimeComputing CPU Time

CPU time = Instruction count × CPI × Clock cycle time

Instruction count × CPI= ────────────────

Clock rate

Instructions Clock cycles 1 second= ──────── × ───────── × ────────

Program Instruction Clock rate

Page 10: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1010

Comparing Computers C1 and C2Comparing Computers C1 and C2

Run the same program on C1 and C2. Suppose both Run the same program on C1 and C2. Suppose both computers execute the same number ( computers execute the same number ( N N ) of instructions:) of instructions:

C1: CPI = 2.0, clock cycle time = 1 nsC1: CPI = 2.0, clock cycle time = 1 ns

CPU time(C1) = CPU time(C1) = NN × 2.0 × 1 = 2.0× 2.0 × 1 = 2.0NN ns nsC2: CPI = 1.2, clock cycle time = 2 nsC2: CPI = 1.2, clock cycle time = 2 ns

CPU time(C2) = CPU time(C2) = NN × 1.2 × 2 = 2.4× 1.2 × 2 = 2.4NN ns ns

CPU time(C2)/CPU time(C1) = 2.4CPU time(C2)/CPU time(C1) = 2.4NN/2.0/2.0NN = 1.2, therefore, = 1.2, therefore, C1C1 is is 1.21.2 times faster than times faster than C2.C2.

Result can vary with the choice of program.Result can vary with the choice of program.

Page 11: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1111

Comparing Program Codes I & IIComparing Program Codes I & IICode size for a program:Code size for a program:– Code I has 5 million instructionsCode I has 5 million instructions– Code II has 6 million instructionsCode II has 6 million instructions– Code I is more efficient. Code I is more efficient. Is it?Is it?

Suppose a computer has three Suppose a computer has three types of instructions: A, B and C.types of instructions: A, B and C.CPU cycles (code I) = 10 millionCPU cycles (code I) = 10 millionCPU cycles (code II) = 9 millionCPU cycles (code II) = 9 millionCode II is more efficient.Code II is more efficient.

CPI( I ) = 10/5 = 2CPI( I ) = 10/5 = 2CPI( II ) = 9/6 = 1.5CPI( II ) = 9/6 = 1.5Code II is more efficient.Code II is more efficient.

Caution:Caution: Code size is a misleading Code size is a misleading indicator of performance. indicator of performance.

Instr. Type CPI

AA 11

BB 22

CC 33

Code

Instruction count in million

Type A

Type B

Type C

Total

II 22 11 22 55

IIII 44 11 11 66

Page 12: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1212

Rating of a ComputerRating of a ComputerMIPS: million instructions per secondMIPS: million instructions per second

Instruction count of a programInstruction count of a programMIPS = ───────────────────MIPS = ───────────────────

Execution time Execution time × 10× 1066

MIPS rating of a computer is relative to a MIPS rating of a computer is relative to a program.program.Standard programs for performance rating:Standard programs for performance rating:

Synthetic benchmarksSynthetic benchmarksSPEC benchmarks (System Performance Evaluation SPEC benchmarks (System Performance Evaluation Corporation)Corporation)

Page 13: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1313

Synthetic Benchmark ProgramsSynthetic Benchmark ProgramsArtificial programs that emulate a large set Artificial programs that emulate a large set of typical “real” programs.of typical “real” programs.Whetstone benchmark – Algol and Fortran.Whetstone benchmark – Algol and Fortran.Dhrystone benchmark – Ada and C.Dhrystone benchmark – Ada and C.Disadvantages:Disadvantages:– No clear agreement on what a typical No clear agreement on what a typical

instruction mix should be.instruction mix should be.– Benchmarks do not produce meaningful result.Benchmarks do not produce meaningful result.– Purpose of rating is defeated when compilers Purpose of rating is defeated when compilers

are written to optimize the performance rating.are written to optimize the performance rating.

Page 14: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1414

AdaAda Lady Augusta Ada Byron, Countess of Lovelace (1815-1852), daughter of Lord Byron (the poet who spent some time in a Swiss jail – in Chillon, not too far from Lausanne...). She was the assistant and patron of Charles Babbage; she wrote programs for his “Analytical Engine.”

An original print from its time.http://www.cs.kuleuven.ac.be/~dirk/ada-belgium/pictures.html

Page 15: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1515

Misleading CompilersMisleading CompilersConsider a computer with a clock rate of 1 GHz.Consider a computer with a clock rate of 1 GHz.Two compilers produce the following instruction Two compilers produce the following instruction mixes for a program:mixes for a program:

Code from

Instruction count (billions)

CPU

clock

cycles

CPI

CPU

time*

(seconds)

MIPS**Type

AType

BType

CTotal

Compiler 1Compiler 1 55 11 11 77 1010×10×1099 1.431.43 1010 700700

Compiler 2Compiler 2 1010 11 11 1212 1515×10×1099 1.251.25 1515 800800Instruction types – A: 1-cycle, B: 2-cycle, C: 3-cycle

* CPU time = CPU clock cycles/clock rate

** MIPS = (Total instruction count/CPU time) × 10 – 6

Page 16: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1616

Peak and Relative MIPS RatingsPeak and Relative MIPS RatingsPeak MIPSPeak MIPS

Choose an instruction mix to minimize CPIChoose an instruction mix to minimize CPIThe rating can be too high and unrealistic for general programsThe rating can be too high and unrealistic for general programs

Relative MIPS: Use a reference computer systemRelative MIPS: Use a reference computer system

Time(ref)Time(ref)Relative MIPS = Relative MIPS = ────── ────── × MIPS(ref)× MIPS(ref)

TimeTime

Historically, VAX-11/ 780, believed to have aHistorically, VAX-11/ 780, believed to have a11 MIPS performance, was used as reference. MIPS performance, was used as reference.

Page 17: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1717

Wĕbopēdia on Wĕbopēdia on MIPSMIPS

Acronym for Acronym for mmillion illion iinstructions nstructions pper er ssecondecond. An old . An old measure of a computer's speed and power, MIPS measure of a computer's speed and power, MIPS measures roughly the number of machine measures roughly the number of machine instructions that a computer can execute in one instructions that a computer can execute in one second.second.In fact, some people jokingly claim that MIPS really In fact, some people jokingly claim that MIPS really stands for stands for MMeaningless eaningless IIndicator of ndicator of PPerformance.erformance.Despite these problems, a MIPS rating can give Despite these problems, a MIPS rating can give you a general idea of a computer's speed. The IBM you a general idea of a computer's speed. The IBM PC/XT computer, for example, is rated at PC/XT computer, for example, is rated at ¼ MIPS¼ MIPS, , while Pentium-based PCs run at over while Pentium-based PCs run at over 100 MIPS100 MIPS. .

Page 18: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1818

A 1994 MIPS Rating ChartA 1994 MIPS Rating Chart

Computer MIPS Price $/MIPS

1975 IBM mainframe1975 IBM mainframe 1010 $10M$10M 1M1M

1976 Cray-11976 Cray-1 160160 $20M$20M 125K125K

1979 DEC VAX1979 DEC VAX 11 $200K$200K 200K200K

1981 IBM PC1981 IBM PC 0.250.25 $3K$3K 12K12K

1984 Sun 21984 Sun 2 11 $10K$10K 10K10K

1994 Pentium PC1994 Pentium PC 6666 $3K$3K 4646

1995 Sony PCX video game1995 Sony PCX video game 500500 $500$500 11

1995 Microunity set-top1995 Microunity set-top 1,0001,000 $500$500 0.50.5 New

Yor

k T

imes

, Apr

il 20

, 199

4

Page 19: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1919

Page 20: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2020

MFLOPS (megaFLOPS)MFLOPS (megaFLOPS)

Only floating point operations are counted:Only floating point operations are counted:– Float, real, double; add, subtract, multiply, divideFloat, real, double; add, subtract, multiply, divide

MFLOPS rating is relevant in scientific computing. For MFLOPS rating is relevant in scientific computing. For example, programs like a compiler will measure almost 0 example, programs like a compiler will measure almost 0 MFLOPS.MFLOPS.

Sometimes misleading due to different implementations. Sometimes misleading due to different implementations. For example, a computer that does not have a floating-point For example, a computer that does not have a floating-point divide, will register many FLOPS for a division.divide, will register many FLOPS for a division.

Number of floating-point operations in a programMFLOPS = ─────────────────────────────────

Execution time × 106

Page 21: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Supercomputer PerformanceSupercomputer Performance

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2121

Gigaflops

Teraflops

Petaflops

Exaflops

http

://e

n.w

ikip

edia

.org

/wik

i/Sup

erco

mpu

ter

Megaflops

Page 22: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Top Supercomputers, June 2012Top Supercomputers, June 2012www.top500.org

Rank Name Location CoresClock GHz

Max. Pflops

Power MW

Eff. Pflops/MJoule

1 Titan/CrayOak

Ridge560,640 2.2 27.11 8.21 3.30

2 Sequoia IBM USA 1,572864 1.6 16.30 7.89 2.07

3K

computerFujitsu Japan

795,024 2.0 10.50 12.66 0.83

4 Mira IBM USA 786,432 1.6 8.16 3.95 2.07

5 SuperMUCIBM

Germany147,456 2.7 2.90 3.52 0.82

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2222

N. Leavitt, “Big Iron Moves Toward Exascale Computing,” Computer, vol. 45,no. 11, pp. 14-17, Nov. 2012.

Page 23: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

The FutureThe Future

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2323

Erik P. DeBenedictis of Sandia National Laboratories theorizes that a zettaflops (1021) (one sextillion FLOPS) computer is required to accomplish full weather modeling, which could cover a two week time span accurately. Such systems might be built around 2030.

http://en.wikipedia.org/wiki/Supercomputer

Page 24: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2424

PerformancePerformance

Performance is measured for a given program or a Performance is measured for a given program or a set of programs:set of programs:

Av. eAv. execution timexecution time = (1/ = (1/nn) ) ΣΣ Execution time Execution time ((program i program i ))

oror

Av. execution time = Av. execution time = [ [ ∏∏ Execution time Execution time ((program i program i )) ]]1/1/nn

Performance is inverse of execution time:Performance is inverse of execution time:

PerformancePerformance = 1/( = 1/(Execution timeExecution time))

i =1

n

i =1

n

Page 25: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2525

Geometric vs. Arithmetic MeanGeometric vs. Arithmetic MeanReference computer times of n programs: r1, . . . , rnReference computer times of n programs: r1, . . . , rnTimes of n programs on the computer under evaluation: Times of n programs on the computer under evaluation: T1, . . . , TnT1, . . . , TnNormalized times: T1/r1, . . . , Tn/rnNormalized times: T1/r1, . . . , Tn/rnGeometric meanGeometric mean == {(T1/r1) . . . (Tn/rn)}{(T1/r1) . . . (Tn/rn)}1/n1/n

{T1 . . . Tn}{T1 . . . Tn}1/n1/n

= = UsedUsed{r1 . . . rn}{r1 . . . rn}1/n1/n

Arithmetic meanArithmetic mean = = {(T1/r1)+ . . . +(Tn/rn)}/n{(T1/r1)+ . . . +(Tn/rn)}/n{T1+ . . . +Tn}/n{T1+ . . . +Tn}/n

≠ ≠ Not usedNot used{r1+ . . . +rn}/n{r1+ . . . +rn}/n

J. E. Smith, “Characterizing Computer Performance with a Single J. E. Smith, “Characterizing Computer Performance with a Single Number,” Number,” Comm. ACMComm. ACM, vol. 31, no. 10, pp. 1202-1206, Oct. 1988., vol. 31, no. 10, pp. 1202-1206, Oct. 1988.

Page 26: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2626

SPEC BenchmarksSPEC BenchmarksSystem Performance Evaluation Corporation System Performance Evaluation Corporation (SPEC)(SPEC)

SPEC89SPEC89– 10 programs10 programs– SPEC performance ratio relative to VAX-11/780SPEC performance ratio relative to VAX-11/780– One program, matrix300, dropped because One program, matrix300, dropped because

compilers could be engineered to improve its compilers could be engineered to improve its performance.performance.

– www.spec.org

Page 27: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2727

SPEC89 Performance Ratio forSPEC89 Performance Ratio forIBM Powerstation 550IBM Powerstation 550

0

100

200

300

400

500

600

700

800g

cc

esp

ress

o

spic

e

do

cuc

nas

a7 li

eqn

tott

mat

rix3

00

fpp

pp

tom

catv

compiler

enhanced compiler

Page 28: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2828

SPEC95 BenchmarksSPEC95 BenchmarksEight integer and ten floating point Eight integer and ten floating point programs, programs, SPECint95SPECint95 and and SPECfp95SPECfp95..

Each program run time is normalized with Each program run time is normalized with respect to the run time of respect to the run time of Sun Sun SPARCstation 10/40SPARCstation 10/40 – the ratio is called – the ratio is called SPEC ratioSPEC ratio..

SPECint95SPECint95 and and SPECfp95SPECfp95 summary summary measurements are the geometric means of measurements are the geometric means of SPEC ratios.SPEC ratios.

Page 29: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2929

SPEC CPU2000 BenchmarksSPEC CPU2000 BenchmarksTwelve integer and 14 floating point Twelve integer and 14 floating point programs, programs, CINT2000CINT2000 and and CFP2000CFP2000..

Each program run time is normalized to Each program run time is normalized to obtain a obtain a SPEC ratioSPEC ratio with respect to the run with respect to the run time on time on Sun Ultra 5_10 with a 300MHz Sun Ultra 5_10 with a 300MHz processorprocessor..

CINT2000CINT2000 and and CFP2000CFP2000 summary summary measurements are the geometric means measurements are the geometric means of SPEC ratios.of SPEC ratios.

Page 30: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3030

Reference CPU: Sun Ultra 5_10 Reference CPU: Sun Ultra 5_10 300MHz Processor300MHz Processor

0

500

1000

1500

2000

2500

3000

3500g

zip

vp

rg

cc

mc

fc

raft

yp

ars

er

eo

np

erl

bm

kg

ap

vo

rte

xb

zip

2tw

olf

wu

pw

ise

sw

imm

gri

da

pp

lum

es

ag

alg

el

art

eq

ua

ke

fac

ere

ca

mm

plu

ca

sfm

a3

ds

ixtr

ac

ka

ps

i

CINT2000

CFP2000

Page 31: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3131

CINT2000: 3.4GHz Pentium 4, HT CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard)Technology (D850MD Motherboard)

0

500

1000

1500

2000

2500

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

Base ratio

Opt. ratio

SPECint2000_base = 1341SPECint2000 = 1389

Source: www.spec.org

Page 32: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3232

Two Benchmark ResultsTwo Benchmark Results

Baseline: A uniform configuration not Baseline: A uniform configuration not optimized for specific program:optimized for specific program:

Same compiler with same settings and flags used Same compiler with same settings and flags used for all benchmarksfor all benchmarks

Other restrictionsOther restrictions

Peak: Run is optimized for obtaining the Peak: Run is optimized for obtaining the peak performance for each benchmark peak performance for each benchmark program.program.

Page 33: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3333

CINT2000: 1.7GHz Pentium 4CINT2000: 1.7GHz Pentium 4(D850MD Motherboard)(D850MD Motherboard)

0100200300400500600700800900

1000

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

Base ratio

Opt. ratio

SPECint2000_base = 579SPECint2000 = 588

Source: www.spec.org

Page 34: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3434

CFP2000: 1.7GHz Pentium 4 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard)(D850MD Motherboard)

0

200

400

600

800

1000

1200

1400w

up

wis

esw

im

mg

rid

app

lum

esa

gal

gel art

equ

ake

face

rec

amm

plu

cas

fma3

dsi

xtra

ck

apsi

Base ratio

Opt. ratio

SPECfp2000_base = 648SPECfp2000 = 659

Source: www.spec.org

Page 35: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3535

Additional SPEC BenchmarksAdditional SPEC Benchmarks

SPECweb99: measures the performance of a SPECweb99: measures the performance of a computer in a networked environment.computer in a networked environment.

Energy efficiency mode: Besides the execution Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a programs is also measured. Energy efficiency of a benchmark program is given by:benchmark program is given by:

1/(Execution time)1/(Execution time)Energy efficiency Energy efficiency == ────────────────────────

Power in wattsPower in watts

== Program units/jouleProgram units/joule

Page 36: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3636

Energy EfficiencyEnergy Efficiency

Efficiency averaged on Efficiency averaged on nn benchmark programs: benchmark programs:

nnEfficiencyEfficiency == (( ΠΠ Efficiency Efficiencyii ))

1/1/nn

i i =1=1

where Efficiencywhere Efficiencyii is the efficiency for program is the efficiency for program ii..

Relative efficiency:Relative efficiency:

Efficiency of a computerEfficiency of a computerRelative efficiency = ─────────────────Relative efficiency = ─────────────────

Eff. of reference Eff. of reference computercomputer

Page 37: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3737

SPEC2000 Relative Energy EfficiencySPEC2000 Relative Energy Efficiency

0

1

2

3

4

5

6

SP

EC

INT

20

00

SP

EC

FP

20

00

SP

EC

INT

20

00

SP

EC

FP

20

00

SP

EC

INT

20

00

SP

EC

FP

20

00

Pentium [email protected]/0.6GHz Energy-efficient procesor

Pentium [email protected] (Reference)

Pentium [email protected]

Always max. clock

Laptop adaptive clk.

Min. power min. clock

Page 38: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Energy and Time PerspectivesEnergy and Time PerspectivesClock cycle is the unit of computing work.Clock cycle is the unit of computing work.

Cycle rate, f cycles per secondCycle rate, f cycles per secondf is the rate of doing computing workf is the rate of doing computing work

Hardware speed, similar to mph for a carHardware speed, similar to mph for a car

Cycle efficiency, Cycle efficiency, ηη cycles per joule cycles per jouleηη is the computing work per energy unit is the computing work per energy unit

Hardware efficiency, similar to mpg for a carHardware efficiency, similar to mpg for a car

Results from recent work:Results from recent work:– A. Shinde, “Managing Performance and Efficiency of a Processor,” A. Shinde, “Managing Performance and Efficiency of a Processor,”

MEE Project ReportMEE Project Report, Auburn Univ., Dec. 2012., Auburn Univ., Dec. 2012.– A. Shinde and V. D. Agarwal, “Managing Performance and Efficiency A. Shinde and V. D. Agarwal, “Managing Performance and Efficiency

of a Processor,” of a Processor,” Proc. 45th IEEE Southeastern Symposium on Proc. 45th IEEE Southeastern Symposium on System TheorySystem Theory, Baylor Univ., TX, March 2013, pp. 59-62., Baylor Univ., TX, March 2013, pp. 59-62.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3838

Page 39: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Energy/Cycle for an 8-bit Adder in Energy/Cycle for an 8-bit Adder in 90nm CMOS Technology (PTM)90nm CMOS Technology (PTM)

K. Kim, “Ultra Low Power CMOS Design” K. Kim, “Ultra Low Power CMOS Design” PhD DissertationPhD Dissertation, Auburn , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.University, Dept. of ECE, Auburn, Alabama, May 2011.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3939

Page 40: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Delay of an 8-bit Adder in

90nm CMOS Technology (PTM)

K. Kim, “Ultra Low Power CMOS Design” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4040

Page 41: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Pentium M ProcessorPentium M Processor

Published data: H. Hanson, K. Rajamani, S. Keckler, F. Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” DVFS: Analysis with an Intel Pentium M,” Proc.Proc. International Symp. Low Power Electronics and DesignInternational Symp. Low Power Electronics and Design, , 2007, pp. 219-224.2007, pp. 219-224.

VDD = 1.2VVDD = 1.2V

Maximum clock rate = 1.8GHzMaximum clock rate = 1.8GHz

Critical path delay, td = 1/1.8GHz = 555.56psCritical path delay, td = 1/1.8GHz = 555.56ps

Power consumption = 120WPower consumption = 120W

Energy per cycle, EPC = 120/(1.8GHz) = 66.67nJEnergy per cycle, EPC = 120/(1.8GHz) = 66.67nJ

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4141

Page 42: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Cycle Efficiency and Frequency for Pentium M

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 104242

Page 43: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Example of Power Management For a program that executes in 1.8 billion clock For a program that executes in 1.8 billion clock

cycles.cycles.

Voltage VDD

Frequency f

MHz

Cycle Efficiency,η

Execution Time

second

Total Energy

Consumed

Powerf/η

1.2 V1800

megacycles/s15

megacycles/joule1.0 120 Joules 120W

0.6 V277

megacycles/s70

megacycles/joule6.5 25 Joules 39.6W

200 mV54.5

megacycles/s660

megacycles/joule33 2.72 Joules 0.083W

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4343

Page 44: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4444

Ways of Improving PerformanceWays of Improving Performance

Increase clock rate.Increase clock rate.

Improve processor organization for lower CPIImprove processor organization for lower CPIPipeliningPipelining

Instruction-level parallelism (ILP): MIMD (Scalar)Instruction-level parallelism (ILP): MIMD (Scalar)

Data-parallelism: SIMD (Vector)Data-parallelism: SIMD (Vector)

multiprocessingmultiprocessing

Compiler enhancements that lower the instruction Compiler enhancements that lower the instruction count or generate instructions with lower average count or generate instructions with lower average CPI (e.g., by using simpler instructions).CPI (e.g., by using simpler instructions).

Page 45: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4545

Limits of PerformanceLimits of PerformanceExecution time of a program on a Execution time of a program on a computer is 100 s:computer is 100 s:

80 s for multiply operations80 s for multiply operations

20 s for other operations20 s for other operations

Improve multiply Improve multiply nn times: times: 8080Execution time = (── + 20 ) secondsExecution time = (── + 20 ) seconds nn

Limit: Even if Limit: Even if nn = = ∞∞, execution time cannot , execution time cannot be reduced below 20 s.be reduced below 20 s.

Page 46: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4646

Amdahl’s LawAmdahl’s LawThe execution time of a The execution time of a

system, in general, has two system, in general, has two fractions – a fractionfractions – a fraction f fenhenh that that

can be speeded up by factor can be speeded up by factor nn, ,

and the remaining fraction 1 - and the remaining fraction 1 - ffenhenh that cannot be improved. that cannot be improved.

Thus, the possible speedup is:Thus, the possible speedup is:

G. M. Amdahl, “Validity of the G. M. Amdahl, “Validity of the

Single Processor Approach to Single Processor Approach to

Achieving Large-Scale Achieving Large-Scale

Computing Capabilities,” Computing Capabilities,” Proc. Proc.

AFIPS Spring Joint Computer AFIPS Spring Joint Computer

ConfConf., Atlantic City, NJ, April ., Atlantic City, NJ, April

1967, pp. 483-485.1967, pp. 483-485.

Old timeSpeedup = ──────

New time

1 = ──────────

1 – fenh + fenh/n

Gene Myron Amdahl born 1922

http://en.wikipedia.org/wiki/Gene_Amdahl

Page 47: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4747

Wisconsin Integrally Synchronized Wisconsin Integrally Synchronized Computer (WISC), 1950-51Computer (WISC), 1950-51

Page 48: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Parallel Processors: Shared MemoryParallel Processors: Shared Memory

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4848

P P

P P

P P

M

Page 49: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Parallel ProcessorsParallel ProcessorsShared Memory, Infinite BandwidthShared Memory, Infinite Bandwidth

N processorsN processors

Single processor: non-memory execution time = Single processor: non-memory execution time = αα

Memory access time = 1 – Memory access time = 1 – αα

N processor run time, T(N)= 1 – N processor run time, T(N)= 1 – αα + + αα/N/N

T(1) T(1) 11 N N

Speedup = Speedup = ——— = —————— = —————————— = —————— = ———————

T(N)T(N) 1 – 1 – αα + + αα/N/N (1 – (1 – αα)N + )N + αα

Maximum speedup = 1/(1 – Maximum speedup = 1/(1 – αα), when N = ∞), when N = ∞

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4949

Page 50: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Run TimeRun Time

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5050

α

1 – α

1 2 3 4 5 6 7

No

rma

lize

d ru

n ti

me

, T(N

)

Number of processors (N)

α/N

T(N) = 1 – α + α/N

Page 51: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

SpeedupSpeedup

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5151

6

5

4

3

2

11 2 3 4 5 6

Sp

eed

up, T

(1)/

T(N

)

Number of processors (N)

Ideal, N(α = 1)

N(1 – α)N + α

Page 52: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

ExampleExample

10% memory accesses, i.e., 10% memory accesses, i.e., αα = 0.9 = 0.9

Maximum speedup=Maximum speedup= 1/(1 – a)1/(1 – a)

== 1.0/0.1 = 10, 1.0/0.1 = 10, when N = ∞when N = ∞

What is the speedup with 10 What is the speedup with 10 processors?processors?

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5252

Page 53: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Parallel ProcessorsParallel ProcessorsShared Memory, Finite BandwidthShared Memory, Finite Bandwidth

N processorsN processors

Single processor: non-memory execution time = Single processor: non-memory execution time = αα

Memory access time = (1 – Memory access time = (1 – αα)N )N

N processor run time, T(N) = (1 – N processor run time, T(N) = (1 – αα)N + )N + αα/N/N

11 NN

Speedup = Speedup = ———————— = ———————— = ——————————————

(1 – (1 – αα)N + )N + αα/N/N (1 – (1 – αα)N)N22 + + αα

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5353

Page 54: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Run TimeRun Time

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5454

α

1 – α

1 2 3 4 5 6 7

No

rma

lize

d ru

n ti

me

, T(N

)

Number of processors (N)

α/N

T(N) = (1 – α)N + α/N(1 – α)N

Page 55: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Minimum Run TimeMinimum Run Time

Minimize N processor run time,Minimize N processor run time,

T(N) = (1 – T(N) = (1 – αα)N + )N + αα/N/N

∂∂T(N)/∂N = 0T(N)/∂N = 0

1 – 1 – αα – – αα/N/N22 = 0, N = [ = 0, N = [αα/(1 – /(1 – αα)])]½½

Min. T(N) = 2Min. T(N) = 2[[αα(1 – (1 – αα)])]½½, because , because ∂∂22T(N)/∂NT(N)/∂N22 > 0. > 0.

Maximum speedup = 1/T(N) = 0.5Maximum speedup = 1/T(N) = 0.5[[αα(1 – (1 – αα)])]-½-½

Example: Example: αα = 0.9 = 0.9Maximum speedup = 1.67, when N = 3Maximum speedup = 1.67, when N = 3

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5555

Page 56: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

SpeedupSpeedup

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5656

6

5

4

3

2

1

1 2 3 4 5 6

Sp

eed

up, T

(1)/

T(N

)

Number of processors (N)

Ideal, N

N(1 – α)N2 + α

Page 57: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Parallel Processors: Distributed MemoryParallel Processors: Distributed Memory

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5757

P P

P P

P P

M

Inter-connectio

nnetwork

M

M

M

M

M

Page 58: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Parallel ProcessorsParallel ProcessorsDistributed MemoryDistributed Memory

N processorsN processors

Single processor: non-memory execution time = Single processor: non-memory execution time = αα

Memory access time = 1 – Memory access time = 1 – αα, same as single processor, same as single processor

Communication overhead = Communication overhead = ββ(N – 1)(N – 1)

N processor run time, T(N) = N processor run time, T(N) = ββ(N – 1) + 1/N(N – 1) + 1/N

11 N N

Speedup = Speedup = ———————— = ——————————————— = ———————

ββ(N – 1) + 1/N(N – 1) + 1/N ββN(N – 1) + 1N(N – 1) + 1

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5858

Page 59: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Minimum Run TimeMinimum Run TimeMinimize N processor run time,Minimize N processor run time,

T(N) = T(N) = ββ(N – 1) + 1/N(N – 1) + 1/N

∂∂T(N)/∂N = 0T(N)/∂N = 0

ββ – 1/N – 1/N22 = 0, N = = 0, N = ββ-½-½

Min. T(N) = 2Min. T(N) = 2ββ½½ – – ββ, because , because ∂∂22T(N)/∂NT(N)/∂N22 > 0. > 0.

Maximum speedup = 1/T(N) = 1/(2Maximum speedup = 1/T(N) = 1/(2ββ½½ – – ββ))

Example: Example: ββ = 0.01, Maximum speedup: = 0.01, Maximum speedup:N = 10N = 10

T(N) = 0.19T(N) = 0.19

Speedup = 5.26Speedup = 5.26

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5959

Page 60: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Run TimeRun Time

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 6060

01 10 20 30

No

rma

lize

d ru

n ti

me

, T(N

)

Number of processors (N)

1/N

T(N) = β(N – 1) + 1/N

β(N – 1)

1

Page 61: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

SpeedupSpeedup

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 6161

12

10

8

6

4

2

2 4 6 8 10 12

Sp

eed

up, T

(1)/

T(N

)

Number of processors (N)

Ideal, N

NβN(N – 1) + 1

Page 62: Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Performance of a Computer (Chapter.

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 6262

Further ReadingFurther ReadingG. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” Scale Computing Capabilities,” Proc. AFIPS Spring Joint Computer ConfProc. AFIPS Spring Joint Computer Conf., ., Atlantic City, NJ, Apr. 1967, pp. 483-485.Atlantic City, NJ, Apr. 1967, pp. 483-485.

J. L. Gustafson, “Reevaluating Amdahl’s Law,” J. L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACMComm. ACM, vol. 31, no. 5, pp. , vol. 31, no. 5, pp. 532-533, May 1988.532-533, May 1988.

M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” ComputerComputer, vol. , vol. 41, no. 7, pp. 33-38, July 2008.41, no. 7, pp. 33-38, July 2008.

D. H. Woo and H.-H. S. Lee, “Extending Amdahl’s Law for Energy-Efficient D. H. Woo and H.-H. S. Lee, “Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era,” Computing in the Many-Core Era,” ComputerComputer, vol. 41, no. 12, pp. 24-31, Dec. , vol. 41, no. 12, pp. 24-31, Dec. 2008.2008.

S. M. Pieper, J. M. Paul and M. J. Schulte, “A New Era of Performance S. M. Pieper, J. M. Paul and M. J. Schulte, “A New Era of Performance Evaluation,” Evaluation,” ComputerComputer, vol. 40, no. 9, pp. 23-30, Sep. 2007., vol. 40, no. 9, pp. 23-30, Sep. 2007.

S. Gal-On and M. Levy, “Measuring Multicore Performance,” S. Gal-On and M. Levy, “Measuring Multicore Performance,” ComputerComputer, vol. 41, , vol. 41, no. 11, pp. 99-102, November 2008.no. 11, pp. 99-102, November 2008.

S. Williams, A. Waterman and D. Patterson, “Roofline: An Insightful Visual S. Williams, A. Waterman and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Performance Model for Multicore Architectures,” Comm. ACMComm. ACM, vol. 52, no. 4, pp. , vol. 52, no. 4, pp. 65-76, Apr. 2009.65-76, Apr. 2009.

U. Vishkin, “Is Multicore Hardware for General-Purpose Parallel Processing U. Vishkin, “Is Multicore Hardware for General-Purpose Parallel Processing Broken?” Broken?” Comm. ACMComm. ACM, vol. 57, no. 4, pp. 35-39, Apr. 2014., vol. 57, no. 4, pp. 35-39, Apr. 2014.