331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8...

331 W08.1 Spring 2005

14:332:331Computer Architecture and Assembly Language

Fall 2003

Week 8

[Adapted from Dave Patterson’s UCB CS152 slides and

Mary Jane Irwin’s PSU CSE331 slides]

331 W08.2 Spring 2005

Head’s Up This week’s material

CPU performance- Reading assignment – PH 4

Building a MIPS datapath- Reading assignment – PH 5.1-5.2

Next week’s material Single cycle datapath implementation

- Reading assignment – PH 5.3 and C.1 through C.2

331 W08.3 Spring 2005

Performance Purchasing perspective

given a collection of machines, which has the - best performance ?

- least cost ?

- best performance / cost ?

Design perspective faced with design options, which has the

- best performance improvement ?

- least cost ?

- best performance / cost ?

Both require basis for comparison metric for evaluation

Our goal is to understand cost & performance implications of architectural choices

331 W08.4 Spring 2005

Two notions of “performance”

° Time to do the task (Execution Time)

– execution time, response time, latency

° Tasks per day, hour, week, sec, ns. .. (Performance)

– throughput, bandwidth

Response time and throughput often are in opposition

Plane

Boeing 747

BAD/Sud Concodre

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200

Which has higher performance?

331 W08.5 Spring 2005

Definitions

Performance is in units of things-per-second bigger is better

If we are primarily concerned with response time performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Performance(X)

n = ----------------------

Performance(Y)

331 W08.6 Spring 2005

Example

Time of Concorde vs. Boeing 747?

• Concord is 1350 mph / 610 mph = 2.2 times faster

• = 6.5 hours / 3 hours

• Throughput of Concorde vs. Boeing 747 ?

• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”

• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”

• Boeing is 1.6 times (“60%”)faster in terms of throughput

• Concord is 2.2 times (“120%”) faster in terms of flying time

• We will focus primarily on execution time for a single job

331 W08.7 Spring 2005

Basis of Evaluation

Actual Target Workload

Full Application Benchmarks

Small “Kernel” Benchmarks

Microbenchmarks

Pros Cons

• representative• very specific• non-portable• difficult to run, or measure• hard to identify cause

• portable• widely used• improvements useful in reality

• easy to run, early in design cycle

• identify peak capability and potential bottlenecks

•less representative

• easy to “fool”

• “peak” may be a long way from application performance

331 W08.8 Spring 2005

SPEC95

Eighteen application benchmarks (with inputs) reflecting a technical computing workload

Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d,

apsi, fppp, wave5

Must run with standard compiler flags eliminate special undocumented incantations that may

not even generate working code for real programs

331 W08.9 Spring 2005

Metrics of performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per month

Useful Operations per second

Each metric has a place and a purpose, and each can be misused

331 W08.10 Spring 2005

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

instr. count CPI clock rate

Program

Compiler

Instr. Set Arch.

Organization

Technology

331 W08.11 Spring 2005

CPI

CPU time = ClockCycleTime * CPI * Ii = 1

n

i i

CPI = CPI * F where F = I i = 1

n

i ii i

Instruction Count

"instruction frequency"

Invest Resources where time is Spent!

CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count

“Average cycles per instruction”

331 W08.12 Spring 2005

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2

How much faster would the machine be is a better data cachereduced the average load time to 2 cycles?

How does this compare with using branch prediction to shave a cycle off the branch time?

What if two ALU instructions could be executed at once?

331 W08.13 Spring 2005

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = -------------------- = ---------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task

by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = 1 (1-F) + F/S

331 W08.14 Spring 2005

Summary: Evaluating Instruction Sets?

Design-time metrics:

° Can it be implemented, in how long, at what cost?

° Can it be programmed? Ease of compilation?

Static Metrics:

° How many bytes does the program occupy in memory?

Dynamic Metrics:

° How many instructions are executed?

° How many bytes does the processor fetch to execute the program?

° How many clocks are required per instruction?

° How "lean" a clock is practical?

Best Metric: Time to execute the program!

NOTE: this depends on instructions set, processor organization, and compilation techniques.

CPI

Inst. Count Cycle Time

331 W08.15 Spring 2005

Review: Design Principles

Simplicity favors regularity fixed size instructions – 32-bits only three instruction formats

Good design demands good compromises three instruction formats

Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes

Make the common case fast arithmetic operands from the register file (load-store

machine) allow instructions to contain immediate operands

331 W08.16 Spring 2005

We're ready to look at an implementation of the MIPS

Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

Generic implementation: use the program counter (PC) to supply

the instruction address and fetch the instruction from memory (and update the PC)

decode the instruction (and read registers) execute the instruction

All instructions (except j) use the ALU after reading the registers

Why? memory-reference? arithmetic? control flow?

The Processor: Datapath & Control

FetchPC = PC+4

DecodeExec

331 W08.17 Spring 2005

Abstract Implementation View

Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential)

Single cycle operation

Split memory (Harvard) model - one memory for instructions and one for data

Address Instruction

InstructionMemory

Write Data

Reg Addr

Reg Addr

Reg Addr

Register

File ALU

DataMemory

Address

Write Data

Read DataPC

Read Data

Read Data

331 W08.18 Spring 2005

Clocking Methodologies Clocking methodology defines when signals can

be read and when they can be written

falling (negative) edge

rising (positive) edgecycle time

clock rate = 1/(cycle time) e.g., 10 nsec cycle time = 100 MHz clock rate 1 nsec cycle time = 1 GHz clock rate

State element design choices level sensitive latch master-slave and edge-triggered flipflops

331 W08.19 Spring 2005

Review: State Elements Set-reset latch

Level sensitive D latch

latch is transparent when clock is high (copies input to output)

R

S

Q

!Q

R S Q(t+1) !Q(t+1)

1 0 0 1

0 1 1 0

0 0 Q(t) !Q(t)

1 1 0 0

clock

D

Q

!Q

clock

D

Q

331 W08.20 Spring 2005

Review: State Elements, con’t

Race problem with latch based design …

Consider the case when D-latch0 holds a 0 and D-latch1 holds a 1 and you want to transfer the contents of D-latch0 to D-latch1 and vica versa

must have the clock high long enough for the transfer to take place

must not leave the clock high so long that the transferred data is copied back into the original latch

Two-sided clock constraint

D

clock

Q

!Q

D-latch0D

clock

Q

!Q

D-latch1

clock

331 W08.21 Spring 2005

Review: State Elements, con’t Solution is to use flipflops that change state (Q)

only on clock edge (master-slave)

- master (first D-latch) copies the input when the clock is high (the slave (second D-latch) is locked in its memory state and the output does not change)

- slave copies the master when the clock goes low (the master is now locked in its memory state so changes at the input are not loaded into the master D-latch)

One-sided clock constraint must have the clock cycle time long enough to

accommodate the worst case delay path

D

clock

Q

!Q

D-latchD

clock

Q

!Q

D-latchQ

!Q

D

clockclock

D

Q

331 W08.22 Spring 2005

Our Implementation

An edge-triggered methodology

Typical execution read contents of some state elements send values through some combinational logic write results to one or more state elements

Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted

and the clock edge occurs

Stateelement

1

Stateelement

2

Combinationallogic

clock

one clock cycle

331 W08.23 Spring 2005

Fetching InstructionsFetching instructions involves

reading the instruction from the Instruction Memory updating the PC to hold the address of the next

instruction

PC is updated every cycle, so it does not need an explicit write control signal

Instruction Memory is read every cycle, so it doesn’t need an explicit read control signal

ReadAddress

Instruction

InstructionMemory

Add

PC

4

331 W08.24 Spring 2005

Decoding InstructionsDecoding instructions involves

sending the fetched instruction’s opcode and function field bits to the control unit

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

ControlUnit

reading two values from the Register File- Register File addresses are contained in the

instruction

331 W08.25 Spring 2005

Executing R Format OperationsR format operations (add, sub, slt, and, or)

perform the indicated (by op and funct) operation on values in rs and rt

store the result back into the Register File (into location rd)

Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

ALU

overflowzero

ALU controlRegWrite

R-type:

31 25 20 15 5 0

op rs rt rd functshamt

10

331 W08.26 Spring 2005

Executing Load and Store OperationsLoad and store operations

compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction

- base register was read from the Register File during decode

- offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value

store value, read from the Register File during decode, must be written to the Data Memory

load value, read from the Data Memory, must be stored in the Register File

I-Type: op rs rt address offset

31 25 20 15 0

331 W08.27 Spring 2005

Executing Load and Store Operations, con’t

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

ALU

overflowzero

ALU controlRegWrite

DataMemory

Address

Write Data

Read Data

SignExtend

MemWrite

MemRead

331 W08.28 Spring 2005

Executing Branch Operations

Branch operations have to

compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output)

compute the branch target address by adding the updated PC to the sign extended16-bit signed offset field in the instruction

- “base register” is the updated PC

- offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address

I-Type: op rs rt address offset

31 25 20 15 0

331 W08.29 Spring 2005

Executing Branch Operations, con’t

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

ALU

zero

ALU control

SignExtend16 32

Shiftleft 2

Add

4 Add

PC

Branchtargetaddress

(to branch control logic)

331 W08.30 Spring 2005

Executing Jump Operations

Jump operations have to

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

ReadAddress

Instruction

InstructionMemory

Add

PC

4

Shiftleft 2

Jumpaddress

26

4

28

J-Type: op

31 25 0

jump target address

331 W08.31 Spring 2005

We wait for everything to settle down

ALU might not produce “right answer” right away

we use write signals along with the clock edge to

determine when to write (to the Register File and the

Data Memory)

Cycle time determined by length of the longest

path

Our Simple Control Structure

We are ignoring some details like register setup and hold times

331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8...

Documents

Transcript of 331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8...