dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0...

25
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface RISC-V Edition Chapter 4 The Processor

Transcript of dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0...

Page 1: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

COMPUTERORGANIZATION AND DESIGNThe Hardware/Software Interface

RISC-VEdition

Chapter 4The Processor

Page 2: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

01 ldEn

00

RegWrite

11

rd

10

Write

A

ddre

ss D

eco

der

64

64

WriteData

64

64

64

ReadData2ReadData1

64

64rs2rs1

01 10

64

00011100 1110

64

64

register 01 (x1)

ckregister 01 (x1)

ckregister 01 (x1)

ck

Page 3: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Chapter 4 — The Processor — 6

Control

& Datapath
Single (long) cycle per instruction
Page 4: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html

Pipelined Datapath & Control Operationwithout data or control dependencies, yet

University of Crete

Dept. of Computer Science

CS−225 (HY−225)

Computer ORganization

Spring 2020 semester

Slides for §9.3 − 9.5

§9.4 Control for the Pipelined Datapath

§9.3 Pipelined Datapath Operation

§9.5 Graphical representation: time−work

1
Page 5: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

ld

76 add

−100x1

−1001

300

400 14

100

4880

80

add

700

ld

Cycle 5

rs2

we Din

rd

rs1 Addr

we rd

Din

Dout

DM?

f7

op f7

?rdIRPC

+

PC

+4

br/jmp addr.

A

IM

I

RF

64:

68:

72:

76:

ld x10, 40(x1)

ld x13, 48(x1)

add x14, x5, x6

60:

sub x11, x2, x3

f3

add x12, x3, x4

A

B

ImmImm

Control

LU

A

Reg.Rd; Op

Instr. Fetch

work

time

Cycle 5

Instr. Fetch Reg.Rd; Op ALU Data Mem. Write Back

Instr. Fetch Reg.Rd; Op ALU Data Mem.

Instr. Fetch Reg.Rd; Op ALU

Instr. Fetch

24
Page 6: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

1

1

x10

0

0

1

x11

sd

add add

1

1

0

0

x12

1

0

0

0

100

x1700

76 add

−100

−100

400

300

x13 130 14

48

Cycle 5

rs2

we Din

rd

rs1 Addr

we rd

Din

Dout

DMf7

op f7

rdIRPC

++4

br/jmp addr.

A

IM

I

RF

64:

68:

72:

76:

ld x10, 40(x1)

sd x13, 48(x1)

add x14, x5, x6

60:

sub x11, x2, x3

f3

add x12, x3, x4

A

B

Imm

PC

Imm

Control

LU

A

0

01

1

15
Page 7: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Data Dependences (Hazards) in Pipelines

RAW (Read after Write) − true dependence, as above

RAR (Read after Read) − not a dependence, can freely reorder the reads

WAW (Write after Write) − if you want to reorder them, simply abort the write of I1

just keep a copy of the old data and have I1 read that copyWAR (Write after Read) − "antidependence": if you want to do the write (I2) early,

(if no one reads this word between I1 and I2)

an earlier Instruction:

a later Instruction

I2 needs the new data written by I1,hence must wait for I1 to write −or at least to generate− the new data

some Memory or

Register File:

a wordwrite

I1

read

Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html

I2:

1
Page 8: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Reg.Rd; OpInstr. Fetch

Program Order

time

Data Memory accesses are performed ‘in−order’ in our simple pipeline,i.e. are not reordered relative to what the program specifies,thus, no dependences of memory word accesses are ever violated

No Memory Data Hazards in our simple Pipeline

Data Mem. Write BackALUReg.Rd; Op

Data Mem. Write BackALU

Data Mem. Write Back

Data Mem.ALUReg.Rd; OpInstr. Fetch

Data Mem. Write BackALUReg.Rd; OpInstr. Fetch

Data Mem. Write BackALU

2
Page 9: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Reg.Wr.

For each instruction that writes a destination register, if the next 2 or 3 instructionsread that same register, i.e. need its result, we have to do something about it...

Register Accesses reordered, with Pipeliningtime

in initial Program OrderRegister Accesses

Data Mem. Reg. WriteALUReg. ReadFetch

Data Mem.ALUReg. ReadInstr. Fetch Reg.Wr.

Register Accessesof next 2 or 3 instructions

potential Dependence Hazardsreordered in time, thus causing

time

Data Mem. Reg. WriteALUReg. ReadFetch

Data Mem. Reg. WriteALUReg. ReadInstr. Fetch

Data Mem. Reg. WriteALUReg. ReadInstr. Fetch

Data Mem. Reg. WriteALUReg. ReadInstr. Fetch

Data Mem.ALUReg. ReadInstr. Fetch

3
Page 10: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Actual Need−Produce Time vs. from/in−Register Time

ALU Instructions:actual

‘official’ Register Positionwritten intoOutput Result Produced

Inputs NeededInputs Read from ‘official’ Position

Output Result

We can ‘Bypass’ the ‘official’ loop through the Register File for immediate−use Results

computation

Load Instructions: Data Mem. Reg. WriteALUReg. ReadInstr. Fetch

Inputs Needed

Output Result Produced

Inputs Read

Output Resultwritten

actual computation

time

All we care about is actual Results ‘Forwarded’ from Producer to Consumer instruction

Data Mem. Reg. WriteALUReg. ReadInstr. Fetch

5
Page 11: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

the one immediately succeedinginstruction, without it having to wait one extra cycle

ALU op. result can be

to any subsequent instructionforwarded

Loaded data canNOT be used by

ALUInstr. Fetch Reg. Read Reg. Write

Data Mem.ALUInstr. Fetch Reg. Read Reg. Write

Data Mem.

ALUInstr. Fetch Reg. WriteReg. Read

Data Mem.ALUInstr. Fetch Reg. Read Reg. Write

Data Mem.ALUInstr. Fetch Reg. Read

Data Mem.

Reg. Write

ALU instructions never stall the pipeline, but Load instructions will do sowhen immediately followed by a dependent instruction

from Load Instruction:

from ALU Instructions:

time

ALU result to next I; Load result to next−after−next I

ALUInstr. Fetch Reg. Read Reg. Write

Data Mem.

6
Page 12: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Imm

LU

A

++4

br/jmp

PC A

IM

I

rs2

we Din

rd

rs1

f7

op f7

RF

f3

Imm

A

B

IR

Op. Dec.

Addr

we rd

Din

Dout

DM

rd

rs2

fwd.ctrl

Forwarding Controlrd rrd5rrd4

rwe4

rrd3

rwe3

01

01

mrd3

mwe3

mrd4

mwe4

rwe5rs1

11
Page 13: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

forward

Wait/Repeat!

Abort!

fetch 60

sub x10, x3

64:

68:

72:

76:

ld x10, 40(x1)

add x14, x5, x6

60:

sub x11, x10, x3

add x12, x3, x4

sd x13, 48(x1)

The instruction immediately after a Load wants to use the load’ed data

Simple, in−order Pipeline: the next instruction has to wait too!

Impossible without losing one cycle:force this instruction to wait (repeat itself on the next cycle)

fetch 64

Distance 1 dependence on LOAD: Wait!

time

x1 + 40

no−op.no−op no−op

Data Mem.x10 − x3fetch 68

x3 + x4fetch 68

ld: read x1 read M[140] write x10

write x11

Data Mem.add x3, x4 write x12

H’zrd D’tct!

15
Page 14: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

LU

A68

Imm++4

br/jmp

A

IM

I

rs2

we Din

rd

rs1

f7

op f7

RF

f3

Imm

A=100

B

IR

Op. Dec.

Addr

we rd

Din

Dout

DM

rd

40

x10

1

add

0

1

1

x10

x3 x11

sub

PC

1

Forwarding ControlHazard Detection

ldE

n

ldE

n

01

rrd3rd

rs1

rs2

fwd.ctrl

need.rs1

rrd5

rwe5

rrd4

rwe4rwe3

mrd3

mwe3

mrd4

mwe4ne

ed

.rs2

0

Wait

16
Page 15: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

(OK to reorder sd−ld) or i==j (fwd in reg.)?

If unknown to compiler, static sch. impossible=> dynamic scheduling at runtime (ooo pipe)

Does the compiler know for sure if i!=j

t1,

sub t1, t0, t1

sd 24(gp)t1,

2 extra clock cycles lost

e = b − f;

a = b + c;

a[i] = b + c;

e = b − a[j];What if the program is?:

RAW dependence?

sd e

sub

ld f

ld b

ld c

add

t1

ld b

ld c

sd e

sub

ld f

add

sd a

This is ‘Static’ Scheduling, at Compile Time

t0

two tem

pora

ryre

gis

ters

suffic

et2

t0

t1

thre

e tem

pora

ry r

egis

ters

needed

Instruction Scheduling

sd af

b

+16:+8:+0:

+24:+32:

e

c

a gp

the more things you have‘up in the air’ (in parallel),the more temporaryregisters you needin order to ‘name’those ‘pending’ values

ld 32(gp)t2,

ld t0, 8(gp)

ld t1, 16(gp)

add t1, t0, t1

sd t1, 0(gp)

sub t1, t0,

sd 24(gp)t1,

t2

No extra clock cycle lost

sd t1, 0(gp)

ld t0, 8(gp)

ld t1, 16(gp)

add t1, t0, t1

ld 32(gp)

18
Page 16: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Control Dependences (branch/jump) in Pipelines

Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html (slides 1−9); Elsevier (slides 10−11)

‘Data Dependence’ = next instruction uses data (register/memory) from previous

‘Control Dependence’ = which is the next instruction depends on the previous

Control Dependences arise from ‘Control Transfer Instructions (CTI)’

Control Transfer Instructions (CTI) are: Jump and Branch Instructions

‘Jumps’ are Unconditional CTI’s: they always transfer control

‘Branches’ are Conditional CTI’s: whether or not they transfer controldepends on the result of a data comparison that they have to perform

Statistics (rough numbers, in a majority of programs, but NOT always so):

− about 1/3 of executed branches are not taken (unsuccessful) = ~5% of all instr.− about 2/3 of executed branches are ‘taken’ (successful) = ~10% of all instr.

− most backwards branches appear in loops, and they are about 90% taken

Branches are about 15−16% of all (‘dynamically’) executed instructions in a program

Jumps are about 4−5% of all executed instructions in a program − procedure calls are about 1%, and returns another ~1%, of all executed instr.

1
Page 17: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

56

fetch

fetch

fetch

fetch

fetch

fetch

40:

44:

48:

52:

76:

36:

ALU

ALU DM WB

ALU DM

branch! (2 or more cycles)

sd

and

ld

xor+4

+4

72:

noopnoop

noop noop noop

noopnoopnoopnoop

SpeculativeExecution

Aborted Execution!

before it causes permanent damage:

speculative executionneed to abort

before DM and WB stages

In modern processors, branch latency is quite longIn our simple pipeline, branch latency is 2 cycles (read registers; compare)

Example here with 3−cycle branch latency

About 2/3 of all executed branches are taken, so this is a heavy loss

In this example, each taken branch causes the loss of 3 extra clock cycles

(with MIPS−style comparisons (beq/bne only) it could even be 1 cycle)

Branch Taken example40: beq ..., goto72

44: sd ...

48: and ...

52: or ...... ...

36: add ...

72: ld ...

76: xor ...fetch add ALU DM WB

+4

+4

+4

+4

2
Page 18: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

48: and ...

52: or ...... ...

36: add ...

72: ld ...

76: xor ...

40 72260 200

88 120180 160

......

... ...

A ‘best approximation’ − not necessarily correct information

Like IM −the Instruction Cache− this will oftentimes ‘overflow’:

old pairs are removed to make room for more recent ones

May be complemented with a small hardware stack:

− on every call (jal ra,...), push the return address;

− on every return (jr ra), pop an address and predict jumpin to that one

Branches that are believed not−taken are NOT entered into the BTB

usually went, in the past.Target PC to which this instruction

PC of a jump or branch−likely instruction;

Branch Target Buffer (BTB)A small table − a cache, like a hash table − containing

their next−PC is something other than PC+4

pairs of (instruction) addresses for which

there is statistical evidence that

In parallel with each Fetch, search the fetched instruction’s PC value in the BTB

40: beq ..., goto72

44: sd ...

6
Page 19: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

fetch

SpeculativeExecution

Continue Executionand Commitfetch ALU DM WB

fetch xor ALU DM WB

fetch ALU DM WB

fetch op.dec

fetch

ALU

op.dec

DM

ALU

WB

DM

36:

BTB

+440:

BTB

+4

branch taken, as predicted

BTB

+4

BTB

+4BTB

+4

ld

op.dec

else, fetch from PC+4

When a matching BTB entry is found, use its Prediction;

When the BTB prediction is Correct40: beq ..., goto72

44: sd ...

48: and ...

52: or ...... ...

36: add ...

72: ld ...

76: xor ...fetch add ALU DM WB

72:

80:BTB

+4

76:

88:

84:

When Prediction is Correct, NO extra clock cycles are lost!

40 72260 200

88 120180 160

......

... ...

BTB

8
Page 20: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

noop

noop noop noop

fetch

SpeculativeExecution

fetch ALU

fetch xor

fetch

36:

BTB

+440:

BTB

+4

BTB

+4

BTB

+4

ld

br. not taken: mispredicted

When Mispredicted, branches cost 3 extra clock cycles in this pipeline

44:

48:

76:

72:

DM

WBDM

ALUfetch

ALUsdfetch

+4

BTB

Flush the Pipeline!

and

Abort!

Prediction says: After fetching from 40, fetch from 72

But this time, the branch ends up going the other way: to 44

When the BTB prediction is Wrong40: beq ..., goto72

44: sd ...

48: and ...

52: or ...... ...

36: add ...

72: ld ...

76: xor ...fetch add ALU DM WB

80:BTB

+4

40 72260 200

88 120180 160

......

... ...

BTBnoopnoopnoopnoop

noop

9
Page 21: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Chapter 1 — Computer Abstractions and Technology — 28

Relative Performancen Define Performance = 1/Execution Timen “X is n time faster than Y”

n== XY

YX

time Executiontime ExecutionePerformancePerformanc

n Example: time taken to run a programn 10s on A, 15s on Bn Execution TimeB / Execution TimeA

= 15s / 10s = 1.5n So A is 1.5 times faster than B

Page 22: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Chapter 1 — Computer Abstractions and Technology — 37

Performance Summary

n Performance depends onn Algorithm: affects IC, possibly CPIn Programming language: affects IC, CPIn Compiler: affects IC, CPIn Instruction set architecture: affects IC, CPI, Tc

The BIG Picture

cycle ClockSeconds

nInstructiocycles Clock

ProgramnsInstructioTime CPU ´´=

Page 23: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Chapter 1 — Computer Abstractions and Technology — 35

CPI in More Detailn If different instruction classes take different

numbers of cycles

å=

´=n

1iii )Count nInstructio(CPICycles Clock

n Weighted average CPI

å=

÷øö

çèæ ´==

n

1i

ii Count nInstructio

Count nInstructioCPICount nInstructio

Cycles ClockCPI

Relative frequency

Page 24: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38

Average Access Timen Hit time is also important for performance

n Average memory access time (AMAT)

n AMAT = Hit time + Miss rate × Miss penalty

n Example

n CPU with 1ns clock, hit time = 1 cycle, miss

penalty = 20 cycles, I-cache miss rate = 5%

n AMAT = 1 + 0.05 × 20 = 2ns

n 2 cycles per instruction

- Caches
Page 25: dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0 sub x11, x2, x3 x12 1 0 0 0 100 x1 700 76 add −100 −100 400 300 x13 130 14 48.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

Measuring Cache Performancen Components of CPU time

n Program execution cycles

n Includes cache hit time

n Memory stall cycles

n Mainly from cache misses

n With simplifying assumptions:

§5.4

Measurin

g a

nd Im

provin

g C

ache P

erfo

rm

ance

penalty MissnInstructio

MissesProgram

nsInstructio

penalty Missrate MissProgram

accessesMemory

cycles stallMemory

´´=

´´=