Chapter 7 Digital Design and Computer Architecture, 2 nd Edition Chapter 7 David Money Harris and...

Post on 21-Jan-2016

337 views 21 download

Tags:

Transcript of Chapter 7 Digital Design and Computer Architecture, 2 nd Edition Chapter 7 David Money Harris and...

Chapter 7 <1>

MIC

ROAR

CHIT

ECTU

RE

Digital Design and Computer Architecture, 2nd Edition

Chapter 7

David Money Harris and Sarah L. Harris

Chapter 7 <2>

MIC

ROAR

CHIT

ECTU

RE Chapter 7 :: Topics

• Introduction• Performance Analysis• Single-Cycle Processor• Pipelined Processor• Exceptions• Advanced Microarchitecture

Chapter 7 <3>

MIC

ROAR

CHIT

ECTU

RE• Microarchitecture: the

implementation of an architecture in hardware

• Processor:– Datapath: functional blocks– Control: control signals

Physics

Devices

AnalogCircuits

DigitalCircuits

Logic

Micro-architecture

Architecture

OperatingSystems

ApplicationSoftware

electrons

transistorsdiodes

amplifiersfilters

AND gatesNOT gates

addersmemories

datapathscontrollers

instructionsregisters

device drivers

programs

Introduction

Chapter 7 <4>

MIC

ROAR

CHIT

ECTU

RE• Multiple implementations for a single

architecture:– Single-cycle: Each instruction executes in a

single cycle– Multicycle: Instructions are broken into series of

shorter steps Each instruction executes in n cycles, where n varys according to the instr.

– Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once (Note: AMD and Intel pipelines are different, for the same IA-32 architecture (a.k.a. x86 ISA)

Microarchitecture

Chapter 7 <5>

MIC

ROAR

CHIT

ECTU

RE• Program execution timeExecution Time = (#instructions)(cycles/instruction)(seconds/cycle)

• Definitions:– IC: Instruction Count (= #instructions)– CPI: Cycles/Instruction– clock period: seconds/cycle– IPC: Instructions/Cycle (= 1/CPI)

• Challenge is to satisfy constraints of:– Cost– Power– Performance

Processor Performance

Chapter 7 <6>

MIC

ROAR

CHIT

ECTU

RE• Consider subset of MIPS instructions:

– R-type instructions: and, or, add, sub, slt– Memory instructions: lw, sw– Branch instructions: beq

MIPS Processor

Chapter 7 <7>

MIC

ROAR

CHIT

ECTU

RE• Determines everything about a processor:

– PC and special registers– Register File– Memory

Architectural State

Chapter 7 <8>

MIC

ROAR

CHIT

ECTU

RE

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RD

DataMemory

WD

WEPCPC'

CLK

32 3232 32

32

32

32 32

32

32

5

5

5

MIPS State Elements

Plus the HI and LO registers

Chapter 7 <9>

MIC

ROAR

CHIT

ECTU

RE• Datapath—design it 1st, to make the

instruction actions possible• Control—design it 2nd, to make them

happen

Single-Cycle MIPS Processor

Chapter 7 <10>

MIC

ROAR

CHIT

ECTU

RESTEP 1: Fetch instruction

IM[PC]

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RD

DataMemory

WD

WEPCPC'

Instr

CLK

Single-Cycle Datapath: lw fetch

Chapter 7 <11>

MIC

ROAR

CHIT

ECTU

RESTEP 2: Read source operands from RF

RF[rs] or RF[Instr(25:21)]

Instr

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RD

DataMemory

WD

WEPCPC'

25:21

CLK

Single-Cycle Datapath: lw Register Read

Chapter 7 <12>

MIC

ROAR

CHIT

ECTU

RESTEP 3: Sign-extend the immediate SignExt(immed)

SignImm

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

CLK

Single-Cycle Datapath: lw Immediate

Chapter 7 <13>

MIC

ROAR

CHIT

ECTU

RESTEP 4: Compute the memory address

addr = RF[rs] + SignExt(immed)

SignImm

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

SrcB

ALUResult

SrcA Zero

CLK

ALUControl2:0

ALU

010

Single-Cycle Datapath: lw address

Chapter 7 <14>

MIC

ROAR

CHIT

ECTU

RE• STEP 5: Read data from memory and write

it back to register file: RF[rt] DM[addr]

A1

A3

WD3

RD2

RD1WE3

A2

SignImm

CLK

A RD

InstructionMemory

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

SrcB20:16

ALUResult ReadData

SrcA

RegWrite

Zero

CLK

ALUControl2:0

ALU

0101

Single-Cycle Datapath: lw Memory Read

Chapter 7 <15>

MIC

ROAR

CHIT

ECTU

RESTEP 6: Determine address of next instruction PC PC + 4

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

SrcB20:16

ALUResult ReadData

SrcA

PCPlus4

Result

RegWrite

Zero

CLK

ALUControl2:0

ALU

0101

Single-Cycle Datapath: lw PC Increment

Chapter 7 <16>

MIC

ROAR

CHIT

ECTU

REIM[PC]RF[rt] DM[RF[rs] + SignExt(immed)]PC PC + 4

Full RTL Expression for lw

Chapter 7 <17>

MIC

ROAR

CHIT

ECTU

REWrite data in rt to memory: DM[addr]RF[rt]

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

20:16

15:0

SrcB20:16

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

MemWriteRegWrite

Zero

CLK

ALUControl2:0

ALU

10100

Single-Cycle Datapath: sw

Chapter 7 <18>

MIC

ROAR

CHIT

ECTU

RE• Read from rs and rt• Write ALUResult to register file• Write to rd (instead of rt) RF[rd] RF[rs] op RF[rt]

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCPC' Instr25:21

20:16

15:0

SrcB

20:16

15:11

ALUResult ReadData

WriteData

SrcA

PCPlus4WriteReg4:0

Result

RegDst MemWrite MemtoRegALUSrcRegWrite

Zero

CLK

ALUControl2:0

ALU

0varies1 001

Single-Cycle Datapath: R-Type

Chapter 7 <19>

MIC

ROAR

CHIT

ECTU

RE• Determine whether values in rs and rt are equal• Calculate branch target address: BTA = PC + 4 + SignExt(immed)<< 2 # <<2 = 4x

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1

PC' Instr25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

RegDst Branch MemWrite MemtoRegALUSrcRegWrite

Zero

PCSrc

CLK

ALUControl2:0

ALU

01100 x0x 1

Single-Cycle Datapath: beq

Chapter 7 <20>

MIC

ROAR

CHIT

ECTU

REIM[PC]if (RF[rs] - RF[rt] == 0) PC BTAelse PC PC + 4

RTL Expression for beq

Chapter 7 <21>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

Single-Cycle Processor

Chapter 7 <22>

MIC

ROAR

CHIT

ECTU

RE

RegDst

Branch

MemWrite

MemtoReg

ALUSrcOpcode5:0

ControlUnit

ALUControl2:0Funct5:0

MainDecoder

ALUOp1:0

ALUDecoder

RegWrite

Single-Cycle Control

Chapter 7 <23>

MIC

ROAR

CHIT

ECTU

RE

ALU

N N

N

3

A B

Y

F

F2:0 Function

000 A & B

001 A | B

010 A + B

011 not used

100 A & ~B

101 A | ~B

110 A - B

111 SLT

Review: ALU

Chapter 7 <24>

MIC

ROAR

CHIT

ECTU

RE

+

2 01

A B

Cout

Y

3

01

F2

F1:0

[N-1] S

NN

N

N

N NNN

N

2Z

ero

Extend

Review: ALU

Chapter 7 <25>

MIC

ROAR

CHIT

ECTU

REALUOp1:0 Meaning

00 Add (for lw, sw)

01 Subtract (for beq)

10 Look at funct (R-type)

11 Not Used

ALUOp1:0 funct ALUControl2:0

00 X 010 (Add)

X1 X 110 (Subtract)

1X 100000 (add) 010 (Add)

1X 100010 (sub) 110 (Subtract)

1X 100100 (and) 000 (And)

1X 100101 (or) 001 (Or)

1X 101010 (slt) 111 (SLT)

Control Unit: ALU Decoder

Chapter 7 <26>

MIC

ROAR

CHIT

ECTU

REInstruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0

R-type 000000

lw 100011

sw 101011

beq 000100

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

Control Unit: Main Decoder

Chapter 7 <27>

MIC

ROAR

CHIT

ECTU

RE

Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0

R-type 000000 1 1 0 0 0 0 10lw 100011 1 0 1 0 0 1 00sw 101011 0 X 1 0 1 X 00beq 000100 0 X 0 1 0 X 01

Control Unit: Main Decoder

Chapter 7 <28>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

0010

01

0

0

1

0

Single-Cycle Datapath: or

Chapter 7 <29>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

No change to datapath

Extended Functionality: addi

Chapter 7 <30>

MIC

ROAR

CHIT

ECTU

REInstruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0

R-type 000000 1 1 0 0 0 0 10

lw 100011 1 0 1 0 0 1 00

sw 101011 0 X 1 0 1 X 00

beq 000100 0 X 0 1 0 X 01

addi 001000

Main Decoder table: addi

Chapter 7 <31>

MIC

ROAR

CHIT

ECTU

REInstruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0

R-type 000000 1 1 0 0 0 0 10

lw 100011 1 0 1 0 0 1 00

sw 101011 0 X 1 0 1 X 00

beq 000100 0 X 0 1 0 X 01

addi 001000 1 0 1 0 0 0 00

Main Decoder table: addi

Chapter 7 <32>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC'

Instr25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

0

1

25:0 <<2

27:0 31:28

PCJump

Jump

Extended Functionality: j

Chapter 7 <33>

MIC

ROAR

CHIT

ECTU

RE

Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump

R-type 000000 1 1 0 0 0 0 10 0

lw 100011 1 0 1 0 0 1 00 0

sw 101011 0 X 1 0 1 X 00 0

beq 000100 0 X 0 1 0 X 01 0

j 000010

Main Decoder table: j

Chapter 7 <34>

MIC

ROAR

CHIT

ECTU

RE

Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump

R-type 000000 1 1 0 0 0 0 10 0

lw 100011 1 0 1 0 0 1 00 0

sw 101011 0 X 1 0 1 X 00 0

beq 000100 0 X 0 1 0 X 01 0

j 000010 0 X X X 0 X XX 1

Main Decoder table: j

Chapter 7 <35>

MIC

ROAR

CHIT

ECTU

RE

Program Execution Time = (#instructions)(cycles/instruction)(seconds/cycle) = IC x CPI x TC

Review: Processor Performance

Chapter 7 <36>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU1

0100

1

0

1

0 0

TC limited by critical path (lw)

Single-Cycle Performance

Chapter 7 <37>

MIC

ROAR

CHIT

ECTU

RE• Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem

+ tmux + tRFsetup

• Typically, limiting paths are: – memory, ALU, register file – Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup

Single-Cycle Performance

Chapter 7 <38>

MIC

ROAR

CHIT

ECTU

REElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 30

Register setup tsetup 20

Multiplexer tmux 25

ALU tALU 200

Memory read tmem 250

Register file read tRFread 150

Register file setup tRFsetup 20

Tc = ?

Single-Cycle Performance Example

Chapter 7 <39>

MIC

ROAR

CHIT

ECTU

REElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 30

Register setup tsetup 20

Multiplexer tmux 25

ALU tALU 200

Memory read tmem 250

Register file read tRFread 150

Register file setup tRFsetup 20

Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup

= [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps [fclk = 1/0.925 GHz = 1.08 GHz]

Single-Cycle Performance Example

Chapter 7 <40>

MIC

ROAR

CHIT

ECTU

REProgram with IC = 100 billion instructions:

Execution Time = IC x CPI x TC

= (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds

Single-Cycle Performance Example

Chapter 7 <41>

MIC

ROAR

CHIT

ECTU

REPros and cons of single-cycle implementation: + simple design + 1 cycle per every instruction - slow cycle time

limited by longest instruction (lw) - HW: 2 adders + ALU; 2 memories

Evaluation of Single-Cycle Processor

Chapter 7 <42>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

0

1

25:0 <<2

27:0 31:28

PCJump

Jump

Review: Single-Cycle Processor

Chapter 7 <43>

MIC

ROAR

CHIT

ECTU

RE//------------------------------------------------// David_Harris@hmc.edu 9 November 2005// Top level system including MIPS and memories//------------------------------------------------

module top (input clk, reset, output [31:0] writedata, dataadr, output memwrite);

wire [31:0] pc, instr, readdata;

// instantiate processor and memories mips mips (clk, reset, pc, instr, memwrite, dataadr, writedata, readdata); imem imem (pc[7:2], instr); dmem dmem (clk, memwrite, dataadr, writedata, readdata);

endmodule

Verilog Model

Chapter 7 <44>

MIC

ROAR

CHIT

ECTU

RE//------------------------------------------------// David_Harris@hmc.edu 23 October 2005// External data memory used by MIPS single-cycle processor//------------------------------------------------module dmem (input clk, we, input [31:0] a, wd, output [31:0] rd);

reg [31:0] RAM[63:0]; assign rd = RAM[a[31:2]]; // word-aligned read

always @(posedge clk) if (we) RAM[a[31:2]] <= wd; // word-aligned writeendmodule

Verilog Model of Data Memory

Chapter 7 <45>

MIC

ROAR

CHIT

ECTU

REmodule imem (input [5:0] addr, output reg [31:0] instr);

// imem is modeled as a lookup table, a stored-program byte-addressable ROMalways@(addr) case ({addr, 2'b00})

// address instruction// --------- --------------

8'h00: instr = 32'h20020005;8'h04: instr = 32'h2003000c;8'h08: instr = 32'h2067fff7;8'h0c: instr = 32'h00e22025;8'h10: instr = 32'h00642824;8'h14: instr = 32'h00a42820;8'h18: instr = 32'h10a7000a;8'h1c: instr = 32'h0064202a;8'h20: instr = 32'h10800001;

default: instr = {32{1'bx}}; // unknown instruction endcase

endmodule

Verilog Model of Instr. Memory

Chapter 7 <46>

MIC

ROAR

CHIT

ECTU

REmodule imem (input [5:0] addr, output [31:0] instr);

reg [31:0] RAM[63:0];

// imem is RAM, loaded from memfile.dat file with hex values at startup initial begin $readmemh("memfile.dat", RAM); end

assign instr = RAM[addr]; // instr at RAM[addr] is read out

endmodule

// imem can be created with CoreGen for Xilinx synthesis

Alternate Model of Instr. Memory

Chapter 7 <47>

MIC

ROAR

CHIT

ECTU

RE// single-cycle MIPS processormodule mips (input clk, reset, output [31:0] pc, input [31:0] instr, output memwrite, output [31:0] aluout, writedata, input [31:0] readdata);

wire memtoreg, pcsrc, zero, alusrc, regdst, regwrite, jump; wire [2:0] alucontrol;

controller c (instr[31:26], instr[5:0], zero, memtoreg, memwrite, pcsrc, alusrc, regdst, regwrite, jump, alucontrol);

datapath dp (clk, reset, memtoreg, pcsrc, alusrc, regdst, regwrite, jump, alucontrol, zero, pc, instr, aluout, writedata, readdata);

endmodule

Verilog Model of MIPS Processor

Chapter 7 <48>

MIC

ROAR

CHIT

ECTU

REmodule controller ( input [5:0] op, funct, input zero, output memtoreg, memwrite, output pcsrc, alusrc, output regdst, regwrite, output jump, output [2:0] alucontrol);

wire [1:0] aluop; wire branch;

maindec md (op, regwrite, regdst, alusrc, branch, memwrite, memtoreg, aluop, jump);

aludec ad (funct, aluop, alucontrol);

assign pcsrc = branch & zero;

endmodule

Verilog Model of Controller

Chapter 7 <49>

MIC

ROAR

CHIT

ECTU

REmodule maindec (input [5:0] op, output regwrite, regdst, alusrc, branch,

output memwrite, memtoreg, output [1:0] aluop, output jump); reg [8:0] controls;

assign {regwrite, regdst, alusrc, branch, memwrite, memtoreg, aluop, jump} = controls;

always @(*) case(op) 6'b000000: controls = 9'b110000100; //Rtype 6'b100011: controls = 9'b101001000; //LW 6'b101011: controls = 9'b001010000; //SW 6'b000100: controls = 9'b000100010; //BEQ 6'b001000: controls = 9'b101000000; //ADDI 6'b000010: controls = 9'b000000001; //J default: controls = 9'bxxxxxxxxx; //??? endcaseendmodule

Verilog Model of Main Decoder

Chapter 7 <50>

MIC

ROAR

CHIT

ECTU

REmodule aludec (input [5:0] funct, input [1:0] aluop, output reg [2:0] alucontrol); always @(*) case(aluop) 2'b00: alucontrol = 3'b010; // add 2'b01: alucontrol = 3'b110; // sub default: case(funct) // RTYPE 6'b100000: alucontrol = 3'b010; // ADD 6'b100010: alucontrol = 3'b110; // SUB 6'b100100: alucontrol = 3'b000; // AND 6'b100101: alucontrol = 3'b001; // OR 6'b101010: alucontrol = 3'b111; // SLT default: alucontrol = 3'bxxx; // ??? endcase endcaseendmodule

Verilog Model of ALU Decoder

Chapter 7 <51>

MIC

ROAR

CHIT

ECTU

REmodule datapath (input clk, reset, memtoreg, pcsrc, alusrc, regdst, input regwrite, jump, input [2:0] alucontrol, output zero, output [31:0] pc, input [31:0] instr, output [31:0] aluout, writedata, input [31:0] readdata);

wire [4:0] writereg; wire [31:0] pcnext, pcnextbr, pcplus4, pcbranch; wire [31:0] signimm, signimmsh, srca, srcb, result; // next PC logic flopr #(32) pcreg(clk, reset, pcnext, pc); adder pcadd1(pc, 32'b100, pcplus4); sl2 immsh(signimm, signimmsh); adder pcadd2(pcplus4, signimmsh, pcbranch); mux2 #(32) pcbrmux(pcplus4, pcbranch, pcsrc, pcnextbr); mux2 #(32) pcmux(pcnextbr, {pcplus4[31:28], instr[25:0], 2'b00}, jump, pcnext);

Verilog Model of Datapath

Chapter 7 <52>

MIC

ROAR

CHIT

ECTU

RE// register file logic regfile rf (clk, regwrite, instr[25:21], instr[20:16], writereg, result, srca, writedata);

mux2 #(5) wrmux (instr[20:16], instr[15:11], regdst, writereg); mux2 #(32) resmux (aluout, readdata, memtoreg, result); signext se (instr[15:0], signimm);

// ALU logic mux2 #(32) srcbmux (writedata, signimm, alusrc, srcb); alu alu (srca, srcb, alucontrol, aluout, zero);

endmodule

Verilog Model of Datapath (con’t)

Chapter 7 <53>

MIC

ROAR

CHIT

ECTU

REmodule regfile (input clk, we3, input [4:0] ra1, ra2, wa3, input [31:0] wd3, output [31:0] rd1, rd2);

reg [31:0] rf [31:0];

// three ported register file: read two ports combinationally // write third port on rising edge of clock. Register 0 hardwired to 0

always @(posedge clk) if (we3) rf [wa3] <= wd3;

assign rd1 = (ra1 != 0) ? rf [ra1] : 0; assign rd2 = (ra2 != 0) ? rf[ ra2] : 0;

endmodule

Verilog Model of Register File

Chapter 7 <54>

MIC

ROAR

CHIT

ECTU

REmodule adder (input [31:0] a, b, output [31:0] y); assign y = a + b;endmodule

module sl2 (input [31:0] a, output [31:0] y);// shift left by 2 assign y = {a[29:0], 2'b00}; endmodule

module signext (input [15:0] a, output [31:0] y); assign y = {{16{a[15]}}, a};endmodule

Verilog Models of Other Parts

Chapter 7 <55>

MIC

ROAR

CHIT

ECTU

REmodule flopr #(parameter WIDTH = 8) (input clk, reset, input [WIDTH-1:0] d, output reg [WIDTH-1:0] q); always @(posedge clk, posedge reset) if (reset) q <= 0; else q <= d;endmodule

module flopenr #(parameter WIDTH = 8) (input clk, reset, en, input [WIDTH-1:0] d, output reg [WIDTH-1:0] q); always @(posedge clk, posedge reset) if (reset) q <= 0; else if (en) q <= d;endmodule

module mux2 #(parameter WIDTH = 8) (input [WIDTH-1:0] d0, d1, input s, output [WIDTH-1:0] y); assign y = s ? d1 : d0; endmodule

Verilog for Parameterized Parts

Chapter 7 <56>

MIC

ROAR

CHIT

ECTU

RE• Unscheduled function call to exception handler• Caused by:

– Hardware, also called an interrupt, e.g. keyboard– Software, also called traps, e.g. undefined instruction

• When exception occurs, the processor:– Records cause of exception (Cause register)– Jumps to exception handler (0x80000180)– Returns to program (EPC register)

Review: Exceptions

Chapter 7 <57>

MIC

ROAR

CHIT

ECTU

RE Example Exception

Chapter 7 <58>

MIC

ROAR

CHIT

ECTU

RE• Not part of register file; in Coprocessor 0

– Cause• Records cause of exception• Coprocessor 0 register 13

– EPC (Exception PC)• Records PC where exception occurred• Coprocessor 0 register 14

• Move from Coprocessor 0– mfc0 $t0, Cause (=mfc0 $t0,$13)– Moves contents of Cause into $t0

00000 $t0 (8) Cause (13) 00000000000

mfc0

31:26 25:21 20:16 15:11 10:0

010000

Review: Exception Registers

Chapter 7 <59>

MIC

ROAR

CHIT

ECTU

REException Cause

Hardware Interrupt 0x00000000

System Call 0x00000020

Breakpoint / Divide by 0 0x00000024

Undefined Instruction 0x00000028

Arithmetic Overflow 0x00000030

Extend single-cycle MIPS processor to handle last two types of exceptions

Review: Exception Causes

Chapter 7 <60>

MIC

ROAR

CHIT

ECTU

RE Exception RTLs

Undefined InstructionIM[PC]. . . # problem in decoding (bad op or func) Cause 40 # = 0x28EPC PCPC 0x80000180 #Exception handler address

Arithmetic OverflowIM[PC]. . . # ALU operation overflowsCause 48 # = 0x30EPC PCPC 0x80000180 #Exception handler address

mfc0 instruction (e.g. mfc0 $t1, $13)IM[PC]RF[rt] RFc0[rd]PC PC + 4

Chapter 7 <61>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1PC0

1

PC' Instr25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegDst BranchMemWrite MemtoReg ALUSrcARegWrite

Zero

PCSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

0

1Data

CLK

CLK

A

B00

01

10

11

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD PCWrite

PCEn

<<2

25:0 (jump)

31:28

27:0

PCJump

00

01

10

11

0x8000 0180

Overflow

CLK

EN

EPCWrite

CLK

EN

CauseWrite

0

1

IntCause

0x30

0x28EPC

Cause

Exception Hardware: EPC & Cause

Never mind the multi-cycle datapath, focus on the exception hardware.

Chapter 7 <62>

MIC

ROAR

CHIT

ECTU

RE

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1PC0

1

PC' Instr25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegDst BranchMemWrite MemtoReg1:0 ALUSrcARegWrite

Zero

PCSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

0001

Data

CLK

CLK

A

B00

01

10

11

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD PCWrite

PCEn

<<2

25:0 (jump)

31:28

27:0

PCJump

00

01

10

11

0x8000 0180

CLK

EN

EPCWrite

CLK

EN

CauseWrite

0

1

IntCause

0x30

0x28EPC

Cause

Overflow

...

01101

01110

...15:11

10

C0

Exception Hardware: mfc0

Never mind the multi-cycle datapath, focus on the exception hardware.

Chapter 7 <63>

MIC

ROAR

CHIT

ECTU

RE• Temporal parallelism• Divide single-cycle processor into 5 stages:

– Fetch– Decode– Execute– Memory– Writeback

• Add pipeline registers between stages

Pipelined MIPS Processor

Chapter 7 <64>

MIC

ROAR

CHIT

ECTU

RE

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined

Single-Cycle vs. Pipelined

Chapter 7 <65>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

Pipelined Processor Abstraction

Chapter 7 <66>

MIC

ROAR

CHIT

ECTU

RE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

Single-Cycle & Pipelined Datapath

Chapter 7 <67>

MIC

ROAR

CHIT

ECTU

RE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK

Fetch Decode Execute Memory Writeback

WriteReg must arrive at same time as Result

Corrected Pipelined Datapath

Chapter 7 <68>

MIC

ROAR

CHIT

ECTU

RE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

• Same control unit as single-cycle processor• Control delayed to proper pipeline stage

Pipelined Processor Control

Chapter 7 <69>

MIC

ROAR

CHIT

ECTU

RE• When an instruction depends on result from

instruction that hasn’t completed• Types:

– Data hazard: register value not yet written back to register file

– Control hazard: next instruction not decided yet (caused by branches) or target address not calculated yet (jumps and branches)

Pipeline Hazards

Chapter 7 <70>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Data Hazard

Chapter 7 <71>

MIC

ROAR

CHIT

ECTU

RE

2 SW fixes• Insert nops in code at compile time• Rearrange code at compile time 2 HW fixes• Stall the processor at run time• Forward data at run time

Handling Data Hazards

Chapter 7 <72>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

• Insert enough nops for result to be ready• Or move independent useful instructions forward

Compile-Time Hazard Elimination

Chapter 7 <73>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Data Forwarding

Chapter 7 <74>

MIC

ROAR

CHIT

ECTU

RE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Data Forwarding

Chapter 7 <75>

MIC

ROAR

CHIT

ECTU

RE• Forward to Execute stage from either:

– Memory stage or– Writeback stage

• Forwarding logic for ForwardAE:

if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10

else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)

then ForwardAE = 01 else ForwardAE = 00

Forwarding logic for ForwardBE same, but replace rsE with rtE

Data Forwarding

Chapter 7 <76>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

StallingForwarding on a load-use hazard isn’t possible!

Chapter 7 <77>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

9

RF $s1

$s0

IMor

Stall

StallingThe HW solution is to stall the pipeline

Chapter 7 <78>

MIC

ROAR

CHIT

ECTU

RE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Stalling Hardware

Chapter 7 <79>

MIC

ROAR

CHIT

ECTU

RElwstall = ((rsD==rtE) OR (rtD==rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

• By flushing the Execute stage, and stalling Fetch and Decode stages, the instruction flushed will simply be repeated in then next clock cycle, but this time with correct (forwarded) data!

Stalling Logic

Chapter 7 <80>

MIC

ROAR

CHIT

ECTU

RE• beq:

– branch not determined until 4th stage of pipeline– Instructions after the branch are fetched before the

branch occurs– These instructions must be flushed if branch happens

• Branch misprediction penalty– the # of instruction flushed, when branch is taken– may be reduced by determining branch earlier

Control Hazards

Chapter 7 <81>

MIC

ROAR

CHIT

ECTU

RE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Me

mto

Reg

E

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CL

R

Control Hazards: Original Pipeline

Chapter 7 <82>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DM

RF $s0

$s4RF| DM

RF $s5

$s0RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

20

24

28

2C

30

...

...

9

Flushthese

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Control Hazards

Chapter 7 <83>

MIC

ROAR

CHIT

ECTU

RE

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdE

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

But: introduced another data hazard in Decode stage!

Early Branch Resolution

Chapter 7 <84>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IMlw20

24

28

2C

30

...

...

9

Flushthis

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIMslt

Early Branch Resolution

Chapter 7 <85>

MIC

ROAR

CHIT

ECTU

RE

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Forwarding to Early-branch HW

Chapter 7 <86>

MIC

ROAR

CHIT

ECTU

RE• Forwarding logic:

ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM

ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM

• Stalling logic:branchstall = BranchD AND RegWriteE AND

(WriteRegE == rsD OR WriteRegE == rtD) OR

BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD)

StallF = StallD = FlushE = (lwstall OR branchstall)

Control Forwarding & Stalling Logic

Chapter 7 <87>

MIC

ROAR

CHIT

ECTU

RE• Guess whether branch will be taken

– Backward branches are usually taken (in bottom-tested loops)

– Consider history to improve guess• Good prediction significantly reduces fraction

of branches requiring a flush • Requires HW for history table, etc

Branch Prediction

Chapter 7 <88>

MIC

ROAR

CHIT

ECTU

RE• SPECINT2000 benchmark:

– 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type

• Suppose:– 40% of loads used by next instruction– 25% of branches mispredicted– All jumps flush next instruction (JTA not ready)

• What is the average CPI?

Pipelined Performance Example

Chapter 7 <89>

MIC

ROAR

CHIT

ECTU

RE• Average CPI is the weighted average of CPIlw , CPIsw ,

CPIbeq , CPIj and CPIR-type

• For pipeline processors, CPI = 1 + # of stall cycles

Load CPI = 1 when no stall, = 2 when load-use occurs (1 stall)– CPIlw = 1(0.6) + 2(0.4) = 1.4– CPIsw = 1Branch CPI = 1 when no stall, = 2 when it mispredicts and stalls– CPIbeq = 1(0.75) + 2(0.25) = 1.25Jump CPI = 2 since it always requires 1 stall– CPIj = 2– CPIR-type = 1

Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15

Calculation of Average CPI

Chapter 7 <90>

MIC

ROAR

CHIT

ECTU

RE• Pipelined processor critical path: Tc = max {

tpcq + tmem + tsetup

2(tRFread + tmux + teq + tAND + tmux + tsetup )

tpcq + tmux + tmux + tALU + tsetup

tpcq + tmemwrite + tsetup

2(tpcq + tmux + tRFwrite) }

Pipelined Performance

Chapter 7 <91>

MIC

ROAR

CHIT

ECTU

REElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 30

Register setup tsetup 20

Multiplexer tmux 25

ALU tALU 200

Memory read tmem 250

Register file read tRFread 150

Register file setup tRFsetup 20

Equality comparator teq 40

AND gate tAND 15

Memory write Tmemwrite 220

Register file write tRFwrite 100 ps

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )

= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

Pipelined Performance Example

Chapter 7 <92>

MIC

ROAR

CHIT

ECTU

REProgram with IC = 100 billion instructions

Execution Time = IC × CPI × Tc

= (100 × 109)(1.15)(550 × 10-

12) = 63 seconds

Pipelined Performance Example

Chapter 7 <93>

MIC

ROAR

CHIT

ECTU

RE

Processor

Execution Time(seconds)

Speedup(single-cycle as baseline)

Single-cycle 92.5 1

Multicycle 133 0.70

Pipelined 63 1.47

Processor Performance Comparison

Chapter 7 <94>

MIC

ROAR

CHIT

ECTU

RE• Deep Pipelining• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors

Advanced Microarchitecture

Chapter 7 <95>

MIC

ROAR

CHIT

ECTU

RE• 10-20 stages typical• Number of stages limited by:

– Pipeline hazards– Sequencing overhead– Power– Cost

Deep Pipelining

Chapter 7 <96>

MIC

ROAR

CHIT

ECTU

RE• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:

– Check direction of branch (forward or backward)– If backward, predict taken– Else, predict not taken

• Dynamic branch prediction:– Keep history of last (several hundred) branches in branch

target buffer, record:• Branch destination• Whether branch was taken

Branch Prediction

Chapter 7 <97>

MIC

ROAR

CHIT

ECTU

RE add $s1, $0, $0 # sum = 0 add $s0, $0, $0 # i = 0 addi $t0, $0, 10 # $t0 = 10for: beq $s0, $t0, done # if i == 10, branch add $s1, $s1, $s0 # sum = sum + i addi $s0, $s0, 1 # increment i j fordone:

Branch Prediction Example

Chapter 7 <98>

MIC

ROAR

CHIT

ECTU

RE• Remembers whether branch was taken the

last time and does the same thing• Mispredicts first and last branch of loop

1-Bit Branch Predictor

Chapter 7 <99>

MIC

ROAR

CHIT

ECTU

RE

Only mispredicts the last branch of the loop

stronglytaken

predicttaken

weaklytaken

predicttaken

weaklynot taken

predictnot taken

stronglynot taken

predictnot taken

taken taken taken

takentakentaken

taken

taken

2-Bit Branch Predictor

Chapter 7 <100>

MIC

ROAR

CHIT

ECTU

RE• Multiple copies of datapath execute multiple

instructions at once• Dependencies make it tricky to issue multiple

instructions at onceCLK CLK CLK CLK

ARD A1

A2RD1A3

WD3WD6

A4A5A6

RD4

RD2RD5

InstructionMemory

RegisterFile Data

Memory

ALU

s

PC

CLK

A1A2

WD1WD2

RD1RD2

Superscalar

Chapter 7 <101>

MIC

ROAR

CHIT

ECTU

RElw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 2or $t3, $s5, $s6sw $s7, 80($t3)

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lw

add

lw $t0, 40($s0)

add $t1, $s1, $s2

sub $t2, $s1, $s3

and $t3, $s3, $s4

or $t4, $s1, $s5

sw $s5, 80($s0)

$t1$s2

$s1

+

RF$s3

$s1

RF

$t2-

DMIM

sub

and $t3$s4

$s3

&

RF$s5

$s1

RF

$t4|

DMIM

or

sw80

$s0

+ $s5

Superscalar Example

Chapter 7 <102>

MIC

ROAR

CHIT

ECTU

RElw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/5 = 1.17or $t3, $s5, $s6sw $s7, 80($t3)

Stall

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $t0, $s2, $s3

and $t2, $s4, $t0

sw $s7, 80($t3)

RF$s1

$t0add

RF$s1

$t0

RF

$t1+

DM

RF$t0

$s4

RF

$t2&

DMIM

and

IMor

and

sub

|$s6

$s5$t3

RF80

$t3

RF+

DM

sw

IM

$s7

9

$s3

$s2

$s3

$s2-

$t0

oror $t3, $s5, $s6

IM

Superscalar with Dependencies

Chapter 7 <103>

MIC

ROAR

CHIT

ECTU

RE• Looks ahead across multiple instructions• Issues as many instructions as possible at once• Issues instructions out of order (as long as no

dependencies)• Dependencies:

– RAW (read after write): one instruction writes, later instruction reads a register

– WAR (write after read): one instruction reads, later instruction writes a register

– WAW (write after write): one instruction writes, later instruction writes a register

Out of Order Processor

Chapter 7 <104>

MIC

ROAR

CHIT

ECTU

RE• Instruction level parallelism (ILP): number

of instruction that can be issued simultaneously (average < 3)

• Scoreboard: table that keeps track of:– Instructions waiting to issue– Available functional units– Dependencies

Out of Order Processor

Chapter 7 <105>

MIC

ROAR

CHIT

ECTU

RElw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/4 =

1.5or $t3, $s5, $s6sw $s7, 80($t3) Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $t0, $s2, $s3

and $t2, $s4, $t0

sw $s7, 80($t3)

or|$s6

$s5$t3

RF80

$t3

RF+

DM

sw $s7

or $t3, $s5, $s6

IM

RF$s1

$t0

RF

$t1+

DMIM

add

sub-$s3

$s2$t0

two cycle latencybetween load anduse of $t0

RAW

WAR

RAW

RF$t0

$s4

RF&

DM

and

IM

$t2

RAW

Out of Order Processor Example

Chapter 7 <106>

MIC

ROAR

CHIT

ECTU

RE

Time (cycles)

1 2 3 4 5 6 7

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $r0, $s2, $s3

and $t2, $s4, $r0

sw $s7, 80($t3)

sub-$s3

$s2$r0

RF$r0

$s4

RF&

DM

and

$s7

or $t3, $s5, $s6IM

RF$s1

$t0

RF

$t1+

DMIM

add

sw+80

$t3

RAW

$s6

$s5|

or

2-cycle RAW

RAW

$t2

$t3

lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/3 =

2or $t3, $s5, $s6sw $s7, 80($t3)

Register Renaming

Chapter 7 <107>

MIC

ROAR

CHIT

ECTU

RE• Single Instruction Multiple Data (SIMD)

– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed

arithmetic)

• For example, add four 8-bit elementspadd8 $s2, $s0, $s1

a0

0781516232432 Bit position

$s0a1a2a3

b0 $s1b1b2b3

a0 + b0 $s2a1 + b1a2 + b2a3 + b3

+

SIMD

Chapter 7 <108>

MIC

ROAR

CHIT

ECTU

RE• Multithreading

– Wordprocessor: thread for typing, spell checking, printing

• Multiprocessors– Multiple processors (cores) on a single chip

Advanced Architecture Techniques

Chapter 7 <109>

MIC

ROAR

CHIT

ECTU

RE• Process: program running on a computer

– Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper

• Thread: part of a program– Each process has multiple threads: e.g., a word

processor may have threads for typing, spell checking, printing

Threading: Definitions

Chapter 7 <110>

MIC

ROAR

CHIT

ECTU

RE• One thread runs at once• When one thread stalls (for example, waiting

for memory):– Architectural state of that thread stored– Architectural state of waiting thread loaded into

processor and it runs– Called context switching

• Appears to user like all threads running simultaneously

Threads in Conventional Processor

Chapter 7 <111>

MIC

ROAR

CHIT

ECTU

RE• Multiple copies of architectural state• Multiple threads active at once:

– When one thread stalls, another runs immediately– If one thread can’t keep all execution units busy,

another thread can use them

• Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput

Intel calls this “hyperthreading”

Multithreading

Chapter 7 <112>

MIC

ROAR

CHIT

ECTU

RE

• Multiple processors (cores) with a method of communication between them

• Types:– Homogeneous: multiple cores with shared memory– Heterogeneous: separate cores for different tasks (for

example, DSP and CPU in cell phone)– Clusters: each core has own memory system

Multiprocessors