4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture...

78
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16

Transcript of 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture...

Page 1: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3

Emil Sekerinski, McMaster University, Fall Term 2015/16

Page 2: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Instruction Execution • PC → instruction memory, fetch instruction

• Register numbers → register file, read registers

• Depending on instruction class

• Use ALU to calculate - arithmetic result - memory address for load/store - branch target address

• Access data memory for load/store

• PC ← target address or PC + 4

Consider simplified MIPS:• lw/swrt,offset(rs)• add/sub/and/or/sltrs,rt,rd• beqrs,rt,offset

Page 3: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Multiplexers

Can’t just join wires together

Use multiplexers

Page 4: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Control

Page 5: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Which Components Are Needed for subrs,rt,rd?

A 1, 2, 3, 4, 5, 6

B 1, 2, 3, 5, 6

C 1, 2, 5, 6, 7

D 5, 6

E 5, 6, 7

1 2

3 4

5 6

7

Page 6: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Which Components Are Needed for swrt,offset(rs)?

A 5, 7

B 5, 6, 7

C 2, 5, 6, 7

D 1, 2, 3, 5, 7

E 1, 2, 3, 5, 6, 7

1 2

3 4

5 6

7

Page 7: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Which Components Are Needed for beqrs,rt,offset?

A 2, 4, 5, 6

B 2, 4, 5, 6, 7

C 1, 2, 3, 4, 5

D 1, 2, 4, 5, 6

E 1, 2, 3, 4, 5, 6

1 2

3 4

5 6

7

Page 8: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Two Types of Logic Component

Examples for combinational logic:

State Element

clk

A

B C = f(A,B,state) Combinational

Logic

A

B C = f(A,B)

Page 9: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Clocks

Combinational logic transforms data during clock cycles

• Signal must be stable when written to state element

• Longest delay determines clock period

Edge triggering allows a state element to be read and written in the same cycle

clock period

Page 10: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

State Element: Latch

An (unclocked) S-R latch (set-reset latch), built from a pair of NOR gates:

S R Q Q̅Initially 0 0 0 1

S becomes 1

S becomes 0

R becomes 1

R becomes 0

1 0 1 0

0 0 1 0

0 1 0 1

0 0 0 1

Page 11: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

State Element: Flip-Flop

In a clocked memory element, the state changes only on clock edge.

A D flip-flop with input D and clock C using a latch; the C input indicates when the latch should read the input and store it:

D C Q Q̅Initially 0 0 0 1

D becomes 1

C becomes 1

C becomes 0

D becomes 0

1 0 0 1

1 1 1 0

1 0 1 0

0 0 1 0

This flip-flop is transparent: when C is 1, the value of Q changes as D changes.What happens when Q is connected back to D?

Page 12: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

State Element: Flip-Flop With Falling-Edge Trigger

A non-transparent flip-flow changesoutput only on the clock edge; either the rising or falling clock edge can be used:

When C is 1, the first (master) latch is open and outputs D. When C falls, the master is closed, but the second (slave) latch is open and gets its input from the master.

When C changes from 1 to 0, the value of D is stored. In a transparent latch, both the stored value and the output Q change whenever C is high, here only when C transitions.

Page 13: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

State Element: Register

A register is like a master-slave flip-flop, except

• N-bit input and output

• Write Enable input

All state elements are clocked by the same clock edge, rising or falling:

The longest delay of combinational logic determines the clock period! We usually omit the clock line in circuits.

Clk

Don�t Care Setup Hold

.

.

.

.

.

.

.

.

.

.

.

.

Setup Hold

Clk

Data In

Write Enable

N N

Data Out

Page 14: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Instruction Fetch

We refine all elements of the datapath: registers, ALU,mux’s, memories, …

The instruction memory is assumed to be 1 word (32 bits) wide:

32-bit register

Increment by 4 for next instruction

Page 15: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

R-Format Instructions (e.g. sltrs,rt,rd)

Read two register operands

Perform arithmetic/logical operation

Write register result

op rs rt rd shamt funct 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits

Page 16: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Load/Store Instructions, e.g. lwrt,offset(rs)

Read register operands

Calculate address using 16-bit offset: Use ALU, but sign-extend offset

Load: Read memory and update register

Store: Write register value to memory

op rs rt constant or address 6 bits 5 bits 5 bits 16 bits

Page 17: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Branch Instructions, e.g. beqrs,rt,offset

Read register operands

Compare operands: - use ALU - subtract - check Zero output

Calculate target address: - sign-extend displacement - shift left 2 places - add to PC + 4

op rs rt constant or address 6 bits 5 bits 5 bits 16 bits

Just re-routes

wires

Sign-bit wire replicated

Page 18: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Composing the Datapath

First-cut data path does an instruction in one clock cycle - Each datapath element can only do one function at a time - Hence, we need separate instruction and data memories

Use multiplexers where alternate data sources are used for different instructions

Page 19: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

ALU Control

Two-bit ALUOp derived from opcode

Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010

sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010

subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001

set-on-less-than 101010 set-on-less-than 0111

Page 20: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Active Single-Cycle Datapath

Which instruction does the green path represent?

A R-type B lwC swD beqE None of the above

Page 21: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Active Single-Cycle Datapath

Which instruction does the green path represent?

A R-type B lwC swD beqE None of the above

Page 22: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Active Single-Cycle Datapath

Which instruction does the green path represent?

A R-type B lwC swD beqE None of the above

Page 23: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Active Single-Cycle Datapath

Which instruction does the green path represent?

A R-type B lwC swD beqE None of the above

Page 24: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Main Control Unit

Control signals derived from instruction

0 rs rt rd shamt funct 31:26 5:0 25:21 20:16 15:11 10:6

35 or 43 rs rt address 31:26 25:21 20:16 15:0

4 rs rt address 31:26 25:21 20:16 15:0

R-type

Load/ Store

Branch

opcode always read

read, except for load

write for R-type

and load

sign-extend and add

Page 25: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Datapath With Control

Page 26: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

System Calls in SPIM

SPIM provides operating system-likeservices through the syscall instruction. Register $v0 contains the call codeand $a0 - $a3 the arguments. The program prints“the answer = 5”.

Page 27: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Object File and Memory Layout

Page 28: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Performance of Single-Cycle Datapath

Longest delay determines clock period. Which instruction takes longest?

A R-type B lwC swD beqE All take

equally long

lwrt,offset(rs): instr. memory → registers → ALU → data memory → registersWhile CPI = 1, clock cycle is too long; even more if e.g. floating point addedSingle-cycle path only for very simple instruction setsPerformance improvement by pipelining

Page 29: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Pipelining Analogy

Overlapping execution: parallelism improves performance

With 4 loads, speedup= 8/3.5 = 2.3

With n loads, time in h?A n + 1.5 B 2nC 4nD 2n + 1.5E None of the above

Page 30: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Five Stage MIPS Pipeline

1. IF: Instruction fetch from memory

2. ID: Instruction decode & register read

3. EX: Execute operation or calculate address

4. MEM: Access memory operand

5. WB: Write result back to register

Page 31: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Pipeline Performance

Assume time for stages is

• 100ps for register read or write

• 200ps for other stages

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

Page 32: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Pipeline Performance

Pipelined (Tc= 200ps)

Single-cycle (Tc= 800ps)

Page 33: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Throughput vs Latency

The execution time of an individual instruction is its latency.

The throughput is the total amount of work done in a given time.

What is true of a 5 stage pipelinedprocessor compared to a single-cycle processor?

A throughput and latency are unaffected

B throughput increases by a factor of 5, latency is unaffected

C throughput is unaffected, latency increases by a factor of 5

D throughput increases by a factor of 5 and latency increases

E none of the above

Page 34: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Pipeline Speedup

If all stages are balanced, i.e. all take the same time, speedup in the long run is the number of stages.

Latency increases if stages are not balanced; it also increases due to extra overhead.

MIPS ISA is designed for pipelining:

• All instructions are 32-bits: easier to fetch and decode in one cycle (x86: 1- to 17-byte instructions)

• Few and regular instruction format: can decode and read registers in one step

• Load/store addressing: can calculate address in 3rd stage, access memory in 4th stage

• Alignment of memory operands: memory access takes only one cycle

Page 35: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Hazards

Situations that prevent starting the next instruction in the next cycle:

Structure hazards

A required resource is busy

Data hazard

Need to wait for previous instruction to complete its data read/write

Control hazard

Deciding on control action depends on previous instruction

Page 36: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Structure Hazards

Conflict for use of aresource:

In MIPS pipeline with a single memory

Load/store requires data access

Instruction fetch would have to stall for that cycle: would cause a pipeline “bubble”

Hence, pipelined datapaths require separate instruction/data memories

⇒ or separate instruction/data caches

Page 37: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Data Hazard

An instruction depends on completion of data access by a previous instruction

add $s0, $t0, $t1 sub $t2, $s0, $t3

Page 38: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Forwarding (aka Bypassing)

Use result when it is computed

• Don’t wait for it to be stored in a register

• Requires extra connections in the datapath

Page 39: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Load-Use Data Hazard

Can’t always avoid stalls by forwarding

• If value not computed when needed

• Can’t forward backward in time!

Page 40: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Code Scheduling to Avoid Stalls

Reorder code to avoid use of load result in the next instruction

A = B + E; C = B + F;

in C is compiled to:

How many cycles?A 9B 10C 11D 12E 13

lw $t1, 0($t0)

lw $t2, 4($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

stall

stall

Page 41: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Code Scheduling to Avoid Stalls

Reorder code to avoid use of load result in the next instruction

A = B + E; C = B + F;

in C is compiled to:

By reordering instructions, to how many cycles canthis be reduced?A 9B 10C 11D 12E 13

13 cycles

lw $t1, 0($t0)

lw $t2, 4($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

stall

stall

Page 42: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Code Scheduling to Avoid Stalls

Reorder code to avoid use of load result in the next instruction

A = B + E; C = B + F;

in C is compiled to:

lw $t1, 0($t0)

lw $t2, 4($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

stall

stall

lw $t1, 0($t0)

lw $t2, 4($t0)

lw $t4, 8($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

11 cycles 13 cycles

Page 43: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Control Hazards

Branch determines flow of control

• Fetching next instruction depends on branch outcome

• Pipeline can’t always fetch correct instructions, still working on ID stage of branch

In MIPS pipeline

• Need to compare registers and compute target early in the pipeline

• Add hardware to do it in ID stage

Page 44: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Stall on Branch

Wait until branch outcome determined before fetching next instruction

Page 45: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Branch Prediction

Longer pipelines can’t readily determine branch outcome early

⇒ stall penalty becomes unacceptable

Predict outcome of branch, only stall if prediction is wrong

In MIPS pipeline

can predict branches not taken

fetch instruction after branch, with no delay

Page 46: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

MIPS with Predict Not Taken

Prediction correct

Prediction incorrect

Page 47: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

More-Realistic Branch Prediction

Static branch prediction

• Based on typical branch behaviour

• Example: while and if-statement branchesPredict backward branches takenPredict forward branches not taken

Dynamic branch prediction

• Hardware measures actual branch behavioure.g., record recent history of each branch

• Assume future behaviour will continue the trend when wrong, stall while re-fetching, and update history

Page 48: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Pipeline Summary

Pipelining improves performance by increasing instruction throughput

• Executes multiple instructions in parallel

• Each instruction has the same latency

Subject to hazards

• Structure, data, control

Instruction set design affects complexity of pipeline implementation

Page 49: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

MIPS Pipelined Datapath

WB

MEM

Right-to-left flow leads to hazards

Page 50: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

IF for Load, Store

“Single-clock-cycle” diagram, highlighting resources used

Page 51: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

ID for Load, Store, …

Page 52: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

EX for Load

Page 53: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

MEM for Load

Page 54: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

WB for Load

Page 55: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Corrected Datapath for Load

Page 56: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Multi-Cycle Pipeline Diagram with Resource Usage

Page 57: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Instruction-Level Parallelism (ILP)

Pipelining: executing multiple instructions in parallel

To increase ILP

• Deeper pipeline Less work per stage shorter clock cycle

• Multiple issue

• Replicate pipeline stages ⇒ multiple pipelines

• Start multiple instructions per clock cycle

• CPI < 1, so use Instructions Per Cycle (IPC)

• E.g., 4GHz 4-way multiple-issue,16 BIPS, peak CPI = 0.25, peak IPC = 4

• But dependencies reduce this in practice

Page 58: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Multiple Issue

Static multiple issue

• Compiler groups instructions to be issued together

• Packages them into “issue slots”

• Compiler detects and avoids hazards

Dynamic multiple issue

• CPU examines instruction stream and chooses instructions to issue each cycle

• Compiler can help by reordering instructions

• CPU resolves hazards using advanced techniques at runtime

Page 59: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Speculation

“Guess” what to do with an instruction

• Start operation as soon as possible

• Check whether guess was right: If so, complete the operation If not, roll-back and do the right thing

Common to static and dynamic multiple issue

Examples:

• Speculate on branch outcomeRoll back if path taken is different

• Speculate on loadRoll back if location is updated

Compiler can reorder instructions, e.g., move load before branch

Hardware can look ahead for instructions to execute

• Buffer results until it determines they are actually needed

• Flush buffers on incorrect speculation

Page 60: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Static Multiple Issue

Compiler groups instructions into “issue packets”

• that can be issued on a single cycle, determined by pipeline resources required

Issue packet is a Very Long Instruction Word (VLIW)

• that specifies multiple concurrent operations

Compiler must remove some/all hazards

• Reorder instructions into issue packets

• No dependencies with a packet

• Possibly some dependencies between packets, depending on ISA

• Pad with nop if necessary

Page 61: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

MIPS with Static Dual Issue

Two-issue packets:

• 64-bit aligned

• ALU/branch instruction, then load/store instruction

• Pad an unused instruction with nop

Address Instruction type Pipeline Stages

n ALU/branch IF ID EX MEM WB

n + 4 Load/store IF ID EX MEM WB

n + 8 ALU/branch IF ID EX MEM WB

n + 12 Load/store IF ID EX MEM WB

n + 16 ALU/branch IF ID EX MEM WB

n + 20 Load/store IF ID EX MEM WB

Page 62: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

MIPS with Static Dual Issue

Page 63: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Hazards in the Dual-Issue MIPS

More instructions executing in parallel, but:

EX data hazard

• Forwarding avoided stalls with single-issue

• Now can’t use ALU result in load/store in same packet

add$t0,$s0,$s1load$s2,0($t0)

• Split into two packets, effectively a stall

Load-use hazard

• Still one cycle use latency, but now two instructions

⇒ More aggressive scheduling required

Page 64: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Scheduling ExampleHow many cycles needed for dual-issue MIPS (ALU/branch, then load/store)?

A 3B 4C 5D 6E 7

Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0

ALU/branch Load/store cycle

Loop: nop lw $t0, 0($s1) 1

addi $s1, $s1,–4 nop 2

addu $t0, $t0, $s2 nop 3

bne $s1, $zero, Loop sw $t0, 4($s1) 4

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

Page 65: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Loop Unrolling

Replicate loop body toexpose more parallelism:

Reduces loop-control overhead

Use different registers per replication: register renaming

Avoid loop-carried anti-dependencies

e.g. store followed by a load of the same register

ALU/branch Load/store cycle

Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1

nop lw $t1, 12($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 4($s1) 4

addu $t2, $t2, $s2 sw $t0, 16($s1) 5

addu $t3, $t4, $s2 sw $t1, 12($s1) 6

nop sw $t2, 8($s1) 7

bne $s1, $zero, Loop sw $t3, 4($s1) 8

ALU/branch Load/store cycle

Loop: nop lw $t0, 0($s1) 1

addi $s1, $s1,–4 nop 2

addu $t0, $t0, $s2 nop 3

bne $s1, $zero, Loop sw $t0, 4($s1) 4

IPC = 14/8 = 1.75Closer to 2, but at cost of registers and code size

Page 66: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Dynamic Multiple Issue

Superscalar processors:

• CPU decides whether to issue 0, 1, 2, … each cycle

• Avoiding structural and data hazards

Avoids the need for compiler scheduling, though it may still help

Page 67: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Dynamic Pipeline Scheduling

Allow the CPU to execute instructions out of order to avoid stalls, but commit result to registers in order:

lw$t0,20($s2)addu$t1,$t0,$t2sub$s4,$s4,$t3slti$t5,$s4,20

Can start sub while addu is waiting for lw

Why not just let the compiler schedule code?

• Not all stalls are predicable, e.g., cache misses

• Can’t always schedule around branches, outcome dynamically determined

Page 68: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Dynamically Scheduled CPU

Results also sent to any waiting

reservation stations

Reorders buffer for register writes

Can supply operands for

issued instructions

Preserves dependencies

Hold pending operands

Page 69: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Does Multiple Issue Work?

Yes, but not as much as we’d like

Programs have real dependencies that limit ILP

Some dependencies are hard to eliminate, e.g. pointers

Some parallelism is hard to expose: limited window size during instruction issue

Memory delays and limited bandwidth: hard to keep pipelines full

Speculation can help if done well

Page 70: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Hardware or Software Based Approach to ILP?

1. Branch prediction

2. Multiple issue

3. VLIW

4. Superscalar

5. Dynamic scheduling

6. Out-of-order execution

7. Speculation

8. Register renaming

A HardwareB SoftwareC BothD NeitherE What was the question?

Page 71: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Power Efficiency

Complexity of dynamic scheduling and speculations requires power

Multiple simpler cores may be better

Microprocessor Year Clock Rate Pipeline Stages

Issue width

Out-of-order/ Speculation

Cores Power

i486 1989 25MHz 5 1 No 1 5W

Pentium 1993 66MHz 5 2 No 1 10W

Pentium Pro 1997 200MHz 10 3 Yes 1 29W

P4 Willamette 2001 2000MHz 22 3 Yes 1 75W

P4 Prescott 2004 3600MHz 31 3 Yes 1 103W

Core 2006 2930MHz 14 4 Yes 2 75W

UltraSparc III 2003 1950MHz 14 4 No 1 90W

UltraSparc T1 2005 1200MHz 6 1 No 8 70W

Page 72: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Cortex A8 and Intel i7

Processor ARM A8 Intel Core i7 920

Market Personal Mobile Device Server, cloud Thermal design power 2 Watts 130 Watts Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Floating point? No Yes Multiple issue? Dynamic Dynamic Peak instructions/clock cycle 2 4 Pipeline stages 14 14 Pipeline schedule Static in-order Dynamic out-of-order

with speculation Branch prediction 2-level 2-level 1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D 2nd level caches/core 128-1024 KiB 256 KiB 3rd level caches (shared) - 2- 8 MB

Page 73: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

ARM Cortex-A8 Pipeline

Page 74: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

ARM Cortex-A8 Performance

Page 75: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Core i7 Pipeline

Page 76: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Core i7 Performance

Page 77: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Matrix Multiply: Unrolled C code1#include<x86intrin.h>

2#defineUNROLL(4)

3

4voiddgemm(intn,double*A,double*B,double*C)

5{

6for(inti=0;i<n;i+=UNROLL*4)

7for(intj=0;j<n;j++){

8__m256dc[4];

9for(intx=0;x<UNROLL;x++)

10c[x]=_mm256_load_pd(C+i+x*4+j*n);

11

12for(intk=0;k<n;k++)

13{

14__m256db=_mm256_broadcast_sd(B+k+j*n);

15for(intx=0;x<UNROLL;x++)

16c[x]=_mm256_add_pd(c[x],

17_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i),b));

18}

19

20for(intx=0;x<UNROLL;x++)

21_mm256_store_pd(C+i+x*4+j*n,c[x]);

22}

23}

Using gcc -O3 optimization to unroll loop

Using AVX subword parallel instructions on x86

Page 78: 4. The Processor - cas.mcmaster.case2ga3/Chapter 4.pdf · 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16.

Matrix Multiply: Performance