Microcoded and VLIW Processorscsg.csail.mit.edu/6.823/Lectures/L19handout.pdfld f2, 8(r1) ld f3,...

L19-1MIT 6.823 Spring 2021

Daniel SanchezComputer Science & Artificial Intelligence Lab

M.I.T.

Microcoded and VLIW Processors

April 29, 2021

MIT 6.823 Spring 2021

Hardwired vs Microcoded Processors

• All processors we have seen so far are hardwired: The microarchitecture directly implements all the instructions in the ISA

• Microcoded processors add a layer of interpretation: Each ISA instruction is executed as a sequence of simpler microinstructions

– Simpler implementation– Lower performance than hardwired (CPI > 1)

• Microcoding common until the 80s, still in use today (e.g., complex x86 instructions are decoded into multiple “micro-ops”)

April 29, 2021 L19-2


Microcontrol Unit[Maurice Wilkes, 1954]

April 29, 2021

Embed the control logic state table in a read-only memory array

Matrix A Matrix B

Decoder

Next state

op conditionalcode flip-flop

address

Control lines toALU, MUXs, Registers

L19-3


Microcoded Microarchitecture

April 29, 2021

Memory(RAM)

Datapath

controller(ROM)

AddrData

zero?

busy?

opcode

enMem

MemWrt

holds fixedmicrocode instructions

holds user program written in macrocode

instructions (e.g., MIPS, x86, etc.)

L19-4


A Bus-based Datapath for MIPS

April 29, 2021

Microinstruction: register to register transfer (17 control signals)MA PC means RegSel = PC; enReg=yes; ldMA= yes

B Reg[rt] means

enMem

MA

addr

data

ldMA

Memory

busy

MemWrt

Bus 32

zero?

A B

OpSel ldA ldB

ALU

enALU

ALUcontrol

2

RegWrt

enReg

addr

data

rsrtrd

32(PC)31(Link)

RegSel

32 GPRs+ PC ...

32-bit Reg

3

rsrtrd

ExtSel

IR

Opcode

ldIR

ImmExt

enImm

2

L19-5


Memory Module

• Assumption: Memory operates asynchronously and is slow compared to Reg-to-Reg transfers

April 29, 2021

Enable

Write(1)/Read(0)RAM

din dout

we

addr busy

bus

L19-6


Microcode Controller

April 29, 2021

JumpType =next | spin

| fetch | dispatch| feqz | fnez

Control Signals (17)

Control ROM

address

data

+1

Opcode ext

PC (state)

jumplogic

zero

PC PC+1

absolute

op-group

busy

PCSrcinput encoding reduces

ROM height

next-state encoding reduces ROM width

L19-7


Jump Logic

April 29, 2021

PCSrc = Case JumpTypes

next PC+1

spin if (busy) then PC else PC+1

fetch absolute

dispatch op-group

feqz if (zero) then absolute else PC+1

fnez if (zero) then PC+1 else absolute

L19-8


Instruction Execution

April 29, 2021

Execution of a MIPS instruction involves

1. instruction fetch2. decode and register fetch3. ALU operation4. memory operation (optional)5. write back to register file (optional)

+ the computation of the next instruction address

L19-9


Instruction Fetch

April 29, 2021

State Control points next-state

fetch0 MA PC fetch1 IR Memoryfetch2 A PCfetch3 PC A + 4 ...

ALU0 A Reg[rs] ALU1 B Reg[rt] ALU2 Reg[rd]func(A,B)

ALUi0 A Reg[rs] ALUi1 B sExt16(Imm)ALUi2 Reg[rd] Op(A,B)

L19-10


Load & Store

April 29, 2021


LW0 A Reg[rs] next LW1 B sExt16(Imm) nextLW2 MA A+B nextLW3 Reg[rt] Memory spinLW4 fetch

SW0 A Reg[rs] next SW1 B sExt16(Imm) nextSW2 MA A+B nextSW3 Memory Reg[rt] spinSW4 fetch

L19-11


Branches

April 29, 2021


BEQZ0 A Reg[rs] nextBEQZ1 fnezBEQZ2 A PC nextBEQZ3 B sExt16(Imm


Jumps

April 29, 2021


J0 A PC nextJ1 B IR nextJ2 PC JumpTarg(A,B) fetch

JR0 A Reg[rs] nextJR1 PC A fetch

JAL0 A PC next JAL1 Reg[31] A next JAL2 B IR nextJAL3 PC JumpTarg(A,B) fetch

JALR0 A PC next JALR1 B Reg[rs] nextJALR2 Reg[31] A next JALR3 PC B fetch

L19-13


VAX 11-780 Microcode (1978)

April 29, 2021 L19-14

L19-15MIT 6.823 Spring 2021

Very Long Instruction Word (VLIW) Processors

April 29, 2021


Check instruction dependencies

Superscalar processor

Sequential ISA Bottleneck

April 29, 2021

a = foo(b);

for (i=0, i<

Sequential source code Superscalar compiler

Find independent operations

Schedule operations

Sequential machine code

Schedule execution

L19-16


VLIW: Very Long Instruction Word

• Multiple operations packed into one instruction• Each operation slot is for a fixed function• Constant operation latencies are specified

April 29, 2021

Two Integer Units,Single Cycle Latency

Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,

Four Cycle Latency

Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1

L19-17


VLIW Design Principles

The architecture:

• Allows operation parallelism within an instruction– No cross-operation RAW check

• Provides deterministic latency for all operations– Latency measured in ‘instructions’– No data use allowed before specified latency with no data

interlocks

The compiler:

• Schedules (reorders) to maximize parallel execution• Guarantees intra-instruction parallelism• Schedules to avoid data hazards (no interlocks)

– Typically separates operations with explicit NOPs

April 29, 2021 L19-18


Early VLIW Machines

• FPS AP120B (1976)– scientific attached array processor– first commercial wide instruction machine– hand-coded vector math libraries using software pipelining and

loop unrolling

• Multiflow Trace (1987)– commercialization of ideas from Fisher’s Yale group including

“trace scheduling”– available in configurations with 7, 14, or 28

operations/instruction

– 28 operations packed into a 1024-bit instruction word

• Cydrome Cydra-5 (1987)– 7 operations encoded in 256-bit instruction word– rotating register file

April 29, 2021 L19-19


Loop Execution

April 29, 2021

How many FP ops/cycle?

for (i=0; i


Loop Unrolling

April 29, 2021

for (i=0; i


Scheduling Loop Unrolled Code

April 29, 2021

loop: ld f1, 0(r1)

ld f2, 8(r1)

ld f3, 16(r1)

ld f4, 24(r1)

add r1, 32

fadd f5, f0, f1

fadd f6, f0, f2

fadd f7, f0, f3

fadd f8, f0, f4

sd f5, 0(r2)

sd f6, 8(r2)

sd f7, 16(r2)

sd f8, 24(r2)

add r2, 32

bne r1, r3, loop

Schedule

Int1 Int 2 M1 M2 FP+ FPx

loop:

Unroll 4 ways

ld f1

ld f2

ld f3

ld f4add r1 fadd f5

fadd f6

fadd f7

fadd f8

sd f5

sd f6

sd f7

sd f8add r2 bne

How many FLOPS/cycle?

L19-22


Software Pipelining

April 29, 2021

How many FLOPS/cycle?

loop: ld f1, 0(r1)

ld f2, 8(r1)

ld f3, 16(r1)

ld f4, 24(r1)

add r1, 32

fadd f5, f0, f1

fadd f6, f0, f2

fadd f7, f0, f3

fadd f8, f0, f4

sd f5, 0(r2)

sd f6, 8(r2)

sd f7, 16(r2)

add r2, 32

sd f8, -8(r2)

bne r1, r3, loop

Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways first

ld f1

ld f2

ld f3

ld f4

fadd f5

fadd f6

fadd f7

fadd f8

sd f5

sd f6

sd f7

sd f8

add r1

add r2

bne

ld f1

ld f2

ld f3

ld f4

fadd f5

fadd f6

fadd f7

fadd f8

sd f5

sd f6

sd f7

sd f8

add r1

add r2

bne

ld f1

ld f2

ld f3

ld f4

fadd f5

fadd f6

fadd f7

fadd f8

sd f5

add r1

loop:iterate

prolog

epilog

L19-23


Software Pipelining vs. Unrolling

April 29, 2021

time

performance

time

performance

Loop Unrolling

Software Pipelining

Startup overhead

Wind-down overhead

Loop Iteration

Loop Iteration

Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

L19-24


What if there are no loops?

• Branches limit basic block size in control-flow intensive irregular code

• Difficult to find ILP in individual basic blocks

April 29, 2021

Basic block

L19-25


Trace Scheduling[Fisher,Ellis]

• Pick string of basic blocks, a trace, that represents most frequent branch path

• Schedule whole “trace” at once• Add fixup code to cope with

branches jumping out of trace

April 29, 2021

How do we know which trace to pick?

L19-26


Problems with “Classic” VLIW

• Knowing branch probabilities– Profiling requires an significant extra step in build process

• Scheduling for statically unpredictable branches– Optimal schedule varies with branch path

• Object code size– Instruction padding wastes instruction memory/cache– Loop unrolling/software pipelining replicates code

• Scheduling memory operations– Caches and/or memory bank conflicts impose statically

unpredictable variability

– Uncertainty about addresses limit code reordering

• Object-code compatibility– Have to recompile all code for every machine, even for two

machines in same generation

April 29, 2021 L19-27


VLIW Instruction Encoding

• Schemes to reduce effect of unused fields– Compressed format in memory, expand on I-cache refill

• used in Multiflow Trace• introduces instruction addressing challenge

– Provide a single-op VLIW instruction• Cydra-5 UniOp instructions

– Mark parallel groups• used in TMS320C6x DSPs, Intel IA-64

April 29, 2021

Group 1 Group 2 Group 3

L19-28


Cydra-5:Memory Latency Register (MLR)

• Problem: Loads have variable latency• Solution: Let software choose desired memory latency

• Compiler schedules code for maximum load-use distance

• Software sets MLR to latency that matches code schedule

• Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline

– Hardware buffers loads that return early– Hardware stalls processor if loads return late

April 29, 2021 L19-29


IA-64 Predicated Execution

April 29, 2021

Problem: Mispredicted branches limit ILP

Solution: Eliminate hard-to-predict branches with predicated execution– Almost all IA-64 instructions can be executed conditionally under predicate– Instruction becomes NOP if predicate register false

Inst 1Inst 2br a==b, b2

Inst 3Inst 4br b3

Inst 5Inst 6

Inst 7Inst 8

b0:

b1:

b2:

b3:

if

else

then

Four basic blocks

Inst 1

Inst 2

p1,p2 cmp(a==b)(p1) Inst 3 || (p2) Inst 5

(p1) Inst 4 || (p2) Inst 6

Inst 7

Inst 8

Predication

One basic block

Mahlke et al, ISCA95: On average >50% branches removed

L19-30


Fully Bypassed Datapath

April 29, 2021

ASrcIRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR ALU

ImmExt

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataData Memory

we

31

nop

stall

D

E M W

PC for JAL, ...

BSrc

Where does predication fit in?

L19-31


IA-64 Speculative Execution

April 29, 2021

Problem: Branches restrict compiler code motion

Inst 1Inst 2br a==b, b2

Load r1Use r1Inst 3

Can’t move load above branch because might cause spurious

exception

Load.s r1Inst 1Inst 2br a==b, b2

Chk.s r1Use r1Inst 3

Speculative load

never causes

exception, but sets

“poison” bit on destination register

Check for exception in

original home block

jumps to fixup code if

exception detected

Particularly useful for scheduling long latency loads early

Solution: Speculative operations that don’t cause exceptions

L19-32


IA-64 Data Speculation

April 29, 2021

Problem: Possible memory hazards limit code scheduling

Requires associative hardware in address check table

Inst 1Inst 2Store

Load r1Use r1Inst 3

Can’t move load above store because store might be to same

address

Load.a r1Inst 1Inst 2Store

Load.cUse r1Inst 3

Data speculative load

adds address to address

check table

Store invalidates any

matching loads in

address check table

Check if load invalid (or

missing), jump to fixup code

if so

Solution: Instruction-based speculation with hardware monitor to check for pointer hazards

L19-33


Clustered VLIW

• Divide machine into clusters of local register files and local functional units

• Lower bandwidth/higher latency interconnect between clusters

• Software responsible for mapping computations to minimize communication overhead

• Common in commercial embedded processors, examples include TI C6x series DSPs, and HP Lx processor

• Exists in some superscalar processors, e.g., Alpha 21264

April 29, 2021

Cluster Interconnect

Local Regfile

Local Regfile

Memory Interconnect

Cache/Memory Banks

Cluster

L19-34


Limits of Static Scheduling

• Unpredictable branches• Unpredictable memory behavior

(cache misses and dependencies)

• Code size explosion• Compiler complexity

April 29, 2021

Question:

How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures?

L19-35

L19-36MIT 6.823 Spring 2021

Thank you!

Next Lecture: Vector Processors

April 29, 2021

Microcoded and VLIW Processorscsg.csail.mit.edu/6.823/Lectures/L19handout.pdfld f2, 8(r1) ld f3,...

Documents

Transcript of Microcoded and VLIW Processorscsg.csail.mit.edu/6.823/Lectures/L19handout.pdfld f2, 8(r1) ld f3,...