Microcoded and VLIW Processorscsg.csail.mit.edu/6.823/Lectures/L19handout.pdfld f2, 8(r1) ld f3,...

36
L19-1 MIT 6.823 Spring 2021 Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. Microcoded and VLIW Processors April 29, 2021

Transcript of Microcoded and VLIW Processorscsg.csail.mit.edu/6.823/Lectures/L19handout.pdfld f2, 8(r1) ld f3,...

  • L19-1MIT 6.823 Spring 2021

    Daniel SanchezComputer Science & Artificial Intelligence Lab

    M.I.T.

    Microcoded and VLIW Processors

    April 29, 2021

  • MIT 6.823 Spring 2021

    Hardwired vs Microcoded Processors

    • All processors we have seen so far are hardwired: The microarchitecture directly implements all the instructions in the ISA

    • Microcoded processors add a layer of interpretation: Each ISA instruction is executed as a sequence of simpler microinstructions

    – Simpler implementation– Lower performance than hardwired (CPI > 1)

    • Microcoding common until the 80s, still in use today (e.g., complex x86 instructions are decoded into multiple “micro-ops”)

    April 29, 2021 L19-2

  • MIT 6.823 Spring 2021

    Microcontrol Unit[Maurice Wilkes, 1954]

    April 29, 2021

    Embed the control logic state table in a read-only memory array

    Matrix A Matrix B

    Decoder

    Next state

    op conditionalcode flip-flop

    address

    Control lines toALU, MUXs, Registers

    L19-3

  • MIT 6.823 Spring 2021

    Microcoded Microarchitecture

    April 29, 2021

    Memory(RAM)

    Datapath

    controller(ROM)

    AddrData

    zero?

    busy?

    opcode

    enMem

    MemWrt

    holds fixedmicrocode instructions

    holds user program written in macrocode

    instructions (e.g., MIPS, x86, etc.)

    L19-4

  • MIT 6.823 Spring 2021

    A Bus-based Datapath for MIPS

    April 29, 2021

    Microinstruction: register to register transfer (17 control signals)MA PC means RegSel = PC; enReg=yes; ldMA= yes

    B Reg[rt] means

    enMem

    MA

    addr

    data

    ldMA

    Memory

    busy

    MemWrt

    Bus 32

    zero?

    A B

    OpSel ldA ldB

    ALU

    enALU

    ALUcontrol

    2

    RegWrt

    enReg

    addr

    data

    rsrtrd

    32(PC)31(Link)

    RegSel

    32 GPRs+ PC ...

    32-bit Reg

    3

    rsrtrd

    ExtSel

    IR

    Opcode

    ldIR

    ImmExt

    enImm

    2

    L19-5

  • MIT 6.823 Spring 2021

    Memory Module

    • Assumption: Memory operates asynchronously and is slow compared to Reg-to-Reg transfers

    April 29, 2021

    Enable

    Write(1)/Read(0)RAM

    din dout

    we

    addr busy

    bus

    L19-6

  • MIT 6.823 Spring 2021

    Microcode Controller

    April 29, 2021

    JumpType =next | spin

    | fetch | dispatch| feqz | fnez

    Control Signals (17)

    Control ROM

    address

    data

    +1

    Opcode ext

    PC (state)

    jumplogic

    zero

    PC PC+1

    absolute

    op-group

    busy

    PCSrcinput encoding reduces

    ROM height

    next-state encoding reduces ROM width

    L19-7

  • MIT 6.823 Spring 2021

    Jump Logic

    April 29, 2021

    PCSrc = Case JumpTypes

    next PC+1

    spin if (busy) then PC else PC+1

    fetch absolute

    dispatch op-group

    feqz if (zero) then absolute else PC+1

    fnez if (zero) then PC+1 else absolute

    L19-8

  • MIT 6.823 Spring 2021

    Instruction Execution

    April 29, 2021

    Execution of a MIPS instruction involves

    1. instruction fetch2. decode and register fetch3. ALU operation4. memory operation (optional)5. write back to register file (optional)

    + the computation of the next instruction address

    L19-9

  • MIT 6.823 Spring 2021

    Instruction Fetch

    April 29, 2021

    State Control points next-state

    fetch0 MA PC fetch1 IR Memoryfetch2 A PCfetch3 PC A + 4 ...

    ALU0 A Reg[rs] ALU1 B Reg[rt] ALU2 Reg[rd]func(A,B)

    ALUi0 A Reg[rs] ALUi1 B sExt16(Imm)ALUi2 Reg[rd] Op(A,B)

    L19-10

  • MIT 6.823 Spring 2021

    Load & Store

    April 29, 2021

    State Control points next-state

    LW0 A Reg[rs] next LW1 B sExt16(Imm) nextLW2 MA A+B nextLW3 Reg[rt] Memory spinLW4 fetch

    SW0 A Reg[rs] next SW1 B sExt16(Imm) nextSW2 MA A+B nextSW3 Memory Reg[rt] spinSW4 fetch

    L19-11

  • MIT 6.823 Spring 2021

    Branches

    April 29, 2021

    State Control points next-state

    BEQZ0 A Reg[rs] nextBEQZ1 fnezBEQZ2 A PC nextBEQZ3 B sExt16(Imm

  • MIT 6.823 Spring 2021

    Jumps

    April 29, 2021

    State Control points next-state

    J0 A PC nextJ1 B IR nextJ2 PC JumpTarg(A,B) fetch

    JR0 A Reg[rs] nextJR1 PC A fetch

    JAL0 A PC next JAL1 Reg[31] A next JAL2 B IR nextJAL3 PC JumpTarg(A,B) fetch

    JALR0 A PC next JALR1 B Reg[rs] nextJALR2 Reg[31] A next JALR3 PC B fetch

    L19-13

  • MIT 6.823 Spring 2021

    VAX 11-780 Microcode (1978)

    April 29, 2021 L19-14

  • L19-15MIT 6.823 Spring 2021

    Very Long Instruction Word (VLIW) Processors

    April 29, 2021

  • MIT 6.823 Spring 2021

    Check instruction dependencies

    Superscalar processor

    Sequential ISA Bottleneck

    April 29, 2021

    a = foo(b);

    for (i=0, i<

    Sequential source code Superscalar compiler

    Find independent operations

    Schedule operations

    Sequential machine code

    Schedule execution

    L19-16

  • MIT 6.823 Spring 2021

    VLIW: Very Long Instruction Word

    • Multiple operations packed into one instruction• Each operation slot is for a fixed function• Constant operation latencies are specified

    April 29, 2021

    Two Integer Units,Single Cycle Latency

    Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,

    Four Cycle Latency

    Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1

    L19-17

  • MIT 6.823 Spring 2021

    VLIW Design Principles

    The architecture:

    • Allows operation parallelism within an instruction– No cross-operation RAW check

    • Provides deterministic latency for all operations– Latency measured in ‘instructions’– No data use allowed before specified latency with no data

    interlocks

    The compiler:

    • Schedules (reorders) to maximize parallel execution• Guarantees intra-instruction parallelism• Schedules to avoid data hazards (no interlocks)

    – Typically separates operations with explicit NOPs

    April 29, 2021 L19-18

  • MIT 6.823 Spring 2021

    Early VLIW Machines

    • FPS AP120B (1976)– scientific attached array processor– first commercial wide instruction machine– hand-coded vector math libraries using software pipelining and

    loop unrolling

    • Multiflow Trace (1987)– commercialization of ideas from Fisher’s Yale group including

    “trace scheduling”– available in configurations with 7, 14, or 28

    operations/instruction

    – 28 operations packed into a 1024-bit instruction word

    • Cydrome Cydra-5 (1987)– 7 operations encoded in 256-bit instruction word– rotating register file

    April 29, 2021 L19-19

  • MIT 6.823 Spring 2021

    Loop Execution

    April 29, 2021

    How many FP ops/cycle?

    for (i=0; i

  • MIT 6.823 Spring 2021

    Loop Unrolling

    April 29, 2021

    for (i=0; i

  • MIT 6.823 Spring 2021

    Scheduling Loop Unrolled Code

    April 29, 2021

    loop: ld f1, 0(r1)

    ld f2, 8(r1)

    ld f3, 16(r1)

    ld f4, 24(r1)

    add r1, 32

    fadd f5, f0, f1

    fadd f6, f0, f2

    fadd f7, f0, f3

    fadd f8, f0, f4

    sd f5, 0(r2)

    sd f6, 8(r2)

    sd f7, 16(r2)

    sd f8, 24(r2)

    add r2, 32

    bne r1, r3, loop

    Schedule

    Int1 Int 2 M1 M2 FP+ FPx

    loop:

    Unroll 4 ways

    ld f1

    ld f2

    ld f3

    ld f4add r1 fadd f5

    fadd f6

    fadd f7

    fadd f8

    sd f5

    sd f6

    sd f7

    sd f8add r2 bne

    How many FLOPS/cycle?

    L19-22

  • MIT 6.823 Spring 2021

    Software Pipelining

    April 29, 2021

    How many FLOPS/cycle?

    loop: ld f1, 0(r1)

    ld f2, 8(r1)

    ld f3, 16(r1)

    ld f4, 24(r1)

    add r1, 32

    fadd f5, f0, f1

    fadd f6, f0, f2

    fadd f7, f0, f3

    fadd f8, f0, f4

    sd f5, 0(r2)

    sd f6, 8(r2)

    sd f7, 16(r2)

    add r2, 32

    sd f8, -8(r2)

    bne r1, r3, loop

    Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways first

    ld f1

    ld f2

    ld f3

    ld f4

    fadd f5

    fadd f6

    fadd f7

    fadd f8

    sd f5

    sd f6

    sd f7

    sd f8

    add r1

    add r2

    bne

    ld f1

    ld f2

    ld f3

    ld f4

    fadd f5

    fadd f6

    fadd f7

    fadd f8

    sd f5

    sd f6

    sd f7

    sd f8

    add r1

    add r2

    bne

    ld f1

    ld f2

    ld f3

    ld f4

    fadd f5

    fadd f6

    fadd f7

    fadd f8

    sd f5

    add r1

    loop:iterate

    prolog

    epilog

    L19-23

  • MIT 6.823 Spring 2021

    Software Pipelining vs. Unrolling

    April 29, 2021

    time

    performance

    time

    performance

    Loop Unrolling

    Software Pipelining

    Startup overhead

    Wind-down overhead

    Loop Iteration

    Loop Iteration

    Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

    L19-24

  • MIT 6.823 Spring 2021

    What if there are no loops?

    • Branches limit basic block size in control-flow intensive irregular code

    • Difficult to find ILP in individual basic blocks

    April 29, 2021

    Basic block

    L19-25

  • MIT 6.823 Spring 2021

    Trace Scheduling[Fisher,Ellis]

    • Pick string of basic blocks, a trace, that represents most frequent branch path

    • Schedule whole “trace” at once• Add fixup code to cope with

    branches jumping out of trace

    April 29, 2021

    How do we know which trace to pick?

    L19-26

  • MIT 6.823 Spring 2021

    Problems with “Classic” VLIW

    • Knowing branch probabilities– Profiling requires an significant extra step in build process

    • Scheduling for statically unpredictable branches– Optimal schedule varies with branch path

    • Object code size– Instruction padding wastes instruction memory/cache– Loop unrolling/software pipelining replicates code

    • Scheduling memory operations– Caches and/or memory bank conflicts impose statically

    unpredictable variability

    – Uncertainty about addresses limit code reordering

    • Object-code compatibility– Have to recompile all code for every machine, even for two

    machines in same generation

    April 29, 2021 L19-27

  • MIT 6.823 Spring 2021

    VLIW Instruction Encoding

    • Schemes to reduce effect of unused fields– Compressed format in memory, expand on I-cache refill

    • used in Multiflow Trace• introduces instruction addressing challenge

    – Provide a single-op VLIW instruction• Cydra-5 UniOp instructions

    – Mark parallel groups• used in TMS320C6x DSPs, Intel IA-64

    April 29, 2021

    Group 1 Group 2 Group 3

    L19-28

  • MIT 6.823 Spring 2021

    Cydra-5:Memory Latency Register (MLR)

    • Problem: Loads have variable latency• Solution: Let software choose desired memory latency

    • Compiler schedules code for maximum load-use distance

    • Software sets MLR to latency that matches code schedule

    • Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline

    – Hardware buffers loads that return early– Hardware stalls processor if loads return late

    April 29, 2021 L19-29

  • MIT 6.823 Spring 2021

    IA-64 Predicated Execution

    April 29, 2021

    Problem: Mispredicted branches limit ILP

    Solution: Eliminate hard-to-predict branches with predicated execution– Almost all IA-64 instructions can be executed conditionally under predicate– Instruction becomes NOP if predicate register false

    Inst 1Inst 2br a==b, b2

    Inst 3Inst 4br b3

    Inst 5Inst 6

    Inst 7Inst 8

    b0:

    b1:

    b2:

    b3:

    if

    else

    then

    Four basic blocks

    Inst 1

    Inst 2

    p1,p2 cmp(a==b)(p1) Inst 3 || (p2) Inst 5

    (p1) Inst 4 || (p2) Inst 6

    Inst 7

    Inst 8

    Predication

    One basic block

    Mahlke et al, ISCA95: On average >50% branches removed

    L19-30

  • MIT 6.823 Spring 2021

    Fully Bypassed Datapath

    April 29, 2021

    ASrcIRIR IR

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR ALU

    ImmExt

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataData Memory

    we

    31

    nop

    stall

    D

    E M W

    PC for JAL, ...

    BSrc

    Where does predication fit in?

    L19-31

  • MIT 6.823 Spring 2021

    IA-64 Speculative Execution

    April 29, 2021

    Problem: Branches restrict compiler code motion

    Inst 1Inst 2br a==b, b2

    Load r1Use r1Inst 3

    Can’t move load above branch because might cause spurious

    exception

    Load.s r1Inst 1Inst 2br a==b, b2

    Chk.s r1Use r1Inst 3

    Speculative load

    never causes

    exception, but sets

    “poison” bit on destination register

    Check for exception in

    original home block

    jumps to fixup code if

    exception detected

    Particularly useful for scheduling long latency loads early

    Solution: Speculative operations that don’t cause exceptions

    L19-32

  • MIT 6.823 Spring 2021

    IA-64 Data Speculation

    April 29, 2021

    Problem: Possible memory hazards limit code scheduling

    Requires associative hardware in address check table

    Inst 1Inst 2Store

    Load r1Use r1Inst 3

    Can’t move load above store because store might be to same

    address

    Load.a r1Inst 1Inst 2Store

    Load.cUse r1Inst 3

    Data speculative load

    adds address to address

    check table

    Store invalidates any

    matching loads in

    address check table

    Check if load invalid (or

    missing), jump to fixup code

    if so

    Solution: Instruction-based speculation with hardware monitor to check for pointer hazards

    L19-33

  • MIT 6.823 Spring 2021

    Clustered VLIW

    • Divide machine into clusters of local register files and local functional units

    • Lower bandwidth/higher latency interconnect between clusters

    • Software responsible for mapping computations to minimize communication overhead

    • Common in commercial embedded processors, examples include TI C6x series DSPs, and HP Lx processor

    • Exists in some superscalar processors, e.g., Alpha 21264

    April 29, 2021

    Cluster Interconnect

    Local Regfile

    Local Regfile

    Memory Interconnect

    Cache/Memory Banks

    Cluster

    L19-34

  • MIT 6.823 Spring 2021

    Limits of Static Scheduling

    • Unpredictable branches• Unpredictable memory behavior

    (cache misses and dependencies)

    • Code size explosion• Compiler complexity

    April 29, 2021

    Question:

    How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures?

    L19-35

  • L19-36MIT 6.823 Spring 2021

    Thank you!

    Next Lecture: Vector Processors

    April 29, 2021