Implementation Characterstics of Interconnect Mechanisms...

download Implementation Characterstics of Interconnect Mechanisms anup/homepage/files/presentatio2004-08-12Implementation Characterstics of Interconnect ... Clustered VLIW Processors and Performance

of 42

  • date post

    18-Apr-2018
  • Category

    Documents

  • view

    215
  • download

    2

Embed Size (px)

Transcript of Implementation Characterstics of Interconnect Mechanisms...

  • Implementation Characterstics of InterconnectMechanisms in Clustered VLIW Architectures

    Anup Gangwar

    Embedded Systems Group,Department of Computer Science and Engineering,

    Indian Institute of Technology Delhihttp://embedded.cse.iitd.ernet.in

    August 12, 2004

    (Joint work with M. Balakrishnan, Preeti R. Panda and Anshul Kumar)

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 1/32

  • Outline

    VLIW Features and Typical Datapath Organization

    Clustered VLIW Processors and Performance Evaluation of

    Interconnects

    Objectives of Clock-period Evaluation Excercise

    Architecture of Modeled Processors

    Clock-period Evaluation Flow

    Experimental Results

    Conclusions

    Future work

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 2/32

  • VLIW Architecture and Features

    Compiler extracts parallelism, these have evolved fromhorizontal microcoded architectures

    Latest industry coined acronym, EPIC for Explicitly ParallelInstruction Computing

    Commercial Architectures:

    General Purpose Computing: Intel Itanium

    Embedded Computing: TriMedia, TiC6x, Suns MAJC etc.

    RISC SuperScalar VLIW (4 issue)ADD r1, r2, r3 ADD r1, r2, r3 ADD r1, r2, r3 | SUB r4, r2, r3 | NOP | NOP

    SUB r4, r2, r3 SUB r4, r2, r3 MUL r5, r1, r4 | NOP | NOP | NOP

    MUL r5, r1, r4 MUL r5, r1, r4

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 3/32

  • VLIW Architecture and Features (contd)

    Advantages:Simplified hardware: Suitable for customization

    Less power consumption as compared to SuperScalar processors

    High performance

    Disadvantages:Complicated compiler: limits retargetability

    Code size blow up due to explicit NOPs.

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 4/32

  • Typical Organization of VLIW Processor

    ALU LD/ST ALU ALU LD/ST ALU

    R.F.

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 5/32

  • Outline

    VLIW Features and Typical Datapath Organization

    Clustered VLIW Processors and Performance Evaluation of

    Interconnects

    Objectives of Clock-period Evaluation Excercise

    Architecture of Modeled Processors

    Clock-period Evaluation Flow

    Experimental Results

    Conclusions

    Future work

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 6/32

  • Clustered VLIW Processors

    For N functional units connected to a RF the Area grows as

    N3, Delay as N32 and Power as N3 (Rixner et. al, HPCA 2000)

    Solutions is to break up the RF into a set of smaller RFs

    ....Register File 1 Register File 2 Register File 3

    Memory System

    Interconnection Network

    Cluster 1 Cluster 2 Cluster 3

    FU1 FU2 FU3 FU1 FU2 FU3 FU1 FU2 FU3

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 7/32

  • Clustered VLIW Architectures

    RF-to-RF

    LD/ST ALU LD/ST ALUALU ALU

    R.F. R.F.

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Clustered VLIW Architectures

    Write Across (1)

    ALU ALU

    R.F. R.F.

    ALULD/ST LD/ST ALU

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Clustered VLIW Architectures

    Read Across (1)

    ALU ALU

    R.F. R.F.

    LD/ST ALU LD/ST ALU

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Clustered VLIW Architectures

    Write/Read Across (1)

    ALU ALU

    R.F. R.F.

    LD/ST ALU LD/ST ALU

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Clustered VLIW Architectures

    Write Across (2)

    ALU ALU

    R.F. R.F.

    ALULD/ST LD/ST ALU

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Clustered VLIW Architectures

    Read Across (2)

    ALU ALU

    R.F. R.F.

    LD/ST ALU LD/ST ALU

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Clustered VLIW Architectures

    Write/Read Across (2)

    ALU ALU

    R.F. R.F.

    LD/ST ALU LD/ST ALU

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/32

  • Set of Evaluated Benchmarks

    Primary benchmark source is DSPStone and MediaBench

    Pick up common set of benchmarks from proposedMediaBench II

    Benchmarks:

    DSPStone Kernels MediaBench

    Matrix Initialization JPEG Decoder

    IDCT JPEG Encoder

    Biquad MPEG2 Decoder

    Lattice MPEG2 Encoder

    Matrix Multiplication G721 Decoder

    Insert Sort G721 Encoder

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 9/32

  • Evaluation Results

    0

    20

    40

    60

    80

    100

    0 2 4 6 8 10 12 14 16 18

    ILP

    as

    Frac

    tion

    of P

    ure

    VLI

    W (%

    )

    No. of Clusters

    RFWA.1RA.1WR.1WA.2RA.2WR.2

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 10/32

  • Outline

    VLIW Features and Typical Datapath Organization

    Clustered VLIW Processors and Performance Evaluation of

    Interconnects

    Objectives of Clock-period Evaluation Excercise

    Architecture of Modeled Processors

    Clock-period Evaluation Flow

    Experimental Results

    Conclusions

    Future work

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 11/32

  • Objectives of Clock-period Evaluation

    Excercise

    For a comprehensive performance estimate both thenumber of cycles and cycle time itself are necessary

    How does clock-period vary for the various types ofInterconnects

    What is the relative increase in total chip interconnect areabetween different interconnects

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 12/32

  • Outline

    VLIW Features and Typical Datapath Organization

    Clustered VLIW Processors and Performance Evaluation of

    Interconnects

    Objectives of Clock-period Evaluation Excercise

    Architecture of Modeled Processors

    Clock-period Evaluation Flow

    Experimental Results

    Conclusions

    Future work

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 13/32

  • Instruction Set Architecture of Processors

    Mnemonic Explanation Cycles

    NOP No Operation 1ADD Rdest Rsrc1 +Rsrc2 1SUB Rdest Rsrc1 Rsrc2 1MUL Rdest Rsrc1 Rsrc2 3DIV Rdest Rsrc1/Rsrc2 3MOV Rdest Rsrc 1LD Rdest MEM[Rsrc1] 3ST MEM[Rdest ] Rsrc1 3CMP Rdest Rsrc1 > Rsrc2 1BRL Branch if less to Address 3BRA Branch always to Address 3MVIH Load (Immediate) higher order bytes of Rdest 1MVIL Load (Immediate) lower order bytes of Rdest 1RFMV Inter-cluster move for RF-to-RF Architectures 1

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 14/32

  • Pipeline of Processors

    InstructionFetch

    InstructionDecode

    InstructionExecute

    Put Away

    ALU Operations

    InstructionFetch

    InstructionDecode Put Away

    Memory

    Access

    Memory Operations

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 15/32

  • Instruction Encoding of Processors

    4 Bits log2(RF Size) log2(RF Size)

    InstructionSource Source

    Register 2Destination

    RegisterRegister 1

    ALU Operations for WA

    log2(RF Size)+1

    4 Bits

    Instruction

    4 Bits log2(RF Size)

    InstructionSource Source

    Register 2Destination

    RegisterRegister 1

    ALU Operations for RA

    log2(RF Size)log2(RF Size)+1

    RFMV Operation

    log2(RF Size) 4 Bits log2(RF Size)

    SourceRegister 1

    DestinationRegisterCluster

    Destination

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 16/32

  • Detailed Architecture for WA.1

    ALULD/STALU

    Register File

    ALULD/STALU

    Register File

    Dec

    odin

    g Lo

    gic

    Full Bypass Within a Cluster

    Dec

    odin

    g Lo

    gic

    Full Bypass Within a Cluster

    Condition CodeCondition CodeInstruction Instruction

    One ClusterOne Cluster

    To/From OffChip Data Memory

    From OffChip Instruction Memory

    To/From OffChip Data Memory

    From

    Pre

    viou

    s C

    lust

    er

    To N

    ext C

    lust

    er

    Instruction Fetch and Branch Handling Logic

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 17/32

  • Outline

    VLIW Features and Typical Datapath Organization

    Clustered VLIW Processors and Performance Evaluation of

    Interconnects

    Objectives of Clock-period Evaluation Excercise

    Architecture of Modeled Processors

    Clock-period Evaluation Flow

    Experimental Results

    Conclusions

    Future work

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 18/32

  • Evaluation Flow for ASIC Technologies

    ParameterizedRTL VHDL

    Libr

    arie

    sA

    SIC

    Clu

    ster

    Mod

    el G

    ener

    atio

    n

    Synopsys DC

    CadenceSoC Encounter

    Synopsys DC

    CadenceSoC Encounter

    Static TimingAnalysis

    InterconnectClockperiodArea

    Cluster Blackbox Timingand P&R Models

    Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 19/32

  • Pin Organization for Clusters

    RF-to-RF Cluster

    Clk

    iram

    _dat

    a

    CC

    Res

    etn

    ldst

    _ram

    _(al