MPSoC Design Flow: Case Study for Hcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug...MPSoC...

26
MPSoC Design Flow: MPSoC Design Flow: Case Study for H.264 Case Study for H.264 Kai Kai Huang Huang [email protected] [email protected] Institute of VLSI Design, Institute of VLSI Design, Zhejiang University, China Zhejiang University, China August, ICDFN 2007 August, ICDFN 2007 Simulink Simulink - - Based MPSoC Design Flow: Case Study of MJPEG and H.264 Based MPSoC Design Flow: Case Study of MJPEG and H.264 Published at DAC 2007 Published at DAC 2007

Transcript of MPSoC Design Flow: Case Study for Hcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug...MPSoC...

  • MPSoC Design Flow: MPSoC Design Flow: Case Study for H.264 Case Study for H.264 ††

    Kai Kai [email protected]@vlsi.zju.edu.cn

    Institute of VLSI Design, Institute of VLSI Design, Zhejiang University, ChinaZhejiang University, China

    August, ICDFN 2007August, ICDFN 2007

    †† ””SimulinkSimulink--Based MPSoC Design Flow: Case Study of MJPEG and H.264Based MPSoC Design Flow: Case Study of MJPEG and H.264””Published at DAC 2007Published at DAC 2007

  • 22

    System System DesignDesign

    GAPGAP

    MotivationMotivation

    SystemC SystemC

    Within a wide range of abstraction levelsWithin a wide range of abstraction levels

    The prevailing environment The prevailing environment Modeling and simulating complex systems at algorithm levelModeling and simulating complex systems at algorithm level

    Intrinsic low level languageIntrinsic low level languageNOT easy to specify the complex system NOT easy to specify the complex system at algorithm levelat algorithm level

    Simulink Simulink

    A preferred HW/SW codesign languageA preferred HW/SW codesign language

    An open issue: Algorithm/Architecture mapping for MPSoCAn open issue: Algorithm/Architecture mapping for MPSoC

    Transaction Accurate Transaction Accurate LevelLevel

    Virtual Architecture Virtual Architecture LevelLevel

    System level design: Key solution to complex MPSoCSystem level design: Key solution to complex MPSoC

    Virtual Prototype LevelVirtual Prototype Level

    Physical implementationPhysical implementation

    ApplicationApplication

    AlgorithmAlgorithm ArchitectureArchitectureSimulinkSimulink

    System

    CSys

    temC

    Algorithm/Architecture mapping Algorithm/Architecture mapping (System Level)(System Level)Protocol Accurate Protocol Accurate (Virtual Architecture Level)(Virtual Architecture Level)Synchronization Accurate Synchronization Accurate (Transaction Level)(Transaction Level)Cycle Accurate Cycle Accurate (Virtual Prototype Level)(Virtual Prototype Level)

  • 33

    ObjectiveObjective

    1)1) System level MPSoC design flowSystem level MPSoC design flow

    2) Case study for multimedia applications2) Case study for multimedia applicationsFeasibility and efficiency of proposed design flowFeasibility and efficiency of proposed design flow

    System functional validationSystem functional validationHW/SW coHW/SW co--design/codesign/co--verificationverificationPerformance analysis for architecture explorationPerformance analysis for architecture exploration

    Combine Simulink with SystemCCombine Simulink with SystemCSimulink for highSimulink for high--level algorithm modelinglevel algorithm modelingSystemC for lowSystemC for low--level HW/SW designlevel HW/SW design

    Concurrent HW/SW designConcurrent HW/SW designSeamless refinement at different level abstraction modelsSeamless refinement at different level abstraction modelsSystematic and automated HW/SW code generationSystematic and automated HW/SW code generation

  • 44

    ContentContent

    Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow

    HW and SW Mixed ModelHW and SW Mixed ModelDesign StepsDesign Steps

    Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment

    Conclusions & Future WorksConclusions & Future Works

  • 55

    Overall Design Flow Overall Design Flow

    Step iStep i

    MixedMixed HW and SW ModelHW and SW Model

  • 66

    Target CPUDesign

    HW and SW Mixed Models HW and SW Mixed Models

    Seamless refinement at four abstraction levels:Simulink Combined Algorithm and Architecture Model (Simulink CAAM)

    OS and HW codesignHW-SW Codesign

    HighHigh LowLow

    System level model (Simulink CAAM)

    Virtual architectureModel (SystemC)

    Transaction accurate (SystemC)

    Virtual prototype(SystemC)

    port portport port

    HdSApp

    CPU SS3Thread Main 2

    App

    Thread Main 1

    Abstract ChannelsAbstract Channels

    port portport port

    HdSApp

    CPU SS2Thread Main 2

    App

    Thread Main 1

    port portport port

    HdSApp

    CPU SS1Thread Main 2

    App

    Thread Main 1

    Interconnecting BusInterconnecting Bus Interconnecting BusInterconnecting Bus

    Hardware Hardware DesignDesign

    Software Software DesignDesign

  • 77

    Step 1 : Simulink ModelingStep 1 : Simulink Modeling

    Application C/C++Application C/C++ Into Into a set of a set of modular functions:modular functions:

    UserUser--defined Simulink blocksdefined Simulink blocks(e.g. S(e.g. S--function)function)prepre--defined Simulink blocks defined Simulink blocks (e.g. (e.g. mathematical operationmathematical operation))

    Simulink Simulink ModelingModeling

  • 88

    Step 2 : Application/Step 2 : Application/Architecture mappingArchitecture mapping

    HW Libary.Comp. Subsystems.Comm. Channels

    Application

    Simulink modeling

    Simulink application model

    Application/Architecture mapping

    Simulink CAAM

    Simulink parsing

    Colif CAAM

    1

    2

    3

    Virtual Architecture Model

    Transaction Accurate Model

    Virtual Prototype Model

    HW Architecture Gen. Multithread Code Gen.54

    SW Libary.Thread Library.HdS Library

    F3 F4

    F1 F2

    F5

    IAS1

    IAS0

    Fsw

    F6

    F7

    Z-1

    Z-k

    F0

    F8

    F9

    F10

    F11

    Z-1

    Z-1

    HW architecture TemplateHW architecture TemplateCPU 1 SSCPU 1 SS

    CPU 2 SSCPU 2 SSCPU CPU 3 SS3 SS

    GFIFOGFIFOGFIFOGFIFOGFIFOGFIFO

    Inte

    r-SS

    Com

    m

    Inte

    r-SS

    Com

    m

    1) 1) Architecture layer: CPU SSs, Inter-SS comm.

    2) Subsystem layer: Threads, Intra-SS comm.

    3) Thread layer: Simulink blocks, links

    Simulink CAAMSimulink CAAM

  • 99

    Step 3 : Simulink ParsingStep 3 : Simulink Parsing

    HW Libary.Comp. Subsystems.Comm. Channels

    Application

    Simulink modeling

    Simulink application model

    Application/Architecture mapping

    Simulink CAAM

    Simulink parsing

    Colif CAAM

    1

    2

    3

    Virtual Architecture Model

    Transaction Accurate Model

    Virtual Prototype Model

    HW Architecture Gen. Multithread Code Gen.54

    SW Libary.Thread Library.HdS Library

    Colif is a XMLColif is a XML--based metabased meta--modelmodelwell-defined data structuresModules, channels and ports

    Simulink CAAMSimulink CAAM

    One-to-one correspondence (Simulink)Simulink Port to Send/Receive Block

    Colif CAAMColif CAAM

  • 1010

    Step 4 : HW Architecture Step 4 : HW Architecture GenerationGeneration

    HW Libary.Comp. Subsystems.Comm. Channels

    Application

    Simulink modeling

    Simulink application model

    Application/Architecture mapping

    Simulink CAAM

    Simulink parsing

    Colif CAAM

    1

    2

    3

    Virtual Architecture Model

    Transaction Accurate Model

    Virtual Prototype Model

    HW Architecture Gen. Multithread Code Gen.54

    SW Libary.Thread Library.HdS Library

    VP HW platform

    Global shared bus

  • 1111

    Step 5: Multithreaded Code Step 5: Multithreaded Code Generation Generation

    HW Libary.Comp. Subsystems.Comm. Channels

    Application

    Simulink modeling

    Simulink application model

    Application/Architecture mapping

    Simulink CAAM

    Simulink parsing

    Colif CAAM

    1

    2

    3

    Virtual Architecture Model

    Transaction Accurate Model

    Virtual Prototype Model

    HW Architecture Gen. Multithread Code Gen.54

    SW Libary.Thread Library.HdS Library

    ******Copy Removal and Buffer Copy Removal and Buffer sharing are used for sharing are used for memory optimizationmemory optimization

  • 1212

    ContentContent

    Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow

    HW and SW Mixed ModelsHW and SW Mixed ModelsDesign StepsDesign Steps

    Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment

    Conclusions & Future WorksConclusions & Future Works

  • 1313

    VLDVLD REC & DFREC & DFMC/SCMC/SC

    H.264 Baseline DecoderH.264 Baseline Decoder

    4th 8x8 LumaIQ/IT

    LuminanceDF

    4th 8x8 Luma REC

    4th 8x8 LumaMC/SC

    3rd 8x8 LumaIQ/IT

    LuminanceDF

    3rd 8x8 Luma REC

    3rd 8x8 LumaMC/SC

    2nd 8x8 LumaIQ/IT

    LuminanceDF

    2nd 8x8 Luma REC

    2nd 8x8 LumaMC/SC

    Chroma V VLD Chroma VDF

    Chroma VREC

    Chroma VIQ/IT

    Chroma VMC/SC

    Global ctrl

    MacroblockVLD

    Chroma U VLD

    Chroma UIQ/IT

    Chroma UDF

    Chroma UREC

    Chroma UMC/SC

    Luminance VLD

    1st 8x8 LumaIQ/IT

    LuminanceDF

    1st 8x8 Luma REC

    1st 8x8 LumaMC/SC

    2 times

    4 times

    Receives an encoded video bit stream and performs iterative Receives an encoded video bit stream and performs iterative executions ofexecutions of MacroblockMacroblock level functions:level functions:VLD:VLD: Variable Length Decoder Variable Length Decoder IQ:IQ: Inverse Quantization Inverse Quantization IT:IT: Inverse Transform Inverse Transform SC:SC: Spatial Compensation Spatial Compensation MC:MC: Motion Compensation Motion Compensation REC:REC: ReconstructionReconstructionDF:DF: Deblock FilterDeblock Filter

    Chroma DecodingChroma Decoding

    LumaLuma DecodingDecoding

    8x88x8

    16x16

    8x88x816x16

    8x88x816x16

  • 1414

    Simulink ModelSimulink Model

    Chroma U

    Decoding

    Chroma VDecoding

    Luma first 8x8 block SC/MC

    Luma second 8x8 block SC/MC

    Luma third 8x8 block SC/MC

    Luma fourth 8x8 block SC/MC

    LumaLuma RECREC

    VLDVLD83 83 S-Functions24 24 delays 286 286 data links43 43 if-action-subsystems5 5 for-iteration subsystems101 101 pre-defined Simulink blocks

  • 1515

    A Simulink CAAM ExampleA Simulink CAAM Example

    GFIFOGFIFO

    4 CPU SS4 CPU SS

    4 Threads4 Threads

    InterInter--SS COM: SS COM: GFIFOGFIFO

    Processor :Processor :

    ARM7ARM7

    GFIFOGFIFO GFIFOGFIFO

    LumaLumaMC/SC/RECMC/SC/REC

    Chroma Chroma DecodingDecoding

    LumaLumaDFDF

    Global CTRL Global CTRL & VLD& VLD

    GFIFOGFIFO

  • 1616

    ExperimentExperiment: : Simulation Simulation TTimeime

    11066s11066s6.5K/s6.5K/s

    325s325s224K/s224K/s

    2.8s2.8s26.0M/s26.0M/s

    20.0s20.0s3.6M/s3.6M/s

    0.5s0.5s146M/s146M/s

    VPVPTATAVAVASimulinkSimulinkRTWRTW

    An experiment for An experiment for decoding 10 frames QCIF decoding 10 frames QCIF FOREMAN FOREMAN ::

    Four ARM7 Processors (VP)Four ARM7 Processors (VP)GFIFO (TA, VP)GFIFO (TA, VP)RTW (A sequential program running on host machine) RTW (A sequential program running on host machine)

    VPVP is too long to debug the is too long to debug the whole whole systemsystem..ItIt’’s necessary to make use of TA for HW/SW cos necessary to make use of TA for HW/SW co--simulation.simulation.

  • 1717

    Experiment: Performance Optimization Experiment: Performance Optimization with Different Architectures with Different Architectures

    F1

    F2

    F3 F4F5

    F6 F7 F8

    SS1

    A popular and simple task A popular and simple task partition strategy was used partition strategy was used in this experimentin this experiment

    Computation-based

    Step by Step

  • 1818

    Experiment Result of H.264 Baseline Experiment Result of H.264 Baseline Decoder with Three ArchitecturesDecoder with Three Architectures

    Luma MC/SC and Luma RecCPU6

    Luma IQ/ITCPU5CPU4

    Luma DFCPU4CPU3

    Luma VLDCPU3CPU4

    Chroma decodingCPU2CPU2

    CPU2

    Global control and MB VLDCPU1CPU1CPU1

    Function Block6ARM4ARM2ARM(a) Execution Time150

    102 100

    020406080

    100120140160

    2ARM 4ARM 6ARM

    Tota

    l exe

    cutio

    n cy

    cles

    (c) H264_GFIFO_4ARM

    14

    3422

    80

    8 4 4 4

    78

    6274

    16

    0102030405060708090

    CPU1 CPU2 CPU3 CPU4

    (b) H264_GFIFO_2ARM

    9

    88

    5 5

    86

    70

    20

    40

    60

    80

    100

    Tradeoff between performance, cost and flexibility:Tradeoff between performance, cost and flexibility:

    -- HighHigh--performance dedicated processor? performance dedicated processor?

    -- FineFine--granularity task partition?granularity task partition?

  • 1919

    ContentContent

    Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow

    HW and SW Mixed ModelsHW and SW Mixed ModelsDesign StepsDesign Steps

    Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment

    Conclusions & Future WorksConclusions & Future Works

  • 2020

    Conclusions & Future WorksConclusions & Future Works

    Proposed a Simulink based MPSoC design flow Proposed a Simulink based MPSoC design flow For automated concurrent hardwareFor automated concurrent hardware--software design and software design and verification. verification. Refine Simulink CAAM to three different abstraction level modelsRefine Simulink CAAM to three different abstraction level models(VA, TA, VP)(VA, TA, VP)

    In the case study of H.264 decoderIn the case study of H.264 decoderThe feasibility and efficiency of proposed design flowThe feasibility and efficiency of proposed design flow

    Functional evaluationFunctional evaluationHW/SW codesignHW/SW codesignPerformance analysis for architecture explorationPerformance analysis for architecture exploration

    Plan to improve the current design flow :Plan to improve the current design flow :DDedicated instruction setsedicated instruction sets..Communication protocol with DMA.Communication protocol with DMA.AAutomatic design space exploration.utomatic design space exploration.

  • 2121

    AcknowledgeAcknowledge

    Thanks to:Thanks to:Prof. Ahmed Jerraya Prof. Ahmed Jerraya ((CEA-LETI, MINATEC, France))SangSang--Il Han Il Han (Seoul National University, Korea)Katalin Popovici Katalin Popovici (TIMA Laboratory, France)Lisane BrisolaraLisane Brisolara(Federal University of Rio Grande do Sul, Brazil)

  • 2222

    Thank you very much Thank you very much for your attention !for your attention !

  • 2323

    APPENDIXAPPENDIX

  • 2424

    Simulink CAAM of MSimulink CAAM of M--JPEG DecoderJPEG Decoder

    Architecture Layer

    Subsystem Layer

    Thread Layer

    Four threads and three CPUsFour threads and three CPUsARM processor and GFIFO/SWFIFOARM processor and GFIFO/SWFIFO

    7 S7 S--FunctionsFunctions7 Delays7 Delays26 Links26 Links4 IASs4 IASs

  • 2525

    Inter/IntraInter/Intra--Subsystem CommunicationSubsystem Communication

    Distributed memory serverwith DMA

    @ MSAP@ MSAPLocal memoriesDMS

    Software FIFO via local memories@ mailbox@ mailboxLocal memoriesLFIFO

    Software FIFO Software FIFO via global via global shared shared memorymemory

    @ mailbox@ mailbox@ mailbox @ mailbox Shared memoryShared memoryGFIFOGFIFO

    Hardware FIFO@ hardware FIFO@ hardware FIFOHardware queueHWFIFO

    Shared memory@ local memory@ local memoryLocal memorySHM

    Software FIFO @ local memory@ local memoryLocal memorySWFIFO

    DescriptionReceiver sync. addrSender sync. addrData bufferProtocol

  • 2626

    Code and Data memory size of H.264 Code and Data memory size of H.264 decoderdecoder

    Multi-thread with copy removal and buffer sharingM3

    Multi-thread with copy removal.M2

    Multi-thread without optimization optionsM1

    Single-thread with copy removal and buffer sharingS3

    Single-thread with copy removal.S2

    Single-thread without optimization optionsS1

    Data memory size (Kbyte)

    ConstantChannelBuffer

    100(27.0K)

    98.8(26.6K)

    58.1(15.7K)

    29.1(7.9K)

    110.3(29.7K)

    74.4(20.0K)

    35.3(9.5K)

    0.0

    20.0

    40.0

    60.0

    80.0

    100.0

    120.0

    RTW S1 S2 S3 M1 M2 M3

    100(79.0K)

    97.7(77.2K)

    79.0(62.4K)

    78.7(62.1K)

    125.9(99.5K)

    105.3(83.2K)

    105.9(83.6K)

    0.0

    20.0

    40.0

    60.0

    80.0

    100.0

    120.0

    140.0

    RTW S1 S2 S3 M1 M2 M3App. library HdS library Thread+main

    Code memory size (Kbyte)