Download - Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.

Design Methodology for Customizable Design Methodology for Customizable Programmable ProcessorsProgrammable Processors

Berkeley – Finland Day, Oct. 18, 2002Berkeley – Finland Day, Oct. 18, 2002

Prof. Jarmo TakalaInstitute of Digital and Computer Systems

Tampere University of TechnologyTampere, Finland

Tel: +358 – 33115 3879; Email: [email protected]

J.Takala/TUT Berkeley – Finland Day, Oct.18, 2002

OutlineOutline

MotivationTransport Triggered Architecture (TTA)Design Methodology for TTAsResearch at TUTConclusions


MotivationMotivation

Programmable processors often used in products using digital signal processing (DSP)Flexibility

Ease of verification

Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100)User applications often contain only subset of total

benchmarks

Efficiency can be improved by customizing architecture according to given tasks


MotivationMotivationDSP applications are often hard realtime

constrainedexecution should be deterministicdynamic runtime behaviours should be avoided

Static scheduling lends itself to DSP

Current design complexities call for increase in designer productivity

High level languages should be used

DSP algorithms contain inherent parallelism

Instruction level parallelism (ILP) should be maximized


What is needed?What is needed?

Application driven design process with easy design space exploration

Replace hardware complexity by software complexityCompiler driven process

Use templated architectureFlexible

heterogeneous function units

Modularscalability

Orthogonalcompiler friendly


Choices for Architecture TemplateChoices for Architecture Template

FrontendFrontend

Application

sequential(superscalar)

dependence

(dataflow)

independence

(EPIC)

independence

(VLIW)

Compilation time(Software)

Determine DependenciesDetermine Dependencies

Determine IndependenciesDetermine Independencies

Bind Function UnitsBind Function Units




Bind Datapaths & ExecuteBind Datapaths & Execute

Run time(Hardware)

ILP Architectures


VLIW Gained Popularity in DSPVLIW Gained Popularity in DSP

Re

gis

ter

File

Inst

ruct

ion

Fet

ch

Inst

ruct

ion

Dec

ode

Dat

a M

emor

y

Inst

ruct

ion

Mem

ory

Byp

assi

ng

Net

wo

rkCPU

FU-1

FU-2

FU-3

FU-4

FU-5


Transport Triggered ArchitectureTransport Triggered Architecture

VLIW drawbacksBypass complexityRegister file complexityRegister file design restricts FU flexibilityOperation encoding format restricts FU flexibility

Reverse programming paradigm [H. Corporaal, 94]

data transport operation

Instruction set contains only a single instruction: move


From VLIW to TTAFrom VLIW to TTA

Re

gis

ter

File

Byp

assi

ng

Net

wo

rkVLIW

Inst

ruct

ion

Fet

ch

Inst

ruct

ion

Dec

ode

Inst

ruct

ion

Mem

ory

FU-1

FU-2

FU-3

FU-4

FU-5

Dat

a M

emor

y

Inst

ruct

ion

Fet

ch

Inst

ruct

ion

Dec

ode

Byp

assi

ng

Net

wo

rk

FU-1

FU-2

FU-3

FU-4

FU-5

RegisterFileTTA


TTA DatapathTTA Datapath

IntegerALU

IntegerALU

FloatALU

Boolean RF

Float RF

Integer RF

Socket

Instruction Memory

Data Memory

Load/StoreUnit

Load/StoreUnit

Immediate Unit

Instruction Unit


Function UnitsFunction Units

Operands written to operand registers (O)

Operation performed when last operand written to trigger register (T)

Pipeline synchronized with control bits (C)

Standard interface FU_ready Result_ready Global_lock

T

optional

Optional shadow register

O

logic

logic

R

logic

C

C

C

C


ILP ArchitecturesILP Architectures

FrontendFrontend

Application

sequential(superscalar)

dependence

(dataflow)

independence

(EPIC)

independence

(VLIW)

Compilation time

independence

(TTA)




Bind DatapathsBind Datapaths

ExecuteExecute




Bind DatapathsBind Datapaths

Run time


TTA Characteristics: HWTTA Characteristics: HW

ModularCan be constructed with standard building blocks

Very flexible and scalableFU functionality can be arbitrarySupports user defined Special Function Units (SFU)

Lower complexityReduction on # register portsReduced bypass complexityReduction in bypass connectivityReduced register pressureTrivial decoding (implies long instructions)


TTA Characteristics: SWTTA Characteristics: SW

Traditional operation-triggered instruction:

Transport-triggered instruction:

Reminds dataflow and time-stationary coding

mul r1,r2,r3;

r1mul.o; r2mul.t; mul.rr3;

r1mul.o, r2mul.t; mul.rr3;

or


TTA Design ToolsTTA Design Tools

Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the NetherlandsMOVE project lead by Prof. Henk CorporaalFully parametric C/C++ Compiler

buses, connections, function units, register files, etc.

Design space explorerProcessor generator


Sequential Simulator

Sequential Simulator

Code Generation TrajectoryCode Generation Trajectory

I/O

Parallel Code

GCC or SUIF

Profiling Data

Parallel SimulatorParallel

Simulator

Compiler BackendCompiler Backend

Sequential Code

Application (C/C++)Ar

chite

ctur

e De

scrip

tion Compiler

FrontendCompiler Frontend

I/O

(MOVE Project at DUT)


TTA Specific OptimizationsTTA Specific OptimizationsTTA allows extra scheduling optimizationsE.g., software bypassing

Bypassing can eliminate the need of RF access

However, more difficult to schedule !

Example: r1 → add.o, r2 → add.t;add.r → r3;r3 → sub.o, r4 → sub.tsub.r → r5;

Translates to: r1 → add.o, r2 → add.t;add.r → sub.o, r4 → sub.t;sub.r → r5;


ResourceOptimization

ConnectivityOptimization

Design Space ExplorationDesign Space Exploration

Application(C/C++)

Application(C/C++)

Map&ScheduleMap&Schedule

FrontendFrontend

FU modelsCost Functions

FU modelsCost Functions

SimulatorSimulator

Resources(Mach)

Resources(Mach)

Map&ScheduleMap&Schedule

Design Point

SimulatorSimulator

Design Points

Select ResourcesSelect Resources

Reduce ConnectionsReduce Connections



Exploration: Resourse OptimizationExploration: Resourse Optimization

Pareto curve represents the lowest bound of found architecture configurations

Selected architecture for further optimization


IRUIRU

ALU ALU

IU

LSU

IU

LSU

IU

LSU


ExpExplloration: oration: CConnectivity onnectivity OptimizationOptimization(MOVE Project at DUT)

Reduced connections decrease bus delay

Critical connections have been removed

IRUIRU

ALU ALU

IU

LSU

IU

LSU

IU

LSU


Topics to be InvestigatedTopics to be Investigated Poor code density

good target for code compression techniques apriori information of application, thus instruction propabilities known

Estimations Power estimation Fast estimations with sufficient accuracy

Flexibity, reuse Applications may change, thus additional resources need to assigned

although not needed by the original application Tool-assisted special function unit generation

Analysis support Model creation support Characterization support

Parameterized processor generator Interconnections, control, etc. maybe realized in several ways depending on

the target Low-power optimizations

Clustered TTAs Interprocessor communication schemes

These topics considered in FlexDSP Project at TUT


Code Compression

Code Compression

New Design EnvironmentNew Design Environment

Functionality(C/C++)

Functionality(C/C++)

OperationAnalysis

OperationAnalysis

Parametric CompilerParametric Compiler Parametric Processor Generator

Parametric Processor Generator

ParallelObject Code

HDLCode

FrontendFrontend

Design SpaceExploration

Design SpaceExploration

FU models(C, HDL)

Cost Functions (area, power,

speed)

FU models(C, HDL)

Cost Functions (area, power,

speed)ResourceConstraints

ResourceConstraints

TTA Processor

SFU GenerationSFU Generation

Target of FlexDSP Project at TUT


ConclusionsConclusions Design methodologies allowing processor

customization will improve efficiency in certain application areas, e.g., multimedia, telecom

TTA is a promising candidate for architectural template for customized processors In particular, support for custom function units allows

powerful tailoring Results of MOVE project at DUT have already proven

the concept Parameterized compiler allows tool-assisted design space

exploration Still more research needed on

Hardware implementations Enhanced compiler strategies