Download - A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN

fellowship

Lysecky, R., Vahid, F. 2

IntroductionDynamic Software Optimization

Dynamic optimizations are increasingly common

Dynamo - Dynamic software optimizations Transmeta Crusoe, Efficeon - Dynamic code morphing Just In Time (JIT) Compilation - Interpreted languages

Advantages Transparent optimizations

No designer effort No tool restrictions

Adapts to actual usage Drawbacks

Currently limited to software optimizations Limited speedup (1.1x to 1.3x common)


IntroductionHardware/Software Partitioning

SW__________________

SW__________________

SW__________________

HW__________________

SW__________________

SW__________________

ProcessorProcessor ProcessorASIC/FPGA

Critical Regions

ProfilerProfiler Benefits Speedups of 2X to 10X

typical Speedups of 800X

possible Far more potential than

dynamic SW optimizations (1.2x)

Energy reductions of 25% to 95% typical

Time Energy

SW OnlyHW/ SW

Time Energy

SW Only

ProcessorProcessor


IntroductionTraditional Hardware/Software Partitioning

BinarySW

ProfilingCAD Tools

NetlistNetlistSW Binary


Requires specialized CAD tools

Non-standard partitioning compilers


IntroductionBinary Hardware/Software Partitioning

Binary Partitioning [Stitt/Vahid ICCAD’02]

Partition application starting from SW binary

Can be desktop based Advantages

Use any standard compiler Supports any language Supports multiple sources from

multiple languages Supports assembly/object code Supports legacy code

Disadvantage Loses some high-level information,

so may be some loss of quality

BinarySW

ProfilingStandard Compiler

BinaryBinary

ProfilingCAD Tools

NetlistNetlistModified Binary



IntroductionDynamic Hardware/Software Partitioning

Dynamic HW/SW Partitioning Embed HW/SW partitioning CAD tools

on-chip Feasible in era of billion-transistor chips

Advantages Does not require any special compilers Completely transparent Bring benefits of HW/SW partitioning to

all SW designers Complements other approaches

Desktop CAD best from purely technical perspective

Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD

BinarySW

ProfilingStandard Compiler

BinaryBinary

CAD

FPGAProc.


MIPS/ARM

I$

D$

Configurable Logic

Profiler

Dynamic Part.

Module (DPM)

Profile application to determine critical regions

Partition critical regions to hardware

Program configurable logic & update software binary

Partitioned application executes faster with lower energy consumption

Initially execute application in software only

11

22

33

44

55

IntroductionWarp Processors


ARMI$

D$

Config. Logic

Profiler

DPM

Warp ProcessorsRequirements & Tools

Warp Processor Architecture and Tools

Basic configurable logic architecture Efficient profiling architecture On-chip CAD tools for HW/SW partitioning

Decompilation Synthesis Technology Mapping Placement and Routing

BinaryBinary

Decompilation

BinaryHW

RT & Logic Synthesis

Technology Mapping

Placement & Routing

Develop new Warp Configurable Logic Architecture (WCLA) WCLA


Warp Configurable Logic ArchitectureRequirements

Robustness Capable of supporting large set of applications

Simplicity Existing FPGAs are too complex for warp

processors Design goals of FPGAs much different

Design configurable fabric by analyzing architectural features as to their impacts on on-chip CAD tools

Fast execution Very low data memory Produce reasonable hardware circuits

Efficient interface to memory


Warp Configurable Logic Architecture

Data address generators (DADG) and Loop control hardware (LCH)

Found in most digital signal processors Provide fast loop execution

Supports memory accesses with regular access pattern Synthesis of FSM not required for many critical loops Configurable logic fabric input provide alternative

control of loop executionDADG &

LCH

Configurable Logic Fabric

Reg0

32-bit MAC

Reg1 Reg2

ARMI$

D$

WCLA

Profiler

DPM


Warp Configurable Logic Architecture

Integrated 32-bit multiplier-accumulator (MAC) Multiplications are frequently found within critical loops

Frequently in the form of a multiply-accumulate operation Fast, single-cycle multipliers are large and require many

interconnections

DADG &

LCH


Reg0

32-bit MAC

Reg1 Reg2

ARMI$

D$

WCLA

Profiler

DPM


SM

CLB

SM

SM

SM

SM

SM

CLB

Warp Configurable Logic Architecture Configurable Logic Fabric

DADGLCH


32-bit MAC

SM

CLB

SM

SM

SM

SM

SM

CLB

Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs)

Each CLB is directly connected to a SM Switch matrix connections

Four short wires connect adjacent SMs Four long wires connect every other SM together


Warp Configurable Logic Architecture Combinational Logic Block Design

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

Several studies have analyzed the impact of LUT and CLB size of overall design area and delay

LUTs with 5 to 6 inputs result in best performance LUTs with less than 3 inputs have much worse

performance [Chow, et al. 1999, Singh, et al. 1992] CLB cluster size of 3 to 20 LUTs are feasible [Marquardt,

Betz, Rose 2000]


Warp Configurable Logic Architecture Combinational Logic Block Design

Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip

CAD tools complexity Provide routing resources between adjacent CLBs

to support carry chains

FPGAs WCLAFlexibility: Large CLBs, various internal routing resources

Simplicity:Limited internal routing, reduce technology mapping complexity

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB


Warp Configurable Logic ArchitectureSwitch Matrix

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

Switch Matrix SM connected using eight channels per

side Four short channels Four long channels

Routes connect wires from different side using the same channel

Each short channel is associated with single long channel

Wires are routed using a single pair of channels through configurable logic fabric

FPGAs WCLAFlexibility: Large routing resources, requires complex routing algorithms

Simplicity: Allow for design of less complex routing algorithm


ResultsBenchmarks

Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone

Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered)

On average, critical loops comprised only 1% of total program size

brev 992 104 70.0% 10.5% 3.3g3fax1 1094 6 31.4% 0.5% 1.5g3fax2 1094 6 31.2% 0.5% 1.5

url 13526 17 79.9% 0.1% 5logmin 8968 38 63.8% 0.4% 2.8pktflow 15582 5 35.5% 0.0% 1.6canrdr 15640 5 35.0% 0.0% 1.5bitmnp 17400 165 53.7% 0.9% 2.2tblook 15668 11 76.0% 0.1% 4.2ttsprk 16558 11 59.3% 0.1% 2.5

matrix01 16694 22 40.6% 0.1% 1.7idctrn01 17106 13 62.2% 0.1% 2.6

Average: 53.2% 1.1% 3.0

Speedup BoundLoop

InstructionsBenchmark

Total Instructions

Loop Execution Time

Loop Size%


ResultsExperimental Setup

Warp Processor 75 MHz ARM7 processor Configurable logic fabric with fixed

frequency of 60 MHz Used dynamic partitioning CAD tools to

map critical region to hardware Executed on an ARM7 processor Active for roughly 10 seconds to

perform partitioning Traditional HW/SW Partitioning

75 MHz ARM7 processor Xilinx Virtex-E FPGA (executing at

maximum possible speed) Manually partitioned software using

VHDL VHDL synthesized using Xilinx ISE 4.1

on desktop

ARM7I$

D$

WCLA

Profiler

DPM

ARM7I$

D$

Xilinx Virtex-E FPGA


0

1

2

3

4

5

Spee

dup

WCLA

Xilinx Virtex-E

ResultsPerformance Speedup

Average speedup of 2.1vs. 2.2 for Virtex-E

4.1


0%

20%

40%

60%

80%

100%

Ener

gy

Red

uct

ion

WCLA

Xilinx Virtex-E

ResultsEnergy Reduction

Average energy reduction of 33%v.s 36% for Xilinx Virtex-E

74%


Context: UCR’s Research on Configurable SoCs

Self Tuning, Self Configuring Mass Produced ICs

MIPS/ ARM

I$

D$

WCLA

Profiler

Dynamic Part.

Module (DPM)

Cache Tuner

Efficient on-chip profiling [Gordon-Ross, Vahid]

Configurable cache [Zhang, Vahid, Najjar]

Self-tuning cache [Zhang, Vahid, Lysecky]

Binary decompilation, loop unrolling, alias analysis[Stitt, Vahid]

Lean on-chip CAD tools[Lysecky, Vahid, Tan]


Conclusions & Future Work Warp Configurable Logic Fabric

Supports wide range of embedded systems applications Design specifically to allow development of lean on-chip CAD

tools Provide excellent results

Average speedups of 2.1 Average energy reduction of 33% Much better than dynamic software optimizations One loop only – more speedup possible More recent examples since DATE publication – 10x speedups Working towards examples with 100x speedups

Future Work Partitioning multiple software loops to hardware Synthesizing Finite State Machines (FSMs) Improved synthesis, technology mapping, and place and

route