A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning
Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN
fellowship
Lysecky, R., Vahid, F. 2
IntroductionDynamic Software Optimization
Dynamic optimizations are increasingly common
Dynamo - Dynamic software optimizations Transmeta Crusoe, Efficeon - Dynamic code morphing Just In Time (JIT) Compilation - Interpreted languages
Advantages Transparent optimizations
No designer effort No tool restrictions
Adapts to actual usage Drawbacks
Currently limited to software optimizations Limited speedup (1.1x to 1.3x common)
Lysecky, R., Vahid, F. 3
IntroductionHardware/Software Partitioning
SW__________________
SW__________________
SW__________________
HW__________________
SW__________________
SW__________________
ProcessorProcessor ProcessorASIC/FPGA
Critical Regions
ProfilerProfiler Benefits Speedups of 2X to 10X
typical Speedups of 800X
possible Far more potential than
dynamic SW optimizations (1.2x)
Energy reductions of 25% to 95% typical
Time Energy
SW OnlyHW/ SW
Time Energy
SW Only
ProcessorProcessor
Lysecky, R., Vahid, F. 4
IntroductionTraditional Hardware/Software Partitioning
BinarySW
ProfilingCAD Tools
NetlistNetlistSW Binary
ProcessorProcessor ProcessorASIC/FPGA
Requires specialized CAD tools
Non-standard partitioning compilers
Lysecky, R., Vahid, F. 5
IntroductionBinary Hardware/Software Partitioning
Binary Partitioning [Stitt/Vahid ICCAD’02]
Partition application starting from SW binary
Can be desktop based Advantages
Use any standard compiler Supports any language Supports multiple sources from
multiple languages Supports assembly/object code Supports legacy code
Disadvantage Loses some high-level information,
so may be some loss of quality
BinarySW
ProfilingStandard Compiler
BinaryBinary
ProfilingCAD Tools
NetlistNetlistModified Binary
ProcessorProcessor ProcessorASIC/FPGA
Lysecky, R., Vahid, F. 6
IntroductionDynamic Hardware/Software Partitioning
Dynamic HW/SW Partitioning Embed HW/SW partitioning CAD tools
on-chip Feasible in era of billion-transistor chips
Advantages Does not require any special compilers Completely transparent Bring benefits of HW/SW partitioning to
all SW designers Complements other approaches
Desktop CAD best from purely technical perspective
Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD
BinarySW
ProfilingStandard Compiler
BinaryBinary
CAD
FPGAProc.
Lysecky, R., Vahid, F. 7
MIPS/ARM
I$
D$
Configurable Logic
Profiler
Dynamic Part.
Module (DPM)
Profile application to determine critical regions
Partition critical regions to hardware
Program configurable logic & update software binary
Partitioned application executes faster with lower energy consumption
Initially execute application in software only
11
22
33
44
55
IntroductionWarp Processors
Lysecky, R., Vahid, F. 8
ARMI$
D$
Config. Logic
Profiler
DPM
Warp ProcessorsRequirements & Tools
Warp Processor Architecture and Tools
Basic configurable logic architecture Efficient profiling architecture On-chip CAD tools for HW/SW partitioning
Decompilation Synthesis Technology Mapping Placement and Routing
BinaryBinary
Decompilation
BinaryHW
RT & Logic Synthesis
Technology Mapping
Placement & Routing
Develop new Warp Configurable Logic Architecture (WCLA) WCLA
Lysecky, R., Vahid, F. 9
Warp Configurable Logic ArchitectureRequirements
Robustness Capable of supporting large set of applications
Simplicity Existing FPGAs are too complex for warp
processors Design goals of FPGAs much different
Design configurable fabric by analyzing architectural features as to their impacts on on-chip CAD tools
Fast execution Very low data memory Produce reasonable hardware circuits
Efficient interface to memory
Lysecky, R., Vahid, F. 10
Warp Configurable Logic Architecture
Data address generators (DADG) and Loop control hardware (LCH)
Found in most digital signal processors Provide fast loop execution
Supports memory accesses with regular access pattern Synthesis of FSM not required for many critical loops Configurable logic fabric input provide alternative
control of loop executionDADG &
LCH
Configurable Logic Fabric
Reg0
32-bit MAC
Reg1 Reg2
ARMI$
D$
WCLA
Profiler
DPM
Lysecky, R., Vahid, F. 11
Warp Configurable Logic Architecture
Integrated 32-bit multiplier-accumulator (MAC) Multiplications are frequently found within critical loops
Frequently in the form of a multiply-accumulate operation Fast, single-cycle multipliers are large and require many
interconnections
DADG &
LCH
Configurable Logic Fabric
Reg0
32-bit MAC
Reg1 Reg2
ARMI$
D$
WCLA
Profiler
DPM
Lysecky, R., Vahid, F. 12
SM
CLB
SM
SM
SM
SM
SM
CLB
Warp Configurable Logic Architecture Configurable Logic Fabric
DADGLCH
Configurable Logic Fabric
32-bit MAC
SM
CLB
SM
SM
SM
SM
SM
CLB
Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs)
Each CLB is directly connected to a SM Switch matrix connections
Four short wires connect adjacent SMs Four long wires connect every other SM together
Lysecky, R., Vahid, F. 13
Warp Configurable Logic Architecture Combinational Logic Block Design
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
Several studies have analyzed the impact of LUT and CLB size of overall design area and delay
LUTs with 5 to 6 inputs result in best performance LUTs with less than 3 inputs have much worse
performance [Chow, et al. 1999, Singh, et al. 1992] CLB cluster size of 3 to 20 LUTs are feasible [Marquardt,
Betz, Rose 2000]
Lysecky, R., Vahid, F. 14
Warp Configurable Logic Architecture Combinational Logic Block Design
Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip
CAD tools complexity Provide routing resources between adjacent CLBs
to support carry chains
FPGAs WCLAFlexibility: Large CLBs, various internal routing resources
Simplicity:Limited internal routing, reduce technology mapping complexity
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
Lysecky, R., Vahid, F. 15
Warp Configurable Logic ArchitectureSwitch Matrix
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
Switch Matrix SM connected using eight channels per
side Four short channels Four long channels
Routes connect wires from different side using the same channel
Each short channel is associated with single long channel
Wires are routed using a single pair of channels through configurable logic fabric
FPGAs WCLAFlexibility: Large routing resources, requires complex routing algorithms
Simplicity: Allow for design of less complex routing algorithm
Lysecky, R., Vahid, F. 16
ResultsBenchmarks
Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone
Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered)
On average, critical loops comprised only 1% of total program size
brev 992 104 70.0% 10.5% 3.3g3fax1 1094 6 31.4% 0.5% 1.5g3fax2 1094 6 31.2% 0.5% 1.5
url 13526 17 79.9% 0.1% 5logmin 8968 38 63.8% 0.4% 2.8pktflow 15582 5 35.5% 0.0% 1.6canrdr 15640 5 35.0% 0.0% 1.5bitmnp 17400 165 53.7% 0.9% 2.2tblook 15668 11 76.0% 0.1% 4.2ttsprk 16558 11 59.3% 0.1% 2.5
matrix01 16694 22 40.6% 0.1% 1.7idctrn01 17106 13 62.2% 0.1% 2.6
Average: 53.2% 1.1% 3.0
Speedup BoundLoop
InstructionsBenchmark
Total Instructions
Loop Execution Time
Loop Size%
Lysecky, R., Vahid, F. 17
ResultsExperimental Setup
Warp Processor 75 MHz ARM7 processor Configurable logic fabric with fixed
frequency of 60 MHz Used dynamic partitioning CAD tools to
map critical region to hardware Executed on an ARM7 processor Active for roughly 10 seconds to
perform partitioning Traditional HW/SW Partitioning
75 MHz ARM7 processor Xilinx Virtex-E FPGA (executing at
maximum possible speed) Manually partitioned software using
VHDL VHDL synthesized using Xilinx ISE 4.1
on desktop
ARM7I$
D$
WCLA
Profiler
DPM
ARM7I$
D$
Xilinx Virtex-E FPGA
Lysecky, R., Vahid, F. 18
0
1
2
3
4
5
Spee
dup
WCLA
Xilinx Virtex-E
ResultsPerformance Speedup
Average speedup of 2.1vs. 2.2 for Virtex-E
4.1
Lysecky, R., Vahid, F. 19
0%
20%
40%
60%
80%
100%
Ener
gy
Red
uct
ion
WCLA
Xilinx Virtex-E
ResultsEnergy Reduction
Average energy reduction of 33%v.s 36% for Xilinx Virtex-E
74%
Lysecky, R., Vahid, F. 20
Context: UCR’s Research on Configurable SoCs
Self Tuning, Self Configuring Mass Produced ICs
MIPS/ ARM
I$
D$
WCLA
Profiler
Dynamic Part.
Module (DPM)
Cache Tuner
Efficient on-chip profiling [Gordon-Ross, Vahid]
Configurable cache [Zhang, Vahid, Najjar]
Self-tuning cache [Zhang, Vahid, Lysecky]
Binary decompilation, loop unrolling, alias analysis[Stitt, Vahid]
Lean on-chip CAD tools[Lysecky, Vahid, Tan]
Lysecky, R., Vahid, F. 21
Conclusions & Future Work Warp Configurable Logic Fabric
Supports wide range of embedded systems applications Design specifically to allow development of lean on-chip CAD
tools Provide excellent results
Average speedups of 2.1 Average energy reduction of 33% Much better than dynamic software optimizations One loop only – more speedup possible More recent examples since DATE publication – 10x speedups Working towards examples with 100x speedups
Future Work Partitioning multiple software loops to hardware Synthesizing Finite State Machines (FSMs) Improved synthesis, technology mapping, and place and
route
Top Related