Wei Zhang † , Li Shang ‡ and Niraj K. Jha †
description
Transcript of Wei Zhang † , Li Shang ‡ and Niraj K. Jha †
NanoMap: An Integrated Design Optimization Flow NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Reconfigurable Architecture
Wei Zhang†, Li Shang‡ and Niraj K. Jha†
Dept. of Electrical EngineeringPrinceton University†
Dept. of Electrical and Computer EngineeringQueen’s University ‡
Outline
Temporal Logic Folding Background on NRAMs Overview for hybrid
NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006)
NanoMap: Design Optimization Flow
Experimental Results Conclusions
Input Design
NanoMap
NATURE
Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles
Temporal Logic Folding
LUT3
OUT
dg
l
a
b
c
e
f
h
i
d
g
l OUT
a
b
cOUT
e
f
h
id
g
l
ab
c
LUT1
e
f h
LUT2
i
i =abc’
LUT1
LUTLUT1
LUT2
LUT3
MEM
l =(I’+e’+f’)h’
OUT =d’g’+l
LUT2
LUT3
LUT3
LUT2
LUT1
NATURE
CMOS fabricationcompatible
CMOS fabricationcompatible NRAM-basedNRAM-based
Run-timereconfiguration
Run-timereconfiguration
Temporallogic folding
Temporallogic folding
Designflexibility
Designflexibility
Logicdensity
Logicdensity
Overview of NATUREOverview of NATURE
Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits
Fine-grain reconfiguration (even cycle-by-cycle) and logic folding
Area-delay trade-off flexibility More than an order of
magnitude increase in logic density
More than an order of magnitude reduction in area-time product
Comparisons assume NRAMs/ CMOS logic implemented in the same technology
Non-volatility: useful in low power & secure processing
Overview of NATURE (Contd.)
Challenges in nano-circuits/architectures Many programmable nanofabrics proposed:
Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc.
Lack of a mature fabrication process Fabrication defects and run-time failures
(between 1% and 10%) Regular, reconfigurable architectures,
such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible
fabrication process
Source: http://www.nantero.com/nram.html
Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable
on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be
commercialized in the near future
NRAMTM by Nantero
NRAMs
Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable
NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM
Length-1wire
Length-4wire Long wire Switch boxLB
Switchmatrix SMB
S1
S1
Long wireLength-4 wire
Length-1 wire
Direct link
S1
S1 S1: Switch box between length-1 wires
S2: Switch box betweenlength-4 wires
Switch matrix: Local routingnetwork
Connection block Switch block
Island-style logic blocks (LBs) connected by various levels of interconnects
An LB contains a super macroblock (SMB) and a local switch matrix
Architecture of NATURE
n1 macroblocks (MBs) comprise an SMB:
here n1 = 4
Architecture of a Super Macroblock Architecture of a Super Macroblock (SMB)(SMB)
MB MBNRAM
MB NRAMNRAM MB
SRAMbits
SRAMbits
---- 2
0---
- 20
---- 2
0
---- 2
0
CLK and Global signals
---- 8
---- 8
---- 8
---- 8
---- 1
20
---- 1
20
---- 1
20
NRAM
SRAMbits
SRAMbits
---- 1
20
CLK and Global signals
ReconfigurationbitsReconfiguration
bits
From Switch matrix
From Switch matrix
From Switch matrix
Output to Interconnect
20 44X1 MUX 20 44X1 MUX
20 44X1 MUX 20 44X1 MUX
n2 logic elements (LEs) comprise an MB:
here n2 = 4
Architecture of a Macroblock (MB)Architecture of a Macroblock (MB)
NRAM LE LE
13 to 5crossbar
13 to 5crossbar
NRAM
LE
13 to 5crossbar
NRAMNRAM LE
65 SRAMbits
65 SRAMbits
65 SRAMbits
65 SRAMbits
---- 5 ---
- 5
---- 5
---- 5
---- 1
7
---- 1
7
---- 1
7
---- 1
7
13 to 5crossbar
---- 2
---- 2
---- 2
---- 2
CLK and Global signals
---- 6
5
---- 6
5
---- 6
5
---- 6
5
8 Outputsof MB
CLK and Global signals
Inputs to MB
Inputs to MB
Inputs to MB
Reconfiguration bits
Reconfiguration bits
Logic Element (Basic Configuration)
An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output
and a primary input
m-input LUT
DFF
SRAM cell
DFF
CLK
Folding Levels
Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs
Level-p folding: LE reconfiguration after the execution of p LUT computations
Reconfiguration time: 160ps Larger folding level, typically delay decrease, area increase
(a) level-1 folding (b) level-2 folding
a0
y0 y1 y2 y3
b0 c0
z0 z1 z2
d0 g0
x0 x1 x2 x3
e0
x0 x1 x2 x3
f0
y0 y1 y2 y3
h0
LUT node
Outputd
i0
a2 a3 a4 a6
Reconfiguration
Reconfiguration
a0
y0 y1 y2 y3
b0 c0
z0 z1 z2
d0e0
x0 x1 x2 x3
f0
y0 y1 y2 y3
g0
x0 x1 x2 x3
h0
d
i0
a2 a3 a4 a6
Output
Design Optimization Flow: NanoMap
Optimize and implement design on NATURE
Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique
to balance resource usage across folding cycles
Input design specified in register-transfer level (RTL) and/or gate-level VHDL
Motivational Example
Different planes should have same number of folding stages to guarantee global synchronization
Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages
reg1 reg2
+
reg3
×
L2L1
L3
s0 s1
input 1 input 2
LUT1
LUT3
LUT2
4 4
44
4 4
4
LUT4
Level 1 register
Level 2 register
Plane Logic in Plane
Pla
ne
cycle
Foldingstage
Fold
ing
cycle
Motivational Example (Contd.)
Example optimization objective Minimize circuit delay under an area constraint
of 32 LEs Assume each LE contains one LUT and two flip-
flops: 32 LEs provide 32 LUTs and 64 flip-flops
reg1 reg2
+
reg3
×
L2L1
L3
s0 s1
input 1 input 2
LUT1
LUT3
LUT2
4 4
44
4 4
4
LUT4
50 LUTs
14 flip-flops
8 LUTsLogic depth: 4
38 LUTsLogic depth: 7
Plane depth: 9
Iterative Design Flow
Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but
large area cost Initial #folding stages: Initial folding levels:
Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure
232
50
52
9
Iterative Design Flow (Contd.)
Cluster size should be smaller than the area constraint
b3 0 0 0
P7 P6 P5
P4
a0
0
a1
a2
a3
P0
P1
P2
P3
FA
FA
FA 0
0
0
0
0
0
0
000
Clu
ster
1C
lust
er 2
FA
bj sum
sum
carryout
ai
0 b2 b1 b0
carry in
out
in
34 LUTs> 32 LUTs
b3 0 0 0
P7 P6
P5
P4
a0
0
a1
a2
a3
P0
P1
P2
P3
FA
FA
FA 0
0
0
0
0
0
0
000
Clu
ster
1C
lust
er 2
0 b2 b1 b0
Level-5 folding Level-4 folding
Solution for the Example
Three folding stages using level-4 folding 32 LEs required for mapping the RTL
circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay
6LEs s0, s1
6LEs storage 1-4
reg1-3
mul: c2
s0, s132LEs
storage add storage 1-4mul: c1
reg1-3
foldingcycle 1
foldingcycle 2
foldingcycle 3
8LEs 4LEs
reg1-3
add
s0, s1
LUT1-4
Solution
Choosefolding level
Module partition
Constraintsatisfied?
FDS to balance resource usage
Constraintsatisfied?
Decreasefolding level
No
No
Yes
Yes
NanoMap: Flow Diagram
LogicMapping
Temporalclustering
Temporalplacement
Routing
Input network
Modulelibrary
Folding levelcomputation
Delay estimation
Schedule each LUT/LUT clusterusing FDS
Perform logic folding?
Yes
No
Placement routable?
No
Yes
Satisfy area constraints?
Yes
Final placement using modified VPR
placer
Satisfy delay constraints?
Yes
Outputreconfiguration bitsOptimization
objective
No
No
RTL module partition
1
3
4
5
6
7
8
10
11
12
14
15
Final routingusing VPR router
16
User constraint
Circuit parameter search
2
Map each LUT/LUT cluster to
SMBs7
Fast placement using modified VPR
placer9
Refine placement?
Yes
No
13
Force-Directed Scheduling
Perform FDS on RTL modules partitioned into LUTs/LUT clusters
Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage
Model resource usage as a force: F = Kx K: distribution graphs (DGs) that describe the
probability of resource usage Aim of FDS: minimize force, indicating
minimum increase in resource usage LE usage depends on LUT computations
and register storage operations:two DGs needed
Temporal Clustering
For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs
Unpacked LUT with a maximal number of inputs selected as initial seed
New LUTs with high attractions to the seed selected and assigned to the SMB
Attractions depend on timing criticality and input pin sharing
Considers attractions across all the folding cycles
B
E F
DC
DC
A
Fold
ing
cycl
e2
Fold
ing
cycl
e1
Placement and Routing
VPR (U. Toronto) modified to perform placement and support temporal logic folding Simulated annealing
approach Cost function computed
across the folding stages Routing using VPR router
performed hierarchically, considering direct link, length-1, length-4 and global interconnects
C
D
C
SMB1
SMB4
D
Fold
ing
cycl
e2
Fold
ing
cycl
e1
23
Experimental Setup
Instance of architecture: 4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops
Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs
Results based on 100nm technology parameters to implement CMOS logicand NRAMs
Experimental Results (Contd.)Experimental Results (Contd.)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
ex1
FIR
ex2
c5315
Biq
uad
Paulin
ASP
P4
(normalized to no-folding)
Delay (ns) for AT optimization
No folding k enough k = 16
1 1 1 1 1 1 12 2
22 2 2
11
1 1
1
1
1
1
2 22
2 2 2
1
02468
1012141618
ex1
FIR
ex2
c5
31
5
Biq
uad
Pau
lin
AS
PP
4
(normalized to no-folding)
#LE * Delay adv. for AT opt.
No folding k enough k = 16
Experimental Results (Contd.)
Reduction
in #LEs
Maximum AT improvement
Average AT improvement
Circuit delay
increase
k enough 14.8X 16.2X 11.0X 31.8%
k = 16 9.2X 9.3X 7.8X 19.4%
Improvement under AT optimization for RTL Benchmarks
LE utilization around 100% 50% reduced need for a deep interconnect
hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous
Experimental Results (Contd.)Experimental Results (Contd.) Flexibility in choosing the best folding level and performing
area-delay trade-offs Mapping results for typical optimizations using Paulin
benchmark as an example
Opt.
obj.
Area
const.
(#LEs)
Delay
const.
(ns)
Folding
level
Case1 AT No No 1
Case2 Delay No No No
Case3 Area No 27 4
Case4 Delay 210 No 31
10
100
1000
10000
Delay(ns)
Area(#LEs)
Mapping results for typical optimizations
case 1 case 2 case 3 case 4
Typical optimizations
Conclusions
NATURE: A new high-performance run-time reconfigurable architecture
NanoMap: an integrated optimization design flow for NATURE
Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages
Can be very useful for cost-conscious embedded systems and improvement of future FPGAs
Non-volatility: helpful in secure and low power processing