mrFPGA: A Novel FPGA Architecture with Memristor-Based...

8
mrFPGA: A Novel FPGA Architecture with Memristor-Based Reconfiguration Jason Cong Bingjun Xiao Department of Computer Science University of California, Los Angeles {cong, xiao}@cs.ucla.edu AbstractIn this paper, we introduce a novel FPGA architecture with memristor-based reconfiguration (mrFPGA). The proposed architecture is based on the existing CMOS-compatible memristor fabrication process. The programmable interconnects of mrFPGA use only memristors and metal wires so that the interconnects can be fabricated over logic blocks, resulting in significant reduction of overall area and interconnect delay but without using a 3D die- stacking process. Using memristors to build up the interconnects can also provide capacitance shielding from unused routing paths and reduce interconnect delay further. Moreover we propose an improved architecture that allows adaptive buffer insertion in interconnects to achieve more speedup. Compared to the fixed buffer pattern in conventional FPGAs, the positions of inserted buffers in mrFPGA are optimized on demand. A complete CAD flow is provided for mrFPGA, with an advanced P&R tool named mrVPR that was developed for mrFPGA. The tool can deal with the novel routing structure of mrFPGA, the memristor shielding effect, and the algo- rithm for optimal buffer insertion. We evaluate the area, performance and power consumption of mrFPGA based on the 20 largest MCNC benchmark circuits. Results show that mrFPGA achieves 5.5x area savings, 2.3x speedup and 1.57x power savings. Further improvement is expected with combination of 3D technologies and mrFPGA. Keywords-FPGA; memristor; ASIC; reconfiguration; I. I NTRODUCTION The performance/power efficiency of an application implemented in an application specific integrated circuit (ASIC) can be as much as six orders of magnitude higher than its counterpart coded in CPU [1]. However the rapid increase of non-recurring engineering cost and design cycle of an ASIC in nanometer technologies makes it impractical to implement most applications in ASIC. This makes the Field Programmable Gate Array (FPGA) increasingly more popular. FPGA is a desirable type of integrated circuit that can be reconfigured to realize a large range of arbitrary functions according to customer demands. The programmability makes FPGA a quick and reusable hardware implementation platform for specific applications. Compared to ASIC, FPGA has sacrificed its performance to some extent for programmability [2] but FPGA can still have orders of magnitude improvement in performance/power efficiency compared to CPU [1, 3]. The flexibility and performance of FPGA compared to ASIC and CPU makes FPGA an important component in the realm of customizable heterogeneous computing platforms [3]. It is well known that the programmable routing structure in FPGA is the principal source of the performance inferiority of FPGA when compared to ASIC [4–7]. It is reported that the programmable interconnects in FPGA can account for up to 90% of the total area [4], up to 80% of the total delay [5, 6] and up to 85% of the total power consumption [7]. The study in [2] measures the gap between FPGA and ASIC. It shows that the area, delay and power consumption of FPGA are 21 times, 4 times and 12 times as much as those of ASIC respectively. We see that if FPGA routing structure (the dominant part in FPGA) gains some improvement, the gap between FPGAs and ASICs will be significantly reduced. Conventional FPGA suffers a lot from its programmable intercon- nects partly due to extensive use of SRAM-based programming bits, multiplexers (or pass transistors), and buffers in the interconnects. One SRAM cell contains as many as six transistors, but can store only one-bit data. The low density of SRAM-based storage increases the area overhead of FPGA programmability and consequently, leads to longer routing paths and larger interconnect delay. In addition SRAM is a type of volatile memory – this means that it contributes to excessive power consumption during stand-by. Emerging technologies, especially emerging non-volatile memory (NVM) technologies, lead to opportunities of circuit improvement. Emerging NVMs typically have a smaller cell size than that of SRAMs. They also have the desirable property of non-volatility, which means that they can be turned off during stand-by to save power. Among them, memristor, also often recognized as resistive random-access memory (RRAM), is considered to be the most promising one. Since HP Labs presented the first experimental realization of memristor in 2008 [8], rapid progress on the fabrication of high-quality memristor has been achieved in the past few years. The current leading memristor technology for the time being has outperformed many other emerging NVMs in some important aspects and performs similar to them in the other aspects. It has been demonstrated that memristor is scalable below 30nm [9], can be fabricated with device size as small as F 2 (F is the feature size) with full compatibility of CMOS [10], and can be programmed within 5ns at 180nm technology node [11]. Due to the advantages of its device property over SRAM and other emerging NVMs, numerous research efforts have been made to adopt memristors for system- level integration [12–15]. However almost all of these studies use memristors as a new type of memory. This paper presents a novel FPGA architecture with memristor- based reconfiguration (mrFPGA). With the novel use of memristors in mrFPGA interconnects rather than memory cells, the proposed architecture reduces the gap between FPGAs and ASICs in area, delay and power. This paper is organized as follows. Section II introduces the conventional FPGA architecture and review recent research work on FPGAs with emerging technologies. Section III presents the unique structure of memristor and its electrical characteristics and circuit model. Section IV describes the architecture of mrFPGA from high-level overview to detailed design. Section V provides a complete CAD flow and evaluation method for mrFPGA. Section VI presents detailed evaluation results of mrFPGA using the largest twenty MCNC benchmarks. Finally we draw some conclusions in Section VII.

Transcript of mrFPGA: A Novel FPGA Architecture with Memristor-Based...

Page 1: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

mrFPGA: A Novel FPGA Architecture with Memristor-BasedReconfiguration

Jason Cong Bingjun XiaoDepartment of Computer Science

University of California, Los Angelescong, [email protected]

Abstract— In this paper, we introduce a novel FPGA architecturewith memristor-based reconfiguration (mrFPGA). The proposedarchitecture is based on the existing CMOS-compatible memristorfabrication process. The programmable interconnects of mrFPGAuse only memristors and metal wires so that the interconnects canbe fabricated over logic blocks, resulting in significant reductionof overall area and interconnect delay but without using a 3D die-stacking process. Using memristors to build up the interconnects canalso provide capacitance shielding from unused routing paths andreduce interconnect delay further. Moreover we propose an improvedarchitecture that allows adaptive buffer insertion in interconnectsto achieve more speedup. Compared to the fixed buffer pattern inconventional FPGAs, the positions of inserted buffers in mrFPGAare optimized on demand. A complete CAD flow is provided formrFPGA, with an advanced P&R tool named mrVPR that wasdeveloped for mrFPGA. The tool can deal with the novel routingstructure of mrFPGA, the memristor shielding effect, and the algo-rithm for optimal buffer insertion. We evaluate the area, performanceand power consumption of mrFPGA based on the 20 largest MCNCbenchmark circuits. Results show that mrFPGA achieves 5.5x areasavings, 2.3x speedup and 1.57x power savings. Further improvementis expected with combination of 3D technologies and mrFPGA.

Keywords-FPGA; memristor; ASIC; reconfiguration;

I. INTRODUCTION

The performance/power efficiency of an application implemented inan application specific integrated circuit (ASIC) can be as much assix orders of magnitude higher than its counterpart coded in CPU[1]. However the rapid increase of non-recurring engineering costand design cycle of an ASIC in nanometer technologies makes itimpractical to implement most applications in ASIC. This makesthe Field Programmable Gate Array (FPGA) increasingly morepopular. FPGA is a desirable type of integrated circuit that can bereconfigured to realize a large range of arbitrary functions accordingto customer demands. The programmability makes FPGA a quick andreusable hardware implementation platform for specific applications.Compared to ASIC, FPGA has sacrificed its performance to someextent for programmability [2] but FPGA can still have orders ofmagnitude improvement in performance/power efficiency comparedto CPU [1, 3]. The flexibility and performance of FPGA compared toASIC and CPU makes FPGA an important component in the realmof customizable heterogeneous computing platforms [3].

It is well known that the programmable routing structure in FPGAis the principal source of the performance inferiority of FPGAwhen compared to ASIC [4–7]. It is reported that the programmableinterconnects in FPGA can account for up to 90% of the total area [4],up to 80% of the total delay [5, 6] and up to 85% of the total powerconsumption [7]. The study in [2] measures the gap between FPGAand ASIC. It shows that the area, delay and power consumption ofFPGA are ∼21 times, ∼4 times and ∼12 times as much as thoseof ASIC respectively. We see that if FPGA routing structure (thedominant part in FPGA) gains some improvement, the gap between

FPGAs and ASICs will be significantly reduced.

Conventional FPGA suffers a lot from its programmable intercon-nects partly due to extensive use of SRAM-based programming bits,multiplexers (or pass transistors), and buffers in the interconnects.One SRAM cell contains as many as six transistors, but can storeonly one-bit data. The low density of SRAM-based storage increasesthe area overhead of FPGA programmability and consequently, leadsto longer routing paths and larger interconnect delay. In additionSRAM is a type of volatile memory – this means that it contributesto excessive power consumption during stand-by.

Emerging technologies, especially emerging non-volatile memory(NVM) technologies, lead to opportunities of circuit improvement.Emerging NVMs typically have a smaller cell size than that ofSRAMs. They also have the desirable property of non-volatility,which means that they can be turned off during stand-by to savepower. Among them, memristor, also often recognized as resistiverandom-access memory (RRAM), is considered to be the mostpromising one. Since HP Labs presented the first experimentalrealization of memristor in 2008 [8], rapid progress on the fabricationof high-quality memristor has been achieved in the past few years.The current leading memristor technology for the time being hasoutperformed many other emerging NVMs in some important aspectsand performs similar to them in the other aspects. It has beendemonstrated that memristor is scalable below 30nm [9], can befabricated with device size as small as F2 (F is the feature size) withfull compatibility of CMOS [10], and can be programmed within5ns at 180nm technology node [11]. Due to the advantages of itsdevice property over SRAM and other emerging NVMs, numerousresearch efforts have been made to adopt memristors for system-level integration [12–15]. However almost all of these studies usememristors as a new type of memory.

This paper presents a novel FPGA architecture with memristor-based reconfiguration (mrFPGA). With the novel use of memristorsin mrFPGA interconnects rather than memory cells, the proposedarchitecture reduces the gap between FPGAs and ASICs in area,delay and power.

This paper is organized as follows. Section II introduces theconventional FPGA architecture and review recent research workon FPGAs with emerging technologies. Section III presents theunique structure of memristor and its electrical characteristics andcircuit model. Section IV describes the architecture of mrFPGAfrom high-level overview to detailed design. Section V provides acomplete CAD flow and evaluation method for mrFPGA. Section VIpresents detailed evaluation results of mrFPGA using the largesttwenty MCNC benchmarks. Finally we draw some conclusions inSection VII.

Page 2: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

II. BACKGROUND AND RELATED WORK

A. Conventional FPGA

Fig. 1 shows a conventional FPGA architecture. It is made up of

Fig. 1: Conventional FPGA architecture.

an array of tiles, and each tile consists of one logic block (LB), twoconnection blocks (CB) and one switch block (SB). Each logic blockcontains a cluster of basic logic elements (BLEs), typically look-up tables (LUTs), to provide customizable logic functions. LBs areconnected to the routing channels through CBs and the segmentedrouting channels are connected with each other through SBs. Fig. 1also depicts two typical circuits designs of CBs and SBs based onmultiplexers (MUXs) and buffers described in [16]. The selector pinsof each multiplexer are connected to a group of 6-transistor SRAMcells to have their connectivity determined. The circuits in Fig. 1have copies of up to the number of pins per side of LBs in a singleCB, and up to the number of tracks per channel in a single SB.We see that CBs and SBs make up of the interconnects of FPGAswith much larger area and higher complexity compared to the directinterconnects of ASIC.

B. Recent Work on FPGAs using Emerging Technologies

With the recent development of emerging technologies, a numberof novel FPGA architectures based on these technologies have beenproposed in the past few years. A 3D FPGA architecture was pro-posed in [17]. It partitions the transistors in the routing structure andlogic structure of conventional FPGA into multiple active layers andstacks the layers via monolithic stacking, a 3D integration technologythe author proposes to use as shown in Fig. 2a. It provides 1.7xperformance gain due to the smaller tile area and shorter interconnectdistance between tiles. However 3D stacking without proper thermalsolution, will result in the excessive increase of heat density and thecorresponding degradation of performance [19]. Also applicationsof monolithic stacking are currently limited due to the reason thathigh temperature required by the fabrication of transistors in anupper layer is likely to destroy the metals wires and transistorswhich have been fabricated in the layers below [18]. In [14] and[18] two FPGA architectures are proposed using emerging NVMsand through-silicon-via (TSV) based 3D integration. Similar to themonolithic stacking in [17], they also move all the transistors in therouting structure of FPGA to the top die over the logic structuredie with TSV connections between them as shown in Fig. 2b. Inaddition they replace SRAM in FPGA with emerging NVMs, suchas phase-change RAM (PCRAM) [18] or memristor [14], to save thearea usage of storing one programming bit and stand-by power asshown in Fig. 2c. 1.1x and 2x overall performance gains before andafter 3D integration respectively are reported in these two works.

cuits (3D-IC) have been recently developed [7, 8]. The ver-tical via densities achieved by these technologies, however,are over an order of magnitude lower than in a state-of-the-art CMOS technology, and they are not expected to scalemuch.

A more promising 3D-IC approach for implementing sucha 3D-FPGA is monolithic stacking, whereby active devicesare lithographically built in between metal layers. The mainadvantage of such approach is that, in principle, it can achievecomparable vertical via density and scale at the same rateas the base CMOS technology. Although this approachhas yet to be developed for the FPGA application, thereis much evidence that forming transistors on a dielectricwith low thermal budget is quite feasible [9]. The processtechnology for the added layers can be much simpler thana full CMOS process. Specifically, the switch layer onlyneeds one type of MOS transistors, while the memory layercan be implemented using a 2-T flash technology [10] or aprogrammable solid-electrolyte switch [11], both of whichpromise to achieve higher densities than SRAM with muchsimpler processes.

Note that our proposed approach to 3D-FPGA is signif-icantly more constrained than the approach investigated in[12], where 3-D switch boxes that require much more com-plex 3D technology to implement are proposed and evalu-ated.

(a) (b)

LB CB

CB

LB CB LB

CB

LB

CB

LB LB

CB

LB CB LB CB LB

CB

CB

CB

SB

CB

SB

SB SBSB SBCB

SB SBCB

CB CB

LB LB LBLB LB LBLB LB LB

Memory LayerSwitch LayerCMOS Layer

3D-FPGA

Figure 1: (a) 2D-FPGA (LB: logic block, CB: con-nection box, SB: switch box). (b) 3D Monolithicallystacked 3D-FPGA.

Under a 3D-IC research program, an interdisciplinary teamof researchers at Stanford University and several other insti-tutions has been developing the monolithically stacked tech-nologies needed to implement a 3D-FPGA as well as thearchitecture and circuit designs of such an FPGA. In thispaper, we describe the results of a study we have conductedunder this program to quantify the potential improvementsin logic density, delay, and power of a monolithically stacked3D-FPGA over conventional 2D-FPGAs. To perform thecomparison, we assume a Virtex-II style 2D-FPGA archi-tecture as a baseline. It is assumed that only the pass-transistor switches and configuration memory cells can bemoved to the top layers (see Figure 1) and that a Virtex-IIstyle logic block and switch box designs are used. A technol-ogy independent FPGA area model is developed and used tocompare the logic density of a stacked FPGA to the baselineFPGA as a function of configuration memory element size.RC circuit models for interconnect segments are developedand used to estimate the improvements in interconnect de-lay in the 3D-FPGA relative to the baseline FPGA for fourdeep submicron CMOS technology nodes. The interconnectdelay results are then used to estimate the relative improve-

ments in the geometric average net delays and critical pathdelays achieved by the 3D-FPGA for 20 MCNC benchmarkcircuits placed and routed using VPR [13]. Finally a modelfor dynamic power consumption is developed and used toquantify the relative improvement in power consumption.

Summary of Results and Outline of the PaperAssuming a configuration memory cell that is ! 0.7 the areaof an SRAM cell, e.g., a 2-T flash or programmable solid-electrolyte switch [10, 11], and pass-transistor switches hav-ing the same characteristics as nMOS devices in the CMOSlayer are used, we show that a monolithically stacked 3D-FPGA can achieve 3.2 times higher logic density, 1.7 timeslower critical path delay, and 1.7 times lower dynamic powerconsumption than the baseline 2D-FPGA implemented inthe same 65nm technology node. Since, the logic densityimprovement can be achieved with the addition of only a fewmask layers on top of a standard CMOS technology, a mono-lithically stacked FPGA is expected to have a lower manu-facturing cost than an FPGA with the same logic capacityfabricated using only the standard CMOS technology. Itis also expected that additional performance improvementscan be achieved by re-architecting the 3D-FPGA to take fulladvantage of the added layers.

The next section presents the baseline 2D-FPGA archi-tecture, the FPGA area model we use, and the logic densityimprovements achieved using a 3D-FPGA. In Section 3, wedescribe the analytical interconnect model and the method-ology we use to estimate delay. The model is then used toquantify the delay reduction achieved using a 3D-FPGA forseveral submicron CMOS technology nodes. In Section 4,we quantify the reduction in dynamic power consumptionachieved using a 3D-FPGA.

2. 3D-FPGA LOGIC DENSITYWe choose a Virtex-II island-style FPGA logic fabric as

a baseline architecture for our study (see [14] for more de-tails on the Virtex-II architecture). The fabric consists of a2D array of logic blocks (LBs) that can be interconnectedvia programmable routing. Each LB contains four slices,each consisting of two 4-input Lookup Tables (LUTs), twoflip-flops (FFs), and programming overhead. A segmentedprogrammable routing architecture is used to minimize thenumber of transistors and wires that a signal needs to tra-verse to reach its destination. Specifically, the programmablerouting comprises di!erent length horizontal and vertical in-terconnect segments that can be connected to the LBs viaconnection boxes and to each other via switch boxes. For thepurpose of our study we assume that the FPGA consistsof square tiles of width L as shown in Figure 2. As in theVirtex-II, we assume sets of one (referred to as Single), two(Double), three (HEX-3), and six (HEX-6) FPGA tile widthsegments in addition to interconnects that span the entirearray width (Global). The longer segments (HEX-3, HEX-6,and Global) also include pass-transistor switches and bu!ersthat will be included in delay and power consumption es-timation. Each connection box comprises pass transistorswitches to connect the LBs’ inputs and outputs to twoneighboring switch boxes through various interconnects. Weassume the MUX-based switch box design described in [15](see Figure 2(b)). In addition to logic and programmable in-terconnect, our study will consider the global resources andswitches used for clocking.

114

(a) Monolithic stacking [17].

!"#

!"##$%&'#(

!"#!"#

)*

*+$,-./01

$%

$% $%

$%

&%

&%

&%

&% #(1

!"#

!"# )*2"34

$%

$%

&%

&%

&%

#(1

*+$,-./01$%

$% $%

$%

&%

&%

&%

&% #(1

$%

$%

&%

&%

&%

#(1

()5

()5

!"#

!"#

!"#

!"#

!"#

!"#

)*

2"34

Figure 5: 3D-NonFAR: The proposed 3D non-volatile FPGA architecture with two-layer die stack-ing and TSVs.

!"#$%&'()*)+, %%#$%&'()*-). "%&'()*,/0

$"()*,1 234556789:

!"()*)/;

%%()*)00

<$=()*)1. >!?@ABC()*,,-=4D9E=>()*,);

FGH(.#<($%&'('434I8JK85LIIM($JL5N83OP(Q44J7E83J(R()*/..&

234556789:

$%&'

>'C$

>'C$

F5H(.#<(=%&'(<89($JL5N83OP(Q44J7E83J(R()*/).&

%%#=%&'()*),.

%%()*)00 S$T()*)+.

>'C$

=%&'@>'C$

!"()*)/; >!?@ABC()*,,-

$"()*,1

<$=()*)1. =4D9E=>()*)/-

"%&'()*))0!"#=%&'()*))1

S$T()*)+.

234556789:

!"#,()*);/ %%#,()*--; "%&'#,()*)U1 '8V5#,(,.*; >6#"6W7 )*,)@

!"#-()*);/ %%#-()*--; "%&'#-()*)U1 '8V5#-(,.*; >6#"6W7 )*,)@

FLH(.#<($%&'(X839#YEL839:(<89($JL5N83OP(Q44J7E83J(R()*;)&(@

Figure 6: Chip footprint comparison between di!er-ent 3D FPGA stacking scenarios. The numbers ineach block represent the relative area compared tothe area of the baseline 2D FPGA.(A, area of base-line 2D FPGA; RR, routing resource; LB-SRAM,configuration memory cells in LB; RR-SRAM, con-figuration memory cells in RR; ST, switch transis-tor).

block memories are put in the upper layer, while the LBs arelocated in the lower layer and TSVs are used to connect theLBs to the corresponding memory elements upward. Sinceall the configurable routing tracks are in the upper layer, noTSV is needed for routing. The non-configurable system-level blocks and the clock network are also located in thesame layer as the LBs, thus no TSV is needed to deliveryclock signals to the flip-flops in LBs.

We compare the logic density improvement of 3D-NonFARwith other two 3D SRAM-based FPGA stacking scenarios,namely 3D-FineGrain [3] in which 3D (6-way) switchingblocks are used and all the resources evenly distributed be-tween layers in fine granularity, and 3D-Mono [4] in whichall SRAM cells are moved to a separate layer and monolith-ical stacking is used so that there is no overhead on inter-layer connections1. We conduct the comparison on Virtex-4XC4VFX60 Platform FPGA, which is in the same technol-ogy node (90nm) as the latest PCM cells in literature [7].The pitch of Cu bumps used in 3D-FineGrain for inter-layerconnection is 10 µm. Based on data reported in [13] wechoose typical 3 ! 3 µm2 TSV pitch for the experiments(TSV size is " 1 µm2). The number of Cu bumps andTSVs are estimated according to the partition style of logicfabrics and the usage of non-configurable blocks. For 3D-FineGrain,

nbump # W $ nSB/2 (2)

1In monolithical stacking, electronic components and theirconnections (wiring) are lithographically built in layers on asingle wafer, hence no TSV is needed. Applications of thismethod are currently limited because creating normal tran-sistors requires enough heat to destroy any existing wiring.

2.5

3

3.5

ity I

mp

rove

me

nt

3D FineGrain 3D Mono 3D NonFAR

1

1.5

2

FX12 FX20 FX40 FX60 FX100 FX140

Lo

gic

Den

sity

Xilinx Virtex-4 Hybrid FPGA Family

Figure 7: Logic density improvement of three 3DFPGA stacking styles. By improvement, we meanthe ratio of footprint in baseline 2D FPGA to thatin 3D FPGA.

where W is the channel width of routing fabrics and nSB isthe number of SBs. For 3D-NonFAR,

nTSV # (4 + 4 + 1) $ nLUT4 + 2 $ nGlobalWire (3)

where nLUT is the number of 4-input LUTs, and nGlobalWire

is the number of global interconnects that go across layers.Fig. 6-(a) shows the case of 3D-FineGrain, in which 3D

switching blocks incurs massive usage of Cu bumps andthe footprint reduction is limited to be around 40%. InFig. 6-(b), 3D-Mono leads to unbalanced layer partitionswith a separate SRAM layer, and the footprint reductionis about 56% with three-layer stacking. Fig. 6-(c) showsthat in 3D-NonFAR, PCM cells occupy much smaller areathan SRAM cells, so the upper layer can accommodate moreblocks, yielding the most area-e!cient stacking with a foot-print reduction of near 60%. The logic density improvementacross the whole Xilinx XC4VFX FPGA Family is depictedin Fig. 7, which shows that 3D-NonFAR is more favorablein larger devices.

4. PERFORMANCE AND POWER EVALU-ATION OF 3D-NONFAR

In this section, we quantify the improvements in delay andpower consumption of 3D-NonFAR. To accurately model theimpact of integrating PCM as a universal memory replace-ment in FPGAs, we first evaluate the timing and powercharacteristics of memories with CACTI [14] and PCRAM-sim [5], as well as the data from latest literature [7]. A setof evaluation results are listed in Table 2. We then proceedto evaluate the delay and power of the whole chip.

Table 2: characteristics of 32MB RAM blocks indi!erent memory technologies at 90nm process

SRAM DRAM PCM PCM(SLC) (MLC)

Area (mm2) 523.5 32.2 59.2 36.0Read Latency (ns) 13.1 6.92 10.7 !15Write Latency (ns) 13.1 6.92 151.8 285.7Read Energy (nJ) 5.03 1.07 2.26 3.49Write Energy (nJ) 2.59 0.55 4.78 7.17

Leakage Power (nW ) 5.14 1.71 0.0091 0.0137

4.1 Delay Improvement in 3D-NonFARThe path delay in FPGAs can be divided into logic block

delay and interconnect delay. In 3D-NonFAR, the LUTs inlogic blocks are implemented with PCM MLC cells, whichare about 5% slower in read speed than that of conventionalSRAM cells, as shown in Table 2. However, this overheadcan be easily compensated by the reduction of interconnect

58

(b) 3D integration with TSVs [18].

(c) Replace SRAMs with emerg-ing NVMs [14].

(d) Nanowire crossbar [19].

Fig. 2: Recent work on FPGA using emerging technologies.

However one potential problem is that the maximum density ofTSVs currently achievable might fail to satisfy the dense connectionsrequired between the routing structure die and logic structure die.In [19], a 3D CMOS/Nanomaterial hybrid FPGA was proposed.Again, it separates the transistors of interconnects from those of logicblocks and redistributes them into different dies as shown in Fig. 2d.The difference is that it uses nanowire crossbars and face-to-face3D integration technology to provide connections between the dies.It provides a 2.6x performance gain compared to the conventionalFPGA architecture with certain power overhead brought by the largecapacitance of crossbar array. In addition the crossbar structure mayconsist of a certain percentage of defective components [19]. To sum-marize all, these works try to stack interconnects over logic blocks forthe sake of reduction of area and interconnect delay. However theyall suffer from a common problem. They still require transistors ininterconnects. To achieve the stacked architectures, these works haveto reorganize the interconnects and logic blocks into separate diesand introduce 3D die-stacking (or less mature monolithic stacking).Otherwise the high temperature during fabrication of the transistorsin an upper layer would ruin the existing transistors and metal wiresbelow in the same die. Die-stacking introduces undesirable problemsin terms of manufacturability, reliability and limit of the verticalintegration density. Some researchers replace the pass transistorsin the routing structure of FPGA with emerging memory elementsintegrated in metal layers with back-end-of-line (BEOL) compatiblefabrication [20, 21]. But the improvement is limited due to the smallportion of pass transistors in the routing structure. There are alsostudies on new reconfigurable circuit structures, such as CMOL [22]and CNTFET-based fine-grain cell matrices [23], with the aim oftaking the place of FPGA. Lots of efforts are still needed to realizethese revolutionary ideas in products.

III. MEMRISTOR TECHNOLOGY

In this work, we propose to use memristors and metal wires but notransistors to build up the routing structure in mrFPGA. We first intro-

Page 3: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

duce the memristor technology in this section. CMOS-compatibilityis key property to integrate memristor into the mrFPGA architectureand several works have demonstrated CMOS-compatible memristorfabrication processes [10, 24, 25]. For example, Fig. 3 is a schematicview and cross-sectional transmission electron microscopy (TEM)image of memristor fabricated by a CMOS-compatible process [24]for example. As shown in Fig. 3a memristor is a two-terminal

(a) Schematic view. (b) TEM image.

Fig. 3: Fabrication structure of memristor integrated with CMOS inthe same die [24].

resistance-like device which can be fabricated between the top metallayer and the other metal layers on top of the transistor layer in thesame die. As explained in [24, 25], since no high temperature takesplace during their memristor fabrication processes, the processes canbe embedded after back-end-of-line process and will not ruin thetransistors and metal wires which have been fabricated below thememristors, as shown in Fig. 3b. We also see from Fig. 3b thatthe structure of memristor is quite simple, resulting in the easyachievement of small cell size. In spite of the simple structure ofmemristor, it achieves the storage of one-bit data successfully. Mem-ristor can be programmed by applying specific programming voltageor current at its two terminals to switch its resistance value at normaloperation between low state and high state. The resistance values atthe two states are usually more than two orders of magnitude apart[24], or even up to six orders of magnitude apart [25]. In additionmemristor can keep the resistance value set before being powered offwithout supply voltage. With this programmable property, memristoris used as a promising type of NVM with two different states: lowresistance state (LRS) and high resistance state (HRS). In our workthe important thing is that memristor acts like a programmable switchwhich can be reconfigured to determine whether its two terminals areconnected or not. If the two terminals need to be connected to eachother, we just program the memristor to LRS. Otherwise, we programthe memristor to HRS. This special mechanism is quite different fromthose of conventional memories and provides a unique opportunityto build up a new FPGA routing structure with the novel use ofmemristors.

IV. mrFPGA STRUCTURE

A. Basic Structure

As discussed in the previous section, there is opportunity to buildup an FPGA routing structure based on memristors only in additionto metal wires. Also memristor-based routing structure of FPGAwill be placed over the transistor layer in the same die, just likethe routing structure of ASICs. The basic schematic view of thisefficient FPGA architecture with memristor-based reconfiguration(mrFPGA) is shown in Fig. 4. With the organization of logic blocksand programmable interconnects in Fig. 4, the overall mrFPGAarchitecture will become a highly compact array as shown in Fig. 5a.From the figure we see that the area of mrFPGA will be reduced

Fig. 4: mrFPGA architecture: The programmable routing structureis built up with metal wires and memristor which are fabricated ontop of logic blocks in the same die according to existing memristorfabrication structures.

to the total area of the logic blocks only, which takes only 10%to 20% of the conventional FPGA area [4]. Fig. 5b is a detaileddesign of the connection blocks and switch blocks in mrFPGA usingmemristors and metal wires only. In this design we use five metallayers M5 to M9 without resource conflict. The circuits of logicblocks in mrFPGA will use the metal layers below them, i.e., M1to M4, which are sufficient for practical FPGAs, e.g. Virtex-6 [26].The memristor layer is designed to locate between the M9 and M8layers. We see that the positions of the layers are in the same order asthe memristor fabrication structure reported in [24] so as to guaranteethe feasibility of mrFPGA fabrication. Though memristors are locatedclose to the top layer in this design, electrical signals do not alwayshave to go through all metal layers to reach memristors, since switchblocks are in M5 ∼ M9. The signals need to reach M1 or M2 fromthe memristor layer only when they access the pins of logic blocks.Moreover, as stated in Section I, the size of a memristor cell canbe close to or even smaller than the width of a metal wire [9, 10].So memristors can easily fit into the design in Fig. 5b without extraarea overhead. The memristor array can be programmed efficientlyusing V/2 biasing scheme and two-step write operation as proposedin [15].

B. Interconnects with Adaptive Buffer Insertion

One potential problem of the basic mrFPGA routing structure isthat there are no buffers in the structure. It leads to a quadraticincrease of interconnect delay with the increase of the interconnectdistance. This problem would undermine the benefit brought by thesignificant reduction of the overall area and interconnect delay inthe case of large FPGAs. To address this problem, we propose animproved FPGA architecture which allows adaptive buffer insertionin the interconnects as shown in Fig. 6. A limited number of buffersare prefabricated in routing channels and can be connected to thetracks in channels via memristors. Whether to insert the prefabricatedbuffers or not and which track uses a buffer depend on the placementresult of the circuit to implement on mrFPGA. For a routing pathlong enough to exhibit the quadratic increase of RC delay, we willlet some buffers be connected to the path at proper positions. To getthe optimal solution of insertion choice for each buffer in the routingnetwork, we borrow an idea from the buffer placement algorithm inASIC [27]. This algorithm can decide the best positions of buffers inthe interconnects of an ASIC using dynamic programming to achievethe minimum interconnect delay. It constructs delay/capacitance pairswhich correspond to different buffer options in a routing tree of anASIC via a depth-first search and produces the optimal option intime complexity of O(B2) where B is the number of legal bufferpositions. For mrFPGA applications, we let the buffer at certainposition be inserted to the routing track if the algorithm decides to

Page 4: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

(a) (b)

Fig. 5: The proposed mrFPGA architecture. (a) Overview of mrFPGA where the FPGA area is contributed by logic blocks only, and (b) Adetailed design of the connection blocks and switch blocks in mrFPGA using the memristors and metal wire resources in existing memristorfabrication structures.

Fig. 6: An improved mrFPGA architecture with adaptive bufferinsertion in the programmable interconnects.

place a buffer at this position. Then whether or not to insert onebuffer at one routing node in mrFPGA is optimized on demand. Incontrast in conventional FPGA the connections between buffers androuting tracks are predetermined during fabrication. A case study inFig. 7 shows the benefit brought by the adaptive buffer insertion inmrFPGA when compared to the fixed buffer pattern in conventionalFPGA. To simplify the problem, we assume in this case that the RCdelay of a wire with the length of one block is 0.5RwireCwire = 1ns,and that the buffers in interconnects are considered as ideal bufferswith infinite drive, no input/output capacitance, and a fixed intrinsicdelay of 9ns. The optimal length of a wire between two adjacentbuffers would be

k =

rTbuffer

0.5RwireCwire= 3

In conventional FPGA, the pattern of pre-fabricated buffers in inter-connects will follow this result to distribute buffers evenly with adistance of three blocks (as shown in Fig. 7). The problem is thatit does not know the starting point of each net (the offset can be0, 1 or 2). So it just staggers the patterns among different tracks

Fig. 7: Comparison between fixed buffer patten in conventional FPGAand adaptive buffer insertion in mrFPGA.

in the same channel as shown in Fig. 7. When the output pin ofa logic block happens to be connected to a wire segment that isbuffered right after the connection node, just like the block “start”in Fig. 7 connected to track3 in the upper channel, the routing resultwill deviate from the optimal. The problem can be even worse duringthe switch from one track to another. A timing-driven design, suchas [28], has to always buffer the switches between tracks to avoid thepotential large RC delay caused by unbuffered connection betweenthe two longest unbuffered track segments, in this case as large as36ns if the two track1s in the channels are connected without a bufferfor example. It drives the routing result farther from the optimal. Thetable in Fig. 7 lists all the possible delays from the logic block “start”to the logic block “end” according to the settings in a conventionalFPGA discussed above. The first column is the track ID in the upperchannel, and the first row is the track ID in the right channel. Thebuffers at the output pin of the “start” block and the input pin ofthe “end” block are not counted in the delay calculation. As shown

Page 5: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

in Fig. 7 the worst case can be 22% slower than the best case in aconventional FPGA. On the contrary, the adaptive buffer insertion inmrFPGA will not suffer from the deviation from the optimal bufferplacement in interconnects as does the conventional FPGA. We seefrom Fig. 7 that in mrFPGA buffers are inserted just exactly one perthree blocks, resulting in a total delay of only 45ns. It is even betterthan the best case in a conventional FPGA.

V. CAD FLOW AND EVALUATION OF mrFPGA

A. CAD Flow

To implement a circuit design in FPGA, a complete CAD flow isneeded. The input of the flow is the design to be implemented and theoutput of the flow includes the placement and routing (P&R) resultused for programming the design into FPGA as well as an estimationof area, delay and power consumption for the implementation of thedesign in FPGA. We propose to use a timing-driven CAD flow for ourmrFPGA as shown in Fig. 8. A circuit design first goes through the

Fig. 8: CAD flow for mrFPGA.

design tool ABC [29] to perform logic optimization and technologymapping. A mapped netlist consisting of K-LUTs will be obtained.Then the mapped netlist is fed into the design tool T-VPACK [30]to pack LUTs to logic blocks and then into the design tool mrVPRdeveloped by us to accomplish P&R. At last we use the generatedbasic-cell (BC) netlist with delay and capacitance information to runthe power estimation tool fpgaEVA LP2 [7]. Since we try to reusemost parts of the existing CAD flow for conventional FPGAs, thedevelopment of mrFPGA CAD flow is quite easy. Only the P&Rtool is specifically developed for mrFPGA. We will introduce thefeatures of this advanced tool in the last part of this section.

B. Equivalent Circuit Model for Interconnect

To perform the timing and power analysis for mrFPGA, we firstdevelop an equivalent circuit model for the routing structure ofmrFPGA. Fig. 9 shows the equivalent circuit of a representativerouting path from the output pin of a logic block to the input pin ofanother logic block in mrFPGA and makes a comparison with that ofa conventional FPGA. In the circuit model, the MUXs in connectionblocks and switch blocks are replaced by the memristors at low resis-tance state (LRS). The wire segments are still modeled as distributedRC lines. Note that the overall resistance value and capacitance valueof wire segments have changed due to the reduction of the tile areain mrFPGA. Also, according to the mrFPGA architecture in Fig. 5a,the length of a wire segment is one tile shorter than its counterpartin a conventional FPGA. The easiest way to see it is to think about

Fig. 9: Extracted equivalent circuit model of the routing structure ofmrFPGA and comparison with conventional FPGA.

the wire segment with the length of one tile. In mrFPGA the wiresegment with the length of one tile is just the short segment betweentwo logic blocks, of which the length is close to zero compared tothe tile side. However the length of one tile does not disappear. It iscounted in the switch block according to the mrFPGA architectureshown in Fig. 5b.

C. Shielding Effect of Memristor

During the development of the circuit models for the routing structureof mrFPGA, we find out that memristor has a shielding effect tohelp reduce the interconnect delay of mrFPGA further. The shieldingeffect of memristors at a high resistance state (HRS) can remove thedownstream capacitance of unused paths from the total capacitanceat one node and then reduce the RC delay to reach this node. Fig. 10shows an example of the effect in a switch block when comparedto a conventional FPGA. In the conventional FPGA, in spite of the

(a) Conventional FPGA. (b) mrFPGA.

Fig. 10: Memristors provide capacitance shielding from unused paths.

fact that the lower two paths are not used as shown in Fig. 10a, thedownstream capacitances of these two paths are still added to the totalcapacitance at node A. In contrast, in mrFPGA only the downstreamcapacitance of a path in use will be added to the total capacitance atnode A. The huge resistance value of memristors at HRS providescapacitance shielding from unused paths. We use HSPICE simulationto verify whether the resistance of memristor at HRS is high enoughto be qualified as an ideal open switch for the shielding effect. In themrFPGA circuit shown in Fig. 10b, we set R as 1×103Ω, and C andCload as 1×10−12F. For the actual memristor resistance value, we setRlrs as 1×103Ω according to [24, 25] and sweep Rhrs from 1×103Ω,the same value as Rlrs, to 1× 106Ω. At this starting point of sweep,the three paths are equivalent and the three downstream capacitances

Page 6: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

will be all added to node A in the path from node “start” to node“end” just like a conventional FPGA. HSPICE simulations result ofthe delay from node “start” to node “end” is shown in Fig. 11. We

103

104

105

106

3

3.5

4

4.5

5

5.5

6

Rhrs

(Ω)

dela

y (n

s)

shielded by open switch

shielded by memristor

Actual Memristor Region

Fig. 11: The impact of the memristor shielding effect on the pathdelay.

see that with the increase of Rhrs, the delay approaches to the valuewith Rhrs replaced by ideal open switches. The resistance value ofan actual memristor at HRS is more than two orders of magnitudehigher than that at LRS as reported in [24, 25]. We mark the actualmemristor region according to this criterion and see that the delaywithin this region is very close to the ideal situation. This proves thatthe shielding effect of memristor can help reduce path delay furtherin mrFPGA.

D. mrVPR: an advanced P&R Tool for mrFPGA

To deal with the novel routing structure of mrFPGA in the P&R stepof CAD flow, we develop an advanced tool named mrVPR (VPR formemristor-based reconfiguration) on the base of the commonly-usedFPGA P&R tool VPR [30]. The main contributions of this tool are

(a) VPR for conventional FPGA (b) Our mrVPR for mrFPGA

Fig. 12: Overview of routing resource in the two tools VPR andmrVPR.

as follows:

1) mrVPR can deal with the routing graph of mrFPGA as shownin Fig. 12b. In Fig. 12b the gray blocks are I/O blocks andlogic blocks of FPGA and the diagonal wires are the routingpaths available in switch blocks. We see that the switch blocksand connection blocks in mrVPR are placed over logic blocksand provide connections for logic blocks nearby in a differentway from that of a conventional FPGA shown in Fig. 12a.

2) mrVPR can reflect the memristor shielding effect. In VPR allthe device capacitances with prefabricated connections to anode will be added to the capacitance of that node regardlessof whether the paths which these devices belong to will beincluded in the routing result or not. In mrVPR, only whena routing path is used, the capacitances on that path will beadded to the corresponding routing nodes.

3) mrVPR is integrated with the algorithm to generate the optimaloption for adaptive buffer insertion. We have implemented thealgorithm in mrVPR and it shows where buffers are needed forinsertion in the final routing result as shown in Fig. 13. Wesee that the positions for buffers to insert are marked with theblack delta shape. We also see that the interconnect view ofthe routing result in Fig. 13b is quite similar to that of ASIC.We expect to see the good performance of mrFPGA in the nextsection.

(a) Routing resource view. (b) Routing result view

Fig. 13: View of the optimal option for buffer insertion in mrVPR.

VI. EXPERIMENTAL RESULTS

A. Settings

We choose Xilinx Virtex-6 FPGA platform as our conventional FPGAmodel baseline in experiments and conduct evaluation of mrFPGAat the same technology node. The architecture logic specificationof Virtex-6 needed by the CAD flow in Fig. 8 are collected fromVirtex-6 FPGA data sheets (Configurable Logic Block part) [26].As for the architecture routing specification of Virtex-6, such as thechannel width and the distribution of wire segment lengths (Single,Double, Quad, Long and Global), are extracted with the help ofFPGA Editor in Xilinx ISE environment. The RC delay values ofeach type of wire segments are calculated from the unit resistanceand capacitance values in the lookup tables of ITRS2010 [31]. Thetiming parameters of buffers inserted in mrFPGA routing structure,including driving resistance, input capacitance and intrinsic delay, areobtained via HSPICE simulation using PTM device models [32, 33].The memristor model is extracted from the measurement results ofthe memristors fabricated by the process in [24], the same process onwhich mrFPGA is based. The 20 largest MCNC benchmark circuitsare used as the input of the CAD flow for conventional FPGA andmrFPGA respectively and comparisons are made on the output of theCAD flow from the aspects of area, delay and power.

B. Evaluation Results

Fig. 14 shows the comparison of the tile area in three FPGAarchitectures: the baseline Virtex-6 FPGA, its mrFPGA counterpartwithout buffer insertion (a fictitious case to show the impact ofbuffer insertion) and mrFPGA with buffer insertion respectively. As

Page 7: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

(a) Virtex-6 baseline. Total area = 13993µm2.

(b) mrFPGA unbuffered. (c) mrFPGA. Total area = 2547µm2.

Fig. 14: Area comparison of a single tile in three architectures (unit:µm2). (LB: logic block; CB: connection block; SB: switch block;Buf: pre-fabricated buffers to insert on demand in mrFPGA).

discussed in the previous sections, in the case of mrFPGA unbuffered,the area contribution of the interconnects is reduced to zero since theinterconnects are purely made up of memristors and metal layerslocated on top of logic blocks. For mrFPGA with buffer insertion,we have to count the area of pre-fabricated buffers in the channelsaccording to the architecture in Fig. 6. The area contributed by bufferinsertion in mrFPGA is pessimistically calculated as the case of auniform distribution of pre-fabricated buffers over channels, with thenumber of buffers per channel set as the maximum number of buffersrequired by mrVPR to insert in one channel of mrFPGA over the 20benchmarks. We see in Fig. 14 that the area overhead brought bypre-fabricated buffers is low compared to the interconnect area ofconventional FPGA. That is mainly due to the on-demand propertyof buffer solution in mrFPGA. Results in Fig. 14 shows more than5.5x saving of total area for mrFPGA. Table I shows the performancecomparison between Virtex-6 baseline and mrFPGA. It shows a smallperformance degradation before buffer insertion in mrFPGA and a2.3x speedup after buffer insertion. The total speedup primarily stemsfrom the reduction of interconnect delay. In the routing structure ofmrFPGA, the reduction of tile area by∼5.5x results in a shortening ofwire segments in the programmable interconnects by ∼2.35 times.Then both the resistance and capacitance values of wire segmentsdecrease by ∼2.35 times, contributing to the total reduction of RCdelays of the wire segments by ∼5.5 times. The shielding effect ofmemristor also helps to improve performance. The quadratic increaseof the RC delay of a pure RC network cancels out the benefit inmrFPGA without buffer insertion. We see in Table I that for somebenchmarks with long interconnect paths between logic blocks, thepure RC delay without buffers is kind of large. However in mrFPGAwith buffer insertion the interconnects perform much better thanVirtex-6 baseline and mrFPGA unbuffered. The good performancealso stems from the adaptive buffer insertion in mrFPGA. Table IIshows the power consumption by Virtex-6 basline and mrFPGArespectively. An average of 40% power savings is achieved. Thepower savings primarily come from the replacement of transistors inthe programmable interconnects, such as MUXs and SRAMs, withthe simple structure of memristors. Both dynamic power and staticpower are reduced due to less capacitance on the routing paths andthe non-volatility of memristor. Note that no die-stacking is needed toachieve this degree of area reduction and speedup. If 3D integrationtechnology is introduced in the future to stack several mrFPGAstogether, at least two more times of improvements in density andspeedup, i.e. 11x and 4.6x respectively, are both expected, accordingto the experimental results on 3D architecture in [17].

VII. CONCLUSIONS

This work presents a novel FPGA architecture with memristor-basedreconfiguration named mrFPGA. The programmable interconnects ofmrFPGA use memristors and metal wires only. The routing structureis then able to be placed over logic blocks in the same die according to

the existing memristor fabrication structure. An improved architecturewith adaptive buffer insertion is proposed to further reduce theinterconnect delay. A complete CAD flow is provided for mrFPGAwith an advanced P&R tool named mrVPR developed for mrFPGA.The tool can deal with the novel routing structure of mrFPGA,the memristor shielding effect, and the algorithm of optimal bufferinsertion. An evaluation of mrFPGA is done on the 20 largest MCNCbenchmark circuits. Results show that mrFPGA achieves a 5.5x areasavings, a 2.3x speedup and a 1.57x power savings.

REFERENCES

[1] P. Schaumont and I. Verbauwhede, “Domain-specific codesign for em-bedded security,” Computer, vol. 36, no. 4, pp. 68–74, Apr. 2003.

[2] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,”IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 26, no. 2, pp. 203–215, Feb. 2007.

[3] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable Domain-Specific Computing,” IEEE Design and Test of Computers, vol. 28, no. 2,pp. 6–15, Mar. 2011.

[4] V. George, “Low Energy Field-Programmable Gate Array,” Ph.D. dis-sertation, UC Berkeley, 2000.

[5] A. DeHon, “Reconfigurable Architectures for General-Purpose Comput-ing,” Ph.D. dissertation, MIT, Dec. 1996.

[6] E. Ahmed and J. Rose, “The Effect of LUT and ClusterSize on Deep-Submicron FPGA Performance and Density,” IEEE Transactions onVLSI Systems, vol. 12, no. 3, pp. 288–298, Mar. 2004.

[7] F. Li et al., “Power modeling and characteristics of field programmablegate arrays,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 24, no. 11, pp. 1712–1724, Nov. 2005.

[8] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “Themissing memristor found.” Nature, vol. 453, no. 7191, pp. 80–3, May2008.

[9] Y. S. Chen et al., “Highly scalable hafnium oxide memory withimprovements of resistive distribution and read disturb immunity,” inIEDM Technical Digest, Dec. 2009, pp. 1–4.

[10] C.-h. Wang et al., “Three-Dimensional 4F2 ReRAM Cell with CMOSLogic Compatible Process,” in IEDM Technical Digest, 2010, pp. 664–667.

[11] S.-s. Sheu et al., “A 5ns Fast Write Multi-Level Non-Volatile 1 Kbits RRAM Memory with Advance Write Scheme,” in VLSI Circuits,Symposium on, 2009, pp. 82–83.

[12] M. Mahvash and A. Parker, “A memristor SPICE model for designingmemristor circuits,” in MWSCAS, no. 2, 2010, pp. 989–992.

[13] S. Yu, J. Liang, Y. Wu, and H.-S. P. Wong, “Read/write schemes analysisfor novel complementary resistive switches in passive crossbar memoryarrays.” Nanotechnology, vol. 21, no. 46, p. 465202, Oct. 2010.

[14] M. Liu and W. Wang, “rFPGA: CMOS-nano hybrid FPGA using RRAMcomponents,” in NanoArch, Jun. 2008, pp. 93–98.

[15] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design Implications ofMemristor-Based RRAM Cross-Point Structures,” in DATE, 2011.

[16] G. Lemieux and D. Lewis, “Circuit design of routing switches,” inInternational Symposium on FPGAs. New York, New York, USA:ACM Press, 2002, pp. 19–28.

[17] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance Benefits ofMonolithically Stacked 3-D FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp.681–229, Feb. 2007.

[18] Y. Chen, J. Zhao, and Y. Xie, “3D-nonFAR: Three-Dimensional Non-Volatile FPGA ARchitecture Using Phase Change Memory,” in ISLPED,2010, p. 55.

[19] C. Dong, D. Chen, S. Haruehanroengra, and W. Wang, “3-D nFPGA:A Reconfigurable Architecture for 3-D CMOS/Nanomaterial HybridDigital Circuits,” IEEE Transactions on Circuits and Systems I: RegularPapers, vol. 54, no. 11, pp. 2489–2501, Nov. 2007.

[20] C. Chen et al., “Efficient FPGAs using nanoelectromechanical relays,”in International Symposium on FPGAs, 2010, pp. 273–282.

[21] P.-E. Gaillardon et al., “Emerging memory technologies for reconfig-urable routing in FPGA architecture,” in ICECS. IEEE, Dec. 2010, pp.62–65.

Page 8: mrFPGA: A Novel FPGA Architecture with Memristor-Based ...cadlab.cs.ucla.edu/software_release/mrVPR/mrFPGA.pdfrealization of memristor in 2008 [8], rapid progress on the fabrication

TABLE I: Critical Path Delay and Comparison (unit: ns)Virtex-6 Baseline mrFPGA unbuffered mrFPGA

Interconnect Delay Total Delay Interconnect Delay Total Delay Interconnect Delay Total Delayalu4 6.62 9.54 5.79 (1.14x) 8.71 (1.09x) 0.83 (7.89x) 4.37 (2.17x)

apex2 7.84 10.4 6.38 (1.22x) 9.67 (1.08x) 0.94 (8.33x) 4.85 (2.15x)apex4 6.79 9.71 5.13 (1.32x) 8.05 (1.20x) 1.52 (4.45x) 4.44 (2.18x)bigkey 10.9 13.3 29.3 (0.37x) 30.5 (0.43x) 2.46 (4.46x) 4.83 (2.76x)clma 12.0 14.6 14.0 (0.85x) 16.5 (0.88x) 1.95 (6.18x) 5.00 (2.93x)des 15.1 17.9 34.4 (0.44x) 37.9 (0.47x) 5.29 (2.86x) 7.84 (2.28x)

diffeq 4.11 5.37 2.67 (1.53x) 5.04 (1.06x) 0.36 (11.1x) 3.29 (1.62x)dsip 9.61 10.9 28.6 (0.33x) 29.9 (0.36x) 3.51 (2.73x) 4.81 (2.26x)

elliptic 5.93 8.36 10.0 (0.58x) 12.4 (0.67x) 1.04 (5.68x) 4.09 (2.04x)ex1010 14.2 16.8 14.0 (1.01x) 17.3 (0.96x) 2.80 (5.07x) 6.40 (2.62x)

ex5p 7.23 10.1 3.91 (1.84x) 7.14 (1.42x) 0.97 (7.43x) 4.20 (2.41x)frisc 8.97 9.92 9.75 (0.91x) 10.8 (0.91x) 0.58 (15.4x) 4.68 (2.12x)

misex3 6.20 9.06 5.14 (1.20x) 8.37 (1.08x) 1.00 (6.16x) 4.54 (1.99x)pdc 11.1 14.4 15.0 (0.74x) 18.0 (0.80x) 2.98 (3.72x) 5.90 (2.45x)s298 8.39 11.1 5.43 (1.54x) 7.77 (1.42x) 0.80 (10.4x) 4.13 (2.68x)

s38417 6.16 7.36 4.40 (1.40x) 5.91 (1.24x) 0.09 (66.5x) 3.72 (1.97x)s38584.1 8.28 11.2 18.6 (0.44x) 20.7 (0.53x) 1.42 (5.82x) 4.90 (2.28x)

seq 7.45 10.6 5.09 (1.46x) 7.95 (1.34x) 0.95 (7.76x) 4.49 (2.37x)spla 10.8 13.8 9.74 (1.11x) 12.6 (1.09x) 2.03 (5.34x) 5.32 (2.60x)tseng 4.90 7.27 6.14 (0.79x) 8.29 (0.87x) 0.95 (5.13x) 3.32 (2.18x)

average - - - (1.01x) - (0.94x) - (9.63x) - (2.30x)

TABLE II: Power Consumption and Comparison (unit: mW)Virtex-6 Baseline mrFPGA unbuffered mrFPGA

Interconnect Power Total Power Interconnect Power Total Power Interconnect Power Total Poweralu4 34.4 52.0 6.87 (5.01x) 24.4 (2.12x) 12.9 (2.66x) 30.5 (1.70x)

apex2 40.1 58.6 7.75 (5.18x) 26.2 (2.23x) 13.1 (3.06x) 31.6 (1.85x)apex4 26.6 42.9 4.42 (6.02x) 20.6 (2.07x) 7.48 (3.56x) 23.7 (1.80x)bigkey 49.7 375. 6.19 (8.03x) 332. (1.13x) 13.0 (3.81x) 338. (1.10x)clma 74.0 137. 9.57 (7.73x) 73.5 (1.87x) 17.3 (4.27x) 81.2 (1.69x)des 112. 558. 19.0 (5.89x) 465. (1.20x) 35.9 (3.12x) 482. (1.15x)

diffeq 15.8 32.6 2.11 (7.48x) 18.9 (1.72x) 4.09 (3.86x) 20.9 (1.55x)dsip 79.8 409. 12.0 (6.61x) 341. (1.19x) 21.5 (3.71x) 351. (1.16x)

elliptic 54.0 158. 7.73 (6.98x) 112. (1.41x) 14.1 (3.82x) 118. (1.33x)ex1010 61.0 109. 8.29 (7.36x) 56.7 (1.93x) 14.6 (4.15x) 63.1 (1.73x)

ex5p 20.8 34.9 3.46 (6.00x) 17.5 (1.98x) 5.90 (3.52x) 20.0 (1.74x)frisc 28.1 60.5 3.20 (8.79x) 35.5 (1.70x) 5.61 (5.01x) 37.9 (1.59x)

misex3 33.2 50.2 6.38 (5.21x) 23.3 (2.15x) 11.2 (2.96x) 28.1 (1.78x)pdc 57.0 96.8 8.84 (6.45x) 48.5 (1.99x) 14.9 (3.80x) 54.7 (1.76x)s298 16.5 29.2 2.60 (6.34x) 15.2 (1.91x) 4.64 (3.56x) 17.3 (1.68x)

s38417 69.7 125. 9.98 (6.98x) 65.5 (1.91x) 17.4 (4.00x) 72.9 (1.71x)s38584.1 78.4 282. 9.06 (8.65x) 213. (1.32x) 16.4 (4.77x) 220. (1.28x)

seq 37.1 55.6 7.28 (5.10x) 25.7 (2.16x) 12.3 (3.02x) 30.7 (1.80x)spla 50.9 86.8 8.57 (5.93x) 44.5 (1.95x) 14.1 (3.58x) 50.1 (1.73x)tseng 20.3 70.7 2.44 (8.32x) 52.7 (1.33x) 4.88 (4.17x) 55.2 (1.28x)

average - - - (6.70x) - (1.76x) - (3.72x) - (1.57x)

[22] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a reconfigurablearchitecture for hybrid digital circuits with two-terminal nanodevices,”Nanotechnology, vol. 16, no. 6, pp. 888–900, Jun. 2005.

[23] P.-E. Gaillardon, F. Clermidy, I. O’Connor, and J. Liu, “Interconnectionscheme and associated mapping method of reconfigurable cell matricesbased on nanoscale devices,” in NanoArch, Jul. 2009, pp. 69–74.

[24] K. Tsunoda et al., “Low Power and High Speed Switching of Ti-dopedNiO ReRAM under the Unipolar Voltage Source of less than 3V,” inIEDM Technical Digest, Dec. 2007, pp. 767–770.

[25] R. Huang et al., “Resistive switching of silicon-rich-oxide featuring highcompatibility with CMOS technology for 3D stackable and embeddedapplications,” Applied Physics A, pp. 927–931, Mar. 2011.

[26] Xilinx, “Virtex-6 FPGA data sheets.” [Online]. Available:http://www.xilinx.com/support/documentation/virtex-6.htm

[27] L. van Ginneken, “Buffer placement in distributed RC-tree networks forminimal Elmore delay,” in ISCAS, 1990, pp. 865–868.

[28] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit andarchitecture design of FPGAs,” in International Symposium on FPGAs,2008, pp. 149–158.

[29] “Berkeley Logic Synthesis and Verification Group, ABC: A Systemfor Sequential Synthesis and Verification, Release 61225.” [Online].Available: http://www.eecs.berkeley.edu/∼alanmi/abc/

[30] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs. Norwell: MA:Kluwer, 1999.

[31] “International Technology Roadmap for Semiconductors,” 2010.[Online]. Available: http://www.itrs.net/Links/2010ITRS/Home2010.htm

[32] W. Zhao and Y. Cao, “Predictive Technology Model (PTM) website.”[Online]. Available: http://ptm.asu.edu/

[33] W. Zhao and Y. Cao, “New Generation of Predictive Technology Modelfor Sub-45nm Design Exploration,” in ISQED, 2006, pp. 585–590.