Exploiting the Dynamic Partial Reconfiguration on NoC...

Exploiting the Dynamic Partial Reconfiguration onNoC-Based FPGA

Amr Hassan1, 2, Hassan Mostafa2, 3, Hossam A. H. Fahmy2, Yehea Ismail31Mentor Graphics Corporation

2Electronics and Communications Engineering Department, Cairo University, Giza 12613, Egypt3Center for Nano-electronics & Devices, American University in Cairo & Zewail City for Science and Technology, Cairo, Egypt

Abstract—Dynamic Partial Reconfiguration (DPR) of SRAM-based Field Programmable Gate Arrays (FPGAs) becomes ademanding feature by many applications for its ability to addmore flexibility over runtime phase. Recently, implementationdesigns which utilize DPR are easier than before. However, tech-niques that FPGAs use to perform DPR (like ICAP and JTAG)encounter a performance bottleneck; only one DPR is allowedat a time. In this paper, we present a state-of-art NoC-basedFPGA simulator, which supports partial dynamic reconfigurationsimulation. Design limitations and performance degradations ofusing DPR on NoC-based FPGA are estimated using NoC-DPRsimulator. Experiments are carried out using NoC-DPR simulatorto measure the reconfiguration time overhead by increasingnumber of simultaneous DPRs on FPGA fabric. It is shown thatthe overhead of reconfiguration time is increased exponentiallywith increasing the number of carried out simultaneous DPRs.However, DPR of NoC-based FPGA can enhance performancecompared to DPR of normal FPGAs with some trade-offs.

Index Terms—Dynamic Partial Reconfiguration, Network onChip, Field Programmable Gate Arrays, System analysis anddesign

I. INTRODUCTION

D ynamic partial reconfiguration (DPR) is a promising fea-

ture for number of applications mapped on SRAM-based

FPGAs (Field Programmable Gate Arrays) technology, which

have dynamic natural over runtime, like signal processing,

including image and video, and electronic measurement ap-

plications. Furthermore, partially reconfigurable (PR) devices

can save chip area by programming only the physical resources

that are needed in each operation phase, so area and power can

be saved by programming only the desired block, which allows

for static leakage reduction.

Despite optimizing in layout design of transistor level,

Nowadays trend is increasing the number of computing cores

in parallel, that is to maximize capability of modern designs.

Consequently processing power has increased and data in-

tensive applications have emerged, thus the challenge of the

communication between these cores, when configured on FP-

GAs, has been raised. A prominent concept for communication

known as Network-on-Chip (NoC) has been adapted for FP-

GAs to handle this challenge. To investigate these concepts we

developed a tool called NoC-DPR a cycle-accurate simulator

for Network-on-Chips (NoCs) which can be used to simulate

NoC-based FPGA that also support DPR. In NoC-DPR we

integrated SystemC NoC simulator is known as NoCTweak [1]

with SystemC Library (ReChannel) [2], which simulate DPR

for general purpose designs. All processing elements (PE) of

NoC can be reconfigured dynamically to adapt new application

at run-time.This paper is organized as follows: In Section II, the related

work in DPR simulation, NoC simulation and merge between

them is reviewed. In Section III, the NoC-DPR simulator archi-

tecture is explained. In Section IV, The case study is presented

along with the results. Design insights and recommendations

to implement DPR for NoC-based FPGA is stated in Section

V Finally, Section VI concludes the paper.II. RELATED WORK

Since the design for dynamic partial reconfiguration at

FPGAs becomes a slightly complex task, more of designers are

tending to use DPR simulators at early design stages as a proof

of concept. Several approaches [3,4,5] have been proposed

to model dynamically reconfigurable systems at system-level

using SystemC.The modeling can be done using object-oriented techniques,

and with avoiding the limitations of SystemC with respect to

dynamic reconfiguration, like ReConLib [6] library. The main

challenge when modeling dynamic designs using SystemC is

inability of perform changes to the system’s module topology

during simulation. The ReChannel library [2] is an extension

to SystemC, not modification to SystemC kernel like previous

projects, that overcomes SystemC modeling limitation without

actually changing the underlying simulation kernel.On the other hand several network on chip simulators have

been developed recently. Some of simulators developed in

C++ like Booksim by Jiang et al [7]. Currently, Booksim 2.0

adds more features to perform modeling of the router micro

architecture, models inter router channel delay, and provides

support for additional traffic models. Others simulators devel-

oped in SystemC like Noxim, which developed by Palesi et al.

[8]. Additionally to formal two simulators, NoCTweak written

in SystemC by Anh et Bevan [1] which supports router type

wormhole over both synthetic traffic and embedded application

patterns.Others approaches are trying to uses NoC as backbone in

FPGAs system to overcome communication issues, like Ehliar

and Liu [9] who explain an open source NoC-based FPGA

architecture with low area overhead, high throughput and low

latency in compared to general NoC performance.

2017 First New Generation of CAS

978-1-5090-6447-2/17 $31.00 © 2017 IEEE

DOI 10.1109/NGCAS.2017.78

189

2017 First New Generation of CAS

978-1-5090-6447-2/17 $31.00 © 2017 IEEE

DOI 10.1109/NGCAS.2017.78

277

III. NOC-DPR SIMULATOR ARCHITECTURE

Our simulator is command line based tool consisting of 2-

D mesh network of routers which is simulated by NoCTweak

[1]. Each node consists of a Processor Element (PE), Network

Interface (NI) and an associated router. Each router connects

with four nearest neighboring routers forming a 2-D mesh

network. Using ReChannel [2] library each PE can be dynam-

ically reconfigured by special type of data packet generates

from certain nodes (master node 0,0), data packets can be

injected into the network through its router. Packets are routed

in the network of routers by a selected routing algorithm

to their destinations at which the packets are immediately

consumed.

The main consideration must be taken when merging DPR

simulation library with NoC simulator, is that all NoC modules

must be will defined through a clear hierarchy at SystemC.

Consequently, we have to implement separate NI instead of

embedded one with NoCTweak simulator, that is to perform

DPR on just core process not on all node components (PE and

NI). However, latency and throughput values have changed due

to this modification. That will be discussed in details within

results and discussion section.

A. NoCTweak Simulator

NoCTweak is an open-source 2-D mesh network on chip

simulator for early exploration of performance and energy effi-

ciency of on-chip networks. The simulator has been developed

using SystemC, a C++ plugin, which allows accurate and fast

modeling of concurrent hardware modules at the cycle-level

accuracy [1]. The simulator is composed of a hierarchy of

modules (processor (core), network interface (NI) and an as-

sociated router) that implements different functionalists of the

network and simulation environment. Each of these modules

has a well-defined interface that facilitates replacement and

customization of module implementations without affecting

other parts of the simulated system.

B. ReChannel Simulation Library

To model the dynamic partial reconfiguration process, we

can simplify this process as reconfigurable modules are “ac-

tivated” and “deactivated” by conditionally intercepting the

communication between static and reconfigurable parts of the

design. This is achieved in ReChannel [2] Library through the

concept of switches (portals) as shown at Fig. 1 . We can

use portals to allow the usage of any SystemC channel in a

reconfigurable context, which leads modeling reconfigurable

systems with a highly flexible methodology. As the main

modification of the original system, that is to be extended

with reconfigurability feature, takes place within the intercon-

nection of different parts of the system – i.e. between static

and reconfigurable parts – no changes to existing modules are

required. This facilitates the interface of reconfiguration parts

to static parts. Reconfiguration properties, like configuration

times, can be added to those modules using inheritance and

can be chosen through argument parsed by command line.

Fig. 1. A portal connecting two reconfigurable modules to a standard channelSystemC

IV. RESULTS AND DISCUSSION

Inserting an explicit Network Interface (NI) with exter-

nal decoupling buffer, which is responsible for storing and

synchronizing flits, which is Flow Control Unit (minimum

unit of message), between PE and network router, affects

network performance, specifically in latency and throughput.

We measure latency after inserting NI and compare it to

latency of NoCTweak using network buffer size 2-flits, with

different injection flits rate on different network sizes. Fig. 2

shows the difference between latency of NoCTweak simulator

and NoC-DPR simulator, latency can reach above 50k cycle

and saturated at 20k respectively. that is due to the NI of NoC-

DPR simulator is responsible for controlling packet generation

from PE according to NoC state, therefore it sends and receives

control signals to PE.

On the contrary in NoCTweak PE has a infinite buffer

which store all generated flits and injects them directly to

router input buffer without controlling, consequently we don’t

have saturated value unlike NoC-DPR which saturated at lower

latency 5k, 6k, and 20k cycles for 2-flits buffer size for 2x2,

3x3, and 14x14 NoC size respectively.

On the other hand, In NoCTweak simulator, the throughput

was saturated at specific values as Fig.3a, because of infinite

buffer at PE; the network loaded with maximum accepted flits

at higher FIR (Flits Injection Rate of each PE), however in

NoC-DPR simulator PE stops packet generation when network

state is fully loaded; that resulted in slightly increase in the

peak value of throughput, but it decreases exponentially due

to network will not be loaded with maximum accepted flits as

shown in Fig. 3b. Throughput saturated values was 0.21, and

0.17 approximately became the peak values 0.22, and 0.175

at 0.22 and 0.2 FIR in NoC-DPR simulator for 2x2 and 3x3

NoC size respectively.

Our experiment aims to simulation the apply of DPR on

provided simulated NoC-based FPGA using different network

sizes and different number of parallel DPR applied, so we

can compare between them with respect to Reconfiguration

Time (RT). First we used Virtex-5 xc5vfx100t FPGA to select

different partial reconfiguration regions, then we get bit stream

size of each region using Xilinx ISE tool, then we get RT,

190278

(a)

(b)

Fig. 2. Average latency for 2-flits buffer depth for: A) NoCTweak simulatorwith Infinite PE buffer B) NoC-DPR simulator

afterwards we use these values as input to to NoC-DPR

simulator for each experiment. The theoretically estimated RT

is calculated by using Partial Reconfiguration Cost Calcula-

tor[10], which uses a cost model to estimate RT. We assumed

that all PEs of NoC are different Partial Reconfiguration (PR)

regions. therefore, other network resources like Router, NI

and wires assumed to be hardwired on FPGA chip layout to

simplify RT estimate. The main advantage of using NoC-based

FPGA instead normal FPGA is that multiple simultaneous

DPRs can be achieved, as we deal each PE as a small FPGA,

which has its own reconfiguration control unit like ICAP (

Internal Configuration Access Port for Xilinx FPGAs). we

investigate in the following section which suitable NoC size

and how much afforded parallel DPRs can be achieved, and

estimate the effecting on RT.

Comparison between different number of simultaneous

DPRs at NoC has been done based on the correspondence

between theoretical and simulated RT for each number as

illustrated in Fig.4. This metric is also a good indicator for the

performance variation from a certain NoC size to another, as

each NoC size has different values of throughput and latency

that affects at RT. However, we use fixed buffer size 8-flits

and FIR 0.1 (flits/cycle) for following experiments. In small

network size (RT above 1 msec) the gap is narrow between

theoretical and experimental RT, for example for 1-DPR at a

time, a slightly increase can be noticed in the gap as network

(a)

(b)

Fig. 3. Average throughput for 2-flits buffer depth for: A) NoCTweaksimulator with Infinite PE buffer B) NoC-DPR simulator

size increases over 7x7. In contrast, simulated RT for 5-DPRs

simultaneous, is drifted from the beginning (at small network

sizes) and the gap jumped at larger network sizes.

Fig. 5 shows the difference percentage of theoretical and

experimental RT using one, two, three, four and five simulta-

neous DPRs. It is shown that at small NoC sizes, from 2x2

to 9x9, the difference is always below than 50% that because

RT is relatively bigger than any NoC overheads (latency and

throughput effects), on the other hand starting form NoC size

10x10, NoC latency becomes noticeable and can be affects

RT.

As simultaneous number of DPRs at NoC increase, the

difference increase markedly. When using one DPR at a time

the percentage does not exceed 35% all over NoC sizes while

when using 3-simultaneous DPRs percentage can reach up to

100% at large NoC 18x18; that is due to the complexity of

controlling parallel DPR and the added delays of NoC latency.

Furthermore, we have to prevent other nodes to communicate

with current PR node.

V. DESIGN RECOMMENDATIONS

Performing DPR feature on NoC-based FPGAs must be

done while taking some considerations:

• we can not use general NoC simulator to simulate DPR

application. we have to grantee that when one process

element (PE) performing DPR, other PEs must be pre-

191279

Fig. 4. comparison between theoretical and simulated reconfiguration timeof multiple DPRs on NoC-based FPGAs

Fig. 5. Difference percentage of reconfiguration time using multiple DPRson NoC-based FPGAs

vented from send or receive to/from this PE until DPR is

finished.

• In our simulator we assume that one PE (0,0) is the

master of DPR process, which responsible of sending

reconfiguration packets, therefore when target PE receives

reconfiguration packet it start to perform reconfiguration.

When DPR is finished, destination send back to master

node acknowledge packet to broadcast to all other nodes

that target PE is back and any node can communicate

with.

• The clear advantage of using NoC-based FPGAs instead

formal FPGAs is that ability to perform multiple DPRs

simultaneously, which was studied in previous section.

• The recommended network size and number of simulta-

neous DPRs can be estimated according to the desired

reconfiguration time. See Fig. 4 and Fig. 5, as RT is the

main parameter over the other NoC parameters (latency

and throughput) which effects directly in the deviation

between theoretical and simulated results. for instance if

reconfiguration time can be 100 msec we can use up to 5-

DPR simultaneously and any suitable network size from

2x2 to 9x9. In contrast if the limit of RT was 1 msec

the optimal choice is 3-DPR simultaneously and network

size larger than 13x13.

VI. CONCLUSION

In this paper, a state-of-art NoC-DPR simulator illustrated

and used to get some recommendation for how to use DPR of

NoC-based FPGA, and which is optmal size of NoC as stated

in the previous section.It is obvious that NoC-based FPGA enhance reconfiguration

capabilities due to multiple PE, which can be perform multiple

DPRs at a time. However, that needs to add more resources like

controlling unit and routers. So, the time performance of DPR

with NoC is better than time performance of DPR at normal

FPGA, considering applications that concern reconfiguration

time over area overhead. Despite that, the number of simul-

taneous DPRs can not exceed certain limit for certain NoC

sizes; as we could not gain more reduction in reconfiguration

time, instead we adding more resources overhead.

ACKNOWLEDGMENT

This research was funded by NTRA, ITIDA, Cairo Univer-

sity, Zewail City of Science and Technology, AUC, the STDF,

Intel, Mentor Graphics, and MCIT.

REFERENCES

[1] Anh T. Tran and Bevan M. Baas. NoCTweak: a Highly ParameterizableSimulator for Early Exploration of Performance and Energy of NetworksOn-Chip, Dept. Electr. Comput. Eng., Univ. California, 2012, July 2012

[2] Raabe A., Hochgurtel S., Zachmann G., and Anlauf J. K. ReChannel:Describing and Simulating Reconfigurable Hardware in SystemC, ACMTransactions on Design Automation of Electronic Systems (TODAES),v.13 n.1, p.1-18, January 2008

[3] Adriatic Consortium. 2002. Advanced methodolgy for designing recon-figurable SoC and application-targeted IP-entities in wireless communi-cations webpage. http://www.imec.be/adriatic

[4] Benkhermi, I., Benkhelifa, A., Chillet, D., Pillement, S., Prévotet, J.-C.,and Verdier, F. 2005. System-Level modelling for reconfigurable SoCs.In the 20th Conference on Design of Circuits and Integrated Systems(DCIS), Lisboa, Portugal

[5] Alisson V. De Brito , Elamr U. K. Melcher , Wilson Rosas, Anopen-source tool for simulation of partially reconfigurable systemsusing SystemC, Proceedings of the IEEE Computer Society AnnualSymposium on Emerging VLSI Technologies and Architectures, p.434,March 02-03, 2006

[6] Schallenberg, A., Oppenheimer, F., and Nebel, W. 2004. Designing fordynamic partially reconfigurable FPGAs with SystemC and OSSS. Inthe Forum on Specification and Design Languages, Lille, France

[7] N. Jiang et al., “BookSim interconnection network simulator,” Online,https://nocs.stanford.edu/cgibin/trac.cgi/wiki/Resources/BookSim

[8] R. Palesi et al., “Noxim - the noc simulator,” Online,http://noxim.sourceforge.net/

[9] A. Ehliar and D. Liu, "An FPGA based open source network-on-chiparchitecture", 17th International Conference on Field ProgrammableLogic and Applications, FPL, pp.800-.803, IEEE Amsterdam, Holland,2007

[10] Kyprianos Papadimitriou , Apostolos Dollas , Scott Hauck, Performanceof partial reconfiguration in FPGA systems: A survey and a costmodel, ACM Transactions on Reconfigurable Technology and Systems(TRETS), v.4 n.4, p.1-24, December 2011

192280

Exploiting the Dynamic Partial Reconfiguration on NoC...

Documents

Transcript of Exploiting the Dynamic Partial Reconfiguration on NoC...