Exploiting the Dynamic Partial Reconfiguration on NoC...
Transcript of Exploiting the Dynamic Partial Reconfiguration on NoC...
Exploiting the Dynamic Partial Reconfiguration onNoC-Based FPGA
Amr Hassan1, 2, Hassan Mostafa2, 3, Hossam A. H. Fahmy2, Yehea Ismail31Mentor Graphics Corporation
2Electronics and Communications Engineering Department, Cairo University, Giza 12613, Egypt3Center for Nano-electronics & Devices, American University in Cairo & Zewail City for Science and Technology, Cairo, Egypt
Abstract—Dynamic Partial Reconfiguration (DPR) of SRAM-based Field Programmable Gate Arrays (FPGAs) becomes ademanding feature by many applications for its ability to addmore flexibility over runtime phase. Recently, implementationdesigns which utilize DPR are easier than before. However, tech-niques that FPGAs use to perform DPR (like ICAP and JTAG)encounter a performance bottleneck; only one DPR is allowedat a time. In this paper, we present a state-of-art NoC-basedFPGA simulator, which supports partial dynamic reconfigurationsimulation. Design limitations and performance degradations ofusing DPR on NoC-based FPGA are estimated using NoC-DPRsimulator. Experiments are carried out using NoC-DPR simulatorto measure the reconfiguration time overhead by increasingnumber of simultaneous DPRs on FPGA fabric. It is shown thatthe overhead of reconfiguration time is increased exponentiallywith increasing the number of carried out simultaneous DPRs.However, DPR of NoC-based FPGA can enhance performancecompared to DPR of normal FPGAs with some trade-offs.
Index Terms—Dynamic Partial Reconfiguration, Network onChip, Field Programmable Gate Arrays, System analysis anddesign
I. INTRODUCTION
D ynamic partial reconfiguration (DPR) is a promising fea-
ture for number of applications mapped on SRAM-based
FPGAs (Field Programmable Gate Arrays) technology, which
have dynamic natural over runtime, like signal processing,
including image and video, and electronic measurement ap-
plications. Furthermore, partially reconfigurable (PR) devices
can save chip area by programming only the physical resources
that are needed in each operation phase, so area and power can
be saved by programming only the desired block, which allows
for static leakage reduction.
Despite optimizing in layout design of transistor level,
Nowadays trend is increasing the number of computing cores
in parallel, that is to maximize capability of modern designs.
Consequently processing power has increased and data in-
tensive applications have emerged, thus the challenge of the
communication between these cores, when configured on FP-
GAs, has been raised. A prominent concept for communication
known as Network-on-Chip (NoC) has been adapted for FP-
GAs to handle this challenge. To investigate these concepts we
developed a tool called NoC-DPR a cycle-accurate simulator
for Network-on-Chips (NoCs) which can be used to simulate
NoC-based FPGA that also support DPR. In NoC-DPR we
integrated SystemC NoC simulator is known as NoCTweak [1]
with SystemC Library (ReChannel) [2], which simulate DPR
for general purpose designs. All processing elements (PE) of
NoC can be reconfigured dynamically to adapt new application
at run-time.This paper is organized as follows: In Section II, the related
work in DPR simulation, NoC simulation and merge between
them is reviewed. In Section III, the NoC-DPR simulator archi-
tecture is explained. In Section IV, The case study is presented
along with the results. Design insights and recommendations
to implement DPR for NoC-based FPGA is stated in Section
V Finally, Section VI concludes the paper.II. RELATED WORK
Since the design for dynamic partial reconfiguration at
FPGAs becomes a slightly complex task, more of designers are
tending to use DPR simulators at early design stages as a proof
of concept. Several approaches [3,4,5] have been proposed
to model dynamically reconfigurable systems at system-level
using SystemC.The modeling can be done using object-oriented techniques,
and with avoiding the limitations of SystemC with respect to
dynamic reconfiguration, like ReConLib [6] library. The main
challenge when modeling dynamic designs using SystemC is
inability of perform changes to the system’s module topology
during simulation. The ReChannel library [2] is an extension
to SystemC, not modification to SystemC kernel like previous
projects, that overcomes SystemC modeling limitation without
actually changing the underlying simulation kernel.On the other hand several network on chip simulators have
been developed recently. Some of simulators developed in
C++ like Booksim by Jiang et al [7]. Currently, Booksim 2.0
adds more features to perform modeling of the router micro
architecture, models inter router channel delay, and provides
support for additional traffic models. Others simulators devel-
oped in SystemC like Noxim, which developed by Palesi et al.
[8]. Additionally to formal two simulators, NoCTweak written
in SystemC by Anh et Bevan [1] which supports router type
wormhole over both synthetic traffic and embedded application
patterns.Others approaches are trying to uses NoC as backbone in
FPGAs system to overcome communication issues, like Ehliar
and Liu [9] who explain an open source NoC-based FPGA
architecture with low area overhead, high throughput and low
latency in compared to general NoC performance.
2017 First New Generation of CAS
978-1-5090-6447-2/17 $31.00 © 2017 IEEE
DOI 10.1109/NGCAS.2017.78
189
2017 First New Generation of CAS
978-1-5090-6447-2/17 $31.00 © 2017 IEEE
DOI 10.1109/NGCAS.2017.78
277
III. NOC-DPR SIMULATOR ARCHITECTURE
Our simulator is command line based tool consisting of 2-
D mesh network of routers which is simulated by NoCTweak
[1]. Each node consists of a Processor Element (PE), Network
Interface (NI) and an associated router. Each router connects
with four nearest neighboring routers forming a 2-D mesh
network. Using ReChannel [2] library each PE can be dynam-
ically reconfigured by special type of data packet generates
from certain nodes (master node 0,0), data packets can be
injected into the network through its router. Packets are routed
in the network of routers by a selected routing algorithm
to their destinations at which the packets are immediately
consumed.
The main consideration must be taken when merging DPR
simulation library with NoC simulator, is that all NoC modules
must be will defined through a clear hierarchy at SystemC.
Consequently, we have to implement separate NI instead of
embedded one with NoCTweak simulator, that is to perform
DPR on just core process not on all node components (PE and
NI). However, latency and throughput values have changed due
to this modification. That will be discussed in details within
results and discussion section.
A. NoCTweak Simulator
NoCTweak is an open-source 2-D mesh network on chip
simulator for early exploration of performance and energy effi-
ciency of on-chip networks. The simulator has been developed
using SystemC, a C++ plugin, which allows accurate and fast
modeling of concurrent hardware modules at the cycle-level
accuracy [1]. The simulator is composed of a hierarchy of
modules (processor (core), network interface (NI) and an as-
sociated router) that implements different functionalists of the
network and simulation environment. Each of these modules
has a well-defined interface that facilitates replacement and
customization of module implementations without affecting
other parts of the simulated system.
B. ReChannel Simulation Library
To model the dynamic partial reconfiguration process, we
can simplify this process as reconfigurable modules are “ac-
tivated” and “deactivated” by conditionally intercepting the
communication between static and reconfigurable parts of the
design. This is achieved in ReChannel [2] Library through the
concept of switches (portals) as shown at Fig. 1 . We can
use portals to allow the usage of any SystemC channel in a
reconfigurable context, which leads modeling reconfigurable
systems with a highly flexible methodology. As the main
modification of the original system, that is to be extended
with reconfigurability feature, takes place within the intercon-
nection of different parts of the system – i.e. between static
and reconfigurable parts – no changes to existing modules are
required. This facilitates the interface of reconfiguration parts
to static parts. Reconfiguration properties, like configuration
times, can be added to those modules using inheritance and
can be chosen through argument parsed by command line.
Fig. 1. A portal connecting two reconfigurable modules to a standard channelSystemC
IV. RESULTS AND DISCUSSION
Inserting an explicit Network Interface (NI) with exter-
nal decoupling buffer, which is responsible for storing and
synchronizing flits, which is Flow Control Unit (minimum
unit of message), between PE and network router, affects
network performance, specifically in latency and throughput.
We measure latency after inserting NI and compare it to
latency of NoCTweak using network buffer size 2-flits, with
different injection flits rate on different network sizes. Fig. 2
shows the difference between latency of NoCTweak simulator
and NoC-DPR simulator, latency can reach above 50k cycle
and saturated at 20k respectively. that is due to the NI of NoC-
DPR simulator is responsible for controlling packet generation
from PE according to NoC state, therefore it sends and receives
control signals to PE.
On the contrary in NoCTweak PE has a infinite buffer
which store all generated flits and injects them directly to
router input buffer without controlling, consequently we don’t
have saturated value unlike NoC-DPR which saturated at lower
latency 5k, 6k, and 20k cycles for 2-flits buffer size for 2x2,
3x3, and 14x14 NoC size respectively.
On the other hand, In NoCTweak simulator, the throughput
was saturated at specific values as Fig.3a, because of infinite
buffer at PE; the network loaded with maximum accepted flits
at higher FIR (Flits Injection Rate of each PE), however in
NoC-DPR simulator PE stops packet generation when network
state is fully loaded; that resulted in slightly increase in the
peak value of throughput, but it decreases exponentially due
to network will not be loaded with maximum accepted flits as
shown in Fig. 3b. Throughput saturated values was 0.21, and
0.17 approximately became the peak values 0.22, and 0.175
at 0.22 and 0.2 FIR in NoC-DPR simulator for 2x2 and 3x3
NoC size respectively.
Our experiment aims to simulation the apply of DPR on
provided simulated NoC-based FPGA using different network
sizes and different number of parallel DPR applied, so we
can compare between them with respect to Reconfiguration
Time (RT). First we used Virtex-5 xc5vfx100t FPGA to select
different partial reconfiguration regions, then we get bit stream
size of each region using Xilinx ISE tool, then we get RT,
190278
(a)
(b)
Fig. 2. Average latency for 2-flits buffer depth for: A) NoCTweak simulatorwith Infinite PE buffer B) NoC-DPR simulator
afterwards we use these values as input to to NoC-DPR
simulator for each experiment. The theoretically estimated RT
is calculated by using Partial Reconfiguration Cost Calcula-
tor[10], which uses a cost model to estimate RT. We assumed
that all PEs of NoC are different Partial Reconfiguration (PR)
regions. therefore, other network resources like Router, NI
and wires assumed to be hardwired on FPGA chip layout to
simplify RT estimate. The main advantage of using NoC-based
FPGA instead normal FPGA is that multiple simultaneous
DPRs can be achieved, as we deal each PE as a small FPGA,
which has its own reconfiguration control unit like ICAP (
Internal Configuration Access Port for Xilinx FPGAs). we
investigate in the following section which suitable NoC size
and how much afforded parallel DPRs can be achieved, and
estimate the effecting on RT.
Comparison between different number of simultaneous
DPRs at NoC has been done based on the correspondence
between theoretical and simulated RT for each number as
illustrated in Fig.4. This metric is also a good indicator for the
performance variation from a certain NoC size to another, as
each NoC size has different values of throughput and latency
that affects at RT. However, we use fixed buffer size 8-flits
and FIR 0.1 (flits/cycle) for following experiments. In small
network size (RT above 1 msec) the gap is narrow between
theoretical and experimental RT, for example for 1-DPR at a
time, a slightly increase can be noticed in the gap as network
(a)
(b)
Fig. 3. Average throughput for 2-flits buffer depth for: A) NoCTweaksimulator with Infinite PE buffer B) NoC-DPR simulator
size increases over 7x7. In contrast, simulated RT for 5-DPRs
simultaneous, is drifted from the beginning (at small network
sizes) and the gap jumped at larger network sizes.
Fig. 5 shows the difference percentage of theoretical and
experimental RT using one, two, three, four and five simulta-
neous DPRs. It is shown that at small NoC sizes, from 2x2
to 9x9, the difference is always below than 50% that because
RT is relatively bigger than any NoC overheads (latency and
throughput effects), on the other hand starting form NoC size
10x10, NoC latency becomes noticeable and can be affects
RT.
As simultaneous number of DPRs at NoC increase, the
difference increase markedly. When using one DPR at a time
the percentage does not exceed 35% all over NoC sizes while
when using 3-simultaneous DPRs percentage can reach up to
100% at large NoC 18x18; that is due to the complexity of
controlling parallel DPR and the added delays of NoC latency.
Furthermore, we have to prevent other nodes to communicate
with current PR node.
V. DESIGN RECOMMENDATIONS
Performing DPR feature on NoC-based FPGAs must be
done while taking some considerations:
• we can not use general NoC simulator to simulate DPR
application. we have to grantee that when one process
element (PE) performing DPR, other PEs must be pre-
191279
Fig. 4. comparison between theoretical and simulated reconfiguration timeof multiple DPRs on NoC-based FPGAs
Fig. 5. Difference percentage of reconfiguration time using multiple DPRson NoC-based FPGAs
vented from send or receive to/from this PE until DPR is
finished.
• In our simulator we assume that one PE (0,0) is the
master of DPR process, which responsible of sending
reconfiguration packets, therefore when target PE receives
reconfiguration packet it start to perform reconfiguration.
When DPR is finished, destination send back to master
node acknowledge packet to broadcast to all other nodes
that target PE is back and any node can communicate
with.
• The clear advantage of using NoC-based FPGAs instead
formal FPGAs is that ability to perform multiple DPRs
simultaneously, which was studied in previous section.
• The recommended network size and number of simulta-
neous DPRs can be estimated according to the desired
reconfiguration time. See Fig. 4 and Fig. 5, as RT is the
main parameter over the other NoC parameters (latency
and throughput) which effects directly in the deviation
between theoretical and simulated results. for instance if
reconfiguration time can be 100 msec we can use up to 5-
DPR simultaneously and any suitable network size from
2x2 to 9x9. In contrast if the limit of RT was 1 msec
the optimal choice is 3-DPR simultaneously and network
size larger than 13x13.
VI. CONCLUSION
In this paper, a state-of-art NoC-DPR simulator illustrated
and used to get some recommendation for how to use DPR of
NoC-based FPGA, and which is optmal size of NoC as stated
in the previous section.It is obvious that NoC-based FPGA enhance reconfiguration
capabilities due to multiple PE, which can be perform multiple
DPRs at a time. However, that needs to add more resources like
controlling unit and routers. So, the time performance of DPR
with NoC is better than time performance of DPR at normal
FPGA, considering applications that concern reconfiguration
time over area overhead. Despite that, the number of simul-
taneous DPRs can not exceed certain limit for certain NoC
sizes; as we could not gain more reduction in reconfiguration
time, instead we adding more resources overhead.
ACKNOWLEDGMENT
This research was funded by NTRA, ITIDA, Cairo Univer-
sity, Zewail City of Science and Technology, AUC, the STDF,
Intel, Mentor Graphics, and MCIT.
REFERENCES
[1] Anh T. Tran and Bevan M. Baas. NoCTweak: a Highly ParameterizableSimulator for Early Exploration of Performance and Energy of NetworksOn-Chip, Dept. Electr. Comput. Eng., Univ. California, 2012, July 2012
[2] Raabe A., Hochgurtel S., Zachmann G., and Anlauf J. K. ReChannel:Describing and Simulating Reconfigurable Hardware in SystemC, ACMTransactions on Design Automation of Electronic Systems (TODAES),v.13 n.1, p.1-18, January 2008
[3] Adriatic Consortium. 2002. Advanced methodolgy for designing recon-figurable SoC and application-targeted IP-entities in wireless communi-cations webpage. http://www.imec.be/adriatic
[4] Benkhermi, I., Benkhelifa, A., Chillet, D., Pillement, S., Prévotet, J.-C.,and Verdier, F. 2005. System-Level modelling for reconfigurable SoCs.In the 20th Conference on Design of Circuits and Integrated Systems(DCIS), Lisboa, Portugal
[5] Alisson V. De Brito , Elamr U. K. Melcher , Wilson Rosas, Anopen-source tool for simulation of partially reconfigurable systemsusing SystemC, Proceedings of the IEEE Computer Society AnnualSymposium on Emerging VLSI Technologies and Architectures, p.434,March 02-03, 2006
[6] Schallenberg, A., Oppenheimer, F., and Nebel, W. 2004. Designing fordynamic partially reconfigurable FPGAs with SystemC and OSSS. Inthe Forum on Specification and Design Languages, Lille, France
[7] N. Jiang et al., “BookSim interconnection network simulator,” Online,https://nocs.stanford.edu/cgibin/trac.cgi/wiki/Resources/BookSim
[8] R. Palesi et al., “Noxim - the noc simulator,” Online,http://noxim.sourceforge.net/
[9] A. Ehliar and D. Liu, "An FPGA based open source network-on-chiparchitecture", 17th International Conference on Field ProgrammableLogic and Applications, FPL, pp.800-.803, IEEE Amsterdam, Holland,2007
[10] Kyprianos Papadimitriou , Apostolos Dollas , Scott Hauck, Performanceof partial reconfiguration in FPGA systems: A survey and a costmodel, ACM Transactions on Reconfigurable Technology and Systems(TRETS), v.4 n.4, p.1-24, December 2011
192280