HWSW Co-Design for FPGA Based Video Processing Platform
Transcript of HWSW Co-Design for FPGA Based Video Processing Platform
Archives Des Sciences Vol 65, No. 12;Dec 2012
504 ISSN 1661-464X
HW/SW Co-design for FPGA based Video Processing Platform
Yahia SAID (Corresponding author)
Laboratory of Electronics and Microelectronics (EμE)
Faculty of Sciences of Monastir ,
University of Monastir 5019, Tunisia
E-mail: [email protected]
Taoufik SAIDANI, Wajdi ELHAMZI, Mohamed ATRI
Laboratory of Electronics and Microelectronics (EμE)
Faculty of Sciences of Monastir,
University of Monastir 5019, Tunisia
E-mail: [email protected], [email protected], [email protected]
Abstract
In this paper we present a Video Processing Platform (VPP) for rapid prototyping based on FPGA (Field
Programmable Gate Arrays) architecture using EDK embedded system and Xilinx System Generator.
This hardware/software co-design platform has been implemented on a Xilinx Spartan 3A DSP FPGA. The
video interface blocks are done in RTL and the MicroBlaze soft processor is used as an embedded video
controller. This paper discusses the architectural building blocks showing the flexibility of the proposed
platform. This flexibility is achieved by using a new design flow based on Xilinx System Generator. This
Video Processing Platform allows custom-processing blocks to be plugged-in to the platform architecture
without modifying the front-end (capturing video data) and back-end (displaying processed output). This
paper presents several examples of video processing applications, such as a Prewitt edge detector and video
wavelet coding that have been realized using the Video Processing Platform (VPP) for real-time video
processing.
Keywords: Field Programmable Gate Arrays (FPGA), Real Time Video Processing, Embedded
Development Kit (EDK), System Generator (XSG).
1. Introduction
Image and video processing are an ever expanding and dynamic areas with applications reaching out into
our everyday life such as in medicine, astronomy, ultrasonic imaging, remote sensing, space exploration,
surveillance, authentication, automated industry inspection and in many more areas [1].
Reconfigurable hardware in the form of Field Programmable Gate Arrays (FPGAs) has been proposed as
a way of obtaining high performance for Image Processing, even under real time requirements [2].
Implementing image processing algorithms on reconfigurable hardware minimizes the time-to-market cost,
enables rapid prototyping of complex algorithms and simplifies debugging and verification. Therefore,
FPGAs are an ideal choice for implementation of real time image processing algorithms [3].
Archives Des Sciences Vol 65, No. 12;Dec 2012
505 ISSN 1661-464X
With the evolution of FPGA architecture, it has in build processor for designing reconfigurable embedded
system. The design involves use of processor, hardware logic IP and its integration. This is termed as
System on Chip (SoC) design [4].
The Xilinx Embedded Development Kit (EDK) is offered for SoC design platform. It provides a rich set
of tools like Software development kit (SDK) to develop embedded software application and Xilinx
platform studio (XPS) for hardware development and with a wide range of embedded processing
Intellectual Property (IP) cores including processors and peripherals. Integrating all the cores with
processors inside the FPGA leads to reconfigurable embedded processor system [5].
The introduction of high level hardware system modeling tools has further accelerated the design of image
processing in FPGA. The Xilinx System generator (XSG) offers a new design methodology that uses a
model based approach for design and implementation of Digital Signal Processing (DSP) applications in
FPGA [6].
XSG is an important design tool which is an extension of Simulink and consists of a Simulink library
called the Xilinx blockset that can be mapped directly into target FPGA hardware. XSG provides the
functionality for performing co-simulation for designs that run both in hardware and in software which
make it possible to complete even very long simulations within a much shorter period of time [6]. Figure 1
shows a design flow using the XSG.
The software automatically converts the high level system block diagram to RTL. The result can be
synthesized to Xilinx FPGA technology using ISE tools. All of the downstream FPGA implementation
steps including synthesis and place and route are automatically performed to generate an FPGA
programming file.
Figure 1. XSG based design flow for hardware implementation
Archives Des Sciences Vol 65, No. 12;Dec 2012
506 ISSN 1661-464X
System Generator provides a system integration platform for the design of video processing system on
FPGAs that allows the RTL, Simulink, MATLAB, and C/C++ components of a DSP system to come
together in a single simulation and implementation environment. It also supports a black box block that
allows RTL to be imported into Simulink and co-simulated.
System Generator constructs the VHDL design of the model, generates a pcore for this model and integrates
it with the hardware/software platform in the XPS project. The EDK Processor IP block provides an
interface to MicroBlaze and Custom logic being developed in XSG. In this approach export IP core
technique is used for designing SoC system [6].
The Xilinx Embedded Development Kit (EDK) tools make it possible to implement a complete video
processing system on a single FPGA using hardware/software codesign methods. In this approach, custom
image/video processing modules developed in System Generator can be integrated as a dedicated hardware
peripheral to the existing framework.
The objective of this work is to develop a real-time video processing platform (VPP) with an input from
a CMOS camera and output to a DVI display and verified the results video in real time. This platform
provides rapid development of image and video processing algorithms: Model-based designs developed
with XSG are converted to hardware blocks that can be incorporated easily into VPP.
This paper is organized as follows: Section 2 describes the Platform design overview. Section 3 presents
two examples of video processing applications developed with XSG which are a Prewitt edge detector and
video wavelet coding. Finally, a brief conclusion and directions for future work are given in Section 4.
2. Overal Platform Design
The board used for VPP is the VSK Spartan 3A-DSP Platform developed by Xilinx [7]. This board has
Xilinx Spartan-3A DSP XC3SD3400A-4FGG676C FPGA with 53,712 logic cells, 126 DSP48A Slices, and
2,268Kb of block ram (BRAMs).
The board has an add-on card: the FMC-Video IO daughter card that augments the video capabilities of the
Video Processing Platform. The FMC-Video includes camera interface to allow the capture of data from a
custom camera based on a Micron MT9V022 Digital CMOS color image sensor [8].
Images with 8 or 10 bits per pixel, 742H by 480V, 60 frames per second are captured by the high
performance MT9V022 image sensor's 10 bit A-D converter and serialized for image transmission [9].
The data stream from the camera is in the form of a high-speed LVDS data stream. This stream is
received and deserialized using a National DS92LV1212A deserializer. This is capable of carrying LVDS
data from a camera which has a pixel rate of 26.6 MHz [8].
This board is ideal for a video processing platform since it has all the hardware necessary to capture and
display the data on a monitor. Video data are captured from the camera at a resolution of 742x480P at 60Hz.
Then these data are sent through a Gamma block for data correction, and then on to the Video to VFBC, so
that we only send the active data into the MPMC. The default is a 3-frame buffer, and a simple sync signal
that is connected between the Video to VFBC and the Display Controller to make sure that we read out one
frame behind what is being written into the external memory. The display controller then reads data out of
memory and passes it to the DVI out.
Archives Des Sciences Vol 65, No. 12;Dec 2012
507 ISSN 1661-464X
We have built a flexible architecture that enables real-time image and video processing. The overview of
the design is given in Figure 2
Figure 2. Platform design overview
The complete streaming video application includes Video interfaces, a run-time configurable processing
blocks and a real-time video processing block. The system is controlled by a MicroBlaze processor [10]
that initializes the VPP peripherals and Controls the Video Processing and Frame Buffer Pipelines by
reading and writing control registers in the system.
The MicroBlaze soft processor core is a 32-bit Harvard Reduced Instruction Set Computer (RISC)
architecture optimized for implementation in Xilinx FPGAs with separate 32-bit instruction and data buses
running at full speed to execute programs and access data from both on-chip and external memory at the
same time [10]. It is used as an embedded video controller in this design. The block diagram of MicroBlaze
is shown in Figure 3.
The peripherals are connected to the Embedded MicroBlaze processor through Processor Local Bus
(PLB). The Processor is connected to dual-port SRAM, called Block RAM (BRAM), using a dedicated
Local Memory Bus (LMB). This bus features separate 32-bit wide channels for program instructions and
program data, using the dual-port feature of the BRAM. The LMB provides single-cycle access to on-chip
dual-port Block RAM.
Archives Des Sciences Vol 65, No. 12;Dec 2012
508 ISSN 1661-464X
Figure .3 MicroBlaze Core Block Diagram
The complete video system is created using the Xilinx Embedded Development Kit (EDK) [5] and
System Generator for DSP [6]. The Embedded Development Kit is an integrated development
environment for designing embedded processing systems. System Generator is a system-level modeling
tool from Xilinx that facilitates FPGA hardware design. It can automatically generate accelerator blocks in
the form of a custom peripheral for the embedded video application that allows the MicroBlaze processor to
read and write shared memories in the customized video accelerators.
The synthesis results of the overall system are given in Table 1. VPP uses few resources of the FPGA;
hence space is available for additional logic such as image and video processing applications.
Table 1. The synthesis results of the proposed platform
Resource Type Used Available %
Slices 7810 23872 33%
Slice Flip Flops 9706 47744 20%
4 input LUTs 11170 47744 24%
bonded IOBs 78 469 17%
BRAMs 64 126 50%
DSP48s 3 126 3%
Archives Des Sciences Vol 65, No. 12;Dec 2012
509 ISSN 1661-464X
3. Case Study Using Xilinx System Generator
Two video processing applications have been designed and developed using Xilinx System Generator. A
Prewitt edge detector and video wavelet coding blocks have been designed and tested with VPP, as
previously described. In this section, output images are real-time video results of the different hardware
components generated by System Generator.
3.1 Prewitt Gradient Edge Detector
Edges characterize boundaries as well as giving the information of the location objects, shape, size, and
object textures. Therefore, edge detection has a fundamental importance in image processing. Edges in
images characterize object boundaries and are therefore useful for segmentation, registration, and
identification of objects in a scene.
Edge detection refers to the process of identifying and locating sharp discontinuities in an image [11]. The
discontinuities are abrupt changes in pixel intensity which characterize boundaries of objects in a scene.
The most well known technique for edge detection involves convolving the image with a 2-D filter, which
is constructed to be sensitive to large gradients in the image while returning values of zero in uniform
regions [12].
Prewitt is gradient based edge detection algorithm which performs a 2-D spatial gradient measurement
on the video data. It uses two 3X3 kernels to convolve with the original image. Hence, all of the edges in an
image, regardless of direction, can be detected by implementing the sum of two directional edge
enhancement operations.
First, RGB data are converted into grayscale to obtain image intensity, using the following equation:
(1)
The kernels are then applied separately to the image intensity, to produce separate measurements of the
gradient component in each orientation (called Gx and Gy) as shown in (2).
and (2)
These can then be combined together to find the absolute magnitude of the gradient at each point and the
orientation of that gradient as follow:
and (3)
The Prewitt edge detector is build as a video processing accelerator, using System Generator for DSP and
Simulink. The design of our filter is shown in Fig. 4.
Archives Des Sciences Vol 65, No. 12;Dec 2012
510 ISSN 1661-464X
Figure 4. Prewitt IP core in system generator
The System Generator design contains an EDK Processor block that can be exported as an EDK pcore
using the EDK Export Tool compilation target. The export process creates a PLB-based pcore, which is
integrated to the Microblaze 32 bit soft RISC processor with the Xilinx Platform Studio (XPS) [6].
In the VPP setup a DVI display shows the output edge from the camera. Experimental setup for
implementation of Prewitt edge detection is presented in Figure 5.
Figure 5. Experimental setup for implementation of edge detection. Input
is from CMOS camera and the output is on a DVI display.
Archives Des Sciences Vol 65, No. 12;Dec 2012
511 ISSN 1661-464X
The total resource usage for the system, including the MicroBlaze, bus structure, the Prewitt edge core and
peripherals, is 9096 slices, equaling 38% of the FPGA’s total resources. Table 2 shows the amount of logic
used for the Prewitt edge module. The post-synthesis resource usage of this module is 5%. It has a
post-synthesis maximum estimate frequency of 68.432MHz.
Table 2. Post-synthesis device utilization for the Prewitt Edge Pcore
Resource Type Used Available %
Slices 1286 23872 6%
Slice Flip Flops 1746 47744 4%
4 input LUTs 1710 47744 4%
bonded IOBs 0 469 0%
BRAMs 5 126 3%
DSP48s 4 126 3%
Maximum Frequency 68.432 MHz
3.2 Discrete Wavelet Transform
Discrete Wavelet Transform (DWT) is a broadly used digital signal processing technique with application
in diverse areas such as digital speech recognition, feature extraction, multi-resolution video processing and
data compression [13]. DWT, originally implemented through Mallat’s filterbank algorithm [14], has been
rendered more efficient by the development of the lifting scheme that has been incorporated in the JPEG
2000 image compression standard.
The lifting scheme entirely relies on the spatial domain, has many advantages compared to filter bank
structure, such as lower area, power consumption and computational complexity.
Lifting has other advantages, such as “in-place” computation of the DWT, integer-to-integer wavelet
transforms which are useful for lossless coding. The lifting scheme has been developed as a flexible tool
suitable for constructing the second generation wavelets. It is composed of three basic operation stages:
split, predict and update (Figure 6).
Split
+
Prediction Up dating
+
K
K-1
Image
Figure 6. Lifting scheme forward transform
Archives Des Sciences Vol 65, No. 12;Dec 2012
512 ISSN 1661-464X
The implementation of lifting schemes is decomposed of two levels 2D-DWT, it may be computed using
filter banks as shown in Figure 7. The input samples X(n) are approved through two stages of analysis
filters.
They are first processed by low-pass (h(n)) and high-pass (g(n)) horizontal filters and are sub sampled by
two. Subsequently, the outputs (L1, H1) are processed by low-pass and high-pass vertical filter. Note that:
L1, H1 are the outputs of 1D-DWT; LL1, LH1, HL1 and HH1 one-level decomposition of 2D-DWT. From
the earlier structure, for a separable 2D-DWT with N levels of transformation, it can be easily achieved by
concatenation of 1D-DWT units, with the first stage processing N transformation levels on rows and the
second one with N transformation levels on columns. For image compression purposes, JPEG 2000
recommends an alternate row/column based structure as the one presented in Figure 7.The sub-band
decomposition of an image when the standard 2D-DWT with two transformation levels is presented in
Figure 8. “H” and “L” correspond to high and low-pass filter stages, respectively.
Figure 7. Lifting scheme decomposition of 5/3 filter
Hn 2
2Gn
Hn
Gn
Hn
Gn
2
2
2
2
X(n)
L1
H1
LL1
LH1
HL1
HH1
Hn 2
2Gn
Hn
Gn
Hn
Gn
2
2
2
2
L2
H2
LL2
LH2
HL2
HH2
Horizontal filter Vertical Filter
Horizontal filterVertical Filter
Figure 8. Subband decomposition for two-level 2D-DWT
Archives Des Sciences Vol 65, No. 12;Dec 2012
513 ISSN 1661-464X
The design of the DWT 2D Codec in System Generator is shown in Figure 9. Experimental results of
DWT2D codec implementation is presented in Figure 10.
Figure 9. DWT2D IP core in system generator
Figure 10. Experimental setup for implementation of DWT2D Codec
The total resource usage for the system, including the MicroBlaze, bus structure, DWT2D Codec Pcore and
peripherals, is 8833 slices, equaling 38% of the FPGA’s total resources. Table 3 shows the amount of logic
used for the DWT2D Codec module. The post-synthesis resource usage of this module is 5%. It has a
post-synthesis maximum estimate frequency of 65,167 MHz.
Archives Des Sciences Vol 65, No. 12;Dec 2012
514 ISSN 1661-464X
Table 3. Post-synthesis device utilization for the DWT2D Codec Pcore
Resource Type Used Available %
Slices 1023 23872 5%
Slice Flip Flops 1246 47744 3%
4 input LUTs 1323 47744 3%
bonded IOBs 162 469 4%
BRAMs 3 126 3%
DSP48s 4 126 4%
Maximum Frequency 65.167 MHz
4. Conclusion
Continual growth in the size and functionality of FPGAs over recent years has resulted in an increasing
interest in their use as implementation platforms for image processing applications, particularly real-time
video processing [15].
In this work, we have presented a video processing platform (VPP) for real-time video processing
application. This platform provides a development environment that allows designers to quickly begin to
experiment with video processing using the Spartan-3A DSP family of FPGAs. An embedded base system
shipped with the VSK [7], provides a familiar starting point from which existing processor-based video
applications can be ported, or new designs created. The user can build flexible video processing systems
that include embedded processors and customized video accelerators and verify video hardware designs in
a fraction of the time using hardware co-simulation provided by System Generator.
Two applications have been presented showing the performance and flexibility of the proposed platform.
For the Prewitt edge detection system architecture, including the MicroBlaze, bus structure, the Prewitt
edge core and peripherals, the total resource usage is 9096 slices, equaling 38% of the FPGA’s total
resources. It has a post-synthesis maximum estimate frequency of 88.547MHz.
The DWT2D codec system architecture has 85.292 MHz maximum frequency and uses 8833 CLB slices
with 38% utilization, so there is possibility of implementing some more parallel processes with this
architecture on the same Platform.
The Xilinx System Generator tool, offers an efficient and straightforward method for transitioning from a
PC-based model in Simulink to a real-time FPGA based hardware implementation. Custom video
accelerator blocks are captured in the DSP friendly Simulink modeling environment, converted into custom
peripherals for Platform Studio and then connected to the embedded system using the processor local bus.
Future works include the use of the Xilinx System Generator and EDK development tools for the
implementation of a computer vision application: object detection and tracking system on the proposed
Platform.
Archives Des Sciences Vol 65, No. 12;Dec 2012
515 ISSN 1661-464X
References
[1] Russ J. C, The Image Processing Hand book, Sixth Edition, CRC Press, 2011.
[2] D.Crookes, “Design and implementation of a high level programming environment for FPGA-based image processing,” IEEE Proceedings on Vision, Image, and Signal Processing, vol 4, 2000.
[3] D.V.Rao, S.Patil, N.A.Muthukuma, “Implementation and Evaluation of Image Processing Algorithms on Reconfigurable Architecture using C-based Hardware Descriptive Languages,” International Journal of Theoretical and Applied Computer Sciences, pp.9-34, 2006.
[4] R.Peesapati, S. Sabat, K.Venu , “ Automatic IP Core generation in SoC,” International Journal of Recent Trends in Engineering, Vol 2, No. 6, 2009
[5] Xilinx Inc. Embedded System Tools Reference Manual, http://www.xilinx.com
[6] Xilinx System Generator user Guide, http://www.xilinx.com
[7] Spartan-3A DSP FPGA Video Starter Kit user Guide, http://www.xilinx.com
[8] Xtreme DSP Solution FMC-Video Daughter Board Technical Reference Guide, http://www.xilinx.com
[9] Micron MT9V022 CMOS image sensor product brief, http://www.micron.com
[10] MicroBlaze soft processor, http://www.xilinx.com
[11] J.Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal.Mach. Intell, vol. PAMI-8, no.6, pp. 679-698, Jum.1986.
[12] S.Behera, M.N.Mohanty, S.Patnaik, “A Comparative Analysis on Edge Detection of Colloid Cyst: A Medical Imaging Approach,” Soft Computing Techniques in Vision Science, Studies in Computational Intelligence, Springer, Volume 395, pp 63-85 , 2012.
[13] D.S.Taubman, M.W.Marcellin, JPEG2000, Image Compression Fundamentals, Standards and Practice, Kluwer Academic Publishers, ch.6, 2002.
[14] S.Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell, Vol 11, pp. 674-693,1989
[15] B.Hutchings, J.Villasenor, “The Flexibility of Configurable Computing,” IEEE Signal Processing Magazine,vol15, pp. 67–84,1998.