Real Time Wavelet Video Denoising in FPGA

7/24/2019 Real Time Wavelet Video Denoising in FPGA

1/12

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2006, Article ID 16035, Pages112DOI 10.1155/ES/2006/16035

A Real-Time Wavelet-Domain Video DenoisingImplementation in FPGA

Mihajlo Katona,1 Aleksandra Pizurica,2 Nikola Teslic,1 Vladimir Kovacevic,1 and Wilfried Philips2

1 Chair for Computer Engineering, University of Novi Sad, Fruskogorska 11, 21000 Novi Sad, Serbia and Montenegro2 Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium

Received 15 December 2005; Accepted 13 April 2006

The use of field-programmable gate arrays (FPGAs) for digital signal processing (DSP) has increased with the introduction ofdedicated multipliers, which allow the implementation of complex algorithms. This architecture is especially effective for data-intensive applications with extremes in data throughput. Recent studies prove that the FPGAs offer better solutions for real-timemultiresolution video processing than any available processor, DSP or general-purpose. FPGA design of critically sampled discretewavelet transforms has been thoroughly studied in literature over recent years. Much less research was done towards FPGA designof overcomplete wavelet transforms and advanced wavelet-domain video processing algorithms. This paper describes the paral-lel implementation of an advanced wavelet-domain noise filtering algorithm, which uses a nondecimated wavelet transform andspatially adaptive Bayesian wavelet shrinkage. The implemented arithmetic is decentralized and distributed over two FPGAs. Thestandard composite television video stream is digitalized and used as a source for real-time video sequences. The results demon-strate the effectiveness of the developed scheme for real-time video processing.

Copyright 2006 Mihajlo Katona et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. INTRODUCTION

Video denoising is important in numerous applications, suchas television broadcasting systems, teleconferencing, videosurveillance, and restoration of old movies. Usually, noise re-duction can significantly improve visual quality of a video aswell as the effectiveness of subsequent processing tasks, likevideo coding.

Noise filters that aim at a high visual quality make use ofboth spatial and temporal redundancy of video. Such filtersare known as spatio-temporal or three-dimensional (3D) fil-

ters. Often 2D spatial filter and 1D temporal filter are appliedseparately, and usually sequentially (because spatial denois-ing facilitates motion detection and estimation). Temporalfiltering part is often realized in a recursive fashion in orderto minimize the memory requirements. Numerous existingapproaches range from lower complexity solutions, like 3Drational [1] and 3D order-statistic [2, 3] algorithms to so-phisticated Bayesian methods based on 3D Markov models[4,5].

Multiresolution video denoising is one of the increas-ingly popular research topics over recent years. Roosmalen etal.[6] proposed video denoising by thresholding the coeffi-cients of a specific 3D multiresolution representation, which

combines 2D steerable pyramid decomposition (of the spa-tial content) and a 1D wavelet decomposition (in time). Re-lated to this, Selesnick and Li [7] investigated wavelet thresh-olding in a nonseparable 3D dual-tree complex wavelet rep-resentation. Rusanovskyy and Egiazarian [8] developed anefficient video denoising method using a 3D sliding windowin the discrete cosine transform domain. Other recent mul-tiresolution schemes employ separable spatial/temporal fil-ters, where the temporal filter is motion adaptive recursivefilter. Such schemes were proposed, for example, by Pizuricaet al. [9] where a motion selective temporal filter follows the

spatial one, and by Zlokolica et al. [10] where a motion-compensated temporal filter precedes the spatial one. Lessresearch was done so far towards hardware design of thesemultiresolution video denoising schemes.

The use of the FPGAs for digital signal processing hasincreased with the introduction of dedicated multipliers,which facilitate the implementation of complex DSP algo-rithms. Such architectures are especially effective for data-intensive applications with extremes in data throughput.With examples for video processing applications Draper et al.[11] present performance comparison of FPGA and general-purpose processors. Similarly, Haj [12] illustrates two dif-ferent wavelet implementations in the FPGAs and compares
http://-/?-http://-/?-


2/12

2 EURASIP Journal on Embedded Systems

these with general-purpose and DSP processors. Both studiescome to the conclusion that the FPGAs are far more suitablefor real-time video processing in the wavelet domain thanany available processor, DSP or general-purpose.

The hardware implementation of the wavelet transformis related to the finite-impulse-response (FIR) filter design.

Recently, the implementation of FIR filters has become quitecommon in the FPGAs. A detailed guide for the FPGA filterdesign is in[13] and techniques for area optimized imple-mentation of FIR filters are presented, for example, in [14].A number of different techniques for implementing the crit-ically sampled discrete wavelet transform (DWT) in the FP-GAs exist [1521] including the implementation of MPEG-4 wavelet-based visual texture compression system [22]. Re-cently, the lifting scheme [2325] is introduced for real-time DWT [20,26] as well as the very-large-scale-integration(VLSI) implementation of the DWT using embedded in-struction codes for symmetric filters [27]. The lifting schemeis attractive for hardware implementations because it re-

places multipliers with shift operations. The FPGA imple-mentations of overcomplete wavelet transforms are muchless studied in literature.

Our initial techniques and results in FPGA implementa-tion of wavelet-domain video denoising are in [28, 29]. Thesetwo studies were focusing on different aspects of the devel-oped system: implementation of the wavelet transform anddistributed computing over the FPGA modules in[28] andcustomization of a wavelet shrinkage function by look-uptables for implementation in read-only-memories (ROMs)[29]. The description was on a more abstract level focusingon the main concepts and not on the details of the architec-tural design.

In this paper, we report a full architectural design ofa real-time FPGA implementation of a video denoising al-gorithm based on an overcomplete (nondecimated) wavelettransform and employing sophisticated locally adaptivewavelet shrinkage. We propose a novel FIR filter design forthe nondecimated wavelet transform based on the algorithmatrous[30]. The implemented spatial/temporal filter is sepa-rable, where a motion-adaptive recursive temporal filter fol-lows the spatial filter as was proposed in [9]. We present anefficient customization of the locally adaptive spatial waveletfilter using a combination of read-only-memories (ROMs)and a dedicated address generation network. We design anefficient implementation of a local window for wavelet pro-cessing using an array of delay elements. Our design of thecomplete denoising scheme distributes computing over twoFPGA modules, which switch their functionality in time:while one module performs the direct wavelet transform ofthe current frame, the other module is busy with the in-verse wavelet transform of the previous frame. After eachtwo frames, the functioning of the two modules is reversed.We present a detailed data flow of the proposed scheme. Forlow-to-moderate noise levels, the designed FPGA implemen-tation yields a minor performance loss compared to the soft-ware version of the algorithm. This proves the potentials ofthe FPGAs for real-time implementations of highly sophisti-cated and complex video processing algorithms.

The paper is organized as follows.Section 2presents anoverview of the proposed FPGA design, including the mem-ory organization (Section 2.1) and data flow (Section 2.2).Section 3 details the FPGA design of the different build-ing blocks in our video denoising scheme. We start withsome preliminaries for the hardware design of the non-

decimated wavelet transform (Section 3.1)and present theproposed pipelined FPGA implementation (Section 3.2).Next, we present the FPGA design of the locally adaptivewavelet shrinkage(Section 3.3) and finally the FPGA imple-mentation of the motion-adaptive recursive temporal filter(Section 3.4).Section 4presents the real-time environmentused in this study. The conclusions are inSection 5.

2. REAL-TIME IMPLEMENTATION WITH FPGA

An overview of our FPGA implementation is illustrated inFigure 1. We use two independent modules working in paral-lel. Each module is implemented in a separate FPGA. While

one module performs the wavelet decomposition of an in-put TV frame, the other module performs the inverse wavelettransform of the previous TV frame. The two modules switchtheir functionality in time. The wavelet-domain denoisingblock is located in front of the inverse wavelet transform.

The proposed distributed algorithm implementationover the two modules allows effective logic decentralizationwith respect to input and output data streams. Namely, whileone FPGA module is handling the input video stream per-forming the wavelet decomposition, the other FPGA mod-ule is reading the wavelet coefficients for denoising, sendingthem to the wavelet reconstruction, and building up the vi-sually improved output video stream.

2.1. Memory organization

The nondecimated wavelet transform demands significantmemory resources. For example, in our implementation withthree decomposition levels we need to store nine frames ofwavelet coefficients for every input frame. In addition, weneed an input memory buffer and an output buffer for iso-lating data accesses from different clock domains.

The input data stream is synchronized with a 13.5 MHzclock. For three decomposition levels the complete waveletdecomposition and reconstruction has to be completed withthe clock of at least 3 13.5 = 40.5 MHz. The set-up ofour hardware platform requires the output data stream at27 MHz. Table 1 lists the required interfaces of the buffersthat are used in the system.

The most critical timing issue is at the memory buffer forstoring the wavelet coefficients. It has to provide simultane-ousreadand writeoptions at 40.5 MHz. Due to lack of theSDRAM controller that supports this timing issue, the wholeprocessing is split in two independent parallel modules. Theidea is to distribute the direct and the inverse wavelet process-ing between these modules. While one module is performingthe wavelet decomposition of the current frame, the othermodule is performing the inverse wavelet transform of the


3/12

Mihajlo Katona et al. 3

Wavelet coefficient RAM Wavelet coefficient RAM

40.5 MHz 40.5 MHz

Denoising Denoising

W 1 W W 1 WW 1

Control module Control module

Outputbuffer

Temporalfilter

Inputbuffer

Control module

27 MHz 13.5 MHz

Figure1: A detail of the FPGA implementation of the proposed wavelet-domain video denoising algorithm.

Table1: Memory interfaces.

Buffers Write port (MHz) Read port (MHz)

Input buffer 13.5 40.5

Wavelet coefficients buffer 40.5 40.5

Output buffer 40.5 27

previous frame. With such organization, one module readsand the other module writes the coefficients. The approxima-tion subband (LL band) during the wavelet decompositionand composition is stored in the onboard SRAM memory.This allows us to use onlyreadorwritememory port duringone frame.

2.2. Data flow

The data flow through all the memory buffers and bothFPGAs in our scheme is shown in Figure 2.The total delayis 4 frames. During the first 20 milliseconds, the input frame

A0 is stored in the input buffer at a clock rate of 13.5 MHz.During the next 20 milliseconds, this frame is read from theinput buffer and is wavelet transformed in a 40.5 MHz clockdomain, with 3 decomposition scales W1(A0),W2(A0), andW3(A0). In parallel to this process, the next frameA1is writ-ten in the input buffer. The following time slot of 20 mil-liseconds is currently not used for processingA0, but is re-served for future additional processing in the wavelet do-

main. Within this period the frame A1 is read from the in-put buffer and is decomposed in its wavelet coefficients. TheframesA0 andA1 are processed by FPGA1. The next inputframe,A2, is written in the input buffer, and is wavelet trans-formed in the next time frame by FPGA2.

The denoising and the inverse wavelet transform of theframeA0 are performed afterwards. During this period thewavelet coefficients of the frameA0 are read from the mem-ory, denoised and the output frame is reconstructed withthe inverse wavelet transformW1(A0). During the last re-construction stage (the reconstruction at the finest waveletscale), the denoised output frame is written to the outputmemory buffer. Parallel to this process, FPGA2 performs the

wavelet decomposition of the frameA2 and the input frameA3is stored in the input buffer.

Finally, 4 20 milliseconds = 80 milliseconds after theframe A0 appeared at the system input (4 frames later), itis read from the output buffer in a 27 MHz clock domainand is sent to the selective recursive temporal filter and to thesystem output afterwards. The output data stream is alignedwith a 100 Hz refresh rate, which means that the same frameis sent twice to the output within one time frame of 20 mil-liseconds. Additionally, FPGA2 performs the wavelet decom-position of the frame A3. Further on,A4 frame is written tothe input buffer and is decomposed in the following timeframe under the control of FPGA1.

In this scheme, the two FPGAs actually switch their func-tionality after each two frames. The FPGA1 performs thewavelet decomposition for two frames, while the FPGA2performs the inverse wavelet transform of the previous twoframes. After two frames, this is reversed.

3. ALGORITHM CUSTOMIZATION FORREAL-TIME PROCESSING

We design an FPGA implementation of a sequantial spa-tial/temporal video denoising scheme from [9], which is de-picted inFigure 3.Note that we use an overcomplete (non-decimated) wavelet transform to guarantee a high-qualityspatial denoising. In this representation, with three decom-position levels the number of the wavelet coefficients is 9times the input image size. Therefore we choose to perform

the temporal filtering in the image domain (after the in-verse wavelet transform) in order to minimize the memoryrequirements.

3.1. The customization of the wavelet transform

While hardware implementations of the orthogonal wavelettransform have been extensively studied in literature [1621, 26, 27], much less research has been done towardsimplementations of the nondecimated wavelet transform.We develop a hardware implementation of the non-decimat-ed wavelet transform based on the algorithm atrous[30] andwith the classical three orientation subbands per scale. This


4/12


Write

Read

In

A0 A1 A2 A3 A4 A5 A6 A7

A

1 A0 A1 A2 A3 A4 A5 A6

Write

Read

Wavelet

FPGA1

W 1 (A

1 ) W 1 (A

1 ) W1 (A0 ) W2 (A0 ) W3 (A0 ) W1 (A1 ) W2 (A1 ) W3 (A1 ) W 1 (A2 ) W 1 (A2 ) W 1 (A3 ) W 1 (A3 ) W1 (A4 ) W2 (A4 ) W3 (A4 ) W1 (A5 ) W2 (A5 ) W3 (A5 ) W 1 (A6 ) W 1 (A6 )

Write

ReadWavelet

FPGA2

Write

Read

Out

W3 (A

1 )W2 (A

1 )W1 (A

1 ) W1 (A0 ) W2 (A0 ) W1 (A1 ) W2 (A1 ) W3 (A2 ) W2 (A2 ) W1 (A2 ) W3 (A3 ) W2 (A3 ) W1 (A3 ) W1 (A4 ) W2 (A4 ) W1 (A5 ) W2 (A5 ) W3 (A6 ) W2 (A6 ) W1 (A6 )

W1 (A

3 )W2 (A

3 )W3 (A

3 ) W 1 (A

2 ) W 1 (A

2 ) W 1 (A

1 ) W 1 (A

1 ) W1 (A0 ) W2 (A0 ) W3 (A0 ) W1 (A1 ) W2 (A1 ) W3 (A1 ) W 1 (A2 ) W 1 (A2 ) W 1 (A3 ) W 1 (A3 ) W1 (A4 )W2 (A4 )W3 (A4 )

W1 (A

3 ) W2 (A

3 ) W3 (A

2 )W2 (A

2 )W1 (A

2 )W3 (A

1 )W2 (A

1 )W1 (A

1 ) W1 (A0 ) W2 (A0 ) W1 (A1 ) W2 (A1 ) W3 (A2 )W2 (A2 )W1 (A2 ) W3 (A3 )W2 (A3 )W1 (A3 ) W1 (A4 )W2 (A4 )

A 3 A 2 A 1 A0 A1 A2 A3 A4

A 4 A 4 A 3 A 3 A 2 A 2 A 1A 1 A0A0 A1A1 A2 A2 A3A3

20 ms 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms

FPGA1 fields

FPGA2 fieldsDirect wavelet transformInverse wavelet transform

Wj (Ai)-wavelet decompositions at scale j

W 1(Ai)-wavelet reconstruction of the frameAiAj -processing frame with indexi

Figure2: The data flow of wavelet processing.

2D wavelettransform

Denoising by

waveletshrinkage

Inverse2D wavelettransform

Pixel-basedmotion detector

Selectiverecursive filter

Figure3: The implemented denoising scheme.

algorithm upsamples the wavelet filters at each decomposi-

tion level. In particular, 2j

1 zeros (holes, in French, trous)are inserted between the filter coefficients at the decomposi-tion level j, as it is shown inFigure 4.

We use the SystemC library [31] and a previously devel-oped simulation environment [32,33] to develop a real-timemodel of the wavelet decomposition and reconstruction.Figure 5 shows the simulation model. After a number ofsimulations and tests we have concluded that the real-timewavelet implementation with 16 bit arithmetic gives practi-cally the same results as a referent MATLAB code of the algo-rithm a trous[30]. At a number of input frames there weremore than 97.13% errorless pixels with mean error of 0.0287.Analyzing those figures at the level of bit representation, we

can conclude that maximally 1 bit out of 16 was wrong. The

wrong bit may occur on the bit position 0 shown inFigure 6.Taking into account that input pixels are 8 bit integers we canignore this error.

3.2. The pipelined FPGA implementation ofthe nondecimated wavelet transform

Here we develop an FPGA implementation of a nondeci-mated wavelet transform with three orientation subbandsper scale. We design FIR filters for the algorithm atrous[30]with the Daubechies minimum phase wavelet of length four[34] and we implement the designed FIR filters with dedi-cated multipliers in the Xilinx Virtex2 FPGAs [35].


5/12


Hj (z)

Gj (z)

Gj (z)

Gj (z)HH

HL

LH

LLAj+1

Hj (z)

Aj

Hj (z)

h0(n) = h[0]h[1]h[2]h[3]

h1(n) = h[0]0 h[1]0 h[2]0 h[3]

h2(n) = h[0]000 h[1]000 h[2]000 h[3]

g0(n) =g[0]g[1]g[2]g[3]

g1(n) =g[0]0g[1]0g[2]0g[3]

g2(n) =g[0]000g[1]000g[2]000g[3]

2j

1 trous (holes) 2j

1 trous (holes)

Figure4: The nondecimated 2D discrete wavelet transform.

2Dwavelet

transformation

2D inversewavelet

transformation

Input

data 0

15

14 : 7

6 : 0

0

Output

data

LL

16LH

16HL

HH 16

16

16 16

15

14 : 7

6 : 0 88

Figure5: The developed simulation model for the implementation of the wavelet transform.

Our implementation of the 2D wavelet transform is line-based as shown in Figure 7. We choose the line alignmentin order to preserve the video sequence input format and topipelinethe whole processing in our system. The horizontaland the vertical filtering is performed within one pass of theinput video stream. We avoid using independent horizontaland vertical processing which requires two cycles and an in-ternal memory for storing the output of the horizontal filter-ing. Instead, we use the line-based vertical filtering with as

many internal line buffers as there are taps in the used FIRfilter.

The horizontal and vertical FIR filters differ only in thefilter delay path implementation. The data path of the hor-izontal filter is a register pipeline as shown inFigure 8. Thedata path of the vertical filter is the output of the line buffers.Hence, the vertical FIR filter does not include any delay ele-ments, but only the pipelined filtering arithmetics (multipli-ers and an adder). Pipelining the filtering arithmetics ensuresthe requested timing for data processing and we use this ap-proach both for the horizontal and vertical filters.

The algorithm atrous[30] upsamples the wavelet filtersby inserting 2j 1 zeros between the filter coefficients at the

decomposition level j (seeFigure 4). We implement this fil-ter up-sampling by using a longer filter delay path and theappropriate data selection logic. The required number of theregisters depends on the length of the mother wavelet func-tion and on the number of the decomposition levels used. Weuse a wavelet of length four and three decomposition levels,and hence our horizontal filter in Figure 8 contains 34 = 12registers. Four registers are dedicated to the 4-tap filter and3 times as many are needed to implement the required up-

sampling up to the third decomposition level. Analogously,on the vertical filtering side, each line buffer for vertical fil-tering is able to store up to 4 lines.

For the calculation of the first decomposition level of thewavelet transform, only the first 4 registers d0, d1, d2, and d3inFigure 8are used in the FIR filter register pipeline. At thesecond decomposition level, the wavelet filters have to be up-sampled with 1 zero between the filter coefficients. In our im-plementation, this means that registers d0, d2, d4, and d6 areused for filtering.Figure 8 illustrates the FIR filter configu-ration during the calculation of the wavelet coefficients fromthe third decomposition level. During this period, the d0, d4,d8, and d12 registers are involved in the filtering process.


6/12


0

0

Input

Output

0

X

0

X

0

X

0

X

0

X

0

X

0

X

F E D C B A 9 8 7 F 5 4 3 2 1 0

Figure6: Input and output data format.

Videoinput 4 tap

horizontal FIRL

4 tap

horizontal FIRH

Line buffer

Line buffer

Line buffer

Line buffer

VerticalFIRinputselect

4 tap

vertical FIRLL

4 tap

vertical FIRLH

4 tap

vertical FIRHL

4 tap

vertical FIRHH

LL

LH

HL

HH

Figure7: A block schematic of the developed hardware implementation of the wavelet transform.

We implement the inverse wavelet transform accordingly.The processing is mirrored when compared to the waveletdecomposition: the vertical filtering is done first and the hor-izontal processing afterwards. The FIR filter design is thesame as for the direct wavelet transform, only the filter co-efficients a(0), a(1), a(2), and a(3) inFigure 8are mirrored.

3.3. The wavelet shrinkage customization

Our video denoising scheme employs a spatially adaptivewavelet shrinkage approach of [36]. A brief description ofthis denoising method follows.

Letyldenote the noise-free wavelet coefficient andwlitsobserved noisy version at the spatial position l in a givenwavelet subband. For compactness, we suppressed here theindices that denote the scale and the orientation. The methodof [36] shrinks each wavelet coefficient by a factor whichequals the probability that this coefficient presents a signal ofinterest. The signal of interest is defined as a noise-free signalcomponent that exceeds in magnitude the standard devia-tion of noise . The probability of the presence of a signal

of interest at positionlis estimated based on the coefficientmagnitude |wl| and based on a local spatial activity indicatorzl =

kl |wk|, where lis the neighborhood of the pixel

l(within a squared window) and Nl is the number of theneighboring coefficients. For example, for a 3 3 windowlconsists of the 8 nearest neighbors of the pixell(Nl= 8).

Let H1 denote the hypothesis the signal of interest ispresent: |yl| > and letH0denote the opposite hypothesis:|yl| . The shrinkage estimator of [9] is

yl= P

H1 | wl, zl

wl= ll1 +ll

wl, (1)

where

=P

H1

P

H0 , l= p

wl | H1

p

wl | H0 , l= p

zl | H1

p

zl | H0 .

(2)

p(wl | H0) and p(wl | H1) denote the conditional prob-ability density functions of the noisy coefficients given theabsence and given the presence of a signal of interest. Sim-

ilarly, p(zl | H0) and p(zl | H1) denote the correspondingconditional probability density functions of the local spa-tial activity indicator. The input-output characteristic of thiswavelet denoiser is illustrated inFigure 9.This figure showsthat the coefficients that are small in magnitude are stronglyshrinked towards zero, while the largest ones tend to be leftunchanged. The displayed family of the shrinkage character-istics corresponds to the different values of the local spatialactivity indicator. For the same coefficient magnitude |wl|the input coefficient will be shrunk less if LSAI zl is biggerand vice versa.

We now address the implementation of this shrinkagefunction. Under the Laplacian prior for noise-free data

p(y) = (/2) exp(|y|) we have[9] = exp(T)/(1 exp(T)). The analytical expressions forland lseem toocomplex for the FPGA implementation. We efficiently imple-ment the two likelihood ratioslandlas appropriatelook-up tables, stored in two read-only memories (ROM). Thegeneration of the particular look-up-tables is based on an ex-tensive experimental study, as we explain later in this section.The developed architecture is presented inFigure 10. OneROM memory, containing the look-up table l, is addressedby the coefficient magnitude |wl|, and the other ROM mem-ory, containing the look-up table lis addressed by LSAIzl. For calculating LSAI, we average the coefficient valuesfrom the current line and from the previous two lines within


7/12


x(n)Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1 Z 1

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12

Scale = 3

a(0) a(1) a(2) a(3)

Z 1 Z 1 Z 1 Z 1

+ + +y(n)

Figure 8: The proposed FIR filter implementation of the algorithm atrous for a mother wavelet of length 4 and supporting up to 3 de-composition levels. The particular arithmetic network using the registers d0, d4, d8, and d12 corresponds to the calculation of the waveletcoefficients at the third decomposition level.

150

100

50

0

50

100

150

150

100

50 0 50 100 150

Noisy input coeffi

cient

Different LSAI

Figure9: An illustration of the employed wavelet shrinkage family.

a 3 3 window. The read values from ROMs are multi-plied to produce the generalized likelihood ratio r = ll.We found it more efficient to realize the shrinkage factorr/(1 +r) using another ROM (look-up-table) instead of us-ing the arithmetic operations. The output of this look-up-table denoted here as shrinkage ROM is the desired waveletshrinkage factor. Finally, the output of the shrinkage ROMmultiplies the input coefficient to yield the denoised coeffi-

cient.We denoise in parallel three wavelet bands LH, HL, and

HH at each scale. Different resolution levels (we use three)are processed sequentially as illustrated inFigure 2. The low-pass (LL) band is only delayed for the number of clock peri-ods that are needed for denoising. This delay, which is in ourimplementation 6 clock cycles, ensures the synchronizationof the inputs at the inverse wavelet transform block (see thetiming inFigure 2).

The generation of the appropriate look-up tables for thetwo likelihood ratios resulted from our extensive experi-ments on different test images and different noise-levels asit is described in[29].Figure 11illustrates the likelihood ra-

tio lcalculated from one test image at different noise lev-els. These diagrams show another interpretation of the well-known threshold selection principle in wavelet denoising: awell-chosen threshold value for the wavelet coefficients in-creases with the increase of the noise level. The maximumlikelihood estimate of the threshold T (i.e., the value forwhich p(T | H0) = p(T | H1)) is the abscissa of the pointl = 1.Figure 12displays the likelihood ratio l, in the di-

agonal subband HH at third decomposition level, for 10 dif-ferent frames with fixed noise standard deviations ( = 10and=30). We showed in [29] that from a practical pointof view, the difference between the calculated likelihood ra-tios for different frames is minor, especially for lower noiselevels (up to= 20). Therefore we average the likelihood ra-tios over different frames and store these values as the corre-sponding look-up tables for several different noise levels (=5, 10, 15, and 20). In the denoising procedure, the user selectsthe input noise level, which enables addressing the correct setof the look-up tables. The performance loss of the algorithmdue to simplifications with the generated look-up tables is fordifferent input noise levels shown inFigure 13.These results


8/12


FIFO(line buffer)

FIFO(line buffer)

LSAI coefficientmagnitude window

Z 1

Z 1

Z 1

Z 1

Z 1

Z 1

Z 1

Z 1

Z 1

+

(1 scale)ABS(pixel)

(1

scale)

Energy

8

Address generation

combinatorial network

ROM

ROM

KSI

ETA

Shrinkage

ROM

Output

coefficient

Input

coeffi

cient

Figure10: Block schematic of implemented denoising architecture.

1000

900

800

700

600500

400

300

200

100

00 50 100 150 200 250

HH

Figure 11: Likelihood ratio l for one test frame and 4 differentnoise levels (= 5,10,20,30).

represent peak signal-to-noise ratio (PSNR) values averagedover frames of several different video sequences. For=10the average performance loss was only 0.13 dB (and visually,the differences are difficult to notice) while for = 20 theperformance loss is 0.55 dB and is on most frames becom-ing visually noticeable, but not highly disturbing. For highernoise levels, the performance loss increases.

In the current implementation, the user has to select oneof the available noise levels. With such approach, it is possi-ble that the user will not choose the best possible noise re-duction. If the selected noise level is smaller from the realnoise level in the input signal, some of the noise will remain

in the output signal. On the other hand, if the noise level isover-estimated, the output signal will be blurred without sat-isfying visual effect.

This user intervention can be avoided by implementing anoise level estimator. The output of this block could be usedfor the look-up table selection, which further enables ad-justable noise reduction according to the noise level in inputsignal. For example, a robust wavelet-domain noise estimatorbased on the median absolute deviation [37] can be used forthis purpose or other related wavelet-domain noise estima-tors like [38].

The likelihood ratiosland lare monotonic increasingfunctions. We are currently investigating the approximation

of these functions by a family of piece-wise linear functionsparameterized by the noise standard deviation and by the pa-rameter of the marginal statistical distribution of the noise-free coefficients in a given subband.

3.4. Temporal filtering

A pixel-based motion detector with selective recursive tem-poral filtering is quite simple for hardware implementation.Since we first apply a high quality spatial filtering the noise isalready significantly suppressed and thus a pixel-based mo-tion detection is efficient. In case the motion is detected therecursive filtering is switched off.

Two pixels are involved for temporal filtering at a time:one pixel from the current field and another from the samespatial position in the previous field. We store the two fieldsin the output buffer and read the both required pixel valuesin the same cycle. If the absolute difference between these twopixel values is smaller than the predefined threshold value,no motioncase is assumed and the two pixel values are sub-ject to a weighted averaging, with the weighting factors de-fined in [9]. In the other case, when motion is detected, thecurrent pixel is passed to the output. The block schematic inFigure 14depicts the developed FPGA architecture of the se-lective recursive temporal filter described above. We use the8 bit arithmetic because the filter is located in the time do-main where all the pixels are represented as 8 bit integers.

4. REAL-TIME ENVIRONMENT

In our implementation we use the standard television broad-casting signal as a source of video signal. A common featureof all standard TV broadcasting technologies is that the video

sequence is transmitted in analog domain (this excludes thelatest DVB and HDTV transmission standards). Thus, beforedigital processing of television video sequence the digitaliza-tion is needed. Also, after digital processing the sequence hasto be converted back to the analogue domain in order tobe shown on a standard tube display. This pair of A/D andD/A converters is well known as a codec. The 8 bit codec,with 256 levels of quantization per pixel, is considered suf-ficient from the visual quality point of view.Figure 15showsa block schematic of digital processing for television broad-casting systems.

We use the PAL-B broadcasting standard and 8 bit YUV4 : 2 : 2 codec. The hardware platform set-up consists of


9/12


1000

900

800

700

600500

400

300

200

100

00 50 100 150 200 250

HH

(a)

1000

900

800

700

600500

400

300

200

100

00 50 100 150 200 250 300

HH

(b)

Figure12: Likelihood ratioldisplayed for 10 frames with fixed-noise levels:= 10 (a) and= 30 (b).

50

40

30

20

10

05 10 15 20 30

Standard deviation of added noise

40

.

39

40

.

23

35

.

77

35

.

63

33

.

24

32

.

95

31

.

55

31

.

00

29

.

22

27

.

98

PSNR(dB)

Original with 3 decomposition levels

FPGA implementation

PSNR comparison

Figure 13: Performance of the designed FPGA implementation in comparison with the original software version of the algorithm, whichemploys exact analytical calculation of the involved shrinkage expression.

three separate boards. Each board corresponds to one of theblocks presented inFigure 15:

(i) Micronas IMAS-VPC 1.1 (A/Danalog front-end)[39];

(ii) CHIPit Professional Gold Edition (processing block)[40];

(iii) Micronas IMAS-DDPB 1.0 (D/Aanalog back-end)[41].

We made all the connections among the previously men-tioned boards with a separateinterconnectionboard designedfor this purpose. This interconnection board consists of theinterconnection channels and the voltage adjustments be-tween the CHIPit board (3.3 V level) and the Micronas IMASboards (5 V level).

The processing board consists of two Xilinx Virtex IIFPGAs (XC2V6000-5)[35] and is equipped with plenty ofSDRAM memory (6 banks with 32 bit access made with256 Mbit ICs).

All boards of the used hardware platform are configuredwith the I2C interface. The user is able to set up the needed

noise level in input signal. This is fulfilled with writing ap-propriate value to the corresponding register in the FPGAaccessible via the I2C interface. Appropriate look-up tablewith the averaged likelihood ratio is selected according to thevalue in this register.

5. CONCLUSION

We designed a real-time FPGA implementation of an ad-vanced wavelet-domain video denoising algorithm. The de-veloped hardware architecture is based on innovative techni-cal solutions that allow an implementation of sophisticatedadaptive wavelet denoising in hardware. We believe that theresults reported in this paper can be interesting for a num-ber of industrial applications, including TV broadcastingsystems. Our current implementation has limitations inpractical use due to the required user-intervention for noiselevel estimation. Our future work will integrate the noiselevel estimation to avoid these limitations and to allow au-tomatic adaptation of the denoiser to the noise level changesin the input signal.


10/12


Delay

Pixel fromcurrent field

Pixel fromprevious field

ABS (A-B) A


11/12


[17] M. Nibouche, A. Bouridane, F. Murtagh, and O. Nibouche,FPGA-Based Discrete Wavelet Transforms System, School ofComputer Science, The Queens University of Belfast, Belfast,UK, 2001.

[18] M. A. Trenas, J. Lopez, and E. L. Zapata, FPGA implemen-tation of wavelet packet transform with reconfigurable treestructure, in Proceedings of the 26th Euromicro Conference(EUROMICRO 00), pp. 12441251, Maastricht, The Nether-lands, September 2000.

[19] K. Wiatr and P. Russek, Embedded zero wavelet coefficientcoding method for FPGA implementation of video codec inreal-time systems, in The International Conference on Infor-mation Technology: Coding and Computing (ITCC 00), pp.146151, Las Vegas, Nev, USA, March 2000.

[20] S. G. Mathen, Wavelet transform based adaptive image com-pression on FPGA, M.S.thesis, University of Kansas, Manhat-tan, Kan, USA, 2000.

[21] J. B. Ballagh, An FPGA-based run-time reconfigurable 2-Ddiscrete wavelet transform core, M.S. thesis, Virginia Poly-technic Institute, Blacksburg, Va, USA, 2001.

[22] L. Nachtergaele, B. Vanhoof, M. Peon, G. Lafruit, J. Bor-mans, and I. Bolsens, Implementation of a scalable MPEG-4wavelet-based visual texture compression system, inProceed-ings of the 36th Design Automation Conference (DAC 99), pp.333336, New Orleans, La, USA, June 1999.

[23] A. R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo,Wavelet transforms that map integers to integers, Appliedand Computational Harmonic Analysis, vol. 5, no. 3, pp. 332369, 1998.

[24] W. Sweldens, Lifting scheme: a new philosophy in biorthog-onal wavelet constructions, in Wavelet Applications in Signaland Image Processing III, vol. 2569 ofProceedings of SPIE, pp.6879, San Diego, Calif, USA, July 1995.

[25] W. Sweldens, Wavelets and the lifting scheme: a 5 minute

tour, Zeitschrift fur Angewandte Mathematik und Mechanik,vol. 76, no. 2, pp. 4144, 1996.

[26] G. Dillen, B. Georis, J.-D. Legat, and O. Cantineau, Com-bined line-based architecture for the 5-3 and 9-7 wavelettransform of JPEG2000,IEEE Transaction on Circuits and Sys-tems for Video Technology, vol. 13, no. 9, pp. 944950, 2003.

[27] B.-F. Wu and Y.-Q. Hu, An efficient VLSI implementationof the discrete wavelet transform using embedded instructioncodes for symmetric filters,IEEE Transactions on Circuits andSystems for Video Technology, vol. 13, no. 9, pp. 936943, 2003.

[28] M. Katona, A. Pizurica, V. Zlokolica, N. Teslic, and W. Philips,Real-time wavelet domain video denoising implemented inFPGA, in Wavelet Applications in Industrial Processing II,vol. 5607 ofProceedings of SPIE, pp. 6370, Philadelphia, Pa,

USA, October 2004.

[29] M. Katona, A. Pizurica, N. Teslic, V. Kovacevic, and W. Philips,FPGA design and implementation of a wavelet-domain videodenoising system, inProceedings of the 7th International Con-

ference on Advanced Concepts for Intelligent Vision Systems(ACIVS 05), J. Blanc-Talon, D. Popescu, W. Philips, and P.Scheunders, Eds., vol. 3708 ofLecture Notes on Computer Sci-ence, pp. 650657, Antwerp, Belgium, September 2005.

[30] S. Mallat andS. Zhong, Characterization of signals from mul-tiscale edges,IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, vol. 14, no. 7, pp. 710732, 1992.

[31] SystemC Version 2.0 Users Guide, SystemC Inc., 2002,http://www.systemc.org.

[32] M. Katona, N. Teslic, V. Kovacevic, and M. Temerinac, Testenvironment for bluetooth baseband inegrated circuit devel-opment, inProceedings of the 5th International Conference onTelecommunications in Modern Satellite, Cable and Broadcast-ing Services (TELSIKS 01), B. D. Milovanovic, Ed., vol. 2, pp.405408, Nis, Yoguslavia, Septmeber 2001.

[33] M. Katona, N. Teslic, and Z. Krajacevic, FPGA design with

SystemC, in The 10th International Conference on MixedDesign of Integrated Circuits and Systems (MIXDES 03), A.Napieralski, Ed., vol. 1, pp. 220223, Lodz, Poland, June 2003.

[34] I. Daubechies,Ten Lectures on Wavelets, SIAM, Philadelphia,Pa, USA, 1992.

[35] Virtex II Platform FPGA: Complete Data Sheet, XILINX Inc.,2004,http://www.xilinx.com.

[36] A. Pizurica and W. Philips, Estimating the probability of thepresence of a signal of interest in multiresolution single- andmultiband image denoising,IEEE Transactions on Image Pro-cessing, vol. 15, no. 3, pp. 654665, 2006.

[37] D. L. Donoho and I. M. Johnstone, Adapting to unknownsmoothness via wavelet shrinkage, Journal of the AmericanStatistical Association, vol. 90, no. 432, pp. 12001224, 1995.

[38] V. Zlokolica, A. Pizurica, and W. Philips, Noise estimationfor video processing based on spatial-temporal gradient his-tograms, to appear inIEEE Signal Processing Letters.

[39] VPC 3205C Video Processor Family, ITT Semiconductors: ITTIntermetall, 1997,http://www.micronas.com.

[40] CHIPit Gold Edition Handbook, ProDesign Electronic & CADLayout, 2003,http://www.prodesigncad.de.

[41] DDPB 3310B Display and Deflection Processor, Micronas Inter-metal, 1998,http://www.micronas.com.

Mihajlo Katona was born in 1974, in Vr-bas, Yugoslavia. In 1999, he received theDiploma degree in computer engineeringand in 2001, M. S. degree in computer

science both from the University of NoviSad (Serbia and Montenegro). In 1999, he

joined the Chair for Computer Engineer-ing at the University of Novi Sad, where heis currently working as a Teaching Assis-tant in the design of complex digital sys-tems. He is currently pursuing his Ph.D. thesis. His research in-terests include digital signal processing, DSP algorithm customiza-tion for hardware implementation, system-on-chip architectures,and FPGA prototyping.

Aleksandra Pizuricawas born in Novi Sad,Yugoslavia, on September 18, 1969. In 1994,she received the Diploma degree in electri-

cal engineering from the University of NoviSad, Yugoslavia, in 1997 the M.S. degree intelecommunications from the University ofBelgrade, Yugoslavia, and in 2002 the Ph.D.degree from the Ghent University, Belgium.Since 1994 until 1997, she was working atthe Department of Telecommunications of the University of NoviSad, and in 1997 she joined the Department of Telecommunica-tions and Information Processing of the Ghent University. She isthe author of 15 papers in international journals and more than50 papers at international scientific conferences. Her research in-terests include image restoration, multiresolution representations,Markov random field models, signal detection and estimation,multimedia applications, and remote sensing.
http://www.systemc.org/http://www.systemc.org/http://www.systemc.org/http://www.xilinx.com/http://www.xilinx.com/http://www.micronas.com/http://www.micronas.com/http://www.prodesigncad.de/http://www.prodesigncad.de/http://www.micronas.com/http://www.micronas.com/http://www.micronas.com/http://www.prodesigncad.de/http://www.micronas.com/http://www.xilinx.com/http://www.systemc.org/http://www.systemc.org/


12/12


Nikola Teslic is a Professor at the Chairfor Computer Engineering, Faculty of En-gineering, University of Novi Sad, Serbiaand Montenegro. In 1995, he received theDiploma degree in electrical engineeringfrom the University of Novi Sad, Yugoslavia,in 1997 the M.S. degree in computer engi-

neering, and in 1999 the Ph.D. degree fromthe University of Novi Sad, Serbia andMon-tenegro. Currently he lectures in the designof complex digital systems and software for TV sets and imageprocessing and DSP architectures and algorithms. His scientificinterests are in the area of computer engineering, especially in thearea of real-time systems, electronic computer-based systems, digi-tal system for audio-video processing. He is the author of 6 papersin international journals and more than 50 papers at internationalscientific conferences.

Vladimir Kovacevic is a Professor and heleads the Chair for Computer Engineering,Faculty of Engineering, University of NoviSad, Yugoslavia. Currently he lectures in

the design of complex digital systems andcomputer systems design. He received hisPh.D. degree at the University of Belgrade.His scientific interests are in the areas ofcomputer engineering, especially in the areaof real-time systems, electronic computer-based systems, large-scale digital system design, computer systemsdesign, communication networks, and systems programming.

WilfriedPhilips was born in Aalst, Belgiumon October 19, 1966. In 1989, he receivedthe Diploma degree in electrical engineer-ing and in 1993 the Ph.D. degree in appliedsciences, both from the University of Ghent,

Belgium. Since November 1997, he is a Lec-turer at the Department of Telecommuni-cations and Information Processing of theUniversity of Ghent. His main research in-terests are image and video restoration andanalysis and data compression. He is the author of more than 50papers in international journals and 200 papers in the proceedingsof international scientific conferences, the Editor of 8 conferenceproceedings and 1 special issue of a journal. He has received 10national and internal awards for his research. He coorganizes 2 in-ternational conferences in the area of image and video processingand computer vision and is a member of the program committeeof several national and international workshops.

Real Time Wavelet Video Denoising in FPGA

Documents

Transcript of Real Time Wavelet Video Denoising in FPGA