A Block-Based Architecture for Lifting Scheme Discrete Wavelet

10
1062 IEICE TRANS. FUNDAMENTALS, VOL.E90–A, NO.5 MAY 2007 PAPER A Block-Based Architecture for Lifting Scheme Discrete Wavelet Transform Chung-Hsien YANG a) , Nonmember, Jia-Ching WANG , Member, Jhing-Fa WANG , and Chi-Wei CHANG , Nonmembers SUMMARY Two-dimensional discrete wavelet transform (DWT) for processing image is conventionally designed by line-based architectures, which are simple and have low complexity. However, they suer from two main shortcomings - the memory required for storing intermediate data and the long latency of computing wavelet coecients. This work presents a new block-based architecture for computing lifting-based 2-D DWT co- ecients. This architecture yields a significantly lower buer size. Ad- ditionally, the latency is reduced from N 2 down to 3N as compared to the line-based architectures. The proposed architecture supports the JPEG2000 default filters and has been realized in ARM-based ALTERA EPXA10 De- velopment Board at a frequency of 44.33 MHz. key words: discrete wavelet transform, JPEG2000, lifting scheme, line- based DWT, VLSI 1. Introduction Over the past decade, the discrete wavelet transform (DWT) has been widely applied in the area of image processing. The DWT is used in the decorrelation step of systems for compressing still pictures. Several research results indicate that wavelets outperform discrete cosine transforms (DCT) in terms of image quality at high compression ratios, by avoiding the block distortion problem suered by DCT- based solutions. DWT has traditionally been implemented by convolution, which depends on both a large number of computations and a large storage size. In 1994, the lifting scheme, a new method which is known superior to conven- tional convolution-based DWT was proposed in [1], [2]. In addition to providing a significant reduction in memory and the computational complexity, lifting scheme provides in- place computation of the wavelet coecients by overwrit- ing the memory locations where contain the input sample values. Furthermore, it has less hardware implementation and faster computation time. Therefore, the specification of the DWT kernels in JPEG2000 is only provided in terms of the lifting coecients and not the convolutional filters. Memory is an important constraint in many image compression applications. Existing DCT-based compres- sion algorithms, including those defined under the JPEG standard use memory very eciently because, if required, Manuscript received December 7, 2005. Manuscript revised September 29, 2006. Final manuscript received February 8, 2007. The authors are with the Department of Electrical Engineer- ing, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan (R.O.C.). a) E-mail: [email protected] DOI: 10.1093/ietfec/e90–a.5.1062 they can operate on individual image blocks such that the minimum amount of memory required is very low. Al- though wavelet-based coders outperform DCT-based coders in terms of compression eciency, their implementations have not yet matured. Memory eciency is in fact one of the most important issues to be addressed before wavelet-based techniques can be widely deployed, and this is currently one area of extensive research activity related to JPEG2000 stan- dard. In the JPEG2000 verification model [9], the following wavelet filters are proposed: (5, 3) (5-tap highpass filter, 3- tap lowpass filter), (9, 7), C(13, 7), S(13, 7), (2, 6), (2, 10) and (6, 10). To be compliant with JPEG2000, the codec has to implement a (5, 3) filter in lossless mode and a (9, 7) fil- ter in lossy mode. Some proposed architectures [3]–[7] do not implement all of the filters and the data paths are in a line-based fashion, resulting in a large buer size and the late production of wavelet coecients. In [3], [4], the DWT is processed by two main modules - a row module and a column module. Another structure was presented in [5] to implement all stages of the transform using recursive archi- tecture. Direct implementation of the lifting scheme was described in [6] and the architecture in [7] improves upon this direct implementation by its folded structure. All of these methods use line-based data flow to process the DWT and suer from large intermediate data storage. This paper proposes a new block-based architecture that can implement lifting scheme DWT and significantly reduce the amount of memory required. This memory eciency is also advanta- geous in terms of computation speed. Instead, in our pro- posed system, the enforced “locality” of the filtering opera- tions makes it more likely that strips of the image get loaded into the on-chip memory only once. The rest of this paper is organized as follows. Sec- tion 2, briefly reviews the lifting scheme. Section 3 analyzes the precision analysis and the data flow. Section 4 explains the proposed architectures. Section 5 presents the FPGA implementation results and comparisons with others’ work. Finally, Sect. 6 draws conclusions. 2. Lifting Scheme The basic concept that underlies the lifting scheme is the factorization of the polyphase matrix of a wavelet filter into a sequence of alternating upper and lower triangular matri- ces and a diagonal matrix. Let h(z) and g(z) be the low-pass Copyright c 2007 The Institute of Electronics, Information and Communication Engineers

Transcript of A Block-Based Architecture for Lifting Scheme Discrete Wavelet

Page 1: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

1062IEICE TRANS. FUNDAMENTALS, VOL.E90–A, NO.5 MAY 2007

PAPER

A Block-Based Architecture for Lifting Scheme Discrete WaveletTransform

Chung-Hsien YANG†a), Nonmember, Jia-Ching WANG†, Member, Jhing-Fa WANG†,and Chi-Wei CHANG†, Nonmembers

SUMMARY Two-dimensional discrete wavelet transform (DWT) forprocessing image is conventionally designed by line-based architectures,which are simple and have low complexity. However, they suffer from twomain shortcomings - the memory required for storing intermediate data andthe long latency of computing wavelet coefficients. This work presents anew block-based architecture for computing lifting-based 2-D DWT co-efficients. This architecture yields a significantly lower buffer size. Ad-ditionally, the latency is reduced from N2 down to 3N as compared to theline-based architectures. The proposed architecture supports the JPEG2000default filters and has been realized in ARM-based ALTERA EPXA10 De-velopment Board at a frequency of 44.33 MHz.key words: discrete wavelet transform, JPEG2000, lifting scheme, line-based DWT, VLSI

1. Introduction

Over the past decade, the discrete wavelet transform (DWT)has been widely applied in the area of image processing.The DWT is used in the decorrelation step of systems forcompressing still pictures. Several research results indicatethat wavelets outperform discrete cosine transforms (DCT)in terms of image quality at high compression ratios, byavoiding the block distortion problem suffered by DCT-based solutions. DWT has traditionally been implementedby convolution, which depends on both a large number ofcomputations and a large storage size. In 1994, the liftingscheme, a new method which is known superior to conven-tional convolution-based DWT was proposed in [1], [2]. Inaddition to providing a significant reduction in memory andthe computational complexity, lifting scheme provides in-place computation of the wavelet coefficients by overwrit-ing the memory locations where contain the input samplevalues. Furthermore, it has less hardware implementationand faster computation time. Therefore, the specification ofthe DWT kernels in JPEG2000 is only provided in terms ofthe lifting coefficients and not the convolutional filters.

Memory is an important constraint in many imagecompression applications. Existing DCT-based compres-sion algorithms, including those defined under the JPEGstandard use memory very efficiently because, if required,

Manuscript received December 7, 2005.Manuscript revised September 29, 2006.Final manuscript received February 8, 2007.†The authors are with the Department of Electrical Engineer-

ing, National Cheng Kung University, No.1, University Road,Tainan City 701, Taiwan (R.O.C.).

a) E-mail: [email protected]: 10.1093/ietfec/e90–a.5.1062

they can operate on individual image blocks such that theminimum amount of memory required is very low. Al-though wavelet-based coders outperform DCT-based codersin terms of compression efficiency, their implementationshave not yet matured. Memory efficiency is in fact one of themost important issues to be addressed before wavelet-basedtechniques can be widely deployed, and this is currently onearea of extensive research activity related to JPEG2000 stan-dard.

In the JPEG2000 verification model [9], the followingwavelet filters are proposed: (5, 3) (5-tap highpass filter, 3-tap lowpass filter), (9, 7), C(13, 7), S(13, 7), (2, 6), (2, 10)and (6, 10). To be compliant with JPEG2000, the codec hasto implement a (5, 3) filter in lossless mode and a (9, 7) fil-ter in lossy mode. Some proposed architectures [3]–[7] donot implement all of the filters and the data paths are in aline-based fashion, resulting in a large buffer size and thelate production of wavelet coefficients. In [3], [4], the DWTis processed by two main modules - a row module and acolumn module. Another structure was presented in [5] toimplement all stages of the transform using recursive archi-tecture. Direct implementation of the lifting scheme wasdescribed in [6] and the architecture in [7] improves uponthis direct implementation by its folded structure. All ofthese methods use line-based data flow to process the DWTand suffer from large intermediate data storage. This paperproposes a new block-based architecture that can implementlifting scheme DWT and significantly reduce the amount ofmemory required. This memory efficiency is also advanta-geous in terms of computation speed. Instead, in our pro-posed system, the enforced “locality” of the filtering opera-tions makes it more likely that strips of the image get loadedinto the on-chip memory only once.

The rest of this paper is organized as follows. Sec-tion 2, briefly reviews the lifting scheme. Section 3 analyzesthe precision analysis and the data flow. Section 4 explainsthe proposed architectures. Section 5 presents the FPGAimplementation results and comparisons with others’ work.Finally, Sect. 6 draws conclusions.

2. Lifting Scheme

The basic concept that underlies the lifting scheme is thefactorization of the polyphase matrix of a wavelet filter intoa sequence of alternating upper and lower triangular matri-ces and a diagonal matrix. Let h(z) and g(z) be the low-pass

Copyright c© 2007 The Institute of Electronics, Information and Communication Engineers

Page 2: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

YANG et al.: A BLOCK-BASED ARCHITECTURE FOR LIFTING SCHEME DISCRETE WAVELET TRANSFORM1063

Fig. 1 Lifting scheme DWT.

(a)

(b)

Fig. 2 Lifting steps for (a) the (5, 3) filter-bank and (b) the (9, 7) filter-bank.

and high-pass analysis filters. The corresponding polyphasematrix is defined as,

P(z) =

[he(z) ho(z)ge(z) go(z)

], (1)

where he(z) contains the even coefficients of h(z), ho(z) con-tains the odd coefficients h(z), ge(z) contains the even coef-ficients of g(z) and go(z) contains the odd coefficients g(z),respectively. Then, P(z) can be factored into lifting steps as,

P(z) =

[K 00 1/K

] m∏i=1

[1 pi(z)0 1

] [1 0

ui(z) 1

]. (2)

As shown in Fig. 1, the P(z) factorization, involves of threesteps:

(1) Prediction step, in which the even samples are mul-tiplied by the time domain equivalent of pi(z), thenadded to the odd samples;

(2) Update step, in which updated odd samples are mul-tiplied by the time domain equivalent of ui(z), thenadded to the even samples;

(3) Scaling step, in which the even samples are multi-plied by 1/K and the odd samples by K.

The inverse DWT is performed by traversing in the re-verse direction; changing the factor K to 1/K, factor 1/K toK, and reversing the signs of the coefficients in pi(z) andui(z).

The original 1-D signal {s00, d0

0, s01, d0

1, s02, d0

2, . . . } issplit into odd and even indexed subsequences, and then thesevalues are modified using alternating prediction and updat-ing steps. The computational steps are summarized as

dni = dn−1

i +∑

k

pn(k)sn−1k , n ∈ [1, 2, . . .M], (3)

Page 3: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

1064IEICE TRANS. FUNDAMENTALS, VOL.E90–A, NO.5 MAY 2007

Table 1 Computational complexity comparison between convolutionand lifting schemes.

sni = sn−1

i +∑

k

un(k)dnk , n ∈ [1, 2, . . .M], (4)

where {sni } and {dn

i } are, respectively, the even and odd se-quences, pn(k) and un(k) are, respectively, the prediction andupdated weights at the nth iteration and M is the number oflifting sequence. For the (5, 3), C(13, 7), S(13, 7), (2, 6),(2, 10) filter-bank, M=1, while for the (9, 7) and (6, 10)filter-bank, M=2. Equation (3) indicates the prediction stepthat consists of predicting each odd sample and subtract-ing it from the odd sample to form the prediction error {dn

i }.Equation (4) indicates the update step that consists of updat-ing the even samples by adding to them a linear combinationof the already modified odd samples, {dn

i }, to form the up-dated sequence {sn

i }. The output of the final prediction stepwill be the high-pass coefficients up to a scaling factor K,while the output of the final update step will be the low-passcoefficients up to a scaling factor 1/K. For the (9, 7) filter-bank, K= 1.230174104914001. The lifting steps of the (5,3) filter-bank and the (9, 7) filter-bank [8] are depicted inFig. 2.

The number of computations required for calculationof a high-pass, low-pass pair of wavelet transforms usingconvolution and lifting scheme is given in Table 1. Thereduction in the number of multiplications for the liftingscheme is significant for odd-tap filters compared with con-volution. For even-tap filters, the convolution scheme hasfewer or an equal number of multiplications. The number ofadditions for lifting scheme is lower in both odd and eventap filters. Such reduction in the computational complexitymakes lifting schemes attractive for both high throughputand low-power applications.

3. Precision Analysis

The drawback of using fixed-point data format for imple-menting application-specific integrated circuit (ASIC) chipsis that the precision can be reduced. To overcome this draw-back, we need to increase the additional bits for ensuringprecision using image quality analysis.

The filter coefficients of the seven filters in JPEG2000considered herein range from 0.003906 to 2 [4]. To convertthe filter coefficients to integers, these coefficients are mul-tiplied by 256. The value of the coefficients range from 1 to512, so that 10 bits can be used to represent the coefficientsin 2’s complement form. At the end of multiplication, the

Fig. 3 General lifting-based structures.

product is shifted right by eight bits to yield the required re-sult. The rounding is applied to the individual product termsinstead of the result of the filter operation.

Now we consider the format of signal values for hard-ware implementation. The signal values must be shifted leftto increase the precision. The extension of the shift is de-termined by image quality analysis. Consider the generalstructure of lifting schemes, as indicated in Fig. 3. Giventhe equation

y = a(x1 + x2) + b(x3 + x4) + x5, (5)

where a and b are the coefficients, xk, 1 ≤ k ≤ 5, are thesignal inputs, and y is the transformed value. Assume A =Round (256 × a) and B = Round (256 × b), Eq. (5) can beexpressed as follows,

y ≈ 1256

[A(x1 + x2) + B(x3 + x4)] + x5. (6)

If the input values xk are shifted by the extension bits, S ,then

y≈ 12S

1256

[A(2S x1+2S x2)+B(2S x3+2S x4)+2S x5]. (7)

The order of the computation is changed to improve its pre-cision

y(S ) =

{1

2S

{[A(2S x1 + 2S x2)

256

]round

+

[B(2S x3 + 2S x4)

256

]round

+ 2S x5

}}round

, (8)

where the subscript round represents the function of round-ing. Rounding occurred when each term has been calcu-lated. The SNR values with different extension bit num-bers, for the Baboon, Lenna, Elaine, and Boat images, afterthree levels of forward and inverse transforms are given inTable 2. For a set of given images, we varied the exten-sion bit number S to select the bit number S with saturatedSNR performance. That is, the bit number greater than Swill only introduce slight SNR improvement. According toFigs. 4 and 5, when S > 5, this proposed architecture usesfive extension bits for processing the DWT.

Page 4: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

YANG et al.: A BLOCK-BASED ARCHITECTURE FOR LIFTING SCHEME DISCRETE WAVELET TRANSFORM1065

Table 2 SNR values after three levels of DWT.

Fig. 4 SNR values among different extension bits after three level DWTusing (5, 3) filter.

Fig. 5 SNR values among different extension bits after three level DWTusing (9, 7) filter.

Once the number of extension bits is chosen, the widthof the data path must be determined, as can be done byobserving the maximum and minimum values for the for-ward and inverse transform at the end of each level. Table 3presents the maximum and minimum values for the Baboon,Lenna, Elaine and Boat images with five extension bits. Thistable indicates that 16 bits are required to represent the trans-formed values in 2’s complement representation.

The multiplier multiplies a 16-bits number by a 10-bit

Table 3 Maximum and minimum values with five extension bits.

number and then rounds the product that has eight LSBs (toaccount for the increased precision of the filter coefficients)and two MSBs to form a 16-bit output. (Sixteen bits are re-quired to represent the outputs and therefore the two MSBsare sign extension bits.)

4. Proposed VLSI Architectures

4.1 Proposed Data Flow Diagram

For each level of the DWT using line-based method, the fil-tering along columns is performed after the completion ofthe filtering along rows as shown in Fig. 6. For instance, inimage processing, it requires N2 words for intermediate datastorage. This may be unreasonable to fit on a single chip foreven moderately sized images. While the line-based methodcan be efficient for 1-D applications, 2-D line-based archi-tectures suffer from the bottleneck that the required memoryequals to the input data size. Besides this disadvantage, theline-based approach does not lend itself to parallel process-ing.

In this paper, the proposed data flow for the DWT doesnot follow the line-based method. A new block-based fash-ion is presented in this paper. When the input image is di-vided into several blocks, the coefficients of each layer (i.e.,LL, LH, HL, HH) can be concurrently obtained within ablock. For this method, it can be thought of a window slid-ing over the image. The overlapping design smoothly slidesthe window across the image. The idea behind the overlap-ping block architecture is to take only as many inputs as re-quired to compute a set of outputs. For example, a 1-D ver-sion would require only one input per filter length (L), andproduces two outputs: a low-pass and a high-pass. The 2-Dcase takes L2 inputs and produces four outputs. In general,an n-dimensional transform needs Ln inputs to produce 2noutputs. Figure 7 presents an example of the data flow, us-ing a (5, 3) filter-bank. The size of input image is assumed tobe 5× 5 pixels, and a block of 3× 3 pixels is used. There arethree intermediate data produced in Fig. 7(a). Figure 7(b)depicts that the three intermediate data are used to generatethe first transformed data Z (the black circle, for HH layer)and to generate the other intermediate data simultaneously.When the transform coefficient Z within a block has beencalculated, the corresponding intermediate data Y (the graycircle) no longer needs the buffer. In summary, Fig. 7(a) to(k) show how the output data Z are calculated from the inputdata X.

Page 5: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

1066IEICE TRANS. FUNDAMENTALS, VOL.E90–A, NO.5 MAY 2007

Fig. 6 Procedure of 2-D line-based DWT.

Fig. 7 Data flow diagrams of the proposed DWT transform.

4.2 Proposed Architectures

The proposed block-based architecture for 2-D DWT is de-picted in Fig. 8. The outputs in each level are LL, LH, HL,and HH. The LL data are used for the next level of decompo-sition. This system has three primary stages. The first stagereads the input data and the block controller forms a “block”according to double buffer scheme. After a “block” of inputdata is ready for processing, it is sent to the pipeline registerfor the next stage.

The second stage is the PE Y controller that processesthe intermediate transform data within a block and stores the

data to the Buffer Y. The last stage, the PE Z controller, pro-cesses the final transform coefficients. Buffer Z is only usedin 4M filters because two passes of one dimension transformis calculated in a round. The registers in Fig. 8 are used forstoring of the second-pass input. The details are discussedin the following subsections.

4.3 Block Controller Modules

The block controller modules read the image input data. TheBUFFER X is used to store input data. It is utilized to seg-ment the image data into sub-blocks. BUFFER X containstwo banks (MEM1 and MEM2) to implement the double-buffer scheme. The first step is to read data from the Ex-ternal Memory into MEM1 (see Fig. 8). When the MEM1is full of the image data, second, the MEM2 reads the im-age data. The MEM1 can be simultaneously read, forming a“block” for processing. The MEM2 will wait until the pro-cessing of MEM1 is completed. The third step is similar tostep 2 but with the MEM1 and MEM2 exchanged. The firststep is executed only once, after which, the second and thethird steps are performed alternatively till the entire imageis completely processed. The roughly finite state machineof the block controller is described in Fig. 9.

4.4 Processing Elements (PE) Modules

Two PE modules are used in our design. The PE Y reads ablock of data from BUFFER X; calculates the intermediatedata Y , and writes the data into BUFFER Y, when the PE Zreads a block of data from BUFFER Y; calculates the trans-form data Z, and writes the data into BUFFER Z. The basiccomputation unit, MAC, is indicated in Fig. 10. Figures 11and 12 show the structures of the 2M and 4M filter banks,respectively. The REG1 and REG2 are used for storing theoverlapped data of the block in the 2M filter banks. Whilethe 4M filter banks are being processed, all four registers areused to reduce the numbers of memory access. Thus the re-accessing of the memory can be prevented to diminish thepower consumption.

In our algorithm, a block has two frames. In eachframe, the processing element calculates the high-pass andlow-pass pair of coefficients. The PE Y and PE Z can si-multaneously perform transform when the PE Z has enough

Page 6: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

YANG et al.: A BLOCK-BASED ARCHITECTURE FOR LIFTING SCHEME DISCRETE WAVELET TRANSFORM1067

Fig. 8 Proposed architecture of 2-D discrete wavelet transform.

Fig. 9 Illustration of the finite state machine for the block controller.

Fig. 10 Architecture of basic computation unit, MAC.

input data to do so. Thus, the computational time can besignificantly reduced.

Fig. 11 Architecture of processing element for 2M filters.

Fig. 12 Architecture of processing element for 4M filters.

4.5 Memory Modules

The structures of double-buffer and overlapping are adopted,so the size of MEM1 and MEM2 in the proposed block-based architecture is N×2, where N is the width of the inputimage. While dealing with the MEM1 (MEM2) data, all ofthem is processed in the PE Y and stored in the BUFFER Y.At this time, the PE Z starts to deal with the other dimension

Page 7: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

1068IEICE TRANS. FUNDAMENTALS, VOL.E90–A, NO.5 MAY 2007

since it has sufficient data for processing. The size of thememory is much lower than those associated with line-basedarchitecture whose memory requirement is N × �N/2� [3],[4].

BUFFER Y and BUFFER Z have size N × 4. Referredto Fig. 13, when a row of the intermediate data is processed,the three other rows can be accessed for simultaneous pro-cessing of other dimensions. These four rows can be rewrit-ten circularly.

Fig. 13 Organization of BUFFER Y and BUFFER Z.

Table 4 (a) Schedule of PE Y for the (5, 3) filter applied on a 5 × 5image. (b) Schedule of PE Z for the (5, 3) filter applied on a 5 × 5 image.

(a)

(b)

Fig. 14 Excalibur device architecture.

Page 8: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

YANG et al.: A BLOCK-BASED ARCHITECTURE FOR LIFTING SCHEME DISCRETE WAVELET TRANSFORM1069

4.6 Scheduling

A detailed schedule of the (5, 3) filter-bank has been gener-ated, as shown in Tables 4(a), (b). In the example of a 5 × 5image, the input data are x(i, j); where i and j are the verti-cal and horizontal indices, respectively, with 1≤ i, j ≤ 5. Inthe 11th cycle, the last element required for calculating thefirst second dimensional coefficient is ready for processing.Thus in the subsequent cycle, the first horizontal wavelet co-efficient, Z(2, 2), can be calculated. Afterwards, the DWTcoefficients are generated at every cycle. The total compu-tational time for one level of decomposition on an N × Nimage, using the (5, 3) filter, is 2 × [N/2] × [N/2] + 2.

5. FPGA Implementation

To realize the proposed architecture, ALTERA EPXA10Development Board (ALTERATM EXCALIBURTM EPXA10F1020C2) was utilized. Figure 14 shows the system ar-chitecture of the embedded stripe and the interfaces to thePLD portion of the devices [11]. This architecture promotesmaximum integration with minimal system cost and allowsthe embedded stripe and PLD to be independently opti-mized for maximum performance. Two AMBA-compliantAHBs ensure that the embedded processor activity is unaf-fected by peripheral and memory operation. Three bidirec-tional AHB-to-AHB bridges enable embedded peripheralsand PLD-implemented peripherals to exchange data withthe embedded processor or with other peripherals. Withthese interfaces, the performance of the ARM922T is un-compromised, and is equivalent to an ASIC implementationon a 0.18-µm CMOS process. The implementation resultsare summarized in Table 5. The critical path of the systemis about 22.557 ns. That means the maximum operating fre-quency is roughly 44.33 MHz. As shown in Fig. 8, the criti-cal path is the path between two pipeline registers (through

Table 5 The implementation results of the FPGA prototype.

Fig. 15 Schematic view of the whole system.

a multiplexer and a PE controller). Figure 15 depicts theschematic view of the whole system. The prototype systemphoto is given in Fig. 16.

The following will compare the buffer size, hardwareutilization, and computational time of the proposed archi-tecture with those of others’ architectures. In the proposedarchitecture, the buffer memory is significantly reduced asshown in Table 6. From Table 6, while the block-based ar-chitectures may use more computing time, the work can bedivided among many processors. In this proposed architec-ture, the first wavelet transform coefficient is generated as

Fig. 16 Photo of prototype system.

Table 6 Comparisons of buffer size among different architectures.

Page 9: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

1070IEICE TRANS. FUNDAMENTALS, VOL.E90–A, NO.5 MAY 2007

soon as possible. The total computational time can also bereduced in comparison with those of other architectures, fa-cilitating quantization in the processing of image compres-sion in JPEG2000, representing another advantage of theproposed block-based structure.

6. Conclusions

Line-based DWT architectures are efficient for 1-D applica-tions. In 2-D transforms (or higher), they suffer from twomain problems - memory requirements and latency. For ex-ample, image processing requires N2 words for storing in-termediate data may not fit on a single chip even for moder-ately sized images. Also, the latency depends on the inputsize. At least O(N) clock cycles are required to generatethe first output. These problems are inherent in line-basedarchitectures.

This paper offers a new data processing path and per-forms a new VLSI architecture to implement the 2-D liftingscheme DWT with small memory. The DWT coefficientsare computed using a block fashion of data path. This ar-chitecture reduces the latency to 3N and the total requiredmemory is also reduced. Finally, the proposed design hassuccessfully been verified using an ARM-based ALTERAEPXA10 Development Board.

References

[1] W. Sweldens, “The lifting scheme: A new philosophy in biorthogo-nal wavelet constructions,” Proc. SPIE: Wavelet Applications in Sig-nal and Image Processing III, vol.2569, pp.68–79, 1995.

[2] I. Daubechies and W. Sweldens, “Factoring wavelet transformsinto lifting schemes,” J. Fourier Analysis and Applications, vol.4,pp.247–269, 1998.

[3] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecturefor lifting based wavelet transform,” Proc. IEEE Workshop SignalProcess. Syst., pp.70–79, Oct. 2000.

[4] T. Acharya, K. Andra, and C. Chakrabarti, “A VLSI architecture forlifting-based forward and inverse wavelet transform,” IEEE Trans.Signal Process., vol.50, no.4, pp.966–977, April 2002.

[5] B.F. Cockburn, H. Liao, and M.K. Mandal, “Novel architectures forthe lifting-based discrete wavelet transform,” Proc. IEEE Conf. onElectrical and Computer Engineering, vol.2, pp.1020–025, 2002.

[6] C.-C. Liu, Y.-H. Shiau, and J.-M. Jou, “Design and implementationof a progressive image coding chip based on the lifted wavelet trans-form,” Proc. 11th VLSI Design/CAD Symposium, pp.49–52, Aug.2000.

[7] C.-J. Lian, K.-F. Chen, H.-H. Chen, and L.-G. Chen, “Lifting baseddiscrete wavelet transform architecture for JPEG2000,” Proc. IEEEInternational Symposium on Circuits and Systems, vol.2, pp.445–448, 2001.

[8] J.M. Shapiro, “Embedded imaging coding using zerotrees of waveletcoefficients,” IEEE Trans. Signal Process., vol.41, no.12, pp.3445–3462, Dec. 1993.

[9] D. Taubman, “JPEE2000 verification model vm3a,” ISO/IECJTC1/SC29/WG1N1143, Feb. 1999.

[10] S. Movva and S. Srinivasan, “A novel architecture for lifting-baseddiscrete wavelet transform for JPEG2000 standard suitable for VLSIimplementation,” Proc. 16th International Conference on VLSI De-sign, pp.202–207, Jan. 2003.

[11] Altera Corporation, Altera Device Package Information Data Sheet,http://www.altera.com/literature/lit-index.html

[12] H. Liao, M. Mandal, and B. Cockburn, “Efficient architecture for1-D and 2-D lifting-based wavelet transforms,” IEEE Trans. SignalProcess., vol.52, no.5, pp.1315–1326, May 2004.

[13] P.-C. Wu and L.-G. Chen, “An efficient architecture for two-dimensional discrete wavelet transform,” IEEE Trans. Circuits Syst.Video Technol., vol.11, no.4, pp.536–545, April 2001.

[14] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “VLSI architecture forforward discrete wavelet transform based on b-spline factorization,”J. VLSI Signal Process., vol.40, no.3, pp.343–353, July 2005.

Chung-Hsien Yang received the B.S. de-gree in Computer Science and Information En-gineering from Tunghai University, Taichung,Taiwan, in 1997 and the M.S. degree in Com-puter Science and Information Engineeringfrom National Cheng Kung University, Tainan,Taiwan, in 1999. He is a Ph.D. candidate inthe Department of Electrical Engineering at Na-tional Cheng Kung University. His research ar-eas include stochastic processes and VLSI de-sign.

Jia-Ching Wang received the M.S. andPh.D. degrees in electrical engineering from Na-tional Cheng Kung University, Tainan, Taiwan,in 1997, 2002, respectively. His research inter-ests include signal processing and VLSI archi-tecture design. Dr. Wang is an honor member ofPhi Tau Phi. He is also a member of IEEE andACM.

Jhing-Fa Wang is now a Chair Professor inNational Cheng Kung University, Tainan, Tai-wan. He received his Master and Bachelor de-grees in the Department of Electrical Engineer-ing from National Cheng Kung University, Tai-wan in 1979 and 1973, respectively and Ph.D.degree in the Department of Computer Scienceand Electrical Engineering from Stevens Insti-tute of Technology, U.S.A. in 1983. He waselected as an IEEE Fellow in 1999 and now theChairman of IEEE Tainan Section. He got out-

standing awards from Institute of Information Industry in 1991 and Na-tional Science Council of Taiwan in 1990, 1995, and 1997, respectively.He has been invited to give keynote speech in PACLIC 12 (Pacific AsiaConference on Language, Information and Computation), Singapore andserved as the general chairman of International Symposium on Commu-nication (ISCOM 2001), Taiwan. He has developed a Mandarin speechrecognition system called Venus-Dictate known as a pioneering system inTaiwan. He was an associate editor for IEEE Transaction on Neural Net-works and VLSI System. He is currently leading a research group of dif-ferent disciplines for the development of Advanced Ubiquitous Media forCreated Cyberspace. He has published about 91 journal papers and 217conference papers and obtained 5 patents since 1983. His research areasinclude wireless content-based media processing, speech recognition andnatural language understanding.

Page 10: A Block-Based Architecture for Lifting Scheme Discrete Wavelet

YANG et al.: A BLOCK-BASED ARCHITECTURE FOR LIFTING SCHEME DISCRETE WAVELET TRANSFORM1071

Chi-Wei Chang received the B.S. de-gree in Biomedical Engineering from ChungYuan Christian University, Chung Li, Taiwan,in 1998, and the M.S. degree in Electrical En-gineering from National Cheng Kung Univer-sity, Tainan, Taiwan, in 2003. His research areasinclude discrete wavelet transform, image pro-cessing and VLSI design.