[Meher:2014][TCSVT](Reusavel)Efficient Integer DCT Architectures for HEVC

11
168 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014 Efficient Integer DCT Architectures for HEVC Pramod Kumar Meher, Senior Member, IEEE, Sang Yoon Park, Member, IEEE, Basant Kumar Mohanty, Senior Member, IEEE, Khoon Seong Lim, and Chuohao Yeo, Member, IEEE Abstract —In this paper, we present area- and power-efficient architectures for the implementation of integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency Video Coding (HEVC). We show that an efficient constant matrix- multiplication scheme can be used to derive parallel architectures for 1-D integer DCT of different lengths. We also show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the proposed architecture could be pruned to reduce the complexity of implementation substantially with only a marginal affect on the coding performance. We propose power-efficient structures for folded and full-parallel implementations of 2-D DCT. From the synthesis result, it is found that the proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed pruning algorithm with nearly the same throughput rate. The proposed architecture is found to support ultrahigh definition 7680×4320 at 60 frames/s video, which is one of the applications of HEVC. Index Terms—Discrete cosine transform (DCT), H.265, High Efficiency Video Coding (HEVC), integer discrete cosine trans- form (DCT), video coding. I. Introduction T HE DISCRETE cosine transform (DCT) plays a vital role in video compression due to its near-optimal decorrela- tion efficiency [1]. Several variations of integer DCT have been suggested in the last two decades to reduce the computational complexity [2]–[6]. The new H.265/High Efficiency Video Coding (HEVC) standard [7] has been recently finalized and poised to replace H.264/AVC [8]. Some hardware architectures for the integer DCT for HEVC have also been proposed for its real-time implementation. Ahmed et al. [9] decomposed the DCT matrices into sparse sub-matrices where the multi- plications are avoided by using the lifting scheme. Shen et al. [10] used the multiplierless multiple constant multiplication Manuscript received August 21, 2012; revised May 17, 2013; accepted June 26, 2013. Date of publication August 6, 2013; date of current version January 3, 2014. (Corresponding author: S. Y. Park.) This paper was recommended by Associate Editor J. Ridge. P. K. Meher, S. Y. Park, K. S. Lim, and C. Yeo are with the Institute for Infocomm Research, 138632, Singapore (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). B. K. Mohanty is with the Department of Electronics and Communica- tion Engineering, Jaypee University of Engineering and Technology, Guna, Madhya Pradesh 473226, India (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2013.2276862 (MCM) approach for four-point and eight-point DCT, and have used the normal multipliers with sharing techniques for 16 and 32-point DCTs. Park et al. [11] have used Chen’s factorization of DCT where the butterfly operation has been implemented by the processing element with only shifters, adders, and multiplexors. Budagavi and Sze [12] proposed a unified structure to be used for forward as well as inverse transform after the matrix decomposition. One key feature of HEVC is that it supports DCT of different sizes such as 4, 8, 16, and 32. Therefore, the hardware architecture should be flexible enough for the computation of DCT of any of these lengths. The existing designs for conventional DCT based on constant matrix multiplication (CMM) and MCM can provide optimal solutions for the computation of any of these lengths, but they are not reusable for any length to support the same throughput processing of DCT of different transform lengths. Considering this issue, we have analyzed the possible implementations of integer DCT for HEVC in the context of resource requirement and reusability, and based on that, we have derived the proposed algorithm for hardware implementation. We have designed scalable and reusable architectures for 1-D and 2-D integer DCTs for HEVC that could be reused for any of the prescribed lengths with the same throughput of processing irrespective of transform size. In the next section, we present algorithms for hardware implementation of the HEVC integer DCTs of different lengths 4, 8, 16, and 32. In Section III, we illustrate the design of the proposed architecture for the implementation of four-point and eight-point integer DCT along with a generalized design of integer DCT of length N, which could be used for the DCT of length N = 16 and 32. Moreover, we demonstrate the reusability of the proposed solutions in this section. In Sec- tion IV, we propose power-efficient designs of transposition buffers for full-parallel and folded implementations of 2-D Integer DCT. In Section V, we propose a bit-pruning scheme for the implementation of integer DCT and present the impact of pruning on forward and inverse transforms. In Section VI, we compare the synthesis result of the proposed architecture with those of existing architectures for HEVC. II. Algorithm for Hardware Implementation of Integer DCT for HEVC In the Joint Collaborative Team-Video Coding (JCT-VC), which manages the standardization of HEVC, Core Experi- ment 10 (CE10) studied the design of core transforms over 1051-8215 c 2013 IEEE

description

Artigo

Transcript of [Meher:2014][TCSVT](Reusavel)Efficient Integer DCT Architectures for HEVC

168 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014

Efficient Integer DCT Architectures for HEVCPramod Kumar Meher, Senior Member, IEEE, Sang Yoon Park, Member, IEEE,

Basant Kumar Mohanty, Senior Member, IEEE, Khoon Seong Lim, and Chuohao Yeo, Member, IEEE

Abstract—In this paper, we present area- and power-efficientarchitectures for the implementation of integer discrete cosinetransform (DCT) of different lengths to be used in High EfficiencyVideo Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can be used to derive parallel architecturesfor 1-D integer DCT of different lengths. We also show thatthe proposed structure could be reusable for DCT of lengths4, 8, 16, and 32 with a throughput of 32 DCT coefficientsper cycle irrespective of the transform size. Moreover, theproposed architecture could be pruned to reduce the complexityof implementation substantially with only a marginal affect onthe coding performance. We propose power-efficient structuresfor folded and full-parallel implementations of 2-D DCT. Fromthe synthesis result, it is found that the proposed architectureinvolves nearly 14% less area-delay product (ADP) and 19% lessenergy per sample (EPS) compared to the direct implementationof the reference algorithm, on average, for integer DCT of lengths4, 8, 16, and 32. Also, an additional 19% saving in ADP and20% saving in EPS can be achieved by the proposed pruningalgorithm with nearly the same throughput rate. The proposedarchitecture is found to support ultrahigh definition 7680×4320at 60 frames/s video, which is one of the applications of HEVC.

Index Terms—Discrete cosine transform (DCT), H.265, HighEfficiency Video Coding (HEVC), integer discrete cosine trans-form (DCT), video coding.

I. Introduction

THE DISCRETE cosine transform (DCT) plays a vital rolein video compression due to its near-optimal decorrela-

tion efficiency [1]. Several variations of integer DCT have beensuggested in the last two decades to reduce the computationalcomplexity [2]–[6]. The new H.265/High Efficiency VideoCoding (HEVC) standard [7] has been recently finalized andpoised to replace H.264/AVC [8]. Some hardware architecturesfor the integer DCT for HEVC have also been proposed forits real-time implementation. Ahmed et al. [9] decomposedthe DCT matrices into sparse sub-matrices where the multi-plications are avoided by using the lifting scheme. Shen et al.[10] used the multiplierless multiple constant multiplication

Manuscript received August 21, 2012; revised May 17, 2013; accepted June26, 2013. Date of publication August 6, 2013; date of current version January3, 2014. (Corresponding author: S. Y. Park.) This paper was recommendedby Associate Editor J. Ridge.

P. K. Meher, S. Y. Park, K. S. Lim, and C. Yeo are with the Institute forInfocomm Research, 138632, Singapore (e-mail: [email protected];[email protected]; [email protected]; [email protected]).

B. K. Mohanty is with the Department of Electronics and Communica-tion Engineering, Jaypee University of Engineering and Technology, Guna,Madhya Pradesh 473226, India (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2013.2276862

(MCM) approach for four-point and eight-point DCT, andhave used the normal multipliers with sharing techniques for16 and 32-point DCTs. Park et al. [11] have used Chen’sfactorization of DCT where the butterfly operation has beenimplemented by the processing element with only shifters,adders, and multiplexors. Budagavi and Sze [12] proposed aunified structure to be used for forward as well as inversetransform after the matrix decomposition.

One key feature of HEVC is that it supports DCT ofdifferent sizes such as 4, 8, 16, and 32. Therefore, the hardwarearchitecture should be flexible enough for the computationof DCT of any of these lengths. The existing designs forconventional DCT based on constant matrix multiplication(CMM) and MCM can provide optimal solutions for thecomputation of any of these lengths, but they are not reusablefor any length to support the same throughput processing ofDCT of different transform lengths. Considering this issue,we have analyzed the possible implementations of integerDCT for HEVC in the context of resource requirement andreusability, and based on that, we have derived the proposedalgorithm for hardware implementation. We have designedscalable and reusable architectures for 1-D and 2-D integerDCTs for HEVC that could be reused for any of the prescribedlengths with the same throughput of processing irrespective oftransform size.

In the next section, we present algorithms for hardwareimplementation of the HEVC integer DCTs of different lengths4, 8, 16, and 32. In Section III, we illustrate the design ofthe proposed architecture for the implementation of four-pointand eight-point integer DCT along with a generalized designof integer DCT of length N, which could be used for theDCT of length N = 16 and 32. Moreover, we demonstrate thereusability of the proposed solutions in this section. In Sec-tion IV, we propose power-efficient designs of transpositionbuffers for full-parallel and folded implementations of 2-DInteger DCT. In Section V, we propose a bit-pruning schemefor the implementation of integer DCT and present the impactof pruning on forward and inverse transforms. In Section VI,we compare the synthesis result of the proposed architecturewith those of existing architectures for HEVC.

II. Algorithm for Hardware Implementation of

Integer DCT for HEVC

In the Joint Collaborative Team-Video Coding (JCT-VC),which manages the standardization of HEVC, Core Experi-ment 10 (CE10) studied the design of core transforms over

1051-8215 c© 2013 IEEE

MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC 169

several meeting cycles [13]. The eventual HEVC transformdesign [14] involves coefficients of 8-bit size, but does notallow full factorization unlike other competing proposals [13].It however allows for both matrix multiplication and partialbutterfly implementation. In this section, we have used thepartial-butterfly algorithm of [14] for the computation ofinteger DCT along with its efficient algorithmic transformationfor hardware implementation.

A. Key Features of Integer DCT for HEVC

The N-point integer DCT1 for HEVC given by [14] can becomputed by a partial butterfly approach using a (N/2)-pointDCT and a matrix–vector product of (N/2) × (N/2) matrixwith an (N/2)-point vector as

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

y(0)y(2)

···

y(N − 4)y(N − 2)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

= CN/2

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

a(0)a(1)

···

a(N/2 − 2)a(N/2 − 1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

(1a)

and

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

y(1)y(3)

···

y(N − 3)y(N − 1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

= MN/2

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

b(0)b(1)

···

b(N/2 − 2)b(N/2 − 1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

(1b)

where

a(i) = x(i) + x(N − i − 1)

b(i) = x(i) − x(N − i − 1) (2)

for i = 0, 1, · · ·, N/2 − 1. X = [x(0), x(1), · · ·, x(N − 1)] is theinput vector and Y = [y(0), y(1), ···, y(N−1)] is N-point DCTof X. CN/2 is (N/2)-point integer DCT kernel matrix of size(N/2) × (N/2). MN/2 is also a matrix of size (N/2) × (N/2)and its (i, j)th entry is defined as

mi,jN/2 = c

2i+1,jN for 0 ≤ i, j ≤ N/2 − 1 (3)

where c2i+1,jN is the (2i + 1, j)th entry of the matrix CN .

Note that (1a) could be similarly decomposed, recursively,further using CN/4 and MN/4. We have referred to the directimplementation of DCT based on (1)–(3) as the referencealgorithm in the remainder of this paper.

B. Hardware Oriented Algorithm

Direct implementation of (1) requires N2/4 + MULN/2

multiplications, N2/4 + N/2 + ADDN/2 additions, and 2 shiftswhere MULN/2 and ADDN/2 are the number of multiplicationsand additions/subtractions of (N/2)-point DCT, respectively.

1While the standard defines the inverse DCT, we show the forward DCThere since it is more directly related to the later parts of this paper. Also, notethat the transpose of the inverse DCT can be used as the forward DCT.

Computation of (1) could be treated as a CMM problem[15]–[17]. Since the absolute values of the coefficients in allthe rows and columns of matrix M in (1b) are identical, theCMM problem can be implemented as a set of N/2 MCMsthat will result in a highly regular architecture and will havelow-complexity implementation.

The kernel matrices for four-, eight-, 16-, and 32-pointinteger DCT for HEVC are given in [14], and 4- and eight-point integer DCT are represented, respectively, as

C4 =

⎡⎢⎢⎣

64 64 64 6483 36 −36 −8364 −64 −64 6436 −83 83 −36

⎤⎥⎥⎦ (4)

and

C8 =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

64 64 64 64 64 64 64 6489 75 50 18 −18 −50 −75 −8983 36 −36 −83 −83 −36 36 8375 −18 −89 −50 50 89 18 −7564 −64 −64 64 64 −64 −64 6450 −89 18 75 −75 −18 89 −5036 −83 83 −36 −36 83 −83 3618 −50 75 −89 89 −75 50 −18

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

. (5)

Based on (1) and (2), hardware oriented algorithms for DCTcomputation can be derived in three stages as in Table I.For 8-, 16-, and 32-point DCT, even indexed coefficients of[y(0), y(2), y(4), · · ·y(N − 2)] are computed as 4-, 8-, and16-point DCTs of [a(0), a(1), a(2), ···a(N/2−1)], respectively,according to (1a). In Table II, we have listed the arithmeticcomplexities of the reference algorithm and the MCM-basedalgorithm for four-, eight-, 16-, and 32-point DCT. Algorithmsfor Inverse DCT (IDCT) can also be derived in a similar way.

III. Proposed Architectures for Integer DCT

Computation

In this section, we present the proposed architecture for thecomputation of integer DCT of length 4. We also present ageneralized architecture for integer DCT of higher lengths.

A. Proposed Architecture for Four-Point Integer DCT

The proposed architecture for four-point integer DCT isshown in Fig. 1(a). It consists of an input adder unit (IAU),a shift-add unit (SAU), and an output adder unit (OAU). TheIAU computes a(0), a(1), b(0), and b(1) according to STAGE-1 of the algorithm as described in Table I. The computationsof ti,36 and ti,83 are performed by two SAUs according toSTAGE-2 of the algorithm. The computation of t0,64 and t1,64

does not consume any logic since the shift operations couldbe rewired in hardware. The structure of SAU is shown inFig. 1(b). Outputs of the SAU are finally added by the OAUaccording to STAGE-3 of the algorithm.

B. Proposed Architecture for Integer DCT of Length 8 andHigher Length DCTs

The generalized architecture for N-point integer DCT basedon the proposed algorithm is shown in Fig. 2. It consists

170 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014

TABLE I

3-STAGEs Hardware Oriented Algorithms for the Computation of 4-, 8-, 16-, and 32-Point DCT

of four units, namely the IAU, (N/2)-point integer DCTunit, SAU, and OAU. The IAU computes a(i) and b(i) fori = 0, 1, ..., N/2 − 1 according to STAGE-1 of the algorithmof Section II-B. The SAU provides the result of multiplicationof input sample with DCT coefficient by STAGE-2 of thealgorithm. Finally, the OAU generates the output of DCTfrom a binary adder tree of log2 N − 1 stages. Fig. 3(a)–(c),respectively, illustrates the structures of IAU, SAU, and OAUin the case of eight-point integer DCT. Four SAUs are requiredto compute ti,89, ti,75, ti,50, and ti,18 for i = 0, 1, 2, and 3according to STAGE-2 of the algorithm. The outputs of SAUsare finally added by two-stage adder tree according to STAGE-3 of the algorithm. Structures for 16- and 32-point integer DCTcan also be obtained similarly.

C. Reusable Architecture for Integer DCT

The proposed reusable architecture for the implementationof DCT of any of the prescribed lengths is shown in Fig. 4(a).

TABLE II

Computational Complexities of MCM-Based Hardware

Oriented Algorithm and Reference Algorithm

There are two (N/2)-point DCT units in the structure.The input to one (N/2)-point DCT unit is fed through(N/2) 2:1 MUXes that selects either [a(0), ..., a(N/2 − 1)]or [x(0), ..., x(N/2 − 1)], depending on whether it is used forN-point DCT computation or for the DCT of a lower size.The other (N/2)-point DCT unit takes the input [x(N/2), ...,

MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC 171

Fig. 1. Proposed architecture of four-point integer DCT. (a) Four-point DCTarchitecture. (b) Structure of SAU.

Fig. 2. Proposed generalized architecture for integer DCT of lengths N = 8,16, and 32.

x(N − 1)] when it is used for the computation of DCT ofN/2 point or a lower size, otherwise, the input is reset by anarray of (N/2) AND gates to disable this (N/2)-point DCTunit. The output of this (N/2)-point DCT unit is multiplexedwith that of the OAU, which is preceded by the SAUs andIAU of the structure. The N AND gates before IAU are usedto disable the IAU, SAU, and OAU when the architecture isused to compute (N/2)-point DCT computation or a lowersize. The input of the control unit, mN is used to decide thesize of DCT computation. Specifically, for N = 32, m32 is a2-bits signal that is set to {00}, {01}, {10}, and {11} to compute

Fig. 3. Proposed architecture of eight-point integer DCT and IDCT. (a)Structure of IAU. (b) Structure of SAU. (c) Structure of OAU.

four-, eight-, 16-, and 32-point DCT, respectively. The controlunit generates sel 1 and sel 2, where sel 1 is used as controlsignals of N MUXes and input of N AND gates before IAU.sel 2 is used as the input m(N/2) to two lower size reusableinteger DCT units in a recursive manner. The combinationallogics for control units are shown in Fig. 4(b) and (c) forN = 16 and 32, respectively. For N = 8, m8 is a 1-bit signalthat is used as sel 1 while sel 2 is not required since four-point DCT is the smallest DCT. The proposed structure cancompute one 32-point DCT, two 16-point DCTs, four eight-point DCTs, and eight four-point DCTs, while the throughputremains the same as 32 DCT coefficients per cycle irrespectiveof the desired transform size.

IV. Proposed Structure for 2-D Integer DCT

Using its separable property, an (N × N)-point 2-D integerDCT could be computed by the row-column decompositiontechnique in two distinct stages.

1) STAGE-1: N-point 1-D integer DCT is computed foreach column of the input matrix of size (N × N) togenerate an intermediate output matrix of size (N ×N).

2) STAGE-2: N-point 1-D DCT is computed for each rowof the intermediate output matrix of size (N × N) togenerate desired 2-D DCT of size (N × N).

172 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014

Fig. 4. Proposed reusable architecture of integer DCT. (a) Proposed reusable architecture for N = 8, 16, and 32. (b) Control unit for N = 16. (c) Controlunit for N = 32.

We present here a folded architecture and a full-parallelarchitecture for the 2-D integer DCT, along with the neces-sary transposition buffer to match them without internal datamovement.

A. Folded Structure for 2-D Integer DCT

The folded structure for the computation of (N × N)-point2-D integer DCT is shown in Fig. 5(a). It consists of oneN-point 1-D DCT module and a transposition buffer. Thestructure of the proposed 4 × 4 transposition buffer is shownin Fig. 5(b). It consists of 16 registers arranged in four rowsand four columns. (N × N) transposition buffer can store N

values in any one column of registers by enabling them byone of the enable signals ENi for i = 0, 1, · · ·, N − 1. One canselect the content of one of the rows of registers through theMUXes.

During the first N successive cycles, the DCT module re-ceives the successive columns of (N×N) block of input for the

computation of STAGE-1, and stores the intermediate resultsin the registers of successive columns in the transpositionbuffer. In the next N cycles, contents of successive rows ofthe transposition buffer are selected by the MUXes and fed asinput to the 1-D DCT module. N MUXes are used at the inputof the 1-D DCT module to select either the columns from theinput buffer (during the first N cycles) or the rows from thetransposition buffer (during the next N cycles).

B. Full-Parallel Structure for 2-D Integer DCT

The full-parallel structure for (N × N)-point 2-D integerDCT is shown in Fig. 6(a). It consists of two N-point 1-DDCT modules and a transposition buffer. The structure of the4 × 4 transposition buffer for full-parallel structure is shownin Fig. 6(b). It consists of 16 register cells (RC) [shown inFig. 6(c)] arranged in four rows and four columns. N × N

transposition buffer can store N values in a cycle either row-wise or column-wise by selecting the inputs by the MUXes at

MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC 173

Fig. 5. Folded structure of (N × N)-point 2-D integer DCT. (a) Folded 2-DDCT architecture. (b) Structure of the transposition buffer for input size 4×4.

the input of RCs. The output from RCs can also be collectedeither row-wise or column-wise. To read the output from thebuffer, N number of (2N − 1):1 MUXes [shown in Fig. 6(d)]are used, where outputs of the ith row and the ith column ofRCs are fed as input to the ith MUX. For the first N successivecycles, the ith MUX provides output of N successive RCs onthe ith row. In the next N successive cycles, the ith MUXprovides output of N successive RCs on the ith column. Bythis arrangement, in the first N cycles, we can read the outputof N successive columns of RCs and in the next N cycles,we can read the output of N successive rows of RCs. Thetransposition buffer in this case allows both read and writeoperations concurrently. If for the N cycles, results are read

Fig. 6. Full-parallel structure of (N × N)-point 2-D integer DCT. (a) Full-parallel 2-D DCT architecture. (b) Structure of the transposition buffer forinput size 4×4. (c) Register cell RCij. (d) 7-to-1 MUX for 4×4 transpositionbuffer.

and stored column-wise now, then in the next N successivecycles, results are read and stored in the transposition bufferrow-wise. The first 1-D DCT module receives the inputscolumn-wise from the input buffer. It computes a column ofintermediate output and stores in the transposition buffer. Thesecond 1-D DCT module receives the rows of the intermediateresult from the transposition buffer and computes the rows of2-D DCT output row-wise.

Suppose that in the first N cycles, the intermediate resultsare stored column-wise and all the columns are filled in withintermediated results, then in the next N cycles, contents ofsuccessive rows of the transposition buffer are selected bythe MUXes and fed as input to the 1-D DCT module of thesecond stage. During this period, the output of the 1-D DCTmodule of first stage is stored row-wise. In the next N cycles,results are read and written column-wise. The alternating

174 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014

Fig. 7. Encoding and decoding chain involving DCT in the HEVC codec.(a) Block diagram of a typical hybrid video encoder, e.g., HEVC. Note thatthe decoder contains a subset of the blocks in the encoder, except for thehaving entropy decoding instead of entropy coding. (b) Breakdown of theblocks in the shaded region in (a) for main profile.

column-wise and row-wise read and write operations with thetransposition buffer continues. The transposition buffer in thiscase introduces a pipeline latency of N cycles required to fillin the transposition buffer for the first time.

V. Pruned DCT Architecture

Fig. 7(a) illustrates a typical hybrid video encoder, e.g.,HEVC, and Fig. 7(b) shows the breakdown of the blocksin the shaded region in Fig. 7(a) for main profile, wherethe encoding and decoding chain involves DCT, quantization,dequantization, and IDCT based on the transform designproposed in [14]. As shown in Fig. 7(b), a data before and afterthe DCT/IDCT transform is constrained to have maximum16 bits regardless of internal bit depth and DCT length. There-fore, scaling operation is required to retain the wordlength ofintermediate data. In the main profile that supports only 8-bitsamples, if bit truncations are not performed, the wordlengthof the DCT output would be log2 N + 6 bits more than thatof the input to avoid overflow. The output wordlengths of thefirst and second forward transforms are scaled down to 16 bitsby truncating least significant log2 N − 1 and log2 N + 6 bits,respectively, as shown in the figure. The resulting coefficientsfrom the inverse transforms are also scaled down by thefixed scaling factor of 7 and 12. It should be noted that

Fig. 8. (a) Dot diagram for the pruning of the SAU for the second eight-pointforward DCT. (b) Modified structure of the SAU after pruning. (c) Dotdiagram for the truncation after the SAU and the OAU to generate y(0) foreight-point DCT.

additional clipping of log2 N − 1 most significant bits isrequired to maintain 16 bits after the first inverse transformand subsequent scaling.

The scaling operation, however, could be integrated withthe computation of the transform without significant impacton the coding result. The SAU includes several left-shiftoperations as shown in Figs. 1(b) and 3(b) whereas the scalingprocess is equivalent to performing the right shift. Therefore,by manipulating the shift operations in the SAU circuit, wecan optimize the complexity of the proposed DCT structure.

MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC 175

TABLE III

Internal Register Wordlength and the Number of Pruned Bits

for the Implementation of Each Transform [DCT-1, DCT-2,

IDCT-1 and IDCT-2 in Fig. 7(b)]

Fig. 8(a) shows the dot diagram of the SAU to illustratethe pruning with an example of the second forward DCT oflength 8. Each row of dot diagram contains 17 dots, whichrepresents output of the IAU or its shifted form (for 16 bitsof the input wordlength). The final sum without truncationshould be 25 bits. But, we use only 16 bits in the final sum,and the remaining 9 bits are finally discarded. To reduce thecomputational complexity, some of the least significant bits(LSB) in the SAU [in the gray area in Fig. 8(a)] can be pruned.It is noted that the worst-case error by the pruning schemeoccurs when all the truncated bits are one. In this case, thesum of truncated values amounts to 88, but it is only 17% ofthe weight of LSB of the final output, 29. Therefore, the impactof the proposed pruning on the final output is not significant.However, as we prune more bits, the truncation error increasesrapidly. Fig. 8(b) shows the modified structure of the SAU afterpruning.

The output of the SAU is the product of DCT coefficientwith the input values. In the HEVC transform design [14],16 bit multipliers are used for all internal multiplications. Tohave the same precision, the output of SAU is also scaleddown to 16 bits by truncating 4 more LSBs, which is shownin the gray area before the OAU in Fig. 8(c) for the outputaddition corresponding to the computation of y(0). The LSBof the result of the OAU is truncated again to retain 16 bitstransform output. By careful pruning, the complexity of SAUand OAU can be significantly reduced without significantimpact on the error performance. The overall reduction of thehardware complexity by the pruning scheme is discussed inSection VI.

Table III lists the internal register wordlength and the num-ber of pruned bits for the implementation of each transform[DCT-1, DCT-2, IDCT-1, and IDCT-2 in Fig. 7(b)]. In thecase of the second forward eight-point DCT (in the seventh

TABLE IV

Experimental Results for Proposed Methods Compared to

HM10.0 Anchor. BD-Rate Results for the Luma and Chroma

(Cb and Cr) Components Are Shown as [Y(%) / Cb(%)/Cr(%)]

row in Table III), the number of input wordlength is 16 bits.If any truncation is not performed in the processing of IAU,SAU, and OAU, the number of output wordlength becomes25 bits by summation of values in columns A, B, and C. Bythe truncation procedure illustrated in Fig. 8 using the numberof truncated bits listed in columns D, E, and F of Table III,the wordlength of DCT output becomes 16 bits. Note that thescale factor listed in column H satisfies the configuration inFig. 7(b) and the wordlength of the multiplier output and thefinal DCT output as listed in columns I and J, respectively, isless than or equal to 16 bits satisfying constraints described in[14] over all the transforms T1-T4 for N = 4, 8, 16, and 32.

In Fig. 2, as the odd-indexed output samples, y(1), y(3),...,y(N − 1) are scaled down by the pruning procedure, even-indexed output samples, y(0), y(2),..., y(N − 2), that arefrom (N/2)-point integer DCT should also be scaled downcorrespondingly. For example, the scaling factor of eight-pointsecond DCT (T2) is 1/29 as shown in column H in Table III,the output of internal four-point DCT should also be scaleddown by 1/29. Since we have four-point DCT with scalingfactor 1/28 as shown in the third row and column H, we canreuse it, and perform additional scaling by 1/2.

To understand the impact of using the pruned transformwithin HEVC, we have integrated them into the HEVC refer-ence software, HM10.0,2 for both luma and chroma residuals.The proposed pruned forward DCT is a nonnormative changeand can be implemented in encoders to produce HEVC-compliant bitstreams. Therefore, we perform one set of ex-periments, prune-forward (PF), in which the forward DCTused is the proposed pruned version, while the inverse DCTis the same as that defined in HEVC Final Draft InternationalStandard (FDIS) [7]. In addition, we perform two other setsof experiments: 1) prune-inverse (PI), in which the forwardDCT is not pruned but the inverse DCT is pruned, and 2)prune-both (PB), where both forward and inverse DCT arethe pruned versions. Both PI and PB are no longer standardscompliant, since the pruned inverse DCT is different from whatis specified in the HEVC standard. We also note here thatboth the encoder and decoder use the same pruned inversetransforms, such that any coding loss is due to the loss inaccuracy of the inverse transform rather than drift betweenencoder and decoder. We performed the tests using the JCT-VC HM10.0 common test conditions [18] for the main profile,

2Available through SVN via http://hevc.hhi.fraunhofer.de/svn/svnHEVCSoftware/tags/HM-10.0/.

176 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014

TABLE V

Area, Time, and Power Complexities of Proposed Architectures and Direct Implementation of Reference Algorithm of

Integer DCT for Various Lengths Based on Synthesis Result Using TSMC 90-nm CMOS Library

i.e., on the test sequences3 mandated for all-intra (AI), random-access (RA), low-delay B (LDB), and low-delay P (LDP)configuration, over QP values of 22, 27, 32, and 37. Usingthe HM10.0 as the anchor, we measure the BD-rate [19]values of each of the PF, PI and PB experiments to get ameasure of the coding performance impact. One can interpretthe BD-rate number as the percentage bitrate difference for thesame peak signal to noise ratio; note that a positive BD-ratenumber indicates coding loss compared to the anchor. Table IVsummarizes the experimental results of using the differentcombinations of the propose pruned forward and inverse DCT.Using the pruned forward DCT alone (PF case) gives averagecoding loss over each coding configuration of about 0.1% inthe luma component, suggesting that the use of the prunedforward DCT in encoder has a insignificant coding impact.On the other hand, PI gives average coding loss ranging from0.8% to 1.0% in the luma component over the different codingconfigurations. This is expected since the bit truncation aftereach transform stage is larger. Finally, PB gives average codingloss ranging from 0.9% - 1.1% in the luma component. It isinteresting to note that the coding loss in PB is approximatelythe sum of that of PF and PI. We have also observed that thecoding performance difference for each of PF, PI and PB arefairly consistent across the different classes of test sequences.

VI. Implementation Results and Discussions

A. Synthesis Results of 1-D Integer DCT

We have coded the architecture derived from the referencealgorithm of Section II as well as the proposed architecturesfor different transform lengths in VHDL, and synthesized bySynopsys Design Compiler using TSMC 90-nm General Pur-pose (GP) CMOS Library. The wordlength of input samplesare chosen to be 16 bits. The area, computation time, andpower consumption (at 100-MHz clock frequency) obtained

3The test sequences are classified into Classes A–F. Classes A–D containsequences with resolutions of 2560×1600, 1920×1080, 832×480, 416×240,respectively. Class E consists of video conferencing sequences of resolution1280 × 720, while Class F consists of screen content videos of resolutionsranging from 832 × 480 to 1280 × 720.

from the synthesis reports are shown in Table V. It is foundthat the proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy per sample (EPS)compared to the direct implementation of reference algorithm,in average, for integer DCT of lengths 4, 8, 16, and 32.Additional 19% saving in ADP and 20% saving in EPS arealso achieved by the pruning scheme with nearly the samethroughput rate. The pruning scheme is more effective forhigher length DCT since the percentage of total area occupiedby the SAUs increases as DCT length increases, and hencemore adders are affected by the pruning scheme.

B. Comparison With the Existing Architectures

We have named the proposed reusable integer DCT ar-chitecture before applying pruning as reusable architecture-1 and that after applying pruning as reusable architecture-2. The processing rate of the proposed integer DCT unit is16 pixels per cycle considering 2-D folded structure since2-D transform of 32 × 32 block can be obtained in 64cycles. In order to support 8K ultrahigh definition (UHD)(7680 × 4320) at 30 frames/s and 4:2:0 YUV format that isone of the applications of HEVC [20], the proposed reusablearchitectures should work at the operating frequency fasterthan 94 MHz (7680 × 4320 × 30 × 1.5/16). The computationtimes of 5.56 ns and 5.27 ns for reusable architectures-1and 2, respectively (obtained from the synthesis without anytiming constraint) are enough for this application. Also, thecomputation time less than 5.358 ns is needed to support 8K

UHD at 60 frames/s, which can be achieved by slight increasein silicon area when we synthesize the reusable architecture-1with the desired timing constraint. Table VI lists the synthesisresults of the proposed reusable architectures, as well as theexisting architectures for HEVC for N = 32 in terms of gatecount that is normalized by area of 2-input NAND gate, max-imum operating frequency, processing rate, throughput, andsupporting video format. The proposed reusable architecture-2 requires larger area than the design of [11], but offers muchhigher throughput. Also, the proposed architectures involveless gate counts, as well as higher throughput, than the designof [10]. Specially, the designs of [10] and [11] require very

MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC 177

TABLE VI

Comparison of Different 1-D Integer DCT Architectures With the Proposed Architectures

TABLE VII

2-D Integer Transform With Folded Structure and

Full-Parallel Structure

high operational frequencies of 761 MHz and 1403 MHz inorder to support UHD at 60 frames/s since 2-D transform of32 × 32 block can be computed in 261 cycles and 481 cycles,respectively. However, 187 MHz operating frequency that isneeded to support 8K UHD at 60 frames/s by the proposedarchitecture can be obtained using TSMC 90-nm or newertechnologies as shown in Table VI. Also, we could obtain thefrequency of 94 MHz for UHD at 30 frames/s and 4:2:0 YUVformat using TSMC 0.15-μm or newer technologies.

C. Synthesis Results of 2-D Integer DCT

We also synthesized the the folded and full-parallel struc-tures for 2-D integer DCT. We have listed total gate counts,processing rate, throughput, power consumption, and EPS inTable VII. We set the operational frequency to 187 MHzfor both cases to support UHD at 60 frames/s. The 2-Dfull-parallel structure yields 32 samples in each cycle afterinitial latency of 32 cycles providing double the throughputof the folded structure. However, the full-parallel architectureconsumes 1.69 times more power than the folded architecturesince it has two 1-D DCT units and nearly the same complexityof transposition buffer while the throughput of full-paralleldesign is double the throughput of folded design. Thus, thefull-parallel design involves 15.6% less EPS.

VII. Summary and Conclusion

In this paper, we have proposed area- and power-efficient ar-chitectures for the implementation of integer DCT of differentlengths to be used in HEVC. The computation of N-point 1-DDCT involves an (N/2)-point 1-D DCT and a vector-matrixmultiplication with a constant matrix of size (N/2) × (N/2).We have shown that the MCM-based architecture is highlyregular and involves significantly less area-delay complexity

and less energy consumption than the direct implementationof matrix multiplication for odd DCT coefficients. We haveused the proposed architecture to derive a reusable architecturefor DCT that can compute the DCT of lengths 4, 8, 16, and32 with throughput of 32 output coefficients per cycle. Wehave also shown that proposed design could be pruned toreduce the area and power complexity of forward DCT withoutsignificant loss of quality. We have proposed power-efficientarchitectures for folded and full-parallel implementations of2-D DCT, where no data movement takes place within thetransposition buffer. From the synthesis result, it is found thatthe proposed algorithm with pruning involves nearly 30% lessADP and 35% less EPS compared to the reference algorithmin average for integer DCT of lengths 4, 8, 16, and 32 withnearly the same throughput rate.

Acknowledgment

The authors would like to thank Dr. A. Tisserand forproviding them with the HDL code for the computation ofthe integer DCT based on their algorithm.

References

[1] N. Ahmed, T. Natarajan, and K. Rao, “Discrete cosine transform,” IEEETrans. Comput., vol. 100, no. 1, pp. 90–93, Jan. 1974.

[2] W. Cham and Y. Chan, “An order-16 integer cosine transform,” IEEETrans. Signal Process., vol. 39, no. 5, pp. 1205–1208, May 1991.

[3] Y. Chen, S. Oraintara, and T. Nguyen, “Video compression using integerDCT,” in Proc. IEEE Int. Conf. Image Process., Sep. 2000, pp. 844–845.

[4] J. Wu and Y. Li, “A new type of integer DCT transform radix and itsrapid algorithm,” in Proc. Int. Conf. Electric Inform. Control Eng., Apr.2011, pp. 1063–1066.

[5] A. M. Ahmed and D. D. Day, “Comparison between the cosine andHartley based naturalness preserving transforms for image watermarkingand data hiding,” in Proc. First Canad. Conf. Comput. Robot Vision, May2004, pp. 247–251.

[6] M. N. Haggag, M. El-Sharkawy, and G. Fahmy, “Efficient fastmultiplication-free integer transformation for the 2-D DCT H.265 stan-dard,” in Proc. Int. Conf. Image Process., Sep. 2010, pp. 3769–3772.

[7] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, andT. Wiegand, High Efficiency Video Coding (HEVC) Text SpecificationDraft 10 (for FDIS and Consent), JCT-VC L1003, Geneva, Switzerland,Jan. 2013.

[8] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst.Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[9] A. Ahmed, M. U. Shahid, and A. Rehman, “N Point DCT VLSIArchitecture for Emerging HEVC Standard,” in Proc. VLSI Design, vol.2012, Article 752024, pp. 1–13, 2012.

[10] S. Shen, W. Shen, Y. Fan, and X. Zeng, “A unified 4/8/16/32-pointinteger IDCT architecture for multiple video coding standards,” in Proc.IEEE Int. Conf. Multimedia Expo, Jul. 2012, pp. 788–793.

[11] J.-S. Park, W.-J. Nam, S.-M. Han, and S. Lee, “2-D large inversetransform (16x16, 32x32) for HEVC (high efficiency video coding),”J. Semicond. Technol. Sci., vol. 12, no. 2, pp. 203–211, Jun. 2012.

178 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 1, JANUARY 2014

[12] M. Budagavi and V. Sze, “Unified forward+inverse transform architec-ture for HEVC,” in Proc. Int. Conf. Image Process., Sep. 2012, pp.209–212.

[13] P. Topiwala, M. Budagavi, A. Fuldseth, R. Joshi, and E. Alshina.(2011, Nov.). JCTVC-G040, CE10: Summary Report on Core TransformDesign [Online]. Available: http://phenix.int-evry.fr/jct/doc end user/documents/7 Geneva/wg11/JCTVC-G040-v3.zip

[14] A. Fuldseth, G. Bjøntegaard, M. Budagavi, and V. Sze. (2011, Nov.).JCTVC-G495, CE10: Core Transform Design for HEVC: Proposal forCurrent HEVC Transform [Online]. Available: http://phenix.int-evry.fr/jct/doc end user/documents/7 Geneva/wg11/JCTV%C-G495-v2.zip

[15] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multipleconstant multiplications: Efficient and versatile framework and algo-rithms for exploring common subexpression elimination,” IEEE Trans.Comput.-Aided Des. Integr. Circuits Syst., vol. 15, no. 2, pp. 151–165,Feb. 1996.

[16] M. D. Macleod and A. G. Dempster, “Common subexpression elimi-nation algorithm for low-cost multiplierless implementation of matrixmultipliers,” Electron. Lett., vol. 40, no. 11, pp. 651–652, May 2004.

[17] N. Boullis and A. Tisserand, “Some optimizations of hardware multi-plication by constant matrices,” IEEE Trans. Comput., vol. 54, no. 10,pp. 1271–1282, Oct. 2005.

[18] F. Bossen, Common HM Test Conditions and Software Reference Con-figurations, JCT-VC L1100, Geneva, Switzerland, Jan. 2013.

[19] G. Bjontegaard, Calculation of Average PSNR Differences Between RDCurves, VCEG-M33, ITU-T SC16/Q6, Austin, TX, USA, Apr. 2001.

[20] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of thehigh efficiency video coding (HEVC) standard,” IEEE Trans. CircuitsSyst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

Pramod Kumar Meher (SM’03) received the B.Sc.(Honors.) and M.Sc. degrees in physics and thePh.D. degree in science from Sambalpur University,Sambalpur, India, in 1976, 1978, and 1996, respec-tively.

Currently, he is a Senior Scientist with the Institutefor Infocomm Research, Singapore. He was a Profes-sor of computer applications with Utkal University,Bhubaneswar, India, from 1997 to 2002, and aReader in electronics with Berhampur University,Berhampur, India, from 1993 to 1997. He has con-

tributed nearly 200 technical papers to various reputed journals and conferenceproceedings. His research interests include the design of dedicated andreconfigurable architectures for computation-intensive algorithms pertainingto signal, image, and video processing; communication; bioinformatics; andintelligent computing.

Dr. Meher served as a speaker for the Distinguished Lecturer Program ofthe IEEE Circuits Systems Society from 2011 to 2012 and as an AssociateEditor of IEEE Transactions on Circuits and Systems II: Express

Briefs from 2008 to 2011. Currently, he is as an Associate Editor for IEEETransactions on Circuits and Systems I: Regular Papers, IEEETransactions on Very Large Scale Integration (VLSI) Systems,and Journal of Circuits, Systems, and Signal Processing. He is a fellow of theInstitution of Electronics and Telecommunication Engineers, India. He was therecipient of the Samanta Chandrasekhar Award for Excellence in Research inEngineering and Technology for 1999.

Sang Yoon Park (S’03–M’11) received the B.S.degree in electrical engineering and the M.S. andPh.D. degrees in electrical engineering and com-puter science from Seoul National University, Seoul,Korea, in 2000, 2002, and 2006, respectively.

He joined the School of Electrical and Elec-tronic Engineering, Nanyang Technological Univer-sity, Singapore, as a Research Fellow in 2007. Since2008, he has been with the Institute for InfocommResearch, Singapore, where he is currently a Re-search Scientist. His research interests include the

design of dedicated and reconfigurable architectures for low-power and high-performance digital signal processing systems.

Basant Kumar Mohanty (M’06–SM’11) receivedthe M.Sc degree in physics from Sambalpur Uni-versity, Sambalpur, India, in 1989 and the Ph.D.degree in the field of very large scale integration fordigital signal processing from Berhampur University,Berhampur, India, in 2000.

In 2001, he joined the EEE Department, BITSPilani, Rajasthan, India, as a Lecturer. Then, hejoined the Department of ECE, Mody Institute of Ed-ucation Research (Deemed University), Rajasthan,India, as an Assistant Professor. In 2003, he joined

the Jaypee University of Engineering and Technology, Guna, Madhya Pradesh,India, where he was an Associate Professor in 2005 and a Full Professor in2007. He has published nearly 50 technical papers. His research interestsinclude the design and implementation of low-power and high-performancedigital systems for resource constrained digital signal processing applications.

Dr. Mohanty is an Associate Editor for Journal of Circuits, Systems, andSignal Processing. He is a Lifetime Member of the Institution of Electronicsand Telecommunication Engineering, New Delhi, India. He was the recipientof the Rashtriya Gaurav Award conferred by the India International FriendshipSociety, New Delhi, for 2012.

Khoon Seong Lim received the B.Eng. (Honors)and M.Eng. degrees in electrical and electronic en-gineering from Nanyang Technological University,Singapore, in 1999 and 2001, respectively.

He was a Research and Development DSP Engi-neer with BITwave Private Ltd., Singapore, focusingon the area of adaptive microphone array beam-forming in 2002. He joined the School of Electricaland Electronic Engineering, Nanyang TechnologicalUniversity, as a Research Associate in 2003. Since2006, he has been with the Institute for Infocomm

Research, Singapore, where he is currently a Senior Research Engineer. Hisresearch interests include embedded systems, sensor array signal processing,and digital signal processing in general.

Chuohao Yeo (S’05–M’09) received the S.B.degree in electrical science and engineering andthe M.Eng. degree in electrical engineering andcomputer science from Massachusetts Institute ofTechnology, Cambridge, MA, USA, in 2002 and thePh.D. degree in electrical engineering and computersciences from University of California, Berkeley(UC Berkeley), CA, USA, in 2009.

From 2005 to 2009, he was a Graduate StudentResearcher with the Berkeley Audio Visual SignalProcessing and Communication Systems Laboratory,

UC Berkeley. Since 2009, he has been a Scientist with the Signal ProcessingDepartment, Institute for Infocomm Research, Singapore, where he leads ateam that is actively involved in HEVC standardization activities. His researchinterests include image and video processing, coding and communications,distributed source coding, and computer vision.

Dr. Yeo was a recipient of the Singapore Government Public ServiceCommission Overseas Merit Scholarship from 1998 to 2002 and Singapore’sAgency for Science, Technology and Research National Science Scholarshipfrom 2004 to 2009. He received the Best Student Paper Award at SPIE VCIP2007 and the Best Short Paper Award at ACM MM 2008.