[IEEE 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications...

4
AbstractBit-serial architectures offer a number of attractive features over their bit-parallel counterparts such as smaller area cost, lower density interconnection, a reduced number of pins, higher clock frequency, simpler routing and etc. These attractive features make them suitable for using in VLSI design and reduce overall production cost. In this paper, we propose the first least significant bit (LSB) bit-serial sum of absolute difference (SAD) hardware accelerator for integer variable block size motion estimation (VBSME) of H.264. This hardware accelerator is based on a previous state-of-art bit- parallel architecture namely propagate partial SAD. In order to reduce area cost and improve throughput, pixel truncation technique is adopted. Due to the bit-serial pipeline architecture and using small processing elements, our architecture works at much higher clock frequency (at least 4 times) and reduces area cost about 32% compared with its bit-parallel counterpart. The proposed hardware accelerator can be used in different disciplines from low bit rate to high bit rate by making a tradeoff between the degree of parallelism or using fast algorithm or a combination of both. I. INTRODUCTION .264 [1] is the latest video coding standard which leads to 40-60% bit rate reduction in comparison to all previous standards [2]. Many new techniques such as variable block size motion estimation (VBSME), multiple reference frame, context based arithmetic coding, etc [3] are used to increase coding efficiency and video quality at the price of more computational complexity. In H.264, VBSME is used as a powerful motion compensated prediction technique with the different block sizes including 4x4, 4x8, 8x4, 8x8, 8x16, 16x8 and 16x16. Compared to previous fixed block size motion estimation (FBSME), VBSME leads to more accurate predictions and thus provides higher compression efficiency. However it is computationally expensive and takes more than 50% of total encoding process at encoder side. Therefore, efficient hardware accelerator of VBSME for real time application is necessary. To date, several works have been proposed for VBSME hardware accelerator [4]-[7] which are targeted for different disciplines from low bit rate to high bit rate. Due to the computational complexity of VBSME and in order to meet the real time processing requirement, all of these works are M This work was supported in part by the Ministry of Higher Education, Malaysia, under Grant FRGS FP094/2007c. Mohammad Reza Hosseiny Fatemi and Rosli Salleh are with Department of Computer System & Technology, Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia (e-mail: [email protected] & [email protected]). Hasan F. Ates is with the Department of Electronics Engineering, Isik University, Sile 34980 Istanbul, Turkey (e-mail: [email protected]). based on bit-parallel architecture. However, the steadily scaling down of CMOS technology provides new opportunities for bit-serial architectures, and what was once unnoticed and neglected becomes a new and important design strategy. In deep sub-micron regime and beyond, higher clock frequency is achievable which increases the processing capability of bit-serial architectures. In addition, due to lower density interconnection and area usage, bit- serial architectures suffer much less from associated problems of interconnection and sub-threshold leakage power compared with their bit-parallel counterparts. In deep sub-micron area, the interconnection will be determining cost, delay, power, reliability and turn-around time of the future LSI's rather than MOSFET's [8]. In addition, the sub- threshold leakage power used to be insignificant for earlier generations of ICs but is becoming an increasing fraction of the total power [9]. But due to expanding of operations and data over the time, design of bit-serial architectures are much more difficult than their bit-parallel counterparts. In addition, bit-serial architectures have lower processing capability than their bit-parallel counterparts, and this limitation poses some restrictions on the usage of bit-serial architectures in applications with huge computational complexity. However, for such applications, the designer can still benefit from attractive features of bit-serial architectures. The solution is to decrease the computational complexity at algorithm level or increase the degree of parallelism at architecture level or use a combination of both methods to meet the required performance. In this paper, we propose a novel cost-efficient SAD hardware accelerator for VBSME which is the bit- serial counterpart of [4]’s work. Due to the bit-serial architecture, the proposed architecture has a low cost area and works at much higher frequency compared with previous bit-parallel architectures. In addition, pixel truncation technique is adapted to reduce the area and improve the throughput further. The rest of paper is organized as follows. In Section II, a brief background of ME is given. In Section III, the proposed bit-serial architecture is presented and its details are described. Implementation results and comparison are given in Section IV. Finally, Section V presents the conclusion. II. BACKGROUND Traditionally in standard video codecs, ME is based on block matching algorithm (BMA). Among different block matching based ME algorithms, full search algorithm generally leads to best result in terms of finding the optimal motion vector (MV). In addition, due to regular data flow in A Bit-Serial Sum of Absolute Difference Accelerator for Variable Block Size Motion Estimation of H.264 Mohammad Reza H. Fatemi, Hasan F. Ates, IEEE, Member, and Rosli Salleh H 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA 2009) Monash University, Sunway campus, Malaysia, 25th & 26th July 2009. 978-1-4244-2887-8/09/$25.00 ©2009 IEEE 1

Transcript of [IEEE 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications...

Page 1: [IEEE 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA) - Kuala Lumpur, Malaysia (2009.07.25-2009.07.26)] 2009 Innovative Technologies

Abstract—Bit-serial architectures offer a number of attractive features over their bit-parallel counterparts such as smaller area cost, lower density interconnection, a reduced number of pins, higher clock frequency, simpler routing and etc. These attractive features make them suitable for using in VLSI design and reduce overall production cost. In this paper, we propose the first least significant bit (LSB) bit-serial sum of absolute difference (SAD) hardware accelerator for integer variable block size motion estimation (VBSME) of H.264. This hardware accelerator is based on a previous state-of-art bit-parallel architecture namely propagate partial SAD. In order to reduce area cost and improve throughput, pixel truncation technique is adopted. Due to the bit-serial pipeline architecture and using small processing elements, our architecture works at much higher clock frequency (at least 4 times) and reduces area cost about 32% compared with its bit-parallel counterpart. The proposed hardware accelerator can be used in different disciplines from low bit rate to high bit rate by making a tradeoff between the degree of parallelism or using fast algorithm or a combination of both.

I. INTRODUCTION .264 [1] is the latest video coding standard which leads to 40-60% bit rate reduction in comparison to all

previous standards [2]. Many new techniques such as variable block size motion estimation (VBSME), multiple reference frame, context based arithmetic coding, etc [3] are used to increase coding efficiency and video quality at the price of more computational complexity. In H.264, VBSME is used as a powerful motion compensated prediction technique with the different block sizes including 4x4, 4x8, 8x4, 8x8, 8x16, 16x8 and 16x16. Compared to previous fixed block size motion estimation (FBSME), VBSME leads to more accurate predictions and thus provides higher compression efficiency. However it is computationally expensive and takes more than 50% of total encoding process at encoder side. Therefore, efficient hardware accelerator of VBSME for real time application is necessary.

To date, several works have been proposed for VBSME hardware accelerator [4]-[7] which are targeted for different disciplines from low bit rate to high bit rate. Due to the computational complexity of VBSME and in order to meet the real time processing requirement, all of these works are

M This work was supported in part by the Ministry of Higher Education, Malaysia, under Grant FRGS FP094/2007c.

Mohammad Reza Hosseiny Fatemi and Rosli Salleh are with Department of Computer System & Technology, Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia (e-mail: [email protected] & [email protected]).

Hasan F. Ates is with the Department of Electronics Engineering, Isik University, Sile 34980 Istanbul, Turkey (e-mail: [email protected]).

based on bit-parallel architecture. However, the steadily scaling down of CMOS technology provides new opportunities for bit-serial architectures, and what was once unnoticed and neglected becomes a new and important design strategy. In deep sub-micron regime and beyond, higher clock frequency is achievable which increases the processing capability of bit-serial architectures. In addition, due to lower density interconnection and area usage, bit-serial architectures suffer much less from associated problems of interconnection and sub-threshold leakage power compared with their bit-parallel counterparts. In deep sub-micron area, the interconnection will be determining cost, delay, power, reliability and turn-around time of the future LSI's rather than MOSFET's [8]. In addition, the sub-threshold leakage power used to be insignificant for earlier generations of ICs but is becoming an increasing fraction of the total power [9]. But due to expanding of operations and data over the time, design of bit-serial architectures are much more difficult than their bit-parallel counterparts. In addition, bit-serial architectures have lower processing capability than their bit-parallel counterparts, and this limitation poses some restrictions on the usage of bit-serial architectures in applications with huge computational complexity. However, for such applications, the designer can still benefit from attractive features of bit-serial architectures. The solution is to decrease the computational complexity at algorithm level or increase the degree of parallelism at architecture level or use a combination of both methods to meet the required performance. In this paper, we propose a novel cost-efficient SAD hardware accelerator for VBSME which is the bit- serial counterpart of [4]’s work. Due to the bit-serial architecture, the proposed architecture has a low cost area and works at much higher frequency compared with previous bit-parallel architectures. In addition, pixel truncation technique is adapted to reduce the area and improve the throughput further.

The rest of paper is organized as follows. In Section II, a brief background of ME is given. In Section III, the proposed bit-serial architecture is presented and its details are described. Implementation results and comparison are given in Section IV. Finally, Section V presents the conclusion.

II. BACKGROUND

Traditionally in standard video codecs, ME is based on block matching algorithm (BMA). Among different block matching based ME algorithms, full search algorithm generally leads to best result in terms of finding the optimal motion vector (MV). In addition, due to regular data flow in

A Bit-Serial Sum of Absolute Difference Accelerator for Variable Block Size Motion Estimation of H.264

Mohammad Reza H. Fatemi, Hasan F. Ates, IEEE, Member, and Rosli Salleh

H

2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA 2009) Monash University, Sunway campus, Malaysia, 25th & 26th July 2009.

978-1-4244-2887-8/09/$25.00 ©2009 IEEE 1

Page 2: [IEEE 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA) - Kuala Lumpur, Malaysia (2009.07.25-2009.07.26)] 2009 Innovative Technologies

full search algorithm, it is more suitable for hardware implementation. In full search ME, a frame is divided to macro blocks (MB) of size NxN. Then, each MB at current frame is matched against a corresponding MB at the reference frame within a searching range of [–w, w]. Conventionally, the sum of absolute differences (SAD) is used as the matching criterion which is defined as:

Where (i, j) is the current search MV, and C(m, n) and R(m+i, n+j) are pixel values in the current MB and the reference MB, respectively. After checking all search locations, the search location with the smallest SAD is selected as the MV of the current MB.

In H.264, each macroblock is divided into smaller blocks and motion estimation is conducted for each of them which leads to extra computational complexity compared with fixed block size ME.

To implement VBSME in hardware, there exist many methods. One way is to design separate architectures for different block sizes to perform ME or increase the degree of parallelism to improve the overall processing capability. However, these methods lead to huge area usage and are not suitable for efficient hardware implementation. One promising solution is to reuse the SAD of the smallest blocks to calculate the SAD of larger blocks [4]. This leads to a slight increase in hardware cost and, other parameters such as operating frequency, memory requirements and so forth, will be the same as fixed block size ME. There have been reported several works [4]-[7] for VBSME hardware architecture all of which have been designed with bit-parallel structure. Among all of these works, two works have superior performances, namely Propagate Partial SAD (PPSAD) [4] and SAD Tree [7]. The first PPSAD architecture was proposed by Huang in [4]. In [4], the partial SADs of 4x4 blocks are produced and propagated in the pipeline. Then these propagated partial SADs are summed up to produce the SADs of bigger block sizes. In [7], an analysis of different ME architectures, including PPSAD and SAD Tree architectures, is given. Based on this analysis, PPSAD architecture needs fewer reference pixel registers and has lower critical path compared with other architectures. After analysis of PPSAD architecture, we realize that this architecture can be efficiently designed in bit-serial form. This new design can provide low cost and high performance architecture for VBSME which is suitable for low and medium resolutions.

III. BIT-SERIAL ARCHITECTURE DESIGN

In digital signal processing, most of algorithms can be implemented in hardware architecture, either in bit-parallel or bit-serial form. While in bit-parallel architectures a word with n bit length is processed in one clock cycle, in bit-serial architectures it is processed in n clock cycles. Consequently,

bit-parallel architectures are suitable for high resolution and bit-serial architectures are for low to medium resolutions. However, by increasing degree of parallelism (with using more architectures which work in parallel), higher processing capability can be achieved in both architectures. Due to the attractive features of bit-serial architectures were mentioned in introduction, it seems that bit-serial architectures have precedence over bit-parallel architectures, if they can satisfy the processing requirements of the target application.

For our particular case which is full search VBSME algorithm, the main operations are absolute difference and add operations. These operations can be implemented in hardware, either in bit-parallel or bit-serial architecture. To benefit from the attractive features of bit-serial architectures, we design novel bit-serial hardware architecture for absolute difference operation and embed it to PPSAD structure. In the following subsections, we describe our bit-serial SAD hardware accelerator for integer VBSME.

A. The Proposed Architecture The proposed hardware architecture of VBSME is shown

in Fig. 1. The main difference between our architecture and [4]’s architecture is that our architecture is designed in bit-serial form whereas [4]’s architecture is in bit-parallel form. In Fig. 1, circles stand for processing elements (PEs) and bits of the pixels of current MB are stored in the related PEs. In each 16 clock cycles, 16x1 search area pixels are inputted and broadcasted to PE array. In Fig. 1, the vertical dash lines show data broadcasting. Each PE is responsible for computing the absolute difference between current MB pixel and search area pixel. Each rectangle gets 4x1 absolute differences and computes sums of them. The sum result is passed down to add with sum results of 4x1 absolute differences of next row. The gray rectangles are registers and used as delay elements. In this way, sixteen 4x4 SADs are calculated in parallel. The SAD of bigger block sizes (i.e. 4x8, 8x4, 8x16, 16x8 and 16x16) are easily computed by using adder trees.

Add and absolute difference operations are main operations in our architecture, and the proposed bit-serial hardware architectures of them are shown in Fig. 2 and Fig. 3, respectively. For doing add operation, in each clk cycle one bit of A and B inputs are fed into the bit-serial adder, starting form least significant bits (LSBs). The produced Cout (carry-out) is kept in a flip flop and used as Cin (carry-in) to add with the next coming bits. At the beginning of two new words addition, Cin (carry-in) should be set to zero because there is not any carry before their LSBs to add.

wjiw

jnimRnmCjiSADN

m

N

n

≤≤−

++−=∑∑= =

,

)1(),(),(),(1 1

2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA 2009) Monash University, Sunway campus, Malaysia, 25th & 26th July 2009.

2

Page 3: [IEEE 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA) - Kuala Lumpur, Malaysia (2009.07.25-2009.07.26)] 2009 Innovative Technologies

:Processing Element :Row Adder Tree :Shift Register for 4x4

Broadcast Each Pixel to 1x16 PE’s

Four 8x8 SADs

Eight 8x4 and 4x8 SADs

Sixteen 4x4 SADs

Two 16x8 SADs and Two 8x16 SADs

One 16x16 SAD

In terms of time, the addition of two n-bit words takes n

clock cycles. For doing absolute difference operation, in each clk cycle one bit of a search point pixel and one bit of a current candidate pixel are fed into the bit-serial absolute difference architecture, as shown in figure 4. To the best of our knowledge, this is the first LSB bit-serial architecture for sum of absolute difference operation. Note that to have correct result, the lower input of 2:1 multiplexer is selected during 1st to (N-1)th cycles of each new word. In Nth

cycle, the invert of Co is used to bit-XOR with all ABS bits.

Fig 2. Bit-serial adder hardware architecture.

Fig 3. Bit-serial absolute difference hardware architecture.

Complexity Reduction Technique

The pixel truncation is a powerful technique for complexity reduction of block matching ME algorithms with acceptable video quality. The effect of pixel truncation has been studied for variable block size integer ME in [10]. Traditionally, in video coding standards, each pixel is represented by 8 bits. In [10], it has been shown that up to 3 bits can be truncated with slight video quality degradation. Due to the existence of 256 pixels in each MB, the maximum SAD of each MB can be represented by 16 bits. As a result, the bit length of each word in our bit-serial architecture is 16 bits. By using truncation technique with 3 bits, the word length is reduced to 13 bits. Consequently, not only hardware cost is reduced but also the throughput is improved.

FA

DCinCout

Ri=Ai+Bi

Bi

Ai

+BA R=A +B

HACi

RiXOR D XOR D

0

clt

ABSi ABS0

Co

D

Fig 1. The proposed bit-serial architecture [4].

2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA 2009) Monash University, Sunway campus, Malaysia, 25th & 26th July 2009.

3

Page 4: [IEEE 2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA) - Kuala Lumpur, Malaysia (2009.07.25-2009.07.26)] 2009 Innovative Technologies

IV. IMPLEMENTATION RESULTS The proposed bit-serial architecture have been

implemented in Verilog HDL and synthesized with Design Compiler with 0.18 μm technology. The implementation results are listed in Table 1. With a maximum clock frequency of 440 MHz, our design can support real time CIF format. Because, the proposed architecture needs 13 clock cycles to calculate one search point. On the other hand, there are 2048 search points within search range [-16, 16] which means that our architecture can process one MB in 26624 clock cycles. Therefore, real time CIF format with 30 fps and [-16, 16] search range can be achieved with 316.29 MHz.

TABLE I IMPLEMENTATION RESULTS AND COMPARISON WITH

PREVIOUS BIT-PARALLEL ARTS. Design Ref [4] Ref [7] Ours Technology (μm) 0.35 0.18 0.18 Max Clock (MHz) 66.67 110.8 440 PE Array (K gates) 64 †81.5 44

†Including the current MB which should be about 15K gates.

In addition to our work, the implementation results of previous arts are also given in Table 1. These arts are bit-parallel forms of our architecture which were reported in [4] and [7] with 0.35 μm and 0.18 μm technology sizes, respectively. In terms of speed, our architecture works 4 and 6.6 times faster than [4]’s and [7]’s architectures, respectively. In addition, it saves about 32% of area cost in comparison with its bit-parallel counterparts. Although these bit-parallel architectures have higher processing capabilities than ours, their processing capabilities can be achieved by using a number of bit-serial architectures which work in parallel, and with a lower gate counts.

In brief, bit-serial architecture may have precedence over its bit-parallel counterpart in low to medium resolutions in terms of area cost and other attractive features that were mentioned in introduction section. Also, bit-serial architecture may be used in higher resolutions by making an intelligent tradeoff between the degree of parallelism or using fast algorithm or a combination of both.

V. CONCLUSION In this paper, we present the first LSB bit-serial sum of

absolute difference accelerator for VBSME. The proposed design greatly benefits from its attractive bit-serial architecture including low area cost, high clock frequency, low interconnection density, a reduced number of pins. In addition, pixel truncation technique is used which not only reduces the latency and the gate count but also increases the throughput. The proposed architecture is implemented in Verilog HDL and synthesized by Design Compiler. It can support CIF full search VBSME with 30fps at a clock frequency of 316.29 MHz and with 44 K gates. The proposed architecture is suitable for low cost and low power applications such as hand held devices which have low and medium resolutions.

REFERENCES [1] Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG, Draft

ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, May 2003.

[2] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, ‘‘Rate-constrained coder control and comparison of video coding standards,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688---703, Jul. 2003.

[3] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, ‘‘Overview of the H.264/AVC video coding standard,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560---576, Jul. 2003.

[4] Y. W. Huang, T. C. Wang, B. Y. Hsieh and L. G. Chen, “Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264,” ISCAS 2003, vol. 2, pp. 796-799, 2003.

[5] S. Y. Yap and J. V. McCanny. “A vlsi architecture for variable block size video motion estimation,” IEEE Transactions on Circuits and Systems II: Express Briefs, Vol.51, pp. 384-389, October 2004.

[6] M. Kim , I. Hwang and S.I Chae, “A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4 AVC/H.264,” ASP-DAC 2005, Shanghai, China, January 18-21, 2005.

[7] C. Y Chen, S.Y. Chien, Y. W. Huang, T. C Chen, T. C Wang and L. G. Chen, “Analysis and architecture design of variable block-size motion estimation for H.264/AVC,” in IEEE Transactions on Circuits and Systems I: Regular Papers, 2006, vol. 53, number 3, pp. 578-593.

[8] T. Sakurai, “Design challenges for 0.1um and beyond: embedded tutorial,” ASP-DAC 2000, Yokohama, Japan, pp. 553-558, 2000.

[9] H. F. Dadgour and K. Banerjee, “Design and analysis of hybrid nems-cmos circuits for ultra low-power applications,” DAC 2007, California, USA, pp. 306-311, 2007.

[10] A. Bahari, T. Arslan, A.T. Erdogan, “Low Power Variable Block Size Motion Estimation using Pixel Truncation”, ISCAS 2007, New Orleans, USA, pp. 3663-3666, May 27-30, 2007.

2009 Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA 2009) Monash University, Sunway campus, Malaysia, 25th & 26th July 2009.

4