Computers and Electrical Engineeringmypages.iit.edu/~xxiong3/files/xiong_cee11.pdf ·...

Computers and Electrical Engineering 37 (2011) 285–299

Contents lists available at ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier .com/ locate/compeleceng

Architecture design of variable block size motion estimation for fulland fast search algorithms in H.264/AVC q

Xuanxing Xiong, Yang Song, Ali Akoglu ⇑Reconfigurable Computing Laboratory, Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721, USA

a r t i c l e i n f o a b s t r a c t

Article history:Received 9 April 2010Received in revised form 7 January 2011Accepted 7 January 2011Available online 4 February 2011

0045-7906/$ - see front matter � 2011 Elsevier Ltddoi:10.1016/j.compeleceng.2011.01.003

q Reviews processed and proposed for publication⇑ Corresponding author.

E-mail addresses: [email protected] (X.

Fast search algorithms (FSA) used for variable block size motion estimation follow irregularsearch (data access) patterns. This poses as the main challenge in designing hardware archi-tectures for them. In this study, we build a baseline architecture for fast search algorithmsusing state-of-the-art components available in academia. We improve its performance byintroducing: (1) a super 2-dimensional (2-D) random access memory architecture forreading regular and interleaved two-rows or two-columns as opposed to one-row or one-column accessibility of the state of the art; (2) a 2-D processing element array with a tunedinterconnect to support neighborhood connections required by the conventional fast searchalgorithms and to exploit on-chip data reuse. Results show that our design increases systemthroughput by up to 85.47%, and achieves power reduction by up to 13.83% with an areaincrease in the worst case by up to 65.53% compared to the baseline architecture.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Motion estimation (ME) reduces the temporal redundancies of image frames in video coding systems. It is widely used invideo coding standards, including H.26x and MPEG-x [1], to achieve a low bit rate while maintaining good video quality. Fullsearch block matching motion estimation (FSBMME) is a particularly popular method due to the high quality encoding itproduces. It splits the current frame into a number of macroblocks (MBs) of size N � N, and then for each current MB(CMB), it looks for the reference block (RB) that has the least difference or distortion within the search range (SR) of the ref-erence frame. The difference or distortion is commonly evaluated by the sum of absolute difference (SAD)

SADðm;nÞ ¼XN�1

i¼0

XN�1

j¼0

jCMBði; jÞ � RBðiþm; jþ nÞj � 12

SRH 6 m <12

SRH; �12

SRV 6 n <12

SRV ð1Þ

where CMBði; jÞ and RBðiþm; jþ nÞ represent the pixel values of the current MB and reference blocks, SADðm;nÞ is the totaldistortion between the current MB and the reference block. SRH and SRV indicate the horizontal and vertical range,respectively.

Although FSBMME has better coding efficiency as it examines all the reference blocks in the search range, this method isvery compute intensive. For example, with a search range of SRH ¼ SRV ¼ 64, and MB size of 16 � 16, a total of 4096 inde-pendent SADs are executed to derive the motion vector of each current MB. Fast search algorithms reduce this computationcomplexity by searching fewer reference blocks based on a geometrical pattern such as four step search (4SS) [2], diamondsearch (DS) [3] and hexagon search (HS) [4].

. All rights reserved.

by Editor-in-Chief Dr. M. Malek.

Xiong), [email protected] (Y. Song), [email protected] (A. Akoglu).

http://dx.doi.org/10.1016/j.compeleceng.2011.01.003

mailto:[email protected]



http://dx.doi.org/10.1016/j.compeleceng.2011.01.003

http://www.sciencedirect.com/science/journal/00457906

http://www.elsevier.com/locate/compeleceng

286 X. Xiong et al. / Computers and Electrical Engineering 37 (2011) 285–299

Some previous video coding standards, such as H.261, adopt the fixed block size motion estimation (FBSME), whichuses the same block size for both static and moving objects. In order to achieve better coding efficiency, H.264/AVC [5]employs variable block size motion estimation (VBSME), providing the flexibility of dividing each MB into seven typesof sub-blocks (4 � 4, 4 � 8, 8 � 4, 8 � 8, 8 � 16, 16 � 8 and 16 � 16). Although VBSME is superior than FBSME in termsof coding efficiency, it is more complicated since the SAD of different types of sub-blocks should be computed to generatethe motion vector. An architecture that supports both variable block size and multiple search algorithms for the motionestimation is a necessity to encode different types of video streams and deliver various levels of quality of service forH.264 video coding.

Several application specific integrated circuits (ASIC) based designs [6–10] have been introduced for FSBMME. Based onthe topology, we classify them as pipelined 1-dimensional (1-D) or 2-dimensional (2-D) processing element (PE) arrayarchitectures which accumulate the SADs between current MB and reference blocks. These designs allow data reuse throughdelay registers and by moving data from one PE to the next. The 1-D architectures [6,7] are scalable for search range andblock size with a small area occupancy, while the 2-D architectures [8–10] have better data reusability and less memorybandwidth requirement. Among various 2-D architectures, the 2-D SAD tree architecture by Chen [10], which supportsVBSME, is currently the state of the art. Other designs extend the studies on this architecture by applying it for high-defini-tion television (HDTV) [11,12], or exploring power aware design issues [13].

In summary, the architectures for the SAD routine of motion estimation should exploit the parallel nature of this com-putation in order to achieve real time processing. More importantly, the computation of SAD is data intensive, so a highlyefficient memory architecture with high memory bandwidth and fast memory access time is indispensable. However,most studies to date have concentrated on the design of PE array architectures [6–10], and data reuse schemes[14,15] for full search. Very few papers address the memory architecture [16–18]. The memory subsystem presents anissue for video coding processors as well as a technical bottleneck with regard to the real time processing, particularlyfor HDTV encoding. In practice, each PE array architecture has different data access patterns, thus requiring a specializedmemory architecture. If a PE is unable to access pixel data from the memory on time, the pipelined data path suffersfrom stalls.

Designing an ASIC architecture to support multiple fast search algorithms is a challenging task. The difficulty arises fromthe irregular search patterns among fast search algorithms. This irregularity leads to poor inter-candidate data-reuse, wherea candidate is a reference block in the search range. The design should not only exploit the parallel nature of the SAD routinein ME, but also support this data intensive routine with a memory architecture that has high bandwidth and fast access time.Therefore, the PE array architecture and the memory architecture should evolve together.

There is minimal work on designing architectures for fast search algorithms. Lee introduces an architecture to implementmultiple search algorithms by performing full search in the quadrangle search area around each of the parameterized searchcenters [19]. This full search based technique cannot support the search pattern of the conventional fast search algorithms.Chen proposes a content adaptive parallel variable block size 4-step search (4SS) algorithm, and designs a parallel architec-ture specifically for it in [13]. Chen, in this work, modifies the original 4SS algorithm specifically for hardware implementa-tion, and proposes a 2-D random access memory architecture for on-chip search range buffer. The resulting architecture isone of the best for fast search motion estimation in academia. However, to our knowledge, there is no architecture that sup-ports the conventional fast search algorithms.

In this paper, we first build a baseline architecture for supporting fast search algorithms using state-of-the-art compo-nents available in academia, and show that this baseline architecture is not efficient enough due to bubble cycles. We ap-proach this problem with a concurrent way of designing the memory architecture; processing element architecture andthe on-chip interconnect structure. We introduce the super 2-D random access memory architecture (RAMA) for the on-chipsearch range buffer to support the irregular search flow and data access pattern of fast search algorithms. The super 2-DRAMA allows reading regular and interleaved two-rows or two-columns, as opposed to the one-row or one-column regularaccess capability proposed by Chen et al. [13]. We also introduce a 2-D PE array with a tuned interconnect to support variousneighborhood connections required by the conventional fast search algorithms, and employ the adder tree structure in theME engine to exploit data reuse and support variable block sizes. We finally develop specific search paths for hardwareimplementation of each fast search algorithm exploiting the parallelism of the proposed architecture. A transaction levelmodeling technique [20] with SystemC [21] is used to evaluate the overall performance of the new design and the baselinearchitecture on standard video sequences along with the ASIC synthesis results. We discuss the design trade-offs for the pro-posed architecture by specifically focusing on the resource overhead and presenting an analytical study on its power effi-ciency. Results show that the proposed architecture accelerates the motion estimation and increases the systemthroughput by up to 85.47%, and analytically achieves power reduction by up to 13.83% with the worst case area overheadof 65.53%. Similar to the baseline architecture, the area overhead is mainly attributable to the routing resources, while logicand 2-D RAMA area overheads are negligible.

The remainder of this paper is organized as follows: In Section 2, we first give an overview of the three conven-tional fast search algorithms, and then present and analyze the baseline architecture. In Section 3, we present thedetails of the proposed architecture. In Section 4, we discuss the controller design and the search path of fast search algo-rithms. The SystemC modeling and implementation results are presented in Section 5. Finally, Section 6 draws theconclusion.

X. Xiong et al. / Computers and Electrical Engineering 37 (2011) 285–299 287

2. Fast search algorithms and baseline architecture

2.1. Fast search algorithms

In this subsection, we give a brief overview of three representative fast search algorithms: four step search (4SS), diamondsearch (DS) and hexagon search (HS), in order to highlight their data access patterns. Basically, fast search algorithms followthree states: initialization, searching and finally refinement.

2.1.1. Four step search (4SS)As Fig. 1a shows, 4SS examines the nine search points on the square search pattern during the initialization state, and

returns the best matched point (BMP) with the least amount of distortion. Each search point represents a reference block.If the BMP is exactly the center of the square, the refinement step checks the eight neighboring points of the BMP to derivethe final best matched point. Otherwise, the center of the square moves to the BMP, and the unchecked search points of thenew square are examined in each iteration of the searching state. This process is performed iteratively until the final refine-ment state is achieved.

2.1.2. Diamond search (DS)Fig. 1b shows the search pattern of DS. Similar to 4SS, it initially searches nine points, then adjusts the center of searching

points based on the BMP of the preceding iteration. These search points construct a diamond, providing DS with a differentdata access pattern compared to 4SS. A final refinement step is adopted in DS, in which the four nearby points of the BMP areexamined.

2.1.3. Hexagon search (HS)Search pattern of HS is depicted in Fig. 1c. It employs a hexagon search pattern which only contains seven search points.

Similar to 4SS and DS, HS derives the BMP in each iteration and changes the center of the searching hexagon to check moresearch points. The refinement step is applied to examine four horizontal and vertical neighbors of the BMP.

Fig. 1. Search patterns of fast search algorithms. (a) Four step search (4SS) with 9 search points. (b) Diamond search (DS) with 9 search points. (c) Hexagonsearch (HS) with 7 search points.

Fig. 2. A search point and its neighboring points.


Compared to full search, fast search algorithms examine a subset of the search points in the search range selectively. Fig. 2introduces the notation of the neighboring points for a given search point. While scanning the search range, in most cases,the next search point is not an immediate neighbor of the current search point. Therefore, the motion estimator often needsto skip its immediate neighbors in order to examine other neighbors. For example, the next search point may often be thesecondary neighbor for the 4SS, the diagonal neighbor for the DS, and the tertiary neighbor for the HS. This kind of irregu-larity leads to uneven data access patterns and poor data reuse between search points, which is the most difficult problem toaddress when designing the hardware architecture.

2.2. Baseline architecture for fast search algorithms

Fig. 3 shows the conventional architecture of the ME system. The current MB buffer stores the MB pixel data of the currentframe. The search range buffer keeps the reference pixels read from the system memory in order to reuse the data, since thesearch ranges of neighboring current MBs overlap. Such MB level data reuse largely reduces the bandwidth requirement ofthe system bus. The ME engine computes the SAD and generates the motion vector with the controller, which also managesthe buffers. In order to support fast search algorithms, we start with this architecture, and build a baseline architecture usingthe 2-D random access memory architecture (RAMA) of [13] for the on-chip search range buffer, and the 2-D SAD tree archi-tecture of [10] for achieving inter-candidate and intra-candidate data reuse. In the following paragraphs, we highlight themain features of the 2-D RAMA and 2-D SAD tree, and then illustrate the bubble cycle overhead of the baseline architecturebuilt based on these components.

2.2.1. 2-D random access memory architectureThe 2-D RAMA is built by several on-chip search range random access memories (RAMs) to support random one row and

one column data access. For N � N MB, this architecture adopts N parallel on-chip RAMs. The search range is then partitionedinto blocks of size N � N. The horizontally adjacent pixels within each N � N reference block are interleavingly stored inthese RAMs. As shown in Fig. 4 with N ¼ 8, each small square denotes a reference pixel, which is described by its coordinatesin the search range, such as A1 or B1. Mi (i = 1,2, . . .,8) denotes eight RAMs, and each column of pixels in Fig. 4b are stored inMi. A row or a column of pixels is described by ‘‘the first pixel-the last pixel’’. For example, the first row of pixels A1, B1, C1,. . ., H1 is described by row A1–H1, and one column of pixels A1, A2, A3, . . ., A8 is described by column A1–A8. Each row ofreference pixels, such as row A4–H4, are stored in different RAMs, and can be accessed in parallel. Also, each column of pix-els, such as D1–D8, are arranged in different RAMs and this architecture also supports random column data access.

Fig. 3. Conventional motion estimation (ME) system architecture.

(a) (b)

Fig. 4. (a) Physical location of search range pixels. (b) Data alignment of 2-D random access memory architecture (assuming macroblock size = 8 � 8 forsimplicity).


2.2.2. 2-D SAD tree architectureFig. 5 depicts the 2-D SAD tree architecture which consists of the 2-D current and reference pixel arrays, the 2-D PE array,

the adder trees, and the decision unit. The current MB and the reference block are stored in the 16 � 16 current pixel arrayand 16 � 16 reference pixel array, respectively. The PE array computes the absolute difference of the current pixel array andthe reference pixel array. The 4 � 4 2-D adder trees generate the SAD of 4 � 4 blocks, and the variable block size adder treeaccumulates the results of 4 � 4 2-D adder trees to derive the SAD of larger blocks, such as 4 � 8, 8 � 8, etc. By taking advan-tage of the adder trees, SADs of variable block sizes are computed in parallel, so that the intra-candidate data reuse isachieved. Moreover, together with the 2-D RAMA, this design supports shifting a row or a column of reference pixels intothe 16 � 16 reference pixel array in each cycle. This shift operation achieves inter-candidate data reuse between both hor-izontally and vertically adjacent reference blocks.

In order to implement the baseline architecture, we employ 16 memories to build the 2-D RAMA, and configure the 2-DSAD tree to support shifting by one pixel in four directions: up, down, left and right. For the full search block matching mo-tion estimation, this architecture has a latency of 16 cycles to load 16 rows of reference pixels, and is capable of examiningone reference block in each cycle by shifting in the search range. For fast search algorithms, this design is also effective as itcan implement the search patterns shown in Fig. 1 by shifting from one search point to another.

However, since the search pattern of fast search algorithms is irregular, the current reference block and the next referenceblock are not immediate neighbors in most cases as illustrated in Fig. 1, although they often share many pixels. The inter-candidate data reuse should still be considered to reduce the on-chip memory access, but it will take several cycles to loadthe pixels of the next reference block. During that period the 2-D PE array and adder trees would be in the idle state waitingfor the data or running redundant computation for a few cycles, also known as bubble cycles. For example, when the baselinearchitecture is employed to implement the conventional 4SS shown in Fig. 1a, it takes at least two cycles to load the data for

Fig. 5. ME engine: 2-D SAD tree architecture [10].


the next reference block in the initialization and searching state (because the minimum distance between two search pointsis two, and at least two rows or two columns of pixels have to be loaded for the next reference block.), then one bubble cycleper reference block is incurred, leading to under utilization of the available PE resources. We observe similar behavior for DSand HS as well. Based on these observations, we propose to concurrently design the memory architecture and the ME enginein order to reduce these bubble cycles and improve the system throughput.

3. Architecture design

We propose our parallel architecture based on the system framework as shown in Fig. 3. In order to support full searchand fast search algorithms with this system architecture, we introduce a super 2-D random access memory for the on-chipsearch range buffer; a 2-D PE array with a tuned interconnect to support immediate, diagonal, and secondary neighborhoodconnections required by the fast search algorithms; and employ the adder trees in the ME engine to exploit intra-candidatedata reuse and support variable block sizes.

3.1. Super 2-D random access memory architecture

The super 2-D RAMA has two rows and two columns accessibility to support fast search algorithms. There are 2N parallelon-chip RAMs for MB size of N � N. Similar to the 2-D RAMA, the search range is split into many N � N reference blocks, andthe reference pixels of each block are stored interleavingly in these RAMs. We double the number of RAMs compared to the2-D RAMA. We then divide the RAMs into two groups, one for odd row data storage and the other one for even row storage.Fig. 6 shows the physical location of pixels, and the data alignment of RAMs. In a real implementation, MB size is 16� 16,however, for simplicity, here we assume that the MB size is 4� 4 and the buffered search range reference pixel array size is8� 8. For each 4� 4 reference block, the odd rows of pixels are mapped into the first group of RAMs, M1–M4, and even rowsare stored in the second RAM group, M5–M8. As illustrated in Fig. 6a, row A1–D1 and row A2–D2 are mapped into RAM M1–M8, respectively; row A3–D3 and row A4–D4 are also mapped in the same way, but their memory locations are interleavedwith the previous row of reference pixels within the same memory group by two pixels. By mapping pixels in this way, thememory architecture has flexible accessibility in three ways: single row or single column access, standard two rows or twocolumns access, and interleaved two rows or two columns access. These three kinds of data accessibility support loading ref-

(a) (b)

(c) (d)

Fig. 6. Physical location of search range pixels and data alignment of the super 2-D random access memory architecture (assuming macroblock size = 4 � 4,buffered search range pixel array size = 8 � 8 for simplicity). (a) Standard two rows access for loading data of the vertical secondary neighbor. (b) Standardtwo columns access for loading data of the horizontal secondary neighbor. (c) Interleaved two rows access for loading data of the diagonal neighbor andsecondary diagonal neighbor. (d) Interleaved two columns access for loading data of the diagonal neighbor and secondary diagonal neighbor.


erence pixels for shifting to the immediate neighbor, secondary neighbor, and diagonal neighbor which are needed in 4SS,DS, and HS.

3.1.1. Single row or single column accessSimilar to the 2-D RAMA, each row and each column of pixels are stored in different RAMs, so each row and each column

of pixels can be accessed in parallel, no matter what the physical location of the row or column is.

3.1.2. Standard two rows or two columns accessThis type of data access is needed to fetch the consecutive two rows or two columns in parallel. Because of the pixel map-

ping technique, two consecutive rows of reference pixels are stored in different RAMs, and they can be accessed at the sametime. Also, two consecutive rows of pixels at arbitrary locations can be accessed in parallel. As Fig. 6a shows, row A1–D1 androw A2–D2, or as another example, row D6–G6 and row D7–G7 can be accessed concurrently. Moreover, two arbitrary con-secutive columns can be accessed with ease. As shown in Fig. 6b, columns A1–A4 and B1–B4, or as another example, columnsF4–F7 and G4–G7, can be accessed at the same time.

3.1.3. Interleaved two rows or two columns accessSimilar to the standard two rows or two columns access, the interleaved two rows or two columns can be accessed arbi-

trarily, as illustrated in Fig. 6c and d.Table 1 summarizes the accessibility comparison of the super 2-D RAMA and the 2-D RAMA. The super 2-D RAMA has

superior accessibility than the 2-D RAMA in supporting standard and interleaved two rows and two columns access. Thesuperior accessibility is achieved by using more RAMs and a more complicated data alignment scheme. For a fixed sizesearch range buffer, the super 2-D RAMA uses twice the number of RAMs. Therefore, the super 2-D RAMA has larger areaoverhead because of the address decoders.

3.2. 2-D PE array with a tuned interconnect and adder trees architecture

We introduce a 2-D PE array architecture with a tuned interconnect to minimize the number of bubble cycles, and exploitthe inter-candidate data reuse for fast search algorithms. The adder trees are adopted for intra-candidate data reuse to sup-port variable block size motion estimation. As shown in Fig. 7 and 8, we integrate the current pixel array and the referencepixel array into the PE array, and introduce three types of interconnects: short interconnects, diagonal interconnects andlong interconnects. These interconnects support shifting to the immediate neighbor, diagonal neighbor, and secondaryneighbor, respectively. The short interconnects connect a PE with its two horizontal adjacent PEs and two vertical adjacentPEs; the diagonal interconnects connect a PE with its four diagonal adjacent PEs; the long interconnects connect a PE withthe other PEs that are two hops away in horizontal and vertical directions. For simplicity, Fig. 8 only shows the interconnecttopology of one PE. Each PE takes advantage of these three types of interconnects to get the next reference pixel from andtransmit its reference pixel to its adjacent PEs. The PEs on the boundary of the 2-D PE array do not have as many adjacent PEsas the internal PEs, and some of their interconnects are connected with the read data bus of the super 2-D random accessmemory to get the next reference pixel. We co-design the interconnect and the memory architecture so that standardand interleaved two rows or two columns accessibility of the super 2-D RAMA is supported with long, short, and diagonalinterconnects. This enables the architecture to feed data to the PEs in a pipelined manner and reduces the number of bubblecycles while shifting in the search range.

3.3. Configuration of the ME engine

With the super 2-D RAMA, the 2-D PE array and the adder tree, our ME engine supports three types of configurations:shift by one pixel to an immediate neighbor, shift by two pixels to a secondary neighbor, and diagonal shift to a diagonalneighbor. The ME engine supports shift by one or two pixels in four directions: up, down, left and right, as well as diagonalshifting in four directions: top left, top right, bottom left, and bottom right. Table 2 summarizes the differences between theproposed architecture and the baseline architecture. Fig. 9 illustrates some shift patterns of the ME engine.

Table 1Accessibility comparison.

Accessibility 2-D RAMA [13] Super 2-D RAMA

Single row Random RandomSingle column Random RandomStandard two rows Not available RandomStandard two columns Not available RandomInterleaved two rows Not available RandomInterleaved two columns Not available Random

Fig. 7. ME engine: 2-D PE array with a tuned interconnect and adder trees for fast search.

Fig. 8. The tuned interconnect topology of one PE in the 2-D PE array architecture for fast search.

Table 2Architecture comparison.

Architecture Baseline Proposed

Search range buffer 2-D RAMA [13] Super 2-D RAMA2-D register or PE array interconnects Short interconnects Short, long and diagonal interconnectsConfigurations Shift by one pixel Shift by one or two pixels, and diagonal shift


The super 2-D RAMA is built by using 32 RAMs, so that two rows or two columns of reference pixels can be accessed ineach cycle. The shifting by one or two pixels can be easily achieved by using the standard one/two rows/columns memoryaccess types. For the diagonal shift, one row and one column of pixels are loaded to the ME engine, then a bubble cycle isused to preload a row or a column of pixels. We cannot eliminate this bubble cycle if we only shift diagonally once (shiftto a diagonal neighbor). However, if we need to do diagonal shifting in the same direction twice (shift to a diagonal neighborand a secondary diagonal neighbor in the same direction), we can use the interleaved memory access. One bubble cycle is

Fig. 9. Examples of the ME engine shift patterns when searching in the search range.


used to preload two interleaved rows, then two interleaved columns are loaded for shifting. In this way, the ME engine canshift twice continuously in the diagonal direction in only three cycles, therefore we eliminate one bubble cycle observed inthe baseline architecture. The diagonal shift is especially important for the DS since most of the search points of DS are diag-onal neighbors. By using these configurations, the ME engine can shift freely in the search range to examine the search pointsefficiently.

4. Algorithm implementation

The architecture presented in Section 3 is capable of supporting full search and three conventional fast search algorithms.When implementing a specific search algorithm, the controller (Fig. 3) manages the memory and the ME engine appropri-ately to exploit the parallelism of this architecture. The controller takes full advantage of the super 2-D RAMA and the 2-D PEarray architecture to reduce bubble cycles, thus maximizing the system throughput. The following subsections discuss howto implement these search algorithms with the proposed architecture efficiently.

4.1. Full search

It is very convenient to implement full search motion estimation with this architecture. The super 2-D RAMA supportsrandom row and column access, so it is easy to examine the candidate blocks in the search range just by using the conven-tional snake search pattern.

4.2. Fast search

The fast search algorithms always adopt a center-biased search pattern as shown in Fig. 1. The search points of each iter-ation are determined by the best matching point during the previous iteration. However, the candidate reference blocks ineach iteration share most of the reference pixels. Even between consecutive iterations, the last search point of the previousiteration also shares most of the reference pixels with the first search point in the current iteration. Shifting from one searchpoint to the next takes less number of cycles than loading all the reference pixels of the next search point from the memory.Moreover, the shift operation reduces the number of memory accesses, thus saving power. Therefore, it is efficient to shift inthe search range to examine all the search points.

For each iteration of fast search algorithms, one can evaluate the search points in different order by shifting in differentsearch paths, thus having different number of bubble cycles. In order to reduce bubble cycles, we evaluate all the possiblesearch paths for different states of fast search algorithms, and design the uniform search path shown in Fig. 10. Each fastsearch algorithm uses specific search paths in its initialization state and refinement state to minimize the number of bubblecycles. In the searching state, the search path always originates from the center search point of the last iteration, and tries tocover the searching points during current iteration in a counter-clockwise manner. After the ME engine finishes examiningall the searching points, it shifts back to the center search point of the current iteration, which is also the best matched pointof the previous iteration. We employ this design, so that the search paths in the searching state are regular and can be imple-mented easily.

We introduce a fixed number of gap cycles between iterations, and these gap cycles cannot be eliminated because the MEengine has to finish the SAD computation and derive the best matched point. The controller also needs to generate the mem-

Fig. 10. Uniform search path for fast search algorithms. (a) Four step search (4SS). (b) Diamond search (DS). (c) Hexagon search (HS).


ory address for the next iteration, and read the reference pixels from the buffer. The number of gap cycles is determinedbased on the specific implementation. Analytically, the minimum number of gap cycles would be two. It takes at leastone cycle to calculate the SAD, derive the best matched point and send the memory read address; and it takes another cycleto read the data from the buffer. The cycles spent on shifting back to the center ðNshifting back cyclesÞ of the current iteration afterexamining all the searching points overlaps with these gap cycles ðNgap cyclesÞ, therefore shifting back to the center does nottake extra bubble cycles. In order to guarantee that the shifting back operation does not degrade the performance of the mo-tion estimator, we assume

Ngap cycles P Nshifting back cycles: ð2Þ

As shown in Fig. 10, the shifting back paths of the 4SS are the longest. The ME engine needs to load two rows and twocolumns of pixels in order to shift back to the center search point for 4SS. The shifting back operation would take two cyclesfor this worst case scenario, while it would take up to four cycles for the baseline architecture. That means the maximumNshifting back cycles of the prosed architecture is two, while that of the baseline architecture is four. Therefore, the minimumNgap cycles of the proposed architecture and the baseline architecture are two and four, respectively.

4.2.1. Four step search (4SS)Fig. 10a shows the uniform search path for 4SS. The ME engine employs the shift by one or two pixels when needed. It

takes only one cycle to shift from one search point to the next during each iteration. We observe bubble cycles due to the gapcycles between iterations and the latency cycles for shifting from the center search point of the previous iteration to the firstsearch point of the current iteration.

4.2.2. Diamond search (DS)Since the search points of DS are often diagonal neighbors, the DS is more irregular than the 4SS. The uniform search

path for DS is illustrated in Fig. 10b. In the initialization state, the search path starts at the upper vertex, then covers thesearch points in a counter-clockwise manner. When the search path goes back to the upper vertex after examining the 8search points, it shifts to the center of the diamond to examine the last point. In the refinement state, the two horizontalsearch points are examined at first, and then the two vertical search points are checked. The diagonal search path can be


implemented by taking advantage of the diagonal shift configuration of the ME engine. The bubble cycles are the gap cyclesbetween iterations and the extra cycles spent to load the reference pixels of search points.

4.2.3. Hexagon search (HS)The searching path of HS is depicted in Fig. 10c. It starts at the vertex on the top left in the initialization state, and covers

the original search points in a zigzag manner. After covering all search points in one iteration, it goes back to the center of thehexagon. In the refinement state, the searching path covers the two horizontal search points, then examines the two verticalpoints. Since HS does not have a diagonal search pattern like DS, the ME engine does not use the diagonal shift for HS. Thecontroller selects the shift by one or two pixels appropriately according to the searching path. For example, in the initiali-zation state, the ME engine uses the shift by two pixels configuration to move from the first search point to the second.In order to move to the third search point, the ME engine can shift right by one pixel, and then shift down by two pixels.

Since fast search algorithms have different search patterns, they use different shift configurations, interconnects and on-chip search range buffer access modes. Table 3 shows the comparison of these algorithms.

5. Experimental results

5.1. Performance evaluation

We employ the transaction level modeling technique for performance evaluation. We develop cycle accurate transactionlevel models of the proposed architecture and the baseline architecture using SystemC [21], and use many representativestandard CIF video sequences [22] to evaluate the performance of these architectures. As presented in Section 4, the mini-mum number of gap cycles for the baseline architecture is four, while for the proposed architecture it is only two. Therefore,we build the cycle accurate SystemC model of the baseline architecture with four gap cycles ðNgap cycles ¼ 4Þ, and model theproposed architecture with two gap cycles ðNgap cycles ¼ 2Þ. We use the number of clock cycles spent on the motion estima-tion process as the major performance metric, and refer it as Runtime. Moreover, we evaluate the Speedup of the proposedarchitecture in comparison with the baseline architecture.

Table 3Shift co

4SSDSHS

Runtime ¼ the number of clock cycles spent on the motion estimation process ð3Þ

Speedup ¼ Runtimebaseline

Runtimeproposedð4Þ

5.1.1. Full searchSince the new design supports loading two rows or two columns of reference pixels in each cycle, it only uses 8 cycles to

load the first reference block in the search range, while the baseline architecture requires 16 cycles. Once the first referenceblock is loaded, the motion estimator just shifts in the search range to examine other reference blocks. After this state, botharchitectures are capable of examining one reference block in each cycle. Therefore, the Speedup metric for full search is com-puted based on Eq. (5). For example, when SRH ¼ SRV ¼ 64, the Speedup is 1.002, which indicates a comparable performancefor full search to the baseline architecture

Speedupfull search ¼SRH � SRV þ 16SRH � SRV þ 8

ð5Þ

5.1.2. Fast searchWhen implementing fast search algorithms, the uniform searching path presented in Section 4 is employed for both

architectures. We use the search range size of SRH ¼ SRV ¼ 64, with the horizontal and vertical search ranges set to½�32;32Þ. For each video, the motion estimator performs motion estimation of each macroblock for 50 continuous frames.Table 4 shows the statistics for the average number of search points per macroblock. Our results are consistent with the re-sults reported in [2–4]. As expected, the most static video ‘‘Waterfall’’ has the minimum average number of search points,while the movement intensive video ‘‘Football’’ has the maximum value. For a specific fast search algorithm, as the numberof search points increases, more number of iterations are performed, which leads to larger runtime. Table 5 shows the per-formance comparison of the proposed architecture and the baseline architecture. Clearly, the proposed architecture largely

nfiguration and interconnect usage.

Shift by 1 pixel to direct neighbor, shortinterconnects

Shift by 2 pixels to secondary neighbor, longinterconnects

Diagonal shift to diagonal neighbor, diagonalinterconnects

Yes Yes NoYes Yes YesYes Yes No

Table 4Average number of search points per macroblock ðSRH ¼ SRV ¼ 64Þ.

Video sequences 4SS DS HS

Akiyo 27.25 26.2 18.78Bridge (close) 17.49 13.61 11.37Bridge (far) 17.61 13.99 11.56Bus 23.34 22.21 16.56Coastguard 19.68 17.47 13.68Container 17.13 13.20 11.12Flower 19.69 16.81 13.37Football 31.98 34.73 23.48Foreman 19.67 16.82 13.12Hall monitor 17.56 13.88 11.29Highway 19.24 16.15 12.83Mobile 17.55 14.34 11.49Mother and daughter 26.97 25.94 18.79News 17.21 13.32 11.20Paris 17.61 13.93 11.53Silent 17.79 14.14 11.63Stefan 20.10 17.54 13.87Tempete 17.57 13.87 11.48Waterfall 17.04 13.08 11.00

Table 5Runtime (cycles) comparison of the proposed architecture and the baseline architecture ðSRH ¼ SRV ¼ 64Þ.

Video sequences 4SS DS HS

Baseline Proposed Speedup Baseline Proposed Speedup Baseline Proposed Speedup

Akiyo 1,659,476 908,938 1.8257 1,748,882 978,877 1.7866 1,563,670 899,345 1.7387Bridge (close) 907,678 533,039 1.7028 929,650 516,749 1.7990 845,714 474,752 1.7814Bridge (far) 925,406 541,903 1.7077 958,606 533,499 1.7968 866,422 487,918 1.7758Bus 1,397,524 777,962 1.7964 1,502,044 841,225 1.7855 1,350,467 773,984 1.7448Coastguard 1,119,154 638,777 1.7520 1,174,460 654,449 1.7946 1,077,329 614,717 1.7526Container 882,612 520,506 1.6957 904,296 502,523 1.7995 823,445 462,365 1.7809Flower 1,102,766 630,583 1.7488 1,148,152 641,132 1.7908 1,040,682 590,310 1.7629Football 2,022,436 1,090,418 1.8547 2,333,892 1,310,450 1.7810 1,996,397 1,145,586 1.7427Foreman 1,080,952 619,676 1.7444 1,158,958 648,156 1.7881 1,012,605 572,133 1.7699Hall monitor 919,910 539,155 1.7062 948,806 527,701 1.7980 839,075 471,197 1.7807Highway 1,055,064 606,732 1.7389 1,109,326 619,674 1.7902 987,499 558,703 1.7675Mobile 918,584 538,492 1.7058 991,068 552,887 1.7925 858,099 482,247 1.7794Mother and daughter 1,629,070 893,735 1.8228 1,736,824 972,204 1.7865 1,561,094 896,384 1.7415News 889,522 523,961 1.6977 913,568 507,917 1.7987 830,452 466,228 1.7812Paris 923,060 540,730 1.7071 954,980 531,477 1.7968 862,535 485,105 1.7780Silent 931,564 544,982 1.7093 968,026 538,691 1.7970 869,576 488,335 1.7807Stefan 1,135,320 646,860 1.7551 1,191,250 665,018 1.7913 1,091,925 621,711 1.7563Tempete 916,804 537,602 1.7054 948,974 527,848 1.7978 857,025 481,701 1.7792Waterfall 875,162 516,781 1.6935 896,388 498,049 1.7998 812,234 455,642 1.7826


accelerates the motion estimation of each video sequence, and achieves significant Speedup from 1.6935 to 1.8547. Assumingthe clock frequencies of these architectures are the same, the improvement of the system throughput would be proportionalto the Speedup metric. Therefore, the proposed architecture can accelerate the motion estimation process and increase thesystem throughput by up to 85.47%.

5.2. Overhead analysis

The superior performance in terms of cycle count comes with a cost of resource overhead, which is mainly attributable tothe super 2-D RAMA and the new 2-D PE array. In order to evaluate the resource overhead, we implement the two architec-tures using Verilog HDL, and analyze the overhead using field programmable gate array (FPGA) and ASIC tools.

5.2.1. Super 2-D RAMAThe new memory architecture needs 32 RAMs, while the 2-D RAMA only needs 16 RAMs. The size of each RAM of the

super 2-D RAMA is half of that of the 2-D RAMA. Since FPGAs use block RAMs to build the user specified memory, super2-D RAMA may need more FPGA block RAMs. For ASIC implementations, the depth of memory is often 2n. Using the levelC data reuse discussed in [14], we adopt the search range (SR) buffer size of 2m bytes, where m is the minimum integer that


satisfies inequality (6). The Artisan memory compiler with the TSMC 180-um technology library is used to create RAMs tobuild the memory. As there are 16 or 32 RAMs, the size of each RAM is small, therefore we build these RAMs using registerfiles. Table 6 shows the area comparison of these two memory architectures. The area overhead of the super 2-D RAMA isattributable to the extra address decoders, and this overhead becomes smaller and smaller as the memory size increases.The area overhead is the cost that we have to pay in order to achieve better accessibility. Fortunately, this memory archi-tecture offers two potential benefits. First, the designer can save power by using an ‘‘enable’’ signal for each RAM. For onerow or one column access, only 16 RAMs are enabled, while the other 16 RAMs are in the power saving state. Second, thesuperior data accessibility will reduce the number of memory accesses compared to the 2-D RAMA. This in turn potentiallyhelps reduce the power consumption of the memory

Table 6ASIC ar

SRH

326464

128

SR Buffer Size ¼ 2m >¼ ð16þ SRV Þ � SRH ð6Þ

5.2.2. 2-D PE array with a tuned interconnect and adder treesThe proposed 2-D PE array and adder trees, and the 2-D SAD tree [10] (including adder trees, the current and reference

pixel arrays) are synthesized using Xilinx ISE 10.1. We chose Xilinx Virtex-5 (XC5VLX30) for prototyping. Table 7 shows theresource comparison. The new architecture has the same number of registers as the 2-D SAD tree architecture, but it usesnearly twice the number of LUTs (look up tables) because there is a 12-to-18 bit multiplexer for each PE. We also synthesizedthese architectures using Synopsys Design Compiler with TSMC 180-nm technology library. Table 8 illustrates the area com-parison. The combinational overhead and total cell overhead are attributable to the area of the 12-to-18 bit multiplexer. Therouting area dominates the total area for both architectures, and the proposed architecture has larger routing area overheadbecause of the long and diagonal wires. Compared to the 2-D SAD tree architecture, our design increases the area by 66.75%.

This area overhead is the worst case scenario, because these architectures are implemented without any special tech-niques dedicated to reduce the interconnects. Actually, only part of the interconnects are used for data transmission in eachcycle, and the interconnects can be reduced further during implementation. For example, one can use tri-state buses to con-nect PEs, where a PE uses local tri-state gates to drive these buses. We can also use both the clock positive edge and negativeedge for data transmission, or even double the clock frequency for the input and output of each PE. It is no doubt that thesetechniques would help to make the area overhead of the 2-D PE array and adder tree more reasonable.

5.2.3. Overall overheadBesides the SR buffer and the ME engine, there are some other components in the motion estimator as shown in Fig. 3, 5

and 7: the current MB buffer, the controller, the decision unit and the SAD buffer. Among these components, the controllerhas some area overhead compared to that of the baseline architecture, since it needs more states to manage more shift con-figurations. However, the overhead of the controller would not have much effect on the overall overhead of the motion esti-mator, because the controller is a small component built by several state machines. The current MB buffer, the decision unit,and the SAD buffer are the same for both architectures. These components are negligible in size. Therefore, we concentrateon the SR buffer and ME engine for the overall overhead analysis of various search range sizes as shown in Table 9.

Compared to the baseline architecture, the super 2-D RAMA has the worst case area overhead of 43.99% as shown in Table6. However, in the context of the whole system, the super 2-D RAMA does not contribute much to the overall overhead be-cause it allocates only 15.53% of the total area for the largest search range size that we have evaluated. The main area over-head is caused by the 2-D PE array with the tuned interconnect.

5.3. Power reduction

In order to reduce the power consumption, one can introduce a power saving state for the motion estimator. Since thenew architecture largely accelerates the motion estimation process, the motion estimator would be in the power saving state

ea comparison of the super 2-D RAMA and the 2-D RAMA.

SRV SR buffer size (bytes) 2-D RAMA [13] (lm2) Proposed super 2-D RAMA (lm2) Overhead (%)

32 2048 278,185 400,548 43.9932 4096 434,349 556,370 28.0964 8192 757,560 868,698 14.6764 16,384 1,397,762 1,515,120 8.40

Table 72-D PE array and adder trees FPGA resource statistics.

Resources 2-D SAD tree [10] Proposed Overhead (%)

Registers 4096 4096 0.00LUTs 8262 16,454 99.15

Table 82-D PE array and adder trees ASIC area statistics.

Area type 2-D SAD tree [10] (lm2) Proposed (lm2) Overhead (%)

Combinational 623,438 952,237 52.74Noncombinational 345,035 315,722 N/ANet interconnect 3,974,807 6,974,742 75.47Total cell 968,468 1,268,040 30.93Total 4,943,282 8,242,703 66.75

Table 9Total ASIC area statistics.

SRH SRV Baseline (lm2) Proposed (lm2) Percentage of memory area in proposed architecture (%) Overhead (%)

32 32 5,221,467 8,643,251 4.63 65.5364 32 5,377,631 8,799,073 6.32 63.6264 64 5,700,842 9,111,401 9.53 59.83

128 64 6,341,044 9,757,823 15.53 53.88

Table 10Normalized power reduction of the proposed architecture ðSRH ¼ SRV ¼ 64Þ.

Video sequences 4SS (%) DS (%) HS (%)

Akiyo 12.46 10.54 8.07Bridge (close) 6.14 11.16 10.28Bridge (far) 6.41 11.05 9.99Bus 11.03 10.49 8.40Coastguard 8.77 10.94 8.80Container 5.74 11.18 10.26Flower 8.61 10.75 9.34Football 13.83 10.26 8.29Foreman 8.37 10.61 9.69Hall monitor 6.32 11.11 10.24highway 8.09 10.72 9.57Mobile 6.30 10.84 10.18Mother and daughter 12.31 10.53 8.23News 5.85 11.14 10.27Paris 6.37 11.05 10.11Silent 6.50 11.06 10.24Stefan 8.94 10.77 9.00Tempete 6.28 11.10 10.17Waterfall 5.62 11.20 10.34


more often. Our objective is to illustrate the potential benefits of the proposed architecture in power reduction with an ana-lytical model. As the dynamic power shown in Eq. (7) dominates the total power consumption in a highly parallel systemwith heavy loads, we mainly concentrate on the dynamic power analysis, and ignore the static power and the power con-sumption in power saving state

P ¼ 12

CloadV2ddf ð7Þ

Power Reduction ¼ 1� 1þ OverheadSpeedup

ð8Þ

In our model, we assume that the load capacitance Cload is proportional to the ASIC area. Therefore, the normalized overallpower reduction ðPower ReductionÞ can be derived using Eq. (8). We use the overall area overhead shown in Table 9 forevaluation of power reduction. When SRH ¼ SRV ¼ 64, the overhead of 59.83% leads to a very pessimistic power estimation.Table 10 shows the power reduction statistics of the proposed architecture for different video sequences. The normalizedpower reduction reaches up to 13.83%. Besides, one can also scale down the clock frequency of the motion estimator to re-duce power. Moreover, by reducing the routing area as discussed in subSection 5.2.2, more power reduction can be achieved.

6. Conclusion

This study reviews the architectures for the motion estimation of H.264/AVC. We build a baseline architecture for fastsearch algorithms, and propose a parallel architecture to improve it further. We design the memory architecture, processing


element architecture and the interconnect structure concurrently. This is the first study that introduces a highly parallelarchitecture for the conventional fast search algorithms of motion estimation. Architecture employs the super 2-D RAMA,the 2-D PE array with tuned interconnects and the adder tree structure. Because of the accessibility of the memory architec-ture and the tuned interconnects of the 2-D PE array, the proposed architecture effectively enables on-chip data reuse, re-duces the bubble cycles that occur when loading the reference pixel of the next search point, and supports the irregularsearch pattern of fast search algorithms effectively. Paper demonstrates the superior performance of the new architectureover the baseline architecture, discusses the design tradeoffs by specifically focusing on the resource overhead, and presentsan analytical study on power efficiency. We show that the new architecture delivers up to 85.47% throughput improvementand up to 13.83% power reduction at the cost of 65.53% area overhead in the worst case.

References

[1] Rao KR, Hwang JJ. Techniques and standards for image, video, and audio coding. Upper Saddle River, NJ: Prentice Hall; 1996.[2] Po L-M, Ma W-C. A novel four-step search algorithm for fast block motion estimation. IEEE Trans Circuits Syst Video Technol 1996;6(3):313–7.[3] Tham JY, Ranganath S, Ranganath M, Kassim AA. A novel unrestricted center-biased diamond search algorithm for block motion estimation. IEEE Trans

Circuits Syst Video Technol 1998;8(4):369–77.[4] Zhu C, Lin X, Chau L-P. Hexagon-based search pattern for fast block motion estimation. IEEE Trans Circuits Syst Video Technol 2002;12(5):349–55.[5] Joint Video Team, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Recommendation H.264 and

ISO/IEC 14496-10 AVC, May 2003.[6] Yang KM, Sun MT, Wu L. A family of VLSI designs for the motion compensation block-matching algorithm. IEEE Trans Circuits Syst

1989;36(10):1317–25.[7] Lai YK, Chen LG. A data-interlacing architecture with two dimensional data-reuse for full-search block-matching algorithm. IEEE Trans Circuits Syst

Video Technol 1998;8(2):124–7.[8] Yeo H, Hu YH. A novel modular systolic array architecture for full-search block matching motion estimation. IEEE Trans Circuits Syst Video Technol

1995;5(5):407–16.[9] Verma R, Akoglu A. A coarse grained reconfigurable architecture for variable block size motion estimation. In: Proceedings of IEEE international

conference on field-programmable technology, 2007. p. 81–8.[10] Chen C-Y, Chien S-Y, Huang Y-W, chen T-C, Wang T-C, Chen L-G. Analysis and architecture design of variable block-size motion estimation for H.264/

AVC. IEEE Trans Circuits Syst I Reg Papers 2006;53(2):578–93.[11] Chen T-C, Chien S-Y, Huang Y-W, Tsai C-H, Chen C-Y, Chen T-W, et al. Analysis and architecture design of an HDTV720p 30frames/s H.264/AVC

encoder. IEEE Trans Circuits Syst Video Technol 2006;16(6):673–88.[12] Chen T-C, Fang H-C, Lian C-J, Tsai C-H, Huang Y-W, Chen T-W, et al. IEEE Circuits and Devices Magazine 2006;22(3):22–31.[13] Chen T-C, Chen Y-H, Tsai S-F, Chien S-Y, Chen L-G. Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC. IEEE

Trans Circuits Syst Video Technol 2007;17(5):568–77.[14] Tuan J-C, Chang T-S, Jen C-W. On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture. IEEE Trans Circuits

Syst Video Technol 2002;12(1):61–72.[15] Chen T-C, Tsai C-Y, Huang Y-W, Chen L-G. Single reference frame multiple current macroblocks scheme for multiple reference frame motion estimation

in H.264/AVC. IEEE Trans Circuits Syst Video Technol 2007;17(2):242–7.[16] Hsia S-C. Efficient memory IP design for HDTV coding applications. IEEE Trans Circuits Syst Video Technol 2003;13(6):465–71.[17] Miyakoshi J, Murachi Y, Hamamoto M, Iinuma T, Ishihara T, Kawaguchi H, et al. A power- and area-efficient SRAM core architecture for super-parallel

video processing. In: Proceedings of IFIP international conference on VLSI, 2006. p. 192–7.[18] Murachi Y, Kamino T, Miyakoshi J, Kawaguchi H, Yoshimoto M. A power-efficient SRAM core architecture with segmentation-free and rectangular

accessibility for super-parallel video processing. In: Proceedings of IEEE international symposium VLSI DAT, 2008. p. 63–6.[19] Lee JH, Yoo K. Multi-algorithm targeted low memory bandwidth architecture for H.264/AVC integer-pel motion estimation. In: Proceedings of IEEE

international conference on multimedia expro, 2008. p. 701–4.[20] Cai L, Gajski D. Transaction level modeling: an overview. In: Proceedings of first IEEE/ACM/IFIP international conference on hardware/software

codesign and system synthesis, 2003. p. 19–24.[21] SytemC (Event-driven System Modeling Kernel in C++), 2010. <www.systemc.org>.[22] Video Sequences, 2010. <http://trace.eas.asu.edu/yuv/>.

Xuanxing Xiong is a Ph.D. student in the Department of Electrical and Computer Engineering at the University of Arizona. He received the B.E. degree inelectronic science and technology from University of Electronic Science and Technology of China (UESTC), Chengdu, PR China, in 2005. His research interestis on designing reconfigurable and parallel architectures for image and video processing.

Yang Song is a Ph.D. student in the Department of Electrical and Computer Engineering at the University of Arizona. He received both his B.S. and M.S. in EEfrom Nanjing University of Science and Technology, PR China, in 2005 and 2007, respectively. His current research interests mainly focus on applicationspecific programmable architecture design for video processing algorithms.

Ali Akoglu is an assistant professor in the Department of Electrical and Computer Engineering at the University of Arizona and the director of theReconfigurable Computing Lab. He received his Ph.D. degree in Computer Science from Arizona State University in 2005. His research interests cover thefields of reconfigurable computing, FPGA CAD tools, and application specific instruction set processor design.

http://www.systemc.org

http://trace.eas.asu.edu/yuv/

Computers and Electrical Engineeringmypages.iit.edu/~xxiong3/files/xiong_cee11.pdf ·...

Documents

Transcript of Computers and Electrical Engineeringmypages.iit.edu/~xxiong3/files/xiong_cee11.pdf ·...