A SURVEY OF ALGORITHMS AND ARCHITECTURES FOR H.264 SUB-PIXEL MOTION ESTIMATION

22
A SURVEY OF ALGORITHMS AND ARCHITECTURES FOR H.264 SUB-PIXEL MOTION ESTIMATION ¤ MOHAMMAD REZA HOSSEINY FATEMI Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia [email protected] HASAN F. ATES Department of Electronics Engineering, Isik University, Sile 34980 Istanbul, Turkey [email protected] ROSLI SALLEH Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia rosli _ [email protected] Received 20 September 2010 Accepted 12 August 2011 This paper reviews recent state-of-the-art H.264 sub-pixel motion estimation (SME) algor- ithms and architectures. First, H.264 SME is analyzed and the impact of its functionalities on coding performance is investigated. Then, design space of SME algorithms is explored repre- senting design problems, approaches, and recent advanced algorithms. Besides, design chal- lenges and strategies of SME hardware architectures are discussed and promising architectures are surveyed. Further perspectives and future prospects are also presented to highlight emerging trends and outlook of SME designs. Keywords: Video compression; sub-pixel motion estimation algorithms and architectures; h.264 standard; survey. 1. Introduction Video coding standards make possible storage and transmission of high volume raw digital video data in limited storage and transmission bandwidth environments. The latest block-based video coding standard, H.264, 1,2 provides a much better com- pression efficiency and subjective video quality compared with all previous standards *This paper was recommended by Regional Editor Piero Malcovati. Journal of Circuits, Systems, and Computers Vol. 21, No. 3 (2012) 1250016 (22 pages) # . c World Scientific Publishing Company DOI: 10.1142/S0218126612500168 1250016-1 J CIRCUIT SYST COMP 2012.21. Downloaded from www.worldscientific.com by UNIVERSITY OF HONG KONG on 09/29/13. For personal use only.

Transcript of A SURVEY OF ALGORITHMS AND ARCHITECTURES FOR H.264 SUB-PIXEL MOTION ESTIMATION

A SURVEY OF ALGORITHMS AND ARCHITECTURES

FOR H.264 SUB-PIXEL MOTION ESTIMATION¤

MOHAMMAD REZA HOSSEINY FATEMI

Faculty of Computer Science and Information Technology,

University of Malaya, 50603 Kuala Lumpur, Malaysia

[email protected]

HASAN F. ATES

Department of Electronics Engineering,

Isik University, Sile 34980 Istanbul, Turkey

[email protected]

ROSLI SALLEH

Faculty of Computer Science and Information Technology,

University of Malaya, 50603 Kuala Lumpur, Malaysia

[email protected]

Received 20 September 2010

Accepted 12 August 2011

This paper reviews recent state-of-the-art H.264 sub-pixel motion estimation (SME) algor-

ithms and architectures. First, H.264 SME is analyzed and the impact of its functionalities on

coding performance is investigated. Then, design space of SME algorithms is explored repre-

senting design problems, approaches, and recent advanced algorithms. Besides, design chal-

lenges and strategies of SME hardware architectures are discussed and promising architectures

are surveyed. Further perspectives and future prospects are also presented to highlight

emerging trends and outlook of SME designs.

Keywords: Video compression; sub-pixel motion estimation algorithms and architectures;

h.264 standard; survey.

1. Introduction

Video coding standards make possible storage and transmission of high volume raw

digital video data in limited storage and transmission bandwidth environments. The

latest block-based video coding standard, H.264,1,2 provides a much better com-

pression efficiency and subjective video quality compared with all previous standards

*This paper was recommended by Regional Editor Piero Malcovati.

Journal of Circuits, Systems, and Computers

Vol. 21, No. 3 (2012) 1250016 (22 pages)

#.c World Scientific Publishing Company

DOI: 10.1142/S0218126612500168

1250016-1

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

such as MPEG-2,2 H.263,3 and MPEG-4.5 For instance, compared with MPEG-4,

H.263, and MPEG-2, H.264 can save 39%, 49%, and 64% of bit rate in average,

respectively.6

Higher compression performance and subjective quality in H.264 are achieved by

using new functionalities such as motion estimation with variable block-size (VBS)

and quarter-pixel accuracy, multiple reference frames (MRF),7 Lagrangian mode

decision,8,9 advanced entropy coding,10 and so on.

The heart of H.264 encoder is motion estimation (ME) which is used for removing

of temporal redundancy in a sequence of images. In H.264, ME is conducted in two

stages. In the first stage, integer motion estimation with different block sizes

(16� 16, 16� 8, 8� 16, 8� 8, 8� 4, 4� 8, and 4� 4) is performed and up to 41

integer motion vectors (IMVs) of all blocks and sub-blocks are determined whereas

up to five reference frames can be searched. Then, SME is started which is usually

conducted in two steps, as shown in Fig. 1. In the first step, half-pixel refinement is

performed by searching eight half-pixel positions around the best integer MV. The

quarter-pixel refinement is carried out in the same manner around the best half-

pixel position. Finally, mode selection is conducted.

\The H.264 SME typically improves coding efficiency between 1 and 4.2 dB in

peak signal-to-noise ratio (PSNR) or leads to 19%�50% in bit rate reduction".11

However, due to the use of VBS, MRF, and interpolation scheme for producing sub-

pixel values, two sub-pixel search steps and matching process, the computational

budget and memory access requirement of the quarter-pixel accurate ME are highly

intensive. Therefore, not only efficient and new fast algorithms are demanded to

reduce the huge computation and memory access bandwidth of the H.264 SME but

(-1, -1) (0, -1)

(0, 1)

(-1, 1)

(-1, 0)

(1, -1)

(1, 0)

(1, 1)

Integer pixel Half pixel

Quarter pixel

Fig. 1. SME refinements in H.264.

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-2

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

also dedicated hardware architectures are required to accelerate the SME process

for real-time applications.

There exist a few effective surveys on algorithms and architectures for video

compression techniques i.e., Refs. 12�16. These studies mainly focus on former

standards such as MPEG-2 and MPEG-4 and they do not cover the new func-

tionalities introduced later in H.264 such as SME with VBS, MRF, Lagrangian

mode decision, and so forth. Motivated by above issue, this paper tries to provide a

comprehensive survey about the most significant SME algorithms and archi-

tectures, address the SME design challenges and strategies at algorithmic and

architectural levels, and highlight the future research directions of SME design.

The rest of this paper is organized as follows. Section 2 analyzes the H.264 SME

and shows the impact of its features on coding performance. Section 3 investigates

the design space of SME algorithms reviewing recent approaches and the concepts

behind them in the state-of-the-art SME algorithms. Section 4 explores the design

challenges and choices of SME architectures and surveys the advanced SME

architectures. Finally, Sec. 5 concludes with emerging perspectives trends in the

research on algorithms and architectures of the SME and its future prospects.

2. Analysis of the H.264 SME

Design of a new and an efficient SME algorithm or hardware architecture relies on a

careful analysis and an in-depth understanding of the SME features and charac-

teristics. A thorough understanding of the SME features is essential for providing an

insight into the design parameters (such us performance, subjective quality, and

complexity), and, how to make an optimal tradeoff between them. In this section,

we evaluate the performance of H.264 SME algorithm and investigate the effect of

its features on coding performance through some experiments.

Different technical features are used in the H.264 SME such as VBS, quarter-

pixel resolution, and MRF, which improve the coding performance of the encoder at

the price of increased computation and memory bandwidth. To evaluate the impact

of different SME features on coding performance, we do several experiments. We use

a set test consisting several common intermediate format (CIF), Quarter CIF

(QCIF), 4CIF, and HD1080 image sequences with the following conditions:

(i) Search ranges for QCIF, CIF, 4CIF, and HD1080 videos are 16, 16, 32, and 64,

respectively.

(ii) Rate–Distortion (R–D) optimization is on.

(iii) The quantization parameter is set to 24, 28, 32, and 36.

(iv) Entropy coding method is CAVLC.

(v) Frequency for encoded bit stream is 30.

(vi) Number of coded pictures for QCIF, CIF, 4CIF, and HD1080 sequences are

300, 300, 220, and 110, respectively and the H.264 reference software (version

11)17 is used for experiments.

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-3

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

Tables 1�4 show gain in performance when using more features in the encoder for

QCIF, CIF, 4CIF, and HD1080 videos, respectively, and for the following cases.

(a) Integer motion estimation (IME) with 1 reference frame, SAD as distortion

criteria, and only 16� 16 blocks.

(b) Increasing motion vector accuracy to quarter-pixel.

(c) Allowing all VBS except below 8� 8 block.

(d) Allowing all VBS except 4� 4 blocks.

(e) Using all VBS.

(f) Increasing the number of reference frames to two reference frames.

(g) Using three reference frames.

(h) Using five reference frames.

(i) Changing the distortion criterion (i.e., replacing the sum of absolute differences

(SAD) with the sum of absolute transformed differences (SATD)).

In Tables 1�4, the average bit rate difference percentage (�BR (%)) and PSNR

difference in terms of dB (�PSNR (dB)) between two cases are calculated as

described in Ref. 18, where case (a) is considered as the reference for other cases

(i.e., case (b) to case (i)). From the simulation results, we can see that when a new

feather is added, in most cases, a better performance is achieved at the price of extra

computation while the gain in performance is not equal for all image sequences and

depends on video contents as well as frame sizes. From the simulation results, the

following observation can be made.

The quarter-pixel motion estimation (QPME) usually provides a considerable

improvement over the IME in terms of bit rate saving and video quality.

As for VBS, the more the block sizes are used, the better the achievement of the

performances. However, as seen in Tables 1�4, elimination of the 4� 4 block will

have a negligible effect on coding performance whereas it leads to a lot of compu-

tation reduction. From the hardware point of view, it can result in many clock cycles

Table 1. Effectiveness of the H.264 SME features on coding performances in QCIF videos.

Container Foreman News

Case 1 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)

(a) 0 0 0 0 0 0

(b) �35.20 1.654 �37.54 2.009 �20.39 1.339

(c) �41.93 2.043 �47.51 2.788 �28.54 2.014

(d) �42.88 2.106 �48.54 2.946 31.00 2.258

(e) �43.20 2.113 �48.78 2.970 �31.23 2.283

(f ) �43.31 2.132 �49.92 3.188 �31.32 2.302

(g) �45.96 2.276 �49.78 3.239 �31.62 2.324

(h) �46.22 2.311 �50.75 3.334 �31.74 2.341

(i) �46.76 2.355 �51.01 3.385 �32.41 2.422

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-4

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

reduction in the SME process. By using different VBS, a good tradeoff between the

video quality and the required clock cycles can be achieved.

On the subject of MRF motion estimation, using of two reference frames usually

brings a sensible coding performance compared with one reference frame. However,

Table 2. Impact of the H.264 SME features on coding performances in CIF videos.

Flower Mobile Mother�daughter

Case 2 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)

(a) 0 0 0 0 0 0

(b) �47.27 2.523 �65.30 3.503 �39.69 1.958

(c) �56.72 3.303 �69.19 3.920 �44.99 2.317

(d) �58.20 3.479 �70.06 4.038 �45.73 2.381

(e) �58.21 3.484 �69.99 4.043 �45.78 2.386

(f ) �59.04 3.610 �73.86 4.657 �46.42 2.449

(g) �60.00 3.724 �74.84 4.910 �46.55 2.463

(h) �60.87 3.845 �75.47 5.113 �46.96 2.498

(i) �61.45 3.927 �76.39 5.229 �47.83 2.569

Table 3. Effectiveness of the H.264 SME features on coding performances in 4CIF videos.

City Crew Soccer

Case 3 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)

(a) 0 0 0 0 0 0

(b) �57.74 3.703 �18.01 1.133 �36.39 1.606

(c) �62.62 4.266 �25.96 1.770 �41.54 1.900

(d) �63.24 4.355 �26.59 1.829 �42.31 1.949

(e) �63.29 4.356 �26.55 1.827 �42.38 1.955

(f ) �65.37 4.832 �30.28 2.170 �42.90 1.998

(g) �65.19 4.799 �30.89 2.235 �43.33 2.029

(h) �65.59 4.880 �31.81 2.328 �43.83 2.061

(i) �65.56 4.933 �32.70 2.438 �44.05 2.082

Table 4. Impact of the H.264 SME features on coding performances in HD1080 videos.

Blue sky Rush hour Station

Case 4 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)

(a) 0 0 0 0 0 0

(b) �18.43 0.904 �11.72 0.340 �36.28 1.385

(c) �23.38 1.207 �20.33 0.623 �40.71 1.636

(d) �24.20 1.231 �20.65 0.636 �40.52 1.622

(e) �24.31 1.239 �20.60 0.634 �40.74 1.642

(f ) �23.11 1.176 �23.53 0.743 �39.50 1.572

(g) �23.23 1.182 �23.96 0.762 �39.29 1.565

(h) �23.56 1.201 �25.45 0.819 �40.08 1.625

(i) �23.84 1.223 �25.20 0.808 �37.87 1.500

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-5

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

increasing the number of reference frames is not always advantageous. As shown in

Tables 1�4, while the number of reference frames is increased to three or five, a

lower coding performance improvement is achieved. However, the computational

complexity and memory bandwidth are proportional to the number of searched

reference frames. As a useful key point for designers, it is worth to mention that

more than 80% of selected best reference frames belong to the first two nearest

reference frames.19

Regarding distortion cost, SATD usually has a small higher video quality than

SAD at price of an increased computation. However, from the hardware point of

view, SATD requires a much higher hardware cost than SAD. In summary, the

SME hardware/software designers can make a good tradeoff between coding per-

formance and computation complexity by using different SME functionalities and

considering aforementioned observations.

3. Investigation of Algorithms

This section investigates the design space of SME algorithms and reviews the design

challenges of SME algorithms and the possible choices to cope with them. In ad-

dition, it surveys the state-of-the-art SME algorithms.

To design a new and an efficient algorithm, designers should take into con-

sideration several design issues. The new algorithm should have a lower compu-

tational complexity and memory access compared with the H.264 SME. In addition,

it should lead to a close coding performance compared to the H.264 SME algorithm.

Moreover, it should be a hardware friendly algorithm for hardwired applications.

This issue will be discussed in the next section. The designers should carefully

consider all the functions in SME algorithm and the impact of each function on the

coding performance and computational complexity to identify the main bottlenecks

and possible solutions. Although the proposed analysis in previous section covers

some of aforementioned issues, the details of SME bottlenecks and solutions that are

still not given are discussed in the following paragraphs.

The VBS feature is one of the main contributors to the computation and memory

bandwidth. Since in the H.264 SME each block has its own motion vector, it can

processes each block separately with multiple reference frames, which increases the

computational complexity, memory bandwidth, and processing time. A simple but

effective way for lowering computation, memory access, and processing time is to

reduce the number of block sizes before or during SME stage. In Ref. 20, Song et al.

directly remove all block modes below 8� 8 that leads to 42.9% reduction in com-

putation complexity at price of small quality drop. However, this technique is not

suitable for low or medium frame size and results in more video quality degradation.

In Refs. 21 and 22, different mode filtering methods are introduced that select only

some of best modes for SME refinement. As a result, a lot of computation, memory

access, and processing time are saved at the price of a small video quality drop.

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-6

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

Interpolation is a computational intensive process in SME, which requires a lot of

memory access as well as computation to produce half- and quarter-pixels.

Matching error surface function can provide a good way for reducing computation

and memory bandwidth of the interpolation process. \It is generally believed that

the fast motion estimation algorithm works best if the error surface inside the search

window is unimodal. As shown in Fig. 2, the error surface of the integer pixel motion

estimation is not unimodal due to the large search window and complexity of video

content. Therefore, the integer motion estimation search would easily be trapped

into a local minimum. On the other hand, since the sub-pixels are generated from

the interpolation of integer pixels, the correlation inside a sub-pixel search window

is much higher than that of the integer pixel search window. Thus, the unimodal

error surface will be valid in most cases of the sub-pixel. As a result, the matching

error decreases monotonically as the search point moves closer to the global mini-

mum".23 Under unimodality assumption of error surface at sub-pixel resolution, the

matching costs at sub-pixel precision can be calculated. Accordingly, the interp-

olation process is avoided and thus the computation and memory requirement are

considerably saved. Several models for error surface function have been developed

with different levels of accuracies and complexities. In Ref. 24, five mathematical

models of error surface are proposed with distinct performances. Among these

models, the first and second models are based on an inverse matrix solution that

have better coding performances, and are more suitable for hardware implemen-

tations. However, the above-mentioned work only supports half-pixel accuracy. To

support the quarter-pixel precision, an algorithm is proposed in Ref. 25. It shows

that the unimodal assumption does not necessarily work for all sub-pixels resol-

ution. That is why the resulted performance of quarter-pixel interpolation free

(a) (b)

Fig. 2. Error surface of (a) integer motion estimation (search range: 32) (b) sub-pixel motion estimation

with 1/8 pixel accuracy (Ref. 23).

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-7

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

algorithm is less than the full search SME. To compensate the quality drop, the

authors introduce the \complete-system model", which provides near full search

SME performance at the price of an increased complexity. Please note that since the

interpolation free algorithms avoid the interpolation process, they are suitable for

using in low complexity video encoders as well as low power applications such as

portable devices. However, they are not able to produce residue values and therefore

need an interpolation unit for motion compensation.

Another way for reducing the computation and memory requirement of the in-

terpolation process is to use an interpolation filter with a reduced complexity. H.264

uses a 6-tap finite impulse response (FIR) filter with (1/32, �5/32, 20/32, 20/32,

�5/32 and 1/32) coefficients and a 2-tap bilinear filter with (1/2, 1/2) coefficients

for producing half-pixel and quarter-pixel positions, respectively. On the other

hand, the computation cost and memory requirement of a filter are proportional to

its tap length. Accordingly, the half-pixel interpolation is the bottleneck of the

interpolation process in the H.264 SME. To address this problem, a 4-tap filter with

(�1/8, 5/8, 5/8, and �1/8) coefficients is proposed in Ref. 26. Therefore, the

computation and memory bandwidth are decreased compared with the standard

6-tap filter.

In addition to the tap length of an interpolation filter, the suitability of its

coefficients for hardware/software implementation is another important issue. As

seen above, some of coefficients in the H.264 6-tap FIR filter and the proposed filter

in Ref. 26 are not a power of two. As a result, they need multiplication, addition,

subtraction, and shift operations. Among these operations, multiplication is the

most expensive one and demands more overhead for implementing. To cater to this

issue, a multiplication free filter with 6-tap length is proposed in Ref. 27. In a later

research,28 both the tap-length and the multiplication issues are considered by

introducing a multiplication free 4-tap filter. In the proposed filter, the weight of

half-pixel interpolation filter are 1/2, 1/8, 1/32, and �1/32. As a result, this filter

consists of additions and shift without multiplications. Therefore, the computation

and memory bandwidth of the proposed half-pixel interpolation filter are decreased.

Due to the existence of two-iterative search for the half-pixel and quarter-pixel

refinements in the H.264 SME that are usually carried out in a sequential manner,

the processing time is increased. From the hardware point of view, this problem

poses a restriction on required clock cycles for high-resolution video such as HDTV

and bigger specifications. One solution for this problem is to calculate all 25 half-

and quarter-pixels in parallel as used in Refs. 29 and 30. Compared with conven-

tional two-iteration scheme, the cycle count in is halved and therefore the long

computation latency is reduced and the HD size requirement can be achieved.

However, from the hardware viewpoint, this approach is not suitable for hardware

implementation due to its large silicon area requirement. To bring a solution for this

problem, single iteration algorithm is proposed in Ref. 31. This algorithm predicts

the sub-pixel motion vector (SMV) of current block using the median motion vector

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-8

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

of neighboring block. Then, it examines six positions including (0, 0), predicted

SMV and four diamond points around predicted SMV, as shown in Fig. 3. In this

way, not only the long latency of two-iterative search is reduced by half, but also the

number of the search points is decreased to six points that leads to computation and

memory bandwidth reduction. Compared with interpolate and search algorithm

which is adopted in reference software, the single iteration algorithm has a small

video quality degradation. Reference 32 adopts the single-iteration technique to

reduce the long processing time in high-resolution specification as well as the

computation and memory bandwidth with small video degradation. In Ref. 33, the

authors present single pass fractional motion estimation to improve the coding

performance relative to single-iteration SME algorithm by searching four additional

positions around the (0, 0) in diagonal direction.

Bit-width reduction is another choice for the complexity and memory bandwidth

saving. The main idea is to represent each pixel with a lower number of bits (usually

one or two bits) using bit transformation. As a result, SME can use bit-transformed

data and, therefore, the computation and memory access are reduced. In Ref. 34,

Akbulut et al. propose the first bit transformed SME. After the interpolation of

reference frame with 8 bits/pixel bit-depth, they use 1-bit transform to represent

each pixel with 1 bit. Then, the SME refinement is carried out as usual. However,

they use 8-bit data at the interpolation process and their method adds an extra

complexity due to transforming 8-bit data to 1-bit samples using multi-band pass

filter. Motivated from above drawback, the all-binary SME is introduced in Ref. 35.

In order to reduce the computation of interpolation, 8-bit pixel values are trans-

formed before the interpolation. Furthermore, by replacing the multi-band pass

filter with multiplication free one-bit transform as described in Ref. 36, further

(-1, -1) (0, -1)

(0, 1)

(-1, 1)

(-1, 0)

(1, -1)

(1, 0)

(1, 1)

Integer pixel Predicted SMV

Diamond position

(0, 0)

Fig. 3. The search pattern of single iteration algorithm that includes (0, 0), predicted SMV, and four

diamond positions.

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-9

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

computation can be saved. Although the above-mentioned algorithms lead to sig-

nificant reduction in computational complexity and memory bandwidth, which

make them suitable for real-time low power video coding systems, they result in

video quality loss as well. Finally yet importantly, it should be noted for motion

compensation, residue data should be inverted to pixel domain that results in an

increased computation.

4. Exploration of SME Architectures

In this section, the design challenges and strategies of the H.264 SME are discussed.

Besides, the representative SME hardware architectures are surveyed.

4.1. Hardware architecture design: Challenges and strategies

Due to the huge computation and memory access requirement of the H.264 SME,

acceleration of the SME process by dedicated hardware architectures is of vital

importance for real-time application. However, design of such architectures is a

challenging job. The reason is that besides the extra ordinary huge computation and

memory bandwidth, the SME algorithm in H.264 has been developed without

architectural considerations. In the H.264 reference software, the SME is based on a

sequential flow with a great deal of data dependency. From the hardware point of

view, the sequential operations (processing) restrict the parallel processing, which

in turn reduce the processing power. Besides, the data dependency increases the

memory requirement at the price of an added storage space, area cost, and power

consumption as well. Therefore, the data flow of the full search SME algorithm

should be carefully reordered and optimized to permit concurrent operations with a

reduced data dependency, which consecutively enables the parallel processing under

the restrictions of sequential SME procedure. An example for analysis of the H.264

SME flow can be found in Ref. 37, where the authors simplify its complex encoding

procedure into several encoding loops. In addition, they propose two decomposing

techniques to parallelize the SME algorithm while maintaining high hardware

utilization and achieving data reuse.

The main building blocks of the H.264 SME can be listed as follows: Interpolation

unit, processing unit that is responsible for search and cost calculation, and mode

decision. Among them, the interpolation unit is computationally intensive while

processing and mode decision units demand less computation. Therefore, design of

hardware architecture is performed by a particular emphasis on interpolation unit

as the most computational intensive unit. However, design of efficient hardware

architectures for less computational demanding units is still important as they can

affect the overall performance of the design.

The main design challenges for the interpolation unit are hardware utilization

and huge memory bandwidth that arise from the VBS, and the 6-tap FIR filter

that is used for generating half-pixels. Due to the seven different block sizes, the

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-10

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

interpolation unit should be carefully designed to achieve maximum hardware

utilization. The common solution for this issue is the use of 4� 4 interpolation unit.

As all bigger blocks can be decomposed into 4� 4 blocks, the complete hardware

utilization can be achieved. However, since each 4� 4 block needs a 9� 9 interp-

olation window, the 4� 4-based interpolation increases the memory bandwidth. On

a contrary way, the use of a 16� 16 interpolation unit leads to the lowest memory

bandwidth, which is due to the covering of all block sizes, at the price of a low

hardware utilization. Examples that are based on 4� 4 and 16� 16 interpolations

are Refs. 38 and 39, respectively. As for the FIR filter, the previous reported

research in the literature such as Refs. 40 and 41 are not suitable for the H.264

interpolation unit because of their too low input bandwidth.42 In Ref. 43, an

architecture for the H.264 interpolation unit is presented that has a 6-tap FIR filter

with a pipelined architecture which increases its processing ability. In Ref. 44, by

exploiting the mixed use of parallel and serial-input FIR filters, a high throughput

rate and efficient silicon utilization are achieved so that the proposed design can

support SDTV and HDTV applications.

Regarding SATD unit that improves the coding performance, there are several

architectures in the literature. The first SATD architecture is proposed by Wang

et al.,45 which contains two 1D transform units and one transpose register array.

This architecture uses a fast algorithm presented in Ref. 46, which only contains add

and shift and addition operations where shift operation is done by wiring leading to

small hardware cost. In Ref. 47, present authors explore the design space of

Hadamard transform and present five implementation alternatives with different

performance in terms of throughput and hardware cost which are suitable for

different applications. Another excellent Hadamard transform architecture can be

found in Ref. 48, which the interested readers are referred to it.

On the subject of less computation demanding units, the main tasks in these

units can be considered as absolute operation and inter mode decision. The dis-

cussion on this subject is not provided here; however, the interested readers are

referred to the following references for more information. The efficient architectures

for absolute operation can be found in Ref. 49. As for the inter mode decision, it is

difficult to find an architecture design with details in literature. However, a detailed

example can be found in Ref. 50.

In addition to the hardware architectures for the SME building blocks, efficient

memory design is another important issue. The reason is that the SME uses the

data-intensive interpolation method to generate sub-pixels within up five reference

frames that calls for a great deal of memory access (MA) through the frame memory.

In Ref. 51, the author calculates the memory access requirement of interpolate and

search sub-pixel motion estimation algorithm in H.264 reference software in terms of

the required number of bits that one n�m block has to access memory. Table 5 lists

the memory access requirement of H.264 SME for five specifications i.e., case (a) to

case (e). As an example, for HD1080 @ 30 fps, all block sizes and five reference

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-11

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

frames, i.e., case (d), the memory access requirement is 477,067 Mbits. From a

hardware viewpoint, this huge memory access leads to a high I/O data bus traffic

and interconnection requirements, which have a significant effect on the perform-

ance of the hardware architecture. Besides, it results in more power consumption.

The common solution for this problem is data reusing technique.52�54 In this

technique, the temporal data locality is exploited and the previously accessed data

is reused in future, which result in data access reduction. Accordingly, the power

consumption is decreased. The data reuse method can be performed offline or online.

In the offline scenario, frequently accessed data is stored in on-chip memory and will

be recycled offline in future access. In this way, the data access from the frame

memory can be decreased. In the online data reuse scheme, the current accessed

data is shared between different places that work in parallel and, therefore, the data

access reduction is achieved. In Ref. 55, an effective data reuse methodology is

introduced, which formulates the SME algorithm as the nested loops structures and

explores data locality by the use of loop analysis. Based on this technique, 95.7% of

memory access is reduced compared with no data reuse case, which leads to a lot of

saving in memory access power. In Ref. 56, the authors propose a hardware im-

plementation for interpolation unit that uses a single-step interpolation method

along with a data reusing scheme, which can save 25% of memory bandwidth, and

45% of processing cycles.

4.2. Hardware architectures of the H.264 SME

This subsection surveys the state-of-the-art SME architectures and classifies them

into full and fast SME architectures. A full SME architecture is based on the full

search SME algorithm whereas a fast SME architecture is a hardware design of a

fast SME algorithm. Four full SME architectures (i.e., see Refs. 37�39 and 57�59)

and six fast search SME architectures (i.e., see Refs. 20, 23, 31, 33, 60 and 61) are

selected as the representative designs that the details as well as the comparisons of

them are provided in the following paragraphs.

Chen et al.38 propose the first H.264 SME architecture. To bring a solution for

the complex sequential procedure flow of the H.264 SME, they present the 4� 4

block decomposition and vertical integration techniques to enable the mapping of

Table 5. The memory access of the interpolate and search method and the NFM

algorithm for some specifications.51

Case Specifications Features MAinterpolate & search

(a) QCIF (15 fps) VBS (all), MRF (5) 2; 915:4� 106

(b) CIF (30 fps) VBS (all), MRF (5) 23; 323� 106

(c) SD480p (30 fps) VBS (all), MRF (5) 79; 511:2� 106

(d) HD1080p (30 fps) VBS (all), MRF (5) 477; 067:3� 106

(e) HD1080p (30 fps) VBS (8� 8�16� 16), MRF (2) 108; 090:2� 106

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-12

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

the SME flow in hardware with features of regular schedule, full utilization, parallel

processing, and reusability. As shown in Fig. 4, the main important parts of Chen’s

architecture are interpolation engine, nine 4� 4 processing units, and Lagrangian

mode decision engine. The interpolation engine produces half-pixels and quarter-

pixels by using one-dimensional (1-D) FIR and bilinear filters, respectively. The

4� 4 processing units generate residues and SATD values. Since there are nine

positions in each sub-pixel refinement, nine 4� 4 block processing units are used.

The \Lagrangian mode-decision" is responsible for finding the best mode among

41 modes in each macroblock (MB). The area cost of the proposed design is 79K

gates and it can support SDTV format 30 frames per second (fps), one reference

frame and all MB modes at the clock frequency of 100MHz. The maximum per-

formance of design is 52 k MB/s.

Based on Ref. 38, Yang et al.39 introduce a high performance architecture of the

H.264 SME for HDTV application with two new techniques. First, differing from

Chen’s design that uses a 4-pixel interpolation unit, they propose a 16-pixel in-

terpolation to increase processing capability. However, four times increase of the

data width of the interpolation and the search engine only bring 52.5% of clock cycle

saving. Second, they replace Chen’s design 1-D filters with diagonal filter and fully

pipeline 1-D filters in the interpolation unit. As a result, the delay path is decreased

and a higher clock frequency can be achieved without a considerable increase of area

cost. The clock frequency of Yang’s design is 285MHz that supports HD1080p

format in 30 fps for one reference frame. The area cost and the maximum

throughput of the design are 189k gates and 250MB/s, respectively.

Fig. 4. Block diagram of Chen’s design.

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-13

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

In Refs. 57 and 58, Wu et al. present a three-engine parallel SME architecture to

support high-resolution application. To cope with the high bandwidth input data

problem, which leads to a high bandwidth memory and a high hardware cost, they

use a reference pixel scheduling method and an efficient memory organization to

ensure that each engine can fetch the right reference pixel at the right time with

minimum hardware cost. In addition, they dispatch 41 sub-blocks of each MB to the

three engines in a load-balancing and interlaced ways to increase the throughput.

Furthermore, they use a resource sharing scheme for SATD generators to reduce the

number of the required SATD generators from three units to two units. The pro-

posed SME architecture takes 311.7k gates and can support HD1080 30 fps at clock

frequency of 154MHz.

Ruiz and Michell59 present a new hardware architecture for full search SME

algorithm which is made up of three different pipeline processors including half-pixel

processor, quarter-pixel processor, and mode decision processor. Similar to the work

of Chen et al.,38 the proposed architecture uses 4-pixel interpolation unit to increase

hardware utilization. To reduce the hardware cost, Ruiz and Michell use SAD as the

distortion criterion instead of SATD at the price of small video degradation. In this

way the hardware cost of Hadamard Transform is saved (i.e., about 85% gate count of

4� 4 processing unit, see Ref. 37). The hardware cost of the proposed architecture is

only 11.4K gates and supports HD1080 resolution at a clock frequency of 250MHz.

However, since the proposed architecture uses 6-tap filter for half-pixel interpolation

filter, it seems that the hardware cost of this architecture should be greater than the

reported one i.e., 11.4K gates. The reason is that according to the gate count profile in

Ref. 37, the hardware cost of similar interpolation unit is 23.872K gates and the silicon

area of 4� 4 processing unit without Hadamard Transform is 5.2K gates whereas the

hardware cost of mode decision and quarter-pixel parts should be considered as well.

Based on Ref. 38, Wang et al.23 contribute a fast SME architecture. Their

architecture is a hardware realization of a fast algorithm with a reduced number of

search points. As a result, the number of 4� 4 processing units decrease to 5 instead

of 9 in Chen’s design. Compared with Chen’s work, it saves 40% and 14% of area

cost and searching time at the price of less than 0.2 dB PSNR quality drop. How-

ever, due to the use of 4-pixel interpolation units and 1-D FIR filters, it has a limited

power processing and can only support SDTV format not higher resolutions.

Song et al.20 propose a fractional motion estimation engine for HD1080p resol-

ution. To achieve real-time encoding, they present three techniques. First, they

remove all modes below 8� 8 block and support one reference frame. Therefore,

88.6% of the computation is saved with 0.1 dB loss. Second, they use two reusing

methods namely lossless inside-mode and cross-mode,62 which save about 65% pixel

generation and SATD calculation. Third, they remove the pipeline bubbles between

the half-pixel and the quarter-pixel stages by optimizing the SME scheduling. The

area cost of the proposed engine is 203.2K gates and it can encode real-time

HDTV1080p at 30 fps under the clock frequency of 200MHz.

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-14

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

In Ref. 31, Kuo et al. present a SME architecture for HDTV and high profile.

They propose two techniques for solving the associated problems with the HDTV

video size and high profile. First, the proposed architecture is based on the single-

iteration algorithm. Consequently, its processing time is halved compared with

conventional two-step full search algorithms. In addition, its search candidates are

reduced to 6 points that leads to reduction of processing units and, thus, hardware

cost as well. Second, since the high profile of H.264 uses the costly 8� 8 SATD, they

replace it with 4� 4 SATD that leads to a reduction in area cost. The proposed

architecture supports HD720p 30 fps with a clock frequency of 70MHz, 62.2K gates,

and 0.043 dB PSNR video quality degradation.

In a later work,60 Tsung et al. contribute an SME architecture for quad full high

definition (QFHD) that is based on the single-iteration algorithm. To cater to the

huge computation and memory bandwidth requirement of the QFHD specification,

several optimization techniques are used. First, the 6-tap interpolation filter is

replaced by a bilinear filter and, thus, memory bandwidth and area cost are saved.

Second, based on the distributive law in SATD, the residues in the quarter-pixel

SME are estimated while the half-pixel residues are known. Therefore, the calcu-

lations of sub-pixels are decreased. Third, a cache-based memory63 is used that

results in bandwidth reduction. The clock frequency and throughput of the proposed

design are 280MHz and 1659K MB/s, which are sufficient for the 4096� 2160

QFHD real-time processing.

Another architecture for QFHD has been recently reported by Kim et al.33 The

proposed architecture is based on single pass fractional motion estimation which

directly searches only surroundings’ positions of both the predicted fractional

motion vector and the search centre. To reduce the hardware cost, authors present

bit clipping scheme that is applied to processing units decreasing the hardware cost

by 25%. In this method, the required bit length of residue data is reduced from nine

to six, which is result of hardware cost reduction in processing units. The exper-

imental results show that bit clipping method has a negligible effect on video

quality. In addition, to increase the throughput all block modes below 8� 8 block

are culled in the proposed design. This design can process real time (i.e., 24 fps)

QFHD video using 134K gates with an operating frequency of 250MHz.

Power aware design has been recently emerged as a rising trend in video coding

systems by which power consumption in multimedia devices can be adjusted

according to certain conditions such as battery status, working time, and coding

performance. Chen et al.61 propose the first power aware architecture for the H.264

SME that is based on advanced mode pre-decision (AMPD) algorithm.38 In order to

support power aware functionality, after the IME refinement, their algorithm stores

the cost of seven partitions including four 8� 8, two 16� 8 and 8� 16, and one

16� 16 partitions from low (best) to high (worst). N (N ¼ 1 � 7) of best partitions

can be selected in the SME. WhenN is equal to seven, the AMPD algorithm has the

same performance with the full search SME. When N is equal one, it gives the

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-15

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

lowest performance. By changing N, the AMPD algorithm can provide a trade-off

between power consumption and coding performance. The area cost of the proposed

design is 125K gates with maximum operation frequency of 27MHz. For real-time

encoding of CIF 30 fps videos with two MRF, the power consumption of

22.58�1.64mW can be efficiently varied at the expense of 0.1�3.9 dB quality drop.

Table 6 shows the comparisons of aforementioned SME architectures. From this

table, three trends can be summarized as follows. First, under similar throughput

condition, a full SME architecture has a higher area cost than a fast SME archi-

tecture that is rewarded with the highest video quality. Second, most of the SME

designs belong to the fast SME architecture due to stringent real-time constrains.

Finally, the single-iteration-based fast SME architectures such as Kuo et al.’s,31

Tsung et al.’s,60 and Kim et al.’s33 are suitable candidates for using in high-resol-

ution application with a negligible quality drop and reasonable hardware cost.

Finally yet importantly, it is worth to note that the surveyed architectures in

this subsection are based on application specific integrated circuit (ASIC) designs

whereas there are other design alternatives such as digital signal processors (DSPs),

media processors, and field programmable gate arrays (FPGAs). ASIC designs show

the best performance with little flexibility whereas processor and FPGA-based

designs provide better flexibility and lower performance.64 Therefore, hardware

engineers can use DSPs, media processors or FPGAs for implementation of fast

SME designs, especially for small and medium video resolutions that require low

and moderate performances. The reason is that fast SME algorithms (such as

AMPD38 and single iteration pass algorithms31,33) require lower computational

complexity and memory bandwidth compared with full search SME algorithms.

5. Further Perspectives and Future Prospects

With rapid advances in multimedia processing and semiconductor technologies, new

multimedia applications and standards will keep emerging, which will unsurprisingly

Table 6. Comparisons of the state-of-art SME architectures.

Ref

Technology

(�m)

Area

(K gates)

CLK freq

(MHz)

Throughput

(MB/s)

Power

(mW)

Video

resolution

Quality

drop (dB)

Chen et al.38 0.18 79.3 100 49K — SDTV 0

Yang et al.39 0.18 189 200 250K — HD1080 0

Wu et al.57 0.13 311.7 154 250K — HD1080 0

Ruiz and Michell59 0.18 11.4 290 — — HD1080 —

Wang et al.32 0.18 48 100 50K — SDTV 0.1�0.2

Song et al.20 0.18 203.2 200 250K 236 HD1080 0.1

Kuo et al.31 0.18 62.2 70 71.3K — HD720 0.043

Tsung et al.60 0.09 448 280 1659K — QFHD 0.02

Kim et al.33 0.13 134 250 977K QFHD —

Chen et al.61 0.18 125 27 23.7K 22.58�1.64 CIF 0.1�3.9

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-16

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

pose new design challenges. As the SME is an important part of current and future

advanced video coding systems, the same scenario will be repeated for it. In the

following paragraphs, the SME design challenges in some of the emerging multimedia

applications and the next advanced video coding standard, currently known as \High

Efficiency Video Coding" (HEVC)/H.265,65 are reviewed.

Digital video systems will continue to evolve towards higher resolutions to

provide better realistic images on displays. Ultra-high definition (UD) TV66,67 is an

emerging visual media, which undoubtedly requires a much higher computation and

memory bandwidth than current applications. For example, the required

throughput of real-time SME for the UD format (7680� 4320 pixels) is 16 times the

HD format (1920� 1080 pixels) under the same frame rate condition.

Another interesting emerging application is three-dimensional (3-D) TV that

provides a 3-D impression of 3-D natural scenes by using multiview video sequences.

Such sequences are produced by capturing a 3-D scene using multiple cameras

simultaneously. The SME for multiview sequences suffers from two major problems.

First, it demands a lot of computation and memory bandwidth that are directly

proportional to the number of capturing cameras. Second, due to the redundancy

between closer cameras, compressing multiview sequences independently with

H.264 is not efficient.68 As a result, there may need new SME algorithms to consider

the redundancy problem that cause an added complexity in the algorithm flow.

Although \the H.264 SME typically leads to 1�4.2 dB PSNR video quality

improvement or 19%�50% bit rate saving",11 the demands for higher coding per-

formance are relentless. Consequently, development of new SME algorithms will

ceaselessly continue. Furthermore, the SME with enriched and improved features in

future video coding standards will keep arriving, which will unsurprisingly increase

the computation and memory bandwidth, data dependency and irregularity of

computation flow. For instance, the SME in H.265 will use adaptive interpolation

filter instead the H.264 interpolation filter with fixed weights to improve motion

compensation prediction. That is because the time variation of image signal is not

taken into account in the H.264 interpolation filters and, consequently, the pre-

diction efficiency of motion compensation is limited. In the adaptive interpolation,

based on the characteristic and statistic of each image, a set of coefficients for the

interpolation filter is calculated which improves the coding performance at the price

of an added computation and irregularity.69,70 In addition, \supermacroblock"

structure up to 64� 64 size will be used in H.265 that brings an added computation

and memory access compared with H.264.

Derived from the above discussion, the SME in the emerging and future multi-

media applications will be characterized by massive computation, memory band-

width, data dependency, and irregular computation flow, which will bring new

challenges for the SME hardware architecture design. On the other hand, the

continuous advances of microelectronic technology will provide more processing

capability, which in turn will facilitate the hardware implementation of SME.

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-17

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

However, conventional architectural approaches may not able to fulfill the hard

real-time constrains of the emerging and future applications. As a result, besides of

the conventional and current advanced algorithmic and architectural techniques

that have been described in this paper, the SME designers may need to develop new

and innovative methods for low cost, low power, and efficient SME hardware

realization. Furthermore, the SME hardware designer should consider some new

challenges that will be brought by advanced very-large-scale integration (VLSI)

technologies such as high static (leakage) power consumption and interconnection

problems. In current and future VLSI designs, as process technology steadily scaling

down to deep submicron area; that is 90 nm and beyond, the static power that is

proportional to the hardware cost, will grow much faster than dynamic power and it

will be the main contributor to total power consumption. For more information on

static power, interested readers are referred to Refs. 71 and 72. In addition, in deep

submicron area, the interconnection parameter will be a very important design

parameter and it will be determining cost, delay, power, reliability, and turn-

around time of the future LSI’s rather than MOSFET’s.73 On the other hand, as the

future SME applications are data intensive with a high bandwidth memory, par-

allelism at hardware level will be a must for meeting the real-time constrains and a

lot of storage space will be required for data saving. Consequently, the hardware

cost and the number of wide length interconnections will be increased, which will

lead to increasing in static power and interconnection density. Therefore, the use of

static power reduction techniques together with reducing of area cost and memory

bandwidth, which result in (static) power and interconnection reduction, are vitally

important for future advanced SME architectures especially for portable multi-

media devices with limited battery power. Last but not least, it seems that emerging

and future SME designs will be based on ASIC rather than other design alternatives

(i.e., DSP, FPGA, and so on) to fulfill the extraordinary huge computational

complexity and memory bandwidth requirement of SME in upcoming video coding

systems.

Acknowledgment

The authors would like to thank all reviewers for their helpful suggestions and

comments that improved the quality of the revised manuscript. This work was

supported in part by the Ministry of Higher Education, Malaysia, under Grant

FRGS FP094/2007c.

References

1. Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG, Draft ITU-T Rec-ommendation and Final Draft International Standard of Joint Video Specification,ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC (2003).

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-18

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

2. T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, Overview of the H.264/AVC, video coding standard, IEEE Trans. Circuits Syst. Video Technol. 13 (2003)560�576.

3. Information technology — generic coding of moving pictures and associated audio in-formation: Video, ISO/IEC 13 818-2 and ITU-T Rec. H.262 (1996).

4. Video Coding for Low Bit Rate Communication, ITU-T Rec. H.263 (1998).5. Information technology — coding of audio-visual objects — part 2: Visual, ISO/IEC 14

496-2 (1999).6. A. Joch, F. Kossentini, H. Schwarz, T. Wiegand and G. J. Sullivan, Performance

comparison of video coding standards using lagrangian coder control, Proc. IEEE Int.Conf. Image Process. (2002), pp. 501�504.

7. T. Wiegand, X. Zhang and B. Girod, Long-term memory motion-compensated predic-tion, IEEE Trans. Circuits Syst. Video Technol. 9 (1999) 70�84.

8. G. J. Sullivan and T. Wiegand, Rate-distortion optimization for video compression,IEEE Signal Process. Mag. 15 (1998) 74�90.

9. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G. J. Sullivan, Rate-constrainedcoder control and comparison of video coding standards, IEEE Trans. Circuits Syst.Video Technol. 13 (2003) 688�703.

10. D. Marpe, H. Schwarz and T. Wiegand, Context-adaptive binary arithmetic coding inthe H.264/AVC video compression standard, IEEE Trans. Circuits Syst. Video Technol.13 (2003) 620�636.

11. M. R. H. Fatemi, H. F. Ates and R. Salleh, Fast algorithm analysis and bit-serialarchitecture design for sub-pixel motion estimation in H.264, J. Circuits Syst. Comput.19 (2010).

12. P. Pirsch, N. Demassieux and W. Gehrke, VLSI architectures for video compression-asurvey, Proc. IEEE 83 (1995) 220�246.

13. F. Dufaux and F. Moscheni, Motion estimation techniques for digital TV: A review and anew contribution, Proc. IEEE 83 (1995) 858�876.

14. P. Pirsch and H.-J. Stolberg, VLSI implementations of image and video multimediaprocessing systems, IEEE Trans. Circuits Syst. Video Technol. 8 (1998) 878�891.

15. P.-C. Tseng, Y.-C. Chang, Y.-W. Huang, H.-C. Fang, C.-T. Huang and L.-G. Chen,Advances in hardware architectures for image and video coding— a survey, Proc. IEEE93 (2005) 184�197.

16. Y.-W. Huang, C.-Y. Chen, C.-H. Tsai, C.-F. Shen and L.-G. Chen, Survey on blockmatching motion estimation algorithms and architectures with new results, J. VLSISignal Process. 42 (2006) 297�320.

17. Joint Video Team Reference Software version 11.0 (2007), http://iphome.hhi.de/suehring/tml/download/old jm/.

18. G. Bjontegaard, Calculation of average PSNR differences between R-D curves, Docu-ment VCEG-M33 (2001).

19. Y.-W. Huang, B.-Y. Hsieh, S.-Y. Chien, S.-Y. Ma and L.-G. Chen, Analysis and com-plexity reduction of multiple reference frames motion estimation in H.264/AVC, IEEETrans. Circuits Syst. Video Technol. 16 (2006) 507�522.

20. Y. Song, M. Shao, Z. Liu, S. Li, L. Li, T. Ikenaga and S. Goto, H.264/AVC fractionalmotion estimation engine with computation reusing in HDTV1080p real-time encodingapplications, IEEE SiPS07, Shanghai, China (2007), pp. 509�514.

21. C. L. Su, Yang, W. S. Li, Y. Chen, C. W. Guo and S. Y. Tseng, Low complexity highquality fractional motion estimation algorithm and architecture design for H.264/AVC,Proc. IEEE Asia Pacific Conf. Circuits Syst. (2006), pp. 578�581.

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-19

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

22. A. Abdelazim, M. Yang and C. Grecos, Fast subpixel motion estimation based on theinterpolation effect on different block sizes for H264/AVC, Opt. Eng. Lett. 48 (2009).

23. Y. J. Wang, C. C. Cheng and T. S. Chang, A fast algorithm and its VLSI architecture forfractional motion estimation for H.264/MPEG-4 AVC video coding, IEEE Trans. Cir-cuits Syst. Video Technol. 17 (2007) 578�583.

24. J. W. Suh and J. Jeong, Fast Subpixel motion estimation techniques having lowercomputational complexity, IEEE Trans. Consum. Electron. 50 (2004) 968�973.

25. P. R. Hill, T. K. Chiew, D. R. Bull and C. N. Canagarajah, Interpolation free subpixelaccuracy motion estimation, IEEE Trans. Circuits Syst. Video Technol. 16 (2006)1519�1526.

26. R. Wang, C. Huang, J. Li and Y. Shen, Sub-pixel motion compensation interpolationfilter in AVS, Proc. IEEE Int. Conf. Multimedia and Expo. 1 (2004) 93�96.

27. C.-Y. Tsai, T.-C. Chen, T.-W. Chen and L.-G. Chen, Bandwidth optimized motioncompensation hardware design for H.264/AVC HDTV decoder, Proc. IEEE MidwestSymp., Circuit System, Vol. 2 (2005), pp. 1199�1202.

28. C. J. Hyun and M. H. Sunwoo, Low power complexity-reduced ME and interpolationalgorithms for H.264/AVC, J. Signal Process. Syst. 56 (2009) 285�293.

29. N. T. Ta, J. H. Kim and J. R. Choi, Fully parallel fractional motion estimation forH.264/AVC encoder, ICIS 4 (2009) 306�306.

30. Y.-H. Chen et al., An H.264/AVC scalable extension and high profile HDTV 1080pencoder chip, Proc. IEEE SOVC (2008).

31. T.-Y. Kuo, Y.-K. Lin and T.-S. Chang, SIFME: A single iteration fractional-pel motionestimation algorithm and architecture for HDTV sized H.264 video coding, Proc. IEEEInt. Conf. Acoustics, Speech, and Signal Processing, Hawaii, USA (2007), pp.1185�1188.

32. Y.-K. Lin et al., A 242 mw 10mm2 1080p H.264/AVC high profile encoder chip, Proc.IEEE ISSCC (2008), pp. 314�315.

33. G. Kim, J. Kim and C.-M. Kyung, A low cost single-pass fractional motion estimationarchitecture using bit clipping for H.264 video codec, ICME (2010), pp. 661�666.

34. O. Akbulut, O. Urhan and S. Ertürk, Fast Sub-Pixel Motion Estimation By Meansof One-Bit Transform, Lecture Notes in Computer Science (LNCS), Vol. 4263 (2006),pp. 503�510.

35. A. Celebi, O. Akbulut, O. Urhan, I. Hamzaoglu and S. Erturk, An all binary sub-pixelmotion estimation approach and its hardware architecture, IEEE Trans. Consum.Electron. 50 (2008) 1928�1937.

36. S. Ertürk, Multiplication-free one-bit transform for low-complexity block-based motionestimation, IEEE Signal Process. Lett. 14 (2007) 109�112.

37. Y.-H. Chen, T.-C. Chen, S.-Y. Chien, Y.-W. Huang and L.-G. Chen, VLSI architecturedesign of fractional motion estimation for H.264/AVC, J. Signal Process. Syst. 53 (2008)335�347.

38. T.-C. Chen, Y.-W. Huang and L.-G. Chen, Fully utilized and reusable architecture forfractional motion estimation of H.264/AVC, IEEE Int. Conf. Acoustics, Speech, andSignal Processing (ICASSP) Vol. 5 (2004), pp. 9�12.

39. Y. Yang, S. Goto and T. Ikenaga, High performance VLSI architecture of fractionalmotion estimation in H.264 for HDTV, Proc. IEEE ISCAS (2006), pp. 2605�2608.

40. J. Park, K. Muhammad and K. Roy, High performance FIR filter design based onsharing multiplication, IEEE Trans. VLSI Syst. 11 (2003) 244�253.

41. H. Samueli, On the design of optimal equiripple FIR digital filters for data transmissionapplications, IEEE Trans. Circuits Syst. 35 (1988) 1542�1546.

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-20

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

42. D. Lei, G. Wen, H. M. Zeng and Z. Z. Ji, An efficient VLSI architecture for MCinterpolation in AVC video coding, Int. Multiconf. in Computer Science and ComputerEngineering, Las Vegas, Nevada, USA (2004).

43. Y. Song, Z. Liu, S. Goto and T. Ikenaga, A VLSI architecture for motion compensationinterpolation in H.264/AVC, IEEE ASICON05, Shanghai, China (2005), pp. 262�265.

44. L. Lu, J.V. McCanny and S. Sezer, Subpixel interpolation architecture for multistandardvideo motion estimation, IEEE Trans. Circuits Syst. Video Technol. 19 (2009)1897�1901.

45. T. C. Wang, Y. W. Huang, H. C. Fang and L. G. Chen, Parallel 4x4 2D transform andinverse transform architecture for MPEG-4 AVC/H.264, Proc. of ISCAS (2003).

46. H. Malvar, A. Hallapuro, M. Karczewicz and L. Kerofsky, Low-complexity transformand quantization with 16-bit arithmetic for H.26L, Proc. ICIP 2002 2 (2002) 489�492.

47. M. S. Porto, T. L. da Silva, R. E. C. Porto, L. V. Agostini, I. V. da Silva and S. Bampi,Design space exploration on the H.264 4� 4 Hadamard transform, 23rd NORCHIPConf. (2005).

48. Z.-Y. Cheng, C.-H. Chen, B.-D Liu and J.-F. Yang, High throughput 2-D transformarchitectures for H.264 advanced video coders, Proc. 2004 IEEE Asia-Pacific Conf.Circuits and Systems (2004).

49. J. Vanne, E. Aho, T. D. Hamalainen and K. Kuusilinna, A high-performance sum ofabsolute difference implementation for motion estimation, IEEE Trans. Circuits Syst.Video Technol. 16 (2006) 876�883.

50. H.-C. Kuo, Efficient VLSI architectures for inter and intra mode decision in high res-olution H.264/MPEG-4 part10 AVC encoding, Master thesis, National Tsing HuaUniversity, Hsinchu, Taiwan (2006).

51. M. R. H. Fatemi, Algorithm optimization and low cost bit-serial architecture design forinteger pixel and sub-pixel motion estimation in H.264/AVC, Ph.D. thesis, University ofMalaya, Kuala Lumpur, Malaysia (2012).

52. S. Wuytack, F. Catthoor, L. Nachtergaele and H. D. Man, Power exploration for datadominated video applications, Proc. IEEE Int. Conf. Low Power Electronics and Design(ISLPED) (1996).

53. S. Wuytack, J.-P. Diguet, F. V. M. Catthoor and H. J. D. Man, Formalized methodologyfor data reuse exploration for low-power hierarchical memory mappings, IEEE Trans.VLSI Syst. 6 (1998) 529�536.

54. D. Soudris, N. D. Zervas, A. Argyriou, M. Dasygenis, K. Tatas, C. E. Goutis andA. Thanailakis, Data-Reuse and Parallel Embedded Architectures for Low-Power, Real-Time Multimedia Applications, Lecture Notes in Computer Science, Vol. 1918 (Springer,2000), pp. 243�254.

55. Y.-H. Chen, T.-C. Chen, C.-Y. Tsai, S.-F. Tsai and L.-G. Chen, Data reuse explorationfor low power motion estimation architecture design in H.264 encoder, J. Signal Process.Syst. 50 (2008) 1�17.

56. Z. Haihua, X. Zhiqi and C. Guanghua, VLSI implementation of sub-pixel interpolatorfor H.264/AVC encoder, High Density Packaging and Microsystem Integration (2007),pp. 1�3.

57. C.-L. Wu, C.-Y. Kao and Y.-L. Lin, A high-performance three-engine architecture forH.264/AVC fractional motion estimation, IEEE Int. Conf. Multimedia and Expo (2008),pp. 133�136.

58. C.-L. Wu, C.-Y. Kao and Y.-L. Lin, A high-performance three-engine architecture forH.264/AVC fractional motion estimation, IEEE Trans. VLSI Syst. 18 (2010) 662�666.

A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation

1250016-21

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.

59. G. A. Ruiz and J. A. Michell, An efficient VLSI architecture of fractional motion esti-mation in H.264 for HDTV, J. Signal Process. Syst. 60 (2010).

60. P.-K. Tsung, W.-Y. Chen, L.-F. Ding, C.-Y. Tsai, T.-D. Chuang and L.-G. Chen,Single-iteration full-search fractional motion estimation for quad full HD H.264/AVCencoding, Int. Conf. Multimedia & Expo (ICME) (2009).

61. T.-C. Chen, Y.-H. Chen, C.-Y. Tsai and L.-G. Chen, Low power and power awarefractional motion estimation of H.264/AVC for mobile applications, IEEE Int. Symp.Circuits and Systems (ISCAS) (2006).

62. M. Shao, Z. Y. Liu, S. Goto and T. Ikenaga, Lossless VLSI oriented full computationreusing algorithm for H.264/AVC fractional motion estimation, IEIEC Trans. Funda-mentals 90-A (2007) 756�763.

63. P.-K. Tsung, W.-Y. Chen, L.-F. Ding, S.-Y. Chien and L.-G. Chen, Cache-based integermotion/disparity estimation for quad-HD H.264/AVC and HD multiview video coding,IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP) (2009).

64. Y.-L. Lin (ed.), Essential Issues in SOC Design (Springer, 2006).65. http://www.vcodex.com/h265.html.66. M. Sugawara, M. Kanazawa, K. Mitani, H. Shimamoto, T. Yamashita and F. Okano,

Ultrahigh-definition video system with 4000 scanning lines, SMPTE Mot. Imag. J. 112(2003) 339�346.

67. E. Nakasu, Y. Nishida, M. Maeda, M. Kanazawa, S. Yano, M. Sugawara, K. Mitani,K. Hamasaki and Y. Nojiri, Technical development toward implementation of ultrahigh-definition TV system, SMPTE Mot. Imag. J. 116 (2007) 279�286.

68. C. Bilen, A. Aksay and G. B. Akar, A multi-view video codec based on H.264, IEEE Int.Conf. Image Processing (ICIP) (2006), pp. 541�544.

69. S. Wittmann and T.Wedi, Separable adaptive interpolation filter for video coding, Proc.IEEE ICIP, San Diego, CA (2008), pp. 2500�2503.

70. Y. Vatis and J. Ostermann, Adaptive interpolation filter for H.264/AVC, IEEE Trans.Circuits Syst. Video Technol. 19 (2009) 179�192.

71. N. S. Kim et al., Leakage current: Moore’s law meets static power, Computer 36 (2003)68�75.

72. W. M. Elgharbawy and M. A. Bayoumi, Leakage sources and possible solutions innanometer CMOS technologies, IEEE Circuits Syst. Mag. 5 (2005) 6�17.

73. T. Sakurai, Design challenges for 0.1 um and beyond: Embedded tutorial, ASP-DAC2000, Yokohama, Japan (2000), pp. 553�558.

M. R. H. Fatemi, H. F. Ates & R. Salleh

1250016-22

J C

IRC

UIT

SY

ST C

OM

P 20

12.2

1. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by U

NIV

ER

SIT

Y O

F H

ON

G K

ON

G o

n 09

/29/

13. F

or p

erso

nal u

se o

nly.