A SURVEY OF ALGORITHMS AND ARCHITECTURES FOR H.264 SUB-PIXEL MOTION ESTIMATION
Transcript of A SURVEY OF ALGORITHMS AND ARCHITECTURES FOR H.264 SUB-PIXEL MOTION ESTIMATION
A SURVEY OF ALGORITHMS AND ARCHITECTURES
FOR H.264 SUB-PIXEL MOTION ESTIMATION¤
MOHAMMAD REZA HOSSEINY FATEMI
Faculty of Computer Science and Information Technology,
University of Malaya, 50603 Kuala Lumpur, Malaysia
HASAN F. ATES
Department of Electronics Engineering,
Isik University, Sile 34980 Istanbul, Turkey
ROSLI SALLEH
Faculty of Computer Science and Information Technology,
University of Malaya, 50603 Kuala Lumpur, Malaysia
Received 20 September 2010
Accepted 12 August 2011
This paper reviews recent state-of-the-art H.264 sub-pixel motion estimation (SME) algor-
ithms and architectures. First, H.264 SME is analyzed and the impact of its functionalities on
coding performance is investigated. Then, design space of SME algorithms is explored repre-
senting design problems, approaches, and recent advanced algorithms. Besides, design chal-
lenges and strategies of SME hardware architectures are discussed and promising architectures
are surveyed. Further perspectives and future prospects are also presented to highlight
emerging trends and outlook of SME designs.
Keywords: Video compression; sub-pixel motion estimation algorithms and architectures;
h.264 standard; survey.
1. Introduction
Video coding standards make possible storage and transmission of high volume raw
digital video data in limited storage and transmission bandwidth environments. The
latest block-based video coding standard, H.264,1,2 provides a much better com-
pression efficiency and subjective video quality compared with all previous standards
*This paper was recommended by Regional Editor Piero Malcovati.
Journal of Circuits, Systems, and Computers
Vol. 21, No. 3 (2012) 1250016 (22 pages)
#.c World Scientific Publishing Company
DOI: 10.1142/S0218126612500168
1250016-1
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
such as MPEG-2,2 H.263,3 and MPEG-4.5 For instance, compared with MPEG-4,
H.263, and MPEG-2, H.264 can save 39%, 49%, and 64% of bit rate in average,
respectively.6
Higher compression performance and subjective quality in H.264 are achieved by
using new functionalities such as motion estimation with variable block-size (VBS)
and quarter-pixel accuracy, multiple reference frames (MRF),7 Lagrangian mode
decision,8,9 advanced entropy coding,10 and so on.
The heart of H.264 encoder is motion estimation (ME) which is used for removing
of temporal redundancy in a sequence of images. In H.264, ME is conducted in two
stages. In the first stage, integer motion estimation with different block sizes
(16� 16, 16� 8, 8� 16, 8� 8, 8� 4, 4� 8, and 4� 4) is performed and up to 41
integer motion vectors (IMVs) of all blocks and sub-blocks are determined whereas
up to five reference frames can be searched. Then, SME is started which is usually
conducted in two steps, as shown in Fig. 1. In the first step, half-pixel refinement is
performed by searching eight half-pixel positions around the best integer MV. The
quarter-pixel refinement is carried out in the same manner around the best half-
pixel position. Finally, mode selection is conducted.
\The H.264 SME typically improves coding efficiency between 1 and 4.2 dB in
peak signal-to-noise ratio (PSNR) or leads to 19%�50% in bit rate reduction".11
However, due to the use of VBS, MRF, and interpolation scheme for producing sub-
pixel values, two sub-pixel search steps and matching process, the computational
budget and memory access requirement of the quarter-pixel accurate ME are highly
intensive. Therefore, not only efficient and new fast algorithms are demanded to
reduce the huge computation and memory access bandwidth of the H.264 SME but
(-1, -1) (0, -1)
(0, 1)
(-1, 1)
(-1, 0)
(1, -1)
(1, 0)
(1, 1)
Integer pixel Half pixel
Quarter pixel
Fig. 1. SME refinements in H.264.
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-2
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
also dedicated hardware architectures are required to accelerate the SME process
for real-time applications.
There exist a few effective surveys on algorithms and architectures for video
compression techniques i.e., Refs. 12�16. These studies mainly focus on former
standards such as MPEG-2 and MPEG-4 and they do not cover the new func-
tionalities introduced later in H.264 such as SME with VBS, MRF, Lagrangian
mode decision, and so forth. Motivated by above issue, this paper tries to provide a
comprehensive survey about the most significant SME algorithms and archi-
tectures, address the SME design challenges and strategies at algorithmic and
architectural levels, and highlight the future research directions of SME design.
The rest of this paper is organized as follows. Section 2 analyzes the H.264 SME
and shows the impact of its features on coding performance. Section 3 investigates
the design space of SME algorithms reviewing recent approaches and the concepts
behind them in the state-of-the-art SME algorithms. Section 4 explores the design
challenges and choices of SME architectures and surveys the advanced SME
architectures. Finally, Sec. 5 concludes with emerging perspectives trends in the
research on algorithms and architectures of the SME and its future prospects.
2. Analysis of the H.264 SME
Design of a new and an efficient SME algorithm or hardware architecture relies on a
careful analysis and an in-depth understanding of the SME features and charac-
teristics. A thorough understanding of the SME features is essential for providing an
insight into the design parameters (such us performance, subjective quality, and
complexity), and, how to make an optimal tradeoff between them. In this section,
we evaluate the performance of H.264 SME algorithm and investigate the effect of
its features on coding performance through some experiments.
Different technical features are used in the H.264 SME such as VBS, quarter-
pixel resolution, and MRF, which improve the coding performance of the encoder at
the price of increased computation and memory bandwidth. To evaluate the impact
of different SME features on coding performance, we do several experiments. We use
a set test consisting several common intermediate format (CIF), Quarter CIF
(QCIF), 4CIF, and HD1080 image sequences with the following conditions:
(i) Search ranges for QCIF, CIF, 4CIF, and HD1080 videos are 16, 16, 32, and 64,
respectively.
(ii) Rate–Distortion (R–D) optimization is on.
(iii) The quantization parameter is set to 24, 28, 32, and 36.
(iv) Entropy coding method is CAVLC.
(v) Frequency for encoded bit stream is 30.
(vi) Number of coded pictures for QCIF, CIF, 4CIF, and HD1080 sequences are
300, 300, 220, and 110, respectively and the H.264 reference software (version
11)17 is used for experiments.
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-3
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
Tables 1�4 show gain in performance when using more features in the encoder for
QCIF, CIF, 4CIF, and HD1080 videos, respectively, and for the following cases.
(a) Integer motion estimation (IME) with 1 reference frame, SAD as distortion
criteria, and only 16� 16 blocks.
(b) Increasing motion vector accuracy to quarter-pixel.
(c) Allowing all VBS except below 8� 8 block.
(d) Allowing all VBS except 4� 4 blocks.
(e) Using all VBS.
(f) Increasing the number of reference frames to two reference frames.
(g) Using three reference frames.
(h) Using five reference frames.
(i) Changing the distortion criterion (i.e., replacing the sum of absolute differences
(SAD) with the sum of absolute transformed differences (SATD)).
In Tables 1�4, the average bit rate difference percentage (�BR (%)) and PSNR
difference in terms of dB (�PSNR (dB)) between two cases are calculated as
described in Ref. 18, where case (a) is considered as the reference for other cases
(i.e., case (b) to case (i)). From the simulation results, we can see that when a new
feather is added, in most cases, a better performance is achieved at the price of extra
computation while the gain in performance is not equal for all image sequences and
depends on video contents as well as frame sizes. From the simulation results, the
following observation can be made.
The quarter-pixel motion estimation (QPME) usually provides a considerable
improvement over the IME in terms of bit rate saving and video quality.
As for VBS, the more the block sizes are used, the better the achievement of the
performances. However, as seen in Tables 1�4, elimination of the 4� 4 block will
have a negligible effect on coding performance whereas it leads to a lot of compu-
tation reduction. From the hardware point of view, it can result in many clock cycles
Table 1. Effectiveness of the H.264 SME features on coding performances in QCIF videos.
Container Foreman News
Case 1 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)
(a) 0 0 0 0 0 0
(b) �35.20 1.654 �37.54 2.009 �20.39 1.339
(c) �41.93 2.043 �47.51 2.788 �28.54 2.014
(d) �42.88 2.106 �48.54 2.946 31.00 2.258
(e) �43.20 2.113 �48.78 2.970 �31.23 2.283
(f ) �43.31 2.132 �49.92 3.188 �31.32 2.302
(g) �45.96 2.276 �49.78 3.239 �31.62 2.324
(h) �46.22 2.311 �50.75 3.334 �31.74 2.341
(i) �46.76 2.355 �51.01 3.385 �32.41 2.422
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-4
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
reduction in the SME process. By using different VBS, a good tradeoff between the
video quality and the required clock cycles can be achieved.
On the subject of MRF motion estimation, using of two reference frames usually
brings a sensible coding performance compared with one reference frame. However,
Table 2. Impact of the H.264 SME features on coding performances in CIF videos.
Flower Mobile Mother�daughter
Case 2 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)
(a) 0 0 0 0 0 0
(b) �47.27 2.523 �65.30 3.503 �39.69 1.958
(c) �56.72 3.303 �69.19 3.920 �44.99 2.317
(d) �58.20 3.479 �70.06 4.038 �45.73 2.381
(e) �58.21 3.484 �69.99 4.043 �45.78 2.386
(f ) �59.04 3.610 �73.86 4.657 �46.42 2.449
(g) �60.00 3.724 �74.84 4.910 �46.55 2.463
(h) �60.87 3.845 �75.47 5.113 �46.96 2.498
(i) �61.45 3.927 �76.39 5.229 �47.83 2.569
Table 3. Effectiveness of the H.264 SME features on coding performances in 4CIF videos.
City Crew Soccer
Case 3 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)
(a) 0 0 0 0 0 0
(b) �57.74 3.703 �18.01 1.133 �36.39 1.606
(c) �62.62 4.266 �25.96 1.770 �41.54 1.900
(d) �63.24 4.355 �26.59 1.829 �42.31 1.949
(e) �63.29 4.356 �26.55 1.827 �42.38 1.955
(f ) �65.37 4.832 �30.28 2.170 �42.90 1.998
(g) �65.19 4.799 �30.89 2.235 �43.33 2.029
(h) �65.59 4.880 �31.81 2.328 �43.83 2.061
(i) �65.56 4.933 �32.70 2.438 �44.05 2.082
Table 4. Impact of the H.264 SME features on coding performances in HD1080 videos.
Blue sky Rush hour Station
Case 4 �BR (%) �PSNR (dB) �BR (%) �PSNR (dB) �BR (%) �PSNR (dB)
(a) 0 0 0 0 0 0
(b) �18.43 0.904 �11.72 0.340 �36.28 1.385
(c) �23.38 1.207 �20.33 0.623 �40.71 1.636
(d) �24.20 1.231 �20.65 0.636 �40.52 1.622
(e) �24.31 1.239 �20.60 0.634 �40.74 1.642
(f ) �23.11 1.176 �23.53 0.743 �39.50 1.572
(g) �23.23 1.182 �23.96 0.762 �39.29 1.565
(h) �23.56 1.201 �25.45 0.819 �40.08 1.625
(i) �23.84 1.223 �25.20 0.808 �37.87 1.500
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-5
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
increasing the number of reference frames is not always advantageous. As shown in
Tables 1�4, while the number of reference frames is increased to three or five, a
lower coding performance improvement is achieved. However, the computational
complexity and memory bandwidth are proportional to the number of searched
reference frames. As a useful key point for designers, it is worth to mention that
more than 80% of selected best reference frames belong to the first two nearest
reference frames.19
Regarding distortion cost, SATD usually has a small higher video quality than
SAD at price of an increased computation. However, from the hardware point of
view, SATD requires a much higher hardware cost than SAD. In summary, the
SME hardware/software designers can make a good tradeoff between coding per-
formance and computation complexity by using different SME functionalities and
considering aforementioned observations.
3. Investigation of Algorithms
This section investigates the design space of SME algorithms and reviews the design
challenges of SME algorithms and the possible choices to cope with them. In ad-
dition, it surveys the state-of-the-art SME algorithms.
To design a new and an efficient algorithm, designers should take into con-
sideration several design issues. The new algorithm should have a lower compu-
tational complexity and memory access compared with the H.264 SME. In addition,
it should lead to a close coding performance compared to the H.264 SME algorithm.
Moreover, it should be a hardware friendly algorithm for hardwired applications.
This issue will be discussed in the next section. The designers should carefully
consider all the functions in SME algorithm and the impact of each function on the
coding performance and computational complexity to identify the main bottlenecks
and possible solutions. Although the proposed analysis in previous section covers
some of aforementioned issues, the details of SME bottlenecks and solutions that are
still not given are discussed in the following paragraphs.
The VBS feature is one of the main contributors to the computation and memory
bandwidth. Since in the H.264 SME each block has its own motion vector, it can
processes each block separately with multiple reference frames, which increases the
computational complexity, memory bandwidth, and processing time. A simple but
effective way for lowering computation, memory access, and processing time is to
reduce the number of block sizes before or during SME stage. In Ref. 20, Song et al.
directly remove all block modes below 8� 8 that leads to 42.9% reduction in com-
putation complexity at price of small quality drop. However, this technique is not
suitable for low or medium frame size and results in more video quality degradation.
In Refs. 21 and 22, different mode filtering methods are introduced that select only
some of best modes for SME refinement. As a result, a lot of computation, memory
access, and processing time are saved at the price of a small video quality drop.
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-6
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
Interpolation is a computational intensive process in SME, which requires a lot of
memory access as well as computation to produce half- and quarter-pixels.
Matching error surface function can provide a good way for reducing computation
and memory bandwidth of the interpolation process. \It is generally believed that
the fast motion estimation algorithm works best if the error surface inside the search
window is unimodal. As shown in Fig. 2, the error surface of the integer pixel motion
estimation is not unimodal due to the large search window and complexity of video
content. Therefore, the integer motion estimation search would easily be trapped
into a local minimum. On the other hand, since the sub-pixels are generated from
the interpolation of integer pixels, the correlation inside a sub-pixel search window
is much higher than that of the integer pixel search window. Thus, the unimodal
error surface will be valid in most cases of the sub-pixel. As a result, the matching
error decreases monotonically as the search point moves closer to the global mini-
mum".23 Under unimodality assumption of error surface at sub-pixel resolution, the
matching costs at sub-pixel precision can be calculated. Accordingly, the interp-
olation process is avoided and thus the computation and memory requirement are
considerably saved. Several models for error surface function have been developed
with different levels of accuracies and complexities. In Ref. 24, five mathematical
models of error surface are proposed with distinct performances. Among these
models, the first and second models are based on an inverse matrix solution that
have better coding performances, and are more suitable for hardware implemen-
tations. However, the above-mentioned work only supports half-pixel accuracy. To
support the quarter-pixel precision, an algorithm is proposed in Ref. 25. It shows
that the unimodal assumption does not necessarily work for all sub-pixels resol-
ution. That is why the resulted performance of quarter-pixel interpolation free
(a) (b)
Fig. 2. Error surface of (a) integer motion estimation (search range: 32) (b) sub-pixel motion estimation
with 1/8 pixel accuracy (Ref. 23).
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-7
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
algorithm is less than the full search SME. To compensate the quality drop, the
authors introduce the \complete-system model", which provides near full search
SME performance at the price of an increased complexity. Please note that since the
interpolation free algorithms avoid the interpolation process, they are suitable for
using in low complexity video encoders as well as low power applications such as
portable devices. However, they are not able to produce residue values and therefore
need an interpolation unit for motion compensation.
Another way for reducing the computation and memory requirement of the in-
terpolation process is to use an interpolation filter with a reduced complexity. H.264
uses a 6-tap finite impulse response (FIR) filter with (1/32, �5/32, 20/32, 20/32,
�5/32 and 1/32) coefficients and a 2-tap bilinear filter with (1/2, 1/2) coefficients
for producing half-pixel and quarter-pixel positions, respectively. On the other
hand, the computation cost and memory requirement of a filter are proportional to
its tap length. Accordingly, the half-pixel interpolation is the bottleneck of the
interpolation process in the H.264 SME. To address this problem, a 4-tap filter with
(�1/8, 5/8, 5/8, and �1/8) coefficients is proposed in Ref. 26. Therefore, the
computation and memory bandwidth are decreased compared with the standard
6-tap filter.
In addition to the tap length of an interpolation filter, the suitability of its
coefficients for hardware/software implementation is another important issue. As
seen above, some of coefficients in the H.264 6-tap FIR filter and the proposed filter
in Ref. 26 are not a power of two. As a result, they need multiplication, addition,
subtraction, and shift operations. Among these operations, multiplication is the
most expensive one and demands more overhead for implementing. To cater to this
issue, a multiplication free filter with 6-tap length is proposed in Ref. 27. In a later
research,28 both the tap-length and the multiplication issues are considered by
introducing a multiplication free 4-tap filter. In the proposed filter, the weight of
half-pixel interpolation filter are 1/2, 1/8, 1/32, and �1/32. As a result, this filter
consists of additions and shift without multiplications. Therefore, the computation
and memory bandwidth of the proposed half-pixel interpolation filter are decreased.
Due to the existence of two-iterative search for the half-pixel and quarter-pixel
refinements in the H.264 SME that are usually carried out in a sequential manner,
the processing time is increased. From the hardware point of view, this problem
poses a restriction on required clock cycles for high-resolution video such as HDTV
and bigger specifications. One solution for this problem is to calculate all 25 half-
and quarter-pixels in parallel as used in Refs. 29 and 30. Compared with conven-
tional two-iteration scheme, the cycle count in is halved and therefore the long
computation latency is reduced and the HD size requirement can be achieved.
However, from the hardware viewpoint, this approach is not suitable for hardware
implementation due to its large silicon area requirement. To bring a solution for this
problem, single iteration algorithm is proposed in Ref. 31. This algorithm predicts
the sub-pixel motion vector (SMV) of current block using the median motion vector
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-8
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
of neighboring block. Then, it examines six positions including (0, 0), predicted
SMV and four diamond points around predicted SMV, as shown in Fig. 3. In this
way, not only the long latency of two-iterative search is reduced by half, but also the
number of the search points is decreased to six points that leads to computation and
memory bandwidth reduction. Compared with interpolate and search algorithm
which is adopted in reference software, the single iteration algorithm has a small
video quality degradation. Reference 32 adopts the single-iteration technique to
reduce the long processing time in high-resolution specification as well as the
computation and memory bandwidth with small video degradation. In Ref. 33, the
authors present single pass fractional motion estimation to improve the coding
performance relative to single-iteration SME algorithm by searching four additional
positions around the (0, 0) in diagonal direction.
Bit-width reduction is another choice for the complexity and memory bandwidth
saving. The main idea is to represent each pixel with a lower number of bits (usually
one or two bits) using bit transformation. As a result, SME can use bit-transformed
data and, therefore, the computation and memory access are reduced. In Ref. 34,
Akbulut et al. propose the first bit transformed SME. After the interpolation of
reference frame with 8 bits/pixel bit-depth, they use 1-bit transform to represent
each pixel with 1 bit. Then, the SME refinement is carried out as usual. However,
they use 8-bit data at the interpolation process and their method adds an extra
complexity due to transforming 8-bit data to 1-bit samples using multi-band pass
filter. Motivated from above drawback, the all-binary SME is introduced in Ref. 35.
In order to reduce the computation of interpolation, 8-bit pixel values are trans-
formed before the interpolation. Furthermore, by replacing the multi-band pass
filter with multiplication free one-bit transform as described in Ref. 36, further
(-1, -1) (0, -1)
(0, 1)
(-1, 1)
(-1, 0)
(1, -1)
(1, 0)
(1, 1)
Integer pixel Predicted SMV
Diamond position
(0, 0)
Fig. 3. The search pattern of single iteration algorithm that includes (0, 0), predicted SMV, and four
diamond positions.
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-9
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
computation can be saved. Although the above-mentioned algorithms lead to sig-
nificant reduction in computational complexity and memory bandwidth, which
make them suitable for real-time low power video coding systems, they result in
video quality loss as well. Finally yet importantly, it should be noted for motion
compensation, residue data should be inverted to pixel domain that results in an
increased computation.
4. Exploration of SME Architectures
In this section, the design challenges and strategies of the H.264 SME are discussed.
Besides, the representative SME hardware architectures are surveyed.
4.1. Hardware architecture design: Challenges and strategies
Due to the huge computation and memory access requirement of the H.264 SME,
acceleration of the SME process by dedicated hardware architectures is of vital
importance for real-time application. However, design of such architectures is a
challenging job. The reason is that besides the extra ordinary huge computation and
memory bandwidth, the SME algorithm in H.264 has been developed without
architectural considerations. In the H.264 reference software, the SME is based on a
sequential flow with a great deal of data dependency. From the hardware point of
view, the sequential operations (processing) restrict the parallel processing, which
in turn reduce the processing power. Besides, the data dependency increases the
memory requirement at the price of an added storage space, area cost, and power
consumption as well. Therefore, the data flow of the full search SME algorithm
should be carefully reordered and optimized to permit concurrent operations with a
reduced data dependency, which consecutively enables the parallel processing under
the restrictions of sequential SME procedure. An example for analysis of the H.264
SME flow can be found in Ref. 37, where the authors simplify its complex encoding
procedure into several encoding loops. In addition, they propose two decomposing
techniques to parallelize the SME algorithm while maintaining high hardware
utilization and achieving data reuse.
The main building blocks of the H.264 SME can be listed as follows: Interpolation
unit, processing unit that is responsible for search and cost calculation, and mode
decision. Among them, the interpolation unit is computationally intensive while
processing and mode decision units demand less computation. Therefore, design of
hardware architecture is performed by a particular emphasis on interpolation unit
as the most computational intensive unit. However, design of efficient hardware
architectures for less computational demanding units is still important as they can
affect the overall performance of the design.
The main design challenges for the interpolation unit are hardware utilization
and huge memory bandwidth that arise from the VBS, and the 6-tap FIR filter
that is used for generating half-pixels. Due to the seven different block sizes, the
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-10
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
interpolation unit should be carefully designed to achieve maximum hardware
utilization. The common solution for this issue is the use of 4� 4 interpolation unit.
As all bigger blocks can be decomposed into 4� 4 blocks, the complete hardware
utilization can be achieved. However, since each 4� 4 block needs a 9� 9 interp-
olation window, the 4� 4-based interpolation increases the memory bandwidth. On
a contrary way, the use of a 16� 16 interpolation unit leads to the lowest memory
bandwidth, which is due to the covering of all block sizes, at the price of a low
hardware utilization. Examples that are based on 4� 4 and 16� 16 interpolations
are Refs. 38 and 39, respectively. As for the FIR filter, the previous reported
research in the literature such as Refs. 40 and 41 are not suitable for the H.264
interpolation unit because of their too low input bandwidth.42 In Ref. 43, an
architecture for the H.264 interpolation unit is presented that has a 6-tap FIR filter
with a pipelined architecture which increases its processing ability. In Ref. 44, by
exploiting the mixed use of parallel and serial-input FIR filters, a high throughput
rate and efficient silicon utilization are achieved so that the proposed design can
support SDTV and HDTV applications.
Regarding SATD unit that improves the coding performance, there are several
architectures in the literature. The first SATD architecture is proposed by Wang
et al.,45 which contains two 1D transform units and one transpose register array.
This architecture uses a fast algorithm presented in Ref. 46, which only contains add
and shift and addition operations where shift operation is done by wiring leading to
small hardware cost. In Ref. 47, present authors explore the design space of
Hadamard transform and present five implementation alternatives with different
performance in terms of throughput and hardware cost which are suitable for
different applications. Another excellent Hadamard transform architecture can be
found in Ref. 48, which the interested readers are referred to it.
On the subject of less computation demanding units, the main tasks in these
units can be considered as absolute operation and inter mode decision. The dis-
cussion on this subject is not provided here; however, the interested readers are
referred to the following references for more information. The efficient architectures
for absolute operation can be found in Ref. 49. As for the inter mode decision, it is
difficult to find an architecture design with details in literature. However, a detailed
example can be found in Ref. 50.
In addition to the hardware architectures for the SME building blocks, efficient
memory design is another important issue. The reason is that the SME uses the
data-intensive interpolation method to generate sub-pixels within up five reference
frames that calls for a great deal of memory access (MA) through the frame memory.
In Ref. 51, the author calculates the memory access requirement of interpolate and
search sub-pixel motion estimation algorithm in H.264 reference software in terms of
the required number of bits that one n�m block has to access memory. Table 5 lists
the memory access requirement of H.264 SME for five specifications i.e., case (a) to
case (e). As an example, for HD1080 @ 30 fps, all block sizes and five reference
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-11
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
frames, i.e., case (d), the memory access requirement is 477,067 Mbits. From a
hardware viewpoint, this huge memory access leads to a high I/O data bus traffic
and interconnection requirements, which have a significant effect on the perform-
ance of the hardware architecture. Besides, it results in more power consumption.
The common solution for this problem is data reusing technique.52�54 In this
technique, the temporal data locality is exploited and the previously accessed data
is reused in future, which result in data access reduction. Accordingly, the power
consumption is decreased. The data reuse method can be performed offline or online.
In the offline scenario, frequently accessed data is stored in on-chip memory and will
be recycled offline in future access. In this way, the data access from the frame
memory can be decreased. In the online data reuse scheme, the current accessed
data is shared between different places that work in parallel and, therefore, the data
access reduction is achieved. In Ref. 55, an effective data reuse methodology is
introduced, which formulates the SME algorithm as the nested loops structures and
explores data locality by the use of loop analysis. Based on this technique, 95.7% of
memory access is reduced compared with no data reuse case, which leads to a lot of
saving in memory access power. In Ref. 56, the authors propose a hardware im-
plementation for interpolation unit that uses a single-step interpolation method
along with a data reusing scheme, which can save 25% of memory bandwidth, and
45% of processing cycles.
4.2. Hardware architectures of the H.264 SME
This subsection surveys the state-of-the-art SME architectures and classifies them
into full and fast SME architectures. A full SME architecture is based on the full
search SME algorithm whereas a fast SME architecture is a hardware design of a
fast SME algorithm. Four full SME architectures (i.e., see Refs. 37�39 and 57�59)
and six fast search SME architectures (i.e., see Refs. 20, 23, 31, 33, 60 and 61) are
selected as the representative designs that the details as well as the comparisons of
them are provided in the following paragraphs.
Chen et al.38 propose the first H.264 SME architecture. To bring a solution for
the complex sequential procedure flow of the H.264 SME, they present the 4� 4
block decomposition and vertical integration techniques to enable the mapping of
Table 5. The memory access of the interpolate and search method and the NFM
algorithm for some specifications.51
Case Specifications Features MAinterpolate & search
(a) QCIF (15 fps) VBS (all), MRF (5) 2; 915:4� 106
(b) CIF (30 fps) VBS (all), MRF (5) 23; 323� 106
(c) SD480p (30 fps) VBS (all), MRF (5) 79; 511:2� 106
(d) HD1080p (30 fps) VBS (all), MRF (5) 477; 067:3� 106
(e) HD1080p (30 fps) VBS (8� 8�16� 16), MRF (2) 108; 090:2� 106
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-12
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
the SME flow in hardware with features of regular schedule, full utilization, parallel
processing, and reusability. As shown in Fig. 4, the main important parts of Chen’s
architecture are interpolation engine, nine 4� 4 processing units, and Lagrangian
mode decision engine. The interpolation engine produces half-pixels and quarter-
pixels by using one-dimensional (1-D) FIR and bilinear filters, respectively. The
4� 4 processing units generate residues and SATD values. Since there are nine
positions in each sub-pixel refinement, nine 4� 4 block processing units are used.
The \Lagrangian mode-decision" is responsible for finding the best mode among
41 modes in each macroblock (MB). The area cost of the proposed design is 79K
gates and it can support SDTV format 30 frames per second (fps), one reference
frame and all MB modes at the clock frequency of 100MHz. The maximum per-
formance of design is 52 k MB/s.
Based on Ref. 38, Yang et al.39 introduce a high performance architecture of the
H.264 SME for HDTV application with two new techniques. First, differing from
Chen’s design that uses a 4-pixel interpolation unit, they propose a 16-pixel in-
terpolation to increase processing capability. However, four times increase of the
data width of the interpolation and the search engine only bring 52.5% of clock cycle
saving. Second, they replace Chen’s design 1-D filters with diagonal filter and fully
pipeline 1-D filters in the interpolation unit. As a result, the delay path is decreased
and a higher clock frequency can be achieved without a considerable increase of area
cost. The clock frequency of Yang’s design is 285MHz that supports HD1080p
format in 30 fps for one reference frame. The area cost and the maximum
throughput of the design are 189k gates and 250MB/s, respectively.
Fig. 4. Block diagram of Chen’s design.
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-13
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
In Refs. 57 and 58, Wu et al. present a three-engine parallel SME architecture to
support high-resolution application. To cope with the high bandwidth input data
problem, which leads to a high bandwidth memory and a high hardware cost, they
use a reference pixel scheduling method and an efficient memory organization to
ensure that each engine can fetch the right reference pixel at the right time with
minimum hardware cost. In addition, they dispatch 41 sub-blocks of each MB to the
three engines in a load-balancing and interlaced ways to increase the throughput.
Furthermore, they use a resource sharing scheme for SATD generators to reduce the
number of the required SATD generators from three units to two units. The pro-
posed SME architecture takes 311.7k gates and can support HD1080 30 fps at clock
frequency of 154MHz.
Ruiz and Michell59 present a new hardware architecture for full search SME
algorithm which is made up of three different pipeline processors including half-pixel
processor, quarter-pixel processor, and mode decision processor. Similar to the work
of Chen et al.,38 the proposed architecture uses 4-pixel interpolation unit to increase
hardware utilization. To reduce the hardware cost, Ruiz and Michell use SAD as the
distortion criterion instead of SATD at the price of small video degradation. In this
way the hardware cost of Hadamard Transform is saved (i.e., about 85% gate count of
4� 4 processing unit, see Ref. 37). The hardware cost of the proposed architecture is
only 11.4K gates and supports HD1080 resolution at a clock frequency of 250MHz.
However, since the proposed architecture uses 6-tap filter for half-pixel interpolation
filter, it seems that the hardware cost of this architecture should be greater than the
reported one i.e., 11.4K gates. The reason is that according to the gate count profile in
Ref. 37, the hardware cost of similar interpolation unit is 23.872K gates and the silicon
area of 4� 4 processing unit without Hadamard Transform is 5.2K gates whereas the
hardware cost of mode decision and quarter-pixel parts should be considered as well.
Based on Ref. 38, Wang et al.23 contribute a fast SME architecture. Their
architecture is a hardware realization of a fast algorithm with a reduced number of
search points. As a result, the number of 4� 4 processing units decrease to 5 instead
of 9 in Chen’s design. Compared with Chen’s work, it saves 40% and 14% of area
cost and searching time at the price of less than 0.2 dB PSNR quality drop. How-
ever, due to the use of 4-pixel interpolation units and 1-D FIR filters, it has a limited
power processing and can only support SDTV format not higher resolutions.
Song et al.20 propose a fractional motion estimation engine for HD1080p resol-
ution. To achieve real-time encoding, they present three techniques. First, they
remove all modes below 8� 8 block and support one reference frame. Therefore,
88.6% of the computation is saved with 0.1 dB loss. Second, they use two reusing
methods namely lossless inside-mode and cross-mode,62 which save about 65% pixel
generation and SATD calculation. Third, they remove the pipeline bubbles between
the half-pixel and the quarter-pixel stages by optimizing the SME scheduling. The
area cost of the proposed engine is 203.2K gates and it can encode real-time
HDTV1080p at 30 fps under the clock frequency of 200MHz.
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-14
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
In Ref. 31, Kuo et al. present a SME architecture for HDTV and high profile.
They propose two techniques for solving the associated problems with the HDTV
video size and high profile. First, the proposed architecture is based on the single-
iteration algorithm. Consequently, its processing time is halved compared with
conventional two-step full search algorithms. In addition, its search candidates are
reduced to 6 points that leads to reduction of processing units and, thus, hardware
cost as well. Second, since the high profile of H.264 uses the costly 8� 8 SATD, they
replace it with 4� 4 SATD that leads to a reduction in area cost. The proposed
architecture supports HD720p 30 fps with a clock frequency of 70MHz, 62.2K gates,
and 0.043 dB PSNR video quality degradation.
In a later work,60 Tsung et al. contribute an SME architecture for quad full high
definition (QFHD) that is based on the single-iteration algorithm. To cater to the
huge computation and memory bandwidth requirement of the QFHD specification,
several optimization techniques are used. First, the 6-tap interpolation filter is
replaced by a bilinear filter and, thus, memory bandwidth and area cost are saved.
Second, based on the distributive law in SATD, the residues in the quarter-pixel
SME are estimated while the half-pixel residues are known. Therefore, the calcu-
lations of sub-pixels are decreased. Third, a cache-based memory63 is used that
results in bandwidth reduction. The clock frequency and throughput of the proposed
design are 280MHz and 1659K MB/s, which are sufficient for the 4096� 2160
QFHD real-time processing.
Another architecture for QFHD has been recently reported by Kim et al.33 The
proposed architecture is based on single pass fractional motion estimation which
directly searches only surroundings’ positions of both the predicted fractional
motion vector and the search centre. To reduce the hardware cost, authors present
bit clipping scheme that is applied to processing units decreasing the hardware cost
by 25%. In this method, the required bit length of residue data is reduced from nine
to six, which is result of hardware cost reduction in processing units. The exper-
imental results show that bit clipping method has a negligible effect on video
quality. In addition, to increase the throughput all block modes below 8� 8 block
are culled in the proposed design. This design can process real time (i.e., 24 fps)
QFHD video using 134K gates with an operating frequency of 250MHz.
Power aware design has been recently emerged as a rising trend in video coding
systems by which power consumption in multimedia devices can be adjusted
according to certain conditions such as battery status, working time, and coding
performance. Chen et al.61 propose the first power aware architecture for the H.264
SME that is based on advanced mode pre-decision (AMPD) algorithm.38 In order to
support power aware functionality, after the IME refinement, their algorithm stores
the cost of seven partitions including four 8� 8, two 16� 8 and 8� 16, and one
16� 16 partitions from low (best) to high (worst). N (N ¼ 1 � 7) of best partitions
can be selected in the SME. WhenN is equal to seven, the AMPD algorithm has the
same performance with the full search SME. When N is equal one, it gives the
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-15
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
lowest performance. By changing N, the AMPD algorithm can provide a trade-off
between power consumption and coding performance. The area cost of the proposed
design is 125K gates with maximum operation frequency of 27MHz. For real-time
encoding of CIF 30 fps videos with two MRF, the power consumption of
22.58�1.64mW can be efficiently varied at the expense of 0.1�3.9 dB quality drop.
Table 6 shows the comparisons of aforementioned SME architectures. From this
table, three trends can be summarized as follows. First, under similar throughput
condition, a full SME architecture has a higher area cost than a fast SME archi-
tecture that is rewarded with the highest video quality. Second, most of the SME
designs belong to the fast SME architecture due to stringent real-time constrains.
Finally, the single-iteration-based fast SME architectures such as Kuo et al.’s,31
Tsung et al.’s,60 and Kim et al.’s33 are suitable candidates for using in high-resol-
ution application with a negligible quality drop and reasonable hardware cost.
Finally yet importantly, it is worth to note that the surveyed architectures in
this subsection are based on application specific integrated circuit (ASIC) designs
whereas there are other design alternatives such as digital signal processors (DSPs),
media processors, and field programmable gate arrays (FPGAs). ASIC designs show
the best performance with little flexibility whereas processor and FPGA-based
designs provide better flexibility and lower performance.64 Therefore, hardware
engineers can use DSPs, media processors or FPGAs for implementation of fast
SME designs, especially for small and medium video resolutions that require low
and moderate performances. The reason is that fast SME algorithms (such as
AMPD38 and single iteration pass algorithms31,33) require lower computational
complexity and memory bandwidth compared with full search SME algorithms.
5. Further Perspectives and Future Prospects
With rapid advances in multimedia processing and semiconductor technologies, new
multimedia applications and standards will keep emerging, which will unsurprisingly
Table 6. Comparisons of the state-of-art SME architectures.
Ref
Technology
(�m)
Area
(K gates)
CLK freq
(MHz)
Throughput
(MB/s)
Power
(mW)
Video
resolution
Quality
drop (dB)
Chen et al.38 0.18 79.3 100 49K — SDTV 0
Yang et al.39 0.18 189 200 250K — HD1080 0
Wu et al.57 0.13 311.7 154 250K — HD1080 0
Ruiz and Michell59 0.18 11.4 290 — — HD1080 —
Wang et al.32 0.18 48 100 50K — SDTV 0.1�0.2
Song et al.20 0.18 203.2 200 250K 236 HD1080 0.1
Kuo et al.31 0.18 62.2 70 71.3K — HD720 0.043
Tsung et al.60 0.09 448 280 1659K — QFHD 0.02
Kim et al.33 0.13 134 250 977K QFHD —
Chen et al.61 0.18 125 27 23.7K 22.58�1.64 CIF 0.1�3.9
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-16
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
pose new design challenges. As the SME is an important part of current and future
advanced video coding systems, the same scenario will be repeated for it. In the
following paragraphs, the SME design challenges in some of the emerging multimedia
applications and the next advanced video coding standard, currently known as \High
Efficiency Video Coding" (HEVC)/H.265,65 are reviewed.
Digital video systems will continue to evolve towards higher resolutions to
provide better realistic images on displays. Ultra-high definition (UD) TV66,67 is an
emerging visual media, which undoubtedly requires a much higher computation and
memory bandwidth than current applications. For example, the required
throughput of real-time SME for the UD format (7680� 4320 pixels) is 16 times the
HD format (1920� 1080 pixels) under the same frame rate condition.
Another interesting emerging application is three-dimensional (3-D) TV that
provides a 3-D impression of 3-D natural scenes by using multiview video sequences.
Such sequences are produced by capturing a 3-D scene using multiple cameras
simultaneously. The SME for multiview sequences suffers from two major problems.
First, it demands a lot of computation and memory bandwidth that are directly
proportional to the number of capturing cameras. Second, due to the redundancy
between closer cameras, compressing multiview sequences independently with
H.264 is not efficient.68 As a result, there may need new SME algorithms to consider
the redundancy problem that cause an added complexity in the algorithm flow.
Although \the H.264 SME typically leads to 1�4.2 dB PSNR video quality
improvement or 19%�50% bit rate saving",11 the demands for higher coding per-
formance are relentless. Consequently, development of new SME algorithms will
ceaselessly continue. Furthermore, the SME with enriched and improved features in
future video coding standards will keep arriving, which will unsurprisingly increase
the computation and memory bandwidth, data dependency and irregularity of
computation flow. For instance, the SME in H.265 will use adaptive interpolation
filter instead the H.264 interpolation filter with fixed weights to improve motion
compensation prediction. That is because the time variation of image signal is not
taken into account in the H.264 interpolation filters and, consequently, the pre-
diction efficiency of motion compensation is limited. In the adaptive interpolation,
based on the characteristic and statistic of each image, a set of coefficients for the
interpolation filter is calculated which improves the coding performance at the price
of an added computation and irregularity.69,70 In addition, \supermacroblock"
structure up to 64� 64 size will be used in H.265 that brings an added computation
and memory access compared with H.264.
Derived from the above discussion, the SME in the emerging and future multi-
media applications will be characterized by massive computation, memory band-
width, data dependency, and irregular computation flow, which will bring new
challenges for the SME hardware architecture design. On the other hand, the
continuous advances of microelectronic technology will provide more processing
capability, which in turn will facilitate the hardware implementation of SME.
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-17
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
However, conventional architectural approaches may not able to fulfill the hard
real-time constrains of the emerging and future applications. As a result, besides of
the conventional and current advanced algorithmic and architectural techniques
that have been described in this paper, the SME designers may need to develop new
and innovative methods for low cost, low power, and efficient SME hardware
realization. Furthermore, the SME hardware designer should consider some new
challenges that will be brought by advanced very-large-scale integration (VLSI)
technologies such as high static (leakage) power consumption and interconnection
problems. In current and future VLSI designs, as process technology steadily scaling
down to deep submicron area; that is 90 nm and beyond, the static power that is
proportional to the hardware cost, will grow much faster than dynamic power and it
will be the main contributor to total power consumption. For more information on
static power, interested readers are referred to Refs. 71 and 72. In addition, in deep
submicron area, the interconnection parameter will be a very important design
parameter and it will be determining cost, delay, power, reliability, and turn-
around time of the future LSI’s rather than MOSFET’s.73 On the other hand, as the
future SME applications are data intensive with a high bandwidth memory, par-
allelism at hardware level will be a must for meeting the real-time constrains and a
lot of storage space will be required for data saving. Consequently, the hardware
cost and the number of wide length interconnections will be increased, which will
lead to increasing in static power and interconnection density. Therefore, the use of
static power reduction techniques together with reducing of area cost and memory
bandwidth, which result in (static) power and interconnection reduction, are vitally
important for future advanced SME architectures especially for portable multi-
media devices with limited battery power. Last but not least, it seems that emerging
and future SME designs will be based on ASIC rather than other design alternatives
(i.e., DSP, FPGA, and so on) to fulfill the extraordinary huge computational
complexity and memory bandwidth requirement of SME in upcoming video coding
systems.
Acknowledgment
The authors would like to thank all reviewers for their helpful suggestions and
comments that improved the quality of the revised manuscript. This work was
supported in part by the Ministry of Higher Education, Malaysia, under Grant
FRGS FP094/2007c.
References
1. Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG, Draft ITU-T Rec-ommendation and Final Draft International Standard of Joint Video Specification,ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC (2003).
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-18
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
2. T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, Overview of the H.264/AVC, video coding standard, IEEE Trans. Circuits Syst. Video Technol. 13 (2003)560�576.
3. Information technology — generic coding of moving pictures and associated audio in-formation: Video, ISO/IEC 13 818-2 and ITU-T Rec. H.262 (1996).
4. Video Coding for Low Bit Rate Communication, ITU-T Rec. H.263 (1998).5. Information technology — coding of audio-visual objects — part 2: Visual, ISO/IEC 14
496-2 (1999).6. A. Joch, F. Kossentini, H. Schwarz, T. Wiegand and G. J. Sullivan, Performance
comparison of video coding standards using lagrangian coder control, Proc. IEEE Int.Conf. Image Process. (2002), pp. 501�504.
7. T. Wiegand, X. Zhang and B. Girod, Long-term memory motion-compensated predic-tion, IEEE Trans. Circuits Syst. Video Technol. 9 (1999) 70�84.
8. G. J. Sullivan and T. Wiegand, Rate-distortion optimization for video compression,IEEE Signal Process. Mag. 15 (1998) 74�90.
9. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G. J. Sullivan, Rate-constrainedcoder control and comparison of video coding standards, IEEE Trans. Circuits Syst.Video Technol. 13 (2003) 688�703.
10. D. Marpe, H. Schwarz and T. Wiegand, Context-adaptive binary arithmetic coding inthe H.264/AVC video compression standard, IEEE Trans. Circuits Syst. Video Technol.13 (2003) 620�636.
11. M. R. H. Fatemi, H. F. Ates and R. Salleh, Fast algorithm analysis and bit-serialarchitecture design for sub-pixel motion estimation in H.264, J. Circuits Syst. Comput.19 (2010).
12. P. Pirsch, N. Demassieux and W. Gehrke, VLSI architectures for video compression-asurvey, Proc. IEEE 83 (1995) 220�246.
13. F. Dufaux and F. Moscheni, Motion estimation techniques for digital TV: A review and anew contribution, Proc. IEEE 83 (1995) 858�876.
14. P. Pirsch and H.-J. Stolberg, VLSI implementations of image and video multimediaprocessing systems, IEEE Trans. Circuits Syst. Video Technol. 8 (1998) 878�891.
15. P.-C. Tseng, Y.-C. Chang, Y.-W. Huang, H.-C. Fang, C.-T. Huang and L.-G. Chen,Advances in hardware architectures for image and video coding— a survey, Proc. IEEE93 (2005) 184�197.
16. Y.-W. Huang, C.-Y. Chen, C.-H. Tsai, C.-F. Shen and L.-G. Chen, Survey on blockmatching motion estimation algorithms and architectures with new results, J. VLSISignal Process. 42 (2006) 297�320.
17. Joint Video Team Reference Software version 11.0 (2007), http://iphome.hhi.de/suehring/tml/download/old jm/.
18. G. Bjontegaard, Calculation of average PSNR differences between R-D curves, Docu-ment VCEG-M33 (2001).
19. Y.-W. Huang, B.-Y. Hsieh, S.-Y. Chien, S.-Y. Ma and L.-G. Chen, Analysis and com-plexity reduction of multiple reference frames motion estimation in H.264/AVC, IEEETrans. Circuits Syst. Video Technol. 16 (2006) 507�522.
20. Y. Song, M. Shao, Z. Liu, S. Li, L. Li, T. Ikenaga and S. Goto, H.264/AVC fractionalmotion estimation engine with computation reusing in HDTV1080p real-time encodingapplications, IEEE SiPS07, Shanghai, China (2007), pp. 509�514.
21. C. L. Su, Yang, W. S. Li, Y. Chen, C. W. Guo and S. Y. Tseng, Low complexity highquality fractional motion estimation algorithm and architecture design for H.264/AVC,Proc. IEEE Asia Pacific Conf. Circuits Syst. (2006), pp. 578�581.
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-19
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
22. A. Abdelazim, M. Yang and C. Grecos, Fast subpixel motion estimation based on theinterpolation effect on different block sizes for H264/AVC, Opt. Eng. Lett. 48 (2009).
23. Y. J. Wang, C. C. Cheng and T. S. Chang, A fast algorithm and its VLSI architecture forfractional motion estimation for H.264/MPEG-4 AVC video coding, IEEE Trans. Cir-cuits Syst. Video Technol. 17 (2007) 578�583.
24. J. W. Suh and J. Jeong, Fast Subpixel motion estimation techniques having lowercomputational complexity, IEEE Trans. Consum. Electron. 50 (2004) 968�973.
25. P. R. Hill, T. K. Chiew, D. R. Bull and C. N. Canagarajah, Interpolation free subpixelaccuracy motion estimation, IEEE Trans. Circuits Syst. Video Technol. 16 (2006)1519�1526.
26. R. Wang, C. Huang, J. Li and Y. Shen, Sub-pixel motion compensation interpolationfilter in AVS, Proc. IEEE Int. Conf. Multimedia and Expo. 1 (2004) 93�96.
27. C.-Y. Tsai, T.-C. Chen, T.-W. Chen and L.-G. Chen, Bandwidth optimized motioncompensation hardware design for H.264/AVC HDTV decoder, Proc. IEEE MidwestSymp., Circuit System, Vol. 2 (2005), pp. 1199�1202.
28. C. J. Hyun and M. H. Sunwoo, Low power complexity-reduced ME and interpolationalgorithms for H.264/AVC, J. Signal Process. Syst. 56 (2009) 285�293.
29. N. T. Ta, J. H. Kim and J. R. Choi, Fully parallel fractional motion estimation forH.264/AVC encoder, ICIS 4 (2009) 306�306.
30. Y.-H. Chen et al., An H.264/AVC scalable extension and high profile HDTV 1080pencoder chip, Proc. IEEE SOVC (2008).
31. T.-Y. Kuo, Y.-K. Lin and T.-S. Chang, SIFME: A single iteration fractional-pel motionestimation algorithm and architecture for HDTV sized H.264 video coding, Proc. IEEEInt. Conf. Acoustics, Speech, and Signal Processing, Hawaii, USA (2007), pp.1185�1188.
32. Y.-K. Lin et al., A 242 mw 10mm2 1080p H.264/AVC high profile encoder chip, Proc.IEEE ISSCC (2008), pp. 314�315.
33. G. Kim, J. Kim and C.-M. Kyung, A low cost single-pass fractional motion estimationarchitecture using bit clipping for H.264 video codec, ICME (2010), pp. 661�666.
34. O. Akbulut, O. Urhan and S. Ertürk, Fast Sub-Pixel Motion Estimation By Meansof One-Bit Transform, Lecture Notes in Computer Science (LNCS), Vol. 4263 (2006),pp. 503�510.
35. A. Celebi, O. Akbulut, O. Urhan, I. Hamzaoglu and S. Erturk, An all binary sub-pixelmotion estimation approach and its hardware architecture, IEEE Trans. Consum.Electron. 50 (2008) 1928�1937.
36. S. Ertürk, Multiplication-free one-bit transform for low-complexity block-based motionestimation, IEEE Signal Process. Lett. 14 (2007) 109�112.
37. Y.-H. Chen, T.-C. Chen, S.-Y. Chien, Y.-W. Huang and L.-G. Chen, VLSI architecturedesign of fractional motion estimation for H.264/AVC, J. Signal Process. Syst. 53 (2008)335�347.
38. T.-C. Chen, Y.-W. Huang and L.-G. Chen, Fully utilized and reusable architecture forfractional motion estimation of H.264/AVC, IEEE Int. Conf. Acoustics, Speech, andSignal Processing (ICASSP) Vol. 5 (2004), pp. 9�12.
39. Y. Yang, S. Goto and T. Ikenaga, High performance VLSI architecture of fractionalmotion estimation in H.264 for HDTV, Proc. IEEE ISCAS (2006), pp. 2605�2608.
40. J. Park, K. Muhammad and K. Roy, High performance FIR filter design based onsharing multiplication, IEEE Trans. VLSI Syst. 11 (2003) 244�253.
41. H. Samueli, On the design of optimal equiripple FIR digital filters for data transmissionapplications, IEEE Trans. Circuits Syst. 35 (1988) 1542�1546.
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-20
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
42. D. Lei, G. Wen, H. M. Zeng and Z. Z. Ji, An efficient VLSI architecture for MCinterpolation in AVC video coding, Int. Multiconf. in Computer Science and ComputerEngineering, Las Vegas, Nevada, USA (2004).
43. Y. Song, Z. Liu, S. Goto and T. Ikenaga, A VLSI architecture for motion compensationinterpolation in H.264/AVC, IEEE ASICON05, Shanghai, China (2005), pp. 262�265.
44. L. Lu, J.V. McCanny and S. Sezer, Subpixel interpolation architecture for multistandardvideo motion estimation, IEEE Trans. Circuits Syst. Video Technol. 19 (2009)1897�1901.
45. T. C. Wang, Y. W. Huang, H. C. Fang and L. G. Chen, Parallel 4x4 2D transform andinverse transform architecture for MPEG-4 AVC/H.264, Proc. of ISCAS (2003).
46. H. Malvar, A. Hallapuro, M. Karczewicz and L. Kerofsky, Low-complexity transformand quantization with 16-bit arithmetic for H.26L, Proc. ICIP 2002 2 (2002) 489�492.
47. M. S. Porto, T. L. da Silva, R. E. C. Porto, L. V. Agostini, I. V. da Silva and S. Bampi,Design space exploration on the H.264 4� 4 Hadamard transform, 23rd NORCHIPConf. (2005).
48. Z.-Y. Cheng, C.-H. Chen, B.-D Liu and J.-F. Yang, High throughput 2-D transformarchitectures for H.264 advanced video coders, Proc. 2004 IEEE Asia-Pacific Conf.Circuits and Systems (2004).
49. J. Vanne, E. Aho, T. D. Hamalainen and K. Kuusilinna, A high-performance sum ofabsolute difference implementation for motion estimation, IEEE Trans. Circuits Syst.Video Technol. 16 (2006) 876�883.
50. H.-C. Kuo, Efficient VLSI architectures for inter and intra mode decision in high res-olution H.264/MPEG-4 part10 AVC encoding, Master thesis, National Tsing HuaUniversity, Hsinchu, Taiwan (2006).
51. M. R. H. Fatemi, Algorithm optimization and low cost bit-serial architecture design forinteger pixel and sub-pixel motion estimation in H.264/AVC, Ph.D. thesis, University ofMalaya, Kuala Lumpur, Malaysia (2012).
52. S. Wuytack, F. Catthoor, L. Nachtergaele and H. D. Man, Power exploration for datadominated video applications, Proc. IEEE Int. Conf. Low Power Electronics and Design(ISLPED) (1996).
53. S. Wuytack, J.-P. Diguet, F. V. M. Catthoor and H. J. D. Man, Formalized methodologyfor data reuse exploration for low-power hierarchical memory mappings, IEEE Trans.VLSI Syst. 6 (1998) 529�536.
54. D. Soudris, N. D. Zervas, A. Argyriou, M. Dasygenis, K. Tatas, C. E. Goutis andA. Thanailakis, Data-Reuse and Parallel Embedded Architectures for Low-Power, Real-Time Multimedia Applications, Lecture Notes in Computer Science, Vol. 1918 (Springer,2000), pp. 243�254.
55. Y.-H. Chen, T.-C. Chen, C.-Y. Tsai, S.-F. Tsai and L.-G. Chen, Data reuse explorationfor low power motion estimation architecture design in H.264 encoder, J. Signal Process.Syst. 50 (2008) 1�17.
56. Z. Haihua, X. Zhiqi and C. Guanghua, VLSI implementation of sub-pixel interpolatorfor H.264/AVC encoder, High Density Packaging and Microsystem Integration (2007),pp. 1�3.
57. C.-L. Wu, C.-Y. Kao and Y.-L. Lin, A high-performance three-engine architecture forH.264/AVC fractional motion estimation, IEEE Int. Conf. Multimedia and Expo (2008),pp. 133�136.
58. C.-L. Wu, C.-Y. Kao and Y.-L. Lin, A high-performance three-engine architecture forH.264/AVC fractional motion estimation, IEEE Trans. VLSI Syst. 18 (2010) 662�666.
A Survey of Algorithms and Architectures for H.264 Sub-Pixel Motion Estimation
1250016-21
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.
59. G. A. Ruiz and J. A. Michell, An efficient VLSI architecture of fractional motion esti-mation in H.264 for HDTV, J. Signal Process. Syst. 60 (2010).
60. P.-K. Tsung, W.-Y. Chen, L.-F. Ding, C.-Y. Tsai, T.-D. Chuang and L.-G. Chen,Single-iteration full-search fractional motion estimation for quad full HD H.264/AVCencoding, Int. Conf. Multimedia & Expo (ICME) (2009).
61. T.-C. Chen, Y.-H. Chen, C.-Y. Tsai and L.-G. Chen, Low power and power awarefractional motion estimation of H.264/AVC for mobile applications, IEEE Int. Symp.Circuits and Systems (ISCAS) (2006).
62. M. Shao, Z. Y. Liu, S. Goto and T. Ikenaga, Lossless VLSI oriented full computationreusing algorithm for H.264/AVC fractional motion estimation, IEIEC Trans. Funda-mentals 90-A (2007) 756�763.
63. P.-K. Tsung, W.-Y. Chen, L.-F. Ding, S.-Y. Chien and L.-G. Chen, Cache-based integermotion/disparity estimation for quad-HD H.264/AVC and HD multiview video coding,IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP) (2009).
64. Y.-L. Lin (ed.), Essential Issues in SOC Design (Springer, 2006).65. http://www.vcodex.com/h265.html.66. M. Sugawara, M. Kanazawa, K. Mitani, H. Shimamoto, T. Yamashita and F. Okano,
Ultrahigh-definition video system with 4000 scanning lines, SMPTE Mot. Imag. J. 112(2003) 339�346.
67. E. Nakasu, Y. Nishida, M. Maeda, M. Kanazawa, S. Yano, M. Sugawara, K. Mitani,K. Hamasaki and Y. Nojiri, Technical development toward implementation of ultrahigh-definition TV system, SMPTE Mot. Imag. J. 116 (2007) 279�286.
68. C. Bilen, A. Aksay and G. B. Akar, A multi-view video codec based on H.264, IEEE Int.Conf. Image Processing (ICIP) (2006), pp. 541�544.
69. S. Wittmann and T.Wedi, Separable adaptive interpolation filter for video coding, Proc.IEEE ICIP, San Diego, CA (2008), pp. 2500�2503.
70. Y. Vatis and J. Ostermann, Adaptive interpolation filter for H.264/AVC, IEEE Trans.Circuits Syst. Video Technol. 19 (2009) 179�192.
71. N. S. Kim et al., Leakage current: Moore’s law meets static power, Computer 36 (2003)68�75.
72. W. M. Elgharbawy and M. A. Bayoumi, Leakage sources and possible solutions innanometer CMOS technologies, IEEE Circuits Syst. Mag. 5 (2005) 6�17.
73. T. Sakurai, Design challenges for 0.1 um and beyond: Embedded tutorial, ASP-DAC2000, Yokohama, Japan (2000), pp. 553�558.
M. R. H. Fatemi, H. F. Ates & R. Salleh
1250016-22
J C
IRC
UIT
SY
ST C
OM
P 20
12.2
1. D
ownl
oade
d fr
om w
ww
.wor
ldsc
ient
ific
.com
by U
NIV
ER
SIT
Y O
F H
ON
G K
ON
G o
n 09
/29/
13. F
or p
erso
nal u
se o
nly.