High Speed Systolic Array Structure for Variable Block...

High Speed Systolic Array Structure for Variable Block Size Motion Estimation

Vinod Reddy {[email protected]}

Abstract:

New systolic array based architecture for variable block size motion estimation is presented in this paper. The proposed architecture is scalable for various block sizes. High speed systolic array is designed for Sum of absolute difference (SAD) calculation of 4x4 block sizes. High speed is achieved by group 4 pixels into a single large pixel as sad’s can be calculated simultaneously for all the pixels in a block. Variable block size Sad’s for an 8x8 block is achieved by operating four Systolic arrays in parallel. An Efficient row adder tree is designed to generate Sad’s for 4x8, 8x4, 8x8 block sizes by reusing the 4x4 Sad’s from the outputs of the systolic array. The presented architecture also reduces the data bandwidth by reusing the pixels efficiently without the need for reading the same pixels twice from the search window. VLSI Implementation of Systolic array resulted in a clock frequency of 500 MHz, synthesized using synopsys design compiler targeting 90nm LSI Standard cell library.

1. Introduction:

A video sequence usually contains a significant amount of temporal redundancy. The block matching algorithm (BMA) based on the motion estimation and compensation is widely used in many video coding standards such as H.26x, MPEG-1, -2, and -4 to remove temporal redundancy. The fixed block size block matching algorithm (FBSBMA) is to divide the current frame into several macroblocks, and to search for a best matched block within a search range in a reference frame. In case of FBSBMA, if a macroblock consists of two objects moving into different directions, the coding performance of the macroblock is worse than that of two objects moving into one direction. To compensate for the demerits of FBSBMA, variable block-size blockmatching algorithm (VBSBMA) is adopted in the advanced video coding standards. In the H.264/advanced video coding (AVC), VBSBMA consists in partitioning a macroblock into 7 kinds of blocks including 4x4, 4x8, 8x4, 8x8, 8x16, 16x8 and 16x16 as it is shown in Fig. 1.

Fig1. Variable Block Sizes from a 16x16 Block Size.

1.1 Criteria:

In full-search block-matching motion estimation, each block in a current frame of size N×N pixels is compared to all the candidate blocks in search region to determine the best match. A commonly used match criterion is then used, and the block with the minimum amount of distortion in terms of the SAD in luminance values is taken as the best match. Such a process can be summarized as shown below.

N-1 N-1

SAD(i,j) = ∑ ∑ | Cur(k,l) – Ref(k+i,l+j) | K=0 l=0

SADmin = min(SAD(i,j))

2. Proposed Architecture

2.1 Nested loop structure of Fixed Block Size ME

The Nested loop structure of fixed size block motion estimation for full search is as shown below

For m= 0 to 2p

For n= 0 to 2p

SAD(m,n)=0

for i= 0 to N-1

for j= 0 to N-1

SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m-p,j+n-p)|

End j

End i

SAD(m,n) = SAD(m,n);

End n

End m

Where x(i,j) = Current Macroblock, y(i,j) = Reference Macroblock.

The inner two for loops correspond to sad calculation of a given Current Block and Reference Block. The outer two for loops represent the full search motion estimation calculating the SAD’s for every candidate block in the given search window.

2.2 Loop Level Optimizations

For a Block Size of 4x4 and P = 2

The loop indices change as shown below

for i= 0 to 3

for j= 0 to 3

SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m,j+n)|

These two for loops can be represented in matrix form as shown below

From the DG, We can notice that all the PE’s can process independently given the pixel inputs x(i,j) &y(i,j).

Now consider the third and fourth loop iterations corresponding to full search ME, If we unroll the third loop, corresponding to index n we may have to read a new set of column pixels and this may result in increase in data bandwidth and also increase in latency as we have to read pixels from different columns which requires multiple clock cycles.

Hence Interchanging the third and fourth we can efficiently reuse the pixels read before and we just have to read a new set of row pixels which can be read in a single cycle as shown below.

For n = 0 to 3

For m = 0 to 3

For i= 0 to 3

For j= 0 to 3

SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m,j+n)|

End j

End i

End m

End n

Unrolling the third loop index m, SAD calculation for different iteratios of m in matrix form and corresponding 3D DG are shown below

Fig 3. Sad calculation for m = {0,1,2} in matrix form , and the corresponding DG for m = 0, m =1.

As we can see by interchanging, for “m = 1” we have to load only {y40,y41,y42,y43} (shown in red in Fig3) set of new pixels and all the pixels {y10,y11,y12,y13…..y31,y32,y33} can be reused from previous iteration of m = 0. These transfers of old pixels from previous iteration of “m” are shown by green arrows in DG.

2.3 High Speed Realization of ME

We will use the fact that sad’s for all the pixels in a row can be calculated in parallel and hence can group sad calculation of four pixels in a row into a single large PE which can process four pixels at a time in parallel, The corresponding 3D DG for different iterations of “m” can now be conveniently represented IN 2D DG as shown below

All the green arrows in the DG to the right shows the transfer of old pixels between different iterations of “m”.

2.4 Systolic Array Design

Systolic array design for the DG shown above is as follows

+ The resulted systolic array is fully pipelined and the cycle time is decided by the PE processing time.

X vector are stored internally in the PE.

Only Y vectors are read from the search window or External memory every cycle as shown below.

Cycle 1: Y30Cycle 2: Y20 Y40Cycle 3: Y10 - - Y50Cycle 4: Y00 - - Y60

As we can see Y vectors are read only once and reused efficiently in the systolic array.

2.5 Variable Block Size Realization for 8x8 Block Size

Systolic array for 4x4 Sad Calculation can be reused for variable block size Sad’s.

4x4 X & Y Blocks are shown below

X & Y Blocks of size 8x8 can be decomposed into four 4x4 blocks as shown below

We can treat a single block of 8x8 size as four different 4x4 blocks and calculate the 4x4 Sad’s simultaneously as shown in the DG below. Once we get the 4x4 sad’s we can combine the two 4x4 sad’s into 8x4 and 4x8 and 8x8 sad’s using adder trees.

Fig4. Extending DG for block size of 8x8

Mapping the 8x8 DG to systolic array as shown below

Each systolic array calculates 4x4 sad’s in parallel and the row adder trees combines the sad’s to generate variable sad’s of 8x4, 4x8 and 8x8.

The same idea can be extended to calculate 41 sad’s for a 16x16 block size by operating 16 systolic arrays in parallel. As Each 16x16 Block can be treated as a 16 different 4x4 blocks and sad’s can be calculated in parallel. These 4x4 sad’s can be combined to generate 41 variable block size sad’s.

2.6 Large PE design

Each PE processes 4 pixels at a time the PE architecture is as shown below

It can be seen that absolute difference for four pixels are calculated in prow adder tree. The two regs for Y vector correspond to 2D delay in the systolic array. PSAD is the sum of four absolute difference’s.

The detailed architecture for calculating the Absolute difference is shown below.

Absolute difference can be calculated in hardware as shown above where Y’ represents the complement of Y.

Each PE processes 4 pixels at a time the PE architecture is as shown below

Fig 6. Large PE Design

It can be seen that absolute difference for four pixels are calculated in parallel in the PE and summed using a row adder tree. The two regs for Y vector correspond to 2D delay in the systolic array. PSAD is the sum of four

The detailed architecture for calculating the Absolute difference is shown below.


arallel in the PE and summed using a row adder tree. The two regs for Y vector correspond to 2D delay in the systolic array. PSAD is the sum of four


Fig 7. Absolute difference

Efficient Row adder Tree using Wallace tree structure is used to reduce the critical path, as the overall cycle time of the systolic array is defined by the PE processing time.

Fig 8. Row Adder tree to sum the four absolute difference values into a partial SAD.

3. Results

The prototype of the proposed architecture is implemented in verilog and tested using modelsim. Synthesized using Synopsys design compiler using 90nm LSI standard cell library.

Design Area/Clock Freq = 333MHz Area/Clock Freq = 500MHz

LargePE Design 6702 um2 7309 um2

Systolic Array 4x4 22132 um2 24601 um2

Systolic Array 8x8 87931 um2 95748 um2

Systolic Array 4x4 corresponds to fixed size ME of block size 4x4. Systolic Array 8x8 corresponds to variable block size ME of block size 8x8. This includes four systolic arrays operating in parallel and row adder tree for calculating the variable block size SAD’s.

The above table shows the area figures for clock frequency of 333 MHz and 500 MHz.Synthesis results show that proposed architecture for variable block size(Systolic Array 8x8)ran at very high clock frequencies of upto 500MhZ with silicon area of 95,748um2.

3.1 Comparision of Existing VBSME Architectures [5]

2D Array [6] 1D Array [7] 1D Array [8] 2D Array [5] Our WorkSearch Range 32 or 16 8 8 Flexible FlexibleTechnology 0.18um 0.13um 0.18um 0.13um 0.09umMax Freq 100MHz 294MHz 266MHz 167MHz >500MHz

Gate Count 154K 61K 67.6K 102.2K 96K

Clearly our results are superior to the earlier implementations in terms of Clock Frequency and Area but our target technology is 90nm less than 0.18um and 0.13um.

4. Conclusion:

A new systolic array architecture for variable block size motion estimation is presented in this paper. Reduced data bandwidth is obtained by algorithm level optimization like loop interchange and unrolling which resulted in efficient reuse of the search window pixels. High speed realization is achieved by processing 4 pixels simultaneously in the large PE. Fast Motion estimation for Variable block size is achieved by operating four systolic arrays in parallel and reusing the 4x4 Sad’s to generate variable block size sad’s. Overall circuit level optimizations for calculating absolute difference, Wallace tree structures for row adder trees resulted in very high speed clock frequency of 500MHz.

5.References

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview of the H.264/AVC Video Coding Standard, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 560-576, July 2003.

[2] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, and L.-G. Chen, “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 673–688, Jun. 2006.

[3] Chen Ching-Yeh ; Chien Shao-Yi ; Huang Yu-Wen ; Chen Tung-Chien ; Wang Tu-Chih ; Chen Liang-Gee, "Analysis and Architecture Design of Variable Block Size Motion Estimation for H.264/AVC," IEEE Transactions on Circuits and Systems I, Volume PP, Issue 99, 2005

[4] Zhenyu Liu, Yiqing Huang, Yang Song, Satoshi Goto, Takeshi Ikenaga, “Hardware-Efficient Propagate Partial SAD Architecture for Variable Block Size Motion Estimation in H.264/AVC,” Proceedings of the 17th Great Lakes Symposium on VLSI, pp. 160-163, 2007

[5] Liang Lu, John V. McCanny, and Sakir Sezer, “Systolic Array Based Architecture for Variable Block-Size Motion Estimation”, IEEE Conference on Adaptive Hardware and Systems, AHS 2007

[6] M. Kim, I. Hwang, S. Chae, “A Fast VLSI Architecture for Full-Search Variable Block Size Motion Estimationin MPEG-4 AVC/H.264”, ACM/IEEE ASP-DAC’05, pp.631-634, 2005.

[7] S. Y. Yap, J. V. McCanny, “A VLSI architecture for variable block size video motion estimation”, IEEETrans. On CAS-II Vol.51, No. 7, pp. 384-389, 2004.

[8] Y. Song, Z. Liu, T. Ikenaga, S. Goto, “VLSI Architecture for Variable Block Size Motion Estimation in H.264/AVC with Low Cost Memory Organization”, IEEE VLSI-DAT’06, pp. 89-92, 2006.

High Speed Systolic Array Structure for Variable Block...

Documents

Transcript of High Speed Systolic Array Structure for Variable Block...