Comparison of 2D Median Filter Hardware Implementations...

Comparison of 2D Median Filter Hardware Implementations for Real-Time Stereo Video

Jesse Scott, Michael Pusateri, Muhammad Umar Mushtaq

Electronic and Computer Services, Penn State University

149 Hammond Building, University Park, PA 16802

Abstract- The two-dimensional spatial median filter is a core algorithm for impulse noise removal in digital image processing and computer vision. While the literature presents several analyses of median filters optimized for a standard 3x3 pixel neighborhood configuration, a 5x5 neighborhood, useful for imagery exhibiting noise not conforming to the classic “salt and pepper” formation, has received little analysis. Research efforts on hardware implementations of median filters have been devoted primarily toward implementations with low latency and high throughput. We are developing a system that includes stereo visible near infrared sensors; both require a 5x5 median filter to handle intensifier noise. Since the system is a battery powered unit, optimal power usage is a critical requirement in addition to low latency and high throughput. However, optimal power usage for median filtering has received little attention in the literature.

In this paper, we focus on investigating four selected hardware implementations of a 5x5 median filter and compare them on the basis of power efficiency. We also analyze the latency, maximum clock rates, and resource utilization for these implementations. The designs include implementations of merge sort and radix sort-based elimination algorithms, common in software implementation of median filters, and a systolic sorting array and a Batcher sorting network, common hardware sorting techniques. All designs were created in the Altera Quartus-II environment for Stratix-II field programmable gate arrays, and were designed to be fully pipelined, accepting input sets and generating median filter output values every pixel clock pulse. Of the four considered designs, the Batcher network is a clear winner in power efficiency. Also, the Batcher network exceeds the functional and performance requirements for resource usage, latency, and clock rate.

I. INTRODUCTION

Currently, several fully digital night vision systems are under development. Many of these digital imagers are intensified like their analog counterparts. The intensification process introduces scintillation noise with multi-pixel extent. The scintillation noise is distracting to

the user and negatively impacts algorithms dependant on accurate histogram information. Figure 1 provides a scintillation noise example from an intensified imager that is typical of the noise occurring throughout the imagery. This noise was sampled from a uniform intensity region within a raw, high resolution image. When observing the scintillation noise, note that the peak is not impulse-like; it is multi-pixel in extent and not fully saturated.

Because of the power limitations of the system, the scintillation noise removal solution must be formulated with power limitations in mind. Goggle systems utilize battery packs that are small and light weight, limiting energy storage. Although the stereo night vision system uses two matched imagers, the imagery will not have matched scintillation noise. Therefore, the solution must be able to handle random scintillation on both imagers while maintaining deterministic runtime and predictable results.

Figure 1. Example of scintillation noise from an intensified digital imager.

II. BACKGROUND A. Destination platform

The platform for implementation is a field programmable gate array (FPGA) based mobile electronic board contained within the night vision goggle system. In addition to the stereo visible imagers, the FPGA supports two additional sensors, two displays, and miscellaneous controls and indicators. The following operations are performed on all sensor channels: acquisition, non-uniformity correction, bad pixel mapping, optical correction, automatic gain control, noise attenuation, synchronization, multi-spectral fusion, and display. To perform all these operations for each of the four channels requires significant resources within the FPGA restricting the resources available for scintillation noise removal.

The goggle is intended as a wearable device that places the stereo imagers directly over the user’s visual path. Because the goggle replaces the view of the operator, there must be as little latency as possible; the goal was to achieve intra-frame latency from acquisition to display. B. Algorithm selection

Prior to efforts to deploy a solution to the FPGA, significant algorithm development was performed to determine the best approach to attenuate the scintillation noise. During this process, MATLAB was used to process a base collection of raw sensor data. The data set was utilized as a source by a wide range of image processing algorithms. A total of 47 variants were tested for signal-to-noise ratio (SNR) as an analytical metric and the resulting videos were subjectively graded. The results shown in table 1 provide the algorithm description, kernel size, and SNR for selected variants of the 47 tested methods. From table 1, it can be seen that a 5x5 median filter is the top performer analytically. The visual performance of median on the scintillation noise was also judged superior. The top three algorithms are highlighted in table 1: Median 5x5, Wiener 9x9, and Disk 5x5.

Table 1. Summary of SNR metric of several algorithms and kernel sizes for scintillation removal.

III. HARDWARE DESIGNS

In reviewing methods to implement a sort for finding

the median, we found four categories: radix [1][2][3][4], histogram [5][6][7][8], systolic network [9][10][11][12], and sorting network [13][14][15][16][17] methods. These

categories seem representative of most hardware approaches we reviewed. From these categories, we chose to implement four algorithms covering all of the categories except histogram based methods. Histogram methods are inherently memory and time intensive and were not attempted based on this premise.

There are a few outlier methods [18][19]. The approach in [18] is a method using iterative row-column sorting referred to as shear-sort and in [19] is a 3D approach using a recursive method; both are similar to sorting network implementations. A. Merge sort

Merge sort [15] is perhaps the best known divide-and-conquer sorting algorithm implemented in software and is fundamentally a specific implementation of a sorting network. Merge sort has a running time of O(n·log(n)) where n is the number of elements to be sorted. The input array is divided in half, recursively, generating successively smaller arrays until no further divisions can be performed. The two-element arrays are then sorted, followed by the merger of these arrays into successively larger sorted arrays. Since n is constant for our hardware implementation, the division into two-element arrays is predictable and can be hardwired. The merge part of the algorithm is a pipeline where the successively larger output array generates a larger latency for each stage. Each stage is considered an atomic operation, introducing different latencies based on the merge size at the stage. As a result, fresh input data sets cannot be fed to the design without dividing the input clock period by a factor equal to the largest latency introduced, which is 13. A major drawback of the merge sort algorithm is the eventual bottle neck that occurs at the last stage where two n/2 arrays are merged atomically. If the design was reworked to eliminate the atomic nature inherent to Merge Sort, the result would be a sorting network similar to subsection D. Merge sorting performs well in microprocessors that are fundamentally serial, but more parallelism is needed in our application. For all practical purposes, merge sort is not a viable option for median filtering on the FPGA.

Figure 2. A block diagram depicting the radix based elimination design.

B. Radix sort based elimination

Radix sort [2][4] is not a comparison based sort, but an elimination sort. Each stage serves to eliminate values that cannot be the median; the number of stages is determined, not by the number of inputs, but by the bit-width of the inputs. Each stage focuses on a bit index, working from the most to the least significant bit through the successive stages. A flag bit is used for each input to indicate whether it has been eliminated at any of the prior stages. If the flag bit is set, the input is not considered as a potential median. Each stage consists of three substages: the add substage sums the indexed bit at that stage in each input, the compare substage determines if the generated sum is greater than n/2 by setting a threshold bit, and the multiplex stage assigns each input a value for the next stage. If the flag bit has not yet been set, the index bit and the threshold bit are compared. If the two bits match, the input is allowed to proceed to the next stage unchanged. If the index bit is low but the threshold bit is high, the dirty bit is set and the input is assigned the minimum possible value. If the index bit is high but the threshold bit is low, the dirty bit is set and the input is assigned the maximum possible value. If more than one input emerges from the last stage with their dirty bits false, the inputs are identical and the first is used as the median. The latency for the module is three clock cycles per stage, with the number of stages based on the input bit-depth. The basic hardware structure is illustrated in figure 2 and is an implementation for 8 bit pixel depth. The first 8 stages are elimination stages while the final search stage is used to select a non-dirty input. Figure 2 also illustrates the substages of each elimination stage as well as the search stage.

Figure 3. A block diagram depicting the systolic network design

C. Systolic sorting network

Systolic sort [9][10] is one of two designs composed entirely of simple 2x2 sorters. The idea behind the arrangement of the 2x2 sorters is to allow the values to percolate to their correct locations by passing them through enough stages of 2x2 sorters, in effect a pipelined and parallel emulation of bubble sort. Inputs are fed to a stage of 2x2 sorters, numbered 0 to n/2 from top to bottom. Sorter

i generates a smaller and a larger output, with the smaller output feeding sorter i-1 of the next stage and the larger output feeding sorter i+1 of the next stage. After n such stages, a sorted output emerges. Since the entire design is constructed from 2x2 sorters and delay registers, a fresh set of inputs can arrive every clock cycle. Figure 3 depicts the systolic network using 13-12 networks modules. The internal structure of 13-12 networks is also illustrated in figure 3 using sorter (S) blocks and buffer (B) blocks representing a sequential collection of 13 2x2 sorters then 12 2x2 sorters.

Figure 4. A block diagram depicting the Batcher network design and a 4 sorter.

D. Batcher sorting network

Batcher Sort [13][15] is the other design based on 2x2 sorters and is found extensively in networking hardware [16]. The premise behind the algorithm is to ensure that inputs to a sorting block are presorted. In this manner, Batcher sort emulates the principle behind merge sort, but improves on it by utilizing a fully parallel merge mechanism better suited to a hardware implementation. Inputs to the design go to a stage made up of 2x2 sorters, resulting in pairs of inputs being sorted. Groups of two pairs are then fed to a stage comprised of 4x4 sorters. The 4x4 sorter is implemented using a two stage pipeline with a pair of 2x2 sorters in the first stage and a 2x2 sorter with two delay registers in the second stage. Subsequent stages follow a similar approach of building mxm sorters, where m is a power of two, from a pipelined arrangement of 2x2 sorters and delay registers. Since each element can accept fresh inputs every clock cycle, the design can be fully pipelined. Each mxm stage introduces a latency of log(2m), and the number of stages required is the ceiling of log(2n). Figure 4 depicts the five stages of the batcher sorter using 2, 4, 8, 16, and 32 input switches. Figure 4 also depicts how a 4 sorter is built from 2 sorters and buffers. Following the method shown in the diagram, a 4 sorter is 2 stages, an 8 sorter is 3 stages, a 16 sorter is 4 stages, and a 32 sorter is 5 stages, totaling 15 stages of 2 sorters.

E. Evolution to Sorters

All four hardware designs we have discussed meet the goal of sorting 25 values and determining the exact median, but do not filter a 5x5 region. To accomplish this, however, we needed to develop a significant wrapper to allow the modules to serve as a median filter for the imagery from the sensors. Our wrapper design incorporates a buffer storing the five most recent rows of the image data stream; making all the data for a 5x5 image region available for processing. We operate the design with the minimum possible latency, using the most recently input pixel as the lower right corner of the 5x5 region. The latency is dependent on the number of pixels in a row of the imagery, R; it is equal to 2R+3 pixel clock cycles. We handle image borders via a wrap-around method.

IV. RESULTS

Two components of testing were completed during the analysis of the median filter implementation. The first round analyzes sorters alone for a fair comparison. The second round used the top performer from round one and encapsulated it in a wrapper to make a comparison with the available Altera MegaCore. A. Sorter Results

The first round of implementation, testing, and analysis had the scope of the merge, radix, systolic, and Batcher sorting modules alone. As the wrapper is a common element for all designs, it was eliminated from this portion of evaluation to focus on the algorithmic implementations. Each scheme was evaluated using four metrics to determine the modules capability to operate within the scope of the destination application: latency, hardware resources, power dissipation, and clock frequency.

Figure 5. A graph presenting the latency of the 4 sorter designs implemented.

Figure 6. A graph presenting the resource requirements of the 4 sorter designs implemented.

Figure 7. A graph presenting the power dissipation of the 4 sorter designs implemented.

Figure 8. A graph presenting the maximum clock frequency of the 4 sorter designs implemented.

Figures 5 through 8 present graphs comparing the

designs for each metric. The merge sort results were included to provide a point of reference for evaluation.

Figure 5 shows the latency of the Batcher hardware is the lowest of the designs. Figure 6 shows the hardware resources required are the lowest for the Batcher sorter and figure 7 indicates it also has the lowest power dissipation of all four designs. Figure 8 shows that, while Batcher does not have the highest potential clocking frequency, it easily meets the minimum frequency of 50 MHz required by the actual imagers. The Batcher sorting network was chosen for the median filter design because it has the lowest power dissipation and meets all other system requirements

Figure 9. A graph presenting the logic utilization for the Altera and Batcher Filters at select image resolutions.

Figure 10. A graph presenting the register utilization for the Altera and Batcher Filters at select image resolutions.

B. Filter Results

Figures 9 through 12 present an analysis of the complete median filter using the Batcher sorter encased in our filter wrapper verses the Altera Median Filter MegaCore IP. The Altera core utilizes a nondeterministic clocking scheme where the core signals its readiness for a new pixel irregularly, requiring the preceding module to manage the data flow rate to the core. This flow management results in the Altera core taking, on average, more than one pixel clock cycle to produce an output pixel. By contrast, our

design deterministically produces one output pixel for every input pixel. To allow a fair comparison of power utilization, we found the total energy required to process a one frame’s worth of pixels at various frame resolutions. The power used for a frame is directly proportional to the energy with the proportionality constant given by the frame rate.

Figure 11. A graph presenting the memory utilization for the Altera and Batcher Filters at select image resolutions.

Figure 12. A graph presenting the energy dissipation for the Altera and Batcher Filters at select image resolutions.

Figure 9 and 10 show that logic and register

requirements for our design are significantly lower than the Altera IP core. Both designs have increasing resource requirements but are relatively constant for both designs. Figure 11 illustrates that the Altera IP core requires significantly less memory. Careful analysis of this difference revealed that the Altera core was using close to 20% less memory, indicating that the Altera core had forgone the five row buffer for the minimal buffer of four rows plus five pixels. The final graph is figure 12 and presents the energy dissipation of each design. With our current implementation, the even point is the 640x480x60 resolution and our design is best used below this data rate. Our implementation can be significantly improved by

reducing the five row buffer to a four row buffer, reducing memory requirements.

V. Future Work

One of our original design assumptions was a five row buffer would result in a very simple addressing scheme compensating for the additional memory requirements. Our analysis of the comparison to the Altera core caused us to reconsider this assumption and we have begun work on a minimal buffer implementation. We believe that the usage of a minimal buffer will leave our implementation superior to the Altera core at all resolutions.

We also found that the speed of our implementation was limited not by the sorting algorithm, but rather by the buffer. Currently, the buffer includes a single dual port memory controller accessing the whole buffer. This architecture requires the use of a memory clock six times faster than the pixel clock. This memory clock rate limits the design to pixel clock frequencies of around 55 MHz. We are evaluating the resource and power costs associated in dividing the buffer into either two or four banks each with its own controller. This improvement will allow the overall design’s pixel clock rates to be limited by our sorter implementation.

Lastly, in the course of implementation research, we came across the concept of a separable median presented originally in [20] by Tukey. Several workers have successfully ported versions of this concept to hardware [21][22][23]. Our initial evaluation of 5x5 separable median suggest that its SNR improvement and visual quality are both poorer than the 5x5 median, but are still acceptable. We are in the processing of determining if the reduction in resource requirements for a separable median justifies the degradation of image quality.

VI. CONCLUSIONS

Our process of algorithm development via MATLAB and implementation via HDL proved efficient by allowing algorithm evaluation to be performed while abstracted from the algorithm details. The median filter shows significant scintillation noise removal while not drastically affecting the clarity or sharpness of the image. Our HDL implementation provides deterministic, low power, and low latency operation. The analysis of sorters showed the Batcher sorting network is superior to the other designs implemented. Due to our buffer design, our median filter does not outperform the preexisting Altera IP core for imagery with pixel clock rates higher than approximately 18.5 MHz. Improvements are under consideration to make our implementation superior at all clock rates.

VII. REFERENCES [1] E. Ataman, V. Aatre, and K. Wong, "A fast method for real-time

median filtering," IEEE Transactions on Acoustics, Speech and Signal Processing, vol.28, no. 4, pp. 415-421, August 1980.

[2] P.E. Danielsson, "Getting the Median Faster," Computer Graphics and Image Processing, vol.17, no.1, pp. 71-78, September 1981.

[3] V.B. Rao and K. Rao, "A new algorithm for real-time median filtering," IEEE Transactions on Acoustics, Speech and Signal Processing, vol.34, no.6, pp. 1674-1675, December 1986.

[4] K. Benkrid, D. Crookes, and A. Benkrid, "Design and implementation of a novel algorithm for general purpose median filtering on FPGAs," IEEE International Symposium on Circuits and Systems, vol. 4, pp. 425-428, 2002.

[5] T. Huang, G. Yang, and G. Tang, "A fast two-dimensional median filtering algorithm," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no.1, pp. 13-18, February 1979.

[6] Z. Yong and G. Taubin, "Real-Time Median Filtering for Embedded Smart Cameras," IEEE International Conference on Computer Vision Systems, pp. 55-55, January 2006.

[7] B. Weiss, “Fast median and bilateral filtering,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 519-526, July 2006.

[8] S. Perreault and P. Hebert, "Median Filtering in Constant Time," IEEE Transactions on Image Processing, vol. 16, no. 9, pp. 2389-2394, September 2007.

[9] C. D. Thompson, "The VLSI Complexity of Sorting," IEEE Transactions on Computers, vol. C-32, no. 12, pp.1171-1184, December 1983.

[10] H. T. Kung, Why systolic architectures?. In Advanced Computer Architecture, D. P. Agrawal, Ed. Los Alamitos, CA: IEEE Computer Society Press, 1986, pp. 300-309.

[11] M. Karaman, L. Onural, and A. Atalar, "Design and implementation of a general-purpose median filter unit in CMOS VLSI," IEEE Journal of Solid-State Circuits, vol. 25, no. 2, pp. 505-513, April 1990.

[12] M. Vega-Rodríguez, J. Sánchez-Pérez, and J. Gómez-Pulido, “An FPGA-based implementation for median filter meeting the real-time requirements of automated visual inspection systems,” Proceedings of the 10th Mediterranean Conference on Control and Automation, August 2007.

[13] K. E. Batcher, “Sorting Networks and their Applications,” Spring Joint Computer Conference of American Federation of Information Processing Societies, vol. 32, pp. 307-314, 1968.

[14] H. S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on Computers, vol. C-20, no. 2, pp. 153-161, February 1971.

[15] D. E. Knuth, The Art of Computer Programming: Sorting and Searching. Reading, Massachusetts: Addison Wesley, vol. 3, April 1997.

[16] A. Huang, “STARLITE: A Wideband Digital Switch,” IEEE Global Telecommunications Conference, Communications in the Information Age, 1984.

[17] K. Chung, “A fast pipelined median filter network,” Signal Process, vol. 51, no. 2, pp. 133-136, June 1996.

[18] I. D. Scherson and S. Sen, "Parallel sorting in two-dimensional VLSI models of computation," IEEE Transactions on Computers, vol. 38, no. 2, pp. 238-249, February 1989.

[19] J. S. Kim and H. W. Park, "Adaptive 3-D median filtering for restoration of an image sequence corrupted by impulse noise," Signal Processing: Image Communication, vol. 16, no. 7, pp. 657-668, 2001.

[20] J. W. Tukey, “The ninther, a technique for low-effort robust (resistant) location in large samples.” In: Contributions to Survey Sampling and Applied Statistics in Honor of H.O. Hartley, edited by H. A. David, pp. 251-258, New York: Academic Press, 1978.

[21] P. M. Narendra, “A separable median filter for image noise smoothing," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 3, pp. 20-29, January 1981.

[22] G. L. Bates and S. Nooshabadi, "FPGA implementation of a median filter,” Proceedings of IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications, vol. 2, pp. 437-440, December 1997.

[23] D. Cline, K. B. White, and P. K. Egbert, "Fast 8-Bit Median Filtering Based on Separability," IEEE International Conference on Image Processing, vol.5, pp. 281-284, September 2007.

Comparison of 2D Median Filter Hardware Implementations...

Documents

Transcript of Comparison of 2D Median Filter Hardware Implementations...