812 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND ... · Ryan N. Rakvic, Member, IEEE, Bradley J....

812 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 4, NO. 4, DECEMBER 2009

Parallelizing Iris RecognitionRyan N. Rakvic, Member, IEEE, Bradley J. Ulis, Randy P. Broussard, Robert W. Ives, Senior Member, IEEE, and

Neil Steiner

Abstract—Iris recognition is one of the most accurate biometricmethods in use today. However, the iris recognition algorithms arecurrently implemented on general purpose sequential processingsystems, such as generic central processing units (CPUs). In thiswork, we present a more direct and parallel processing alternativeusing field-programmable gate arrays (FPGAs), offering an op-portunity to increase speed and potentially alter the form factorof the resulting system. Within the means of this project, the mosttime-consuming operations of a modern iris recognition algorithmare deconstructed and directly parallelized. In particular, portionsof iris segmentation, template creation, and template matchingare parallelized on an FPGA-based system, with a demonstratedspeedup of 9.6, 324, and 19 times, respectively, when compared toa state-of-the-art CPU-based version. Furthermore, the parallelalgorithm on our FPGA also greatly outperforms our calculatedtheoretical best Intel CPU design. Finally, on a state-of-the-artFPGA, we conclude that a full implementation of a very fast irisrecognition algorithm is more than feasible, providing a potentialsmall form-factor solution.

Index Terms—Biometrics, field-programmable gate arrays(FPGAs), iris recognition, parallel computing.

I. INTRODUCTION

I RIS recognition stands out as one of the most accurate bio-metric methods in use today. Currently, iris recognition al-

gorithms are deployed globally in a variety of systems rangingfrom personal computers to portable iris scanners. The overallperformance, in terms of size, shape, speed, power, and accu-racy, of these systems have become of great interest.

Currently, iris recognition algorithms are deployed globallyin a variety of systems ranging from personal computers toportable iris scanners. These systems use central processing unit(CPU)-based systems for which the algorithm was originallydesigned. CPU-based systems are considered general purposemachines, designed for all types of applications. Therefore,CPU-based systems are known as sequential processing de-vices. Fig. 1 illustrates a simple processor pipeline. Instructionsare first fetched, decoded, and finally executed by arithmeticlogic units, or ALUs. State-of-the-art processors contain more

Manuscript received February 16, 2009; revised July 23, 2009. Firstpublished September 22, 2009; current version published November 18,2009. This work was supported in part by the Department of Defense andin part by the Information Sciences Institute-East. The associate editor coor-dinating the review of this manuscript and approving it for publication wasProf. Vijaya Kumar Bhagavatula.

R. N. Rakvic, B. J. Ulis, R. P. Broussard, and R. W. Ives are with the UnitedStates Naval Academy, Annapolis, MD 21402 USA (e-mail: [email protected];[email protected]; [email protected]; [email protected]).

N. Steiner is with the University of Southern California’s Information Sci-ences Institute, Arlington, VA 22203 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIFS.2009.2032012

Fig. 1. Simple processor pipeline with four ALUs.

than one ALU. Therefore, multiple instructions can be executedin parallel. However, modern processors are limited in thenumber of ALUs they possess, and most of today’s CPUs donot have more than four, as illustrated in Fig. 1. As pointedout in previous papers [24]–[26], this method of processingfor which the recognition algorithm is currently designed isinefficient since parts of the iris recognition algorithm can begreatly parallelized.

The goal of this research is to find and exploit potential par-allelism within iris recognition algorithms. Field-programmablegate arrays (FPGAs) are complex programmable logic devicesthat are essentially a “blank slate” integrated circuit that can beprogrammed with nearly any parallel logic function. They arefully customizable and the designer can prototype, simulate, andimplement a parallel logic function without the costly processof having a new integrated circuit manufactured from scratch.

By programming the hardware directly, performance and en-ergy consumption can be improved while the size and formfactor of the system can be altered. The focus of this researchis on parallelizing key portions of the iris recognition algorithmusing an FPGA. This work demonstrates this by making the fol-lowing contributions:

1) Introduction and calculation of the theoretical best perfor-mance of CPU-based machines executing key componentsof an iris recognition algorithm.

2) Measurements of CPU performance of key componentsof an iris recognition algorithm on a state-of-the-art com-puter.

3) Deconstruction and novel parallelization of three key irisrecognition components: 1) Iris segmentation using localkurtosis, 2) iris template creation via filtering, and 3) tem-plate matching via hamming distance.

4) Evaluation of the benefits of parallelization in terms of per-formance and size.

1556-6013/$26.00 © 2009 IEEE

Authorized licensed use limited to: University of Southern California. Downloaded on January 27, 2010 at 10:54 from IEEE Xplore. Restrictions apply.

RAKVIC et al.: PARALLELIZING IRIS RECOGNITION 813

Fig. 2. Template creation.

II. RIDGE ENERGY DETECTION IRIS RECOGNITION ALGORITHM

The first and most popular iris recognition algorithm was in-troduced by pioneer Dr. J. Daugman [1], [16]. This patentedalgorithm is not available for open use, so an alternative for re-search purposes can be found in the implementation created byL. Masek [17]. There are also many other proposed methods[18]–[22] to perform iris recognition. Ives et al. have developedan alternate iris recognition algorithm, referred to as the ridgeenergy direction (RED) algorithm [2] that will be the focus ofthis work. The RED algorithm has been shown to have compa-rable performance to state-of-the-art algorithms that have beenpreviously published [2]. In particular, the three parts of theRED algorithm have been fine-tuned to produce highly accu-rate iris recognition achieving 99.999% and 99.603% accuracyon the Bath2000 [28] and CASIA I [29] databases, respectively.In comparison, Masek’s implementation [17] of Daugman’s al-gorithm resulted in 99.526% and 99.630% accuracy on the sameBath2000 and CASIA I databases, respectively.

The steps of the RED algorithm are illustrated in Fig. 2. Oneof the first steps in preprocessing the image before extracting itsunique features is to segment the iris from the rest of the image.Several means have been proposed to perform this segmentation[4]–[8], but this algorithm utilizes a different approach usinglocal statistics [15] to aid in determining the outer boundary ofthe iris. After iris segmentation, feature extraction is performedbased on directional filters. The iris is then transformed to polarcoordinates, and the iris image is histogram equalized and thenprocessed by each of two different directional filters (verticallyand horizontally oriented). At every pixel location, the identityof the directional filter which provides the largest value of outputis recorded, and represented with a single bit, thus generatinga template. Templates are then matched using fractional Ham-ming Distance as the measure of closeness. We will now high-light three main components of the RED algorithm that helpdictate performance of the overall algorithm: 1) segmentationof the outer boundary of the iris; 2) iris template generation;and 3) iris template matching.

A. Iris Segmentation With Local Kurtosis

After determining the location of the pupil, the center of thepupil is used as a reference for unwrapping the iris image to

polar coordinates (50 rows by 180 columns). The center of thelimbic boundary is not typically the same as the center of thepupil, so we consider a number of possible centers for the limbicboundary: the pupil center and the pupil center shifted by smallincrements to the left and right. Note that when unwrappingthe iris from the correct center of the outer boundary, in theunwrapped image the outer boundary will appear as a hori-zontal line. For each of these unwrapped images, we computethe local kurtosis [9] in 3 3 neighborhoods of each pixel. Inthe resulting image, we observe that the values near the limbicboundary are relatively stable in small neighborhoods, com-pared with the appearance in the sclera and in the interior ofthe iris. Therefore, to create binary images, we assign a valueof 1 (white) to locations for which kurtosis values within a 33 neighborhood are nearly constant. The correct center of theouter boundary will appear in the unwrapped image that has thelongest nearly horizontal white border. Calculating the kurtosisrepeatedly to aid with iris segmentation is a time-consumingportion of the RED algorithm, and hence we focus on it in thisresearch.

B. Template Generation via Filtering

After determining the inner and outer boundaries and centerof the pupil, the iris is again transformed into polar coordinateswith the center of the pupil as the point of reference, into a 120row by 180 column image. In this process, the radial extent ofthe iris is normalized in order to account for pupil dilation. Eachrow in the unwrapped iris image represents an annular regionsurrounding the pupil, and the columns represent radial infor-mation. Next, we consider the “energy” of the unwrapped irisimage after an adaptive histogram equalization. Here, energyloosely refers to the prominence (pixel values) of the ridgesthat appear in the histogram equalized image: higher value re-flects higher energy. This “energy” image is passed into each oftwo different directional filters (a vertical filter and a horizontalfilter). These filters are used to indicate the presence of strongridges, and the orientation of these ridges.

At every pixel location in the filtered image, the filter whichprovides the largest value of output is recorded and encodedwith one bit to represent the identity of this directional filter.The iris image is thus transformed into a one-bit template thatis the same size as the image in polar coordinates (120 rowsby 180 columns). A template mask is also created during thisfiltering process. If both filter output values are not above a cer-tain threshold, then a mask bit is cleared for that particular pixellocation. The template mask is used to identify pixel locationswhere neither vertical nor horizontal directions are identified.

C. Template Matching via Hamming Distance

The template can now be compared to a stored template usingthe fractional Hamming distance as the measure of closeness(see Fig. 3). The fractional Hamming distance between two tem-plates is defined as

(1)The operator is the EXCLUSIVE OR operation to detect

disagreement between the pairs of bits that represent thedirections in the two templates, represents the binary ANDfunction, and masks A and B identify the values in each template



Fig. 3. New template is compared with each template stored in a database.

that are not corrupted by artifacts such as eyelids/eyelashesand specularities. The denominator of (1) ensures that only thebits that matter are included in the calculation, after artifactsare discounted. Rotation (due to head-tilt) mismatch betweenirises is handled with left–right shifts of the coded one-bittemplate to determine the minimum Hamming distance. Thelower the HD result, the greater the match between the two irisesbeing compared. The fractional Hamming distance between twotemplates is compared to a predetermined threshold value and amatch or nonmatch declaration is made. The time-consumptionof template matching can greatly affect speed performance,since it is tied to the size of the database that it is beingmatched against.

III. FPGA PARALLELIZATION

FPGAs are complex programmable logic devices that areessentially a “blank slate” integrated circuit that can be pro-grammed with nearly any parallel logic function. They are fullycustomizable and the designer can prototype, simulate, andimplement a parallel logic function without the costly processof having a new integrated circuit manufactured from scratch.FPGAs are commonly programmed via VHSIC HardwareDescription Language (VHDL). VHDL statements are inher-ently parallel, not sequential. VHDL allows the programmer todictate the type of hardware that is synthesized on an FPGA.For example, if you would like to have 2048 XOR logic gatesthat execute in parallel, then you program this directly in theVHDL code.

In this research, we will demonstrate a parallelization of threemajor components of the RED iris recognition algorithm: 1) irissegmentation with kurtosis, 2) template creation via filtering,and 3) template matching via hamming distance. We will con-clude this section with a brief discussion of future paralleliza-tions.

A. Iris Segmentation via Kurtosis

As stated previously, kurtosis is used to aid with iris segmen-tation. In the RED algorithm, repeated kurtosis calculations on5 5 elements are made across the size of an image. Kurtosis isoften used in probability theory to represent the “peakedness”of the probability distribution of a real-valued random variable[10]. Intuitively it is known as a measure of the volatilityof volatility. It is described by the following mathematical

function:

(2)

where represents the number of elements (25) for this calcu-lation, represents the mean for those elements, and is therandom variable representing their actual values.

Fig. 4 introduces the C++ code for the kurtosis function, andave represents the average of the block of code. It is calculatedinside a FOR loop, where the loop cycles through a 5 5 ma-trix and the index variable is, therefore, 25. Notice the sequen-tial nature, since each matrix element is added one by one. Thenext FOR loop contains the calculations for variance and fourthmoment by using the calculated ave above. For example, a dif-ference is first calculated followed by four multiplications and asummation to create the fourth moment. This process is also re-peated 25 times sequentially. After the loop, final computationfor the kurtosis function is performed with three divisions and amultiplication.

The Intel Assembly code for this calculation is shown inFig. 4. In general, eight sequential instructions are requiredto perform the loop overhead, following that each calculationrequires multiple instructions. For example, the fourth mo-ment calculation requires three floating point multiplications.Floating point multiplications typically require dozens ofprocessor cycles to execute. The total number of assemblyinstructions required to perform this kurtosis calculation is741, of which 259 are floating point instructions. Instructionexecution bandwidth for a processor is limited by the numberof functional units that it contains. For example, a processorwith four ALU functional units can maximally execute four in-structions per cycle. Therefore, in theory, a processor executingfour instructions per cycle at 4 GHz can maximally execute16-G instructions per second. Therefore, assuming a processorspeed of 4 GHz, four ALUs, and four instructions executed percycle, the total theoretical best case time is 46.3 ns

We have successfully parallelized the kurtosis function inVHDL. Illustrated in Fig. 4 is our VHDL code for the kurtosisfunction. As can be seen, it is also contained in a FOR loop that



Fig. 4. Kurtosis algorithm in (a) C++, (b) Intel Assembly, and (c) VHDL.

appears to be sequential. Each iteration however of a FOR loopis implemented in parallel. Therefore, all 25 iterations of theloop are performed concurrently. Due to the inherent multipli-cation required by the kurtosis algorithm, this compilation andsynthesis results in an implementation that utilizes the limitedon chip multipliers. We will discuss the hardware usage of ouralgorithm in the following section.

In addition to the previously described “intra” kurtosis func-tion parallelization, “inter” kurtosis function parallelization canalso be computed. Here, “inter” kurtosis function paralleliza-tion means computing multiple kurtosis function calculationssimultaneously. We synthesized a different hardware versionconsisting of two kurtosis algorithms executing in parallel todemonstrate feasibility and potential. We will present results



Fig. 5. (a) 9 9 filter computing circular convolution of top left portion of data. (b) Example two-directional filtering. (c) C++ code.

Fig. 6. Illustration of different possible digital filtering methods.

(shown below as “inter” with two copies) and discuss on-chipresource utilization in the following section.

B. Template Creation via Filtering

The RED iris recognition algorithm uses a digital filter togenerate the iris template. The energy data passed from the iris

segmentation process is organized into a rectangular array. Therectangular array is formed into a single-holed torus to help ac-count for edge effects. The filter passes over this periodic arraytaking in 81 (9 9) values at a time (note, in [2], 11 11 is used,although 9 9 provides very similar results). More specifically,the result is computed by first multiplying each filter value by



the corresponding input data value. Then a summation is per-formed, and the result is stored in a memory location that corre-sponds to the centroid of the filter. This process repeats for eachpixel in the input data, stepping right, column-by-column, anddown, row-by-row, until the filtering is complete, as shown inFig. 5(a).

Finally, the template is generated by comparing the resultsof two different directional filters (horizontal and vertical) andwriting a single bit that represents the filter with the highestoutput at the equivalent location. The output of each filter iscompared and for each pixel, a “1” is assigned for strong ver-tical content or a “0” for strong horizontal content. These bits areconcatenated to form a bit vector unique to the “iris signal” thatconveys the identifiable information. This final step is demon-strated in Fig. 5(b). In this example, the two filters are repre-sented in the middle of the diagram, while the image data is onthe left. In this example, the filter provides higher output withthe vertical filter, and hence a 1 bit is stored for this particularpixel location.

As described in prior art [3], there are three methods to com-pute a digital filter, as shown in Fig. 6. The first method uses ageneral purpose processor to compute the digital filter via soft-ware, executing a single instruction, and hence byte (shown as

1) at a time. This approach is acceptable if the digital filteris small; however, as the digital filter and data set to be fil-tered increases, the computational load rises exponentially. Thesecond approach is to develop hardware specifically designedto compute the digital filter but by first loading all the data (all81 bytes, illustrated 81) into an array of computing elements.By loading the data first (buffering the data into the hardware),this approach is similar to the sequential general purpose pro-cessor method due to overhead. Only partial calculations can becomputed until the entire kernel is filled. The third and final ap-proach is to redefine the filtering as a parallel process, allowingthe hardware to take in a single byte of data and compute themaximum influence that data would have on the final results.The method is a truly parallel process and minimizes the needfor overhead and buffering that other sequential processes use.Throughout this paper, the parallel method of approach will beinvestigated and compared in detail to sequential methods. Par-allelizing the RED algorithm filtering can be done simply bydoing the obvious—calculating the result of both filters at thesame time and computing the single bit representation in thetemplate simultaneously. Unfortunately, while this straightfor-ward approach appears to cut out a large amount of intermediatememory, it in fact suffers from a very serious memory bottle-neck. The energy data to compute a single result must first befully loaded into a hardware representation of each filter. As-suming the filters could compute the final template bit in a singleclock cycle, each filter must read 81 different 8-bit values of en-ergy data before any filtering can be done. Therefore, a straight-forward approach to digital filtering in hardware will suffer dueto the overhead associated with waiting for data to be localizednear the computing elements. In this case, since the computingelements are idle until the 81 data values are loaded, the processis fundamentally sequential much like the general purpose pro-cessor. Consequently, a straightforward filtering in hardware isinherently not parallel with a single bank of memory holding allthe energy data.

To parallelize the digital filter, the memory itself must be par-allelized. A possible but impractical option to parallelize the

straightforward digital filter would be to create an 81 portedblock of memory or 81 parallel blocks of memory that couldsupply all energy data to both filters in a single step. Either casewould exceed the memory resources of even the most massiveFPGAs while also potentially creating interconnect issues. Fur-thermore, even if the memory issue was inconsequential to thestraightforward digital filtering, there is no way around the over-head associated with needing at least 81 energy data values andat least nine more properly arranged to step in either the hori-zontal or vertical directions.

There is an alternative way to accomplish the digital filteringthat also offers greater opportunity for parallelization while alsoreducing the bandwidth requirements on memory. Our digitalfiltering is a commutative mathematical operation of the form

or

The resulting function is either function filtered withor filtered with . This concept is critical in modifying thedigital filtering algorithm so that it can more easily be imple-mented in hardware. Typically, the filter is streamed overtop theinput data, but for parallelizing purposes, the input data will bestreamed over the centroid of the filter. This approach is analo-gous to dropping the energy data onto the centroid of the filterand computing the partial product of all results within scope.Commuting the filtering eliminates the overhead associated withbuffering all the energy data into the filter to calculate a singleresult and greatly simplifies the hardware implementation (seeFig. 7).

The parallelized digital filters maximize the computations forany single 8 bits of energy data so that input buffering is kept atan absolute minimum while also offering quite a bit of freedomwith regard to the form of the resulting data. The parallelizeddigital filter computes the partial products and writes the resultsto memory into the appropriately wrapped memory locations.While outside the scope of this paper, it is likely possible tocompute the partial results directly into the finalized template,totally eliminating the need for intermediate memory but thatis left to the discretion of the engineer to handle the filters inwhatever way best fits the application.

The RED algorithm digital filtering is designed and imple-mented in VHDL software. The parallel hardware filter con-sists of 81 accumulator cores arranged in a nine by nine arrayto represent the original filter. The accumulators are groupedwithin the array by row since the energy data from the segmen-tation process (or most other common process) emerges left toright, top to bottom, easing the memory access as the filter arraysteps through the incoming data. Each row is accommodatedwith a block of simple dual ported ram to support all the si-multaneous reading and writing needs of the accumulator corerow. Also among each row is a multiplexer to narrow the con-trol of memory to a single accumulator core and a row con-troller to grant any accumulator within the row memory controland issue the proper control signals to the row’s accumulatorcores. Finally, the array is fitted with a first-in-first-out (FIFO)memory buffer to ease incoming data and a comparator modulethat judges which filter had the highest output and issues theappropriate bit to the finalized template for matching. The full



Fig. 7. 9 9 commutated circular convolution.

Fig. 8. Template generation VHDL system block diagram.

system is shown in Fig. 8. Due to the complexity, the VHDLcode for this portion of our parallelization is not shown here.

C. Template Matching via Hamming Distance

The focus of this section is on parallelizing the templatematching portion of the algorithm. The template matchingportion of the algorithm is important due to its repeated execu-tion (depending on the number of iris comparisons necessary).Illustrated in Fig. 9 is optimized C++ code for computing thefractional HD between two templates. In order to help countthe number of 1’s in a binary number, the code contains alookup table (Value). The lookup table holds the number of1’s that are in that particular 16 bit number. This lookup tableis small enough to fit into the cache and has been shown toexecute faster because it requires less assembly instructionsthan a recursive or loop-based method.

We would like to highlight the sequential nature of this code.For example, since the XOR function is performed 32 bits at atime, a loop (FOR loop denoted below) is necessary. Since itis computing 2048 bits, this loop is executed 64 times. Also,note that the XOR and AND computations are also performed se-quentially. These instructions could be scheduled to execute inparallel, but a modern CPU has a limited number of functionalunits, therefore limiting the amount of parallel execution. Fi-nally, a score is computed as a ratio of the number of differ-

ences between the templates to the total number of bits that arenot masked.

Illustrated in Fig. 9 is the associated assembly code createdfor the hamming distance calculation. The code is compiled fora Xeon Processor, and hence IA-32 assembly code is produced[11]. For each C++ computation, there are at least five assemblylanguage instructions required. For example, for the AND com-putation that is in C++ code, there are four MOV instructionsand one AND instruction. The MOV instructions are required tomove data in and out of registers in preparation for the logical in-struction. This AND instruction is a 32-bitwise computation per-formed by an ALU functional unit in the processor. Illustratedin Fig. 10 are the assembly instructions that are required for theloop instruction. Loop instructions require the aforementionedoverhead. For each iteration of the loop, there is required a totalof 38 assembly instructions. As stated above, this code requires64 loops to perform a template match. Therefore, according tothe following formula, a template match on a perfect 4-GHz 4IPC microprocessor is 151 ns per match:

Ideally, if 2048 matching elements could fit onto the FPGA,then the iris recognition algorithm could be many times faster.Fig. 11 illustrates our VHDL code that implements the HD cal-



Fig. 9. Template matching C++ and intel assembly.

Fig. 10. Overhead assembly instructions for loop.

culation. Here we perform the same function as the aforemen-tioned C++ code. However, we are doing this computation com-pletely in parallel. There are 2048 XOR gates and 4096 AND gatesrequired for this computation. In addition, adders are requiredfor summing and calculating the score.

The iris templates must be stored either in memory on theFPGA or off-chip. In one instance of our implementation, wehave implemented a 2048-bit-wide memory in VHDL. Wehave added this to our code to verify that a small databasecan be stored on chip. One of the two templates compared isreceived from this dual-ported 2048-bit-wide single cycle cacheimplemented on our FPGA. Therefore, once per clock cycle, a2048-bit vector is fetched from on-chip memory, and the HDcalculation is performed. We have successfully implemented

and tested the HD calculation with and without a memorydevice.

D. Future Topics

Segmentation in iris recognition algorithms requires substan-tial computer vision and digital signal processing techniques toaccurately detect the inner and outer boundaries of an iris em-bedded in a bitmap image. Initially, finding the pupil as a largedark object in a bitmap image can be complicated by glare andeccentricities which can easily throw off an algorithm. To mit-igate anomalies in the pupil, a series of dilations and erosionsexpand the black pupil, close gaps created by glare, and returnthe pupil to its original shape. After dilation and erosion, a com-puter can better detect the shape of the pupil as a characteristic



Fig. 11. VHDL implementation of the hamming distance calculation.

Fig. 12. DE2 board from Altera Corporation.

large dark object with a center point and radius, setting it apartfrom eyebrows and eyelashes which remain irregular in shape.

In the RED iris recognition algorithm, dilation and erosionrepresent approximately 30% of the segmentation executiontime in software. A proposed alternative implements the di-lation and erosion as an array of cells within hardware-basedarchitecture such as an FPGA. We expect a parallel FPGAimplementation to be complete after only a few clock cyclescompared to the millions needed in the software dilation anderosion method.

IV. RESULTS AND ANALYSIS

A. Hardware Details

The CPU experiment is executed on an Intel Xeon X5355[12] workstation class machine. The processor is equipped witheight cores, 2.66-GHz clock, and an 8-MB L2 cache. While

there are eight cores available, only one core is used to performthis test, therefore allowing all cache and memory resources forthe code under test. The HD code was compiled under WindowsXP using the Visual Studio software suite. The code has beenfully optimized to enhance performance. Additionally, millionsof matches were executed to ensure that the templates are fullycached in the on-chip L2 cache.

The RED algorithm optimized C++ code time is faster thansome of the times reported in the literature for other commercialiris recognition implementations [27]. For example, the tem-plate matching time published in [27] is 10 m, whereas ourtemplate matching time is 383 ns. We attribute this difference to:1) our use of a faster, more modern CPU operating at 2.66 GHzvice 300 MHz; 2) the RED algorithm is fully optimized; and3) the RED algorithm is different from [27]. An extrapolationof the execution times of [27] on a modern computing environ-ment at 2.66 GHz would leave it still slower, but closer to the



Fig. 13. FPGA versus CPU comparison for iris match execution.

Fig. 14. FPGA hardware usage.

RED execution times. Our goal here is not to provide a directcomparison of the execution times of these two iris recognitionalgorithms. However, this indicates that our C++ code is a rea-sonable target for comparison to the FPGA performance. Fur-thermore, due to iris recognition algorithm similarity, one canreasonably expect similar improvements from application map-ping to FPGA technology for other iris recognition algorithms.

The FPGA experiment is executed on a DE2 [13] board pro-vided by Altera Corporation. The DE2 board includes a Cy-clone-II EP2C35 FPGA chip, as well as the required program-ming interface. The DE2 board is illustrated in Fig. 12. Althoughthe DE2 board is utilized for this research, only the Cyclone-IIchip is necessary to execute our algorithm. The Cyclone-II [14]family is designed for high-performance low-power applica-tions. It contains over 30 000 logic elements (LEs) and over480 000 embedded memory bits. In order to program our VHDLonto the Cyclone-II, we utilize the Altera Quartus software forimplementation of our VHDL program. The Quartus suite in-cludes compilation, synthesis, simulation, and programming en-vironments. We are able to determine the size required of ourprogram on the FPGA, and the resulting execution time.

B. Comparisons

1) Speedup: All VHDL code is fully synthesizable and isdownloaded onto our DE2 for direct hardware execution. Asdiscussed above, our code is fully contained within a “process”statement. The process statement is only initiated when a signalin its sensitivity list changes values. The sensitivity list of ourprocess contains the clock signal and, therefore, the code is ex-ecuted once per clock cycle. In this code, the clock signal isdrawn from our DE2 board which contains a 50-MHz clock.Therefore, every 20 ns, our calculation is computed. Note that

for template generation, this process takes many clock cycles,and therefore, the total time for template generation with ourFPGA is 98.6 us.

Fig. 13 illustrates the execution times and accelerationachieved for our implemented FPGA version on the Cyclone-IIEP2C35 versus a Xeon-based C++ version. For example, theoptimized C++ version takes 96.8 ns per match while the FPGAtakes 20 ns per match for the kurtosis function. For the kurtosisimplementation where we have two concurrent implementa-tions on our FPGA, execution time is cut in half to 10 ns.

The main result of this research is that the implementation ona modest sized FPGA is approximately 9.6, 324, and 19 timesfaster than a state-of-the-art CPU design for the kurtosis, fil-tering, and hamming distance calculations, respectively. Alsoillustrated in Fig. 13 are the theoretical best performances fora future Intel-based machine, assuming a 4-GHz clock speedand four ALUs. The parallel algorithm on our FPGA is approx-imately 2.3, 4.7, and 7.5 times faster than the theoretical bestCPU design for the kurtosis, filtering, and hamming distancecalculations, respectively. If Intel were to design a much fastermicroprocessor with a perfect compiler, it still would be ordersof magnitude slower than an off-the-shelf inexpensive low-endFPGA.

2) Hardware Usage: Fig. 14 presents the utilization of theFPGA LEs for our hardware implementations. For example, ourimplementation of the “inter” with two-copy kurtosis functionparallelization utilizes 9% of our Cyclone-II FPGA. We foundthat the single-copy version utilized 0% of the LEs, but ex-hausted all of the on-chip multipliers. Only 9% of our chip isutilized by the second kurtosis function implementation.

For the template generation code, a much more parallel andhardware intensive approach is utilized, therefore extending the



usage of the Cyclone-II chip. As illustrated, the template gener-ation consumes 71% of the Cyclone-II LEs, and 17% of memorybit availability.

Our implementation of the hamming distance algorithmutilizes 73% of our Cyclone-II FPGA. In terms of on-chipmemory usage, one of the two templates compared is stored inthis dual-ported 2048-bit-wide single-cycle cache implementedon our Cyclone-II FPGA. Each stored template consumes 0.7%of on-chip memory. We have added this to our code to verifythat a small database of approximately 230 can be stored onchip.

3) Projections and Other Issues: The Cyclone-II is not astate-of-the-art design. A projection of the performance andhardware usage of a state-of-the-art Stratix IV at 500-MHz[23] FPGA is given in Figs. 13 and 14, respectively. For ex-ample, given traditional scaling of FPGAs, we anticipate that aStratix-IV at 500 MHz would be able to perform the hammingdistance computation in only 2 ns. Therefore, a state-of-the-artFPGA can outperform a state-of-the-art computer by 90, 1500,and 190 times faster than a state-of-the-art CPU design forthe kurtosis, filtering, and hamming distance calculations,respectively. The hardware usage of the Stratix-IV for our threecomponents is also drastically reduced. For example, for thekurtosis function, parallelism can be increased since each 55 matrix computation will now only consume less than 1%of the chip. All told, the three algorithms now only consume0.9%, 7.1%, and 7.3% of the Stratix-IV for kurtosis, filtering,and hamming distance calculations, respectively. Therefore, afull implementation of a very fast iris recognition algorithm ismore than feasible.

On-chip memory for the Stratix-IV is also much larger with22.4 Mbits of on-chip storage [23]. This is important for bothiris template storage and digital filtering. For example, a data-base consisting of 10 000 irises can be stored on the Stratix-IV.We anticipate this storage scaling trend to continue into the fu-ture, with larger and larger database storage available. If a largerdatabase is necessary, we propose an implementation where aDRAM chip is provided as part of the package, and the on-chipdatabase is concurrently loaded while hamming distances arebeing computed.

Although we have not yet empirically analyzed it, anotheradvantage of this FPGA-based technique is power and energyreduction. FPGAs, in general, consume far less power than theirCPU counterparts. A detailed study of the power and energyadvantages of this implementation is outside the scope of thisstudy and left for future work.

V. CONCLUSION

In this paper, we have provided novel hardware implemen-tations which enabled us to discover that three key portions ofan iris recognition algorithm can be parallelized. The main re-sult is that the implementation on a modest sized FPGA is ap-proximately 9.6, 324, and 19 times faster than a state-of-the-artCPU design for portions of iris segmentation, template creation,and template matching components, respectively. Furthermore,the parallel algorithm on our FPGA also greatly outperformsour calculated theoretical best Intel CPU design. Finally, on astate-of-the-art FPGA, we conclude that a full implementationof a very fast iris recognition algorithm is more than feasible,providing a potential small form-factor solution.

FPGAs have been on an impressive scaling trend over thelast 10 years. They are increasing in capacity, in terms of LEs,increasing in memory, in terms of on-chip bit storage, increasingin speed, and decreasing in power consumption. We expect thisscaling trend to continue in the short term and we are excitedto study the different advantages of FPGAs, including powerconsumption. Given the commercial and military demands of anaccurate, timely, and small iris recognition system, we believean FPGA-based option presents an exciting parallel alternativefor this parallel application.

REFERENCES

[1] J. Daugman, “Probing the uniqueness and randomness of IrisCodes:Results from 200 billion iris pair comparisons,” Proc. IEEE, vol. 94,no. 11, pp. 1927–1935, Nov. 2006.

[2] R. W. Ives, R. Broussard, L. Kennell, R. N. Rakvic, and D. Etter, “Irisrecognition using the ridge energy direction (RED) algorithm,” in Proc.42nd Ann. Asilomar Conf. Signals, Systems and Computers, PacificGrove, CA, Nov. 2008.

[3] R. N. Rakvic, B. J. Ulis, R. P. Broussard, and R. W. Ives, “Iris templategeneration with parallel logic,” in Proc. 42nd Ann. Asilomar Conf. Sig-nals, Systems and Computers, Pacific Grove, CA, Nov. 2008.

[4] J. Huang, Y. Wang, T. Tan, and J. Cui, “A new iris segmentationmethod for recognition,” in Proc. 17th Int. Conf. Pattern Recognition,Aug. 2004, vol. 3, pp. 554–557.

[5] W. Kong and D. Zhang, “Accurate iris segmentation based on novelreflection and eyelash detection model,” in Proc. 2001 Int. Symp. In-telligent Multimedia, Video and Speech Processing, May 2001, pp.263–266.

[6] Q. Tian, Q. Pan, Y. Cheng, and Q. Gao, “Fast algorithm and applica-tion of Hough transform in iris segmentation,” in Proc. 2004 Int. Conf.Machine Learning and Cybernetics, Aug. 2004, vol. 7, pp. 3977–3980.

[7] A. Bachoo and J. Tapamo, “A segmentation method to improve iris-based person identification,” in Proc. 7th AFRICON Conf., Sep. 2004,vol. 1, pp. 403–408.

[8] J. Daugman, “High confidence visual recognition of persons by a testof statistical independence,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 15, no. 11, pp. 1148–1161, Nov. 1993.

[9] R. W. Ives, L. Kennell, R. Gaunt, and D. Etter, “Iris segmentationfor recognition using local statistics,” in Proc. 39th Ann. AsilomarConf. Signals, Systems and Computers, Pacific Grove, CA, 2005, pp.859–863.

[10] “Engineering Statistics Handbook” 2008 [Online]. Available: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

[11] Intel Corporation [Online]. Available: http://processorfinder.intel.com/details.aspx?sSpec=SL9YM, Jun. 16, 2008

[12] Intel Corporation [Online]. Available: http://www.intel.com/products/processor/manuals/index.htm, Jun. 25, 2008

[13] Altera Corporation [Online]. Available: http://www.altera.com/educa-tion/univ/materials/boards/unv-de2-board.html, Jun. 16, 2008

[14] Altera Corporation [Online]. Available: http://www.altera.com/prod-ucts/devices/cyclone2/cy2-index.jsp, Jun. 16, 2008

[15] L. Kennell, R. W. Ives, and R. M. Gaunt, “Binary morphology and localstatistics applied to iris segmentation for recognition,” presented at the2006 IEEE Int. Conf. Image Processing, Atlanta, GA, Oct. 2006.

[16] J. Daugman, “Statistical richness of visual phase information,” Int. J.Comput. Vision, vol. 45, no. 1, pp. 25–38, 2001.

[17] L. Masek, “Recognition of Human Iris Patterns for Biometric Identi-fication” Masters Thesis, The University of Western Australia, Perth,May 12, 2006 [Online]. Available: www.csse.uwa.edu.au/~pk/student-projects/libor/LiborMasekThesis.pdf

[18] R. P. Wildes, J. C. Asmuth, G. L. Green, S. C. Hsu, R. J. Kolczynski,J. R. Matey, and S. E. McBride, “A machine vision system for irisrecognition,” Mach. Vision Applicat., vol. 9, pp. 1–8, 1996.

[19] W. W. Boles and B. Boashash, “A human identification technique usingimages of the iris and wavelet transform,” IEEE Trans. Signal Process.,vol. 46, no. 4, pp. 1185–1188, Apr. 1998.

[20] L. Ma, T. Tan, Y. Wang, and D. Zhang, “Efficient iris recognition bycharacterizing key local variations,” IEEE Trans. Image Process., vol.13, no. 6, pp. 739–750, Jun. 2004.

[21] Y. Zhu, T. Tan, and Y. Wang, “Biometric personal identification basedon iris patterns,” in Proc. 15th IEEE ICPR, 2000, vol. 2, pp. 801–804.



[22] C. Park, J. Lee, M. Smith, and K. Park, “Iris-based personal authenti-cation using a normalized directional energy feature,” in Proc. 4th Int.Conf. Audio- and Video-Based Biometric Person Authentication, Jan.2003, pp. 224–232.

[23] FPGA J. Sep. 2, 2008 [Online]. Available: http://www.fpgajournal.com/articles_2008/20080520_40nm.htm

[24] R. P. Broussard, R. N. Rakvic, and R. W. Ives, “Accelerating iris tem-plate matching using commodity video graphics adapters,” in 2008IEEE Int. Conf. Biometrics: Theory, Applications and Systems, CrystalCity, VA, Sep. 2008.

[25] N. A. Schmid, M. Ketkar, H. Singh, and B. Cukic, “Performance anal-ysis of iris-based identification system at the matching score level,”IEEE Trans. Inf. Forensics Security, vol. 1, no. 2, pp. 154–168, Jun.2006.

[26] F. Hao, J. Daugman, and P. Zielinski, “A fast search algorithm for alarge fuzzy database,” IEEE Trans. Inf. Forensics Security, vol. 3, no.2, pp. 203–212, Jun. 2008.

[27] J. Daugmann, “How iris recognition works,” IEEE Trans. Circuits Syst.Video Technol., vol. 14, no. 1, pp. 21–30, Jan. 2004.

[28] D. M. Monro, S. Rakshit, and D. Zhang, Iris Image Database Univer-sity of Bath, U.K. [Online]. Available: http://www.bath.ac.uk/elec-eng/pages/sipg/irisweb

[29] CASIA Iris Image Database [Online]. Available: http://www.sinobio-metrics.com

Ryan N. Rakvic (M’01) received the B.S. degreein computer engineering from the University ofMichigan and the M.S. and Ph.D. degrees in com-puter engineering from Carnegie Mellon University.

He is currently an Assistant Professor in theElectrical and Computer Engineering Department,U.S. Naval Academy, Annapolis, MD. Prior tothat, he spent five years in the Computer ResearchLaboratory at Intel Corporation, Santa Clara, CA.During his time at Intel, he invented creative waysto increase the performance of Pentium micropro-

cessors. More recently, he is attempting to utilize the emerging efficiency offield-programmable gate arrays (FPGAs) to outperform a general purposemicroprocessor both in terms of performance and power efficiency.

Bradley J. Ulis received the B.S. degree in elec-trical engineering from the U.S. Naval Academy,Annapolis, MD, and is currently pursuing the M.S.degree in electrical engineering from the U.S. NavalPostgraduate School, Monterey, CA. His researchinterests include field-programmable gate arraydesign and biometric signal processing.

Randy P. Broussard received the B.S. degree inelectrical engineering from Tulane University in1986, the M.S. degree in computer engineering fromFlorida Institute of Technology in 1991, and thePh.D. degree in electrical engineering from the AirForce Institute of Technology in 1997.

He is currently an Assistant Professor in theDepartment of Systems Engineering, U.S. NavalAcademy, where he teaches computer engineering,computer vision, and control systems. His researchincludes the areas of pattern recognition, iris recog-

nition, neural networks, automatic target recognition, computer vision, andbreast cancer detection.

Robert W. Ives (S’96–M’97–SM’03) received theB.S. degree in mathematics from the U.S. NavalAcademy, Annapolis, MD, the M.S. degree in elec-trical engineering from the U.S. Naval PostgraduateSchool, Monterey, CA, and the Ph.D. degree inelectrical engineering from the University of NewMexico, Albuquerque.

He served 20 years in the U.S. Navy, including5 1/2 years as a military researcher at Sandia Na-tional Laboratories, Albuquerque, and three years asan Assistant Professor in the Electrical Engineering

Department, U.S. Naval Postgraduate School. After retiring from the Navy, hereturned to the U.S. Naval Academy, where he is currently an Associate Pro-fessor, and is the Director of the USNA Center for Biometric Signal Processing.His research interests include biometric signal processing, image processing andcompression, adaptive signal processing, information theory, and communica-tions.

Neil Steiner received the B.A. degree from WheatonCollege, the B.S.E.E. degree in electrical engineeringfrom Illinois Institute of Technology, the M.S. degreein electrical engineering from Virginia Tech., and thePh.D. degree in electrical engineering from VirginiaTech. in 2008.

He is a Research Programmer at the Universityof Southern California’s Information SciencesInstitute. His research interests include autonomouscomputing systems, hardware–software interaction,low-level FPGA configurations, and trusted inte-

grated circuits.


812 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND ... · Ryan N. Rakvic, Member, IEEE, Bradley J....

Documents

Transcript of 812 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND ... · Ryan N. Rakvic, Member, IEEE, Bradley J....