an Efficient Parallel Approach For

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014 147

An Efficient Parallel Approach forSclera Vein Recognition

Yong Lin, Eliza Yingzi Du, Senior Member, IEEE, Zhi Zhou, Student Member, IEEE,and N. Luke Thomas

Abstract Sclera vein recognition is shown to be a promisingmethod for human identification. However, its matching speedis slow, which could impact its application for real-time appli-cations. To improve the matching efficiency, we proposed a newparallel sclera vein recognition method using a two-stage parallelapproach for registration and matching. First, we designed arotation- and scale-invariant Y shape descriptor based featureextraction method to efficiently eliminate most unlikely matches.Second, we developed a weighted polar line sclera descriptorstructure to incorporate mask information to reduce GPU mem-ory cost. Third, we designed a coarse-to-fine two-stage matchingmethod. Finally, we developed a mapping scheme to map thesubtasks to GPU processing units. The experimental results showthat our proposed method can achieve dramatic processing speedimprovement without compromising the recognition accuracy.

Index Terms Sclera vein recognition, sclera feature matching,sclera matching, parallel computing, GPGPU.

I. INTRODUCTION

THE sclera is the opaque and white outer layer of the eye.The blood vessel structure of sclera is formed randomlyand is unique to each person [1, 2], which can be used forhumans identification [3-6]. Several researchers have designeddifferent Sclera vein recognition methods and have shownthat it is promising to use Sclera vein recognition for humanidentification. In [4], Crihalmeanu and Ross proposed threeapproaches: Speed Up Robust Features (SURF)-based method,minutiae detection, and direct correlation matching for fea-ture registration and matching. Within these three methods,the SURF method achieves the best accuracy. It takes anaverage of 1.5 seconds1 using the SURF method to per-

Manuscript received October 28, 2012; revised June 4, 2013 andSeptember 29, 2013; accepted October 26, 2013. Date of publicationNovember 14, 2013; date of current version January 7, 2014. The associateeditor coordinating the review of this manuscript and approving it forpublication was Prof. Patrizio Campisi.

Y. Lin is with the School of Computer Science, Xidian Univer-sity, Xian 710071, China, and also with the Department of Com-puter Science, Ningxia Normal University, Guyuan 756000, China (e-mail:[email protected]).

E. Y. Du was with Purdue University, Indianapolis, IN 47907 USA.She is now with Qualcomm, Santa Clara, CA 92121 USA (e-mail:[email protected]).

Z. Zhou was with Purdue University, Indianapolis, IN 47907 USA. He isnow with Allen Institute for Brain Science, Seattle, WA 98103 USA (e-mail:[email protected]).

N. L. Thomas is with the Biometrics and Pattern Recognition Lab-oratory, Department of Electrical and Computer Engineering, Indi-ana University-Purdue University, Indianapolis, IN 46202 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIFS.2013.22913141This speed is based on our implementation of their method.

form a one-to-one matching. In [3], Zhou et. al. proposedline descriptor-based method for sclera vein recognition.The matching step (including registration) is the most time-consuming step in this sclera vein recognition system, whichcosts about 1.2 seconds to perform a one-to-one matching.Both speed was calculated using a PC with Intel Core2 Duo 2.4GHz processors and 4 GB DRAM. Currently,Sclera vein recognition algorithms [3, 4] are designed usingcentral processing unit (CPU)-based systems. As discussedin [7], CPU-based systems are designed as sequential process-ing devices, which may not be efficient in data processingwhere the data can be parallelized. Because of large timeconsumption in the matching step, Sclera vein recognitionusing sequential-based method would be very challenging tobe implemented in a real time biometric system, especiallywhen there is large number of templates in the database formatching.

GPUs (as abbreviation of General purpose GraphicsProcessing Units: GPGPUs) are now popularly used forparallel computing to improve the computational processingspeed and efficiency [8-20]. The highly parallel structureof GPUs makes them more effective than CPUs fordata processing where processing can be performed inparallel. GPUs have been widely used in biometricsrecognition such as: speech recognition [8], text detection [9],handwriting recognition [10], and face recognition [14].In iris recognition [15], GPU was used to extract the features,construct descriptors, and match templates. GPUs are also usedfor object retrieval and image search [16-19]. Park et al. [20]designed the performance evaluation of image processingalgorithms, such as linear feature extraction and multi-viewstereo matching, on GPUs. However, these approaches weredesigned for their specific biometric recognition applicationsand feature searching methods. Therefore they may not beefficient for Sclera vein recognition.

Compute Unified Device Architecture (CUDA), the com-puting engine of NVIDIA GPUs, is used in this research.CUDA is a highly parallel, multithreaded, many-core proces-sor with tremendous computational power [21]. It supports notonly a traditional graphics pipeline but also computation onnon-graphical data. More importantly, it offers an easier pro-gramming platform which outperforms its CPU counterparts interms of peak arithmetic intensity and memory bandwidth [22].

In this research, the goal is not to develop a unifiedstrategy to parallelize all sclera matching methods becauseeach method is quite different from one another and wouldneed customized design. To develop an efficient parallel com-puting scheme, it would need different strategies for different

1556-6013 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

148 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014

Sclera vein recognition methods. Rather, the goal is to developa parallel sclera matching solution for Sclera vein recognitionusing our sequential line-descriptor method [3] using theCUDA GPU architecture. However, the parallelization strate-gies developed in this research can be applied to design parallelapproaches for other Sclera vein recognition methods and helpparallelize general pattern recognition methods.

Based on the matching approach in [3], there are three chal-lenges to map the task of sclera feature matching to GPU. 1)Mask files are used to calculate valid overlapping areas of twosclera templates and to align the templates to the same coordi-nate system. But the mask files are large in size and will preoc-cupy the GPU memory and slow down the data transfer. Also,some of processing on the mask files will involve convolutionwhich is difficult to improve its performance on the scalarprocess unit on CUDA. 2) The procedure of sclera featurematching consists of a pipeline of several computational stageswith different memory and processing requirements. There isno uniform mapping scheme applicable for all these stages.3) When the scale of sclera database is far larger than thenumber of the processing units on the GPU, parallel matchingon the GPU is still unable to satisfy the requirement of real-time performance. New designs are necessary to help narrowdown the search range. In summary, nave implementation ofthe algorithms in parallel would not work efficiently.

Note, it is relatively straightforward to implement ourC program for CUDA on AMD-based GPU using OpenCL.Our CUDA kernels can be directly converted to OpenCLkernels by concerning different syntax for various keywordsand built-in functions. The mapping strategy is also effectivein OpenCL if we regard thread and block in CUDA as work-item and work-group in OpenCL. Most of our optimizationtechniques such as coalesced memory access and prefix sumcan work in OpenCL too. Moreover, since CUDA is a dataparallel architecture, the implementation of our approach byOpenCL should be programmed in data-parallel model.

In this research, we first discuss why the nave parallelapproach would not work (Section 3). We then propose thenew sclera descriptor the Y shape sclera feature-basedefficient registration method to speed up the mapping scheme(Section 4); introduce the weighted polar line (WPL)descriptor, that would be better suited for parallel computingto mitigate the mask size issue (Section 5); and developour coarse to fine two-stage matching process to dramati-cally improve the matching speed (Section 6). These newapproaches make the parallel processing possible and efficient.However, it is non-trivial to implement these algorithms inCUDA. We then developed the implementation schemes tomap our algorithms into CUDA (Section 7). In the Section 2,we give brief introduction of Sclera vein recognition. In theSection 8, we performed some experiments using the proposedsystem. In the Section 9, we draw some conclusions.

II. BACKGROUND OF SCLERA VEIN RECOGNITIONA. Overview of Sclera Vein Recognition

A typical sclera vein recognition system includes sclerasegmentation, feature enhancement, feature extraction, andfeature matching (Figure 1).

Fig. 1. The diagram of a typical sclera vein recognition approach.

Sclera image segmentation is the first step in sclera veinrecognition. Several methods have been designed for sclerasegmentation [3, 4, 23-25]. Crihalmeanu et al. [25] presentedan semi-automated system for sclera segmentation. They useda clustering algorithm to classify the color eye images intothree clusters - sclera, iris, and background. Later on, Cri-halmeanu and Ross [4] designed a segmentation approachbased on a normalized sclera index measure, which includescoarse sclera segmentation, pupil region segmentation, and finesclera segmentation. Zhou et. al. [3] developed a skin tone pluswhite color-based voting method for sclera segmentation incolor images and Otsus thresholding-based method for gray-scale images.

After sclera segmentation, it is necessary to enhance andextract the sclera features since the sclera vein patterns oftenlack contrast, and are hard to detect. Zhou et al. [3] useda bank of multi-directional Gabor filters for vascular patternenhancement. Derakhshani et. al. [23] used contrast limitedadaptive histogram equalization (CLAHE) to enhance thegreen color plane of the RGB image, and a multi-scale regiongrowing approach to identify the sclera veins from the imagebackground. Crihalmeanu and Ross [4] applied a selectiveenhancement filter for blood vessels to extract features fromthe green component in a color image.

In the feature matching step, Crihalmeanu and Ross pro-posed [4] three registration and matching approaches includ-ing Speed Up Robust Features (SURF) which is based oninterest-point detection, minutiae detection which is basedon minutiae points on the vasculature structure, and directcorrelation matching which relies on image registration.Zhou et. al. designed a line descriptor based feature regis-tration and matching method [3].

B. Overview of the Line Descriptor-Based Sclera VeinRecognition Method

The matching segment of the line-descriptor based methodis a bottleneck with regard to matching speed. In this section,we briefly describe the Line Descriptor-based sclera veinrecognition method.

After segmentation, vein patterns were enhanced by a bankof directional Gabor filters. Binary morphological operationsare used to thin the detected vein structure down to a single-pixel wide skeleton and remove the branch points. The linedescriptor is used to describe the segments in the veinstructure [3]. Figure 2 shows a visual description of the linedescriptor. Each segment is described by three quantities: thesegments angle to some reference angle at the iris center ,the segments distance to the iris center r , and the dominantangular orientation of the line segment .

Thus, the descriptor is S = ( r )T. The individualcomponents of the line descriptor are calculated as:

= tan1(

yl yixl xi

)

LIN et al.: EFFICIENT PARALLEL APPROACH FOR SCLERA VEIN RECOGNITION 149

Fig. 2. The sketch of parameters of segment descriptor [3].

r =

(yl yi )2 + (xl xi )2

and = tan1(

ddx

fline (x))

. (1)

Here fline (x) is the polynomial approximation of the linesegment, (xl, yl) is the center point of the line segment, (xi , yi )is the center of the detected iris, and S is the line descriptor.

In order to register the segments of the vascular patterns,a RANSAC-based algorithm is used to estimate the best-fitparameters for registration between the two sclera vascularpatterns. For the registration algorithm, it randomly choosestwo points one from the test template, and one from thetarget template. It also randomly chooses a scaling factorand a rotation value, based on a priori knowledge of thedatabase. Using these values, it calculates a fitness value forthe registration using these parameters [3].

After sclera template registration, each line segment in thetest template is compared to the line segments in the targettemplate for matches. In order to reduce the effect of segmen-tation errors, we created the weighting image (Figure 3) fromthe sclera mask by setting interior pixels in the sclera maskto 1, pixels within some distance of the boundary of the maskto 0.5, and pixels outside the mask to 0.

The matching score for two segment descriptors is calcu-lated by [3]:

m(Si , Sj

) =

w(Si )w(Sj

),

d(Si , Sj

) Dmatchand

i j match

0, else,

(2)

where Si and Sj are two segment descriptors, m(Si , Sj ) isthe matching score between segments Si and Sj , d(Si , Sj )is the Euclidean distance between the segment descriptorscenter points (from Eq. 6-8), Dmatch is the matching distancethreshold, and match is the matching angle threshold. Thetotal matching score, M , is the sum of the individual match-ing scores divided by the maximum matching score for theminimal set between the test and target template. That is, oneof the test or target templates has fewer points, and thus thesum of its descriptors weight sets the maximum score that canbe attained [3].

M =

(i, j )Matchesm

(Si , Sj

)

min

(

iT estw (Si ) ,

jT argetw

(Sj

)) . (3)

Fig. 3. The weighting image [3].

Fig. 4. The module of sclera template matching.

here, Matches is the set of all pairs that are matching, T estis the set of descriptors in the test template, T arget is the setof descriptors in the target template.

III. A NAVE IMPLEMENTATION OFPARALLEL PROCESSING

A nave parallel approach is to directly convert the sequen-tial algorithm to a parallel computation model (Figure 4).Before matching, the masks file should be aligned and theoverlap of these masks was calculated as a new mask. Thedescriptors outside of the new mask are removed. A binaryerosion is performed to generate the boundary area of thenew mask. A weight value of a descriptor is calculatedaccording to their position. Most of these common steps, suchas mask merging, weight calculation, and descriptor maskrequire scanning the mask image pixel by pixel and convo-lution. Computationally, these are time-consuming and createa bottleneck with regard to speed for the sclera matching.Furthermore, the size of the mask file is too large to loadonto the GPU without computational delay. As a result, thisparallel approach is inefficient.

IV. THE PROPOSED Y SHAPE SCLERA FEATUREFOR EFFICIENT REGISTRATION

Currently, the registration of two sclera images duringmatching is very time consuming. To improve the efficiency,in this research, we propose a new descriptor the Y shapedescriptor, which can greatly help improve the efficiency ofthe coarse registration of two images and can be used to filterout some non-matching pairs before refined matching.

Within the sclera, there can be several layers of veins. Themotion of these different layers can cause the blood vesselsof sclera show different patterns [26]. But in the same layers,blood vessels keep some of their forms. As present in Figure 5,the set of vessel segments combine to create Y shape branchesoften belonging to same sclera layer. When the numbers ofbranches is more than three, the vessels branches may comefrom different sclera layers and its pattern will deform with


Fig. 5. The Y shape vessel branch in sclera.

Fig. 6. The rotation and scale invariant character of Y shape vessel branch.

movement of eye. Y shape branches are observed to be a stablefeature and can be used as sclera feature descriptor.

To detect the Y shape branches in the original template, wesearch for the nearest neighbors set of every line segment in aregular distance, classified the angles among these neighbors.If there were two types of angle values in the line segmentset, this set may be inferred as a Y shape structure and theline segment angles would be recorded as a new feature of thesclera.

There are two ways to measure both orientation and rela-tionship of every branch of Y shape vessels: one is to use theangles of every branch to x axle, the other is to use the angelsbetween branch and iris radial direction. The first methodneeds additional rotation operating to align the template.In our approach, we employed the second method. As Figure 6shows, 1, 2, and 3 denote the angle between each branchand the radius from pupil center. Even when the head tilts,the eye moves, or the camera zooms occurs at the imageacquisition step, 1, 2, and 3 are quite stable. To tolerateerrors from the pupil center calculation in the segmentationstep, we also recorded the center position (x, y) of the Y shapebranches as auxiliary parameters. So in our rotation, shift andscale invariant feature vector is defined as: y(1, 2, 3, x, y).

The Y-shape descriptor is generated with reference to theiris center. Therefore, it is automatically aligned to the iriscenters. It is a rotational- and scale- invariant descriptor.

V. WPL SCLERA DESCRIPTORAs we discussed in the Section 2.2., the line descriptor is

extracted from the skeleton of vessel structure in binary images(Figure 7). The skeleton is then broken into smaller segments.For each segment, a line descriptor is created to record thecenter and orientation of the segment. This descriptor isexpressed as s(x, y, ), where (x, y) is the position of thecenter and is its orientation.

Because of the limitation of segmentation accuracy, thedescriptor in the boundary of sclera area might not be accurateand may contain spur edges resulting from the iris, eyelid,and/or eyelashes. To be tolerant of such error, the mask file

Fig. 7. The line descriptor of the sclera vessel pattern. (a) An eye image.(b) Vessel patterns in sclera. (c) Enhanced sclera vessel patterns. (d) Centersof line segments of vessel patterns.

is designed to indicate whether a line segment belongs tothe edge of the sclera or not. However, in GPU application,using the mask is a challenging since the mask files are largein size and will occupy the GPU memory and slow downthe data transfer. When matching, the registration RANSAC-type algorithm was used to randomly select the correspondingdescriptors and the transform parameter between them wasused to generate the template transform affine matrix. Afterevery templates transform, the mask data should also be trans-formed; and new boundary should be calculated to evaluate theweight of the transformed descriptor. This results in too manyconvolutions in processor unit.

To reduce heavy data transfer and computation, we designedthe weighted polar line (WPL) descriptor structure, whichincludes the information of mask and can be automaticallyaligned. We extracted the relationship of geometric featureof descriptors and store them as a new descriptor. We usea weighted image created via setting various weight valuesaccording to their positions. The weight of those descriptorswho are beyond the sclera are set to be 0, and those who arenear the sclera boundary are 0.5 and interior descriptors areset to be 1. In our work, descriptors weights were calculatedon their own mask by the CPU only once. The calculatingresult was saved as a component of descriptor. The descriptorof sclera will change to s(x, y,, w), where, w denotes theweight of the point and the value may be 0, 0.5, 1.

To align two templates, when a template is shifted toanother location along the line connecting their centers, allthe descriptors of that template will be transformed. It wouldbe faster if two templates have similar reference points. Ifwe use the center of the iris as the reference point, when twotemplates are compared, the correspondence will automaticallybe aligned to each other since they have the similar referencepoint. Every feature vector of the template is a set of linesegment descriptors composed of three variable (Figure 8):the segment angle to the reference line which went throughthe iris center, denoted as ; the distance between the segmentscenter and pupil center which is denoted as r ; the dominantangular orientation of the segment, denoted as . To minimizethe GPU computing, we also convert the descriptor value frompolar coordinate to rectangular coordinate in CPU preprocess.The descriptor vector becomes s(x, y, r, ,, w).

The left and right parts of sclera in an eye may have differentregistration parameters. For example, as an eyeball moves left,left part sclera patterns of the eye may be compressed whilethe right part sclera patterns are stretched. In parallel matching,these two parts are assigned to threads in different warpsto allow different deformation. The multiprocessor in CUDAmanages threads in groups of 32 parallel threads called warps.We reorganized the descriptor from same sides and saved


Fig. 8. The key elements of descriptor vector.

Fig. 9. Simplified sclera matching steps on GPU.

them in continuous address. This would meet requirement ofcoalesced memory access in GPU.

After reorganizing the structure of descriptors and addingmask information into the new descriptor, the computation onthe mask file is not needed on the GPU. It was very fastto match with this feature because it does not need to re-register the templates every time after shifting. Thus the costof data transfer and computation on GPU will be reduced.Matching on the new descriptor, the shift parameters generatorin Figure 4 is then simplified as Figure 9.

VI. COARSE-TO-FINE TWO-STAGE MATCHING PROCESSTo further improve the matching process, we propose the

coarse-to-fine two-stage matching process. In the first stage,we matched two images coarsely using the Y-shape descrip-tors, which is very fast to match because no registration wasneeded. The matching result in this stage can help filter outimage pairs with low similarities. After this step, it is stillpossible for some false positive matches. In the second stage,we used WPL descriptor to register the two images for moredetailed descriptor matching including scale- and translation-invariance. This stage includes shift transform, affine matrixgeneration, and final WPL descriptor matching.

Overall, we partitioned the registration and matchingprocessing into four kernels2 in CUDA (Figure 10): matchingon the Y shape descriptor, shift transformation, affine matrixgeneration, and final WSL descriptor matching. Combiningthese two stages, the matching program can run faster andachieve more accurate score.

A. Stage I: Matching With Y Shape DescriptorDue to scale- and rotation- invariance of the Y-shape

features, registration is unnecessary before matching onY shape descriptor. The whole matching algorithm is listedas algorithm 1.

2Kernel in CUDA means function called from the host that runs on thedevice.

Fig. 10. Two-stage matching scheme.

Algorithm (Kernel) 1 Matching With Y Shape Descriptor

Here, ytei , and yta j are the Y shape descriptors of testtemplate Tte and target template Tta respectively. d is theEuclidian distance of angle element of descriptors vectordefined as (3). dxy is the Euclidian distance of two descriptorcenters defined as (4). ni , and di are the matched descriptorpairs number and their centers distance respectively. t isa distance threshold and txy is the threshold to restrict thesearching area. We set t to 30 and txy to 675 in ourexperiment. Here,

d(ytei , ytai

)=

(i0 j0)2+(i1 j1)2+(i2 j2)2, (5)and

dxy(ytei , ytai

) =

(xi x j )2 + (yi y j )2. (6)To match two sclera templates, we searched the areas nearby

to all the Y shape branches.The search area is limited tothe corresponding left or right half of the sclera in order toreduce the searching range and time. The distance of twobranches is defined in (3) where i j is the angle between thej th branch and the polar from pupil center in desctiptor i .The number of matched pairs ni and the distance betweenY shape branches centers di are stored as the matching result.We fuse the number of matched branches and the averagedistance between matched branches centers as (2). Here, isa factor to fuse the matching score which was set to 30 in ourstudy. Ni and N j is the total numbers of feature vectors intemplate i and j separately. The decision is regulated by thethreshold t: if the scleras matching score is lower than t, thesclera will be discarded. The sclera with high matching scorewill be passed to the next more precisely matching process.

B. Stage II: Fine Matching Using WPL DescriptorThe line segment WSL descriptor reveals more vessel

structure detail of sclera than the Y shape descriptor. The


variation of sclera vessel pattern is nonlinear because:1) When acquiring an eye image in different gaze angle,the vessel structure will appear nonlinear shrink or extendbecause eyeball is spherical in shape. And 2) sclera ismade up of four layers: episclera, stroma, lamina fusca andendothelium. There are slightly differences among movementof these layers. Considering these factors, our registrationemployed both single shift transform and multi-parametertransform which combines shift, rotation, and scale together.

1) Shift Parameter Search: As we discussed before, seg-mentation may not be accurate. As a result, the detected iriscenter could not be very accurate. Shift transform is designedto tolerant possible errors in pupil center detection in thesegmentation step. If there is no deformation or only veryminor deformation, registration with shift transform togetherwould be adequate to achieve an accurate result. We designedAlgorithm 2 to get optimized shift parameter. Where, Tte isthe test template; and ssei is the i th WPL descriptor of Tte.Tta is the target template; and ssai is the i th WPL descriptorof Tta.d(stek, sta j ) is Euclidean distance of descriptors stekand sta j :

d(stei , sta j

) =

(xtei xta j)2 + (ytei yta j)2. (7)sk is the shift value of two descriptors defines as:

sk = (xtek xta j , ytek yta j ). (8)

Algorithm (Kernel) 2 Shift Parameter Search for Regis-tration

We first randomly select an equal number of segmentdescriptors stek in test template Tte from each quad and find itsnearest neighbors sta j in target template Tta . The shift offsetof them is recorded as the possible registration shift factorsk . The final offset registration factor is sopt im which hasthe smallest standard deviation among these candidate offsets.

2) Affine Transform Parameter Search: Affine transform isdesigned to tolerant some deformation of sclera patterns inthe matching step. The affine transform algorithm is shown inAlgorithm 3. The shift value in the parameter set is obtained byrandomly selecting descriptor s(it)te and calculating the distancefrom its nearest neighbor sta j in Tta . We transform the testtemplate by the matrix in (7). At end of the iteration, we countthe numbers of matched descriptor pairs from the transformed

template and the target template. The factor is involved todetermine if the pair of descriptor is matched, and we setit to be 20 pixels in our experiment. After N iterations, theoptimized transform parameter set is determined via selectingthe maximum matching numbers m(it).

Algorithm (Kernel) 3 Affine Parameter Search forRegistration

Here, stei , Tte, sta j and Tta is defined same as algorithm 2.tr (it)shi f t ,

(it)tr (it)scale is the parameters of shift, rotation andscale transform generated in i t th iteration. R((it)), T (tr (it)shi f t )and S(tr (it)scale) are the transform matrix defined as (7). Tosearch optimize transform parameter, we iterated N times togenerate these parameters. In our experiment, we set iterationtime to 512.

3) Registration and Matching Algorithm: Using the opti-mized parameter set determined from Algorithms 2 and 3,the test template will be registered and matched simultane-ously. The registration and matching algorithm is listed inAlgorithm 4. Here, stei , Tte, sta j and Tta are defined sameas Algorithms 2 and 3. (optm), tr (optm)shi f t , tr

(optm)scale ,sopt im are

the registration parameters attained from Algorithms 2 and 3.R((optm)

)T (tr (optm)shi f t)S(tr (optm)scale

)is the descriptor transform

matrix defined in Algorithm 3. is the angle between thesegment descriptor and radius direction. w is the weight of thedescriptor which indicates whether the descriptor is at the edgeof sclera or not. To ensure that the nearest descriptors have asimilar orientation, we used a constant factor to check theabstract difference of two . In our experiment, we set to 5.The total matching score is minimal score of two transformedresult divided by the minimal matching score for test templateand target template.


Fig. 11. The task assignment inside and outside the GPU.

Algorithm (Kernel) 4 Registration and Match

VII. MAPPING THE SUBTASKS TO CUDA

CUDA is a single instruction multiple data (SIMD) systemand works as a coprocessor with a CPU. A CUDA consistsof many streaming multiprocessors (SM) where the parallelpart of the program should be partitioned into threads bythe programmer and mapped into those threads. There aremultiple memory spaces in the CUDA memory hierarchy:register, local memory, shared memory, global memory, con-stant memory and texture memory. Register, local memoryand shared memory are on-chip and could be a little timeconsuming to access these memories. Only shared memorycan be accessed by other threads within the same block.However, there is only limited availability of shared memory.Global memory, constant memory, and texture memory areoff-chip memory and accessible by all threads, which wouldbe very time consuming to access these memories. Constant

memory and texture memory are read-only and cacheablememory.

Mapping algorithms to CUDA to achieve efficient process-ing is not a trivial task. There are several challenges inCUDA programming: 1) If threads in a warp have differentcontrol path, all the branches will be executed serially. Toimprove performance, branch divergence within a warp shouldbe avoided. 2) Global memory is slower than on-chip memoryin term of access. To completely hide the latency of the smallinstructions set, we should use on-chip memory preferentiallyrather than global memory. When global memory accessoccurs, threads in same warp should access the words insequence to achieve coalescence. 3) Shared memory is muchfaster than the local and global memory space. But sharedmemory is organized into banks which are equal in size. Iftwo addresses of memory request from different thread withina warp fall in the same memory bank, the access will beserialized. To get maximum performance, memory requestsshould be scheduled to minimize bank conflicts.

A. Mapping Algorithm to BlocksBecause the proposed registration and matching algorithm

has four independent modules, all the modules will be con-verted to different kernels on the GPU. These kernels aredifferent in computation density, thus we map them to theGPU by various map strategies to fully utilize the computingpower of CUDA.

Figure 11 shows our scheme of CPU-GPU task distributionand the partition among blocks and threads. Algorithm 1 ispartitioned into coarse-grained parallel subtasks. We create anumber of threads in this kernel. The number of threads isthe same as the number of templates in the database. As theupper middle column shows in Figure 11, each target templatewill be assigned to one thread. One thread performs a pair oftemplates compare. In our work, we use NVIDIA C2070 as


our GPU. Threads and blocks number is set to 1024. Thatmeans we can match our test template with up to 10241024target templates at same time.

Algorithms 2-4 will be partitioned into fine-grained subtaskswhich is processed a section of descriptors in one thread. Asthe lower portion of the middle column shows in Figure 11,we assigned a target template to one block. Inside a block,one thread corresponds a set of descriptors in this template.This partition makes every block execute independently andthere are no data exchange requirements between differentblocks. When all threads complete their responding descriptorfractions, the sum of the intermediate results needs to becomputed or compared. A parallel prefix sum algorithm isused to calculate the sum of intermediate results which isshow in right of Figure 11. Firstly, all odd number threadscompute the sum of consecutive pairs of the results. Then,recursively, every first of i(= 4, 8, 16, 32, 64, ...) threadscompute the prefix sum on the new result. The final resultwill be saved in the first address which has the same variablename as the first intermediate result.

B. Mapping Inside BlockIn shift argument searching, there are two schemes we can

choose to map task: 1) mapping one pair of templates to all thethreads in a block, and then every thread would take charge ofa fraction of descriptors and cooperation with other threads; or2) assigning a single possible shift offset to a thread, and allthe threads will compute independently unless the final resultshould be compared with other possible offset. Due to greatnumber of sum and synchronization operations in every nearestneighbor searching step, as Figure 11 shown, we choose thesecond method to parallelize shift searching.

In affine matrix generator, we mapped an entire parameterset searching to a thread and every thread randomly generateda set of parameters and tried them independently. The gen-erated iterations were assigned to all threads. The challengeof this step is the randomly generated numbers might becorrelated among threads. In the step of rotation and scaleregistration generating, we used the Mersenne Twister pseudo-random number generator because it can use bitwise arithmeticand have long period. The Mersenne twister, as most ofpseudorandom generators, is iterative. Therefore its hard toparallelize a single twister state update step among severalexecution threads. To make sure that thousands of threadsin the launch grid generate uncorrelated random sequence,many simultaneous Mersenne twisters need to process withdifferent initial states in parallel. But even very different (byany definition) initial state values do not prevent the emissionof correlated sequences by each generator sharing identicalparameters. To solve this problem and to enable efficientimplementation of Mersenne Twister on parallel architectures,we used a special offline tool for the dynamic creation ofMersenne Twisters parameters, modified from the algorithmdeveloped by Makoto Matsumoto and Takuji Nishimura [27].

In the registration and matching step, when searching thenearest neighbor, a line segment that has already matchedwith others should not be used again. In our approach, a flag

Fig. 12. Example image from the UBIRIS database.

Fig. 13. Occupancy on various thread numbers per block.

variable denoting whether the line has been matched is storedin shared memory. To share the flags, all the threads in ablock should wait synchronic operation at every query step.Our solution is to use a single thread in a block to processthe matching.

C. Memory ManagementThe bandwidth inside GPU board is much higher than the

bandwidth between host memory and device memory. The datatransfer between host and device can lead to long latency.As shown in Figure 11, we load the entire target templatesset from database without considering when they would beprocessed. Therefore, there was no data transfer from host todevice during the matching procedure.

In global memory, the components in descriptory(1, 2, 3, x, y) and s(x, y, r, ,w) were stored separately.This would guarantee contiguous kernels of Algorithm 2 to 4can access their data in successive addresses. Althoughsuch coalescing access reduces the latency, frequently globalmemory access was still a slower way to get data. In ourkernel, we loaded the test template to shared memory toaccelerate memory access. Because the Algorithms 2 to 4execute different number of iterations on same data, the bankconflict does not happen. To maximize our texture memoryspace, we set the system cache to the lowest value andbonded our target descriptor to texture memory. Using thiscatchable memory, our data access was accelerated more.

VIII. EXPERIMENTAL RESULTSWe used a computer with INTEL i7 950 3.07GHz processor

and NVIDIA Tesla C2070 graphic card, which has 448 cores,1.15GHz GPU clock, and 1.5GHz memory clock. The scleraimage database we used is UBIRIS database Session 1.It is a publicly available eye image database acquired invisible wavelength [26]. There are 1214 images collected from241 persons in this database (Figure 12). In our study, 46 blur,blink or no-sclera-area images are removed.


Fig. 14. Matching time with various threads numbers per block.

Achieving a good performance using a GPU requires keep-ing the multiprocessor as busy as possible by using a suitablenumber of threads and blocks. The larger the number ofthreads used per block, the more templates can be comparedsimultaneously. Threads in a warp start together at the sameprogram address. When one warp is paused, other warpswill be executed to reduce latencies and keep the processunit busy. To quickly switch from one execution context toanother, multiprocessors keep all warps active by partitionprivate register to every warp. As a result, the numbers ofbocks and warps that can reside on the multiprocessor dependon whether there are enough registers and shared memoryavailable on the multiprocessor [29]. If we set the numberof threads per block as a multiple of warp size, the maximumthreads number per block should set to be

T = Rblock WsizeWsizeRk

GT

, (9)

where T is the number of threads per block which can makeall the threads resident in multiprocessor. Rblock is the totalnumber of registers for a block, Wsize is the warp size, Rk isthe number of registers used by the kernel, GT is the threadallocation granularity.

The number of blocks should guarantee that there are at leasttwo blocks in a multiprocessor. The total amount of sharedmemory Sblock for a block should be

Sblock =

SkGs

, (10)

where Sk is the amount of shared memory used by the kernel,Gs is the shared memory allocation granularity.

We also used the occupancy calculation tools provided byNVIDA to search for the optimized configuration parame-ters [21]. Every kernel of Kernel 3 needs 31 registers and7168 bytes shared memory. As Figure 13 shows, the maximumoccupancy, which is defined as the ratio of the number ofactive warps per multiprocessor to the maximum number ofpossible active warps, can be achieve when threads number isset to be 256, 512, or 1024.

Fig. 15. ROC curve of parallel matching.

Using Y shape descriptor, it takes 56.8 second to comparea 11681168 pair template. The Equal Error Rate (EER) of thisstage is 9.93%, which is not very accurate. However, it canbe used as a filtering method to select most likely matchingtemplates to compare in the next step, Stage II.

Figure 14 shows our experimental result with two stages.To balance the matching speed and accuracy, we adopteddifferent strategies to select the possible template after Stage I.The matching only using Stage II achieves the most accurateresult; however it would take longest time. In the sequentialimplementation on a CPU, the iteration of performance rangesfrom 100 to 400 depending on the size of the template. Inour implementation on a GPU, the iteration count was set tobe 512. Consequently, this extends the registration parametersearch range and gains more accurate matching result. Whilethe percentage of selected templates after Stage I increases,the accuracy of parallel matching result would decrease.Figure 14 shows the ROC curve and the EER of each method;and Figure 15 shows ROC curve and the GARs when FAR =0.1% and FAR = 0.01%. The summary of the accuracy andspeed is shown in Table 1.

For the sequential method, the EER is 3.386%, the areaunder the curve (AUC) is 97.5% and GAR = 92.6% and86.46% with FAR = 0.1% and FAR = 0.01% respectively.For the parallel computing, if all templates were used for theStage II matching, the parallel approach would achieve betterrecognition accuracy than the sequential method with EER =3.052%, AUC = 98.6%, GAR = 93% (when FAR = 0.1%)and 87.9% (when FAR = 0.01%). At the same time, the pro-posed parallel computing approach achieves a 769 times speedimprovement. If we filter 23.3% of pairs from the Stage I, thespeed would be further improved to be 805 times, while theaccuracy still beats the sequential method using EER, AUC,GAR (when FAR = 0.1% and FAR = 0.01%) as measures.If we filter 43.5% of pairs from the Stage I, the speed wouldbe 1304 times improvement over the sequential method. Theequal error rate would be a little higher than the sequentialmethod, however, the AUC and GAR are better. If we filter61.6% of pairs from the Stage I, the EER would be 3.637%,which is about 0.3% higher than the EER of the sequentialmethod. And the AUC is 97.4%, which is about 0.1% lowerthan the AUC of the sequential method. But the GAR wouldbe much better: GAR = 93.8% and 89.7% when FAR = 0.1%


TABLE IPARALLEL MATCHING COMPARED WITH SEQUENTIAL MATCHING

and 0.01% respectively. The speed is 1935 times faster than thesequential method. Note that we used 448 cores GPU in thisresearch. This would mean that the proposed method efficiencyis 4.3 times of the number of GPU cores. This shows thatthe proposed parallel computing method could dramaticallyimprove the speed without compromising the recognitionaccuracy.

IX. CONCLUSIONIn this paper, we proposed a new parallel sclera vein

recognition method, which employees a two stage parallelapproach for registration and matching. Even though theresearch focused on developing a parallel sclera matchingsolution for the sequential line-descriptor method using CUDAGPU architecture, the parallel strategies developed in thisresearch can be applied to design parallel solutions to othersclera vein recognition methods and general pattern recogni-tion methods. We designed the Y shape descriptor to narrowthe search range to increase the matching efficiency, which is anew feature extraction method to take advantage of the GPUstructures. We developed the WPL descriptor to incorporatemask information and make it more suitable for parallelcomputing, which can dramatically reduce data transferringand computation. We then carefully mapped our algorithmsto GPU threads and blocks, which is an important step toachieve parallel computation efficiency using a GPU. A workflow, which has high arithmetic intensity to hide the memoryaccess latency, was designed to partition the computation taskto the heterogeneous system of CPU and GPU, even to thethreads in GPU. The proposed method dramatically improvesthe matching efficiency without compromising recognitionaccuracy.

ACKNOWLEDGMENTWe would like to thank the associate editor

Dr. Patrizio-Campisi, and anonymous reviewers for theirconstructive comments. We would also like to acknowledgethe Department of Computer Science at the University ofBeira Interior for providing the UBIRIS database [28].

REFERENCES

[1] C. W. Oyster, The Human Eye: Structure and Function. Sunderland:Sinauer Associates, 1999.

[2] P. Kaufman, and A. Alm, Clinical application, Adlers Physiology ofthe Eye, 2003.

[3] Z. Zhou, E. Y. Du, N. L. Thomas, and E. J. Delp, A new humanidentification method: Sclera recognition, IEEE Trans. Syst., Man,Cybern. A, Syst., Humans, vol. 42, no. 3, pp. 571583, May 2012.

[4] S. Crihalmeanu and A. Ross, Multispectral scleral patterns for ocu-lar biometric recognition, Pattern Recognit. Lett., vol. 33, no. 14,pp. 18601869, Oct. 2012.

[5] Z. Zhou, E. Y. Du, N. L. Thomas, and E. J. Delp, A comprehensivemultimodal eye recognition, Signal, Image Video Process., vol. 7, no. 4,pp. 619631, Jul. 2013.

[6] Z. Zhou, E. Y. Du, N. L. Thomas, and E. J. Delp, A comprehensiveapproach for sclera image quality measure, Int. J. Biometrics, vol. 5,no. 2, pp. 181198, 2013.

[7] R. N. Rakvic, B. J. Ulis, R. P. Broussard, R. W. Ives, and N. Steiner,Parallelizing iris recognition, IEEE Trans. Inf. Forensics Security,vol. 4, no. 4, pp. 812823, Dec. 2009.

[8] P. R. Dixon, T. Oonishi, and S. Furui, Harnessing graphics processorsfor the fast computation of acoustic likelihoods in speech recognition,Comput. Speech Lang., vol. 23, no. 4, pp. 510526, 2009.

[9] K.-S. Oh and K. Jung, GPU implementation of neural networks,Pattern Recognit., vol. 37, no. 6, pp. 13111314, 2004.

[10] D. C. Cirean, U. Meier, L. M. Gambardella, and J. Schmidhuber,Deep, big, simple neural nets for handwritten digit recognition, NeuralComput., vol. 22, no. 12, pp. 32073220, 2010.

[11] J. Antikainen, J. Havel, R. Josth, A. Herout, P. Zemcik, and M. Hauta-Kasari, Nonnegative tensor factorization accelerated using GPGPU,IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 7, pp. 11351141,Feb. 2011.

[12] C. Cuevas, D. Berjon, F. Moran, and N. Garcia, Moving objectdetection for real-time augmented reality applications in a GPGPU,IEEE Trans. Consum. Electron., vol. 58, no. 1, pp. 117125, Feb. 2012.

[13] Y. Xu, S. Deka, and R. Righetti, A hybrid CPU-GPGPU approachfor real-time elastography, IEEE Trans. Ultrason., Ferroelectr. Freq.Control, vol. 58, no. 12, pp. 26312645, Dec. 2011.

[14] G. Poli, J. H. Saito, J. F. Mari, and M. R. Zorzan, Processing neocog-nitron of face recognition on high performance environment based onGPU with CUDA architecture, in Proc. 20th Int. Symp. Comput. Archit.High Perform. Comput., 2008, pp. 8188.

[15] F. Z. Sakr, M. Taher, and A. M. Wahba, High performance irisrecognition system on GPU, in Proc. ICCES, 2011, pp. 237242.

[16] W. Wenying, Z. Dongming, Z. Yongdong, L. Jintao, and G. Xiaoguang,Robust spatial matching for object retrieval and its parallel implemen-tation on GPU, IEEE Trans. Multimedia, vol. 13, no. 6, pp. 13081318,Dec. 2011.


[17] N. Ichimura, GPU computing with orientation maps for extracting localinvariant features, in Proc. IEEE Comput. CVPRW, Jun. 2010, pp. 18.

[18] K. Tsz-Ho, S. Hoi, and C. C. L. Wang, Fast query for exemplar-based image completion, IEEE Trans. Image Process., vol. 19, no. 12,pp. 31063115, Dec. 2010.

[19] X. Hongtao, G. Ke, Z. Yongdong, T. Sheng, L. Jintao, and L. Yizhi,Efficient feature detection and effective post-verification for large scalenear-duplicate image search, IEEE Trans. Multimedia, vol. 13, no. 6,pp. 13191332, Dec. 2011.

[20] P. In Kyu, N. Singhal, L. Man Hee, C. Sungdae, and C. W. Kim, Designand performance evaluation of image processing algorithms on GPUs,IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 91104, Jan. 2011.

[21] NVIDIA CUDA C Programming Guide, NVIDIA Corporation, SantaClara, CA, USA, 2011.

[22] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone,and J. C. Phillips, GPU computing, Proc. IEEE, vol. 96, no. 5,pp. 879899, May 2008.

[23] R. Derakhshani, A. Ross, and S. Crihalmeanu, A new biometricmodality based on conjunctival vasculature, in Proc. Artif. Neural Netw.Eng., 2006, pp. 18.

[24] R. Derakhshani and A. Ross, A texture-based neural network classifierfor biometric identification using ocular surface vasculature, in Proc.Int. Joint Conf. Neural Netw., 2007, pp. 29822987.

[25] S. Crihalmeanu, A. Ross, and R. Derakhshani, Enhancement andregistration schemes for matching conjunctival vasculature advancesin biometrics, in Proc. 3rd IAPR/IEEE Int. Conf. Biometrics, 2009,pp. 12401249.

[26] R. Broekhuyse, The lipid composition of aging sclera and cornea, Int.J. Ophthalmol., vol. 171, no. 1, pp. 8285, 1975.

[27] M. Matsumoto and T. Nishimura, Mersenne twister: A623-dimensionally equidistributed uniform pseudo-random numbergenerator, ACM Trans. Model. Comput. Simul., vol. 8, no. 1, pp. 330,1998.

[28] H. Proena and L. A. Alexandre, UBIRIS: A noisy iris image database,in Proc. 13th Int. Conf. Image Anal. Process., 2005, pp. 970977.

[29] CUDA C Best Practices Guide, NVIDIA Corporation, Santa Clara, CA,USA, 2011.

Yong Lin received the B.S. degree in physics fromNingxia University, China, in 1995, and the M.S.degree in computer software from the Institute ofSoftware Chinese Academy of Sciences, China,in 2005. From 2010 to 2012, he was a VisitingScholar with Indiana University-Purdue University,Indianapolis, IN, USA. Since 1995, he has beenwith the Department of Computer Science, NingxiaTeachers University. He is currently pursuing thePh.D. degree with the School of Computer Sci-ence and Technology, Xidian University, China. His

research interests include computer architecture, high performance computingfor biometrics, parallel computing, and GPUs.

Eliza Yingzi Du (SM08) received the Ph.D. degreein electrical engineering from the University ofMaryland, Baltimore County, Baltimore, in 2003,and the B.S. and M.S. degrees in electrical engi-neering from the Beijing University of Posts andTelecommunications, Beijing, China, in 1996 and1999, respectively. She is currently a Director ofengineering with Qualcomm. From 2005 to 2013,she was the Founding Director of the Biometricsand Pattern Recognition Laboratory and a tenuredProfessor with the Department of Electrical and

Computer Engineering, Purdue University, Indianapolis (IUPUI), IN, USA.From 2003 to 2005, she was an Assistant Research Professor with theElectrical Engineering Department, United States Naval Academy.

Her research interests include image processing, pattern recognition, andbiometrics. Her research has been funded by the Office of Naval Research,National Institute of Justice, Department of Defense, National Science Foun-dation, Canada Border Services Agency, Indiana Department of Transporta-tion, and several industry and IUPUI internal grants.

Dr. Du received the Office of Naval Research Young Investigator Award in2007, the Indiana University Trustee Teaching Award in 2009, the Supervisorof the Year Award at IUPUI in 2009, the Best Paper Award with her studentsin IEEE Workshop on Computational Intelligence in Biometrics: Theory,Algorithms, and Applications in 2009. She is a member of the honor societiesTau Beta Pi and Phi Kappa Phi.

Zhi Zhou (S08) received the Ph.D. degree inelectrical engineering from Purdue University, WestLafayette, in 2013, the M.S. degree in electricaland computer engineering from Indiana University-Purdue University Indianapolis, Indianapolis, IN,USA, in 2008, and the B.S. degree in electrical engi-neering from the Beijing University of Technology,Beijing, China, in 2005. He is currently a Scientistwith the Allen Institute for Brain Science.

His research interests include image processing,biometrics, image analysis, pattern recognition, data

mining, machine learning, and data visualization of large volume of 3-Dbiological imaging data.

Dr. Zhou received the Best Paper Award in the IEEE Workshop on Compu-tational Intelligence in Biometrics: Theory, Algorithms, and Applications in2009.

N. Luke Thomas received the B.S. degree in elec-trical engineering and the M.S. degree in electricaland computer engineering from Indiana University-Purdue University Indianapolis, IN, USA, in 2010.He is currently in industry as a Software Engineerof safety critical engine control systems.

His research interests include algorithm develop-ment, biometrics, and pattern recognition.

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 600 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/Description >>> setdistillerparams> setpagedevice

an Efficient Parallel Approach For

Documents

Transcript of an Efficient Parallel Approach For