An embedded system for an eye-detection...

www.elsevier.com/locate/cviu

Computer Vision and Image Understanding 98 (2005) 104–123

An embedded system for an eye-detection sensor

Arnon Amira,*, Lior Zimetb, Alberto Sangiovanni-Vincentellic,Sean Kaoc

a IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USAb University of California Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA

c Department of EECS, University of California Berkeley, CA 94720, USA

Received 27 July 2004; accepted 27 July 2004Available online 1 October 2004

Abstract

Real-time eye detection is important formanyHCI applications, including eye-gaze tracking,autostereoscopic displays, video conferencing, face detection, and recognition.Current commer-cial and research systems use software implementation and require a dedicated computer for theimage-processing task—a large, expensive, and complicated-to-use solution. In order to makeeye-gaze tracking ubiquitous, the system complexity, size, and price must be substantiallyreduced. This paper presents a hardware-based embedded system for eye detection, implementedusing simple logic gates, with no CPU and no addressable frame buffers. The image-processingalgorithm was redesigned to enable highly parallel, single-pass image-processing implementa-tion. A prototype system uses a CMOS digital imaging sensor and an FPGA for the image pro-cessing. It processes 640 · 480 progressive scan frames at a 60 fps rate, andoutputs a compact listof sub-pixel accurate (x,y) eyes coordinates via USB communication. Experimentation withdetection of human eyes and synthetic targets are reported. This new logic design, operatingat the sensor�s pixel clock, is suitable for single-chip eye detection and eye-gaze tracking sensors,thus making an important step towards mass production, low cost systems.� 2004 Published by Elsevier Inc.

Keywords: Eye detection; Eye-gaze tracking; Real-time image processing; FPGA-based image processing;Embedded systems design

1077-3142/$ - see front matter � 2004 Published by Elsevier Inc.

doi:10.1016/j.cviu.2004.07.009

* Corresponding author. Fax: +1 408-927-3033.E-mail addresses: [email protected] (A. Amir), [email protected] (L. Zimet), alberto@eecs.

berkeley.edu (A. Sangiovanni-Vincentelli), kaos@

mailto:[email protected]

mailto:[email protected]

mailto:alberto@eecs.

mailto:kaos@<eecs.berkeley.edu

A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104–123 105

1. Introduction

Eye detection, the task of finding and locating eyes in images, is used in a greatnumber of applications. One example is eye-gaze tracking systems, used in HCIfor a variety of applications [11,10,5,21,22], such as pointing and selection [12], acti-vating commands (e.g. [19]), and combinations with other pointing devices [26].Other examples include real-time systems for face detection, face recognition, irisrecognition, eye contact correction in video conferencing, autostereoscopic displays,facial expression recognition, and more. All those applications require eye detectionfor their work.

Consider for example commercial eye-gaze tracking systems. Current implemen-tations of eye-gaze tracking systems are software driven and require a dedicatedhigh-end PC for image processing. Yet, even high-end PCs can hardly support thedesired high frame rates (about 200–400 fps) and high-resolution images requiredto maintain accuracy in a wide operational field of view. Even in cases where the per-formance requirement are met, these solutions are bulky and prohibitively expensivethus limiting their reach in a very promising market.

Any CPU-based implementation of real-time image-processing algorithm has twomajor bottlenecks—data transfer bandwidth and sequential data processing rate.First, video frames are captured, from either an analog or a digital camera, andtransferred to the CPU main memory. Moving complete frames around imposes adata flow bandwidth bottleneck. For example, an IEEE 1394 (Firewire) channelcan transfer data at 400 Mbits/s, or 40 MPixel/s at 10 bits/Pixel. At a 60 fps, withframe size of 640 · 480 pixels, our sensor�s pixel rate is already 30 MPixel/s. Fasterframe rates or larger frame size would require special frame capturing devices. Aftera frame is grabbed and moved to memory, the CPU can sequentially process the pix-els. A CPU adds large overhead to the actual computation. Its time splits amongmoving data between memory and registers, applying arithmetic and logic operatorson the data, and keeping the algorithm flow, handling branches, and fetching code.For example, if each pixel is accessed only 10 times along the entire computation(e.g., applying a single 3 · 3 filter), it would require the CPU to run at least 10 timesfaster than the input pixel rate, and even faster due to CPU overhead. Hence, theCPU has to run much faster than the sensor pixel rate. Last, a relatively long latencyis imposed by the need to wait for an entire frame to be captured before it is pro-cessed. Device latency is extremely noticeable in interactive user interfaces.

The advantage of a software-based implementation lies in its flexibility and shortimplementation time. A large quantities hardware implementation would certainlyreduce unit cost and size, but at the price of increased design time and rigidity.An efficient hardware implementation should offer the same functionality and exploitits main characteristic: concurrency. A simple mapping of the software implementa-tion into hardware components falls short of the potential benefits offered by a hard-ware solution. Rather, the algorithms and data structures have to be redesigned,implementing the same mathematical computation using maximum concurrency.

Our approach to embedded system design is based on a methodology [20,2] thatcaptures the design at a high level of abstraction where maximum concurrency may

106 A. Amir et al. / Computer Vision and Image Understanding 98 (2005) 104–123

be exposed. This representation is then mapped into a particular architecture andevaluated in terms of cost, size, power consumption, and performance. The method-ology offers a continuous trade-off among possible implementations including, at theextremes, a full software and a full hardware realization.

This paper presents a prototype of an embedded eye-detection system imple-mented completely in hardware. The novelty of this paper is the redesign of a sequen-tial pupil detection algorithm into a synchronous parallel algorithm, capable of fullyexploiting the concurrency of hardware implementations. The algorithm includescomputation of regions (connected components), shape moments and simple shapeclassification—all of which are computed within a single pass over the frame, drivenby the real-time pixel-clock of the imaging sensor. The output is a list of pupil loca-tion and size, passed via low-bandwidth USB connection to an application. In addi-tion to the elimination of a dedicated PC and thus major saving in cost, the systemhas virtually no limit in processing pixel rate, bounded only by the sensor�s outputpixel rate. Minimal latency is achieved as pixels are processed while being receivedfrom the sensor.

A prototype was built using a field programmable gate array (FPGA). It processes640 · 480 progressive scan frames at 60 fps—a pixel rate that is 16 times faster thanthe one reported elsewhere [16,15]. By extending the algorithm to further detectglints and by integrating this logic into imaging sensor�s VLSI, this approach cansolve the speed, size, complexity, and cost of current eye-gaze tracking systems.

2. Prior work on eye detection

Much of the eye-detection literature is associated with face detection and face rec-ognition (see, e.g. [9,27]). Direct eye-detection methods search for eyes without priorinformation about face location, and can further be classified into passive and activemethods. Passive eye detectors work on images taken in natural scenes, without anyspecial illumination and therefore can be applied to movies, broadcast news, etc. Onesuch example exploits gradient field and temporal analysis to detect eyes in gray-levelvideo [14].

Active eye-detection methods use special illumination and thus are applicable toreal-time situations in controlled environments, such as eye-gaze tracking, iris recog-nition, and video conferencing. They take advantage of the retro-reflection propertyof the eye, a property that is rarely seen in any other natural objects. When light fallson the eye, part of it is reflected back, through the pupil, in a very narrow beampointing directly towards the light source. When a light source is located very closeto a camera focal axis (on-axis light), the captured image shows a very bright pupil[11,25]. This is often seen as the red-eye effect when using a flashlight in stills photog-raphy. When a light source is located away from the camera focal axis (off-axis light),the image shows a dark pupil. This is the reason for making the flashlight unit popup in many camera models. However, neither of these lights allow for good discrim-ination of pupils from other objects, as there are also other bright and dark objects inthe scene that would generate pupil-like regions in the image.


2.1. Eye detection by frame subtraction using two illuminations

The active eye-detection method used here is based on the subtraction schemewith two synchronized illuminators [23,6,16]. At the core of this approach, twoframes are captured, one with on-axis illumination and the other with off-axis illumi-nation. Then the illumination intensity of the off-axis frame is subtracted from theintensity of the on-axis frame. Most of the scene, except pupils, reflects the sameamount of light under both illuminations and subtracts out. Pupils, however, arevery bright in one frame and dark in the other, and thus are detected as elliptic re-gions after applying a threshold on the difference frame. False positives might showdue to specularities from curved shiny surfaces such as glasses, and at the boundariesof moving objects due to the delay between the two frames.

Tomono[23]usedamodified three-CCDcamerawithtwonarrowbandpassfiltersonthe sensors and a polarized filter with two illuminators in the corresponding wave-lengths,theon-axisbeingpolarized.Thisconfigurationallowedforsimultaneousframescapture. It overcame themotionproblemandmostof the specularities problemsandas-sist the glint detection. However, it required a more expensive, three-CCD camera.

The frame-subtraction eye-detection scheme has become popular in differentapplications because of its robustness and simplicity. It was used for tracking eyes,eyebrow and other facial features [8,13], for facial expression recognition [7], track-ing head nod and shake, recognition of head gesture [4], achieving eye contact in vi-deo conferencing [24], stereoscopic displays [18], and more.

The increasing popularity and multitude of applications suggest that an eye-detec-tion sensor would be of great value. An embedded eye-detection system would sim-plify the implementation of all of these applications. It would eliminate the need forhigh-bandwidth image data transfer between imaging sensor and the PC, reduce thehigh CPU requirements for software-based image processing, and would dramati-cally reduce their cost.

This eye-detection algorithm, selected here for this challenging new type of imple-mentation, is fairly simple and robust. While a software implementation of this algo-rithm is straightforward, a logic circuit-based implementation, with no CPU,operating at a processing rate of single cycle per pixel provides enough interestingchallenges to deal with. Alternative algorithms might have other advantages, how-ever, the simplicity of the selected algorithm made it suitable for a logic-only imple-mentation, without use of any CPU/DSP or microcode of any kind. Some of thelessons learned here may help to implement other, more complicated image-process-ing algorithms in hardware as well.

2.2. The basic sequential algorithm

The frame-subtraction algorithm for eye detection is summarized in Fig. 1. A de-tailed software-based implementation was reported elsewhere [16]. The on-axis andoff-axis light sources alternate, synchronized with the camera frames. After capturinga pair of frames, subtracting the frame associated with the off-axis from the one asso-ciated with the on-axis light and thresholding the result, a binary image is obtained.

Fig. 1. Processing steps of the basic sequential algorithm.


The binary image contains regions of significant difference. These regions are de-tected using a connected components algorithm [1]. The detected regions may in-clude not only pupils but also some false positives, mainly due to object motionand specularities. The non-pupil regions are filtered in step 6, and the remaining pu-pil candidates are reported to the output.

The first five steps of the algorithm are applied on frames. They consume most ofthe computational time and large memory arrays for frame buffers. In step 5 the de-tected regions are represented by their properties, including area, bounding box, andfirst-order moments. The filtering in step 6 rejects regions that do not look pupils.Pupils are expected to show as elliptic shapes in a certain size range. Typical motionartifacts show as elongated shapes of large size and are in general easy to distinguishfrom pupils.1 At the last step, the centroids and size of the detected pupils are re-ported. This output can be used by different applications.

3. Synchronous pupil detection algorithm

To use a synchronous hardware implementation, it was necessary to modify thebasic sequential algorithm of Section 2.2. While a software implementation is

1 More elaborate eye classification methods were applied in [3].


straightforward, a hardware implementation using only registers and logic gatesoperating at pixel-rate clock is far more complicated. The slow clock rate requiresany pixel operation to be performed in a single clock and any line operation to befinished within the number of clock cycles equivalent to the number of pixels per im-age line. In fact, the sensor�s pixel clock is the theoretically lowest possible processingclock. Using the sensor�s clock rate has advantage in synchronization, low cost hard-ware and potential to use much faster sensors.

Fig. 2 shows a block diagram of the parallel algorithm. The input is a stream ofpixels. There is only a single frame buffer (delay), used to store the previous frame forthe frame-to-frame subtraction. The output of the frame subtraction is stored in aline buffer as it is being computed in a raster scan order. The output is a set ofXY locations within the frame region of the centroid of the detected pupils. Thisset of XY locations is updated once per frame.

Connected components are computed using a single-pass algorithm, similar to theregion-coloring algorithm [1]. Here, however, the algorithm operates in three parallelstages, running in a pipeline on three lines: when line Li is present in the line buffer itis scanned, and line components, or just components, are detected. At the same time,components from line Li � 1 are being connected (merged) to components from lineLi � 2, and those line components are being updated in the frame regions table.

The shape properties, including bounding box and moments (step 5, Fig. 1), arecomputed simultaneously with the computation of the connected components (step4). This saves time and eliminates the need for an extra frame buffer, as being usedbetween step 4 and step 5 in Fig. 1. Consider the computation of a property f (s) of aregion s. The region s is being discovered on a line-by-line basis and is composed ofmultiple line components. For each line object that belongs to s, say s1, the propertyf (s1) is computed and stored with the line object. When s1 is merged with anotherobject, s2, their properties are combined to reflect the properties of the combined re-gion s = s1 [ s2. This type of computation can be recursively or iteratively applied toany shape property that can be computed from its value over parts of the shape. Theproperty f (s) can be computed using the form:

f ðsÞ ¼ gðf ðs1Þ; f ðs2ÞÞ;

Fig. 2. System block diagram, showing the main hardware components and pipeline processing.


where g (,) is an appropriate aggregation function for the property f. For example,aggregation of moments is simply the sum, mij

s ¼ gðmijs1;m

ijs2Þ ¼ mij

s1 þ mijs2, whereas

for the aggregation of normalized moments, the regions areas As1, As2 are needed:

�mijs ¼ gð�mij

s1; �mijs2Þ ¼ ðAs1 �m

ijs1 þ As2 �m

ijs2Þ=ðAs1 þ As2Þ:

The four properties which define the bounding box, namely Xmins ;Xmax

s ; Y mins ; Y max

s ,are aggregated using, e.g., Xmin

s ¼ ðXmins1 ;Xmin

s2 Þ.Properties that are computed here in this manner include area, first-order mo-

ments, and bounding box. Other useful properties that may be computed in thismanner are higher-order moments, average color, color histograms, etc.

In the pipelined architecture, each stage of the pipeline performs the computationon the intermediate results from the previous stage, using the same pixel clock. Thememory arrays in the processing stages keep the information of only one or two linesat a time. Additional memory arrays are used to keep components properties andcomponents location, and to form the pipeline.

To get the effect of bright and dark pupils in consecutive frames, the logic circuitdrives control signals to the on-axis and off-axis infrared light sources. These signalsare being generated from the frame clock coming from the image sensor.

3.1. Pixels subtraction and thresholding

The digital video stream is coming from the sensor at a pixel rate of 30 Mpixels/s.The data are latched in the logic circuit and at the same time pushed into a FIFOframe buffer. The frame buffer acts as one full frame delay. Fig. 3 depicts the subtrac-tion and threshold module operation.

Since each consecutive pair of frame is processed, the order of subtraction changesat the beginning of every frame. The on-axis and off-axis illumination signals areused to coordinate the subtraction order. To eliminate the handling of negative num-bers, a fixed offset value is added to the gray levels of all pixels of the on-axis frame.The subtraction operation is done pixel-by-pixel and the result is then compared to athreshold that may be modified for different light conditions and noise environments.The binary result of the threshold operation is kept in a line buffer array that is trans-ferred to the Line Components Properties module on the next line clock.

Fig. 3. Subtraction and thresholding.


3.2. Detecting line components and their properties

Working at the sensor pixel clock, this module is triggered every line clock to findthe components properties in the binary line buffer resulting from the subtractionand threshold module. Every ‘‘black’’ pixel in the line is part of a component, andconsecutive black pixels are aggregated to one component.

Several properties are calculated for each of these components: the starting pixel,xstart, the ending pixel, xend, and first-order moments. The output data structurefrom this module is composed of several memory arrays that contain the compo-nents properties. For a 640 · 480 frame size, the arrays with the correspondingbits-width are: xstart (10 bits), xend (10 bits), M00 (6 bits), M10 (15 bits), and M01

(15 bits). Higher-order moments would require additional arrays. Preliminary filter-ing of oversized objects can be done in this stage using the information in M00, sinceit contains the number of pixels in the component.

3.3. Computing connected components

The line components (and their properties) are merged with components of previ-ous lines to form connected regions in the frame. The connect components module istriggered every line clock, and works on two consecutive lines. It receives the xstart andxend arrays and provides at the output a component identification number (ID) forevery component and a list of merged components. Since the module works on pairsof consecutive lines, every line is processed twice in this module, first with the lineabove it and then with the line below. After being processed in this module, the lineinformation is no longer needed and is overwritten by the next line to be processed.

Interestingly, there is never any representation of the boundary of a regionthroughout the entire process. As regions are scanned line-by-line through the pipe-line, their properties are computed and held. But their shape is never stored, not evenas a temporary binary image, as is a common practice with sequential implementa-tions. Moreover, if regions or boundaries would need to be stored, then a large datastructure, such as a frame buffer, would be required. Filling such a data structure andrevisiting it after the frame is processed would have caused unnecessary additionalcomplexity and would increase system latency. By eliminating any explicit shape rep-resentation in our design, a region is fully detected and all of its properties are avail-able just three line-clocks after it was output from the imaging sensor. If we comparethis to a computer-based implementation, then by the time the computer would havefinished grabbing a frame, this system would have already finished processing thatframe and reported all the region centers via the USB line.

The processing uses the xstart and xend information to check for neighboring pixelsbetween two line components in the two lines. The following condition is used to de-tect all (four-) neighbors line components:

ðxC start 6 xP endÞ and ðxC end P xP startÞ;where the subscripts P, C refer to the previous and current lines, respectively. Exam-ples for such regions are shown in Fig. 4. Using this condition, the algorithm checks

Fig. 4. Four cases of overlap between two consecutive line components are identified by their start–endlocation relationships. This merging operates on two line-component lists (as opposed to pixel lines).


for overlap between every component in the current line against all the componentsin the previous line. In software, this part can be easily implemented as a nested loop.In hardware, the implementation involves two counters with a synchronizationmechanism to implement the loops. If a connected component is found, the nextinternal loop can start from the last connected component in the previous line, sav-ing computation time. Since the module is reset every line clock and each iteration inthe loop takes one clock, the number of components per line is limited to the squareroot of the number of pixels in a line. For a 640 · 480 frame size it can accommodateup to 25 objects per line. A more efficient, linear time merging algorithm exists, butits implementation in hardware was somewhat more difficult.

A component from the current line that is connected to a component from theprevious line is assigned with the same component ID as the one from the previousline. Components in the current line, which are not connected to any component inthe previous line, are assigned with new IDs. In some cases, a component in the cur-rent line connects two or more components from the previous line. Fig. 5 shows anexample of three consecutive lines with different cases of connected components, anda merge. Obviously, the merged components should all get the same ID. The middlecomponent, which is initially assigned with ID = 3, would no longer need that IDafter the merge. This ID can then be re-used. This is important in order to minimizethe required size of arrays for component properties. In order to keep the ID listmanageable and within the size constraint in the hardware, it was implemented asa linked-list (in hardware). The list pointer is updated on every issue of a new IDor whenever a merge occurred. The ID list is kept non-fragmented. To support merg-ing components, the algorithm keeps a merge list that is used in the next module inthe pipeline. The merge list has an entry for each component with the component IDto be merged with, or zero if no merge is required for this component. The actualregion merging is handled by the Regions List module.

Fig. 5. An example in which two regions are merged into one at the time of processing the lower pair oflines. The middle region, which was first assigned with ID = 3, is merged with the region of ID = 1, theirmoments and bounding box properties are merged, and ID 3 is recycled.


3.4. The regions list module

This module maintains the list of regions in the frame. It is triggered every lineclock, and receives lists of components IDs, of merged components, and the mo-ments of components that were processed in the connected components module. Itkeeps a list of regions and their corresponding moments for the entire frame, alongwith a valid bit for each entry. Since regions are forming as the system advancesthrough lines of the frame, the moments from each line are being updated to the cor-rect entry according to the ID of the components. Moments of components with thesame ID as the region are simply added. When two regions merge, their moments ofone region are similarly added to those of the other region, the result is stored in theregion with the lower ID, and the other record of the other region is cleared and ismade available for a new region to come. The list of components of the entire frameand their moments is ready at the time of receiving the last pixel of the frame.

3.5. Reporting the detected regions

Once the list of moments of the entire frame is ready, the system checks the shapeproperties of each region and decides if a region is a valid pupil. The centroid of aregion is then computed using:

X c ¼M10

M00

Y c ¼M10

M00

:

This module is triggered every frame clock. In our current implementation it copiesthe list of moments computed in the previous module, to allow concurrent processingof the next frame, and prepares a list of regions centroids. This could also be doneper each finished region, while still processing the rest of the frame. This compactregions list can now be easily transferred to an application system, such as a PC.For a frame of size 640 · 480 pixels, any pixel location can be expressed by10 + 9 = 19 bits of information. However, since the division operation provides uswith sub-pixel accuracy, we add 3 bits to each coordinate, using fixed precisionsub-pixel location format. A universal serial bus (USB) connection is used to transferthe pupil locations and size to a PC. The Valid bit for the centroids list is updatedevery time the list is ready. This bit is poled by the USB to ensure correct locationsread from the hardware.

4. Simulink model

The functionality of the system is captured at an abstract level using a rigorouslydefined model of computation. Models of computation are formalisms that allowcapturing important formal properties of algorithms. For example, control algo-rithms can often be represented as finite-state machines—a formal model that hasbeen the object of intense study over the years. Formal verification of properties suchas safety, deadlock free, and fairness is possible without resorting to expensive sim-


ulation runs on empirical models. Analyzing the design at this level of abstraction,represented by the Simulink step in Fig. 6, allows finding and correcting errors earlyin the design cycle, rather than waiting for a full implementation to verify the system.Verifying a system at the implementation level often results in long and tediousdebugging sessions where functional errors may be well hidden and difficult to dis-entangle from implementation errors.

In addition to powerful verification techniques, models of computation offer away of synthesizing efficient implementations. In the case of finite-state machines,a number of tools are available on the market to either generate code or to generateRegister Transfer Level descriptions that can be then mapped efficiently into a li-brary of standard cells or other hardware components.

4.1. Synchronous dataflow model

This system processes large amounts of data in a pipeline and is synchronized witha pixel clock. A synchronous dataflow model of computation is the most appropriatefor this type of application. A synchronous dataflow is composed of actors or exe-cutable modules. Each module consumes tokens on its inputs and produces tokenson its outputs. Actors are connected by edges and transfer tokens in a first-in first-out manner. Furthermore, actors may only fire when there is a sufficient numberof tokens on all inputs for a particular actor.

All of the major system modules described in Section 3 are modeled as actors inthe dataflow. Data busses, which convey data between modules, are modeled asedges connecting the actors.

Several engineering design automation (EDA) tools can be used to analyze andsimulate this model of computation. We decided to use Simulink over other toolssuch as Ptolemy, System C, and the Metropolis meta-model because of its ease ofuse, maturity, and functionality.

Simulink provides a flexible tool that contains a great deal of built-in support aswell as versatility. The Matlab toolbox suite, that includes arithmetic libraries as well

Fig. 6. First the system behavior model is designed and tested. Then the hardware/software architecture isset, and the functions are mapped onto the selected architecture. Last, this model is refined and convertedto actual system implementation.


as powerful numerical and algebraic manipulation, is linked into Simulink, thus pro-viding Simulink with methods for efficiently integrating complex sequential algo-rithms into pipeline models and allowing synchronization through explicit clockconnections.

The image subtraction and thresholding processes corresponding pixels from thecurrent and previous frames on a pixel-by-pixel basis. Input tokens for this actor rep-resent pixels from the sensor and the frame buffer, while output tokens representthresholded pixels.

The connected components module accumulates region�s data over image lines.This module�s actor fires only when it has accumulated enough pixel data for acomplete line. It outputs a single token per line, containing all the data structuresfor the connected components list, including object ID tags, components moments,and whether they are connected to any other components, processed on previouslines.

These line tokens are passed to a module that updates the properties of all knownregions in the frame. Input is received on a line-by-line basis from the connect com-ponents module and regions in the list are updated accordingly. This module accu-mulates the data about possible pupil regions over an entire frame of data. Thus, oneoutput token is generated per frame. The frame�s regions data are then passed to anactor that calculates the XY centroid of the regions using a Matlab script. The XY

data output is generated for every frame.The executable model of computation supported by Simulink provides a conve-

nient framework to detect and correct errors in the algorithms and in the data pass-ing. It forces the system to use a consistent timing scheme for passing data betweenprocessing modules. Finally, this model provided us with a basis for a hardwareimplementation, taking advantage of a pipeline scheme that could process differentparts of a frame at the same time.

5. Hardware implementation

After verifying the functional specification of the model using Simulink, hard-ware components are selected and the logic circuit is implemented in HDL,according to the hardware blocks depicted in Fig. 2. A Zoran sensor and Appli-cation Development board was used for the prototype implementation, as shownin Fig. 7. This board was handy and was sufficient for the task. It has the rightmeans for connecting the sensor and has USB connectivity for transferring theprocessing results. The use of an available hardware development platform min-imizes the required hardware work and saves both time and money that wouldotherwise be required for a new PCB design. The Zoran board contains threeApex20K FPGA-s and several other components, much of it left here unused.The algorithm implementation consists of combinatorial logic and registers onlyand does not make any use of the special features the Altera Apex20K provides,such as PLLs and DSP blocks. The density and timing performance of the Alteradevice allows real-time operation in the sensor�s 30 MHz pixel rate with hardly

Fig. 7. System prototype. The lens is circled with the on-axis IR LEDs. Most of the development board isleft unused, except of one FPGA and a frame buffer.


any optimization of the HDL code. A potentially better implementation, withmuch fewer components and possibly higher performance could be achieved bya dedicated PCB design. However, the design of a new dedicated PCB was be-yond the scope of this project. Moreover, since the designed logic operates at pix-el clock, the most suitable mass-production implementation would be to eliminatethe FPGA altogether and to place the entire logic on the same silicon with theCMOS sensor, thus making it a single-chip eye-detection sensor. This would bethe most suitable and cost effective commercial implementation for massproduction.

The image capture device is a Zoran 1.3 Mega pixel CMOS Sensor, mountedwith a 4.8 mm/F2.8 Sunex DSL602 lens providing 59� field of view. It providesa parallel video stream of 10 bits per pixel at up to 30 MPixels/s (a 30 MHz pixelclock), along with pixel-, line-, and frame-clocks. In video mode, the sensor canprovide up to 60 progressive scan frames per second at NTSC resolution of640 · 480. The control over the sensor is done through an I2C bus that can config-ure a set of 20 registers in the sensor to change different parameters, such as expo-sure time, image decimation, analog gain, line integration time, sub-windowselection, and more. Each of the two near-infrared light sources is composed ofa set of 12 HSDL-4420 subminiature LEDs. This is equivalent to the light reportedelsewhere [16]. The LEDs are driven by power transistors controlled from the logiccircuit to synchronize the on-axis, off-axis illumination to the CMOS sensor framerate. The LEDs illuminate at near-infrared wavelength of 875 nm and are non-ob-trusive to human, yet they exploit a relatively sensitive region of the imaging sensorspectrum (see Fig. 8, Fig. 9).

The video is streamed to the FPGA where the complete algorithm was imple-mented. Only one of the three on-board available Altera Apex20K devices was used.It provided enough logic gates and operates at the required pixel-rate running speed.The FPGA controls the frame buffer FIFO (Averlogic AL422B device with an easy

Fig. 8. Results of one experiment. The detected (x,y) coordinates of four linearly moving targets alongapprox. 1500 frames are marked with blue �+� signs (forming a wide line due to their high plotting density).The deviation from the four superimposed second degree polynomial fits (red) is hardly noticeable due tothe achieved sub-pixel accuracy. See Table 1 for the calculated error rates. (For interpretation of thereferences to color in this figure legend, the reader is referred to the web version of this paper.)

Fig. 9. Synthetic target (�+� sign) and eye (�.� sign) detections are shown at 4 ft (left graph, total of 980frames) and 6 ft (right graph, 870 frames) experiments. The eye-detection trajectory follows the headmotion. For this subject, S13, head motion was much more noticeable at the second, 6 ft session. Thesynthetic targets are static and thus show no trajectory.


read and write pointers control). Communication with the application device, a PCcomputer in this case, is done through a Cypress USB1.1 controller. The USB pro-vides an easy way to transfer pupil data to the PC for display and verification. It also

Fig. 10. One frame of the system�s monitor video output, generated by the FPGA, passed to an on boardVGA converter, and projected at 60 fps on a white screen. This optional video channel is only used fordebugging and demonstration purposes and is not required for regular operation. The gray-level inputframe is directed to the red channel and detected pupils are superimposed as bright green marks. Image iscaptured by a digital camera from the projected screen. (For interpretation of the references to color in thisfigure legend, the reader is referred to the web version of this paper.)


allows easy implementation of the I2C communication and control of the sensor andof algorithm parameters.

One limitation of the original prototype design was the lack of any video output,e.g., to a monitor. This was later solved using a VGA converter, also found on thedevelopment board. While this part has no role in the system operation, it is indis-pensable at the development and debugging phases. This is the only means to displaythe captured image, necessary to enable even the most basic operations such as lensfocus and aperture adjustments. Since the FPGA is directly connected to the sensor,the video output must be originated by the FPGA. This video signal drives the redcolor channel of the VGA output (see Fig. 10).

6. Experimentation

The prototype system was tested on both synthetic targets and human eyes. Thesynthetic targets provide accurate and controlled motion, fixed ‘‘pupil’’ size and no


blinking. Experiments with detection of human eyes provide some quantitative andqualitative data from realistic detection scenarios, however with no ground truthinformation to compare with.

Synthetic targets consist of circular patches, 7 mm in diameter, made of a redreflective material of the kind used for car safety reflective stickers. These stickersare directional reflectors, mimicking the eye retro-reflection. Their reflection is ingeneral brighter than of a human eye, and can be detected by the prototype systemat distances up to 12 ft. Two pairs of targets were placed on a piece of cardboard,attached to horizontal bearing tracks, and manually moved along a 25 cm straightline in front of the sensor, at different locations and distances (about 40–60 cm).The system output (x,y) coordinates of the targets was logged and analyzed. Fig.7 shows the result of one such test.

Table 1 summarizes the results of 10 experiments, total of 15,850 frames. Theaverage false positive detection rate is 1.07%. The average misses rate is 0.2%. Posi-tion errors are measured by the fluctuation of data points from a second-order poly-nomial fit to all the point locations of each of the targets in each experiment (ideallya linear fit would do, however, due to lens distortion the trajectories are not linear inthe image space). The RMS error is 0.9 pixels, and the average absolute error is 0.34pixels. Note that the position error computation incorporates all outliers.

Target 3 has a noticeably higher RMS error than the others. This was later foundto be a result of using a rolling shutter. It was causing an overlap between the expo-sure time of pixels of on-axis and off-axis frames. We were able to overcome thisproblem later using sub-frame selection of a 640 · 480 region rather than the origi-nally used frame decimation mode. However, a sensor with fixed exposure wouldhave an advantage in such an eye-detection system.

A second lab experiment was conducted with human subjects. Four male subjectsparticipated, of which two are White (S10, S13) and two are Asian (S11, S14). Theexperiment was done at two distances, 4 and 6 ft away from the sensor. Each subjectwas recorded at each distance for three sessions of about 3–5 s each, in which he wasasked not to blink and not to move his head. A reference target of two circles wasplaced on the wall behind the subjects and was expected to be detected at each framein the same location.

The detection results of human eyes are summarized in Table 2. Thus we haveonly four subjects, two trends can be observed. First, the average eye misses rateat 6 ft was nearly as twice higher as at 4 ft. Second, while detection of Asian people

Table 1Experimentation results with synthetic targets

Target Deletion rate Insertion rate Position err (RMS Pixels) Position err (abs Pixels)

1 0.00000 0.00151 0.344 0.2512 0.00006 0.01154 0.566 0.2903 0.00378 0.02415 1.803 0.4744 0.00398 0.00543 0.884 0.358Average 0.00195 0.01066 0.899 0.343

Table 2Experimentation results with detection of human eyes and synthetic targets

s10 s11 s13 s14 White Asian Average

Age 24 34 41 28 — — —Race White Asian White Asian — — —Target misses rate (8 ft) 0.0994 0.0398 0.0003 0.1638 0.0498 0.1018 0.0758Eye misses rate (4 ft) 0.3825 0.2906 0.1633 0.2596 0.2729 0.2751 0.274Eye misses rate (6 ft) 0.1719 0.6129 0.5552 0.768 0.3636 0.6905 0.527


eyes was similar to White people eyes at the 4 ft distance, it was much worse at the6 ft distance. This qualitative difference between Asian and White people is in agree-ment with the lower intensity of IR reflected from eyes of Asian people [17]. Notethat despite of the request not to blink, the eye misses rate does include severalblinks, at both distances, which greatly affect the reported results. The reference tar-get, located at 8 ft distance was missed a few times, sometimes due to occlusion bythe subject head, and in other cases due to noise, but in general was well detected.

Our experiment with human eyes was affected by the lack of ground truth. With-out tracking, the system misses the eyes during eye blinks (no tracking betweenframes is used in any of the experiments). Eye location error cannot be evaluatedas the true eye location is dynamic, changing with head motion and saccades. How-ever, the experimentation with both synthetic and human eyes in covers both aspectsof the system in a quantitative way.

The system was demonstrated at the CVPR 2003 demo session [28]. Several hun-dred people passed in front of the system during 8 h of demonstration. A typicalframe is shown in Fig. 10. The system worked continuously without a single reset,and was able to detect most of the people at distances between 2 and 5 ft.

This prototype system is found noticeably less sensitive than previous, CCD-equipped, software-based implementation we experimented with [16]. One of the rea-sons is the faster frame rate and the rolling shutter, which impose a shorter exposuretime. Another reason is due to the optics and the lower sensitivity of the CMOS sen-sor. There is a tradeoff between the lens speed (or aperture), the field of view width,the amount of IR illumination, the exposure time (thus the maximal frame rate) andthe overall system sensitivity. In this first prototype we focused on making the pro-cessing work and have not exploited the full potential of a dedicated optical design ofthe system. It is expected that with continuous improvement of CMOS image sensorsit will be possible to increase system performance.

7. Conclusions and future work

The prototype system developed in this work demonstrates the ability to imple-ment eye detection using simple logic. The logic can be pushed into the sensor chip,simplifying the connection between the two and providing a single-chip eye-detectionsensor. Eliminating the need for image processing on the CPU is an important steptowards mass production of commercial eye detectors. We believe that this out of the


box thinking and novel implementation will promote the development of moreaffordable, yet high performance embedded eye-gaze tracking systems.

The current system is only doing eye detection, not eye-gaze tracking. In order tosupport eye-gaze tracking, more features need to be extracted from the image, suchas cornea glints (first Purkinje image). Glints detection may be implemented in a sim-ilar manner to the pupil detection. Once the glint and pupil centers are extractedfrom the frame, a calibration-free eye-gaze computation can be carried on a com-puter or a PDA with minimal processing load.

The proposed single-pass, highly parallel connected components algorithm, whichis coupled with moments and shape properties computation, is suitable for a soft-ware implementation as well. This general-purpose algorithm would make an extre-mely efficient computation for real-time image analysis of various detection andtracking tasks.

The advantages of hardware implementation over software/CPU are lower man-ufacturing price, ability to work at very high frame rates with a low processing clock,equal to the sensors pixel clock, and low operational complexity (it does not crash).The disadvantages are lower flexibility for changes, more complicated developmentcycle and limited monitoring ability. This approach has a significant value for lowcost, mass-production manufacturing.

Of a particular interest is the achieved processing speed. Operating at just a singlecomputation cycle per pixel clock, this logic design achieves the theoretical compu-tational lower bound for this task. Hence, the system is able to process relatively highdata rates at very low processing clock. For example, at the current operating pixelrate of 30 MHz, and dynamic range of 10 bits/pixel, already represents a data rate of300 Mbps. This rate is quickly approaching the capacity limit of conventional digitalinput channels such Firewire (400 Mbps) and USB-2 (480 Mbps), eliminated herealtogether. This highly parallel, low clock rate implementation is a proof of conceptfor the great potential in hardware-based image processing and may result with avery low cost, high performance, mass-production system.

In this paper, we explored one extreme of the possible range of implementation: afull hardware solution. Most prior work was based on software-only implementa-tions. Our design methodology, based on the capture of the processing algorithmsat a high level of abstraction, is useful for exploration of a much wider range ofimplementations, from a full software realization to intermediate solutions wherepart of the functionality of the algorithm is implemented in hardware and othersin software. We wish to compare these intermediate implementations and to useautomatic synthesis techniques to optimize the design and its implementations.The Metropolis framework under development at the Department of EECS of theUniversity of California at Berkeley under the sponsorship of the Gigascale SystemResearch Center can be used as an alternative to or in coordination with Simulink-Matlab to formally verify design properties and to evaluate architectures early in thedesign process as well as to automatically synthesize implementations.

Future work may improve the system in several aspects. Work at high frame ratesrequires very sensitive sensors due to the short exposure time. This limits the distancerange of eye detection in the current prototype. This deficit may be addressed by a


more appropriate optics. Higher-order moments would improve filtering of false po-sitive pupil candidates. A fixed shutter would help to reduce the IR illumination cy-cle and increase consistency illumination across the image. Altering betweendecimation and windowing at different resolutions can yield high frame rate trackingof known eyes and high resolution search for new targets. Last, adding the detectionof cornea glints to the pupil detector would allow to compute gaze direction, hencemaking it a single-chip eye-gaze tracking sensor.

Acknowledgments

We thank Zoran CMOS Sensor team and Video IP core team for providing thesensor module and for the support. We are also grateful to Tamar Kariv for the helpwith the USB drivers and PC application, and to Ran Kalechman and Sharon Ul-man for their helpful advice with the logic synthesis.

References

[1] D.H. Ballard, C.M. Brown, Computer Vision, Prentice Hall, New Jersey, 1982, pp. 150–152.[2] F. Balarin, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, M. Sgroi, Y. Watanabe, Modeling

and designing heterogeneous systems, in: J. Cortadella, A. Yakovlev (Eds.), LNCS: Advances inConcurrency and Hardware Design, Springer-Verlag, Berlin, 2002, pp. 228–273.

[3] A. Cozzi, M. Flickner, J. Mao, S. Vaithyanathan, A comparison of classifiers for real-time eyedetection, in: ICANN, 2001, pp. 993–999.

[4] J.W. Davis, S. Vaks, A perceptual user interface for recognizing head gesture acknowledgements, in:IEEE PUI, Orlando, FL, 2001.

[5] Andrew T. Duchowski (Ed.), Proc. Symp. on Eye Tracking Research and Applications, March 2002,Louisiana.

[6] Y. Ebisawa, S. Satoh, Effectiveness of pupil area detection technique using two light sources andimage difference method, in: A.Y.J. Szeto, R.M. Rangayan (Eds.), Proc. 15th Annu. Internat. Conf.of the IEEE Engineering in Medicine and Biology Society, San Diego, CA, 1993, pp. 1268–1269.

[7] I. Haritaoglu, A. Cozzi, D. Koons, M. Flickner, D. Zotkin, Y. Yacoob, Attentive toys, in: Proc. ofICME 2001, Japan, pp. 1124–1127.

[8] A. Haro, I. Essa, M. Flickner, Detecting and tracking eyes by using their physiological properties, in:Proc. Conf. on Computer Vision and Pattern Recognition, June 2000.

[9] E. Hjelm, B.K. Low, Face detection: a survey, Comput. Vision Image Understand. 83 (3) (2001) 236–274.

[10] A. Hyrskykari, Gaze control as an input device, in: Proc. of ACHCI�97, University of Tampere, 1997,pp. 22–27.

[11] T. Hutchinson, K. Reichert, L. Frey, Human–computer interaction using eye-gaze input, IEEETrans. Syst., Man, Cybernet. 19 (1989) 1527–1533.

[12] R.J.K. Jacob, The use of eye movements in human–computer interaction techniques: what you lookat is what you get, ACM Trans. Inform. Syst. 9 (3) (1991) 152–169.

[13] A. Kapoor, R.W. Picard, Real-time, fully automatic upper facial feature tracking, in: Proc. 5thInternat. Conf. on Automatic Face and Gesture Recognition, May 2002.

[14] R. Kothari, J.L. Mitchell, Detection of eye locations in unconstrained visual images, in: Proc. ICPR,vol. I, Lausanne, Switzerland, September 1996, pp. 519–522.


[15] C. Morimoto, D. Koons, A. Amir, M. Flickner, ‘‘Pupil Detection and Tracking Using Multiple LightSources’’, Workshop on Advances in Facial Image Analysis and Recognition Technology, FifthEuropean Conference on Computer Vision (ECCV �98), Freiburg, June 1998.

[16] C. Morimoto, D. Koons, A. Amir, M. Flickner, Pupil detection and tracking using multiple lightsources, Image Vision Comput. 18 (4) (2000) 331–336.

[17] K. Nguyen, C. Wagner, D. Koons, M. Flickner, Differences in the infrared bright pupil response ofhuman eyes, in: Proc. Symp. on ETRA 2002: Eye Tracking Research and Applications Symposium,New Orleans, Louisiana, 2002.

[18] K. Perlin, S. Paxia, J.S. Kollin, An autostereoscopic display, in: Proc. of the 27th Annu. Conf. onComputer Graphics and Interactive Techniques, July 2000, pp. 319–326.

[19] D.D. Salvucci, J.R. Anderson, Intelligent gaze-added interfaces, in: Proc. of ACM CHI 2000, NewYork, 2000, pp. 265–272.

[20] A. Sangiovanni-Vincentelli, Defining platform-based design, EE Design, March 5, 2002.[21] L. Stark, StephenEllis, Scanpaths revisited: cognitive models direct active looking, in: D.F. Fisher,

R.A. Monty, J.W. Senders (Eds.), Eye Movements: Cognition and Visual Perception, Erlbaum,Hillside, NJ, 1981.

[22] R. Stiefelhagen, J. Yang, A. Waibel, A model-based gaze tracking system, in: Proc. of the Joint Symp.on Intelligence and Systems, Washington, DC, 1996.

[23] A. Tomono, M. Iida, Y. Kobayashi, A TV camera system which extracts feature points for non-contact eye movement detection, in: Proc. of the SPIE Optics, Illumination, and Image Sensing forMachine Vision IV, vol. 1194, 1989, pp. 2–12.

[24] R. Vertegaal, I. Weevers, C. Soln, Queen�s Univ., Canada Gaze-2: an attentive video conferencingsystem, in: Proc. CHI-02, Minneapolis, 2002, pp. 736–737.

[25] L. Young, D. Sheena, Methods and designs: survey of eye movement recording methods, Behav. Res.Methods Instrum. 7 (5) (1975) 397–429.

[26] S. Zhai, C. Morimoto, S. Ihde, Manual and gaze input cascaded (MAGIC) pointing, in: Proc. ACMCHI-99, Pittsburgh, 15–20 May 1999, pp. 246–253.

[27] W.Y. Zhao, R. Chellappa, A. Rosenfeld, P.J. Phillips, Face recognition: a literature survey, UMDCfAR Technical Report CAR-TR-948, 2000.

[28] L. Zimet, S. Kao, A. Amir, A. Sangiovanni-Vincentelli, A real time embedded system for an eyedetection sensor, in: Proc. of Computer Vision And Pattern Recognition CVPR-2003 Demo Session,Wisconsin, June 16–22 2003.

An embedded system for an eye-detection...

Documents

Transcript of An embedded system for an eye-detection...