Finger tracking for interaction in augmented environments

Finger tracking for interaction in augmented environments

Klaus Dorfmuller-Ulhaas��, Dieter Schmalstieg��Imagination Computer Services, Austria�Vienna University of Technology, Austria

�klaus, dieter�@ims.tuwien.ac.at

Abstract

Optical tracking systems allow three-dimensional inputfor virtual environment applications with high precision andwithout annoying cables. Spontaneous and intuitive inter-action is possible through gestures. In this paper, we presenta finger tracker that allows gestural interaction and is sim-ple, cheap, fast, robust against occlusion and accurate. Itis based on a marked glove, a stereoscopic tracking systemand a kinematic 3-d model of the human finger. Within ouraugmented reality application scenario, the user is able tograb, translate, rotate, and release objects in an intuitiveway. We demonstrate our tracking system in an augmentedreality chess game allowing a user to interact with virtualobjects.

(a) (b)

Figure 1. Manipulation of virtual objects bygrab and release gestures: Natural inter-action is possible using the finger trackerdescribed in this paper together with aug-mented reality displays. In this image, a userplays chess against the computer by movingvirtual chess men with his finger on a realboard.

1 Introduction

In order to convey a sense of immersion, a virtual envi-ronment (VE) system must not only present a convincingvisual rendering of the simulated objects, but also allow tomanipulate them in a fast, precise, and natural way. Ratherthan relying on mouse or keyboard, direct manipulation ofvirtual objects is enabled by employing tracking with six de-grees of freedom (6DOF). Frequently, this is done via hand-held props (such as flying mouse or wand) that are fittedwith magnetic trackers. However, this technology can onlyoffer limited quality because it is inherently tethered, inac-curate and susceptible to magnetic interference. Early on,optical tracking has been proposed as an alternative. Onemain reason why optical tracking is so attractive is becauseunlike tethered magnetic tracking it supports capturing hu-man movement without the need for active sensors, thus al-lowing interaction without the use of props. In particular,the tracking of hands is relevant, because it allows naturalgesturing (Figure 1).

Figure 2 shows a taxonomy of gesturing after Quek[25, 26]. Intentional gestures can be roughly categorizedinto manipulative (object movement, rotation etc.) andcommunicative. While the expressive power of gestures ismostly attributed to the communicative family of gestures,it is the manipulative family that is mostly used in virtualenvironments. The reason is that choreographing 3-d eventsbased on object manipulation is straightforward and imme-diately useful, while the meaning of communicative ges-tures is often more subtle and harder to exploit in appli-cations. Also, communicative gesturing just like any formof communication relies on a common language that firstneeds to be mastered by the user before useful interactionis possible (with the exception of deictic gestures [3]). Wewill now examine how gestures fit into an interaction frame-work for VEs. We follow the 3-d interaction taxonomiesdeveloped by Hand [11] and Bowman [4] that categorizeinteraction into viewpoint control, selection, manipulation,and system control:

� Viewpoint manipulation in virtual environments is best

Gestures

Manipulative Communicative

Unintentional Movements

Hand/Arm Movements

Acts

Mimetic Referential Modalizing

Symbols

Deictic

Figure 2. Intentional hand and arms move-ments can be classified as manipulative orcommunicative. Communicative gesturescan be related to language (symbolic), or non-linguistic acts. Mimetic acts simulate actions,while deictic acts refer to a specific object.Symbolic gestures either stand for a referen-tial action, or are used as modalizers, oftenfor speech.

performed with direct head tracking [29].

� Manipulation of virtual objects (rigid objects, i. e., pri-marily translation and rotation) is a core requirementof most VE applications. Obviously, a very efficientmethod is direct manipulation with the hand.

� To perform selection in reality, a user stretches out thehand in the direction of the target object, then grabsit for manipulation, which can be tracked as a deicticfollowed by a mimetic gesture.

� System control describes all access to abstract func-tions that have no obvious correspondence in the three-dimensional environment. To provide good feedback,visible command objects (3D icons etc.) are often usedfor system control. Such system icons can be manipu-lated with similar gestures like normal objects.

Taken together, all input modes relevant for a general vir-tual environment can be provided by control of a 6DOF cur-sor and a grab/select command. In this paper, we proposeto track the user’s index finger via retroreflective markersand use the tip as a cursor. The select command is triggeredby bending one’s finger to indicate a grab or grab-and-hold(i.e., drag) operation. The simplicity of this approach isalso its power. While the potential of optical tracking asa superior tracking technique is generally recognized, itscomplexity has prevented widespread acceptance. In con-trast, our simple approach is at a sweet spot in the space ofpossible optical tracking approaches, allowing to develop afinger tracker that is fast, reliable, robust against occlusion,

cheap, and accurate, and that can be interfaced easily to anyVE and provides all necessary means of interaction throughgestures. It combines natural and unobtrusive interactionthrough gesturing with precise and general purpose inter-action in a mostly unrestrained virtual environment. Sur-prisingly, to our knowledge this particular approach has notbeen tried yet.

In the following, we discuss related work in section 2,followed by an overview of our approach in section 3, anddetails on the used finger model in section 4 and computervision algorithms in section 5. The presentation is comple-mented by results in section 6 and section 7 concludes thepaper.

2 Related Work

In this section, we give a brief overview of gesture basedinteraction methods which consider the human hand. Asmentioned before, gestures may be classified as manipula-tive or communicative. Our overview of the literature willconcentrate on manipulative gestures, since we are inter-ested in systems which allow to grab, translate, rotate andrelease virtual objects. The interested reader is referred toPavlovic et al. [14, 24] for a general survey of hand trackingmethods and algorithms for hand gesture analysis.

Considering the complexity of shapes of the human handwhich may appear in video images, the segmentation ofthe human hand can be figured out as the most crucial andtime-consuming part a vision based system has to solve.In case of manipulative gestures, the tracking of the handshould operate in real-time. This is why system develop-ers apply constraints either for the environment or the ap-pearance of the human hand. We will distinguish back-ground and foreground constraints generally applied forsimplifying the segmentation process. Background con-straint systems are often using a uniform (uncluttered) back-ground [28, 5, 21, 18, 6]. Other systems assume a staticor temporarily static background so that background sub-traction [16, 32, 35] or segmentation by motion [19] canbe performed. Unfortunately, using a controlled or knownbackground is problematic or impossible in dynamic vir-tual and augmented environments where the scene changesover time. Foreground constraint systems detect markers at-tached to the human hand [7, 23, 22] or classify the humanskin color [17, 10, 36]. Such systems assume controlled andstatic lighting conditions and rely on the assumption that noobjects with similar color (e.g., skin/wood) appears in theimage. Projection-based virtual environments are typicallyused with dimmed lights, leading to a decrease in color dy-namics, which results in difficulties in identifying the hu-man skin.

Template matching approaches for special hand featureslike the finger tips restrict the hand in its flexibility of defor-

2

mation since the finger tips should always be visible in thecamera images [28, 10]. A common approach is to restrictthe appearance of the hand to known depth values and todisallow other objects to appear inside the interaction vol-ume [33]. Finally, an infrared camera system can be adaptedto acquire optical signals at a controlled temperature for thehuman hand [30].

After image segmentation, the hand model plays a funda-mental role in the tracking process. We distinguish em 3-dhand models and appearance based models. 3-d hand mod-els use articulated structures of the human hand to estimatethe hand movements [28, 21], whereas appearance-basedmodels directly link the appearance of the hand movementsin visual images to specific gestures [2, 12, 32]. 3-d handmodel-based systems often provide a higher flexibility, dueto the estimation of joint angles and a higher precision. Fi-nally, the form of output from the tracking process deter-mines the scope of possible applications. We classify 2-dsystems [2], e.g., for controlling 2-d user interfaces [30],systems working in 3-d by supporting relative 3-d positions[23, 12] and systems which are using stereoscopic visionfor most accurate, absolute 3-d positions [28, 32, 33]. Ob-viously, only absolute 3-d position is useful for our applica-tion scenario.

Often not addressed is the necessity of tracking initial-ization which means that the user is forced to move the handto a known position while performing a specific pose. Sys-tems like [2, 12, 28] need this initialization whenever theobject detection algorithm looses track. Such an approachis not acceptable for spontaneous and natural interactionin virtual environments. Recently, a new algorithm calledCondenstation [15] has been developed which tries to over-come the increasing uncertainty of a Kalman filter process.The algorithm is based on random sampling in order to trackobjects with a best fit over time, mostly independent fromdiscrete time slots where the probability is not optimal. Inthe case of Active Contours it is necessary to collect alter-native states and to prevent the system from loosing track,because a re-initialization of the system is computationalexpensive and not soluble in real-time. For the purpose ofobject tracking, we have implemented a Kalman filter, ableto estimate a residual between the observed and estimatedmeasurements. This residual value is used as a decidingfactor whether the filter looses track or not.

3 System Overview

We did not find a human hand tracking system which ful-fills all of our requirements. Specifically, all purely natural-feature-based tracking systems are either not accurate forthe purpose of augmented reality or not independent fromthe environment or application. To overcome these prob-lems our optical tracking consists of retroreflective markers

operating with infrared light. The tracking system posesminimal constraints to the environment and can be easilyadapted for other virtual reality applications. The proposeddesign is intended for a fixed working area of reasonablesize (1-3m squared) where dextrous interaction can occur.A suitable workspace is defined by the volume above a ta-ble - this is both useful in combination with back-projectiontables [20] and augmented reality scenarios [27, 13]. In thefollowing we focus on an outside-in tracking system, sinceocclusion problems may not be such a big problem as forinside-out tracking and in addition a stereo rig with a higherbaseline should be more precise for the purpose of 3-d in-teraction.

We want to require minimal effort in the setup and main-tenance of the system. Thus it is a requirement that the sys-tem can be used without any special lighting or background.Moreover, a simple calibration procedure is necessary to al-low quick installation of the system after the location orenvironment has changed. To allow for a relatively unre-strained environment, a marked glove is used for real-timeseparation of the finger from the background. The glove isfitted with retroreflective markers, which are illuminated byan infrared light source. A stereo camera pair with infraredlenses filters out most of the background. The infrared lightsource is co-located with the camera, so that light emittedin the direction of the retroreflective markers is directly re-flected towards the camera in a fashion similar to [9]. Aftersegmentation of the 2-d marker locations, they are passedon to the marker matching module, where markers are cor-related using a method based on epipolar constraints anda kinematic model of the finger. A motion estimator hasbeen added in order to smooth and predict the motion of theuser’s finger. Therefore, the synthesized 3-d position valuesare used as periodic measurements during a Kalman filterprocess. The filter itself takes parameters of a linearizedkinematic model such as velocity, acceleration and angularvelocities. These parameter values may be used in order topredict a future pose of the user’s finger.

We will examine the used marker and finger model inmore detail, and then discuss the relevant steps in com-puter vision processing required to transform images fromthe camera into 3-d model parameters.

4 Markers and finger model

The intention of the finger tracker is to determine enoughinformation to robustly track the position, orientation andpose of the user’s index finger. For real-time determinationof these parameters without the need to constrain environ-mental conditions, we resort to using a marked (but unteth-ered) glove. In the following, we describe our considera-tions regarding shape and placement of these markers.

Possible marker shapes are shown in figure 3. Round or

3

square reflector blips are features of the surface, which isfine as long as the markers face the camera. However, whileinteracting in virtual reality, hand and fingers are constantlyrotated in space, and markers will often turn away from thecamera. In this case, the blip would not indicate the realposition of the joint any more.

(a) (b) (c) (d)

Figure 3. Shape of markers - in contrast toround blips on the surface (a) that do not al-ways represent the joint position (cross) wellif rotated away from the camera, flat rings (b)are always centered at the joint, while convexrings (c) improve upon flat rings in that theyhave better retroreflective properties. Our fi-nal choice are displaced balls (d) that sufferthe least from self-occlusion of the fingers.

As an alternative solution, we tried ring-shaped markerscomposed from small stripes of reflector material that arewrapped around the finger joints. A section of the rings willalways face the camera independent of the rotation of thejoint. After some experimentation, the ring markers weremodified to have a convex rather than a flat surface (Fig-ure 4(a)). In that way, a portion of the retroreflective sur-face of the marker will always be oriented towards the cam-era, allowing for a higher amount of light to be reflectedto the camera, thereby making segmentation easier. Unfor-tunately, our experiments showed that both blip and ringmakers suffer from the fact that the joint center cannot eas-ily be determined from the position of the markers due toself-occlusion of the fingers.

We therefore finally settled on using displaced balls onthe back of the finger (Figure 4(b)), that have good retrore-flective properties and are not often significantly occludedby the fingers themselves. The use of displaced ballswas enhanced by connecting the balls with short piecesof wire mounted to hinges in the balls to enforce a fixedknown distance between the balls. Dimensions of thesewire rods were chosen to match the distances between fin-ger joints. This makes the glove independent of the user’sfinger lengths. Indeed, there is an offset between the mark-ers and the real joint positions. The real kinematics can beestimated using the user dependent finger’s segment lengthsand thickness, however, our work is focused on a user inde-pendent and finger calibration free solution when estimatingthe pose of the chain of markers. While this ”exoskeleton”

looks awkward, it has the great advantage that it follows thebehavior of the finger as a kinematic chain, but with easilydetectable joint centers. Our experiences confirmed that itdoes not affect finger movement or interaction in any no-ticeable way.

For reconstruction, we employ a 3-d finger model basedon a kinematic chain of the finger joints that directly mapsonto the markers. As the distance of the markers is known,the system is independent of the actual dimensions of theuser’s finger (within certain limits), while the soft glove’smaterial can be stretched to fit any user. The only remaininguser specific parameter is the actual offset from the usersfinger tip to the last marker in the chain which is used todetermine the 6DOF ”hot spot”. To enable a user to interactwith his or her finger tip, this offset must be determined.However, we found that most users are willing to acceptthat the actual hot spot is offset by a small amount fromtheir finger tip, and interaction is not affected.

(a) (b)

Figure 4. Gloves fitted with retroreflectivemarkers

5 Computer vision processing

For performing the whole work cycle shown in figure 5,four tasks can be figured out which are the important oper-ations of the tracking procedure. These are the calibration,the segmentation, the marker matching and the motion es-timation which includes the prediction of the model. Theseoperations will be described in the following sections.

5.1 Camera Calibration

The calibration must be very easy to perform, becausevirtual reality users typically are not computer vision ex-perts. Therefore, our system can adaptively calibrate astereo rig by tracking a single moving point acquired fromeach of the two cameras. As a result, the calibration datamay be entered by just waving a passive reflective markeraround. The parameters which are estimated by the cali-bration procedure are the focal lengths of both cameras, thetranslation and rotation from one to the other camera and the

4

3−d markerlocations

Finger

Cameras

locations2−d marker

with 8 DoFfinger model

Projection

Prediction

Motion Estimation

Figure 5. Processing pipeline

structure depth with regard to one camera coordinate frame.Given a rough estimate of these parameters, the system isable to estimate a global optimum by compairing a set ofmeasurements of one camera with the transformed and pro-jected measurements of the other camera on the first cam-era’s image plane. Details about this easy-to-use calibrationtechnique can be found in [8].

5.2 Segmentation

The principal task the segmentation process has to per-form is the estimation of the center of gravity for eachmarker. The center of gravity is computed from theweighted contributions of the pixels covered by the markersin the greyscale image. We have implemented a thresholdbased segmentation, because it is simple and able to workin real-time. Pixel values which are above a given thresholdare used to estimate the center of gravity of the marker im-age. This segmentation is not satisfying for all purposes asit is described above, but it works much faster than ellipticfitting algorithms. Later on, we try to compensate for theseerrors by using a Kalman filter for motion estimation.

Unlike ring markers, a spherical marker’s center of grav-ity generally matches the joint center very well, which re-duces uncertainty and improves the behavior of the Kalmanfilter described in section 5.4.

5.3 Matching of markers

In addition to the segmentation, we need a mechanismwhich correlates extracted features of both images. Due toreflections on specular surfaces, noise can be included inthe list of segmented features and should be detected by thematching module.

Application of epipolar constraint does not solve thecomplete matching problem, which is problematic in caseswhere the corresponding feature for a marker in the first im-age is not the feature which has the closest distance to theepipolar line in the second image. This can lead to erraticmatching that combines image features which are not cor-related in reality. Since the epipolar constraint module cannot detect such ambiguous cases based on the distance of afeature from the epipolar line, all matching features whichlie within a small neighborhood of the epipolar line must beconsidered as candidates for 3-d points.

Detection of correct 3-d points and their assignment tofinger joints is done by analysis of the 3-d position valueswe retrieve with the previously described uncertainty. Byusing knowledge about the distances between the markerson the user’s finger and some further constraints, the systemis able to estimate the finger’s pose:

� The first constraint is based on the assumption that themarker positions are located approximately in one 3-d plane. While it is indeed possible to move a fingersideways to some degree, this constraint is sufficientlysatisfied by the rigid marker skeleton.

� The second constraint is based on the non-ambiguoussequence of pre-known marker distances.

search area

marker 3marker 2

marker 1

marker 0

Figure 6. Marker matching

Figure 6 illustrates the procedure of marker matching.A random 3-d marker position is chosen. In the next step,the algorithm searches for a second marker position whichhas been located close to the surface of a sphere with a ra-dius determined by the known marker distance. If no suchmarker position can be found, the algorithm starts with an-other arbitrarily chosen 3-d marker position. If a markercan be found close to the sphere’s surface, a second sphereis used to find the third marker position and so on. Theprocedure is successful if a full path including four markershas been found, if the identified 3-d locations are locatedwithin a given threshold to a 3-d plane and if the shape ofthe polygon constructed from the joint positions is convex.One additional constraint which enhances the performance

5

of the system is based on knowledge retrieved from the mo-tion prediction, which is described below in section 5.4.Consider the case when a complete path between the mea-surements could not be estimated in the presence of occlu-sion. Here, we derive the marker matching by the predictionof the Kalman filter process. The measurements close tothe predicted values are taken as the new marker positions,while for occluded markers we may be able to use the pre-dicted values. However, a better solution can be estimatedusing some biological constraints. If the finger tip markeror the marker furthermost to the fingertip is lost, we use acalculated marker position using the approximation that thefirst joint angle �� close to the finger tip is two thirds of thesecond joint angle ��. This constraint may also be used toestimate an inner marker position.

Palm

pppp

xz

y

0

1

2

3

Figure 7. Finger coordinate system

For the following sections we assume a coordinate sys-tem defined by the finger pose (figure 7). The finger is lo-cated in the ��-plane and the origin is at point ��, with the�-axis pointing in the direction of ��. As common for kine-matic chains, the coordinate system for �� is defined rela-tive to the reference frame of ��. Analogously, the referencesystem of �� is defined relative to ��. �� is only necessaryto define the direction of the �-axis in the reference frameof ��. In case the user’s finger is bent, the global rotationmatrix of the finger at frame � can be calculated as follows

��

�� (1)

��

��

��

��

��

(2)

�� (3)

First, we calculate �� as the norm of the vector from ��to ��. Since all markers should lie on a plane, we can use�� to define a second vector used to compute the �- and �-axis of this coordinate system by applying the cross productof vectors. The result gives the base vectors of the globalfinger reference frame which can be combined in a globalrotation matrix at time frame �. The vectors ��,��, and�� form the columns of the matrix.

� ��

�(4)

5.4 Modeling and Estimating Motion Kinematics

For developing a robust finger tracker it is important toachieve good estimates of the finger pose, even though mea-surements are imprecise and include distortions. Measure-ments such as the marker positions are assumed to containwhite noise. The Kalman filter used in our implementa-tion is responsible for filtering the motion model parametersvalues. Whenever the system equations1 do not fit the realmotion process well, the residual between real motion andmotion model will be interpreted as random system noise.The Kalman filter as used in our implementation is rather afilter which extracts a kinematic state from periodic noisymeasurements than a predictor of future marker positionsused for speeding up the segmentation. Our implementa-tion is using the Kalman filter in order to enhance the fingerpose matching for the current frame. The process of markerand finger pose matching consists of a minimal path searchof estimated 3-d point distances and is known to be NP-complete. Searching only a small number of markers doesnot really suffer from this fact, but even a moderate num-ber of falsely detected marker positions (reflections etc.)can quickly affect computational performance of the search.The Kalman filter is a good tool to overcome this problemby predicting new 3-d marker positions. Based on this pre-diction the algorithm can directly select markers in loca-tions likely to contain valid 3-d points. As mentioned beforeand shown in figure 7, our measurement vector � � at timeframe � includes the location of four 3-d marker positions.

��

�(5)

These measurements are not correct due to noise from cal-ibration and segmentation errors. We assume this noise iswhite noise �� added to the correct measurement ��

�.

��

� � �� (6)

Consider figure 8 for the transformation of marker po-sitions ��,��,�� and �� from time frame � � � to timeframe �. For each point of the marker model a transla-tion �� and rotation �� is performed. This incremen-tal and relative rotation is modeled using angular velocities

� ��

��. We are applying equation 7

� ��

��

��

� �

��

��

��

��

�(7)

introduced by Azarbayejani and Pentland [1] to transformthe angular velocities into a quaternion representation ofthe rotation ��. For the translational as well as for therotational components it is assumed that each point ��, �� undergoes a motion with constant angular ve-locity and with constant translational acceleration. In other

1In our case the motion kinematic equations.

6

p0,i,i−1

T

Ri−1

i−1

p0,i−1

p1,i−1

p2,i−1p3,i−1

p1,i

2,i

p3,i

yi

p

Rα

yi−1

Rβ,i−1

Figure 8. Marker transformation

words we use a linearized kinematic model for motion esti-mation, which is simple and less computationally intensivethan using the accurate model. However, this linearizationis only effective for a short period in time. Therefore, real-time motion capturing is neccessary and the precision de-creases with the frame rate. Using this linearization, thetranslation �� can be expressed as

��

�� (8)

where � ��

��is the translational velocity,

� ��

��is the constant translational accel-

eration and � is the time interval �� . To estimate thefinger’s motion kinematics, the bending of joints has beenmodeled by applying a rotation�� for the first joint and�� for the second joint. �� is defined as a rotationaround the �-axis using the angle ��:

��

��

��

� (9)

The rotation �� is defined similar.Consider once again the incremental transformation

shown in figure 8, where the rotation depends on the angu-lar velocity and the translation depends on the translationalvelocity and translational acceleration as described before.As we assume to have a linearized motion, we are able tocalculate the new marker positions if we know the followingparameters collected in the state vector

��

��(10)

These parameters are estimated during a Kalman filter pro-cess. Thus, we are using the ”hat”(�)-notation for estimatedparameters and are able to express the marker movements

by the following equation:

��

�� (11)

�� (12)

��

��

�(13)

��

��

��

�(14)

These equations can be seen as an estimation process of fu-ture measurements at time frame � while previous measure-ments given at time frame �� are known:

��

��(15)

Whenever a new measurement is available, the Kalman fil-ter is performing a correction step (also called measure-ment update) to keep the residual between measurementsand estimated measurements as low as possible by mini-mizing the error using a least square approach. The function��

��

which is dependent on the current estimatedstate and the last measurement vector should be minimizedand is given in equation 16.

��

��

��

��

� � ��

� � ��

� � ��

� � ��

�� (16)

After measurement update a new prediction can be per-formed. This step is also called time update, because thisprocedure is projecting the current state forward in time.Considering our application context, a linear transformationof the state vector is applied which is given by:

��

��

��

��

��

The strength of the Kalman filter is its feasibility to modelnoise, even allowing the system to filter state values in noisyenvironments. The existence of noise is assumed for twodifferent processes.

� The measurement includes a white noise such that theexpectation value is zero � � �� and noise in-cluded in one measurement is independent from noiseof another measurement.

� ��

��

��

� � �� (17)

��

describes the covariance matrix of measurementnoise at time frame �.

7

� The filter models system noise that results from im-precise system equations. For instance, the linearizedkinematic motion model is not describing the real mo-tion. Thus, there is a difference between the linearizedand the real motion which can be modeled as a whitesystem noise similar to equation 17. We denote thecovariance matrix of system noise ��.

Figure 9 shows the complete extended Kalman filter pro-cess as applied for finger tracking purposes.

Initialization of �� , �� , �, and ��

Time Update:

�� (18)

�� (19)

Measurement Update:

� � ��

��

��

��

(20)

��

�(21)

� � � � �� (22)

�

�

�

�

�

��

��

��

Figure 9. The extended Kalman filter

As a first step, an initialization of the filter is necessary.Therefore, the covariance matrix of the state vector � ��,the state vector itself, the system and measurement noisematrices need to be specified. Afterwards, the state vectorand its covariance matrix can be projected forward in timeusing:

� �

��

��

�� (23)

The next step is to correct the state and covariance matrix�� whenever a new measurement is available. Therefore,the Kalman gain matrix �� is calculated which is used asa relative weighting of the trust in real measurements vs.the estimated system state. Since equation 16 is non-linear,we have to apply the extended Kalman filter, which requires

calculation of the Jacobian matrix ��

��

��

(24)

and the new measurement noise matrix �� which is influ-enced by the derivative of the function �

��

�.

��

��

�

��

��

��

�

�

(25)

6 Experimental results

Experiments with real sequences of marker based fingermotions were done on an Athlon 800 MHz processor usingELTEC’s PcEye2 frame grabber board and two PULNiXTM-560 PAL cameras. The finger tracking operates in real-time with 25 frames per second2 and an accuracy of 0.5 to 2mm in the range of one square meter. The angular accuracyis difficult to analyse because it is dependent on the bend-ing of the user’s finger. Analysing the jittering in rotationalvalues while having a bent finger, the angular error is belowone degree in average.

We have connected the tracking system via sockets withthe Studierstube [31] augmented reality system. The latencyof the whole system is about 50 to 100 ms. We can com-pensate this latency while using predicted marker positions.However, the accuracy of the system is reduced to 5 mmprecision while predicting 80 ms forward in time.

The application we have used for rendering is a virtualchess application where chess men are displayed as virtualobjects and the chess board is real. In order to grab a virtualchess man the user has to move his finger to the middle ofone square and intersect the marker located at the finger tipwith the virtual chess man and bend the finger in order tograb the virtual object. While holding the finger bent, theuser is able to drag (translate and rotate) the chess man andrelease it by stretching out the finger. This kind of interac-tion was found to be intuitive and easy to learn because itis similar to a real grab gesture the user performs. Comparefigure 1(a) for an image of the collision of the user‘sfinger with a chess man and figure 1(b) for an image whileperforming the grab gesture and dragging the virtual object.During fusion of real images and virtual images there isone thing that is not perfect in regard to a fully immersiveillusion of the user, which are incorrect placements of thevirtual objects in regard to the real objects like the humanhand. Considering figure 10, the grabbed chess man shouldbe located at the finger tip, but it appears on the palm.Future augmented reality systems should handle occlusionsof virtual objects. This may be solved by estimating depthvalues of the real scene, however, this is a time-consuming

2The frame rate is limited by the update rate of the camera.

8

Figure 10. Fusion of the real and virtual world

reconstruction problem, which is currently not solvablein real-time. A video of the finger tracking in combi-nation with a chess application can be downloaded fromhttp://www.ims.tuwien.ac.at/pages/research/vr/fingertracker/.

In regard to the robustness of the tracking, a source ofproblems of vision-based tracking systems is occlusion. Inour tracking environment the cameras are positioned moreor less orthogonal to each other. Thus, there is no need forour tracking system to detect four markers in each cameraimage (see figure 11). We are able to estimate the finger

(left camera image) (right camera image)

Figure 11. Occlusions of markers

pose if one marker in each camera image plane is lost, orif two markers in one of the images are invisible. With ourfinger tracking system the interaction with virtual environ-ments is performed wireless and more intuitive than withmost other devices commercially available. One markerwhich is transiently lost causes no dramatic problems, butto allow two handed interactions more cameras are neededto be robust with regard to occlusions.

7 Conclusion and Future Work

We have presented an optical finger tracking system withhigh precision for three-dimensional input in virtual en-vironments. It allows spontaneous and intuitive interac-tion through gestures and is simple, cheap, fast and robustagainst occlusions. The proposed tracking system does notneed any initialization for tracking and there is no need toadapt the finger model to different users, since we are us-ing a “exo-skeleton” model fixed to a glove. The system isoperating in a relatively unrestrained environment.

Until know, we have not done any evaluation with regardto ergonomics and comfort. However, from a practical pointof view the marked glove fits to different users as long asthe cotton glove suits the user’s hand. The “exo-skeleton”seems to disturb the user only in situations in which thesphere markers collide with other real objects. Therefore,the “exo-skeleton” should be designed using smaller spheremarkers than in our current implementation. The trackingmethod is expandable for two hand user input, but since oc-clusion is a well-known computer vision problem, multiplecameras will be necessary to solve hand-hand occlusions.Future plans include investigating how to track the contoursof the human hand without having restrictions about thecomplexity of the background.

8 Acknowledgements

This research is supported by the European Communityunder contract no. FMRX-CT-96-0036. We would like tothank the members of the interactive media systems groupat the Vienna University of Technology and in particularGerhard Reitmayr for his support in creating an augmentedvideo using the Studierstube system.

References

[1] A. Azarbayejani and A. P. Pentland Recursive Estimation ofMotion, Structure, and Focal Length, IEEE PAMI 17(6), June1995

[2] A. Blake and M. Isard 3D position, attitude and shape input us-ing video tracking of hands and lips In Proc. of SIGGRAPH’94

[3] R. A. Bolt “Put-That-There”: Voice and gesture at the graphicsinterface Computer Graphics (SIGGRAPH ’80 Proceedings),Vol. 14, No. 3. July, 1980, pp. 262 - 270

[4] D. Bowman (Ed.) 3D User Interface Design: FundamentalTechniques, Theory, and Practice SIGGRAPH 2000 coursenotes No. 36, New Orleans, ACM Press, August 2000

[5] T. Brown and R.C. Thomas Finger tracking for the Digital DeskFirst Australasian User Interface Conference, AUIC 2000, 1999,pp. 11 - 16

[6] R. Cipolla and N.J. Hollinghurst A human-robot interface usingpointing with uncalibrated stereo vision In Computer Vision

9

for Human-Machine Interaction, R. Chipolla and A. Pentland(Eds.), Cambridge University Press, 1988

[7] R. Cipolla, Y. Okamoto, and Y. Kuno Robust structure from mo-tion using motion parallax In Proc. of the Fourth InternationalConference on Computer Vision, 1999, April 1993, pp. 374 -382

[8] K. Dorfmuller and H. Wirth Real-Time Hand and Head Track-ing for Virtual Environments Using Infrared Beacons In: N.Magnenat-Thalmann, D. Thalmann (Eds.) “Modelling and Mo-tion Capture Techniques for Virtual Environments”, Interna-tional Workshop, CAPTECH’98, Geneva, Switzerland, Novem-ber 1998, Proceedings LNAI 1537, Heidelberg: Springer Verl.,1998

[9] K. Dorfmuller An optical tracking system for VR/AR-Applications In Proc. of the Eurographics Workshop on VirtualEnvironment’99, Springer-Verlag Wien NewYork, Vienna, Aus-tria, 1999, pp. 33 - 42

[10] R. O’Hagan, A. Zelinsky Visual gesture interfaces for virtual en-vironments First Australasian User Interface Conference, 2000.AUIC 2000, 1999, pp. 73 -80

[11] C. Hand A Survey of �D Interaction Techniques ComputerGraphics Forum, Vol. 16, No. 5, 1997, pp. 269–281

[12] T. Heap and D. Hogg Towards 3-D Hand Tracking using a De-formable Model, 2nd International Face and Gesture Recogni-tion Conference, 1996

[13] N. Hedley, L. Postner, R. May, M. Billinghurst, and H. KatoCollaborative AR for Geographic Visualization Proc. Int’l Sym-posium on Mixed Reality, Yokohama, Japan, March 2001, toappear

[14] T.S. Huang and V.I. Pavlovic Hand gesture modeling, analy-sis, and synthesis In Proc. of IEEE International Workshop onAutomatic Face and Gesture Recognition, Sept. 1995, pp. 73 -79

[15] M. Isard and A. Blake Condensation - conditional density prop-agation for visual tracking Int. J. Computer Vision, 1998

[16] K. Ishibuchi, H. Takemutra, and F. Kishino Real Time HandGesture Recognition using 3D Prediction Model InternationalConference on Systems, Man, and Cybernetics, Vol. 5, Le Tou-quet, France, Oct. 1993, pp. 324 - 328

[17] C. Jennings Robust finger tracking with multiple cameras InProc. of the International Workshop on Recognition, Analy-sis, and Tracking of Faces and Gestures in Real-Time Systems,1999, pp. 152 - 160

[18] A. Katkere, E. Hunter, D. Kuramura, J. Schlenzig, S. Moezzi,and R. Jain ROBOGEST: Telepresence using Hand GesturesTechnical report VCL-94-104, Visual Computing Laboratory,University of California, San Diego, Dec. 1994

[19] I.J. Ko and H.I. Choi Extracting the hand region with the aidof a tracking facility Electronic letters32(17), August 1996, pp.1561 - 1563

[20] W. Krueger, C. A. Bohn, B. Froehlich, H. Schueth, W. Strauss,and G. Wesche The Responsive Workbench: A Virtual WorkEnvironment IEEE Computer, 28(7), 1995, pp. 42-48

[21] J.J. Kuch, T.S. Huang Vision based hand modeling and trackingfor virtual teleconferencing and telecollaboration Proc. of theFifth International Conference on Computer Vision, 1995, pp.666 - 671

[22] F. Lathuiliere and J.- Y. Herve Visual hand posture tracking in agripper guiding application In Proc. of the IEEE InternationalConference on Robotics and Automation, ICRA ’00, Vol. 2 ,2000, pp. 1688 - 1694

[23] C. Maggioni A novel gestural input device for virtual realityVirtual Reality Annual International Symposium, IEEE, 1993,pp. 118 - 124

[24] V.I. Pavlovic, R. Sharma, and T.S. Huang Visual interpreta-tion of hand gestures for human-computer interaction: a re-view IEEE Transactions on Pattern Analysis and Machine In-telligence, Vol. 19, No. 7 , July 1997, pp. 677 -695

[25] F.K.H. Quek Toward a Vision-Based Hand Gesture InterfaceVirtual Reality Software and Technology Conf., Aug. 1994, pp.17 - 31

[26] F.K.H. Quek Eyes in the Interface Image and Vision Computing,vol. 13, Aug. 1995

[27] R. Raskar, G. Welch, and W.Chen Tabletop Spatially Aug-mented Reality: Bringing Physical Models to Life using Pro-jected Imagery Second Int. Workshop on Augmented Reality(IWAR’99), San Francisco, October 1999

[28] J. Rehg and T. Kanade Digiteyes: Vision-based Human HandTracking, Technical Report CMU-CS-TR-93-220, CarnegieMellon University, 1993

[29] M. Usoh, K. Arthur, M. Whitton, R. Bastos, A. Steed, M. Slater,F. Brooks Jr. Walking ¿ Walking-in-Place ¿ Flying, in VirtualEnvironments Siggraph 1999, Computer Graphics Proceedings,Annual Conference Series, Addison Wesley Longman, Los An-geles, 1999, pp. 359 - 364

[30] Y. Sato, Y. Kobayashi, and H. Koike Fast tracking of hands andfingertips in infrared images for augmented desk interface InProc. of the Fourth IEEE International Conference on AutomaticFace and Gesture Recognition, 2000, pp. 462 - 467

[31] D. Schmalstieg, A. Fuhrmann, and G. Hesina Bridging MultipleUser Interface Dimensions with Augmented Reality Proceed-ings of the 3rd International Symposium on Augmented Reality(ISAR 2000), pp. 20-30, Munich, Germany, Oct. 5-6, 2000

[32] J. Segen and S. Kumar Human-computer interaction using ges-ture recognition and 3D hand tracking In Proc. of the Interna-tional Conference on Image Processing, Vol. 3, ICIP 98, 1998,pp. 188 - 192

[33] A. Utsumi and J. Ohya Multiple-hand-gesture tracking usingmultiple cameras IEEE Society Conference on Computer Visionand Pattern Recognition, Vol. 1, 1999

[34] C. Ware and S. Osborne Exploration and virtual camera controlin virtual three dimensional environments Proceedings of the1990 Symposium on Interactive 3D Graphics (Snowbird, Utah ).In Computer Graphics Vol.24, No. 2, March 1990, pp.175-183

[35] C. Wren, A. Azarbayejani, T. Darrell and A. Pentland. Pfinder:Real-time Tracking of the Human Body, Integration Issues inLarge Commercial Media Delivery Systems. A. G. Tescher andV. M. Bove, 1996

[36] A. Wu, M. Shah, N. Da Vitoria Lobo A virtual 3D blackboard:3D finger tracking using a single camera In Proc. of the FourthIEEE International Conference on Automatic Face and GestureRecognition, 2000, pp. 536 - 543

[37] Z. Zhang and O. Faugeras 3-D Dynamic Scene Analysis,Springer Series in Information Sciences, Springer-Verlag, Berlin1992

10

Finger tracking for interaction in augmented environments

Documents

Transcript of Finger tracking for interaction in augmented environments