ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...

13
An earlier version of this paper appeared in the proceedings of the International Conference on Computer Vision, pp. 363–369, Mumbai, India, January 4–7, 1998 ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler and Dimitris Metaxas Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104-6389 [email protected], [email protected] Abstract We present a framework for recognizing isolated and continuous American Sign Language (ASL) sentences from three-dimensional data. The data are obtained by us- ing physics-based three-dimensional tracking methods and then presented as input to Hidden Markov Models (HMMs) for recognition. To improve recognition performance, we model context-dependent HMMs and present a novel method of coupling three-dimensional computer vision methods and HMMs by temporally segmenting the data stream with vision methods. We then use the geomet- ric properties of the segments to constrain the HMM framework for recognition. We show in experiments with a 53 sign vocabulary that three-dimensional features outperform two-dimensional features in recognition per- formance. Furthermore, we demonstrate that context- dependent modeling and the coupling of vision methods and HMMs improve the accuracy of continuous ASL recog- nition. 1 Introduction American Sign Language (ASL) is the primary mode of communication for many deaf people in the USA. It is a highly inflected language with sophisticated grammatical properties, which constrain strongly the order and appear- ance of signs. Because of the constraints, it provides an ap- pealing test bed for understanding more general principles governing human motion and gesturing, including human- computer gesture interfaces. Such interfaces are essential in virtual reality applications, where the user must be able to manipulate virtual objects by gesturing. A working ASL recognition system could also facilitate interaction of deaf people with their surroundings. To date, most attempts at ASL recognition have ei- ther used only two-dimensional computer vision meth- ods, or they have used other input devices, such as data- gloves, instead of computer vision, to collect input from the signer [18, 3, 23]. In this paper we present a new ap- proach to ASL recognition. First, we use computer vision methods to extract the three-dimensional parameters of a signer’s arm motions. We then use Hidden Markov Mod- els (HMMs) to recognize isolated and continuous ASL ut- terances from the three-dimensional input. We develop context-dependent modeling of HMMs and methods for coupling the application of HMMs and the application of three-dimensional computer vision methods to improve continuous recognition performance. Our approach at- tempts to overcome some of the limitations of the previous approaches that use two-dimensional visual input, do not use context-dependent modeling, or do not couple com- puter vision methods with HMMs [18, 3, 17, 12]. Three-dimensional image-based shape and motion tracking of a human’s arm and hand is difficult because of the complexity of the motions and occlusion effects. Recently, a methodology has been developed [8, 10] that allows three-dimensional tracking of human motion from multiple images. In this paper we augment this methodol- ogy to track the three-dimensional motion of a subject’s arms and hands from multiple images. This method is based on the use of deformable models, whose shape and motion fits the given image sequences based on occlud- ing contour information and theorems from projective ge- ometry. The output of this method consists of the three- dimensional motion parameters of the subject’s arms. For efficiency reasons, and because arm movements already carry much of the information needed for recognizing ASL signs, we do not use the hand information in this paper. Apart from obtaining accurate data, ASL recognition is difficult, because there are always statistical variations in the way humans perform motions, even with identical meaning. In addition, in continuous utterances, there are no clear boundaries between individual signs. HMMs pro- vide a framework for capturing statistical variations in both position and duration of the movement, as well as implicit segmentation of the input stream. Furthermore, continuous recognition is complicated by coarticulation effects, that is, the pronunciation 1 of a sign is influenced by the preceding and following signs. Coarticulation effects can be partly alleviated by training context-dependent HMMs. The theory behind HMMs makes several assumptions that are often not valid in practice. For this reason, we de- velop a new approach that couples computer vision meth- ods with HMM modeling. It is based on a temporal seg- mentation process that operates by extracting geometric properties of the three-dimensional computer vision pa- 1 By “pronunciation” we mean motion. We follow the terminology of spoken language linguistics where applicable. 1

Transcript of ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...

Page 1: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

An� earlier versionof this paper appearedin the proceedingsof the Inter national Conferenceon Computer Vision,pp. 363–369,Mumbai, India, January 4–7,1998

ASL RecognitionBasedon a Coupling BetweenHMMs and 3D Motion Analysis

ChristianVoglerandDimitris MetaxasDepartmentof ComputerandInformationScience,Universityof Pennsylvania,Philadelphia,PA 19104-6389

[email protected], [email protected]

AbstractWe present a framework for recognizing isolated and

continuous American Sign Language (ASL) sentences fromthree-dimensional data. The data are obtained by us-ing physics-based three-dimensional tracking methods andthen presented as input to Hidden Markov Models (HMMs)for recognition. To improve recognition performance,we model context-dependent HMMs and present a novelmethod of coupling three-dimensional computer visionmethods and HMMs by temporally segmenting the datastream with vision methods. We then use the geomet-ric properties of the segments to constrain the HMMframework for recognition. We show in experimentswith a 53 sign vocabulary that three-dimensional featuresoutperform two-dimensional features in recognition per-formance. Furthermore, we demonstrate that context-dependent modeling and the coupling of vision methodsand HMMs improve the accuracy of continuous ASL recog-nition.

1 Intr oductionAmericanSignLanguage(ASL) is theprimarymodeof

communicationfor many deafpeoplein the USA. It is ahighly inflectedlanguagewith sophisticatedgrammaticalproperties,which constrainstronglytheorderandappear-anceof signs.Becauseof theconstraints,it providesanap-pealingtestbedfor understandingmoregeneralprinciplesgoverninghumanmotionandgesturing,includinghuman-computergestureinterfaces.Suchinterfacesareessentialin virtual reality applications,wheretheusermustbeableto manipulatevirtual objectsby gesturing.A workingASLrecognitionsystemcouldalsofacilitateinteractionof deafpeoplewith their surroundings.

To date, most attemptsat ASL recognitionhave ei-ther used only two-dimensionalcomputervision meth-ods,or they have usedother input devices,suchasdata-gloves, insteadof computervision, to collect input fromthesigner[18, 3, 23]. In this paperwe presenta new ap-proachto ASL recognition.First, we usecomputervisionmethodsto extract the three-dimensionalparametersof asigner’s armmotions. We thenuseHiddenMarkov Mod-els(HMMs) to recognizeisolatedandcontinuousASL ut-terancesfrom the three-dimensionalinput. We developcontext-dependentmodelingof HMMs and methodsfor

coupling the applicationof HMMs and the applicationof three-dimensionalcomputervision methodsto improvecontinuousrecognitionperformance. Our approachat-temptsto overcomesomeof thelimitationsof thepreviousapproachesthat usetwo-dimensionalvisual input, do notusecontext-dependentmodeling,or do not couplecom-putervisionmethodswith HMMs [18, 3, 17, 12].

Three-dimensionalimage-basedshape and motiontracking of a human’s arm and handis difficult becauseof the complexity of the motionsand occlusioneffects.Recently, a methodologyhasbeendeveloped[8, 10] thatallows three-dimensionaltrackingof humanmotion frommultiple images.In this paperwe augmentthis methodol-ogy to track the three-dimensionalmotion of a subject’sarms and handsfrom multiple images. This methodisbasedon theuseof deformablemodels,whoseshapeandmotion fits the given imagesequencesbasedon occlud-ing contourinformationandtheoremsfrom projective ge-ometry. The outputof this methodconsistsof the three-dimensionalmotionparametersof thesubject’s arms.Forefficiency reasons,and becausearm movementsalreadycarrymuchof theinformationneededfor recognizingASLsigns,wedonotusethehandinformationin thispaper.

Apart from obtainingaccuratedata,ASL recognitionis difficult, becausetherearealwaysstatisticalvariationsin the way humansperformmotions,even with identicalmeaning. In addition, in continuousutterances,therearenoclearboundariesbetweenindividualsigns.HMMs pro-videaframework for capturingstatisticalvariationsin bothpositionanddurationof themovement,aswell asimplicitsegmentationof theinputstream.Furthermore,continuousrecognitionis complicatedby coarticulationeffects,thatis,thepronunciation1 of a signis influencedby theprecedingandfollowing signs. Coarticulationeffectscanbe partlyalleviatedby trainingcontext-dependentHMMs.

The theorybehindHMMs makesseveral assumptionsthatareoftennot valid in practice.For this reason,we de-velopa new approachthatcouplescomputervision meth-odswith HMM modeling. It is basedon a temporalseg-mentationprocessthat operatesby extracting geometricpropertiesof the three-dimensionalcomputervision pa-

1By “pronunciation”we meanmotion. We follow theterminologyofspokenlanguagelinguisticswhereapplicable.

1

Page 2: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

rameters. Thesepropertiesare obtainedindependentlyfrom the HMM algorithmsandareusedto imposeaddi-tionalconstraintsonHMM-basedrecognition.

To testouralgorithmsandassumptions,weperformedaseriesof experimentsbasedon a vocabulary consistingof53differentsignsthatmakeextensiveuseof space.Weex-perimentedwith bothisolatedandcontinuousASL recog-nition for both three-dimensionaland two-dimensionaldata. As HMMs require large amountsof training dataandthecomputervisionprocessis computationallyexpen-sive,we useddatafrom anAscensionTechnologiesFlockof Birdsandcomputervisionprocessesinterchangeably.

Ourgoalis to discoverandanalyzeausableframeworkfor bothisolatedandparticularlycontinuousASL recogni-tion. We do not addressmoregeneralgesturerecognitiontopicsandsignerindependencein this paper. Neitherdowe addresstheinvolvedaspectsof ASL linguistics[19] atthis point, but obviously, a viable futureASL recognitionsystemshouldbeableto handlethem.

In the following sections,we discussrelatedwork andgive an overview on the theory behindthe vision meth-odsandHMMs. Afterward,we addresstheuseof HMMsfor isolatedandcontinuousASL recognition,andcouplingcomputervision processeswith theHMM algorithms.Fi-nally, we outlinedatacollectionandprovideexperimenta-tion resultsfor isolatedandcontinuousrecognitionandthecouplingof computervisionandHMMs.

2 PreviousWorkPrevious work on sign languagerecognitionfocuses

primarily on fingerspellingrecognitionand isolatedsignrecognition. Somework usesneural networks [3, 22].For this work to applyto continuousASL recognition,theproblemof explicit temporalsegmentationmustbesolved,whichis alimitation thatHMM-basedrecognitiondoesnothave. MohammedWaleedKadous[23] usesPowerGlovesto recognizeasetof 95 isolatedAuslansignswith 80%ac-curacy, with anemphasison computationallyinexpensivemethods.Kirsti GrobelandMarcellAssam[4] useHMMsto recognizeisolatedsignswith 91.3%accuracy out of a262signvocabulary. They extract thefeaturesfrom videorecordingsof signerswearingcoloredgloves.

Thereis very little previous work on continuousASLrecognition. ThadStarnerand Alex Pentland[18] useaview-basedapproachto extract two-dimensionalfeaturesas input to HMMs with a 40 word vocabulary. YangheeNam andKwangYoenWohn [12] usethree-dimensionaldataasinputtoHMMs for continuousrecognitionof averysmallsetof gestures.

3 Model-basedTracking of a Human’sArmsIn thissectionwegivea brief overview of our formula-

tion thatallows the three-dimensionalarmshapeandmo-

tion estimationfrom multiple images[6, 7, 8, 10].

Our approachconsistsof two parts.Thefirst part[6, 7]consistsof anactive,integratedapproachthatidentifiesre-liably thepartsof amovingarticulatedobjectandestimatestheir shapeand motion from a controlled set of motionsthatrevealtheobject’s structure.We usethealgorithmde-velopedin [6, 7], which segmentstheapparentbodycon-tourof amovinghumaninto theconstituentparts.Initially,a singledeformablemodelis usedin orderto fit theimagedata.As themodeldeformsto fit thedeformed(dueto themotionof thehuman)subsequentimagecontours,a novelHuman Body Part Identification Algorithm (HBPIA) isdevelopedto identify all the body parts. By applyingtheHBPIA iterativelyoverthesubsequentframes,all themov-ing partsareidentified.In addition,we have extendedthisalgorithmto allow theestimationof thethree-dimensionalshapeof asubject’sbodyparts,basedon theintegrationofimagestakenfrom threeorthogonallyplacedcameras.Weusedthis methodologyto estimatethe three-dimensionalshapeof thesubject’s armsshown in theexamplesin Sec-tion 7. It is worthnotingthatwe have recoveredthelowerarmandthehandasonepart,sincein ourASL recognitionexperimentswe did not usethe motion of the lower armandthehandrelative to eachother.

The secondpart of the algorithmconsistsof usingtheextractedthree-dimensionalshapeof the arm to track thethree-dimensionalposition and orientationof a subject’sbodyparts[8]. To alleviatedifficultiesarisingfrom occlu-sionanddegenerateviewsduringtheunconstrainedmove-mentof the arm, we usethreecalibratedcamerasplacedin a mutually orthogonalconfiguration. At every imageframeand for eachbody part, we derive a subsetof thecamerasthatprovidethemostinformativeviewsfor track-ing. This active and time varying selectionis basedonthevisibility of apartandtheobservability of its predictedmotion from a certaincamera.Oncea setof camerashasbeenselectedto trackeachpart,weuseconceptsfrom pro-jective geometryto relatepointson theoccludingcontourto pointson the three-dimensionalshapemodel. Using aphysics-basedmodelingapproach,we transformthis cor-respondence,in addition to two-dimensionalforcesaris-ing from the discrepancy betweenthe model’s occludingcontourand the imagedata, into generalizedforcesthatareappliedto the model to estimatethe model’s transla-tional androtationaldegreesof freedom. To improve thetrackingresultsfurther, the dynamicsystemis embeddedwithin an extendedKalmanfilter framework, andwe usethe predicted motion of the model at eachframe to es-tablishpoint correspondencesbetweenoccludingcontoursandthethree-dimensionalmodel.

We usedthis two-stepapproachto track the motionofthesubject’s armsperformingtheASL gestures,asshown

2

Page 3: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

in Section7. Theoutputof thesystemis a setof rotation,��� , andtranslation,��� , parametersthatwe useasinput tothe HMMs and the vision-basedsegmentationalgorithmpresentedin thefollowing sections.

4 Hidden Mark ov ModelsHiddenMarkov Models (HMMs) area type of statis-

tical model. They have beenusedsuccessfullyin speechrecognition,andrecentlyin handwriting,gesture,andsignlanguagerecognition.Wenow giveasummaryof thebasictheorybehindHMMs, which is coveredin detailin [15].

4.1 Definition of HMMsAn HMM consistsof anumber� of states����� ���������� � , togetherwith transitionsbetweenstates.Thesystemis

in oneof theHMM’ sstatesatany giventime. At regularlyspaceddiscretetimeintervals,thesystemtakesanoutgoingtransitionfrom its currentstateto a new state.

Eachtransitionfrom � � to ��� hasanassociatedproba-bility ����� of beingtaken. Hence, � � ��������� . Eachstate� � alsohasan initial probability ��� of thesystemstartingin ��� . In addition,eachstate � � generatesoutput �! #" ,which is distributedaccordingto a probabilitydistributionfunction $ �&% �(')�+*-, Outputis ��.Systemis in � �0/ . An ex-ampleisgivenin Figure1. Themodeldepictedthereis alsoanexampleof a left-right model; that is, � ���2143 implies57698 � In otherwords,transitionsonly flow forwardfromlower statesto the samestateor higherstates,but neverbackward. This topologyis themostcommonlyusedonefor modelingprocessesover time.

a11 a24

a23a12

55a

b5b4b3b2b1

a34 a45S1 S S S S2 3 4 5

Figure1: Exampleleft-right HMM with its transitionandoutputprobabilities.“Left-right” meansthattransitionsoc-curonly from left to right, andneverbackward.

4.2 The Thr eeFundamentalHMM ProblemsTherearethreefundamentalproblemsin HMM theory:

(1) For a sequenceof observations :;�<:)�=>���?�@�:BA ,: � C" , computethe probability * % :D. E�' that anHMM E generated: .

(2) For some: andanHMM E , recover themostlikelystatesequence� � F�?���?�� A thatgenerated: .

(3) Adjust the parametersof an HMM E suchthat theymaximize* % :D. EG' for some: .

The first problemcorrespondsto maximumlikelihoodrecognitionof an unknown data sequencewith a set ofHMMs, eachof which correspondsto a sign. For eachHMM, the probability * % :D. EG' is computedthat it gener-atedthe unknown sequence,andthenthe HMM with thehighestprobability is selectedastherecognizedsign. Forcomputing * % :D. EG' , let HI�JH � KH � ����?��LH A be a statesequencein E :

MON % 8 'P�#* % :)�=K:B�=?�����?K: N KH N �Q� � . EG'4�SR 8 R!�T (1)

* % :D. E�'U� �V�XW � M A % 8 '@ (2)

M � % 8 'P�#���Y$Z� % : � '@ (3)

MON\[ � % 8 'P�#$ �K% : N\[ �?' �V�]W � M�N % 5 '&� �]� �)R!^UR4_!`4� (4)

Theseequationsassumethat the : � areindependent,andthey make the Markov assumptionthat a transition de-pendsonly on the currentstate,a fundamentallimitationof HMMs. This methodis called the forward-backwardalgorithmandcomputes* % :D. E�' in : % � � _)' time.

The secondproblemcorrespondsto finding the mostlikely path H throughan HMM E , given an observationsequence: , andis equivalentto maximizing * % HaK:D. E�' .Let b

N % 8 'P� cad=efhgLi�j�j�j i f�kmlFg * % H � H � ���?�]H N �n� �]K:D. E�'@ (5)bN\[ � % 8 'o�Q$ �]% : N\[ ��'qprcDde�Ls � sG� ,

bN % 5 't� �]�]/ (6)

cad=ef * % HaK:D. E�'u�vcad=e�Zs � sG� ,bA % 8 ' / � (7)b

N % 8 ' correspondsto themaximumprobabilityof all statesequencesthat endup in � � at time ^ . Equations6 and7follow from Equation5 by induction on ^ . The Viterbialgorithmis a dynamicprogrammingalgorithmthat, us-ing Equation7, computesboth the maximumprobability* % HaK:D. E�' andthestatesequenceH in : % � � _B' time.

Therecoveryof thestatesequencemakestheViterbi al-gorithminvaluablefor continuousrecognition,sinceit by-passesthe difficult problemof segmentingthe utterancesinto its individualparts.Instead,asequenceof HMMs cor-respondingto individual signsis concatenatedinto a net-work, as schematicallydepictedin Figure 2. Thus, themostlikely statesequencerecoversthesequenceof signs.

The Viterbi algorithmalsohasthe propertythat it canbe optimizedwith the beam-searchingalgorithm. Whileupdating

bN\[ � % 8 ' , this optimizationconsidersonly those

states� � in the HMM network for which

bN % 5 ' is above

a thresholdvalue.Theassumptionis thatif theprobability

3

Page 4: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

HMM 1

HMM 2

HMM n

......

Finalstate

Initialstate

Figure2: Concatenationof HMMs into a network

of a partial paththroughthe network becomestoo low, itcannotcontributeto themostlikely path.Beam-searchingis essentialfor makinglarge-scaleapplicationstractable.

The third problemcorrespondsto training the HMMswith data,suchthat they areableto recognizepreviouslyunseendatacorrectlyafterthetrainingphase.Thereexistsno analyticalsolution for maximizing * % :D. EG' for givenobservation sequences,but an iterative procedure,calledthe Baum-Welch procedure,maximizes * % :D. E�' locally.In thecaseof continuousdensityoutputprobabilities,thereestimationprocessworksasfollows.

Define $]� % :S' as $]� % :S'w�x�Qyz W �|{ � z)} % :-&~G� z Z�O� z ' ,where � describesthenumberof mixtures,

5is thestate

number, { describesthe weight of mixture � in state5,

and } is a Gaussiandensitywith mean~ , andcovariancematrix � . Definethebackwardvariable� as

� N % 8 'P�#* % : N\[ ��: N\[ �=?���?��K:BA�. H N �#� � KEG'Z (8)

�GA % 8 'P���� (9)

� N % 8 '�� �V�]W � ������$K� % : N\[ � 't� N\[ � % 5 'Z (10)

�SR 8 R!����SR�^UR4_!`4��� (11)

Furthermore,define� and � as

� N % 8 5 'P� MON % 8 't� ��� $ ��% : N\[ �?'0� N\[ � % 5 '* % :D. E�' (12)

� N % 8 'P� �V��W � � N % 8 5 'Z� (13)

� N � N % 8 5 ' canbe interpretedas the expectednumberof transitionsfrom � � to � � ; likewise � N � N % 8 ' canbe in-terpretedastheexpectednumberof transitionstakenfrom� � . With theseinterpretations,the reestimationformulaefor thetransitionsandoutputprobabilitiesare����q��� � % 8 '@ (14)

�� ��� � � A�� �N W � � N % 8 5 '� A�� �N W � � N % 8 ' (15)

�{ � z � � AN W � � N % 5 &�T'� AN W � � y� W � � N % 5 L�(' (16)

�~G� z � � AN W � � N % 5 &�T'&: N� AN W � � N % 5 ]��' (17)

�� � z � � AN W � � N % 5 ]��' % : N `7~G� z ' % : N `�~G� z ' A� AN W � � N % 5 ]��' � (18)

Repeateduseof thisprocedureconvergestoamaximumprobability[15], typically after5–10iterations.

5 Useof HMMs for ASL RecognitionIn the previous sectionwe reviewed the extraction of

three-dimensionalfeaturesfrom computervision and theHMM theory. We now discusshow they fit in the frame-work of ASL recognition.

HMMs are an attractive choice for processingthree-dimensionalsigndata,becausetheirstate-basednatureen-ablesthemto describehow a sign changesover time andto capturevariationsin thedurationof signs,by remainingin a statefor severaltime frames.

Therearetwo waysto approachthe recognitionprob-lem that posevery different researchproblems. Isolatedrecognitionattemptsto recognizeonesinglesignata time.Hence,it is basedon theassumptionthateachsigncanbeindividually extractedandthenindividually recognized.

Continuousrecognition,on theotherhand,attemptstorecognizean entirestreamof signs,without any artificialpausesor any other form of marked boundariesbetweentheindividualsigns.Clearly, continuousrecognitionis de-sirablefor the most natural interactionpossiblebetweenhumansandmachines,but it is alsomuchmoredifficult totacklethanisolatedrecognition.Thenext two subsectionsdiscusseachof thetwo approachesin detail.

5.1 IsolatedRecognitionIsolatedsign recognitionassumesthat eachsign can

be extractedindividually. This requiresclearly markedboundariesbetweensigns. Sucha boundarycould sim-ply besilence,thatis, a brief restingphaseaftereachsign,duringwhich thesignerperformsno movements.Silenceis easilydetectedthroughananalysisof theglobalvarianceoverthehandmovements.

Once there are clearly marked boundariesbetweensigns,HMM recognitionis comparatively straightforward.Therecognitionprocessextractsthesignalcorrespondingtoeachsignindividually. It thenpickstheHMM thatyieldsthe maximumlikelihoodfor that signalasthe recognizedsign.

Training the HMMs to maximize recognitionperfor-manceis alsocomparatively straightforward. Initially, allsigns in the training set are labeled. For eachsign inthe dictionary, the training procedurethen computesthe

4

Page 5: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

meanand covariancematrix over the data available forthat sign and assignsthem uniformly as the initial out-put probabilitiesto all statesin the correspondingHMM.It alsoassignsinitial transitionprobabilitiesuniformly tothe HMM’ s states.Unlike the initial outputprobabilities,initial transitionprobabilitiesdo not influencethe perfor-manceof thefully trainedHMMs greatly.

The trainingprocedurethenrunstheViterbi algorithmrepeatedlyon thetrainingsamples,soasto align thetrain-ing dataalong the HMM’ s states. The aligneddataarethenusedto estimatebetteroutputprobabilitiesfor eachstateindividually. This realignmentyieldsmajorimprove-mentsin recognitionperformance,becauseit increasesthechancesof the Baum-Welch reestimationalgorithm con-verging to an optimal or a near-optimal maximum. AfterconstructingthesebootstrappedHMMs, the training pro-cedurefinishesby reestimatingeachHMM in turn withthe Baum-Welch reestimationalgorithmoutlined in Sec-tion 4.2.

Theby far mostchallengingproblemin isolatedrecog-nition is extractinga featurevector that optimizesrecog-nition performance.Even after obtainingaccuratethree-dimensionaldata from our computervision methodde-scribedin Section3, we found that the featuresusedforrecognition— and the way that they are represented—greatly influencerecognitionperformance. The experi-mentalresultsgiven in Section8.1 demonstratehow thefeaturevectoraffectsperformance.

Thereareseveral reasonswhy performanceis so sen-sitive to choosingthe type of featurevector: First, somefeaturescarry more information than others; for exam-ple,three-dimensionalfeaturesaremorereliablethantwo-dimensionalones.Second,somefeaturesaremoreinvari-ant to changesin orientationandpositionthanothers;forexample,polarcoordinatesaremoreinvariantto rotationsthanCartesiancoordinates[1]. Third, thestatisticalprop-ertiesof somefeatureschange,dependingon thedurationof a sign. For this reason,the positionsof the handsinthree-dimensionalspaceperformbetterthanthevelocitiesof the hands(seealso Section8.2). Fourth, the statisti-cal distribution of the featuresduring thecourseof a signseemsto play a role. For somefeatures,their distributionfits Gaussiandensitiesnaturally, whereasfor othersit doesnot.

If thelatterexplanationholdstrue,weshouldseeama-jor improvementin recognitionperformancefrom usingmultiple GaussianmixturesastheoutputprobabilitiesforHMMs, insteadof usingjust onesingleGaussiandensity.However, we did not experimentwith multiple mixturesbecauseof thelackof sufficient trainingdata.

The numberof statesand the topology usedfor theHMMs is alsoimportant.Signlanguageasa time-varying

processlendsitself naturallyto aleft-rightmodeltopology.Findingtheoptimumnumberof states,which dependsontheframerateandonthecomplexity of thesignsinvolved,is anempiricalprocess.We usedthesamemodeltopologyfor all signs,anddeterminedexperimentallythat for ourtaskamodelwith 9 stateswassufficient,whichis depictedin Figure3. TheoutputprobabilitiesweresingleGaussiandensitieswith diagonalcovariancematrices,becausewehadinsufficient trainingdatafor multiplemixtures.

Figure 3: Left-right HMM topology for isolated ASLrecognition.

5.2 ContinuousRecognitionContinuoussignrecognition,ontheotherhand,is much

harderthanisolatedsign recognition.Thereis no silencebetweenthe signs,so the straightforward methodof us-ing silenceto distinguishboundariesfails. Here HMMsoffer the compellingadvantageof beingable to segmentthe streamsof signsautomaticallywith the Viterbi algo-rithm. Coarticulationeffectsfurthercomplicatecontinuousrecognition.We now discussthemin detail,beforewede-scribethetechniquesneededto trainHMMs for continuousrecognition.

5.2.1 The Coarticulation ProblemCoarticulationmeansthatthepronunciationof asignis

influencedby the precedingandfollowing signs. Oneofthe mostvisible effectsof coarticulationin ASL is that awiderangeof movementsareinsertedbetweensigns.

For example,the sign for “FATHER” is performedbyrepeatedlytappingtheforehead,andthesignfor “READ”is performedin neutralspacein front of thechest.If thesetwo signsareperformedin succession,anextramovementfrom theforeheadto neutralspaceappears(Figure4). Thisphenomenonis calledmovementepenthesis[5]. We dis-cussits implicationsfor ASL recognitionmorethoroughlyin [20].

Figure4: Movementepenthesis.Thearrow in themiddlepictureindicatesanextra movementbetweenthesignsfor“FATHER” and“READ” thatis notpresentin their lexicalforms.

5

Page 6: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

Speechrecognizershandlecoarticulationby trainingphonemecontext-dependentHMMs. They traina separatemodelfor eachpossiblecombinationof threephonemesinsequencethatcouldoccurduringnaturalspeech.In princi-ple,thesameideaappliestosignlanguagerecognition,andweperformedsomeexperimentsto verify theapplicability,seeSection8.3.

A possibleway to train context-dependentmodelsforASL recognitionis to usewhole signsas the phonologi-cal unit in ASL.2 Thus,triphonecontext-dependentmod-elsfrom speechrecognitioncorrespondto tri-signcontext-dependentmodelsin ASL recognition. In otherwords,aseparatemodel is trained for eachcombinationof threesignsin sequence.The first andthe third sign in the se-quenceform the context for the middle sign, with whichthemodelis associated.

Tri-sign context-dependentmodeling,however, is pro-hibitively expensive, becauseit requires : %���� ' modelsoverall, where � is the vocabulary size. Collectingsucha large amountof training datanecessaryto obtain reli-ableestimatesfor themodelsis intractableevenfor smallvocabulary sizes. This intractability is a negative conse-quenceof usingwholesignsasthephonologicalunit. Un-like for speechrecognition,which hasto handleonly ap-proximately40 classesof allophones,there is no upperboundon the numberof modelsrequiredfor ASL recog-nition with wholesignsasthesmallestunit.

Therefore, we used only bi-sign context-dependentmodels,whichrequireamodelfor everypossiblecombina-tion of two signs.Themodelis associatedwith thesecondsign,andthefirst signformsits precedingcontext.

Bi-sign context-dependentmodeling requires : %�� � 'models.Althoughthiscomplexity is animprovementover: %���� ' , it is still too largefor anythingbut a smallvocab-ulary. Speechrecognizersreducethenumberof modelsre-quiredbyusingtheobservationthatmany contextsareverysimilar. Therefore,they tie the parametersof the modelscorrespondingto similar contexts, suchthat the transitionandoutputprobabilitiesaresharedbetweenthesemodels.This techniquesignificantlyreducesthenumberof distinctmodels.

Parametertying is alsoapplicableto ASL recognition,but it is not as effective as for speechrecognition. Themain reasonfor the reducedeffectivenessis that move-mentepenthesisinsertsmany movementsunrelatedto thesigns’ lexical forms. The implication is that context-dependentmodelswill work well only with prohibitivelylargeamountsof trainingdata.

In fact, it is questionablewhethercontext-dependentmodeling is a good solution to the coarticulationprob-

2Thisassumptionis not correct:Wholesignsarenot thesmallestunitin ASL phonology, but this topic is beyondthescopeof thispaper.

lem in ASL recognitionat all. Movementepenthesisisa phonologicalprocessin ASL and shouldbe treatedassuch;thatis, themovementsinducedbyepenthesisaresep-aratephonemes.Usingcontext-dependentmodelsto cap-turethemis implausiblefrom aphonologicalpointof view.It seemsto make moresenseto modelthemovementsex-plicitly. We follow up on this ideain [20] andshow that itleadsto betterrecognitionperformance.

5.2.2 The Training ProcedureA sign in our datacollectedat naturalsigningspeeds

was between10 and 45 frames long, not counting theframesneededfor the transitionbetweensigns. Becauseof themovementsbetweensigns,theHMM topologymustbemoreflexible thantheonedescribedfor isolatedrecog-nition in Section5.1. Theseconsiderationsled usto usingtheleft-right modelshown in Figure5.

Figure5: Topologyof thecontext-dependentmodel. Thearcsthat skip statesallow the modelingof variabilitiesinthedurationof differentsigns.

Like for isolatedrecognition,we determinedthe opti-mal numberof statesexperimentally. For theoutputprob-abilities,wechoseasingleGaussiandensitywith diagonalcovariance,aswehadinsufficienttrainingdatafor estimat-ing full-rank covariancematrices.

Trainingcontinuousrecognitionmodelsis muchharderthantrainingisolatedrecognitionmodels,becauseit is dif-ficult to obtaingoodinitial estimatesof theHMM param-eters.Viterbi realignment(seeSection5.1) worksonly ifthetrainingdatais accuratelylabeled,includingthebound-ariesbetweentheindividualsigns.Obtainingthesebound-ariesis very difficult and time-consuming;even humanshave troubledeterminingwherea sign endsandthe nextonestarts.

The alternative to usingViterbi realignmentis usingaflat-startscheme.It consistsof computingtheglobalmeanandcovariancematrix over theentiretrainingdatasetandassigningtheseas the initial output probabilitiesto theHMMs. We usedthisschemeto initialize theHMMs.

Wethenusedembeddedtraining [24] to reestimatetheHMMs. Eachiterationof this procedureconcatenatestheHMMs correspondingto the individual signsin a trainingsentenceinto a singlelargeHMM. It thenreestimatestheparametersof thelargeHMM with asingleiterationof theBaum-Welchalgorithmdescribedin Section4.2,asusual.Thereestimatedparameters,however, arenot immediatelyappliedto theindividual HMMs. Instead,they arepooled

6

Page 7: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

in accumulators,andappliedto theindividualHMMs onlyafter thetrainingprocedurehasiteratedover all sentencesin thetrainingset.

Hence,embeddedtraining effectively trains all mod-els in parallel with the entire training set. It yields bet-ter parameterestimatesthantrainingtheHMMs indepen-dently[24].

In the caseof context-independentmodels,using theflat start schemefollowed by several embeddedtrainingrunsis all thatis necessaryto trainHMMs for recognition.Context-dependentmodelsaremoredifficult to train thancontext-independentmodels,becausethetraininginvolvestwo extra steps. Theseconsistof generatingthe context-dependentmodels,and tying the parametersof HMMswith similar contexts(seealsoSection5.2.1).

The first extra step, which consists of generatingthe context-dependentmodels,requirescare,becauseforcontext-dependentmodels there exist far fewer trainingexamplesper model than for context-independentmod-els. In this case,embeddedtraining is likely to yield thebestparameterestimatesfor context-dependentmodelsifthey have alreadybeeninitialized with bettervaluesthanthe global meanandcovariancematrix from the flat-startscheme.

Therefore,weranseveralembeddedtrainingrunsonthecontext-independentmodelsand then generatedcontext-dependentmodelswith thesameparametersasthecontext-independentmodels. It is vital to avoid overtrainingthecontext-independentmodelsby keepingthenumberof ini-tial trainingpasseslow. Theprobabilitiesshouldnot havefully convergedyet. Otherwise,usingcontext-dependentmodelsactuallydecreasesrecognitionperformance.

The secondextra step,which consistsof tying the pa-rameters,is also vital to the context-dependentmodels’performance,especiallybecauseof our relative lack oftraining data. Tying parametersreducesthe numberofmodels,assignswith similar contexts thensharea com-monmodel.As a result,moretrainingdatapermodelbe-comesavailable.

Unfortunately, parametertying is a highly empiricalprocess.Ourexperimentsindicatedthattying thetransitionprobabilitiesproperlyhadthegreatestinfluenceon recog-nition results. We usedthe endinglocationsof the signsin the precedingcontext to decideon the tying. For ex-ample,the signsfor “BROTHER” and“SISTER” end inthe samelocation. As a result,the two modelsfor a signoccurringafter the signsfor “BROTHER” or “SISTER,”suchas “LIKE, ” cansharethe sametransitionprobabili-ties. We alsousedtheendinglocationsto decideon tyingthe outputprobabilities. For our dataset, the tying pro-cessreducedthe numberof modelsto lessthanonesixthof theiroriginalnumber.

6 Coupling of Vision and HMMsIn the precedingsectionwe reviewedhow HMMs can

be usedfor ASL recognition. The useof HMMs alone,however, imposessomelimitations,oneof which is insuf-ficiency of trainingdata,especiallywhile trainingcontext-dependentmodels.Furthermore,theprobabilitytheoryas-sumptionsunderlying the HMM theory, as describedinSection4.2, areoften not valid: Successive observationsareoftennot independent,the transitionfrom onestatetothe next often dependsnot only on the currentstate,butalsoon the statehistory, and the distribution of observa-tionsdoesnotnecessarilyresembleanormaldensity.

Anotherproblemis thattheHMM theorydoesnot pro-videfor any dynamicweightingof featuresdependingonasign’scontext. For example,theinvariantfeaturesfor somesigns,suchas “I,” are the endpointsof their movementswith respectto a bodypart,andthemovementsareunim-portant.For othersigns,only themovementsareinvariant.The partsof the featureset that shouldbe examinedandignoredfor eachclassof signsaremutuallyexclusive.

To alleviate theselimitations,we investigatedthe cou-pling of the HMM recognitionprocesswith an indepen-dent computervision-basedmotion analysisthat tempo-rally segmentsthe signalandextractsits geometricprop-erties.Theideais thata signcanbedescribedin termsofoneormoregeometricprimitives,suchashandmovementsalonga line, in a plane,or a circle. This ideais supportedby theexistenceof transcriptionsystems,suchastheHam-NoSys[14], thatbasethedescriptionof themovementsongeometricprimitives.

The presenceof three-dimensionalinformationis cru-cial for the couplingto work. In the past,geometricfit-ting of planeshasalreadybeenusedfor rough segmen-tation [12], but not for providing additional informationaboutthe natureof the fits to the HMM recognitionpro-cess.

6.1 Segmentationof the SignalTo extract the geometricpropertiesof the continuous

signalestimatedwith ourcomputervisionmethods,it mustfirst besegmentedtemporallyinto its parts.Any changeofthetypeof armmovementis likely to beaccompaniedbya dip in the velocity. Thus, minima in the absoluteval-uesof thevelocity vectorprovide stronghintsat segmen-tationboundaries.However, therearetypically many morevelocity minimathansegmentationboundaries.Thus,thesegmentationprocessmustprovide facilities to mergead-jacentsegments.

After performinginitial segmentationbasedon veloci-ties, our algorithmattemptsto fit geometricprimitivestothe individual segments.Thesecurrentlyconsistof lines,planes,andholds3 ata positionin space.

3A hold is a shortperiodof time, duringwhich no handmovements

7

Page 8: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

The fit of a hold is determinedby computingthe co-variancematrix over thesegment’s positiondata. If thereis little movement,the eigenvaluesof the matrix in everydirectionaresmall,andconsequentlyits traceis small.

Theleast-squaresfit of a line is governedbyV�Q� � �

V� .X. � � ` %�� p@� � ' � .�. � (19)

where� � is thedistanceof � � to theline, and � is theline’sunit direction vector. Let � be a matrix containingthepoints �q� in the segmentsas its row vectors. Minimiz-ing Equation19 with respectto � correspondsto maxi-mizing � A � A � � . By Rayleigh’s principle,themaximal-eigenvalueeigenvectorof � A � maximizesthis equation,which is equivalentto themaximal-eigenvalueeigenvectorof the points’ covariancematrix. This eigenvectoris theline’s directionvector. Theothertwo eigenvaluesindicatethegoodnessof fit — thesmallerthey arewith respecttothelargesteigenvalue,thebetterthefit.

Theleast-squaresfit of a planeis governedbyV� � ���

V� .�. �h��p���.�. � (20)

where � � is the distanceof � � to the plane,and � is theplane’s unit normalvector. If � is a matrix containingthepoints � � asits row vectors,theminimal-eigenvalueeigen-vectorof � A � minimizesEquation20 with respectto � .Hence,minimizing this equationis equivalent to findingthe minimal-eigenvalueeigenvectorof the points’ covari-ancematrix. Theothertwo eigenvaluesindicatethegood-nessof fit — thelargerthey arewith respectto thesmallesteigenvalue,thebetterthefit.

Using least-squaresfitting is basedon the assumptionthat the signalnoiseterm is capturedby a normaldistri-bution. If this assumptionis not valid, the least-squaresestimatoris likely to yield poorresults,becauseof its sen-sitivity to outliers.Ontheotherhand,in three-dimensionalspace,the least-squaresestimatoris mucheasierto com-putethanmorerobust estimators.It would be interestingto compareits performanceon temporalsegmentationtotheperformanceof robustregressionestimators[13], suchastheleastmedianof squaresestimator[2, 11], or there-peatedmedianestimator[16, 9].

After the initial fit, the algorithmpools the primitivesinto a directedacyclic graph (DAG), schematicallyde-pictedin Figure6. Note that the individual segmentsarenot mutually exclusive; for example,datacan fit both aline anda plane.

If the algorithmfails to fit any geometricprimitivestosomesegment, it insertsthe segmentinto the DAG as a

take place.

“wild card,” which is definedconservatively to matchanykind of geometricprimitive. It thenattemptsto mergead-jacentsegmentsif they are compatible,in an attempttoeliminatespurioussegmentationboundaries.

We definedadjacentsegmentsto be compatiblefor amergeif they sharedthesametypeof geometricprimitivein similar orientations,andif the mergedsegmentstill fitthesametypeof geometricprimitiveasits constitutingseg-ments.In addition,we considereda wild cardto becom-patiblewith anothergeometricprimitive if this primitivealsofit themergedsegment.

PlaneLine

Line

Hold

Figure6: Geometricprimitivespooledinto a DAG. Cir-clesdenotesegmentationboundaries.Dottedarcsdenotepossiblenull transitions;they arenecessaryto compensatefor spurioussegments.Sometimesdatacanfit multiplege-ometricprimitives; in this DAG the dataof the first twosegmentsfit both a hold followedby a line, anda simpleline.

TheDAGnow givesall possiblesegmentsequencesthatarea valid representationof thesignal. If a sequenceis tobevalid, it mustbeobtainableby tracingapaththroughtheDAGfromtheleftmostsegmentationboundaryto theright-mostsegmentationboundary. In theexamplegivenin Fig-ure6 thesequences“Hold, Line,Plane,” and“Line, Plane”would both be valid sequences,but “Plane,Plane”wouldnot,becausethelatterdoesnot lie onany paththroughthatDAG.

Thisdiscussionhassofarignoredthepossibilityof spu-rioussegmentsarisingfromthevisionanalysis.Thatis, theanalysismight recognizea segmentthatshouldbepartofanother, but the merge processfails to merge it into an-othersegment. Themain reasonfor theexistenceof spu-rioussegmentsis undersampling.If a segmentconsistsofvery few samples,it is oftenimpossibleto extractreliableinformationfrom it. Our algorithmattemptsto solve thisproblemby addingarcsto theDAG from eachsegmenta-tion boundaryto the next (representedby the dottedarcsin Figure6). Thus,a paththroughtheDAG canoptionallyskip thesespurioussegments.

6.2 Using the Motion Analysiswith HMMsEachsignin thevocabularyhasassociatedoneor more

templatesthat comprisethe sign’s geometricprimitiveswith weightsof eachfeature’s relative importance.These

8

Page 9: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

primitivesarematchedagainstthosein theDAG. Assum-ing thatthesegmentationprocessyieldscorrectresults,thefollowing must be true: If a sequenceof signsis repre-sentedby theinputsignal,thesequenceof geometricprim-itivescorrespondingto thesignsmustform a paththroughtheDAG. We call sucha sequenceof signsvalid with re-spectto thecomputervisionDAG.

This observationsuggestsanapplicationof themotionas a backupcheckfor the HMM framework. First rec-ognizea candidatesentencefrom the input signalvia theViterbi algorithm.Thengenerateall possiblesequencesofgeometricprimitivescorrespondingto therecognizedsignsand constructanotherDAG from them. Using dynamicprogramming,matchthetwo DAGsagainsteachother. Ifthetwo DAGssharea commonpath,acceptthecandidatesentenceascorrect. Otherwise,reject the candidatesen-tenceasincorrect.

Thejustificationfor this algorithmcomesfrom thefol-lowing propertiesof the DAGs: If the two DAGsshareacommonpath,thereis a sequenceof geometricprimitivesthat forms a paththroughthecomputervision DAG. Fur-thermore,this sequenceof geometricprimitivesis oneofthe possiblesequencesgeneratedfrom the candidatesen-tence. Thus, the candidatesentenceis valid with respectto thecomputervision DAG. Conversely, if no suchcom-monpathexists,noneof thesequencesof geometricprim-itivesgeneratedfrom thecandidatesentenceformsa paththrough the computervision DAG. Thus, the candidatesentenceis not valid with respectto the computervisionDAG andshouldberejected.

6.3 Discussionof the CouplingTheHMM recognitionalgorithmandthevisionmatch-

ing algorithmcomplementeachother. TheadvantagesoftheHMM recognitionmethodareautomaticsegmentationduring both training andrecognition,anda fully formal-ized training procedure.The disadvantagesarepoor per-formancein the presenceof insufficient training data,noformal way to weight featuresdynamically, andpossibleviolationsof thestochasticindependenceassumptions.

Theadvantagesof thevision matchingmethodarethepossibilityof weightingtherelative importanceof featuresdynamically, and independencefrom insufficient trainingdata. A significantdisadvantageis thatestimatingthege-ometricpropertiesof the signsin the vocabulary requiresmanuallabelingandanalysisof thedata.Furthermore,seg-mentationmustbedoneexplicitly, which raisesthepossi-bility of spurioussegments,asdescribedin Section6.1,orthepossibilityof missingsegments.Coarticulationsome-timesalsochangesthegeometricpropertiesof thesignal,suchthatthetemplatesfor thecorrectsequenceof signnolongermatchthe actualsignal. Copingwith the changesin thegeometricpropertiesis an importanttaskfor future

research.

7 Data CollectionFor our experimentswe collecteddata,usingboth our

computervision system,and an AscensionTechnologiesFlock of Birds. Thereasonfor usingthe latterwasthat itis fasterat this point thanthecomputervision system,andhencemoresuitablefor prototyping.

The computervision systemyields rotation, � � , andtranslation,��� , of eachsegmentof thearm,asdescribedinSection3. Figure7 givesanexampleof thecomputervi-siontrackingprocess.Theimagesshow thehighaccuracyof thecomputervision system;in fact,it is comparabletotheaccuracy achievedby theFlockof Birdssystem.

TheFlockof Birds systemconsistsof a magnetandsixsensorsthat detecttheir rotation, �� � , andtranslation, ��|� ,with respectto the magnetat 25 framesper second.Weusedthe datafrom both systemsinterchangeablywith asimplealignmentof coordinatesystems.The coordinatesystemwasright-handed,with theorigin at thebaseof thesigner’sspineandthe � axisfacingup.

Figure 7: Fitting the three-dimensionalmodels to thesigner’s arms. From top to bottom, the signs for “FA-THER,” “I,” and“MAIL ” aredisplayed.Fromleft to right,thefront, side,andtopviewsaredisplayed.

We used the 53-sign vocabulary listed in Table 1.TheirpronunciationsfollowedtheASL dialectusedin thePhiladelphia,PA, area. The goalsin choosingthe vocab-ulary wereto beableto expresssentencesthatcouldhaveoccurredin a naturalconversation,andto make intensiveuseof the signingspace,so asto demonstratethe advan-tagesof three-dimensionaldataovertwo-dimensionaldata.Wecollected486continuousASL sentences,eachbetween

9

Page 10: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

Category SignsusedNouns America, Christian, Christmas, book,

brother, chair, college, family, fa-ther, friend, interpreter, language,mail,mother, name,paper, president,school,sign,sister, teacher

Pronouns I, my, you,your, how, what,where,whyVerbs act,can,give,have,interpret,like,make,

read,sit, teach,try, visit, want,will, winAdjectives deaf,good,happy, relieved,sadOther if, from, for, hi

Table1: Thecomplete53signvocabulary

2 and12 signslong, with a total of 2345signs. Theonlyconstraintsontheorderandoccurrenceof signswerethosedictatedby thegrammarof ASL [19].

Furthermore,we collectedexamplesof eachsign forisolatedrecognition. Becausepart of the datawerecor-ruptedduringthecollectionprocess,wediscardedall signsfor whichwedid nothaveenoughintacttrainingexamples.This left 656examplesovera rangeof 40signs.Eachsignhadat least6 examplesavailablefor thetrainingset,and2examplesavailablefor thetestset.

8 ExperimentsWe performedisolated,continuous,and vision-HMM

coupledASL recognitionexperiments.WeusedEntropic’sHidden Markov Model Toolkit (HTK) Version 2.02 fortrainingandtestingin all of ourexperiments.

8.1 IsolatedRecognitionExperimentsThegoalof theisolatedrecognitionexperimentswasto

discover a setof featuresthat maximizesHMM recogni-tion performance.We useddifferentfeaturesin ourexper-iments,includingwrist positioncoordinatesof bothhands(denotedby �O&�|]  ), wrist positionexpressedin polarco-ordinatesin the � - � plane (denotedby ¡£¢?¤�]¥�¢£¤ ), polarcoordinatesin the � -   plane(denotedby ¡?¢£¦�&¥¢?¦ ), wristposition expressedin sphericalcoordinates(denotedby¡]¥(K§ ), andwrist orientationangle(denotedby

b), aswell

asderivativesof these(denotedby a dot). We alsocom-binedseveralfeaturesin someexperiments.

We ran repeatedexperiments,morethan � 3 3�3�3 total,with differentfeaturesandrandomlyselectedtrainingandtestsetson a per-experimentbasis. Threequartersof theexamplesfor eachsignwerein thetrainingsetandtherestwerein thetestset.Eachselectionyielded178testexam-plesperexperiment.Sometypical resultsaregivenin Ta-ble 2. In addition,we performedexperimentsto comparethe merits of using three-dimensionalcoordinatesversustwo-dimensionalcoordinatesby projectingthecoordinatesonplanes.Theresultsareshown in Table3.

Features ~ ¨ B W N��]�GK  98.42% 0.99% 100.0% 93.8% 463¡ ¢£¤ &¥ ¢£¤ ,  98.72% 0.79% 100.0% 95.5% 494¡?¢£¤�&¡£¢?¦ ,¥¢?¤�]¥�¢£¦ ,��]�GK  98.78% 0.78% 100.0% 94.9% 882¡&¥(L§ 96.48% 1.31% 100.0% 93.3% 210©�O ©�| ©  96.87% 1.21% 100.0% 93.3% 167��]�GK ªb

98.25% 0.92% 100.0% 95.5% 167©¡ ¢£¤ ©¥ ¢£¤ ,©  96.28% 1.04% 98.9% 93.8% 120©¡« ©¥( ©§ 95.89% 1.29% 98.9% 92.1% 150

Table2: Resultsof isolatedsign recognitionwith three-dimensionalfeatures. ~ , ¨ , B, W, and N correspondtotheaveragepercentageof correctlyrecognizedsigns,stan-darddeviation,bestcase,worstcase,andnumberof exper-iments,respectively. All experimentsusedatestsetof 178signs.

Features ~ ¨ B W N¡ ¢£¤ &¥ ¢?¤ 98.06% 1.26% 100.0% 94.9% 118��&� 97.75% 1.20% 100.0% 94.9% 118

Table 3: Resultsof isolatedsign recognitionwith two-dimensionalfeatures.Themeaningof the columnsis thesameasin Table2.

8.2 Analysisof IsolatedRecognitionThe low error ratesof the bestfeaturesetsshow that

with a good selectionof features,the hand movementsalone,without handconfigurationinformation,carry suf-ficient informationto discriminateamongmany differentsigns. Polarcoordinatesslightly outperformedCartesiancoordinates.A combinationof both yielded the bestre-sults,althoughthedifferenceis not significant. However,thestandarddeviationof thecombinedfeaturesetwaslow-est,indicatingthatacomplex featurevectoris morerobustthana simplefeaturevector.

Positioncoordinatessignificantlyoutperformedveloc-ities. The reasonfor the poor performanceof velocityfeaturesis that the statisticalpropertiesof the velocitieschangewith variationsin thesign’s duration. In contrast,thestatisticalpropertiesof positioncoordinatesarelargelyunaffectedby thedurationof signs,becauseHMMs absorbvariationsin durationthroughtransitionslooping backtothe samestate. Yet, positioncoordinateshave the signif-icantdisadvantagethat they arenot invariantwith respectto location.Thelackof invariancewill causeproblemsforfuture applicationsthat attemptto capturecommonalitiesbetweenmovementsatdifferentlocationsin space.

10

Page 11: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

Three-dimensionalfeaturesperformedbetterthantwo-dimensionalfeatures,althoughthedifferenceis not large.The differencewould probablybecomemore significantwith a largervocabulary. The differencesin standardde-viation, however, indicatethat three-dimensionalfeaturesaremorerobustthantwo-dimensionalfeatures.

It is an importantconsequenceof the experiments’re-sults that the performanceof the featurevectorsdependson theactualexamplesin thetrainingset,all otherfactorsbeingequal.Thus,only performinga largenumberof ex-perimentsyieldsreliableestimatesof therelativemeritsofdifferentfeatures.

8.3 ContinuousRecognitionExperimentsWe split the486sentencesrandomlyinto a trainingset

with 389 examplesanda testsetwith 97 examples(con-taining 456 signs). Eachsign in the vocabulary occurredat least once in the test set. The training and test setswerethesamethroughoutall experiments,andno portionof the testsetwasusedfor training in any way. We ranthree-dimensionalexperimentswith andwithout context-dependentHMMs, andtwo-dimensionalexperiments(byprojectingthedataonplanes;theresultsgivenarethebestthatwe found).

In accordancewith the results from isolated experi-mentsthat position coordinatesperform better than ve-locities, and that a complex featurevector is more ro-bust thana sparseone,we choseour featurevectorto be% ��&�|K ª]¥ ¢£¤ &¥ ¢£¦ ©�� ©�| © (

b' for bothhands.Thatis, it con-

sistedof Cartesianandpolarpositioncoordinates,veloci-ties,andwrist orientationangles.Thetaskgrammarwasasimpleword loop, soevery signwasequallylikely at anytime in theHMM network.

Table4 shows the experimentalresults. We usewordaccuracy asourevaluationcriterion.It is computedby sub-tractingthenumberof insertionerrorsfrom thenumberofcorrectlyspottedsigns.Thenumberof wordsin theresultfor two-dimensionaldatais lower thanin theotherresults,becausefor onesentencetheViterbi beam-searchingopti-mizationprunedall pathsthroughtheHMM network (seealsoSection4.2).

8.4 Analysisof ContinuousRecognitionThe resultsare clearly in favor of using three-dimen-

sionaldataover two-dimensionalfor continuousrecogni-tion. The6.3percentdifferenceis large,although,accord-ing toourexperienceswith isolatedrecognition,oneexper-imentis notenoughto estimatetherealdifferencereliably.

Context-dependentmodelsoutperformedcontext-inde-pendentmodels, but the increasein performancewassmall, probably to a large extent becauseof insufficienttrainingdata— context-dependentmodelingrequireshugeamountsof data to becomeeffective. Also, cross-signcontext-dependentmodelingfor ASL is implausiblefrom

Typeof Wordexperiment accuracy Details3D context 87.71% H=416,D=8, S=32independent I=16,N=4563D context 89.91% H=424,D=6, S=26dependent I=14,N=4562D context 83.63% H=394,D=14,S=44dependent I=16,N=452

Table 4: Resultsof continuousrecognitionexperiments.H denotesthe numberof correctsigns,D the numberofdeletionerrors,S the numberof substitutionerrors,I thenumberof insertionerrors,andN thetotalnumberof signsin thetestset.

a phonologicalpoint of view (seeSection5.2.1). Theal-ternative is modelingmovementepenthesisdirectly, anditappearsto performbetter[20].

More thanhalf of thesubstitutionerrorsin eachexperi-mentwereconfusionsbetween“I” and“MY,” and“Y OU”and“Y OUR,” whichdiffer only in handconfiguration.Weexpectthataddingfeaturesdescribingthehandconfigura-tion will improverecognitionperformancesignificantly.

Repeatingthecontext-dependentexperimentwith five-bestrecognitionshowedthattheabsenceof astronggram-marfor constrainingtheHMM network degradesrecogni-tion performancesignificantly. In many cases,thecorrectsentencewastheonlygrammaticalsentenceamongthefivebestcandidates.In othercases,all fivecandidateswereun-grammatical.

Unfortunately, using a strong grammarfor a test setasdiverseasours is not practical,becausethe sizeof anHMM network grows exponentiallywith the numberofrulespresentin thegrammar. Statisticallanguagemodels,suchasbigrammodels,haveprovedto beaneffectivesolu-tion to thisproblemin speechrecognition.Weshow in [20]thatbigramlanguagemodelsarepromisingfor ASL recog-nition aswell. However, they requirea largecorpusof la-beledreal-world datato becometruly effective. Presently,nosuchcorpusexistsfor ASL.

8.5 Coupling ExperimentsTo investigatethe effectsof couplingthe three-dimen-

sionalmotionanalysiswith theHMM framework, weper-formedtwo experiments.In thefirst experiment,we ana-lyzedall sentencesin thetestsetwith ourmotionanalysis,soasto provideanupperboundon its performance.If themotion analysishadworked perfectly, it shouldhave ac-ceptedall of these97 testsentences.In reality, however, itrejected10outof these97sentences.

A closerlook at the10 rejectedsentencesrevealedthatfive of thesewerenot recognizedcorrectlyby thecontext-dependentHMMs either. Thus,it is likely that thesefive

11

Page 12: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

sentenceswerenotsignedpreciselyenoughduringthedatacollectionprocess.Theotherfive rejectedsentencesindi-catethatthemotionanalysisstill needsimprovement.

In thesecondexperiment,weranthecouplingalgorithmon the actual recognitionhypothesesfrom the context-dependentHMMs in theexperimentsin Section8.3. Thistime,thealgorithmalsoeliminated10sentencesoutof 97.Five of thesewere correctly rejected;that is, the HMMframework hadprovidedincorrectresultsfor them. Thus,at thecurrentmoment,couplingHMMs with motionanal-ysisbreaksevenwith usingtheHMM framework by itself.Theword accuracy achievedby thecouplingwas90.10%,which is slightly better than the 89.91%word accuracyachievedby thecontext-dependentmodelsalone.

As we have usedonly a small part of the full powerof computervision motion analysisso far, we seetheseresultsasevidencethatcouplingwill eventuallybeabletooutperformeithermethodindependently.

9 SummaryWehavedevelopeda framework for recognizingAmer-

icanSignLanguagefrom three-dimensionaldataobtainedwith computervision techniques.We showedhow to col-lect three-dimensionaldatafrom computervision andusethem as input to Hidden Markov Models. We also de-terminedthat three-dimensionalfeaturesaresuperiorovertwo-dimensionalones.

By using context-dependentmodeling, we improvedrecognitionperformance. Throughcoupling vision pro-cesseswith HiddenMarkov Models,we took a first steptowardovercomingthe limitationsof eithermethodby it-self.

10 Future workThe collection of a standardizedcorpusof real-world

ASL conversationsandstory telling shouldbea high pri-ority for future work. The currentlack of sucha corpusmakesit impossibleto compareresultsfrom differentre-searchers.Furthermore,it makesthe developmentof sta-tistical languagemodelsfor ASL difficult. Suchlanguagemodelsarenecessaryfor large-scaleapplications.

Testing the algorithms describedin this paper andin [20, 21] with alargervocabularyis alsoimportant.Onlythenit will bepossibleto judgehow well thesealgorithmsscale.

On the linguistic sideof ASL recognition,futureworkshouldincorporatefacialexpressionsandotherphonolog-ical processesin ASL into the recognitionframework. Italsoneedsto addresshow to make useof handconfigura-tion information;usingthis informationeffectively seemsto benontrivial. Furthermore,futurework hasto find waysto use statistical languagemodels, so as to counterbal-ancetheimpracticabilityof usingstronglyconstrainedtask

grammars.On thecomputervisionsideof ASL recognition,future

work shouldelaborateon thecouplingof computervisionandHMMs andmake the computervision analysismorerobust. This work shouldconsistof recognizingmoredif-ferentgeometricproperties,fine-tuningthesigntemplates,andfine-tuningthe dynamicweightingof featuresbasedonthepropertiesof eachsignthatis matchedto thesignal.It alsoneedsto addresscoarticulationeffects,which it hasignoredsofar.

It is alsonecessaryto to developananthropometricallycorrectmodelof thehumanhand,sothatthecomputervi-sion trackingprocesscanmake handconfigurationinfor-mationavailableto therecognitionframework.

AcknowledgmentsThisworkwassupportedin partbyaNSFCareerAward

NSF-9624604,ONR-DURIP’97 N00014-97-1-0385 andN00014-97-1-0396, ONR Young InvestigatorProposal,andNSFIRI-97-01803.IoannisKakadiarishelpedin ob-tainingthecomputervisionsamples.

References[1] L. W. Campbell,D. A. Becker, A. Azarbayejani,A. F. Bo-

bick, and A. Pentland. Invariant featuresfor 3-D gesturerecognition.2ndInternationalWorkshoponFaceandGestureRecognition,Killington, VT, 1996.

[2] H. EdelsbrunnerandD. L. Souvaine. Computingleastme-dianof squaresregressionlinesandguidedtopologicalsweep.Journal of the American Statistical Asscoiation, 85:115–119,1990.

[3] R. Erenshteyn andP. Laskov. A multi-stageapproachto fin-gerspellingandgesturerecognition.Proceedingsof theWork-shopon the Integrationof Gesturein LanguageandSpeech,pp.185–194,Wilmington,DE, 1996.

[4] K. GrobelandM. Assam.IsolatedsignlanguagerecognitionusinghiddenMarkov models.SMC’97,pp.162–167.

[5] S. K. Liddell andR. E. Johnson.AmericanSignLanguage:Thephonologicalbase.Sign Language Studies, 64:195–277,1989.

[6] I. A. Kakadiaris,D. Metaxas,and R. Bajcsy. Active part-decomposition,shapeandmotionestimationof articulatedob-jects:A physics-basedapproach.CVPR’94,pp.980–984.

[7] I. A. Kakadiarisand D. Metaxas. 3D humanbody modelacquisitionfrom multipleviews. ICCV’95, pp.618–623.

[8] I. A. KakadiarisandD. Metaxas. Model basedestimationof 3D humanmotion with occlusionbasedon active multi-viewpoint selection.CVPR’96,pp.81–87.

[9] J. Matousek, D. Mount, and N. S. Netanyahu. Effi-centrandomizedalgorithmsfor the repeatedmedianline es-timator. To appearin Algorithmica. Available online athttp://www.cs.umd.edu/˜mount/pubs.html .

12

Page 13: ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler

[10] D. Metaxas.Physics-based Deformable Models: Applica-tions to Computer Vision, Graphics and Medical Imaging.Kluwer AcademicPublishers,November1996.

[11] D. Mount, N. Netanyahu,K. Romanik,R. Silverman,andA. Y. Wu. A practicalapproximationalgorithmfor theLMSline estimator. 8th ACM-SIAM Symposiumon DiscreteAl-gorithms,pp.473–482,1997.

[12] Y. NamandK. Y. Wohn. Recognitionof space-timehand-gesturesusingHiddenMarkov model. ACM SymposiumonVirtual Reality Software and Technology, pp. 51–58,HongKong,1996.

[13] N. S. Netanyahu, V. Philomin, A. Rosenfeld,and A. J.Stromberg. Robust detectionof straight and circular roadsegments in noisy aerial images. Pattern Recognition,30(10):1673–1686,1997.

[14] S. Prillwitz et al. HamNoSys. Version 2.0; Hamburg Nota-tion System for Sign Languages. An introductory guide, vol-ume5 of International Studies on Sign Language and Com-munication of the Deaf. SignumVerlag,Hamburg, 1989.

[15] L. R. Rabiner. A tutorial on HiddenMarkov Modelsandselectedapplicationsin speechrecognition. Proceedings ofthe IEEE, 77(2):257–286,February1989.

[16] A. F. Siegel. Robust regressionusing repeatedmedians.Biometrika, 69:242–244,1982.

[17] J. M. SiskindandQ. Morris. A maximum-likelihoodap-proachto visualeventclassification.ECCV’96.

[18] T. StarnerandA. Pentland.Visualrecognitionof AmericanSign LanguageusingHiddenMarkov models. InternationalWorkshopon AutomaticFaceand GestureRecognition,pp.189–194,Zurich,Switzerland,1995

[19] C. Valli andC. Lucas. Linguistics of American Sign Lan-guage: An Introduction. GallaudetUniversity Press,Wash-ingtonDC, 1995.

[20] C. VoglerandD. Metaxas.AdaptingHiddenMarkov mod-elsfor ASL recognitionby usingthree-dimensionalcomputervisionmethods.SMC’97,pp.156–161.

[21] C. Vogler and D. Metaxas. ASL recognitionbasedon acouplingbetweenHMMs and3D motionanalysis.ICCV’98,pp.363–369.

[22] M. B. WaldronandS. Kim. IsolatedASL signrecognitionsystemfor deafpersons.IEEE Transactions on RehabilitationEngineering, 3(3):261–71,September1995.

[23] M. WaleedKadous. Machinerecognitionof AuslansignsusingPowerGloves:Towardslarge-lexiconrecognitionof signlanguage.Proceedingsof theWorkshopon theIntegrationofGesturein LanguageandSpeech,pp. 165–174,Wilmington,DE, 1996.

[24] S.Young,J.Jansen,J.Odell,D. Ollason,andP. Woodland.The HTK Book (for HTK 2.0). CambridgeUniversity, 1995.

13