ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...

An� earlier versionof this paper appearedin the proceedingsof the Inter national Conferenceon Computer Vision,pp. 363–369,Mumbai, India, January 4–7,1998

ASL RecognitionBasedon a Coupling BetweenHMMs and 3D Motion Analysis

ChristianVoglerandDimitris MetaxasDepartmentof ComputerandInformationScience,Universityof Pennsylvania,Philadelphia,PA 19104-6389

[email protected], [email protected]

AbstractWe present a framework for recognizing isolated and

continuous American Sign Language (ASL) sentences fromthree-dimensional data. The data are obtained by us-ing physics-based three-dimensional tracking methods andthen presented as input to Hidden Markov Models (HMMs)for recognition. To improve recognition performance,we model context-dependent HMMs and present a novelmethod of coupling three-dimensional computer visionmethods and HMMs by temporally segmenting the datastream with vision methods. We then use the geomet-ric properties of the segments to constrain the HMMframework for recognition. We show in experimentswith a 53 sign vocabulary that three-dimensional featuresoutperform two-dimensional features in recognition per-formance. Furthermore, we demonstrate that context-dependent modeling and the coupling of vision methodsand HMMs improve the accuracy of continuous ASL recog-nition.

1 Intr oductionAmericanSignLanguage(ASL) is theprimarymodeof

communicationfor many deafpeoplein the USA. It is ahighly inflectedlanguagewith sophisticatedgrammaticalproperties,which constrainstronglytheorderandappear-anceof signs.Becauseof theconstraints,it providesanap-pealingtestbedfor understandingmoregeneralprinciplesgoverninghumanmotionandgesturing,includinghuman-computergestureinterfaces.Suchinterfacesareessentialin virtual reality applications,wheretheusermustbeableto manipulatevirtual objectsby gesturing.A workingASLrecognitionsystemcouldalsofacilitateinteractionof deafpeoplewith their surroundings.

To date, most attemptsat ASL recognitionhave ei-ther used only two-dimensionalcomputervision meth-ods,or they have usedother input devices,suchasdata-gloves, insteadof computervision, to collect input fromthesigner[18, 3, 23]. In this paperwe presenta new ap-proachto ASL recognition.First, we usecomputervisionmethodsto extract the three-dimensionalparametersof asigner’s armmotions. We thenuseHiddenMarkov Mod-els(HMMs) to recognizeisolatedandcontinuousASL ut-terancesfrom the three-dimensionalinput. We developcontext-dependentmodelingof HMMs and methodsfor

coupling the applicationof HMMs and the applicationof three-dimensionalcomputervision methodsto improvecontinuousrecognitionperformance. Our approachat-temptsto overcomesomeof thelimitationsof thepreviousapproachesthat usetwo-dimensionalvisual input, do notusecontext-dependentmodeling,or do not couplecom-putervisionmethodswith HMMs [18, 3, 17, 12].

Three-dimensionalimage-basedshape and motiontracking of a human’s arm and handis difficult becauseof the complexity of the motionsand occlusioneffects.Recently, a methodologyhasbeendeveloped[8, 10] thatallows three-dimensionaltrackingof humanmotion frommultiple images.In this paperwe augmentthis methodol-ogy to track the three-dimensionalmotion of a subject’sarms and handsfrom multiple images. This methodisbasedon theuseof deformablemodels,whoseshapeandmotion fits the given imagesequencesbasedon occlud-ing contourinformationandtheoremsfrom projective ge-ometry. The outputof this methodconsistsof the three-dimensionalmotionparametersof thesubject’s arms.Forefficiency reasons,and becausearm movementsalreadycarrymuchof theinformationneededfor recognizingASLsigns,wedonotusethehandinformationin thispaper.

Apart from obtainingaccuratedata,ASL recognitionis difficult, becausetherearealwaysstatisticalvariationsin the way humansperformmotions,even with identicalmeaning. In addition, in continuousutterances,therearenoclearboundariesbetweenindividualsigns.HMMs pro-videaframework for capturingstatisticalvariationsin bothpositionanddurationof themovement,aswell asimplicitsegmentationof theinputstream.Furthermore,continuousrecognitionis complicatedby coarticulationeffects,thatis,thepronunciation1 of a signis influencedby theprecedingandfollowing signs. Coarticulationeffectscanbe partlyalleviatedby trainingcontext-dependentHMMs.

The theorybehindHMMs makesseveral assumptionsthatareoftennot valid in practice.For this reason,we de-velopa new approachthatcouplescomputervision meth-odswith HMM modeling. It is basedon a temporalseg-mentationprocessthat operatesby extracting geometricpropertiesof the three-dimensionalcomputervision pa-

1By “pronunciation”we meanmotion. We follow theterminologyofspokenlanguagelinguisticswhereapplicable.

1

rameters. Thesepropertiesare obtainedindependentlyfrom the HMM algorithmsandareusedto imposeaddi-tionalconstraintsonHMM-basedrecognition.

To testouralgorithmsandassumptions,weperformedaseriesof experimentsbasedon a vocabulary consistingof53differentsignsthatmakeextensiveuseof space.Weex-perimentedwith bothisolatedandcontinuousASL recog-nition for both three-dimensionaland two-dimensionaldata. As HMMs require large amountsof training dataandthecomputervisionprocessis computationallyexpen-sive,we useddatafrom anAscensionTechnologiesFlockof Birdsandcomputervisionprocessesinterchangeably.

Ourgoalis to discoverandanalyzeausableframeworkfor bothisolatedandparticularlycontinuousASL recogni-tion. We do not addressmoregeneralgesturerecognitiontopicsandsignerindependencein this paper. Neitherdowe addresstheinvolvedaspectsof ASL linguistics[19] atthis point, but obviously, a viable futureASL recognitionsystemshouldbeableto handlethem.

In the following sections,we discussrelatedwork andgive an overview on the theory behindthe vision meth-odsandHMMs. Afterward,we addresstheuseof HMMsfor isolatedandcontinuousASL recognition,andcouplingcomputervision processeswith theHMM algorithms.Fi-nally, we outlinedatacollectionandprovideexperimenta-tion resultsfor isolatedandcontinuousrecognitionandthecouplingof computervisionandHMMs.

2 PreviousWorkPrevious work on sign languagerecognitionfocuses

primarily on fingerspellingrecognitionand isolatedsignrecognition. Somework usesneural networks [3, 22].For this work to applyto continuousASL recognition,theproblemof explicit temporalsegmentationmustbesolved,whichis alimitation thatHMM-basedrecognitiondoesnothave. MohammedWaleedKadous[23] usesPowerGlovesto recognizeasetof 95 isolatedAuslansignswith 80%ac-curacy, with anemphasison computationallyinexpensivemethods.Kirsti GrobelandMarcellAssam[4] useHMMsto recognizeisolatedsignswith 91.3%accuracy out of a262signvocabulary. They extract thefeaturesfrom videorecordingsof signerswearingcoloredgloves.

Thereis very little previous work on continuousASLrecognition. ThadStarnerand Alex Pentland[18] useaview-basedapproachto extract two-dimensionalfeaturesas input to HMMs with a 40 word vocabulary. YangheeNam andKwangYoenWohn [12] usethree-dimensionaldataasinputtoHMMs for continuousrecognitionof averysmallsetof gestures.

3 Model-basedTracking of a Human’sArmsIn thissectionwegivea brief overview of our formula-

tion thatallows the three-dimensionalarmshapeandmo-

tion estimationfrom multiple images[6, 7, 8, 10].

Our approachconsistsof two parts.Thefirst part[6, 7]consistsof anactive,integratedapproachthatidentifiesre-liably thepartsof amovingarticulatedobjectandestimatestheir shapeand motion from a controlled set of motionsthatrevealtheobject’s structure.We usethealgorithmde-velopedin [6, 7], which segmentstheapparentbodycon-tourof amovinghumaninto theconstituentparts.Initially,a singledeformablemodelis usedin orderto fit theimagedata.As themodeldeformsto fit thedeformed(dueto themotionof thehuman)subsequentimagecontours,a novelHuman Body Part Identification Algorithm (HBPIA) isdevelopedto identify all the body parts. By applyingtheHBPIA iterativelyoverthesubsequentframes,all themov-ing partsareidentified.In addition,we have extendedthisalgorithmto allow theestimationof thethree-dimensionalshapeof asubject’sbodyparts,basedon theintegrationofimagestakenfrom threeorthogonallyplacedcameras.Weusedthis methodologyto estimatethe three-dimensionalshapeof thesubject’s armsshown in theexamplesin Sec-tion 7. It is worthnotingthatwe have recoveredthelowerarmandthehandasonepart,sincein ourASL recognitionexperimentswe did not usethe motion of the lower armandthehandrelative to eachother.

The secondpart of the algorithmconsistsof usingtheextractedthree-dimensionalshapeof the arm to track thethree-dimensionalposition and orientationof a subject’sbodyparts[8]. To alleviatedifficultiesarisingfrom occlu-sionanddegenerateviewsduringtheunconstrainedmove-mentof the arm, we usethreecalibratedcamerasplacedin a mutually orthogonalconfiguration. At every imageframeand for eachbody part, we derive a subsetof thecamerasthatprovidethemostinformativeviewsfor track-ing. This active and time varying selectionis basedonthevisibility of apartandtheobservability of its predictedmotion from a certaincamera.Oncea setof camerashasbeenselectedto trackeachpart,weuseconceptsfrom pro-jective geometryto relatepointson theoccludingcontourto pointson the three-dimensionalshapemodel. Using aphysics-basedmodelingapproach,we transformthis cor-respondence,in addition to two-dimensionalforcesaris-ing from the discrepancy betweenthe model’s occludingcontourand the imagedata, into generalizedforcesthatareappliedto the model to estimatethe model’s transla-tional androtationaldegreesof freedom. To improve thetrackingresultsfurther, the dynamicsystemis embeddedwithin an extendedKalmanfilter framework, andwe usethe predicted motion of the model at eachframe to es-tablishpoint correspondencesbetweenoccludingcontoursandthethree-dimensionalmodel.

We usedthis two-stepapproachto track the motionofthesubject’s armsperformingtheASL gestures,asshown

2

in Section7. Theoutputof thesystemis a setof rotation,�� , andtranslation,�� , parametersthatwe useasinput tothe HMMs and the vision-basedsegmentationalgorithmpresentedin thefollowing sections.

4 Hidden Mark ov ModelsHiddenMarkov Models (HMMs) area type of statis-

tical model. They have beenusedsuccessfullyin speechrecognition,andrecentlyin handwriting,gesture,andsignlanguagerecognition.Wenow giveasummaryof thebasictheorybehindHMMs, which is coveredin detailin [15].

4.1 Definition of HMMsAn HMM consistsof anumber� of states�� , togetherwith transitionsbetweenstates.Thesystemis

in oneof theHMM’ sstatesatany giventime. At regularlyspaceddiscretetimeintervals,thesystemtakesanoutgoingtransitionfrom its currentstateto a new state.

Eachtransitionfrom � � to �� hasanassociatedproba-bility �� of beingtaken. Hence, � � �� . Eachstate� � alsohasan initial probability �� of thesystemstartingin �� . In addition,eachstate � � generatesoutput �! #" ,which is distributedaccordingto a probabilitydistributionfunction $ �&% �(')�+*-, Outputis ��.Systemis in � �0/ . An ex-ampleisgivenin Figure1. Themodeldepictedthereis alsoanexampleof a left-right model; that is, � ��2143 implies57698 � In otherwords,transitionsonly flow forwardfromlower statesto the samestateor higherstates,but neverbackward. This topologyis themostcommonlyusedonefor modelingprocessesover time.

a11 a24

a23a12

55a

b5b4b3b2b1

a34 a45S1 S S S S2 3 4 5

Figure1: Exampleleft-right HMM with its transitionandoutputprobabilities.“Left-right” meansthattransitionsoc-curonly from left to right, andneverbackward.

4.2 The Thr eeFundamentalHMM ProblemsTherearethreefundamentalproblemsin HMM theory:

(1) For a sequenceof observations :;�<:)�=>��?�@�:BA ,: � C" , computethe probability * % :D. E�' that anHMM E generated: .

(2) For some: andanHMM E , recover themostlikelystatesequence� � F�?��?�� A thatgenerated: .

(3) Adjust the parametersof an HMM E suchthat theymaximize* % :D. EG' for some: .

The first problemcorrespondsto maximumlikelihoodrecognitionof an unknown data sequencewith a set ofHMMs, eachof which correspondsto a sign. For eachHMM, the probability * % :D. EG' is computedthat it gener-atedthe unknown sequence,andthenthe HMM with thehighestprobability is selectedastherecognizedsign. Forcomputing * % :D. EG' , let HI�JH � KH � ��?��LH A be a statesequencein E :

MON % 8 'P�#* % :)�=K:B�=?��?K: N KH N �Q� � . EG'4�SR 8 R!�T (1)

* % :D. E�'U� �V�XW � M A % 8 '@ (2)

M � % 8 'P�#��Y$Z� % : � '@ (3)

MON\[ � % 8 'P�#$ �K% : N\[ �?' �V�]W � M�N % 5 '&� �]� �)R!^UR4_!`4� (4)

Theseequationsassumethat the : � areindependent,andthey make the Markov assumptionthat a transition de-pendsonly on the currentstate,a fundamentallimitationof HMMs. This methodis called the forward-backwardalgorithmandcomputes* % :D. E�' in : % � � _)' time.

The secondproblemcorrespondsto finding the mostlikely path H throughan HMM E , given an observationsequence: , andis equivalentto maximizing * % HaK:D. E�' .Let b

N % 8 'P� cad=efhgLi�j�j�j i f�kmlFg * % H � H � ��?�]H N �n� �]K:D. E�'@ (5)bN\[ � % 8 'o�Q$ �]% : N\[ ��'qprcDde�Ls � sG� ,

bN % 5 't� �]�]/ (6)

cad=ef * % HaK:D. E�'u�vcad=e�Zs � sG� ,bA % 8 ' / � (7)b

N % 8 ' correspondsto themaximumprobabilityof all statesequencesthat endup in � � at time ^ . Equations6 and7follow from Equation5 by induction on ^ . The Viterbialgorithmis a dynamicprogrammingalgorithmthat, us-ing Equation7, computesboth the maximumprobability* % HaK:D. E�' andthestatesequenceH in : % � � _B' time.

Therecoveryof thestatesequencemakestheViterbi al-gorithminvaluablefor continuousrecognition,sinceit by-passesthe difficult problemof segmentingthe utterancesinto its individualparts.Instead,asequenceof HMMs cor-respondingto individual signsis concatenatedinto a net-work, as schematicallydepictedin Figure 2. Thus, themostlikely statesequencerecoversthesequenceof signs.

The Viterbi algorithmalsohasthe propertythat it canbe optimizedwith the beam-searchingalgorithm. Whileupdating

bN\[ � % 8 ' , this optimizationconsidersonly those

states� � in the HMM network for which

bN % 5 ' is above

a thresholdvalue.Theassumptionis thatif theprobability

3

HMM 1

HMM 2

HMM n

......

Finalstate

Initialstate

Figure2: Concatenationof HMMs into a network

of a partial paththroughthe network becomestoo low, itcannotcontributeto themostlikely path.Beam-searchingis essentialfor makinglarge-scaleapplicationstractable.

The third problemcorrespondsto training the HMMswith data,suchthat they areableto recognizepreviouslyunseendatacorrectlyafterthetrainingphase.Thereexistsno analyticalsolution for maximizing * % :D. EG' for givenobservation sequences,but an iterative procedure,calledthe Baum-Welch procedure,maximizes * % :D. E�' locally.In thecaseof continuousdensityoutputprobabilities,thereestimationprocessworksasfollows.

Define $]� % :S' as $]� % :S'w�x�Qyz W �|{ � z)} % :-&~G� z Z�O� z ' ,where � describesthenumberof mixtures,

5is thestate

number, { describesthe weight of mixture � in state5,

and } is a Gaussiandensitywith mean~ , andcovariancematrix � . Definethebackwardvariable� as

� N % 8 'P�#* % : N\[ ��: N\[ �=?��?��K:BA�. H N �#� � KEG'Z (8)

�GA % 8 'P�� (9)

� N % 8 '�� V�]W � ��$K� % : N\[ � 't� N\[ � % 5 'Z (10)

�SR 8 R!��SR�^UR4_!`4�� (11)

Furthermore,define� and � as

� N % 8 5 'P� MON % 8 't� �� $ ��% : N\[ �?'0� N\[ � % 5 '* % :D. E�' (12)

� N % 8 'P� �V��W � � N % 8 5 'Z� (13)

� N � N % 8 5 ' canbe interpretedas the expectednumberof transitionsfrom � � to � � ; likewise � N � N % 8 ' canbe in-terpretedastheexpectednumberof transitionstakenfrom� � . With theseinterpretations,the reestimationformulaefor thetransitionsandoutputprobabilitiesare��q�� % 8 '@ (14)

�� A�� N W � � N % 8 5 '� A�� N W � � N % 8 ' (15)

�{ � z � � AN W � � N % 5 &�T'� AN W � � y� W � � N % 5 L�(' (16)

�~G� z � � AN W � � N % 5 &�T'&: N� AN W � � N % 5 ]��' (17)

�� z � � AN W � � N % 5 ]��' % : N `7~G� z ' % : N `�~G� z ' A� AN W � � N % 5 ]��' � (18)

Repeateduseof thisprocedureconvergestoamaximumprobability[15], typically after5–10iterations.

5 Useof HMMs for ASL RecognitionIn the previous sectionwe reviewed the extraction of

three-dimensionalfeaturesfrom computervision and theHMM theory. We now discusshow they fit in the frame-work of ASL recognition.

HMMs are an attractive choice for processingthree-dimensionalsigndata,becausetheirstate-basednatureen-ablesthemto describehow a sign changesover time andto capturevariationsin thedurationof signs,by remainingin a statefor severaltime frames.

Therearetwo waysto approachthe recognitionprob-lem that posevery different researchproblems. Isolatedrecognitionattemptsto recognizeonesinglesignata time.Hence,it is basedon theassumptionthateachsigncanbeindividually extractedandthenindividually recognized.

Continuousrecognition,on theotherhand,attemptstorecognizean entirestreamof signs,without any artificialpausesor any other form of marked boundariesbetweentheindividualsigns.Clearly, continuousrecognitionis de-sirablefor the most natural interactionpossiblebetweenhumansandmachines,but it is alsomuchmoredifficult totacklethanisolatedrecognition.Thenext two subsectionsdiscusseachof thetwo approachesin detail.

5.1 IsolatedRecognitionIsolatedsign recognitionassumesthat eachsign can

be extractedindividually. This requiresclearly markedboundariesbetweensigns. Sucha boundarycould sim-ply besilence,thatis, a brief restingphaseaftereachsign,duringwhich thesignerperformsno movements.Silenceis easilydetectedthroughananalysisof theglobalvarianceoverthehandmovements.

Once there are clearly marked boundariesbetweensigns,HMM recognitionis comparatively straightforward.Therecognitionprocessextractsthesignalcorrespondingtoeachsignindividually. It thenpickstheHMM thatyieldsthe maximumlikelihoodfor that signalasthe recognizedsign.

Training the HMMs to maximize recognitionperfor-manceis alsocomparatively straightforward. Initially, allsigns in the training set are labeled. For eachsign inthe dictionary, the training procedurethen computesthe

4

meanand covariancematrix over the data available forthat sign and assignsthem uniformly as the initial out-put probabilitiesto all statesin the correspondingHMM.It alsoassignsinitial transitionprobabilitiesuniformly tothe HMM’ s states.Unlike the initial outputprobabilities,initial transitionprobabilitiesdo not influencethe perfor-manceof thefully trainedHMMs greatly.

The trainingprocedurethenrunstheViterbi algorithmrepeatedlyon thetrainingsamples,soasto align thetrain-ing dataalong the HMM’ s states. The aligneddataarethenusedto estimatebetteroutputprobabilitiesfor eachstateindividually. This realignmentyieldsmajorimprove-mentsin recognitionperformance,becauseit increasesthechancesof the Baum-Welch reestimationalgorithm con-verging to an optimal or a near-optimal maximum. AfterconstructingthesebootstrappedHMMs, the training pro-cedurefinishesby reestimatingeachHMM in turn withthe Baum-Welch reestimationalgorithmoutlined in Sec-tion 4.2.

Theby far mostchallengingproblemin isolatedrecog-nition is extractinga featurevector that optimizesrecog-nition performance.Even after obtainingaccuratethree-dimensionaldata from our computervision methodde-scribedin Section3, we found that the featuresusedforrecognition— and the way that they are represented—greatly influencerecognitionperformance. The experi-mentalresultsgiven in Section8.1 demonstratehow thefeaturevectoraffectsperformance.

Thereareseveral reasonswhy performanceis so sen-sitive to choosingthe type of featurevector: First, somefeaturescarry more information than others; for exam-ple,three-dimensionalfeaturesaremorereliablethantwo-dimensionalones.Second,somefeaturesaremoreinvari-ant to changesin orientationandpositionthanothers;forexample,polarcoordinatesaremoreinvariantto rotationsthanCartesiancoordinates[1]. Third, thestatisticalprop-ertiesof somefeatureschange,dependingon thedurationof a sign. For this reason,the positionsof the handsinthree-dimensionalspaceperformbetterthanthevelocitiesof the hands(seealso Section8.2). Fourth, the statisti-cal distribution of the featuresduring thecourseof a signseemsto play a role. For somefeatures,their distributionfits Gaussiandensitiesnaturally, whereasfor othersit doesnot.

If thelatterexplanationholdstrue,weshouldseeama-jor improvementin recognitionperformancefrom usingmultiple GaussianmixturesastheoutputprobabilitiesforHMMs, insteadof usingjust onesingleGaussiandensity.However, we did not experimentwith multiple mixturesbecauseof thelackof sufficient trainingdata.

The numberof statesand the topology usedfor theHMMs is alsoimportant.Signlanguageasa time-varying

processlendsitself naturallyto aleft-rightmodeltopology.Findingtheoptimumnumberof states,which dependsontheframerateandonthecomplexity of thesignsinvolved,is anempiricalprocess.We usedthesamemodeltopologyfor all signs,anddeterminedexperimentallythat for ourtaskamodelwith 9 stateswassufficient,whichis depictedin Figure3. TheoutputprobabilitiesweresingleGaussiandensitieswith diagonalcovariancematrices,becausewehadinsufficient trainingdatafor multiplemixtures.

Figure 3: Left-right HMM topology for isolated ASLrecognition.

5.2 ContinuousRecognitionContinuoussignrecognition,ontheotherhand,is much

harderthanisolatedsign recognition.Thereis no silencebetweenthe signs,so the straightforward methodof us-ing silenceto distinguishboundariesfails. Here HMMsoffer the compellingadvantageof beingable to segmentthe streamsof signsautomaticallywith the Viterbi algo-rithm. Coarticulationeffectsfurthercomplicatecontinuousrecognition.We now discussthemin detail,beforewede-scribethetechniquesneededto trainHMMs for continuousrecognition.

5.2.1 The Coarticulation ProblemCoarticulationmeansthatthepronunciationof asignis

influencedby the precedingandfollowing signs. Oneofthe mostvisible effectsof coarticulationin ASL is that awiderangeof movementsareinsertedbetweensigns.

For example,the sign for “FATHER” is performedbyrepeatedlytappingtheforehead,andthesignfor “READ”is performedin neutralspacein front of thechest.If thesetwo signsareperformedin succession,anextramovementfrom theforeheadto neutralspaceappears(Figure4). Thisphenomenonis calledmovementepenthesis[5]. We dis-cussits implicationsfor ASL recognitionmorethoroughlyin [20].

Figure4: Movementepenthesis.Thearrow in themiddlepictureindicatesanextra movementbetweenthesignsfor“FATHER” and“READ” thatis notpresentin their lexicalforms.

5

Speechrecognizershandlecoarticulationby trainingphonemecontext-dependentHMMs. They traina separatemodelfor eachpossiblecombinationof threephonemesinsequencethatcouldoccurduringnaturalspeech.In princi-ple,thesameideaappliestosignlanguagerecognition,andweperformedsomeexperimentsto verify theapplicability,seeSection8.3.

A possibleway to train context-dependentmodelsforASL recognitionis to usewhole signsas the phonologi-cal unit in ASL.2 Thus,triphonecontext-dependentmod-elsfrom speechrecognitioncorrespondto tri-signcontext-dependentmodelsin ASL recognition. In otherwords,aseparatemodel is trained for eachcombinationof threesignsin sequence.The first andthe third sign in the se-quenceform the context for the middle sign, with whichthemodelis associated.

Tri-sign context-dependentmodeling,however, is pro-hibitively expensive, becauseit requires : %�� ' modelsoverall, where � is the vocabulary size. Collectingsucha large amountof training datanecessaryto obtain reli-ableestimatesfor themodelsis intractableevenfor smallvocabulary sizes. This intractability is a negative conse-quenceof usingwholesignsasthephonologicalunit. Un-like for speechrecognition,which hasto handleonly ap-proximately40 classesof allophones,there is no upperboundon the numberof modelsrequiredfor ASL recog-nition with wholesignsasthesmallestunit.

Therefore, we used only bi-sign context-dependentmodels,whichrequireamodelfor everypossiblecombina-tion of two signs.Themodelis associatedwith thesecondsign,andthefirst signformsits precedingcontext.

Bi-sign context-dependentmodeling requires : %�� 'models.Althoughthiscomplexity is animprovementover: %�� ' , it is still too largefor anythingbut a smallvocab-ulary. Speechrecognizersreducethenumberof modelsre-quiredbyusingtheobservationthatmany contextsareverysimilar. Therefore,they tie the parametersof the modelscorrespondingto similar contexts, suchthat the transitionandoutputprobabilitiesaresharedbetweenthesemodels.This techniquesignificantlyreducesthenumberof distinctmodels.

Parametertying is alsoapplicableto ASL recognition,but it is not as effective as for speechrecognition. Themain reasonfor the reducedeffectivenessis that move-mentepenthesisinsertsmany movementsunrelatedto thesigns’ lexical forms. The implication is that context-dependentmodelswill work well only with prohibitivelylargeamountsof trainingdata.

In fact, it is questionablewhethercontext-dependentmodeling is a good solution to the coarticulationprob-

2Thisassumptionis not correct:Wholesignsarenot thesmallestunitin ASL phonology, but this topic is beyondthescopeof thispaper.

lem in ASL recognitionat all. Movementepenthesisisa phonologicalprocessin ASL and shouldbe treatedassuch;thatis, themovementsinducedbyepenthesisaresep-aratephonemes.Usingcontext-dependentmodelsto cap-turethemis implausiblefrom aphonologicalpointof view.It seemsto make moresenseto modelthemovementsex-plicitly. We follow up on this ideain [20] andshow that itleadsto betterrecognitionperformance.

5.2.2 The Training ProcedureA sign in our datacollectedat naturalsigningspeeds

was between10 and 45 frames long, not counting theframesneededfor the transitionbetweensigns. Becauseof themovementsbetweensigns,theHMM topologymustbemoreflexible thantheonedescribedfor isolatedrecog-nition in Section5.1. Theseconsiderationsled usto usingtheleft-right modelshown in Figure5.

Figure5: Topologyof thecontext-dependentmodel. Thearcsthat skip statesallow the modelingof variabilitiesinthedurationof differentsigns.

Like for isolatedrecognition,we determinedthe opti-mal numberof statesexperimentally. For theoutputprob-abilities,wechoseasingleGaussiandensitywith diagonalcovariance,aswehadinsufficienttrainingdatafor estimat-ing full-rank covariancematrices.

Trainingcontinuousrecognitionmodelsis muchharderthantrainingisolatedrecognitionmodels,becauseit is dif-ficult to obtaingoodinitial estimatesof theHMM param-eters.Viterbi realignment(seeSection5.1) worksonly ifthetrainingdatais accuratelylabeled,includingthebound-ariesbetweentheindividualsigns.Obtainingthesebound-ariesis very difficult and time-consuming;even humanshave troubledeterminingwherea sign endsandthe nextonestarts.

The alternative to usingViterbi realignmentis usingaflat-startscheme.It consistsof computingtheglobalmeanandcovariancematrix over theentiretrainingdatasetandassigningtheseas the initial output probabilitiesto theHMMs. We usedthisschemeto initialize theHMMs.

Wethenusedembeddedtraining [24] to reestimatetheHMMs. Eachiterationof this procedureconcatenatestheHMMs correspondingto the individual signsin a trainingsentenceinto a singlelargeHMM. It thenreestimatestheparametersof thelargeHMM with asingleiterationof theBaum-Welchalgorithmdescribedin Section4.2,asusual.Thereestimatedparameters,however, arenot immediatelyappliedto theindividual HMMs. Instead,they arepooled

6

in accumulators,andappliedto theindividualHMMs onlyafter thetrainingprocedurehasiteratedover all sentencesin thetrainingset.

Hence,embeddedtraining effectively trains all mod-els in parallel with the entire training set. It yields bet-ter parameterestimatesthantrainingtheHMMs indepen-dently[24].

In the caseof context-independentmodels,using theflat start schemefollowed by several embeddedtrainingrunsis all thatis necessaryto trainHMMs for recognition.Context-dependentmodelsaremoredifficult to train thancontext-independentmodels,becausethetraininginvolvestwo extra steps. Theseconsistof generatingthe context-dependentmodels,and tying the parametersof HMMswith similar contexts(seealsoSection5.2.1).

The first extra step, which consists of generatingthe context-dependentmodels,requirescare,becauseforcontext-dependentmodels there exist far fewer trainingexamplesper model than for context-independentmod-els. In this case,embeddedtraining is likely to yield thebestparameterestimatesfor context-dependentmodelsifthey have alreadybeeninitialized with bettervaluesthanthe global meanandcovariancematrix from the flat-startscheme.

Therefore,weranseveralembeddedtrainingrunsonthecontext-independentmodelsand then generatedcontext-dependentmodelswith thesameparametersasthecontext-independentmodels. It is vital to avoid overtrainingthecontext-independentmodelsby keepingthenumberof ini-tial trainingpasseslow. Theprobabilitiesshouldnot havefully convergedyet. Otherwise,usingcontext-dependentmodelsactuallydecreasesrecognitionperformance.

The secondextra step,which consistsof tying the pa-rameters,is also vital to the context-dependentmodels’performance,especiallybecauseof our relative lack oftraining data. Tying parametersreducesthe numberofmodels,assignswith similar contexts thensharea com-monmodel.As a result,moretrainingdatapermodelbe-comesavailable.

Unfortunately, parametertying is a highly empiricalprocess.Ourexperimentsindicatedthattying thetransitionprobabilitiesproperlyhadthegreatestinfluenceon recog-nition results. We usedthe endinglocationsof the signsin the precedingcontext to decideon the tying. For ex-ample,the signsfor “BROTHER” and“SISTER” end inthe samelocation. As a result,the two modelsfor a signoccurringafter the signsfor “BROTHER” or “SISTER,”suchas “LIKE, ” cansharethe sametransitionprobabili-ties. We alsousedtheendinglocationsto decideon tyingthe outputprobabilities. For our dataset, the tying pro-cessreducedthe numberof modelsto lessthanonesixthof theiroriginalnumber.

6 Coupling of Vision and HMMsIn the precedingsectionwe reviewedhow HMMs can

be usedfor ASL recognition. The useof HMMs alone,however, imposessomelimitations,oneof which is insuf-ficiency of trainingdata,especiallywhile trainingcontext-dependentmodels.Furthermore,theprobabilitytheoryas-sumptionsunderlying the HMM theory, as describedinSection4.2, areoften not valid: Successive observationsareoftennot independent,the transitionfrom onestatetothe next often dependsnot only on the currentstate,butalsoon the statehistory, and the distribution of observa-tionsdoesnotnecessarilyresembleanormaldensity.

Anotherproblemis thattheHMM theorydoesnot pro-videfor any dynamicweightingof featuresdependingonasign’scontext. For example,theinvariantfeaturesfor somesigns,suchas “I,” are the endpointsof their movementswith respectto a bodypart,andthemovementsareunim-portant.For othersigns,only themovementsareinvariant.The partsof the featureset that shouldbe examinedandignoredfor eachclassof signsaremutuallyexclusive.

To alleviate theselimitations,we investigatedthe cou-pling of the HMM recognitionprocesswith an indepen-dent computervision-basedmotion analysisthat tempo-rally segmentsthe signalandextractsits geometricprop-erties.Theideais thata signcanbedescribedin termsofoneormoregeometricprimitives,suchashandmovementsalonga line, in a plane,or a circle. This ideais supportedby theexistenceof transcriptionsystems,suchastheHam-NoSys[14], thatbasethedescriptionof themovementsongeometricprimitives.

The presenceof three-dimensionalinformationis cru-cial for the couplingto work. In the past,geometricfit-ting of planeshasalreadybeenusedfor rough segmen-tation [12], but not for providing additional informationaboutthe natureof the fits to the HMM recognitionpro-cess.

6.1 Segmentationof the SignalTo extract the geometricpropertiesof the continuous

signalestimatedwith ourcomputervisionmethods,it mustfirst besegmentedtemporallyinto its parts.Any changeofthetypeof armmovementis likely to beaccompaniedbya dip in the velocity. Thus, minima in the absoluteval-uesof thevelocity vectorprovide stronghintsat segmen-tationboundaries.However, therearetypically many morevelocity minimathansegmentationboundaries.Thus,thesegmentationprocessmustprovide facilities to mergead-jacentsegments.

After performinginitial segmentationbasedon veloci-ties, our algorithmattemptsto fit geometricprimitivestothe individual segments.Thesecurrentlyconsistof lines,planes,andholds3 ata positionin space.

3A hold is a shortperiodof time, duringwhich no handmovements

7

The fit of a hold is determinedby computingthe co-variancematrix over thesegment’s positiondata. If thereis little movement,the eigenvaluesof the matrix in everydirectionaresmall,andconsequentlyits traceis small.

Theleast-squaresfit of a line is governedbyV�Q� � �

V� .X. � � ` %�� p@� � ' � .�. � (19)

where� � is thedistanceof � � to theline, and � is theline’sunit direction vector. Let � be a matrix containingthepoints �q� in the segmentsas its row vectors. Minimiz-ing Equation19 with respectto � correspondsto maxi-mizing � A � A � � . By Rayleigh’s principle,themaximal-eigenvalueeigenvectorof � A � maximizesthis equation,which is equivalentto themaximal-eigenvalueeigenvectorof the points’ covariancematrix. This eigenvectoris theline’s directionvector. Theothertwo eigenvaluesindicatethegoodnessof fit — thesmallerthey arewith respecttothelargesteigenvalue,thebetterthefit.

Theleast-squaresfit of a planeis governedbyV� � ��

V� .�. �h��p��.�. � (20)

where � � is the distanceof � � to the plane,and � is theplane’s unit normalvector. If � is a matrix containingthepoints � � asits row vectors,theminimal-eigenvalueeigen-vectorof � A � minimizesEquation20 with respectto � .Hence,minimizing this equationis equivalent to findingthe minimal-eigenvalueeigenvectorof the points’ covari-ancematrix. Theothertwo eigenvaluesindicatethegood-nessof fit — thelargerthey arewith respectto thesmallesteigenvalue,thebetterthefit.

Using least-squaresfitting is basedon the assumptionthat the signalnoiseterm is capturedby a normaldistri-bution. If this assumptionis not valid, the least-squaresestimatoris likely to yield poorresults,becauseof its sen-sitivity to outliers.Ontheotherhand,in three-dimensionalspace,the least-squaresestimatoris mucheasierto com-putethanmorerobust estimators.It would be interestingto compareits performanceon temporalsegmentationtotheperformanceof robustregressionestimators[13], suchastheleastmedianof squaresestimator[2, 11], or there-peatedmedianestimator[16, 9].

After the initial fit, the algorithmpools the primitivesinto a directedacyclic graph (DAG), schematicallyde-pictedin Figure6. Note that the individual segmentsarenot mutually exclusive; for example,datacan fit both aline anda plane.

If the algorithmfails to fit any geometricprimitivestosomesegment, it insertsthe segmentinto the DAG as a

take place.

“wild card,” which is definedconservatively to matchanykind of geometricprimitive. It thenattemptsto mergead-jacentsegmentsif they are compatible,in an attempttoeliminatespurioussegmentationboundaries.

We definedadjacentsegmentsto be compatiblefor amergeif they sharedthesametypeof geometricprimitivein similar orientations,andif the mergedsegmentstill fitthesametypeof geometricprimitiveasits constitutingseg-ments.In addition,we considereda wild cardto becom-patiblewith anothergeometricprimitive if this primitivealsofit themergedsegment.

PlaneLine

Line

Hold

Figure6: Geometricprimitivespooledinto a DAG. Cir-clesdenotesegmentationboundaries.Dottedarcsdenotepossiblenull transitions;they arenecessaryto compensatefor spurioussegments.Sometimesdatacanfit multiplege-ometricprimitives; in this DAG the dataof the first twosegmentsfit both a hold followedby a line, anda simpleline.

TheDAGnow givesall possiblesegmentsequencesthatarea valid representationof thesignal. If a sequenceis tobevalid, it mustbeobtainableby tracingapaththroughtheDAGfromtheleftmostsegmentationboundaryto theright-mostsegmentationboundary. In theexamplegivenin Fig-ure6 thesequences“Hold, Line,Plane,” and“Line, Plane”would both be valid sequences,but “Plane,Plane”wouldnot,becausethelatterdoesnot lie onany paththroughthatDAG.

Thisdiscussionhassofarignoredthepossibilityof spu-rioussegmentsarisingfromthevisionanalysis.Thatis, theanalysismight recognizea segmentthatshouldbepartofanother, but the merge processfails to merge it into an-othersegment. Themain reasonfor theexistenceof spu-rioussegmentsis undersampling.If a segmentconsistsofvery few samples,it is oftenimpossibleto extractreliableinformationfrom it. Our algorithmattemptsto solve thisproblemby addingarcsto theDAG from eachsegmenta-tion boundaryto the next (representedby the dottedarcsin Figure6). Thus,a paththroughtheDAG canoptionallyskip thesespurioussegments.

6.2 Using the Motion Analysiswith HMMsEachsignin thevocabularyhasassociatedoneor more

templatesthat comprisethe sign’s geometricprimitiveswith weightsof eachfeature’s relative importance.These

8

primitivesarematchedagainstthosein theDAG. Assum-ing thatthesegmentationprocessyieldscorrectresults,thefollowing must be true: If a sequenceof signsis repre-sentedby theinputsignal,thesequenceof geometricprim-itivescorrespondingto thesignsmustform a paththroughtheDAG. We call sucha sequenceof signsvalid with re-spectto thecomputervisionDAG.

This observationsuggestsanapplicationof themotionas a backupcheckfor the HMM framework. First rec-ognizea candidatesentencefrom the input signalvia theViterbi algorithm.Thengenerateall possiblesequencesofgeometricprimitivescorrespondingto therecognizedsignsand constructanotherDAG from them. Using dynamicprogramming,matchthetwo DAGsagainsteachother. Ifthetwo DAGssharea commonpath,acceptthecandidatesentenceascorrect. Otherwise,reject the candidatesen-tenceasincorrect.

Thejustificationfor this algorithmcomesfrom thefol-lowing propertiesof the DAGs: If the two DAGsshareacommonpath,thereis a sequenceof geometricprimitivesthat forms a paththroughthecomputervision DAG. Fur-thermore,this sequenceof geometricprimitivesis oneofthe possiblesequencesgeneratedfrom the candidatesen-tence. Thus, the candidatesentenceis valid with respectto thecomputervision DAG. Conversely, if no suchcom-monpathexists,noneof thesequencesof geometricprim-itivesgeneratedfrom thecandidatesentenceformsa paththrough the computervision DAG. Thus, the candidatesentenceis not valid with respectto the computervisionDAG andshouldberejected.

6.3 Discussionof the CouplingTheHMM recognitionalgorithmandthevisionmatch-

ing algorithmcomplementeachother. TheadvantagesoftheHMM recognitionmethodareautomaticsegmentationduring both training andrecognition,anda fully formal-ized training procedure.The disadvantagesarepoor per-formancein the presenceof insufficient training data,noformal way to weight featuresdynamically, andpossibleviolationsof thestochasticindependenceassumptions.

Theadvantagesof thevision matchingmethodarethepossibilityof weightingtherelative importanceof featuresdynamically, and independencefrom insufficient trainingdata. A significantdisadvantageis thatestimatingthege-ometricpropertiesof the signsin the vocabulary requiresmanuallabelingandanalysisof thedata.Furthermore,seg-mentationmustbedoneexplicitly, which raisesthepossi-bility of spurioussegments,asdescribedin Section6.1,orthepossibilityof missingsegments.Coarticulationsome-timesalsochangesthegeometricpropertiesof thesignal,suchthatthetemplatesfor thecorrectsequenceof signnolongermatchthe actualsignal. Copingwith the changesin thegeometricpropertiesis an importanttaskfor future

research.

7 Data CollectionFor our experimentswe collecteddata,usingboth our

computervision system,and an AscensionTechnologiesFlock of Birds. Thereasonfor usingthe latterwasthat itis fasterat this point thanthecomputervision system,andhencemoresuitablefor prototyping.

The computervision systemyields rotation, � � , andtranslation,�� , of eachsegmentof thearm,asdescribedinSection3. Figure7 givesanexampleof thecomputervi-siontrackingprocess.Theimagesshow thehighaccuracyof thecomputervision system;in fact,it is comparabletotheaccuracy achievedby theFlockof Birdssystem.

TheFlockof Birds systemconsistsof a magnetandsixsensorsthat detecttheir rotation, �� , andtranslation, ��|� ,with respectto the magnetat 25 framesper second.Weusedthe datafrom both systemsinterchangeablywith asimplealignmentof coordinatesystems.The coordinatesystemwasright-handed,with theorigin at thebaseof thesigner’sspineandthe � axisfacingup.

Figure 7: Fitting the three-dimensionalmodels to thesigner’s arms. From top to bottom, the signs for “FA-THER,” “I,” and“MAIL ” aredisplayed.Fromleft to right,thefront, side,andtopviewsaredisplayed.

We used the 53-sign vocabulary listed in Table 1.TheirpronunciationsfollowedtheASL dialectusedin thePhiladelphia,PA, area. The goalsin choosingthe vocab-ulary wereto beableto expresssentencesthatcouldhaveoccurredin a naturalconversation,andto make intensiveuseof the signingspace,so asto demonstratethe advan-tagesof three-dimensionaldataovertwo-dimensionaldata.Wecollected486continuousASL sentences,eachbetween

9

Category SignsusedNouns America, Christian, Christmas, book,

brother, chair, college, family, fa-ther, friend, interpreter, language,mail,mother, name,paper, president,school,sign,sister, teacher

Pronouns I, my, you,your, how, what,where,whyVerbs act,can,give,have,interpret,like,make,

read,sit, teach,try, visit, want,will, winAdjectives deaf,good,happy, relieved,sadOther if, from, for, hi

Table1: Thecomplete53signvocabulary

2 and12 signslong, with a total of 2345signs. Theonlyconstraintsontheorderandoccurrenceof signswerethosedictatedby thegrammarof ASL [19].

Furthermore,we collectedexamplesof eachsign forisolatedrecognition. Becausepart of the datawerecor-ruptedduringthecollectionprocess,wediscardedall signsfor whichwedid nothaveenoughintacttrainingexamples.This left 656examplesovera rangeof 40signs.Eachsignhadat least6 examplesavailablefor thetrainingset,and2examplesavailablefor thetestset.

8 ExperimentsWe performedisolated,continuous,and vision-HMM

coupledASL recognitionexperiments.WeusedEntropic’sHidden Markov Model Toolkit (HTK) Version 2.02 fortrainingandtestingin all of ourexperiments.

8.1 IsolatedRecognitionExperimentsThegoalof theisolatedrecognitionexperimentswasto

discover a setof featuresthat maximizesHMM recogni-tion performance.We useddifferentfeaturesin ourexper-iments,includingwrist positioncoordinatesof bothhands(denotedby �O&�|] ), wrist positionexpressedin polarco-ordinatesin the � - � plane (denotedby ¡£¢?¤�]¥�¢£¤ ), polarcoordinatesin the � - plane(denotedby ¡?¢£¦�&¥¢?¦ ), wristposition expressedin sphericalcoordinates(denotedby¡]¥(K§ ), andwrist orientationangle(denotedby

b), aswell

asderivativesof these(denotedby a dot). We alsocom-binedseveralfeaturesin someexperiments.

We ran repeatedexperiments,morethan � 3 3�3�3 total,with differentfeaturesandrandomlyselectedtrainingandtestsetson a per-experimentbasis. Threequartersof theexamplesfor eachsignwerein thetrainingsetandtherestwerein thetestset.Eachselectionyielded178testexam-plesperexperiment.Sometypical resultsaregivenin Ta-ble 2. In addition,we performedexperimentsto comparethe merits of using three-dimensionalcoordinatesversustwo-dimensionalcoordinatesby projectingthecoordinatesonplanes.Theresultsareshown in Table3.

Features ~ ¨ B W N��]�GK 98.42% 0.99% 100.0% 93.8% 463¡ ¢£¤ &¥ ¢£¤ , 98.72% 0.79% 100.0% 95.5% 494¡?¢£¤�&¡£¢?¦ ,¥¢?¤�]¥�¢£¦ ,��]�GK 98.78% 0.78% 100.0% 94.9% 882¡&¥(L§ 96.48% 1.31% 100.0% 93.3% 210©�O ©�| © 96.87% 1.21% 100.0% 93.3% 167��]�GK ªb

98.25% 0.92% 100.0% 95.5% 167©¡ ¢£¤ ©¥ ¢£¤ ,© 96.28% 1.04% 98.9% 93.8% 120©¡« ©¥( ©§ 95.89% 1.29% 98.9% 92.1% 150

Table2: Resultsof isolatedsign recognitionwith three-dimensionalfeatures. ~ , ¨ , B, W, and N correspondtotheaveragepercentageof correctlyrecognizedsigns,stan-darddeviation,bestcase,worstcase,andnumberof exper-iments,respectively. All experimentsusedatestsetof 178signs.

Features ~ ¨ B W N¡ ¢£¤ &¥ ¢?¤ 98.06% 1.26% 100.0% 94.9% 118��&� 97.75% 1.20% 100.0% 94.9% 118

Table 3: Resultsof isolatedsign recognitionwith two-dimensionalfeatures.Themeaningof the columnsis thesameasin Table2.

8.2 Analysisof IsolatedRecognitionThe low error ratesof the bestfeaturesetsshow that

with a good selectionof features,the hand movementsalone,without handconfigurationinformation,carry suf-ficient informationto discriminateamongmany differentsigns. Polarcoordinatesslightly outperformedCartesiancoordinates.A combinationof both yielded the bestre-sults,althoughthedifferenceis not significant. However,thestandarddeviationof thecombinedfeaturesetwaslow-est,indicatingthatacomplex featurevectoris morerobustthana simplefeaturevector.

Positioncoordinatessignificantlyoutperformedveloc-ities. The reasonfor the poor performanceof velocityfeaturesis that the statisticalpropertiesof the velocitieschangewith variationsin thesign’s duration. In contrast,thestatisticalpropertiesof positioncoordinatesarelargelyunaffectedby thedurationof signs,becauseHMMs absorbvariationsin durationthroughtransitionslooping backtothe samestate. Yet, positioncoordinateshave the signif-icantdisadvantagethat they arenot invariantwith respectto location.Thelackof invariancewill causeproblemsforfuture applicationsthat attemptto capturecommonalitiesbetweenmovementsatdifferentlocationsin space.

10

Three-dimensionalfeaturesperformedbetterthantwo-dimensionalfeatures,althoughthedifferenceis not large.The differencewould probablybecomemore significantwith a largervocabulary. The differencesin standardde-viation, however, indicatethat three-dimensionalfeaturesaremorerobustthantwo-dimensionalfeatures.

It is an importantconsequenceof the experiments’re-sults that the performanceof the featurevectorsdependson theactualexamplesin thetrainingset,all otherfactorsbeingequal.Thus,only performinga largenumberof ex-perimentsyieldsreliableestimatesof therelativemeritsofdifferentfeatures.

8.3 ContinuousRecognitionExperimentsWe split the486sentencesrandomlyinto a trainingset

with 389 examplesanda testsetwith 97 examples(con-taining 456 signs). Eachsign in the vocabulary occurredat least once in the test set. The training and test setswerethesamethroughoutall experiments,andno portionof the testsetwasusedfor training in any way. We ranthree-dimensionalexperimentswith andwithout context-dependentHMMs, andtwo-dimensionalexperiments(byprojectingthedataonplanes;theresultsgivenarethebestthatwe found).

In accordancewith the results from isolated experi-mentsthat position coordinatesperform better than ve-locities, and that a complex featurevector is more ro-bust thana sparseone,we choseour featurevectorto be% ��&�|K ª]¥ ¢£¤ &¥ ¢£¦ ©�� ©�| © (

b' for bothhands.Thatis, it con-

sistedof Cartesianandpolarpositioncoordinates,veloci-ties,andwrist orientationangles.Thetaskgrammarwasasimpleword loop, soevery signwasequallylikely at anytime in theHMM network.

Table4 shows the experimentalresults. We usewordaccuracy asourevaluationcriterion.It is computedby sub-tractingthenumberof insertionerrorsfrom thenumberofcorrectlyspottedsigns.Thenumberof wordsin theresultfor two-dimensionaldatais lower thanin theotherresults,becausefor onesentencetheViterbi beam-searchingopti-mizationprunedall pathsthroughtheHMM network (seealsoSection4.2).

8.4 Analysisof ContinuousRecognitionThe resultsare clearly in favor of using three-dimen-

sionaldataover two-dimensionalfor continuousrecogni-tion. The6.3percentdifferenceis large,although,accord-ing toourexperienceswith isolatedrecognition,oneexper-imentis notenoughto estimatetherealdifferencereliably.

Context-dependentmodelsoutperformedcontext-inde-pendentmodels, but the increasein performancewassmall, probably to a large extent becauseof insufficienttrainingdata— context-dependentmodelingrequireshugeamountsof data to becomeeffective. Also, cross-signcontext-dependentmodelingfor ASL is implausiblefrom

Typeof Wordexperiment accuracy Details3D context 87.71% H=416,D=8, S=32independent I=16,N=4563D context 89.91% H=424,D=6, S=26dependent I=14,N=4562D context 83.63% H=394,D=14,S=44dependent I=16,N=452

Table 4: Resultsof continuousrecognitionexperiments.H denotesthe numberof correctsigns,D the numberofdeletionerrors,S the numberof substitutionerrors,I thenumberof insertionerrors,andN thetotalnumberof signsin thetestset.

a phonologicalpoint of view (seeSection5.2.1). Theal-ternative is modelingmovementepenthesisdirectly, anditappearsto performbetter[20].

More thanhalf of thesubstitutionerrorsin eachexperi-mentwereconfusionsbetween“I” and“MY,” and“Y OU”and“Y OUR,” whichdiffer only in handconfiguration.Weexpectthataddingfeaturesdescribingthehandconfigura-tion will improverecognitionperformancesignificantly.

Repeatingthecontext-dependentexperimentwith five-bestrecognitionshowedthattheabsenceof astronggram-marfor constrainingtheHMM network degradesrecogni-tion performancesignificantly. In many cases,thecorrectsentencewastheonlygrammaticalsentenceamongthefivebestcandidates.In othercases,all fivecandidateswereun-grammatical.

Unfortunately, using a strong grammarfor a test setasdiverseasours is not practical,becausethe sizeof anHMM network grows exponentiallywith the numberofrulespresentin thegrammar. Statisticallanguagemodels,suchasbigrammodels,haveprovedto beaneffectivesolu-tion to thisproblemin speechrecognition.Weshow in [20]thatbigramlanguagemodelsarepromisingfor ASL recog-nition aswell. However, they requirea largecorpusof la-beledreal-world datato becometruly effective. Presently,nosuchcorpusexistsfor ASL.

8.5 Coupling ExperimentsTo investigatethe effectsof couplingthe three-dimen-

sionalmotionanalysiswith theHMM framework, weper-formedtwo experiments.In thefirst experiment,we ana-lyzedall sentencesin thetestsetwith ourmotionanalysis,soasto provideanupperboundon its performance.If themotion analysishadworked perfectly, it shouldhave ac-ceptedall of these97 testsentences.In reality, however, itrejected10outof these97sentences.

A closerlook at the10 rejectedsentencesrevealedthatfive of thesewerenot recognizedcorrectlyby thecontext-dependentHMMs either. Thus,it is likely that thesefive

11

sentenceswerenotsignedpreciselyenoughduringthedatacollectionprocess.Theotherfive rejectedsentencesindi-catethatthemotionanalysisstill needsimprovement.

In thesecondexperiment,weranthecouplingalgorithmon the actual recognitionhypothesesfrom the context-dependentHMMs in theexperimentsin Section8.3. Thistime,thealgorithmalsoeliminated10sentencesoutof 97.Five of thesewere correctly rejected;that is, the HMMframework hadprovidedincorrectresultsfor them. Thus,at thecurrentmoment,couplingHMMs with motionanal-ysisbreaksevenwith usingtheHMM framework by itself.Theword accuracy achievedby thecouplingwas90.10%,which is slightly better than the 89.91%word accuracyachievedby thecontext-dependentmodelsalone.

As we have usedonly a small part of the full powerof computervision motion analysisso far, we seetheseresultsasevidencethatcouplingwill eventuallybeabletooutperformeithermethodindependently.

9 SummaryWehavedevelopeda framework for recognizingAmer-

icanSignLanguagefrom three-dimensionaldataobtainedwith computervision techniques.We showedhow to col-lect three-dimensionaldatafrom computervision andusethem as input to Hidden Markov Models. We also de-terminedthat three-dimensionalfeaturesaresuperiorovertwo-dimensionalones.

By using context-dependentmodeling, we improvedrecognitionperformance. Throughcoupling vision pro-cesseswith HiddenMarkov Models,we took a first steptowardovercomingthe limitationsof eithermethodby it-self.

10 Future workThe collection of a standardizedcorpusof real-world

ASL conversationsandstory telling shouldbea high pri-ority for future work. The currentlack of sucha corpusmakesit impossibleto compareresultsfrom differentre-searchers.Furthermore,it makesthe developmentof sta-tistical languagemodelsfor ASL difficult. Suchlanguagemodelsarenecessaryfor large-scaleapplications.

Testing the algorithms describedin this paper andin [20, 21] with alargervocabularyis alsoimportant.Onlythenit will bepossibleto judgehow well thesealgorithmsscale.

On the linguistic sideof ASL recognition,futureworkshouldincorporatefacialexpressionsandotherphonolog-ical processesin ASL into the recognitionframework. Italsoneedsto addresshow to make useof handconfigura-tion information;usingthis informationeffectively seemsto benontrivial. Furthermore,futurework hasto find waysto use statistical languagemodels, so as to counterbal-ancetheimpracticabilityof usingstronglyconstrainedtask

grammars.On thecomputervisionsideof ASL recognition,future

work shouldelaborateon thecouplingof computervisionandHMMs andmake the computervision analysismorerobust. This work shouldconsistof recognizingmoredif-ferentgeometricproperties,fine-tuningthesigntemplates,andfine-tuningthe dynamicweightingof featuresbasedonthepropertiesof eachsignthatis matchedto thesignal.It alsoneedsto addresscoarticulationeffects,which it hasignoredsofar.

It is alsonecessaryto to developananthropometricallycorrectmodelof thehumanhand,sothatthecomputervi-sion trackingprocesscanmake handconfigurationinfor-mationavailableto therecognitionframework.

AcknowledgmentsThisworkwassupportedin partbyaNSFCareerAward

NSF-9624604,ONR-DURIP’97 N00014-97-1-0385 andN00014-97-1-0396, ONR Young InvestigatorProposal,andNSFIRI-97-01803.IoannisKakadiarishelpedin ob-tainingthecomputervisionsamples.

References[1] L. W. Campbell,D. A. Becker, A. Azarbayejani,A. F. Bo-

bick, and A. Pentland. Invariant featuresfor 3-D gesturerecognition.2ndInternationalWorkshoponFaceandGestureRecognition,Killington, VT, 1996.

[2] H. EdelsbrunnerandD. L. Souvaine. Computingleastme-dianof squaresregressionlinesandguidedtopologicalsweep.Journal of the American Statistical Asscoiation, 85:115–119,1990.

[3] R. Erenshteyn andP. Laskov. A multi-stageapproachto fin-gerspellingandgesturerecognition.Proceedingsof theWork-shopon the Integrationof Gesturein LanguageandSpeech,pp.185–194,Wilmington,DE, 1996.

[4] K. GrobelandM. Assam.IsolatedsignlanguagerecognitionusinghiddenMarkov models.SMC’97,pp.162–167.

[5] S. K. Liddell andR. E. Johnson.AmericanSignLanguage:Thephonologicalbase.Sign Language Studies, 64:195–277,1989.

[6] I. A. Kakadiaris,D. Metaxas,and R. Bajcsy. Active part-decomposition,shapeandmotionestimationof articulatedob-jects:A physics-basedapproach.CVPR’94,pp.980–984.

[7] I. A. Kakadiarisand D. Metaxas. 3D humanbody modelacquisitionfrom multipleviews. ICCV’95, pp.618–623.

[8] I. A. KakadiarisandD. Metaxas. Model basedestimationof 3D humanmotion with occlusionbasedon active multi-viewpoint selection.CVPR’96,pp.81–87.

[9] J. Matousek, D. Mount, and N. S. Netanyahu. Effi-centrandomizedalgorithmsfor the repeatedmedianline es-timator. To appearin Algorithmica. Available online athttp://www.cs.umd.edu/˜mount/pubs.html .

12

[10] D. Metaxas.Physics-based Deformable Models: Applica-tions to Computer Vision, Graphics and Medical Imaging.Kluwer AcademicPublishers,November1996.

[11] D. Mount, N. Netanyahu,K. Romanik,R. Silverman,andA. Y. Wu. A practicalapproximationalgorithmfor theLMSline estimator. 8th ACM-SIAM Symposiumon DiscreteAl-gorithms,pp.473–482,1997.

[12] Y. NamandK. Y. Wohn. Recognitionof space-timehand-gesturesusingHiddenMarkov model. ACM SymposiumonVirtual Reality Software and Technology, pp. 51–58,HongKong,1996.

[13] N. S. Netanyahu, V. Philomin, A. Rosenfeld,and A. J.Stromberg. Robust detectionof straight and circular roadsegments in noisy aerial images. Pattern Recognition,30(10):1673–1686,1997.

[14] S. Prillwitz et al. HamNoSys. Version 2.0; Hamburg Nota-tion System for Sign Languages. An introductory guide, vol-ume5 of International Studies on Sign Language and Com-munication of the Deaf. SignumVerlag,Hamburg, 1989.

[15] L. R. Rabiner. A tutorial on HiddenMarkov Modelsandselectedapplicationsin speechrecognition. Proceedings ofthe IEEE, 77(2):257–286,February1989.

[16] A. F. Siegel. Robust regressionusing repeatedmedians.Biometrika, 69:242–244,1982.

[17] J. M. SiskindandQ. Morris. A maximum-likelihoodap-proachto visualeventclassification.ECCV’96.

[18] T. StarnerandA. Pentland.Visualrecognitionof AmericanSign LanguageusingHiddenMarkov models. InternationalWorkshopon AutomaticFaceand GestureRecognition,pp.189–194,Zurich,Switzerland,1995

[19] C. Valli andC. Lucas. Linguistics of American Sign Lan-guage: An Introduction. GallaudetUniversity Press,Wash-ingtonDC, 1995.

[20] C. VoglerandD. Metaxas.AdaptingHiddenMarkov mod-elsfor ASL recognitionby usingthree-dimensionalcomputervisionmethods.SMC’97,pp.156–161.

[21] C. Vogler and D. Metaxas. ASL recognitionbasedon acouplingbetweenHMMs and3D motionanalysis.ICCV’98,pp.363–369.

[22] M. B. WaldronandS. Kim. IsolatedASL signrecognitionsystemfor deafpersons.IEEE Transactions on RehabilitationEngineering, 3(3):261–71,September1995.

[23] M. WaleedKadous. Machinerecognitionof AuslansignsusingPowerGloves:Towardslarge-lexiconrecognitionof signlanguage.Proceedingsof theWorkshopon theIntegrationofGesturein LanguageandSpeech,pp. 165–174,Wilmington,DE, 1996.

[24] S.Young,J.Jansen,J.Odell,D. Ollason,andP. Woodland.The HTK Book (for HTK 2.0). CambridgeUniversity, 1995.

13

ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...

Documents

Transcript of ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...