ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...
Transcript of ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis · 2001-12-03 · ASL...
An� earlier versionof this paper appearedin the proceedingsof the Inter national Conferenceon Computer Vision,pp. 363–369,Mumbai, India, January 4–7,1998
ASL RecognitionBasedon a Coupling BetweenHMMs and 3D Motion Analysis
ChristianVoglerandDimitris MetaxasDepartmentof ComputerandInformationScience,Universityof Pennsylvania,Philadelphia,PA 19104-6389
[email protected], [email protected]
AbstractWe present a framework for recognizing isolated and
continuous American Sign Language (ASL) sentences fromthree-dimensional data. The data are obtained by us-ing physics-based three-dimensional tracking methods andthen presented as input to Hidden Markov Models (HMMs)for recognition. To improve recognition performance,we model context-dependent HMMs and present a novelmethod of coupling three-dimensional computer visionmethods and HMMs by temporally segmenting the datastream with vision methods. We then use the geomet-ric properties of the segments to constrain the HMMframework for recognition. We show in experimentswith a 53 sign vocabulary that three-dimensional featuresoutperform two-dimensional features in recognition per-formance. Furthermore, we demonstrate that context-dependent modeling and the coupling of vision methodsand HMMs improve the accuracy of continuous ASL recog-nition.
1 Intr oductionAmericanSignLanguage(ASL) is theprimarymodeof
communicationfor many deafpeoplein the USA. It is ahighly inflectedlanguagewith sophisticatedgrammaticalproperties,which constrainstronglytheorderandappear-anceof signs.Becauseof theconstraints,it providesanap-pealingtestbedfor understandingmoregeneralprinciplesgoverninghumanmotionandgesturing,includinghuman-computergestureinterfaces.Suchinterfacesareessentialin virtual reality applications,wheretheusermustbeableto manipulatevirtual objectsby gesturing.A workingASLrecognitionsystemcouldalsofacilitateinteractionof deafpeoplewith their surroundings.
To date, most attemptsat ASL recognitionhave ei-ther used only two-dimensionalcomputervision meth-ods,or they have usedother input devices,suchasdata-gloves, insteadof computervision, to collect input fromthesigner[18, 3, 23]. In this paperwe presenta new ap-proachto ASL recognition.First, we usecomputervisionmethodsto extract the three-dimensionalparametersof asigner’s armmotions. We thenuseHiddenMarkov Mod-els(HMMs) to recognizeisolatedandcontinuousASL ut-terancesfrom the three-dimensionalinput. We developcontext-dependentmodelingof HMMs and methodsfor
coupling the applicationof HMMs and the applicationof three-dimensionalcomputervision methodsto improvecontinuousrecognitionperformance. Our approachat-temptsto overcomesomeof thelimitationsof thepreviousapproachesthat usetwo-dimensionalvisual input, do notusecontext-dependentmodeling,or do not couplecom-putervisionmethodswith HMMs [18, 3, 17, 12].
Three-dimensionalimage-basedshape and motiontracking of a human’s arm and handis difficult becauseof the complexity of the motionsand occlusioneffects.Recently, a methodologyhasbeendeveloped[8, 10] thatallows three-dimensionaltrackingof humanmotion frommultiple images.In this paperwe augmentthis methodol-ogy to track the three-dimensionalmotion of a subject’sarms and handsfrom multiple images. This methodisbasedon theuseof deformablemodels,whoseshapeandmotion fits the given imagesequencesbasedon occlud-ing contourinformationandtheoremsfrom projective ge-ometry. The outputof this methodconsistsof the three-dimensionalmotionparametersof thesubject’s arms.Forefficiency reasons,and becausearm movementsalreadycarrymuchof theinformationneededfor recognizingASLsigns,wedonotusethehandinformationin thispaper.
Apart from obtainingaccuratedata,ASL recognitionis difficult, becausetherearealwaysstatisticalvariationsin the way humansperformmotions,even with identicalmeaning. In addition, in continuousutterances,therearenoclearboundariesbetweenindividualsigns.HMMs pro-videaframework for capturingstatisticalvariationsin bothpositionanddurationof themovement,aswell asimplicitsegmentationof theinputstream.Furthermore,continuousrecognitionis complicatedby coarticulationeffects,thatis,thepronunciation1 of a signis influencedby theprecedingandfollowing signs. Coarticulationeffectscanbe partlyalleviatedby trainingcontext-dependentHMMs.
The theorybehindHMMs makesseveral assumptionsthatareoftennot valid in practice.For this reason,we de-velopa new approachthatcouplescomputervision meth-odswith HMM modeling. It is basedon a temporalseg-mentationprocessthat operatesby extracting geometricpropertiesof the three-dimensionalcomputervision pa-
1By “pronunciation”we meanmotion. We follow theterminologyofspokenlanguagelinguisticswhereapplicable.
1
rameters. Thesepropertiesare obtainedindependentlyfrom the HMM algorithmsandareusedto imposeaddi-tionalconstraintsonHMM-basedrecognition.
To testouralgorithmsandassumptions,weperformedaseriesof experimentsbasedon a vocabulary consistingof53differentsignsthatmakeextensiveuseof space.Weex-perimentedwith bothisolatedandcontinuousASL recog-nition for both three-dimensionaland two-dimensionaldata. As HMMs require large amountsof training dataandthecomputervisionprocessis computationallyexpen-sive,we useddatafrom anAscensionTechnologiesFlockof Birdsandcomputervisionprocessesinterchangeably.
Ourgoalis to discoverandanalyzeausableframeworkfor bothisolatedandparticularlycontinuousASL recogni-tion. We do not addressmoregeneralgesturerecognitiontopicsandsignerindependencein this paper. Neitherdowe addresstheinvolvedaspectsof ASL linguistics[19] atthis point, but obviously, a viable futureASL recognitionsystemshouldbeableto handlethem.
In the following sections,we discussrelatedwork andgive an overview on the theory behindthe vision meth-odsandHMMs. Afterward,we addresstheuseof HMMsfor isolatedandcontinuousASL recognition,andcouplingcomputervision processeswith theHMM algorithms.Fi-nally, we outlinedatacollectionandprovideexperimenta-tion resultsfor isolatedandcontinuousrecognitionandthecouplingof computervisionandHMMs.
2 PreviousWorkPrevious work on sign languagerecognitionfocuses
primarily on fingerspellingrecognitionand isolatedsignrecognition. Somework usesneural networks [3, 22].For this work to applyto continuousASL recognition,theproblemof explicit temporalsegmentationmustbesolved,whichis alimitation thatHMM-basedrecognitiondoesnothave. MohammedWaleedKadous[23] usesPowerGlovesto recognizeasetof 95 isolatedAuslansignswith 80%ac-curacy, with anemphasison computationallyinexpensivemethods.Kirsti GrobelandMarcellAssam[4] useHMMsto recognizeisolatedsignswith 91.3%accuracy out of a262signvocabulary. They extract thefeaturesfrom videorecordingsof signerswearingcoloredgloves.
Thereis very little previous work on continuousASLrecognition. ThadStarnerand Alex Pentland[18] useaview-basedapproachto extract two-dimensionalfeaturesas input to HMMs with a 40 word vocabulary. YangheeNam andKwangYoenWohn [12] usethree-dimensionaldataasinputtoHMMs for continuousrecognitionof averysmallsetof gestures.
3 Model-basedTracking of a Human’sArmsIn thissectionwegivea brief overview of our formula-
tion thatallows the three-dimensionalarmshapeandmo-
tion estimationfrom multiple images[6, 7, 8, 10].
Our approachconsistsof two parts.Thefirst part[6, 7]consistsof anactive,integratedapproachthatidentifiesre-liably thepartsof amovingarticulatedobjectandestimatestheir shapeand motion from a controlled set of motionsthatrevealtheobject’s structure.We usethealgorithmde-velopedin [6, 7], which segmentstheapparentbodycon-tourof amovinghumaninto theconstituentparts.Initially,a singledeformablemodelis usedin orderto fit theimagedata.As themodeldeformsto fit thedeformed(dueto themotionof thehuman)subsequentimagecontours,a novelHuman Body Part Identification Algorithm (HBPIA) isdevelopedto identify all the body parts. By applyingtheHBPIA iterativelyoverthesubsequentframes,all themov-ing partsareidentified.In addition,we have extendedthisalgorithmto allow theestimationof thethree-dimensionalshapeof asubject’sbodyparts,basedon theintegrationofimagestakenfrom threeorthogonallyplacedcameras.Weusedthis methodologyto estimatethe three-dimensionalshapeof thesubject’s armsshown in theexamplesin Sec-tion 7. It is worthnotingthatwe have recoveredthelowerarmandthehandasonepart,sincein ourASL recognitionexperimentswe did not usethe motion of the lower armandthehandrelative to eachother.
The secondpart of the algorithmconsistsof usingtheextractedthree-dimensionalshapeof the arm to track thethree-dimensionalposition and orientationof a subject’sbodyparts[8]. To alleviatedifficultiesarisingfrom occlu-sionanddegenerateviewsduringtheunconstrainedmove-mentof the arm, we usethreecalibratedcamerasplacedin a mutually orthogonalconfiguration. At every imageframeand for eachbody part, we derive a subsetof thecamerasthatprovidethemostinformativeviewsfor track-ing. This active and time varying selectionis basedonthevisibility of apartandtheobservability of its predictedmotion from a certaincamera.Oncea setof camerashasbeenselectedto trackeachpart,weuseconceptsfrom pro-jective geometryto relatepointson theoccludingcontourto pointson the three-dimensionalshapemodel. Using aphysics-basedmodelingapproach,we transformthis cor-respondence,in addition to two-dimensionalforcesaris-ing from the discrepancy betweenthe model’s occludingcontourand the imagedata, into generalizedforcesthatareappliedto the model to estimatethe model’s transla-tional androtationaldegreesof freedom. To improve thetrackingresultsfurther, the dynamicsystemis embeddedwithin an extendedKalmanfilter framework, andwe usethe predicted motion of the model at eachframe to es-tablishpoint correspondencesbetweenoccludingcontoursandthethree-dimensionalmodel.
We usedthis two-stepapproachto track the motionofthesubject’s armsperformingtheASL gestures,asshown
2
in Section7. Theoutputof thesystemis a setof rotation,��� , andtranslation,��� , parametersthatwe useasinput tothe HMMs and the vision-basedsegmentationalgorithmpresentedin thefollowing sections.
4 Hidden Mark ov ModelsHiddenMarkov Models (HMMs) area type of statis-
tical model. They have beenusedsuccessfullyin speechrecognition,andrecentlyin handwriting,gesture,andsignlanguagerecognition.Wenow giveasummaryof thebasictheorybehindHMMs, which is coveredin detailin [15].
4.1 Definition of HMMsAn HMM consistsof anumber� of states����� ���������� � , togetherwith transitionsbetweenstates.Thesystemis
in oneof theHMM’ sstatesatany giventime. At regularlyspaceddiscretetimeintervals,thesystemtakesanoutgoingtransitionfrom its currentstateto a new state.
Eachtransitionfrom � � to ��� hasanassociatedproba-bility ����� of beingtaken. Hence, � � ��������� . Eachstate� � alsohasan initial probability ��� of thesystemstartingin ��� . In addition,eachstate � � generatesoutput �! #" ,which is distributedaccordingto a probabilitydistributionfunction $ �&% �(')�+*-, Outputis ��.Systemis in � �0/ . An ex-ampleisgivenin Figure1. Themodeldepictedthereis alsoanexampleof a left-right model; that is, � ���2143 implies57698 � In otherwords,transitionsonly flow forwardfromlower statesto the samestateor higherstates,but neverbackward. This topologyis themostcommonlyusedonefor modelingprocessesover time.
a11 a24
a23a12
55a
b5b4b3b2b1
a34 a45S1 S S S S2 3 4 5
Figure1: Exampleleft-right HMM with its transitionandoutputprobabilities.“Left-right” meansthattransitionsoc-curonly from left to right, andneverbackward.
4.2 The Thr eeFundamentalHMM ProblemsTherearethreefundamentalproblemsin HMM theory:
(1) For a sequenceof observations :;�<:)�=>���?�@�:BA ,: � C" , computethe probability * % :D. E�' that anHMM E generated: .
(2) For some: andanHMM E , recover themostlikelystatesequence� � F�?���?�� A thatgenerated: .
(3) Adjust the parametersof an HMM E suchthat theymaximize* % :D. EG' for some: .
The first problemcorrespondsto maximumlikelihoodrecognitionof an unknown data sequencewith a set ofHMMs, eachof which correspondsto a sign. For eachHMM, the probability * % :D. EG' is computedthat it gener-atedthe unknown sequence,andthenthe HMM with thehighestprobability is selectedastherecognizedsign. Forcomputing * % :D. EG' , let HI�JH � KH � ����?��LH A be a statesequencein E :
MON % 8 'P�#* % :)�=K:B�=?�����?K: N KH N �Q� � . EG'4�SR 8 R!�T (1)
* % :D. E�'U� �V�XW � M A % 8 '@ (2)
M � % 8 'P�#���Y$Z� % : � '@ (3)
MON\[ � % 8 'P�#$ �K% : N\[ �?' �V�]W � M�N % 5 '&� �]� �)R!^UR4_!`4� (4)
Theseequationsassumethat the : � areindependent,andthey make the Markov assumptionthat a transition de-pendsonly on the currentstate,a fundamentallimitationof HMMs. This methodis called the forward-backwardalgorithmandcomputes* % :D. E�' in : % � � _)' time.
The secondproblemcorrespondsto finding the mostlikely path H throughan HMM E , given an observationsequence: , andis equivalentto maximizing * % HaK:D. E�' .Let b
N % 8 'P� cad=efhgLi�j�j�j i f�kmlFg * % H � H � ���?�]H N �n� �]K:D. E�'@ (5)bN\[ � % 8 'o�Q$ �]% : N\[ ��'qprcDde�Ls � sG� ,
bN % 5 't� �]�]/ (6)
cad=ef * % HaK:D. E�'u�vcad=e�Zs � sG� ,bA % 8 ' / � (7)b
N % 8 ' correspondsto themaximumprobabilityof all statesequencesthat endup in � � at time ^ . Equations6 and7follow from Equation5 by induction on ^ . The Viterbialgorithmis a dynamicprogrammingalgorithmthat, us-ing Equation7, computesboth the maximumprobability* % HaK:D. E�' andthestatesequenceH in : % � � _B' time.
Therecoveryof thestatesequencemakestheViterbi al-gorithminvaluablefor continuousrecognition,sinceit by-passesthe difficult problemof segmentingthe utterancesinto its individualparts.Instead,asequenceof HMMs cor-respondingto individual signsis concatenatedinto a net-work, as schematicallydepictedin Figure 2. Thus, themostlikely statesequencerecoversthesequenceof signs.
The Viterbi algorithmalsohasthe propertythat it canbe optimizedwith the beam-searchingalgorithm. Whileupdating
bN\[ � % 8 ' , this optimizationconsidersonly those
states� � in the HMM network for which
bN % 5 ' is above
a thresholdvalue.Theassumptionis thatif theprobability
3
HMM 1
HMM 2
HMM n
......
Finalstate
Initialstate
Figure2: Concatenationof HMMs into a network
of a partial paththroughthe network becomestoo low, itcannotcontributeto themostlikely path.Beam-searchingis essentialfor makinglarge-scaleapplicationstractable.
The third problemcorrespondsto training the HMMswith data,suchthat they areableto recognizepreviouslyunseendatacorrectlyafterthetrainingphase.Thereexistsno analyticalsolution for maximizing * % :D. EG' for givenobservation sequences,but an iterative procedure,calledthe Baum-Welch procedure,maximizes * % :D. E�' locally.In thecaseof continuousdensityoutputprobabilities,thereestimationprocessworksasfollows.
Define $]� % :S' as $]� % :S'w�x�Qyz W �|{ � z)} % :-&~G� z Z�O� z ' ,where � describesthenumberof mixtures,
5is thestate
number, { describesthe weight of mixture � in state5,
and } is a Gaussiandensitywith mean~ , andcovariancematrix � . Definethebackwardvariable� as
� N % 8 'P�#* % : N\[ ��: N\[ �=?���?��K:BA�. H N �#� � KEG'Z (8)
�GA % 8 'P���� (9)
� N % 8 '�� �V�]W � ������$K� % : N\[ � 't� N\[ � % 5 'Z (10)
�SR 8 R!����SR�^UR4_!`4��� (11)
Furthermore,define� and � as
� N % 8 5 'P� MON % 8 't� ��� $ ��% : N\[ �?'0� N\[ � % 5 '* % :D. E�' (12)
� N % 8 'P� �V��W � � N % 8 5 'Z� (13)
� N � N % 8 5 ' canbe interpretedas the expectednumberof transitionsfrom � � to � � ; likewise � N � N % 8 ' canbe in-terpretedastheexpectednumberof transitionstakenfrom� � . With theseinterpretations,the reestimationformulaefor thetransitionsandoutputprobabilitiesare����q��� � % 8 '@ (14)
�� ��� � � A�� �N W � � N % 8 5 '� A�� �N W � � N % 8 ' (15)
�{ � z � � AN W � � N % 5 &�T'� AN W � � y� W � � N % 5 L�(' (16)
�~G� z � � AN W � � N % 5 &�T'&: N� AN W � � N % 5 ]��' (17)
�� � z � � AN W � � N % 5 ]��' % : N `7~G� z ' % : N `�~G� z ' A� AN W � � N % 5 ]��' � (18)
Repeateduseof thisprocedureconvergestoamaximumprobability[15], typically after5–10iterations.
5 Useof HMMs for ASL RecognitionIn the previous sectionwe reviewed the extraction of
three-dimensionalfeaturesfrom computervision and theHMM theory. We now discusshow they fit in the frame-work of ASL recognition.
HMMs are an attractive choice for processingthree-dimensionalsigndata,becausetheirstate-basednatureen-ablesthemto describehow a sign changesover time andto capturevariationsin thedurationof signs,by remainingin a statefor severaltime frames.
Therearetwo waysto approachthe recognitionprob-lem that posevery different researchproblems. Isolatedrecognitionattemptsto recognizeonesinglesignata time.Hence,it is basedon theassumptionthateachsigncanbeindividually extractedandthenindividually recognized.
Continuousrecognition,on theotherhand,attemptstorecognizean entirestreamof signs,without any artificialpausesor any other form of marked boundariesbetweentheindividualsigns.Clearly, continuousrecognitionis de-sirablefor the most natural interactionpossiblebetweenhumansandmachines,but it is alsomuchmoredifficult totacklethanisolatedrecognition.Thenext two subsectionsdiscusseachof thetwo approachesin detail.
5.1 IsolatedRecognitionIsolatedsign recognitionassumesthat eachsign can
be extractedindividually. This requiresclearly markedboundariesbetweensigns. Sucha boundarycould sim-ply besilence,thatis, a brief restingphaseaftereachsign,duringwhich thesignerperformsno movements.Silenceis easilydetectedthroughananalysisof theglobalvarianceoverthehandmovements.
Once there are clearly marked boundariesbetweensigns,HMM recognitionis comparatively straightforward.Therecognitionprocessextractsthesignalcorrespondingtoeachsignindividually. It thenpickstheHMM thatyieldsthe maximumlikelihoodfor that signalasthe recognizedsign.
Training the HMMs to maximize recognitionperfor-manceis alsocomparatively straightforward. Initially, allsigns in the training set are labeled. For eachsign inthe dictionary, the training procedurethen computesthe
4
meanand covariancematrix over the data available forthat sign and assignsthem uniformly as the initial out-put probabilitiesto all statesin the correspondingHMM.It alsoassignsinitial transitionprobabilitiesuniformly tothe HMM’ s states.Unlike the initial outputprobabilities,initial transitionprobabilitiesdo not influencethe perfor-manceof thefully trainedHMMs greatly.
The trainingprocedurethenrunstheViterbi algorithmrepeatedlyon thetrainingsamples,soasto align thetrain-ing dataalong the HMM’ s states. The aligneddataarethenusedto estimatebetteroutputprobabilitiesfor eachstateindividually. This realignmentyieldsmajorimprove-mentsin recognitionperformance,becauseit increasesthechancesof the Baum-Welch reestimationalgorithm con-verging to an optimal or a near-optimal maximum. AfterconstructingthesebootstrappedHMMs, the training pro-cedurefinishesby reestimatingeachHMM in turn withthe Baum-Welch reestimationalgorithmoutlined in Sec-tion 4.2.
Theby far mostchallengingproblemin isolatedrecog-nition is extractinga featurevector that optimizesrecog-nition performance.Even after obtainingaccuratethree-dimensionaldata from our computervision methodde-scribedin Section3, we found that the featuresusedforrecognition— and the way that they are represented—greatly influencerecognitionperformance. The experi-mentalresultsgiven in Section8.1 demonstratehow thefeaturevectoraffectsperformance.
Thereareseveral reasonswhy performanceis so sen-sitive to choosingthe type of featurevector: First, somefeaturescarry more information than others; for exam-ple,three-dimensionalfeaturesaremorereliablethantwo-dimensionalones.Second,somefeaturesaremoreinvari-ant to changesin orientationandpositionthanothers;forexample,polarcoordinatesaremoreinvariantto rotationsthanCartesiancoordinates[1]. Third, thestatisticalprop-ertiesof somefeatureschange,dependingon thedurationof a sign. For this reason,the positionsof the handsinthree-dimensionalspaceperformbetterthanthevelocitiesof the hands(seealso Section8.2). Fourth, the statisti-cal distribution of the featuresduring thecourseof a signseemsto play a role. For somefeatures,their distributionfits Gaussiandensitiesnaturally, whereasfor othersit doesnot.
If thelatterexplanationholdstrue,weshouldseeama-jor improvementin recognitionperformancefrom usingmultiple GaussianmixturesastheoutputprobabilitiesforHMMs, insteadof usingjust onesingleGaussiandensity.However, we did not experimentwith multiple mixturesbecauseof thelackof sufficient trainingdata.
The numberof statesand the topology usedfor theHMMs is alsoimportant.Signlanguageasa time-varying
processlendsitself naturallyto aleft-rightmodeltopology.Findingtheoptimumnumberof states,which dependsontheframerateandonthecomplexity of thesignsinvolved,is anempiricalprocess.We usedthesamemodeltopologyfor all signs,anddeterminedexperimentallythat for ourtaskamodelwith 9 stateswassufficient,whichis depictedin Figure3. TheoutputprobabilitiesweresingleGaussiandensitieswith diagonalcovariancematrices,becausewehadinsufficient trainingdatafor multiplemixtures.
Figure 3: Left-right HMM topology for isolated ASLrecognition.
5.2 ContinuousRecognitionContinuoussignrecognition,ontheotherhand,is much
harderthanisolatedsign recognition.Thereis no silencebetweenthe signs,so the straightforward methodof us-ing silenceto distinguishboundariesfails. Here HMMsoffer the compellingadvantageof beingable to segmentthe streamsof signsautomaticallywith the Viterbi algo-rithm. Coarticulationeffectsfurthercomplicatecontinuousrecognition.We now discussthemin detail,beforewede-scribethetechniquesneededto trainHMMs for continuousrecognition.
5.2.1 The Coarticulation ProblemCoarticulationmeansthatthepronunciationof asignis
influencedby the precedingandfollowing signs. Oneofthe mostvisible effectsof coarticulationin ASL is that awiderangeof movementsareinsertedbetweensigns.
For example,the sign for “FATHER” is performedbyrepeatedlytappingtheforehead,andthesignfor “READ”is performedin neutralspacein front of thechest.If thesetwo signsareperformedin succession,anextramovementfrom theforeheadto neutralspaceappears(Figure4). Thisphenomenonis calledmovementepenthesis[5]. We dis-cussits implicationsfor ASL recognitionmorethoroughlyin [20].
Figure4: Movementepenthesis.Thearrow in themiddlepictureindicatesanextra movementbetweenthesignsfor“FATHER” and“READ” thatis notpresentin their lexicalforms.
5
Speechrecognizershandlecoarticulationby trainingphonemecontext-dependentHMMs. They traina separatemodelfor eachpossiblecombinationof threephonemesinsequencethatcouldoccurduringnaturalspeech.In princi-ple,thesameideaappliestosignlanguagerecognition,andweperformedsomeexperimentsto verify theapplicability,seeSection8.3.
A possibleway to train context-dependentmodelsforASL recognitionis to usewhole signsas the phonologi-cal unit in ASL.2 Thus,triphonecontext-dependentmod-elsfrom speechrecognitioncorrespondto tri-signcontext-dependentmodelsin ASL recognition. In otherwords,aseparatemodel is trained for eachcombinationof threesignsin sequence.The first andthe third sign in the se-quenceform the context for the middle sign, with whichthemodelis associated.
Tri-sign context-dependentmodeling,however, is pro-hibitively expensive, becauseit requires : %���� ' modelsoverall, where � is the vocabulary size. Collectingsucha large amountof training datanecessaryto obtain reli-ableestimatesfor themodelsis intractableevenfor smallvocabulary sizes. This intractability is a negative conse-quenceof usingwholesignsasthephonologicalunit. Un-like for speechrecognition,which hasto handleonly ap-proximately40 classesof allophones,there is no upperboundon the numberof modelsrequiredfor ASL recog-nition with wholesignsasthesmallestunit.
Therefore, we used only bi-sign context-dependentmodels,whichrequireamodelfor everypossiblecombina-tion of two signs.Themodelis associatedwith thesecondsign,andthefirst signformsits precedingcontext.
Bi-sign context-dependentmodeling requires : %�� � 'models.Althoughthiscomplexity is animprovementover: %���� ' , it is still too largefor anythingbut a smallvocab-ulary. Speechrecognizersreducethenumberof modelsre-quiredbyusingtheobservationthatmany contextsareverysimilar. Therefore,they tie the parametersof the modelscorrespondingto similar contexts, suchthat the transitionandoutputprobabilitiesaresharedbetweenthesemodels.This techniquesignificantlyreducesthenumberof distinctmodels.
Parametertying is alsoapplicableto ASL recognition,but it is not as effective as for speechrecognition. Themain reasonfor the reducedeffectivenessis that move-mentepenthesisinsertsmany movementsunrelatedto thesigns’ lexical forms. The implication is that context-dependentmodelswill work well only with prohibitivelylargeamountsof trainingdata.
In fact, it is questionablewhethercontext-dependentmodeling is a good solution to the coarticulationprob-
2Thisassumptionis not correct:Wholesignsarenot thesmallestunitin ASL phonology, but this topic is beyondthescopeof thispaper.
lem in ASL recognitionat all. Movementepenthesisisa phonologicalprocessin ASL and shouldbe treatedassuch;thatis, themovementsinducedbyepenthesisaresep-aratephonemes.Usingcontext-dependentmodelsto cap-turethemis implausiblefrom aphonologicalpointof view.It seemsto make moresenseto modelthemovementsex-plicitly. We follow up on this ideain [20] andshow that itleadsto betterrecognitionperformance.
5.2.2 The Training ProcedureA sign in our datacollectedat naturalsigningspeeds
was between10 and 45 frames long, not counting theframesneededfor the transitionbetweensigns. Becauseof themovementsbetweensigns,theHMM topologymustbemoreflexible thantheonedescribedfor isolatedrecog-nition in Section5.1. Theseconsiderationsled usto usingtheleft-right modelshown in Figure5.
Figure5: Topologyof thecontext-dependentmodel. Thearcsthat skip statesallow the modelingof variabilitiesinthedurationof differentsigns.
Like for isolatedrecognition,we determinedthe opti-mal numberof statesexperimentally. For theoutputprob-abilities,wechoseasingleGaussiandensitywith diagonalcovariance,aswehadinsufficienttrainingdatafor estimat-ing full-rank covariancematrices.
Trainingcontinuousrecognitionmodelsis muchharderthantrainingisolatedrecognitionmodels,becauseit is dif-ficult to obtaingoodinitial estimatesof theHMM param-eters.Viterbi realignment(seeSection5.1) worksonly ifthetrainingdatais accuratelylabeled,includingthebound-ariesbetweentheindividualsigns.Obtainingthesebound-ariesis very difficult and time-consuming;even humanshave troubledeterminingwherea sign endsandthe nextonestarts.
The alternative to usingViterbi realignmentis usingaflat-startscheme.It consistsof computingtheglobalmeanandcovariancematrix over theentiretrainingdatasetandassigningtheseas the initial output probabilitiesto theHMMs. We usedthisschemeto initialize theHMMs.
Wethenusedembeddedtraining [24] to reestimatetheHMMs. Eachiterationof this procedureconcatenatestheHMMs correspondingto the individual signsin a trainingsentenceinto a singlelargeHMM. It thenreestimatestheparametersof thelargeHMM with asingleiterationof theBaum-Welchalgorithmdescribedin Section4.2,asusual.Thereestimatedparameters,however, arenot immediatelyappliedto theindividual HMMs. Instead,they arepooled
6
in accumulators,andappliedto theindividualHMMs onlyafter thetrainingprocedurehasiteratedover all sentencesin thetrainingset.
Hence,embeddedtraining effectively trains all mod-els in parallel with the entire training set. It yields bet-ter parameterestimatesthantrainingtheHMMs indepen-dently[24].
In the caseof context-independentmodels,using theflat start schemefollowed by several embeddedtrainingrunsis all thatis necessaryto trainHMMs for recognition.Context-dependentmodelsaremoredifficult to train thancontext-independentmodels,becausethetraininginvolvestwo extra steps. Theseconsistof generatingthe context-dependentmodels,and tying the parametersof HMMswith similar contexts(seealsoSection5.2.1).
The first extra step, which consists of generatingthe context-dependentmodels,requirescare,becauseforcontext-dependentmodels there exist far fewer trainingexamplesper model than for context-independentmod-els. In this case,embeddedtraining is likely to yield thebestparameterestimatesfor context-dependentmodelsifthey have alreadybeeninitialized with bettervaluesthanthe global meanandcovariancematrix from the flat-startscheme.
Therefore,weranseveralembeddedtrainingrunsonthecontext-independentmodelsand then generatedcontext-dependentmodelswith thesameparametersasthecontext-independentmodels. It is vital to avoid overtrainingthecontext-independentmodelsby keepingthenumberof ini-tial trainingpasseslow. Theprobabilitiesshouldnot havefully convergedyet. Otherwise,usingcontext-dependentmodelsactuallydecreasesrecognitionperformance.
The secondextra step,which consistsof tying the pa-rameters,is also vital to the context-dependentmodels’performance,especiallybecauseof our relative lack oftraining data. Tying parametersreducesthe numberofmodels,assignswith similar contexts thensharea com-monmodel.As a result,moretrainingdatapermodelbe-comesavailable.
Unfortunately, parametertying is a highly empiricalprocess.Ourexperimentsindicatedthattying thetransitionprobabilitiesproperlyhadthegreatestinfluenceon recog-nition results. We usedthe endinglocationsof the signsin the precedingcontext to decideon the tying. For ex-ample,the signsfor “BROTHER” and“SISTER” end inthe samelocation. As a result,the two modelsfor a signoccurringafter the signsfor “BROTHER” or “SISTER,”suchas “LIKE, ” cansharethe sametransitionprobabili-ties. We alsousedtheendinglocationsto decideon tyingthe outputprobabilities. For our dataset, the tying pro-cessreducedthe numberof modelsto lessthanonesixthof theiroriginalnumber.
6 Coupling of Vision and HMMsIn the precedingsectionwe reviewedhow HMMs can
be usedfor ASL recognition. The useof HMMs alone,however, imposessomelimitations,oneof which is insuf-ficiency of trainingdata,especiallywhile trainingcontext-dependentmodels.Furthermore,theprobabilitytheoryas-sumptionsunderlying the HMM theory, as describedinSection4.2, areoften not valid: Successive observationsareoftennot independent,the transitionfrom onestatetothe next often dependsnot only on the currentstate,butalsoon the statehistory, and the distribution of observa-tionsdoesnotnecessarilyresembleanormaldensity.
Anotherproblemis thattheHMM theorydoesnot pro-videfor any dynamicweightingof featuresdependingonasign’scontext. For example,theinvariantfeaturesfor somesigns,suchas “I,” are the endpointsof their movementswith respectto a bodypart,andthemovementsareunim-portant.For othersigns,only themovementsareinvariant.The partsof the featureset that shouldbe examinedandignoredfor eachclassof signsaremutuallyexclusive.
To alleviate theselimitations,we investigatedthe cou-pling of the HMM recognitionprocesswith an indepen-dent computervision-basedmotion analysisthat tempo-rally segmentsthe signalandextractsits geometricprop-erties.Theideais thata signcanbedescribedin termsofoneormoregeometricprimitives,suchashandmovementsalonga line, in a plane,or a circle. This ideais supportedby theexistenceof transcriptionsystems,suchastheHam-NoSys[14], thatbasethedescriptionof themovementsongeometricprimitives.
The presenceof three-dimensionalinformationis cru-cial for the couplingto work. In the past,geometricfit-ting of planeshasalreadybeenusedfor rough segmen-tation [12], but not for providing additional informationaboutthe natureof the fits to the HMM recognitionpro-cess.
6.1 Segmentationof the SignalTo extract the geometricpropertiesof the continuous
signalestimatedwith ourcomputervisionmethods,it mustfirst besegmentedtemporallyinto its parts.Any changeofthetypeof armmovementis likely to beaccompaniedbya dip in the velocity. Thus, minima in the absoluteval-uesof thevelocity vectorprovide stronghintsat segmen-tationboundaries.However, therearetypically many morevelocity minimathansegmentationboundaries.Thus,thesegmentationprocessmustprovide facilities to mergead-jacentsegments.
After performinginitial segmentationbasedon veloci-ties, our algorithmattemptsto fit geometricprimitivestothe individual segments.Thesecurrentlyconsistof lines,planes,andholds3 ata positionin space.
3A hold is a shortperiodof time, duringwhich no handmovements
7
The fit of a hold is determinedby computingthe co-variancematrix over thesegment’s positiondata. If thereis little movement,the eigenvaluesof the matrix in everydirectionaresmall,andconsequentlyits traceis small.
Theleast-squaresfit of a line is governedbyV�Q� � �
V� .X. � � ` %�� p@� � ' � .�. � (19)
where� � is thedistanceof � � to theline, and � is theline’sunit direction vector. Let � be a matrix containingthepoints �q� in the segmentsas its row vectors. Minimiz-ing Equation19 with respectto � correspondsto maxi-mizing � A � A � � . By Rayleigh’s principle,themaximal-eigenvalueeigenvectorof � A � maximizesthis equation,which is equivalentto themaximal-eigenvalueeigenvectorof the points’ covariancematrix. This eigenvectoris theline’s directionvector. Theothertwo eigenvaluesindicatethegoodnessof fit — thesmallerthey arewith respecttothelargesteigenvalue,thebetterthefit.
Theleast-squaresfit of a planeis governedbyV� � ���
V� .�. �h��p���.�. � (20)
where � � is the distanceof � � to the plane,and � is theplane’s unit normalvector. If � is a matrix containingthepoints � � asits row vectors,theminimal-eigenvalueeigen-vectorof � A � minimizesEquation20 with respectto � .Hence,minimizing this equationis equivalent to findingthe minimal-eigenvalueeigenvectorof the points’ covari-ancematrix. Theothertwo eigenvaluesindicatethegood-nessof fit — thelargerthey arewith respectto thesmallesteigenvalue,thebetterthefit.
Using least-squaresfitting is basedon the assumptionthat the signalnoiseterm is capturedby a normaldistri-bution. If this assumptionis not valid, the least-squaresestimatoris likely to yield poorresults,becauseof its sen-sitivity to outliers.Ontheotherhand,in three-dimensionalspace,the least-squaresestimatoris mucheasierto com-putethanmorerobust estimators.It would be interestingto compareits performanceon temporalsegmentationtotheperformanceof robustregressionestimators[13], suchastheleastmedianof squaresestimator[2, 11], or there-peatedmedianestimator[16, 9].
After the initial fit, the algorithmpools the primitivesinto a directedacyclic graph (DAG), schematicallyde-pictedin Figure6. Note that the individual segmentsarenot mutually exclusive; for example,datacan fit both aline anda plane.
If the algorithmfails to fit any geometricprimitivestosomesegment, it insertsthe segmentinto the DAG as a
take place.
“wild card,” which is definedconservatively to matchanykind of geometricprimitive. It thenattemptsto mergead-jacentsegmentsif they are compatible,in an attempttoeliminatespurioussegmentationboundaries.
We definedadjacentsegmentsto be compatiblefor amergeif they sharedthesametypeof geometricprimitivein similar orientations,andif the mergedsegmentstill fitthesametypeof geometricprimitiveasits constitutingseg-ments.In addition,we considereda wild cardto becom-patiblewith anothergeometricprimitive if this primitivealsofit themergedsegment.
PlaneLine
Line
Hold
Figure6: Geometricprimitivespooledinto a DAG. Cir-clesdenotesegmentationboundaries.Dottedarcsdenotepossiblenull transitions;they arenecessaryto compensatefor spurioussegments.Sometimesdatacanfit multiplege-ometricprimitives; in this DAG the dataof the first twosegmentsfit both a hold followedby a line, anda simpleline.
TheDAGnow givesall possiblesegmentsequencesthatarea valid representationof thesignal. If a sequenceis tobevalid, it mustbeobtainableby tracingapaththroughtheDAGfromtheleftmostsegmentationboundaryto theright-mostsegmentationboundary. In theexamplegivenin Fig-ure6 thesequences“Hold, Line,Plane,” and“Line, Plane”would both be valid sequences,but “Plane,Plane”wouldnot,becausethelatterdoesnot lie onany paththroughthatDAG.
Thisdiscussionhassofarignoredthepossibilityof spu-rioussegmentsarisingfromthevisionanalysis.Thatis, theanalysismight recognizea segmentthatshouldbepartofanother, but the merge processfails to merge it into an-othersegment. Themain reasonfor theexistenceof spu-rioussegmentsis undersampling.If a segmentconsistsofvery few samples,it is oftenimpossibleto extractreliableinformationfrom it. Our algorithmattemptsto solve thisproblemby addingarcsto theDAG from eachsegmenta-tion boundaryto the next (representedby the dottedarcsin Figure6). Thus,a paththroughtheDAG canoptionallyskip thesespurioussegments.
6.2 Using the Motion Analysiswith HMMsEachsignin thevocabularyhasassociatedoneor more
templatesthat comprisethe sign’s geometricprimitiveswith weightsof eachfeature’s relative importance.These
8
primitivesarematchedagainstthosein theDAG. Assum-ing thatthesegmentationprocessyieldscorrectresults,thefollowing must be true: If a sequenceof signsis repre-sentedby theinputsignal,thesequenceof geometricprim-itivescorrespondingto thesignsmustform a paththroughtheDAG. We call sucha sequenceof signsvalid with re-spectto thecomputervisionDAG.
This observationsuggestsanapplicationof themotionas a backupcheckfor the HMM framework. First rec-ognizea candidatesentencefrom the input signalvia theViterbi algorithm.Thengenerateall possiblesequencesofgeometricprimitivescorrespondingto therecognizedsignsand constructanotherDAG from them. Using dynamicprogramming,matchthetwo DAGsagainsteachother. Ifthetwo DAGssharea commonpath,acceptthecandidatesentenceascorrect. Otherwise,reject the candidatesen-tenceasincorrect.
Thejustificationfor this algorithmcomesfrom thefol-lowing propertiesof the DAGs: If the two DAGsshareacommonpath,thereis a sequenceof geometricprimitivesthat forms a paththroughthecomputervision DAG. Fur-thermore,this sequenceof geometricprimitivesis oneofthe possiblesequencesgeneratedfrom the candidatesen-tence. Thus, the candidatesentenceis valid with respectto thecomputervision DAG. Conversely, if no suchcom-monpathexists,noneof thesequencesof geometricprim-itivesgeneratedfrom thecandidatesentenceformsa paththrough the computervision DAG. Thus, the candidatesentenceis not valid with respectto the computervisionDAG andshouldberejected.
6.3 Discussionof the CouplingTheHMM recognitionalgorithmandthevisionmatch-
ing algorithmcomplementeachother. TheadvantagesoftheHMM recognitionmethodareautomaticsegmentationduring both training andrecognition,anda fully formal-ized training procedure.The disadvantagesarepoor per-formancein the presenceof insufficient training data,noformal way to weight featuresdynamically, andpossibleviolationsof thestochasticindependenceassumptions.
Theadvantagesof thevision matchingmethodarethepossibilityof weightingtherelative importanceof featuresdynamically, and independencefrom insufficient trainingdata. A significantdisadvantageis thatestimatingthege-ometricpropertiesof the signsin the vocabulary requiresmanuallabelingandanalysisof thedata.Furthermore,seg-mentationmustbedoneexplicitly, which raisesthepossi-bility of spurioussegments,asdescribedin Section6.1,orthepossibilityof missingsegments.Coarticulationsome-timesalsochangesthegeometricpropertiesof thesignal,suchthatthetemplatesfor thecorrectsequenceof signnolongermatchthe actualsignal. Copingwith the changesin thegeometricpropertiesis an importanttaskfor future
research.
7 Data CollectionFor our experimentswe collecteddata,usingboth our
computervision system,and an AscensionTechnologiesFlock of Birds. Thereasonfor usingthe latterwasthat itis fasterat this point thanthecomputervision system,andhencemoresuitablefor prototyping.
The computervision systemyields rotation, � � , andtranslation,��� , of eachsegmentof thearm,asdescribedinSection3. Figure7 givesanexampleof thecomputervi-siontrackingprocess.Theimagesshow thehighaccuracyof thecomputervision system;in fact,it is comparabletotheaccuracy achievedby theFlockof Birdssystem.
TheFlockof Birds systemconsistsof a magnetandsixsensorsthat detecttheir rotation, �� � , andtranslation, ��|� ,with respectto the magnetat 25 framesper second.Weusedthe datafrom both systemsinterchangeablywith asimplealignmentof coordinatesystems.The coordinatesystemwasright-handed,with theorigin at thebaseof thesigner’sspineandthe � axisfacingup.
Figure 7: Fitting the three-dimensionalmodels to thesigner’s arms. From top to bottom, the signs for “FA-THER,” “I,” and“MAIL ” aredisplayed.Fromleft to right,thefront, side,andtopviewsaredisplayed.
We used the 53-sign vocabulary listed in Table 1.TheirpronunciationsfollowedtheASL dialectusedin thePhiladelphia,PA, area. The goalsin choosingthe vocab-ulary wereto beableto expresssentencesthatcouldhaveoccurredin a naturalconversation,andto make intensiveuseof the signingspace,so asto demonstratethe advan-tagesof three-dimensionaldataovertwo-dimensionaldata.Wecollected486continuousASL sentences,eachbetween
9
Category SignsusedNouns America, Christian, Christmas, book,
brother, chair, college, family, fa-ther, friend, interpreter, language,mail,mother, name,paper, president,school,sign,sister, teacher
Pronouns I, my, you,your, how, what,where,whyVerbs act,can,give,have,interpret,like,make,
read,sit, teach,try, visit, want,will, winAdjectives deaf,good,happy, relieved,sadOther if, from, for, hi
Table1: Thecomplete53signvocabulary
2 and12 signslong, with a total of 2345signs. Theonlyconstraintsontheorderandoccurrenceof signswerethosedictatedby thegrammarof ASL [19].
Furthermore,we collectedexamplesof eachsign forisolatedrecognition. Becausepart of the datawerecor-ruptedduringthecollectionprocess,wediscardedall signsfor whichwedid nothaveenoughintacttrainingexamples.This left 656examplesovera rangeof 40signs.Eachsignhadat least6 examplesavailablefor thetrainingset,and2examplesavailablefor thetestset.
8 ExperimentsWe performedisolated,continuous,and vision-HMM
coupledASL recognitionexperiments.WeusedEntropic’sHidden Markov Model Toolkit (HTK) Version 2.02 fortrainingandtestingin all of ourexperiments.
8.1 IsolatedRecognitionExperimentsThegoalof theisolatedrecognitionexperimentswasto
discover a setof featuresthat maximizesHMM recogni-tion performance.We useddifferentfeaturesin ourexper-iments,includingwrist positioncoordinatesof bothhands(denotedby �O&�|] ), wrist positionexpressedin polarco-ordinatesin the � - � plane (denotedby ¡£¢?¤�]¥�¢£¤ ), polarcoordinatesin the � - plane(denotedby ¡?¢£¦�&¥¢?¦ ), wristposition expressedin sphericalcoordinates(denotedby¡]¥(K§ ), andwrist orientationangle(denotedby
b), aswell
asderivativesof these(denotedby a dot). We alsocom-binedseveralfeaturesin someexperiments.
We ran repeatedexperiments,morethan � 3 3�3�3 total,with differentfeaturesandrandomlyselectedtrainingandtestsetson a per-experimentbasis. Threequartersof theexamplesfor eachsignwerein thetrainingsetandtherestwerein thetestset.Eachselectionyielded178testexam-plesperexperiment.Sometypical resultsaregivenin Ta-ble 2. In addition,we performedexperimentsto comparethe merits of using three-dimensionalcoordinatesversustwo-dimensionalcoordinatesby projectingthecoordinatesonplanes.Theresultsareshown in Table3.
Features ~ ¨ B W N��]�GK 98.42% 0.99% 100.0% 93.8% 463¡ ¢£¤ &¥ ¢£¤ , 98.72% 0.79% 100.0% 95.5% 494¡?¢£¤�&¡£¢?¦ ,¥¢?¤�]¥�¢£¦ ,��]�GK 98.78% 0.78% 100.0% 94.9% 882¡&¥(L§ 96.48% 1.31% 100.0% 93.3% 210©�O ©�| © 96.87% 1.21% 100.0% 93.3% 167��]�GK ªb
98.25% 0.92% 100.0% 95.5% 167©¡ ¢£¤ ©¥ ¢£¤ ,© 96.28% 1.04% 98.9% 93.8% 120©¡« ©¥( ©§ 95.89% 1.29% 98.9% 92.1% 150
Table2: Resultsof isolatedsign recognitionwith three-dimensionalfeatures. ~ , ¨ , B, W, and N correspondtotheaveragepercentageof correctlyrecognizedsigns,stan-darddeviation,bestcase,worstcase,andnumberof exper-iments,respectively. All experimentsusedatestsetof 178signs.
Features ~ ¨ B W N¡ ¢£¤ &¥ ¢?¤ 98.06% 1.26% 100.0% 94.9% 118��&� 97.75% 1.20% 100.0% 94.9% 118
Table 3: Resultsof isolatedsign recognitionwith two-dimensionalfeatures.Themeaningof the columnsis thesameasin Table2.
8.2 Analysisof IsolatedRecognitionThe low error ratesof the bestfeaturesetsshow that
with a good selectionof features,the hand movementsalone,without handconfigurationinformation,carry suf-ficient informationto discriminateamongmany differentsigns. Polarcoordinatesslightly outperformedCartesiancoordinates.A combinationof both yielded the bestre-sults,althoughthedifferenceis not significant. However,thestandarddeviationof thecombinedfeaturesetwaslow-est,indicatingthatacomplex featurevectoris morerobustthana simplefeaturevector.
Positioncoordinatessignificantlyoutperformedveloc-ities. The reasonfor the poor performanceof velocityfeaturesis that the statisticalpropertiesof the velocitieschangewith variationsin thesign’s duration. In contrast,thestatisticalpropertiesof positioncoordinatesarelargelyunaffectedby thedurationof signs,becauseHMMs absorbvariationsin durationthroughtransitionslooping backtothe samestate. Yet, positioncoordinateshave the signif-icantdisadvantagethat they arenot invariantwith respectto location.Thelackof invariancewill causeproblemsforfuture applicationsthat attemptto capturecommonalitiesbetweenmovementsatdifferentlocationsin space.
10
Three-dimensionalfeaturesperformedbetterthantwo-dimensionalfeatures,althoughthedifferenceis not large.The differencewould probablybecomemore significantwith a largervocabulary. The differencesin standardde-viation, however, indicatethat three-dimensionalfeaturesaremorerobustthantwo-dimensionalfeatures.
It is an importantconsequenceof the experiments’re-sults that the performanceof the featurevectorsdependson theactualexamplesin thetrainingset,all otherfactorsbeingequal.Thus,only performinga largenumberof ex-perimentsyieldsreliableestimatesof therelativemeritsofdifferentfeatures.
8.3 ContinuousRecognitionExperimentsWe split the486sentencesrandomlyinto a trainingset
with 389 examplesanda testsetwith 97 examples(con-taining 456 signs). Eachsign in the vocabulary occurredat least once in the test set. The training and test setswerethesamethroughoutall experiments,andno portionof the testsetwasusedfor training in any way. We ranthree-dimensionalexperimentswith andwithout context-dependentHMMs, andtwo-dimensionalexperiments(byprojectingthedataonplanes;theresultsgivenarethebestthatwe found).
In accordancewith the results from isolated experi-mentsthat position coordinatesperform better than ve-locities, and that a complex featurevector is more ro-bust thana sparseone,we choseour featurevectorto be% ��&�|K ª]¥ ¢£¤ &¥ ¢£¦ ©�� ©�| © (
b' for bothhands.Thatis, it con-
sistedof Cartesianandpolarpositioncoordinates,veloci-ties,andwrist orientationangles.Thetaskgrammarwasasimpleword loop, soevery signwasequallylikely at anytime in theHMM network.
Table4 shows the experimentalresults. We usewordaccuracy asourevaluationcriterion.It is computedby sub-tractingthenumberof insertionerrorsfrom thenumberofcorrectlyspottedsigns.Thenumberof wordsin theresultfor two-dimensionaldatais lower thanin theotherresults,becausefor onesentencetheViterbi beam-searchingopti-mizationprunedall pathsthroughtheHMM network (seealsoSection4.2).
8.4 Analysisof ContinuousRecognitionThe resultsare clearly in favor of using three-dimen-
sionaldataover two-dimensionalfor continuousrecogni-tion. The6.3percentdifferenceis large,although,accord-ing toourexperienceswith isolatedrecognition,oneexper-imentis notenoughto estimatetherealdifferencereliably.
Context-dependentmodelsoutperformedcontext-inde-pendentmodels, but the increasein performancewassmall, probably to a large extent becauseof insufficienttrainingdata— context-dependentmodelingrequireshugeamountsof data to becomeeffective. Also, cross-signcontext-dependentmodelingfor ASL is implausiblefrom
Typeof Wordexperiment accuracy Details3D context 87.71% H=416,D=8, S=32independent I=16,N=4563D context 89.91% H=424,D=6, S=26dependent I=14,N=4562D context 83.63% H=394,D=14,S=44dependent I=16,N=452
Table 4: Resultsof continuousrecognitionexperiments.H denotesthe numberof correctsigns,D the numberofdeletionerrors,S the numberof substitutionerrors,I thenumberof insertionerrors,andN thetotalnumberof signsin thetestset.
a phonologicalpoint of view (seeSection5.2.1). Theal-ternative is modelingmovementepenthesisdirectly, anditappearsto performbetter[20].
More thanhalf of thesubstitutionerrorsin eachexperi-mentwereconfusionsbetween“I” and“MY,” and“Y OU”and“Y OUR,” whichdiffer only in handconfiguration.Weexpectthataddingfeaturesdescribingthehandconfigura-tion will improverecognitionperformancesignificantly.
Repeatingthecontext-dependentexperimentwith five-bestrecognitionshowedthattheabsenceof astronggram-marfor constrainingtheHMM network degradesrecogni-tion performancesignificantly. In many cases,thecorrectsentencewastheonlygrammaticalsentenceamongthefivebestcandidates.In othercases,all fivecandidateswereun-grammatical.
Unfortunately, using a strong grammarfor a test setasdiverseasours is not practical,becausethe sizeof anHMM network grows exponentiallywith the numberofrulespresentin thegrammar. Statisticallanguagemodels,suchasbigrammodels,haveprovedto beaneffectivesolu-tion to thisproblemin speechrecognition.Weshow in [20]thatbigramlanguagemodelsarepromisingfor ASL recog-nition aswell. However, they requirea largecorpusof la-beledreal-world datato becometruly effective. Presently,nosuchcorpusexistsfor ASL.
8.5 Coupling ExperimentsTo investigatethe effectsof couplingthe three-dimen-
sionalmotionanalysiswith theHMM framework, weper-formedtwo experiments.In thefirst experiment,we ana-lyzedall sentencesin thetestsetwith ourmotionanalysis,soasto provideanupperboundon its performance.If themotion analysishadworked perfectly, it shouldhave ac-ceptedall of these97 testsentences.In reality, however, itrejected10outof these97sentences.
A closerlook at the10 rejectedsentencesrevealedthatfive of thesewerenot recognizedcorrectlyby thecontext-dependentHMMs either. Thus,it is likely that thesefive
11
sentenceswerenotsignedpreciselyenoughduringthedatacollectionprocess.Theotherfive rejectedsentencesindi-catethatthemotionanalysisstill needsimprovement.
In thesecondexperiment,weranthecouplingalgorithmon the actual recognitionhypothesesfrom the context-dependentHMMs in theexperimentsin Section8.3. Thistime,thealgorithmalsoeliminated10sentencesoutof 97.Five of thesewere correctly rejected;that is, the HMMframework hadprovidedincorrectresultsfor them. Thus,at thecurrentmoment,couplingHMMs with motionanal-ysisbreaksevenwith usingtheHMM framework by itself.Theword accuracy achievedby thecouplingwas90.10%,which is slightly better than the 89.91%word accuracyachievedby thecontext-dependentmodelsalone.
As we have usedonly a small part of the full powerof computervision motion analysisso far, we seetheseresultsasevidencethatcouplingwill eventuallybeabletooutperformeithermethodindependently.
9 SummaryWehavedevelopeda framework for recognizingAmer-
icanSignLanguagefrom three-dimensionaldataobtainedwith computervision techniques.We showedhow to col-lect three-dimensionaldatafrom computervision andusethem as input to Hidden Markov Models. We also de-terminedthat three-dimensionalfeaturesaresuperiorovertwo-dimensionalones.
By using context-dependentmodeling, we improvedrecognitionperformance. Throughcoupling vision pro-cesseswith HiddenMarkov Models,we took a first steptowardovercomingthe limitationsof eithermethodby it-self.
10 Future workThe collection of a standardizedcorpusof real-world
ASL conversationsandstory telling shouldbea high pri-ority for future work. The currentlack of sucha corpusmakesit impossibleto compareresultsfrom differentre-searchers.Furthermore,it makesthe developmentof sta-tistical languagemodelsfor ASL difficult. Suchlanguagemodelsarenecessaryfor large-scaleapplications.
Testing the algorithms describedin this paper andin [20, 21] with alargervocabularyis alsoimportant.Onlythenit will bepossibleto judgehow well thesealgorithmsscale.
On the linguistic sideof ASL recognition,futureworkshouldincorporatefacialexpressionsandotherphonolog-ical processesin ASL into the recognitionframework. Italsoneedsto addresshow to make useof handconfigura-tion information;usingthis informationeffectively seemsto benontrivial. Furthermore,futurework hasto find waysto use statistical languagemodels, so as to counterbal-ancetheimpracticabilityof usingstronglyconstrainedtask
grammars.On thecomputervisionsideof ASL recognition,future
work shouldelaborateon thecouplingof computervisionandHMMs andmake the computervision analysismorerobust. This work shouldconsistof recognizingmoredif-ferentgeometricproperties,fine-tuningthesigntemplates,andfine-tuningthe dynamicweightingof featuresbasedonthepropertiesof eachsignthatis matchedto thesignal.It alsoneedsto addresscoarticulationeffects,which it hasignoredsofar.
It is alsonecessaryto to developananthropometricallycorrectmodelof thehumanhand,sothatthecomputervi-sion trackingprocesscanmake handconfigurationinfor-mationavailableto therecognitionframework.
AcknowledgmentsThisworkwassupportedin partbyaNSFCareerAward
NSF-9624604,ONR-DURIP’97 N00014-97-1-0385 andN00014-97-1-0396, ONR Young InvestigatorProposal,andNSFIRI-97-01803.IoannisKakadiarishelpedin ob-tainingthecomputervisionsamples.
References[1] L. W. Campbell,D. A. Becker, A. Azarbayejani,A. F. Bo-
bick, and A. Pentland. Invariant featuresfor 3-D gesturerecognition.2ndInternationalWorkshoponFaceandGestureRecognition,Killington, VT, 1996.
[2] H. EdelsbrunnerandD. L. Souvaine. Computingleastme-dianof squaresregressionlinesandguidedtopologicalsweep.Journal of the American Statistical Asscoiation, 85:115–119,1990.
[3] R. Erenshteyn andP. Laskov. A multi-stageapproachto fin-gerspellingandgesturerecognition.Proceedingsof theWork-shopon the Integrationof Gesturein LanguageandSpeech,pp.185–194,Wilmington,DE, 1996.
[4] K. GrobelandM. Assam.IsolatedsignlanguagerecognitionusinghiddenMarkov models.SMC’97,pp.162–167.
[5] S. K. Liddell andR. E. Johnson.AmericanSignLanguage:Thephonologicalbase.Sign Language Studies, 64:195–277,1989.
[6] I. A. Kakadiaris,D. Metaxas,and R. Bajcsy. Active part-decomposition,shapeandmotionestimationof articulatedob-jects:A physics-basedapproach.CVPR’94,pp.980–984.
[7] I. A. Kakadiarisand D. Metaxas. 3D humanbody modelacquisitionfrom multipleviews. ICCV’95, pp.618–623.
[8] I. A. KakadiarisandD. Metaxas. Model basedestimationof 3D humanmotion with occlusionbasedon active multi-viewpoint selection.CVPR’96,pp.81–87.
[9] J. Matousek, D. Mount, and N. S. Netanyahu. Effi-centrandomizedalgorithmsfor the repeatedmedianline es-timator. To appearin Algorithmica. Available online athttp://www.cs.umd.edu/˜mount/pubs.html .
12
[10] D. Metaxas.Physics-based Deformable Models: Applica-tions to Computer Vision, Graphics and Medical Imaging.Kluwer AcademicPublishers,November1996.
[11] D. Mount, N. Netanyahu,K. Romanik,R. Silverman,andA. Y. Wu. A practicalapproximationalgorithmfor theLMSline estimator. 8th ACM-SIAM Symposiumon DiscreteAl-gorithms,pp.473–482,1997.
[12] Y. NamandK. Y. Wohn. Recognitionof space-timehand-gesturesusingHiddenMarkov model. ACM SymposiumonVirtual Reality Software and Technology, pp. 51–58,HongKong,1996.
[13] N. S. Netanyahu, V. Philomin, A. Rosenfeld,and A. J.Stromberg. Robust detectionof straight and circular roadsegments in noisy aerial images. Pattern Recognition,30(10):1673–1686,1997.
[14] S. Prillwitz et al. HamNoSys. Version 2.0; Hamburg Nota-tion System for Sign Languages. An introductory guide, vol-ume5 of International Studies on Sign Language and Com-munication of the Deaf. SignumVerlag,Hamburg, 1989.
[15] L. R. Rabiner. A tutorial on HiddenMarkov Modelsandselectedapplicationsin speechrecognition. Proceedings ofthe IEEE, 77(2):257–286,February1989.
[16] A. F. Siegel. Robust regressionusing repeatedmedians.Biometrika, 69:242–244,1982.
[17] J. M. SiskindandQ. Morris. A maximum-likelihoodap-proachto visualeventclassification.ECCV’96.
[18] T. StarnerandA. Pentland.Visualrecognitionof AmericanSign LanguageusingHiddenMarkov models. InternationalWorkshopon AutomaticFaceand GestureRecognition,pp.189–194,Zurich,Switzerland,1995
[19] C. Valli andC. Lucas. Linguistics of American Sign Lan-guage: An Introduction. GallaudetUniversity Press,Wash-ingtonDC, 1995.
[20] C. VoglerandD. Metaxas.AdaptingHiddenMarkov mod-elsfor ASL recognitionby usingthree-dimensionalcomputervisionmethods.SMC’97,pp.156–161.
[21] C. Vogler and D. Metaxas. ASL recognitionbasedon acouplingbetweenHMMs and3D motionanalysis.ICCV’98,pp.363–369.
[22] M. B. WaldronandS. Kim. IsolatedASL signrecognitionsystemfor deafpersons.IEEE Transactions on RehabilitationEngineering, 3(3):261–71,September1995.
[23] M. WaleedKadous. Machinerecognitionof AuslansignsusingPowerGloves:Towardslarge-lexiconrecognitionof signlanguage.Proceedingsof theWorkshopon theIntegrationofGesturein LanguageandSpeech,pp. 165–174,Wilmington,DE, 1996.
[24] S.Young,J.Jansen,J.Odell,D. Ollason,andP. Woodland.The HTK Book (for HTK 2.0). CambridgeUniversity, 1995.
13