School of Industrial Engineering, Purdue University, West ... · inputs in handheld smart devices...

Gesture Agreement Assessment Using Description Vectors

Naveen Madapana, Glebys Gonzalez and Juan WachsSchool of Industrial Engineering, Purdue University, West Lafayette, United States.

Abstract— Participatory design is a popular design techniquethat involves the end users in the early stages of the designprocess to obtain user-friendly gestural interfaces. Guessabilitystudies followed by agreement analyses are often used to elicitand comprehend the preferences (or gestures/proposals) of theparticipants. Previous approaches to assess agreement, groupedthe gestures into equivalence classes and ignored the integralproperties that are shared between them. In this work, werepresent the gestures using binary description vectors to allowthem to be partially similar. In this context, we introduce a newmetric referred to as soft agreement rate (SAR) to quantifythe level of consensus between the participants. In addition, weperformed computational experiments to study the behavior ofour partial agreement formula and mathematically show thatexisting agreement metrics are a special case of our approach.Our methodology was evaluated through a gesture elicitationstudy conducted with a group of neurosurgeons. Nevertheless,our formulation can be applied to any other user-elicitationstudy. Results show that the level of agreement obtained bySAR metric is 2.64 times higher than the existing metrics. Inaddition to the mostly agreed gesture, SAR formulation alsoprovides the mostly agreed descriptors which can potentiallyhelp the designers to come up with a final gesture set.

I. INTRODUCTION AND RELATED WORKGestures offer an intuitive and a natural mode of interac-

tion with computing devices. Given the ubiquity of gesturalinputs in handheld smart devices and gaming consoles [1],[2], the choice of gestures plays a crucial role in the develop-ment of efficient and user-friendly gestural interfaces [3]. Inthis paper, we utilize the principles of participatory design toobtain the best choices of gestures and further propose a newtechnique to measure the agreement among the participants.

Participatory design is a popular design technique thataims to involve the end users in the early stages of the designprocess to obtain high acceptability interfaces [4]–[6]. Itallows the stakeholders, i.e. customers or end users to providetheir requirements so that the developed system complieswith their preferences [7], [8]. This technique is especiallyadvantageous in domains requiring particular expertise i.e.radiology, neurosurgery, construction and aviation as theexperts in those domains have the intrinsic knowledge aboutthe environment which considerably affects the choice ofgestures [9]–[11].

Guessability or user-elicitation studies are often conductedto elicit user preferences in the form of proposals for thefunctionalities of the interface. The nature of proposalsgreatly depends on the interface and its mode of operation.For instance, the proposals can be: 1. Word choices inthe case of speech-based virtual assistants such as AmazonEcho [12], 2. Choice of symbolic gestures in the case ofsmartphones or autonomous vehicles [6], [13], and 3. Choiceof freehand gestures in the case of gaming consoles [1], [14].

Our work falls under the third category in which proposalsare nothing but the gestures elicited by the end users.

Fig. 1: Gestures that are similar but not equivalent.

Once the participants’ gestures are obtained, agreementanalysis is performed to quantify the agreement among usersand select the best set of gestures for the final interface.The metrics introduced by Wobbrock et al. [4], [5] havebeen popularly used in the literature to measure the levelof agreement among users. These metrics are referred to as“agreement index” (Ar) [4] and “agreement rate” (ARr)[5] for command r. The metric Ar is inconsistent at theboundary i.e. it takes a non-zero value when there is noagreement leading to an overestimation of the agreement.In this regard, Vatavu et al. proposed a new formula foragreement (ARr) to address this critical issue. Interestinglyenough, Stern et al. developed a very similar formula whileattempting to quantify the agreement among user-elicitedproposals [3]. Since their introduction, these metrics werewidely adopted in several user elicitation studies to assessagreement [4], [15]–[18].

Existing methods for agreement analysis are based on theformation of equivalence groups or classes among the elicitedproposals. Each class consists of proposals that are identicalor very similar. Human judgement or visual inspection isoften used to determine whether two proposals are identicalor non-identical. Hence two proposals that are similar butnot equivalent can be potentially assigned to two differentequivalence classes or vice versa (refer to Fig. 1). In addition,the degree of similarity between the proposals is binary, i.e.a value of one refers to a pair of identical proposals andzero otherwise. Hence we refer to these methods as “hardclassification” approaches.

In contrast, our work allows the proposals to be partiallysimilar, using binary description vectors. First, each proposalis decomposed into a set of atomic properties or descriptors.

arX

iv:1

910.

1014

0v1

[cs

.HC

] 2

2 O

ct 2

019

In the case of hand gestures, these properties could be thedirection of motion, shape of the trajectory, orientation of thehand, etc. In other words, a proposal is represented as a bi-nary vector where each element corresponds to the presence(one) or absence (zero) of a particular property. This binaryvector is referred to as a soft representation or a descriptionvector. Second, the similarity metrics that are widely usedin the pattern recognition can be used to measure the extentof similarity. Next, we propose a new metric to measurethe level of agreement that implicitly takes into accountthe description vectors. We refer to these similarity basedagreement methodologies as “soft classification” approaches.

The key contributions of this paper are as follows: 1. Anovel methodology to perform agreement analysis for user-elicitation studies by incorporating the description vectors, 2.A mathematical proof showing that the existing agreementmetrics are a special case of our approach. 3. An empiricalrelationship between the agreement metric and the averagepercentage of participants that agreed on a particular gesture.

II. METHODOLOGY

Let us start by defining the notations. Let C be the numberof commands or referents, d be the number of descriptors,and subscript r denote the command r. Let Pr be the set ofall proposals or gestures, where r = 1, . . . , C. Let ur ≤ |Pr|be the number of unique gestures. Let P i

r be a subset ofproposals that are considered identical. Thus, |P i

r | wouldbe the number of identical proposals in the set i. We use|.| to denote the number of elements in a set and use ||.||to denote the L2 norm of a vector. Let ith proposal berepresented as a binary description vector Si

r ∈ {0, 1}d,where i = 1, 2, . . . , Nr. The total number of proposals (Nr)for the command r can be represented as the following (Eq.1). Note that all referents have same number of proposals.

N = Nr =

ur∑i=1

|P ir | = |Pr| (1)

The most commonly used agreement metric defined byWobbrock et al. [5] is given in equation 2. ARr is the levelof agreement for command r.

ARr =1

N(N − 1)

ur∑i=1

|P ir |(|P i

r | − 1) (2)

A. Soft Agreement Rate (SAR)

To define an agreement index, we need a similarity metricthat measures closeness between two vectors (binary in ourcase). Popular similarity metrics include but not limitedto cosine [19], Jaccard [20] and Hamming similarity [21],which vary between zero and one (one when the vectors areequivalent and zero when they are orthogonal). Note that theJaccard similarity does not consider zero-zero (a descriptorbeing absent in both the gestures) as an agreement but onlyconsiders one-one as an agreement. It complies with thecontext of agreement analysis because when two gestureslack a descriptor, it does not necessarily imply an agreement.

Given the sparsity in the description vectors, the cosinedistance is a good alternative to measure the similarity [21].Hamming similarity can not be used to measure similarityas it considers zero-zero as an agreement.

We propose an agreement metric referred to as SoftAgreement Rate (SAR). SARr is the agreement forcommand r which is defined as a mean of the Jaccardsimilarity J applied to all possible pairwise combinations ofbinary vectors corresponding to the proposals in Pr (Eq. 3).An overall agreement rate (SAR) is defined as a mean ofSARr for all referents r. The mathematical representationof SAR relates to AR in terms of considering all possiblepairwise combinations. Similar to the AR, the proposedmetric SAR takes a value of 0 where there is no agreementand takes a value of 1 when all participants agree on aproposal.

SARr =2

N (N − 1)

N∑j=k+1

N∑k=1

J(Sjr , S

kr

)(3)

Where, Sjr and Sk

r represent the binary vectors of jth andkth proposal. Since the Jaccard similarity between two zerovectors is not defined, we propose a conditional definitionfor this similarity metric (Eq. 4). Our methodology can beeasily generalized to the cosine similarity.

J(a, b) =

0, if ‖a‖+ ‖b‖ = 0

a.b

‖a‖2 + ‖b‖2 − a.b, otherwise

(4)

Where a and b are the binary vectors, and a.b denotes thedot product between the vectors a and b.

B. Relation to Existing Metrics

In this section, we prove that the metric AR is a spe-cial case of our metric SAR. Let SARhard denote thesoft agreement rate when the gestures are treated as rigidentities and are grouped into equivalence classes instead ofas a combination of descriptors. In machine learning, it iscommon to represent a distinct equivalence class (gesturesin this case) as a one hot (OH) vector.

There are ur distinct equivalence classes or unique ges-tures for referent r. In this special case, each unique gestureis assigned to a distinct OH vector of length ur. This impliesthat all the gestures in the set P i

r are assigned to the same OHvector. The Jaccard similarity between two identical OH vec-tors is unity and two distinct OH vectors is 0. This nullifiesall of the distinct pairwise OH vectors. The resulting nonzerocombinations are obtained from the pairwise combinationswithin the subset P i

r . That is, J(Sjr , S

kr ; j 6= k) = 0 and

(Sjr , S

kr ; j = k) = 1.

⇒N∑

j=k+1

N∑k=1

J(Sjr , S

kr ) =

1

2

ur∑i=1

|P ir |(|P i

r | − 1)

By substituting the above in Eq. 3, we will get,SARhard

r = ARr. Interestingly enough, the resulting equa-tion is exactly equal to the one proposed by Vatavu et al. [5].This proves that our approach is general enough to adapt forboth soft and hard representation of gestures.

C. Interpretation of Agreement Index

We know that the agreement rate is related to the numberof participants that agreed on a particular proposal. Nev-ertheless, there is no direct theoretical / empirical relationbetween the agreement value and the average percentage ofparticipants that agreed on a proposal (η). For example, avalue of η = 0.40 indicates that 40% of participants agreedon a proposal. In this section, we present an elegant way ofinterpreting the level of agreement. Given the agreement rate,the main idea is to estimate the average number of subjectsthat agreed on a particular proposal.

For instance, assume that we have 10 proposals for areferent r and consider the following two scenarios: a)Four proposals are equivalent and the rest are different(AR = 0.13, η = 0.40) and b) Six proposals are equivalent(AR = 0.33, η = 0.60) and the rest different. In eachof these cases, it is unclear how the AR is related to η.Note that η2 ≈ AR in these simplistic scenarios. In morecomplex scenarios such as when 4 out of 10 subjects pickeda particular gesture, 3 out of 10 picked another gesture, andrest of subjects picked different gestures, the value of η cannot be quantified.

Thus, we propose an empirical relation between η andlevel of agreement which is a numerical approximation ofη in terms of agreement value. Given the quadratic nature(pairwise combinations) of the agreement metrics, we pro-pose that value of η is approximately equal to the squareroot of the SAR or AR (Eq. 5 and 6). In the case of hardrepresentations, η indicates the average number of subjectsthat agreed on a gesture. However, for soft representations,η gives the average number of subjects that agreed on aset of gesture descriptors. Though η is an empirical and anapproximate measure of an average percentage, it is meant toprovide a qualitative interpretation of the agreement values.

ηARr ≈

√ARr (5)

ηSARr ≈

√SARr (6)

III. EXPERIMENTS AND RESULTS

A. Case Study

Our approach was evaluated and tested using the gestureselicited by the neurosurgeons (refer to Fig. 2) in a guessabil-ity study conducted by Madapana et al. [18]. In their study,a group of nine neurosurgeons (participants) were asked tocreate gestures (proposals) for each of the 28 commands(referents) present in a radiology image browser. A total of252 gestures were considered. Next, a subset of 55 gesturedescriptors proposed in the literature [22], [23] were usedin this paper as shown in Table I in order to create the soft

representations. In other words, each of the 252 gestures wererepresented as a 55 dimensional binary description vectors.

Two experiments were conducted in order to computethe level of agreement using: AR and SAR. In the firstexperiment, a group of six participants were asked to groupthe gestures for each referent into equivalent classes. Henceeach equivalent class consists of gestures that are physicallysimilar. This procedure was repeated for all of the 28referents. We used these groupings to compute the level ofagreement using AR formulation (refer to Equation 2) asshown in table II.

TABLE I: A list of gesture descriptors. Note that L, R, U,D, F and B stand for left, right, up, down, forward andbackward. CW and CCW stand for clockwise and counterclockwise.

# Movement descriptors1 Dominant hand L, R, U, D, F, B, CW, CCW,

Iterative, and Circular2 Non-Dominant hand3 Overall hand motion Inward, Outward flow, Circular

Orientation descriptors4 Dominant hand Left, Right, Up, Down, For-

ward and Backward5 Non-Dominant handState of the hand

6 Dominant hand C shape, V shape, Fist, 1, 2, 3,4 open fingers, open palm7 Non-Dominant hand

Other descriptors8 Avg. position of gesture Above, Medium and Low

In the second experiment, the same group of sixparticipants were asked to annotate each of the 252 gestureswith respect to their gesture descriptors. We developed aweb interface that facilitates annotating descriptors for eachgesture. This interface consisted of a window that plays thegesture video and a set of questions asking participants toannotate whether a particular descriptor is present or not.The annotations were automatically parsed to generate thedescription vector. We used these annotations to computeagreement using SAR metric as shown in table II.

B. Distribution of SAR metric

This section presents the probability distribution functionPDF(S|d = 55) of the SAR metric by varying the numberof subjects (S). This involved forming binary gesture descrip-tion vectors of dimension d = 55 for each of the gesture pro-posals elicited by the S subjects. These description vectorswere sampled from a Bernoulli distribution with probabilityP (1) = 0.5. First, the random descriptions were generatedand then, the level of agreement using SAR was computed.This procedure is repeated for 107 iterations and the normal-ized histogram of agreement values was constructed using100 bins of equal intervals. Fig. 3a shows the PDF of SARwhen the no. of subjects vary i.e. PDF(S|d = 55). Theshape of the distribution resembles the bell curve with peakoccurring at 0.33 approximately. For S = 9 and D = 55,the cumulative probability P (SAR ≤ 0.35) = 0.88 whileP (SAR ≤ 0.40) = 0.999.

Fig. 2: Gesture elicitation study with neurosurgeons.

TABLE II: Values of agreement (AR and SAR) and esti-mated average number of subjects that agreed on particularcommand or a set of descriptors (ηAR and ηSAR).

Command AR ηAR (%) SAR ηSAR (%)Scroll up 0.08 28.87 0.30 54.50

Flip horizontal 0.13 36.51 0.39 62.66Rotate CW 0.20 44.72 0.32 56.76

Zoom in 0.19 44.10 0.22 46.51Zoom out 0.23 47.73 0.22 47.08Panel left 0.15 38.73 0.21 45.79Pan left 0.13 36.51 0.29 53.47

Ruler measure 0.12 34.96 0.19 43.90Window open 0.07 25.82 0.23 47.73Inc. contrast 0.06 24.72 0.36 59.97

Layout 0.09 30.73 0.37 60.47Preset 0.07 25.82 0.32 56.99

Mean ± Std 0.11± 0.10 33.7± 7.41 0.29± 0.05 54± 5.0

Note that these PDFs were constructed assuming that theinput data resembles a Bernoulli distribution with P (1) =0.5. However, the actual gesture description data is sparsewith zeroes occurring more frequently than ones P (1) =0.07. Hence we conducted a new set of similar computationalexperiments to construct the PDF of SAR when we feed thedata that resembles the actual data (Bernoulli distributionwith P (1) = 0.07). Fig. 3b shows the PDF of our metricwhen the parameter S is varied. Note that the shape ofthis distribution does not look like a bell curve anymoreand the peak occurs between 0.0 and 0.1. For S = 9,the cumulative probability P (SAR ≤ 0.04) = 0.84 whileP (SAR ≤ 0.07) = 0.99.

C. Discussion

SAR formulation considers the properties of the gesturesand hence produces relatively higher agreement rates in com-parison to previous approaches. These results are expected,given that the SAR formulation considers the partial similar-ity between the gestures. This indicates that participants aremore likely to agree on high-level properties of the gesturesthan agreeing on the gesture itself. It clearly indicates thatthe soft representations provide deeper understanding aboutthe properties that the domain experts agree on, even whenthe gestures are not identical. Indirectly, the SAR metric

focuses on what experts emphasize in the gestures ratherthan on their plain spatio-temporal appearance.

Our formulation is particularly advantageous when thevalues of AR are very low i.e. AR < 0.1 [5]. Given avery little agreement, it is very hard to determine the finalgesture as the mostly agreed gesture is chosen by very fewparticipants. In such cases, the system interface designerswill be greatly benefited from SAR as it allows them todetermine the properties of the final gesture. Similarly, whenthe difference between SAR and AR is large as in the caseof scroll up, the interface designers are recommended toutilize both the methodologies when determining the finalgesture lexicon.

Fig. 3: Probability distribution of SAR with varying numberof subjects and d = 55 when the input gesture descriptiondata is sampled from a Bernoulli distribution with p(1) =0.50 (left) and 0.93 (right).

IV. CONCLUSION

Existing techniques for agreement analysis strongly relyon the formation of equivalence classes and ignore theintegral properties that are shared between the gestures.In this work, we represent gestures as binary descriptionvectors by decomposing the gestures into a set of measurableproperties. Next, we propose an agreement metric referredto as Soft Agreement Rate (SAR) that incorporates thedescription vectors into the agreement analysis. We furthershow that the existing agreement metrics are a special caseof our approach when the equivalence classes themselves actas descriptors. Moreover, we propose an empirical relationbetween the level of agreement and the average numberof participants that agreed on a particular gesture or a setof descriptors. We evaluated our approach using a gestureelicitation study conducted with a group of neurosurgeons.Our results show that SAR metric yields considerably higheragreement rates than the existing metrics. In addition toproviding the mostly agreed gesture, SAR formulation alsoprovides the mostly agreed descriptors. This can potentiallyhelp the interface designers to choose the final gesture whenthe level of agreement is considerably low.

REFERENCES

[1] H. Istance, A. Hyrskykari, L. Immonen, S. Mansikkamaa, and S. Vick-ers, “Designing Gaze Gestures for Gaming: An Investigation ofPerformance,” in Proceedings of the 2010 Symposium on Eye-TrackingResearch & Applications, ETRA ’10, (New York, NY, USA), pp. 323–330, ACM, 2010.

[2] Y. Wang, T. Yu, L. Shi, and Z. Li, “Using human body gestures asinputs for gaming via depth analysis,” in 2008 IEEE InternationalConference on Multimedia and Expo, pp. 993–996, June 2008.

[3] H. I. Stern, J. P. Wachs, and Y. Edan, “Optimal Consensus IntuitiveHand Gesture Vocabulary Design,” in 2008 IEEE International Con-ference on Semantic Computing, pp. 96–103, Aug. 2008.

[4] R.-D. Vatavu, “User-defined Gestures for Free-hand TV Control,” inProceedings of the 10th European Conference on Interactive TV andVideo, EuroITV ’12, (New York, NY, USA), pp. 45–48, ACM, 2012.

[5] R.-D. Vatavu and J. O. Wobbrock, “Formalizing Agreement Analysisfor Elicitation Studies: New Measures, Significance Test, and Toolkit,”in Proceedings of the 33rd Annual ACM Conference on Human Factorsin Computing Systems, CHI ’15, (New York, NY, USA), pp. 1325–1334, ACM, 2015.

[6] H. Dong, A. Danesh, N. Figueroa, and A. El Saddik, “An elicitationstudy on gesture preferences and memorability toward a practicalhand-gesture vocabulary for smart televisions,” IEEE Access, vol. 3,pp. 543–555, 2015.

[7] M. J. Muller and S. Kuhn, “Participatory design,” Commun. ACM,vol. 36, pp. 24–28, June 1993.

[8] F. Kensing and J. Blomberg, “Participatory design: Issues and con-cerns,” Computer Supported Cooperative Work (CSCW), vol. 7,pp. 167–185, Sep 1998.

[9] M. G. Jacob, Y. T. Li, and J. P. Wachs, “Gestonurse: A multimodalrobotic scrub nurse,” in 2012 7th ACM/IEEE International Conferenceon Human-Robot Interaction (HRI), pp. 153–154, Mar. 2012.

[10] J. Wachs, H. Stern, Y. Edan, M. Gillam, C. Feied, M. Smith, andJ. Handler, “Gestix: A Doctor-Computer Sterile Gesture Interface forDynamic Environments,” in Soft Computing in Industrial Applications,pp. 30–39, Springer, Berlin, Heidelberg, 2007. DOI: 10.1007/978-3-540-70706-6 3.

[11] J. Hettig, A. Mewes, O. Riabikin, M. Skalej, B. Preim, and C. Hansen,“Exploration of 3d medical image data for interventional radiologyusing myoelectric gesture control,” in Proceedings of the EurographicsWorkshop on Visual Computing for Biology and Medicine, pp. 177–185, Eurographics Association, 2015.

[12] S. Wright, Amazon Echo Dot: Amazon Dot Advanced User Guide(2017 Updated) Step-by-Step Instructions to Enrich Your Smart Life!(Amazon Echo, Dot, Echo Dot, Amazon Echo User Manual, EchoDot Ebook, Amazon Dot). USA: CreateSpace Independent PublishingPlatform, 2016.

[13] S. S. Arefin Shimon, C. Lutton, Z. Xu, S. Morrison-Smith, C. Boucher,and J. Ruiz, “Exploring Non-touchscreen Gestures for Smartwatches,”in Proceedings of the 2016 CHI Conference on Human Factors inComputing Systems, CHI ’16, (New York, NY, USA), pp. 3822–3833,ACM, 2016.

[14] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, “Robust Part-Based HandGesture Recognition Using Kinect Sensor,” IEEE Transactions onMultimedia, vol. 15, pp. 1110–1120, Aug. 2013.

[15] C. Valdes, D. Eastman, C. Grote, S. Thatte, O. Shaer, A. Mazalek,B. Ullmer, and M. K. Konkel, “Exploring the Design Space of GesturalInteraction with Active Tokens Through User-defined Gestures,” inProceedings of the 32Nd Annual ACM Conference on Human Factorsin Computing Systems, CHI ’14, (New York, NY, USA), pp. 4107–4116, ACM, 2014.

[16] L. Findlater, B. Lee, and J. Wobbrock, “Beyond qwerty: aug-menting touch screen keyboards with multi-touch gestures for non-alphanumeric input,” in Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, pp. 2679–2682, ACM, 2012.

[17] G. Bailly, T. Pietrzak, J. Deber, and D. J. Wigdor, “Metamorphe:augmenting hotkey usage with actuated keys,” in Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems,pp. 563–572, ACM, 2013.

[18] N. Madapana, G. Gonzalez, R. Rodgers, L. Zhang, and J. P. Wachs,“Gestures for picture archiving and communication systems (pacs)operation in the operating room: Is there any standard?,” PLOS ONE,vol. 13, pp. 1–13, 06 2018.

[19] H. V. Nguyen and L. Bai, “Cosine Similarity Metric Learning forFace Verification,” in Computer Vision ACCV 2010, Lecture Notes inComputer Science, pp. 709–720, Springer, Berlin, Heidelberg, Nov.2010.

[20] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu,“Using of Jaccard coefficient for keywords similarity,” in Proceedingsof the International MultiConference of Engineers and ComputerScientists, vol. 1, 2013.

[21] R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling Up All Pairs SimilaritySearch,” in Proceedings of the 16th International Conference on WorldWide Web, WWW ’07, (New York, NY, USA), pp. 131–140, ACM,2007.

[22] N. Madapana and J. Wachs, “Database of gesture attributes: Zero shotlearning for gesture recognition,” in 2019 14th IEEE InternationalConference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8, May 2019.

[23] A. Lascarides and M. Stone, “A Formal Semantic Analysis of Gesture,”Journal of Semantics, vol. 26, no. 4, p. 393, 2009.

School of Industrial Engineering, Purdue University, West ... · inputs in handheld smart devices...

Documents

Transcript of School of Industrial Engineering, Purdue University, West ... · inputs in handheld smart devices...