Learning Whom to Trust with MACE · 2013-05-18 · spamming, as well as parameters that model the...

Proceedings of NAACL-HLT 2013, pages 1120–1130,Atlanta, Georgia, 9–14 June 2013. c©2013 Association for Computational Linguistics

Learning Whom to Trust with MACE

Dirk Hovy1 Taylor Berg-Kirkpatrick2 Ashish Vaswani1 Eduard Hovy3

(1) Information Sciences Institute, University of Southern California, Marina del Rey(2) Computer Science Division, University of California at Berkeley

(3) Language Technology Institute, Carnegie Mellon University, Pittsburgh{dirkh,avaswani}@isi.edu, [email protected], [email protected]

Abstract

Non-expert annotation services like Amazon’sMechanical Turk (AMT) are cheap and fastways to evaluate systems and provide categor-ical annotations for training data. Unfortu-nately, some annotators choose bad labels inorder to maximize their pay. Manual iden-tification is tedious, so we experiment withan item-response model. It learns in an un-supervised fashion to a) identify which an-notators are trustworthy and b) predict thecorrect underlying labels. We match perfor-mance of more complex state-of-the-art sys-tems and perform well even under adversarialconditions. We show considerable improve-ments over standard baselines, both for pre-dicted label accuracy and trustworthiness es-timates. The latter can be further improvedby introducing a prior on model parametersand using Variational Bayes inference. Ad-ditionally, we can achieve even higher accu-racy by focusing on the instances our model ismost confident in (trading in some recall), andby incorporating annotated control instances.Our system, MACE (Multi-Annotator Compe-tence Estimation), is available for download1.

1 IntroductionAmazon’s MechanicalTurk (AMT) is frequently

used to evaluate experiments and annotate data inNLP (Callison-Burch et al., 2010; Callison-Burchand Dredze, 2010; Jha et al., 2010; Zaidan andCallison-Burch, 2011). However, some turkers try tomaximize their pay by supplying quick answers thathave nothing to do with the correct label. We refer to

1Available under http://www.isi.edu/publications/licensed-sw/mace/index.html

this type of annotator as a spammer. In order to mit-igate the effect of spammers, researchers typicallycollect multiple annotations of the same instance sothat they can, later, use de-noising methods to inferthe best label. The simplest approach is majorityvoting, which weights all answers equally. Unfor-tunately, it is easy for majority voting to go wrong.A common and simple spammer strategy for cate-gorical labeling tasks is to always choose the same(often the first) label. When multiple spammersfollow this strategy, the majority can be incorrect.While this specific scenario might seem simple tocorrect for (remove annotators that always producethe same label), the situation grows more trickywhen spammers do not annotate consistently, but in-stead choose labels at random. A more sophisticatedapproach than simple majority voting is required.

If we knew whom to trust, and when, we couldreconstruct the correct labels. Yet, the only wayto be sure we know whom to trust is if we knewthe correct labels ahead of time. To address thiscircular problem, we build a generative model of theannotation process that treats the correct labels aslatent variables. We then use unsupervised learningto estimate parameters directly from redundantannotations. This is a common approach in theclass of unsupervised models called item-responsemodels (Dawid and Skene, 1979; Whitehill et al.,2009; Carpenter, 2008; Raykar and Yu, 2012).While such models have been implemented inother fields (e.g., vision), we are not aware of theiravailability for NLP tasks (see also Section 6).

Our model includes a binary latent variable thatexplicitly encodes if and when each annotator isspamming, as well as parameters that model theannotator’s specific spamming “strategy”. Impor-

1120

tantly, the model assumes that labels produced byan annotator when spamming are independent ofthe true label (though, a spammer can still producethe correct label by chance).

In experiments, our model effectively differenti-ates dutiful annotators from spammers (Section 4),and is able to reconstruct the correct label with highaccuracy (Section 5), even under extremely adver-sarial conditions (Section 5.2). It does not requireany annotated instances, but is capable of includingvarying levels of supervision via token constraints(Section 5.2). We consistently outperform major-ity voting, and achieve performance equal to that ofmore complex state-of-the-art models. Additionally,we find that thresholding based on the posterior la-bel entropy can be used to trade off coverage for ac-curacy in label reconstruction, giving considerablegains (Section 5.1). In tasks where correct answersare more important than answering every instance,e.g., when constructing a new annotated corpus, thisfeature is extremely valuable. Our contributions are:• We demonstrate the effectiveness of our model

on real world AMT datasets, matching the ac-curacy of more complex state-of-the-art sys-tems

• We show how posterior entropy can be used totrade some coverage for considerable gains inaccuracy

• We study how various factors affect perfor-mance, including number of annotators, anno-tator strategy, and available supervision

• We provide MACE (Multi-Annotator Compe-tence Estimation), a Java-based implementa-tion of a simple and scalable unsupervisedmodel that identifies malicious annotators andpredicts labels with high accuracy

2 ModelWe keep our model as simple as possible so that it

can be effectively trained from data where annotatorquality is unknown. If the model has too manyparameters, unsupervised learning can easily pickup on and exploit coincidental correlations in thedata. Thus, we make a modeling assumption thatkeeps our parameterization simple. We assume thatan annotator always produces the correct label when

N

Ti

MAij

Sij

T

A2

C2

A3

C3

A1

C1

Figure 1: Graphical model: Annotator j produceslabel Aij on instance i. Label choice depends oninstance’s true label Ti, and whether j is spam-ming on i, modeled by binary variable Sij . N =|instances|, M = |annotators|.

for i = 1 . . . N :Ti ∼ Uniform

for j = 1 . . .M :Sij ∼ Bernoulli(1− θj)if Sij = 0 :

Aij = Ti

else :Aij ∼ Multinomial(ξj)

Figure 2: Generative process: see text for descrip-tion.

he tries to. While this assumption does not reflectthe reality of AMT, it allows us to focus the model’spower where it’s important: explaining away labelsthat are not correlated with the correct label.

Our model generates the observed annotations asfollows: First, for each instance i, we sample thetrue label Ti from a uniform prior. Then, for eachannotator j we draw a binary variable Sij from aBernoulli distribution with parameter 1 − θj . Sij

represents whether or not annotator j is spammingon instance i. We assume that when an annotatoris not spamming on an instance, i.e. Sij = 0, hejust copies the true label to produce annotation Aij .If Sij = 1, we say that the annotator is spammingon the current instance, and Aij is sampled froma multinomial with parameter vector ξj . Note thatin this case the annotation Aij does not depend onthe true label Ti. The annotations Aij are observed,

1121

while the true labels Ti and the spamming indicatorsSij are unobserved. The graphical model is shownin Figure 1 and the generative process is describedin Figure 2.

The model parameters are θj and ξj . θj specifiesthe probability of trustworthiness for annotator j(i.e. the probability that he is not spamming onany given instance). The learned value of θj willprove useful later when we try to identify reliableannotators (see Section 4). The vector ξj determineshow annotator j behaves when he is spamming. Anannotator can produce the correct answer even whilespamming, but this can happen only by chance sincethe annotator must use the same multinomial param-eters ξj across all instances. This means that we onlylearn annotator biases that are not correlated withthe correct label, e.g., the strategy of the spammerwho always chooses a certain label. This contrastswith previous work where additional parameters areused to model the biases that even dutiful annotatorsexhibit. Note that an annotator can also choose notto answer, which we can naturally accommodate be-cause the model is generative. We enhance our gen-erative model by adding Beta and Dirichlet priors onθj and ξj respectively which allows us to incorporateprior beliefs about our annotators (section 2.1).

2.1 LearningWe would like to set our model parameters to

maximize the probability of the observed data, i.e.,the marginal data likelihood:

P (A; θ, ξ) =XT,S

h NYi=1

P (Ti) ·MY

j=1

P (Sij ; θj) · P (Aij |Sij , Ti; ξj)i

where A is the matrix of annotations, S is thematrix of competence indicators, and T is the vectorof true labels.

We maximize the marginal data likelihood usingExpectation Maximization (EM) (Dempster et al.,1977), which has successfully been applied tosimilar problems (Dawid and Skene, 1979). We ini-tialize EM randomly and run for 50 iterations. Weperform 100 random restarts, and keep the modelwith the best marginal data likelihood. We smooththe M-step by adding a fixed value δ to the fractionalcounts before normalizing (Eisner, 2002). We findthat smoothing improves accuracy, but, overall,learning is robust to varying δ, and set δ = 0.1

num labels .

We observe, however, that the average annota-tor proficiency is usually high, i.e., most annota-tors answer correctly. The distribution learned byEM, however, is fairly linear. To improve the cor-relation between model estimates and true annotatorproficiency, we would like to add priors about theannotator behavior into the model. A straightfor-ward approach is to employ Bayesian inference withBeta priors on the proficiency parameters, θj . Wethus also implement Variational-Bayes (VB) train-ing with symmetric Beta priors on θj and symmet-ric Dirichlet priors on the strategy parameters, ξj .Setting the shape parameters of the Beta distributionto 0.5 favors the extremes of the distribution, i.e.,either an annotator tried to get the right answer, orsimply did not care, but (almost) nobody tried “a lit-tle”. With VB training, we observe improved corre-lations over all test sets with no loss in accuracy. Thehyper-parameters of the Dirichlet distribution on ξjwere clamped to 10.0 for all our experiments withVB training. Our implementation is similar to John-son (2007), which the reader can refer to for details.

3 ExperimentsWe evaluate our method on existing annotated

datasets from various AMT tasks. However, wealso want to ensure that our model can handleadversarial conditions. Since we have no controlover the factors in existing datasets, we createsynthetic data for this purpose.

3.1 Natural DataIn order to evaluate our model, we use the

datasets from (Snow et al., 2008) that use discretelabel values (some tasks used continuous values,which we currently do not model). Since theycompared AMT annotations to experts, gold anno-tations exist for these sets. We can thus evaluatethe accuracy of the model as well as the proficiencyof each annotator. We show results for word sensedisambiguation (WSD: 177 items, 34 annotators),recognizing textual entailment (RTE: 800 items,164 annotators), and recognizing temporal relation(Temporal: 462 items, 76 annotators).

3.2 Synthetic DataIn addition to the datasets above, we generate

synthetic data in order to control for different

1122

factors. This also allows us to create a gold standardto which we can compare. We generate data setswith 100 items, using two or four possible labels.

For each item, we generate answers from 20different annotators. The “annotators” are functionsthat return one of the available labels accordingto some strategy. Better annotators have a smallerchance of guessing at random.

For various reasons, usually not all annotators seeor answer all items. We thus remove a randomlyselected subset of answers such that each item isonly answered by 10 of the annotators. See Figure3 for an example annotation of three items.

annotators

item

s – 0 0 1 – 0 – – 0 –1 – – 0 – 1 0 – – 0– – 0 – 0 1 – 0 – 0

Figure 3: Annotations: 10 annotators on three items,labels {1, 0}, 5 annotations/item. Missing annota-tions marked ‘–’

3.3 EvaluationsFirst, we want to know which annotators to trust.

We evaluate whether our model’s learned trustwor-thiness parameters θj can be used to identify theseindividuals (Section 4).

We then compare the label predicted by our modeland by majority voting to the correct label. Theresults are reported as accuracy (Section 5). Sinceour model computes posterior entropies for eachinstance, we can use this as an approximation for themodel’s confidence in the prediction. If we focus onpredictions with high confidence (i.e., low entropy),we hope to see better accuracy, even at the price ofleaving some items unanswered. We evaluate thistrade-off in Section 5.1. In addition, we investigatethe influence of the number of spammers and theirstrategy on the accuracy of our model (Section 5.2).

4 Identifying Reliable AnnotatorsOne of the distinguishing features of the model

is that it uses a parameter for each annotator toestimate whether or not they are spamming. Canwe use this parameter to identify trustworthy indi-viduals, to invite them for future tasks, and blockuntrustworthy ones?

RTE Temporal WSDraw agreement 0.78 0.73 0.81Cohen’s κ 0.70 0.80 0.13G-index 0.76 0.73 0.81MACE-EM 0.87 0.88 0.44MACE-VB (0.5,0.5) 0.91 0.90 0.90

Table 1: Correlation with annotator proficiency:Pearson ρ of different methods for various data sets.MACE-VB’s trustworthiness parameter (trainedwith Variational Bayes with α = β = 0.5) corre-lates best with true annotator proficiency.

It is natural to apply some form of weighting.One approach is to assume that reliable annotatorsagree more with others than random annotators.Inter-annotator agreement is thus a good candidateto weigh the answers. There are various measuresfor inter-annotator agreement.

Tratz and Hovy (2010) compute the averageagreement of each annotator and use it as a weightto identify reliable ones. Raw agreement can bedirectly computed from the data. It is related tomajority voting, since it will produce high scores forall members of the majority class. Raw agreementis thus a very simple measure.

In contrast, Cohen’s κ corrects the agreementbetween two annotators for chance agreement. Itis widely used for inter-annotator agreement inannotation tasks. We also compute the κ valuesfor each pair of annotators, and average them foreach annotator (similar to the approach in Tratz andHovy (2010)). However, whenever one label is moreprevalent (a common case in NLP tasks), κ overesti-mates the effect of chance agreement (Feinstein andCicchetti, 1990) and penalizes disproportionately.The G-index (Gwet, 2008) corrects for the numberof labels rather than chance agreement.

We compare these measures to our learned trust-worthiness parameters θj in terms of their ability toselect reliable annotators. A better measure shouldlend higher score to annotators who answer correctlymore often than others. We thus compare the ratingsof each measure to the true proficiency of eachannotator. This is the percentage of annotated itemsthe annotator answered correctly. Methods that canidentify reliable annotators should highly correlate

1123

to the annotator’s proficiency. Since the methodsuse different scales, we compute Pearson’s ρ for thecorrelation coefficient, which is scale-invariant. Thecorrelation results are shown in Table 1.

The model’s θj correlates much more stronglywith annotator proficiency than either κ or rawagreement. The variant trained with VB performsconsistently better than standard EM training, andyields the best results. This show that our modeldetects reliable annotators much better than anyof the other measures, which are only looselycorrelated to annotator proficiency.

The numbers for WSD also illustrate the low κscore resulting when all annotators (correctly) agreeon a small number of labels. However, all inter-annotator agreement measures suffer from an evenmore fundamental problem: removing/ignoringannotators with low agreement will always improvethe overall score, irrespective of the quality of theirannotations. Worse, there is no natural stoppingpoint: deleting the most egregious outlier alwaysimproves agreement, until we have only one anno-tator with perfect agreement left (Hovy, 2010). Incontrast, MACE does not discard any annotators,but weighs their contributions differently. We arethus not losing information. This works well evenunder adversarial conditions (see Section 5.2).

5 Recovering the Correct Answer

RTE Temporal WSDmajority 0.90 0.93 0.99Raykar/Yu 2012 0.93 0.94 —Carpenter 2008 0.93 — —MACE-EM/VB 0.93 0.94 0.99MACE-EM@90 0.95 0.97 0.99MACE-EM@75 0.95 0.97 1.0MACE-VB@90 0.96 0.97 1.0MACE-VB@75 0.98 0.98 1.0

Table 2: Accuracy of different methods on data setsfrom (Snow et al., 2008). MACE-VB uses Varia-tional Bayes training. Results @n use the n% itemsthe model is most confident in (Section 5.1). Resultsbelow double line trade coverage for accuracy andare thus not comparable to upper half.

The previous sections showed that our model reli-ably identifies trustworthy annotators. However, we

also want to find the most likely correct answer. Us-ing majority voting often fails to find the correct la-bel. This problem worsens when there are more thantwo labels. We need to take relative majorities intoaccount or break ties when two or more labels re-ceive the same number of votes. This is deeply un-satisfying.

Figure 2 shows the accuracy of our model onvarious data sets from Snow et al. (2008). Themodel outperforms majority voting on both RTEand Temporal recognition sets. It performs as wellas majority voting for the WSD task. This last setis somewhat of an exception, though, since almostall annotators are correct all the time, so majorityvoting is trivially correct. Still, we need to ensurethat the model does not perform worse under suchconditions. The results for RTE and Temporal dataalso rival those reported in Raykar and Yu (2012)and Carpenter (2008), yet were achieved with amuch simpler model.

Carpenter (2008) models instance difficulty asa parameter. While it seems intuitively useful tomodel which items are harder than other, it increasesthe parameter space more than our trustworthinessvariable. We achieve comparable performance with-out modeling difficulty, which greatly simplifiesinference. The model of Raykar and Yu (2012) ismore similar to our approach, in that it does notmodel item difficulty. However, it adds an extra stepthat learns priors from the estimated parameters. Inour model, this is part of the training process. Formore details on both models, see Section 6.

5.1 Trading Coverage for AccuracySometimes, we want to produce an answer for ev-

ery item (e.g., when evaluating a data set), and some-times, we value good answers more than answeringall items (e.g., when developing an annotatedcorpus). Jha et al. (2010) have demonstrated how toachieve better coverage (i.e., answer more items) byrelaxing the majority voting constraints. Similarly,we can improve accuracy if we only select high qual-ity annotations, even if this incurs lower coverage.

We provide a parameter in MACE that allowsusers to set a threshold for this trade-off: themodel only returns a label for an instance if it issufficiently confident in its answer. We approximatethe model’s confidence by the posterior entropy of

1124

MACE-EMMACE-VBmajority

Figure 4: Tradeoff between coverage and accuracy for RTE (left) and temporal (right). Lower thresholdslead to less coverage, but result in higher accuracy.

each instance. However, entropy depends stronglyon the specific makeup of the dataset (number ofannotators and labels, etc.), so it is hard for the userto set a specific threshold.

Instead of requiring an exact entropy value, weprovide a simple thresholding between 0.0 and 1.0(setting the threshold to 1.0 will include all items).After training, MACE orders the posterior entropiesfor all instances and selects the value that coversthe selected fraction of the instances. The thresholdthus roughly corresponds to coverage. It then onlyreturns answers for instances whose entropy isbelow the threshold. This procedure is similar toprecision/recall curves.

Jha et al. (2010) showed the effect of varying therelative majority required, i.e., requiring that at leastn out of 10 annotators have to agree to count anitem. We use that method as baseline comparison,evaluating the effect on coverage and accuracywhen we vary n from 5 to 10.

Figure 4 shows the tradeoff between coverageand accuracy for two data sets. Lower thresholdsproduce more accurate answers, but result in lowercoverage, as some items are left blank. If we pro-duce answers for all items, we achieve accuraciesof 0.93 for RTE and 0.94 for Temporal, but byexcluding just the 10% of items in which the modelis least confident, we achieve accuracies as high as0.95 for RTE and 0.97 for Temporal. We omit theresults for WSD here, since there is little headroomand they are thus not very informative. Using Varia-tional Bayes inference consistently achieves higher

results for the same coverage than the standard im-plementation. Increasing the required majority alsoimproves accuracy, although not as much, and theloss in coverage is larger and cannot be controlled.In contrast, our method allows us to achieve betteraccuracy at a smaller, controlled loss in coverage.

5.2 Influence of Strategy, Number ofAnnotators, and Supervision

Adverse Strategy We showed that our modelrecovers the correct answer with high accuracy.However, to test whether this is just a function ofthe annotator pool, we experiment with varyingthe trustworthiness of the pool. If most annotatorsanswer correctly, majority voting is trivially correct,as is our model. What happens, however, if moreand more annotators are unreliable? While someagreement can arise from randomness, majorityvoting is bound to become worse—can our modelovercome this problem? We set up a second set ofexperiments to test this, using synthetic data. Wechoose 20 annotators and vary the amount of goodannotators among them from 0 to 10 (after whichthe trivial case sets in). We define a good annotatoras one who answers correctly 95% of the time.2

Adverse annotators select their answers randomly oralways choose a certain value (minimal annotators).These are two frequent strategies of spammers.

For different numbers of labels and varyingpercentage of spammers, we measure the accuracy

2The best annotators on the Snow data sets actually foundthe correct answer 100% of the time.

1125

a) random annotators b) minimal annotators

Figure 5: Influence of adverse annotator strategy on label accuracy (y-axis). Number of possible labelsvaried between 2 (top row) and 4 (bottom row). Adverse annotators either choose at random (a) or alwaysselect the first label (b). MACE needs fewer good annotators to recover the correct answer.

MACE-EMMACE-VBmajority

Figure 6: Varying number of annotators: effect on prediction accuracy. Each point averaged over 10 runs.Note different scale for WSD.

of our model and majority voting on 100 items,averaged over 10 runs for each condition. Figure5 shows the effect of annotator proficiency on bothmajority voting and our method for both kinds ofspammers. Annotator pool strategy affects majority

voting more than our model. Even with few goodannotators, our model learns to dismiss the spam-mers as noise. There is a noticeable point on eachgraph where MACE diverges from the majorityvoting line. It thus reaches good accuracy much

1126

faster than majority voting, i.e., with fewer good an-notators. This divergence point happens earlier withmore label values when adverse annotators labelrandomly. In general, random annotators are easierto deal with than the ones always choosing the firstlabel. Note that in cases where we have a majorityof adversarial annotators, VB performs worse thanEM, since this condition violates the implicit as-sumptions we encoded with the priors in VB. Underthese conditions, setting different priors to reflectthe annotator pool should improve performance.

Obviously, both of these pools are extremes: it isunlikely to have so few good or so many maliciousannotators. Most pools will be somewhere inbetween. It does show, however, that our modelcan pick up on reliable annotators even under veryunfavorable conditions. The result has a practicalupshot: AMT allows us to require a minimum ratingfor annotators to work on a task. Higher ratingsimprove annotation quality, but delay completion,since there are fewer annotators with high ratings.The results in this section suggest that we can findthe correct answer even in annotator pools with lowoverall proficiency. We can thus waive the ratingrequirement and allow more annotators to work onthe task. This considerably speeds up completion.

Number of Annotators Figure 6 shows the effectdifferent numbers of annotators have on accuracy.As we increase the number of annotators, MACEand majority voting achieve better accuracy results.We note that majority voting results level or evendrop when going from an odd to an even number.In these cases, the new annotator does not improveaccuracy if it goes with the previous majority (i.e.,going from 3:2 to 4:2), but can force an error whengoing against the previous majority (i.e., from 3:2 to3:3), by creating a tie. MACE-EM and MACE-VBdominate majority voting for RTE and Temporal.For WSD, the picture is less clear, where majorityvoting dominates when there are fewer annotators.Note that the differences are minute, though (within1 percentage point). For very small pool sizes (< 3),MACE-VB outperforms both other methods.

Amount of Supervision So far, we have treatedthe task as completely unsupervised. MACE doesnot require any expert annotations in order toachieve high accuracy. However, we often have

annotations for some of the items. These annotateddata points are usually used as control items (byremoving annotators that answer them incorrectly).If such annotated data is available, we would liketo make use of it. We include an option that letsusers supply annotations for some of the items,and use this information as token constraints in theE-step of training. In those cases, the model doesnot need to estimate the correct value, but only hasto adjust the trust parameter. This leads to improvedperformance.3

We explore for RTE and Temporal how per-formance changes when we vary the amount ofsupervision in increments of 5%.4 We average over10 runs for each value of n, each time supplying an-notations for a random set of n items. The baselineuses the annotated label whenever supplied, other-wise the majority vote, with ties split at random.

Figure 7 shows that, unsurprisingly, all methodsimprove with additional supervision, ultimatelyreaching perfect accuracy. However, MACE usesthe information more effectively, resulting inhigher accuracy for a given amount of supervision.This gain is more pronounced when only littlesupervision is available.

6 Related ResearchSnow et al. (2008) and Sorokin and Forsyth

(2008) showed that Amazon’s MechanicalTurk usein providing non-expert annotations for NLP tasks.Various models have been proposed for predictingcorrect annotations from noisy non-expert annota-tions and for estimating annotator trustworthiness.These models divide naturally into two categories:those that use expert annotations for supervisedlearning (Snow et al., 2008; Bian et al., 2009), andcompletely unsupervised ones. Our method fallsinto the latter category because it learns from theredundant non-expert annotations themselves, andmakes no use of expertly annotated data.

Most previous work on unsupervised modelsbelongs to a class called “Item-response models”,used in psychometrics. The approaches differ withrespect to which aspect of the annotation process

3If we had annotations for all items, accuracy would be per-fect and require no training.

4Given the high accuracy for the WSD data set even in thefully unsupervised case, we omit the results here.

1127

Figure 7: Varying the amount of supervision: effect on prediction accuracy. Each point averaged over 10runs. MACE uses supervision more efficiently.

they choose to focus on, and the type of annotationtask they model. For example, many methods ex-plicitly model annotator bias in addition to annotatorcompetence (Dawid and Skene, 1979; Smyth et al.,1995). Our work models annotator bias, but onlywhen the annotator is suspected to be spamming.

Other methods focus modeling power on instancedifficulty to learn not only which annotators aregood, but which instances are hard (Carpenter,2008; Whitehill et al., 2009). In machine vision,several models have taken this further by parameter-izing difficulty in terms of complex features definedon each pairing of annotator and annotation instance(Welinder et al., 2010; Yan et al., 2010). Whilesuch features prove very useful in vision, they aremore difficult to define for the categorical problemscommon to NLP. In addition, several methods arespecifically tailored to annotation tasks that involveranking (Steyvers et al., 2009; Lee et al., 2011),which limits their applicability in NLP.

The method of Raykar and Yu (2012) is mostsimilar to ours. Their goal is to identify and filterout annotators whose annotations are not correlatedwith the gold label. They define a function of thelearned parameters that is useful for identifyingthese spammers, and then use this function to builda prior. In contrast, we use simple priors, but incor-porate a model parameter that explicitly representsthe probability that an annotator is spamming. Oursimple model achieves the same accuracy on gold

label predictions as theirs.

7 ConclusionWe provide a Java-based implementation, MACE,

that recovers correct labels with high accuracy, andreliably identifies trustworthy annotators. Inaddition, it provides a threshold to control theaccuracy/coverage trade-off and can be trained withstandard EM or Variational Bayes EM. MACEworks fully unsupervised, but can incorporate tokenconstraints via annotated control items. We showthat even small amounts help improve accuracy.

Our model focuses most of its modeling poweron learning trustworthiness parameters, whichare highly correlated with true annotator relia-bility (Pearson ρ 0.9). We show on real-worldand synthetic data sets that our method is moreaccurate than majority voting, even under ad-versarial conditions, and as accurate as morecomplex state-of-the-art systems. Focusing on high-confidence instances improves accuracy consider-ably. MACE is freely available for download underhttp://www.isi.edu/publications/licensed-sw/mace/index.html.

AcknowledgementsThe authors would like to thank Chris Callison-

Burch, Victoria Fossum, Stephan Gouws, MarcSchulder, Nathan Schneider, and Noah Smith forinvaluable discussions, as well as the reviewers fortheir constructive feedback.

1128

ReferencesJiang Bian, Yandong Liu, Ding Zhou, Eugene Agichtein,

and Hongyuan Zha. 2009. Learning to recognize re-liable users and content in social media with coupledmutual reinforcement. In Proceedings of the 18th in-ternational conference on World wide web, pages 51–60. ACM.

Chris Callison-Burch and Mark Dredze. 2010. Creatingspeech and language data with amazon’s mechanicalturk. In Proceedings of the NAACL HLT 2010 Work-shop on Creating Speech and Language Data withAmazon’s Mechanical Turk, pages 1–12, Los Angeles,June. Association for Computational Linguistics.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Kay Peterson, Mark Przybocki, and Omar Zaidan.2010. Findings of the 2010 joint workshop on sta-tistical machine translation and metrics for machinetranslation. In Proceedings of the Joint Fifth Workshopon Statistical Machine Translation and MetricsMATR,pages 17–53, Uppsala, Sweden, July. Association forComputational Linguistics.

Bob Carpenter. 2008. Multilevel Bayesian models ofcategorical data annotation. Unpublished manuscript.

A. Philip Dawid and Allan M. Skene. 1979. Maximumlikelihood estimation of observer error-rates using theEM algorithm. Applied Statistics, pages 20–28.

Arthur P. Dempster, Nan M. Laird, and Donald B. Ru-bin. 1977. Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal StatisticalSociety. Series B (Methodological), 39(1):1–38.

Jason Eisner. 2002. An interactive spreadsheet for teach-ing the forward-backward algorithm. In Proceed-ings of the ACL-02 Workshop on Effective tools andmethodologies for teaching natural language process-ing and computational linguistics-Volume 1, pages 10–18. Association for Computational Linguistics.

Alvan R. Feinstein and Domenic V. Cicchetti. 1990.High agreement but low kappa: I. the problems oftwo paradoxes. Journal of Clinical Epidemiology,43(6):543–549.

Kilem Li Gwet. 2008. Computing inter-rater reliabil-ity and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psy-chology, 61(1):29–48.

Eduard Hovy. 2010. Annotation. A Tutorial. In 48thAnnual Meeting of the Association for ComputationalLinguistics.

Mukund Jha, Jacob Andreas, Kapil Thadani, Sara Rosen-thal, and Kathleen McKeown. 2010. Corpus creationfor new genres: A crowdsourced approach to pp at-tachment. In Proceedings of the NAACL HLT 2010Workshop on Creating Speech and Language Datawith Amazon’s Mechanical Turk, pages 13–20. Asso-ciation for Computational Linguistics.

Mark Johnson. 2007. Why doesn’t EM find good HMMPOS-taggers. In Proceedings of the 2007 Joint Confer-ence on Empirical Methods in Natural Language Pro-cessing and Computational Natural Language Learn-ing (EMNLP-CoNLL), pages 296–305.

Michael D. Lee, Mark Steyvers, Mindy de Young, andBrent J. Miller. 2011. A model-based approach tomeasuring expertise in ranking tasks. In L. Carlson,C. Holscher, and T.F. Shipley, editors, Proceedings ofthe 33rd Annual Conference of the Cognitive ScienceSociety, Austin, TX. Cognitive Science Society.

Vikas C. Raykar and Shipeng Yu. 2012. EliminatingSpammers and Ranking Annotators for CrowdsourcedLabeling Tasks. Journal of Machine Learning Re-search, 13:491–518.

Padhraic Smyth, Usama Fayyad, Mike Burl, Pietro Per-ona, and Pierre Baldi. 1995. Inferring ground truthfrom subjective labelling of Venus images. Advancesin neural information processing systems, pages 1085–1092.

Rion Snow, Brendan O’Connor, Dan Jurafsky, and An-drew Y. Ng. 2008. Cheap and fast—but is itgood? Evaluating non-expert annotations for naturallanguage tasks. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing,pages 254–263. Association for Computational Lin-guistics.

Alexander Sorokin and David Forsyth. 2008. Utilitydata annotation with Amazon Mechanical Turk. InIEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition Workshops, CVPRW ’08,pages 1–8. IEEE.

Mark Steyvers, Michael D. Lee, Brent Miller, andPernille Hemmer. 2009. The wisdom of crowds in therecollection of order information. Advances in neuralinformation processing systems, 23.

Stephen Tratz and Eduard Hovy. 2010. A taxonomy,dataset, and classifier for automatic noun compoundinterpretation. In Proceedings of the 48th AnnualMeeting of the Association for Computational Linguis-tics, pages 678–687. Association for ComputationalLinguistics.

Peter Welinder, Steve Branson, Serge Belongie, andPietro Perona. 2010. The multidimensional wisdomof crowds. In Neural Information Processing SystemsConference (NIPS), volume 6.

Jacob Whitehill, Paul Ruvolo, Tingfan Wu, JacobBergsma, and Javier Movellan. 2009. Whose voteshould count more: Optimal integration of labels fromlabelers of unknown expertise. Advances in Neural In-formation Processing Systems, 22:2035–2043.

Yan Yan, Romer Rosales, Glenn Fung, Mark Schmidt,Gerardo Hermosillo, Luca Bogoni, Linda Moy, and

1129

Jennifer Dy. 2010. Modeling annotator expertise:Learning when everybody knows a bit of something.In International Conference on Artificial Intelligenceand Statistics.

Omar F. Zaidan and Chris Callison-Burch. 2011. Crowd-sourcing translation: Professional quality from non-professionals. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies, pages 1220–1229, Portland, Oregon, USA, June. Association forComputational Linguistics.

1130

Learning Whom to Trust with MACE · 2013-05-18 · spamming, as well as parameters that model the...

Documents

Transcript of Learning Whom to Trust with MACE · 2013-05-18 · spamming, as well as parameters that model the...