Crowdsourcing for Multimedia Retrieval

+

Crowdsourcing for Mul0media Retrieval Marco Tagliasacchi Politecnico di Milano, Italy

+ Outline

n  Crowdsourcing applica0ons in mul0media retrieval

n  Aggrega0ng annota0ons

n  Aggrega0ng and learning

n  Crowdsourcing at work

+ Crowdsourcing applica0ons in mul0media retrieval

+ Crowdsourcing

n  Crowdsourcing is an example of human compu+ng

n  Use an online community of human workers to complete useful tasks

n  The task is outsourced to an undefined public

n Main idea: design tasks that are n  Easy for humans n  Hard for machines

+ Crowdsourcing

n  Crowdsourcing plaHorms n  Paid contributors

n  Amazon Mechanical Turk (www.mturk.com) n  CrowdFlower (crowdflower.com) n  oDesk (www.odesk.com) n  …

n  Volunteers n  Foldit (www.fold.it) n  Duolingo (www.duolingo.com) n  …

+ Applica0ons in mul0media retrieval n  Create annotated data sets for training

n  Reduces both cost and 0me needed to gather annota0ons, n  …but annota0ons might be noisy!

n  Validate the output of mul0media retrieval systems

n  Query expansion / reformula0on

+ Crea0ng annotated training sets [Sorokin and Forsyth, 2008]

n  Collect annota0ons for computer vision data sets n  people segmenta0on

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4

Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation ofthe protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfacesare similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $0.66.

further assume that the polygon with more vertices is a bet-ter annotation and we put it first in the pair. The distributionof scores and a detailed analysis appears in figures 4,5. Weshow all scores ordered from the best (lowest) on the leftto the worst (highest) on the right. We select 5:15:952 per-

25 through 95 with step 15

centiles of quality and show the respective annotations.

Looking at the images we see that the workers mostlytry to accomplish the task. Some of the errors come fromsloppy annotations (especially in the heavily underpaid ex-periment 3 - polygonal labeling). Most of the disagreementscome from difficult cases, when the question we ask is dif-

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4







n  Collect annota0ons for computer vision data sets n  people segmenta0on and pose annota0on

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4






Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4







n  Observa0ons: n  Annotators make errors n  Quality of annotators is heterogeneous n  The quality of the annota0ons depends on the difficulty of the task

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

A B C D EF

G

area(XOR)/area(AND). The lower the better. Mean 0.21, std 0.14, median 0.16

A B

C D E F G

Experiment 3: trace the boundary of the person.

knee

111

222

333 444

555

666

777888999 101010

111111

121212

131313141414

0 50 100 150 200 250 300 3500

10

20

30

40

50

A B C D EF

G

Experiment 4: click on 14 landmarksMean error in pixels between annotation points. The lower the better. Mean 8.71, std 6.29, median 7.35.

111

222

333 444

555666

777888 999 101010

111111121212

131313141414

111

222

333444

555

666

777

888

999

101010

111111

121212

131313141414

111

222

333444

555

666

777

888

999101010

111111

121212

131313

141414

111

222

333 444

555

666

777

888

999 101010

111111

121212131313

141414

111

222

333 444

555

666

777

888

999 101010

111111

121212

131313

141414

111

222

333

444

555

666

777888

999101010

111111121212

131313

141414

A B C

F GD E

!gure 6knee

Figure 5. Quality details. We present detailed analysis of annotation quality for experiments 3 and 4. For every image the best fittingpair of annotations is selected. The score of the best pair is shown in the figure. For experiment 3 we score annotations by the area oftheir symmetric difference (XOR) divided by the area of their union(OR). For experiment 4 we compute the average distance between themarked points. The scores are ordered low (best) to high (worst). For clarity we render annotations at 5:15:95 percentiles of the score.Blue curve and dots show annotation 1, yellow curve and dots show annotation 2 of the pair. For experiment 3 we additionally assume thatthe polygon with more vertices is a better annotation, so annotation 1 (blue) always has more vertices.

Face Recognition. 4[5] T. L. Berg, A. C. Berg, J. Edwards, and D. Forsyth. Who’s

in the picture? In Proc. NIPS, 2004. 1, 4[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.

Actions as space-time shapes. ICCV, pages 1395–1402,2005. 2

[7] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In CVPR, 2005. 4

100 110 120 130 140 150 160 170 180 190 2003

45

6

7

89

10

11

12

13

rAnklerKneelKneelAnkle

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

rWristrElbowlElbowlWrist

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

rHiplHiprShoulderlShoulder

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

NeckHead

Figure 6. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. We show scores of thebest pair for all annotations between 35th and 65th percentiles - between points “C” and “E” of experiment 4 in fig. 5. All the plots havethe same scale: from image 100 to 200 on horizontal axis and from 3 pixels to 13 pixels of error on the vertical axis. These graphs showannotators have greater difficulty choosing a consistent location for the hip than for any other landmark; this may be because some placethe hip at the point a tailor would use and others mark the waist, or because the location of the hip is difficult to decide under clothing.

[8] Espgame. www.espgame.org, 2008. 4[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,

and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.4

[10] M. Everingham, A. Zisserman, C. K. I. Williams, andL. Van Gool. The PASCAL Visual Object ClassesChallenge 2006 (VOC2006) Results. http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf. 4

[11] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning ofobject categories. PAMI, 28(4):594–611, 2006. 1, 4

[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-egory dataset. Technical Report 7694, California Institute ofTechnology, 2007. 1, 4

[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical report,University of Massachusetts, Amherst, 2007. 1, 4

[14] Imageparsing. ImageParsing.com, 2008. 4[15] Linguistic data consortium. www.ldc.upenn.edu/. 4[16] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Build-

ing a large annotated corpus of english: The penn treebank.Computational Linguistics, 19(2):313–330, 1994. 4

[17] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat-ural image boundaries using local brightness, color, and tex-ture cues. PAMI, 2004. in press. 2

[18] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A databaseof human segmented natural images and its application toevaluating segmentation algorithms and measuring ecologi-cal statistics. In International Conference on Computer Vi-

sion, 2001. 1, 4[19] G. Mori, X. Ren, A. Efros, and J. Malik. Recovering human

body configurations: Combining segmentation and recogni-tion. In CVPR, 2004. 2

[20] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 2000. 4

[21] The pascal visual object classes chal-lenge 2008. http://www.pascal-network.org/challenges/VOC/voc2008/index.html. 4

[22] P. J. Phillips, A. Martin, C. Wilson, and M. Przybocki. Anintroduction to evaluating biometric systems. Computer,33(2):56–63, 2000. 4

[23] D. Ramanan. Learning to parse images of articulated bodies.In NIPS, pages 1129–1136, 2007. 2

[24] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Free-man. Labelme: A database and web-based tool for imageannotation. IJCV, 77(1-3):157–173, 2008. 1, 2, 4

[25] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination,and expression(pie) database. In AFGR, 2002. 4

[26] L. von Ahn and L. Dabbish. Labeling images with a com-puter game. In ACM CHI, 2004. 1, 4

[27] L. von Ahn, R. Liu, and M. Blum. Peekaboom: A game forlocating objects in images. In ACM CHI, 2006. 1, 4

[28] B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large scalegeneral purpose ground truth dataset: methodology, annota-tion tool, and benchmarks. In EMMCVPR, 2007. 1, 4

+ Crea0ng annotated training sets [Soleymani and Larson, 2010]

n MediaEval 2010 Affect Task

n  Use of Amazon Mechanical Turk to annotate the Affect Task Corpus

n  126 videos (2-‐5 mins in length)

n  Annotate n  Mood (e.g., pleased, helpless, energe0c, etc.) n  Emo0on (e.g., sadness, joy, anger, etc.) n  Boreness (nine point ra0ng scale) n  Like (nine point ra0ng scale)

+ Crea0ng annotated training sets [Nowak and Ruger., 2010]

n  Crowdsourcing image concepts. 53 concepts, e.g., n  Abstract categories: partylife, beach holidays, snow, etc. n  Time of the day: day, night, no visual cue

n  …

n  Subset of 99 images from the ImageCLEF2009 dataset

Place contains three mutual exclusive concepts, namely In-door, Outdoor and No Visual Place. In contrast several op-tional concepts belong to the category Landscape Elements.The task of the annotators was to choose exactly one conceptfor categories with mutual exclusive concepts and to selectall applicable concepts for optional designed concepts. Allphotos were annotated at an image-based level. The anno-tator tagged the whole image with all applicable conceptsand then continued with the next image.

Figure 1: Annotation tool that was used for the ac-quisition of expert annotations.

Fig. 1 shows the annotation tool that was delivered tothe annotators. The categories are ordered into the threetabs Holistic Scenes, Representation and Pictured Objects.All optional concepts are represented as check boxes andthe mutual exclusive concepts are modelled as radio but-ton groups. The tool verifies if for each category containingmutual exclusive concepts exactly one was selected beforestoring the annotations and presenting the next image.

3.3 Collecting Data of Non-expert AnnotatorsThe same set of images that was used for the expert anno-

tators, was distributed over the online marketplace AmazonMechanical Turk (www.mturk.com) and annotated by non-experts in form of mini-jobs. At MTurk these mini-jobs arecalled HITs (Human Intelligence Tasks). They represent asmall piece of work with an allocated price and completiontime. The workers at MTurk, called turkers, can choosethe HITs they would like to perform and submit the re-sults to MTurk. The requester of the work collects all re-sults from MTurk after they are completed. The workflowof a requester can be described as follows: 1) design a HITtemplate, 2) distribute the work and fetch results and 3)approve or reject work from turkers. For the design of theHITs, MTurk o!ers support by providing a web interface,command line tools and developer APIs. The requester candefine how many assignments per HIT are needed, how muchtime is allotted to each HIT and how much to pay per HIT.MTurk o!ers several ways of assuring quality. Optionallythe turkers can be asked to pass a qualification test beforeworking on HITs, multiple workers can be assigned the sameHIT and requesters can reject work in case the HITs werenot finished correctly. The HIT approval rate each turkerachieves by completing HITs can be used as a threshold forauthorisation to work.

3.3.1 Design of HIT TemplateThe design of the HITs at MTurk for the image annota-

tion task is similar to the annotation tool that was providedto the expert annotators (see Sec. 3.2). Each HIT consistsof the annotation of one image with all applicable 53 con-cepts. It is arranged as a question survey and structuredinto three sections. The section Scene Description and thesection Representation each contain four questions, the sec-tion Pictured Objects consists of three questions. In front ofeach section the image to be annotated is presented. Therepetition of the image ensures that the turker can see itwhile answering the questions without scrolling to the topof the document. Fig. 2 illustrates the questions for thesection Representation.

Figure 2: Section Representation of the survey.

The turkers see a screen with instructions and the task tofulfil when they start working. As a consequence, the guide-lines should be very short and easy to understand. In theannotation experiment the following annotation guidelineswere posted to the turkers. These annotation guidelines arefar shorter than the guidelines for the expert annotators anddo not contain example images.

• Selected concepts should be representative for the con-tent or representation of the whole image.

• Radio Button concepts exclude each other. Please an-notate with exactly one radio button concept per ques-tion.

• Check Box concepts represent optional concepts. Pleasechoose all applicable concepts for an image.

• Please make sure that the information is visually de-picted in the images (no meta-knowledge)!

559

+ Crea0ng annotated training sets [Nowak and Ruger., 2010]

n  Study of expert and non-‐expert labeling

n  Inter-‐annota0on agreement among experts: n  very high

n  Influence of the expert ground truth on concept-‐based retrieval ranking: n  very limited

n  Inter-‐annota0on agreement among non-‐experts n  High, although not as good as among experts

n  Influence of averaged annota0ons (experts vs. non experts) on concept-‐based retrieval ranking: n  Averaging filters out noisy non-‐expert annota0ons

+ Crea0ng annotated training sets [Vondrick et al., 2010]

n  Crowdsourcing object tracking in video

n  Annotators draw bounding boxes 4 C. Vondrick, D. Ramanan, D. Patterson

Fig. 2: Our video labeling user interface. All previously labeled entities are shownand the box the user is currently working with is bright orange.

Displaying other workers’ labels unintentionally fostered a sense of communityengagement that some of workers expressed in unsolicited comments.

“Maybe it’s more bizarre that I keep doing these hits for a penny. I must not bethe only one who finds them oddly compelling–more and more boxes show upon each hit.” — Anonymous subject

Mechanical Turk does not necessarily ensure quality work is produced. Infact, as a result of the low price of most HITs, many workers attempt to satisfythe HIT with the least amount of effort possible. Therefore it is very importantthat HITs are structured to produce desired results in a somewhat adversarialenvironment. One of the key criteria for the design of the UI is to make sure thatproducing quality work is no harder than doing the minimal amount of work toconvince the UI that the HIT is completed. A second important criteria is tobuild into the evaluation process of a HIT an analysis of the validity of the work.A typical approach is to have multiple workers complete the same task until astatistical test demonstrates consensus on a single answer. A final importantcriteria is to design the interface so that it is difficult to successfully write anautomated bot to get through the UI.

By requiring the user to annotate every key frame or explicitly say there isnothing left to annotate, we reduce the ease with which a worker can just “click-

+ Crea0ng annotated training sets [Vondrick et al., 2010]

n  Annotators label the enclosing bounding box of an en0ty every T frames

n  Bounding boxes at intermediate 0me instants are interpolated

n  Interes0ng trade-‐off between n  Cost of Mturk workers n  Cost of interpola0on on Amazon EC2 cloud

12 C. Vondrick, D. Ramanan, D. Patterson

(a) Field drills (b) Basketball players

(c) Ball

Fig. 7: Cost trade-off between human effort and CPU cycles. As the total cost

increases, performance will improve. Cost axes are in dollars.

4.3 Performance Cost Trade-off

We now consider our motivating question: how should one divide human effortversus CPU effort so as to maximize track accuracy given a X$? A fixed dollar

amount can be spent only on human annotations, purely on CPU, or some

combination. We express this combination as a diagonal line in the ground plane

of the 3D plot in Fig.7. We plot the tracking accuracy as a function of this

combination for different X$ amounts in Fig.8. We describe the trade-off further

in the caption of Fig.8.

5 Conclusion

Our motivation thus far has been the use of crowdsource marketplaces as a cost-

effective labeling tool. We argue that they also provide an interesting platform

for research on interactive vision. It is clear that state-of-the-art techniques are

+ Crea0ng annotated training sets [Urbano et al., 2010]

n  Goal: evalua0on of music informa0on retrieval systems

n  Use crowdsourcing as an alterna0ve to experts to create ground-‐truths of par0ally ordered lists

n  Good agreement (92% complete + par0al) with experts

answer preference judgments between F and each of the other documents. In this case, every document was judged as more similar, except for G, which was judged equally similar (or dissimilar). Therefore, a new segment appears to the left of F with all the candidates judged more relevant, and G is set up in the same group as F. For the second iteration, in the rightmost segment no judgment is needed because F and G were already compared, and B would be the pivot for the leftmost segment. Incipits A and C are judged similar to B, but D and E are judged as less similar, so they are set up in a segment to the right of B. At the end, there are 3 ordered groups of relevance formed with preference judgments. Note that not all the 21 judgments were needed to arrange and aggregate every incipit (e.g. G is only compared with F). Table 1. Example of self-organized partially ordered list. Pivots for each segment appear in bold face. Documents that have been pivots already appear underlined. Iteration Segments Preference Judgments

1 C, D, E, A, G, B, F C<F, D<F, E<F, A<F, G=F, B<F 2 C, D, E, A, B , F, G C=B, D>B, E>B, A=B 3 B, C, A , D, E , F, G C=A, D=E 4 (A, B, C), (E, D), (F, G) -

With preference judgments, the sample of rankings given to each candidate is less variable than with the original method. Whenever a candidate is preferred over another one, it would be given a rank of 1 and -1 otherwise. In case it was judged equally similar, a rank of 0 would be added to its sample. With the original methodology, on the other hand, the ranks given to an incipit could range from 1 to well beyond 20, which increases the variance of the samples. Note that, with our scheme, the two samples of rankings given to each pair of documents are the opposite and therefore have the same variance. Signed Mann-Whitney U tests can be used again to decide whether two rank samples are different or not. Because the samples are less variable, the effect size is larger, which increases the statistical power of the test and makes it more likely for it to find a true difference where there is one. As a consequence, fewer assessors are needed overall.

4. CROWDSOURCING PREFERENCES The use of a crowdsourcing platform seems very appropriate for our purposes. If the reasonable person assumption holds, we could use non experts to generate a ground truth like these. Because we no longer show the image of the staves, but offer an audio file instead, no music expertise is needed. We have also seen how to use preference judgments to generate partially ordered lists instead of having assessors rank all candidates at once. Therefore, the whole process can be divided into very small and simple tasks where one incipit has to be preferred over the other, which seems perfectly doable for any non expert. Also, the number of judgments between pairs of documents can be smaller, and given that we use non experts, the overall cost should be much less. We are not aware of any work examining the feasibility of music related tasks with crowdsourcing platforms like Amazon Mechanical Turk (AMT), so we decided to use it for our experiments. AMT has been widely used before for tasks related to Text IR evaluation. HITs (each of the single tasks assigned to a worker) have traditionally used the English language, but it has been shown recently that workers can also work in other languages such as Spanish [18]. Other multimedia tasks, such as image tagging, have also been proved to be feasible with crowdsourcing [19].

4.1 HIT Design The use of preference judgments is prone to have a very simple HIT design (see Figure 4). We asked workers to listen to the

the two incipits to compare. Next, they were asked what variation was more similar to the original melody, allowing 3 options: A is more similar, B is more similar, and they are either equally similar or dissimilar. We indicated them that if one melody was part of another one, they had to be considered equally similar, so as to comply with the original guidelines. As optional questions, they were asked for their musical background, if any, and for comments or suggestions to give us some feedback.

Figure 4. Example of HIT for music preference judgment.

The evaluation collection used in MIREX 2005 (Eval05 for short) had about 550 short incipits in MIDI format, which we transformed to MP3 files as they are easier to play in a standard web browser. The average duration was 6 seconds, ranging from 1 to 57 seconds. However, many incipits start with rests (see query and incipit C in Figure 2), which would make workers lose a lot of time. Therefore, we trimmed the leading and tailing silence, which resulted in durations from 1 to 26 seconds, with an average of 4 seconds. With this cuts, the average time needed to listen to the 3 files in a HIT at least once was 13 seconds, ranging from 4 to 24 seconds. This decision agrees with the initial guidelines that were given to the experts, as two incipits should be considered equally relevant despite one of them having leading or tailing rests (i.e. one would be just part of the other). We uploaded all these trimmed MP3 files to a private web server, as well as the source of a very simple Flash player to play the queries and candidate incipits. Therefore, our HIT template was designed to display the MP3 players and stream the audio files from our server. We created a batch of HITs for each of the iterations calculated with our methodology, and paid every answer with 2 cents of

After downloading the results and analyzing them, we calculated the next preference judgments to perform and uploaded a new batch to AMT,

Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (CSE 2010) - July 23, 2010 12

+ Validate the output of MIR systems[Snoek et al., 2010][Freiburg et al., 2011]

n  Search engine for archival rock ‘n’ roll concert video

n  Use of crowdsourcing to improve, extend and share automa0cally detected concepts in video fragments

Guitar playerHands Pinkpop logoSingerPinkpop hat Drummer Over the shuolderClose-upAudience StageKeyboard

Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feedback.

Figure 2: Timeline-based video player where col-ored dots correspond to automated visual detectionresults. Users can navigate directly to fragments ofinterest by interaction with the colored dots, whichpop-up a feedback overlay as displayed in Figure 3.

since 1970 at Landgraaf, the Netherlands. All music videoshave been recorded during the 40 years life cycle of the fes-tival. We cleared copyright for several Dutch and Belgianartists playing at Pinkpop, including gigs from K’s Choice,Junkie XL, and Moke. The amount of footage for each fes-tival year varies from only a summary to almost unabridgedconcert recordings, even including raw, unpublished footage.The complete video archive contains 94 concerts covering 32hours in total.

We create detectors for 11 concert concepts following astate-of-the-art implementation [10]. We select the con-cepts based on frequency, visual detection feasibility, pre-vious mentioning in literature and expected utility for con-cert video users (summarized in Figure 1). We consider avideo fragment a more user-friendly retrieval unit comparedto more technically defined shots or keyframes. We createfragment-level detection scores from frame-level scores byaggregating the concept scores of all the frames in the pro-cessed videos. The fragment algorithm was designed to findthe longest fragments with the highest average scores for aspecific concert concept [10]. Users may provide feedback onthese automatically detected fragments using our feedbackmechanism.

2.2 Feedback MechanismThe main mode of user interaction with our video search

engine is by means of the In-Video Browser, see Figure 2.The timeline-based browser enables users to watch and nav-igate through a single video concert. Little colored dots onthe timeline mark the location of an interesting fragmentcorresponding to an automatically derived label. To inspectthe label, users simply move their mouse cursor over the col-ored dot. By clicking on the dot, the player instantly startsthe specific fragment in the video. If needed, the user canmanually select more concept labels in the panel on the leftof the video player. Too maintain overview, the In-Video

Figure 3: Harvesting user feedback for video frag-ments (top to bottom). The thumbs-up button in-dicates agreement with the automatically detectedlabel, thumbs-down disagreement. Three key framesrepresent the visual summary of the fragment.Users may correct wrong labels, adapt fragmentboundaries, or suggest additional labels (in Dutch).

browser automatically launches with a maximum of twelvefragments on the timeline interface every time a user startsa concert. These twelve correspond to the most reliablefragment labels. Once the timeline becomes too crowded asa result of multiple selected labels, the user may decide tozoom in on the timeline to retrieve fragments for a specific,smaller part of the video.

An important aspect of the In-Video browser is that theuser viewing experience is interrupted as little as possible,the video continues to play while the user interacts with thebrowser. In the graphical overlay that appears while thefragment is playing, the label is shown together with the

914

0

20

40

60

80

100

120

140

160

180

>50% >60% >70% >80% >90%

Excluded correct fragment labels

Crowdsourcing errors

User-Feedback Agreement

Vid

eo

Fra

gm

en

ts

Figure 4: Results for Experiment 2: Quality vsQuantity. Simply relying on a majority vote of thecrowd results in most correct fragments, albeit with23 errors. We observe a best tradeo! between qual-ity and quantity of crowdsourcing visual detectorsfor a user agreement of 67%.

4.2 Experiment 2: Quality vs QuantityThe question that we tried to answer with this experiment

is whether the resulting labels are of su!cient quality com-pared to expert labels, when aggregated over multiple users.We have in total 510 fragments, where we now assume theexpert label to be correct, and investigate for how many ofthem we would have obtained the same label when imposinga minimum agreement threshold on the crowdsourced labels.We plot the percentage of agreement among user-providedlabels versus the number of video fragments in Figure 4. Theground truth shows that the quality of the suggested labelsis high. As much as 85% of the automatically suggestedlabels correspond with the ground truth. If the simple eval-uation principle of the majority is used, only 23 fragmentshave received tags that do not match with the ground truth,which in our case corresponds to a loss of 37 training sam-ples. When we further increase the threshold for a positive ornegative agreement the number of fragments receiving thewrong label is gradually reduced to 8 fragments only, butthe number of excluded training samples increases rapidly.For a conservative user agreement of 80%, for example, 119fragments are ignored. We observe that a threshold of 67%provides a well-chosen balance between the 8 errors and the422 fragments that can be used as a correction mechanism,or as reliable training examples for a new round of detectorlearning.

5. CONCLUSIONThe main research question of this paper was: can user

tags from crowdsourcing be beneficial to a system that au-tomatically predicts labels for video fragments. We devel-oped a video search engine for a dedicated user communityin the domain of concert video allowing for easy fragment-level crowdsourcing. The user-feedback mechanism of theIn-Video browser made it possible to harvest positive andnegative user judgements on automatically predicted videofragment labels.

For this case study two experiments are conducted. The

first experiment showed that users provided enough feed-back. Analysis of the collected data proved that users pro-vided the feedback to the video-fragment labels withouta preference for incorrect labels. The second experimentshowed that 85% of the automatically suggested labels cor-responds with the ground truth. We observe that an ag-gregation threshold of 67% provides a well-chosen balancebetween errors in the user judgements and the amount ofreliable training examples remaining. If the threshold is en-forced, the error rate in the training examples is less than2%. Within the context of our case study, we conclude thatcrowdsourcing can be beneficial to enhance and improve au-tomated video content analysis. How the new informationcan be exploited for incremental learning of visual detectorsis an interesting question for future research.

6. ACKNOWLEDGMENTSWe thank our users for providing feedback. We are grate-

ful to the Netherlands Institute for Sound and Vision. Thisresearch is supported by the projects: BSIK MultimediaN,FES COMMIT, Images for the Future, and STW SEARCHER.

7. REFERENCES[1] L. Ahn. Games with a purpose. IEEE Computer,

39(6):92–94, 2006.[2] M. Ames and M. Naaman. Why we tag: Motivations

for annotation in mobile and online media. In Proc.CHI, 2007.

[3] R. Gligorov, L. B. Baltussen, J. van Ossenbruggen,L. Aroyo, M. Brinkerink, J. Oomen, and A. van Ees.Towards integration of end-user tags with professionalannotations. In Proc. Web Science, 2010.

[4] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing userstudies with mechanical turk. In Proc. CHI, 2008.

[5] C. Marlow, M. Naaman, D. Boyd, and M. Davis.HT06, tagging paper, taxonomy, Flickr, academicarticle, to read. In Proc. Hypertext, 2006.

[6] P. Marsden. Crowdsourcing. Contagious Magazine,18:24–28, 2009.

[7] J. Nielsen. Participation inequality: Encouraging moreusers to contribute, 2006. http://www.useit.com/alertbox/participation_inequality.html.

[8] D. A. Shamma, R. Shaw, P. L. Shafton, and Y. Liu.Watch what I watch: using community activity tounderstand content. In Proc. MIR, 2007.

[9] A. F. Smeaton, P. Over, and W. Kraaij. Evaluationcampaigns and TRECVid. In Proc. MIR, 2006.

[10] C. G. M. Snoek, B. Freiburg, J. Oomen, andR. Ordelman. Crowdsourcing rock n’ roll multimediaretrieval. In Proc. ACM Multimedia, 2010.

[11] C. G. M. Snoek and A. W. M. Smeulders.Visual-concept search solved? IEEE Computer,43(6):76–78, 2010.

[12] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng.Cheap and fast—but is it good?: evaluatingnon-expert annotations for natural language tasks. InProc. EMNLP, 2008.

[13] J. Surowiecki. The wisdom of crowds: why the manyare smarter than the few. Random House, 2005.

[14] R. van Zwol, L. Garcia, G. Ramirez,B. Sigurbjornsson, and M. Labad. Video tag game. InProc. WWW, 2008.

916

Guitar playerHands Pinkpop logoSingerPinkpop hat Drummer Over the shuolderClose-upAudience StageKeyboard

Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feedback.

Figure 2: Timeline-based video player where col-ored dots correspond to automated visual detectionresults. Users can navigate directly to fragments ofinterest by interaction with the colored dots, whichpop-up a feedback overlay as displayed in Figure 3.

since 1970 at Landgraaf, the Netherlands. All music videoshave been recorded during the 40 years life cycle of the fes-tival. We cleared copyright for several Dutch and Belgianartists playing at Pinkpop, including gigs from K’s Choice,Junkie XL, and Moke. The amount of footage for each fes-tival year varies from only a summary to almost unabridgedconcert recordings, even including raw, unpublished footage.The complete video archive contains 94 concerts covering 32hours in total.

We create detectors for 11 concert concepts following astate-of-the-art implementation [10]. We select the con-cepts based on frequency, visual detection feasibility, pre-vious mentioning in literature and expected utility for con-cert video users (summarized in Figure 1). We consider avideo fragment a more user-friendly retrieval unit comparedto more technically defined shots or keyframes. We createfragment-level detection scores from frame-level scores byaggregating the concept scores of all the frames in the pro-cessed videos. The fragment algorithm was designed to findthe longest fragments with the highest average scores for aspecific concert concept [10]. Users may provide feedback onthese automatically detected fragments using our feedbackmechanism.

2.2 Feedback MechanismThe main mode of user interaction with our video search

engine is by means of the In-Video Browser, see Figure 2.The timeline-based browser enables users to watch and nav-igate through a single video concert. Little colored dots onthe timeline mark the location of an interesting fragmentcorresponding to an automatically derived label. To inspectthe label, users simply move their mouse cursor over the col-ored dot. By clicking on the dot, the player instantly startsthe specific fragment in the video. If needed, the user canmanually select more concept labels in the panel on the leftof the video player. Too maintain overview, the In-Video

Figure 3: Harvesting user feedback for video frag-ments (top to bottom). The thumbs-up button in-dicates agreement with the automatically detectedlabel, thumbs-down disagreement. Three key framesrepresent the visual summary of the fragment.Users may correct wrong labels, adapt fragmentboundaries, or suggest additional labels (in Dutch).

browser automatically launches with a maximum of twelvefragments on the timeline interface every time a user startsa concert. These twelve correspond to the most reliablefragment labels. Once the timeline becomes too crowded asa result of multiple selected labels, the user may decide tozoom in on the timeline to retrieve fragments for a specific,smaller part of the video.

An important aspect of the In-Video browser is that theuser viewing experience is interrupted as little as possible,the video continues to play while the user interacts with thebrowser. In the graphical overlay that appears while thefragment is playing, the label is shown together with the

914

+ Validate the output of MIR systems[Steiner et al., 2011]

n  Propose a browser extension to navigate detected events in videos n  Visual events (shot changes) n  Occurrence events (analysis of metadata by means of NLP to detect named en00es)

n  Interest-‐based events (click counters on detected visual events)

Crowdsourcing Event Detection in YouTube Videos 3

through a combination of textual, visual, and behavioral analysis techniques. Whena user starts watching a video, three event detection processes start:

Visual Event Detection Process We detect shots in the video by visually analyzing itscontent [19]. We do this with the help of a browser extension, i.e., the whole processruns on the client-side using the modern HTML5 [12] JavaScript APIs of the <video>and <canvas> elements. As soon as the shots have been detected, we offer the user thechoice to quickly jump into a specific shot by clicking on a representative still frame.

Occurrence Event Detection Process We analyze the available video metadata usingNLP techniques, as outlined in [18]. The detected named entities are presented to theuser in a list, and upon click via a timeline-like user interface allow for jumping intoone of the shots where the named entity occurs.

Interest-based Event Detection Process As soon as the visual events have been detected,we attach JavaScript event listeners to each of the shots and count clicks on shots as anexpression of interest in those shots.

Fig. 2: Screenshot of the YouTube browser extension, showing the three different eventtypes: visual events (video shots below the video), occurrence events (contained namedentities and their depiction at the right of the video), and interest-based events (pointsof interest in the video highlighted with a red background in the bottom left).

60

+ Validate the output of MIR systems[Goeau et al., 2011]

n  Visual plant species iden0fica0on n  Based on local visual features n  Crowdsourced valida0on

Figure 1: GUI of the web application.

3. WEB APPLICATION & TAG POOLINGFigure 1 presents the Graphical User Interface of the web

application. On the left, the user choose to load a scan or

a photograph, and then, the system returns and displays

on the right the top-3 species with the most similar pic-

tures. On the bottom left part, the user can then either

select and validate the top-1 suggested species, or he can

choose an other species in the list, or even enter a new

species name if it is not available. The uploaded image

used as query is temporary stored with its associated species

name. Then other users might interact with these new pic-

tures later. So far, this last step is done offline, after that

some professional botanists involved in the project validate

the images and theirs species names. But, the aggregation

to the visual knowledge of these uploaded images will be

integrated automatically in further versions. The species

names and pictures are clickable and bring the user to on-

line taxon descriptions from the Tela Botanica web site. In

this way, beyond the visual content-based recognition pro-

cess, the species identification is considered as one way to

access richer botanical information like species distribution,

complementary pictures, textual descriptions, etc.

4. COLLABORATIVE DATA COLLECTEDThe current data was built by several cycles of collabo-

rative data collections and taxonomical validations. Scans

of leaves were collected over two seasons, between June and

September, in 2009 and 2010, thanks to the work of active

contributors from Tela Botanica social networks. The idea

of collecting only scans during this first period was to initial-

ize the training data with limited noisy background, so that

the online identification tool works sufficiently well to atract

new users. Notice that this did not prevent users to submit

unconstrained pictures, since our matching-based approach

is relatively robust to such asymetry between training and

query images. The first online application did contain 457

validated scans over 27 species and the link was mostly dis-

seminated through Tela Botanica. It finally allowed to col-

lect 2228 scans over 55 species. A public version of the

application2was opened in October 2010

3. At the time of

2http://combraille.cirad.fr:8080/demo_plantscan/3http://www.tela-botanica.org/actu/article3856.html

writing, 858 images were uploaded and tested by about 25

new users. These images are either scans or photographs

with uniform background, or free photographs with natural

background, and involve 15 new species from the previous

set of 55 species. Note that the collected data will be used

within ImageCLEF2011 plant retrieval task4.

5. EVALUATIONPerformances, basically in terms of species identification

rates, will be actually shown during the demo, with an of-

fline version connected to a digital camera. It will consists in

an enjoying demo where anyone can play to shoot fresh cut

leaves. Users would notice short response times for identifi-

cation (around 2 seconds), and observe relevance of species

suggested in spite of the intra-species visual variability, or

cases with occlusions or with non-uniform backgrounds. As

a rough guide, a leave one out cross-validation (i.e. each

scan used one by one as external query), gives an average

precision around 0.7 over the 20 first most similar images,

and gives basically the correct species as the first rank 9

times out of 10 with the knn basic rule of decision.

6. CONCLUSIONSThis demo represents a first step to a large scale crowd-

sourcing application promoting collaborative enrichment of

botanical visual knowledge and its exploitation for helping

users to identify biodiversity. The next version will consider

a full autonomous and dynamical application integrating col-

laborative taxonomical validation. If the application focuses

here on an educational subject, the performances obtained

and the emulation created during this project are encourag-

ing for addressing others floras and more narrow studies.

7. ACKNOWLEDGMENTSThis research has been conducted with the support of

the Agropolis Fondation. Great thanks to all users of TelaBotanica social networks who spent hours to cut, scan andtest fresh leaves on our system.

8. ADDITIONAL AUTHORSJean-Franois Molino (IRD, UMR AMAP, Montpellier, France), Philippe

Birnbaum (CIRAD, UMR AMAP), Daniel Barthelemy (CIRAD, BIOS,Direction and INRA, UMR AMAP, F-34398) and Nozha Boujemaa(INRIA, Saclay, France).

9. REFERENCES[1] P. Belhumeur, D. Chen, S. Feiner, D. Jacobs, W. Kress, H. Ling,

I. Lopez, R. Ramamoorthi, S. Sheorey, S. White, and L. Zhang.Searching the world’s herbaria: A system for visual identificationof plant species. In ECCV. 2008.

[2] O. M. Bruno, R. de Oliveira Plotze, M. Falvo, and M. de Castro.Fractal dimension applied to plant identification. InformationSciences, 2008.

[3] A. Joly and O. Buisson. A posteriori multi-probe localitysensitive hashing. In Proceeding of the 16th ACM internationalconference on Multimedia, 2008.

[4] A. Joly and O. Buisson. Logo retrieval with a contrario visualquery expansion. In Proceedings of the seventeen ACMinternational conference on Multimedia, 2009.

[5] J. C. Neto, G. E. Meyer, D. D. Jones, and A. K. Samal. Plantspecies identification using elliptic fourier leaf shape analysis.Computers and Electronics in Agriculture, 2006.

[6] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman.Object retrieval with large vocabularies and fast spatialmatching. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2007.

4http://www.imageclef.org/2011/plants

814

+ Validate the output of MIR systems [Yan et al., 2010]

n  CrowdSearch combines n  Automated image search

n  Local processing on mobile phones + backend processing n  Real-‐0me human valida0on of search results

n  Amazon Mechanical Turk

n  Studies the trade-‐off in terms of n  Delay n  Accuracy n  Cost

n More on this later…

man error and bias to maximize accuracy. To balance thesetradeo!s, CrowdSearch uses an adaptive algorithm that usesdelay and result prediction models of human responses to ju-diciously use human validation. Once a candidate image isvalidated, it is returned to the user as a valid search result.

3. CROWDSOURCING FOR SEARCHIn this section, we first provide a background of the Ama-

zon Mechanical Turk (AMT). We then discuss several designchoices that we make while using crowdsourcing for imagevalidation including: 1) how to construct tasks such thatthey are likely to be answered quickly, 2) how to minimizehuman error and bias, and 3) how to price a validation taskto minimize delay.

Background: We now provide a short primer on theAMT, the crowdsourcing system that we use in this work.AMT is a large-scale crowdsourcing system that has tensof thousands of validators at any time. The key benefit ofAMT is that it provides public APIs for automatic postingof tasks and retrieval of results. The AMT APIs enable usto post tasks and specify two parameters: (a) the number ofduplicates, i.e. the number of independent validators whowe want to work on the particular task, and (b) the rewardthat a validator obtains for providing responses. A validatorworks in two phases: (a) they first accept a task once theyidentify that they would like to work on it, which in turndecrements the number of available duplicates, and (b) onceaccepted, they need to provide a response within a periodspecified by the task.

One constraint of the AMT that pertains to CrowdSearchis that the number of duplicates and reward for a task thathas been posted cannot be changed at a later point. We takethis practical limitation in mind in designing our system.

Constructing Validation Tasks: How can we constructvalidation tasks such that they are answered quickly? Ourexperience with AMT revealed several insights. First, we ob-served that asking people to tag query images and candidateimages directly is not useful since: 1) text tags from crowd-sourcing systems are often ambiguous and meaningless (sim-ilar conclusions have been reported by other crowdsourcingstudies [8]), and 2) tasks involving tagging are unpopular,hence they incur large delay. Second, we found that havinga large validation task that presents a number of <queryimage, candidate image> pairs enlarges human error andbias since a single individual can bias a large fraction of thevalidation results.

We settled on an a simple format for validation tasks.Each <query image, candidate image> pair is packaged intoa task, and a validator is required to provide a simple YESor NO answer: YES if the two images are correctly matched,and NO otherwise. We find that these tasks are often themost popular among validators on AMT.

Minimizing Human Bias and Error: Human error andbias is inevitable in validation results, therefore a centralchallenge is eliminating human error to achieve high accu-racy. We use a simple strategy to deal with this problem:we request several duplicate responses for a validation taskfrom multiple validators, and aggregate the responses usinga majority rule. Since AMT does not allow us to dynami-cally change the number of duplicates for a task, we fix thisnumber for all tasks. In §7.2, we evaluate several aggrega-tion approaches, and show that a majority of five duplicates

!"#$%&'()*# +),-.-)/#&'()*#0 1"23.4)/#&5)3.-)/.6,&7)080

!"#$%

&'(&'(&'()'(*+,( &'(&'(

!"#$%

&'(&'(-.)'(*+,( -.&'(

!"#$%

-.-.-.)'(*+,( -.-.

!"#$%

&'(&'(&'()'(*+,( -.&'(

+9

+:

+;

+<

Figure 2: Shown are an image search query, candi-date images, and duplicate validation results. Eachvalidation task is a Yes/No question about whetherthe query image and candidate image contains thesame object.

is the best strategy and consistently achieves us more than95% search accuracy.

Pricing Validation Tasks: Crowdsourcing systems allowus to set a monetary reward for each task. Intuitively, ahigher price provides more incentive for human validators,and therefore can lead to lower delay. This raises the fol-lowing question: is it better to spend X cents on a singlevalidation task or to spread it across X validation tasks ofprice one cent each? We find that it is typically better tohave more tasks at a low price than fewer tasks at a highprice. There are three reasons for this behavior: 1) since alarge fraction of tasks on the AMT o!er a reward of only onecent, the expectation of users is that most tasks are quickand low-cost, 2) crowdsourcing systems like the AMT havetens of thousands of human validators, hence posting moretasks reduces the impact of a slow human validator on over-all delay, and 3) more responses allows better aggregationto avoid human error and bias. Our experiments with AMTshow that the first response in five one cent tasks is 50 - 60%faster than a single five cent task, confirming the intuitionthat delay is lower when more low-priced tasks are posted.

4. CROWDSEARCH ALGORITHMGiven a query image and a ranked list of candidate im-

ages, the goal of human validation is to identify the correctcandidate images from the ranked list. Human validationimproves search accuracy, but incurs monetary cost and hu-man processing delay. We first discuss these tradeo!s andthen describe how CrowdSearch optimizes overall cost whilereturning at least one valid candidate image within a user-specified deadline.

4.1 Delay-Cost TradeoffsBefore presenting the CrowdSearch algorithm, we illus-

trate the tradeo! between delay and cost by discussing post-ing schemes that optimize one or the other but not both.

Parallel posting to optimize delay: A scheme that op-timizes delay would post all candidate images to the crowd-sourcing system at the same time. (We refer to this as par-allel posting.) While parallel posting reduces delay, it isexpensive in terms of monetary cost. Figure 2 shows aninstance where the image search engine returns four candi-

+ Query expansion / reformula0on [Harris, 2012]

n  Search YouTube user generated content

n  Natural language queries are restated and given as input to n  YouTube search interface n  Students n  Crowd in Mturk

!"#$%!"#$%&#%&'(%)*+$,%-.#-%"$#,/0+#-/%1.2#3"$4%*)%#%0+/3-"*$%#$,5*2%6*22/31*$,"$4% #$37/2% *$% 8$*79/,4/%:#28/-%7/;3"-/3% $/4#-"</9=%#))/6-3% +-"9"-=>%?*$3/0+/$-9=@% -./% #;"9"-=% -*% /))/6-"</9=% 3/#26.% )*2%AB?@%1#2-"6+9#29=%*$%2#2/%*2%$*"3=%-*1"63@%2/:#"$3%#%6.#99/$4/>%

?2*7,3*+26"$4%:#=%12*<",/%#%<"#;9/%3*9+-"*$%)*2%3/#26."$4%AB?>%%C./% +3/% *)% -./% 62*7,% #3% #% 3/#26.% 3-2#-/4=% "3% 6*:1/99"$4D% "-%"$-2*,+6/3% ,"</23"-=% *)% 3/#26.% -/2:3% 3"$6/% ,"))/2/$-% :/:;/23% *)%-./% 62*7,% 7"99% #119=% ,"))/2/$-% 3/#26.% 3-2#-/4"/3% ;#3/,% *$% -./"2%)#:"9"#2"-=%7"-.% -./% 3/#26.% -*1"6>%E*2/*</2@% -./% 62*7,% .#3% ;//$%3.*7$% -*% 12*<",/% 4**,% 0+#9"-=% "$% 3-+,"/3% "$<*9<"$4% 2/9/<#$6/%F+,4:/$-3>%G</$%7"-.%,"</23"-=@%7/%6#$%3-"99%/H1/6-%3/#26.%0+#9"-=I%3*:/%3-+,"/3%*$%12/,"6-"*$%"$%62*7,3*+26"$4%3=3-/:3%,/:*$3-2#-/%-.#-% 2/9"#;"9"-=% *)% -./% #</2#4/% *)% 12/,"6-/,% 36*2/3% ;=% -./% 62*7,%":12*</3% #3% -./% 3"J/% *)% -./% 62*7,% "$62/#3/3% &KL@% KK(>% M"8/7"3/@%3/#26.%0+#9"-=%"3%/H1/6-/,%-*%":12*</%#3%-./%$+:;/2%*)%3/#26./23%"$% -./% 62*7,% /H1#$,3>% ?2*7,3*+26"$4% 6*$-2#3-3%7"-.% 8$*79/,4/%:#28/-3% "$% 9/</9% *)% /$4#4/:/$-D% N"/93/$% :/$-"*$3% "$% &KO(% -.#-%*</2% 'LP% *)% 8$*79/,4/% :#28/-% 42*+1% 1#2-"6"1#$-3% )#"9% -*%6*$-2";+-/D% -./2/)*2/% -./% 62*7,3*+26"$4% #31/6-% "$-2*,+6/3% 3*:/%)"$#$6"#9%"$6/$-"</%-*%:*-"<#-/%-#38%1#2-"6"1#-"*$>%

C./%*;F/6-"</%"$%-."3%1#1/2%"3%-*%/H#:"$/%")%-./%62*7,%6#$%12*<",/%#% :*2/% 12/6"3/% 3/-% *)% AB?% 3/#26.% 2/3+9-3@% 4"</$% #% 0+/2=@%6*:1#2/,% 7"-.% *-./2%:+9-":/,"#% 3/#26.% -**93>% C./% 6*$-2";+-"*$3%*)% -."3% 1#1/2% #2/% #3% )*99*73>% Q"23-% 7/% 6*:1#2/% -./% 2/-2"/<#9%1/2)*2:#$6/%*)%,"))/2/$-%2/-2"/<#9%:*,/93%"$%-/2:3%*)%12/6"3"*$%*$%3/</2#9% 6#-/4*2"/3% +3"$4%AB?%<",/*% 2/0+/3-3% -#8/$% )2*:% 9/#,"$4%8$*79/,4/% :#28/-% 7/;3"-/3>% R/% -./$% 6*:1#2/% S*+C+;/T3% *7$%3/#26.%"$-/2)#6/%7"-.%#%3/#26.%6*$,+6-/,%;=%3-+,/$-3%#3%7/99%#3%#%3/#26.% #112*#6.% +3"$4% 62*7,3*+26"$4>% %R/% /<#9+#-/% *+2% 2/3+9-3%+3"$4% -7*% :/-.*,3I% :/#$% #</2#4/% 12/6"3"*$% ,/-/2:"$/,% #)-/2%#119="$4% 1**9"$4@% #$,% #% 3":19/% 9"3-% 12/)/2/$6/@% 7./2/% -./% /$-"2/%9"3-%*)%<",/*3%F+,4/,%#3%2/9/<#$-%;=%/#6.%:/-.*,%#2/%6*:1#2/,>%%

C./%2/:#"$,/2%*)% -./%1#1/2% "3%*24#$"J/,%#3%)*99*73>% U$%V/6-"*$%O%7/%1+-%*+2%7*28%"$%-./%6*$-/H-%*)%12/<"*+3%7*28>%U$%V/6-"*$%W%7/%,"36+33% *+2% /H1/2":/$-#9% 3/-+1>% V/6-"*$% X% *))/23% #% ,"36+33"*$% *)%-./% 2/3+9-3>%R/%6*$69+,/%#$,%12*<",/% "$3"4.-% "$-*% )+-+2/%7*28% "$%V/6-"*$%Y>%

!"! #$%&'$()*+#,)G</$% 12"*2% -*% R/;% O>L@% -./2/% .#3% ;//$% 3"4$")"6#$-% 2/3/#26.% "$%:+9-":/,"#% 3/#26.% :/-.*,3@% "$69+,"$4% 3/</2#9% *24#$"J/,%6*:1/-"-"*$3% -.#-% "$<*9</% -2#,"-"*$#9% 3/#26.% 3-2#-/4"/3>% C./%1*1+9#2%CZG?[",%&KW(%;/$6.:#28"$4%6*:1/-"-"*$%%)*6+3/3%*$%-./%,/-/6-"*$% *)% 31/6")"6% )/#-+2/3% 7"-."$% $*$\AB?% :+9-":/,"#%6*99/6-"*$3>% R"8"1/,"#% Z/-2"/<#9@% #% -#38% "$% U:#4/?MGQ% &KX(%"$<*9</3% 9*6#-"$4% 2/9/<#$-% ":#4/3% )2*:% -./% R"8"1/,"#% ":#4/%6*99/6-"*$% ;#3/,% *$% #% 12*<",/,% -/H-% 0+/2=% #$,% 3/</2#9% 3#:19/%":#4/3>% % R."9/% R"8"1/,"#% Z/-2"/<#9% /H#:"$/3% $*"3=% #$,%+$3-2+6-+2/,% -/H-+#9% #$$*-#-"*$3% "$% R"8"1/,"#% :+9-":/,"#@% -./%3/:"\3-2+6-+2/,%6*$-/$-%/<#9+#-/,%"$%U:#4/?MGQ%"3%)#2%9/33%$*"3=%#$,%:*2/%3-2+6-+2/,%-.#$%6*$-/$-%3/#26./3%*$%S*+C+;/>%

V/</2#9% 3-+,"/3% .#</% /H#:"$/,% 3/#26.% 0+#9"-=% *$% +3/2\3+119"/,%-#43%"$%*-./2%R/;%O>L%#119"6#-"*$3>%%]"</23"-=%*)%":#4/%-#4%3/#26.%2/3+9-3% "$% Q9"682% +3"$4% #$% ":19"6"-% 2/9/<#$6/% )//,;#68% :*,/9% "3%/H19*2/,%;=%<*$%^7*9%!"#$%&#%&KY(@%6*$69+,"$4%-.#-%,"</23"-=%"3%#$%":1*2-#$-% 6*:1*$/$-%7./$% 2/-2"/<#9% "3% ;#3/,%*$% 3:#99% ,#-#% 3/-3@%3+6.% #3% -.*3/% )*+$,% "$% ":#4/% -#43>% % _*-.*% !"#$ %&#% /H19*2/%)*983*$*:=% -#44"$4@% 7."6.% "3% ;*+$,% ;=% -./% 3#:/% $*"3=%+$3-2+6-+2/,%2/3-2"6-"*$3%#3%S*+C+;/%-#43%&K`(@%;+-%-./"2%3-+,=%7#3%12":#2"9=% )*6+3/,% *$% 2/6*::/$,/2% 3=3-/:3% +3#4/% *)% -./3/% -#43>%a-./23% .#</% /H#:"$/,% :+9-":/,"#% 3/#26.% /))/6-"</$/33% *$%8$*79/,4/%:#28/-%7/;3"-/3@%3+6.%#3%?.+#%!"#$%&#%"$%&Kb(%#$,%M"%!"#$%&#%"$%&Kc(D%.*7/</2@%-./"2%)*6+3%"3%-*%9*6#-/%#99%6*$-/$-%#,,2/33"$4%#% 31/6")"6% 0+/3-"*$% d/>4>% e.*7% -*f% #$,% e7.=f% 0+/3-"*$% -=1/3g%7./2/#3% -./% )*6+3%*)%*+2% 3-+,=% "3%*$% )"$,"$4%#$,% 2#$8"$4%<",/*3%-.#-%)+9)"99%#%31/6")"6%3/#26.%2/0+/3-%d/>4>@%e./91%)"$,%#%<",/*fg>%%

h%)/7%3-+,"/3%.#</%/H#:"$/,%-./%/))/6-"</$/33%*)%62*7,3%*$%$*"3=%,#-#% 3/#26./3>% V-/"$/2% !"#$ %&#% ,/:*$3-2#-/,% 3/#26./3% *)% /</$-%,/-/6-"*$%:/-.*,3%"$%S*+C+;/%<",/*3%#-%-./%)2#4:/$-%9/</9%&K'(>%_3+/.%!"#$%&#%/H#:"$/,%3/#26./3% "$%1*9"-"6#9%;9*43% "$%&OL(%7."6.@%#9-.*+4.% $*"3=@% ,*% $*-% /H1/2"/$6/% -./% 2/3-2"6-"*$3% "$./2/$-% "$%:+9-":/,"#% -#43>% % U$% &OK(@% S#$% !"#$ %&>% 12*<",/,% #$% "$$*<#-"</%#112*#6.% 6#99/,% ?2*7,V/#26.@% 7."6.% 12*<",/,% $/#2\2/#9\-":/%#33/33:/$-% *)% ":#4/3>% h9-.*+4.% -./% #+-.*23T% )*6+3% 7#3% *$%9#;/9"$4% ":#4/3@% -./"2% #112*#6.% 6*+9,% )/#3";9=% ;/% /H-/$,/,% -*%9*6#-"$4%3":"9#2%:/,"#%*$%S*+C+;/>%

%-./012)3")+4214.25)67)892)4.:26)1281.24;<)=16>2??).@46<4.@/)A60'0B2C?)?2;1>9).@8217;>2D)?80:2@8?D);@:)892)>165:")

%

!"! #$%&'(%)!"*! +,,-./0)!"#$%& '()& *++,#$%& )-.,/.'#+$& 0)'(+12& 3)& 4.,4/,.')& '()& 567&

"4+8)"&9+8&).4(&+9&'()&").84(&)99+8'":&&;()")&.8)&%#-)$&#$&;.<,)&=:&&

>(#,)&'()")&"4+8)"&"))0&8)."+$.<,)2&#'&#"&,#?),@&1/)&'+&'3+&#""/)"A&

+/8&4.,4/,.'#+$&+9&%8+/$1&'8/'(&.$12&9+8&0+"'&").84()"2&'()8)&3)8)&

+$,@& .& "0.,,& *)84)$'.%)& +9& B+/;/<)& -#1)+"& 3)8)& 4+$"#1)8)1&

8),)-.$':& & ;()& 48+31"+/84#$%& ").84(& "'8.')%@& .$1& '()& "'/1)$'&

").84(& "'8.')%#)"& *)89+80)1& <)'')8& '(.$& '()& B+/;/<)& ").84(&

#$')89.4)& ."& 0)."/8)1& <@& 5672& .& 8)"/,'& '(.'& #"& "'.'#"'#4.,,@&

"#%$#9#4.$'&C'3+&'.#,)12&*DE:EFG:&

H#$4)& I)"'.')1& J/)8#)"& 3)8)& %8+/*)1& #$'+& '(8))& ")*.8.')&

4.')%+8#)"& C)."@2& 0)1#/02& .$1& 1#99#4/,'G2& 3)& )-.,/.')1& '()0&

")*.8.'),@& 9+8& ).4(& ").84(& "'8.')%@:& & ;()& 8)"/,'"& .8)& 8)*+8')1& #$&

;.<,)&K:&

;.<,)&K&8.#")"&"+0)&#$')8)"'#$%&*+#$'"&9+8&1#"4/""#+$:&&L#8"'2&567&

"4+8)"&9+8&)."@&M/)8#)"&.8)&0/4(&0+8)&4+$"#"')$'&.48+""&"'8.')%#)"&

4+0*.8)1& 3#'(& '(+")& 9+8& 0)1#/0& +8& 1#99#4/,'& ").84()":& & ;(#"& #"&

,#?),@&.&8)"/,'&+9&.&().-#)8&8),#.$4)&9+8&"'/1)$'"&.$1&'()&48+31&+$&

'()& "'.$1.81& B+/;/<)& ").84(& #$')89.4)& 9+8& '()& )."#)8& M/)8#)"2&

,#0#'#$%&'()&.1-.$'.%)"&+9&(/0.$&4+0*/'.'#+$:&&6"&0+8)&1#99#4/,'&

M/)8#)"& .8)& )$4+/$')8)12& '()& -.,/)& +9& (/0.$& 4+0*/'.'#+$&

<)4+0)"&.&0+8)&#0*+8'.$'&4+$"#1)8.'#+$:&

H)4+$12& .,'(+/%(& '()& 567& "4+8)& %.*& #"& "0.,,& <)'3))$& "'/1)$'&

").84(& .$1& 48+31"+/84#$%2& 3)& 1+& $+'#4)& '(.'& '()& 9#-)& "'/1)$'"&

4+$"#"')$',@& *)89+80)1& ",#%(',@& <)'')8& '(.$& '()& 48+31:& & N.4(&

"'/1)$'& *)89+80)1& .,,& KF& M/)8#)"2& 8)9#$#$%& '()#8& "+/84)"& .$1&

')4($#M/)"& ."& '()@& )$4+/$')8)1& ).4(& $)3& M/)8@& O& .,,& 9#-)&

*.8'#4#*.$'"& *)89+80)1& 9."')8& .$1& *8+-#1)1& <)'')8& ").84(& 8)"/,'"&

'+3.81"& '()&)$1&+9& '()#8&M/)8@&")""#+$& '(.$& #$&'()&<)%#$$#$%&C3)&

4.$$+'& +<")8-)& '(#"& #0*8+-)0)$'&3#'(& '()& 48+31& ."& ).4(& 48+31&

*.8'#4#*.$'&*8+-#1)1& 8)"/,'"& 9+8&+$,@&.& "#$%,)&M/)8@G:& &;()&48+31&

(.1& '()& "0.,,)"'& 1)-#.'#+$& #$& 567& "4+8)"& .48+""& '()& =& ").84(&

4.')%+8#)"2& *8#0.8#,@& <)4./")& '()& ,.8%)8& $/0<)8& +9& *)+*,)&

").84(#$%&8)1/4)"&'()&-.8#.'#+$2&."&1#"4/"")1&#$&PQER&.$1&PQQR:&

;(#812&3)&4.$&"))& '()&-.,/)&+9&/"#$%&(/0.$&#$*/'& #$& '()")&567&

"4+8)"2&</'&;.<,)&K&1+)"&$+'&'.?)&'()&4+"'"&#$&<+'(&'#0)&.$1&0+$)@&

#$'+& 4+$"#1)8.'#+$:& & >)& 0.?)& '()& .""/0*'#+$& '(.'& B+/;/<)S"&

").84(&(."&$+&4+"'& #$& ')80"&+9& '#0)&.$1&0+$)@&.$1&/")& #'&."&+/8&

<."),#$):& &>)&?)*'& '8.4?&+9& '()& ),.*")1& '#0)& '.?)$&<@& '()&48+31&

.$1&9+8&'()&"'/1)$'"&."&3),,2&"+&3)&4.$&)-.,/.')&'(#"&#$&.%%8)%.'):&&&

;(#"&#"&8)*+8')1&#$&;.<,)"&F&.$1&T:&

;+&#,,/"'8.')2&#$&;.<,)"&F&.$1&T2&9+8&I)"'.')1&J/)8#)"&4,.""#9#)1&."&

U1#99#4/,'V2&'+&+<'.#$&.$&#$48).")&#$&567&+9&E:EEQ&/"#$%&"'/1)$'"2&

3)&3+/,1&)W*)4'&'+&"*)$1&E:ET&0#$/')"&.$1&#$4/8&.&4+"'&+9&X:YX=&

4)$'":& & ;+& +<'.#$& .$& )M/#-.,)$'& #$48).")& #$& 567& /"#$%&

48+31"+/84#$%2& 3)& 3+/,1& )W*)4'& '+& "*)$12& +$& .-)8.%)2& E:EK&

0#$/')"&.$1&#$4/8&.&4+"'&+9&Q:QQQ&4)$'":&&Z+')&'(.'&'()")&$/0<)8"&

8)*8)")$'& ,+$%& ')80& .-)8.%)":& & ;(/"2& 3)& +<")8-)& '(.'& /"#$%& '()&

48+312& ."& 4+0*.8)1&3#'(& "'/1)$'"2& 8)M/#8)"&KE[&+9& '()& 4+"'& .$1&

'.?)"& '3+& '(#81"& '()& '#0)2& +$& .-)8.%)2& '+& 8.#")& 567& <@& .$&

)M/#-.,)$'&.0+/$':&&;(/"2&3()$&+<'.#$#$%&0+8)&*8)4#")&8)"/,'"&#"&

+/8&*.8.0+/$'&+<\)4'#-)2&/"#$%&"'/1)$'"&+8&'()&48+31&#"&)W*)4')1&

'+& *8+-#1)& '()& <)"'& 8)"/,'":& & ]9& '#0)& +8& 9#$.$4#.,& 4+"'"& .8)& .,"+& .&

4+$"#1)8.'#+$2&+/8&8)"/,'"&"(+3&'(.'&/"#$%&'()&48+31&3#,,&*8+-#1)&

'()&<)"'&'8.1)+99&<)'3))$&'#0)2&9#$.$4#.,&4+"'2&.$1&*8)4#"#+$:&&&

!"1! %.23-4)'.56)+748474/94)>)&.**,@&^+*),.$1S"&*.#83#")&.%%8)%.'#+$&0)'(+12&1)"48#<)1& #$&

PXT2& XYR2& #"& .& ^+$1+84)'& 0)'(+1& /")1& '+& )-.,/.')& *.#83#")&

*8)9)8)$4)":& & ^+*),.$1S"& *.#83#")& .%%8)%.'#+$&0)'(+1& )W.0#$)"&

'3+& ,#"'"& 9+8&.&%#-)$&M/)8@& #$&.&*.#83#")& 9."(#+$&.$1& 8)4+81"& '()&

."")""+8S"&*8)9)8)$4)&."&.&U-#4'+8@V:&&H).84(&"'8.')%#)"&.8)&+81)8)1&

<@&$/0<)8&+9&-#4'+8#)"&+-)8&).4(&+**+$)$'&'+&1)')80#$)&.$&+-)8.,,&

3#$$)8:&&>)&)W.0#$)&).4(&*.#83#")&*8)9)8)$4)&9+8&'()&'(8))&8)"/,'&

,#"'"& 9+8& .,,& KF& M/)8#)":& & ;()")& 4+0*.8#"+$& 8)"/,'"& .8)& %#-)$& #$&

;.<,)&Y:&

L8+0& ;.<,)& Y2& 3)& +<")8-)& '(.'& "'/1)$'& ").84(& #"& +/8& ^+$1+84)'&

3#$$)82& <).'#$%& .,,& +'()8& ").84(& "'8.')%#)"& #$& *.#83#")&

4+0*.8#"+$":&&6"&3#'(&'()&*++,#$%&."")""0)$'&0)'(+12&'()8)&3."&.&

",#%('&*8)9)8)$4)&+9&"'/1)$'&").84(&8)"/,'"&+-)8&'()&48+31"+/84#$%&

"/**,#)1&-#1)+&,#"'":&&_+3)-)82&3()$&9#$.$4#.,&4+"'"&.8)&1#"4,+")1&

'+& '()& ."")""+8"& .,+$%& 3#'(& '()& "4+8)"2& 48+31"+/84#$%& #"& +/8&

^+$1+84)'&3#$$)82&."&+<")8-)1&#$&;.<,)&`:&

(:;-4)<")=>47:--)?@+)59,745)8,7)4:9A)54:79A)567:640B")

%4:79A)%67:640B) ?@+)

H'/1)$'&H).84(& E:FXK&

^8+31"+/84#$%& E:FQX&

B+/;/<)&H).84(& E:=Fa&

&

(:;-4)!C)?@+)59,745)8,7)4:9A)54:79A)567:640BD);7,E4/)F,G/);B)54:79A)9:640,7B")

%4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6)

H'/1)$'&H).84(& E:T=T& E:FQT& E:KXQ&

^8+31"+/84#$%& E:TQa& E:FQX& E:KEK&

B+/;/<)&H).84(& E:FE`& E:=KK& E:XXK&

&

(:;-4)JC)+:.7G.54)9,23:7.5,/5),8)-.56)3748474/94)H5./0)K,34-:/FL5)3:.7G.54):00740:6.,/)246A,FD):5):554554F);B)6A4)

97,GF")

K,23:7.5,/) #45H-6) M.//47)

H'/1)$'&H).84(&-":&

^8+31"+/84#$%&XK&-":&XQ& H'/1)$'&H).84(&

H'/1)$'&H).84(&-":&

B+/;/<)&H).84(&=a&-":&T& H'/1)$'&H).84(&

^8+31"+/84#$%&-":&

B+/;/<)&H).84(&=Y&-":&`& ^8+31"+/84#$%&

&

(:;-4)NC)O/974:54)./)?@+)59,745),>47)6A4)P,H(H;4)54:79A)./6478:94)F.>.F4F);B):FF.6.,/:-)6.24)6:E4/)Q./)2./H645R")

%4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6)

H'/1)$'&H).84(& E:QXF& E:QEQ& E:ETQ&

^8+31"+/84#$%& E:EFa& E:ETT& E:EKE&

&

(:;-4)SC)O/974:54)./)?@+),>47)6A4)P,H(H;4)54:79A)./6478:94)F.>.F4F);B):FF.6.,/:-)9,56)Q./)&%)94/65R")

%4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6)

H'/1)$'&H).84(& Q:==X& Q:TFT& X:YX=&

^8+31"+/84#$%& X:QFF& Q:aF=& Q:QQQ&

&

MAP

+ Aggrega0ng annota0ons

+ Annota0on model

n  A set of objects to annotate

n  A set of annotators

n  Types of annota0ons n  Binary n  Categorical (mul0-‐class) n  Numerical n  Other

i = 1, . . . , I

j = 1, . . . , J

+

Binary Mul0-‐class

Annota0on model

Annotators Objects

y1

y2

y3

True labels

y11

y12 y21

y32

y33

y41

y51

y52

yji ∈ L

Annota0ons

|L| = 2

|L| > 2

+ Aggrega0ng annota0ons

n  Majority vo0ng (baseline) n  For each object, assign the label that received the largest number of votes

n  Aggrega0ng annota0ons n  [Dawid and Skene, 1979] n  [Snow et al., 2008] n  [Whitehill et al., 2009] n  …

n  Aggrega0ng and learning n  [Sheng et al., 2008] n  [Donmez et al., 2009] n  [Raykar et al., 2010] n  …

+ Aggrega0ng annota0ons Majority vo0ng

n  Assume that n  The annotator quality is independent from the object n  All annotators have the same quality

n  The integrated quality of majority vo0ng using annotators is

P (yji = yi) = pj

pj = p

q = P (yMV = y) =N�

l=0

�2N + 1

i

�p2N+1−i · (1− p)i

I = 2N + 1

+ Aggrega0ng annota0ons Majority vo0ng

repeated-labeling to shift from a lower-q curve to a higher-qcurve can, under some settings, improve learning considerably.In order to treat this more formally, we first introduce someterminology and simplifying assumptions.

3.1 Notation and AssumptionsWe consider a problem of supervised induction of a (binary)

classification model. The setting is the typical one, with someimportant exceptions. For each training example !yi, xi",procuring the unlabeled “feature” portion, xi, incurs cost CU .The action of labeling the training example with a label yi

incurs cost CL. For simplicity, we assume that each cost isconstant across all examples. Each example !yi, xi" has a truelabel yi, but labeling is error-prone. Specifically, each labelyij comes from a labeler j exhibiting an individual labelingquality pj , which is Pr(yij = yi); since we consider the caseof binary classification, the label assigned by labeler j will beincorrect with probability 1 # pj .

In the current paper, we work under a set of assumptionsthat allow us to focus on a certain set of problems that arisewhen labeling using multiple noisy labelers. First, we assumethat Pr(yij = yi|xi) = Pr(yij = yi) = pj , that is, individuallabeling quality is independent of the specific data point beinglabeled. We sidestep the issue of knowing pj : the techniques wepresent do not rely on this knowledge. Inferring pj accuratelyshould lead to improved techniques; Dawid and Skene [6] andSmyth et al. [26, 28] have shown how to use an expectation-maximization framework for estimating the quality of labelerswhen all labelers label all available examples. It seems likelythat this work can be adapted to work in a more generalsetting, and applied to repeated-labeling. We also assumefor simplicity that each labeler j only gives one label, butthat is not a restrictive assumption in what follows. Wefurther discuss limitations and directions for future researchin Section 5.

3.2 Majority Voting and Label QualityTo investigate the relationship between labeler quality, num-

ber of labels, and the overall quality of labeling using multiplelabelers, we start by considering the case where for inductioneach repeatedly-labeled example is assigned a single “inte-grated” label yi, inferred from the individual yij ’s by majorityvoting. For simplicity, and to avoid having to break ties, weassume that we always obtain an odd number of labels. Thequality qi = Pr(yi = yi) of the integrated label yi will becalled the integrated quality. Where no confusion will arise,we will omit the subscript i for brevity and clarity.

3.2.1 Uniform Labeler QualityWe first consider the case where all labelers exhibit the same

quality, that is, pj = p for all j (we will relax this assumptionlater). Using 2N + 1 labelers with uniform quality p, theintegrated labeling quality q is:

q = Pr(y = y) =N!

i=0

"2N + 1

i

#· p2N+1!i · (1 # p)i (1)

which is the sum of the probabilities that we have more correctlabels than incorrect (the index i corresponds to the numberof incorrect labels).

Not surprisingly, from the formula above, we can infer thatthe integrated quality q is greater than p only when p > 0.5.When p < 0.5, we have an adversarial setting where q < p,and, not surprisingly, the quality decreases as we increase thenumber of labelers.

Figure 2 demonstrates the analytical relationship between

0.20.30.40.50.60.70.80.9

1

1 3 5 7 9 11 13Number of labelers

Inte

grat

ed q

ualit

y

p=1.0p=0.9p=0.8p=0.7p=0.6p=0.5p=0.4

Figure 2: The relationship between integrated label-ing quality, individual quality, and the number of la-belers.

-0.2-0.15

-0.1-0.05

00.05

0.10.15

0.20.25

0.3

3 5 7 9 11 13

Number of labelers

Qua

lity

impr

ovem

ent p=1.0

p=0.9p=0.8p=0.7p=0.6p=0.5p=0.4

Figure 3: Improvement in integrated quality com-pared to single-labeling, as a function of the numberof labelers, for di!erent labeler qualities.

the integrated quality and the number of labelers, for di!er-ent individual labeler qualities. As expected, the integratedquality improves with larger numbers of labelers, when theindividual labeling quality p > 0.5; however, the marginalimprovement decreases as the number of labelers increases.Moreover, the benefit of getting more labelers also dependson the underlying value of p. Figure 3 shows how integratedquality q increases compared to the case of single-labeling, fordi!erent values of p and for di!erent numbers of labelers. Forexample, when p = 0.9, there is little benefit when the numberof labelers increase from 3 to 11. However, when p = 0.7,going just from single labeling to three labelers increases in-tegrated quality by about 0.1, which in Figure 1 would yielda substantial upward shift in the learning curve (from theq = 0.7 to the q = 0.8 curve); in short, a small amount ofrepeated-labeling can have a noticeable e!ect for moderatelevels of noise.

Therefore, for cost-e!ective labeling using multiple noisylabelers we need to consider: (a) the e!ect of the integratedquality q on learning, and (b) the number of labelers requiredto increase q under di!erent levels of labeler quality p; we willreturn to this later, in Section 4.

3.2.2 Different Labeler QualityIf we relax the assumption that pj = p for all j, and allow

labelers to have di!erent qualities, a new question arises:what is preferable: using multiple labelers or using the bestindividual labeler? A full analysis is beyond the scope (andspace limit) of this paper, but let us consider the special casethat we have a group of three labelers, where the middlelabeling quality is p, the lowest one is p # d, and the highestone is p + d. In this case, the integrated quality q is:

(p # d) · p · (p + d) + (p # d) · p · (1 # (p + d))+

(p # d) · (1 # p) · (p + d) + (1 # (p # d)) · p · (p + d) =

#2p3 + 2pd2 + 3p2 # d2

616

+ Aggrega0ng annota0ons [Snow et al., 2008]

n  Binary labels:

n  The true label is es0mated evalua0ng the posterior log-‐odds, i.e.,

n  Applying Bayes theorem

likelihood prior posterior

yji ∈ {0, 1}

logP (yi = 1|y1i , . . . , yJi )P (yi = 0|y1i , . . . , yJi )


=�

j

logP (yji |yi = 1)

P (yji |yi = 0)+ log

P (yi = 1)

P (yi = 0)


n  How to es0mate and ?

n  Gold standard: n  Some objects have known labels n  Ask to annotate these objects n  Compute empirical p.m.f. for object(s) with known labels

n  Compute the performance of annotator (independent from the object) j

P (yji |yi = 1) P (yji |yi = 0)

P (yj = 1|y = 1) =Number of correct annotations

Number of annotations of object with label = 1

P (yj1|y1 = 1) = P (yj2|y2 = 1) = . . . = P (yjI |yI = 1) = P (yj |y = 1)


n  Each annotator vote is weighted by the log-‐likelihood ra0o for their given response (Naïve Bayes)

n More reliable annotators are weighted more

n  Issue: Obtaining a gold standard is costly!


=�

j

logP (yji |yi = 1)

P (yji |yi = 0)+ log

P (yi = 1)

P (yi = 0)

+ Aggrega0ng annota0ons [Kumar and Lease, 2011]

n With very accurate annotators, it is berer to label more examples once

n With very noisy annotators, aggrega0ng labels helps, if annotator accuracies are taken into account

Figure 1: p1:w !U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensuslabel accuracy) provides little benefit. Instead, labeling e!ort is better spent single labeling more examples.

Figure 2: p1:w !U(0.4,0.6). With very noisy annotators, single labeling yields such poor training data thatthere is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise toproduce more noise. In contrast, by modeling worker accuracies and weighting their labels appropriately,NB can improve consensus labeling accuracy (and thereby classifier accuracy).

Figure 3: p1:w !U(0.3, 0.7). With greater variance in accuracies vs. Figure 2, NB further improves.

Figure 4: (p1:w !U(0.1, 0.7)). When average annotator accuracy is below 50%, SL and MV perform exceedinglypoorly. However, variance in worker accuracies known to NB allows it to concentrate weight on workers withaccuracy over 50% in order to achieve accurate consensus labeling (and thereby classifier accuracy).

Figure 5: p1:w !U(0.2, 0.6). When nearly all annotators typically produce bad labels, failing to “flip” labelsfrom poor annotators dooms all methods to low accuracy.

21

Figure 1: p1:w !U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensuslabel accuracy) provides little benefit. Instead, labeling e!ort is better spent single labeling more examples.

Figure 2: p1:w !U(0.4,0.6). With very noisy annotators, single labeling yields such poor training data thatthere is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise toproduce more noise. In contrast, by modeling worker accuracies and weighting their labels appropriately,NB can improve consensus labeling accuracy (and thereby classifier accuracy).

Figure 3: p1:w !U(0.3, 0.7). With greater variance in accuracies vs. Figure 2, NB further improves.

Figure 4: (p1:w !U(0.1, 0.7)). When average annotator accuracy is below 50%, SL and MV perform exceedinglypoorly. However, variance in worker accuracies known to NB allows it to concentrate weight on workers withaccuracy over 50% in order to achieve accurate consensus labeling (and thereby classifier accuracy).

Figure 5: p1:w !U(0.2, 0.6). When nearly all annotators typically produce bad labels, failing to “flip” labelsfrom poor annotators dooms all methods to low accuracy.

21

pj ∼ U(0.6, 1.0)

pj ∼ U(0.3, 0.7)

SL: Single Labeling, MV: Majority Vo0ng; NB: Naïve Bayes

+ Aggrega0ng annota0ons [Dawid and Skene, 1979]

n Mul0-‐class labels

n  Each annotator is characterized by the (unknown) error rates

n  Given a set of observed labels, es0mate n  The error rates n  The a-‐posteriori probabili0es

πjlk = P (yj = l|y = k) k, l = 1, . . . ,K

yji ∈ {1, . . . ,K}

D = {y1i , . . . , yJi }Ii=1

P (yi = k|D)

πjlk


n  For simplicity, consider the case with binary labels

n  Each annotator is characterized by the (unknown) error rates

n  Also assume that the prior is known, i.e.,

yji ∈ {0, 1}

P (yji = 1|yi = 1) = αj1

P (yji = 0|yi = 0) = αj0

P (yi = 1) = 1− P (yi = 0) = pi

True posi0ve rate

True nega0ve rate


y11 y21 y12 y22 y32

α1 α2 α3 αJ

· · ·

· · ·

True labels

Observed labels

Annotator accuracies

y1 y2 yI


n  The likelihood func0on of the parameters given the observa0ons is factored as

P (D|α1,α0) =I�

i=1

P (y1i , . . . , yJi |α1,α0)

=I�

i=1

P (y1i , . . . , yJi |yi = 1,α1)P (yi = 1)

+ P (y1i , . . . , yJi |yi = 0,α0)P (yi = 0)

{α1,α0}D = {y1i , . . . , yJi }Ii=1


n  The parameters are found by maximizing the log-‐likelihood func0on

n  The solu0on is based on Expecta0on-‐Maximiza0on

n  Expecta+on step

a1,i =J�

j=1

[αj1]

yji [1− αj

1]1−yj

i

a0,i =J�

j=1

[αj0]

1−yji [1− αj

0]yji

{α1, α0} = argmaxθ

logP (D|θ)

µi = P (yi = 1|y1i , . . . , yJi ,θ)

∝ P (y1i , . . . , yJi |yi = 1,θ)P (yi = 1|θ)

=a1,ipi

a1,ipi + a0,i(1− pi)

pi = P (yi = 1) (prior)

θ = {α1,α2}


n Maximiza+on step n  False posi0ve and false nega0ve rates can be es0mated in closed form

αj0 =

�Ii=1(1− µi)(1− yji )�I

i=1(1− µi)αj1 =

�Ii=1 µiy

ji�I

i=1 µi

+ Aggrega0ng annota0ons [Tang and Lease, 2011]

n  A semi-‐supervised approach between n  A supervised approach based on gold standard

n  Naïve Bayes [Snow et al., 2008] n  An unsupervised approach

n  Expecta0on-‐Maximiza0on [Dawid and Skene, 1979]

n  A very modest amount of supervision can provide significant benefit

128 256 512 1024 20480.65

0.7

0.75

0.8

0.85

0.9

Number of labeled examples

Accu

racy

MVEMNB

Figure 4: Supervised NB vs. unsupervised MV andEM on the synthetic dataset.

128 256 512 1024 20480.62

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

Number of labeled examples

Accu

racy

MVEMNB

Figure 5: Supervised NB vs. unsupervised MV andEM on the MTurk dataset.

4.2 Semi-supervised vs. supervisedIn our second set of experiments, we compare our semi-

supervised SNB method vs. supervised NB method, eval-uating consensus accuracy achieved across varying amountof labeled vs. unlabeled training data. Starting from eachof the same labeled training size values considered in ourfirst set of experiments for supervised NB, we now consideradding additional unlabeled examples in powers of two asbefore into the training set, though now we have potentiallymore data to use (up to 5000 unlabeled examples in thesynthetic data, and up to 15758 examples with MTurk). Asbefore, we repeat experiments 10 times and average.Figure 6 and Figure 7 compare semi-supervised SNBmethod

with supervised NB method for synthetic and MTurk data,respectively. Results on both synthetic and MTurk data arequite similar. Each curve in the figures corresponds to a SNBmethod trained on a di!erent number of (labeled) trainingexamples. The x-axis indicates the number of additional,

64 128 256 512 1024 2048 4096 50000.65

0.7

0.75

0.8

0.85

0.9

Number of unlabeled examples

accu

racy

SNB1 (128 labeled examples)SNB2 (256 labeled examples)SNB3 (512 labeled examples)SNB4 (1024 labeled examples)SNB5 (2048 labeled examples)

Figure 6: Semi-supervised SNB vs. supervised NBmethod on the synthetic dataset.

64 128 256 512 1024 2048 4096 157580.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

Number of unlabeled examples

Accu

racy

SNB1 (128 labeled examples)SNB2 (256 labeled examples)SNB3 (512 labeled examples)SNB4 (1024 labeled examples)SNB5 (2048 labeled examples)

Figure 7: Semi-supervised SNB vs. supervised NBmethod on the MTurk dataset.

unlabeled examples used for training. While not shown, avalue of x = 0 (no unlabeled data used) in Figure 6 andFigure 7 would correspond exactly to the accuracy achievedby supervised NB method from Figure 4 and Figure 5, re-spectively. All curves approach convergence with the fulltraining set (all available labeled and unlabeled data).

Labels for unlabeled examples are automatically estimatedby SNB with a given confidence during the training process.Worker labels are then compared to these generated labelsand confidence values in order to estimate worker accuracies(in addition to comparing worker labels on expert labeledexamples). Figure 4 and Figure 5 intuitively showed thatNB consensus accuracy increases with more labeled train-ing data. Figure 6 and Figure 7 reflect this in the relativestarting positions of each learning curve of SNB method.

Recall that unsupervised EM method achieved 75.0% con-sensus accuracy for the synthetic data in Figure 4. FromFigure 6 we can see that, with only 256 labeled and 1024

+ Aggrega0ng annota0ons [Whitehill et al., 2009]

n  Binary labels:

n  Annotators have different exper0se:

n More skilled annotators (higher ) have higher probability of labeling correctly

n  As the difficulty of the image increases, the probability of the label being correct moves towards 0.5

n  GLAD (Genera0ve model of Labels, Abili0es, and Difficul0es)

αj

p(yji = yi|αj ,βi) =1

1 + e−αjβi

1/βi

yji ∈ {0, 1}


β1 β2 βI

y11 y21 y12 y22 y32

α1 α2 α3 αJ

· · ·

· · ·

· · ·

Object difficul0es

True labels

Observed labels

Annotator accuracies

y1 y2 yI


n  The observed labels are samples from the random variables

n  The unobserved variables are n  The true image labels n  The object difficulty parameters n  The different annotators accuracies

n  Goal: find the most likely values of unobserved variables given the observed data

n  Solu0on: Expecta+on-‐Maximiza+on (EM)

{yji }

yi, i = 1, . . . , I

αj , j = 1, . . . , J

βi, i = 1, . . . , I


n  Expecta+on step: n  Compute the posterior probabili0es of all n  given the values from the last M step

yi ∈ {0, 1}α,β

yi = {yji� |i� = i}

p(yji |yi = 1,αj ,βi) =

�1

1 + e−αjβi

�yji�1− 1

1 + e−αjβi

�1−yji

Bayes’ theorem

Annotators independence

P (yi|yi,α,βi) ∝ P (yi|α,βi)P (yi|yi,α,βi)

∝ P (yi)�

j

P (yji |yi,αj ,βi)


n Maximiza+on step: n  Maximize the auxiliary func0on

n  Where the expecta0on is with respect to the posterior probabili0es of all computed in the E-‐step

n  The parameters are es0mated using gradient ascent

Q(α,β) = E[log p(y1, . . . ,yI ,y|α,β)]

yi ∈ {0, 1}

Q(α,β) = E

log�

i

p(yi)�

j

p(yji |yi,αj ,βi)

=�

i

E[log p(yi)] +�

ij

E[log p(yji |yi,αj ,βi)]

α,β

(α∗,β∗) = argmaxα,β

Q(α,β)


5 10 15 200.75

0.8

0.85

0.9

0.95

1Effect of Number of Labelers on Accuracy

Number of Labelers

Prop

ortio

n of

Lab

els

Cor

rect

GLADMajority vote

5 10 15 200

0.2

0.4

0.6

0.8

1Effect of Number of Labelers on Parameter Estimates

Number of Labelers

Cor

rela

tion

Beta: Spearman Corr.Alpha: Pearson Corr.

Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlyingclass labels on simulation data. Right: The ability of GLAD to recover the true alpha and betaparameters on simulation data.

Image TypeLabeler type Hard Easy

Good 0.95 1Bad 0.54 1

We measured performance in terms of proportion of correctly estimated labels. We compared threeapproaches: (1) our proposed method, GLAD; (2) the method proposed in [5], which models labelerability but not image difficulty; and (3) Majority Vote. The simulations were repeated 20 timesand average performance calculated for the three methods. The results shown below indicated thatmodeling image difficulty can result in significant performance improvements.

Method ErrorGLAD 4.5%

Majority Vote 11.2%Dawid & Skene [5] 8.4%

4.1 Stability of EM under Various Starting Points

Empirically we found that the EM procedure was fairly insensitive to varying the starting point of theparameter values. In a simulation study of 2000 images and 20 labelers, we randomly selected eachαi ∼ U [0, 4] and log(βj) ∼ U [0, 3], and EM was run until convergence. Over the 50 simulationruns, the average percent-correct of the inferred labels was 85.74%, and the standard deviation ofthe percent-correct over all the trials was only 0.024%.

5 Empirical Study I: Greebles

As a first test-bed for GLAD using real data obtained from the Mechanical Turk, we posted picturesof 100 “Greebles” [6], which are synthetically generated images that were originally created to studyhuman perceptual expertise. Greebles somewhat resemble human faces and have a “gender”: Maleshave horn-like organs that point up, whereas for females the horns point down. See Figure 3 (left)for examples. Each of the 100 Greeble images was labeled by 10 different human coders on the Turkfor gender (male/female). Four greebles of each gender (separate from the 100 labeled images) weregiven as examples of each class. Shown at a resolution of 48x48 pixels, the task required carefulinspection of the images in order to label them correctly. The ground-truth gender values were allknown with certainty (since they are rendered objects) and thus provided a means of measuring theaccuracy of inferred image labels.

5

5 10 15 200.75

0.8

0.85

0.9

0.95

1Effect of Number of Labelers on Accuracy

Number of Labelers

Prop

ortio

n of

Lab

els

Cor

rect

GLADMajority vote

5 10 15 200

0.2

0.4

0.6

0.8

1Effect of Number of Labelers on Parameter Estimates

Number of Labelers

Cor

rela

tion

Beta: Spearman Corr.Alpha: Pearson Corr.

Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlyingclass labels on simulation data. Right: The ability of GLAD to recover the true alpha and betaparameters on simulation data.

Image TypeLabeler type Hard Easy

Good 0.95 1Bad 0.54 1

We measured performance in terms of proportion of correctly estimated labels. We compared threeapproaches: (1) our proposed method, GLAD; (2) the method proposed in [5], which models labelerability but not image difficulty; and (3) Majority Vote. The simulations were repeated 20 timesand average performance calculated for the three methods. The results shown below indicated thatmodeling image difficulty can result in significant performance improvements.

Method ErrorGLAD 4.5%

Majority Vote 11.2%Dawid & Skene [5] 8.4%

4.1 Stability of EM under Various Starting Points

Empirically we found that the EM procedure was fairly insensitive to varying the starting point of theparameter values. In a simulation study of 2000 images and 20 labelers, we randomly selected eachαi ∼ U [0, 4] and log(βj) ∼ U [0, 3], and EM was run until convergence. Over the 50 simulationruns, the average percent-correct of the inferred labels was 85.74%, and the standard deviation ofthe percent-correct over all the trials was only 0.024%.

5 Empirical Study I: Greebles

As a first test-bed for GLAD using real data obtained from the Mechanical Turk, we posted picturesof 100 “Greebles” [6], which are synthetically generated images that were originally created to studyhuman perceptual expertise. Greebles somewhat resemble human faces and have a “gender”: Maleshave horn-like organs that point up, whereas for females the horns point down. See Figure 3 (left)for examples. Each of the 100 Greeble images was labeled by 10 different human coders on the Turkfor gender (male/female). Four greebles of each gender (separate from the 100 labeled images) weregiven as examples of each class. Shown at a resolution of 48x48 pixels, the task required carefulinspection of the images in order to label them correctly. The ground-truth gender values were allknown with certainty (since they are rendered objects) and thus provided a means of measuring theaccuracy of inferred image labels.

5

+ Aggrega0ng annota0ons [Welinder and Perona, 2010]

n  Seung similar to [Whitehill et al., 2009], with some differences n  Object difficulty is not explicitly modeled n  Annotators quality dis0nguishes between true posi0ve and true nega0ve rate

n  A prior distribu0on is set on to capture 2 kinds of annotators n  Honest annotators (with different quali0es, from unreliable to experts) n  Adversarial annotators

P (yji = 1|yi = 1) = αj1

P (yji = 0|yi = 0) = αj0

αj = [αj0,α

j1]

T

αj


n  Batched algorithm n  Expecta0on step

n  Maximiza0on step

α∗ = argmaxα

Q(α)

Q(α) =�

i

E[logP (yi)] +�

ij

E[logP (yji |yi,αj)] +

�

j

logP (αj)

P (yi|yi,α) ∝ P (yi)�

j

P (yji |yi,αj)

P (yji |yi = 1,αj) = (αj1)

yji (1− αj

1)1−yj

i

P (yji |yi = 0,αj) = (αj0)

yji (1− αj

0)1−yj

i

Prior


Dataset Images Assignments Workers

Presence-1 1,514 15 47

Presence-2 2,401 15 54

Attributes-1 6,033 5 507


Bounding Boxes 911 10 85

Table 1. Summary of the datasets collected from Amazon Mechan-

ical Turk showing the number of images per dataset, the number of

labels per image (assignments), and total number of workers that

provided labels. Presence-1/2 are binary labels, and Attributes-1/2

are multi-valued labels.

0 2 4 6 8 10

10!3

10!2

10!1

no. of assignments per image

err

or

rate

majorityGLADours (batch)

Figure 7. Comparison between the majority rule, GLAD [14], and

our algorithm on synthetic data as the number of assignments per

image is increased. The synthetic data is generated by the model

in Section 5 from the worker parameters estimated in Figure 5a.

tions, most annotators provided good labels, except for no.

53 and 58. These two annotators were also the only ones to

label all available images. In all three subplots of Figure 6,

most workers provide only a few labels, and only some very

active annotators label more than 100 images. Our findings

in this figure are very similar to the results presented in Fig-

ure 6 of [7].

Importance of discrimination: The results in Figure 6

point out the importance of online estimation of aj and the

use of expert- and bot-lists for obtaining labels on MTurk.

The expert-list is needed to reduce the number of labels per

image, as we can be more sure of the quality of the labels re-

ceived from experts. Furthermore, without the expert-list to

prioritize which annotators to ask first, the image will likely

be labeled by a new worker, and thus the estimate of aj for

that worker will be very uncertain. The bot-list is needed to

discriminate against sloppy annotators that will otherwise

annotate most of the dataset in hope to make easy money,

as shown by the outliers (no. 53 and 58) in Figure 6c.

Performance of binary model: We compared the per-

formance of the annotator model applied to binary data, de-

scribed in Section 5, to two other models of binary data, as

the number of available labels per image, m, varied. The

first method was a simple majority decision rule and the

second method was the GLAD-algorithm presented in [14].

Since we did not have access to the ground truth labels of

2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

err

or

rate

labels per image

Presence!1

batchonline

2 4 6 8 10 12 14

labels per image

Presence!2

batchonline

Figure 8. Error rates vs. the number of labels used per image on

the Presence datasets for the online algorithm and the batch ver-

sion. The ground truth was the estimates when running the batch

algorithm with all 15 labels per image available (thus batch will

have zero error at 15 labels per image).

the datasets, we generated synthetic data, where we knew

the ground truth, as follows: (1) We used our model to es-

timate aj for all 47 annotators in the Presence-1 dataset.

(2) For each of 2000 target values (half with zi = 1), we

sampled labels from m randomly chosen workers, where

the labels were generated according to the estimated aj and

Equation 10. As can be seen from Figure 7, our model

achieves a consistently lower error rate on synthetic data.

Online algorithm: We simulated running the online al-

gorithm on the Presence datasets obtained using MTurk and

used the result from the batch algorithm as ground truth.

When the algorithm requested labels for an image, it was

given labels from the dataset (along with an identifier for

the worker that provided it) randomly sampled without re-

placement. If it requested labels from the expert-list for a

particular image, it only received such a label if a worker

in the expert-list had provided a label for that image, other-

wise it was randomly sampled from non bot-listed workers.

A typical run of the algorithm on the Presence-1 dataset is

shown in Figure 9. In the first few iterations, the algorithm

is pessimistic about the quality of the annotators, and re-

quests up to m = 15 labels per image. As the evidence

accumulates, more workers are put in the expert- and bot-

lists, and the number of labels requested by the algorithm

decreases. Notice in the figure that towards the final itera-

tions, the algorithm samples only 2–3 labels for some im-

ages.

To get an idea of the performance of the online algo-

rithm, we compared it to running the batch version from

Section 3 with limited number of labels per image. For the

Presence-1 dataset, the error rate of the online algorithm is

almost three times lower than the general algorithm when

using the same number of labels per image, see Figure 8.

For the Presence-2 dataset, twice as many labels per image

are needed for the batch algorithm to achieve the same per-

formance as the online version.

Online crowdsourcing: rating annotators and obtaining cost-effective labels

Peter Welinder Pietro PeronaCalifornia Institute of Technology{welinder,perona}@caltech.edu

Abstract

Labeling large datasets has become faster, cheaper, andeasier with the advent of crowdsourcing services like Ama-zon Mechanical Turk. How can one trust the labels ob-tained from such services? We propose a model of the la-beling process which includes label uncertainty, as well amulti-dimensional measure of the annotators’ ability. Fromthe model we derive an online algorithm that estimates themost likely value of the labels and the annotator abilities.It finds and prioritizes experts when requesting labels, andactively excludes unreliable annotators. Based on labelsalready obtained, it dynamically chooses which images willbe labeled next, and how many labels to request in orderto achieve a desired level of confidence. Our algorithm isgeneral and can handle binary, multi-valued, and continu-ous annotations (e.g. bounding boxes). Experiments on adataset containing more than 50,000 labels show that ouralgorithm reduces the number of labels required, and thusthe total cost of labeling, by a large factor while keepingerror rates low on a variety of datasets.

1. Introduction

Crowdsourcing, the act of outsourcing work to a largecrowd of workers, is rapidly changing the way datasets arecreated. Not long ago, labeling large datasets could takeweeks, if not months. It was necessary to train annotatorson custom-built interfaces, often in person, and to ensurethey were motivated enough to do high quality work. Today,with services such as Amazon Mechanical Turk (MTurk), itis possible to assign annotation jobs to hundreds, even thou-sands, of computer-literate workers and get results back ina matter of hours. This opens the door to labeling hugedatasets with millions of images, which in turn providesgreat possibilities for training computer vision algorithms.

The quality of the labels obtained from annotators varies.Some annotators provide random or bad quality labels in thehope that they will go unnoticed and still be paid, and yetothers may have good intentions but completely misunder-stand the task at hand. The standard solution to the problemof “noisy” labels is to assign the same labeling task to many

2 25 22 26 2 25 22 26 2 25 22 26

2 25 22 26 2 25 22 26 2 25 22 26

Figure 1. Examples of binary labels obtained from Amazon Me-chanical Turk, (see Figure 2 for an example of continuous labels).The boxes show the labels provided by four workers (identified bythe number in each box); green indicates that the worker selectedthe image, red means that he or she did not. The task for each an-notator was to select only images that he or she thought containeda Black-chinned Hummingbird. Figure 5 shows the expertise andbias of the workers. Worker 25 has a high false positive rate, and22 has a high false negative rate. Worker 26 provides inconsistentlabels, and 2 is the annotator with the highest accuracy. Photosin the top row were classified to contain a Black-chinned Hum-mingbird by our algorithm, while the ones in the bottom row werenot.

different annotators, in the hope that at least a few of themwill provide high quality labels or that a consensus emergesfrom a great number of labels. In either case, a large numberof labels is necessary, and although a single label is cheap,the costs can accumulate quickly.

If one is aiming for a given label quality for the minimumtime and money, it makes more sense to dynamically decideon the number of labelers needed. If an expert annotatorprovides a label, we can probably rely on it being of highquality, and we may not need more labels for that particulartask. On the other hand, if an unreliable annotator providesa label, we should probably ask for more labels until we findan expert or until we have enough labels from non-expertsto let the majority decide the label.

1


n  Online algorithm n  For each annotator

n  Es0mate

n  If the es0ma0on of the annotator quality is reliable ( )

n  If annotator is an expert, add to expert-‐list

n  Otherwise, add to bot-‐list

αj

var(αj) < θ

E ← {E ∪ {j}}

B ← {B ∪ {j}}

j


n  Online algorithm (con0nued) n  For each object to be annotated

n  Compute from available labels and

n  If es0mated label is unreliable ( ) ask to experts in list

n  If labels cannot be obtained from experts, ask to annotators not in bot-‐list

n  Stop asking labels when or maximum number of annota0ons is exceeded

P (yi) yi α

maxyi

P (yi) < τ E

B

maxyi

P (yi) ≥ τ


n  The on-‐line algorithm allows to reduce the number of annota0ons / object, for the same target error rate

Dataset Images Assignments Workers

Presence-1 1,514 15 47

Presence-2 2,401 15 54



Bounding Boxes 911 10 85

Table 1. Summary of the datasets collected from Amazon Mechan-

ical Turk showing the number of images per dataset, the number of

labels per image (assignments), and total number of workers that

provided labels. Presence-1/2 are binary labels, and Attributes-1/2

are multi-valued labels.

0 2 4 6 8 10

10!3

10!2

10!1

no. of assignments per image

erro

r rat

e

majorityGLADours (batch)

Figure 7. Comparison between the majority rule, GLAD [14], and

our algorithm on synthetic data as the number of assignments per

image is increased. The synthetic data is generated by the model

in Section 5 from the worker parameters estimated in Figure 5a.

tions, most annotators provided good labels, except for no.

53 and 58. These two annotators were also the only ones to

label all available images. In all three subplots of Figure 6,

most workers provide only a few labels, and only some very

active annotators label more than 100 images. Our findings

in this figure are very similar to the results presented in Fig-

ure 6 of [7].

Importance of discrimination: The results in Figure 6

point out the importance of online estimation of aj and the

use of expert- and bot-lists for obtaining labels on MTurk.

The expert-list is needed to reduce the number of labels per

image, as we can be more sure of the quality of the labels re-

ceived from experts. Furthermore, without the expert-list to

prioritize which annotators to ask first, the image will likely

be labeled by a new worker, and thus the estimate of aj for

that worker will be very uncertain. The bot-list is needed to

discriminate against sloppy annotators that will otherwise

annotate most of the dataset in hope to make easy money,

as shown by the outliers (no. 53 and 58) in Figure 6c.

Performance of binary model: We compared the per-

formance of the annotator model applied to binary data, de-

scribed in Section 5, to two other models of binary data, as

the number of available labels per image, m, varied. The

first method was a simple majority decision rule and the

second method was the GLAD-algorithm presented in [14].

Since we did not have access to the ground truth labels of

2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

err

or

rate

labels per image

Presence!1

batchonline

2 4 6 8 10 12 14

labels per image

Presence!2

batchonline

Figure 8. Error rates vs. the number of labels used per image on

the Presence datasets for the online algorithm and the batch ver-

sion. The ground truth was the estimates when running the batch

algorithm with all 15 labels per image available (thus batch will

have zero error at 15 labels per image).

the datasets, we generated synthetic data, where we knew

the ground truth, as follows: (1) We used our model to es-

timate aj for all 47 annotators in the Presence-1 dataset.

(2) For each of 2000 target values (half with zi = 1), we

sampled labels from m randomly chosen workers, where

the labels were generated according to the estimated aj and

Equation 10. As can be seen from Figure 7, our model

achieves a consistently lower error rate on synthetic data.

Online algorithm: We simulated running the online al-

gorithm on the Presence datasets obtained using MTurk and

used the result from the batch algorithm as ground truth.

When the algorithm requested labels for an image, it was

given labels from the dataset (along with an identifier for

the worker that provided it) randomly sampled without re-

placement. If it requested labels from the expert-list for a

particular image, it only received such a label if a worker

in the expert-list had provided a label for that image, other-

wise it was randomly sampled from non bot-listed workers.

A typical run of the algorithm on the Presence-1 dataset is

shown in Figure 9. In the first few iterations, the algorithm

is pessimistic about the quality of the annotators, and re-

quests up to m = 15 labels per image. As the evidence

accumulates, more workers are put in the expert- and bot-

lists, and the number of labels requested by the algorithm

decreases. Notice in the figure that towards the final itera-

tions, the algorithm samples only 2–3 labels for some im-

ages.

To get an idea of the performance of the online algo-

rithm, we compared it to running the batch version from

Section 3 with limited number of labels per image. For the

Presence-1 dataset, the error rate of the online algorithm is

almost three times lower than the general algorithm when

using the same number of labels per image, see Figure 8.

For the Presence-2 dataset, twice as many labels per image

are needed for the batch algorithm to achieve the same per-

formance as the online version.

Online crowdsourcing: rating annotators and obtaining cost-effective labels

Peter Welinder Pietro PeronaCalifornia Institute of Technology{welinder,perona}@caltech.edu

Abstract

Labeling large datasets has become faster, cheaper, andeasier with the advent of crowdsourcing services like Ama-zon Mechanical Turk. How can one trust the labels ob-tained from such services? We propose a model of the la-beling process which includes label uncertainty, as well amulti-dimensional measure of the annotators’ ability. Fromthe model we derive an online algorithm that estimates themost likely value of the labels and the annotator abilities.It finds and prioritizes experts when requesting labels, andactively excludes unreliable annotators. Based on labelsalready obtained, it dynamically chooses which images willbe labeled next, and how many labels to request in orderto achieve a desired level of confidence. Our algorithm isgeneral and can handle binary, multi-valued, and continu-ous annotations (e.g. bounding boxes). Experiments on adataset containing more than 50,000 labels show that ouralgorithm reduces the number of labels required, and thusthe total cost of labeling, by a large factor while keepingerror rates low on a variety of datasets.

1. Introduction

Crowdsourcing, the act of outsourcing work to a largecrowd of workers, is rapidly changing the way datasets arecreated. Not long ago, labeling large datasets could takeweeks, if not months. It was necessary to train annotatorson custom-built interfaces, often in person, and to ensurethey were motivated enough to do high quality work. Today,with services such as Amazon Mechanical Turk (MTurk), itis possible to assign annotation jobs to hundreds, even thou-sands, of computer-literate workers and get results back ina matter of hours. This opens the door to labeling hugedatasets with millions of images, which in turn providesgreat possibilities for training computer vision algorithms.

The quality of the labels obtained from annotators varies.Some annotators provide random or bad quality labels in thehope that they will go unnoticed and still be paid, and yetothers may have good intentions but completely misunder-stand the task at hand. The standard solution to the problemof “noisy” labels is to assign the same labeling task to many

2 25 22 26 2 25 22 26 2 25 22 26

2 25 22 26 2 25 22 26 2 25 22 26

Figure 1. Examples of binary labels obtained from Amazon Me-chanical Turk, (see Figure 2 for an example of continuous labels).The boxes show the labels provided by four workers (identified bythe number in each box); green indicates that the worker selectedthe image, red means that he or she did not. The task for each an-notator was to select only images that he or she thought containeda Black-chinned Hummingbird. Figure 5 shows the expertise andbias of the workers. Worker 25 has a high false positive rate, and22 has a high false negative rate. Worker 26 provides inconsistentlabels, and 2 is the annotator with the highest accuracy. Photosin the top row were classified to contain a Black-chinned Hum-mingbird by our algorithm, while the ones in the bottom row werenot.

different annotators, in the hope that at least a few of themwill provide high quality labels or that a consensus emergesfrom a great number of labels. In either case, a large numberof labels is necessary, and although a single label is cheap,the costs can accumulate quickly.

If one is aiming for a given label quality for the minimumtime and money, it makes more sense to dynamically decideon the number of labelers needed. If an expert annotatorprovides a label, we can probably rely on it being of highquality, and we may not need more labels for that particulartask. On the other hand, if an unreliable annotator providesa label, we should probably ask for more labels until we findan expert or until we have enough labels from non-expertsto let the majority decide the label.

1

+ Aggrega0ng annota0ons [Karger et al., 2011]

n  Infers labels and annotators quali0es

n  No prior knowledge on annotator quali0es

n  Inspired to belief propaga+on and message-‐passing

n  Binary labels

n  Define a matrix , such that

yji = {−1,+1}

Aij = yjiI × J A


Annotators Objects

A =

+1 +1 − +1 −− − −1 − −1+1 − −1 − +1

+1

+1

+1

+1

+1

−1

−1

−1


Annotators Objects

+1

+1

+1

+1

+1

−1

−1

−1x(k)i→j =

�

j�∈∂i\j

Aij�y(k−1)j�→i

Reliability of annotator es0ma0ng object

Es0mated (sov) label of using all annotators but

yj→i ji

ij


Annotators Objects

+1

+1

+1

+1

+1

−1

−1

−1

Reliability of annotator es0ma0ng object

ji

y(k)j→i =�

i�∈∂j\i

Ai�jx(k−1)i�→j


Annotators Objects

+1

+1

+1

+1

+1

−1

−1

−1yi = sgn

�

j∈∂i

Aijyj→i

Final es0mate


sum of the answers weighted by each worker’s reliability:

si = sign

��

j∈∂iAijyj→i

�.

It is understood that when there is a tie we flip a fair coin to make a decision.

Iterative Algorithm

Input: E, {Aij}(i,j)∈E , kmax

Output: Estimate s�{Aij}

�

1: For all (i, j) ∈ E do

Initialize y(0)j→i with random Zij ∼ N (1, 1) ;2: For k = 1, . . . , kmax do

For all (i, j) ∈ E do x(k)i→j ←�

j�∈∂i\j Aij�y(k−1)j�→i ;

For all (i, j) ∈ E do y(k)j→i ←�

i�∈∂j\iAi�jx(k)i�→j ;

3: For all i ∈ [m] do xi ←�

j∈∂iAijy(kmax−1)j→i ;

4: Output estimate vector s�{Aij}

�= [sign(xi)] .

We emphasize here that our inference algorithm requires no information about the prior distri-

bution of the workers’ quality pj . Our algorithm is inspired by power iteration used to compute

the leading singular vectors of a matrix, and we discuss the connection in detail in Section 2.6.

While our algorithm is also inspired by the standard Belief Propagation (BP) algorithm for ap-

proximating max-marginals [Pea88, YFW03], our algorithm is original and overcomes a few critical

limitations of the standard BP. First, the iterative algorithm does not require any knowledge of

the prior distribution of pj , whereas the standard BP requires the knowledge of the distribution.

Second, there is no efficient way to implement standard BP, since we need to pass sufficient statis-

tics (or messages) which under our general model are distributions over the reals. On the other

hand, the iterative algorithm only passes messages that are real numbers regardless of the prior

distribution of pj , which makes it easy to implement. Third, the iterative algorithm is provably

asymptotically order-optimal. Density evolution, is a standard technique to analyze the perfor-

mance of BP. Although we can write down the density evolution for the standard BP, we cannot

describe or compute the densities, analytically or numerically. It is also very simple to write down

the density evolution equations (cf. (8)) for our algorithm, but it is not a priori clear how one can

analyze the densities in this case either. We develop a novel technique to analyze the densities for

our iterative algorithm and prove optimality. This technique could be of independent interest to

analyzing a broader class of message-passing algorithms.

2.2 Performance guarantee

We state the main analytical result of this paper: for random (l, r)-regular bipartite graph based

task assignments with our iterative inference algorithm, the probability of error decays exponentially

in lq, up to a universal constant and for a broad range of the parameters l, r and q. With a

reasonable choice of l = r and both scaling like (1/q) log(1/�), the proposed algorithm is guaranteed

to achieve error less than � for any � ∈ (0, 1/2). Further, an algorithm independent lower bound

7

+ Aggrega0ng and learning

+ Aggrega0ng and learning [Sheng et al., 2008]

n  Use several noisy labels to create labeled data for training classifiers

n  Training samples

n  Labels might be noisy

�yi,xi�

True label Feature vector

40

50

60

70

80

90

100

1 40 80 120 160 200 240 280Number of examples (Mushroom)

Acc

urac

y

q=1.0q=0.9q=0.8q=0.7q=0.6q=0.5

Figure 1: Learning curves under di!erent quality lev-els of training data (q is the probability of a labelbeing correct).

depends both on the quality of the training labels and onthe number of training examples. Of course if the traininglabels are uninformative (q = 0.5), no amount of training datahelps. As expected, under the same labeling quality, moretraining examples lead to better performance, and the higherthe quality of the training data, the better the performanceof the learned model. However, the relationship between thetwo factors is complex: the marginal increase in performancefor a given change along each dimension is quite di!erent fordi!erent combinations of values for both dimensions. To this,one must overlay the di!erent costs of acquiring only newlabels versus whole new examples, as well as the expectedimprovement in quality when acquiring multiple new labels.

This paper makes several contributions. First, under gradu-ally weakening assumptions, we assess the impact of repeated-labeling on the quality of the resultant labels, as a functionof the number and the individual qualities of the labelers.We derive analytically the conditions under which repeated-labeling will be more or less e!ective in improving resultantlabel quality. We then consider the e!ect of repeated-labelingon the accuracy of supervised modeling. As demonstrated inFigure 1, the relative advantage of increasing the quality of la-beling, as compared to acquiring new data points, depends onthe position on the learning curves. We show that even if weignore the cost of obtaining the unlabeled part of a data point,there are times when repeated-labeling is preferable comparedto getting labels for unlabeled examples. Furthermore, whenwe do consider the cost of obtaining the unlabeled portion,repeated-labeling can give considerable advantage.

We present a comprehensive experimental analysis of therelationships between quality, cost, and technique for repeated-labeling. The results show that even a straightforward, round-robin technique for repeated-labeling can give substantialbenefit over single-labeling. We then show that selectivelychoosing the examples to label repeatedly yields substantialextra benefit. A key question is: How should we select datapoints for repeated-labeling? We present two techniques basedon di!erent types of information, each of which improves overround-robin repeated labeling. Then we show that a techniquethat combines the two types of information is even better.

Although this paper covers a good deal of ground, there ismuch left to be done to understand how best to label usingmultiple, noisy labelers; so, the paper closes with a summaryof the key limitations, and some suggestions for future work.

2. RELATED WORKRepeatedly labeling the same data point is practiced in

applications where labeling is not perfect (e.g., [27, 28]). Weare not aware of a systematic assessment of the relationshipbetween the resultant quality of supervised modeling andthe number of, quality of, and method of selection of datapoints for repeated-labeling. To our knowledge, the typi-

cal strategy used in practice is what we call “round-robin”repeated-labeling, where cases are given a fixed number oflabels—so we focus considerable attention in the paper to thisstrategy. A related important problem is how in practice toassess the generalization performance of a learned model withuncertain labels [28], which we do not consider in this paper.Prior research has addressed important problems necessary fora full labeling solution that uses multiple noisy labelers, suchas estimating the quality of labelers [6, 26, 28], and learningwith uncertain labels [13, 24, 25]. So we treat these topicsquickly when they arise, and lean on the prior work.

Repeated-labeling using multiple noisy labelers is di!erentfrom multiple label classification [3, 15], where one examplecould have multiple correct class labels. As we discuss inSection 5, repeated-labeling can apply regardless of the numberof true class labels. The key di!erence is whether the labelsare noisy. A closely related problem setting is described byJin and Ghahramani [10]. In their variant of the multiplelabel classification problem, each example presents itself witha set mutually exclusive labels, one of which is correct. Thesetting for repeated-labeling has important di!erences: labelsare acquired (at a cost); the same label may appear manytimes, and the true label may not appear at all. Again, thelevel of error in labeling is a key factor.

The consideration of data acquisition costs has seen in-creasing research attention, both explicitly (e.g., cost-sensitivelearning [31], utility-based data mining [19]) and implicitly, asin the case of active learning [5]. Turney [31] provides a shortbut comprehensive survey of the di!erent sorts of costs thatshould be considered, including data acquisition costs andlabeling costs. Most previous work on cost-sensitive learningdoes not consider labeling cost, assuming that a fixed set oflabeled training examples is given, and that the learner cannotacquire additional information during learning (e.g., [7, 8, 30]).

Active learning [5] focuses on the problem of costly labelacquisition, although often the cost is not made explicit. Ac-tive learning (cf., optimal experimental design [33]) uses theexisting model to help select additional data for which toacquire labels [1, 14, 23]. The usual problem setting for activelearning is in direct contrast to the setting we consider forrepeated-labeling. For active learning, the assumption is thatthe cost of labeling is considerably higher than the cost ofobtaining unlabeled examples (essentially zero for “pool-based”active learning).

Some previous work studies data acquisition cost explicitly.For example, several authors [11, 12, 16, 17, 22, 32, 37] studythe costly acquisition of feature information, assuming thatthe labels are known in advance. Saar-Tsechansky et al. [22]consider acquiring both costly feature and label information.

None of this prior work considers selectively obtaining mul-tiple labels for data points to improve labeling quality, and therelative advantages and disadvantages for improving modelperformance. An important di!erence from the setting fortraditional active learning is that labeling strategies that usemultiple noisy labelers have access to potentially relevant addi-tional information. The multisets of existing labels intuitivelyshould play a role in determining the examples for which toacquire additional labels. For example, presumably one wouldbe less interested in getting another label for an example thatalready has a dozen identical labels, than for one with justtwo, conflicting labels.

3. REPEATED LABELING: THE BASICSFigure 1 illustrates that the quality of the labels can have

a marked e!ect on classification accuracy. Intuitively, using

615

q = Probability of a label being correct


n When training a classifier, consider two op0ons

n  Acquiring a new training example

n  Get another label for an exis0ng example

n  Compare two strategies n  SL -‐ single labeling: acquires addi0onal examples (each with one noisy label)

n  MV – majority vo0ng: acquire addi0onal noisy labels for exis0ng examples

�yi,xi�


n When labels are noisy, repeated labeling + majority vo0ng helps

n  Otherwise, acquiring addi0onal training samples might be berer

50556065707580859095

100

100 1100 2100 3100 4100 5100Number of labels (mushroom, p=0.6)

Acc

urac

y

SLML

(a) p = 0.6, #examples = 100, for MV

50556065707580859095

100

50 800 1550 2300 3050 3800 4550 5300Number of labels (mushroom, p=0.8)

Acc

urac

y

SLMV

(b) p = 0.8, #examples = 50, for MV

Figure 5: Comparing the increase in accuracy for themushroom data set as a function of the number oflabels acquired, when the cost of an unlabeled exam-ple is negligible, i.e., CU = 0. Repeated-labeling withmajority vote (MV ) starts with an existing set of ex-amples and only acquires additional labels for them,and single labeling (SL) acquires additional examples.Other data sets show similar results.

for a fixed labeler quality. Both MV and SL start with thesame number of single-labeled examples. Then, MV startsacquiring additional labels only for the existing examples,while SL acquires new examples and labels them.

Generally, whether to invest in another whole training exam-ple or another label depends on the gradient of generalizationperformance as a function of obtaining another label or anew example. We will return to this when we discuss futurework, but for illustration, Figure 5 shows scenarios for ourexample problem, where each strategy is preferable to theother. From Figure 1 we see that for p = 0.6, and with 100examples, there is a lot of headroom for repeated-labeling toimprove generalization performance by improving the overalllabeling quality. Figure 5(a) indeed shows that for p = 0.6,repeated-labeling does improve generalization performance(per label) as compared to single-labeling new examples. Onthe other hand, for high initial quality or steep sections of thelearning curve, repeated-labeling may not compete with sin-gle labeling. Figure 5(b) shows that single labeling performsbetter than repeated-labeling when we have a fixed set of 50training examples with labeling quality p = 0.8. Particularly,repeated-labeling could not further improve its performanceafter acquiring a certain amount of labels (cf., the q = 1 curvein Figure 1).

The results for other datasets are similar to Figure 5: un-der noisy labels, and with CU ! CL, round-robin repeated-labeling can perform better than single-labeling when thereare enough training examples, i.e., after the learning curvesare not so steep (cf., Figure 1).

4.2.2 Round-robin Strategies, General CostsWe illustrated above that repeated-labeling is a viable alter-

native to single-labeling, even when the cost of acquiring the“feature” part of an example is negligible compared to the costof label acquisition. However, as described in the introduction,often the cost of (noisy) label acquisition CL is low comparedto the cost CU of acquiring an unlabeled example. In thiscase, clearly repeated-labeling should be considered: usingmultiple labels can shift the learning curve up significantly.To compare any two strategies on equal footing, we calcu-late generalization performance “per unit cost” of acquireddata; we then compare the di!erent strategies for combiningmultiple labels, under di!erent individual labeling qualities.

We start by defining the data acquisition cost CD:

CD = CU · Tr + CL · NL (2)

to be the sum of the cost of acquiring Tr unlabeled examples(CU · Tr), plus the cost of acquiring the associated NL labels(CL · NL). For single labeling we have NL = Tr, but forrepeated-labeling NL > Tr.

We extend the setting of Section 4.2.1 slightly: repeated-labeling now acquires and labels new examples; single label-ing SL is unchanged. Repeated-labeling again is generalizedround-robin: for each new example acquired, repeated-labelingacquires a fixed number of labels k, and in this case NL = k·Tr.(In our experiments, k = 5.) Thus, for round-robin repeated-labeling, in these experiments the cost setting can be de-scribed compactly by the cost ratio ! = CU

CL, and in this case

CD = ! · CL · Tr + k · CL · Tr, i.e.,

CD " ! + k (3)

We examine two versions of repeated-labeling, repeated-labelingwith majority voting (MV ) and uncertainty-preserving repeated-labeling (ME ), where we generate multiple examples with dif-ferent weights to preserve the uncertainty of the label multisetas described in Section 3.3.

Performance of di!erent labeling strategies: Figure 6plots the generalization accuracy of the models as a function ofdata acquisition cost. Here ! = 3, and we see very clearly thatfor p = 0.6 both versions of repeated-labeling are preferable tosingle labeling. MV and ME outperform SL consistently (onall but waveform, where MV ties with SL) and, interestingly,the comparative performance of repeated-labeling tends toincrease as one spends more on labeling.

Figure 7 shows the e!ect of the cost ratio !, plotting theaverage improvement per unit cost of MV over SL as a functionof !. Specifically, for each data set the vertical di!erencesbetween the curves are averaged across all costs, and thenthese are averaged across all data sets. The figure shows thatthe general phenomenon illustrated in Figure 6 is not tiedclosely to the specific choice of ! = 3.

Furthermore, from the results in Figure 6, we can see thatthe uncertainty-preserving repeated-labeling ME always per-forms at least as well as MV and in the majority of the casesME outperforms MV. This is not apparent in all graphs, sinceFigure 6 only shows the beginning part of the learning curvesfor MV and ME (because for a given cost, SL uses up trainingexamples quicker than MV and ME ). However, as the numberof training examples increases further, then (for p = 0.6) MEoutperforms MV. For example, Figure 8 illustrates for thesplice dataset, comparing the two techniques for a larger rangeof costs.

In other results (not shown) we see that when labeling qual-ity is substantially higher (e.g., p = 0.8), repeated-labeling stillis increasingly preferable to single labeling as ! increases; how-ever, we no longer see an advantage for ME over MV. Theseresults suggest that when labeler quality is low, inductivemodeling often can benefit from the explicit representation

618

50556065707580859095

100

100 1100 2100 3100 4100 5100Number of labels (mushroom, p=0.6)

Acc

urac

y

SLML

(a) p = 0.6, #examples = 100, for MV

50556065707580859095

100

50 800 1550 2300 3050 3800 4550 5300Number of labels (mushroom, p=0.8)

Acc

urac

y

SLMV

(b) p = 0.8, #examples = 50, for MV

Figure 5: Comparing the increase in accuracy for themushroom data set as a function of the number oflabels acquired, when the cost of an unlabeled exam-ple is negligible, i.e., CU = 0. Repeated-labeling withmajority vote (MV ) starts with an existing set of ex-amples and only acquires additional labels for them,and single labeling (SL) acquires additional examples.Other data sets show similar results.

for a fixed labeler quality. Both MV and SL start with thesame number of single-labeled examples. Then, MV startsacquiring additional labels only for the existing examples,while SL acquires new examples and labels them.

Generally, whether to invest in another whole training exam-ple or another label depends on the gradient of generalizationperformance as a function of obtaining another label or anew example. We will return to this when we discuss futurework, but for illustration, Figure 5 shows scenarios for ourexample problem, where each strategy is preferable to theother. From Figure 1 we see that for p = 0.6, and with 100examples, there is a lot of headroom for repeated-labeling toimprove generalization performance by improving the overalllabeling quality. Figure 5(a) indeed shows that for p = 0.6,repeated-labeling does improve generalization performance(per label) as compared to single-labeling new examples. Onthe other hand, for high initial quality or steep sections of thelearning curve, repeated-labeling may not compete with sin-gle labeling. Figure 5(b) shows that single labeling performsbetter than repeated-labeling when we have a fixed set of 50training examples with labeling quality p = 0.8. Particularly,repeated-labeling could not further improve its performanceafter acquiring a certain amount of labels (cf., the q = 1 curvein Figure 1).

The results for other datasets are similar to Figure 5: un-der noisy labels, and with CU ! CL, round-robin repeated-labeling can perform better than single-labeling when thereare enough training examples, i.e., after the learning curvesare not so steep (cf., Figure 1).

4.2.2 Round-robin Strategies, General CostsWe illustrated above that repeated-labeling is a viable alter-

native to single-labeling, even when the cost of acquiring the“feature” part of an example is negligible compared to the costof label acquisition. However, as described in the introduction,often the cost of (noisy) label acquisition CL is low comparedto the cost CU of acquiring an unlabeled example. In thiscase, clearly repeated-labeling should be considered: usingmultiple labels can shift the learning curve up significantly.To compare any two strategies on equal footing, we calcu-late generalization performance “per unit cost” of acquireddata; we then compare the di!erent strategies for combiningmultiple labels, under di!erent individual labeling qualities.

We start by defining the data acquisition cost CD:

CD = CU · Tr + CL · NL (2)

to be the sum of the cost of acquiring Tr unlabeled examples(CU · Tr), plus the cost of acquiring the associated NL labels(CL · NL). For single labeling we have NL = Tr, but forrepeated-labeling NL > Tr.

We extend the setting of Section 4.2.1 slightly: repeated-labeling now acquires and labels new examples; single label-ing SL is unchanged. Repeated-labeling again is generalizedround-robin: for each new example acquired, repeated-labelingacquires a fixed number of labels k, and in this case NL = k·Tr.(In our experiments, k = 5.) Thus, for round-robin repeated-labeling, in these experiments the cost setting can be de-scribed compactly by the cost ratio ! = CU

CL, and in this case

CD = ! · CL · Tr + k · CL · Tr, i.e.,

CD " ! + k (3)

We examine two versions of repeated-labeling, repeated-labelingwith majority voting (MV ) and uncertainty-preserving repeated-labeling (ME ), where we generate multiple examples with dif-ferent weights to preserve the uncertainty of the label multisetas described in Section 3.3.

Performance of di!erent labeling strategies: Figure 6plots the generalization accuracy of the models as a function ofdata acquisition cost. Here ! = 3, and we see very clearly thatfor p = 0.6 both versions of repeated-labeling are preferable tosingle labeling. MV and ME outperform SL consistently (onall but waveform, where MV ties with SL) and, interestingly,the comparative performance of repeated-labeling tends toincrease as one spends more on labeling.

Figure 7 shows the e!ect of the cost ratio !, plotting theaverage improvement per unit cost of MV over SL as a functionof !. Specifically, for each data set the vertical di!erencesbetween the curves are averaged across all costs, and thenthese are averaged across all data sets. The figure shows thatthe general phenomenon illustrated in Figure 6 is not tiedclosely to the specific choice of ! = 3.

Furthermore, from the results in Figure 6, we can see thatthe uncertainty-preserving repeated-labeling ME always per-forms at least as well as MV and in the majority of the casesME outperforms MV. This is not apparent in all graphs, sinceFigure 6 only shows the beginning part of the learning curvesfor MV and ME (because for a given cost, SL uses up trainingexamples quicker than MV and ME ). However, as the numberof training examples increases further, then (for p = 0.6) MEoutperforms MV. For example, Figure 8 illustrates for thesplice dataset, comparing the two techniques for a larger rangeof costs.

In other results (not shown) we see that when labeling qual-ity is substantially higher (e.g., p = 0.8), repeated-labeling stillis increasingly preferable to single labeling as ! increases; how-ever, we no longer see an advantage for ME over MV. Theseresults suggest that when labeler quality is low, inductivemodeling often can benefit from the explicit representation

618


n  How to select which sample for re-‐labeling?

n  Assume that n  the annotator quality is independent from the object n  all annotators have the same quality n  the annotator quality is unknown, i.e., uniformly distributed in [0,1]

n  Let and denote the number of labels equal to 0 or 1 assigned to an object

P (yji = yi) = pj

pj = p

L0

L0 = |{yj |yj = 0}|L1 = |{yj |yj = 1}|

L1


n  If is the true label, the probability of observing and labels is given by the binomial distribu0on

n  The posterior can be expressed as

L0 L1

P (L0, L1|p) =�L0 + L1

L1

�pL1(1− p)L0

y = 1

P (p|L1, L0) =P (L0, L1|p)P (p)

P (L0, L1)=

P (L0, L1|p)P (p)� 10 P (L0, L1|s)ds

Beta func0on Beta distribu0on

=pL1(1− p)L0

B(L0 + 1, L1 + 1)= βp(L0 + 1, L1 + 1)


n  Let denote the CDF of the beta distribu0on

n  The uncertainty of an object due to noisy labels is defined as

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

CD

F

L0 = 0, L1 = 4

L0 = 1, L1 = 3L0 = 2, L1 = 2

Ip(L0, L1)

SLU = min{I0.5(L0 + 1, L1 + 1), 1− I0.5(L0 + 1, L1 + 1)}


n  Different strategies to select the next labeling ac0on n  GRR: Generalized round robin n  Selec0ve repeated labeling

n  LU: Label uncertainty n  MU: Model uncertainty (as in ac0ve learning) n  LMU: Label and model uncertainty

50556065707580

0 200 400 600 800 1000 1200 1400 1600Number of labels (bmg)

Acc

urac

y

GRR MULU LMU

50

60

70

80

90

0 400 800 1200 1600 2000Number of labels (expedia)

Acc

urac

y

50

60

70

80

90

100

0 400 800 1200 1600 2000Number of labels (kr-vs-kp)

Acc

urac

y

60

70

80

90

100

0 400 800 1200 1600 2000Number of labels (mushroom)

Acc

urac

y

50

60

70

80

90

0 200 400 600 800 1000 1200 1400Number of labels (qvc)

Acc

urac

y

60

70

80

90

100

0 400 800 1200 1600 2000Number of labels (sick)

Acc

urac

y

6065

7075

8085

0 400 800 1200 1600 2000Number of labels (spambase)

Acc

urac

y

50556065707580

0 400 800 1200 1600 2000Number of labels (splice)

Acc

urac

y

707580859095

100

0 400 800 1200 1600 2000Number of labels (thyroid)

Acc

urac

y

50

55

60

65

70

0 100 200 300 400 500 600Number of labels (tic-tac-toe)

Acc

urac

y

50556065707580

0 400 800 1200 1600 2000Number of labels (travelocity)

Acc

urac

y

50556065707580

0 400 800 1200 1600 2000Number of labels (waveform)

Acc

urac

y

Figure 11: Accuracy as a function of the number of la-bels acquired for the four selective repeated-labelingstrategies for the 12 datasets (p = 0.6).

generalization accuracy averaged over the held-out test sets(as described in Section 4.1). The results (Figure 11) showthat the improvements in data quality indeed do acceleratelearning. (We report values for p = 0.6, a high-noise settingthat can occur in real-life training data.7) Table 2 summarizesthe results of the experiments, reporting accuracies averagedacross the acquisition iterations for each data set, with themaximum accuracy across all the strategies highlighted inbold, the minimum accuracy italicized, and the grand aver-ages reported at the bottom of the columns.

The results are satisfying. The two methods that incorpo-rate label uncertainty (LU and LMU ) are consistently betterthan round-robin repeated-labeling, achieving higher accu-racy for every data set. (Recall that in the previous section,round-robin repeated-labeling was shown to be substantiallybetter than the baseline single labeling in this setting.) Theperformance of model uncertainty alone (MU ), which can beviewed as the active learning baseline, is more variable: inthree cases giving the best accuracy, but in other cases not

7From [20]: “No two experts, of the 5 experts surveyed, agreed upon

diagnoses more than 65% of the time. This might be evidence forthe di!erences that exist between sites, as the experts surveyed hadgained their expertise at di!erent locations. If not, however, it raisesquestions about the correctness of the expert data.”

even reaching the accuracy of round-robin repeated-labeling.Overall, combining label and model uncertainty (LMU ) isthe best approach: in these experiments LMU always out-performs round-robin repeated-labeling, and as hypothesized,generally it is better than the strategies based on only onetype of uncertainty (in each case, statistically significant by aone-tailed sign test at p < 0.1 or better).

5. CONCLUSIONS, LIMITATIONS, ANDFUTURE WORK

Repeated-labeling is a tool that should be considered when-ever labeling might be noisy, but can be repeated. We showedthat under a wide range of conditions, it can improve boththe quality of the labeled data directly, and the quality ofthe models learned from the data. In particular, selectiverepeated-labeling seems to be preferable, taking into accountboth labeling uncertainty and model uncertainty. Also, whenquality is low, preserving the uncertainty in the label multisetsfor learning [25] can give considerable added value.

Our focus in this paper has been on improving data qualityfor supervised learning; however, the results have implica-tions for data mining generally. We showed that selectiverepeated-labeling improves the data quality directly and sub-stantially. Presumably, this could be helpful for many datamining applications.

This paper makes important assumptions that should bevisited in future work, in order for us to understand practicalrepeated-labeling and realize its full benefits.

• For most of the work we assumed that all the labelershave the same quality p and that we do not know p. Aswe showed briefly in Section 3.2.2, di!ering qualities com-plicates the picture. On the other hand, good estimatesof individual labelers’ qualities inferred by observing theassigned labels [6, 26, 28] could allow more sophisticatedselective repeated-labeling strategies.

• Intuitively, we might also expect that labelers wouldexhibit higher quality in exchange for a higher payment.It would be interesting to observe empirically how indi-vidual labeler quality varies as we vary CU and CL, andto build models that dynamically increase or decreasethe amounts paid to the labelers, depending on the qual-ity requirements of the task. Morrison and Cohen [18]determine the optimal amount to pay for noisy infor-mation in a decision-making context, where the amountpaid a!ects the level of noise.

• In our experiments, we introduced noise to existing,benchmark datasets. Future experiments, that use reallabelers (e.g., using Mechanical Turk) should give abetter understanding on how to better use repeated-labeling strategies in a practical setting. For example,in practice we expect labelers to exhibit di!erent levelsof noise and to have correlated errors; moreover, theremay not be su"ciently many labelers to achieve veryhigh confidence for any particular example.

• In our analyses we also assumed that the di"culty of la-beling an example is constant across examples. In reality,some examples are more di"cult to label than others andbuilding a selective repeated-labeling framework that ex-plicitly acknowledges this, and directs resources to moredi"cult examples, is an important direction for futurework. We have not yet explored to what extent tech-niques like LMU (which are agnostic to the di"culty of

621

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

Labe

ling

qual

ity

ENTROPYGRR

Figure 9: What not to do: data quality improvementfor an entropy-based selective repeated-labeling strat-egy vs. round-robin repeated-labeling.

0.60.650.7

0.750.8

0.850.9

0.951

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

Labe

ling

qual

ity

GRR MULU LMU

(a) p = 0.6

0.85

0.87

0.89

0.91

0.93

0.95

0.97

0.99

0 400 800 1200 1600 2000

Number of labels (waveform, p=0.8)

Labe

ling

qual

ity

GRR MULU LMU

(b) p = 0.8

Figure 10: The data quality improvement of the fourstrategies (GRR, LU, MU, and LMU ) for the wave-form dataset.

Lpos + 1, ! = Lneg + 1. Thus, we set:

SLU = min{I0.5(Lpos, Lneg), 1 ! I0.5(Lpos, Lneg)} (4)

We compare selective repeated-labeling based on SLU toround-robin repeated-labeling (GRR), which we showed to per-form well in Section 4.2. To compare repeated-labeling strate-gies, we followed the experimental procedure of Section 4.2,with the following modification. Since we are asking whetherlabel uncertainty can help with the selection of examples forwhich to obtain additional labels, each training example startswith three initial labels. Then, each repeated-labeling strategyiteratively selects examples for which it acquires additionallabels (two at a time in these experiments).

Comparing selective repeated-labeling using SLU (call thatLU ) to GRR, we observed similar patterns across all twelvedata sets; therefore we only show the results for the wave-form dataset (Figure 10; ignore the MU and LMU lines fornow, we discuss these techniques in the next section), whichare representative. The results indicate that LU performssubstantially better than GRR, identifying the examples forwhich repeated-labeling is more likely to improve quality.

4.3.3 Using Model UncertaintyA di!erent perspective on the certainty of an example’s

label can be borrowed from active learning. If a predictive

Data Set GRR MU LU LMU

bmg 62.97 71.90 64.82 68.93expedia 80.61 84.72 81.72 85.01kr-vs-kp 76.75 76.71 81.25 82.55

mushroom 89.07 94.17 92.56 95.52qvc 64.67 76.12 66.88 74.54sick 88.50 93.72 91.06 93.75

spambase 72.79 79.52 77.04 80.69splice 69.76 68.16 73.23 73.06

thyroid 89.54 93.59 92.12 93.97tic-tac-toe 59.59 62.87 61.96 62.91travelocity 64.29 73.94 67.18 72.31waveform 65.34 69.88 66.36 70.24

average 73.65 78.77 76.35 79.46

Table 2: Average accuracies of the four strategiesover the 12 datasets, for p = 0.6. For each dataset,the best performance is in boldface and the worst initalics.

model has high confidence in the label of an example, perhapswe should expend our repeated-labeling resources elsewhere.

• Model Uncertainty (MU) applies traditional activelearning scoring, ignoring the current multiset of la-bels. Specifically, for the experiments below the model-uncertainty score is based on learning a set of models,each of which predicts the probability of class member-ship, yielding the uncertainty score:

SMU = 0.5 !

!!!!!1m

m"

i=1

Pr(+|x, Hi) ! 0.5

!!!!! (5)

where Pr(+|x, Hi) is the probability of classifying theexample x into + by the learned model Hi, and m is thenumber of learned models. In our experiments, m = 10,and the model set is a random forest [4] (generated byWEKA).

Of course, by ignoring the label set, MU has the comple-mentary problem to LU : even if the model is uncertain abouta case, should we acquire more labels if the existing labelmultiset is very certain about the example’s class? The invest-ment in these labels would be wasted, since they would havea small e!ect on either the integrated labels or the learning.

• Label and Model Uncertainty (LMU) combinesthe two uncertainty scores to avoid examples whereeither model is certain. This is done by computing thescore SLMU as the geometric average of SLU and SMU .That is:

SLMU ="

SMU · SLU (6)

Figure 10 demonstrates the improvement in data qualitywhen using model information. We can observe that the LMUmodel strongly dominates all other strategies. In high-noisesettings (p = 0.6) MU also performs well compared to GRRand LU, indicating that when noise is high, using learnedmodels helps to focus the investment in improving quality. Insettings with low noise (p = 0.8), LMU continues to dominate,but MU no longer outperforms LU and GRR.

4.3.4 Model Performance with Selective MLSo, finally, let us assess whether selective repeated-labeling

accelerates learning (i.e., improves model generalization per-formance, in addition to data quality). Again, experimentsare conducted as described above, except here we compute

620

+ Aggrega0ng and learning [Donmez et al., 2009]

n  Unlike [Sheng et al., 2008], annotators can have different (unknown) quali0es

n  IEThresh (Interval Es0mate Threshold): a strategy to select the annotator with the highest es0mated labeling quality

1.  Fit logis0c regression to training data

2.  Pick the most uncertain unlabeled instance

�yi,xi�, i = 1, . . . , I

x∗ = argmaxxi

(1− maxy∈{0,1}

P (y|xi))

A posteriori probability computed by the classifier


3.  For each annotator, n  Compute if she/he agrees with majority vo0ng

n  Compute the mean and the sample standard devia0on of the agreement, averaged over mul0ple objects

n  Compute the upper confidence interval of the annotator

Cri0cal value for the Student’s t-‐distribu0on with degrees of freedom

rji =

�1 yji = yMV

i

0 otherwise

σj = std[rji ]µj = E[rji ]

UIj = µj + t(Ij−1)

α/2

σj

√n

Ij − 1


4.  Choose all annotators with largest upper confidence interval

5.  Compute the majority vote of the selected annotators

6.  Update training data

7.  Repeat 2-‐6

{j|UIj ≥ �maxj

UIj}

T = T ∪ �yMVi ,x∗�

yMVi


n  Achieves trade-‐off between n  Explora0on (at the beginning, to es0mate annotator quali0es) n  Exploita0on (once annotator quali0es are es0mated, ask to more reliable)

0.5 0.6 0.7 0.8 0.90

50

100

150image

True Oracle Accuracy

#Tim

es S

elec

ted

0.5 0.6 0.7 0.8 0.90

50

100

150phoneme

True Oracle Accuracy

#Tim

es S

elec

ted

Figure 2: Number of times each oracle is queried vs. the true oracle accuracy. Each oracle corresponds toa single bar. Each bar is multicolored where each color shows the relative contribution. Blue correspondsto the first 10 iterations, green corresponds to an additional 40 iterations and red corresponds to anotheradditional 100 iterations. The bar height shows the total number of times an oracle is queried for labelingby IEThresh during first 150 iterations.

Table 1: Properties of six datasets used in the ex-periments. All are binary classification tasks withvarying sizes.

Dataset Size +/- Ratio Dimensionsimage 2310 1.33 18

mushroom 8124 1.07 22spambase 4601 0.65 57phoneme 5404 0.41 5ringnorm 7400 0.98 20svmguide 3089 1.83 4

and to have consistent baselines. The annotator accuraciesand the size of each dataset is reported in Table 2.

We compared our method IEThresh against Repeated andRandom baselines on these two datasets. In contrast to theUCI data experiment, there is no training of classifiers forthis experiment. Instead, the test set predictions are madedirectly by AMT labelers. Hence, we randomly selected 50instances from each dataset to be used by IEThresh to inferestimates for the annotator accuracies. The remaining in-stances are held out as the test set. The annotator with thebest estimated accuracy is evaluated on the test set. Thetotal number of queries are then calculated as a sum of thenumber of queries issued during inference and the number ofqueries issued to the chosen annotator during testing. Re-peated and Random baselines do not need an inference phasesince they do not change their annotator selection mecha-nism via learning. Hence, they are directly evaluated on thetest set. The total number of queries is assigned comparablyfor IEThresh and Repeated; however, it is equal to the num-ber of test instances for the Random baseline since it queriesa single labeler for each instance; thus, there can only be asmany queries as the number of test instances.

4.2 ResultsFigure 1 compares three methods on six datasets with

simulated oracles. The true accuracy of each oracle in Fig-ure 1 is drawn uniformly at random from within the range

Table 2: The size and the annotator accuracies foreach AMT dataset.

Data Size Annotator AccuraciesTEMP 190 0.44, 0.44, 0.54, 0.92, 0.92, 0.93RTE 100 0.51, 0.51, 0.58, 0.85, 0.92

Table 3: Performance Comparison on RTE data.The last column indicates the total number ofqueries issued to labelers by each method. IEThreshperforms accurately with comparable labeling e!ortto Repeated.

Method Accuracy # QueriesIEThresh 0.92 252Repeated 0.6 250Random 0.64 50

[.5, 1]. The figure reports the average classification errorwith respect to the total number of oracle queries issuedby each method. IEThresh is the best performer in all sixdatasets. In ringnorm and spambase datasets, IEThresh ini-tially performs slightly worse than the other methods, indi-cating that oracle reliability requires more sampling in thesetwo datasets. But, after the estimates are settled (whichhappens in ! 200 queries), it outperforms the others, withespecially large margins in spambase dataset. The resultsreported are statistically significant based on a two-sidedpaired t-test, where each pair of points on the averaged re-sults is compared.

We also analyzed the e!ect of filtering less reliable oracles.An ideal filtering mechanism excludes the less accurate or-acles early in the process and samples more from the moreaccurate ones. In Figure 2, we report the number of timeseach oracle is queried on image and phoneme datasets. Thex-axis shows the true accuracy of each oracle. We considerthe first 150 iterations of IEThresh and count the numberof times each oracle is selected. Each color corresponds to adi!erent time frame; i.e. blue, green and red correspond to

264

Itera0on counts 41-‐150 11-‐40 1-‐10

+ Aggrega0ng and learning [Dekel and Shamir, 2009]

n  In some cases, the number of annotators is of the same order as the number of objects to annotate

n Majority vo0ng cannot help

n  Es0ma0ng the annotator quali0es might be problema0c

n  Goal: prune low-‐quality annotators, when each one annotates at most one object

I/J = Θ(1)


n  Consider a training set

n  Let denote a binary classifier that assign a label to

n  Let denote a randomized classifier which represents the way annotator labels data

n  Let denote the set of objects annotated by annotator

n  Prune away any annotator for which

n  In words, the method prunes those annotators that are in disagreement with a classifier trained based on all annotators

�yi,xi�, i = 1, . . . , I

f(w,xi)

hj(xi)

{0, 1} xi

j

jSj

�j =

�i∈Sj 1hj(xi) �=f(w,xi)

|Sj | > T


�1 = 1/9

�2 = 0

�3 = 1

+ Aggrega0ng and learning [Raykar et al., 2010]

n  Consider a training set

n  Let denote a binary classifier that assign a label to

n  Consider the family of linear classifiers

n  The probability of the posi0ve class is modeled as a logis0c sigmoid

�yi,xi�, i = 1, . . . , I

f(w,xi) {0, 1} xi

yi = 1 if wTxi ≥ γ

yi = 0 otherwise

P (yi = 1|xi,w) = σ(wTx) σ(z) =1

1 + e−z


n  Similarly to [Welinder and Perona, 2010] annotators quality dis0nguishes between true posi0ve and true nega0ve rate

n  Goal: n  Given the observed labels and the feature vectors n  Es0mate the unknown parameters

P (yji = 1|yi = 1) = αj1

P (yji = 0|yi = 0) = αj0

θ = {w,α1,α0}D = {xi, y

1i , . . . , y

Ji }Ii=1


n  The likelihood func0on of the parameters given the observa0ons is factored as

θ = {w,α1,α0}D = {xi, y

1i , . . . , y

Ji }Ii=1

P (D|θ) =I�

i=1

P (y1i , . . . , yJi |xi,θ)

=I�

i=1

P (y1i , . . . , yJi |yi = 1,α1)P (yi = 1|xi,w)

+ P (y1i , . . . , yJi |yi = 0,α0)P (yi = 0|xi,w)


n  The parameters are found by maximizing the log-‐likelihood func0on

n  The solu0on is based on Expecta0on-‐Maximiza0on

n  Expecta+on step

{α1, α0, w} = argmaxθ

logP (D|θ)

µi = P (yi = 1|y1i , . . . , yJi ,xi,θ)

∝ P (y1i , . . . , yJi |yi = 1,θ)P (yi = 1|,xi,θ)

pi = σ(wTxi)

=a1,ipi

a1,ipi + a0,i(1− pi)

a1,i =J�

j=1

[αj1]

yji [1− αj

1]1−yj

i

a0,i =J�

j=1

[αj0]

1−yji [1− αj

0]yji


n Maximiza+on step n  False posi0ve and false nega0ve rates can be es0mated in closed form

n  The classifier can be es0mated by means of gradient ascent

αj0 =

�Ii=1(1− µi)(1− yji )�I

i=1(1− µi)αj1 =

�Ii=1 µiy

ji�I

i=1 µi

wt+1 = w

t − ηH−1g

w


n  Log-‐odds

logit(µi) = logµi

1− µi= log

P (yi = 1|y1i , . . . , yJi ,xi,θ)

P (yi = 0|y1i , . . . , yJi ,xi,θ)

= c+wTxi +J�

i=1

yji [logit(αj1) + logit(αj

0)]

Contribu0on of the annotators: Weighted linear combina0on of labels from all annotators

Contribu0on of the classifier


LEARNING FROM CROWDS

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate (1−specifcity)

True

Pos

itive

Rate

(se

nsitiv

ity)

ROC Curve for the classifier

Golden ground truth AUC=0.915Proposed EM algorithm AUC=0.913Majority voting baseline AUC=0.882

(a)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate (1−specifcity)

True

Pos

itive

Rate

(se

nsitiv

ity)

ROC Curve for the estimated true labels

Proposed EM algorithm AUC=0.991Majority voting baseline AUC=0.962

(b)

Figure 1: Results for the digital mammography data set with annotations from 5 simulated radiol-ogists. (a) The ROC curve of the learnt classifier using the golden ground truth (dottedblack line), the majority voting scheme (dashed blue line), and the proposed EM algo-rithm (solid red line). (b) The ROC curve for the estimated ground truth. The actualsensitivity and specificity of each of the radiologists is marked as a !. The end of thedashed blue line shows the estimates of the sensitivity and specificity obtained from themajority voting algorithm. The end of the solid red line shows the estimates from theproposed method. The ellipse plots the contour of one standard deviation.

1313


n  Extensions n  Bayesian approach, with priors on true posi0ve and true nega0ve rates n  Adop0on of different types of classifiers

n  Mul0-‐class classifica0on n  Ordinal regression n  Regression

yi ∈ {l1, . . . , lK}, l1 < . . . lK

yi ∈ {l1, . . . , lK}

yi ∈ R

+ Aggrega0ng and learning [Yan et al., 2010b]

n  Seung similar to [Raykar et al., 2010], with two main differences

n  No dis0nc0on between true posi0ve and true nega0ve rates

n  The quality of the annotator is dependent on the object

αj1 = αj

0, j = 1, . . . , J

αj(x) =1

1 + e−wjx

+ Aggrega0ng and learning [Yan et al., 2010b]

n  Log-‐odds

logit(µi) = log

µi

1− µi= wTxi +

J�

i=1

(−1)(1−yji )(wj)Txi

= c+wTxi +J�

i=1

yji (wj)Txi

Contribu0on of the annotators: Weighted linear combina0on of labels from all annotators Weights depend on object difficulty

+ Aggrega0ng and learning [Yan et al., 2011]

n  Ac0ve learning from crowds n  Which training point to pick?

n  Pick the example that is closest to the classifier separa0ng hyperplane

n  Which expert to pick?

i∗ = argmini

|wTxi|

j∗ = argminj

1

1 + e−(wj)Txi∗

+ Crowdsourcing at work

+ CrowdSearch [Yan et al., 2010]

n  CrowdSearch combines n  Automated image search

n  Local processing on mobile phones + backend processing n  Real-‐0me human valida0on of search results

n  Amazon Mechanical Turk

n  Studies the trade-‐off in terms of n  Delay n  Accuracy n  Cost

man error and bias to maximize accuracy. To balance thesetradeo!s, CrowdSearch uses an adaptive algorithm that usesdelay and result prediction models of human responses to ju-diciously use human validation. Once a candidate image isvalidated, it is returned to the user as a valid search result.

3. CROWDSOURCING FOR SEARCHIn this section, we first provide a background of the Ama-

zon Mechanical Turk (AMT). We then discuss several designchoices that we make while using crowdsourcing for imagevalidation including: 1) how to construct tasks such thatthey are likely to be answered quickly, 2) how to minimizehuman error and bias, and 3) how to price a validation taskto minimize delay.

Background: We now provide a short primer on theAMT, the crowdsourcing system that we use in this work.AMT is a large-scale crowdsourcing system that has tensof thousands of validators at any time. The key benefit ofAMT is that it provides public APIs for automatic postingof tasks and retrieval of results. The AMT APIs enable usto post tasks and specify two parameters: (a) the number ofduplicates, i.e. the number of independent validators whowe want to work on the particular task, and (b) the rewardthat a validator obtains for providing responses. A validatorworks in two phases: (a) they first accept a task once theyidentify that they would like to work on it, which in turndecrements the number of available duplicates, and (b) onceaccepted, they need to provide a response within a periodspecified by the task.

One constraint of the AMT that pertains to CrowdSearchis that the number of duplicates and reward for a task thathas been posted cannot be changed at a later point. We takethis practical limitation in mind in designing our system.

Constructing Validation Tasks: How can we constructvalidation tasks such that they are answered quickly? Ourexperience with AMT revealed several insights. First, we ob-served that asking people to tag query images and candidateimages directly is not useful since: 1) text tags from crowd-sourcing systems are often ambiguous and meaningless (sim-ilar conclusions have been reported by other crowdsourcingstudies [8]), and 2) tasks involving tagging are unpopular,hence they incur large delay. Second, we found that havinga large validation task that presents a number of <queryimage, candidate image> pairs enlarges human error andbias since a single individual can bias a large fraction of thevalidation results.

We settled on an a simple format for validation tasks.Each <query image, candidate image> pair is packaged intoa task, and a validator is required to provide a simple YESor NO answer: YES if the two images are correctly matched,and NO otherwise. We find that these tasks are often themost popular among validators on AMT.

Minimizing Human Bias and Error: Human error andbias is inevitable in validation results, therefore a centralchallenge is eliminating human error to achieve high accu-racy. We use a simple strategy to deal with this problem:we request several duplicate responses for a validation taskfrom multiple validators, and aggregate the responses usinga majority rule. Since AMT does not allow us to dynami-cally change the number of duplicates for a task, we fix thisnumber for all tasks. In §7.2, we evaluate several aggrega-tion approaches, and show that a majority of five duplicates

!"#$%&'()*# +),-.-)/#&'()*#0 1"23.4)/#&5)3.-)/.6,&7)080

!"#$%

&'(&'(&'()'(*+,( &'(&'(

!"#$%

&'(&'(-.)'(*+,( -.&'(

!"#$%

-.-.-.)'(*+,( -.-.

!"#$%

&'(&'(&'()'(*+,( -.&'(

+9

+:

+;

+<

Figure 2: Shown are an image search query, candi-date images, and duplicate validation results. Eachvalidation task is a Yes/No question about whetherthe query image and candidate image contains thesame object.

is the best strategy and consistently achieves us more than95% search accuracy.

Pricing Validation Tasks: Crowdsourcing systems allowus to set a monetary reward for each task. Intuitively, ahigher price provides more incentive for human validators,and therefore can lead to lower delay. This raises the fol-lowing question: is it better to spend X cents on a singlevalidation task or to spread it across X validation tasks ofprice one cent each? We find that it is typically better tohave more tasks at a low price than fewer tasks at a highprice. There are three reasons for this behavior: 1) since alarge fraction of tasks on the AMT o!er a reward of only onecent, the expectation of users is that most tasks are quickand low-cost, 2) crowdsourcing systems like the AMT havetens of thousands of human validators, hence posting moretasks reduces the impact of a slow human validator on over-all delay, and 3) more responses allows better aggregationto avoid human error and bias. Our experiments with AMTshow that the first response in five one cent tasks is 50 - 60%faster than a single five cent task, confirming the intuitionthat delay is lower when more low-priced tasks are posted.

4. CROWDSEARCH ALGORITHMGiven a query image and a ranked list of candidate im-

ages, the goal of human validation is to identify the correctcandidate images from the ranked list. Human validationimproves search accuracy, but incurs monetary cost and hu-man processing delay. We first discuss these tradeo!s andthen describe how CrowdSearch optimizes overall cost whilereturning at least one valid candidate image within a user-specified deadline.

4.1 Delay-Cost TradeoffsBefore presenting the CrowdSearch algorithm, we illus-

trate the tradeo! between delay and cost by discussing post-ing schemes that optimize one or the other but not both.

Parallel posting to optimize delay: A scheme that op-timizes delay would post all candidate images to the crowd-sourcing system at the same time. (We refer to this as par-allel posting.) While parallel posting reduces delay, it isexpensive in terms of monetary cost. Figure 2 shows aninstance where the image search engine returns four candi-


n  Delay-‐costs trade-‐offs n  Parallel pos0ng

n  Minimizes delay n  Expensive in terms of monetary cost

n  Serial pos0ng n  Posts top-‐ranked candidates for valida0on n  Cheap in terms of monetary cost n  Much higher delay

n  Adap0ve strategy à CrowdSearch


n  Example: a candidate image has received the sequence of responses

n  Enumerate all sequences of responses, i.e.,

n  For each sequence, es0mate n  The probability of observing given n  Weather it would lead to success under majority vo0ng n  The probability of obtaining the responses before the deadline

n  Es0mate the probability of success. If post a new candidate

Si = {‘Y’, ‘N’}

S(1)i = {‘Y’, ‘N’, ‘Y’}

S(2)i = {‘Y’, ‘N’, ‘N’}

S(3)i = {‘Y’, ‘N’, ‘Y’, ‘Y’}

S(j)i Si

Psucc < τ

. . .


n  Predic0ng valida0on results n  Training:

n  Enumerate all sequences of fixed length (e.g., five) n  Compute empirical probabili0es

n  Example: n  Observed sequence n  Sequences that lead to posi0ve results

!""#

$ %

%$ %%

%$$ %$%

%$%$ %$%%

!"!"!!"!"" !"!!!!"!!"

&'()&'&(

&'(*&'&+

&'&,&'&*

&'-)

Figure 5: A SeqTree to Predict Validation Re-sults. The received sequence is ‘YNY’, the two se-quences that lead to positive results are ‘YNYNY’and ‘YNYY’. The probability that ‘YNYY’ occursgiven receiving ‘YNY’ is 0.16/0.25 = 64%

5.1 Image Search OverviewThe image search process contains two major steps: 1) ex-

tracting features from a query image, and 2) search throughdatabase images with features of query image.

Extracting features from query image: There are manygood features to represent images, such as the Scale-InvariantFeature Transform (SIFT) [9]. While these features captureessential characteristics of images, they are not directly ap-propriate for search because of their large size. For instance,SIFT features are 128 dimensional vectors and there are sev-eral hundred such SIFT vectors for a VGA image. The largesize makes it 1) unwieldy and ine!cient for search since thedata structures are large, and 2) ine!cient for communi-cation since no compression gains are achieved by locallycomputing SIFT features on the phone.

A canonical approach to reduce the size of features is toreduce the dimensionality by clustering. This is enabled by alookup structure called “vocabulary tree” that is constructedin an a priori manner by hierarchical k-means clustering ofSIFT features of a training dataset. For example, a vo-cabulary tree for buildings can be constructed by collectingthousands of training images, extracting their SIFT featureand using k-means clustering to build the tree. A vocabu-lary tree is typically constructed for each category of images,such as faces, buildings, or book covers.

Searching through database images: Once vistermsare extracted from an image, they can be used in a man-ner similar to keywords in text retrieval [2]. The searchprocess uses a data structure called the inverted index thatis constructed from the corpus of images in the database.The inverted index is basically a mapping from each vis-term to the images in the database containing that visterm.Each visterm is also associated with a inverted document fre-quency (IDF) score that describes its discriminating power.Given a set of visterms for a query image, the search processis simple: for each visterm, the image search engine looksup the inverted index and compute an IDF score for each ofthe candidate images. The list of candidates are returnedranked in order of their IDF score.

5.2 Implementation TradeoffsThere are two key questions that arise in determining how

to split image search functionality between the mobile phoneand remote server. The first question is whether vistermextraction should be performed on the mobile phone or re-mote server. Since visterms are very compact, transmit-ting visterms from a phone as opposed to the raw imagecan save time and energy, particularly if more expensive 3Gcommunication is used. However, visterm computation canincur significant delay on the phone due to its resource con-straints. In-order to reduce this delay, one would need totradeo" the resolution of the visterms, thereby impactingsearch accuracy. Thus, using local computation to extractvisterms from a query image saves energy but sacrifices ac-curacy. Our system chooses the best option for visterm ex-traction depending on the availability of WiFi connectivity.If only 3G connectivity is available, visterm extraction isperformed locally, whereas if WiFi connectivity is available,the raw image is transferred quickly over the WiFi link andperformed at the remote server.

The second question is whether inverted index lookupshould be performed on the phone or the remote server.There are three reasons to choose the latter option: 1) sincevisterms are already extremely compact, the benefit in per-forming inverted index lookup on the phone is limited, 2)having a large inverted index and associated database im-ages on the phone is often not feasible, and 3) having theinverted index on the phone makes it harder to update thedatabase to add new images. For all these reasons, we chooseto use a remote server for inverted index lookup.

6. SYSTEM IMPLEMENTATIONThe CrowdSearch system is implemented on Apple iPhones

and a backend server at UMass. The components diagramof CrowdSearch system is shown in Figure 6.

iPhone Client: We designed a simple user interface formobile users to capture query images and issue a searchquery. The screenshot of the user interface is shown in Fig-ure 1. A client can provide an Amazon payments account tofacilitate the use of AMT and pay for validation. There isalso a free mode where validation is not performed and onlythe image search engine results are provided to the user.

To support local image processing on the iPhone, we portedan open-source implementation of the SIFT feature extrac-tion algorithm [26] to the iPhone. We also implemented avocabulary tree lookup algorithm to convert from SIFT fea-tures to visterms. While vocabulary tree lookup is fast andtakes less than five seconds, SIFT takes several minutes toprocess a VGA image due to the lack of floating point sup-port on the iPhone. To reduce the SIFT running time, wetune SIFT parameters to produce fewer SIFT features froman image. This modification comes at the cost of reducedaccuracy for image search but reduces SIFT running time onthe phone to less than 30 seconds. Thus, the overall com-putation time on the iPhone client is roughly 35 seconds.

When the client is connected to the server, it also receivesupdates, such as an updated vocabulary tree or new deadlinerecommendations.

CrowdSearch Server Implementation: The CrowdSearchServer is comprised of two major components: automatedimage search engine and validation proxy. The image searchengine generates a ranked list of candidate images for each

Si = {‘Y’, ‘N’, ‘Y’}

P ({‘Y’, ‘N’, ‘Y’, ‘Y’}) = 0.16/0.25

P ({‘Y’, ‘N’, ‘Y’, ‘N’, ‘Y’}) = 0.03/0.25


n  Delay predic0on

(a) Overall delay model (b) Inter-arrival delay model

Figure 3: Delay models for overall delay and inter-arrival delay. The overall delay is decoupled with acceptanceand submission delay.

From our inter-arrival delay model, we know that all inter-arrival times are independent. Thus, we can present theprobability density function of Yi,j as the convolution of theinter-arrival times of response pairs from i to j.

Before applying convolution, we first need to consider thecondition Yi,i+1 ! t " ti. This condition can be removedby applying the law of total probability. We sum up all thepossible values for Yi,i+1, and note that the lower bound ist " ti. For each Yi,i+1 = tx, the rest part of Yi,j , or Yi+1,j ,should be in the range of D " ti " tx. Thus the condition ofYi,i+1 can be removed and we have:

Pij =D!tiX

tx=t!ti

P (Yi,i+1 = tx)P (Yi+1,j # D " ti " tx) (4)

Now we can apply the convolution directly to Yi+1,j . Letfi,j(t) denote the PDF of inter-arrival between response iand j. The PDF of Yi+1,j can be expressed as:

fi+1,j(t) = (fi+1,i+2 $ · · · $ fj!1,j)(t)

Combining this with Equation 4, we have:

Pij =D!tiX

tx=t!ti

(fi,i+1(tx)D!ti!tx

X

ty=0

fi+1,j(ty)) (5)

Now the probability we want to predict has been expressedin the form of PDF of inter-arrival times. Our delay modelscapture the distribution of all inter-arrival times that weneed for computing the above probability: we use the delaymodel for the first response when i = 0, and use the inter-arrival of adjacent response when i > 0. Therefore, we canpredict the delay of receiving any remaining responses giventhe time that partial responses are received.

4.4 Predicting Validation ResultsHaving presented the delay model, we discuss how to

predict the actual content of the incoming responses, i.e.whether each response is a Yes or No. Specifically, giventhat we have received a sequence Si, we want to computethe probability of occurrence of each possible sequence Sj

that starts with Si, such that the validation result is posi-tive, i.e., majority(Sj) = Y es.

This prediction can be easily done using a su!ciently largetraining dataset to study the distribution of all possible re-sult sequences. For the case where the number of duplicateis set to be 5, there are 25 = 32 di"erent sequence combina-tions. We can compute the probability that each sequenceoccurs in the training set by counting the number of their oc-currences. We use this probability distribution as our modelfor predicting validation results. We use the probabilities toconstruct a probability tree called SeqTree.

Figure 5 shows an example of a SeqTree tree. It is a binarytree where leaf nodes are the sequences with length of 5. Fortwo leaf nodes where only the last bit is di"erent, they havea common parent node whose sequence is the common sub-string of the two leaf nodes. For example, nodes ‘YNYNN’and ‘YNYNY’ have a parent node of ‘YNYN’. The proba-bility of a parent node is the summation of the probabilityfrom its children. Following this rule, the SeqTree is built,where each node Si is associated with a probability pi thatits sequence can happen.

Given the tree, it is easy to predict the probability that Sj

occurs given partial sequence Si using the SeqTree. Simplyfind the nodes that correspond to Si and Sj respectively,and the probability we want is pj/pi.

5. IMAGE SEARCH ENGINEIn this section, we briefly introduce the automated image

search engine. Our search engine is designed using imagesearch methods that have been described in the prior workincluding our own [30]. The fundamental idea of the imagesearch engine is to use a set of compact image representa-tions called visterms (visual terms) for e!cient communica-tion and search. The compactness of visterms makes themattractive for mobile image search, since they can be com-municated from the phone to a remote search engine serverat extremely low energy cost. However, extracting vistermsfrom images consumes significant computation overhead anddelay at the phone. In this section, we provide an overview ofthe image search engine, and focus on explaining the trade-o"s that are specific to using it on resource-constrained mo-bile phones.

fa(t) = λae−λa(t−ca)

fs(t) = λse−λs(t−cs)

fo(t) = fa(t) ∗ fs(t)

fi(t) = λie−λit

Acceptance

Submission

Overall

Inter-‐arrival


Figure 7: Precision of automated image search overfour categories of images. These four categoriescover the spectrum of the precision of automatedsearch.

faces and flowers. The precision drops significantly as thelength of the ranked list grows, indicating that even top-ranked images su!ers from high error rate. Therefore, wecannot present the results directly to users.

We now evaluate how much human validation can improveimage search precision. Figure 8 plots four di!erent schemesfor human validation: first-response, majority(3), major-ity(5), and one-veto (i.e. complete agreement among val-idators). In each of these cases, the human-validated searchscheme returns only the candidate images on the ranked listthat are deemed to be correct. Automated image searchsimply returns the top five images on the ranked list.

The results reveal two key observations: First, consider-able improvement in precision is irrespective of which strat-egy is used. All four validation schemes are considerablybetter than automated search. For face images, even us-ing a single human validation improves precision by 3 timeswhereas the use of a majority(5) scheme improves precisionby 5 times. Even for book cover images, majority(5) still im-proves precision by 30%. In fact, the precision using humanvalidators is also considerably better than the top-rankedresponse from the automatic search engine. Second, amongthe four schemes, human validation with majority(5) is eas-ily the best performer and consistently provides accuracygreater than 95% for all image categories. Majority(3) is aclose second, but its precision on face and building imagesis less than 95%. The one-veto scheme also cannot reach95% precision for face, flower and building images. Usingthe first response gives the worst precision as it is a!ectedmost by human bias and error. Based on the above obser-vation, we conclude that for mobile users who care aboutsearch precision, majority(5) is the best validation scheme.

7.3 Accuracy of Delay ModelsThe inter-arrival time models are central to the Crowd-

Search algorithm. We obtain the parameters of the delaymodels using the training dataset, and validate the parame-ters against the testing dataset. Both datasets are describedin §7.1. We validate the following five models: arrival of thefirst response, and inter-arrival times between two adjacentresponses from 1st to 2nd response, to 4th to 5th response(§4.3.1). In this and the following experiments, we set the

Figure 8: Precision of automated image search andhuman validation with four di!erent validation cri-teria.

threshold to post next task be 0.6. In other words, if theprobability that at least one of existing validation task issuccessful is less than 0.6, a new task is triggered.

Figure 9(a) shows the cumulative distribution functions(CDF) for the first response. As described in §4.3.1, thismodel is derived by the convolution of the acceptance timeand submission time distribution. We show that the modelparameters for the acceptance, submission, as well as the to-tal delay for the first response fit the testing data very well.Figure 9(b) shows the CDF of two of inter-arrival times be-tween 1st and 2nd responses, and 3rd and 4th responses.(The other inter-arrival times are not shown to avoid clut-ter.) The scatter points are for testing dataset and the solidline curves are for our model. Again, the model fits thetesting data very well.

While the results were shown visually in Figure 9, wequantify the error between the actual and predicted distri-butions using the K-L Divergence metric in Table 1. TheK-L Divergence or relative entropy measures the distancebetween two distributions [3] in bits. Table 1 shows the dis-tance between our model to the actual data is less than 5bits for all the models, which is very small. These valuesare all negative, which indicates that the predicted delay ofour model is little bit larger than the actual delay. This ob-servation indicates that our models are conservative in thesense that they would rather post more tasks than miss thedeadline requirement.

The results from Figure 9 show that the model parame-ters remain stable over time and can be used for prediction.In addition, it shows that our model provides an excellentapproximation of the user behavior on a large-scale crowd-sourcing system such as AMT.

7.4 CrowdSearch PerformanceIn this section, we evaluate the CrowdSearch algorithm on

its ability to meet a user-specified deadline while maximizingaccuracy and minimizing overall monetary cost. We com-pare the performance of CrowdSearch against two schemes:parallel posting and serial posting, described in §4.1. Paral-lel posting posts all five candidate results at the same time,

+

n 36 month large-‐scale integra0ng project

n par0ally funded by the European Commission’s 7th Framework ICT Programme for Research and Technological Development

n www.cubrikproject.eu

The CUBRIK project

+ Objec0ves [Fraternali et al., 2012]

n  The technical goal of CUbRIK is to build an open search plaHorm grounded on four objec0ves: n  Advance the architecture of mul0media search n  Place humans in the loop n  Open the search box n  Start up a search business ecosystem

+ Objec0ve: Advance the architecture of mul0media search

n Mul0media search: coordinated result of three main processes:

n  Content processing: acquisi0on, analysis, indexing and knowledge extrac0on from mul0media content

n  Query processing: deriva0on of an informa0on need from a user and produc0on of a sensible response

n  Feedback processing: quality feedback on the appropriateness of search results

+ Objec0ve: Advance the architecture of mul0media search

n Objec0ve: n  Content processing, query processing and feedback processing phases will be implemented by means of independent components

n  Components are organized in pipelines

n  Each applica0on defines ad-‐hoc pipelines that provide unique mul0media search capabili0es in that scenario

+ Objec0ve: Humans in the loop

n  Problem: the uncertainty of analysis algorithms leads to low confidence results and conflic9ng opinions on automa0cally extracted features

n  Solu+on: humans have superior capacity for understanding the content of audiovisual material n  State of the art: humans replace automa+c feature extrac+on processes

(human annota9ons)

n  Our contribu0on: integra+on of human judgment and algorithms n  Goal: improve the performance of mul0media content processing

+ CUbRIK architecture

!"#$%&'##()$*+,-./%% 01234$56174%.8943219:83%"2;<469%

Version 1.0 - 27 June 2011 Page 10 of 102

9=4% 819>24% ;?% 9=4% @>79:@4A:1% 54126=% 915B5% 9;% 9=4% @;59% 1CC2;C2:194% =>@18% :8942169:;8%@46=18:5@5D%%

*+,-./%E:77%1AA2455%9=4%C2;F74@%;?%;C9:@:G:83%@>79:@4A:1%54126=%915B%4H46>9:;8%;8%9;C%;?%4H:59:83%5;6:17% 849E;2B:83% 592>69>245D% I=4% 6=1774834% :5% 9;% 4HC7;:9% 67155:617% 1CC2;16=45% ;?% 5;6:17% 849E;2B%1817J5:5% K4D3DL% 6489217:9J% 1817J5:5L% 6;=45:M4% 5>F$32;>C% :A489:?:619:;8L% 2;74% :8?424864N% :8% ;2A42% 9;%;C9:@:G4% 9=4%C42?;2@1864%;?%62;E%5;>26:83%@>79:@4A:1%54126=%915B5D%I=4%3484217%3;17% :5% 9;%4H92169%?2;@%5;6:17%849E;2B5%;?%A:??42489%B:8A5%K3484217%C>2C;54L%4894291:8@489L%C2;?455:;817L%496N%9=4%169;25%9=19%124%@;59% 7:B47J% 9;%C42?;2@%1% 915B%E:9=%@1H:@>@%O>17:9J%18A% :8%@:8:@>@%9:@4D%P73;2:9=@5%?;2%4M17>19:83% 9=4% 1??:8:9J% ;?% 915B5% 9;% C;9489:17% =>@18% 155:38445L% 9;% @1H:@:G4% 9=4% M47;6:9J% ;?% 915B%5C241A:83% 9=2;>3=%M159% 62;EA5L% ?;2%4M17>19:83% 9=4% 24C>919:;8%18A% 92>59%;?% 618A:A194%155:38445L% ?;2%A49469:83%@17:6:;>5%F4=1M:;>25%K4D3D%5C1@L%?21>AN%844A%9;%F4%A45:384A%18A%6>59;@:G4A%9;%9=4%5C46:?:6%6;894H9%;?%@>79:@4A:1%54126=D%

B1.1.5. Realising CUBRIK

I=4% 1CC2;16=% C>25>4A% FJ% *+,-./% :5% 9=4% A4M47;C@489% ;?% 18% !"#$%&!'()#* +(,-#.!(/% ?;2% 54126=%1CC7:619:;85% A45:38425L% 946=8;7;3J% C2;M:A425L% 18A% 6;89489% ;E8425% 9;% @18134% 9=4% 6;@C74H:9J% ;?%6;8592>69:83%18A%4M;7M:83%56171F74L%@>79:@;A17L%2417$9:@4%@4A:1%54126=%18A%2492:4M17%1CC7:619:;85%;8%9;C%;?%1%6;77469:;8%;?%6;@C;84895L%:8A:M:A>175%18A%6;@@>8:9:45D%

%

Figure 1.2: CUBRIK Architecture

!:3>24%QD&%5=;E5%1%=:3=%74M47%;M42M:4E%;?%9=4%*+,-./%126=:9469>24D%*+,-./%247:45%;8%1%?21@4E;2B%?;2%4H46>9:83%C2;645545%K1B1%"0"#10$#&NL%6;85:59:83%;?%6;77469:;85%;?%915B5%9;%F4%4H46>94A%:8%1%A:592:F>94A%?15=:;8D%R16=%C:C47:84%:5%A4562:F4A%FJ%1%E;2B?7;E%;?%915B5L%177;6194A%9;%4H46>9;25D%I15B%4H46>9;25%618%F4%5;?9E124%6;@C;84895%K4D3DL%A191%1817J5:5%173;2:9=@5L%@491A191%:8A4H:83%9;;75L%54126=%483:845%;?%A:??42489%819>24L%245>79%C24548919:;8%@;A>745L%496DND%I15B5%618%175;%F4%177;6194A%9;%:8A:M:A>17%=>@18%>5425%K4D3DL%M:1%1%31@:83%:8942?1645N%;2%9;%18%489:24%6;@@>8:9J%K4D3DL%FJ%1%62;EA5;>26:83%6;@C;8489ND%

S:??42489% C:C47:845% 618% F4% A4?:84A% ?;2% 9=4% A:??42489% C2;645545% ;?% 1%@>79:@4A:1% 54126=% 1CC7:619:;8T%6;89489% 1817J5:5% 18A% @491A191% 4H92169:;8L% O>42J% C2;6455:83L% 18A% 2474M1864% ?44AF16B% C2;6455:83D%":C47:84% A4562:C9:;85% 124% 59;24A% :8% 1% C2;6455% 24C;5:9;2JL% 486;A4A% :8% 5918A12A%E;2B?7;E% 7183>1345%

n  Problems in automa+c logo detec+on: n  Object recogni0on is affected by the quality of the input set of images

n  Uncertain matches, i.e., the ones with low matching score, could not contain the searched logo

97

Trademark Logo Detec0on: problems in automa0c logo detec0on

n  Contribu+on in human computa+on n  Filter the input logos, elimina0ng the irrelevant ones n  Segment the input logos

n  Validate the matching results

98

Trademark Logo Detec0on: contribu0on of human computa0on

+ Human-‐powered logo detec0on [Bozzon et al., 2012]

n  Goal: integrate human and automa0c computa0on to increase precision and recall w.r.t. fully automa0c solu0ons

Retrieve Logo Images

Logo Name

Validate Logo Images

Match Logo Images in Videos

Videocollection

Logo Images

Validated Logo

Images

+

Validate Low-confidence

Results

Join Results and Emit Report

Low Confidence

Results

High Confidence

Results

Validated Results

+Logo Detection

Experimental evalua0on

n  Three experimental seungs: n  No human interven0on n  Logo valida0on performed by two domain experts n  Inclusion of the actual crowd knowledge

n  Crowd involvement n  40 people involved n  50 task instances generated n  70 collected answers

100

Experimental evalua0on 101

No Crowd

Experts

Crowd

No Crowd

Experts

Crowd

No Crowd

Experts Crowd

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Aleve

Chunky

Shout


No Crowd

Experts

Crowd

No Crowd

Experts

Crowd

No Crowd

Experts Crowd

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Aleve

Chunky

Shout

Precision decreases Reasons for the wrong inclusion •  Geographical loca0on of the users •  Exper0se of the involved users


No Crowd

Experts

Crowd

No Crowd

Experts

Crowd

No Crowd

Experts Crowd

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Aleve

Chunky

Shout

Precision decreases •  Similarity between two

logos in the data set

Open issues and future direc0ons

n  Reproducibility and experiment design [Paritosh, 2012]

n  Expert finding / task alloca0on

n  Beyond textual labels

+

Thanks for your aren0on www.cubrikproject.eu

+ References 1/3

n  [Bozzon et al., 2012] Alessandro Bozzon, Ilio Catallo, Eleonora Ciceri, Piero Fraternali, Davide Mar0nenghi, Marco Tagliasacchi: A Framework for Crowdsourced Mul0media Processing and Querying. CrowdSearch 2012: 42-‐47

n  [Dawid and Skene, 1979] A. P. Dawid and A. M. Skene, Maximum Likelihood Es0ma0on of Observer Error-‐Rates Using the EM Algorithm, Journal of the Royal Sta0s0cal Society. Series C (Applied Sta0s0cs), Vol. 28, No. 1 (1979), pp. 20-‐28

n  [Dekel and Shamir, 2009] O. Dekel and O. Shamir, Vox Populi: Collec0ng High-‐Quality Labels from a Crowd. ;In Proceedings of COLT. 2009. n  [Domnez et al., 2009] Pinar Donmez, Jaime G. Carbonell, and Jeff Schneider. 2009. Efficiently learning the accuracy of labeling sources for

selec0ve sampling. In Proceedings of the 15th ACM SIGKDD interna0onal conference on Knowledge discovery and data mining (KDD '09) n  [Fraternali et al., 2012] Piero Fraternali, Marco Tagliasacchi, Davide Mar0nenghi, Alessandro Bozzon, Ilio Catallo, Eleonora Ciceri,

Francesco Saverio Nucci, Vincenzo Croce, Ismail Sengör Al0ngövde, Wolf Siberski, Fausto Giunchiglia, Wolfgang Nejdl, Martha Larson, Ebroul Izquierdo, Petros Daras, Oro Chrons, Ralph Traphöner, Björn Decker, John Lomas, Patrick Aichroth, Jasminko Novak, Ghislain Sillaume, Fernando Sánchez-‐Figueroa, Carolina Salas-‐Parra: The CUBRIK project: human-‐enhanced 0me-‐aware mul0media search. WWW (Companion Volume) 2012: 259-‐262

n  [Freiburg et al. 2011] Bauke Freiburg, Jaap Kamps, and Cees G.M. Snoek. 2011. Crowdsourcing visual detectors for video search. In Proceedings of the 19th ACM interna0onal conference on Mul0media (MM '11). ACM, New York, NY, USA, 913-‐916.

n  [Goëau et al., 2011] H. Goëau, A. Joly, S. Selmi, P. Bonnet, E. Mouysset, L. Joyeux, J. Molino, P. Birnbaum, D. Barthelemy, and N. Boujemaa, Visual-‐based plant species iden0fica0on from crowdsourced data. ;In Proceedings of ACM Mul0media. 2011, 813-‐814.

n  [Harris, 2012] Christopher G. Harris, An Evalua0on of Search Strategies for User-‐Generated Video Content, Proceedings of the First Interna0onal Workshop on Crowdsourcing Web Search, Lyon, France, April 17, 2012

n  [Karger et al., 2011] D.R. Karger, S. Oh, and D. Shah, Budget-‐Op0mal Task Alloca0on for Reliable Crowdsourcing Systems. ;In Proceedings of CoRR. 2011.

n  [Kumar and Lease, 2011] A. Kumar and M. Lease. Modeling annotator accuracies for supervised learning. In WSDM Workshop on Crowdsourcing for Search and Data Mining, 2011.

+ References 2/3

n  [Nowak and Ruger, 2010] Stefanie Nowak and Stefan Rüger. 2010. How reliable are annota0ons via crowdsourcing: a study about inter-‐annotator agreement for mul0-‐label image annota0on. In Proceedings of the interna0onal conference on Mul0media informa0on retrieval (MIR '10). ACM, New York, NY, USA, 557-‐566.

n  [Paritosh, 2012] Praveen Paritosh, Human Computa0on Must Be Reproducible, Proceedings of the First Interna0onal Workshop on Crowdsourcing Web Search, Lyon, France, April 17, 2012

n  [Raykar et al., 2010] Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning From Crowds. J. Mach. Learn. Res. 99 (August 2010), 1297-‐1322.

n  [Sheng et al., 2008] Victor S. Sheng, Foster Provost, and Panagio0s G. Ipeiro0s. 2008. Get another label? improving data quality and data mining using mul0ple, noisy labelers. In Proceedings of the 14th ACM SIGKDD interna0onal conference on Knowledge discovery and data mining (KDD '08). ACM, New York, NY, USA, 614-‐622.

n  [Snoek et al., 2010] Cees G.M. Snoek, Bauke Freiburg, Johan Oomen, and Roeland Ordelman. Crowdsourcing rock n' roll mul0media retrieval. In Proceedings of the interna0onal conference on Mul0media (MM '10). ACM, New York, NY, USA, 1535-‐1538.

n  [Snow et al., 2008] Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast-‐-‐-‐but is it good?: evalua0ng non-‐expert annota0ons for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Associa0on for Computa0onal Linguis0cs, Stroudsburg, PA, USA, 254-‐263.

n  [Soleymani and Larson, 2010] Soleymani, M. and Larson, M. Crowdsourcing for Affec0ve Annota0on of Video: Development of a Viewer-‐reported Boredom Corpus. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evalua0on (CSE 2010)

n  [Sorokin and Forsyth, 2008] Sorokin, A.; Forsyth, D.; , "U0lity data annota0on with Amazon Mechanical Turk," Computer Vision and Parern Recogni0on Workshops, 2008. CVPRW '08. IEEE Computer Society Conference on , vol., no., pp.1-‐8, 23-‐28 June 2008

n  [Steiner et al., 2011] Thomas Steiner, Ruben Verborgh, Rik Van de Walle, Michael Hausenblas, and Joaquim Gabarró Vallés, Crowdsourcing Event Detec0on in YouTube Videos, Proceedings of the 1st Workshop on Detec0on, Representa0on, and Exploita0on of Events in the Seman0c Web, 2011

n  [Tang and Lease, 2011] Wei Tang and Marhew Lease. Semi-‐Supervised Consensus Labeling for Crowdsourcing. In ACM SIGIR Workshop on Crowdsourcing for Informa0on Retrieval (CIR), 2011.

+ References 3/3

n  [Urbano et al., 2010] J. Urbano, J. Morato, M. Marrero, and D. Mar0n. Crowdsourcing preference judgments for evalua0on of music similarity tasks. In Proceedings of the ACM SIGIR 2010 Workshop on Crowdsourcing for Search Evalua0on (CSE 2010), pages 9-‐-‐16, Geneva, Switzerland, July 2010.

n  [Vondrick et al., 2010] Carl Vondrick, Deva Ramanan, and Donald Parerson. 2010. Efficiently scaling up video annota0on with crowdsourced marketplaces. In Proceedings of the 11th European conference on Computer vision: Part IV (ECCV'10)

n  [Welinder and Perona, 2010] Welinder, P., Perona, P. Online crowdsourcing: ra0ng annotators and obtaining cost-‐effec0ve labels. Workshop on Advancing Computer Vision with Humans in the Loop at CVPR. 2010

n  [Whitehill et al., 2009] Jacob Whitehill, Paul Ruvolo, Jacob Bergsma, Tingfan Wu, and Javier Movellan, "Whose Vote Should Count More: Op0mal Integra0on of Labels from Labelers of Unknown Exper0se", Advances in Neural Informa0on Processing Systems (forthcoming), 2009.

n  [Yan et al., 2010] Tingxin Yan, Vikas Kumar, and Deepak Ganesan. 2010. CrowdSearch: exploi0ng crowds for accurate real-‐0me image search on mobile phones. In Proceedings of the 8th interna0onal conference on Mobile systems, applica0ons, and services (MobiSys '10). ACM, New York, NY, USA, 77-‐90.

n  [Yan et al., 2010b] Y. Yan, R. Rosales, G. Fung, M.W. Schmidt, G.H. Valadez, L. Bogoni, L. Moy, and J.G. Dy, Modeling annotator exper0se: Learning when everybody knows a bit of something. ;In Proceedings of Journal of Machine Learning Research -‐ Proceedings Track. 2010, 932-‐939.

n  [Yan et al., 2011] Y. Yan, R. Rosales, G. Fung, and J.G. Dy, Ac0ve Learning from Crowds. ;In Proceedings of ICML. 2011, 1161-‐1168.

Crowdsourcing for Multimedia Retrieval

Technology

Transcript of Crowdsourcing for Multimedia Retrieval