Crowdsourcing for Multimedia Retrieval

108
+ Crowdsourcing for Mul0media Retrieval Marco Tagliasacchi Politecnico di Milano, Italy

description

lecture of Marco Tagliasacchi (Politecnico di Milano) for Summer School on Social Media Modeling and Search, and European Chapter of the ACM SIGMM event, supported by CUbRIK and Social Sensor Project. 10-14 September, Fira, Santorini, Greece in Santorini

Transcript of Crowdsourcing for Multimedia Retrieval

Page 1: Crowdsourcing for Multimedia Retrieval

+  

Crowdsourcing  for  Mul0media  Retrieval  Marco  Tagliasacchi  Politecnico  di  Milano,  Italy  

Page 2: Crowdsourcing for Multimedia Retrieval

+  Outline  

n  Crowdsourcing  applica0ons  in  mul0media  retrieval  

n  Aggrega0ng  annota0ons  

n  Aggrega0ng  and  learning  

n  Crowdsourcing  at  work  

Page 3: Crowdsourcing for Multimedia Retrieval

+  Crowdsourcing  applica0ons  in  mul0media  retrieval  

Page 4: Crowdsourcing for Multimedia Retrieval

+  Crowdsourcing  

n  Crowdsourcing  is  an  example  of  human  compu+ng  

n  Use  an  online  community  of  human  workers  to  complete  useful  tasks  

n  The  task  is  outsourced  to  an  undefined  public  

n Main  idea:  design  tasks  that  are  n  Easy  for  humans  n  Hard  for  machines  

Page 5: Crowdsourcing for Multimedia Retrieval

+  Crowdsourcing  

n  Crowdsourcing  plaHorms  n  Paid  contributors  

n  Amazon  Mechanical  Turk  (www.mturk.com)  n  CrowdFlower  (crowdflower.com)  n  oDesk  (www.odesk.com)  n  …  

n  Volunteers  n  Foldit  (www.fold.it)  n  Duolingo  (www.duolingo.com)  n  …  

 

Page 6: Crowdsourcing for Multimedia Retrieval

+  Applica0ons  in  mul0media  retrieval    n  Create  annotated  data  sets  for  training  

n  Reduces  both  cost  and  0me  needed  to  gather  annota0ons,  n  …but  annota0ons  might  be  noisy!    

n   Validate  the  output  of  mul0media  retrieval  systems  

n  Query  expansion  /  reformula0on  

Page 7: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Sorokin  and  Forsyth,  2008]  

n  Collect  annota0ons  for  computer  vision  data  sets    n  people  segmenta0on    

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4

Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation ofthe protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfacesare similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $0.66.

further assume that the polygon with more vertices is a bet-ter annotation and we put it first in the pair. The distributionof scores and a detailed analysis appears in figures 4,5. Weshow all scores ordered from the best (lowest) on the leftto the worst (highest) on the right. We select 5:15:952 per-

25 through 95 with step 15

centiles of quality and show the respective annotations.

Looking at the images we see that the workers mostlytry to accomplish the task. Some of the errors come fromsloppy annotations (especially in the heavily underpaid ex-periment 3 - polygonal labeling). Most of the disagreementscome from difficult cases, when the question we ask is dif-

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4

Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation ofthe protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfacesare similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $0.66.

further assume that the polygon with more vertices is a bet-ter annotation and we put it first in the pair. The distributionof scores and a detailed analysis appears in figures 4,5. Weshow all scores ordered from the best (lowest) on the leftto the worst (highest) on the right. We select 5:15:952 per-

25 through 95 with step 15

centiles of quality and show the respective annotations.

Looking at the images we see that the workers mostlytry to accomplish the task. Some of the errors come fromsloppy annotations (especially in the heavily underpaid ex-periment 3 - polygonal labeling). Most of the disagreementscome from difficult cases, when the question we ask is dif-

Page 8: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Sorokin  and  Forsyth,  2008]  

n  Collect  annota0ons  for  computer  vision  data  sets    n  people  segmenta0on  and  pose  annota0on    

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4

Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation ofthe protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfacesare similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $0.66.

further assume that the polygon with more vertices is a bet-ter annotation and we put it first in the pair. The distributionof scores and a detailed analysis appears in figures 4,5. Weshow all scores ordered from the best (lowest) on the leftto the worst (highest) on the right. We select 5:15:952 per-

25 through 95 with step 15

centiles of quality and show the respective annotations.

Looking at the images we see that the workers mostlytry to accomplish the task. Some of the errors come fromsloppy annotations (especially in the heavily underpaid ex-periment 3 - polygonal labeling). Most of the disagreementscome from difficult cases, when the question we ask is dif-

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4

Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation ofthe protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfacesare similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $0.66.

further assume that the polygon with more vertices is a bet-ter annotation and we put it first in the pair. The distributionof scores and a detailed analysis appears in figures 4,5. Weshow all scores ordered from the best (lowest) on the leftto the worst (highest) on the right. We select 5:15:952 per-

25 through 95 with step 15

centiles of quality and show the respective annotations.

Looking at the images we see that the workers mostlytry to accomplish the task. Some of the errors come fromsloppy annotations (especially in the heavily underpaid ex-periment 3 - polygonal labeling). Most of the disagreementscome from difficult cases, when the question we ask is dif-

Page 9: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Sorokin  and  Forsyth,  2008]  

n  Observa0ons:  n  Annotators  make  errors  n  Quality  of  annotators  is  heterogeneous  n  The  quality  of  the  annota0ons  depends  on  the  difficulty  of  the  task  

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

A B C D EF

G

area(XOR)/area(AND). The lower the better. Mean 0.21, std 0.14, median 0.16

A B

C D E F G

Experiment 3: trace the boundary of the person.

knee

111

222

333 444

555

666

777888999 101010

111111

121212

131313141414

0 50 100 150 200 250 300 3500

10

20

30

40

50

A B C D EF

G

Experiment 4: click on 14 landmarksMean error in pixels between annotation points. The lower the better. Mean 8.71, std 6.29, median 7.35.

111

222

333 444

555666

777888 999 101010

111111121212

131313141414

111

222

333444

555

666

777

888

999

101010

111111

121212

131313141414

111

222

333444

555

666

777

888

999101010

111111

121212

131313

141414

111

222

333 444

555

666

777

888

999 101010

111111

121212131313

141414

111

222

333 444

555

666

777

888

999 101010

111111

121212

131313

141414

111

222

333

444

555

666

777888

999101010

111111121212

131313

141414

A B C

F GD E

!gure 6knee

Figure 5. Quality details. We present detailed analysis of annotation quality for experiments 3 and 4. For every image the best fittingpair of annotations is selected. The score of the best pair is shown in the figure. For experiment 3 we score annotations by the area oftheir symmetric difference (XOR) divided by the area of their union(OR). For experiment 4 we compute the average distance between themarked points. The scores are ordered low (best) to high (worst). For clarity we render annotations at 5:15:95 percentiles of the score.Blue curve and dots show annotation 1, yellow curve and dots show annotation 2 of the pair. For experiment 3 we additionally assume thatthe polygon with more vertices is a better annotation, so annotation 1 (blue) always has more vertices.

Face Recognition. 4[5] T. L. Berg, A. C. Berg, J. Edwards, and D. Forsyth. Who’s

in the picture? In Proc. NIPS, 2004. 1, 4[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.

Actions as space-time shapes. ICCV, pages 1395–1402,2005. 2

[7] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In CVPR, 2005. 4

100 110 120 130 140 150 160 170 180 190 2003

45

6

7

89

10

11

12

13

rAnklerKneelKneelAnkle

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

rWristrElbowlElbowlWrist

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

rHiplHiprShoulderlShoulder

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

NeckHead

Figure 6. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. We show scores of thebest pair for all annotations between 35th and 65th percentiles - between points “C” and “E” of experiment 4 in fig. 5. All the plots havethe same scale: from image 100 to 200 on horizontal axis and from 3 pixels to 13 pixels of error on the vertical axis. These graphs showannotators have greater difficulty choosing a consistent location for the hip than for any other landmark; this may be because some placethe hip at the point a tailor would use and others mark the waist, or because the location of the hip is difficult to decide under clothing.

[8] Espgame. www.espgame.org, 2008. 4[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,

and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.4

[10] M. Everingham, A. Zisserman, C. K. I. Williams, andL. Van Gool. The PASCAL Visual Object ClassesChallenge 2006 (VOC2006) Results. http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf. 4

[11] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning ofobject categories. PAMI, 28(4):594–611, 2006. 1, 4

[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-egory dataset. Technical Report 7694, California Institute ofTechnology, 2007. 1, 4

[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical report,University of Massachusetts, Amherst, 2007. 1, 4

[14] Imageparsing. ImageParsing.com, 2008. 4[15] Linguistic data consortium. www.ldc.upenn.edu/. 4[16] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Build-

ing a large annotated corpus of english: The penn treebank.Computational Linguistics, 19(2):313–330, 1994. 4

[17] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat-ural image boundaries using local brightness, color, and tex-ture cues. PAMI, 2004. in press. 2

[18] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A databaseof human segmented natural images and its application toevaluating segmentation algorithms and measuring ecologi-cal statistics. In International Conference on Computer Vi-

sion, 2001. 1, 4[19] G. Mori, X. Ren, A. Efros, and J. Malik. Recovering human

body configurations: Combining segmentation and recogni-tion. In CVPR, 2004. 2

[20] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 2000. 4

[21] The pascal visual object classes chal-lenge 2008. http://www.pascal-network.org/challenges/VOC/voc2008/index.html. 4

[22] P. J. Phillips, A. Martin, C. Wilson, and M. Przybocki. Anintroduction to evaluating biometric systems. Computer,33(2):56–63, 2000. 4

[23] D. Ramanan. Learning to parse images of articulated bodies.In NIPS, pages 1129–1136, 2007. 2

[24] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Free-man. Labelme: A database and web-based tool for imageannotation. IJCV, 77(1-3):157–173, 2008. 1, 2, 4

[25] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination,and expression(pie) database. In AFGR, 2002. 4

[26] L. von Ahn and L. Dabbish. Labeling images with a com-puter game. In ACM CHI, 2004. 1, 4

[27] L. von Ahn, R. Liu, and M. Blum. Peekaboom: A game forlocating objects in images. In ACM CHI, 2006. 1, 4

[28] B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large scalegeneral purpose ground truth dataset: methodology, annota-tion tool, and benchmarks. In EMMCVPR, 2007. 1, 4

Page 10: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Soleymani  and  Larson,  2010]  

n MediaEval  2010  Affect  Task  

n  Use  of  Amazon  Mechanical  Turk  to  annotate  the  Affect  Task  Corpus  

n  126  videos  (2-­‐5  mins  in  length)  

n  Annotate  n  Mood  (e.g.,  pleased,  helpless,  energe0c,  etc.)  n  Emo0on  (e.g.,  sadness,  joy,  anger,  etc.)  n  Boreness  (nine  point  ra0ng  scale)  n  Like  (nine  point  ra0ng  scale)  

Page 11: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Nowak  and  Ruger.,  2010]  

n  Crowdsourcing  image  concepts.  53  concepts,  e.g.,  n  Abstract  categories:  partylife,  beach  holidays,  snow,  etc.  n  Time  of  the  day:  day,  night,  no  visual  cue  

n  …  

n  Subset  of  99  images  from  the  ImageCLEF2009  dataset  

Place contains three mutual exclusive concepts, namely In-door, Outdoor and No Visual Place. In contrast several op-tional concepts belong to the category Landscape Elements.The task of the annotators was to choose exactly one conceptfor categories with mutual exclusive concepts and to selectall applicable concepts for optional designed concepts. Allphotos were annotated at an image-based level. The anno-tator tagged the whole image with all applicable conceptsand then continued with the next image.

Figure 1: Annotation tool that was used for the ac-quisition of expert annotations.

Fig. 1 shows the annotation tool that was delivered tothe annotators. The categories are ordered into the threetabs Holistic Scenes, Representation and Pictured Objects.All optional concepts are represented as check boxes andthe mutual exclusive concepts are modelled as radio but-ton groups. The tool verifies if for each category containingmutual exclusive concepts exactly one was selected beforestoring the annotations and presenting the next image.

3.3 Collecting Data of Non-expert AnnotatorsThe same set of images that was used for the expert anno-

tators, was distributed over the online marketplace AmazonMechanical Turk (www.mturk.com) and annotated by non-experts in form of mini-jobs. At MTurk these mini-jobs arecalled HITs (Human Intelligence Tasks). They represent asmall piece of work with an allocated price and completiontime. The workers at MTurk, called turkers, can choosethe HITs they would like to perform and submit the re-sults to MTurk. The requester of the work collects all re-sults from MTurk after they are completed. The workflowof a requester can be described as follows: 1) design a HITtemplate, 2) distribute the work and fetch results and 3)approve or reject work from turkers. For the design of theHITs, MTurk o!ers support by providing a web interface,command line tools and developer APIs. The requester candefine how many assignments per HIT are needed, how muchtime is allotted to each HIT and how much to pay per HIT.MTurk o!ers several ways of assuring quality. Optionallythe turkers can be asked to pass a qualification test beforeworking on HITs, multiple workers can be assigned the sameHIT and requesters can reject work in case the HITs werenot finished correctly. The HIT approval rate each turkerachieves by completing HITs can be used as a threshold forauthorisation to work.

3.3.1 Design of HIT TemplateThe design of the HITs at MTurk for the image annota-

tion task is similar to the annotation tool that was providedto the expert annotators (see Sec. 3.2). Each HIT consistsof the annotation of one image with all applicable 53 con-cepts. It is arranged as a question survey and structuredinto three sections. The section Scene Description and thesection Representation each contain four questions, the sec-tion Pictured Objects consists of three questions. In front ofeach section the image to be annotated is presented. Therepetition of the image ensures that the turker can see itwhile answering the questions without scrolling to the topof the document. Fig. 2 illustrates the questions for thesection Representation.

Figure 2: Section Representation of the survey.

The turkers see a screen with instructions and the task tofulfil when they start working. As a consequence, the guide-lines should be very short and easy to understand. In theannotation experiment the following annotation guidelineswere posted to the turkers. These annotation guidelines arefar shorter than the guidelines for the expert annotators anddo not contain example images.

• Selected concepts should be representative for the con-tent or representation of the whole image.

• Radio Button concepts exclude each other. Please an-notate with exactly one radio button concept per ques-tion.

• Check Box concepts represent optional concepts. Pleasechoose all applicable concepts for an image.

• Please make sure that the information is visually de-picted in the images (no meta-knowledge)!

559

Page 12: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Nowak  and  Ruger.,  2010]  

n  Study  of  expert  and  non-­‐expert  labeling  

n  Inter-­‐annota0on  agreement  among  experts:    n  very  high  

n  Influence  of  the  expert  ground  truth  on  concept-­‐based  retrieval  ranking:    n  very  limited  

n  Inter-­‐annota0on  agreement  among  non-­‐experts  n  High,  although  not  as  good  as  among  experts  

n  Influence  of  averaged  annota0ons  (experts  vs.  non  experts)  on  concept-­‐based  retrieval  ranking:  n  Averaging  filters  out  noisy  non-­‐expert  annota0ons  

Page 13: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Vondrick  et  al.,  2010]  

n  Crowdsourcing  object  tracking  in  video  

n  Annotators  draw  bounding  boxes  4 C. Vondrick, D. Ramanan, D. Patterson

Fig. 2: Our video labeling user interface. All previously labeled entities are shownand the box the user is currently working with is bright orange.

Displaying other workers’ labels unintentionally fostered a sense of communityengagement that some of workers expressed in unsolicited comments.

“Maybe it’s more bizarre that I keep doing these hits for a penny. I must not bethe only one who finds them oddly compelling–more and more boxes show upon each hit.” — Anonymous subject

Mechanical Turk does not necessarily ensure quality work is produced. Infact, as a result of the low price of most HITs, many workers attempt to satisfythe HIT with the least amount of effort possible. Therefore it is very importantthat HITs are structured to produce desired results in a somewhat adversarialenvironment. One of the key criteria for the design of the UI is to make sure thatproducing quality work is no harder than doing the minimal amount of work toconvince the UI that the HIT is completed. A second important criteria is tobuild into the evaluation process of a HIT an analysis of the validity of the work.A typical approach is to have multiple workers complete the same task until astatistical test demonstrates consensus on a single answer. A final importantcriteria is to design the interface so that it is difficult to successfully write anautomated bot to get through the UI.

By requiring the user to annotate every key frame or explicitly say there isnothing left to annotate, we reduce the ease with which a worker can just “click-

Page 14: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Vondrick  et  al.,  2010]  

n  Annotators  label  the  enclosing  bounding  box  of  an  en0ty  every  T  frames  

n  Bounding  boxes  at  intermediate  0me  instants  are  interpolated  

n  Interes0ng  trade-­‐off  between    n  Cost  of  Mturk  workers  n  Cost  of  interpola0on  on  Amazon  EC2  cloud  

12 C. Vondrick, D. Ramanan, D. Patterson

(a) Field drills (b) Basketball players

(c) Ball

Fig. 7: Cost trade-off between human effort and CPU cycles. As the total cost

increases, performance will improve. Cost axes are in dollars.

4.3 Performance Cost Trade-off

We now consider our motivating question: how should one divide human effortversus CPU effort so as to maximize track accuracy given a X$? A fixed dollar

amount can be spent only on human annotations, purely on CPU, or some

combination. We express this combination as a diagonal line in the ground plane

of the 3D plot in Fig.7. We plot the tracking accuracy as a function of this

combination for different X$ amounts in Fig.8. We describe the trade-off further

in the caption of Fig.8.

5 Conclusion

Our motivation thus far has been the use of crowdsource marketplaces as a cost-

effective labeling tool. We argue that they also provide an interesting platform

for research on interactive vision. It is clear that state-of-the-art techniques are

Page 15: Crowdsourcing for Multimedia Retrieval

+  Crea0ng  annotated  training  sets  [Urbano  et  al.,  2010]  

n  Goal:  evalua0on  of  music  informa0on  retrieval  systems  

n  Use  crowdsourcing  as  an  alterna0ve  to  experts  to  create  ground-­‐truths  of  par0ally  ordered  lists  

n  Good  agreement  (92%  complete  +  par0al)  with  experts  

answer preference judgments between F and each of the other documents. In this case, every document was judged as more similar, except for G, which was judged equally similar (or dissimilar). Therefore, a new segment appears to the left of F with all the candidates judged more relevant, and G is set up in the same group as F. For the second iteration, in the rightmost segment no judgment is needed because F and G were already compared, and B would be the pivot for the leftmost segment. Incipits A and C are judged similar to B, but D and E are judged as less similar, so they are set up in a segment to the right of B. At the end, there are 3 ordered groups of relevance formed with preference judgments. Note that not all the 21 judgments were needed to arrange and aggregate every incipit (e.g. G is only compared with F). Table 1. Example of self-organized partially ordered list. Pivots for each segment appear in bold face. Documents that have been pivots already appear underlined. Iteration Segments Preference Judgments

1 C, D, E, A, G, B, F C<F, D<F, E<F, A<F, G=F, B<F 2 C, D, E, A, B , F, G C=B, D>B, E>B, A=B 3 B, C, A , D, E , F, G C=A, D=E 4 (A, B, C), (E, D), (F, G) -

With preference judgments, the sample of rankings given to each candidate is less variable than with the original method. Whenever a candidate is preferred over another one, it would be given a rank of 1 and -1 otherwise. In case it was judged equally similar, a rank of 0 would be added to its sample. With the original methodology, on the other hand, the ranks given to an incipit could range from 1 to well beyond 20, which increases the variance of the samples. Note that, with our scheme, the two samples of rankings given to each pair of documents are the opposite and therefore have the same variance. Signed Mann-Whitney U tests can be used again to decide whether two rank samples are different or not. Because the samples are less variable, the effect size is larger, which increases the statistical power of the test and makes it more likely for it to find a true difference where there is one. As a consequence, fewer assessors are needed overall.

4. CROWDSOURCING PREFERENCES The use of a crowdsourcing platform seems very appropriate for our purposes. If the reasonable person assumption holds, we could use non experts to generate a ground truth like these. Because we no longer show the image of the staves, but offer an audio file instead, no music expertise is needed. We have also seen how to use preference judgments to generate partially ordered lists instead of having assessors rank all candidates at once. Therefore, the whole process can be divided into very small and simple tasks where one incipit has to be preferred over the other, which seems perfectly doable for any non expert. Also, the number of judgments between pairs of documents can be smaller, and given that we use non experts, the overall cost should be much less. We are not aware of any work examining the feasibility of music related tasks with crowdsourcing platforms like Amazon Mechanical Turk (AMT), so we decided to use it for our experiments. AMT has been widely used before for tasks related to Text IR evaluation. HITs (each of the single tasks assigned to a worker) have traditionally used the English language, but it has been shown recently that workers can also work in other languages such as Spanish [18]. Other multimedia tasks, such as image tagging, have also been proved to be feasible with crowdsourcing [19].

4.1 HIT Design The use of preference judgments is prone to have a very simple HIT design (see Figure 4). We asked workers to listen to the

the two incipits to compare. Next, they were asked what variation was more similar to the original melody, allowing 3 options: A is more similar, B is more similar, and they are either equally similar or dissimilar. We indicated them that if one melody was part of another one, they had to be considered equally similar, so as to comply with the original guidelines. As optional questions, they were asked for their musical background, if any, and for comments or suggestions to give us some feedback.

Figure 4. Example of HIT for music preference judgment.

The evaluation collection used in MIREX 2005 (Eval05 for short) had about 550 short incipits in MIDI format, which we transformed to MP3 files as they are easier to play in a standard web browser. The average duration was 6 seconds, ranging from 1 to 57 seconds. However, many incipits start with rests (see query and incipit C in Figure 2), which would make workers lose a lot of time. Therefore, we trimmed the leading and tailing silence, which resulted in durations from 1 to 26 seconds, with an average of 4 seconds. With this cuts, the average time needed to listen to the 3 files in a HIT at least once was 13 seconds, ranging from 4 to 24 seconds. This decision agrees with the initial guidelines that were given to the experts, as two incipits should be considered equally relevant despite one of them having leading or tailing rests (i.e. one would be just part of the other). We uploaded all these trimmed MP3 files to a private web server, as well as the source of a very simple Flash player to play the queries and candidate incipits. Therefore, our HIT template was designed to display the MP3 players and stream the audio files from our server. We created a batch of HITs for each of the iterations calculated with our methodology, and paid every answer with 2 cents of

After downloading the results and analyzing them, we calculated the next preference judgments to perform and uploaded a new batch to AMT,

Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (CSE 2010) - July 23, 2010 12

Page 16: Crowdsourcing for Multimedia Retrieval

+  Validate  the  output  of  MIR  systems[Snoek  et  al.,  2010][Freiburg  et  al.,  2011]  

n  Search  engine  for  archival  rock  ‘n’  roll  concert  video  

n  Use  of  crowdsourcing  to  improve,  extend  and  share  automa0cally  detected  concepts  in  video  fragments  

Guitar playerHands Pinkpop logoSingerPinkpop hat Drummer Over the shuolderClose-upAudience StageKeyboard

Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feedback.

Figure 2: Timeline-based video player where col-ored dots correspond to automated visual detectionresults. Users can navigate directly to fragments ofinterest by interaction with the colored dots, whichpop-up a feedback overlay as displayed in Figure 3.

since 1970 at Landgraaf, the Netherlands. All music videoshave been recorded during the 40 years life cycle of the fes-tival. We cleared copyright for several Dutch and Belgianartists playing at Pinkpop, including gigs from K’s Choice,Junkie XL, and Moke. The amount of footage for each fes-tival year varies from only a summary to almost unabridgedconcert recordings, even including raw, unpublished footage.The complete video archive contains 94 concerts covering 32hours in total.

We create detectors for 11 concert concepts following astate-of-the-art implementation [10]. We select the con-cepts based on frequency, visual detection feasibility, pre-vious mentioning in literature and expected utility for con-cert video users (summarized in Figure 1). We consider avideo fragment a more user-friendly retrieval unit comparedto more technically defined shots or keyframes. We createfragment-level detection scores from frame-level scores byaggregating the concept scores of all the frames in the pro-cessed videos. The fragment algorithm was designed to findthe longest fragments with the highest average scores for aspecific concert concept [10]. Users may provide feedback onthese automatically detected fragments using our feedbackmechanism.

2.2 Feedback MechanismThe main mode of user interaction with our video search

engine is by means of the In-Video Browser, see Figure 2.The timeline-based browser enables users to watch and nav-igate through a single video concert. Little colored dots onthe timeline mark the location of an interesting fragmentcorresponding to an automatically derived label. To inspectthe label, users simply move their mouse cursor over the col-ored dot. By clicking on the dot, the player instantly startsthe specific fragment in the video. If needed, the user canmanually select more concept labels in the panel on the leftof the video player. Too maintain overview, the In-Video

Figure 3: Harvesting user feedback for video frag-ments (top to bottom). The thumbs-up button in-dicates agreement with the automatically detectedlabel, thumbs-down disagreement. Three key framesrepresent the visual summary of the fragment.Users may correct wrong labels, adapt fragmentboundaries, or suggest additional labels (in Dutch).

browser automatically launches with a maximum of twelvefragments on the timeline interface every time a user startsa concert. These twelve correspond to the most reliablefragment labels. Once the timeline becomes too crowded asa result of multiple selected labels, the user may decide tozoom in on the timeline to retrieve fragments for a specific,smaller part of the video.

An important aspect of the In-Video browser is that theuser viewing experience is interrupted as little as possible,the video continues to play while the user interacts with thebrowser. In the graphical overlay that appears while thefragment is playing, the label is shown together with the

914

0

20

40

60

80

100

120

140

160

180

>50% >60% >70% >80% >90%

Excluded correct fragment labels

Crowdsourcing errors

User-Feedback Agreement

Vid

eo

Fra

gm

en

ts

Figure 4: Results for Experiment 2: Quality vsQuantity. Simply relying on a majority vote of thecrowd results in most correct fragments, albeit with23 errors. We observe a best tradeo! between qual-ity and quantity of crowdsourcing visual detectorsfor a user agreement of 67%.

4.2 Experiment 2: Quality vs QuantityThe question that we tried to answer with this experiment

is whether the resulting labels are of su!cient quality com-pared to expert labels, when aggregated over multiple users.We have in total 510 fragments, where we now assume theexpert label to be correct, and investigate for how many ofthem we would have obtained the same label when imposinga minimum agreement threshold on the crowdsourced labels.We plot the percentage of agreement among user-providedlabels versus the number of video fragments in Figure 4. Theground truth shows that the quality of the suggested labelsis high. As much as 85% of the automatically suggestedlabels correspond with the ground truth. If the simple eval-uation principle of the majority is used, only 23 fragmentshave received tags that do not match with the ground truth,which in our case corresponds to a loss of 37 training sam-ples. When we further increase the threshold for a positive ornegative agreement the number of fragments receiving thewrong label is gradually reduced to 8 fragments only, butthe number of excluded training samples increases rapidly.For a conservative user agreement of 80%, for example, 119fragments are ignored. We observe that a threshold of 67%provides a well-chosen balance between the 8 errors and the422 fragments that can be used as a correction mechanism,or as reliable training examples for a new round of detectorlearning.

5. CONCLUSIONThe main research question of this paper was: can user

tags from crowdsourcing be beneficial to a system that au-tomatically predicts labels for video fragments. We devel-oped a video search engine for a dedicated user communityin the domain of concert video allowing for easy fragment-level crowdsourcing. The user-feedback mechanism of theIn-Video browser made it possible to harvest positive andnegative user judgements on automatically predicted videofragment labels.

For this case study two experiments are conducted. The

first experiment showed that users provided enough feed-back. Analysis of the collected data proved that users pro-vided the feedback to the video-fragment labels withouta preference for incorrect labels. The second experimentshowed that 85% of the automatically suggested labels cor-responds with the ground truth. We observe that an ag-gregation threshold of 67% provides a well-chosen balancebetween errors in the user judgements and the amount ofreliable training examples remaining. If the threshold is en-forced, the error rate in the training examples is less than2%. Within the context of our case study, we conclude thatcrowdsourcing can be beneficial to enhance and improve au-tomated video content analysis. How the new informationcan be exploited for incremental learning of visual detectorsis an interesting question for future research.

6. ACKNOWLEDGMENTSWe thank our users for providing feedback. We are grate-

ful to the Netherlands Institute for Sound and Vision. Thisresearch is supported by the projects: BSIK MultimediaN,FES COMMIT, Images for the Future, and STW SEARCHER.

7. REFERENCES[1] L. Ahn. Games with a purpose. IEEE Computer,

39(6):92–94, 2006.[2] M. Ames and M. Naaman. Why we tag: Motivations

for annotation in mobile and online media. In Proc.CHI, 2007.

[3] R. Gligorov, L. B. Baltussen, J. van Ossenbruggen,L. Aroyo, M. Brinkerink, J. Oomen, and A. van Ees.Towards integration of end-user tags with professionalannotations. In Proc. Web Science, 2010.

[4] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing userstudies with mechanical turk. In Proc. CHI, 2008.

[5] C. Marlow, M. Naaman, D. Boyd, and M. Davis.HT06, tagging paper, taxonomy, Flickr, academicarticle, to read. In Proc. Hypertext, 2006.

[6] P. Marsden. Crowdsourcing. Contagious Magazine,18:24–28, 2009.

[7] J. Nielsen. Participation inequality: Encouraging moreusers to contribute, 2006. http://www.useit.com/alertbox/participation_inequality.html.

[8] D. A. Shamma, R. Shaw, P. L. Shafton, and Y. Liu.Watch what I watch: using community activity tounderstand content. In Proc. MIR, 2007.

[9] A. F. Smeaton, P. Over, and W. Kraaij. Evaluationcampaigns and TRECVid. In Proc. MIR, 2006.

[10] C. G. M. Snoek, B. Freiburg, J. Oomen, andR. Ordelman. Crowdsourcing rock n’ roll multimediaretrieval. In Proc. ACM Multimedia, 2010.

[11] C. G. M. Snoek and A. W. M. Smeulders.Visual-concept search solved? IEEE Computer,43(6):76–78, 2010.

[12] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng.Cheap and fast—but is it good?: evaluatingnon-expert annotations for natural language tasks. InProc. EMNLP, 2008.

[13] J. Surowiecki. The wisdom of crowds: why the manyare smarter than the few. Random House, 2005.

[14] R. van Zwol, L. Garcia, G. Ramirez,B. Sigurbjornsson, and M. Labad. Video tag game. InProc. WWW, 2008.

916

Guitar playerHands Pinkpop logoSingerPinkpop hat Drummer Over the shuolderClose-upAudience StageKeyboard

Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feedback.

Figure 2: Timeline-based video player where col-ored dots correspond to automated visual detectionresults. Users can navigate directly to fragments ofinterest by interaction with the colored dots, whichpop-up a feedback overlay as displayed in Figure 3.

since 1970 at Landgraaf, the Netherlands. All music videoshave been recorded during the 40 years life cycle of the fes-tival. We cleared copyright for several Dutch and Belgianartists playing at Pinkpop, including gigs from K’s Choice,Junkie XL, and Moke. The amount of footage for each fes-tival year varies from only a summary to almost unabridgedconcert recordings, even including raw, unpublished footage.The complete video archive contains 94 concerts covering 32hours in total.

We create detectors for 11 concert concepts following astate-of-the-art implementation [10]. We select the con-cepts based on frequency, visual detection feasibility, pre-vious mentioning in literature and expected utility for con-cert video users (summarized in Figure 1). We consider avideo fragment a more user-friendly retrieval unit comparedto more technically defined shots or keyframes. We createfragment-level detection scores from frame-level scores byaggregating the concept scores of all the frames in the pro-cessed videos. The fragment algorithm was designed to findthe longest fragments with the highest average scores for aspecific concert concept [10]. Users may provide feedback onthese automatically detected fragments using our feedbackmechanism.

2.2 Feedback MechanismThe main mode of user interaction with our video search

engine is by means of the In-Video Browser, see Figure 2.The timeline-based browser enables users to watch and nav-igate through a single video concert. Little colored dots onthe timeline mark the location of an interesting fragmentcorresponding to an automatically derived label. To inspectthe label, users simply move their mouse cursor over the col-ored dot. By clicking on the dot, the player instantly startsthe specific fragment in the video. If needed, the user canmanually select more concept labels in the panel on the leftof the video player. Too maintain overview, the In-Video

Figure 3: Harvesting user feedback for video frag-ments (top to bottom). The thumbs-up button in-dicates agreement with the automatically detectedlabel, thumbs-down disagreement. Three key framesrepresent the visual summary of the fragment.Users may correct wrong labels, adapt fragmentboundaries, or suggest additional labels (in Dutch).

browser automatically launches with a maximum of twelvefragments on the timeline interface every time a user startsa concert. These twelve correspond to the most reliablefragment labels. Once the timeline becomes too crowded asa result of multiple selected labels, the user may decide tozoom in on the timeline to retrieve fragments for a specific,smaller part of the video.

An important aspect of the In-Video browser is that theuser viewing experience is interrupted as little as possible,the video continues to play while the user interacts with thebrowser. In the graphical overlay that appears while thefragment is playing, the label is shown together with the

914

Page 17: Crowdsourcing for Multimedia Retrieval

+  Validate  the  output  of  MIR  systems[Steiner  et  al.,  2011]  

n  Propose  a  browser  extension  to  navigate  detected  events  in  videos  n  Visual  events  (shot  changes)  n  Occurrence  events  (analysis  of  metadata  by  means  of  NLP  to  detect  named  en00es)  

n  Interest-­‐based  events  (click  counters  on  detected  visual  events)  

Crowdsourcing Event Detection in YouTube Videos 3

through a combination of textual, visual, and behavioral analysis techniques. Whena user starts watching a video, three event detection processes start:

Visual Event Detection Process We detect shots in the video by visually analyzing itscontent [19]. We do this with the help of a browser extension, i.e., the whole processruns on the client-side using the modern HTML5 [12] JavaScript APIs of the <video>and <canvas> elements. As soon as the shots have been detected, we offer the user thechoice to quickly jump into a specific shot by clicking on a representative still frame.

Occurrence Event Detection Process We analyze the available video metadata usingNLP techniques, as outlined in [18]. The detected named entities are presented to theuser in a list, and upon click via a timeline-like user interface allow for jumping intoone of the shots where the named entity occurs.

Interest-based Event Detection Process As soon as the visual events have been detected,we attach JavaScript event listeners to each of the shots and count clicks on shots as anexpression of interest in those shots.

Fig. 2: Screenshot of the YouTube browser extension, showing the three different eventtypes: visual events (video shots below the video), occurrence events (contained namedentities and their depiction at the right of the video), and interest-based events (pointsof interest in the video highlighted with a red background in the bottom left).

60

Page 18: Crowdsourcing for Multimedia Retrieval

+  Validate  the  output  of  MIR  systems[Goeau  et  al.,  2011]  

n  Visual  plant  species  iden0fica0on  n  Based  on  local  visual  features  n  Crowdsourced  valida0on  

Figure 1: GUI of the web application.

3. WEB APPLICATION & TAG POOLINGFigure 1 presents the Graphical User Interface of the web

application. On the left, the user choose to load a scan or

a photograph, and then, the system returns and displays

on the right the top-3 species with the most similar pic-

tures. On the bottom left part, the user can then either

select and validate the top-1 suggested species, or he can

choose an other species in the list, or even enter a new

species name if it is not available. The uploaded image

used as query is temporary stored with its associated species

name. Then other users might interact with these new pic-

tures later. So far, this last step is done offline, after that

some professional botanists involved in the project validate

the images and theirs species names. But, the aggregation

to the visual knowledge of these uploaded images will be

integrated automatically in further versions. The species

names and pictures are clickable and bring the user to on-

line taxon descriptions from the Tela Botanica web site. In

this way, beyond the visual content-based recognition pro-

cess, the species identification is considered as one way to

access richer botanical information like species distribution,

complementary pictures, textual descriptions, etc.

4. COLLABORATIVE DATA COLLECTEDThe current data was built by several cycles of collabo-

rative data collections and taxonomical validations. Scans

of leaves were collected over two seasons, between June and

September, in 2009 and 2010, thanks to the work of active

contributors from Tela Botanica social networks. The idea

of collecting only scans during this first period was to initial-

ize the training data with limited noisy background, so that

the online identification tool works sufficiently well to atract

new users. Notice that this did not prevent users to submit

unconstrained pictures, since our matching-based approach

is relatively robust to such asymetry between training and

query images. The first online application did contain 457

validated scans over 27 species and the link was mostly dis-

seminated through Tela Botanica. It finally allowed to col-

lect 2228 scans over 55 species. A public version of the

application2was opened in October 2010

3. At the time of

2http://combraille.cirad.fr:8080/demo_plantscan/3http://www.tela-botanica.org/actu/article3856.html

writing, 858 images were uploaded and tested by about 25

new users. These images are either scans or photographs

with uniform background, or free photographs with natural

background, and involve 15 new species from the previous

set of 55 species. Note that the collected data will be used

within ImageCLEF2011 plant retrieval task4.

5. EVALUATIONPerformances, basically in terms of species identification

rates, will be actually shown during the demo, with an of-

fline version connected to a digital camera. It will consists in

an enjoying demo where anyone can play to shoot fresh cut

leaves. Users would notice short response times for identifi-

cation (around 2 seconds), and observe relevance of species

suggested in spite of the intra-species visual variability, or

cases with occlusions or with non-uniform backgrounds. As

a rough guide, a leave one out cross-validation (i.e. each

scan used one by one as external query), gives an average

precision around 0.7 over the 20 first most similar images,

and gives basically the correct species as the first rank 9

times out of 10 with the knn basic rule of decision.

6. CONCLUSIONSThis demo represents a first step to a large scale crowd-

sourcing application promoting collaborative enrichment of

botanical visual knowledge and its exploitation for helping

users to identify biodiversity. The next version will consider

a full autonomous and dynamical application integrating col-

laborative taxonomical validation. If the application focuses

here on an educational subject, the performances obtained

and the emulation created during this project are encourag-

ing for addressing others floras and more narrow studies.

7. ACKNOWLEDGMENTSThis research has been conducted with the support of

the Agropolis Fondation. Great thanks to all users of TelaBotanica social networks who spent hours to cut, scan andtest fresh leaves on our system.

8. ADDITIONAL AUTHORSJean-Franois Molino (IRD, UMR AMAP, Montpellier, France), Philippe

Birnbaum (CIRAD, UMR AMAP), Daniel Barthelemy (CIRAD, BIOS,Direction and INRA, UMR AMAP, F-34398) and Nozha Boujemaa(INRIA, Saclay, France).

9. REFERENCES[1] P. Belhumeur, D. Chen, S. Feiner, D. Jacobs, W. Kress, H. Ling,

I. Lopez, R. Ramamoorthi, S. Sheorey, S. White, and L. Zhang.Searching the world’s herbaria: A system for visual identificationof plant species. In ECCV. 2008.

[2] O. M. Bruno, R. de Oliveira Plotze, M. Falvo, and M. de Castro.Fractal dimension applied to plant identification. InformationSciences, 2008.

[3] A. Joly and O. Buisson. A posteriori multi-probe localitysensitive hashing. In Proceeding of the 16th ACM internationalconference on Multimedia, 2008.

[4] A. Joly and O. Buisson. Logo retrieval with a contrario visualquery expansion. In Proceedings of the seventeen ACMinternational conference on Multimedia, 2009.

[5] J. C. Neto, G. E. Meyer, D. D. Jones, and A. K. Samal. Plantspecies identification using elliptic fourier leaf shape analysis.Computers and Electronics in Agriculture, 2006.

[6] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman.Object retrieval with large vocabularies and fast spatialmatching. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2007.

4http://www.imageclef.org/2011/plants

814

Page 19: Crowdsourcing for Multimedia Retrieval

+  Validate  the  output  of  MIR  systems  [Yan  et  al.,  2010]  

n  CrowdSearch  combines  n  Automated  image  search  

n  Local  processing  on  mobile  phones  +  backend  processing  n  Real-­‐0me  human  valida0on  of  search  results  

n  Amazon  Mechanical  Turk  

n  Studies  the  trade-­‐off  in  terms  of  n  Delay  n  Accuracy  n  Cost  

n More  on  this  later…  

man error and bias to maximize accuracy. To balance thesetradeo!s, CrowdSearch uses an adaptive algorithm that usesdelay and result prediction models of human responses to ju-diciously use human validation. Once a candidate image isvalidated, it is returned to the user as a valid search result.

3. CROWDSOURCING FOR SEARCHIn this section, we first provide a background of the Ama-

zon Mechanical Turk (AMT). We then discuss several designchoices that we make while using crowdsourcing for imagevalidation including: 1) how to construct tasks such thatthey are likely to be answered quickly, 2) how to minimizehuman error and bias, and 3) how to price a validation taskto minimize delay.

Background: We now provide a short primer on theAMT, the crowdsourcing system that we use in this work.AMT is a large-scale crowdsourcing system that has tensof thousands of validators at any time. The key benefit ofAMT is that it provides public APIs for automatic postingof tasks and retrieval of results. The AMT APIs enable usto post tasks and specify two parameters: (a) the number ofduplicates, i.e. the number of independent validators whowe want to work on the particular task, and (b) the rewardthat a validator obtains for providing responses. A validatorworks in two phases: (a) they first accept a task once theyidentify that they would like to work on it, which in turndecrements the number of available duplicates, and (b) onceaccepted, they need to provide a response within a periodspecified by the task.

One constraint of the AMT that pertains to CrowdSearchis that the number of duplicates and reward for a task thathas been posted cannot be changed at a later point. We takethis practical limitation in mind in designing our system.

Constructing Validation Tasks: How can we constructvalidation tasks such that they are answered quickly? Ourexperience with AMT revealed several insights. First, we ob-served that asking people to tag query images and candidateimages directly is not useful since: 1) text tags from crowd-sourcing systems are often ambiguous and meaningless (sim-ilar conclusions have been reported by other crowdsourcingstudies [8]), and 2) tasks involving tagging are unpopular,hence they incur large delay. Second, we found that havinga large validation task that presents a number of <queryimage, candidate image> pairs enlarges human error andbias since a single individual can bias a large fraction of thevalidation results.

We settled on an a simple format for validation tasks.Each <query image, candidate image> pair is packaged intoa task, and a validator is required to provide a simple YESor NO answer: YES if the two images are correctly matched,and NO otherwise. We find that these tasks are often themost popular among validators on AMT.

Minimizing Human Bias and Error: Human error andbias is inevitable in validation results, therefore a centralchallenge is eliminating human error to achieve high accu-racy. We use a simple strategy to deal with this problem:we request several duplicate responses for a validation taskfrom multiple validators, and aggregate the responses usinga majority rule. Since AMT does not allow us to dynami-cally change the number of duplicates for a task, we fix thisnumber for all tasks. In §7.2, we evaluate several aggrega-tion approaches, and show that a majority of five duplicates

!"#$%&'()*# +),-.-)/#&'()*#0 1"23.4)/#&5)3.-)/.6,&7)080

!"#$%

&'(&'(&'()'(*+,( &'(&'(

!"#$%

&'(&'(-.)'(*+,( -.&'(

!"#$%

-.-.-.)'(*+,( -.-.

!"#$%

&'(&'(&'()'(*+,( -.&'(

+9

+:

+;

+<

Figure 2: Shown are an image search query, candi-date images, and duplicate validation results. Eachvalidation task is a Yes/No question about whetherthe query image and candidate image contains thesame object.

is the best strategy and consistently achieves us more than95% search accuracy.

Pricing Validation Tasks: Crowdsourcing systems allowus to set a monetary reward for each task. Intuitively, ahigher price provides more incentive for human validators,and therefore can lead to lower delay. This raises the fol-lowing question: is it better to spend X cents on a singlevalidation task or to spread it across X validation tasks ofprice one cent each? We find that it is typically better tohave more tasks at a low price than fewer tasks at a highprice. There are three reasons for this behavior: 1) since alarge fraction of tasks on the AMT o!er a reward of only onecent, the expectation of users is that most tasks are quickand low-cost, 2) crowdsourcing systems like the AMT havetens of thousands of human validators, hence posting moretasks reduces the impact of a slow human validator on over-all delay, and 3) more responses allows better aggregationto avoid human error and bias. Our experiments with AMTshow that the first response in five one cent tasks is 50 - 60%faster than a single five cent task, confirming the intuitionthat delay is lower when more low-priced tasks are posted.

4. CROWDSEARCH ALGORITHMGiven a query image and a ranked list of candidate im-

ages, the goal of human validation is to identify the correctcandidate images from the ranked list. Human validationimproves search accuracy, but incurs monetary cost and hu-man processing delay. We first discuss these tradeo!s andthen describe how CrowdSearch optimizes overall cost whilereturning at least one valid candidate image within a user-specified deadline.

4.1 Delay-Cost TradeoffsBefore presenting the CrowdSearch algorithm, we illus-

trate the tradeo! between delay and cost by discussing post-ing schemes that optimize one or the other but not both.

Parallel posting to optimize delay: A scheme that op-timizes delay would post all candidate images to the crowd-sourcing system at the same time. (We refer to this as par-allel posting.) While parallel posting reduces delay, it isexpensive in terms of monetary cost. Figure 2 shows aninstance where the image search engine returns four candi-

Page 20: Crowdsourcing for Multimedia Retrieval

+  Query  expansion  /  reformula0on  [Harris,  2012]  

n  Search  YouTube  user  generated  content  

n  Natural  language  queries  are  restated  and  given  as  input  to  n  YouTube  search  interface  n  Students  n  Crowd  in  Mturk  

!"#$%!"#$%&#%&'(%)*+$,%-.#-%"$#,/0+#-/%1.2#3"$4%*)%#%0+/3-"*$%#$,5*2%6*22/31*$,"$4% #$37/2% *$% 8$*79/,4/%:#28/-%7/;3"-/3% $/4#-"</9=%#))/6-3% +-"9"-=>%?*$3/0+/$-9=@% -./% #;"9"-=% -*% /))/6-"</9=% 3/#26.% )*2%AB?@%1#2-"6+9#29=%*$%2#2/%*2%$*"3=%-*1"63@%2/:#"$3%#%6.#99/$4/>%

?2*7,3*+26"$4%:#=%12*<",/%#%<"#;9/%3*9+-"*$%)*2%3/#26."$4%AB?>%%C./% +3/% *)% -./% 62*7,% #3% #% 3/#26.% 3-2#-/4=% "3% 6*:1/99"$4D% "-%"$-2*,+6/3% ,"</23"-=% *)% 3/#26.% -/2:3% 3"$6/% ,"))/2/$-% :/:;/23% *)%-./% 62*7,% 7"99% #119=% ,"))/2/$-% 3/#26.% 3-2#-/4"/3% ;#3/,% *$% -./"2%)#:"9"#2"-=%7"-.% -./% 3/#26.% -*1"6>%E*2/*</2@% -./% 62*7,% .#3% ;//$%3.*7$% -*% 12*<",/% 4**,% 0+#9"-=% "$% 3-+,"/3% "$<*9<"$4% 2/9/<#$6/%F+,4:/$-3>%G</$%7"-.%,"</23"-=@%7/%6#$%3-"99%/H1/6-%3/#26.%0+#9"-=I%3*:/%3-+,"/3%*$%12/,"6-"*$%"$%62*7,3*+26"$4%3=3-/:3%,/:*$3-2#-/%-.#-% 2/9"#;"9"-=% *)% -./% #</2#4/% *)% 12/,"6-/,% 36*2/3% ;=% -./% 62*7,%":12*</3% #3% -./% 3"J/% *)% -./% 62*7,% "$62/#3/3% &KL@% KK(>% M"8/7"3/@%3/#26.%0+#9"-=%"3%/H1/6-/,%-*%":12*</%#3%-./%$+:;/2%*)%3/#26./23%"$% -./% 62*7,% /H1#$,3>% ?2*7,3*+26"$4% 6*$-2#3-3%7"-.% 8$*79/,4/%:#28/-3% "$% 9/</9% *)% /$4#4/:/$-D% N"/93/$% :/$-"*$3% "$% &KO(% -.#-%*</2% 'LP% *)% 8$*79/,4/% :#28/-% 42*+1% 1#2-"6"1#$-3% )#"9% -*%6*$-2";+-/D% -./2/)*2/% -./% 62*7,3*+26"$4% #31/6-% "$-2*,+6/3% 3*:/%)"$#$6"#9%"$6/$-"</%-*%:*-"<#-/%-#38%1#2-"6"1#-"*$>%

C./%*;F/6-"</%"$%-."3%1#1/2%"3%-*%/H#:"$/%")%-./%62*7,%6#$%12*<",/%#% :*2/% 12/6"3/% 3/-% *)% AB?% 3/#26.% 2/3+9-3@% 4"</$% #% 0+/2=@%6*:1#2/,% 7"-.% *-./2%:+9-":/,"#% 3/#26.% -**93>% C./% 6*$-2";+-"*$3%*)% -."3% 1#1/2% #2/% #3% )*99*73>% Q"23-% 7/% 6*:1#2/% -./% 2/-2"/<#9%1/2)*2:#$6/%*)%,"))/2/$-%2/-2"/<#9%:*,/93%"$%-/2:3%*)%12/6"3"*$%*$%3/</2#9% 6#-/4*2"/3% +3"$4%AB?%<",/*% 2/0+/3-3% -#8/$% )2*:% 9/#,"$4%8$*79/,4/% :#28/-% 7/;3"-/3>% R/% -./$% 6*:1#2/% S*+C+;/T3% *7$%3/#26.%"$-/2)#6/%7"-.%#%3/#26.%6*$,+6-/,%;=%3-+,/$-3%#3%7/99%#3%#%3/#26.% #112*#6.% +3"$4% 62*7,3*+26"$4>% %R/% /<#9+#-/% *+2% 2/3+9-3%+3"$4% -7*% :/-.*,3I% :/#$% #</2#4/% 12/6"3"*$% ,/-/2:"$/,% #)-/2%#119="$4% 1**9"$4@% #$,% #% 3":19/% 9"3-% 12/)/2/$6/@% 7./2/% -./% /$-"2/%9"3-%*)%<",/*3%F+,4/,%#3%2/9/<#$-%;=%/#6.%:/-.*,%#2/%6*:1#2/,>%%

C./%2/:#"$,/2%*)% -./%1#1/2% "3%*24#$"J/,%#3%)*99*73>% U$%V/6-"*$%O%7/%1+-%*+2%7*28%"$%-./%6*$-/H-%*)%12/<"*+3%7*28>%U$%V/6-"*$%W%7/%,"36+33% *+2% /H1/2":/$-#9% 3/-+1>% V/6-"*$% X% *))/23% #% ,"36+33"*$% *)%-./% 2/3+9-3>%R/%6*$69+,/%#$,%12*<",/% "$3"4.-% "$-*% )+-+2/%7*28% "$%V/6-"*$%Y>%

!"! #$%&'$()*+#,)G</$% 12"*2% -*% R/;% O>L@% -./2/% .#3% ;//$% 3"4$")"6#$-% 2/3/#26.% "$%:+9-":/,"#% 3/#26.% :/-.*,3@% "$69+,"$4% 3/</2#9% *24#$"J/,%6*:1/-"-"*$3% -.#-% "$<*9</% -2#,"-"*$#9% 3/#26.% 3-2#-/4"/3>% C./%1*1+9#2%CZG?[",%&KW(%;/$6.:#28"$4%6*:1/-"-"*$%%)*6+3/3%*$%-./%,/-/6-"*$% *)% 31/6")"6% )/#-+2/3% 7"-."$% $*$\AB?% :+9-":/,"#%6*99/6-"*$3>% R"8"1/,"#% Z/-2"/<#9@% #% -#38% "$% U:#4/?MGQ% &KX(%"$<*9</3% 9*6#-"$4% 2/9/<#$-% ":#4/3% )2*:% -./% R"8"1/,"#% ":#4/%6*99/6-"*$% ;#3/,% *$% #% 12*<",/,% -/H-% 0+/2=% #$,% 3/</2#9% 3#:19/%":#4/3>% % R."9/% R"8"1/,"#% Z/-2"/<#9% /H#:"$/3% $*"3=% #$,%+$3-2+6-+2/,% -/H-+#9% #$$*-#-"*$3% "$% R"8"1/,"#% :+9-":/,"#@% -./%3/:"\3-2+6-+2/,%6*$-/$-%/<#9+#-/,%"$%U:#4/?MGQ%"3%)#2%9/33%$*"3=%#$,%:*2/%3-2+6-+2/,%-.#$%6*$-/$-%3/#26./3%*$%S*+C+;/>%

V/</2#9% 3-+,"/3% .#</% /H#:"$/,% 3/#26.% 0+#9"-=% *$% +3/2\3+119"/,%-#43%"$%*-./2%R/;%O>L%#119"6#-"*$3>%%]"</23"-=%*)%":#4/%-#4%3/#26.%2/3+9-3% "$% Q9"682% +3"$4% #$% ":19"6"-% 2/9/<#$6/% )//,;#68% :*,/9% "3%/H19*2/,%;=%<*$%^7*9%!"#$%&#%&KY(@%6*$69+,"$4%-.#-%,"</23"-=%"3%#$%":1*2-#$-% 6*:1*$/$-%7./$% 2/-2"/<#9% "3% ;#3/,%*$% 3:#99% ,#-#% 3/-3@%3+6.% #3% -.*3/% )*+$,% "$% ":#4/% -#43>% % _*-.*% !"#$ %&#% /H19*2/%)*983*$*:=% -#44"$4@% 7."6.% "3% ;*+$,% ;=% -./% 3#:/% $*"3=%+$3-2+6-+2/,%2/3-2"6-"*$3%#3%S*+C+;/%-#43%&K`(@%;+-%-./"2%3-+,=%7#3%12":#2"9=% )*6+3/,% *$% 2/6*::/$,/2% 3=3-/:3% +3#4/% *)% -./3/% -#43>%a-./23% .#</% /H#:"$/,% :+9-":/,"#% 3/#26.% /))/6-"</$/33% *$%8$*79/,4/%:#28/-%7/;3"-/3@%3+6.%#3%?.+#%!"#$%&#%"$%&Kb(%#$,%M"%!"#$%&#%"$%&Kc(D%.*7/</2@%-./"2%)*6+3%"3%-*%9*6#-/%#99%6*$-/$-%#,,2/33"$4%#% 31/6")"6% 0+/3-"*$% d/>4>% e.*7% -*f% #$,% e7.=f% 0+/3-"*$% -=1/3g%7./2/#3% -./% )*6+3%*)%*+2% 3-+,=% "3%*$% )"$,"$4%#$,% 2#$8"$4%<",/*3%-.#-%)+9)"99%#%31/6")"6%3/#26.%2/0+/3-%d/>4>@%e./91%)"$,%#%<",/*fg>%%

h%)/7%3-+,"/3%.#</%/H#:"$/,%-./%/))/6-"</$/33%*)%62*7,3%*$%$*"3=%,#-#% 3/#26./3>% V-/"$/2% !"#$ %&#% ,/:*$3-2#-/,% 3/#26./3% *)% /</$-%,/-/6-"*$%:/-.*,3%"$%S*+C+;/%<",/*3%#-%-./%)2#4:/$-%9/</9%&K'(>%_3+/.%!"#$%&#%/H#:"$/,%3/#26./3% "$%1*9"-"6#9%;9*43% "$%&OL(%7."6.@%#9-.*+4.% $*"3=@% ,*% $*-% /H1/2"/$6/% -./% 2/3-2"6-"*$3% "$./2/$-% "$%:+9-":/,"#% -#43>% % U$% &OK(@% S#$% !"#$ %&>% 12*<",/,% #$% "$$*<#-"</%#112*#6.% 6#99/,% ?2*7,V/#26.@% 7."6.% 12*<",/,% $/#2\2/#9\-":/%#33/33:/$-% *)% ":#4/3>% h9-.*+4.% -./% #+-.*23T% )*6+3% 7#3% *$%9#;/9"$4% ":#4/3@% -./"2% #112*#6.% 6*+9,% )/#3";9=% ;/% /H-/$,/,% -*%9*6#-"$4%3":"9#2%:/,"#%*$%S*+C+;/>%

%-./012)3")+4214.25)67)892)4.:26)1281.24;<)=16>2??).@46<4.@/)A60'0B2C?)?2;1>9).@8217;>2D)?80:2@8?D);@:)892)>165:")

%

!"! #$%&'(%)!"*! +,,-./0)!"#$%& '()& *++,#$%& )-.,/.'#+$& 0)'(+12& 3)& 4.,4/,.')& '()& 567&

"4+8)"&9+8&).4(&+9&'()&").84(&)99+8'":&&;()")&.8)&%#-)$&#$&;.<,)&=:&&

>(#,)&'()")&"4+8)"&"))0&8)."+$.<,)2&#'&#"&,#?),@&1/)&'+&'3+&#""/)"A&

+/8&4.,4/,.'#+$&+9&%8+/$1&'8/'(&.$12&9+8&0+"'&").84()"2&'()8)&3)8)&

+$,@& .& "0.,,& *)84)$'.%)& +9& B+/;/<)& -#1)+"& 3)8)& 4+$"#1)8)1&

8),)-.$':& & ;()& 48+31"+/84#$%& ").84(& "'8.')%@& .$1& '()& "'/1)$'&

").84(& "'8.')%#)"& *)89+80)1& <)'')8& '(.$& '()& B+/;/<)& ").84(&

#$')89.4)& ."& 0)."/8)1& <@& 5672& .& 8)"/,'& '(.'& #"& "'.'#"'#4.,,@&

"#%$#9#4.$'&C'3+&'.#,)12&*DE:EFG:&

H#$4)& I)"'.')1& J/)8#)"& 3)8)& %8+/*)1& #$'+& '(8))& ")*.8.')&

4.')%+8#)"& C)."@2& 0)1#/02& .$1& 1#99#4/,'G2& 3)& )-.,/.')1& '()0&

")*.8.'),@& 9+8& ).4(& ").84(& "'8.')%@:& & ;()& 8)"/,'"& .8)& 8)*+8')1& #$&

;.<,)&K:&

;.<,)&K&8.#")"&"+0)&#$')8)"'#$%&*+#$'"&9+8&1#"4/""#+$:&&L#8"'2&567&

"4+8)"&9+8&)."@&M/)8#)"&.8)&0/4(&0+8)&4+$"#"')$'&.48+""&"'8.')%#)"&

4+0*.8)1& 3#'(& '(+")& 9+8& 0)1#/0& +8& 1#99#4/,'& ").84()":& & ;(#"& #"&

,#?),@&.&8)"/,'&+9&.&().-#)8&8),#.$4)&9+8&"'/1)$'"&.$1&'()&48+31&+$&

'()& "'.$1.81& B+/;/<)& ").84(& #$')89.4)& 9+8& '()& )."#)8& M/)8#)"2&

,#0#'#$%&'()&.1-.$'.%)"&+9&(/0.$&4+0*/'.'#+$:&&6"&0+8)&1#99#4/,'&

M/)8#)"& .8)& )$4+/$')8)12& '()& -.,/)& +9& (/0.$& 4+0*/'.'#+$&

<)4+0)"&.&0+8)&#0*+8'.$'&4+$"#1)8.'#+$:&

H)4+$12& .,'(+/%(& '()& 567& "4+8)& %.*& #"& "0.,,& <)'3))$& "'/1)$'&

").84(& .$1& 48+31"+/84#$%2& 3)& 1+& $+'#4)& '(.'& '()& 9#-)& "'/1)$'"&

4+$"#"')$',@& *)89+80)1& ",#%(',@& <)'')8& '(.$& '()& 48+31:& & N.4(&

"'/1)$'& *)89+80)1& .,,& KF& M/)8#)"2& 8)9#$#$%& '()#8& "+/84)"& .$1&

')4($#M/)"& ."& '()@& )$4+/$')8)1& ).4(& $)3& M/)8@& O& .,,& 9#-)&

*.8'#4#*.$'"& *)89+80)1& 9."')8& .$1& *8+-#1)1& <)'')8& ").84(& 8)"/,'"&

'+3.81"& '()&)$1&+9& '()#8&M/)8@&")""#+$& '(.$& #$&'()&<)%#$$#$%&C3)&

4.$$+'& +<")8-)& '(#"& #0*8+-)0)$'&3#'(& '()& 48+31& ."& ).4(& 48+31&

*.8'#4#*.$'&*8+-#1)1& 8)"/,'"& 9+8&+$,@&.& "#$%,)&M/)8@G:& &;()&48+31&

(.1& '()& "0.,,)"'& 1)-#.'#+$& #$& 567& "4+8)"& .48+""& '()& =& ").84(&

4.')%+8#)"2& *8#0.8#,@& <)4./")& '()& ,.8%)8& $/0<)8& +9& *)+*,)&

").84(#$%&8)1/4)"&'()&-.8#.'#+$2&."&1#"4/"")1&#$&PQER&.$1&PQQR:&

;(#812&3)&4.$&"))& '()&-.,/)&+9&/"#$%&(/0.$&#$*/'& #$& '()")&567&

"4+8)"2&</'&;.<,)&K&1+)"&$+'&'.?)&'()&4+"'"&#$&<+'(&'#0)&.$1&0+$)@&

#$'+& 4+$"#1)8.'#+$:& & >)& 0.?)& '()& .""/0*'#+$& '(.'& B+/;/<)S"&

").84(&(."&$+&4+"'& #$& ')80"&+9& '#0)&.$1&0+$)@&.$1&/")& #'&."&+/8&

<."),#$):& &>)&?)*'& '8.4?&+9& '()& ),.*")1& '#0)& '.?)$&<@& '()&48+31&

.$1&9+8&'()&"'/1)$'"&."&3),,2&"+&3)&4.$&)-.,/.')&'(#"&#$&.%%8)%.'):&&&

;(#"&#"&8)*+8')1&#$&;.<,)"&F&.$1&T:&

;+&#,,/"'8.')2&#$&;.<,)"&F&.$1&T2&9+8&I)"'.')1&J/)8#)"&4,.""#9#)1&."&

U1#99#4/,'V2&'+&+<'.#$&.$&#$48).")&#$&567&+9&E:EEQ&/"#$%&"'/1)$'"2&

3)&3+/,1&)W*)4'&'+&"*)$1&E:ET&0#$/')"&.$1&#$4/8&.&4+"'&+9&X:YX=&

4)$'":& & ;+& +<'.#$& .$& )M/#-.,)$'& #$48).")& #$& 567& /"#$%&

48+31"+/84#$%2& 3)& 3+/,1& )W*)4'& '+& "*)$12& +$& .-)8.%)2& E:EK&

0#$/')"&.$1&#$4/8&.&4+"'&+9&Q:QQQ&4)$'":&&Z+')&'(.'&'()")&$/0<)8"&

8)*8)")$'& ,+$%& ')80& .-)8.%)":& & ;(/"2& 3)& +<")8-)& '(.'& /"#$%& '()&

48+312& ."& 4+0*.8)1&3#'(& "'/1)$'"2& 8)M/#8)"&KE[&+9& '()& 4+"'& .$1&

'.?)"& '3+& '(#81"& '()& '#0)2& +$& .-)8.%)2& '+& 8.#")& 567& <@& .$&

)M/#-.,)$'&.0+/$':&&;(/"2&3()$&+<'.#$#$%&0+8)&*8)4#")&8)"/,'"&#"&

+/8&*.8.0+/$'&+<\)4'#-)2&/"#$%&"'/1)$'"&+8&'()&48+31&#"&)W*)4')1&

'+& *8+-#1)& '()& <)"'& 8)"/,'":& & ]9& '#0)& +8& 9#$.$4#.,& 4+"'"& .8)& .,"+& .&

4+$"#1)8.'#+$2&+/8&8)"/,'"&"(+3&'(.'&/"#$%&'()&48+31&3#,,&*8+-#1)&

'()&<)"'&'8.1)+99&<)'3))$&'#0)2&9#$.$4#.,&4+"'2&.$1&*8)4#"#+$:&&&

!"1! %.23-4)'.56)+748474/94)>)&.**,@&^+*),.$1S"&*.#83#")&.%%8)%.'#+$&0)'(+12&1)"48#<)1& #$&

PXT2& XYR2& #"& .& ^+$1+84)'& 0)'(+1& /")1& '+& )-.,/.')& *.#83#")&

*8)9)8)$4)":& & ^+*),.$1S"& *.#83#")& .%%8)%.'#+$&0)'(+1& )W.0#$)"&

'3+& ,#"'"& 9+8&.&%#-)$&M/)8@& #$&.&*.#83#")& 9."(#+$&.$1& 8)4+81"& '()&

."")""+8S"&*8)9)8)$4)&."&.&U-#4'+8@V:&&H).84(&"'8.')%#)"&.8)&+81)8)1&

<@&$/0<)8&+9&-#4'+8#)"&+-)8&).4(&+**+$)$'&'+&1)')80#$)&.$&+-)8.,,&

3#$$)8:&&>)&)W.0#$)&).4(&*.#83#")&*8)9)8)$4)&9+8&'()&'(8))&8)"/,'&

,#"'"& 9+8& .,,& KF& M/)8#)":& & ;()")& 4+0*.8#"+$& 8)"/,'"& .8)& %#-)$& #$&

;.<,)&Y:&

L8+0& ;.<,)& Y2& 3)& +<")8-)& '(.'& "'/1)$'& ").84(& #"& +/8& ^+$1+84)'&

3#$$)82& <).'#$%& .,,& +'()8& ").84(& "'8.')%#)"& #$& *.#83#")&

4+0*.8#"+$":&&6"&3#'(&'()&*++,#$%&."")""0)$'&0)'(+12&'()8)&3."&.&

",#%('&*8)9)8)$4)&+9&"'/1)$'&").84(&8)"/,'"&+-)8&'()&48+31"+/84#$%&

"/**,#)1&-#1)+&,#"'":&&_+3)-)82&3()$&9#$.$4#.,&4+"'"&.8)&1#"4,+")1&

'+& '()& ."")""+8"& .,+$%& 3#'(& '()& "4+8)"2& 48+31"+/84#$%& #"& +/8&

^+$1+84)'&3#$$)82&."&+<")8-)1&#$&;.<,)&`:&

(:;-4)<")=>47:--)?@+)59,745)8,7)4:9A)54:79A)567:640B")

%4:79A)%67:640B) ?@+)

H'/1)$'&H).84(& E:FXK&

^8+31"+/84#$%& E:FQX&

B+/;/<)&H).84(& E:=Fa&

&

(:;-4)!C)?@+)59,745)8,7)4:9A)54:79A)567:640BD);7,E4/)F,G/);B)54:79A)9:640,7B")

%4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6)

H'/1)$'&H).84(& E:T=T& E:FQT& E:KXQ&

^8+31"+/84#$%& E:TQa& E:FQX& E:KEK&

B+/;/<)&H).84(& E:FE`& E:=KK& E:XXK&

&

(:;-4)JC)+:.7G.54)9,23:7.5,/5),8)-.56)3748474/94)H5./0)K,34-:/FL5)3:.7G.54):00740:6.,/)246A,FD):5):554554F);B)6A4)

97,GF")

K,23:7.5,/) #45H-6) M.//47)

H'/1)$'&H).84(&-":&

^8+31"+/84#$%&XK&-":&XQ& H'/1)$'&H).84(&

H'/1)$'&H).84(&-":&

B+/;/<)&H).84(&=a&-":&T& H'/1)$'&H).84(&

^8+31"+/84#$%&-":&

B+/;/<)&H).84(&=Y&-":&`& ^8+31"+/84#$%&

&

(:;-4)NC)O/974:54)./)?@+)59,745),>47)6A4)P,H(H;4)54:79A)./6478:94)F.>.F4F);B):FF.6.,/:-)6.24)6:E4/)Q./)2./H645R")

%4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6)

H'/1)$'&H).84(& E:QXF& E:QEQ& E:ETQ&

^8+31"+/84#$%& E:EFa& E:ETT& E:EKE&

&

(:;-4)SC)O/974:54)./)?@+),>47)6A4)P,H(H;4)54:79A)./6478:94)F.>.F4F);B):FF.6.,/:-)9,56)Q./)&%)94/65R")

%4:79A)%67:640B) $:5B) ?4F.H2) I.88.9H-6)

H'/1)$'&H).84(& Q:==X& Q:TFT& X:YX=&

^8+31"+/84#$%& X:QFF& Q:aF=& Q:QQQ&

&

MAP  

Page 21: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  

Page 22: Crowdsourcing for Multimedia Retrieval

+  Annota0on  model  

n  A  set  of  objects  to  annotate  

n  A  set  of  annotators  

n  Types  of  annota0ons  n  Binary  n  Categorical  (mul0-­‐class)  n  Numerical  n  Other  

 

i = 1, . . . , I

j = 1, . . . , J

Page 23: Crowdsourcing for Multimedia Retrieval

+  

Binary    Mul0-­‐class  

Annota0on  model  

Annotators  Objects  

y1

y2

y3

True  labels  

y11

y12 y21

y32

y33

y41

y51

y52

yji ∈ L

Annota0ons  

|L| = 2

|L| > 2

Page 24: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  

n  Majority  vo0ng  (baseline)  n  For  each  object,  assign  the  label  that  received  the  largest  number  of  votes  

n  Aggrega0ng  annota0ons  n  [Dawid  and  Skene,  1979]  n  [Snow  et  al.,  2008]  n  [Whitehill  et  al.,  2009]  n  …  

n  Aggrega0ng  and  learning  n  [Sheng  et  al.,  2008]  n  [Donmez  et  al.,  2009]  n  [Raykar  et  al.,  2010]  n  …  

Page 25: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  Majority  vo0ng  

n  Assume  that    n  The  annotator  quality  is  independent  from  the  object    n  All  annotators  have  the  same  quality  

n  The  integrated  quality  of  majority  vo0ng  using                                                  annotators  is  

P (yji = yi) = pj

pj = p

q = P (yMV = y) =N�

l=0

�2N + 1

i

�p2N+1−i · (1− p)i

I = 2N + 1

Page 26: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  Majority  vo0ng  

repeated-labeling to shift from a lower-q curve to a higher-qcurve can, under some settings, improve learning considerably.In order to treat this more formally, we first introduce someterminology and simplifying assumptions.

3.1 Notation and AssumptionsWe consider a problem of supervised induction of a (binary)

classification model. The setting is the typical one, with someimportant exceptions. For each training example !yi, xi",procuring the unlabeled “feature” portion, xi, incurs cost CU .The action of labeling the training example with a label yi

incurs cost CL. For simplicity, we assume that each cost isconstant across all examples. Each example !yi, xi" has a truelabel yi, but labeling is error-prone. Specifically, each labelyij comes from a labeler j exhibiting an individual labelingquality pj , which is Pr(yij = yi); since we consider the caseof binary classification, the label assigned by labeler j will beincorrect with probability 1 # pj .

In the current paper, we work under a set of assumptionsthat allow us to focus on a certain set of problems that arisewhen labeling using multiple noisy labelers. First, we assumethat Pr(yij = yi|xi) = Pr(yij = yi) = pj , that is, individuallabeling quality is independent of the specific data point beinglabeled. We sidestep the issue of knowing pj : the techniques wepresent do not rely on this knowledge. Inferring pj accuratelyshould lead to improved techniques; Dawid and Skene [6] andSmyth et al. [26, 28] have shown how to use an expectation-maximization framework for estimating the quality of labelerswhen all labelers label all available examples. It seems likelythat this work can be adapted to work in a more generalsetting, and applied to repeated-labeling. We also assumefor simplicity that each labeler j only gives one label, butthat is not a restrictive assumption in what follows. Wefurther discuss limitations and directions for future researchin Section 5.

3.2 Majority Voting and Label QualityTo investigate the relationship between labeler quality, num-

ber of labels, and the overall quality of labeling using multiplelabelers, we start by considering the case where for inductioneach repeatedly-labeled example is assigned a single “inte-grated” label yi, inferred from the individual yij ’s by majorityvoting. For simplicity, and to avoid having to break ties, weassume that we always obtain an odd number of labels. Thequality qi = Pr(yi = yi) of the integrated label yi will becalled the integrated quality. Where no confusion will arise,we will omit the subscript i for brevity and clarity.

3.2.1 Uniform Labeler QualityWe first consider the case where all labelers exhibit the same

quality, that is, pj = p for all j (we will relax this assumptionlater). Using 2N + 1 labelers with uniform quality p, theintegrated labeling quality q is:

q = Pr(y = y) =N!

i=0

"2N + 1

i

#· p2N+1!i · (1 # p)i (1)

which is the sum of the probabilities that we have more correctlabels than incorrect (the index i corresponds to the numberof incorrect labels).

Not surprisingly, from the formula above, we can infer thatthe integrated quality q is greater than p only when p > 0.5.When p < 0.5, we have an adversarial setting where q < p,and, not surprisingly, the quality decreases as we increase thenumber of labelers.

Figure 2 demonstrates the analytical relationship between

0.20.30.40.50.60.70.80.9

1

1 3 5 7 9 11 13Number of labelers

Inte

grat

ed q

ualit

y

p=1.0p=0.9p=0.8p=0.7p=0.6p=0.5p=0.4

Figure 2: The relationship between integrated label-ing quality, individual quality, and the number of la-belers.

-0.2-0.15

-0.1-0.05

00.05

0.10.15

0.20.25

0.3

3 5 7 9 11 13

Number of labelers

Qua

lity

impr

ovem

ent p=1.0

p=0.9p=0.8p=0.7p=0.6p=0.5p=0.4

Figure 3: Improvement in integrated quality com-pared to single-labeling, as a function of the numberof labelers, for di!erent labeler qualities.

the integrated quality and the number of labelers, for di!er-ent individual labeler qualities. As expected, the integratedquality improves with larger numbers of labelers, when theindividual labeling quality p > 0.5; however, the marginalimprovement decreases as the number of labelers increases.Moreover, the benefit of getting more labelers also dependson the underlying value of p. Figure 3 shows how integratedquality q increases compared to the case of single-labeling, fordi!erent values of p and for di!erent numbers of labelers. Forexample, when p = 0.9, there is little benefit when the numberof labelers increase from 3 to 11. However, when p = 0.7,going just from single labeling to three labelers increases in-tegrated quality by about 0.1, which in Figure 1 would yielda substantial upward shift in the learning curve (from theq = 0.7 to the q = 0.8 curve); in short, a small amount ofrepeated-labeling can have a noticeable e!ect for moderatelevels of noise.

Therefore, for cost-e!ective labeling using multiple noisylabelers we need to consider: (a) the e!ect of the integratedquality q on learning, and (b) the number of labelers requiredto increase q under di!erent levels of labeler quality p; we willreturn to this later, in Section 4.

3.2.2 Different Labeler QualityIf we relax the assumption that pj = p for all j, and allow

labelers to have di!erent qualities, a new question arises:what is preferable: using multiple labelers or using the bestindividual labeler? A full analysis is beyond the scope (andspace limit) of this paper, but let us consider the special casethat we have a group of three labelers, where the middlelabeling quality is p, the lowest one is p # d, and the highestone is p + d. In this case, the integrated quality q is:

(p # d) · p · (p + d) + (p # d) · p · (1 # (p + d))+

(p # d) · (1 # p) · (p + d) + (1 # (p # d)) · p · (p + d) =

#2p3 + 2pd2 + 3p2 # d2

616

Page 27: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Snow  et  al.,  2008]  

n  Binary  labels:    

n  The  true  label  is  es0mated  evalua0ng  the  posterior  log-­‐odds,  i.e.,  

n  Applying  Bayes  theorem  

likelihood   prior  posterior  

yji ∈ {0, 1}

logP (yi = 1|y1i , . . . , yJi )P (yi = 0|y1i , . . . , yJi )

logP (yi = 1|y1i , . . . , yJi )P (yi = 0|y1i , . . . , yJi )

=�

j

logP (yji |yi = 1)

P (yji |yi = 0)+ log

P (yi = 1)

P (yi = 0)

Page 28: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Snow  et  al.,  2008]  

n  How  to  es0mate                            and                                                  ?  

n  Gold  standard:    n  Some  objects  have  known  labels  n  Ask  to  annotate  these  objects  n  Compute  empirical  p.m.f.  for  object(s)  with  known  labels  

n  Compute  the  performance  of  annotator            (independent  from  the  object)  j

P (yji |yi = 1) P (yji |yi = 0)

P (yj = 1|y = 1) =Number of correct annotations

Number of annotations of object with label = 1

P (yj1|y1 = 1) = P (yj2|y2 = 1) = . . . = P (yjI |yI = 1) = P (yj |y = 1)

Page 29: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Snow  et  al.,  2008]  

n  Each  annotator  vote  is  weighted  by  the  log-­‐likelihood  ra0o  for  their  given  response  (Naïve  Bayes)  

n More  reliable  annotators  are  weighted  more  

n  Issue:  Obtaining  a  gold  standard  is  costly!    

logP (yi = 1|y1i , . . . , yJi )P (yi = 0|y1i , . . . , yJi )

=�

j

logP (yji |yi = 1)

P (yji |yi = 0)+ log

P (yi = 1)

P (yi = 0)

Page 30: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Kumar  and  Lease,  2011]  

n With  very  accurate  annotators,  it  is  berer  to  label  more  examples  once  

n With  very  noisy  annotators,  aggrega0ng  labels  helps,  if  annotator  accuracies  are  taken  into  account    

Figure 1: p1:w !U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensuslabel accuracy) provides little benefit. Instead, labeling e!ort is better spent single labeling more examples.

Figure 2: p1:w !U(0.4,0.6). With very noisy annotators, single labeling yields such poor training data thatthere is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise toproduce more noise. In contrast, by modeling worker accuracies and weighting their labels appropriately,NB can improve consensus labeling accuracy (and thereby classifier accuracy).

Figure 3: p1:w !U(0.3, 0.7). With greater variance in accuracies vs. Figure 2, NB further improves.

Figure 4: (p1:w !U(0.1, 0.7)). When average annotator accuracy is below 50%, SL and MV perform exceedinglypoorly. However, variance in worker accuracies known to NB allows it to concentrate weight on workers withaccuracy over 50% in order to achieve accurate consensus labeling (and thereby classifier accuracy).

Figure 5: p1:w !U(0.2, 0.6). When nearly all annotators typically produce bad labels, failing to “flip” labelsfrom poor annotators dooms all methods to low accuracy.

21

Figure 1: p1:w !U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensuslabel accuracy) provides little benefit. Instead, labeling e!ort is better spent single labeling more examples.

Figure 2: p1:w !U(0.4,0.6). With very noisy annotators, single labeling yields such poor training data thatthere is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise toproduce more noise. In contrast, by modeling worker accuracies and weighting their labels appropriately,NB can improve consensus labeling accuracy (and thereby classifier accuracy).

Figure 3: p1:w !U(0.3, 0.7). With greater variance in accuracies vs. Figure 2, NB further improves.

Figure 4: (p1:w !U(0.1, 0.7)). When average annotator accuracy is below 50%, SL and MV perform exceedinglypoorly. However, variance in worker accuracies known to NB allows it to concentrate weight on workers withaccuracy over 50% in order to achieve accurate consensus labeling (and thereby classifier accuracy).

Figure 5: p1:w !U(0.2, 0.6). When nearly all annotators typically produce bad labels, failing to “flip” labelsfrom poor annotators dooms all methods to low accuracy.

21

pj ∼ U(0.6, 1.0)

pj ∼ U(0.3, 0.7)

SL:  Single  Labeling,  MV:  Majority  Vo0ng;  NB:  Naïve  Bayes  

Page 31: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

n Mul0-­‐class  labels  

n  Each  annotator  is  characterized  by  the  (unknown)  error  rates  

n  Given  a  set  of  observed  labels,                                                                                  es0mate    n  The  error  rates  n  The  a-­‐posteriori  probabili0es  

πjlk = P (yj = l|y = k) k, l = 1, . . . ,K

yji ∈ {1, . . . ,K}

D = {y1i , . . . , yJi }Ii=1

P (yi = k|D)

πjlk

Page 32: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

n  For  simplicity,  consider  the  case  with  binary  labels  

n  Each  annotator  is  characterized  by  the  (unknown)  error  rates  

n  Also  assume  that  the  prior  is  known,  i.e.,    

yji ∈ {0, 1}

P (yji = 1|yi = 1) = αj1

P (yji = 0|yi = 0) = αj0

P (yi = 1) = 1− P (yi = 0) = pi

True  posi0ve  rate  

True  nega0ve  rate  

Page 33: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

y11 y21 y12 y22 y32

α1 α2 α3 αJ

· · ·

· · ·

True  labels  

Observed  labels  

Annotator  accuracies  

y1 y2 yI

Page 34: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

n  The  likelihood  func0on  of  the  parameters                                      given  the  observa0ons                                                                                      is  factored  as  

P (D|α1,α0) =I�

i=1

P (y1i , . . . , yJi |α1,α0)

=I�

i=1

P (y1i , . . . , yJi |yi = 1,α1)P (yi = 1)

+ P (y1i , . . . , yJi |yi = 0,α0)P (yi = 0)

{α1,α0}D = {y1i , . . . , yJi }Ii=1

Page 35: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

J�

j=1

P (yji |yi = 0,αj0) =

J�

j=1

[αj0]

1−yji [1− αj

0]yji

J�

j=1

P (yji |yi = 1,αj1) =

J�

j=1

[αj1]

yji [1− αj

1]1−yj

i

=I�

i=1

P (y1i , . . . , yJi |yi = 1,α1)P (yi = 1)

+ P (y1i , . . . , yJi |yi = 0,α0)P (yi = 0)

P (D|α1,α0)

Page 36: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

n  The  parameters  are  found  by  maximizing  the  log-­‐likelihood  func0on  

n  The  solu0on  is  based  on  Expecta0on-­‐Maximiza0on  

n  Expecta+on  step  

a1,i =J�

j=1

[αj1]

yji [1− αj

1]1−yj

i

a0,i =J�

j=1

[αj0]

1−yji [1− αj

0]yji

{α1, α0} = argmaxθ

logP (D|θ)

µi = P (yi = 1|y1i , . . . , yJi ,θ)

∝ P (y1i , . . . , yJi |yi = 1,θ)P (yi = 1|θ)

=a1,ipi

a1,ipi + a0,i(1− pi)

pi = P (yi = 1) (prior)

θ = {α1,α2}

Page 37: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Dawid  and  Skene,  1979]  

n Maximiza+on  step  n  False  posi0ve  and  false  nega0ve  rates  can  be  es0mated  in  closed  form  

αj0 =

�Ii=1(1− µi)(1− yji )�I

i=1(1− µi)αj1 =

�Ii=1 µiy

ji�I

i=1 µi

Page 38: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Tang  and  Lease,  2011]  

n  A  semi-­‐supervised  approach  between  n  A  supervised  approach  based  on  gold  standard  

n  Naïve  Bayes  [Snow  et  al.,  2008]  n  An  unsupervised  approach    

n  Expecta0on-­‐Maximiza0on  [Dawid  and  Skene,  1979]  

n  A  very  modest  amount  of  supervision  can  provide  significant  benefit  

128 256 512 1024 20480.65

0.7

0.75

0.8

0.85

0.9

Number of labeled examples

Accu

racy

MVEMNB

Figure 4: Supervised NB vs. unsupervised MV andEM on the synthetic dataset.

128 256 512 1024 20480.62

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

Number of labeled examples

Accu

racy

MVEMNB

Figure 5: Supervised NB vs. unsupervised MV andEM on the MTurk dataset.

4.2 Semi-supervised vs. supervisedIn our second set of experiments, we compare our semi-

supervised SNB method vs. supervised NB method, eval-uating consensus accuracy achieved across varying amountof labeled vs. unlabeled training data. Starting from eachof the same labeled training size values considered in ourfirst set of experiments for supervised NB, we now consideradding additional unlabeled examples in powers of two asbefore into the training set, though now we have potentiallymore data to use (up to 5000 unlabeled examples in thesynthetic data, and up to 15758 examples with MTurk). Asbefore, we repeat experiments 10 times and average.Figure 6 and Figure 7 compare semi-supervised SNBmethod

with supervised NB method for synthetic and MTurk data,respectively. Results on both synthetic and MTurk data arequite similar. Each curve in the figures corresponds to a SNBmethod trained on a di!erent number of (labeled) trainingexamples. The x-axis indicates the number of additional,

64 128 256 512 1024 2048 4096 50000.65

0.7

0.75

0.8

0.85

0.9

Number of unlabeled examples

accu

racy

SNB1 (128 labeled examples)SNB2 (256 labeled examples)SNB3 (512 labeled examples)SNB4 (1024 labeled examples)SNB5 (2048 labeled examples)

Figure 6: Semi-supervised SNB vs. supervised NBmethod on the synthetic dataset.

64 128 256 512 1024 2048 4096 157580.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

Number of unlabeled examples

Accu

racy

SNB1 (128 labeled examples)SNB2 (256 labeled examples)SNB3 (512 labeled examples)SNB4 (1024 labeled examples)SNB5 (2048 labeled examples)

Figure 7: Semi-supervised SNB vs. supervised NBmethod on the MTurk dataset.

unlabeled examples used for training. While not shown, avalue of x = 0 (no unlabeled data used) in Figure 6 andFigure 7 would correspond exactly to the accuracy achievedby supervised NB method from Figure 4 and Figure 5, re-spectively. All curves approach convergence with the fulltraining set (all available labeled and unlabeled data).

Labels for unlabeled examples are automatically estimatedby SNB with a given confidence during the training process.Worker labels are then compared to these generated labelsand confidence values in order to estimate worker accuracies(in addition to comparing worker labels on expert labeledexamples). Figure 4 and Figure 5 intuitively showed thatNB consensus accuracy increases with more labeled train-ing data. Figure 6 and Figure 7 reflect this in the relativestarting positions of each learning curve of SNB method.

Recall that unsupervised EM method achieved 75.0% con-sensus accuracy for the synthetic data in Figure 4. FromFigure 6 we can see that, with only 256 labeled and 1024

Page 39: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Whitehill  et  al.,  2009]  

n  Binary  labels:    

n  Annotators  have  different  exper0se:  

n More  skilled  annotators  (higher          )  have  higher  probability  of  labeling  correctly  

n  As  the  difficulty  of  the  image                        increases,  the  probability  of  the  label  being  correct  moves  towards  0.5  

n  GLAD  (Genera0ve  model  of  Labels,  Abili0es,  and  Difficul0es)  

αj

p(yji = yi|αj ,βi) =1

1 + e−αjβi

1/βi

yji ∈ {0, 1}

Page 40: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Whitehill  et  al.,  2009]  

β1 β2 βI

y11 y21 y12 y22 y32

α1 α2 α3 αJ

· · ·

· · ·

· · ·

Object  difficul0es  

True  labels  

Observed  labels  

Annotator  accuracies  

y1 y2 yI

Page 41: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Whitehill  et  al.,  2009]  

n  The  observed  labels  are  samples  from  the                      random  variables  

n  The  unobserved  variables  are  n  The  true  image  labels  n  The  object  difficulty  parameters  n  The  different  annotators  accuracies  

n  Goal:  find  the  most  likely  values  of  unobserved  variables  given  the  observed  data  

n  Solu0on:  Expecta+on-­‐Maximiza+on  (EM)  

{yji }

yi, i = 1, . . . , I

αj , j = 1, . . . , J

βi, i = 1, . . . , I

Page 42: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Whitehill  et  al.,  2009]  

n  Expecta+on  step:  n  Compute  the  posterior  probabili0es  of  all                                n  given  the                      values  from  the  last  M  step  

yi ∈ {0, 1}α,β

yi = {yji� |i� = i}

p(yji |yi = 1,αj ,βi) =

�1

1 + e−αjβi

�yji�1− 1

1 + e−αjβi

�1−yji

Bayes’  theorem  

Annotators  independence  

P (yi|yi,α,βi) ∝ P (yi|α,βi)P (yi|yi,α,βi)

∝ P (yi)�

j

P (yji |yi,αj ,βi)

Page 43: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Whitehill  et  al.,  2009]  

n Maximiza+on  step:  n  Maximize  the  auxiliary  func0on  

n  Where  the  expecta0on  is  with  respect  to  the  posterior  probabili0es  of  all    computed  in  the  E-­‐step  

n  The  parameters                        are  es0mated  using  gradient  ascent  

Q(α,β) = E[log p(y1, . . . ,yI ,y|α,β)]

yi ∈ {0, 1}

Q(α,β) = E

log�

i

p(yi)�

j

p(yji |yi,αj ,βi)

=�

i

E[log p(yi)] +�

ij

E[log p(yji |yi,αj ,βi)]

α,β

(α∗,β∗) = argmaxα,β

Q(α,β)

Page 44: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Whitehill  et  al.,  2009]  

5 10 15 200.75

0.8

0.85

0.9

0.95

1Effect of Number of Labelers on Accuracy

Number of Labelers

Prop

ortio

n of

Lab

els

Cor

rect

GLADMajority vote

5 10 15 200

0.2

0.4

0.6

0.8

1Effect of Number of Labelers on Parameter Estimates

Number of Labelers

Cor

rela

tion

Beta: Spearman Corr.Alpha: Pearson Corr.

Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlyingclass labels on simulation data. Right: The ability of GLAD to recover the true alpha and betaparameters on simulation data.

Image TypeLabeler type Hard Easy

Good 0.95 1Bad 0.54 1

We measured performance in terms of proportion of correctly estimated labels. We compared threeapproaches: (1) our proposed method, GLAD; (2) the method proposed in [5], which models labelerability but not image difficulty; and (3) Majority Vote. The simulations were repeated 20 timesand average performance calculated for the three methods. The results shown below indicated thatmodeling image difficulty can result in significant performance improvements.

Method ErrorGLAD 4.5%

Majority Vote 11.2%Dawid & Skene [5] 8.4%

4.1 Stability of EM under Various Starting Points

Empirically we found that the EM procedure was fairly insensitive to varying the starting point of theparameter values. In a simulation study of 2000 images and 20 labelers, we randomly selected eachαi ∼ U [0, 4] and log(βj) ∼ U [0, 3], and EM was run until convergence. Over the 50 simulationruns, the average percent-correct of the inferred labels was 85.74%, and the standard deviation ofthe percent-correct over all the trials was only 0.024%.

5 Empirical Study I: Greebles

As a first test-bed for GLAD using real data obtained from the Mechanical Turk, we posted picturesof 100 “Greebles” [6], which are synthetically generated images that were originally created to studyhuman perceptual expertise. Greebles somewhat resemble human faces and have a “gender”: Maleshave horn-like organs that point up, whereas for females the horns point down. See Figure 3 (left)for examples. Each of the 100 Greeble images was labeled by 10 different human coders on the Turkfor gender (male/female). Four greebles of each gender (separate from the 100 labeled images) weregiven as examples of each class. Shown at a resolution of 48x48 pixels, the task required carefulinspection of the images in order to label them correctly. The ground-truth gender values were allknown with certainty (since they are rendered objects) and thus provided a means of measuring theaccuracy of inferred image labels.

5

5 10 15 200.75

0.8

0.85

0.9

0.95

1Effect of Number of Labelers on Accuracy

Number of Labelers

Prop

ortio

n of

Lab

els

Cor

rect

GLADMajority vote

5 10 15 200

0.2

0.4

0.6

0.8

1Effect of Number of Labelers on Parameter Estimates

Number of Labelers

Cor

rela

tion

Beta: Spearman Corr.Alpha: Pearson Corr.

Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlyingclass labels on simulation data. Right: The ability of GLAD to recover the true alpha and betaparameters on simulation data.

Image TypeLabeler type Hard Easy

Good 0.95 1Bad 0.54 1

We measured performance in terms of proportion of correctly estimated labels. We compared threeapproaches: (1) our proposed method, GLAD; (2) the method proposed in [5], which models labelerability but not image difficulty; and (3) Majority Vote. The simulations were repeated 20 timesand average performance calculated for the three methods. The results shown below indicated thatmodeling image difficulty can result in significant performance improvements.

Method ErrorGLAD 4.5%

Majority Vote 11.2%Dawid & Skene [5] 8.4%

4.1 Stability of EM under Various Starting Points

Empirically we found that the EM procedure was fairly insensitive to varying the starting point of theparameter values. In a simulation study of 2000 images and 20 labelers, we randomly selected eachαi ∼ U [0, 4] and log(βj) ∼ U [0, 3], and EM was run until convergence. Over the 50 simulationruns, the average percent-correct of the inferred labels was 85.74%, and the standard deviation ofthe percent-correct over all the trials was only 0.024%.

5 Empirical Study I: Greebles

As a first test-bed for GLAD using real data obtained from the Mechanical Turk, we posted picturesof 100 “Greebles” [6], which are synthetically generated images that were originally created to studyhuman perceptual expertise. Greebles somewhat resemble human faces and have a “gender”: Maleshave horn-like organs that point up, whereas for females the horns point down. See Figure 3 (left)for examples. Each of the 100 Greeble images was labeled by 10 different human coders on the Turkfor gender (male/female). Four greebles of each gender (separate from the 100 labeled images) weregiven as examples of each class. Shown at a resolution of 48x48 pixels, the task required carefulinspection of the images in order to label them correctly. The ground-truth gender values were allknown with certainty (since they are rendered objects) and thus provided a means of measuring theaccuracy of inferred image labels.

5

Page 45: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Welinder  and  Perona,  2010]  

n  Seung  similar  to  [Whitehill  et  al.,  2009],  with  some  differences  n  Object  difficulty  is  not  explicitly  modeled  n  Annotators  quality  dis0nguishes  between  true  posi0ve  and  true  nega0ve  rate  

n  A  prior  distribu0on  is  set  on              to  capture  2  kinds  of  annotators  n  Honest  annotators  (with  different  quali0es,  from  unreliable  to  experts)  n  Adversarial  annotators  

P (yji = 1|yi = 1) = αj1

P (yji = 0|yi = 0) = αj0

αj = [αj0,α

j1]

T

αj

Page 46: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Welinder  and  Perona,  2010]  

n  Batched  algorithm  n  Expecta0on  step  

n  Maximiza0on  step  

α∗ = argmaxα

Q(α)

Q(α) =�

i

E[logP (yi)] +�

ij

E[logP (yji |yi,αj)] +

j

logP (αj)

P (yi|yi,α) ∝ P (yi)�

j

P (yji |yi,αj)

P (yji |yi = 1,αj) = (αj1)

yji (1− αj

1)1−yj

i

P (yji |yi = 0,αj) = (αj0)

yji (1− αj

0)1−yj

i

Prior  

Page 47: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Welinder  and  Perona,  2010]  

Dataset Images Assignments Workers

Presence-1 1,514 15 47

Presence-2 2,401 15 54

Attributes-1 6,033 5 507

Attributes-2 6,033 5 460

Bounding Boxes 911 10 85

Table 1. Summary of the datasets collected from Amazon Mechan-

ical Turk showing the number of images per dataset, the number of

labels per image (assignments), and total number of workers that

provided labels. Presence-1/2 are binary labels, and Attributes-1/2

are multi-valued labels.

0 2 4 6 8 10

10!3

10!2

10!1

no. of assignments per image

err

or

rate

majorityGLADours (batch)

Figure 7. Comparison between the majority rule, GLAD [14], and

our algorithm on synthetic data as the number of assignments per

image is increased. The synthetic data is generated by the model

in Section 5 from the worker parameters estimated in Figure 5a.

tions, most annotators provided good labels, except for no.

53 and 58. These two annotators were also the only ones to

label all available images. In all three subplots of Figure 6,

most workers provide only a few labels, and only some very

active annotators label more than 100 images. Our findings

in this figure are very similar to the results presented in Fig-

ure 6 of [7].

Importance of discrimination: The results in Figure 6

point out the importance of online estimation of aj and the

use of expert- and bot-lists for obtaining labels on MTurk.

The expert-list is needed to reduce the number of labels per

image, as we can be more sure of the quality of the labels re-

ceived from experts. Furthermore, without the expert-list to

prioritize which annotators to ask first, the image will likely

be labeled by a new worker, and thus the estimate of aj for

that worker will be very uncertain. The bot-list is needed to

discriminate against sloppy annotators that will otherwise

annotate most of the dataset in hope to make easy money,

as shown by the outliers (no. 53 and 58) in Figure 6c.

Performance of binary model: We compared the per-

formance of the annotator model applied to binary data, de-

scribed in Section 5, to two other models of binary data, as

the number of available labels per image, m, varied. The

first method was a simple majority decision rule and the

second method was the GLAD-algorithm presented in [14].

Since we did not have access to the ground truth labels of

2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

err

or

rate

labels per image

Presence!1

batchonline

2 4 6 8 10 12 14

labels per image

Presence!2

batchonline

Figure 8. Error rates vs. the number of labels used per image on

the Presence datasets for the online algorithm and the batch ver-

sion. The ground truth was the estimates when running the batch

algorithm with all 15 labels per image available (thus batch will

have zero error at 15 labels per image).

the datasets, we generated synthetic data, where we knew

the ground truth, as follows: (1) We used our model to es-

timate aj for all 47 annotators in the Presence-1 dataset.

(2) For each of 2000 target values (half with zi = 1), we

sampled labels from m randomly chosen workers, where

the labels were generated according to the estimated aj and

Equation 10. As can be seen from Figure 7, our model

achieves a consistently lower error rate on synthetic data.

Online algorithm: We simulated running the online al-

gorithm on the Presence datasets obtained using MTurk and

used the result from the batch algorithm as ground truth.

When the algorithm requested labels for an image, it was

given labels from the dataset (along with an identifier for

the worker that provided it) randomly sampled without re-

placement. If it requested labels from the expert-list for a

particular image, it only received such a label if a worker

in the expert-list had provided a label for that image, other-

wise it was randomly sampled from non bot-listed workers.

A typical run of the algorithm on the Presence-1 dataset is

shown in Figure 9. In the first few iterations, the algorithm

is pessimistic about the quality of the annotators, and re-

quests up to m = 15 labels per image. As the evidence

accumulates, more workers are put in the expert- and bot-

lists, and the number of labels requested by the algorithm

decreases. Notice in the figure that towards the final itera-

tions, the algorithm samples only 2–3 labels for some im-

ages.

To get an idea of the performance of the online algo-

rithm, we compared it to running the batch version from

Section 3 with limited number of labels per image. For the

Presence-1 dataset, the error rate of the online algorithm is

almost three times lower than the general algorithm when

using the same number of labels per image, see Figure 8.

For the Presence-2 dataset, twice as many labels per image

are needed for the batch algorithm to achieve the same per-

formance as the online version.

Online crowdsourcing: rating annotators and obtaining cost-effective labels

Peter Welinder Pietro PeronaCalifornia Institute of Technology{welinder,perona}@caltech.edu

Abstract

Labeling large datasets has become faster, cheaper, andeasier with the advent of crowdsourcing services like Ama-zon Mechanical Turk. How can one trust the labels ob-tained from such services? We propose a model of the la-beling process which includes label uncertainty, as well amulti-dimensional measure of the annotators’ ability. Fromthe model we derive an online algorithm that estimates themost likely value of the labels and the annotator abilities.It finds and prioritizes experts when requesting labels, andactively excludes unreliable annotators. Based on labelsalready obtained, it dynamically chooses which images willbe labeled next, and how many labels to request in orderto achieve a desired level of confidence. Our algorithm isgeneral and can handle binary, multi-valued, and continu-ous annotations (e.g. bounding boxes). Experiments on adataset containing more than 50,000 labels show that ouralgorithm reduces the number of labels required, and thusthe total cost of labeling, by a large factor while keepingerror rates low on a variety of datasets.

1. Introduction

Crowdsourcing, the act of outsourcing work to a largecrowd of workers, is rapidly changing the way datasets arecreated. Not long ago, labeling large datasets could takeweeks, if not months. It was necessary to train annotatorson custom-built interfaces, often in person, and to ensurethey were motivated enough to do high quality work. Today,with services such as Amazon Mechanical Turk (MTurk), itis possible to assign annotation jobs to hundreds, even thou-sands, of computer-literate workers and get results back ina matter of hours. This opens the door to labeling hugedatasets with millions of images, which in turn providesgreat possibilities for training computer vision algorithms.

The quality of the labels obtained from annotators varies.Some annotators provide random or bad quality labels in thehope that they will go unnoticed and still be paid, and yetothers may have good intentions but completely misunder-stand the task at hand. The standard solution to the problemof “noisy” labels is to assign the same labeling task to many

2 25 22 26 2 25 22 26 2 25 22 26

2 25 22 26 2 25 22 26 2 25 22 26

Figure 1. Examples of binary labels obtained from Amazon Me-chanical Turk, (see Figure 2 for an example of continuous labels).The boxes show the labels provided by four workers (identified bythe number in each box); green indicates that the worker selectedthe image, red means that he or she did not. The task for each an-notator was to select only images that he or she thought containeda Black-chinned Hummingbird. Figure 5 shows the expertise andbias of the workers. Worker 25 has a high false positive rate, and22 has a high false negative rate. Worker 26 provides inconsistentlabels, and 2 is the annotator with the highest accuracy. Photosin the top row were classified to contain a Black-chinned Hum-mingbird by our algorithm, while the ones in the bottom row werenot.

different annotators, in the hope that at least a few of themwill provide high quality labels or that a consensus emergesfrom a great number of labels. In either case, a large numberof labels is necessary, and although a single label is cheap,the costs can accumulate quickly.

If one is aiming for a given label quality for the minimumtime and money, it makes more sense to dynamically decideon the number of labelers needed. If an expert annotatorprovides a label, we can probably rely on it being of highquality, and we may not need more labels for that particulartask. On the other hand, if an unreliable annotator providesa label, we should probably ask for more labels until we findan expert or until we have enough labels from non-expertsto let the majority decide the label.

1

Page 48: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Welinder  and  Perona,  2010]  

n  Online  algorithm  n  For  each  annotator  

n  Es0mate  

n  If  the  es0ma0on  of  the  annotator  quality  is  reliable  (                                                    )  

n  If  annotator          is  an  expert,  add  to  expert-­‐list  

n  Otherwise,  add  to  bot-­‐list  

αj

var(αj) < θ

E ← {E ∪ {j}}

B ← {B ∪ {j}}

j

Page 49: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Welinder  and  Perona,  2010]  

n  Online  algorithm  (con0nued)  n  For  each  object  to  be  annotated  

n  Compute                                from  available  labels                and  

n  If  es0mated  label  is  unreliable  (                                                                )  ask  to  experts  in  list  

n  If  labels  cannot  be  obtained  from  experts,  ask  to  annotators  not  in  bot-­‐list    

n  Stop  asking  labels  when                                                                    or  maximum  number  of  annota0ons  is  exceeded  

P (yi) yi α

maxyi

P (yi) < τ E

B

maxyi

P (yi) ≥ τ

Page 50: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Welinder  and  Perona,  2010]  

n  The  on-­‐line  algorithm  allows  to  reduce  the  number  of  annota0ons  /  object,  for  the  same  target  error  rate  

Dataset Images Assignments Workers

Presence-1 1,514 15 47

Presence-2 2,401 15 54

Attributes-1 6,033 5 507

Attributes-2 6,033 5 460

Bounding Boxes 911 10 85

Table 1. Summary of the datasets collected from Amazon Mechan-

ical Turk showing the number of images per dataset, the number of

labels per image (assignments), and total number of workers that

provided labels. Presence-1/2 are binary labels, and Attributes-1/2

are multi-valued labels.

0 2 4 6 8 10

10!3

10!2

10!1

no. of assignments per image

erro

r rat

e

majorityGLADours (batch)

Figure 7. Comparison between the majority rule, GLAD [14], and

our algorithm on synthetic data as the number of assignments per

image is increased. The synthetic data is generated by the model

in Section 5 from the worker parameters estimated in Figure 5a.

tions, most annotators provided good labels, except for no.

53 and 58. These two annotators were also the only ones to

label all available images. In all three subplots of Figure 6,

most workers provide only a few labels, and only some very

active annotators label more than 100 images. Our findings

in this figure are very similar to the results presented in Fig-

ure 6 of [7].

Importance of discrimination: The results in Figure 6

point out the importance of online estimation of aj and the

use of expert- and bot-lists for obtaining labels on MTurk.

The expert-list is needed to reduce the number of labels per

image, as we can be more sure of the quality of the labels re-

ceived from experts. Furthermore, without the expert-list to

prioritize which annotators to ask first, the image will likely

be labeled by a new worker, and thus the estimate of aj for

that worker will be very uncertain. The bot-list is needed to

discriminate against sloppy annotators that will otherwise

annotate most of the dataset in hope to make easy money,

as shown by the outliers (no. 53 and 58) in Figure 6c.

Performance of binary model: We compared the per-

formance of the annotator model applied to binary data, de-

scribed in Section 5, to two other models of binary data, as

the number of available labels per image, m, varied. The

first method was a simple majority decision rule and the

second method was the GLAD-algorithm presented in [14].

Since we did not have access to the ground truth labels of

2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

err

or

rate

labels per image

Presence!1

batchonline

2 4 6 8 10 12 14

labels per image

Presence!2

batchonline

Figure 8. Error rates vs. the number of labels used per image on

the Presence datasets for the online algorithm and the batch ver-

sion. The ground truth was the estimates when running the batch

algorithm with all 15 labels per image available (thus batch will

have zero error at 15 labels per image).

the datasets, we generated synthetic data, where we knew

the ground truth, as follows: (1) We used our model to es-

timate aj for all 47 annotators in the Presence-1 dataset.

(2) For each of 2000 target values (half with zi = 1), we

sampled labels from m randomly chosen workers, where

the labels were generated according to the estimated aj and

Equation 10. As can be seen from Figure 7, our model

achieves a consistently lower error rate on synthetic data.

Online algorithm: We simulated running the online al-

gorithm on the Presence datasets obtained using MTurk and

used the result from the batch algorithm as ground truth.

When the algorithm requested labels for an image, it was

given labels from the dataset (along with an identifier for

the worker that provided it) randomly sampled without re-

placement. If it requested labels from the expert-list for a

particular image, it only received such a label if a worker

in the expert-list had provided a label for that image, other-

wise it was randomly sampled from non bot-listed workers.

A typical run of the algorithm on the Presence-1 dataset is

shown in Figure 9. In the first few iterations, the algorithm

is pessimistic about the quality of the annotators, and re-

quests up to m = 15 labels per image. As the evidence

accumulates, more workers are put in the expert- and bot-

lists, and the number of labels requested by the algorithm

decreases. Notice in the figure that towards the final itera-

tions, the algorithm samples only 2–3 labels for some im-

ages.

To get an idea of the performance of the online algo-

rithm, we compared it to running the batch version from

Section 3 with limited number of labels per image. For the

Presence-1 dataset, the error rate of the online algorithm is

almost three times lower than the general algorithm when

using the same number of labels per image, see Figure 8.

For the Presence-2 dataset, twice as many labels per image

are needed for the batch algorithm to achieve the same per-

formance as the online version.

Online crowdsourcing: rating annotators and obtaining cost-effective labels

Peter Welinder Pietro PeronaCalifornia Institute of Technology{welinder,perona}@caltech.edu

Abstract

Labeling large datasets has become faster, cheaper, andeasier with the advent of crowdsourcing services like Ama-zon Mechanical Turk. How can one trust the labels ob-tained from such services? We propose a model of the la-beling process which includes label uncertainty, as well amulti-dimensional measure of the annotators’ ability. Fromthe model we derive an online algorithm that estimates themost likely value of the labels and the annotator abilities.It finds and prioritizes experts when requesting labels, andactively excludes unreliable annotators. Based on labelsalready obtained, it dynamically chooses which images willbe labeled next, and how many labels to request in orderto achieve a desired level of confidence. Our algorithm isgeneral and can handle binary, multi-valued, and continu-ous annotations (e.g. bounding boxes). Experiments on adataset containing more than 50,000 labels show that ouralgorithm reduces the number of labels required, and thusthe total cost of labeling, by a large factor while keepingerror rates low on a variety of datasets.

1. Introduction

Crowdsourcing, the act of outsourcing work to a largecrowd of workers, is rapidly changing the way datasets arecreated. Not long ago, labeling large datasets could takeweeks, if not months. It was necessary to train annotatorson custom-built interfaces, often in person, and to ensurethey were motivated enough to do high quality work. Today,with services such as Amazon Mechanical Turk (MTurk), itis possible to assign annotation jobs to hundreds, even thou-sands, of computer-literate workers and get results back ina matter of hours. This opens the door to labeling hugedatasets with millions of images, which in turn providesgreat possibilities for training computer vision algorithms.

The quality of the labels obtained from annotators varies.Some annotators provide random or bad quality labels in thehope that they will go unnoticed and still be paid, and yetothers may have good intentions but completely misunder-stand the task at hand. The standard solution to the problemof “noisy” labels is to assign the same labeling task to many

2 25 22 26 2 25 22 26 2 25 22 26

2 25 22 26 2 25 22 26 2 25 22 26

Figure 1. Examples of binary labels obtained from Amazon Me-chanical Turk, (see Figure 2 for an example of continuous labels).The boxes show the labels provided by four workers (identified bythe number in each box); green indicates that the worker selectedthe image, red means that he or she did not. The task for each an-notator was to select only images that he or she thought containeda Black-chinned Hummingbird. Figure 5 shows the expertise andbias of the workers. Worker 25 has a high false positive rate, and22 has a high false negative rate. Worker 26 provides inconsistentlabels, and 2 is the annotator with the highest accuracy. Photosin the top row were classified to contain a Black-chinned Hum-mingbird by our algorithm, while the ones in the bottom row werenot.

different annotators, in the hope that at least a few of themwill provide high quality labels or that a consensus emergesfrom a great number of labels. In either case, a large numberof labels is necessary, and although a single label is cheap,the costs can accumulate quickly.

If one is aiming for a given label quality for the minimumtime and money, it makes more sense to dynamically decideon the number of labelers needed. If an expert annotatorprovides a label, we can probably rely on it being of highquality, and we may not need more labels for that particulartask. On the other hand, if an unreliable annotator providesa label, we should probably ask for more labels until we findan expert or until we have enough labels from non-expertsto let the majority decide the label.

1

Page 51: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Karger  et  al.,  2011]  

n  Infers  labels  and  annotators  quali0es  

n  No  prior  knowledge  on  annotator  quali0es  

n  Inspired  to  belief  propaga+on  and  message-­‐passing  

n  Binary  labels  

n  Define  a                              matrix            ,  such  that  

yji = {−1,+1}

Aij = yjiI × J A

Page 52: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Karger  et  al.,  2011]  

Annotators  Objects  

A =

+1 +1 − +1 −− − −1 − −1+1 − −1 − +1

+1

+1

+1

+1

+1

−1

−1

−1

Page 53: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Karger  et  al.,  2011]  

Annotators  Objects  

+1

+1

+1

+1

+1

−1

−1

−1x(k)i→j =

j�∈∂i\j

Aij�y(k−1)j�→i

Reliability  of  annotator        es0ma0ng  object    

Es0mated  (sov)  label  of    using  all  annotators  but    

yj→i ji

ij

Page 54: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Karger  et  al.,  2011]  

Annotators  Objects  

+1

+1

+1

+1

+1

−1

−1

−1

Reliability  of  annotator        es0ma0ng  object    

ji

y(k)j→i =�

i�∈∂j\i

Ai�jx(k−1)i�→j

Page 55: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Karger  et  al.,  2011]  

Annotators  Objects  

+1

+1

+1

+1

+1

−1

−1

−1yi = sgn

j∈∂i

Aijyj→i

Final  es0mate  

Page 56: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  annota0ons  [Karger  et  al.,  2011]  

sum of the answers weighted by each worker’s reliability:

si = sign

��

j∈∂iAijyj→i

�.

It is understood that when there is a tie we flip a fair coin to make a decision.

Iterative Algorithm

Input: E, {Aij}(i,j)∈E , kmax

Output: Estimate s�{Aij}

1: For all (i, j) ∈ E do

Initialize y(0)j→i with random Zij ∼ N (1, 1) ;2: For k = 1, . . . , kmax do

For all (i, j) ∈ E do x(k)i→j ←�

j�∈∂i\j Aij�y(k−1)j�→i ;

For all (i, j) ∈ E do y(k)j→i ←�

i�∈∂j\iAi�jx(k)i�→j ;

3: For all i ∈ [m] do xi ←�

j∈∂iAijy(kmax−1)j→i ;

4: Output estimate vector s�{Aij}

�= [sign(xi)] .

We emphasize here that our inference algorithm requires no information about the prior distri-

bution of the workers’ quality pj . Our algorithm is inspired by power iteration used to compute

the leading singular vectors of a matrix, and we discuss the connection in detail in Section 2.6.

While our algorithm is also inspired by the standard Belief Propagation (BP) algorithm for ap-

proximating max-marginals [Pea88, YFW03], our algorithm is original and overcomes a few critical

limitations of the standard BP. First, the iterative algorithm does not require any knowledge of

the prior distribution of pj , whereas the standard BP requires the knowledge of the distribution.

Second, there is no efficient way to implement standard BP, since we need to pass sufficient statis-

tics (or messages) which under our general model are distributions over the reals. On the other

hand, the iterative algorithm only passes messages that are real numbers regardless of the prior

distribution of pj , which makes it easy to implement. Third, the iterative algorithm is provably

asymptotically order-optimal. Density evolution, is a standard technique to analyze the perfor-

mance of BP. Although we can write down the density evolution for the standard BP, we cannot

describe or compute the densities, analytically or numerically. It is also very simple to write down

the density evolution equations (cf. (8)) for our algorithm, but it is not a priori clear how one can

analyze the densities in this case either. We develop a novel technique to analyze the densities for

our iterative algorithm and prove optimality. This technique could be of independent interest to

analyzing a broader class of message-passing algorithms.

2.2 Performance guarantee

We state the main analytical result of this paper: for random (l, r)-regular bipartite graph based

task assignments with our iterative inference algorithm, the probability of error decays exponentially

in lq, up to a universal constant and for a broad range of the parameters l, r and q. With a

reasonable choice of l = r and both scaling like (1/q) log(1/�), the proposed algorithm is guaranteed

to achieve error less than � for any � ∈ (0, 1/2). Further, an algorithm independent lower bound

7

Page 57: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  

Page 58: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n  Use  several  noisy  labels  to  create  labeled  data  for  training  classifiers  

n  Training  samples  

n  Labels  might  be  noisy  

�yi,xi�

True  label   Feature  vector  

40

50

60

70

80

90

100

1 40 80 120 160 200 240 280Number of examples (Mushroom)

Acc

urac

y

q=1.0q=0.9q=0.8q=0.7q=0.6q=0.5

Figure 1: Learning curves under di!erent quality lev-els of training data (q is the probability of a labelbeing correct).

depends both on the quality of the training labels and onthe number of training examples. Of course if the traininglabels are uninformative (q = 0.5), no amount of training datahelps. As expected, under the same labeling quality, moretraining examples lead to better performance, and the higherthe quality of the training data, the better the performanceof the learned model. However, the relationship between thetwo factors is complex: the marginal increase in performancefor a given change along each dimension is quite di!erent fordi!erent combinations of values for both dimensions. To this,one must overlay the di!erent costs of acquiring only newlabels versus whole new examples, as well as the expectedimprovement in quality when acquiring multiple new labels.

This paper makes several contributions. First, under gradu-ally weakening assumptions, we assess the impact of repeated-labeling on the quality of the resultant labels, as a functionof the number and the individual qualities of the labelers.We derive analytically the conditions under which repeated-labeling will be more or less e!ective in improving resultantlabel quality. We then consider the e!ect of repeated-labelingon the accuracy of supervised modeling. As demonstrated inFigure 1, the relative advantage of increasing the quality of la-beling, as compared to acquiring new data points, depends onthe position on the learning curves. We show that even if weignore the cost of obtaining the unlabeled part of a data point,there are times when repeated-labeling is preferable comparedto getting labels for unlabeled examples. Furthermore, whenwe do consider the cost of obtaining the unlabeled portion,repeated-labeling can give considerable advantage.

We present a comprehensive experimental analysis of therelationships between quality, cost, and technique for repeated-labeling. The results show that even a straightforward, round-robin technique for repeated-labeling can give substantialbenefit over single-labeling. We then show that selectivelychoosing the examples to label repeatedly yields substantialextra benefit. A key question is: How should we select datapoints for repeated-labeling? We present two techniques basedon di!erent types of information, each of which improves overround-robin repeated labeling. Then we show that a techniquethat combines the two types of information is even better.

Although this paper covers a good deal of ground, there ismuch left to be done to understand how best to label usingmultiple, noisy labelers; so, the paper closes with a summaryof the key limitations, and some suggestions for future work.

2. RELATED WORKRepeatedly labeling the same data point is practiced in

applications where labeling is not perfect (e.g., [27, 28]). Weare not aware of a systematic assessment of the relationshipbetween the resultant quality of supervised modeling andthe number of, quality of, and method of selection of datapoints for repeated-labeling. To our knowledge, the typi-

cal strategy used in practice is what we call “round-robin”repeated-labeling, where cases are given a fixed number oflabels—so we focus considerable attention in the paper to thisstrategy. A related important problem is how in practice toassess the generalization performance of a learned model withuncertain labels [28], which we do not consider in this paper.Prior research has addressed important problems necessary fora full labeling solution that uses multiple noisy labelers, suchas estimating the quality of labelers [6, 26, 28], and learningwith uncertain labels [13, 24, 25]. So we treat these topicsquickly when they arise, and lean on the prior work.

Repeated-labeling using multiple noisy labelers is di!erentfrom multiple label classification [3, 15], where one examplecould have multiple correct class labels. As we discuss inSection 5, repeated-labeling can apply regardless of the numberof true class labels. The key di!erence is whether the labelsare noisy. A closely related problem setting is described byJin and Ghahramani [10]. In their variant of the multiplelabel classification problem, each example presents itself witha set mutually exclusive labels, one of which is correct. Thesetting for repeated-labeling has important di!erences: labelsare acquired (at a cost); the same label may appear manytimes, and the true label may not appear at all. Again, thelevel of error in labeling is a key factor.

The consideration of data acquisition costs has seen in-creasing research attention, both explicitly (e.g., cost-sensitivelearning [31], utility-based data mining [19]) and implicitly, asin the case of active learning [5]. Turney [31] provides a shortbut comprehensive survey of the di!erent sorts of costs thatshould be considered, including data acquisition costs andlabeling costs. Most previous work on cost-sensitive learningdoes not consider labeling cost, assuming that a fixed set oflabeled training examples is given, and that the learner cannotacquire additional information during learning (e.g., [7, 8, 30]).

Active learning [5] focuses on the problem of costly labelacquisition, although often the cost is not made explicit. Ac-tive learning (cf., optimal experimental design [33]) uses theexisting model to help select additional data for which toacquire labels [1, 14, 23]. The usual problem setting for activelearning is in direct contrast to the setting we consider forrepeated-labeling. For active learning, the assumption is thatthe cost of labeling is considerably higher than the cost ofobtaining unlabeled examples (essentially zero for “pool-based”active learning).

Some previous work studies data acquisition cost explicitly.For example, several authors [11, 12, 16, 17, 22, 32, 37] studythe costly acquisition of feature information, assuming thatthe labels are known in advance. Saar-Tsechansky et al. [22]consider acquiring both costly feature and label information.

None of this prior work considers selectively obtaining mul-tiple labels for data points to improve labeling quality, and therelative advantages and disadvantages for improving modelperformance. An important di!erence from the setting fortraditional active learning is that labeling strategies that usemultiple noisy labelers have access to potentially relevant addi-tional information. The multisets of existing labels intuitivelyshould play a role in determining the examples for which toacquire additional labels. For example, presumably one wouldbe less interested in getting another label for an example thatalready has a dozen identical labels, than for one with justtwo, conflicting labels.

3. REPEATED LABELING: THE BASICSFigure 1 illustrates that the quality of the labels can have

a marked e!ect on classification accuracy. Intuitively, using

615

q  =  Probability  of  a  label  being  correct  

Page 59: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n When  training  a  classifier,  consider  two  op0ons  

n  Acquiring  a  new  training  example  

n  Get  another  label  for  an  exis0ng  example  

n  Compare  two  strategies  n  SL  -­‐  single  labeling:  acquires  addi0onal  examples  (each  with  one  noisy  label)  

n  MV  –  majority  vo0ng:  acquire  addi0onal  noisy  labels  for  exis0ng  examples  

�yi,xi�

Page 60: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n When  labels  are  noisy,  repeated  labeling  +  majority  vo0ng  helps  

n  Otherwise,  acquiring  addi0onal  training  samples  might  be  berer    

50556065707580859095

100

100 1100 2100 3100 4100 5100Number of labels (mushroom, p=0.6)

Acc

urac

y

SLML

(a) p = 0.6, #examples = 100, for MV

50556065707580859095

100

50 800 1550 2300 3050 3800 4550 5300Number of labels (mushroom, p=0.8)

Acc

urac

y

SLMV

(b) p = 0.8, #examples = 50, for MV

Figure 5: Comparing the increase in accuracy for themushroom data set as a function of the number oflabels acquired, when the cost of an unlabeled exam-ple is negligible, i.e., CU = 0. Repeated-labeling withmajority vote (MV ) starts with an existing set of ex-amples and only acquires additional labels for them,and single labeling (SL) acquires additional examples.Other data sets show similar results.

for a fixed labeler quality. Both MV and SL start with thesame number of single-labeled examples. Then, MV startsacquiring additional labels only for the existing examples,while SL acquires new examples and labels them.

Generally, whether to invest in another whole training exam-ple or another label depends on the gradient of generalizationperformance as a function of obtaining another label or anew example. We will return to this when we discuss futurework, but for illustration, Figure 5 shows scenarios for ourexample problem, where each strategy is preferable to theother. From Figure 1 we see that for p = 0.6, and with 100examples, there is a lot of headroom for repeated-labeling toimprove generalization performance by improving the overalllabeling quality. Figure 5(a) indeed shows that for p = 0.6,repeated-labeling does improve generalization performance(per label) as compared to single-labeling new examples. Onthe other hand, for high initial quality or steep sections of thelearning curve, repeated-labeling may not compete with sin-gle labeling. Figure 5(b) shows that single labeling performsbetter than repeated-labeling when we have a fixed set of 50training examples with labeling quality p = 0.8. Particularly,repeated-labeling could not further improve its performanceafter acquiring a certain amount of labels (cf., the q = 1 curvein Figure 1).

The results for other datasets are similar to Figure 5: un-der noisy labels, and with CU ! CL, round-robin repeated-labeling can perform better than single-labeling when thereare enough training examples, i.e., after the learning curvesare not so steep (cf., Figure 1).

4.2.2 Round-robin Strategies, General CostsWe illustrated above that repeated-labeling is a viable alter-

native to single-labeling, even when the cost of acquiring the“feature” part of an example is negligible compared to the costof label acquisition. However, as described in the introduction,often the cost of (noisy) label acquisition CL is low comparedto the cost CU of acquiring an unlabeled example. In thiscase, clearly repeated-labeling should be considered: usingmultiple labels can shift the learning curve up significantly.To compare any two strategies on equal footing, we calcu-late generalization performance “per unit cost” of acquireddata; we then compare the di!erent strategies for combiningmultiple labels, under di!erent individual labeling qualities.

We start by defining the data acquisition cost CD:

CD = CU · Tr + CL · NL (2)

to be the sum of the cost of acquiring Tr unlabeled examples(CU · Tr), plus the cost of acquiring the associated NL labels(CL · NL). For single labeling we have NL = Tr, but forrepeated-labeling NL > Tr.

We extend the setting of Section 4.2.1 slightly: repeated-labeling now acquires and labels new examples; single label-ing SL is unchanged. Repeated-labeling again is generalizedround-robin: for each new example acquired, repeated-labelingacquires a fixed number of labels k, and in this case NL = k·Tr.(In our experiments, k = 5.) Thus, for round-robin repeated-labeling, in these experiments the cost setting can be de-scribed compactly by the cost ratio ! = CU

CL, and in this case

CD = ! · CL · Tr + k · CL · Tr, i.e.,

CD " ! + k (3)

We examine two versions of repeated-labeling, repeated-labelingwith majority voting (MV ) and uncertainty-preserving repeated-labeling (ME ), where we generate multiple examples with dif-ferent weights to preserve the uncertainty of the label multisetas described in Section 3.3.

Performance of di!erent labeling strategies: Figure 6plots the generalization accuracy of the models as a function ofdata acquisition cost. Here ! = 3, and we see very clearly thatfor p = 0.6 both versions of repeated-labeling are preferable tosingle labeling. MV and ME outperform SL consistently (onall but waveform, where MV ties with SL) and, interestingly,the comparative performance of repeated-labeling tends toincrease as one spends more on labeling.

Figure 7 shows the e!ect of the cost ratio !, plotting theaverage improvement per unit cost of MV over SL as a functionof !. Specifically, for each data set the vertical di!erencesbetween the curves are averaged across all costs, and thenthese are averaged across all data sets. The figure shows thatthe general phenomenon illustrated in Figure 6 is not tiedclosely to the specific choice of ! = 3.

Furthermore, from the results in Figure 6, we can see thatthe uncertainty-preserving repeated-labeling ME always per-forms at least as well as MV and in the majority of the casesME outperforms MV. This is not apparent in all graphs, sinceFigure 6 only shows the beginning part of the learning curvesfor MV and ME (because for a given cost, SL uses up trainingexamples quicker than MV and ME ). However, as the numberof training examples increases further, then (for p = 0.6) MEoutperforms MV. For example, Figure 8 illustrates for thesplice dataset, comparing the two techniques for a larger rangeof costs.

In other results (not shown) we see that when labeling qual-ity is substantially higher (e.g., p = 0.8), repeated-labeling stillis increasingly preferable to single labeling as ! increases; how-ever, we no longer see an advantage for ME over MV. Theseresults suggest that when labeler quality is low, inductivemodeling often can benefit from the explicit representation

618

50556065707580859095

100

100 1100 2100 3100 4100 5100Number of labels (mushroom, p=0.6)

Acc

urac

y

SLML

(a) p = 0.6, #examples = 100, for MV

50556065707580859095

100

50 800 1550 2300 3050 3800 4550 5300Number of labels (mushroom, p=0.8)

Acc

urac

y

SLMV

(b) p = 0.8, #examples = 50, for MV

Figure 5: Comparing the increase in accuracy for themushroom data set as a function of the number oflabels acquired, when the cost of an unlabeled exam-ple is negligible, i.e., CU = 0. Repeated-labeling withmajority vote (MV ) starts with an existing set of ex-amples and only acquires additional labels for them,and single labeling (SL) acquires additional examples.Other data sets show similar results.

for a fixed labeler quality. Both MV and SL start with thesame number of single-labeled examples. Then, MV startsacquiring additional labels only for the existing examples,while SL acquires new examples and labels them.

Generally, whether to invest in another whole training exam-ple or another label depends on the gradient of generalizationperformance as a function of obtaining another label or anew example. We will return to this when we discuss futurework, but for illustration, Figure 5 shows scenarios for ourexample problem, where each strategy is preferable to theother. From Figure 1 we see that for p = 0.6, and with 100examples, there is a lot of headroom for repeated-labeling toimprove generalization performance by improving the overalllabeling quality. Figure 5(a) indeed shows that for p = 0.6,repeated-labeling does improve generalization performance(per label) as compared to single-labeling new examples. Onthe other hand, for high initial quality or steep sections of thelearning curve, repeated-labeling may not compete with sin-gle labeling. Figure 5(b) shows that single labeling performsbetter than repeated-labeling when we have a fixed set of 50training examples with labeling quality p = 0.8. Particularly,repeated-labeling could not further improve its performanceafter acquiring a certain amount of labels (cf., the q = 1 curvein Figure 1).

The results for other datasets are similar to Figure 5: un-der noisy labels, and with CU ! CL, round-robin repeated-labeling can perform better than single-labeling when thereare enough training examples, i.e., after the learning curvesare not so steep (cf., Figure 1).

4.2.2 Round-robin Strategies, General CostsWe illustrated above that repeated-labeling is a viable alter-

native to single-labeling, even when the cost of acquiring the“feature” part of an example is negligible compared to the costof label acquisition. However, as described in the introduction,often the cost of (noisy) label acquisition CL is low comparedto the cost CU of acquiring an unlabeled example. In thiscase, clearly repeated-labeling should be considered: usingmultiple labels can shift the learning curve up significantly.To compare any two strategies on equal footing, we calcu-late generalization performance “per unit cost” of acquireddata; we then compare the di!erent strategies for combiningmultiple labels, under di!erent individual labeling qualities.

We start by defining the data acquisition cost CD:

CD = CU · Tr + CL · NL (2)

to be the sum of the cost of acquiring Tr unlabeled examples(CU · Tr), plus the cost of acquiring the associated NL labels(CL · NL). For single labeling we have NL = Tr, but forrepeated-labeling NL > Tr.

We extend the setting of Section 4.2.1 slightly: repeated-labeling now acquires and labels new examples; single label-ing SL is unchanged. Repeated-labeling again is generalizedround-robin: for each new example acquired, repeated-labelingacquires a fixed number of labels k, and in this case NL = k·Tr.(In our experiments, k = 5.) Thus, for round-robin repeated-labeling, in these experiments the cost setting can be de-scribed compactly by the cost ratio ! = CU

CL, and in this case

CD = ! · CL · Tr + k · CL · Tr, i.e.,

CD " ! + k (3)

We examine two versions of repeated-labeling, repeated-labelingwith majority voting (MV ) and uncertainty-preserving repeated-labeling (ME ), where we generate multiple examples with dif-ferent weights to preserve the uncertainty of the label multisetas described in Section 3.3.

Performance of di!erent labeling strategies: Figure 6plots the generalization accuracy of the models as a function ofdata acquisition cost. Here ! = 3, and we see very clearly thatfor p = 0.6 both versions of repeated-labeling are preferable tosingle labeling. MV and ME outperform SL consistently (onall but waveform, where MV ties with SL) and, interestingly,the comparative performance of repeated-labeling tends toincrease as one spends more on labeling.

Figure 7 shows the e!ect of the cost ratio !, plotting theaverage improvement per unit cost of MV over SL as a functionof !. Specifically, for each data set the vertical di!erencesbetween the curves are averaged across all costs, and thenthese are averaged across all data sets. The figure shows thatthe general phenomenon illustrated in Figure 6 is not tiedclosely to the specific choice of ! = 3.

Furthermore, from the results in Figure 6, we can see thatthe uncertainty-preserving repeated-labeling ME always per-forms at least as well as MV and in the majority of the casesME outperforms MV. This is not apparent in all graphs, sinceFigure 6 only shows the beginning part of the learning curvesfor MV and ME (because for a given cost, SL uses up trainingexamples quicker than MV and ME ). However, as the numberof training examples increases further, then (for p = 0.6) MEoutperforms MV. For example, Figure 8 illustrates for thesplice dataset, comparing the two techniques for a larger rangeof costs.

In other results (not shown) we see that when labeling qual-ity is substantially higher (e.g., p = 0.8), repeated-labeling stillis increasingly preferable to single labeling as ! increases; how-ever, we no longer see an advantage for ME over MV. Theseresults suggest that when labeler quality is low, inductivemodeling often can benefit from the explicit representation

618

Page 61: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n  How  to  select  which  sample  for  re-­‐labeling?  

n  Assume  that    n  the  annotator  quality  is  independent  from  the  object    n  all  annotators  have  the  same  quality  n  the  annotator  quality  is  unknown,  i.e.,  uniformly  distributed  in  [0,1]  

n  Let                      and                      denote  the  number  of  labels  equal  to  0  or  1  assigned  to  an  object  

P (yji = yi) = pj

pj = p

L0

L0 = |{yj |yj = 0}|L1 = |{yj |yj = 1}|

L1

Page 62: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n  If                              is  the  true  label,  the  probability  of  observing                and                labels  is  given  by  the  binomial  distribu0on  

n  The  posterior  can  be  expressed  as  

L0 L1

P (L0, L1|p) =�L0 + L1

L1

�pL1(1− p)L0

y = 1

P (p|L1, L0) =P (L0, L1|p)P (p)

P (L0, L1)=

P (L0, L1|p)P (p)� 10 P (L0, L1|s)ds

Beta  func0on   Beta  distribu0on  

=pL1(1− p)L0

B(L0 + 1, L1 + 1)= βp(L0 + 1, L1 + 1)

Page 63: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n  Let                                          denote  the  CDF  of  the  beta  distribu0on  

n  The  uncertainty  of  an  object  due  to  noisy  labels  is  defined  as  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

CD

F

L0 = 0, L1 = 4

L0 = 1, L1 = 3L0 = 2, L1 = 2

Ip(L0, L1)

SLU = min{I0.5(L0 + 1, L1 + 1), 1− I0.5(L0 + 1, L1 + 1)}

Page 64: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Sheng  et  al.,  2008]  

n  Different  strategies  to  select  the  next  labeling  ac0on  n  GRR:  Generalized  round  robin  n  Selec0ve  repeated  labeling  

n  LU:  Label  uncertainty  n  MU:  Model  uncertainty  (as  in  ac0ve  learning)  n  LMU:  Label  and  model  uncertainty  

50556065707580

0 200 400 600 800 1000 1200 1400 1600Number of labels (bmg)

Acc

urac

y

GRR MULU LMU

50

60

70

80

90

0 400 800 1200 1600 2000Number of labels (expedia)

Acc

urac

y

50

60

70

80

90

100

0 400 800 1200 1600 2000Number of labels (kr-vs-kp)

Acc

urac

y

60

70

80

90

100

0 400 800 1200 1600 2000Number of labels (mushroom)

Acc

urac

y

50

60

70

80

90

0 200 400 600 800 1000 1200 1400Number of labels (qvc)

Acc

urac

y

60

70

80

90

100

0 400 800 1200 1600 2000Number of labels (sick)

Acc

urac

y

6065

7075

8085

0 400 800 1200 1600 2000Number of labels (spambase)

Acc

urac

y

50556065707580

0 400 800 1200 1600 2000Number of labels (splice)

Acc

urac

y

707580859095

100

0 400 800 1200 1600 2000Number of labels (thyroid)

Acc

urac

y

50

55

60

65

70

0 100 200 300 400 500 600Number of labels (tic-tac-toe)

Acc

urac

y

50556065707580

0 400 800 1200 1600 2000Number of labels (travelocity)

Acc

urac

y

50556065707580

0 400 800 1200 1600 2000Number of labels (waveform)

Acc

urac

y

Figure 11: Accuracy as a function of the number of la-bels acquired for the four selective repeated-labelingstrategies for the 12 datasets (p = 0.6).

generalization accuracy averaged over the held-out test sets(as described in Section 4.1). The results (Figure 11) showthat the improvements in data quality indeed do acceleratelearning. (We report values for p = 0.6, a high-noise settingthat can occur in real-life training data.7) Table 2 summarizesthe results of the experiments, reporting accuracies averagedacross the acquisition iterations for each data set, with themaximum accuracy across all the strategies highlighted inbold, the minimum accuracy italicized, and the grand aver-ages reported at the bottom of the columns.

The results are satisfying. The two methods that incorpo-rate label uncertainty (LU and LMU ) are consistently betterthan round-robin repeated-labeling, achieving higher accu-racy for every data set. (Recall that in the previous section,round-robin repeated-labeling was shown to be substantiallybetter than the baseline single labeling in this setting.) Theperformance of model uncertainty alone (MU ), which can beviewed as the active learning baseline, is more variable: inthree cases giving the best accuracy, but in other cases not

7From [20]: “No two experts, of the 5 experts surveyed, agreed upon

diagnoses more than 65% of the time. This might be evidence forthe di!erences that exist between sites, as the experts surveyed hadgained their expertise at di!erent locations. If not, however, it raisesquestions about the correctness of the expert data.”

even reaching the accuracy of round-robin repeated-labeling.Overall, combining label and model uncertainty (LMU ) isthe best approach: in these experiments LMU always out-performs round-robin repeated-labeling, and as hypothesized,generally it is better than the strategies based on only onetype of uncertainty (in each case, statistically significant by aone-tailed sign test at p < 0.1 or better).

5. CONCLUSIONS, LIMITATIONS, ANDFUTURE WORK

Repeated-labeling is a tool that should be considered when-ever labeling might be noisy, but can be repeated. We showedthat under a wide range of conditions, it can improve boththe quality of the labeled data directly, and the quality ofthe models learned from the data. In particular, selectiverepeated-labeling seems to be preferable, taking into accountboth labeling uncertainty and model uncertainty. Also, whenquality is low, preserving the uncertainty in the label multisetsfor learning [25] can give considerable added value.

Our focus in this paper has been on improving data qualityfor supervised learning; however, the results have implica-tions for data mining generally. We showed that selectiverepeated-labeling improves the data quality directly and sub-stantially. Presumably, this could be helpful for many datamining applications.

This paper makes important assumptions that should bevisited in future work, in order for us to understand practicalrepeated-labeling and realize its full benefits.

• For most of the work we assumed that all the labelershave the same quality p and that we do not know p. Aswe showed briefly in Section 3.2.2, di!ering qualities com-plicates the picture. On the other hand, good estimatesof individual labelers’ qualities inferred by observing theassigned labels [6, 26, 28] could allow more sophisticatedselective repeated-labeling strategies.

• Intuitively, we might also expect that labelers wouldexhibit higher quality in exchange for a higher payment.It would be interesting to observe empirically how indi-vidual labeler quality varies as we vary CU and CL, andto build models that dynamically increase or decreasethe amounts paid to the labelers, depending on the qual-ity requirements of the task. Morrison and Cohen [18]determine the optimal amount to pay for noisy infor-mation in a decision-making context, where the amountpaid a!ects the level of noise.

• In our experiments, we introduced noise to existing,benchmark datasets. Future experiments, that use reallabelers (e.g., using Mechanical Turk) should give abetter understanding on how to better use repeated-labeling strategies in a practical setting. For example,in practice we expect labelers to exhibit di!erent levelsof noise and to have correlated errors; moreover, theremay not be su"ciently many labelers to achieve veryhigh confidence for any particular example.

• In our analyses we also assumed that the di"culty of la-beling an example is constant across examples. In reality,some examples are more di"cult to label than others andbuilding a selective repeated-labeling framework that ex-plicitly acknowledges this, and directs resources to moredi"cult examples, is an important direction for futurework. We have not yet explored to what extent tech-niques like LMU (which are agnostic to the di"culty of

621

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

Labe

ling

qual

ity

ENTROPYGRR

Figure 9: What not to do: data quality improvementfor an entropy-based selective repeated-labeling strat-egy vs. round-robin repeated-labeling.

0.60.650.7

0.750.8

0.850.9

0.951

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

Labe

ling

qual

ity

GRR MULU LMU

(a) p = 0.6

0.85

0.87

0.89

0.91

0.93

0.95

0.97

0.99

0 400 800 1200 1600 2000

Number of labels (waveform, p=0.8)

Labe

ling

qual

ity

GRR MULU LMU

(b) p = 0.8

Figure 10: The data quality improvement of the fourstrategies (GRR, LU, MU, and LMU ) for the wave-form dataset.

Lpos + 1, ! = Lneg + 1. Thus, we set:

SLU = min{I0.5(Lpos, Lneg), 1 ! I0.5(Lpos, Lneg)} (4)

We compare selective repeated-labeling based on SLU toround-robin repeated-labeling (GRR), which we showed to per-form well in Section 4.2. To compare repeated-labeling strate-gies, we followed the experimental procedure of Section 4.2,with the following modification. Since we are asking whetherlabel uncertainty can help with the selection of examples forwhich to obtain additional labels, each training example startswith three initial labels. Then, each repeated-labeling strategyiteratively selects examples for which it acquires additionallabels (two at a time in these experiments).

Comparing selective repeated-labeling using SLU (call thatLU ) to GRR, we observed similar patterns across all twelvedata sets; therefore we only show the results for the wave-form dataset (Figure 10; ignore the MU and LMU lines fornow, we discuss these techniques in the next section), whichare representative. The results indicate that LU performssubstantially better than GRR, identifying the examples forwhich repeated-labeling is more likely to improve quality.

4.3.3 Using Model UncertaintyA di!erent perspective on the certainty of an example’s

label can be borrowed from active learning. If a predictive

Data Set GRR MU LU LMU

bmg 62.97 71.90 64.82 68.93expedia 80.61 84.72 81.72 85.01kr-vs-kp 76.75 76.71 81.25 82.55

mushroom 89.07 94.17 92.56 95.52qvc 64.67 76.12 66.88 74.54sick 88.50 93.72 91.06 93.75

spambase 72.79 79.52 77.04 80.69splice 69.76 68.16 73.23 73.06

thyroid 89.54 93.59 92.12 93.97tic-tac-toe 59.59 62.87 61.96 62.91travelocity 64.29 73.94 67.18 72.31waveform 65.34 69.88 66.36 70.24

average 73.65 78.77 76.35 79.46

Table 2: Average accuracies of the four strategiesover the 12 datasets, for p = 0.6. For each dataset,the best performance is in boldface and the worst initalics.

model has high confidence in the label of an example, perhapswe should expend our repeated-labeling resources elsewhere.

• Model Uncertainty (MU) applies traditional activelearning scoring, ignoring the current multiset of la-bels. Specifically, for the experiments below the model-uncertainty score is based on learning a set of models,each of which predicts the probability of class member-ship, yielding the uncertainty score:

SMU = 0.5 !

!!!!!1m

m"

i=1

Pr(+|x, Hi) ! 0.5

!!!!! (5)

where Pr(+|x, Hi) is the probability of classifying theexample x into + by the learned model Hi, and m is thenumber of learned models. In our experiments, m = 10,and the model set is a random forest [4] (generated byWEKA).

Of course, by ignoring the label set, MU has the comple-mentary problem to LU : even if the model is uncertain abouta case, should we acquire more labels if the existing labelmultiset is very certain about the example’s class? The invest-ment in these labels would be wasted, since they would havea small e!ect on either the integrated labels or the learning.

• Label and Model Uncertainty (LMU) combinesthe two uncertainty scores to avoid examples whereeither model is certain. This is done by computing thescore SLMU as the geometric average of SLU and SMU .That is:

SLMU ="

SMU · SLU (6)

Figure 10 demonstrates the improvement in data qualitywhen using model information. We can observe that the LMUmodel strongly dominates all other strategies. In high-noisesettings (p = 0.6) MU also performs well compared to GRRand LU, indicating that when noise is high, using learnedmodels helps to focus the investment in improving quality. Insettings with low noise (p = 0.8), LMU continues to dominate,but MU no longer outperforms LU and GRR.

4.3.4 Model Performance with Selective MLSo, finally, let us assess whether selective repeated-labeling

accelerates learning (i.e., improves model generalization per-formance, in addition to data quality). Again, experimentsare conducted as described above, except here we compute

620

Page 65: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Donmez  et  al.,  2009]  

n  Unlike  [Sheng  et  al.,  2008],  annotators  can  have  different  (unknown)  quali0es  

n  IEThresh  (Interval  Es0mate  Threshold):  a  strategy  to  select  the  annotator  with  the  highest  es0mated  labeling  quality  

1.  Fit  logis0c  regression  to  training  data  

2.  Pick  the  most  uncertain  unlabeled  instance  

�yi,xi�, i = 1, . . . , I

x∗ = argmaxxi

(1− maxy∈{0,1}

P (y|xi))

A  posteriori  probability  computed  by  the  classifier  

Page 66: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Donmez  et  al.,  2009]  

3.  For  each  annotator,    n  Compute  if  she/he  agrees  with  majority  vo0ng  

n  Compute  the  mean  and  the  sample  standard  devia0on  of  the  agreement,  averaged  over  mul0ple  objects  

n  Compute  the  upper  confidence  interval  of  the  annotator  

Cri0cal  value  for  the  Student’s  t-­‐distribu0on  with                              degrees  of  freedom  

rji =

�1 yji = yMV

i

0 otherwise

σj = std[rji ]µj = E[rji ]

UIj = µj + t(Ij−1)

α/2

σj

√n

Ij − 1

Page 67: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Donmez  et  al.,  2009]  

4.  Choose  all  annotators  with  largest  upper  confidence  interval  

5.  Compute  the  majority  vote  of  the  selected  annotators  

6.  Update  training  data  

7.  Repeat  2-­‐6  

{j|UIj ≥ �maxj

UIj}

T = T ∪ �yMVi ,x∗�

yMVi

Page 68: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Donmez  et  al.,  2009]  

n  Achieves  trade-­‐off  between    n  Explora0on  (at  the  beginning,  to  es0mate  annotator  quali0es)  n  Exploita0on  (once  annotator  quali0es  are  es0mated,  ask  to  more  reliable)  

0.5 0.6 0.7 0.8 0.90

50

100

150image

True Oracle Accuracy

#Tim

es S

elec

ted

0.5 0.6 0.7 0.8 0.90

50

100

150phoneme

True Oracle Accuracy

#Tim

es S

elec

ted

Figure 2: Number of times each oracle is queried vs. the true oracle accuracy. Each oracle corresponds toa single bar. Each bar is multicolored where each color shows the relative contribution. Blue correspondsto the first 10 iterations, green corresponds to an additional 40 iterations and red corresponds to anotheradditional 100 iterations. The bar height shows the total number of times an oracle is queried for labelingby IEThresh during first 150 iterations.

Table 1: Properties of six datasets used in the ex-periments. All are binary classification tasks withvarying sizes.

Dataset Size +/- Ratio Dimensionsimage 2310 1.33 18

mushroom 8124 1.07 22spambase 4601 0.65 57phoneme 5404 0.41 5ringnorm 7400 0.98 20svmguide 3089 1.83 4

and to have consistent baselines. The annotator accuraciesand the size of each dataset is reported in Table 2.

We compared our method IEThresh against Repeated andRandom baselines on these two datasets. In contrast to theUCI data experiment, there is no training of classifiers forthis experiment. Instead, the test set predictions are madedirectly by AMT labelers. Hence, we randomly selected 50instances from each dataset to be used by IEThresh to inferestimates for the annotator accuracies. The remaining in-stances are held out as the test set. The annotator with thebest estimated accuracy is evaluated on the test set. Thetotal number of queries are then calculated as a sum of thenumber of queries issued during inference and the number ofqueries issued to the chosen annotator during testing. Re-peated and Random baselines do not need an inference phasesince they do not change their annotator selection mecha-nism via learning. Hence, they are directly evaluated on thetest set. The total number of queries is assigned comparablyfor IEThresh and Repeated; however, it is equal to the num-ber of test instances for the Random baseline since it queriesa single labeler for each instance; thus, there can only be asmany queries as the number of test instances.

4.2 ResultsFigure 1 compares three methods on six datasets with

simulated oracles. The true accuracy of each oracle in Fig-ure 1 is drawn uniformly at random from within the range

Table 2: The size and the annotator accuracies foreach AMT dataset.

Data Size Annotator AccuraciesTEMP 190 0.44, 0.44, 0.54, 0.92, 0.92, 0.93RTE 100 0.51, 0.51, 0.58, 0.85, 0.92

Table 3: Performance Comparison on RTE data.The last column indicates the total number ofqueries issued to labelers by each method. IEThreshperforms accurately with comparable labeling e!ortto Repeated.

Method Accuracy # QueriesIEThresh 0.92 252Repeated 0.6 250Random 0.64 50

[.5, 1]. The figure reports the average classification errorwith respect to the total number of oracle queries issuedby each method. IEThresh is the best performer in all sixdatasets. In ringnorm and spambase datasets, IEThresh ini-tially performs slightly worse than the other methods, indi-cating that oracle reliability requires more sampling in thesetwo datasets. But, after the estimates are settled (whichhappens in ! 200 queries), it outperforms the others, withespecially large margins in spambase dataset. The resultsreported are statistically significant based on a two-sidedpaired t-test, where each pair of points on the averaged re-sults is compared.

We also analyzed the e!ect of filtering less reliable oracles.An ideal filtering mechanism excludes the less accurate or-acles early in the process and samples more from the moreaccurate ones. In Figure 2, we report the number of timeseach oracle is queried on image and phoneme datasets. Thex-axis shows the true accuracy of each oracle. We considerthe first 150 iterations of IEThresh and count the numberof times each oracle is selected. Each color corresponds to adi!erent time frame; i.e. blue, green and red correspond to

264

Itera0on  counts    41-­‐150    11-­‐40    1-­‐10    

Page 69: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Dekel    and  Shamir,  2009]  

n  In  some  cases,  the  number  of  annotators  is  of  the  same  order  as  the  number  of  objects  to  annotate  

n Majority  vo0ng  cannot  help  

n  Es0ma0ng  the  annotator  quali0es  might  be  problema0c  

n  Goal:  prune  low-­‐quality  annotators,  when  each  one  annotates  at  most  one  object                                        

I/J = Θ(1)

Page 70: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Dekel    and  Shamir,  2009]  

n  Consider  a  training  set  

n  Let                                        denote  a  binary  classifier  that  assign  a  label                          to    

n  Let                              denote  a  randomized  classifier  which  represents  the  way  annotator            labels  data    

n  Let                denote  the  set  of  objects  annotated  by  annotator    

n  Prune  away  any  annotator  for  which  

n  In  words,  the  method  prunes  those  annotators  that  are  in  disagreement  with  a  classifier  trained  based  on  all  annotators  

�yi,xi�, i = 1, . . . , I

f(w,xi)

hj(xi)

{0, 1} xi

j

jSj

�j =

�i∈Sj 1hj(xi) �=f(w,xi)

|Sj | > T

Page 71: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Dekel    and  Shamir,  2009]  

�1 = 1/9

�2 = 0

�3 = 1

Page 72: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n  Consider  a  training  set  

n  Let                                        denote  a  binary  classifier  that  assign  a  label                          to    

n  Consider  the  family  of  linear  classifiers  

n  The  probability  of  the  posi0ve  class  is  modeled  as  a  logis0c  sigmoid  

�yi,xi�, i = 1, . . . , I

f(w,xi) {0, 1} xi

yi = 1 if wTxi ≥ γ

yi = 0 otherwise

P (yi = 1|xi,w) = σ(wTx) σ(z) =1

1 + e−z

Page 73: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n  Similarly  to  [Welinder  and  Perona,  2010]  annotators  quality  dis0nguishes  between  true  posi0ve  and  true  nega0ve  rate  

n  Goal:    n  Given  the  observed  labels  and  the  feature  vectors  n  Es0mate  the  unknown  parameters  

P (yji = 1|yi = 1) = αj1

P (yji = 0|yi = 0) = αj0

θ = {w,α1,α0}D = {xi, y

1i , . . . , y

Ji }Ii=1

Page 74: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n  The  likelihood  func0on  of  the  parameters                                                                      given  the  observa0ons                                                                                                is  factored  as  

θ = {w,α1,α0}D = {xi, y

1i , . . . , y

Ji }Ii=1

P (D|θ) =I�

i=1

P (y1i , . . . , yJi |xi,θ)

=I�

i=1

P (y1i , . . . , yJi |yi = 1,α1)P (yi = 1|xi,w)

+ P (y1i , . . . , yJi |yi = 0,α0)P (yi = 0|xi,w)

Page 75: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

P (D|θ) =I�

i=1

P (y1i , . . . , yJi |yi = 1,α1)P (yi = 1|xi,w)

+ P (y1i , . . . , yJi |yi = 0,α0)P (yi = 0|xi,w)

J�

j=1

P (yji |yi = 0,αj0) =

J�

j=1

[αj0]

1−yji [1− αj

0]yji

J�

j=1

P (yji |yi = 1,αj1) =

J�

j=1

[αj1]

yji [1− αj

1]1−yj

i

σ(wTx)

1− σ(wTx)

Page 76: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n  The  parameters  are  found  by  maximizing  the  log-­‐likelihood  func0on  

n  The  solu0on  is  based  on  Expecta0on-­‐Maximiza0on  

n  Expecta+on  step  

{α1, α0, w} = argmaxθ

logP (D|θ)

µi = P (yi = 1|y1i , . . . , yJi ,xi,θ)

∝ P (y1i , . . . , yJi |yi = 1,θ)P (yi = 1|,xi,θ)

pi = σ(wTxi)

=a1,ipi

a1,ipi + a0,i(1− pi)

a1,i =J�

j=1

[αj1]

yji [1− αj

1]1−yj

i

a0,i =J�

j=1

[αj0]

1−yji [1− αj

0]yji

Page 77: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n Maximiza+on  step  n  False  posi0ve  and  false  nega0ve  rates  can  be  es0mated  in  closed  form  

n  The  classifier            can  be  es0mated  by  means  of  gradient  ascent  

αj0 =

�Ii=1(1− µi)(1− yji )�I

i=1(1− µi)αj1 =

�Ii=1 µiy

ji�I

i=1 µi

wt+1 = w

t − ηH−1g

w

Page 78: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n  Log-­‐odds  

logit(µi) = logµi

1− µi= log

P (yi = 1|y1i , . . . , yJi ,xi,θ)

P (yi = 0|y1i , . . . , yJi ,xi,θ)

= c+wTxi +J�

i=1

yji [logit(αj1) + logit(αj

0)]

Contribu0on  of  the  annotators:  Weighted  linear  combina0on  of  labels  from  all  annotators  

Contribu0on  of  the  classifier  

Page 79: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

LEARNING FROM CROWDS

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate (1−specifcity)

True

Pos

itive

Rate

(se

nsitiv

ity)

ROC Curve for the classifier

Golden ground truth AUC=0.915Proposed EM algorithm AUC=0.913Majority voting baseline AUC=0.882

(a)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate (1−specifcity)

True

Pos

itive

Rate

(se

nsitiv

ity)

ROC Curve for the estimated true labels

Proposed EM algorithm AUC=0.991Majority voting baseline AUC=0.962

(b)

Figure 1: Results for the digital mammography data set with annotations from 5 simulated radiol-ogists. (a) The ROC curve of the learnt classifier using the golden ground truth (dottedblack line), the majority voting scheme (dashed blue line), and the proposed EM algo-rithm (solid red line). (b) The ROC curve for the estimated ground truth. The actualsensitivity and specificity of each of the radiologists is marked as a !. The end of thedashed blue line shows the estimates of the sensitivity and specificity obtained from themajority voting algorithm. The end of the solid red line shows the estimates from theproposed method. The ellipse plots the contour of one standard deviation.

1313

Page 80: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Raykar  et  al.,  2010]  

n  Extensions  n  Bayesian  approach,  with  priors  on  true  posi0ve  and  true  nega0ve  rates  n  Adop0on  of  different  types  of  classifiers  

n  Mul0-­‐class  classifica0on  n  Ordinal  regression  n  Regression  

yi ∈ {l1, . . . , lK}, l1 < . . . lK

yi ∈ {l1, . . . , lK}

yi ∈ R

Page 81: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Yan  et  al.,  2010b]  

n  Seung  similar  to  [Raykar  et  al.,  2010],  with  two  main  differences  

n  No  dis0nc0on  between  true  posi0ve  and  true  nega0ve  rates  

n  The  quality  of  the  annotator  is  dependent  on  the  object  

 

αj1 = αj

0, j = 1, . . . , J

αj(x) =1

1 + e−wjx

Page 82: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Yan  et  al.,  2010b]  

n  Log-­‐odds  

 logit(µi) = log

µi

1− µi= wTxi +

J�

i=1

(−1)(1−yji )(wj)Txi

= c+wTxi +J�

i=1

yji (wj)Txi

Contribu0on  of  the  annotators:  Weighted  linear  combina0on  of  labels  from  all  annotators  Weights  depend  on  object  difficulty  

Page 83: Crowdsourcing for Multimedia Retrieval

+  Aggrega0ng  and  learning  [Yan  et  al.,  2011]  

n  Ac0ve  learning  from  crowds  n  Which  training  point  to  pick?  

n  Pick  the  example  that  is  closest  to  the  classifier  separa0ng  hyperplane  

n  Which  expert  to  pick?  

 

i∗ = argmini

|wTxi|

j∗ = argminj

1

1 + e−(wj)Txi∗

Page 84: Crowdsourcing for Multimedia Retrieval

+  Crowdsourcing  at  work  

Page 85: Crowdsourcing for Multimedia Retrieval

+  CrowdSearch  [Yan  et  al.,  2010]  

n  CrowdSearch  combines  n  Automated  image  search  

n  Local  processing  on  mobile  phones  +  backend  processing  n  Real-­‐0me  human  valida0on  of  search  results  

n  Amazon  Mechanical  Turk  

n  Studies  the  trade-­‐off  in  terms  of  n  Delay  n  Accuracy  n  Cost  

man error and bias to maximize accuracy. To balance thesetradeo!s, CrowdSearch uses an adaptive algorithm that usesdelay and result prediction models of human responses to ju-diciously use human validation. Once a candidate image isvalidated, it is returned to the user as a valid search result.

3. CROWDSOURCING FOR SEARCHIn this section, we first provide a background of the Ama-

zon Mechanical Turk (AMT). We then discuss several designchoices that we make while using crowdsourcing for imagevalidation including: 1) how to construct tasks such thatthey are likely to be answered quickly, 2) how to minimizehuman error and bias, and 3) how to price a validation taskto minimize delay.

Background: We now provide a short primer on theAMT, the crowdsourcing system that we use in this work.AMT is a large-scale crowdsourcing system that has tensof thousands of validators at any time. The key benefit ofAMT is that it provides public APIs for automatic postingof tasks and retrieval of results. The AMT APIs enable usto post tasks and specify two parameters: (a) the number ofduplicates, i.e. the number of independent validators whowe want to work on the particular task, and (b) the rewardthat a validator obtains for providing responses. A validatorworks in two phases: (a) they first accept a task once theyidentify that they would like to work on it, which in turndecrements the number of available duplicates, and (b) onceaccepted, they need to provide a response within a periodspecified by the task.

One constraint of the AMT that pertains to CrowdSearchis that the number of duplicates and reward for a task thathas been posted cannot be changed at a later point. We takethis practical limitation in mind in designing our system.

Constructing Validation Tasks: How can we constructvalidation tasks such that they are answered quickly? Ourexperience with AMT revealed several insights. First, we ob-served that asking people to tag query images and candidateimages directly is not useful since: 1) text tags from crowd-sourcing systems are often ambiguous and meaningless (sim-ilar conclusions have been reported by other crowdsourcingstudies [8]), and 2) tasks involving tagging are unpopular,hence they incur large delay. Second, we found that havinga large validation task that presents a number of <queryimage, candidate image> pairs enlarges human error andbias since a single individual can bias a large fraction of thevalidation results.

We settled on an a simple format for validation tasks.Each <query image, candidate image> pair is packaged intoa task, and a validator is required to provide a simple YESor NO answer: YES if the two images are correctly matched,and NO otherwise. We find that these tasks are often themost popular among validators on AMT.

Minimizing Human Bias and Error: Human error andbias is inevitable in validation results, therefore a centralchallenge is eliminating human error to achieve high accu-racy. We use a simple strategy to deal with this problem:we request several duplicate responses for a validation taskfrom multiple validators, and aggregate the responses usinga majority rule. Since AMT does not allow us to dynami-cally change the number of duplicates for a task, we fix thisnumber for all tasks. In §7.2, we evaluate several aggrega-tion approaches, and show that a majority of five duplicates

!"#$%&'()*# +),-.-)/#&'()*#0 1"23.4)/#&5)3.-)/.6,&7)080

!"#$%

&'(&'(&'()'(*+,( &'(&'(

!"#$%

&'(&'(-.)'(*+,( -.&'(

!"#$%

-.-.-.)'(*+,( -.-.

!"#$%

&'(&'(&'()'(*+,( -.&'(

+9

+:

+;

+<

Figure 2: Shown are an image search query, candi-date images, and duplicate validation results. Eachvalidation task is a Yes/No question about whetherthe query image and candidate image contains thesame object.

is the best strategy and consistently achieves us more than95% search accuracy.

Pricing Validation Tasks: Crowdsourcing systems allowus to set a monetary reward for each task. Intuitively, ahigher price provides more incentive for human validators,and therefore can lead to lower delay. This raises the fol-lowing question: is it better to spend X cents on a singlevalidation task or to spread it across X validation tasks ofprice one cent each? We find that it is typically better tohave more tasks at a low price than fewer tasks at a highprice. There are three reasons for this behavior: 1) since alarge fraction of tasks on the AMT o!er a reward of only onecent, the expectation of users is that most tasks are quickand low-cost, 2) crowdsourcing systems like the AMT havetens of thousands of human validators, hence posting moretasks reduces the impact of a slow human validator on over-all delay, and 3) more responses allows better aggregationto avoid human error and bias. Our experiments with AMTshow that the first response in five one cent tasks is 50 - 60%faster than a single five cent task, confirming the intuitionthat delay is lower when more low-priced tasks are posted.

4. CROWDSEARCH ALGORITHMGiven a query image and a ranked list of candidate im-

ages, the goal of human validation is to identify the correctcandidate images from the ranked list. Human validationimproves search accuracy, but incurs monetary cost and hu-man processing delay. We first discuss these tradeo!s andthen describe how CrowdSearch optimizes overall cost whilereturning at least one valid candidate image within a user-specified deadline.

4.1 Delay-Cost TradeoffsBefore presenting the CrowdSearch algorithm, we illus-

trate the tradeo! between delay and cost by discussing post-ing schemes that optimize one or the other but not both.

Parallel posting to optimize delay: A scheme that op-timizes delay would post all candidate images to the crowd-sourcing system at the same time. (We refer to this as par-allel posting.) While parallel posting reduces delay, it isexpensive in terms of monetary cost. Figure 2 shows aninstance where the image search engine returns four candi-

Page 86: Crowdsourcing for Multimedia Retrieval

+  CrowdSearch  [Yan  et  al.,  2010]  

n  Delay-­‐costs  trade-­‐offs  n  Parallel  pos0ng  

n  Minimizes  delay  n  Expensive  in  terms  of  monetary  cost  

n  Serial  pos0ng  n  Posts  top-­‐ranked  candidates  for  valida0on  n  Cheap  in  terms  of  monetary  cost  n  Much  higher  delay  

n  Adap0ve  strategy  à  CrowdSearch  

Page 87: Crowdsourcing for Multimedia Retrieval

+  CrowdSearch  [Yan  et  al.,  2010]  

n  Example:  a  candidate  image  has  received  the  sequence  of  responses  

n  Enumerate  all  sequences  of  responses,  i.e.,      

n  For  each  sequence,  es0mate  n  The  probability  of  observing                          given  n  Weather  it  would  lead  to  success  under  majority  vo0ng  n  The  probability  of  obtaining  the  responses  before  the  deadline  

n  Es0mate  the  probability  of  success.  If                                                  post  a  new  candidate  

Si = {‘Y’, ‘N’}

S(1)i = {‘Y’, ‘N’, ‘Y’}

S(2)i = {‘Y’, ‘N’, ‘N’}

S(3)i = {‘Y’, ‘N’, ‘Y’, ‘Y’}

S(j)i Si

Psucc < τ

. . .

Page 88: Crowdsourcing for Multimedia Retrieval

+  CrowdSearch  [Yan  et  al.,  2010]  

n  Predic0ng  valida0on  results  n  Training:    

n  Enumerate  all  sequences  of  fixed  length  (e.g.,  five)  n  Compute  empirical  probabili0es  

n  Example:  n  Observed  sequence  n  Sequences  that  lead  to  posi0ve  results    

!""#

$ %

%$ %%

%$$ %$%

%$%$ %$%%

!"!"!!"!"" !"!!!!"!!"

&'()&'&(

&'(*&'&+

&'&,&'&*

&'-)

Figure 5: A SeqTree to Predict Validation Re-sults. The received sequence is ‘YNY’, the two se-quences that lead to positive results are ‘YNYNY’and ‘YNYY’. The probability that ‘YNYY’ occursgiven receiving ‘YNY’ is 0.16/0.25 = 64%

5.1 Image Search OverviewThe image search process contains two major steps: 1) ex-

tracting features from a query image, and 2) search throughdatabase images with features of query image.

Extracting features from query image: There are manygood features to represent images, such as the Scale-InvariantFeature Transform (SIFT) [9]. While these features captureessential characteristics of images, they are not directly ap-propriate for search because of their large size. For instance,SIFT features are 128 dimensional vectors and there are sev-eral hundred such SIFT vectors for a VGA image. The largesize makes it 1) unwieldy and ine!cient for search since thedata structures are large, and 2) ine!cient for communi-cation since no compression gains are achieved by locallycomputing SIFT features on the phone.

A canonical approach to reduce the size of features is toreduce the dimensionality by clustering. This is enabled by alookup structure called “vocabulary tree” that is constructedin an a priori manner by hierarchical k-means clustering ofSIFT features of a training dataset. For example, a vo-cabulary tree for buildings can be constructed by collectingthousands of training images, extracting their SIFT featureand using k-means clustering to build the tree. A vocabu-lary tree is typically constructed for each category of images,such as faces, buildings, or book covers.

Searching through database images: Once vistermsare extracted from an image, they can be used in a man-ner similar to keywords in text retrieval [2]. The searchprocess uses a data structure called the inverted index thatis constructed from the corpus of images in the database.The inverted index is basically a mapping from each vis-term to the images in the database containing that visterm.Each visterm is also associated with a inverted document fre-quency (IDF) score that describes its discriminating power.Given a set of visterms for a query image, the search processis simple: for each visterm, the image search engine looksup the inverted index and compute an IDF score for each ofthe candidate images. The list of candidates are returnedranked in order of their IDF score.

5.2 Implementation TradeoffsThere are two key questions that arise in determining how

to split image search functionality between the mobile phoneand remote server. The first question is whether vistermextraction should be performed on the mobile phone or re-mote server. Since visterms are very compact, transmit-ting visterms from a phone as opposed to the raw imagecan save time and energy, particularly if more expensive 3Gcommunication is used. However, visterm computation canincur significant delay on the phone due to its resource con-straints. In-order to reduce this delay, one would need totradeo" the resolution of the visterms, thereby impactingsearch accuracy. Thus, using local computation to extractvisterms from a query image saves energy but sacrifices ac-curacy. Our system chooses the best option for visterm ex-traction depending on the availability of WiFi connectivity.If only 3G connectivity is available, visterm extraction isperformed locally, whereas if WiFi connectivity is available,the raw image is transferred quickly over the WiFi link andperformed at the remote server.

The second question is whether inverted index lookupshould be performed on the phone or the remote server.There are three reasons to choose the latter option: 1) sincevisterms are already extremely compact, the benefit in per-forming inverted index lookup on the phone is limited, 2)having a large inverted index and associated database im-ages on the phone is often not feasible, and 3) having theinverted index on the phone makes it harder to update thedatabase to add new images. For all these reasons, we chooseto use a remote server for inverted index lookup.

6. SYSTEM IMPLEMENTATIONThe CrowdSearch system is implemented on Apple iPhones

and a backend server at UMass. The components diagramof CrowdSearch system is shown in Figure 6.

iPhone Client: We designed a simple user interface formobile users to capture query images and issue a searchquery. The screenshot of the user interface is shown in Fig-ure 1. A client can provide an Amazon payments account tofacilitate the use of AMT and pay for validation. There isalso a free mode where validation is not performed and onlythe image search engine results are provided to the user.

To support local image processing on the iPhone, we portedan open-source implementation of the SIFT feature extrac-tion algorithm [26] to the iPhone. We also implemented avocabulary tree lookup algorithm to convert from SIFT fea-tures to visterms. While vocabulary tree lookup is fast andtakes less than five seconds, SIFT takes several minutes toprocess a VGA image due to the lack of floating point sup-port on the iPhone. To reduce the SIFT running time, wetune SIFT parameters to produce fewer SIFT features froman image. This modification comes at the cost of reducedaccuracy for image search but reduces SIFT running time onthe phone to less than 30 seconds. Thus, the overall com-putation time on the iPhone client is roughly 35 seconds.

When the client is connected to the server, it also receivesupdates, such as an updated vocabulary tree or new deadlinerecommendations.

CrowdSearch Server Implementation: The CrowdSearchServer is comprised of two major components: automatedimage search engine and validation proxy. The image searchengine generates a ranked list of candidate images for each

Si = {‘Y’, ‘N’, ‘Y’}

P ({‘Y’, ‘N’, ‘Y’, ‘Y’}) = 0.16/0.25

P ({‘Y’, ‘N’, ‘Y’, ‘N’, ‘Y’}) = 0.03/0.25

Page 89: Crowdsourcing for Multimedia Retrieval

+  CrowdSearch  [Yan  et  al.,  2010]  

n  Delay  predic0on  

(a) Overall delay model (b) Inter-arrival delay model

Figure 3: Delay models for overall delay and inter-arrival delay. The overall delay is decoupled with acceptanceand submission delay.

From our inter-arrival delay model, we know that all inter-arrival times are independent. Thus, we can present theprobability density function of Yi,j as the convolution of theinter-arrival times of response pairs from i to j.

Before applying convolution, we first need to consider thecondition Yi,i+1 ! t " ti. This condition can be removedby applying the law of total probability. We sum up all thepossible values for Yi,i+1, and note that the lower bound ist " ti. For each Yi,i+1 = tx, the rest part of Yi,j , or Yi+1,j ,should be in the range of D " ti " tx. Thus the condition ofYi,i+1 can be removed and we have:

Pij =D!tiX

tx=t!ti

P (Yi,i+1 = tx)P (Yi+1,j # D " ti " tx) (4)

Now we can apply the convolution directly to Yi+1,j . Letfi,j(t) denote the PDF of inter-arrival between response iand j. The PDF of Yi+1,j can be expressed as:

fi+1,j(t) = (fi+1,i+2 $ · · · $ fj!1,j)(t)

Combining this with Equation 4, we have:

Pij =D!tiX

tx=t!ti

(fi,i+1(tx)D!ti!tx

X

ty=0

fi+1,j(ty)) (5)

Now the probability we want to predict has been expressedin the form of PDF of inter-arrival times. Our delay modelscapture the distribution of all inter-arrival times that weneed for computing the above probability: we use the delaymodel for the first response when i = 0, and use the inter-arrival of adjacent response when i > 0. Therefore, we canpredict the delay of receiving any remaining responses giventhe time that partial responses are received.

4.4 Predicting Validation ResultsHaving presented the delay model, we discuss how to

predict the actual content of the incoming responses, i.e.whether each response is a Yes or No. Specifically, giventhat we have received a sequence Si, we want to computethe probability of occurrence of each possible sequence Sj

that starts with Si, such that the validation result is posi-tive, i.e., majority(Sj) = Y es.

This prediction can be easily done using a su!ciently largetraining dataset to study the distribution of all possible re-sult sequences. For the case where the number of duplicateis set to be 5, there are 25 = 32 di"erent sequence combina-tions. We can compute the probability that each sequenceoccurs in the training set by counting the number of their oc-currences. We use this probability distribution as our modelfor predicting validation results. We use the probabilities toconstruct a probability tree called SeqTree.

Figure 5 shows an example of a SeqTree tree. It is a binarytree where leaf nodes are the sequences with length of 5. Fortwo leaf nodes where only the last bit is di"erent, they havea common parent node whose sequence is the common sub-string of the two leaf nodes. For example, nodes ‘YNYNN’and ‘YNYNY’ have a parent node of ‘YNYN’. The proba-bility of a parent node is the summation of the probabilityfrom its children. Following this rule, the SeqTree is built,where each node Si is associated with a probability pi thatits sequence can happen.

Given the tree, it is easy to predict the probability that Sj

occurs given partial sequence Si using the SeqTree. Simplyfind the nodes that correspond to Si and Sj respectively,and the probability we want is pj/pi.

5. IMAGE SEARCH ENGINEIn this section, we briefly introduce the automated image

search engine. Our search engine is designed using imagesearch methods that have been described in the prior workincluding our own [30]. The fundamental idea of the imagesearch engine is to use a set of compact image representa-tions called visterms (visual terms) for e!cient communica-tion and search. The compactness of visterms makes themattractive for mobile image search, since they can be com-municated from the phone to a remote search engine serverat extremely low energy cost. However, extracting vistermsfrom images consumes significant computation overhead anddelay at the phone. In this section, we provide an overview ofthe image search engine, and focus on explaining the trade-o"s that are specific to using it on resource-constrained mo-bile phones.

fa(t) = λae−λa(t−ca)

fs(t) = λse−λs(t−cs)

fo(t) = fa(t) ∗ fs(t)

fi(t) = λie−λit

Acceptance  

Submission  

Overall  

Inter-­‐arrival  

Page 90: Crowdsourcing for Multimedia Retrieval

+  CrowdSearch  [Yan  et  al.,  2010]  

Figure 7: Precision of automated image search overfour categories of images. These four categoriescover the spectrum of the precision of automatedsearch.

faces and flowers. The precision drops significantly as thelength of the ranked list grows, indicating that even top-ranked images su!ers from high error rate. Therefore, wecannot present the results directly to users.

We now evaluate how much human validation can improveimage search precision. Figure 8 plots four di!erent schemesfor human validation: first-response, majority(3), major-ity(5), and one-veto (i.e. complete agreement among val-idators). In each of these cases, the human-validated searchscheme returns only the candidate images on the ranked listthat are deemed to be correct. Automated image searchsimply returns the top five images on the ranked list.

The results reveal two key observations: First, consider-able improvement in precision is irrespective of which strat-egy is used. All four validation schemes are considerablybetter than automated search. For face images, even us-ing a single human validation improves precision by 3 timeswhereas the use of a majority(5) scheme improves precisionby 5 times. Even for book cover images, majority(5) still im-proves precision by 30%. In fact, the precision using humanvalidators is also considerably better than the top-rankedresponse from the automatic search engine. Second, amongthe four schemes, human validation with majority(5) is eas-ily the best performer and consistently provides accuracygreater than 95% for all image categories. Majority(3) is aclose second, but its precision on face and building imagesis less than 95%. The one-veto scheme also cannot reach95% precision for face, flower and building images. Usingthe first response gives the worst precision as it is a!ectedmost by human bias and error. Based on the above obser-vation, we conclude that for mobile users who care aboutsearch precision, majority(5) is the best validation scheme.

7.3 Accuracy of Delay ModelsThe inter-arrival time models are central to the Crowd-

Search algorithm. We obtain the parameters of the delaymodels using the training dataset, and validate the parame-ters against the testing dataset. Both datasets are describedin §7.1. We validate the following five models: arrival of thefirst response, and inter-arrival times between two adjacentresponses from 1st to 2nd response, to 4th to 5th response(§4.3.1). In this and the following experiments, we set the

Figure 8: Precision of automated image search andhuman validation with four di!erent validation cri-teria.

threshold to post next task be 0.6. In other words, if theprobability that at least one of existing validation task issuccessful is less than 0.6, a new task is triggered.

Figure 9(a) shows the cumulative distribution functions(CDF) for the first response. As described in §4.3.1, thismodel is derived by the convolution of the acceptance timeand submission time distribution. We show that the modelparameters for the acceptance, submission, as well as the to-tal delay for the first response fit the testing data very well.Figure 9(b) shows the CDF of two of inter-arrival times be-tween 1st and 2nd responses, and 3rd and 4th responses.(The other inter-arrival times are not shown to avoid clut-ter.) The scatter points are for testing dataset and the solidline curves are for our model. Again, the model fits thetesting data very well.

While the results were shown visually in Figure 9, wequantify the error between the actual and predicted distri-butions using the K-L Divergence metric in Table 1. TheK-L Divergence or relative entropy measures the distancebetween two distributions [3] in bits. Table 1 shows the dis-tance between our model to the actual data is less than 5bits for all the models, which is very small. These valuesare all negative, which indicates that the predicted delay ofour model is little bit larger than the actual delay. This ob-servation indicates that our models are conservative in thesense that they would rather post more tasks than miss thedeadline requirement.

The results from Figure 9 show that the model parame-ters remain stable over time and can be used for prediction.In addition, it shows that our model provides an excellentapproximation of the user behavior on a large-scale crowd-sourcing system such as AMT.

7.4 CrowdSearch PerformanceIn this section, we evaluate the CrowdSearch algorithm on

its ability to meet a user-specified deadline while maximizingaccuracy and minimizing overall monetary cost. We com-pare the performance of CrowdSearch against two schemes:parallel posting and serial posting, described in §4.1. Paral-lel posting posts all five candidate results at the same time,

Page 91: Crowdsourcing for Multimedia Retrieval

+  

n 36  month  large-­‐scale  integra0ng  project    

n par0ally  funded  by  the  European  Commission’s  7th  Framework  ICT  Programme  for  Research  and  Technological  Development  

n www.cubrikproject.eu  

The  CUBRIK  project  

Page 92: Crowdsourcing for Multimedia Retrieval

+  Objec0ves  [Fraternali  et  al.,  2012]  

n  The  technical  goal  of  CUbRIK  is  to  build  an  open  search  plaHorm  grounded  on  four  objec0ves:  n  Advance  the  architecture  of  mul0media  search  n  Place  humans  in  the  loop  n  Open  the  search  box  n  Start  up  a  search  business  ecosystem  

Page 93: Crowdsourcing for Multimedia Retrieval

+  Objec0ve:  Advance  the  architecture  of  mul0media  search  

n Mul0media  search:  coordinated  result  of  three  main  processes:  

n  Content  processing:  acquisi0on,  analysis,  indexing  and  knowledge  extrac0on  from  mul0media  content  

n  Query  processing:  deriva0on  of  an  informa0on  need  from  a  user  and  produc0on  of  a  sensible  response  

n  Feedback  processing:  quality  feedback  on  the  appropriateness  of  search  results  

Page 94: Crowdsourcing for Multimedia Retrieval

+  Objec0ve:  Advance  the  architecture  of  mul0media  search  

n Objec0ve:    n  Content  processing,  query  processing  and  feedback  processing  phases  will  be  implemented  by  means  of  independent  components  

n  Components  are  organized  in  pipelines  

n  Each  applica0on  defines  ad-­‐hoc  pipelines  that  provide  unique  mul0media  search  capabili0es  in  that  scenario  

Page 95: Crowdsourcing for Multimedia Retrieval

+  Objec0ve:  Humans  in  the  loop  

n  Problem:  the  uncertainty  of  analysis  algorithms  leads  to  low  confidence  results  and  conflic9ng  opinions  on  automa0cally  extracted  features  

n  Solu+on:  humans  have  superior  capacity  for  understanding  the  content  of  audiovisual  material  n  State  of  the  art:  humans  replace  automa+c  feature  extrac+on  processes  

(human  annota9ons)  

n  Our  contribu0on:  integra+on  of  human  judgment  and  algorithms  n  Goal:  improve  the  performance  of  mul0media  content  processing  

Page 96: Crowdsourcing for Multimedia Retrieval

+  CUbRIK  architecture  

!"#$%&'##()$*+,-./%% 01234$56174%.8943219:83%"2;<469%

Version 1.0 - 27 June 2011 Page 10 of 102

9=4% 819>24% ;?% 9=4% @>79:@4A:1% 54126=% 915B5% 9;% 9=4% @;59% 1CC2;C2:194% =>@18% :8942169:;8%@46=18:5@5D%%

*+,-./%E:77%1AA2455%9=4%C2;F74@%;?%;C9:@:G:83%@>79:@4A:1%54126=%915B%4H46>9:;8%;8%9;C%;?%4H:59:83%5;6:17% 849E;2B:83% 592>69>245D% I=4% 6=1774834% :5% 9;% 4HC7;:9% 67155:617% 1CC2;16=45% ;?% 5;6:17% 849E;2B%1817J5:5% K4D3DL% 6489217:9J% 1817J5:5L% 6;=45:M4% 5>F$32;>C% :A489:?:619:;8L% 2;74% :8?424864N% :8% ;2A42% 9;%;C9:@:G4% 9=4%C42?;2@1864%;?%62;E%5;>26:83%@>79:@4A:1%54126=%915B5D%I=4%3484217%3;17% :5% 9;%4H92169%?2;@%5;6:17%849E;2B5%;?%A:??42489%B:8A5%K3484217%C>2C;54L%4894291:8@489L%C2;?455:;817L%496N%9=4%169;25%9=19%124%@;59% 7:B47J% 9;%C42?;2@%1% 915B%E:9=%@1H:@>@%O>17:9J%18A% :8%@:8:@>@%9:@4D%P73;2:9=@5%?;2%4M17>19:83% 9=4% 1??:8:9J% ;?% 915B5% 9;% C;9489:17% =>@18% 155:38445L% 9;% @1H:@:G4% 9=4% M47;6:9J% ;?% 915B%5C241A:83% 9=2;>3=%M159% 62;EA5L% ?;2%4M17>19:83% 9=4% 24C>919:;8%18A% 92>59%;?% 618A:A194%155:38445L% ?;2%A49469:83%@17:6:;>5%F4=1M:;>25%K4D3D%5C1@L%?21>AN%844A%9;%F4%A45:384A%18A%6>59;@:G4A%9;%9=4%5C46:?:6%6;894H9%;?%@>79:@4A:1%54126=D%

B1.1.5. Realising CUBRIK

I=4% 1CC2;16=% C>25>4A% FJ% *+,-./% :5% 9=4% A4M47;C@489% ;?% 18% !"#$%&!'()#* +(,-#.!(/% ?;2% 54126=%1CC7:619:;85% A45:38425L% 946=8;7;3J% C2;M:A425L% 18A% 6;89489% ;E8425% 9;% @18134% 9=4% 6;@C74H:9J% ;?%6;8592>69:83%18A%4M;7M:83%56171F74L%@>79:@;A17L%2417$9:@4%@4A:1%54126=%18A%2492:4M17%1CC7:619:;85%;8%9;C%;?%1%6;77469:;8%;?%6;@C;84895L%:8A:M:A>175%18A%6;@@>8:9:45D%

%

Figure 1.2: CUBRIK Architecture

!:3>24%QD&%5=;E5%1%=:3=%74M47%;M42M:4E%;?%9=4%*+,-./%126=:9469>24D%*+,-./%247:45%;8%1%?21@4E;2B%?;2%4H46>9:83%C2;645545%K1B1%"0"#10$#&NL%6;85:59:83%;?%6;77469:;85%;?%915B5%9;%F4%4H46>94A%:8%1%A:592:F>94A%?15=:;8D%R16=%C:C47:84%:5%A4562:F4A%FJ%1%E;2B?7;E%;?%915B5L%177;6194A%9;%4H46>9;25D%I15B%4H46>9;25%618%F4%5;?9E124%6;@C;84895%K4D3DL%A191%1817J5:5%173;2:9=@5L%@491A191%:8A4H:83%9;;75L%54126=%483:845%;?%A:??42489%819>24L%245>79%C24548919:;8%@;A>745L%496DND%I15B5%618%175;%F4%177;6194A%9;%:8A:M:A>17%=>@18%>5425%K4D3DL%M:1%1%31@:83%:8942?1645N%;2%9;%18%489:24%6;@@>8:9J%K4D3DL%FJ%1%62;EA5;>26:83%6;@C;8489ND%

S:??42489% C:C47:845% 618% F4% A4?:84A% ?;2% 9=4% A:??42489% C2;645545% ;?% 1%@>79:@4A:1% 54126=% 1CC7:619:;8T%6;89489% 1817J5:5% 18A% @491A191% 4H92169:;8L% O>42J% C2;6455:83L% 18A% 2474M1864% ?44AF16B% C2;6455:83D%":C47:84% A4562:C9:;85% 124% 59;24A% :8% 1% C2;6455% 24C;5:9;2JL% 486;A4A% :8% 5918A12A%E;2B?7;E% 7183>1345%

Page 97: Crowdsourcing for Multimedia Retrieval

n  Problems  in  automa+c  logo  detec+on:  n  Object  recogni0on  is  affected  by  the  quality  of  the  input  set  of  images  

 n  Uncertain  matches,  i.e.,  the  ones  with  low  matching  score,  could  not  contain  the  searched  logo  

97

Trademark  Logo  Detec0on:  problems  in  automa0c  logo  detec0on  

Page 98: Crowdsourcing for Multimedia Retrieval

n  Contribu+on  in  human  computa+on  n  Filter  the  input  logos,  elimina0ng  the  irrelevant  ones  n  Segment  the  input  logos  

n  Validate  the  matching  results  

98  

Trademark  Logo  Detec0on:  contribu0on  of  human  computa0on

Page 99: Crowdsourcing for Multimedia Retrieval

+  Human-­‐powered  logo  detec0on  [Bozzon  et  al.,  2012]  

n  Goal:  integrate  human  and  automa0c  computa0on  to  increase  precision  and  recall  w.r.t.  fully  automa0c  solu0ons  

Retrieve Logo Images

Logo Name

Validate Logo Images

Match Logo Images in Videos

Videocollection

Logo Images

Validated Logo

Images

+

Validate Low-confidence

Results

Join Results and Emit Report

Low Confidence

Results

High Confidence

Results

Validated Results

+Logo Detection

Page 100: Crowdsourcing for Multimedia Retrieval

Experimental  evalua0on  

n  Three  experimental  seungs:  n  No  human  interven0on  n  Logo  valida0on  performed  by    two  domain  experts  n  Inclusion  of  the  actual  crowd  knowledge  

n  Crowd  involvement    n  40  people  involved  n  50  task  instances  generated  n  70  collected  answers    

100  

Page 101: Crowdsourcing for Multimedia Retrieval

Experimental  evalua0on  101  

No  Crowd  

Experts  

Crowd  

No  Crowd  

Experts  

Crowd  

No  Crowd  

Experts  Crowd  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

Recall  

Precision  

Aleve  

Chunky  

Shout  

Page 102: Crowdsourcing for Multimedia Retrieval

Experimental  evalua0on  102  

No  Crowd  

Experts  

Crowd  

No  Crowd  

Experts  

Crowd  

No  Crowd  

Experts  Crowd  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

Recall  

Precision  

Aleve  

Chunky  

Shout  

Precision  decreases    Reasons  for  the  wrong  inclusion  •  Geographical  loca0on  of  the  users  •  Exper0se  of  the  involved  users  

Page 103: Crowdsourcing for Multimedia Retrieval

Experimental  evalua0on  103  

No  Crowd  

Experts  

Crowd  

No  Crowd  

Experts  

Crowd  

No  Crowd  

Experts  Crowd  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

Recall  

Precision  

Aleve  

Chunky  

Shout  

Precision  decreases  •  Similarity  between  two  

logos  in  the  data  set  

Page 104: Crowdsourcing for Multimedia Retrieval

Open  issues  and  future  direc0ons  

n  Reproducibility  and  experiment  design  [Paritosh,  2012]  

n  Expert  finding  /  task  alloca0on  

n  Beyond  textual  labels  

Page 105: Crowdsourcing for Multimedia Retrieval

+  

Thanks  for  your  aren0on  www.cubrikproject.eu

   

Page 106: Crowdsourcing for Multimedia Retrieval

+  References  1/3  

n  [Bozzon  et  al.,  2012]  Alessandro  Bozzon,  Ilio  Catallo,  Eleonora  Ciceri,  Piero  Fraternali,  Davide  Mar0nenghi,  Marco  Tagliasacchi:  A  Framework  for  Crowdsourced  Mul0media  Processing  and  Querying.  CrowdSearch  2012:  42-­‐47  

n  [Dawid  and  Skene,  1979]  A.  P.  Dawid  and  A.  M.  Skene,  Maximum  Likelihood  Es0ma0on  of  Observer  Error-­‐Rates  Using  the  EM  Algorithm,  Journal  of  the  Royal  Sta0s0cal  Society.  Series  C  (Applied  Sta0s0cs),  Vol.  28,  No.  1  (1979),  pp.  20-­‐28  

n  [Dekel  and  Shamir,  2009]  O.  Dekel  and  O.  Shamir,    Vox  Populi:  Collec0ng  High-­‐Quality  Labels  from  a  Crowd.    ;In  Proceedings  of  COLT.  2009.    n  [Domnez  et  al.,  2009]  Pinar  Donmez,  Jaime  G.  Carbonell,  and  Jeff  Schneider.  2009.  Efficiently  learning  the  accuracy  of  labeling  sources  for  

selec0ve  sampling.  In  Proceedings  of  the  15th  ACM  SIGKDD  interna0onal  conference  on  Knowledge  discovery  and  data  mining  (KDD  '09)  n  [Fraternali  et  al.,  2012]  Piero  Fraternali,  Marco  Tagliasacchi,  Davide  Mar0nenghi,  Alessandro  Bozzon,  Ilio  Catallo,  Eleonora  Ciceri,  

Francesco  Saverio  Nucci,  Vincenzo  Croce,  Ismail  Sengör  Al0ngövde,  Wolf  Siberski,  Fausto  Giunchiglia,  Wolfgang  Nejdl,  Martha  Larson,  Ebroul  Izquierdo,  Petros  Daras,  Oro  Chrons,  Ralph  Traphöner,  Björn  Decker,  John  Lomas,  Patrick  Aichroth,  Jasminko  Novak,  Ghislain  Sillaume,  Fernando  Sánchez-­‐Figueroa,  Carolina  Salas-­‐Parra:  The  CUBRIK  project:  human-­‐enhanced  0me-­‐aware  mul0media  search.  WWW  (Companion  Volume)  2012:  259-­‐262  

n  [Freiburg  et  al.  2011]  Bauke  Freiburg,  Jaap  Kamps,  and  Cees  G.M.  Snoek.  2011.  Crowdsourcing  visual  detectors  for  video  search.  In  Proceedings  of  the  19th  ACM  interna0onal  conference  on  Mul0media  (MM  '11).  ACM,  New  York,  NY,  USA,  913-­‐916.  

n  [Goëau  et  al.,  2011]  H.  Goëau,  A.  Joly,  S.  Selmi,  P.  Bonnet,  E.  Mouysset,  L.  Joyeux,  J.  Molino,  P.  Birnbaum,  D.  Barthelemy,    and  N.  Boujemaa,    Visual-­‐based  plant  species  iden0fica0on  from  crowdsourced  data.    ;In  Proceedings  of  ACM  Mul0media.  2011,  813-­‐814.    

n  [Harris,  2012]  Christopher  G.  Harris,  An  Evalua0on  of  Search  Strategies  for  User-­‐Generated  Video  Content,  Proceedings  of  the  First  Interna0onal  Workshop  on  Crowdsourcing  Web  Search,  Lyon,  France,  April  17,  2012  

n  [Karger  et  al.,  2011]  D.R.  Karger,  S.  Oh,    and  D.  Shah,    Budget-­‐Op0mal  Task  Alloca0on  for  Reliable  Crowdsourcing  Systems.    ;In  Proceedings  of  CoRR.  2011.    

n  [Kumar  and  Lease,  2011]  A.  Kumar  and  M.  Lease.  Modeling  annotator  accuracies  for  supervised  learning.  In  WSDM  Workshop  on  Crowdsourcing  for  Search  and  Data  Mining,  2011.  

Page 107: Crowdsourcing for Multimedia Retrieval

+  References  2/3  

n  [Nowak  and  Ruger,  2010]  Stefanie  Nowak  and  Stefan  Rüger.  2010.  How  reliable  are  annota0ons  via  crowdsourcing:  a  study  about  inter-­‐annotator  agreement  for  mul0-­‐label  image  annota0on.  In  Proceedings  of  the  interna0onal  conference  on  Mul0media  informa0on  retrieval  (MIR  '10).  ACM,  New  York,  NY,  USA,  557-­‐566.  

n  [Paritosh,  2012]  Praveen  Paritosh,  Human  Computa0on  Must  Be  Reproducible,  Proceedings  of  the  First  Interna0onal  Workshop  on  Crowdsourcing  Web  Search,  Lyon,  France,  April  17,  2012  

n  [Raykar  et  al.,  2010]  Vikas  C.  Raykar,  Shipeng  Yu,  Linda  H.  Zhao,  Gerardo  Hermosillo  Valadez,  Charles  Florin,  Luca  Bogoni,  and  Linda  Moy.  2010.  Learning  From  Crowds.  J.  Mach.  Learn.  Res.  99  (August  2010),  1297-­‐1322.  

n  [Sheng  et  al.,  2008]  Victor  S.  Sheng,  Foster  Provost,  and  Panagio0s  G.  Ipeiro0s.  2008.  Get  another  label?  improving  data  quality  and  data  mining  using  mul0ple,  noisy  labelers.  In  Proceedings  of  the  14th  ACM  SIGKDD  interna0onal  conference  on  Knowledge  discovery  and  data  mining  (KDD  '08).  ACM,  New  York,  NY,  USA,  614-­‐622.  

n  [Snoek  et  al.,  2010]  Cees  G.M.  Snoek,  Bauke  Freiburg,  Johan  Oomen,  and  Roeland  Ordelman.  Crowdsourcing  rock  n'  roll  mul0media  retrieval.  In  Proceedings  of  the  interna0onal  conference  on  Mul0media  (MM  '10).  ACM,  New  York,  NY,  USA,  1535-­‐1538.  

n  [Snow  et  al.,  2008]  Rion  Snow,  Brendan  O'Connor,  Daniel  Jurafsky,  and  Andrew  Y.  Ng.  2008.  Cheap  and  fast-­‐-­‐-­‐but  is  it  good?:  evalua0ng  non-­‐expert  annota0ons  for  natural  language  tasks.  In  Proceedings  of  the  Conference  on  Empirical  Methods  in  Natural  Language  Processing  (EMNLP  '08).  Associa0on  for  Computa0onal  Linguis0cs,  Stroudsburg,  PA,  USA,  254-­‐263.  

n  [Soleymani  and  Larson,  2010]  Soleymani,  M.  and  Larson,  M.  Crowdsourcing  for  Affec0ve  Annota0on  of  Video:  Development  of  a  Viewer-­‐reported  Boredom  Corpus.  In  Proceedings  of  the  SIGIR  2010  Workshop  on  Crowdsourcing  for  Search  Evalua0on  (CSE  2010)  

n  [Sorokin  and  Forsyth,  2008]  Sorokin,  A.;  Forsyth,  D.;  ,  "U0lity  data  annota0on  with  Amazon  Mechanical  Turk,"  Computer  Vision  and  Parern  Recogni0on  Workshops,  2008.  CVPRW  '08.  IEEE  Computer  Society  Conference  on  ,  vol.,  no.,  pp.1-­‐8,  23-­‐28  June  2008  

n  [Steiner  et  al.,  2011]  Thomas  Steiner,  Ruben  Verborgh,  Rik  Van  de  Walle,  Michael  Hausenblas,  and  Joaquim  Gabarró  Vallés,  Crowdsourcing  Event  Detec0on  in  YouTube  Videos,  Proceedings  of  the  1st  Workshop  on  Detec0on,  Representa0on,  and  Exploita0on  of  Events  in  the  Seman0c  Web,  2011  

n  [Tang  and  Lease,  2011]  Wei  Tang  and  Marhew  Lease.  Semi-­‐Supervised  Consensus  Labeling  for  Crowdsourcing.  In  ACM  SIGIR  Workshop  on  Crowdsourcing  for  Informa0on  Retrieval  (CIR),  2011.  

Page 108: Crowdsourcing for Multimedia Retrieval

+  References  3/3  

n  [Urbano  et  al.,  2010]  J.  Urbano,  J.  Morato,  M.  Marrero,  and  D.  Mar0n.  Crowdsourcing  preference  judgments  for  evalua0on  of  music  similarity  tasks.  In  Proceedings  of  the  ACM  SIGIR  2010  Workshop  on  Crowdsourcing  for  Search  Evalua0on  (CSE  2010),  pages  9-­‐-­‐16,  Geneva,  Switzerland,  July  2010.  

n  [Vondrick  et  al.,  2010]  Carl  Vondrick,  Deva  Ramanan,  and  Donald  Parerson.  2010.  Efficiently  scaling  up  video  annota0on  with  crowdsourced  marketplaces.  In  Proceedings  of  the  11th  European  conference  on  Computer  vision:  Part  IV  (ECCV'10)  

n  [Welinder  and  Perona,  2010]  Welinder,  P.,  Perona,  P.  Online  crowdsourcing:  ra0ng  annotators  and  obtaining  cost-­‐effec0ve  labels.  Workshop  on  Advancing  Computer  Vision  with  Humans  in  the  Loop  at  CVPR.  2010  

n  [Whitehill  et  al.,  2009]  Jacob  Whitehill,  Paul  Ruvolo,  Jacob  Bergsma,  Tingfan  Wu,  and  Javier  Movellan,  "Whose  Vote  Should  Count  More:  Op0mal  Integra0on  of  Labels  from  Labelers  of  Unknown  Exper0se",  Advances  in  Neural  Informa0on  Processing  Systems  (forthcoming),  2009.  

n  [Yan  et  al.,  2010]  Tingxin  Yan,  Vikas  Kumar,  and  Deepak  Ganesan.  2010.  CrowdSearch:  exploi0ng  crowds  for  accurate  real-­‐0me  image  search  on  mobile  phones.  In  Proceedings  of  the  8th  interna0onal  conference  on  Mobile  systems,  applica0ons,  and  services  (MobiSys  '10).  ACM,  New  York,  NY,  USA,  77-­‐90.  

n  [Yan  et  al.,  2010b]  Y.  Yan,  R.  Rosales,  G.  Fung,  M.W.  Schmidt,  G.H.  Valadez,  L.  Bogoni,  L.  Moy,    and  J.G.  Dy,    Modeling  annotator  exper0se:  Learning  when  everybody  knows  a  bit  of  something.    ;In  Proceedings  of  Journal  of  Machine  Learning  Research  -­‐  Proceedings  Track.  2010,  932-­‐939.    

n  [Yan  et  al.,  2011]  Y.  Yan,  R.  Rosales,  G.  Fung,    and  J.G.  Dy,    Ac0ve  Learning  from  Crowds.    ;In  Proceedings  of  ICML.  2011,  1161-­‐1168.