Crowdsourcing in NLP

What is Crowdsourcing ? • “Crowdsourcing is the act of a company or institution taking a function once

performed by employees and outsourcing it to an undefined and large network of people in the form of an open call.”(2006 - magazine article)

• "Crowdsourcing is a type of participative online activity in which an

individual, an institution, a non-profit organization, or company proposes to a

group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The

undertaking of the task, of variable complexity and modularity, and in

which the crowd should participate bringing their work, money, knowledge and/or

experience, always entails mutual benefit….". (Estellés-Arolas, 2012 – integrated 40

definitions from literature 2008 onward)

1

Sample Tasks (Difficult for computers but simple for human beings): • Identify the disease mentions from PubMed abstracts • Classify book reviews as positive or negative

Amazon Mechanical Turk(AMT) (launched 2005)

A Crowdsourcing Platform

• Requester

– Designs the task and prepare the dataset (i.e., taskitems)

– Submits to AMT

• # judgments per taskitem

• Reward payment for each taskitem to each worker

• Specify restrictions (worker locations, previous tasks accuracy)

• Workers

– Work on taskitem(s); can work on as many or as few taskitems as they please

– Get paid small amounts of money (a few cents per taskitem)

Amazon Mechanical

Turk

2

About the AMT workers Who are they? (Ross et al. 2010)

• Age – Average 31, Min 18, Max 71,

Median 27

• Gender – Female 55%, Male 45%

• Occupation – 38% FT, 31% PT, 31% unemployed

• Education – 66% college or higher, 33%

students

• Salary – Median 20k – 30k

• Country – USA 57%, India 32%, Other 11%

How do they search for Tasks on AMT (Chilton et al. 2010)

• Title

• Reward amount

• Date posted – Newest to oldest

• Time allotted

• Number of task items – Most to fewest task items

• Expiration date

3

Paper 1: Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks

Rion Snow, Brendon O’Conner, Daniel Jurafsky, Andrew Y. Ng Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008

Motivation

• Large-scale annotation – Vital to NLP research and developing new

algorithms

• Challenges to expert annotation – Financially Expensive

– Time Consuming

• An alternative – Explore non-expert annotations (crowdsourcing)

4

Overview

• 5 natural language tasks (short and simple) 1. Affect Recognition

2. Word Similarity

3. Recognizing Textual Entailment (RTE)

4. Event Temporal Ordering

5. Word Sense Disambiguation

• Method – Post tasks on Amazon Mechanical Turk

– Request 10 independent judgments(annotations)/taskitem

• Evaluate Performance of Non-experts – Compare annotations (with experts)

– Compare m/c learning classifier performance (trained on expert vs. non-expert data)

5

Task#1: Affect Recognition Original Experiment with expert: (Strapparava and Mihalcea, 2007) in SemEval

• Given a textual headline – identify and rate emotions – anger [0,100], disgust [0,100],

fear [0,100], joy [0,100], sadness [0,100], surprise [0,100], overall valence [-100,100]

• Example headline-annotation pair

Outcry at N Korea ‘nuclear test’ (Anger, 30), (Disgust, 30), (Fear, 30), (Joy, 0), (Sadness, 20), (Surprise, 40), (Valence, -50)

• Original Experiment – 1000 headlines extracted from

New York Times, CNN, Google News

– 6 expert annotators per headline

• Non-expert (Crowdsourcing) Experiment – 100 headline sample – 10 annotations per headline

• 70 affect labels

– Paid $2.00 to Amazon for collecting 7000 affect labels

6

Task#1 Affect Recognition Results (Inter-Annotator Agreement ITA)

• ITAtask= average(ITAannotator)

• ITAannotator= Pearson’s correlation

– This annotator’s labels Vs. Average of labels of other annotators

Original experiment with experts

Individual Experts are better labelers than individual non-experts

Non-expert annotations are good enough to increase overall quality of task 7

Task#1 Affect Recognition Results How many non-expert labelers are equivalent to 1 expert labeler?

• Treat n (1,2,3,…10) non-expert annotators as 1 meta-labeler

– Average the labels for all possible subsets of size n

• Find the minimum number of non-experts (k) needed to rival the performance of an expert

On an average, it takes 4 non-experts to produce expert-like performance. For this task, it takes $1.00 to generate 875 expert equivalent labels

8

Task#2: Word Similarity (Original Experiment with Experts – Resnik, 1999)

• Given a pair of words {boy, lad}

– Provide numeric judgments [0,10] on similarity

• Original Experiment

– 30 pairs of words

– 10 experts

– ITA = 0.958

• Crowdsourcing Experiment – 30 pairs of words

– 10 annotations per pair

– Paid Total $0.2 for 300 annotations

– Task was completed within 11 minutes of posting

– Maximum ITA 0.952

9

Task#3: Recognizing Textual Entailment (Original Experiment: Dagan et al. 2006)

• To determine whether second sentence infers from the first sentence

S1: Crude Oil Prices Slump

S2: Oil prices drop

– Answer: true

S1: The government announced that it plans to raise oil prices

S2: Oil prices drop

– Answer: false

• Original Experiment – 800 sentence pairs

– ITA = 0.91

• Crowdsourcing experiment – 800 sentence pairs

– 10 annotators

– For average response, used simple majority voting, and random tie breaking

– Maximum ITA = 0.89

10

Task#5: Word Sense Disambiguation (Original Experiment - SemEval Pradhan et al. 2007)

• Robert E. Lyons III … was appointed president and chief operating officer… – What is the sense of

“president”? • Executive officer of a form,

corporation or university • Head of a country (other than

the US) • Head of the US, President of the

United States

• Original Experiment – ITA not reported – Provides gold standard

• Crowdsourcing Experiment

– 177 examples of noun “president” for the 3 senses

– 10 annotators – Results aggregated using maj. voting

and random tie breaking, and accuracy calculated w.r.t gold

– Red line represents best system’s performance SemEval Task 17 (Cai et al., 2007)

11

Training Classifiers: Non-Experts vs. Experts Task#1: Affect Recognition

• Designed a supervised affect recognition system

• Training the system

• Testing a new headline

12

e= emotion t=token (word in headline) H=headline Ht= set of headlines containing token t

• Experiments (Training: 100;Testing: 900 )

– In most cases, only 1 non-expert helped better train the system.

– Possible Explanation: • Individual labelers(experts or non-

experts) tend to be biased • For non-experts, even a single set of

annotations (1-NE) is created using multiple non-expert labelers. (because of the nature of crowdsourcing) – may have reduced bias

Summary

• Individual experts better than individual non-experts

• Non-experts improve quality of annotations

• For many tasks, only a small number(avg. 4) of non-experts/item are needed to equal the expert performance

• Non-experts’ trained system performed better probably because they offer more diversity

13

Paper 2: Validating Candidate gene-mutation relations in MEDLINE abstracts via crowdsourcing

John Burger, Emily Doughty, Sam Bayer, et al. Data Integration in the Life Sciences, 8th International Conference, DILS 2012

Goal

• To identify relationships between genes and mutations from the biomedical literature (mutation grounding)

Challenge

• Multiple mentions of genes and mutations, extracting the correct association is challenging

Method • Identify all mutation and gene mentions using existing tools

• Identify gene-mutation relationships using crowdsourcing with non-experts

14

Dataset

• PubMed abstracts – Mesh terms: Mutation, Mutation AND

Polymorphism/genetic

– Diseases (identified using Metamap): breast cancer, prostate cancer, autism spectrum disorder

• 810 abstracts – Expert curated (Gold standard) 1608 gene-mutation

relationships

• Working dataset: selected 250 abstracts with 578 gene-mutation pairs

15

Method

Extracting Mentions Extracting Relationships

• Normalized all mentions

• Generated cross-product of all gene-mutation relationships within an article

• Total 1398 candidate gene-mutation pairs – submitted as a taskitem to Amazon Mechanical Turk

16

• Used existing tool EMU (Extractor of Mutations) (Doughty et al. 2011)

• Gene Identification – String match with HUGO

and NCBI gene databases

• Mutation(SNPs) Identification – Regular expressions

Sample Task-item posted for crowdsourcing

17

Method: Crowdsourcing

• 5 annotations per task-item(candidate association)

• 1398 task-items + 467 control items – Control items are hidden tests (Amazon uses them to

calculate worker’s rating)

• Restricted the task to workers – from United States only – With 95% rating(from previous tasks)

• Payment – 8 cents per abstract to each worker – Total: $900

18

Results

• Mutation Recall: 477/550= 86.7%

• Gene Recall: 257/276 = 93.1%

• Out of 250 abstracts, 185 were perfect documents (with 100% recall of genes and mutations)

• Crowdsourcing Results:

– Time: 30 hr

– Total 58 workers • 12 performed only 1 item

• 22 performed 2-10

• 13 performed 11-100 items

• 11 performed 100+items

19

Results

Worker Accuracy Consensus Accuracy

• Simple Majority Voting – 78.4%

• Weighted Vote Approach (based on worker’s ratings) – 78.3%

• Naïve Bayes Classifier (to identify probability of correctness of each response) – 82.1%

20

Conclusion

• It is easy to recruit the workers and achieve a fast turnaround time

• Performance of workers varies

• The task required significant level of biomedical literacy , yet one worker gave 95% accurate responses.

• It is important to find new ways to identify qualified workers and aggregating results

21

Crowdsourcing in NLP

Documents

Transcript of Crowdsourcing in NLP