Crowdsourcing in NLP
-
Upload
ritu-khare -
Category
Documents
-
view
92 -
download
3
Transcript of Crowdsourcing in NLP
What is Crowdsourcing ? • “Crowdsourcing is the act of a company or institution taking a function once
performed by employees and outsourcing it to an undefined and large network of people in the form of an open call.”(2006 - magazine article)
• "Crowdsourcing is a type of participative online activity in which an
individual, an institution, a non-profit organization, or company proposes to a
group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The
undertaking of the task, of variable complexity and modularity, and in
which the crowd should participate bringing their work, money, knowledge and/or
experience, always entails mutual benefit….". (Estellés-Arolas, 2012 – integrated 40
definitions from literature 2008 onward)
1
Sample Tasks (Difficult for computers but simple for human beings): • Identify the disease mentions from PubMed abstracts • Classify book reviews as positive or negative
Amazon Mechanical Turk(AMT) (launched 2005)
A Crowdsourcing Platform
• Requester
– Designs the task and prepare the dataset (i.e., taskitems)
– Submits to AMT
• # judgments per taskitem
• Reward payment for each taskitem to each worker
• Specify restrictions (worker locations, previous tasks accuracy)
• Workers
– Work on taskitem(s); can work on as many or as few taskitems as they please
– Get paid small amounts of money (a few cents per taskitem)
Amazon Mechanical
Turk
2
About the AMT workers Who are they? (Ross et al. 2010)
• Age – Average 31, Min 18, Max 71,
Median 27
• Gender – Female 55%, Male 45%
• Occupation – 38% FT, 31% PT, 31% unemployed
• Education – 66% college or higher, 33%
students
• Salary – Median 20k – 30k
• Country – USA 57%, India 32%, Other 11%
How do they search for Tasks on AMT (Chilton et al. 2010)
• Title
• Reward amount
• Date posted – Newest to oldest
• Time allotted
• Number of task items – Most to fewest task items
• Expiration date
3
Paper 1: Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks
Rion Snow, Brendon O’Conner, Daniel Jurafsky, Andrew Y. Ng Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008
Motivation
• Large-scale annotation – Vital to NLP research and developing new
algorithms
• Challenges to expert annotation – Financially Expensive
– Time Consuming
• An alternative – Explore non-expert annotations (crowdsourcing)
4
Overview
• 5 natural language tasks (short and simple) 1. Affect Recognition
2. Word Similarity
3. Recognizing Textual Entailment (RTE)
4. Event Temporal Ordering
5. Word Sense Disambiguation
• Method – Post tasks on Amazon Mechanical Turk
– Request 10 independent judgments(annotations)/taskitem
• Evaluate Performance of Non-experts – Compare annotations (with experts)
– Compare m/c learning classifier performance (trained on expert vs. non-expert data)
5
Task#1: Affect Recognition Original Experiment with expert: (Strapparava and Mihalcea, 2007) in SemEval
• Given a textual headline – identify and rate emotions – anger [0,100], disgust [0,100],
fear [0,100], joy [0,100], sadness [0,100], surprise [0,100], overall valence [-100,100]
• Example headline-annotation pair
Outcry at N Korea ‘nuclear test’ (Anger, 30), (Disgust, 30), (Fear, 30), (Joy, 0), (Sadness, 20), (Surprise, 40), (Valence, -50)
• Original Experiment – 1000 headlines extracted from
New York Times, CNN, Google News
– 6 expert annotators per headline
• Non-expert (Crowdsourcing) Experiment – 100 headline sample – 10 annotations per headline
• 70 affect labels
– Paid $2.00 to Amazon for collecting 7000 affect labels
6
Task#1 Affect Recognition Results (Inter-Annotator Agreement ITA)
• ITAtask= average(ITAannotator)
• ITAannotator= Pearson’s correlation
– This annotator’s labels Vs. Average of labels of other annotators
Original experiment with experts
Individual Experts are better labelers than individual non-experts
Non-expert annotations are good enough to increase overall quality of task 7
Task#1 Affect Recognition Results How many non-expert labelers are equivalent to 1 expert labeler?
• Treat n (1,2,3,…10) non-expert annotators as 1 meta-labeler
– Average the labels for all possible subsets of size n
• Find the minimum number of non-experts (k) needed to rival the performance of an expert
On an average, it takes 4 non-experts to produce expert-like performance. For this task, it takes $1.00 to generate 875 expert equivalent labels
8
Task#2: Word Similarity (Original Experiment with Experts – Resnik, 1999)
• Given a pair of words {boy, lad}
– Provide numeric judgments [0,10] on similarity
• Original Experiment
– 30 pairs of words
– 10 experts
– ITA = 0.958
• Crowdsourcing Experiment – 30 pairs of words
– 10 annotations per pair
– Paid Total $0.2 for 300 annotations
– Task was completed within 11 minutes of posting
– Maximum ITA 0.952
9
Task#3: Recognizing Textual Entailment (Original Experiment: Dagan et al. 2006)
• To determine whether second sentence infers from the first sentence
S1: Crude Oil Prices Slump
S2: Oil prices drop
– Answer: true
S1: The government announced that it plans to raise oil prices
S2: Oil prices drop
– Answer: false
• Original Experiment – 800 sentence pairs
– ITA = 0.91
• Crowdsourcing experiment – 800 sentence pairs
– 10 annotators
– For average response, used simple majority voting, and random tie breaking
– Maximum ITA = 0.89
10
Task#5: Word Sense Disambiguation (Original Experiment - SemEval Pradhan et al. 2007)
• Robert E. Lyons III … was appointed president and chief operating officer… – What is the sense of
“president”? • Executive officer of a form,
corporation or university • Head of a country (other than
the US) • Head of the US, President of the
United States
• Original Experiment – ITA not reported – Provides gold standard
• Crowdsourcing Experiment
– 177 examples of noun “president” for the 3 senses
– 10 annotators – Results aggregated using maj. voting
and random tie breaking, and accuracy calculated w.r.t gold
– Red line represents best system’s performance SemEval Task 17 (Cai et al., 2007)
11
Training Classifiers: Non-Experts vs. Experts Task#1: Affect Recognition
• Designed a supervised affect recognition system
• Training the system
• Testing a new headline
12
e= emotion t=token (word in headline) H=headline Ht= set of headlines containing token t
• Experiments (Training: 100;Testing: 900 )
– In most cases, only 1 non-expert helped better train the system.
– Possible Explanation: • Individual labelers(experts or non-
experts) tend to be biased • For non-experts, even a single set of
annotations (1-NE) is created using multiple non-expert labelers. (because of the nature of crowdsourcing) – may have reduced bias
Summary
• Individual experts better than individual non-experts
• Non-experts improve quality of annotations
• For many tasks, only a small number(avg. 4) of non-experts/item are needed to equal the expert performance
• Non-experts’ trained system performed better probably because they offer more diversity
13
Paper 2: Validating Candidate gene-mutation relations in MEDLINE abstracts via crowdsourcing
John Burger, Emily Doughty, Sam Bayer, et al. Data Integration in the Life Sciences, 8th International Conference, DILS 2012
Goal
• To identify relationships between genes and mutations from the biomedical literature (mutation grounding)
Challenge
• Multiple mentions of genes and mutations, extracting the correct association is challenging
Method • Identify all mutation and gene mentions using existing tools
• Identify gene-mutation relationships using crowdsourcing with non-experts
14
Dataset
• PubMed abstracts – Mesh terms: Mutation, Mutation AND
Polymorphism/genetic
– Diseases (identified using Metamap): breast cancer, prostate cancer, autism spectrum disorder
• 810 abstracts – Expert curated (Gold standard) 1608 gene-mutation
relationships
• Working dataset: selected 250 abstracts with 578 gene-mutation pairs
15
Method
Extracting Mentions Extracting Relationships
• Normalized all mentions
• Generated cross-product of all gene-mutation relationships within an article
• Total 1398 candidate gene-mutation pairs – submitted as a taskitem to Amazon Mechanical Turk
16
• Used existing tool EMU (Extractor of Mutations) (Doughty et al. 2011)
• Gene Identification – String match with HUGO
and NCBI gene databases
• Mutation(SNPs) Identification – Regular expressions
Sample Task-item posted for crowdsourcing
17
Method: Crowdsourcing
• 5 annotations per task-item(candidate association)
• 1398 task-items + 467 control items – Control items are hidden tests (Amazon uses them to
calculate worker’s rating)
• Restricted the task to workers – from United States only – With 95% rating(from previous tasks)
• Payment – 8 cents per abstract to each worker – Total: $900
18
Results
• Mutation Recall: 477/550= 86.7%
• Gene Recall: 257/276 = 93.1%
• Out of 250 abstracts, 185 were perfect documents (with 100% recall of genes and mutations)
• Crowdsourcing Results:
– Time: 30 hr
– Total 58 workers • 12 performed only 1 item
• 22 performed 2-10
• 13 performed 11-100 items
• 11 performed 100+items
19
Results
Worker Accuracy Consensus Accuracy
• Simple Majority Voting – 78.4%
• Weighted Vote Approach (based on worker’s ratings) – 78.3%
• Naïve Bayes Classifier (to identify probability of correctness of each response) – 82.1%
20
Conclusion
• It is easy to recruit the workers and achieve a fast turnaround time
• Performance of workers varies
• The task required significant level of biomedical literacy , yet one worker gave 95% accurate responses.
• It is important to find new ways to identify qualified workers and aggregating results
21