Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking
-
Upload
ashishhzb -
Category
News & Politics
-
view
304 -
download
0
Transcript of Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking
Crowdsourcing for Book Search Evaluation:Impact of HIT Design on
Comparative System Ranking
Gabriella Kazai , Jaap Kamps, Marjin Koolen, Natasa Milic-Frayling
Presented by Kumar Ashish
Traditional Method
● Require Trained Group of Experts to gather Information
● Precise Guidelines
● Problem of Scalability○ INEX Book Collection
■ 50,239 books■ 83 Prove-It Topics■ Assessor require
33 days to judge a single topic if he spends 95 minutes each day
Crowdsourcing
● It is method of outsourcing work through an open call for contributions from members of crowd, who are invited to carry out Human Intelligence Tasks(HIT) in exchange for micro-payments, social recognition, or entertainment value.
● It offers a solution for scalability problem.
Problems with Crowdsourcing
● Suffers from poor output Quality○ Workers' dishonest and careless behavior
■ Workers motivated by financial gain
may aim to complete as many HITs are possible within a given time.
○ Poor task designs by the task requester
Solution:
● Include Trap Questions● Include Qualifying Questions● Use Gold Standard Data Set for which agreement can be
measured● Timing Controls ● Challenge-response tests(captcha)● Build redundancy into task design● Model Annotators Quality
Objective
● Investigates the Impact of aspects of Human Intelligence Task(HIT) design on the quality of relevance labels provided by the crowd.
○ Investigation is focused upon these three aspects:
■ Quality Control Elements■ Document pooling and sampling for relevance
judgements by the crowd■ Documents Ordering within a HIT for presentation to
the workers
Prove It!● It aims to investigate effective ways to retrieve relevant
parts of the books that can aid a user in confirming or refuting a given factual claim
How??● Participating Systems are required to retrieve and submit a
ranked list of book pages per topic, that can confirm or refute the topic claim or contain information that is related to topic.
● Task Performed by Assessors:○ Assessors are free to choose the topic○ Assessors are free to choose books. These books are
ordered based on their rank.○ Once Inside book, assessors are required to judge all
listed pages.○ Each pages can take four values:
■ Confirms some aspects of claim■ Refutes some aspects of claim■ Contains information that are related to claim■ Irrelevant
Example:Claim: Imperialistic Foreign Policy led to World War 2
First Page: ConfirmsSecond Page: Contains Information that relates to claim
Approach
Experimental Data
Gold Standard:● INEX 2010 Prove it topics
○ Author uses a set of 21 topics with an average of 169 judged pages per topic.
Experiment Design
● Pooling Strategy● Document Ordering● HIT Design and Quality Control● Experimental Grid● Measures
Pooling Strategy
● Top-n pool: ○ Top n pages of the official Prove It runs is selected using
a round robin strategy.● Rank-boosted pool:
○ Pages from the Prove It runs are re-ranked based on book's highest rank and popularity across all the Best Books Runs and the Prove It runs.
● Answer Boosted Pool:○ Pages from the Prove It runs are re-ranked based on
their content similarity to the topic
Author selects pages for each HIT by interleaving the three pooling strategies.
Document Ordering
● Biased Order○ HITs are constructed by preserving the order of pages
produced by a given pooling approach, i.e. based on decreasing expected relevance
● Random Order○ HITs are constructed by first inserting the known
relevant pages at any position in the HITs, an then randomly distributing it.
ExampleClaim: Imperialistic Foreign Policy led to World War 2
Question: What is the relevance label of Document "Fall of Ottoman Empire"?
Order1:1. Causes of World War 22. World War 23. World War 14. Fall of Ottoman Empire5. Indus Valley Civilization
Order2:1. Indus Valley Civilization2. Fall of Ottoman Empire3. Causes of World War 24. World War 15. World War 2
HIT Design and Quality Control
● The author devised control mechanisms to verify worker engagement in order to reduce careless behaviour including the extreme case of dishonest worker's behavior
● In order to check the effect of these control mechanism, the
author has devised two types of HITs.○ Full Design (FullD)○ Simple Design(SimpleD)
Full Design(FullD)
■ Warning: "At least 60% of the labels need to agree with expert provided labels in order to qualify for payments"
■ Trap Question: " I did not pay attention" ■ In Order to reduce the effect of Random
clicking, one can use flow control so that answer to next question is dependent upon answer given to previous question.
■ Captcha: To detect Human Input in Online Form
■ Restricts participation to those workers only who completed over 100 HITs at 95+% approval rate
Example
Simple Design(SimpleD)
● No restrictions on the worker who can participate● Includes only one trap question● No qualifying Test ● No Warning● No captcha
Experimental Grid
● FullD-bias○ Full Design with biased ordering of pages
● FullD-rand○ Full Design with random ordering
● SimpleD-Bias○ Simple Design with biased ordering of pages
● SimpleD-Rand○ Simple Design with random ordering
The Interleaved pooling strategy is common across the
experiments.
Measures:
In order to assess the quality of crowdsourced labels, CS, the author has introduced two measures: ● Exact Agreement(EA): Agreement on the exact
degree of relevance, i.e. CS =GS(Gold Standard) ● Binary Agreement(BA): Either the page is non-
relevant(CS and GS are irrelevant) or relevant (CS and GS contains: Confirms, Refutes, Contains Some Information) to the topic of claim
Analysis and Discussion
Impact on Quality Controls
● FullD HITs yield considerably more labels per HIT per worker than SimpleD.
● Collected Labels from FullD HITs agree significantly more with the Gold Standards labels than those from SimpleD HITs
● FullD HITs attract workers who achieve significantly higher agreement levels with the Gold Standard labels.
Impact of Ordering Strategies
● When the impact of the biased and random order of pages in the FullD and SimpleD is compared, it is seen that random order of pages produces higher accuracy
Refining Relevance Labels
● Mean Agreement per HIT (when 3 Workers per HIT for FullD ,and 1 worker per HIT for SimpleD) is 62% EA and 69% BA for FullD and 44% EA and 54% BA for SimpleD.
● After applying majority vote FullD achieves 74% EA and
78% BA, while SimpleD achieves 61% EA and 68% BA.
When Majority Rule is applied, the accuracy of SimpleD label improves substantially more than
the accuracy of the FullD design.
Removing workers with low label accuracy
● Filtering out workers with low accuracy labels increases the GS agreements for remaining labels
● Agreement stays unchanged until the minimum accuracy of workers reaches 40%
● Substantially more workers are removed from SimpleD than FullD
Impact of Pooling Strategies
● Above table shows that there is no substantial difference between label accuracy levels for the three pooling strategies
● Answer Based Pooling leads to highest number of unique and relevant pages.
Other Factors Impacting Accuracy
● Total Number of HITs completed by worker provides no clue about the level of label accuracy
● Average time spent on the HIT is only weakly correlated with accuracy
● Correlation between EA and the number of labels produced by workers is strong(dishonest and careless workers tends to skip some part of HITS)
● The structure of flow questionaries (Flow) has high correlation with the EA accuracy
Impact on System Rankings
● MAP and Bpref○ These system ranking characterise the overall ranking
and their comparison provides insights into the impact of un-judged pages.
● P@10 and nDCG@10
○ These system ranking focuses on the search performance in the top 10 retrieved pages.
Quality Control
● Agreement between FullD Ranking and INEX Ranking is high across all metrics.
● SimpleD Ranking and INEX Ranking correlate better than FullD Ranking on MAP and Bref.
● P@10 and nDCG@10 metrics strongly differentiate the effect of two HITs design on system ranking.
System Rank Correlation between different designs
Impact of Ordering Strategy
● Random Ordering of documents in the HITs yields higher level of label accuracy compared to biased ordering.
Impact of Biased and random page order on system rank correlation with INEX ranking
Impact of Pooling Strategies
● Rank-boosted pool leads to very high correlations with INEX ranking based on MAP for both the FullD and SimpleD relevance judgements.
Impact of pooling strategy on system rank correlations with INEX ranking
Evaluation of Prove It Systems
● The authors investigate the use of the crowdsourced relevance judgements to evaluate the Prove-It runs of the INEX 2010 Book Track.
● Focus is on FullD HIT design.● System is evaluated with
○ With Relevance Judgements from FullD HITs only○ By merging FullD relevance jusgements with gold
standard relevance judgements.
● FullD relevance judgements lead to slightly different system rankings from the INEX ranking, since ,by design the crowdsourced document pool included pages outside the GS pool.
● The correlation between extended system rankings based on P@10 (0.17) and nDCG@10(0.12) using FullD relevance judgements is low.
System rank correlations with ranking over offcial submissions(top) and extended set(bottom)
Conclusions
● FullD leads to significantly higher label Quality● Random page ordering in HITs leads to significantly higher
label accuracy● Consensus over multiple judgements leads to more reliable
labels● Completion rate of the questionnaire flow and the fraction of
obtained labels provide good indicators of label quality● P@10 and nDCG@10 metrics are more effective in
evaluating the effectiveness of crowdsourcing through system rankings.
● Filtering out workers with low label accuracy reduces the pooling effect.
Amazon Mechanical Turk
HIT
Discussions: