Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking

Crowdsourcing for Book Search Evaluation:Impact of HIT Design on

Comparative System Ranking

Gabriella Kazai , Jaap Kamps, Marjin Koolen, Natasa Milic-Frayling

Presented by Kumar Ashish

Traditional Method

● Require Trained Group of Experts to gather Information

● Precise Guidelines

● Problem of Scalability○ INEX Book Collection

■ 50,239 books■ 83 Prove-It Topics■ Assessor require

33 days to judge a single topic if he spends 95 minutes each day

Crowdsourcing

● It is method of outsourcing work through an open call for contributions from members of crowd, who are invited to carry out Human Intelligence Tasks(HIT) in exchange for micro-payments, social recognition, or entertainment value.

● It offers a solution for scalability problem.

Problems with Crowdsourcing

● Suffers from poor output Quality○ Workers' dishonest and careless behavior

■ Workers motivated by financial gain

may aim to complete as many HITs are possible within a given time.

○ Poor task designs by the task requester

Solution:

● Include Trap Questions● Include Qualifying Questions● Use Gold Standard Data Set for which agreement can be

measured● Timing Controls ● Challenge-response tests(captcha)● Build redundancy into task design● Model Annotators Quality

Objective

● Investigates the Impact of aspects of Human Intelligence Task(HIT) design on the quality of relevance labels provided by the crowd.

○ Investigation is focused upon these three aspects:

■ Quality Control Elements■ Document pooling and sampling for relevance

judgements by the crowd■ Documents Ordering within a HIT for presentation to

the workers

Prove It!● It aims to investigate effective ways to retrieve relevant

parts of the books that can aid a user in confirming or refuting a given factual claim

How??● Participating Systems are required to retrieve and submit a

ranked list of book pages per topic, that can confirm or refute the topic claim or contain information that is related to topic.

● Task Performed by Assessors:○ Assessors are free to choose the topic○ Assessors are free to choose books. These books are

ordered based on their rank.○ Once Inside book, assessors are required to judge all

listed pages.○ Each pages can take four values:

■ Confirms some aspects of claim■ Refutes some aspects of claim■ Contains information that are related to claim■ Irrelevant

Example:Claim: Imperialistic Foreign Policy led to World War 2

First Page: ConfirmsSecond Page: Contains Information that relates to claim

Approach

Experimental Data

Gold Standard:● INEX 2010 Prove it topics

○ Author uses a set of 21 topics with an average of 169 judged pages per topic.

Experiment Design

● Pooling Strategy● Document Ordering● HIT Design and Quality Control● Experimental Grid● Measures

Pooling Strategy

● Top-n pool: ○ Top n pages of the official Prove It runs is selected using

a round robin strategy.● Rank-boosted pool:

○ Pages from the Prove It runs are re-ranked based on book's highest rank and popularity across all the Best Books Runs and the Prove It runs.

● Answer Boosted Pool:○ Pages from the Prove It runs are re-ranked based on

their content similarity to the topic

Author selects pages for each HIT by interleaving the three pooling strategies.

Document Ordering

● Biased Order○ HITs are constructed by preserving the order of pages

produced by a given pooling approach, i.e. based on decreasing expected relevance

● Random Order○ HITs are constructed by first inserting the known

relevant pages at any position in the HITs, an then randomly distributing it.

ExampleClaim: Imperialistic Foreign Policy led to World War 2

Question: What is the relevance label of Document "Fall of Ottoman Empire"?

Order1:1. Causes of World War 22. World War 23. World War 14. Fall of Ottoman Empire5. Indus Valley Civilization

Order2:1. Indus Valley Civilization2. Fall of Ottoman Empire3. Causes of World War 24. World War 15. World War 2

HIT Design and Quality Control

● The author devised control mechanisms to verify worker engagement in order to reduce careless behaviour including the extreme case of dishonest worker's behavior

● In order to check the effect of these control mechanism, the

author has devised two types of HITs.○ Full Design (FullD)○ Simple Design(SimpleD)

Full Design(FullD)

■ Warning: "At least 60% of the labels need to agree with expert provided labels in order to qualify for payments"

■ Trap Question: " I did not pay attention" ■ In Order to reduce the effect of Random

clicking, one can use flow control so that answer to next question is dependent upon answer given to previous question.

■ Captcha: To detect Human Input in Online Form

■ Restricts participation to those workers only who completed over 100 HITs at 95+% approval rate

Example

Simple Design(SimpleD)

● No restrictions on the worker who can participate● Includes only one trap question● No qualifying Test ● No Warning● No captcha

Experimental Grid

● FullD-bias○ Full Design with biased ordering of pages

● FullD-rand○ Full Design with random ordering

● SimpleD-Bias○ Simple Design with biased ordering of pages

● SimpleD-Rand○ Simple Design with random ordering

The Interleaved pooling strategy is common across the

experiments.

Measures:

In order to assess the quality of crowdsourced labels, CS, the author has introduced two measures: ● Exact Agreement(EA): Agreement on the exact

degree of relevance, i.e. CS =GS(Gold Standard) ● Binary Agreement(BA): Either the page is non-

relevant(CS and GS are irrelevant) or relevant (CS and GS contains: Confirms, Refutes, Contains Some Information) to the topic of claim

Analysis and Discussion

Impact on Quality Controls

● FullD HITs yield considerably more labels per HIT per worker than SimpleD.

● Collected Labels from FullD HITs agree significantly more with the Gold Standards labels than those from SimpleD HITs

● FullD HITs attract workers who achieve significantly higher agreement levels with the Gold Standard labels.

Impact of Ordering Strategies

● When the impact of the biased and random order of pages in the FullD and SimpleD is compared, it is seen that random order of pages produces higher accuracy

Refining Relevance Labels

● Mean Agreement per HIT (when 3 Workers per HIT for FullD ,and 1 worker per HIT for SimpleD) is 62% EA and 69% BA for FullD and 44% EA and 54% BA for SimpleD.

● After applying majority vote FullD achieves 74% EA and

78% BA, while SimpleD achieves 61% EA and 68% BA.

When Majority Rule is applied, the accuracy of SimpleD label improves substantially more than

the accuracy of the FullD design.

Removing workers with low label accuracy

● Filtering out workers with low accuracy labels increases the GS agreements for remaining labels

● Agreement stays unchanged until the minimum accuracy of workers reaches 40%

● Substantially more workers are removed from SimpleD than FullD

Impact of Pooling Strategies

● Above table shows that there is no substantial difference between label accuracy levels for the three pooling strategies

● Answer Based Pooling leads to highest number of unique and relevant pages.

Other Factors Impacting Accuracy

● Total Number of HITs completed by worker provides no clue about the level of label accuracy

● Average time spent on the HIT is only weakly correlated with accuracy

● Correlation between EA and the number of labels produced by workers is strong(dishonest and careless workers tends to skip some part of HITS)

● The structure of flow questionaries (Flow) has high correlation with the EA accuracy

Impact on System Rankings

● MAP and Bpref○ These system ranking characterise the overall ranking

and their comparison provides insights into the impact of un-judged pages.

● P@10 and nDCG@10

○ These system ranking focuses on the search performance in the top 10 retrieved pages.

Quality Control

● Agreement between FullD Ranking and INEX Ranking is high across all metrics.

● SimpleD Ranking and INEX Ranking correlate better than FullD Ranking on MAP and Bref.

● P@10 and nDCG@10 metrics strongly differentiate the effect of two HITs design on system ranking.

System Rank Correlation between different designs

Impact of Ordering Strategy

● Random Ordering of documents in the HITs yields higher level of label accuracy compared to biased ordering.

Impact of Biased and random page order on system rank correlation with INEX ranking

Impact of Pooling Strategies

● Rank-boosted pool leads to very high correlations with INEX ranking based on MAP for both the FullD and SimpleD relevance judgements.

Impact of pooling strategy on system rank correlations with INEX ranking

Evaluation of Prove It Systems

● The authors investigate the use of the crowdsourced relevance judgements to evaluate the Prove-It runs of the INEX 2010 Book Track.

● Focus is on FullD HIT design.● System is evaluated with

○ With Relevance Judgements from FullD HITs only○ By merging FullD relevance jusgements with gold

standard relevance judgements.

● FullD relevance judgements lead to slightly different system rankings from the INEX ranking, since ,by design the crowdsourced document pool included pages outside the GS pool.

● The correlation between extended system rankings based on P@10 (0.17) and nDCG@10(0.12) using FullD relevance judgements is low.

System rank correlations with ranking over offcial submissions(top) and extended set(bottom)

Conclusions

● FullD leads to significantly higher label Quality● Random page ordering in HITs leads to significantly higher

label accuracy● Consensus over multiple judgements leads to more reliable

labels● Completion rate of the questionnaire flow and the fraction of

obtained labels provide good indicators of label quality● P@10 and nDCG@10 metrics are more effective in

evaluating the effectiveness of crowdsourcing through system rankings.

● Filtering out workers with low label accuracy reduces the pooling effect.

Amazon Mechanical Turk

Discussions:

Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking

News & Politics

Transcript of Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking