Cscw family searchindexing

Post on 30-Nov-2014

447 views 1 download

Tags:

description

 

Transcript of Cscw family searchindexing

CSCW, SAN ANTONIO, TXFEB 26, 2013

Derek Hansen, Patrick Schone, Douglas Corey, Matthew Reid, & Jake Gehring

QUALITY CONTROL MECHANISMS FOR CROWDSOURCING: PEER REVIEW, ARBITRATION, & EXPERTISE AT FAMILYSEARCH INDEXING

FamilySearch.org

FamilySearch Indexing (FSI)

FamilySearch Indexing (FSI)

FSI in Broader Landscape• Crowdsourcing Project

Aggregates discrete tasks completed by volunteers who replace professionals (Howe, 2006; Doan, et al., 2011)

• Human Computation SystemHumans use computational system to work on a problem that may someday be solvable by computers (Quinn & Bederson, 2011)

• Lightweight Peer ProductionLargely anonymous contributors independently completing discrete, repetitive tasks provided by authorities (Haythornthwaite, 2009)

Design Challenge: Improve efficiency without sacrificing quality

Time

Am

ount

Scanned Documents

Quality Control Mechanisms• 9 Types of quality control for human computation

systems (Quinn & Bederson, 2011)• Redundancy• Multi-level review

• Find-Fix-Verify pattern (Bernstein, et al., 2010)• Weight proposed solutions by reputation of contributor

(McCann, et al., 2003)• Peer or expert oversight (Cosley, et al., 2005)• Tournament selection approach (Sun, et al., 2011)

A-B-Arbitrate process (A-B-ARB)

A

B

ARB

Currently Used Mechanism

Peer review process (A-R-RARB)

A R RARB

Already Filled InOptional?Proposed Mechanism

Two Act Play

Act I: Experience

What is the role of experience on quality and efficiency?

Historical data analysis using full US and Canadian Census records from 1920 and earlier

Act II: Quality Control

Is peer review or arbitration better in terms of quality and efficiency?

Field experiment using 2,000 images from the 1930 US Census Data & corresponding truth set

Act I: Experience

Quality is estimated based on A-B agreement (no truth set)

Efficiency calculated using keystroke-logging data with idle time and outliers removed

A-B agreement by field

A-B agreement by language (1871 Canadian Census)

English Language

Given Name: 79.8%

Surname: 66.4%

French Language

Given Name: 62.7%

Surname: 48.8%

A-B agreement by experience

Birth Place: All U.S. Censuses

B (

novi

ce ↔

exp

erie

nced

)

A (novice ↔ experienced)

A-B agreement by experience

Given Name: All U.S. Censuses

A (novice ↔ experienced)

B (

novi

ce ↔

exp

erie

nced

)

A-B agreement by experience

Surname: All U.S. Censuses

A (novice ↔ experienced)

B (

novi

ce ↔

exp

erie

nced

)

A-B agreement by experience

Gender: All U.S. Censuses

A (novice ↔ experienced)

B (

novi

ce ↔

exp

erie

nced

)

A-B agreement by experience

Birthplace: English-speaking Canadian Census

A (novice ↔ experienced)

B (

novi

ce ↔

exp

erie

nced

)

Time & keystroke by experience

Summary & Implications of Act I Experienced workers are faster and more accurate,

gains which continue even at high levels

- Focus on retention

- Encourage both novices & experts to do more

- Develop interventions to speed up experience gains (e.g., send users common mistakes made by people at their experience level)

Summary & Implications of Act I Contextual knowledge (e.g., Canadian placenames)

and specialized skills (e.g., French language fluency) is needed for some tasks

- Recruit people with existing knowledge & skills

- Provide contextual information when possible (e.g., Canadian placename prompts)

- Don’t remove context (e.g., captcha)

- Allow users to specialize?

Act II: Quality ControlA-B-ARB data from original transcribers (Feb 2011)

A-R-RARB data includes original A data and newly collected R and RARB data from people new to this method (Jan-Feb of 2012)

Truth Set data from company with independent audit by FSI experts

Statistical Test: mixed-model logistic regression (accurate or not) with random effects, controlling for expertise

Limitations• Experience levels of R and RARB were

lower than expected, though we did statistically control for this

• Original B data used in A-B-ARB for certain fields was transcribed in non-standard manner requiring adjustment

No Need for RARB• No gains in quality from extra arbitration of

peer reviewed data (A-R = A-R-RARB)• RARB takes some time, so better without

Quality Comparison

• Both methods were statistically better than A alone

• A-B-ARB had slightly lower error rates than A-R

• R “missed” more errors, but also introduced fewer errors

Time Comparison

Summary & Implications of Act II Peer Review shows considerable efficiency

gains with nearly as good quality as Arbitration

- Prime reviewers to find errors (e.g., prompt them with expected # of errors on a page)

- Highlight potential problems (e.g., let A flag tough fields)

- Route difficult pages to experts

- Consider an A-R1-R2 process when high quality is critical

Summary & Implications of Act II Reviewing reviewers isn’t always worth the time

- At least in some contexts, Find-Fix may not need Verify

Quality of different fields varies dramatically

- Use different quality control mechanisms for harder or easier fields

Integrate human and algorithmic transcription

- Use algorithms on easy fields & integrate into review process so machine learning can occur

Questions• Derek Hansen (dlhansen@byu.edu)• Patrick Schone (BoiseBound@aol.com)• Douglas Corey (corey@mathed.byu.edu)• Matthew Reid (matthewreid007@gmail.com)• Jake Gehring (GehringJG@familysearch.org)