Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence,...

19
Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data Kenneth Benoit (LSE) With: Drew Conway (NYU), Ben Lauderdale (LSE), Michael Laver (NYU), and Slava Mikhaylov (UCL)

Transcript of Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence,...

Page 1: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data

Kenneth Benoit (LSE)With: Drew Conway (NYU), Ben Lauderdale (LSE), Michael Laver

(NYU), and Slava Mikhaylov (UCL)

Page 2: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be
Page 3: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

what does it mean to beagile?

reproducible?

crowd-sourcing for text analysis

proof-of-concept application(s)

Page 4: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

agile data production

• interactive between a researcher and his or her needs

• flexible (rather than fixed)• responsive• rapid

Page 5: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

most secondary data today is not agile

• general-purpose• attempts to be comprehensive• resource-intensive• generates “lock-in”• produced by a few (trained) experts:

Example - expert-coded policy manifestos in political science

Page 6: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

alternative: supplement automated approaches with human decision-making

• complex data generation projects are broken down into simple tasks, very targeted

• humans instead of algorithms ensures the validity of natural language processing in a wide range of applications

• retain the reproducibility of automated methods

Page 7: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

our solution: draw on the crowd

• crowd-sourcing: outsourcing a task by distributing it in simplified parts to a large, unspecified group

• contrasts with expert coding:• less expertise, but in far greater numbers• no one sees a whole text: text analysis tasks are served

partially and randomly• multiple coders per sentence, different coders treated as

exchangeable

• we use a scaling model to aggregate judgments into quantities of interest

• very easy to reproduce

Page 8: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

our problem: classifying text units

• the idea: to observe a political party’s policy position by content analysis of its texts

• standard testing domain: party manifestos

• idea is to break them into sentences and use human judgment to apply pre-defined codes

• experts: all sentences, by one codercrowd: lots of coders, only some sentences

Page 9: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

example: labelling immigration sentences

Page 10: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

findings

• non-experts produce valid results, just need a more of them

• experts are just another form of crowd: experts also have a variance

• crowd-sourced data production is reliable - and offers reproducibility

• flexibility, bespoke, low-cost• can work for any data production job that is easily

distributed into simple tasks

Page 11: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

we used an IRT-type scaling model to estimate position

• also allows for coder effects and sentence difficulty effects

• can estimate uncertainty through posterior simulation (MCMC)

Page 12: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

deployment on Crowdflower

• http://crowdflower.com • a front-end to many crowd-sourcing platforms, not

just Mechanical Turk• uses a quality monitoring system so that you have

to maintain an 80% “trust” score or be rejected• trust maintained through “gold” questions carefully

selected and agreed by experts

Page 13: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

Crowd-sourced data coding for the social sciences / 41

!

Figure 6: Expert and crowd-sourced estimates of economic and social policy codes, all

manifestos.

Comparing the positional estimates from the model on these two scales, we see a similar

correlation in Figure 9 from the six core manifestos on which we calibrated the total codings.

Even for just six manifestos, the correlation between the expert and the crowd codings was 0.99

for the economic scale, and a 0.92 for the somewhat noisier social scale.

-2-1

01

2

-2 -1 0 1 2 -2 -1 0 1 2

Economic Social

Expe

rt M

ean

Cod

e

Crowd Mean Code

overall correlations: 0.96 for economic, and 0.92 for social dimension

comparing experts to crowd-coders

Page 14: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

correlation of aggregate measures

25

our evidence that the crowd-sourced estimates of party policy positions can be used as substitutes

for the expert estimates, which is our main concern in this paper.

Figure 3. Expert and crowd-sourced estimates of economic and social policy positions.

Our scaling model provides a theoretically well-grounded way to aggregate all the

information in our expert or crowd codes, relating the underlying position of the political text

both to the “difficulty”  of  a  particular  sentence  and  to  a  coder’s  propensity  to identify the correct

policy domain and code the policy position within domain.24 Because the positions from the

scaling model depend on parameters estimated using the full set of coders and codings, changes

to the manifesto set can affect the relative scaling. The simple mean of means method, however,

is invariant to rescaling and always produces the same results, even for a single manifesto.

Comparing crowd-sourced estimates from the model to those produced by a simple averaging of

the mean of mean sentence scores, for instance, we find correlations of 0.96 for the economic

and 0.97 for the social policy positions of the 18 manifestos. We present both methods as

24 For instance, each coder has a domain sensitivity as well as a proclivity to code policy as left or right. We report more fully on diagnostic results for our coders on the basis of the auxiliary model quantity estimates in the supplemental materials.

−2 −1 0 1 2

−2−1

01

2

Manifesto PlacementEconomic

Expert

Crowd

87

87

87

97

97 97

92

9292

01

0101

05

05

05

10

1010

r=0.96

−2 −1 0 1 2−2

−10

12

Manifesto PlacementSocial

Expert

Crowd

87

87

87

97

9797

92

9292

01

01

01

05

05

05

10

10

10

r=0.92r = 0.96 r = 0.92

Page 15: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

how many judgments are needed?

we chose five per text unit

27

Figure 5. Standard errors of manifesto-level policy estimates as a function of the number of

workers, for the oversampled 1987 and 1997 manifestos. Each point is the bootstrapped standard deviation of the mean of means aggregate manifesto scores, computed from sentence-level

random n sub-samples from the codes.

The findings show a clear trend: uncertainty over the crowd-based estimates collapses as

we increase the number of workers per sentence. Indeed, the only difference between experts and

the crowd is that expert variance is smaller, as we would expect. Our findings vary somewhat

with policy area, given the noisier character of social policy estimates, but adding additional

crowd-sourced sentence judgments led to convergence with our expert panel of 5-6 coders at

around 15 crowd coders. However, the steep decline in the uncertainty of our document

estimates leveled out at around five crowd judgments per sentence, at which point the absolute

level of error is already low for both policy domains. While increasing the number of unbiased

crowd judgments will always give better estimates, we decided on cost-benefit grounds for the

second stage of our deployment to continue coding in the crowd until we had obtained five

crowd judgments per sentence. This may seem a surprisingly small number, but there are a

number of important factors to bear in mind in this context. First, the manifestos comprise about

5 10 15 20

0.00

0.01

0.02

0.03

0.04

0.05

Economic

Crowd codes per sentence

Std.

erro

r of b

oots

trapp

ed m

anife

sto

estim

ates

5 10 15 200.

000.

010.

020.

030.

040.

05

Social

Crowd codes per sentence

Std

erro

r of b

oots

trapp

ed m

anife

sto

estim

ates ●

ExpertCrowd

experts

crowd

Page 16: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

three other contexts, other languages

• new corpus: EP debate over state aid to uncompetitive coal mines

• 36 speakers, 11 countries, 10 original languages (only one English speech) translated into 22 languages

• we know the vote on the measure: can text-based measure predict it?

Page 17: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be
Page 18: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

33

!Figure 7. Scored speeches from a debate over state subsidies by vote, from separate crowd-

sourced text analysis in six languages. Aggregate scores are standardized for direct comparison.

!!

Correlations of 35 speaker scores Language English German Spanish Italian Greek Polish German 0.96 -- -- -- -- -- Spanish 0.94 0.95 -- -- -- -- Italian 0.92 0.94 0.92 -- -- -- Greek 0.95 0.97 0.95 0.92 -- -- Polish 0.96 0.95 0.94 0.94 0.93 --

Sentence N 414 455 418 349 454 437 Total Judgments 3,545 1,855 2,240 1,748 2,396 2,256 Cost $109.33 $55.26 $54.26 $43.69 $68.03 $59.25 Elapsed Time (hrs) 1 3 3 7 2 1

Table 5. Summary of Results from EP Debate Coding in 6 languages

-2

-1

0

1

For (n=25 Against (n=6)Vote

Mea

n C

row

d S

core

LanguageGerman

English

Spanish

Greek

Italian

Polish

Page 19: Crowd-Sourced Text Analysis: Reproducible and Agile ... · • multiple coders per sentence, different coders treated as exchangeable ... to maintain an 80% “trust” score or be

all greek to us