Secondary data analysis with digital trace data

16
Secondary data analysis with digital trace data Examples from FLOSS research Andrea Wiggins 13 Juillet, 2011

Transcript of Secondary data analysis with digital trace data

Page 1: Secondary data analysis with digital trace data

Secondary data analysis with digital trace data

Examples from FLOSS research

Andrea Wiggins13 Juillet, 2011

Page 2: Secondary data analysis with digital trace data

Secondary Data Analysis

• Uses existing data produced or collected by someone else, usually for a different purpose

• Databases

• Repositories

• Surveys

• Emails

• Social networks

2

Page 3: Secondary data analysis with digital trace data

Digital Trace Data

• Records of activity (trace data) undertaken through an online information system (thus digital)

• Increasingly common in studies of online phenomena

• Large volumes of available data

• Can be complete: a census, not a sample

• May be more reliably recorded than other data

3

Page 4: Secondary data analysis with digital trace data

Characteristics

1. Found data (not produced for research)

2. Event-based data (not summary data)

3. Events occur over time, so it is longitudinal data

4

Page 5: Secondary data analysis with digital trace data

Requirements

• Understand the original data source

• How it was collected, potential problems

• Limitations of the sample

• What the data describe

• Match with appropriate analysis methods and measures

• New types of data may require new measures

• Theoretical coherence is very important

5

Page 6: Secondary data analysis with digital trace data

Advantages

• Data may be “complete”

• Usually no response bias (exception: cookies)

• May cover long periods of time and large groups

• Multiple different data types, but mostly textual

• Data are often easy to acquire

• APIs or scraping web pages (with caution)

• Databases, archives, or repositories of research data

• But remember: you usually get what you pay for!

6

Page 7: Secondary data analysis with digital trace data

Disadvantages

• Often difficult to know limitations of data

• Data may be poorly documented

• Original creator may not be available for comment

• Volume of data can be overwhelming

• Sampling strategies needed, e.g., temporal, random

• Substantial time required for data preparation: 90% of effort

• Exceptions are everywhere and will break analyses, but can only be discovered through trial and error

7

Page 8: Secondary data analysis with digital trace data

Example: Email Networks

• Data source: email listservs for FLOSS projects

• Analysis approach: create social networks

• Within discussion threads, individuals are nodes, and links are reply-to messages

• Some conceptual issues for interpretation, choice of measures

• Technical challenges

• Temporal aggregation

• Identity resolution

8

Page 9: Secondary data analysis with digital trace data

Temporal AggregationFigures from Howison et al., 2006

9

Page 10: Secondary data analysis with digital trace data

Network Workflow10

Page 11: Secondary data analysis with digital trace data

Network Results

Cleaning up before shutting down

• Observed anomalous patterns in trackers for both projects: periodic centralization spikes

• A single user makes batch bug closings (up to 279!)– Fire’s (feature request) tracker housekeeping

appears to be preparation for project closure

– Gaim’s tracker housekeeping was more regular and repeated

• Different levels of correlation between venues, suggesting different types of interactions

• User venues more decentralized than developer venues, reflecting greater number of participants

• Overall trend toward decentralization could be result of different influences

11

Page 12: Secondary data analysis with digital trace data

Example: Classification

• Replication of success-tragedy classification

• Classification criteria originally drawn from interviews with community members

• Data extracted from repositories

• Technical challenges

• Merging data from two repositories

• Processing large volume of data in multiple steps

12

Page 13: Secondary data analysis with digital trace data

Variables

• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads

• Project statistics retrieved from repositories

• Founding date

• Data collection date

• Dates for all releases

• Number of downloads

• URL

13

Page 14: Secondary data analysis with digital trace data

Classification workflow14

Page 15: Secondary data analysis with digital trace data

Classification Results

Class Original Our results Difference

unclassifiable

3 186 3 296 +110

II 13 342 (12%) 16 252 (14%) +2 910 (+2%)

IG 10 711 (10%) 12 991 (11%) +2 280 (+1%)

TI 37 320 (35%) 36 507 (31%) -813 (-4%)

TG 30 592 (28%) 32 642 (28%) +2 050 (0%)

SG 15 782 (15%) 16 045 (14%) +263 (-1%)

other 8 422 0

Total 119 355 117 733

15

Page 16: Secondary data analysis with digital trace data

Thanks!

• Questions?

16