Secondary data analysis with digital trace data
-
Upload
andrea-wiggins -
Category
Technology
-
view
949 -
download
6
Transcript of Secondary data analysis with digital trace data
Secondary data analysis with digital trace data
Examples from FLOSS research
Andrea Wiggins13 Juillet, 2011
Secondary Data Analysis
• Uses existing data produced or collected by someone else, usually for a different purpose
• Databases
• Repositories
• Surveys
• Emails
• Social networks
2
Digital Trace Data
• Records of activity (trace data) undertaken through an online information system (thus digital)
• Increasingly common in studies of online phenomena
• Large volumes of available data
• Can be complete: a census, not a sample
• May be more reliably recorded than other data
3
Characteristics
1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data
4
Requirements
• Understand the original data source
• How it was collected, potential problems
• Limitations of the sample
• What the data describe
• Match with appropriate analysis methods and measures
• New types of data may require new measures
• Theoretical coherence is very important
5
Advantages
• Data may be “complete”
• Usually no response bias (exception: cookies)
• May cover long periods of time and large groups
• Multiple different data types, but mostly textual
• Data are often easy to acquire
• APIs or scraping web pages (with caution)
• Databases, archives, or repositories of research data
• But remember: you usually get what you pay for!
6
Disadvantages
• Often difficult to know limitations of data
• Data may be poorly documented
• Original creator may not be available for comment
• Volume of data can be overwhelming
• Sampling strategies needed, e.g., temporal, random
• Substantial time required for data preparation: 90% of effort
• Exceptions are everywhere and will break analyses, but can only be discovered through trial and error
7
Example: Email Networks
• Data source: email listservs for FLOSS projects
• Analysis approach: create social networks
• Within discussion threads, individuals are nodes, and links are reply-to messages
• Some conceptual issues for interpretation, choice of measures
• Technical challenges
• Temporal aggregation
• Identity resolution
8
Temporal AggregationFigures from Howison et al., 2006
9
Network Workflow10
Network Results
Cleaning up before shutting down
• Observed anomalous patterns in trackers for both projects: periodic centralization spikes
• A single user makes batch bug closings (up to 279!)– Fire’s (feature request) tracker housekeeping
appears to be preparation for project closure
– Gaim’s tracker housekeeping was more regular and repeated
• Different levels of correlation between venues, suggesting different types of interactions
• User venues more decentralized than developer venues, reflecting greater number of participants
• Overall trend toward decentralization could be result of different influences
11
Example: Classification
• Replication of success-tragedy classification
• Classification criteria originally drawn from interviews with community members
• Data extracted from repositories
• Technical challenges
• Merging data from two repositories
• Processing large volume of data in multiple steps
12
Variables
• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads
• Project statistics retrieved from repositories
• Founding date
• Data collection date
• Dates for all releases
• Number of downloads
• URL
13
Classification workflow14
Classification Results
Class Original Our results Difference
unclassifiable
3 186 3 296 +110
II 13 342 (12%) 16 252 (14%) +2 910 (+2%)
IG 10 711 (10%) 12 991 (11%) +2 280 (+1%)
TI 37 320 (35%) 36 507 (31%) -813 (-4%)
TG 30 592 (28%) 32 642 (28%) +2 050 (0%)
SG 15 782 (15%) 16 045 (14%) +263 (-1%)
other 8 422 0
Total 119 355 117 733
15
Thanks!
• Questions?
16