My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute...
-
Upload
philip-fowler -
Category
Documents
-
view
213 -
download
1
Transcript of My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute...
My summer of dilettante data-mining
Virgil Griffith - Disruptive Technologist
California Institute of Technology
[email protected]://virgil.gr
Format of Presentation
WikiScanner
Amateur Data-mining is easy, fun, and high-impact. You should try it!
When you edit Wikipedia…
You can edit leaving your username.
OR
You can edit anonymously, using your computer’s IP address instead of a username.
BUT
Sometimes IP addresses can be traced back (it’s tricky though).
Idea!
1) Download ALL of Wikipedia, getting all of the anonymous edits (Free from Wikimedia.)
Found 34.5M anonymous edits, ~21% of Wikipedia
2) Tracing is hard, so buy a database of what organizations own which IP addresses (available from private corporations. $1,000)
2,668,095 different orgs in database.
3) Merge them together!
What you can do now
Type an organization’s name and see all of the anonymous edits that came from their local network. Found 187,529 different orgs with at
least 1 edit.See what organizations have edited a
particular Wikipedia page.Vote on the interesting stuff
The Harvest
Different % of anonymous edits by countryYes, the CIA does in fact edit Wikipedia [1] [
2]
FOIA lawsuit filed over Mike Huckabee white-washing. [1]
Dutch princess white-washes connections to drug baron.
Politicians do in fact hire staff to police their pages. [1]
So do corporations. A lot.
Wikimedia wants WikiScanner 2.0
Shinier. More bells and whistles.
New ways to catch people Vote stacking Time overlaps Automatic vanity checks Linkspam with Google ads And more! (but I’m not telling)
Amateur Data-mining is fun
Data-Mining: How to
Connect data from multiple sources and repurpose it.
This means1) Download data from different places2) Use it for something else other than
what was intended.
Tools to get you started
MySQL / Python/ RubyGeneral Architecture for Text
Engineering (GATE)Text SimilarityDutch data-mining tools
Interesting Sources of Data
Archive.org Wayback MachineNotable Names Databse (NNDB)Domain ToolsSocial Networks
Facebook, OkCupid, etc.
Any unusual dataset has useful purposes.
Using data for things you wouldn’t have guessed
Trust Metrics in Wikipedia textHow many writers made your page?Google Maps and the impact of climate c
hangeReal-time Wikipedia with Google MapsPlanet SonyVisualizing Speeches
Repurposing WikiScanner…Using WikiScanner for social science res
earch
And we’re not even scratching the surface…
Kink by geography Straight from OkCupid
Webpage diffs Archive.org Wayback Machine
Analyze C-SPAN closed-caption data TV Card + Text Analyzer
Large-scale mining of .doc metadata Open source tools + Google-mining
Future work on repurposing online data in interesting ways
Questions?