My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute...

15
My summer of dilettante data-mining Virgil Griffith - Disruptive Technologist California Institute of Technology [email protected] http://virgil.gr

Transcript of My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute...

Page 1: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

My summer of dilettante data-mining

Virgil Griffith - Disruptive Technologist

California Institute of Technology

[email protected]://virgil.gr

Page 2: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Format of Presentation

WikiScanner

Amateur Data-mining is easy, fun, and high-impact. You should try it!

Page 3: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

When you edit Wikipedia…

You can edit leaving your username.

OR

You can edit anonymously, using your computer’s IP address instead of a username.

BUT

Sometimes IP addresses can be traced back (it’s tricky though).

Page 4: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Idea!

1) Download ALL of Wikipedia, getting all of the anonymous edits (Free from Wikimedia.)

Found 34.5M anonymous edits, ~21% of Wikipedia

2) Tracing is hard, so buy a database of what organizations own which IP addresses (available from private corporations. $1,000)

2,668,095 different orgs in database.

3) Merge them together!

Page 5: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

What you can do now

Type an organization’s name and see all of the anonymous edits that came from their local network. Found 187,529 different orgs with at

least 1 edit.See what organizations have edited a

particular Wikipedia page.Vote on the interesting stuff

Page 6: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

The Harvest

Different % of anonymous edits by countryYes, the CIA does in fact edit Wikipedia [1] [

2]

FOIA lawsuit filed over Mike Huckabee white-washing. [1]

Dutch princess white-washes connections to drug baron.

Politicians do in fact hire staff to police their pages. [1]

So do corporations. A lot.

Page 7: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Wikimedia wants WikiScanner 2.0

Shinier. More bells and whistles.

New ways to catch people Vote stacking Time overlaps Automatic vanity checks Linkspam with Google ads And more! (but I’m not telling)

Page 8: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Amateur Data-mining is fun

Page 9: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Data-Mining: How to

Connect data from multiple sources and repurpose it.

This means1) Download data from different places2) Use it for something else other than

what was intended.

Page 10: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Tools to get you started

MySQL / Python/ RubyGeneral Architecture for Text

Engineering (GATE)Text SimilarityDutch data-mining tools

Page 11: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Interesting Sources of Data

Archive.org Wayback MachineNotable Names Databse (NNDB)Domain ToolsSocial Networks

Facebook, OkCupid, etc.

Any unusual dataset has useful purposes.

Page 12: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Using data for things you wouldn’t have guessed

Trust Metrics in Wikipedia textHow many writers made your page?Google Maps and the impact of climate c

hangeReal-time Wikipedia with Google MapsPlanet SonyVisualizing Speeches

Repurposing WikiScanner…Using WikiScanner for social science res

earch

Page 13: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

And we’re not even scratching the surface…

Kink by geography Straight from OkCupid

Webpage diffs Archive.org Wayback Machine

Analyze C-SPAN closed-caption data TV Card + Text Analyzer

Large-scale mining of .doc metadata Open source tools + Google-mining

Page 14: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Future work on repurposing online data in interesting ways

Page 15: My summer of dilettante data- mining Virgil Griffith - Disruptive Technologist California Institute of Technology virgil@caltech.edu .

Questions?