Download - Onlineinfo2012 - Scraping

Transcript
Page 1: Onlineinfo2012 - Scraping

DATA LIBERATION

Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier

Tony HirstDepartment of Communication and

SystemsThe Open University

Page 2: Onlineinfo2012 - Scraping

data NOT information

Craftby Vicky Hugheston

Page 3: Onlineinfo2012 - Scraping

[Disruptive Innovation?]

Page 4: Onlineinfo2012 - Scraping
Page 5: Onlineinfo2012 - Scraping

“First” generation:data catalogues

Page 6: Onlineinfo2012 - Scraping

Breathing life into data…

Page 7: Onlineinfo2012 - Scraping

=importData(“CSV_URL”)

Google Sheets

Page 8: Onlineinfo2012 - Scraping

the spreadsheet becomes

A DATABASE

Page 9: Onlineinfo2012 - Scraping

Google Charts

Visualisation API

Page 10: Onlineinfo2012 - Scraping

Google Charts

Visualisation API

Page 11: Onlineinfo2012 - Scraping

Google Charts

Visualisation API

Page 12: Onlineinfo2012 - Scraping

“Second” generation:data management

systems

Page 13: Onlineinfo2012 - Scraping

DMS – Data Management System

Page 14: Onlineinfo2012 - Scraping

BUT

Page 15: Onlineinfo2012 - Scraping

There’s lots more data that’s locked up in web pages…

Page 16: Onlineinfo2012 - Scraping

Scraping…

Page 17: Onlineinfo2012 - Scraping
Page 18: Onlineinfo2012 - Scraping

“grabbing web content in a machine readable

format and then processing it for your

own purposes”

Page 19: Onlineinfo2012 - Scraping
Page 20: Onlineinfo2012 - Scraping

DIY API

Page 21: Onlineinfo2012 - Scraping
Page 22: Onlineinfo2012 - Scraping

Original HTML web

page

Accessible web page

Extract Information

-> data

Page 23: Onlineinfo2012 - Scraping

Recreating the database that was used

to populate a (templated) page

Page 24: Onlineinfo2012 - Scraping
Page 25: Onlineinfo2012 - Scraping
Page 26: Onlineinfo2012 - Scraping
Page 27: Onlineinfo2012 - Scraping
Page 28: Onlineinfo2012 - Scraping
Page 29: Onlineinfo2012 - Scraping

Implied semantics

Page 30: Onlineinfo2012 - Scraping

…quick’n’dirty=importHTML(“pageURL”,“table”,N)

Page 31: Onlineinfo2012 - Scraping
Page 32: Onlineinfo2012 - Scraping
Page 33: Onlineinfo2012 - Scraping
Page 34: Onlineinfo2012 - Scraping
Page 35: Onlineinfo2012 - Scraping
Page 36: Onlineinfo2012 - Scraping
Page 37: Onlineinfo2012 - Scraping

PDF scraping

Page 38: Onlineinfo2012 - Scraping
Page 39: Onlineinfo2012 - Scraping

Scrapers

Views

Scraper SQLite database

SQLite database Scraper

Page 40: Onlineinfo2012 - Scraping
Page 41: Onlineinfo2012 - Scraping
Page 42: Onlineinfo2012 - Scraping
Page 43: Onlineinfo2012 - Scraping

Sometimes the data is spread

across different files…

Page 44: Onlineinfo2012 - Scraping
Page 45: Onlineinfo2012 - Scraping

Row based aggregation

Page 46: Onlineinfo2012 - Scraping

Sometimes the data is spread

across different websites…

Page 47: Onlineinfo2012 - Scraping

…Normalisation…

Page 48: Onlineinfo2012 - Scraping
Page 49: Onlineinfo2012 - Scraping

Data Enrichment

Page 50: Onlineinfo2012 - Scraping

Column Additions/An

notations

Page 51: Onlineinfo2012 - Scraping
Page 52: Onlineinfo2012 - Scraping

Sometimes the data is split

across different files…

Page 53: Onlineinfo2012 - Scraping

Column based merge

Page 54: Onlineinfo2012 - Scraping
Page 55: Onlineinfo2012 - Scraping

-> Data cleansing

Page 56: Onlineinfo2012 - Scraping

Clustering…

Page 57: Onlineinfo2012 - Scraping

OpenRefinehttp://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey

Page 58: Onlineinfo2012 - Scraping

OpenRefine

Page 59: Onlineinfo2012 - Scraping

OpenRefine

Page 60: Onlineinfo2012 - Scraping

“Finessing” a common identifer

Page 61: Onlineinfo2012 - Scraping

Common identifiers (common KEYS) make

it MUCH easier to JOIN datasets by column

Page 62: Onlineinfo2012 - Scraping

Book Title -> ISBN

Page 63: Onlineinfo2012 - Scraping

I am “psychemedia” on Twitter, delicious, slideshare, flickr, etc

etc

Page 64: Onlineinfo2012 - Scraping
Page 65: Onlineinfo2012 - Scraping

Reconciliation…

Page 66: Onlineinfo2012 - Scraping

OpenRefine

Page 67: Onlineinfo2012 - Scraping

OpenRefine

Page 68: Onlineinfo2012 - Scraping

OpenRefine

Page 69: Onlineinfo2012 - Scraping

OpenRefine

Page 70: Onlineinfo2012 - Scraping
Page 71: Onlineinfo2012 - Scraping

Linked Data™

Page 72: Onlineinfo2012 - Scraping
Page 73: Onlineinfo2012 - Scraping

So who speaks SPARQL?

Diners - Journal Canteenby avlxyz

Page 74: Onlineinfo2012 - Scraping

You DON’T have to….

Page 75: Onlineinfo2012 - Scraping

Just think about how one piece of data might be related to another

through a common means of addressing them…

Page 76: Onlineinfo2012 - Scraping

http://ouseful.info

@psychemedia