Onlineinfo2012 - Scraping

download
  • date post

    27-Jan-2015
  • Category

    Business
  • view

    106
  • download

    1

Embed Size (px)

description

Is open data disruptive to data vendors/verticals in the information industry? How can scrapers turn data published as information on the web or in PDFs back into structured data? What business models or publications are built from scraped data?

transcript

  • 1. DATA LIBERATIONOpening Up Data by Hookor by Crook - DataScraping, Linkage and theValue of a Good Identifier Tony HirstDepartment of Communication and SystemsThe Open University

2. data NOTinformationby Vick 3. [DisruptiveInnovation?] 4. First generation: data catalogues 5. Breathing life into data 6. =importData(CSV_URL) 7. the spreadsheet becomesA DATABASE 8. Second generation: data managementsystems 9. Theres lots moredata thats lockedup in web pages 10. Scraping 11. grabbing web contentin a machine readable format and then processing it for yourown purposes 12. OriginalExtractAccessibleHTML webInformationweb pagepage -> data 13. Recreating thedatabase that was used to populate a (templated) page 14. quickndirty 15. ScrapersSQLiteScraperdatabaseViews SQLitedatab ase Scraper 16. Sometimes the data is spreadacross different files 17. Row basedaggregation 18. Sometimes the data is spreadacross differentwebsites 19. Normalisation 20. DataEnrichment 21. ColumnAdditions/An notations 22. Sometimes thedata is splitacross different files 23. Columnbased merge 24. -> Datacleansing 25. Clustering 26. http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss//via Martin Hawksey/@mhawksey 27. Finessing acommonidentifer 28. Common identifiers (common KEYS) makeit MUCH easier to JOIN datasets by column 29. Book Title-> ISBN 30. I am psychemediaon Twitter, delicious,slideshare, flickr, etc etc 31. Reconciliation 32. LinkedData 33. So who speaks SPARQL? Diners - Journal Canteen by avlxyz 34. You DONT have to. 35. Just think about how one piece of data might be related to another through a common means ofaddressing them 36. http://ouseful.info @psychemedia