Scraping for Stories

54
Paul Bradshaw Leanpub.com/scrapingforjournalists * Scraping for stories

Transcript of Scraping for Stories

Page 1: Scraping for Stories

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Scraping for stories

Page 2: Scraping for Stories

Why scraping

How to spot opportunities for scraping

Tools and traits: what can be scraped, and how

Why and how

Page 3: Scraping for Stories

Automating the repetitive gathering of data, e.g. Multiple tables in one pageWebpage tablesMultiple spreadsheetsMultiple PDFs

What is scraping?

Page 4: Scraping for Stories

Why is a government website carrying fake jobs?

Aron Pilhofer, News Rewired

Page 5: Scraping for Stories

https://www.youtube.com/watch?v=Efr-VEkwWoM

Page 6: Scraping for Stories
Page 7: Scraping for Stories

http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076

Page 8: Scraping for Stories
Page 9: Scraping for Stories

http://www.private-eye.co.uk/registry

Page 10: Scraping for Stories
Page 11: Scraping for Stories
Page 12: Scraping for Stories
Page 13: Scraping for Stories

Focus.

Page 14: Scraping for Stories

New entries - or disappearing ones

Page 15: Scraping for Stories
Page 16: Scraping for Stories

http://helpmeinvestigate.com/olympics/olympic-torch-relay-youth-target-missed-by-over-1000-places/

Page 17: Scraping for Stories
Page 18: Scraping for Stories
Page 19: Scraping for Stories
Page 20: Scraping for Stories
Page 21: Scraping for Stories
Page 22: Scraping for Stories
Page 23: Scraping for Stories

https://moveplanner.zoopla.co.uk/terms-and-conditions

Page 24: Scraping for Stories
Page 25: Scraping for Stories
Page 26: Scraping for Stories

http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?

Page 27: Scraping for Stories

http://www.nature.com/news/scientific-publishing-the-inside-track-1.15424

Page 28: Scraping for Stories
Page 29: Scraping for Stories
Page 30: Scraping for Stories

What makes a website suitable for scraping?

Page 31: Scraping for Stories

*

Page 32: Scraping for Stories

*

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

Page 33: Scraping for Stories

*

URL parameters

https://stores.sainsburys.co.uk/

Page 34: Scraping for Stories

HTML <table> HTML tag(s) JSON file XML TXT CSV or XLS PDF PDF which needs OCR

Document challenges

Page 35: Scraping for Stories

1 page, changes >1 pages, ‘next’ links pages linked from 1 index >1 pages, URL pattern Search results URL pattern Search results, uses cookie Search results, login needed

Hosting challenges

Page 36: Scraping for Stories
Page 37: Scraping for Stories

§

Patterns Look for structure in a webpage - how are elements distinguished? Think code and text

Page 38: Scraping for Stories

*

Chrome: right-click > Inspect

Page 39: Scraping for Stories

*

Inspector: right-click > Copy…

Page 40: Scraping for Stories
Page 41: Scraping for Stories
Page 42: Scraping for Stories
Page 43: Scraping for Stories

*

A tip about URLs

Page 44: Scraping for Stories

This bit isn’t needed.

Page 45: Scraping for Stories

This bit is just for SEO.

Page 46: Scraping for Stories

You can put anything there. Literally.

Page 47: Scraping for Stories

§

Do it now: Identify an online source of information you might scrape Think beyond tables: what about series of pages? Documents?

Page 48: Scraping for Stories

https://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/

Page 49: Scraping for Stories

Robots.txt http://www.tcij.org/robots.txt

Page 50: Scraping for Stories

Treat like any source: build in TGTBT checks Seek second sources Seek right of reply/confirmation

Data is just a lead

Page 51: Scraping for Stories

http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/

Page 52: Scraping for Stories

https://www.mediawiki.org/wiki/API:Main_page

Does it have an API?

Page 53: Scraping for Stories

Gives you a long term insight into the issue Allows you to spot things being removed and added

Scheduled scrapes

Page 54: Scraping for Stories

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Thank you.