Scraping for Stories

Post on 21-Feb-2017

450 views 2 download

Transcript of Scraping for Stories

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Scraping for stories

Why scraping

How to spot opportunities for scraping

Tools and traits: what can be scraped, and how

Why and how

Automating the repetitive gathering of data, e.g. Multiple tables in one pageWebpage tablesMultiple spreadsheetsMultiple PDFs

What is scraping?

Why is a government website carrying fake jobs?

Aron Pilhofer, News Rewired

https://www.youtube.com/watch?v=Efr-VEkwWoM

http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076

http://www.private-eye.co.uk/registry

Focus.

New entries - or disappearing ones

http://helpmeinvestigate.com/olympics/olympic-torch-relay-youth-target-missed-by-over-1000-places/

https://moveplanner.zoopla.co.uk/terms-and-conditions

http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?

http://www.nature.com/news/scientific-publishing-the-inside-track-1.15424

What makes a website suitable for scraping?

*

*

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

*

URL parameters

https://stores.sainsburys.co.uk/

HTML <table> HTML tag(s) JSON file XML TXT CSV or XLS PDF PDF which needs OCR

Document challenges

1 page, changes >1 pages, ‘next’ links pages linked from 1 index >1 pages, URL pattern Search results URL pattern Search results, uses cookie Search results, login needed

Hosting challenges

§

Patterns Look for structure in a webpage - how are elements distinguished? Think code and text

*

Chrome: right-click > Inspect

*

Inspector: right-click > Copy…

*

A tip about URLs

This bit isn’t needed.

This bit is just for SEO.

You can put anything there. Literally.

§

Do it now: Identify an online source of information you might scrape Think beyond tables: what about series of pages? Documents?

https://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/

Robots.txt http://www.tcij.org/robots.txt

Treat like any source: build in TGTBT checks Seek second sources Seek right of reply/confirmation

Data is just a lead

http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/

https://www.mediawiki.org/wiki/API:Main_page

Does it have an API?

Gives you a long term insight into the issue Allows you to spot things being removed and added

Scheduled scrapes

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Thank you.