Scraping for Stories

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Scraping for stories

Why scraping

How to spot opportunities for scraping

Tools and traits: what can be scraped, and how

Why and how

Automating the repetitive gathering of data, e.g. Multiple tables in one pageWebpage tablesMultiple spreadsheetsMultiple PDFs

What is scraping?

Why is a government website carrying fake jobs?

Aron Pilhofer, News Rewired

https://www.youtube.com/watch?v=Efr-VEkwWoM

http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076

http://www.private-eye.co.uk/registry

Focus.

New entries - or disappearing ones

http://helpmeinvestigate.com/olympics/olympic-torch-relay-youth-target-missed-by-over-1000-places/

https://moveplanner.zoopla.co.uk/terms-and-conditions

http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?

http://www.nature.com/news/scientific-publishing-the-inside-track-1.15424

What makes a website suitable for scraping?

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

URL parameters

https://stores.sainsburys.co.uk/

HTML <table> HTML tag(s) JSON file XML TXT CSV or XLS PDF PDF which needs OCR

Document challenges

1 page, changes >1 pages, ‘next’ links pages linked from 1 index >1 pages, URL pattern Search results URL pattern Search results, uses cookie Search results, login needed

Hosting challenges

Patterns Look for structure in a webpage - how are elements distinguished? Think code and text

Chrome: right-click > Inspect

Inspector: right-click > Copy…

A tip about URLs

This bit isn’t needed.

This bit is just for SEO.

You can put anything there. Literally.

Do it now: Identify an online source of information you might scrape Think beyond tables: what about series of pages? Documents?

https://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/

Robots.txt http://www.tcij.org/robots.txt

Treat like any source: build in TGTBT checks Seek second sources Seek right of reply/confirmation

Data is just a lead

http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/

https://www.mediawiki.org/wiki/API:Main_page

Does it have an API?

Gives you a long term insight into the issue Allows you to spot things being removed and added

Scheduled scrapes

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Thank you.

Scraping for Stories

Education

Transcript of Scraping for Stories

Liability for Data Scraping Prohibitions under the Refusal ...

Algorithms for Web Scraping

job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Screen scraping

WEB SCRAPING FOR JOB STATISTICS - unstats.un.org · WEB SCRAPING FOR JOB STATISTICS BoroNikić International Conference on Big Data for Official Statistics Dublin, September 2016

Practical Approaches for Web Scraping

Scraping and Data Mining for Beginners and Pros

Website Scraping

Web Scraping for Consumer Price Statistics Robert Breton.

Web Scraping for Non Programmers

Scraping Barrell

XPath for web scraping

Almost Scraping: Web Scraping without Programming

Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

Scraping HTML with XPathbooks.pharo.org/booklet-Scraping/pdf/2020-02-04-scraping... · 2020. 2. 4. · Illustrations IcamewiththeideaofthisbookletthanktoPeterthatkindlyanswereda questiononthePharomailing-list.TohelpPetershowedtoaPharoerhow

Onlineinfo2012 - Scraping

Acceptance & Scraping

My favourite stories based on scraping

Scraping Data for Market Research

COMP 4971C Independent Project Web Scraping …Web Scraping Website with Python for Database Construction HALIM, Kevin 8 2.1.2. Scraping the website Scraping the website GSMArena can