RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf ·...

40
RECSM Summer School: Scraping the web Pablo Barber´ a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

Transcript of RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf ·...

Page 1: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

RECSM Summer School:Scraping the web

Pablo Barbera

School of International RelationsUniversity of Southern California

pablobarbera.com

Networked Democracy Labwww.netdem.org

Course website:

github.com/pablobarbera/big-data-upf

Page 2: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: what? why?

An increasing amount of data is available on the web:

I Speeches, sentences, biographical information...

I Social media data, newspaper articles, press releases...I Geographic information, conflict data...

These datasets are often provided in an unstructured format.

Web scraping is the process of extracting this informationautomatically and transforming it into a structured dataset.

Page 3: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: what? why?

An increasing amount of data is available on the web:

I Speeches, sentences, biographical information...I Social media data, newspaper articles, press releases...

I Geographic information, conflict data...

These datasets are often provided in an unstructured format.

Web scraping is the process of extracting this informationautomatically and transforming it into a structured dataset.

Page 4: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: what? why?

An increasing amount of data is available on the web:

I Speeches, sentences, biographical information...I Social media data, newspaper articles, press releases...I Geographic information, conflict data...

These datasets are often provided in an unstructured format.

Web scraping is the process of extracting this informationautomatically and transforming it into a structured dataset.

Page 5: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: what? why?

An increasing amount of data is available on the web:

I Speeches, sentences, biographical information...I Social media data, newspaper articles, press releases...I Geographic information, conflict data...

These datasets are often provided in an unstructured format.

Web scraping is the process of extracting this informationautomatically and transforming it into a structured dataset.

Page 6: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: what? why?

An increasing amount of data is available on the web:

I Speeches, sentences, biographical information...I Social media data, newspaper articles, press releases...I Geographic information, conflict data...

These datasets are often provided in an unstructured format.

Web scraping is the process of extracting this informationautomatically and transforming it into a structured dataset.

Page 7: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: two approaches

Two different approaches:

1. Screen scraping: extract data from source code of website,with html parser and/or regular expressions

I rvest package in R2. Web APIs (application programming interfaces): a set of

structured http requests that return JSON or XML data

I httr package to construct API requestsI Packages specific to each API: weatherData, WDI,

Rfacebook... Check CRAN Task View on Web Technologiesand Services for more examples

Page 8: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: two approaches

Two different approaches:

1. Screen scraping: extract data from source code of website,with html parser and/or regular expressions

I rvest package in R

2. Web APIs (application programming interfaces): a set ofstructured http requests that return JSON or XML data

I httr package to construct API requestsI Packages specific to each API: weatherData, WDI,

Rfacebook... Check CRAN Task View on Web Technologiesand Services for more examples

Page 9: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: two approaches

Two different approaches:

1. Screen scraping: extract data from source code of website,with html parser and/or regular expressions

I rvest package in R2. Web APIs (application programming interfaces): a set of

structured http requests that return JSON or XML data

I httr package to construct API requestsI Packages specific to each API: weatherData, WDI,

Rfacebook... Check CRAN Task View on Web Technologiesand Services for more examples

Page 10: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: two approaches

Two different approaches:

1. Screen scraping: extract data from source code of website,with html parser and/or regular expressions

I rvest package in R2. Web APIs (application programming interfaces): a set of

structured http requests that return JSON or XML data

I httr package to construct API requests

I Packages specific to each API: weatherData, WDI,Rfacebook... Check CRAN Task View on Web Technologiesand Services for more examples

Page 11: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Scraping the web: two approaches

Two different approaches:

1. Screen scraping: extract data from source code of website,with html parser and/or regular expressions

I rvest package in R2. Web APIs (application programming interfaces): a set of

structured http requests that return JSON or XML data

I httr package to construct API requestsI Packages specific to each API: weatherData, WDI,

Rfacebook... Check CRAN Task View on Web Technologiesand Services for more examples

Page 12: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:

I First, check if an API exists or if data are available fordownload

I Some websites disallow scrapers on their robots.txtfiles

2. Limit your bandwidth use:

I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 13: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

download

I Some websites disallow scrapers on their robots.txtfiles

2. Limit your bandwidth use:

I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 14: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:

I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 15: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:

I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 16: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:I Wait one or two seconds after each hit

I Scrape only what you need, and just once (e.g. store thehtml file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 17: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 18: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentation

I Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 19: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentationI Is there a batch download option?

I Are there any rate limits?I Can you share the data?

Page 20: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentationI Is there a batch download option?I Are there any rate limits?

I Can you share the data?

Page 21: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The rules of the game

1. Respect the hosting site’s wishes:I First, check if an API exists or if data are available for

downloadI Some websites disallow scrapers on their robots.txt

files

2. Limit your bandwidth use:I Wait one or two seconds after each hitI Scrape only what you need, and just once (e.g. store the

html file in disk, and then parse it)

3. When using APIs, read documentationI Is there a batch download option?I Are there any rate limits?I Can you share the data?

Page 22: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The art of web scraping

Workflow:

1. Learn about structure of website

2. Build prototype code3. Generalize: functions, loops, debugging4. Data cleaning

Page 23: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The art of web scraping

Workflow:

1. Learn about structure of website2. Build prototype code

3. Generalize: functions, loops, debugging4. Data cleaning

Page 24: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The art of web scraping

Workflow:

1. Learn about structure of website2. Build prototype code3. Generalize: functions, loops, debugging

4. Data cleaning

Page 25: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

The art of web scraping

Workflow:

1. Learn about structure of website2. Build prototype code3. Generalize: functions, loops, debugging4. Data cleaning

Page 26: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table format

Page 27: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios2. Data in unstructured format

www.ipaidabribe.com/reports/paid

Page 28: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

3. Data hidden behind web forms

Candidates on 2015 Venezuelan parliamentary election

Page 29: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table format

I Automatic extraction with rvest

2. Data in unstructured format

I Element identification with selectorGadgetI Automatic extraction with rvest

3. Data hidden behind web forms

I Automation of web browser behavior with selenium

Page 30: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table formatI Automatic extraction with rvest

2. Data in unstructured format

I Element identification with selectorGadgetI Automatic extraction with rvest

3. Data hidden behind web forms

I Automation of web browser behavior with selenium

Page 31: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table formatI Automatic extraction with rvest

2. Data in unstructured format

I Element identification with selectorGadgetI Automatic extraction with rvest

3. Data hidden behind web forms

I Automation of web browser behavior with selenium

Page 32: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table formatI Automatic extraction with rvest

2. Data in unstructured formatI Element identification with selectorGadget

I Automatic extraction with rvest

3. Data hidden behind web forms

I Automation of web browser behavior with selenium

Page 33: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table formatI Automatic extraction with rvest

2. Data in unstructured formatI Element identification with selectorGadgetI Automatic extraction with rvest

3. Data hidden behind web forms

I Automation of web browser behavior with selenium

Page 34: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table formatI Automatic extraction with rvest

2. Data in unstructured formatI Element identification with selectorGadgetI Automatic extraction with rvest

3. Data hidden behind web forms

I Automation of web browser behavior with selenium

Page 35: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Three main scenarios

1. Data in table formatI Automatic extraction with rvest

2. Data in unstructured formatI Element identification with selectorGadgetI Automatic extraction with rvest

3. Data hidden behind web formsI Automation of web browser behavior with selenium

Page 36: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

APIs

API = Application Programming Interface; a set of structuredhttps requests that return data in JSON or XML format.

Types of APIs:1. RESTful APIs: queries for static information at current

moment (e.g. user profiles, posts, etc.)2. Streaming APIs: changes in users’ data in real time (e.g.

new tweets, new FB posts...)

Most APIs are rate-limited:I Restrictions on number of API calls by user/IP address and

period of time.

Page 37: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

APIs

API = Application Programming Interface; a set of structuredhttps requests that return data in JSON or XML format.

Types of APIs:1. RESTful APIs: queries for static information at current

moment (e.g. user profiles, posts, etc.)2. Streaming APIs: changes in users’ data in real time (e.g.

new tweets, new FB posts...)Most APIs are rate-limited:

I Restrictions on number of API calls by user/IP address andperiod of time.

Page 38: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Connecting with an API

Constructing a REST API call:I Baseline URL:

https://maps.googleapis.com/maps/api/geocode/json

I Parameters: ?address=barcelonaI Authentication token: &key=XXXXX

Response is often in JSON format.

Authentication:I Many APIs require an access key or tokenI An alternative, open standard is called OAuthI Connections without sharing username or password, only

temporary tokens that can be refreshedI httr package in R implements most cases (examples)

Page 39: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Connecting with an API

Constructing a REST API call:I Baseline URL:

https://maps.googleapis.com/maps/api/geocode/json

I Parameters: ?address=barcelonaI Authentication token: &key=XXXXX

Response is often in JSON format.

Authentication:I Many APIs require an access key or tokenI An alternative, open standard is called OAuthI Connections without sharing username or password, only

temporary tokens that can be refreshedI httr package in R implements most cases (examples)

Page 40: RECSM Summer School: Scraping the webpablobarbera.com/big-data-upf/slides/02-scraping.pdf · 2017-06-30 · Scraping the web: two approaches Two different approaches: 1.Screen scraping:

Other APIs

See CRAN Web Technologies Task View