Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process...

24
Web Scraping and BLS Federal Committee on Statistical Methodology Research Conference March 7, 2018 Bill Thompson 1

Transcript of Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process...

Page 1: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Web Scraping and BLS

Federal Committee on Statistical Methodology Research Conference March 7, 2018

Bill Thompson

1

Page 2: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Overview

■ Background

■ Experiences at BLS

■ Challenges

■ Next Steps

2 2— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 3: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Data Collection Challenges

■ Reluctant respondent

o Burden

o Trust

■ Data Collection

o Costly to support entire infrastructure

3 3— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 4: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Data Collection Challenges

■ Called to automate our data collection

4 4— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 5: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

How BLS has Addressed Data Collection Challenges

■ Purchased data sets

■ Other Federal surveys

■ Directly from corporation

■ Scraping websites

5— U.S. BUREAU OF LABOR STATISTICS • bls.gov 5

Page 6: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Data Collection Opportunity

■ Information on the internet

■ Anything, well almost anything, we collect manually could be collected automatically

6— U.S. BUREAU OF LABOR STATISTICS • bls.gov 6

Page 7: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Web Scraping

Web scraping is the process of extracting data from websites, typically through the use of a software program that simulates human exploration of a website(s).

7— U.S. BUREAU OF LABOR STATISTICS • bls.gov 7

Page 8: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Getting Web Data

■ Web scraping etc… o Extract data (prices) from web page (HTML)

– Human readable page -> machine readable data

o Using APIs (Application Programming Interfaces) – Code (provided by source) get machine readable data directly

■ Next step--data-traffic efficient applications

8— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 9: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Web Scrape Benefits

■ Reduces collection costs

■ Reduces respondent burden

■ Improves timeliness

■ Increases sample size

■ Increases data quality

■ Allows for evaluation & improvement

9— U.S. BUREAU OF LABOR STATISTICS • bls.gov 9

Page 10: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Web Scraping

■ In the beginning …

– Obtain price and characteristic data for hedonic modeling

– Evaluate as a replacement for data collection

10— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 11: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Web Scraping

■ Then there were multiple efforts… – Retailer web sites, web scraping & API

– Price aggregators

– Obtain data from other statistical agencies

– Location services to refine address information

– Job postings & The Conference Board

11— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 12: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Web Scraping Successes

■ Fatal Work Injuries & Google Alerts

■ Office of Productivity & Technology & Federal Agencies

■ Prices

12 12— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 13: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Census of Fatal Occupational Injuries (CFOI)

■ Fatal Work Injuries & Google Alerts

o Google Alerts

o Developed CFOI Public Data Management System (C-PDMS)

o C-PDMS is being used by for CFOI production purposes.

13 13— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 14: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Office Of Productivity & Technology (OPT)

■ Web Scrape Throughout OPT

o Multifactor Productivity

– 40% of the input data in non-manufacturing & manufacturing

– Bureau of Economic Analysis (BEA)

o Industry Productivity

– United States Geological Survey (USGS)

– & Center Disease Control (CDC)

– & Bureau of Labor Statistics (BLS)

o Reaching out to BEA for more source data into the API

14 14— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 15: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Consumer Price Index & Division of Price and Index Number Research

■ Price research

o A fairly significant amount of research into web scraping activity in potential support of the Consumer Price Index

o Overall objective of automating collection of prices and product information to create price indexes

15 15— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 16: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Consumer Price Index & Division of Price and Index Number Research

■ Current Price Research

o Using the API (with permission) – From retailers to collect prices and characteristic

information on a weekly basis

– From online price aggregator

o Using DPINR’s automated data collection – From retailers (with permission)

16 16— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 17: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

v17— U.S. BUREAU OF LABOR STATISTICS • bls.go

Page 18: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

18— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 19: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

What have we learned?

BLS can systematically harness information on the internet for statistical use…clearly BLS has transitioned out of the ‘proof of concept’ stage.

19 19— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 20: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Challenges

■ Infrastructure

■ Appropriate skills

■ Research

■ And…

20 20— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 21: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Policy Issues

■ Information on website=publicly available data?

■ How does confidentiality apply?

■ Optics & future respondent cooperation matter

21 21— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 22: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

In the Meantime

■ Except for a few situations

o CPI has scaled back effort

o Secure permission prior to web scrape

■ Guidance

o Developing a business plan for each planned use

o Developing a guidance/decision

■ Discussion about current policy

22 22— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 23: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Next Steps

■ Policy concerns

■ Begin to consider a centralized approach

■ Continue to seek out opportunities

23 23— U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 24: Web Scraping and BLS - National Center for Education ...Web Scraping . Web scraping is the process of extracting data from websites, typically through the use of a software program

Contact Information Bill Thompson

Producer Price Index

[email protected]

David Friedman

Associate Commissioner Office of Price & Living

Conditions

[email protected]

24