Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

34
SCRAPING FROM THE WEB An Overview That Does Not Contain Too Much Cussing Feihong Hsu ChiPy February 14, 2013

description

A high level overview of how we did scraping at EveryBlock.

Transcript of Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Page 1: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

SCRAPING FROM THE WEBAn Overview That Does Not Contain Too Much Cussing

Feihong HsuChiPy

February 14, 2013

Page 2: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Organization

Definition of scraper

Common types of scrapers

Components of a scraping system

Pro tips

Page 3: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

What I mean when I say scraper

Any program that retrieves structured data from the web, and then transforms it to conform with a different structure.

Wait, isn’t that just ETL? (extract, transform, load)

Well, sort of, but I don’t want to call it that...

Page 4: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Some people would say that “scraping” only applies to web pages. I would argue that getting data from a CSV or JSON file is qualitatively not all that different. So I lump them all together.

Why not ETL? Because ETL implies that there are rules and expectations, and these two things don’t exist in the world of open government data. They can change the structure of their dataset without telling you, or even take the dataset down on a whim. A program that pulls down government data is often going to be a bit hacky by necessity, so “scraper” seems like a good term for that.

Notes

Page 5: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Main types of scrapersCSV

RSS/Atom

JSON

XML

HTML crawler

Web browser

PDF

Database dump

GIS

Mixed

Page 6: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

CSV

import csv

You should usually use csv.DictReader.

If the column names are all caps, consider making them lowercase.

Watch out for CSV datasets that don’t have the same number of elements on each row.

Page 7: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

def get_rows(csv_file): reader = csv.reader(open(csv_file)) # Get the column names, lowercased. column_names = tuple(k.lower() for k in reader.next()) for row in reader:

yield dict(zip(column_names, row))

Page 8: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

JSON

import json

Page 9: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

XML

import lxml.etree

Get rid of namespaces in the input document. http://bit.ly/LO5x7H

A lot of XML datasets have a fairly flat structure. In these cases, convert the elements to dictionaries.

Page 10: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

<root> <items> <item> <id>3930277-ac</id> <name>Frodo Samwise</name> <age>56</age> <occupation>Tolkien scholar</occupation> <description>Short, with hairy feet</description> </item> ... </items></root>

Page 11: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

import lxml.etreetree = lxml.etree.fromstring(SOME_XML_STRING)for el in tree.findall('items/item'): children = el.getchildren() # Keys are element names. keys = (c.tag for c in children) # Values are element text contents. values = (c.text for c in children) yield dict(zip(keys, values))

Page 12: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

HTML

import requests

import lxml.html

I generally use XPath, but pyquery seems fine too.

If the HTML is very funky, use html5lib as the parser.

Sometimes data can be scraped from a chunk of JavaScript embedded in the page.

Page 13: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Please don’t use urllib2.

If you do use html5lib for parsing, remember that you can do so from within lxml itself. http://lxml.de/html5parser.html

Notes

Page 14: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Web browser

If you need a real browser to scrape the data, it’s often not worth it.

But there are tools out there.

I wrote PunkyBrowster, but I can't really recommend it over ghost.py. It seems to have a better API, supports PySide and Qt, and has a more permissive license (MIT).

Page 15: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

PDFNot as hard as it looks.

There are no Python libraries that handle all kinds of PDF documents in the wild.

Use the pdftohtml command to convert the PDF to XML.

When debugging, use pdftohtml to generate HTML that you can inspect in the browser.

If the text in the PDF is in tabular format, you can group text cells by proximity.

Page 16: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

The “group by proximity” strategy works like this:

1. Find a text cell that has a very distinct pattern (probably a date cell). This is your “anchor”.

2. Find all cells that have the same row position as the anchor (possibly off by a few pixels).

3. Figure out which grouped cells belong to which fields based on column position.

Notes

Page 17: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

RSS/Atom

import feedparser

Sometimes feedparser can’t handle custom fields, and you’ll have to fall back to lxml.etree.

Unfortunately, plenty of RSS feeds are not compliant XML. Either do some custom munging or try html5lib.

Page 18: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Database dump

If it’s a Microsoft Access file, use mbtools to dump the data.

Sometimes it’s a ZIP file containing CSV files, each of which corresponds to a separate table dump.

Just load it all into a SQLite database and run queries on it.

Page 19: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

We wrote code that simulated joins using lists of dictionaries. This was painful to write and not so much fun to read. Don’t do this.

Notes

Page 20: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

GIS

I haven’t worked much with KML or SHP files.

If an organization provides GIS files for download, they usually offer other options as well. Look for those instead.

Page 21: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Mixed

This is very common.

For example: an organization offers a CSV download, but you have to scrape their web page to find the link for it.

Page 22: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Components of a scraping system

Downloader

Cacher

Raw item retriever

Existing item detector

Item transformer

Status reporter

Page 23: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Caching is essential when scraping a dataset that involves a large number of HTML pages. Test runs can take hours if you’re making requests over the network. A good caching system pretty prints the files it downloads so you can more easily inspect them.

Reporting is essential if you’re managing a group of scrapers. Since you KNOW that at least one of your scrapers will be broken at any time, you might as well know which ones are broken. A good reporting mechanism shows when your scrapers break, as well as when the dataset itself has issues (determined heuristically).

Notes

Page 24: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Steps to writing a scraper

Find the data source

Find the metadata

Analysis (verify the primary key)

Develop

Test

Fix (repeat ∞ times)

Page 25: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

The Analysis step should also include noting which fields should be lookup fields (see design pattern slide).

The Testing step is always done on real data and has three phases: dry run (nothing added or updated), dry run with lookups (only lookups are added), and production run. I run all three phases on my local instance before deploying to production.

Notes

Page 26: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

A very useful tool for HTML scraping

Firefinder (http://bit.ly/kr0UOY)

Extension for Firebug

Allows you to test CSS and XPath expressions on any page, and visually inspect the results.

Page 27: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Look, it’s Firefinder!

Page 28: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Storing scraped data

Don’t create tables before you understand how you want to use the data.

Consider using ZODB (or another nonrelational DB)

Adrian Holovaty’s talk on how EveryBlock avoided creating new tables for each dataset: http://bit.ly/Yl6VAZ (relevant part starts at 7:10)

Page 29: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Design patterns

If a field contains a finite number of possible values, use a lookup table instead of storing each value.

Make a scraper superclass that incorporates common scraper logic.

Page 30: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

The scraper superclass will probably have convenience methods for converting dates/times, cleaning HTML, looking for existing items, etc. It should also incorporate the caching and reporting logic.

Notes

Page 31: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Working with government data

Some data sources are only available at certain times of day.

Be careful about rate limiting and IP blocking.

Data scraped from a web page shouldn’t be used for analyzing trends.

When you’re stuck, give them a phone call.

Page 32: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

If you do manage to find an actual person to talk to you, keep a record of their contact information and do NOT lose it! They are your first line of defense when a dataset you rely on goes down.

Notes

Page 33: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

Pro tips

When you don’t know what encoding the content is in, use charade, not chardet.

Remember to clean any HTML you intend to display.

If the dataset doesn’t allow filtering by date, it’s a lost cause (unless you just care about historical data).

When your scraper fails, do NOT fix it. If a user complains, consider fixing it.

Page 34: Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

I am done

Questions?