When RSS Fails: Web Scraping with HTTP

46
When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLC February 27, 2009

description

A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.

Transcript of When RSS Fails: Web Scraping with HTTP

Page 1: When RSS Fails: Web Scraping with HTTP

When RSS Fails:Web Scraping with HTTP

Matthew TurlandSenior ConsultantBlue Parabola LLCFebruary 27, 2009

Page 2: When RSS Fails: Web Scraping with HTTP

What is Web Scraping?

A 2 Step Process

Page 3: When RSS Fails: Web Scraping with HTTP

Its Goal: Data

Page 4: When RSS Fails: Web Scraping with HTTP

Obtain It

Page 5: When RSS Fails: Web Scraping with HTTP

Transform It

Page 6: When RSS Fails: Web Scraping with HTTP

Automate It

Page 7: When RSS Fails: Web Scraping with HTTP

Step 1: Retrieval

Page 8: When RSS Fails: Web Scraping with HTTP

The Client

Page 9: When RSS Fails: Web Scraping with HTTP

The Server

Page 10: When RSS Fails: Web Scraping with HTTP

The Request

Page 11: When RSS Fails: Web Scraping with HTTP

The Response

Page 12: When RSS Fails: Web Scraping with HTTP

Or In Your Case

Page 13: When RSS Fails: Web Scraping with HTTP

Step #2: Analysis

Page 14: When RSS Fails: Web Scraping with HTTP

Locate Desired Data

Page 15: When RSS Fails: Web Scraping with HTTP

Extract It

Page 16: When RSS Fails: Web Scraping with HTTP

Use It

Page 17: When RSS Fails: Web Scraping with HTTP

2 Step Process

Step 1:Retrieval GET /some/resource

...

HTTP/1.1 200 OK... Resource

with data you want

Step 2:Analysis

Rawresource

Usabledata

So To Recap

Page 18: When RSS Fails: Web Scraping with HTTP

Data Mining

Focus in data mining Focus in web scraping

Consuming Web Services

Web service data formats Web scraping data formats

How Is It Different?

Page 19: When RSS Fails: Web Scraping with HTTP

System integration

Crawlersand indexers

Integrationtesting

What Is It Used For?

Page 20: When RSS Fails: Web Scraping with HTTP

Disadvantages

Page 21: When RSS Fails: Web Scraping with HTTP

One small change to markup...

Page 22: When RSS Fails: Web Scraping with HTTP

... may break your application.

Page 23: When RSS Fails: Web Scraping with HTTP

Or in modern terms...

Page 24: When RSS Fails: Web Scraping with HTTP

Reverse Engineering Required

Page 25: When RSS Fails: Web Scraping with HTTP

Multiple Requests

Page 26: When RSS Fails: Web Scraping with HTTP

No Nice Neat Data Package

Page 27: When RSS Fails: Web Scraping with HTTP

Quite the Opposite, In Fact

Page 28: When RSS Fails: Web Scraping with HTTP

Use one like this:

To do this:

Know enough HTTP to...

Page 29: When RSS Fails: Web Scraping with HTTP

PEAR::HTTP_Client pecl_http Zend_Http_Client

Learn to use and troubleshoot one like this:

Or roll your own!

cURLFilesystem + Streams

Know enough HTTP to...

Page 30: When RSS Fails: Web Scraping with HTTP

GET /wiki/Main_Page HTTP/1.1

Host: en.wikipedia.org

method or operation

URI address for the desired resource

protocol version in use by the client

header name header value

request line

header

more headers follow...

Let's GET Started

Page 31: When RSS Fails: Web Scraping with HTTP

1. Uniquely identifies a resource

2. Indicates how to locate a resource

3. Does both and is thus human-usable.

URI

URL

More info in RFC 3986 Sections 1.1.3 and 1.2.2

URI vs URL

Page 32: When RSS Fails: Web Scraping with HTTP

In principle:"Let's do this by the book."

GET

In reality:"'Safe operation'? Whatever."

GET

Warning about GET

Page 33: When RSS Fails: Web Scraping with HTTP

http://en.wikipedia.org/w/index.php? title=Query_string&action=edit

URLQuery String

Question mark to separatethe resource address and query string

Equal signs to separate parameternames and respective values

Ampersands to separate parameter name-value pairs. Parameter

Value

Query Strings

Page 34: When RSS Fails: Web Scraping with HTTP

Parameter Value

first

second

this is a field

is it clear enough (already)?

Query Stringfirst=this+is+a+field&second=is+it+clear+%28already%29%3F

Also called percent encoding.

parse_str, urlencode, urldecode: Handy PHP URL functions

$_SERVER['QUERY_STRING'] / http_build_query($_GET)

More info on URL encoding in RFC 3986 Section 2.1

URL Encoding

Page 35: When RSS Fails: Web Scraping with HTTP

Most CommonHTTP Operations1. GET2. POST...

/w/index.phpPOST

/new/resource-or-

/updated/resource

GET /some/resource HTTP/1.1Header: Value...

POST /some/resource HTTP/1.1Header: Value

request body

none

POST Requests

Page 36: When RSS Fails: Web Scraping with HTTP

POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1Content­Type: application/x­www­form­urlencoded

wpStarttime=20080719022313&wpEdittime=20080719022100...

Blank line separatesrequest headers and body

Content type for datasubmitted via HTML form(multipart/form-data for file uploads)

Request body... look familiar?

Note: Most browsers have a query string length limit.Lowest known common denominator: IE7strlen(entire URL) <= 2,048 bytes.This limit is not standardized. It applies to query strings, but not request bodies.

POST Request Example

Page 37: When RSS Fails: Web Scraping with HTTP

HEAD /wiki/Main_Page HTTP/1.1Host: en.wikipedia.org

Same as GET with two exceptions:

1

HTTP/1.1 200 OKHeader: Value

2 No response body

HEAD vs GET

HeadersBody

Sometimes headersare all you want

?

HEAD Request

Page 38: When RSS Fails: Web Scraping with HTTP

HTTP/1.0 200 OKServer: ApacheX­Powered­By: PHP/5.2.5...

[body]

Lowest protocol versionrequired to process theresponse

Responsestatus code Response

status description

Status line

Same header format asrequests, but different headers are used(see RFC 2616 Section 14)

Responses

Page 39: When RSS Fails: Web Scraping with HTTP

1xx InformationalRequest received, continuing process.

2xx SuccessRequest received, understood, and accepted.

3xx RedirectionClient must take additional action to complete the request.

4xx Client ErrorRequest is malformed or could not be fulfilled.

5xx Server ErrorRequest was valid, but the server failed to process it.

See RFC 2616 Section 10 for more info.

Response Status Codes

Page 40: When RSS Fails: Web Scraping with HTTP

Set-Cookie

Cookie

Location Watch out for infinite loops!

Last-Modified

If-Modified-Since

304 Not Modified

ETag

If-None-MatchOR

See RFC 2109 or RFC 2965for more info.

Headers

Page 41: When RSS Fails: Web Scraping with HTTP

WWW-Authenticate

Authorization

User-Agent

200 OK / 403 Forbidden

See RFC 2617for more info.

User-Agent:

Some servers performuser agent sniffing

Some clients performuser agent spoofing

More Headers

Page 42: When RSS Fails: Web Scraping with HTTP

Best Practices

Page 43: When RSS Fails: Web Scraping with HTTP

Simulate User Behavior

Page 44: When RSS Fails: Web Scraping with HTTP

Minimize Requests

Page 45: When RSS Fails: Web Scraping with HTTP

Batch Jobs, Non-Peak Hours

Page 46: When RSS Fails: Web Scraping with HTTP

Questions?

No heckling... OK, maybe just a little.

I generally blog about my experiences with web scraping

and PHP at http://ishouldbecoding.com.

</shameless_plug>

Thanks for coming!