Web Scraping with PHP

52
Web Scraping with Matthew Turland php|tek 2009 Unconference May 21, 2009

Transcript of Web Scraping with PHP

Page 1: Web Scraping with PHP

Web Scraping with

Matthew Turlandphp|tek 2009 Unconference

May 21, 2009

Page 2: Web Scraping with PHP

What Is It?

Page 3: Web Scraping with PHP

Normal Web Browsing

Page 4: Web Scraping with PHP

Difference #1: Immediate Audience

Page 5: Web Scraping with PHP

Difference #2: Consumption Method

Page 6: Web Scraping with PHP

Why Is ItUseful?

Page 7: Web Scraping with PHP

Data WithoutWeb Services

Page 8: Web Scraping with PHP

Integration Testing

Page 9: Web Scraping with PHP

Crawlers

Page 10: Web Scraping with PHP

With plain text, we give ourselves the

ability to manipulate knowledge, both

manually and programmatically, using

virtually every tool at our disposal.

3.14 The Power of Plain Text,The Pragmatic Programmer

Page 11: Web Scraping with PHP

Disadvantages

Page 12: Web Scraping with PHP

Potential Lack of Stability

Page 13: Web Scraping with PHP

Reverse Engineering Required

Page 14: Web Scraping with PHP

MoreRequests

Page 15: Web Scraping with PHP

No Nice NeatData Package

Page 16: Web Scraping with PHP

Step #1: Retrieval

Page 17: Web Scraping with PHP

Speaking the Language

Page 18: Web Scraping with PHP

The Web We Weave

GET / HTTP/1.1User-Agent: ...

HTTP/1.1 200 OKContent-Type: ...

Page 19: Web Scraping with PHP

GET /index.php?foo=bar HTTP/1.1

<a href="/index.php?foo=bar">Index</a>

<form method="post" action="/index.php"> <input name="foo" value="bar" /></form>

POST /index.php HTTP/1.1

foo=bar

Browsing Requests→

Page 20: Web Scraping with PHP

HTTP/1.1 200 OKContent-Type: image/gifContent-Length: 8558

Responses Rendered Elements→

<img src="/intl/en_ALL/images/logo.gif" />

GET /intl/en_ALL/images/logo.gif HTTP/1.1Host: google.com

Page 21: Web Scraping with PHP

Not As Easy As It Looks

Page 22: Web Scraping with PHP

Redirections

Page 23: Web Scraping with PHP

Referer [sic]

Page 24: Web Scraping with PHP

Cookies

Page 25: Web Scraping with PHP

User Agent Sniffing

Page 26: Web Scraping with PHP

robots.txt

Page 27: Web Scraping with PHP

Caching

Page 28: Web Scraping with PHP

HTTP Authentication

Page 29: Web Scraping with PHP

PHP: Glue for the Web

Page 31: Web Scraping with PHP

Simple Streams Example$uri = 'http://www.example.com/some/resource';$get = file_get_contents($uri);$context = stream_context_create( array( 'http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ) ));$post = file_get_contents($uri, false, $context);

Page 32: Web Scraping with PHP

pecl_http Example

$http = new HttpRequest($uri);$http->enableCookies();$http->setMethod(HTTP_METH_POST);$http->addPostFields(array('var1' => 'value1'));$http->setOptions( 'useragent' => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer'));$response = $http->send();$headers = $response->getHeaders();$body = $response->getBody();

Page 33: Web Scraping with PHP

pecl_http Request Pooling

$pool = new HttpRequestPool;foreach ($urls as $url) { $request = new HttpRequest($url, HTTP_METH_GET); $pool->attach($request);}$pool->send();foreach ($pool as $request) { echo $request->getUrl(), PHP_EOL; echo $request->getResponseBody(), PHP_EOL;}

Page 34: Web Scraping with PHP

HTTP Resources

➔ RFC 2616 HyperText Transfer Protocol

➔ RFC 3986 Uniform Resource Identifiers

➔ "HTTP: The Definitive Guide" (ISBN 1565925092)

➔ "HTTP Pocket Reference: HyperText Transfer Protocol"

(ISBN 1565928628)

➔ "HTTP Developer's Handbook" (ISBN 0672324547) by

Chris Shiflett

➔ Ben Ramsey's blog series on HTTP

Page 35: Web Scraping with PHP

Step #2:Analysis

Page 36: Web Scraping with PHP

Tidy Extension

$config = array('output-xhtml' => true);$tidy = tidy_parse_string($markupString, $config);$tidy = tidy_parse_file($markupFilePath, $config);$output = tidy_get_output($tidy);

Page 37: Web Scraping with PHP

DOM Extension

$doc = new DOMDocument;$doc->loadHTML($htmlString);$doc->loadHTMLFile($htmlFilePath);$listItems = $doc->getElementsByTagName('li');$xpath = new DOMXPath($doc);$listItems = $xpath->query('//ul/li');foreach ($listItems as $listItem) { echo $listItem->nodeValue, PHP_EOL;}

Page 38: Web Scraping with PHP

SimpleXML Extension

$sxe = new SimpleXMLElement($markupString);$sxe = new SimpleXMLElement($filePath, null, true);echo $sxe->body->ul->li[0], PHP_EOL;$children = $sxe->body->ul->li;$children = $sxe->body->ul->children();foreach ($children as $li) { echo $li, PHP_EOL;}echo $sxe->body->ul['id'];$attributes = $sxe->body->ul->attributes();foreach ($attributes as $name => $value) { echo $name, '=', $value, PHP_EOL;}

Page 39: Web Scraping with PHP

XMLReader Extension

$doc = XMLReader::xml($xmlString);$doc = XMLReader::open($filePath);while ($doc->read()) { if ($doc->nodeType == XMLReader::ELEMENT) { var_dump($doc->localName); var_dump($doc->hasValue); var_dump($doc->value); var_dump($doc->hasAttributes); var_dump($doc->getAttribute('id')); }}

Page 40: Web Scraping with PHP

CSS Selector Libraries

➔ phpQuery

➔ Simple HTML DOM Parser

➔ Zend_Dom_Query

$doc1 = phpQuery::newDocumentFile($markupFilePath);$doc2 = phpQuery::newDocument($markupString);$listItems = pq('ul > li'); // uses $doc2$listItems = pq('ul > li', $doc1);

Page 41: Web Scraping with PHP

PCRE Extension

Page 42: Web Scraping with PHP

Best Practices

Page 43: Web Scraping with PHP

Approximate Human Behavior

Page 44: Web Scraping with PHP

Minimize Requests

Page 45: Web Scraping with PHP

Batch Jobs,Non-Peak Hours

Page 46: Web Scraping with PHP

Account for Unavailability

Page 47: Web Scraping with PHP

Aim for Parallelism

Page 48: Web Scraping with PHP

Validate Data

Page 49: Web Scraping with PHP

Test, Test, Test!

Page 50: Web Scraping with PHP

Questions

Page 51: Web Scraping with PHP

Please leave a comment!

http://joind.in/event/view/41

Page 52: Web Scraping with PHP

And ping me online!

Matthew Turland

Senior Consultant, Blue Parabola LLC

[email protected]

http://blueparabola.com

[email protected]

http://ishouldbecoding.com

@elazar