Web Scraping with PHP

Post on 11-Jun-2015

8.442 views 2 download

Tags:

Transcript of Web Scraping with PHP

Web Scraping with

Matthew Turlandphp|tek 2009 Unconference

May 21, 2009

What Is It?

Normal Web Browsing

Difference #1: Immediate Audience

Difference #2: Consumption Method

Why Is ItUseful?

Data WithoutWeb Services

Integration Testing

Crawlers

With plain text, we give ourselves the

ability to manipulate knowledge, both

manually and programmatically, using

virtually every tool at our disposal.

3.14 The Power of Plain Text,The Pragmatic Programmer

Disadvantages

Potential Lack of Stability

Reverse Engineering Required

MoreRequests

No Nice NeatData Package

Step #1: Retrieval

Speaking the Language

The Web We Weave

GET / HTTP/1.1User-Agent: ...

HTTP/1.1 200 OKContent-Type: ...

GET /index.php?foo=bar HTTP/1.1

<a href="/index.php?foo=bar">Index</a>

<form method="post" action="/index.php"> <input name="foo" value="bar" /></form>

POST /index.php HTTP/1.1

foo=bar

Browsing Requests→

HTTP/1.1 200 OKContent-Type: image/gifContent-Length: 8558

Responses Rendered Elements→

<img src="/intl/en_ALL/images/logo.gif" />

GET /intl/en_ALL/images/logo.gif HTTP/1.1Host: google.com

Not As Easy As It Looks

Redirections

Referer [sic]

Cookies

User Agent Sniffing

robots.txt

Caching

HTTP Authentication

PHP: Glue for the Web

Simple Streams Example$uri = 'http://www.example.com/some/resource';$get = file_get_contents($uri);$context = stream_context_create( array( 'http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ) ));$post = file_get_contents($uri, false, $context);

pecl_http Example

$http = new HttpRequest($uri);$http->enableCookies();$http->setMethod(HTTP_METH_POST);$http->addPostFields(array('var1' => 'value1'));$http->setOptions( 'useragent' => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer'));$response = $http->send();$headers = $response->getHeaders();$body = $response->getBody();

pecl_http Request Pooling

$pool = new HttpRequestPool;foreach ($urls as $url) { $request = new HttpRequest($url, HTTP_METH_GET); $pool->attach($request);}$pool->send();foreach ($pool as $request) { echo $request->getUrl(), PHP_EOL; echo $request->getResponseBody(), PHP_EOL;}

HTTP Resources

➔ RFC 2616 HyperText Transfer Protocol

➔ RFC 3986 Uniform Resource Identifiers

➔ "HTTP: The Definitive Guide" (ISBN 1565925092)

➔ "HTTP Pocket Reference: HyperText Transfer Protocol"

(ISBN 1565928628)

➔ "HTTP Developer's Handbook" (ISBN 0672324547) by

Chris Shiflett

➔ Ben Ramsey's blog series on HTTP

Step #2:Analysis

Tidy Extension

$config = array('output-xhtml' => true);$tidy = tidy_parse_string($markupString, $config);$tidy = tidy_parse_file($markupFilePath, $config);$output = tidy_get_output($tidy);

DOM Extension

$doc = new DOMDocument;$doc->loadHTML($htmlString);$doc->loadHTMLFile($htmlFilePath);$listItems = $doc->getElementsByTagName('li');$xpath = new DOMXPath($doc);$listItems = $xpath->query('//ul/li');foreach ($listItems as $listItem) { echo $listItem->nodeValue, PHP_EOL;}

SimpleXML Extension

$sxe = new SimpleXMLElement($markupString);$sxe = new SimpleXMLElement($filePath, null, true);echo $sxe->body->ul->li[0], PHP_EOL;$children = $sxe->body->ul->li;$children = $sxe->body->ul->children();foreach ($children as $li) { echo $li, PHP_EOL;}echo $sxe->body->ul['id'];$attributes = $sxe->body->ul->attributes();foreach ($attributes as $name => $value) { echo $name, '=', $value, PHP_EOL;}

XMLReader Extension

$doc = XMLReader::xml($xmlString);$doc = XMLReader::open($filePath);while ($doc->read()) { if ($doc->nodeType == XMLReader::ELEMENT) { var_dump($doc->localName); var_dump($doc->hasValue); var_dump($doc->value); var_dump($doc->hasAttributes); var_dump($doc->getAttribute('id')); }}

CSS Selector Libraries

➔ phpQuery

➔ Simple HTML DOM Parser

➔ Zend_Dom_Query

$doc1 = phpQuery::newDocumentFile($markupFilePath);$doc2 = phpQuery::newDocument($markupString);$listItems = pq('ul > li'); // uses $doc2$listItems = pq('ul > li', $doc1);

PCRE Extension

Best Practices

Approximate Human Behavior

Minimize Requests

Batch Jobs,Non-Peak Hours

Account for Unavailability

Aim for Parallelism

Validate Data

Test, Test, Test!

Questions

Please leave a comment!

http://joind.in/event/view/41

And ping me online!

Matthew Turland

Senior Consultant, Blue Parabola LLC

matthew@blueparabola.com

http://blueparabola.com

matt@ishouldbecoding.com

http://ishouldbecoding.com

@elazar