Transcript of Web Scraping with PHP
- 1. Web Scraping with PHP Matthew Turland September 16,
2008
- 2. Everyone acquainted?
- Lead Programmer forsurgiSYS, LLC
- Blog:http://ishouldbecoding.com
- 3. What is Web Scraping? 2 Stage Process Stage 1 : Retrieval
GET/some/resource ... HTTP/1.1 200OK ... Resource with datayou want
Stage 2 : Analysis Raw resource Usable data
- 4. How is it different from... Data Mining Focus in data mining
Focus in web scraping Consuming Web Services Web service data
formats Web scraping data formats
- 5. Potential Applications What Data source When Web service is
unavailable or dataaccess is one-time only. Crawlers and indexers
Remote data search offers nocapabilities for search or data
sourceintegration. Integration testing Applications must be tested
bysimulating client behavior andensuring responses are consistent
with requirements.
- 6. Disadvantages == vs
- 7. Legal Concerns TOS TOU EUA Original source Illegal syndicate
IANAL!
- 8. Retrieval GET/some/resource ... sizeof( ) == sizeof( ) if( )
require ;
- 9. The Low-Down on HTTP
- 10. Know enough HTTP to... Use one like this: To do this:
- 11. Know enough HTTP to... PEAR::HTTP_Client pecl_http
Zend_Http_Client Learn to use and troubleshoot one like this: Or
roll your own! cURL Filesystem+Streams
- 12.
- GET /wiki/Main_Page HTTP/1.1
Let's GET Started methodoroperation URIaddress for
thedesiredresource protocol versioninuse by the client headername
headervalue request line header more headers follow...
- 13. Warning about GET In principle: "Let's do this by the
book." GET In reality: "' Safe operation '? Whatever." GET
- 14. URI vs URL 1. Uniquely identifies a resource 2. Indicates
how to locate a resource 3. Does both and is thus human-usable. URI
URL More info inRFC 3986Sections1.1.3and1.2.2
- 15. Query Strings http://en.wikipedia.org/w/index.php?
title=Query_string&action=edit URL Query String Question markto
separate the resource address andquery string Equal signsto
separate parameter names and respective values Ampersandsto
separateparametername-value pairs. Parameter Value
- 16. URL Encoding Parameter Value first second this is a field
was it clear enough (already)? Query String
first=this+is+a+field&second=was+it+clear+%28already%29%3F Also
calledpercent encoding . urlencodeandurlencode : HandyPHP URL
functions $_SERVER ['QUERY_STRING'] /http_build_query ( $_GET )
More info on URL encoding inRFC 3986 Section 2.1
- 17. POST Requests Most Common HTTP Operations 1. GET 2. POST
... /w/index.php POST /new/resource -or- /updated/resource GET
/some/resource HTTP/1.1 Header: Value ... POST /some/resource
HTTP/1.1 Header: Value request body none
- 18. POST Request Example
- POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1
- Content-Type: application/x-www-form-urlencoded
-
wpStarttime=20080719022313&wpEdittime=20080719022100...
Blank lineseparates request headers and body Content typefor data
submitted via HTML form (multipart/form-data forfile uploads )
Request body ... look familiar? Note : Most browsers have a query
string length limit. Lowest known common denominator: IE7
strlen(entire URL) array( 'method' => 'POST', 'header' =>
'Content-Type: ' . 'application/x-www-form-urlencoded', 'content'
=> http_build_query(array( 'var1' => 'value1', 'var2' =>
'value2' )) ))); // Last 2 parameters here also apply to fopen()
$post =file_get_contents ($uri, false, $context);
- 26. Streams Resources
- Language Reference > Context options and parameters
- Appendices > List of Supported Protocols/Wrappers
- php|architect's Definitive Guide to PHP Streams (ETA late 2008
/ early 2009)
- 27. pecl_http Examples $http = newHttpRequest ($uri);
$http-> enableCookies (); $http-> setMethod (HTTP_METH_POST);
// or HTTP_METH_GET $http-> addPostFields ($postData);
$http-> setOptions (array( 'httpauth' => $username . ':' .
$password, 'httpauthtype' => HTTP_AUTH_BASIC, useragent =>
'PHP ' . phpversion(), 'referer' =>
'http://example.com/some/referer', 'range' => array(array(1, 5),
array(10, 15)) )); $response = $http-> send (); $headers =
$response-> getHeaders (); $body = $response-> getBody ();
SeePHP Manual for more info.
- 28. PEAR::HTTP_Client Examples $cookiejar =
newHTTP_Client_CookieManager (); $request = newHTTP_Request ($uri);
$request-> setMethod (HTTP_REQUEST_METHOD_POST); $request->
setBasicAuth ($username, $password); $request-> addHeader
('User-Agent', $userAgent); $request-> addHeader ('Referer',
$referrer); $request-> addHeader ('Range', 'bytes=2-3,5-6');
foreach ($postData as $key => $value) $request-> addPostData
($key, $value); $request-> sendRequest (); $cookiejar->
updateCookies ($request); $request = newHTTP_Request ($otheruri);
$cookiejar-> passCookies ($request); $response = $request->
sendRequest (); $headers = $request->getResponseHeader(); $body
= $request->getResponseBody(); SeePEAR Manual andAPI Docs for
more info.
- 29. Zend_Http_Client Examples $client = newZend_Http_Client
($uri); $client-> setMethod (Zend_Http_Client::POST);
$client-> setAuth ($username, $password); $client->
setHeaders ('User-Agent', $userAgent); $client-> setHeaders
(array( 'Referer' => $referrer, 'Range' => 'bytes=2-3,5-6' );
$client-> setParameterPost ($postData); $client->
setCookieJar (); $client-> request (); $client-> setUri
($otheruri); $client-> setMethod (Zend_Http_Client::GET);
$response = $client-> request (); $headers = $response->
getHeaders (); $body = $response-> getBody (); SeeZF Manual for
more info.
- 30. cURL Examples Fatal error: Allowed memory size of n00b
bytesexhausted (tried to allocate 1337 bytes) in/this/slide.php on
line 1 SeePHP Manual ,Context Options , ormy php|architect article
for more info. Just kidding. Really, the equivalent cURL code for
theprevious examples is so verbose that it won't fit on one slide
and I don't think it's deserving of multiple slides.
- 31. HTTP Resources
- RFC 2616 HyperText Transfer Protocol
- RFC 3986 Uniform Resource Identifiers
- "HTTP: The Definitive Guide" (ISBN 1565925092)
- "HTTP Pocket Reference: HyperText Transfer Protocol" (ISBN
1565928628)
- "HTTP Developer's Handbook" (ISBN 0672324547) byChris
Shiflett
- Ben Ramsey's blog series on HTTP
- 32. Analysis Raw resource Usable data DOM XMLReader SimpleXML
XSL tidy PCRE String functions JSON ctype XML Parser
- 33. Cleanup
- tidy is good for correcting markup malformations. *
- String functions and PCRE can be used for manual cleanup prior
to using a parsing extension.
- DOM is generally forgiving when parsing malformed markup. It
generates warnings that can be suppressed.
- Save a static copy of your target, use a validator on the input
(ex:W3C Markup Validator ), fix validation errors manually, and
write code to automatically apply fixes.
- 34. Parsing
- DOM and SimpleXML are tree-based parsers that store the entire
document in memory to provide full access.
- XMLReader is a pull-based parser that iterates over nodes in
the document and is less memory-intensive.
- SAX is also pull-based, but uses event-based callbacks.
- JSON can be used to parse isolated JavaScript data.
- Nothing "official" for CSS. Find something likeCSSTidy .
- PCRE can be used for parsing. Last resort, though.
- 35. Validation
- Make as few assumptions (and as many assertions) about the
target as possible.
- Validation provides additional sanity checks for your
application.
- PCRE can be used to form pattern-based assertions about
extracted data.
- ctype can be used to form primitive type-based assertions.
- 36. Transformation
- XSL can be used to extract data from an XML-compatible document
and retrofit it to a format defined by an XSL template.
- To my knowledge, this capability is unfortunately unique to
XML-compatible data.
- Use components like template engines to separate formatting of
data from retrieval/analysis logic.
- 37. Abstraction
- Remain in keeping with the DRY principle.
- Develop components that can be reused across projects.
Ex:DomQuery ,Zend_Dom .
- Make an effort to minimize application-specific logic. This
applies to both retrieval and analysis.
- 38. Assertions
- Apply to long-term real-time web scraping applications.
- Affirm conditions of behavior and output of the target
application.
- Use in the application during runtime to avoid Bad Things (tm)
happening when the target application changes.
- Include in unit tests of the application. Youareusing unit
tests, right?
- 39. Testing
- Write tests on target application output stored in local files
that can be run sans internet during development.
- If possible/feasible/appropriate, write "live tests" that
actively test using assertions on the target application.
- Run live tests when the target appears to have changed (because
your web scraping application breaks).
- 40. Questions?
- No heckling... OK, maybe just a little.
- I will hang around afterward if you have questions, points for
discussion, or just want to say hi. It's cool, I don't bite or have
cooties or anything. I have business cards too.
- I generally blog about my experiences with web scraping and PHP
at http://ishouldbecoding.com.