Web Scraping with PHP
-
Upload
matthew-turland -
Category
Technology
-
view
23.183 -
download
0
Transcript of Web Scraping with PHP
- 1. Web Scraping with PHP Matthew Turland September 16, 2008
- 2. Everyone acquainted?
- Lead Programmer forsurgiSYS, LLC
- PHP Communitymember
- Blog:http://ishouldbecoding.com
- 3. What is Web Scraping? 2 Stage Process Stage 1 : Retrieval GET/some/resource ... HTTP/1.1 200OK ... Resource with datayou want Stage 2 : Analysis Raw resource Usable data
- 4. How is it different from... Data Mining Focus in data mining Focus in web scraping Consuming Web Services Web service data formats Web scraping data formats
- 5. Potential Applications What Data source When Web service is unavailable or dataaccess is one-time only. Crawlers and indexers Remote data search offers nocapabilities for search or data sourceintegration. Integration testing Applications must be tested bysimulating client behavior andensuring responses are consistent with requirements.
- 6. Disadvantages == vs
- 7. Legal Concerns TOS TOU EUA Original source Illegal syndicate IANAL!
- 8. Retrieval GET/some/resource ... sizeof( ) == sizeof( ) if( ) require ;
- 9. The Low-Down on HTTP
- 10. Know enough HTTP to... Use one like this: To do this:
- 11. Know enough HTTP to... PEAR::HTTP_Client pecl_http Zend_Http_Client Learn to use and troubleshoot one like this: Or roll your own! cURL Filesystem+Streams
- 12.
- GET /wiki/Main_Page HTTP/1.1
- Host: en.wikipedia.org
- 13. Warning about GET In principle: "Let's do this by the book." GET In reality: "' Safe operation '? Whatever." GET
- 14. URI vs URL 1. Uniquely identifies a resource 2. Indicates how to locate a resource 3. Does both and is thus human-usable. URI URL More info inRFC 3986Sections1.1.3and1.2.2
- 15. Query Strings http://en.wikipedia.org/w/index.php? title=Query_string&action=edit URL Query String Question markto separate the resource address andquery string Equal signsto separate parameter names and respective values Ampersandsto separateparametername-value pairs. Parameter Value
- 16. URL Encoding Parameter Value first second this is a field was it clear enough (already)? Query String first=this+is+a+field&second=was+it+clear+%28already%29%3F Also calledpercent encoding . urlencodeandurlencode : HandyPHP URL functions $_SERVER ['QUERY_STRING'] /http_build_query ( $_GET ) More info on URL encoding inRFC 3986 Section 2.1
- 17. POST Requests Most Common HTTP Operations 1. GET 2. POST ... /w/index.php POST /new/resource -or- /updated/resource GET /some/resource HTTP/1.1 Header: Value ... POST /some/resource HTTP/1.1 Header: Value request body none
- 18. POST Request Example
- POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1
- Content-Type: application/x-www-form-urlencoded
- wpStarttime=20080719022313&wpEdittime=20080719022100...
- 26. Streams Resources
- Language Reference > Context options and parameters
-
- HTTP context options
-
- Context parameters
- Appendices > List of Supported Protocols/Wrappers
-
- HTTP and HTTPS
- php|architect's Definitive Guide to PHP Streams (ETA late 2008 / early 2009)
- 27. pecl_http Examples $http = newHttpRequest ($uri); $http-> enableCookies (); $http-> setMethod (HTTP_METH_POST); // or HTTP_METH_GET $http-> addPostFields ($postData); $http-> setOptions (array( 'httpauth' => $username . ':' . $password, 'httpauthtype' => HTTP_AUTH_BASIC, useragent => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer', 'range' => array(array(1, 5), array(10, 15)) )); $response = $http-> send (); $headers = $response-> getHeaders (); $body = $response-> getBody (); SeePHP Manual for more info.
- 28. PEAR::HTTP_Client Examples $cookiejar = newHTTP_Client_CookieManager (); $request = newHTTP_Request ($uri); $request-> setMethod (HTTP_REQUEST_METHOD_POST); $request-> setBasicAuth ($username, $password); $request-> addHeader ('User-Agent', $userAgent); $request-> addHeader ('Referer', $referrer); $request-> addHeader ('Range', 'bytes=2-3,5-6'); foreach ($postData as $key => $value) $request-> addPostData ($key, $value); $request-> sendRequest (); $cookiejar-> updateCookies ($request); $request = newHTTP_Request ($otheruri); $cookiejar-> passCookies ($request); $response = $request-> sendRequest (); $headers = $request->getResponseHeader(); $body = $request->getResponseBody(); SeePEAR Manual andAPI Docs for more info.
- 29. Zend_Http_Client Examples $client = newZend_Http_Client ($uri); $client-> setMethod (Zend_Http_Client::POST); $client-> setAuth ($username, $password); $client-> setHeaders ('User-Agent', $userAgent); $client-> setHeaders (array( 'Referer' => $referrer, 'Range' => 'bytes=2-3,5-6' ); $client-> setParameterPost ($postData); $client-> setCookieJar (); $client-> request (); $client-> setUri ($otheruri); $client-> setMethod (Zend_Http_Client::GET); $response = $client-> request (); $headers = $response-> getHeaders (); $body = $response-> getBody (); SeeZF Manual for more info.
- 30. cURL Examples Fatal error: Allowed memory size of n00b bytesexhausted (tried to allocate 1337 bytes) in/this/slide.php on line 1 SeePHP Manual ,Context Options , ormy php|architect article for more info. Just kidding. Really, the equivalent cURL code for theprevious examples is so verbose that it won't fit on one slide and I don't think it's deserving of multiple slides.
- 31. HTTP Resources
- RFC 2616 HyperText Transfer Protocol
- RFC 3986 Uniform Resource Identifiers
- "HTTP: The Definitive Guide" (ISBN 1565925092)
- "HTTP Pocket Reference: HyperText Transfer Protocol" (ISBN 1565928628)
- "HTTP Developer's Handbook" (ISBN 0672324547) byChris Shiflett
- Ben Ramsey's blog series on HTTP
- 32. Analysis Raw resource Usable data DOM XMLReader SimpleXML XSL tidy PCRE String functions JSON ctype XML Parser
- 33. Cleanup
- tidy is good for correcting markup malformations. *
- String functions and PCRE can be used for manual cleanup prior to using a parsing extension.
- DOM is generally forgiving when parsing malformed markup. It generates warnings that can be suppressed.
- Save a static copy of your target, use a validator on the input (ex:W3C Markup Validator ), fix validation errors manually, and write code to automatically apply fixes.
- 34. Parsing
- DOM and SimpleXML are tree-based parsers that store the entire document in memory to provide full access.
- XMLReader is a pull-based parser that iterates over nodes in the document and is less memory-intensive.
- SAX is also pull-based, but uses event-based callbacks.
- JSON can be used to parse isolated JavaScript data.
- Nothing "official" for CSS. Find something likeCSSTidy .
- PCRE can be used for parsing. Last resort, though.
- 35. Validation
- Make as few assumptions (and as many assertions) about the target as possible.
- Validation provides additional sanity checks for your application.
- PCRE can be used to form pattern-based assertions about extracted data.
- ctype can be used to form primitive type-based assertions.
- 36. Transformation
- XSL can be used to extract data from an XML-compatible document and retrofit it to a format defined by an XSL template.
- To my knowledge, this capability is unfortunately unique to XML-compatible data.
- Use components like template engines to separate formatting of data from retrieval/analysis logic.
- 37. Abstraction
- Remain in keeping with the DRY principle.
- Develop components that can be reused across projects. Ex:DomQuery ,Zend_Dom .
- Make an effort to minimize application-specific logic. This applies to both retrieval and analysis.
- 38. Assertions
- Apply to long-term real-time web scraping applications.
- Affirm conditions of behavior and output of the target application.
- Use in the application during runtime to avoid Bad Things (tm) happening when the target application changes.
- Include in unit tests of the application. Youareusing unit tests, right?
- 39. Testing
- Write tests on target application output stored in local files that can be run sans internet during development.
- If possible/feasible/appropriate, write "live tests" that actively test using assertions on the target application.
- Run live tests when the target appears to have changed (because your web scraping application breaks).
- 40. Questions?
- No heckling... OK, maybe just a little.
- I will hang around afterward if you have questions, points for discussion, or just want to say hi. It's cool, I don't bite or have cooties or anything. I have business cards too.
- I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com.
- Thanks for coming!