1. Web Scraping with PHP Matthew Turland September 16,
2008
2. Everyone acquainted?
Lead Programmer forsurgiSYS, LLC
PHP Communitymember
Blog:http://ishouldbecoding.com
3. What is Web Scraping? 2 Stage Process Stage 1 : Retrieval
GET/some/resource ... HTTP/1.1 200OK ... Resource with datayou want
Stage 2 : Analysis Raw resource Usable data
4. How is it different from... Data Mining Focus in data mining
Focus in web scraping Consuming Web Services Web service data
formats Web scraping data formats
5. Potential Applications What Data source When Web service is
unavailable or dataaccess is one-time only. Crawlers and indexers
Remote data search offers nocapabilities for search or data
sourceintegration. Integration testing Applications must be tested
bysimulating client behavior andensuring responses are consistent
with requirements.
6. Disadvantages == vs
7. Legal Concerns TOS TOU EUA Original source Illegal syndicate
IANAL!
10. Know enough HTTP to... Use one like this: To do this:
11. Know enough HTTP to... PEAR::HTTP_Client pecl_http
Zend_Http_Client Learn to use and troubleshoot one like this: Or
roll your own! cURL Filesystem+Streams
12.
GET /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org
Let's GET Started methodoroperation URIaddress for
thedesiredresource protocol versioninuse by the client headername
headervalue request line header more headers follow...
13. Warning about GET In principle: "Let's do this by the
book." GET In reality: "' Safe operation '? Whatever." GET
14. URI vs URL 1. Uniquely identifies a resource 2. Indicates
how to locate a resource 3. Does both and is thus human-usable. URI
URL More info inRFC 3986Sections1.1.3and1.2.2
15. Query Strings http://en.wikipedia.org/w/index.php?
title=Query_string&action=edit URL Query String Question markto
separate the resource address andquery string Equal signsto
separate parameter names and respective values Ampersandsto
separateparametername-value pairs. Parameter Value
16. URL Encoding Parameter Value first second this is a field
was it clear enough (already)? Query String
first=this+is+a+field&second=was+it+clear+%28already%29%3F Also
calledpercent encoding . urlencodeandurlencode : HandyPHP URL
functions $_SERVER ['QUERY_STRING'] /http_build_query ( $_GET )
More info on URL encoding inRFC 3986 Section 2.1
17. POST Requests Most Common HTTP Operations 1. GET 2. POST
... /w/index.php POST /new/resource -or- /updated/resource GET
/some/resource HTTP/1.1 Header: Value ... POST /some/resource
HTTP/1.1 Header: Value request body none
18. POST Request Example
POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1
Blank lineseparates request headers and body Content typefor data
submitted via HTML form (multipart/form-data forfile uploads )
Request body ... look familiar? Note : Most browsers have a query
string length limit. Lowest known common denominator: IE7
strlen(entire URL) array( 'method' => 'POST', 'header' =>
'Content-Type: ' . 'application/x-www-form-urlencoded', 'content'
=> http_build_query(array( 'var1' => 'value1', 'var2' =>
'value2' )) ))); // Last 2 parameters here also apply to fopen()
$post =file_get_contents ($uri, false, $context);
26. Streams Resources
Language Reference > Context options and parameters
HTTP context options
Context parameters
Appendices > List of Supported Protocols/Wrappers
HTTP and HTTPS
php|architect's Definitive Guide to PHP Streams (ETA late 2008
/ early 2009)
30. cURL Examples Fatal error: Allowed memory size of n00b
bytesexhausted (tried to allocate 1337 bytes) in/this/slide.php on
line 1 SeePHP Manual ,Context Options , ormy php|architect article
for more info. Just kidding. Really, the equivalent cURL code for
theprevious examples is so verbose that it won't fit on one slide
and I don't think it's deserving of multiple slides.
31. HTTP Resources
RFC 2616 HyperText Transfer Protocol
RFC 3986 Uniform Resource Identifiers
"HTTP: The Definitive Guide" (ISBN 1565925092)
"HTTP Pocket Reference: HyperText Transfer Protocol" (ISBN
1565928628)
32. Analysis Raw resource Usable data DOM XMLReader SimpleXML
XSL tidy PCRE String functions JSON ctype XML Parser
33. Cleanup
tidy is good for correcting markup malformations. *
String functions and PCRE can be used for manual cleanup prior
to using a parsing extension.
DOM is generally forgiving when parsing malformed markup. It
generates warnings that can be suppressed.
Save a static copy of your target, use a validator on the input
(ex:W3C Markup Validator ), fix validation errors manually, and
write code to automatically apply fixes.
34. Parsing
DOM and SimpleXML are tree-based parsers that store the entire
document in memory to provide full access.
XMLReader is a pull-based parser that iterates over nodes in
the document and is less memory-intensive.
SAX is also pull-based, but uses event-based callbacks.
JSON can be used to parse isolated JavaScript data.
Nothing "official" for CSS. Find something likeCSSTidy .
PCRE can be used for parsing. Last resort, though.
35. Validation
Make as few assumptions (and as many assertions) about the
target as possible.
Validation provides additional sanity checks for your
application.
PCRE can be used to form pattern-based assertions about
extracted data.
ctype can be used to form primitive type-based assertions.
36. Transformation
XSL can be used to extract data from an XML-compatible document
and retrofit it to a format defined by an XSL template.
To my knowledge, this capability is unfortunately unique to
XML-compatible data.
Use components like template engines to separate formatting of
data from retrieval/analysis logic.
37. Abstraction
Remain in keeping with the DRY principle.
Develop components that can be reused across projects.
Ex:DomQuery ,Zend_Dom .
Make an effort to minimize application-specific logic. This
applies to both retrieval and analysis.
38. Assertions
Apply to long-term real-time web scraping applications.
Affirm conditions of behavior and output of the target
application.
Use in the application during runtime to avoid Bad Things (tm)
happening when the target application changes.
Include in unit tests of the application. Youareusing unit
tests, right?
39. Testing
Write tests on target application output stored in local files
that can be run sans internet during development.
If possible/feasible/appropriate, write "live tests" that
actively test using assertions on the target application.
Run live tests when the target appears to have changed (because
your web scraping application breaks).
40. Questions?
No heckling... OK, maybe just a little.
I will hang around afterward if you have questions, points for
discussion, or just want to say hi. It's cool, I don't bite or have
cooties or anything. I have business cards too.
I generally blog about my experiences with web scraping and PHP
at http://ishouldbecoding.com.