Web::Scraper

Post on 13-May-2015

27.053 views 0 download

Tags:

Transcript of Web::Scraper

Practical Web Scraping

with Web::Scraper

Tatsuhiko Miyagawa miyagawa@gmail.com

Six Apart, Ltd. / Shibuya Perl MongersYAPC::Europe 2007 Vienna

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko Miyagawa

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CPAN: MIYAGAWA

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords

Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny

Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton

Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC

Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager

Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era

Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode

File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent

HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote

Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp

Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find

PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable

Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript

Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon

Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape

WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery

WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog

XML::Atom::Stream XML::Liberal

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

http://code.sixapart.com/

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Practical Web Scraping

with Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

"Screen-scrapingis so 1999!"

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

RSS is a metadatanot a complete

HTML replacement

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Practical Web Scraping

with Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

What's wrong withLWP & Regexp?

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

It works!

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

WWW::MySpace 0.70

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

WWW::Search::Ebay 2.231

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

WWW::Mixi 0.50

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

It works …

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

There are3 problems(at least)

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(1)Fragile

Easy to break even with slight HTML changes(like newlines, order of attributes etc.)

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(2)Hard to maintain

Regular expression based scrapers are good Only when they're used in write-only scripts

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(3)Improper

HTML & encodinghandling

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="message">I &hearts; Vienna</span>

> perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1'I &hearts; Vienna

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="message">I &hearts; Vienna</span>

> perl –MHTML::Entities –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities($1)'I ♥ Vienna

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="message"> ウィーンが大好き! </span>

> perl –MHTML::Entities –MEncode –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.ウィーンが大好き!

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

The "right" wayof screen-scraping

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(1), (2)MaintainableLess fragile

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Use XPathand CSS Selectors

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath

HTML::TreeBuilder::XPathXML::LibXML

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;

# Monday, August 27, 2007 at 12:49:46

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CSS Selectors

"XPath for HTML coders""XPath for people who hates XML"

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CSS Selectors

body { font-size: 12px; }

div.article { padding: 1em }

span#count { color: #fff }

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath: //strong[@id="ctu"]

CSS Selector: strong#ctu

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CSS Selectors

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;

# Monday, August 27, 2007 at 12:49:46

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Robust,Maintainable,

andSane character

handling

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Exmaple (before)

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

but …long and boring

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Practical Web Scraping

with Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web scraping toolkitinspired by scrapi.rb

DSL-ish

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Example (after)

#!/usr/bin/perl

use strict;

use warnings;

use Web::Scraper;

use URI;

my $s = scraper {

process "strong#ctu", time => 'TEXT';

result 'time';

};

my $uri = URI->new("http://timeanddate.com/worldclock/");

print $s->scrape($uri);

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Basics

use Web::Scraper;

my $s = scraper {

# DSL goes here

};

my $res = $s->scrape($uri);

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process

process $selector,

$key => $what,

…;

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

$selector:

CSS Selectoror

XPath (start with /)

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

$key:key for the result

hashappend "[]" for

looping

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

$what:'@attr''TEXT'

Web::Scrapersub { … }

Hash reference

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li > a",

'urls[]' => '@href';

# { urls => [ … ] }

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process '//ul[@class="sites"]/li/a',

'names[]' => 'TEXT';

# { names => [ 'OpenGuides', … ] }

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li",

'sites[]' => scraper {

process 'a',

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li > a",

'sites[]' => sub {

# $_ is HTML::Element

+{ link => $_->attr('href'), name => $_->as_text };

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li > a",

'sites[]' => {

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

result

result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;

my $s = scraper { process …; process …; result 'foo', 'bar';};

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

More Examples

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Thumbnail URLs on Flickr set

#!/usr/bin/perl

use strict;

use Data::Dumper;

use Web::Scraper;

use URI;

my $url = "http://flickr.com/photos/bulknews/sets/72157601700510359/";

my $s = scraper {

process "a.image_link img", "thumbs[]" => '@src';

};

warn Dumper $s->scrape( URI->new($url) );

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="vcard"> <a href="http://twitter.com/iamcal" class="url" rel="contact" title="Cal Henderson"> <img alt="Cal Henderson" class="photo fn" height="24" id="profile-image" src="http://assets0.twitter.com/…/mini/buddyicon.gif" width="24" /></a></span>

<span class="vcard">…</span>

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Twitter Friends

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard a", "people[]" => '@title';

};

warn Dumper $s->scrape( URI->new($url) ) ;

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Twitter Friends (complex)

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard", "people[]" => scraper {

process "a", link => '@href', name => '@title';

process "img", thumb => '@src';

};

};

warn Dumper $s->scrape( URI->new($url) ) ;

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tools

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

> cpan Web::Scraper

comes with 'scraper' CLI

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

> scraper http://example.com/

scraper> process "a", "links[]" => '@href';

scraper> d

$VAR1 = {

links => [

'http://example.org/',

'http://example.net/',

],

};

scraper> y

---

links:

- http://example.org/

- http://example.net/

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

> scraper /path/to/foo.html

> GET http://example.com/ | scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

TODO

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web::ScraperNeeds documentation

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

More examplesto put in eg/ directory

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

integrate withWWW::Mechanize

and Test::WWW::Declare

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath Auto-suggestion

off of DOM + element

DOM + XPath => ElementDOM + Element => XPath?

(Template::Extract?)

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Questions?

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Thank you

http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/

webscraper