Web::Scraper

Practical Web Scraping

with Web::Scraper

Tatsuhiko Miyagawa miyagawa@gmail.com

Six Apart, Ltd. / Shibuya Perl MongersYAPC::Europe 2007 Vienna

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko Miyagawa

CPAN: MIYAGAWA

abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords

Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny

Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton

Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC

Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager

Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era

Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode

File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent

HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote

Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp

Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find

PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable

Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript

Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon

Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape

WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery

WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog

XML::Atom::Stream XML::Liberal

http://code.sixapart.com/

with Web::Scraper

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping

"Screen-scrapingis so 1999!"

RSS is a metadatanot a complete

HTML replacement

with Web::Scraper

What's wrong withLWP & Regexp?

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1'Monday, August 27, 2007 at 12:49:46

It works!

WWW::MySpace 0.70

WWW::Search::Ebay 2.231

WWW::Mixi 0.50

It works …

There are3 problems(at least)

(1)Fragile

Easy to break even with slight HTML changes(like newlines, order of attributes etc.)

(2)Hard to maintain

Regular expression based scrapers are good Only when they're used in write-only scripts

(3)Improper

HTML & encodinghandling

I &hearts; Vienna

> perl –e '$c =~ m@(.*?)@ and print $1'I &hearts; Vienna

I &hearts; Vienna

> perl –MHTML::Entities –e '$c =~ m@(.*?)@ and print decode_entities($1)'I ♥ Vienna

ウィーンが大好き！

> perl –MHTML::Entities –MEncode –e '$c =~ m@(.*?)@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.ウィーンが大好き！

The "right" wayof screen-scraping

(1), (2)MaintainableLess fragile

Use XPathand CSS Selectors

HTML::TreeBuilder::XPathXML::LibXML

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;

# Monday, August 27, 2007 at 12:49:46

CSS Selectors

"XPath for HTML coders""XPath for people who hates XML"

CSS Selectors

body { font-size: 12px; }

div.article { padding: 1em }

span#count { color: #fff }

XPath: //strong[@id="ctu"]

CSS Selector: strong#ctu

CSS Selectors

use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;

# Monday, August 27, 2007 at 12:49:46

Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Robust,Maintainable,

andSane character

handling

Exmaple (before)

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1'Monday, August 27, 2007 at 12:49:46

Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

but …long and boring

with Web::Scraper

Web scraping toolkitinspired by scrapi.rb

DSL-ish

Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

Example (after)

#!/usr/bin/perl

use strict;

use warnings;

use Web::Scraper;

use URI;

my $s = scraper {

process "strong#ctu", time => 'TEXT';

result 'time';

my $uri = URI->new("http://timeanddate.com/worldclock/");

print $s->scrape($uri);

Basics

use Web::Scraper;

my $s = scraper {

# DSL goes here

my $res = $s->scrape($uri);

process

process $selector,

$key => $what,

$selector:

CSS Selectoror

XPath (start with /)

$key:key for the result

hashappend "[]" for

looping

$what:'@attr''TEXT'

Web::Scrapersub { … }

Hash reference

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

process "ul.sites > li > a",

'urls[]' => '@href';

# { urls => [ … ] }

process '//ul[@class="sites"]/li/a',

'names[]' => 'TEXT';

# { names => [ 'OpenGuides', … ] }

process "ul.sites > li",

'sites[]' => scraper {

process 'a',

link => '@href', name => 'TEXT';

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

'sites[]' => sub {

# $_ is HTML::Element

+{ link => $_->attr('href'), name => $_->as_text };

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

'sites[]' => {

link => '@href', name => 'TEXT';

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

result

result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;

my $s = scraper { process …; process …; result 'foo', 'bar';};

More Examples

Thumbnail URLs on Flickr set

#!/usr/bin/perl

use strict;

use Data::Dumper;

use Web::Scraper;

use URI;

my $url = "http://flickr.com/photos/bulknews/sets/72157601700510359/";

my $s = scraper {

process "a.image_link img", "thumbs[]" => '@src';

warn Dumper $s->scrape( URI->new($url) );

…

Twitter Friends

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard a", "people[]" => '@title';

warn Dumper $s->scrape( URI->new($url) ) ;

Twitter Friends (complex)

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard", "people[]" => scraper {

process "a", link => '@href', name => '@title';

process "img", thumb => '@src';

warn Dumper $s->scrape( URI->new($url) ) ;

> cpan Web::Scraper

comes with 'scraper' CLI

> scraper http://example.com/

scraper> process "a", "links[]" => '@href';

scraper> d

$VAR1 = {

links => [

'http://example.org/',

'http://example.net/',

scraper> y

links:

- http://example.org/

- http://example.net/

> scraper /path/to/foo.html

> GET http://example.com/ | scraper

Web::ScraperNeeds documentation

More examplesto put in eg/ directory

integrate withWWW::Mechanize

and Test::WWW::Declare

XPath Auto-suggestion

off of DOM + element

DOM + XPath => ElementDOM + Element => XPath?

(Template::Extract?)

Questions?

Thank you

http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/

webscraper

Web::Scraper

Technology

Transcript of Web::Scraper

SCRAPER POWER....A Steiger tractor-scraper combination efficiently handles large-scale dirt work with outstanding fuel economy. The scraper tractor configuration’s rugged chassis

Submerged Scraper Conveyor System

Land Scraper & GZ Pull Scraper

MARTIN Standard Plough Scraper

77 Series word doc MK1 & MK2 brg Cutaway '610 5 x 6. X s 3 6.3 Scraper Scraper scraper Scraper Scraper Scraper Bar Bar Bar Bar Bar 31 32 33 901108 239300 204700 204300 204400 204500

2012 CATERPILLAR 627H SCRAPER

Performances in Stockyard Technologies - Aumund...11 Combined Stacker / Scraper Reclaimer 12 Semi-Portal Scraper Reclaimer 13 Cantilever Scraper Reclaimer 14 Circular Stockyards 15

CWW Card Scraper

Manure Scraper SySteMS - SUEVIA HAIGES · PDF fileManure Scraper SySteMS for free stalls u e! 2 Flap Scraper for Middle Guidance

613 c ii scraper

Belt Scraper Brochure

4 Semantic Web Application Architecture · Scraper technology is continuing to develop. We will illustrate the basics with an early scraper system called Solver, which has been developed

Wheel Tractor-Scraper TA1

CK Scraper Brochure

CROMAR SCRAPER CONVEYOR PRODUCT SHEET 04 SCRAPER CONVEYOR PRODUCT S… · Title: CROMAR SCRAPER CONVEYOR PRODUCT SHEET_04 Created Date: 20150121121250Z

BELT SCRAPER-EXTERNAL - Kaveri · BELT SCRAPER-EXTERNAL ... standard for all conveyor cleaning applications ... after the discharge trajectory. PRIMARY SCRAPER SECONDARY SCRAPER

Fixed Bar Box Scraper

Casing Scraper

Kitchen Scraper

OkCupid Scraper - Project Jupyter · OkCupid Scraper Web Scraping Project by Fangzhou Cheng. We found love in a hopeless place ... Based on Okcupid cofounder, Christian’s blog,