a tutorial on crawling tools - University of...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a tutorial on crawling tools

jianguo lu

September 25, 2014

jianguo lu September 25, 2014 1 / 34

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Outline

1 overview

2 wget

3 crawler4j

4 nutch


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

overview

type of crawlers

classify crawling according to the type of the web:

surface web crawler: obtain web pages by following hyperlinksdeep web crawlerprogrammable web apis

classify crawling according to the content:

general purpose, e.g., google. archive.focused crawlers, e.g.,

academic crawlers (google scholar)social networks (twitter, weibo)

surface web and deep web are intertwined

most large web sites provide both surface web and deep web (e.g.,google scholar, twitter)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

overview

tools for surface web crawling

command line

wget (www get), preinstalled in ubuntu(our cs machines)curl (crawl url), OSX preinstalled

Simple crawling apis

Java: crawler4j in java: http://code.google.com/p/crawler4j/Python: scrapy: http://scrapy.org/

large scale scrawling

Heritrix, crawler for archive.org.nutch


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

get a webpage (in java)

import j a v a . net . ∗ ;import j a v a . i o . ∗ ;pub l i c c l a s s URLReader {

pub l i c s t a t i c void main ( S t r i n g [ ] a r g s ) throws Excep t i on {

URL o r a c l e = new URL( ” ht tp ://www. o r a c l e . com/” ) ;Bu f f e r edReade r i n = new Buf f e r edReade r (new InputSt reamReader ( o r a c l e . openStream ( ) ) ) ;

S t r i n g i n p u t L i n e ;whi le ( ( i n p u t L i n e = i n . r e adL i n e ( ) ) != nu l l )

System . out . p r i n t l n ( i n p u t L i n e ) ;i n . c l o s e ( ) ;

}}

limitations

how to analyze the page to get other urlshow to control the process

how deep to crawlhow often to send the request...


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

wget

stands for www get.developed in 1996preinstalled on most linux like machinesexample to download a single file:

$ wget h t tp : //www. openss7 . org / r epo s / t a r b a l l s / s t r x25 − 0 . 9 . 2 . 1 . t a r . bz2Sav ing to : ‘ s t r x25 − 0 . 9 . 2 . 1 . t a r . bz2 . 1 ’31\% [=================> 1 ,213 ,592 68 .2K/ s e ta 34 s


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

get more pages

get a single page

wget http://www.example.com/index.html

support http, ftp etc., e.g.

wget ftp://ftp.gnu.org/pub/gnu/wget/wget-latest.tar.gz

More complex usage includes automatic download of multiple URLsinto a directory hierarchy.

wget -e robots=off -r -l1 --no-parent -A.gif

ftp://www.example.com/dir/


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

Recursive retrieval using -r

program begins following links from the website and downloadingthem too.

http://activehistory.ca/papers/ has a link tohttp://activehistory.ca/papers/historypaper-9/, so it willdownload that too if we use recursive retrieval.

will also follow any other links: if there was a link to http://uwo.casomewhere on that page, it would follow that and download it as well.

By default, -r sends wget to a depth of five sites after the first one.This is following links, to a limit of five clicks after the first website.


http://activehistory.ca/papers/

http://activehistory.ca/papers/historypaper-9/

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

not beyond last parent directory using –no-parent

The double-dash indicates the full-text of a command. All commandsalso have a short version, this could be initiated using -np.

wget should follow links, but not beyond the last parent directory.

wont go anywhere that is not part of thehttp://activehistory.ca/papers/hierarchy


http://activehistory.ca/papers/ hierarchy

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

how far you want to go

The default: follow each link and carry on to a limit of five pagesaway from the first page.

wget -l 2, which takes us to a depth of two web-pages.

Note this is a lower-case L, not a number 1.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

politeness

-w 10

adds a ten second wait in between server requests.you can shorten this, as ten seconds is quite long.you can also use the parameter: –random-wait to let wget chose arandom number of seconds to wait.wget --random-wait -r -p -e robots=off -U mozilla

http://www.example.com

--limit-rate=20k

limit the maximum download speed to 20kb/s.Opinion varies on what a good limit rate is, but you are probably goodup to about 200kb/s


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

some sites are protective

if the robots.txt does not allow you to crawl anything

use robots=offwget -r -p -e robots=off http://www.example.com


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

mask user agent

if a web site checks a browser identity

$ wget −r −p −e −U mo z i l l a h t tp : //www. example . com

$ wget −−use r−agent=”Moz i l l a /5 .0 (X11 ; U; L inux i 686 ; en−US ; r v : 1 . 9 . 0 . 3 ) Gecko /2008092416 F i r e f o x /3 . 0 . 3 ”URL−TO−DOWNLOAD


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

Increase Total Number of Retry Attempts

By default wget retries 20 times to make the download successful.

If the internet connection has problem, you may want to increase thenumber of tries

wget –tries=75 DOWNLOAD-URL


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

Download Multiple Files / URLs Using Wget -i

First, store all the download files or URLs in a text file as:

Next, give the download-file-list.txt as argument to wget using -ioption as shown below.

$ ca t > download− f i l e − l i s t . t x tURL1URL2URL3URL4

$ wget − i download− f i l e − l i s t . t x t


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

Download Only Certain File Types Using wget -r -A

You can use this under following situations:

Download all images from a websiteDownload all videos from a websiteDownload all PDF files from a website

$ wget −r −A. pdf h t tp : // u r l−to−webpage−with−pd f s /


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

download a directory

task: download all the files under the papers directory ofActiveHistory.ca.

wget -r --no-parent -w 2 --limit-rate=20k

http://activehistory.ca/papers/

Note that the trailing slash on the URL is critical

if you omit it, wget will think that papers is a file rather than adirectory.

When it is done, you should have a directory labeled ActiveHistory.cathat contains the /papers/ sub-directory perfectly mirrored on yoursystem.

This directory will appear in the location that you ran the commandfrom in your command line

Links will be replaced with internal links to the other pages you’vedownloaded, so you can actually have a fully working ActiveHistory.casite on your computer.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

mirror a website using -m

If you want to mirror an entire website, there is a built-in commandto wget.

This command means mirror, and is especially useful for backing upan entire website.

it looks at the time stamps, and does not repeat the download if thefile in the local system is recent.

it supports infinite recursion (it will go as many layers into the site asnecessary).

The command for mirroring ActiveHistory.ca would be:

wget -m -w 2 --limit-rate=20k http://activehistory.ca


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wget

download in the background

$ wget −b ht tp ://www. openss7 . org / r epo s / t a r b a l l s / s t r x25 − 0 . 9 . 2 . 1 . t a r . bz2Cont i nu i ng i n background , p i d 1984 .Output w i l l be w r i t t e n to ‘ wget−l og ’ .


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

crawler4j

crawler4j

pub l i c c l a s s Ba s i c C r aw lC o n t r o l l e r {

pub l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws Excep t i on {S t r i n g c r aw l S t o r a g eFo l d e r = a rg s [ 0 ] ;i n t numberOfCrawlers = I n t e g e r . p a r s e I n t ( a r g s [ 1 ] ) ;C raw lCon f i g c o n f i g = new Craw lCon f i g ( ) ;c o n f i g . s e tC r aw l S t o r a g eFo l d e r ( c r aw l S t o r a g eFo l d e r ) ;c o n f i g . s e t P o l i t e n e s sD e l a y ( 1000 ) ;c o n f i g . setMaxDepthOfCrawl ing ( 2 ) ;c o n f i g . setMaxPagesToFetch ( 1000 ) ;c o n f i g . s e tResumab l eCraw l i ng ( f a l s e ) ;PageFetcher pageFe tche r = new PageFetcher ( c o n f i g ) ;Robo t s t x tCon f i g r o b o t s t x t C o n f i g = new Robo t s t x tCon f i g ( ) ;Robo t s t x t S e r v e r r o b o t s t x t S e r v e r = new Robo t s t x t S e r v e r ( r o b o t s t x tCon f i g , pageFe tche r ) ;C r aw l C o n t r o l l e r c o n t r o l l e r = new C r aw lC o n t r o l l e r ( c on f i g , pageFetcher , r o b o t s t x t S e r v e r ) ;

c o n t r o l l e r . addSeed ( ” h t tp : //www. i c s . u c i . edu/” ) ;c o n t r o l l e r . s t a r t ( Ba s i cC r aw l e r . c l a s s , numberOfCrawlers ) ;

}}


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

nutch overview

Apache Nutch is an open source Web crawler written in Java.

can find Web page hyperlinks in an automated manner, reduce lots ofmaintenance work,

for example checking broken links,

and create a copy of all the visited pages for searching over.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

Install Nutch

Option 1: Setup Nutch from a binary distribution

Download a binary package (apache-nutch-1.X-bin.zip).Unzip your binary Nutch package. There should be a folderapache-nutch-1.X.cd apache-nutch-1.X/

$NUTCH RUNTIME HOME refers to the current directory(apache-nutch-1.X/).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

Set up Nutch from a source distribution

Advanced users may also use the source distribution:

Download a source package (apache-nutch-1.X-src.zip)

Unzip

cd apache-nutch-1.X/

Run ant in this folder (cf. RunNutchInEclipse)

Now there is a directory runtime/local which contains a ready to useNutch installation. When the source distribution is used${NUTCH RUNTIME HOME}refers to apache-nutch-1.X/runtime/local/. Note thatconfig files should be modified inapache-nutch-1.X/runtime/local/conf/

ant clean will remove this directory (keep copies of modified config files)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

Verify your Nutch installation

run ”bin/nutch” - You can confirm a correct installation if you seeingsimilar to the following:

Usage: nutch COMMAND where command is one of:

crawl: one-step crawler for intranets (DEPRECATED)readdb read / dump crawl dbmergedb merge crawldb-s, with optional filteringreadlinkdb read / dump link dbinject inject new urls into the databasegenerate generate new segments to fetch from crawl dbfreegen generate new segments to fetch from text filesfetch fetch a segment’s pages


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

Some troubleshooting tips

Run the following command if you are seeing ”Permission denied”:chmod +x bin/nutch

Setup JAVA HOME if you are seeing JAVA HOME not set.

On Mac, you can run the following command or add it to /.bashrc:

export

JAVA HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

On Debian or Ubuntu, you can run the following command or add itto /.bashrc:

export JAVA HOME=$(readlink -f /usr/bin/java | sed

"s:bin/java::")


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

Crawl your first website

Nutch requires two configuration changes before a website can be crawled:

Customize your crawl properties

provide a name for your crawler for external servers to recognize

Set a seed list of URLs to crawl


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

nutch

resources

tutorial for wget. http://programminghistorian.org/lessons/automated-downloading-with-wget

Sun Yang, A COMPREHENSIVE STUDY OF THE REGULATIONAND BEHAVIOR OF WEB CRAWLERS, phd thesis, 2008.

Data Mining the Web Via Crawling, By Kate Matsudaira July 26,2012 http://cacm.acm.org/blogs/blog-cacm/

153780-data-mining-the-web-via-crawling/fulltext

How to crawl a quarter billion webpages in 40 hours by MichaelNielsen on August 10, 2012

crawl using amazon ec2 http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/

Beautiful Soap, a screen scraping tool,http://www.crummy.com/software/BeautifulSoup/


http://programminghistorian.org/lessons/automated-downloading-with-wget

http://programminghistorian.org/lessons/automated-downloading-with-wget

http://cacm.acm.org/blogs/blog-cacm/153780-data-mining-the-web-via-crawling/fulltext

http://cacm.acm.org/blogs/blog-cacm/153780-data-mining-the-web-via-crawling/fulltext

a tutorial on crawling tools - University of...

Documents

Transcript of a tutorial on crawling tools - University of...