Web Crawling and Data Gathering with Apache Nutch

Post on 08-May-2015

24.590 views 2 download

description

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Transcript of Web Crawling and Data Gathering with Apache Nutch

Apache Nutch

Web Crawling and Data Gathering

Steve Watt - @wattsteveIBM Big Data LeadData Day Austin

2

Topics

Introduction

The Big Data Analytics Ecosystem

Load Tooling

How is Crawl data being used?

Web Crawling - Considerations

Apache Nutch Overview

Apache Nutch Crawl Lifecycle, Setup and Demos

3

The Offline (Analytics) Big Data Ecosystem

Load Tooling

Web Content Your Content

Hadoop

Data Catalogs Analytics Tooling Export Tooling

Find Analyze Visualize Consume

4

Load Tooling - Data Gathering Patterns and Enablers

Web Content

– Downloading – Amazon Public DataSets / InfoChimps

– Stream Harvesting – Collecta / Roll-your-own (Twitter4J)

– API Harvesting – Roll your own (Facebook REST Query)

– Web Crawling – Nutch

Your Content

– Copy from FileSystem

– Load from Database - SQOOP

– Event Collection Frameworks - Scribe and Flume

5

How is Crawl data being used?

Build your own search engine – Built in Lucene Indexes for querying

– Solr integration for Multi-faceted search

Analytics Selective filtering and extraction with data from a single

provider Joining datasets from multiple providers for further

analytics Event Portal Example Is Austin really a startup town?

Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”

6

Web Crawling - considerations

Robots.txt

Facebook lawsuit against API Harvester

“No Crawling without written approval” in Mint.com Terms of Use

What if the web had as many crawlers as Apache Web Servers ?

7

Apache Nutch – What is it ?

Apache Nutch Project – nutch.apache.org– Hadoop + Web Crawler + Lucene

Hadoop based web crawler ? How does that work ?

8

Apache Nutch Overview

Seeds and Crawl Filters

Crawl Depths

Fetch Lists and Partitioning

Segments - Segment Reading using Hadoop

Indexing / Lucene

Web Application for Querying

Apache Nutch - Web Application

Crawl Lifecycle

Generate

Inject

LinkDB

Fetch

Index

CrawlDB Update

Dedup

Merge

Single Process Web Crawling

Single Process Web Crawling

- Create the seed file and copy it into a “urls” directory

- Export JAVA_HOME

- Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)

- Edit the conf/nutch-site.xml and specify an http.agent.name

- bin/nutch crawl urls -dir crawl -depth 2

D E M O

Distributed Web Crawling

Distributed Web Crawling

- The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.

- Why orchestrate your crawl?

- How?– Create the seed file and copy it into a “urls” directory. Then

copy the directory up to the HDFS

– Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)

– Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.

– Restart Hadoop so the new files are picked up in the classpath

Distributed Web Crawling

- Code Review: org.apache.nutch.crawl.Crawl

- Orchestrated Crawl Example (Step 1 - Inject):

bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls

D E M O

Segment Reading

17

Segment Readers

The SegmentReader class is not all that useful. But here it is anyway:

– bin/nutch readseg -list crawl/segments/20110128170617

– bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir

What you really want to do is process each crawled page in M/R as an individual record– SequenceFileInputFormatters over Nutch HDFS Segments

FTW

– RecordReader returns Content Objects as Value

Code Walkthrough

D E M O

Thanks

Questions ?

Steve Watt - swatt@us.ibm.com

Twitter: @wattsteveBlog: stevewatt.blogspot.com

austinhug.blogspot.com