Download - Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Transcript
Page 1: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

動的HTMLスクレイピング対応並列分散クローラ

のご紹介Introduce of

the parallel distributed Crawlerwith scraping Dynamic HTML

Page 2: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Before that,シラツチ ケイ

✓白土 慧✓id: kei-s✓@kei_s

Page 3: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Before that,✓Born in Sapporo

Now live in Tokyo

✓RubyKaigi2009 Staffw/ Ruby札幌

✓I like Ruby & JavaScript

Page 4: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

サイエンスとエンジニアリング

提 供

Page 5: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

I want ...

Page 6: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

A lot ofweb data!!

Page 7: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to gather it?✓Web API✓HTML Scraping

How to scrape HTML?

✓Mechanize & Nokogiri (鋸)

Page 8: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

But,

How about Dynamic HTML?

Page 9: Introduce of the parallel distributed Crawler with scraping Dynamic HTML
Page 10: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

in HTML source

Page 11: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

the parallelize distributed Crawler with scraping Dynamic HTML

Greasihttp://github.com/kei-s/greasi

Page 12: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

DEMO

Page 13: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Outline of Crawler

data

Server Clients

URL

data

Web Pageaccess

DOM

・・・

・・・

URL

Page 14: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Server Side : Requirements

✓Receive and Store data

Page 15: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

My Choice

Page 16: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Client Side : Requirements

✓Evaluate Dynamic HTML like browser

What do you choice if it is you?

Page 17: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

My Choice

Page 18: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How it works?

Page 19: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Server Side: Code Snippets

require 'rubygems'require 'sinatra'

post '/' do url = params[:url] data = params[:data] store(url, data) next_url = process(url) next_urlend

Page 20: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Client Side: Code Snippets

// ==UserScript==// @name greasi_scraper// @namespace http://libelabo.jp/// @include http://images.google.co.jp/*// @require http://ajax.googleapis.com/ajax/libs/jquery/1.3.1/jquery.min.js// ==/UserScript==

function postData(data) { var postData = $.param({url: location.href, data: JSON.stringify(data)}); GM_xmlhttpRequest({ method: "POST", url: "http://libelabo.jp/greasi/", headers: {'Content-type':'application/x-www-form-urlencoded'}, data: postData, onload: function(xhr){ location.href = xhr.responseText } });}

Page 21: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Parallelize?

Page 22: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Parallelize?

Add Tabs :)

Page 23: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Distribute?

Page 24: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Distribute?

Install Firefox :)

Page 25: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Summary

✓With Nice Products,

Page 26: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Summary

✓Make it “サクッと”!✓Use it “サクッと”!

Page 27: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Thank you!