Introduce of the parallel distributed Crawler with scraping Dynamic HTML

27
動的HTMLスクレイピング対応 並列分散クローラ のご紹介 Introduce of the parallel distributed Crawler with scraping Dynamic HTML

description

動的HTMLスクレイピング対応並列分散クローラのご紹介 札幌Ruby会議02

Transcript of Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Page 1: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

動的HTMLスクレイピング対応並列分散クローラ

のご紹介Introduce of

the parallel distributed Crawlerwith scraping Dynamic HTML

Page 2: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Before that,シラツチ ケイ

✓白土 慧✓id: kei-s✓@kei_s

Page 3: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Before that,✓Born in Sapporo

Now live in Tokyo

✓RubyKaigi2009 Staffw/ Ruby札幌

✓I like Ruby & JavaScript

Page 4: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

サイエンスとエンジニアリング

提 供

Page 5: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

I want ...

Page 6: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

A lot ofweb data!!

Page 7: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to gather it?✓Web API✓HTML Scraping

How to scrape HTML?

✓Mechanize & Nokogiri (鋸)

Page 8: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

But,

How about Dynamic HTML?

Page 9: Introduce of the parallel distributed Crawler with scraping Dynamic HTML
Page 10: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

in HTML source

Page 11: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

the parallelize distributed Crawler with scraping Dynamic HTML

Greasihttp://github.com/kei-s/greasi

Page 12: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

DEMO

Page 13: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Outline of Crawler

data

Server Clients

URL

data

Web Pageaccess

DOM

・・・

・・・

URL

Page 14: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Server Side : Requirements

✓Receive and Store data

Page 15: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

My Choice

Page 16: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Client Side : Requirements

✓Evaluate Dynamic HTML like browser

What do you choice if it is you?

Page 17: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

My Choice

Page 18: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How it works?

Page 19: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Server Side: Code Snippets

require 'rubygems'require 'sinatra'

post '/' do url = params[:url] data = params[:data] store(url, data) next_url = process(url) next_urlend

Page 20: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Client Side: Code Snippets

// ==UserScript==// @name greasi_scraper// @namespace http://libelabo.jp/// @include http://images.google.co.jp/*// @require http://ajax.googleapis.com/ajax/libs/jquery/1.3.1/jquery.min.js// ==/UserScript==

function postData(data) { var postData = $.param({url: location.href, data: JSON.stringify(data)}); GM_xmlhttpRequest({ method: "POST", url: "http://libelabo.jp/greasi/", headers: {'Content-type':'application/x-www-form-urlencoded'}, data: postData, onload: function(xhr){ location.href = xhr.responseText } });}

Page 21: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Parallelize?

Page 22: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Parallelize?

Add Tabs :)

Page 23: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Distribute?

Page 24: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

How to Distribute?

Install Firefox :)

Page 25: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Summary

✓With Nice Products,

Page 26: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Summary

✓Make it “サクッと”!✓Use it “サクッと”!

Page 27: Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Thank you!