Introduce of the parallel distributed Crawler with scraping Dynamic HTML

Post on 10-May-2015

2.189 views 0 download

Tags:

description

動的HTMLスクレイピング対応並列分散クローラのご紹介 札幌Ruby会議02

Transcript of Introduce of the parallel distributed Crawler with scraping Dynamic HTML

動的HTMLスクレイピング対応並列分散クローラ

のご紹介Introduce of

the parallel distributed Crawlerwith scraping Dynamic HTML

Before that,シラツチ ケイ

✓白土 慧✓id: kei-s✓@kei_s

Before that,✓Born in Sapporo

Now live in Tokyo

✓RubyKaigi2009 Staffw/ Ruby札幌

✓I like Ruby & JavaScript

サイエンスとエンジニアリング

提 供

I want ...

A lot ofweb data!!

How to gather it?✓Web API✓HTML Scraping

How to scrape HTML?

✓Mechanize & Nokogiri (鋸)

But,

How about Dynamic HTML?

in HTML source

the parallelize distributed Crawler with scraping Dynamic HTML

Greasihttp://github.com/kei-s/greasi

DEMO

Outline of Crawler

data

Server Clients

URL

data

Web Pageaccess

DOM

・・・

・・・

URL

Server Side : Requirements

✓Receive and Store data

My Choice

Client Side : Requirements

✓Evaluate Dynamic HTML like browser

What do you choice if it is you?

My Choice

How it works?

Server Side: Code Snippets

require 'rubygems'require 'sinatra'

post '/' do url = params[:url] data = params[:data] store(url, data) next_url = process(url) next_urlend

Client Side: Code Snippets

// ==UserScript==// @name greasi_scraper// @namespace http://libelabo.jp/// @include http://images.google.co.jp/*// @require http://ajax.googleapis.com/ajax/libs/jquery/1.3.1/jquery.min.js// ==/UserScript==

function postData(data) { var postData = $.param({url: location.href, data: JSON.stringify(data)}); GM_xmlhttpRequest({ method: "POST", url: "http://libelabo.jp/greasi/", headers: {'Content-type':'application/x-www-form-urlencoded'}, data: postData, onload: function(xhr){ location.href = xhr.responseText } });}

How to Parallelize?

How to Parallelize?

Add Tabs :)

How to Distribute?

How to Distribute?

Install Firefox :)

Summary

✓With Nice Products,

Summary

✓Make it “サクッと”!✓Use it “サクッと”!

Thank you!