Introduce of the parallel distributed Crawler with scraping Dynamic HTML
-
Upload
kei-shiratsuchi -
Category
Technology
-
view
2.189 -
download
0
description
Transcript of Introduce of the parallel distributed Crawler with scraping Dynamic HTML
動的HTMLスクレイピング対応並列分散クローラ
のご紹介Introduce of
the parallel distributed Crawlerwith scraping Dynamic HTML
Before that,シラツチ ケイ
✓白土 慧✓id: kei-s✓@kei_s
Before that,✓Born in Sapporo
Now live in Tokyo
✓RubyKaigi2009 Staffw/ Ruby札幌
✓I like Ruby & JavaScript
サイエンスとエンジニアリング
提 供
I want ...
A lot ofweb data!!
How to gather it?✓Web API✓HTML Scraping
How to scrape HTML?
✓Mechanize & Nokogiri (鋸)
But,
How about Dynamic HTML?
in HTML source
the parallelize distributed Crawler with scraping Dynamic HTML
Greasihttp://github.com/kei-s/greasi
DEMO
Outline of Crawler
data
Server Clients
URL
data
Web Pageaccess
DOM
・・・
・・・
URL
Server Side : Requirements
✓Receive and Store data
My Choice
Client Side : Requirements
✓Evaluate Dynamic HTML like browser
What do you choice if it is you?
My Choice
How it works?
Server Side: Code Snippets
require 'rubygems'require 'sinatra'
post '/' do url = params[:url] data = params[:data] store(url, data) next_url = process(url) next_urlend
Client Side: Code Snippets
// ==UserScript==// @name greasi_scraper// @namespace http://libelabo.jp/// @include http://images.google.co.jp/*// @require http://ajax.googleapis.com/ajax/libs/jquery/1.3.1/jquery.min.js// ==/UserScript==
function postData(data) { var postData = $.param({url: location.href, data: JSON.stringify(data)}); GM_xmlhttpRequest({ method: "POST", url: "http://libelabo.jp/greasi/", headers: {'Content-type':'application/x-www-form-urlencoded'}, data: postData, onload: function(xhr){ location.href = xhr.responseText } });}
How to Parallelize?
How to Parallelize?
Add Tabs :)
How to Distribute?
How to Distribute?
Install Firefox :)
Summary
✓With Nice Products,
Summary
✓Make it “サクッと”!✓Use it “サクッと”!
Thank you!