Introduction of vertical crawler

31
Introduction of Crawler Speaker: Jinglun

description

Introduction of vertical crawler

Transcript of Introduction of vertical crawler

Page 1: Introduction of vertical crawler

Introduction of Crawler

Speaker: Jinglun

Page 2: Introduction of vertical crawler

Target

• Know concept of crawler• Design/Implement a crawler with good

performance, flexibility

Page 3: Introduction of vertical crawler

Agenda

• Crawler Introduction• Design• More Challenge• Source Code (If have time)

Page 4: Introduction of vertical crawler

What’s Crawler

Page 5: Introduction of vertical crawler

What’s Crawler

• Web crawler• Vertical crawler

Page 6: Introduction of vertical crawler

Example

• Get mappings from source code to yicu– Step1 : Find products link from codesearch– Step2: Find source code link from svn– Step3: Find mapping from source code

Page 7: Introduction of vertical crawler

Requirements

• Get mappings from source code to yicu

Page 8: Introduction of vertical crawler

Format RequirementsFeatures Qualities Constraints

Business Find which source code using our libs

1. High performance

2. Flexibility3. Easy Maintain

1.Thousands of page2. One machine3. Several days

Users Only me now

Developers Only me

Page 9: Introduction of vertical crawler

Architect

• Two directions• Two dimensions

Page 10: Introduction of vertical crawler

Process

Page 11: Introduction of vertical crawler

Components

Page 12: Introduction of vertical crawler

Layers

View

Control

Module

General Components

Business Bus

Page 13: Introduction of vertical crawler

Layers

• Crawler (View and Control) • Downloader, Extractor (Module)• Storage (General components)• Not use business bus

Page 14: Introduction of vertical crawler

Other views

• Running view• Deploy view• Data view• Develop view• …

Page 15: Introduction of vertical crawler

Other views

• Running view• Deploy view• Data view• Develop view• …

Page 16: Introduction of vertical crawler

Develop View

• Trunk/– Src/– Test/– Bin/

Page 17: Introduction of vertical crawler

Review DesignFeatures Qualities Constrai

nts

Business Find which source code using our libs

1. High performance

2. Flexibility

3. Easy maintain

1.Thousands of page2. One machine3. Several days

Users Only me now

Developers

Only me

Page 18: Introduction of vertical crawler

Solutions

• Crawler• Downloader• Storage• Extractor

Page 19: Introduction of vertical crawler

Solutions

• Crawler• Downloader (One API)• Storage• Extractor

Page 20: Introduction of vertical crawler

Crawler

DFS or BFS?

Page 21: Introduction of vertical crawler

Crawler

DFS or BFS?BFS1. More deep, more UNIMPORTANT2. Many paths are available to an certain page3. Simple for distributed crawler4. More efficient developing

Page 22: Introduction of vertical crawler

Crawler

Foreach($seed_urls as $url) { $page = GetPage($url); // download or read

fileSave($page);$meta_info = Parse($page);

Save($meta_info);}

Page 23: Introduction of vertical crawler

Storage

• What to be stored?– Urls?– Meta info?– Http Response (header, body, curl_info)?

• How to store?– Mysql?– File system?

Page 24: Introduction of vertical crawler

Storage

Hash table• Index• physical store

Page 25: Introduction of vertical crawler

Storage

Hash table• Index– md5

• physical store– Head file, body file, curl_info file

Page 26: Introduction of vertical crawler

Storage

• Basedir/– header_093e3575e895287cf471e6d5f5028446 body_093e3575e895287cf471e6d5f5028446 info_093e3575e895287cf471e6d5f5028446– header_ 66e7c612cf23049fa731c831bcee9048

body_ 66e7c612cf23049fa731c831bcee9048info_ 66e7c612cf23049fa731c831bcee9048

– …– meta_info.txt– failed_download_url.txt– failed_extract_page.txt

Page 27: Introduction of vertical crawler

Extractor

• Dom tree• Regular expresses

Page 28: Introduction of vertical crawler

Review DesignFeatures Qualities Constrai

nts

Business Find which source code using our libs

1. High performance

2. Flexibility

3. Easy maintain

1.Thousands of page2. One machine3. Several days

Users Only me now

Developers

Only me

Page 29: Introduction of vertical crawler

Performance Issue?

• Multi threads?• Multi process?

Page 30: Introduction of vertical crawler

Data analyze

• Vim• Shell (awk, sed)• Regular expression

Page 31: Introduction of vertical crawler

More Challenges

• Distributed• Noise• Duplication• Quick updates• Concurrent and Performance• …