Crawling the Hidden Web

19
Crawling the Hidden Crawling the Hidden Web Web Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar

description

Crawling the Hidden Web. Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar. Introduction. What are web crawlers? Programs, that traverses Web graph in a structured manner, retrieving web pages . - PowerPoint PPT Presentation

Transcript of Crawling the Hidden Web

Page 1: Crawling the Hidden Web

Crawling the Hidden WebCrawling the Hidden Web

Sriram Raghavan Hector Garcia-Molina

Computer Science DepartmentStanford University

Reviewed byPankaj Kumar

Page 2: Crawling the Hidden Web

IntroductionIntroductionWhat are web crawlers? Programs, that traverses Web graph in a

structured manner, retrieving web pages.

Are they really crawling the whole web graph?

Their target: Publicly Index-able Web (PIW)

They are missing something…

04/21/23 Crawling Hidden Web 2

Page 3: Crawling the Hidden Web

What about results, which can only be obtained by:• Search Forms• Web pages, that need authorization.

Let’s face the truth:• Size of hidden web with respect to PIW• High Quality information are present out

there.Example – Patents & Trademark Office, News Media

04/21/23 Crawling Hidden Web 3

Page 4: Crawling the Hidden Web

Now…The Goal:• To create a web crawler, which can crawl

and extract information from hidden database.

• Indexing, analysis and mining of hidden web content.

But, the path is not easy:• Automatic parsing and processing of form-

based interfaces.• Input to the form of search queries.

04/21/23 Crawling Hidden Web 4

Page 5: Crawling the Hidden Web

Our approach:

a.Task-specificity – • Resource Discovery (will NOT focus in

this paper)• Content Extraction

b.Human Assistance – It is critical, as it• enables the crawler to use relevant

values.• gathers additional potential values.

04/21/23 Crawling Hidden Web 5

Page 6: Crawling the Hidden Web

Hidden Web CrawlersHidden Web CrawlersA new operational model – developed

at Stanford University.First of all…• How a user interacts with a web form:

04/21/23 Crawling Hidden Web 6

Page 7: Crawling the Hidden Web

• Now, how a crawler should interact with a web form:

• Wait…what is this all about ???- Let’s understand the terminologies first. That will

help us.04/21/23 Crawling Hidden Web 7

Page 8: Crawling the Hidden Web

Terminologies:

Form Page: Actual web page containing the form. Response Page: Page received in response to a

form submission. Internal Form Representation: Created by the

crawler, for a certain web form, F.

F = ({E1, E2,…, En}, S, M) Task-specific Database: Information, that the

crawler needs. Matching Function: It implements the “Match”

algorithm to produce value assignments for the form elements.

Match(({E1, E2,…, En}, S, M), D) = [E1v1, E2v2,…, Envn]

Response Analysis: Receives and stores the form submission in the crawler’s repository.

04/21/23 Crawling Hidden Web 8

Page 9: Crawling the Hidden Web

Submission Efficiency (Performance):Let,

Ntotal = Total # of forms submitted by the crawler,

Nsuccess= # of submissions which result in a response page containing one or more search results, and

Nvalid = # of semantically correct form submissions.

Then,

a. Strict Submission Efficiency (SEstrict) = (Nsuccess) / (Ntotal )

b. Lenient Submission Efficiency (SElenient) = (Nvalid) / (Ntotal )

04/21/23 Crawling Hidden Web 9

Page 10: Crawling the Hidden Web

HiWE: Hidden Web HiWE: Hidden Web ExposerExposerHiWE Architecture:

04/21/23 Crawling Hidden Web 10

Page 11: Crawling the Hidden Web

But, how does this fit in our operational model ????

• Form Representation• Task Specific Database (LVS Table)• Matching Function• Computing Weights

04/21/23 Crawling Hidden Web 11

Page 12: Crawling the Hidden Web

LITE: LITE: LLayout-based ayout-based IInformation nformation EExtraction xtraction TTechniqueechniqueWhat is it ??A technique where page layout aids in label

extraction.• Prune the form page.• Approximately layout the pruned page using Custom

Layout Engine.• Identify and rank the Candidate.• The highest ranked candidate is

the label associated with the form

element.

04/21/23 Crawling Hidden Web 12

Page 13: Crawling the Hidden Web

ExperimentsExperimentsTask Description: Collect Web pages

containing“News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years”.• Parameter values:

Parameters Values

Number of sites visited 50

Number of forms encountered 218

Number of forms chosen for submission 94

Label matching threshold (σ) 0.75

Minimum form size (α) 3

Value assignment ranking function ρfuzMinimum acceptable value assignment rank (ρmin)

0.6

04/21/23 Crawling Hidden Web 13

Page 14: Crawling the Hidden Web

Effect of Value Assignment Ranking function (ρfuzz , ρavg and ρprob ):

Label Extraction:a. LITE: 93%

b. Heuristic purely based on Textual Analysis : 72%

c. Heuristic based on Extensive manual observation: 83%

Ranking Function

Ntotal Nsuccess SEstrict

ρfuz 3214 2853 88.8

ρavg 3760 3126 83.1

ρprob 4316 2810 65.1

04/21/23 Crawling Hidden Web 14

Page 15: Crawling the Hidden Web

Effect of α:

Effect of crawler input to LVS table:

04/21/23 Crawling Hidden Web 15

Page 16: Crawling the Hidden Web

Pros and Cons…Pros and Cons…Pros• More amount of information is crawled• Quality of information is very high• More focused results • Crawler inputs increases the number of successful

submissions

Cons• Crawling becomes slower• Task-specific Database can limit the accuracy of

results• Unable to process simple form element dependencies• Lack of support for partially filled out forms

04/21/23 Crawling Hidden Web 16

Page 17: Crawling the Hidden Web

Where does our course fit in Where does our course fit in here…??here…??In Content Extraction• Given the set of resources, i.e. sites and

databases, automate the information retrieval

In Label Matching (Matching Function)• Label Normalization• Edit Distance Calculation

In LITE-based heuristic for extracting labels• Identify and Rank Candidates

In maintaining Crawler’s repository

04/21/23 Crawling Hidden Web 17

Page 18: Crawling the Hidden Web

Related Works…Related Works… J. Madhavan et al, VLDS, 2008, Google's Deep Web Crawl J. Madhavan et al, CIDR, Jan. 2009, Harnessing the Deep Web:

Present and Future Manuel Álvarez, Juan Raposo, Fidel Cacheda and Alberto Pan,

Aug. 2006, A Task-specific Approach for Crawling the Deep Web Lu Jiang, Zhaohui Wu, Qian Feng, Jun Liu, Qinghua Zheng,

Efficient Deep Web Crawling Using Reinforcement Learning Manuel Álvarez et al, Crawling the Content Hidden Behind Web

Forms Yongquan Dong, Qingzhong Li, 2012, A Deep Web Crawling

Approach Based on Query Harvest Model Alexandros Ntoulas, Petros Zerfos, Junghoo Cho, Downloading

Hidden Web Content Rosy Madaan, Ashutosh Dixit, A.K. Sharma, Komal Kumar

Bhatia, 2010, A Framework for Incremental Hidden Web Crawler Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma, Query Selection

Techniques for Efficient Crawling of Structured Web Sources http://deepweb.us/

04/21/23 Crawling Hidden Web 18

Page 19: Crawling the Hidden Web

So…what’s the So…what’s the “Conclusion” ?“Conclusion” ?Traditional Crawler’s limitations Issues related to extending the Crawlers for

accessing the “Hidden Web”Need for narrow application focusPromising results of HiWELimitations (of HiWE):• Inability to handle simple dependencies between

form elements• Lack of support for partial filled out forms

04/21/23 Crawling Hidden Web 19