Crawling the Hidden Web

73
Crawling the Hidden Crawling the Hidden Web Web by Michael Weinberg [email protected] Internet DB Seminar, The Hebrew University of Jerusalem, School of Computer Science and Engineering,

description

Crawling the Hidden Web. by Michael Weinberg [email protected]. Internet DB Seminar, The Hebrew University of Jerusalem, School of Computer Science and Engineering, December 2001. Agenda. Hidden Web - what is it all about? - PowerPoint PPT Presentation

Transcript of Crawling the Hidden Web

Page 1: Crawling the Hidden Web

Crawling the Hidden WebCrawling the Hidden Web

by

Michael [email protected]

Internet DB Seminar,

The Hebrew University of Jerusalem,

School of Computer Science and Engineering,

December 2001

Page 2: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 2

AgendaAgenda

Hidden Web - what is it all about? Generic model for a hidden Web crawler HiWE (Hidden Web Exposer) LITE – LLayout-based IInformation EExtraction

TTechnique Results from experiments conducted to test these

techniques

Page 3: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 3

Web CrawlersWeb Crawlers

Automatically traverse the Web graph, building a local repository of the portion of the Web that they visit

Traditionally, crawlers have only targeted a portion of the Web called the publicly indexable Web (PIW)

PIW – the set of pages reachable purely by following hypertext links, ignoring search forms and pages that require authentication

Page 4: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 4

The Hidden WebThe Hidden Web

Recent studies show that a significant fraction of Web content in fact lies outside the PIW

Large portions of the Web are ‘hidden’ behind search forms in searchable databases

HTML pages are dynamically generated in response to queries submitted via the search forms

Also referred as the ‘Deep’ Web

Page 5: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 5

The Hidden Web GrowthThe Hidden Web Growth Hidden Web continues to grow, as organizations

with large amount of high-quality information are placing their content online, providing web-accessible search facilities over existing databases

For example:– Census Bureau– Patents and Trademarks Office– News media companies

InvisibleWeb.com lists over 10000 such databases

Page 6: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 6

Surface WebSurface Web

Page 7: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 7

Deep WebDeep Web

Page 8: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 8

Deep Web Content DistributionDeep Web Content Distribution

Page 9: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 9

Deep Web StatsDeep Web Stats The Deep Web is 500500 times larger than PIW !!! Contains 7,500 terabytes of information (March 2000)

More than 200,000 Deep Web sites exist Sixty of the largest Deep Web sites collectively

contain about 750 terabytes of information 95% of the Deep Web is publicly accessible (no

fees) Google indexes about 16% of the PIW, so we

search about 0.030.03% of the pages available today

Page 10: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 10

The ProblemThe Problem

Hidden Web contains large amounts of high-quality information

The information is buried on dynamically generated sites

Search engines that use traditional crawlers never find this information

Page 11: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 11

The SolutionThe Solution

Build a hidden Web crawler Can crawl and extract content from hidden

databases Enable indexing, analysis, and mining of hidden

Web content The content extracted by such crawlers can be

used to categorize and classify the hidden databases

Page 12: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 12

ChallengesChallenges

Significant technical challenges in designing a hidden Web crawler

Should interact with forms that were designed primarily for human consumption

Must provide input in the form of search queries How equip the crawlers with input values for use

in constructing search queries? To address these challenges, we adopt the

task-specifictask-specific, human-assistedhuman-assisted approach

Page 13: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 13

Task-SpecificityTask-Specificity

Extract content based on the requirements of a particular application or task

For example, consider a market analyst interested in press releases, articles, etc… pertaining to the semiconductor industry, and dated sometime in the last ten years

Page 14: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 14

Human-AssistanceHuman-Assistance

Human-assistance is critical to ensure that the crawler issues queries that are relevant to the particular task

For instance, in the semiconductor example, the market analyst may provide the crawler with lists of companies or products that are of interest

The crawler will be able to gather additional potential company and product names as it processes a number of pages

Page 15: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 15

Two StepsTwo Steps

There are two steps in achieving our goal:– Resource discovery – identify sites and databases that

are likely to be relevant to the task– Content extraction – actually visit the identified sites to

submit queries and extract the hidden pages

In this presentation we do not directly address the resource discovery problem

Page 16: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 16

Hidden Web CrawlersHidden Web Crawlers

Page 17: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 17

User form interactionUser form interaction

Form page

Response page

Web query front-end

(3) Fill-out form

(1) Download form

(5) Download response

(2) View form

(4) Submit form

(6) View result

Hidden Databas

e

Page 18: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 18

Operation ModelOperation Model Our model of a hidden Web crawler consists of

four components:– Internal Form Representation– Task-specific database– Matching function– Response Analysis

Form Page – the page containing the search form Response Page – the page received in response to

a form submission

Page 19: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 19

Generic Operational ModelGeneric Operational Model

Internal Form Representation

Task specific databas

eSet of value-assignments

Response Analysis

Hidden Web Crawler Form

page

Response page

Web query front-end

Match

Hidden Databas

e

Repository

Download form

Form submission

Download response

Form analysi

s

Page 20: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 20

Internal Form RepresentationInternal Form Representation

Form F: is a set of n form elements S – submission information associated with the

form:– submission URL– Internal identifiers for each form element

M – meta-information about the form:– web-site hosting the form– set of pages pointing to this form page– other text on the page besides the form

M})S,},E,...,E,({EF n21}E,...,E,{E n21

Page 21: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 21

Task-specific DatabaseTask-specific Database

The crawler is equipped with a task-specific database D

Contains the necessary information to formulate queries relevant to the particular task

In the ‘market analyst’ example, D could contain list of semiconductor company and product names

The actual format and organization of D are specific for to a particular crawler implementation

HiWE uses a set of labeled fuzzy sets

Page 22: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 22

Matching FunctionMatching Function

Matching algorithm properties:– – Input: Internal form representation and current contents

of the database D– Output: Set of value assignments– associates value with element

]},...,{[)),,},,...Match(({ 111 nnn vEvEDMSEE

ii vE iv iE

Page 23: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 23

Response AnalysisResponse Analysis

Module that stores the response page in the repository

Attempts to distinguish between pages containing search results and pages containing error messages

This feedback is used to tune the matching function

Page 24: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 24

Traditional Performance MetricTraditional Performance Metric

Traditional crawlers performance metrics:– Crawling speed– Scalability– Page importance– Freshness

These metrics are relevant to hidden web crawlers, but do not capture the fundamental challenges in dealing with the Hidden Web

Page 25: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 25

New Performance MetricsNew Performance Metrics Coverage metric:

– ‘Relevant’ pages extracted / ‘relevant’ pages present in the targeted hidden databases

– Problem: difficult to estimate how much of the hidden content is relevant to the task

Page 26: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 26

New Performance MetricsNew Performance Metrics

total

successstrict N

NSE

– : the total number of forms that the crawler submits

– : num of submissions which result in response page with one or more search results

– Problem: the crawler is penalized if the database didn’t contain any relevant search results

successN

totalN

Page 27: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 27

New Performance MetricsNew Performance Metrics

– : number of semantically correct form submissions

– Penalizes the crawler only if a form submission is semantically incorrect

– Problem: difficult to evaluate since a manual comparison is needed to decide whether the form is semantically correct

total

validlenient N

NSE

validN

Page 28: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 28

Design IssuesDesign Issues What information about each form element

should the crawler collect? What meta-information is likely to be useful? How should the task-specific database be

organized, updated and accessed? What Match function is likely to maximize

submission efficiency? How to use the response analysis module to tune

the Match function?

iE

Page 29: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 29

HiWE: Hidden Web ExposerHiWE: Hidden Web Exposer

Page 30: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 30

Basic IdeaBasic Idea Extract descriptive information (label) for each

element of a form Task-specific database is organized in terms of

categories, each of which is also associated with labels

Matching function attempts to match from form labels to database categories to compute a set of candidate values assignments

Page 31: Crawling the Hidden Web

LVS Manager

HiWE ArchitectureHiWE Architecture

Label1 Value-Set1

Label2 Value-Set2

Labeln Value-Setn

Response Analyzer

Form Processor

Form Analyzer

Crawl Manager

Parser

WWW

URL 1 URL 2

URL N

URL List

Custom data sources

LVS Table

Form submissio

n

Response

Feedback

Page 32: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 32

HiWE’s Main ModulesHiWE’s Main Modules URL List:

– contains all the URLs the crawler has discovered so far

Crawl Manager: – controls the entire crawling process

Parser: – extracts hypertext links from the crawled pages and adds

them to the URL list

Form Analyzer, Form Processor, Response Analyzer:– Together implement the form processing and submission

operations

Page 33: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 33

HiWE’s Main ModulesHiWE’s Main Modules LVS Manager:

– Manages additions and accesses to the LVS table LVS table:

– HiWE’s implementation of the task-specific database

Page 34: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 34

HiWE’s Form RepresentationHiWE’s Form Representation

Form– The third component of F is an empty set since current

implementation of HiWE does not collect any meta-information about the form

For each element , HiWE collects a domain Dom( ) and a label label( )

})S,},E,...,E,({EF n21

iEiEiE

Page 35: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 35

HiWE’s Form RepresentationHiWE’s Form Representation

Domain of an element:– Set of values which can be associated with the

corresponding form element– May be a finite set (e.g., domain of a selection list)– May be infinite set (e.g., domain of a text box)

Label of an element:– The descriptive information associated with the

element, if any– Most forms include some descriptive text to help users

understand the semantics of the element

Page 36: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 36

Label(E1) = "Document Type"Dom(E1 ) = {Articles, Press Releases,

Label(E2) = "Company Name"Dom(E2) = {s | s is a text string}Label(E3) = "Sector"Dom(E3) = {Entertainment, Automobile

Reports}

Element E1

Element E2

Information Technology,Construction}

Element E3

Form Representation - FigureForm Representation - Figure

Page 37: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 37

HiWE’s Task-specific DatabaseHiWE’s Task-specific Database

Task-specific information is organized in terms of a finite set of concepts or categories

Each concept has one or more labels and an associated set of values

For example the label ‘Company Name’ could be associated with the set of values {‘IBM’, ‘Microsoft’, ‘HP’,…}

Page 38: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 38

The concepts are organized in a table called the Label Value Set (LVS)

Each entry in the LVS is of the form (L,V):– L : label– fuzzy set of values

– Fuzzy set V has an associated membership function that assigns weights, in the range [0,1] to each member of the set

– is a measure of the crawler’s confidence that the assignment of to E is semantically meaningful

}{V 1 n,...,vv

)(v ivM

vM

iv

HiWE’s Task-specific DatabaseHiWE’s Task-specific Database

Page 39: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 39

For elements with a finite domain:– The set of possible values is fixed and can be

exhaustively enumerated

– In this example, the crawler can first retrieve all relevant articles, then all relevant press releases and finally all relevant reports

HiWE’s Matching FunctionHiWE’s Matching Function

Label(E1) = "Document Type"Dom(E1 ) = {Articles, Press Releases, Reports

}

Element E1

Page 40: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 40

For elements with an infinite domain:– HiWE textually matches the labels of these elements

with labels in the LVS table– For example, if a textbox element has the label “Enter

State” which best matches an LVS entry with the label “State” , the values associated with that LVS entry (e.g., “California”) can be used to fill the textbox

– How do we match Form labels with LVS labels?

HiWE’s Matching FunctionHiWE’s Matching Function

Page 41: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 41

Two steps in matching Form labels with LVS labels:– 1. Normalization: includes conversion to a common

case and standard style– 2. Use of an approximate string matching algorithm to

compute minimum edit distances– HiWE employs D. Lopresti and A. Tomkins string

matching algorithm that takes word reordering into account

Label MatchingLabel Matching

Page 42: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 42

Let LabelMatch( ) denote the LVS entry with the minimum distance to label( )

Threshold If all LVS entries are more than edit operations

away from label( ) , LabelMatch( ) = nil

Label MatchingLabel Matching

iE

iE

iE iE

Page 43: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 43

For each element , compute ( , ):– If has an infinite domain and (L,V) is the

closest matching LVS entry, then = V and =

– If has a finite domain, then =Dom( ) and

The set of value assignments is computed as the product of all the `s:

Too many assignments?

Label MatchingLabel Matching

iV VxxMi

,1)(iE

vMiVM

iVMiV

iV

iV

iE

iE

iE

iV}...1,:],...,{[),( 11 niVvvEvELVSFMatch iinn

Page 44: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 44

HiWE employs an aggregation function to compute a rank for each value assignment

Uses a configurable parameter, a minimum acceptable value assignment rank ( )

The intent is to improve submission efficiency by only using ‘high-quality’ value assignments

We will show three possible aggregation functions

Ranking Value AssignmentsRanking Value Assignments

min

Page 45: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 45

The rank of a value assignment is the minimum of the weights of all the constituent values.

Very conservative in assigning ranks. Assigns a high rank only if each individual weight is high

Fuzzy ConjunctionFuzzy Conjunction

)(min]),...,([...1

11 iVninnfuz vMvEvE

i

Page 46: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 46

The rank of a value assignment is the average of the weights of the constituent values

Less conservative than fuzzy conjunction

AverageAverage

ni

iVnnfuz vMn

vEvEi

...111 )(

1]),...,([

Page 47: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 47

This ranking function treats weights as probabilities

is the likelihood that the choice of is useful and is the likelihood that it is not

The likelihood of a value assignment being useful is:

Assigns low rank if all the individual weights are very low

ProbabilisticProbabilistic

ni

iVnnfuz vMvEvEi

...111 ))(1(1]),...,([

)( iV vMi

)(1 iV vMi

iv

Page 48: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 48

HiWE supports a variety of mechanisms for adding entries to the LVS table:– Explicit Initialization– Built-in entries– Wrapped data sources– Crawling experience

Populating the LVS TablePopulating the LVS Table

Page 49: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 49

Supply labels and associated value sets at startup time

Useful to equip the crawler with labels that the crawler is most likely to encounter

In the ‘semiconductor’ example, we supply HiWE with a list of relevant company names and associate the list with labels ‘Company’ , ‘Company Name’

Explicit InitializationExplicit Initialization

Page 50: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 50

HiWE has built-in entries for commonly used concepts:– Dates and Times– Names of months– Days of week

Built-in EntriesBuilt-in Entries

Page 51: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 51

LVS Manager can query data sources through a well-defined interface

The data source must be ‘wrapped’ by a program that supports two kinds of queries:– Given a set of labels, return a value set– Given a set of values, return other values that belong to

the same value set

Wrapped Data SourcesWrapped Data Sources

Page 52: Crawling the Hidden Web

LVS Manager

HiWE ArchitectureHiWE Architecture

Label1 Value-Set1

Label2 Value-Set2

Labeln Value-Setn

Response Analyzer

Form Processor

Form Analyzer

Crawl Manager

Parser

WWW

URL 1 URL 2

URL N

URL List

Custom data sources

LVS Table

Form submissio

n

Response

Feedback

Page 53: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 53

Finite domain form elements are a useful source of labels and associated value sets

HiWE adds this information to the LVS table Effective when similar label is associated with a

finite domain element in one form and with an infinite domain element in another

Crawling ExperienceCrawling Experience

Page 54: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 54

New value added to the LVS must be assigned a suitable weight

Explicit initialization and build-in values have fixed weights

Values obtained from external data sources or through the crawler’s own activity, are assigned weights that vary with time

Computing WeightsComputing Weights

Page 55: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 55

For external data sources - computed by the respective wrappers

For values directly gathered by the crawler:– Finite domain element E with Dom(E)– = 1 iff – Three cases arise when incorporating Dom(E) into the

LVS table

Initial WeightsInitial Weights

)()( xM EDom )(EDomx

Page 56: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 56

Crawler successfully extracts label(E) and computes LabelMatch(E)=(L,V):– Replace the (L,V) entry by the entry–

– Intuitively, Dom(E) provides new elements to the value set and ‘boosts’ the weights of existing elements

Updating LVS – Case 1Updating LVS – Case 1

))(,( EDomVL ))(),(max()( )()( xMxMxM EDomVEDomV

Page 57: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 57

Crawler successfully extracts label(E) but LabelMatch(E) = nil:– A new entry ( label(E),Dom(E) ) is created in the LVS

Updating LVS – Case 2Updating LVS – Case 2

Page 58: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 58

Crawler can not extract label(E):– For each entry (L,V):

Compute a score :

Identify the entry with the maximum score Identify the value of the maximum score Replace entry with new entry

Confidence of new values:

Updating LVS – Case 3Updating LVS – Case 3

)(

)()(

EDom

xMEDomx

V

),( maxmax VL

maxs),( maxmax VL

))(,( maxmax EDomVL

)()( )(max xMsxM EDom

Page 59: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 59

Initialization of the crawling activity includes:– Set of sites to crawl– Explicit initialization for the LVS table– Set of data sources– Label matching threshold – Minimum acceptable value assignment rank– Value assignment aggregation function

Configuring HiWEConfiguring HiWE

)(

Page 60: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 60

Layout-based Information Extraction Technique Physical Layout of a page is also used to aid in

extraction For example, a piece of text that is physically

adjacent to a form element is very likely a description of that element

Unfortunately, this semantic associating is not always reflected in the underlying HTML of the Web page

Introducing LITEIntroducing LITE

Page 61: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 61

Layout-based Information Layout-based Information Extraction TechniqueExtraction Technique

Page 62: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 62

Accurate extraction of the labels and domains of form elements

Elements that are visually close on the screen, may be separated arbitrarily in the actual HTML text

Even when HTML provides a facility for semantic relationships, it’s not used in a majority of pages

Accurate page layout is a complex process Even a crude approximate layout of portions of a

page, can yield very useful semantic information

The ChallengeThe Challenge

Page 63: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 63

LITE-based heuristic:– Prune the form page and isolate elements which

directly influence the layout– Approximately layout the pruned page using a custom

layout engine– Identify the pieces of text that are physically closest to

the form element (these are candidates)– Rank each candidate using a variety of measures– Choose the highest ranked candidate as the label

Form Analysis in HiWEForm Analysis in HiWE

Page 64: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 64

Pruning Before Partial LayoutPruning Before Partial Layout

Page 65: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 65

LITE - FigureLITE - Figure

Partial Layout

DOM Parser

DOM Representation

Pruned Page

Prune

List of Elements

Submission Info

Labels & Domain Values

DOM API

Internal Form Representation

Key Idea in LITE:

Physical page layout embeds significant semantic information

Page 66: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 66

ExperimentsExperiments A number of experiments were conducted to study

the performance of HiWE We will see how performance depends on:

– Minimum form size– Crawler input to LVS table– Different ranking functions

Page 67: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 67

Parameter Values for Task 1Parameter Values for Task 1

Task 1:

News articles, reports, press releases and white papers relating to the semiconductor industry, dated sometime in the last ten years

Page 68: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 68

Variation of Performance with Variation of Performance with

Page 69: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 69

Effect of Crawler input to LVSEffect of Crawler input to LVS

Page 70: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 70

Different Ranking FunctionsDifferent Ranking Functions

fuz When using and the crawler’s submission efficiency is mostly above 80%

performs poorly submits more forms than (less conservative)

avgprob

avg

fuz

Page 71: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 71

Label ExtractionLabel Extraction

LITE-based heuristic achieved overall accuracy of 93% The test set was manually analyzed

Page 72: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 72

ConclusionConclusion Addressed the problem of extending current-day

crawlers to build repositories that include pages from the ‘Hidden Web’

Presented a simple operation model of a hidden web crawler

Described the implementation of a prototype crawler – HiWE

Introduced a technique for Layout-based information extraction

Page 73: Crawling the Hidden Web

23/12/2001 Michael Weinberg, SDBI Seminar 73

BibliographyBibliography

Crawling the Hidden Web, by S. Raghavan and H. Garcia-Molina, Stanford University, 2001

BrightPlanet.com white papers D. Lopresti and A. Tomkins. Block edit models for

approximate string matching