Automatic Query Generation for Hidden Web Crawling

12
Automatic Query Generation for Hidden Web Crawling Paper Number 350 ABSTRACT An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the re- sults. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only “entry point” to a Hidden Web site is a query in- terface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a the- oretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our poli- cies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experi- ment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries. 1. INTRODUCTION Recent studies show that a significant fraction of Web content cannot be reached by following links [8, 13]. In particular, a large part of the Web is “hidden” behind search forms and is reachable only when users type in a set of keywords, or queries, to the forms. These pages are often referred to as the Hidden Web [18] or the Deep Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. Web [8], because search engines typically cannot index the pages and do not return them in their results (thus, the pages are essentially “hidden” from an average Web user). According to many studies, the size of the Hidden Web increases rapidly as more organizations put their valuable content online through an easy-to-use Web in- terface [8, 13]. For example, PubMed [3] hosts a large database of research papers in medicine and provides a keyword search interface to the papers. In addition, many news Web sites allow users to access their archived articles through search interfaces. In [13], Chang et al. estimate that well over 100,000 Hidden-Web sites cur- rently exist on the Web. The content provided by many Hidden-Web sites is often of very high quality and can be extremely valuable to many users [8]. For example, PubMed hosts many high-quality papers that were selected from careful peer- review processes. The patent documents available at the US Patent and Trademarks Office [5] are of immense importance to the inventors who want to examine the “prior art.” In this paper, we study how we can build a Hidden- Web crawler that can automatically download pages from the Hidden Web, so that search engines can in- dex them. Existing crawlers 1 rely on the hyperlinks on the Web to discover pages, so current search engines cannot index the Hidden-Web pages (due to the lack of links). We believe that an effective Hidden-Web crawler can have a tremendous impact on how users search in- formation on the Web: Tapping into unexplored information: The Hidden- Web crawler will allow an average Web user to eas- ily explore the vast amount of information that is mostly “hidden” at the present moment. Since a majority of Web users rely on search engines to discover pages, when pages are not indexed by search engines, they are unlikely to be viewed by many Web users. Unless users go directly to Hidden-Web sites and issue queries there, they cannot access the pages at the sites. Improving user experience: Even if a user is aware of a number of Hidden-Web sites, the user still has to waste a significant amount of time and ef- 1 Crawlers are the programs that traverse the Web au- tomatically and download pages for search engines.

Transcript of Automatic Query Generation for Hidden Web Crawling

Automatic Query Generation for Hidden Web Crawling

Paper Number 350

ABSTRACTAn ever-increasing amount of information on the Webtoday is available only through search interfaces: theusers have to type in a set of keywords in a search formin order to access the pages from certain Web sites.These pages are often referred to as the Hidden Webor the Deep Web. Since there are no static links to theHidden Web pages, search engines cannot discover andindex such pages and thus do not return them in the re-sults. However, according to recent studies, the contentprovided by many Hidden Web sites is often of very highquality and can be extremely valuable to many users.

In this paper, we study how we can build an effectiveHidden Web crawler that can autonomously discoverand download pages from the Hidden Web. Since theonly “entry point” to a Hidden Web site is a query in-terface, the main challenge that a Hidden Web crawlerhas to face is how to automatically generate meaningfulqueries to issue to the site. Here, we provide a the-oretical framework to investigate the query generationproblem for the Hidden Web and we propose effectivepolicies for generating queries automatically. Our poli-cies proceed iteratively, issuing a different query in everyiteration. We experimentally evaluate the effectivenessof these policies on 4 real Hidden Web sites and ourresults are very promising. For instance, in one experi-ment, one of our policies downloaded more than 90% ofa Hidden Web site (that contains 14 million documents)after issuing fewer than 100 queries.

1. INTRODUCTIONRecent studies show that a significant fraction of Web

content cannot be reached by following links [8, 13]. Inparticular, a large part of the Web is “hidden” behindsearch forms and is reachable only when users type in aset of keywords, or queries, to the forms. These pagesare often referred to as the Hidden Web [18] or the Deep

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

Web [8], because search engines typically cannot indexthe pages and do not return them in their results (thus,the pages are essentially “hidden” from an average Webuser).

According to many studies, the size of the HiddenWeb increases rapidly as more organizations put theirvaluable content online through an easy-to-use Web in-terface [8, 13]. For example, PubMed [3] hosts a largedatabase of research papers in medicine and providesa keyword search interface to the papers. In addition,many news Web sites allow users to access their archivedarticles through search interfaces. In [13], Chang et al.estimate that well over 100,000 Hidden-Web sites cur-rently exist on the Web.

The content provided by many Hidden-Web sites isoften of very high quality and can be extremely valuableto many users [8]. For example, PubMed hosts manyhigh-quality papers that were selected from careful peer-review processes. The patent documents available at theUS Patent and Trademarks Office [5] are of immenseimportance to the inventors who want to examine the“prior art.”

In this paper, we study how we can build a Hidden-Web crawler that can automatically download pagesfrom the Hidden Web, so that search engines can in-dex them. Existing crawlers1 rely on the hyperlinks onthe Web to discover pages, so current search enginescannot index the Hidden-Web pages (due to the lack oflinks). We believe that an effective Hidden-Web crawlercan have a tremendous impact on how users search in-formation on the Web:

• Tapping into unexplored information: The Hidden-Web crawler will allow an average Web user to eas-ily explore the vast amount of information that ismostly “hidden” at the present moment. Sincea majority of Web users rely on search enginesto discover pages, when pages are not indexedby search engines, they are unlikely to be viewedby many Web users. Unless users go directly toHidden-Web sites and issue queries there, theycannot access the pages at the sites.

• Improving user experience: Even if a user is awareof a number of Hidden-Web sites, the user stillhas to waste a significant amount of time and ef-

1Crawlers are the programs that traverse the Web au-tomatically and download pages for search engines.

fort, visiting all of the potentially relevant sites,querying each of them and exploring the result.By making the Hidden-Web pages searchable ata central location, we can significantly reduce theuser’s wasted time and effort in searching the Hid-den Web.

• Reducing potential bias: Due to the heavy relianceof many Web users on search engines, search en-gines influence how the users perceive the Webgreatly [24]. Users do not necessarily perceivewhat actually exists on the Web, but what is in-dexed by search engines [24]. Our Hidden-Webcrawler will make a larger fraction of the Web ac-cessible through search engines and, thus, it canminimize the potential bias introduced by them.

Given that the only “entry” to hidden-Web pagesis through search forms, one of the core challenges toimplementing an effective Hidden-Web crawler is howthe crawler can automatically generate queries. Clearly,when the search forms list all possible values for a query(e.g., through a drop-down list), the solution is straight-forward. We exhaustively issue all possible queries, onequery at a time. When the query forms have a “freetext” input, however, an infinite number of queries arepossible, so we cannot exhaustively issue all possiblequeries. In this case, what queries should we pick?Can the crawler automatically come up with meaning-ful queries without understanding the semantics of thesearch form?

In this paper, we provide a theoretical framework toinvestigate the Hidden-Web crawling problem and pro-pose effective ways of generating queries automatically.We also evaluate our proposed solutions through exper-iments conducted on real Hidden-Web sites. In sum-mary, this paper makes the following contributions:

• We present a formal framework to study the prob-lem of Hidden-Web crawling. (Section 2).

• We investigate a number of crawling policies forthe Hidden Web, including the optimal policy thatcan potentially download the maximum numberof pages through the minimum number of inter-actions. Unfortunately, we show that the optimalpolicy is NP-hard and cannot be implemented inpractice (Section 2.2).

• We propose a new adaptive policy that approxi-mates the optimal policy. Our adaptive policy ex-amines the pages returned from previous queriesand adapts its query-selection policy automati-cally based on them (Section 3).

• We evaluate various crawling policies through ex-periments on real Web sites. Our experiments willshow the relative advantages of various crawlingpolicies and demonstrate their potential. The re-sults from our experiments are very promising. Inone experiment, for example, our adaptive policydownloaded more than 90% of the pages withinPubMed (that contains 14 million documents) af-ter it issued fewer than 100 queries.

Figure 1: A single-attribute search interface

1.1 Related WorkIn the study most related to ours, Raghavan and

Garcia-Molina [25] present an architectural model fora Hidden Web crawler. The main focus of this work isto learn Hidden-Web query interfaces, not to generatequeries automatically. The potential queries are eitherprovided manually by users or collected from the queryinterfaces. In contrast, our main focus is to generatequeries automatically without any human intervention.

In [11, 12], Callan and Connell try to learn the lan-guage model of a text database by issuing a number ofsample queries. Although both our work and [11, 12] usethe idea of automatically issuing queries to a databaseand examining the results, the main goals are different.[11, 12] try to acquire an accurate language model bycollecting a uniform random sample from the database.In contrast, our goal is to retrieve the maximum numberof documents using minimum resources.

Cope et al. [16] propose a method to automaticallydetect whether a particular Web page contains a searchform. This work is complementary to ours; once wedetect search interfaces on the Web using the methodin [16], we may use our proposed algorithms to downloadpages automatically from those Web sites.

Reference [7] reports methods to estimate what frac-tion of a text database can be eventually acquired byissuing queries to the database. In [6] the authors studyquery-based techniques that can extract relational datafrom large text databases. Again, these works studyorthogonal issues and are complementary to our work.

There exists a large body of work studying how toidentify the most relevant database given a user query [21,20, 15, 22, 19]. This body of work is often referred to asmeta-searching or database selection problem over theHidden Web. For example, [20] suggests the use of fo-cused probing to classify databases into a topical cat-egory, so that given a query, a relevant database canbe selected based on its topical category. [15] presents asystem that can construct topic-specific hierarchies fromtext documents using a rule-based classifier. Our visionis different from this body of work in that we intendto download and index the Hidden pages at a centrallocation in advance, so that users can access all the in-formation at their convenience from one single location.

2. FRAMEWORKIn this section, we present a formal framework for

the study of the Hidden-Web crawling problem. InSection 2.1, we describe our assumptions on Hidden-Web sites and explain how users interact with the sites.Based on this interaction model, we present a high-levelalgorithm for a Hidden-Web crawler in Section 2.2. Fi-nally in Section 2.3, we formalize the Hidden-Web crawl-ing problem.

2.1 Hidden-Web database modelThere exists a variety of Hidden Web sources that pro-

(a) List of matching pages for query “liver”. (b) The first matching page for “liver”.

Figure 3: Pages from the PubMed Web site.

Figure 2: A multi-attribute search interface

vide information on a multitude of topics. Depending onthe type of information, we may categorize a Hidden-Web site either as a textual database or a structureddatabase. A textual database is a site that mainly con-tains plain-text documents, such as PubMed and Lexis-Nexis (an online database of legal documents [1]). Sinceplain-text documents do not usually have well-definedstructure, most textual databases provide a simple searchinterface where users type a list of keywords in a sin-gle search box (Figure 1). In contrast, a structureddatabase often contains multi-attribute relational data(e.g., a book on the Amazon Web site may have thefields title=‘Harry Potter’, author=‘J.K. Rowling’

and isbn=‘0590353403’) and supports multi-attributesearch interfaces (Figure 2). In this paper, we willmainly focus on textual databases that support single-attribute keyword queries. We discuss how we can ex-tend our ideas for the textual database to multi-attributestructured databases in Section 5.1.

We also assume that users need to take the followingsteps in order to access pages in a Hidden-Web database.

1. Step 1. First, the user issues a query, say “liver,”through the search interface provided by the Website (such as the one shown in Figure 1).

2. Step 2. Shortly after the user issues the query,she is presented with a result index page. That is,the Web site returns a list of links to potentiallyrelevant Web pages, as shown in Figure 3(a).

3. Step 3. From the list in the result index page,

Algorithm 2.1 Crawling a Hidden Web site

Procedure(1) while ( there are available resources ) do

// select a term to send to the site(2) qi = SelectTerm()

// send query and acquire// result index page

(3) R(qi) = QueryWebSite( qi )

// download the pages of interest(4) Download( R(qi) )

(5) done

Figure 4: Algorithm for crawling a Hidden Website.

the user identifies the pages that look “interesting”and follows the links. Clicking on a link leads theuser to the actual Web page, such as the one shownin Figure 3(b), that the user wants to look at.

2.2 A generic Hidden Web crawling algo-rithm

Given that the only “entry” to the pages in a Hidden-Web site is its search from, a Hidden-Web crawler shouldfollow the three steps described in the previous section.That is, the crawler has to generate a query, issue it tothe Web site, download the result index page, and followthe links to download the actual pages. In most cases, acrawler has limited time and network resources, so thecrawler repeats these steps until it uses up its resources.

In Figure 4 we show the generic algorithm for a Hidden-Web crawler. For simplicity, we assume that the Hidden-Web crawler issues single-term queries only.2 The crawlerfirst decides which query term it is going to use (Step (2)),

2For most Web sites that assume “AND” for multi-

Sq1

q

qq

2

34

Figure 5: A set-formalization of the optimalquery selection problem.

and issues the query, and retrieves the result index page(Step (3)). Finally, based on the links found on the re-sult index page, it downloads the Hidden Web pagesfrom the site (Step (4)). This same process is repeateduntil all the available resources are used up (Step (1)).

Given this algorithm, we can see that the most criticaldecision that a crawler has to make is what query toissue next. If the crawler can issue successful queriesthat will return many matching pages, the crawler canfinish its crawling early on using minimum resources.In contrast, if the crawler issues completely irrelevantqueries that do not return any matching pages, it maywaste all of its resources simply issuing queries withoutever retrieving actual pages. Therefore, how the crawlerselects the next query can greatly affect its effectiveness.In the next section, we formalize this query selectionproblem.

2.3 Problem formalizationTheoretically, the problem of query selection can be

formalized as follows: We assume that the crawler down-loads pages from a Web site that has the set of pagesS (the rectangle in Figure 5). We represent each Webpage in S as a vertex in a graph (dots in Figure 5). Wealso represent each potential query qi that a crawler canissue as a hyperedge in the graph (circles in Figure 5).A hyperedge qi connects all the vertices (pages) that arereturned when the crawler issues qi to the site. Each hy-peredge is associated with a weight that represents thecost of issuing the query. Under this formalization, ourgoal is to find the set of hyperedges (queries) that coverthe maximum number of vertices (Web pages) with theminimum total weight (cost). This problem is equiva-lent to the set-covering problem in graph theory [17].

There are two main difficulties with this formaliza-tion. First, in a practical situation, the crawler doesnot know which Web pages will be returned by whichqueries, so the hyperedges of S are not known in ad-vance. Without knowing the hyperedges the crawlercannot decide which queries to pick to maximize thecoverage. Second, the set-covering problem is known tobe NP-Hard [17], so an efficient algorithm to solve thisproblem in polynomial time has yet to be found.

To overcome these difficulties, we need the follow-ing two solutions: First, we need a way to predict how

keyword queries, single-term queries return the max-imum results. Extending our work to multi-keywordqueries is straightforward.

many pages will be returned for each query qi. Sec-ond, we need an approximation algorithm that can selectthe semi-optimal queries at a reasonable computationalcost. Later in Section 3 we will present our ideas forthese solutions.

2.3.1 Performance MetricBefore we present our ideas for the query selection

problem, we briefly discuss some of our notation andthe cost/performance metrics.

Given a query qi, we use P (qi) to denote the fractionof pages that we will get back if we issue query qi tothe site. For example, if a Web site has 10,000 pagesin total, and if 3,000 pages are returned for the queryqi = “medicine”, then P (qi) = 0.3. We use P (q1∧q2) torepresent the fraction of pages that are returned fromboth q1 and q2 (i.e., the intersection of P (q1) and P (q2)).Similarly, we use P (q1 ∨ q2) to represent the fraction ofpages that are returned from either q1 or q2 (i.e., theunion of P (q1) and P (q2)).

We also use Cost(qi) to represent the cost of issuingthe query qi. Depending on the scenario, the cost can bemeasured either in time, network bandwidth, the num-ber of interactions with the site or the combination of allof these. The query cost may also consist of a number offactors, including the cost for submitting the query tothe site, retrieving the result index page (Figure 3(a))and downloading the actual pages (Figure 3(b)). As wewill see later, our proposed algorithms are independentof the exact cost function. As an example, we show howthe query cost may be computed in one scenario.

Example 1 We assume that submitting a query incursa fixed cost of cq. The cost for downloading the resultindex page is proportional to the number of matchingdocuments to the query. The cost for downloading amatching document also incurs a fixed cost of cd. Thenthe overall cost of query qi is

Cost(qi) = cq + crP (qi) + cdP (qi). (1)

The first term is the fixed cost of submitting qi, thesecond term is the cost for downloading the result in-dex page and the third term is the cost of downloadingall matching documents. In certain cases, some of thedocuments from qi may have already been downloadedfrom previous queries. In this case, the crawler may skipdownloading these documents and the cost of qi can be

Cost(qi) = cq + crP (qi) + cdPunq(qi). (2)

Here, we use Punq(qi) to represent the fraction of thenew documents from qi that have not been retrievedfrom previous queries. Later in Section 3.1 we will studyhow we can estimate P (qi) and Punq(qi) to estimate thecost of qi. 2

Since our algorithms are independent of the exact costfunction, we will assume a generic cost function Cost(qi)in this paper. When we need a concrete cost function,however, we will use Equation 2.

Given the notation, we can formalize the goal of aHidden-Web crawler as follows:

Problem 1 Find the set of queries q1, . . . , qn that max-imizes

P (q1 ∨ · · · ∨ qn)

under the constraintn∑

i=1

Cost(qi) ≤ t.

Here, t is the maximum download resource that thecrawler has. 2

3. KEYWORD SELECTIONHow should a crawler select the queries to issue? Given

that the goal is to download the maximum number ofunique documents from a textual database, we may con-sider one of the following options:

• Random: We select random keywords from, say,an English dictionary and issue them to the database.The hope is that a random query will return a rea-sonable number of matching documents.

• Generic-frequency: We analyze a generic docu-ment corpus collected elsewhere (say, from the Web)and obtain the generic frequency distribution ofeach keyword. Based on this generic distribution,we start with the most frequent keyword, issueit to the Hidden-Web database and retrieve theresult. We then continue to the second-most fre-quent keyword and repeat this process until weexhaust all download resources. The hope is thatthe frequent keywords in a generic corpus will alsobe frequent in the Hidden-Web database, return-ing many matching documents.

• Adaptive: We analyze the documents returned fromthe previous queries issued to the Hidden-Web databaseand estimate which keyword is most likely to re-turn the most documents. Based on this analysis,we issue the most “promising” query, and repeatthe process.

Among these three general policies, we may considerthe random policy as the base comparison point since itis expected to perform the worst. Between the generic-frequency and the adaptive policies, both policies mayshow similar performance if the crawled database has ageneric document collection without a specialized topic.The adaptive policy, however, may perform significantlybetter than the generic-frequency policy if the databasehas a very specialized collection that is different fromthe generic corpus. We will experimentally comparethese three policies in Section 4.

While the first two policies (random and generic-frequencypolicies) are easy to implement, we need to understandhow we can analyze the downloaded pages to identifythe most “promising” query in order to implement theadaptive policy. We address this issue in the rest of thissection.

3.1 Estimating the number of matching pagesIn order to identify the most promising query, we need

to estimate how many unique documents we will down-load if we issue the query qi as the next query. That

is, assuming that we have issued queries q1, . . . , qi−1 weneed to estimate P (q1∨· · ·∨qi−1∨qi), for every potentialnext query qi and compare this value. In estimating thisnumber, we note that we can rewrite P (q1∨· · ·∨qi−1∨qi)as:

P ((q1 ∨ · · · ∨ qi−1) ∨ qi)

= P (q1 ∨ · · · ∨ qi−1) + P (qi) − P ((q1 ∨ · · · ∨ qi−1) ∧ qi)

= P (q1 ∨ · · · ∨ qi−1) + P (qi)

− P (q1 ∨ · · · ∨ qi−1)P (qi|q1 ∨ · · · ∨ qi−1) (3)

In the above formula, note that we can precisely mea-sure P (q1 ∨ · · · ∨ qi−1) and P (qi | q1 ∨ · · · ∨ qi−1) by an-alyzing previously-downloaded pages: We know P (q1 ∨· · · ∨ qi−1), the union of all pages downloaded fromq1, . . . , qi−1, since we have already issued q1, . . . , qi−1

and downloaded the matching pages.3 We can also mea-sure P (qi | q1 ∨ · · · ∨ qi−1), the probability that qi ap-pears in the pages from q1, . . . , qi−1, by counting howmany times qi appears in the pages from q1, . . . , qi−1.Therefore, we only need to estimate P (qi) to evaluateP (q1 ∨ · · · ∨ qi). We may consider a number of differentways to estimate P (qi), including the following:

1. Independence estimator: We assume that the ap-pearance of qi is independent of q1, . . . , qi−1. Thatis, we assume that P (qi) = P (qi|q1 ∨ · · · ∨ qi−1).

2. Zipf estimator: In [20], Ipeirotis et al., proposed amechanism to estimate how many times a particu-lar term occurs in the entire corpus based on a sub-set of documents from the corpus. Their methodexploits the fact that the frequency of terms in-side text collections follows a power law distribu-tion [27, 23]. That is, if we rank every term basedon their occurrence frequency (with the most fre-quent term having a rank of 1, second most fre-quent a rank of 2 etc.), then the frequency f of aterm inside the text collection is given by:

f = α(r + β)−γ (4)

where r is the rank of the term and α, β, and γ

are constants that depend on the text collection.We illustrate how we can use the above equationto estimate the P (qi) values through the followingexample:

Example 2 Assume that we have already issuedqueries q1=disk, q2=java, q3=computer to a textdatabase. The database contains 100,000 docu-ments in total, and for each query q1, q2, and q3,the database returned 2, 000, 1, 100, and 6, 000pages, respectively. We show P (qi) for each queryin the second column of Figure 6(a). For example,P (computer) = 6,000

100,000= 0.06. We assume that

the documents retrieved from q1, q2 and q3 do nothave a significant overlap, and we downloaded atotal of 8, 000 unique documents from the three

3For exact estimation, we need to know the total num-ber of pages in the site. However, since we need tocompare only relative values among queries, we do notactually need to know the total number of pages.

queries. Given this, P (computer|q1 ∨ q2 ∨ q3) =6,0008,000

= 0.75, etc. We show P (qi|q1 ∨ q2 ∨ q3) for

every query in the third column of Figure 6(a).

Within the retrieved documents, we can count howmany times other terms tk’s appear (i.e., P (tk|q1∨q2 ∨ q3)) and we also show these numbers in thethird column. For example, data appears in 3,200documents out of 8,000 retrieved documents, soP (data|q1 ∨ q2 ∨ q3) = 0.4. However, we do notknow exactly how many times the term data ap-pears in the entire database, so the value of P (data)is unknown. We use question marks to representunknown values. Our goal is to estimate theseunknown values.

For estimation, we first rank each term tk basedon its P (tk|q1 ∨ q2 ∨ q3) value. The rank of eachterm is shown in the fourth column of Figure 6(a).For example, the term computer appears most fre-quently within the downloaded pages, so its rankis 1.

Now, given Equation 4, we know that the rankR(tk) and the frequency P (tk) of a term roughlysatisfies the following equation:

P (tk) = α[R(tk) + β]−γ. (5)

Then using the known P (tk) values from previ-ous queries, we can compute the constants α, β,

and γ. That is, given the P (computer), P (disk),and P (java) values and their ranks, we can findthe the best-fitting curve for Equation 4(b). Thiscurve fitting results in the following constants: α =0.08, β = 0.25, γ = 1.15. In Figure 6(b), we showthe fitted curve together with the rank of eachterm.

Once we have obtained these constants, we canestimate all unknown P (tk) values. For example,since R(data) is 2, we can estimate that

P (data) = α[R(data) + β]−γ

= 0.08(2 + 0.25)−1.15 = 0.031. 2

After we estimate P (qi) and P (qi|q1 ∨ · · · ∨ qi−1) val-ues, we can calculate P (q1 ∨ · · · ∨ qi). In Section 3.3, weexplain how we can efficiently compute P (qi|q1 ∨ · · · ∨qi−1) by maintaining a succinct summary table. In thenext section, we first examine how we can use this valueto decide which query we should issue next to the Hid-den Web site.

3.2 Greedy query selectionThe goal of the Hidden-Web crawler is to download

the maximum number of unique documents from a databaseusing its limited download resources. Given this goal,the Hidden-Web crawler has to take two factors into ac-count. 1) the number of new documents that can beobtained from the query qi and 2) the cost of issuingthe query qi. For example, if two queries, qi and qj , in-cur the same cost, but qi returns more new pages thanqj , qi is more desirable than qj . Similarly, if qi and qj

q1 = disk, q2 = java, q3 = computer

Term tk P (tk) P (tk|q1 ∨ q2 ∨ q3) R(tk)computer 0.06 0.75 1

data ? 0.4 2disk 0.02 0.25 3

memory ? 0.2 4java 0.011 0.13 5. . . . . . . . . . . .

(a) Probabilities of terms after queries q1, q2, q3.

ii

i 11 i−1

−−γγ

ii

computer data disk memory java

(b) Zipf curve of example data.

Figure 6: Fitting a Zipf Distribution in order toestimate P (qi).

return the same number of new documents, but qi in-curs less cost then qj , qi is more desirable. Based onthis observation, the Hidden-Web crawler may use thefollowing efficiency metric to quantify the desirabilityof the query qi:

Efficiency(qi) =Punq(qi)

Cost(qi)

Here, Punq(qi) represents the amount of new documentsreturned for qi (the pages that have not been returnedfor previous queries). Cost(qi) represents the cost ofissuing the query qi.

Intuitively, the efficiency of qi measures how manynew documents are retrieved per unit cost, and can beused as an indicator of how well our resources are spentwhen issuing qi. Thus, the Hidden Web crawler can es-timate the efficiency of every candidate qi, and selectthe one with the highest value. By using its resourcesmore efficiently, the crawler may eventually downloadthe maximum number of unique documents. In Fig-ure 7, we show the query selection function that usesthe concept of efficiency. In principle, this algorithmtakes a greedy approach and tries to maximize the “po-tential gain” in every step.

We can estimate the efficiency of every query usingthe estimation method described in Section 3.1. Thatis, the size of the new documents from the query qi,

Algorithm 3.1 Greedy SelectTerm()Parameters:

Tadp: The list of potential query keywordsProcedure

[2] Foreach tk in Tadp do

[3] Estimate Efficiency(tk) =Punq(tk)

Cost(tk)

[4] done[5] return tk with maximum Efficiency(tk)

Figure 7: Algorithm for selecting the next queryterm.

Punq(qi), is

Punq(qi)

= P (q1 ∨ · · · ∨ qi−1 ∨ qi) − P (q1 ∨ · · · ∨ qi−1)

= P (qi) − P (q1 ∨ · · · ∨ qi−1)P (qi|q1 ∨ · · · ∨ qi−1)

from Equation 3, where P (qi) can be estimated usingone of the methods described in section 3. We can alsoestimate Cost(qi) similarly. For example, if Cost(qi) is

Cost(qi) = cq + crP (qi) + cdPunq(qi)

(Equation 2), we can estimate Cost(qi) by estimatingP (qi) and Punq(qi).

3.3 The query statistics tableIn estimating the efficiency of queries, we found that

we need to measure P (qi|q1 ∨ · · · ∨ qi−1) for every po-tential query qi. In this section, we explain how we cancompute these values efficiently by maintaining a smalltable that we call a query statistics table.

The main idea for the query statistics table is thatP (qi|q1 ∨ · · · ∨ qi−1) can be measured by counting howmany times the keyword qi appears within the docu-ments downloaded from q1, . . . , qi−1. We record thesecounts in a table, as shown in Figure 8(a). The leftcolumn of the table contains all potential query termsand the right column contains the number of previously-downloaded documents containing the respective term.For example, the table in Figure 8(a) shows that we havedownloaded 50 documents so far, and the term modelappears in 10 of these documents. Given this number,we can compute that P (model|q1∨· · ·∨qi−1) = 10

50= 0.2.

We note that the query statistics table needs to beupdated whenever we issue a new query qi and downloadmore documents. This update can be done efficiently aswe illustrate in the following example.

Example 3 After examining the query statistics tableof Figure 8(a), we have decided to use the term “com-puter” as our next query qi. From the new query qi =“computer,” we downloaded 20 more new pages. Outof these, 12 contain the keyword “model” and 18 thekeyword “disk.” The table in Figure 8(b) shows thefrequency of each term in the newly-downloaded pages.

We can update the old table (Figure 8(a)) to includethis new information by simply adding correspondingentries in Figures 8(a) and (b). The result is shownon Figure 8(c). For example, keyword “model” existsin 10 + 12 = 22 pages within the pages retrieved from

Term tk N(tk)

model 10computer 38

digital 50

Term tk N(tk)

model 12computer 20

disk 18

Total pages: 50 New pages: 20(a) After q1, . . . , qi−1 (b) New from qi = computer

Term tk N(tk)

model 10+12 = 22computer 38+20 = 58

disk 0+18 = 18digital 50+0 = 50

Total pages: 50 + 20 = 70(c) After q1, . . . , qi

Figure 8: Updating the query statistics table.

q1, . . . , qi. According to this new table, P (model|q1 ∨· · · ∨ qi) is now 22

70= 0.3.

3.4 Crawling sites that do not return allresults

In certain cases, when a query matches a large num-ber of pages, the Hidden Web site returns only a por-tion of those pages. For example, the Open DirectoryProject [2] allows the users to see only up to 10, 000results after they issue a query. Obviously, this kindof limitation has an immediate effect on our HiddenWeb crawler. First, since we can only retrieve up toa specific number of pages per query, our crawler willneed to issue more queries (and potentially will use upmore resources) in order to download all the pages. Sec-ond, the query selection method that we presented inSection 3.2 assumes that for every potential query qi,we can find P (qi|q1 ∨ · · · ∨ qi−1). That is, for everyquery qi we can find the fraction of documents in thewhole text database that contains qi with at least oneof q1, . . . , qi−1. However, if the text database returnedonly a portion of the results for any of the q1, . . . , qi−1

then the value P (qi|q1 ∨ · · · ∨ qi−1) is not accurate andmay affect our decision for the next query qi, and poten-tially the performance of our crawler. Since we cannotretrieve more results per query than the maximum num-ber the Web site allows, our crawler has no other choicebesides submitting more queries. However, there is away to estimate the correct value for P (qi|q1∨· · ·∨qi−1)in the case where the Web site returns only a portion ofthe results.

Again, assume that the Hidden Web site we are cur-rently crawling is represented as the rectangle on Fig-ure 9 and its pages as points in the figure. Assume thatwe have already issued queries q1, . . . , qi−1 which re-turned a number of results less than the maximum num-ber that the site allows, and therefore we have down-loaded all the pages for these queries (big circle in Fig-ure 9). That is, at this point, our estimation for P (qi|q1∨· · · ∨ qi−1) is accurate. Now assume that we submit

qi1 i−1

q\/ ... \/q

qi

/

S

Figure 9: A Web site that does not return allthe results.

query qi to the Web site, but due to a limitation inthe number of results that we get back, we retrieve theset q′i (small circle in Figure 9) instead of the set qi

(dashed circle in Figure 9). Now we need to update ourquery statistics table so that it has accurate informa-tion for the next step. That is, although we got the setq′i back, for every potential query qi+1 we need to findP (qi+1|q1 ∨ · · · ∨ qi):

P (qi+1|q1 ∨ · · · ∨ qi)

=1

P (q1 ∨ · · · ∨ qi)· [P (qi+1 ∧ (q1 ∨ · · · ∨ qi−1))+

P (qi+1 ∧ qi) − P (qi+1 ∧ qi ∧ (q1 ∨ · · · ∨ qi−1))] (6)

In the previous equation, we can find P (q1∨· · ·∨qi) byestimating P (qi) with the method shown in Section 3.Additionally, we can calculate P (qi+1 ∧ (q1 ∨· · ·∨ qi−1))and P (qi+1∧qi ∧ (q1∨· · ·∨qi−1)) by directly examiningthe documents that we have downloaded from queriesq1, . . . , qi−1. The term P (qi+1 ∧ qi) however is unknownand we need to estimate it. Assuming that q′

i is a ran-dom sample of qi, then:

P (qi+1 ∧ qi)

P (qi+1 ∧ q′i)=

P (qi)

P (q′i)(7)

From Equation 7 we can calculate P (qi+1) and af-ter we replace this value to Equation 6 we can findP (qi+1|q1 ∨ · · · ∨ qi).

4. EXPERIMENTAL EVALUATIONIn this section we experimentally evaluate the perfor-

mance of the various algorithms for Hidden Web crawl-ing presented in this paper. Our goal is to validate ourtheoretical analysis through real-world experiments, bycrawling popular Hidden Web sites of textual databases.Since the number of documents that are discovered anddownloaded from a textual database depends on the se-lection of the words that will be issued as queries to thesearch interface of each site, we compare the various se-lection policies that were described in section 3, namelythe random, generic-frequency, and adaptive algorithms.

The adaptive algorithm learns new keywords and termsfrom the documents that it downloads, and its selectionprocess is driven by a cost model as described in Sec-tion 3.2. To make our experiment and its analysis sim-ple, we choose a unit cost for every query. That is, our

goal is to maximize the number of downloaded pagesby issuing the least number of queries. In addition, weuse the independence estimator (Section 3.1) to estimateP (qi) from downloaded pages.

For the generic-frequency policy, we compute the fre-quency distribution of words that appear in a 5.5-million-Web-page corpus downloaded from 154 Web sites of var-ious topics [4]. Keywords are selected based on their de-creasing frequency with which they appear in this docu-ment set, with the most frequent one being selected first,followed by the second-most frequent keyword, etc.4

Regarding the random policy, we use the same set ofwords collected from the Web corpus, but in this case,instead of selecting keywords based on their relativefrequency, we choose them randomly (uniform distri-bution). In order to further investigate how the qual-ity of the potential query-term list affects the random-based algorithm, we construct two sets: one with the16, 000 most frequent words of the term collection usedin the generic-frequency policy (hereafter, the randompolicy with the set of 16,000 words will be referred to asrandom-16K), and another set with the 1, 000, 000 mostfrequent words of the same collection as above (here-after, referred to as random-1M). The former set hasfrequent words that appear in a large number of docu-ments (at least 10, 000 in our collection), and thereforecan be considered of “high-quality” terms. The latterset though contains a much larger collection of words,among which some are bogus, and meaningless (e.g.,“xxzyz”).

The experiments were conducted by employing eachone of the aforementioned algorithms (adaptive, generic-frequency, random-16K, and random-1M) to crawl anddownload contents from three Hidden Web sites: ThePubMed Medical Library,5 Amazon,6 and the Open Di-rectory Project.7 According to the information on PubMed’sWeb site, its collection contains approximately 14 mil-lion abstracts of biomedical articles. We consider theseabstracts as the “documents” in the site, and in each it-eration of the adaptive policy, we use these abstracts asinput to the algorithm. Thus our goal is to “discover” asmany unique abstracts as possible by repeatedly query-ing the Web query interface provided by PubMed. TheHidden Web crawling on the PubMed Web site can beconsidered as topic-specific, due to the fact that allabstracts within PubMed are related to the fields ofmedicine and biology.

In the case of the Amazon Web site, we are interestedin downloading all the hidden pages that contain infor-mation on books. The querying to Amazon is performedthrough the Software Developer’s Kit that Amazon pro-vides for interfacing to its Web site, and which returnsresults in XML form. The generic “keyword” field isused for searching, and as input to the adaptive pol-

4We did not manually exclude stop words (e.g., the, is,of, etc.) from the keyword list. As it turns out, allWeb sites except PubMed return matching documentsfor the stop words, such as “the.”5http://www.pubmed.org6http://www.amazon.com7http://dmoz.org

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200

fra

ctio

n o

f d

ocu

me

nts

query number

Cumulative fraction of unique documents - PubMed website

adaptivegeneric-frequency

random-16Krandom-1M

Figure 10: Coverage of policies for Pubmed

icy we extract the product description and the text ofcustomer reviews when present in the XML reply. SinceAmazon does not provide any information on how manybooks it has in its catalogue, we use random samplingon the 10-digit ISBN number of the books to estimatethe size of the collection. Out of the 10, 000 randomISBN numbers queried, 46 are found in the Amazoncatalogue, therefore the size of its book collection is es-timated to be 46

10000· 1010 = 4.6 million books. It’s also

worth noting here that Amazon poses an upper limit onthe number of results (books in our case) returned byeach query, which is set to 32, 000.

As for the third Hidden Web site, the Open Direc-tory Project (hereafter also referred to as dmoz), the sitemaintains the links to 3.8 million sites together with abrief summary of each listed site. The links are search-able through a keyword-search interface. We considereach indexed link together with its brief summary as thedocument of the dmoz site, and we provide the shortsummaries to the adaptive algorithm to drive the selec-tion of new keywords for querying. On the dmoz Website, we perform two Hidden Web crawls: the first ison its generic collection of 3.8-million indexed sites, re-gardless of the category that they fall into. The othercrawl is performed specifically on the Arts section ofdmoz (http://dmoz.org/Arts), which comprises of ap-proximately 429, 000 indexed sites that are relevant toArts, making this crawl topic-specific, as in PubMed.Like Amazon, dmoz also enforces an upper limit on thenumber of returned results, which is 10, 000 links withtheir summaries.

4.1 Comparison of policiesThe first question that we seek to answer is the evolu-

tion of the coverage metric as we submit queries to thesites. That is, what fraction of the collection of docu-ments stored in the Hidden Web site can we downloadas we continuously query for new words selected usingthe policies described above? More formally, we are in-terested in the value of P (q1 ∨ · · · ∨ qi−1 ∨ qi), after wesubmit q1, . . . , qi queries, and as i increases.

In Figures 10, 11, 12, and 13 we present the cover-age metric for each policy, as a function of the querynumber, for the Web sites of PubMed, Amazon, gen-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700

fra

ctio

n o

f d

ocu

me

nts

query number

Cumulative fraction of unique documents - Amazon website

adaptivegeneric-frequency

random-16Krandom-1M

Figure 11: Coverage of policies for Amazon

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700

fra

ctio

n o

f d

ocu

me

nts

query number

Cumulative fraction of unique documents - dmoz website

adaptivegeneric-frequency

random-16Krandom-1M

Figure 12: Coverage of policies for general dmoz

eral dmoz and the art-specific dmoz, respectively. Onthe y-axis the fraction of the total documents down-loaded from the website is plotted, while the x-axis rep-resents the query number. A first observation from thesegraphs is that in general, the generic-frequency and theadaptive policies perform much better than the random-based algorithms. In all of the figures, the graphs for

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450

fra

ctio

n o

f d

ocu

me

nts

query number

Cumulative fraction of unique documents - dmoz/Arts website

adaptivegeneric-frequency

random-16Krandom-1M

Figure 13: Coverage of policies for the Arts sec-tion of dmoz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60

fra

ctio

n o

f d

ocu

me

nts

query number

Convergence of adaptive under different initial queries - PubMed website

pubmeddata

informationreturn

Figure 14: Convergence of the adaptive algo-rithm using different initial queries for crawlingthe PubMed Web site

the random-1M and the random-16K are significantlybelow those of other policies.

Between the generic-frequency and the adaptive poli-cies, we can see that the latter outperforms the formerwhen the site is topic specific. For example, for thePubMed site (Figure 10), the adaptive algorithm issuesonly 83 queries to download almost 80% of the docu-ments stored in PubMed, but the generic-frequency al-gorithm requires 106 queries for the same coverage,. Forthe dmoz/Arts crawl (Figure 13), the difference is evenmore substantial: the adaptive policy is able to down-load 99.98% of the total sites indexed in the Directoryby issuing 471 queries, while the frequency-based al-gorithm is much less effective using the same numberof queries, and discovers only 72% of the total numberof indexed sites. The adaptive algorithm, by examin-ing the contents of the pages that it downloads at eachiteration, is able to identify the topic of the site as ex-pressed by the words that appear most frequently in theresult-set. Consequently, it is able to select words forsubsequent queries that are more relevant to the site,than those preferred by the generic-frequency policy,which are drawn from a large, generic collection. Ta-ble 1 shows a sample of 10 keywords out of 211 chosenand submitted to the PubMed Web site by the adap-tive algorithm, but not by the other policies. For eachkeyword, we present the number of the iteration, alongwith the number of results that it returned. As one cansee from the table, these keywords are highly relevant tothe topics of medicine and biology of the Public MedicalLibrary, and match against numerous articles stored inits Web site.

In both cases examined in Figures 10, and 13, therandom-based policies perform much worse than theadaptive algorithm, and the generic-frequency. It isworthy noting however, that the random-based policywith the small, carefully selected set of 16, 000 “qual-ity” words manages to download a considerable frac-tion of 42.5% from the PubMed Web site after 200queries, while the coverage for the Arts section of dmozreaches 22.7%, after 471 queried keywords. On the otherhand, the random-based approach that makes use of

Iteration Keyword Number of Results23 department 2, 719, 03134 patients 1, 934, 42853 clinical 1, 198, 32267 treatment 4, 034, 56569 medical 1, 368, 20070 hospital 503, 307146 disease 1, 520, 908172 protein 2, 620, 938174 molecular 951, 639185 diagnosis 4, 276, 518

Table 1: Sample of keywords queried to PubMedexclusively by the adaptive policy

the vast collection of 1, 000, 000 words, among which alarge number is bogus keywords, fails to download evena mere 1% of the total collection, after submitting thesame number of query words.

For the generic collections of Amazon and the dmozsites, shown in Figures 11 and 12 respectively, we getmixed results: The generic-frequency policy shows slightlybetter performance than the adaptive policy for theAmazon site (Figure 11), and the adaptive method clearlyoutperforms the generic-frequency for the general dmozsite (Figure 12). A closer look at the log files of the twoHidden Web crawlers reveals the main reason: Amazonwas functioning in a very flaky way when the adaptivecrawler visited it, resulting in a large number of lostresults. Thus, we suspect that the slightly poor perfor-mance of the adaptive policy is due to this experimentalvariance. We are currently running another experimentto verify whether this is indeed the case. Aside from thisexperimental variance, the Amazon result indicates thatif the collection and the words that a Hidden Web sitecontains are generic enough, then the generic-frequencyapproach may be a good candidate algorithm for effec-tive crawling.

As in the case of topic-specific Hidden Web sites, therandom-based policies also exhibit poor performancecompared to the other two algorithms when crawlinggeneric sites: for the Amazon Web site, random-16Ksucceeds in downloading almost 36.7% after issuing 775queries, alas for the generic collection of dmoz, the frac-tion of the collection of links downloaded is 13.5% afterthe 770th query. Finally, as expected, random-1M iseven worse than random-16K, downloading only 14.5%of Amazon and 0.3% of the generic dmoz.

In summary, the adaptive algorithm performs remark-ably well in all cases: it is able to discover and down-load most of the documents stored in Hidden Web sitesby issuing the least number of queries. When the col-lection refers to a specific topic, it is able to identifythe keywords most relevant to the topic of the site andconsequently ask for terms that is most likely that willreturn a large number of results . On the other hand,the generic-frequency policy proves to be quite effectivetoo, though less than the adaptive: it is able to retrieverelatively fast a large portion of the collection, and whenthe site is not topic-specific, its effectiveness can reachthat of adaptive (e.g. Amazon). Finally, the random

policy performs poorly in general, and should not bepreferred.

4.2 Impact of the initial queryAn interesting issue that deserves further examina-

tion is whether the initial choice of the keyword used asthe first query issued by the adaptive algorithm affectsits effectiveness in subsequent iterations. The choice ofthis keyword is not done by the selection of the adap-tive algorithm itself and has to be manually set, sinceits query statistics tables have not been populated yet.Thus, the selection is generally arbitrary, so for purposesof fully automating the whole process, some additionalinvestigation seems necessary.

For this reason, we initiated three adaptive HiddenWeb crawlers targeting the PubMed Web site with dif-ferent seed-words: the word “data”, which returns1,344,999 results, the word “information” that reports308, 474 documents, and the word “return” that re-trieves 29, 707 pages, out of 14 million. These key-words represent varying degrees of term popularity inPubMed, with the first one being of high popularity,the second of medium, and the third of low. We alsoshow results for the keyword “pubmed”, used in the ex-periments for coverage, and which returns 695 articles.As we can see from Figure 14, after a small number ofqueries, all four crawlers roughly download the samefraction of the collection, regardless of their startingpoint: Their coverages are roughly equivalent from the25th query. Eventually, all four crawlers use the sameset of terms for their queries, regardless of the initialquery. In the specific experiment, from the 36th queryonwards, all four crawlers use the same terms for theirqueries in each iteration, or the same terms are used offby one or two query numbers.

4.3 Other observationsRegarding the number of queries required to down-

load a considerable portion of the pages stored in aHidden Web site, comparison of Figures 10, 11, and 12shows that when a site sets an upper limit on the num-ber of results that it returns on a given query, the algo-rithms need more iterations to increase their coverage.For example, less than 100 queries are issued by theadaptive policy in order to download 80% of PubMed’scollection, while for the same fraction, more than 700are required for the Amazon and generic-dmoz Websites, which impose limits of 32,000 and 10,000 results,respectively. This essentially means that the net effectof choosing a very “successful” keyword (in terms ofthe number of pages that it matches against) is limitedby the cap enforced by the sites, and the improvementfrom the generic-frequency or the adaptive policy is lessdramatic.

Another interesting side-effect of the adaptive algo-rithm is the support for multilingual Hidden Web sites,without any additional modification. In our attempt toexplain the significantly better performance of the adap-tive algorithm for the generic dmoz Web site (which isnot topic-specific), we noticed that the adaptive algo-rithm issues non-English queries to the site. This is be-cause the adaptive algorithm learns its vocabulary from

the pages it downloads, therefore is able to discover fre-quent words that do not necessarily belong to the En-glish dictionary. On the contrary, the generic-frequencyalgorithm is restricted by the language of the corpusthat was analyzed before the crawl. In our case, thegeneric-frequency algorithm used a 5.5-million Englishword corpus, so as a result, it was limited to queryingdocuments that could match only English words.

5. CONCLUSION AND FUTURE WORKTraditional crawlers normally follow links on the Web

to discover and download pages. Therefore they cannotget to the Hidden Web pages which are only accessiblethrough query interfaces. In this paper, we studied howwe can build a Hidden Web crawler that can automati-cally query a Hidden Web site and download pages fromit. We proposed three different query generation policiesfor the Hidden Web: a policy that picks queries at ran-dom from a list of keywords, a policy that picks queriesbased on their frequency in a generic text collection, anda policy which adaptively picks a good query based onthe content of the pages downloaded from the HiddenWeb site. Experimental evaluation on 4 real HiddenWeb sites shows that our policies have a great poten-tial. In particular, in certain cases the adaptive policycan download more than 90% of a Hidden Web site afterissuing approximately 100 queries. Given these results,we believe that our work provides a potential mecha-nism to improve the search-engine coverage of the Weband the user experience of Web search.

5.1 Future WorkWe briefly discuss some future-research avenues.

Multi-attribute Database We are currently inves-tigating how to extend our ideas to structured multi-attribute databases. While generating queries for multi-attribute databases is clearly a more difficult problem,we may exploit the following observation to address thisproblem: When a site supports multi-attribute queries,the site often returns pages that contain values for eachof the query attributes. For example, when an on-line bookstore supports queries on title, author andisbn, the pages returned from a query typically con-tain the title, author and ISBN of corresponding books.Thus, if we can analyze the returned pages and extractthe values for each field (e.g, title = ‘Harry Potter’,author = ‘J.K. Rowling’, etc), we can apply the sameidea that we used for the textual database: estimatethe frequency of each attribute value and pick the mostpromising one. The main challenge is to automati-cally segment the returned pages so that we can iden-tify the sections of the pages that present the valuescorresponding to each attribute. Since many Web sitesfollow limited formatting styles in presenting multipleattributes — for example, most book titles are precededby the label “Title:” — we believe we may learn page-segmentation rules automatically from a small set oftraining examples.

Other Practical Issues In addition to the auto-matic query generation problem, there are many prac-tical issues to be addressed to build a fully automatic

Hidden-Web crawler. For example, in this paper weassumed that the crawler already knows all query inter-faces for Hidden-Web sites. But how can the crawlerdiscover the query interfaces? The method proposedin [16] may be a good starting point.

In addition, some Hidden-Web sites return their re-sults in batches of, say, 20 pages, so the user has toclick on a “next” button in order to see more results. Inthis case, a fully automatic Hidden-Web crawler shouldknow that the first result index page contains only a par-tial result and “press” the next button automatically.

Finally, some Hidden Web sites may contain an in-finite number of Hidden Web pages which do not con-tribute much significant content (e.g. a calendar withlinks for every day). In this case the Hidden-Web crawlershould be able to detect that the site does not have muchmore new content and stop downloading pages from thesite. Page similarity detection algorithms may be usefulfor this purpose [10, 9, 26, 14].

6. REFERENCES[1] Lexisnexis, law, public records, company data,

government, academic and business news sourceshttp://www.lexisnexis.com.

[2] The Open Directory Projecthttp://www.dmoz.org.

[3] Pubmed Medical Library,http://www.pubmed.org.

[4] Reference hidden to maintain anonymity.

[5] United States Patent and Trademark Office,http://www.uspto.gov.

[6] E. Agichtein and L. Gravano. Querying textdatabases for efficient information extraction. In19th International Conference on DataEngineering (ICDE), 2003.

[7] E. Agichtein, P. Ipeirotis, and L. Gravano.Modeling query-based access to text databases. InProceedings of the Sixth International Workshopon the Web and Databases (WebDB 2003), 2003.

[8] M. K. Bergman. The deep web: Surfacing hiddenvalue,http://www.brightplanet.com/deepcontent/

tutorials/DeepWeb/deepweb%whitepaper.pdf,2001.

[9] S. Brin, J. Davis, and H. Garcia-Molina. Copydetection mechanisms for digital documents. InSIGMOD Conference, May 1995.

[10] A. Z. Broder, S. C. Glassman, M. S. Manasse, andG. Zweig. Syntactic clustering of the web. InProceedings of the International World-Wide WebConference, April 1997.

[11] J. Callan, M. Connell, and A. Du. Automaticdiscovery of language models for text databases.In SIGMOD Conference, 1999.

[12] J. P. Callan and M. E. Connell. Query-basedsampling of text databases. Information Systems,19(2):97–130, 2001.

[13] K. C.-C. Chang, B. He, C. Li, and Z. Zhang.Structured databases on the web: Observationsand implications. Technical report, UIUC, 2003.

[14] J. Cho, N. Shivakumar, and H. Garcia-Molina.Finding replicated web collections. In SIGMODConference, Dellas, Texas, May 2000.

[15] W. Cohen and Y. Singer. Learning to query theweb. In In Proceedings of the AAAI Workshop onInternet-Based Information Systems, 1996.

[16] J. Cope, N. Craswell, and D. Hawking.Automated discovery of search interfaces on theweb. In Proceedings of the Fourteenth Australasiandatabase conference on Database technologies2003, pages 181–189. Australian ComputerSociety, Inc., 2003.

[17] T. H. Cormen, C. E. Leiserson, and R. L. Rivest.Introduction to Algorithms, Second Edition. MITPress/McGraw Hill, 2001.

[18] D. Florescu, A. Y. Levy, and A. O. Mendelzon.Database techniques for the world-wide web: Asurvey. SIGMOD Record, 27(3):59–74, 1998.

[19] B. He and K. C.-C. Chang. Statistical schemamatching across web query interfaces. InSIGMOD Conference, 2003.

[20] P. Ipeirotis and L. Gravano. Distributed searchover the hidden web: Hierarchical databasesampling and selection. In Proceedings of theTwenty-eighth International Conference on VeryLarge Databases, 2002.

[21] P. G. Ipeirotis, L. Gravano, and M. Sahami.Probe, count, and classify: Categorizing hiddenweb databases. In SIGMOD Conference, 2001.

[22] V. Z. Liu, J. C. Richard C. Luo and, and W. W.Chu. Dpro: A probabilistic approach for hiddenweb database selection using dynamic probing. In20th International Conference on DataEngineering (ICDE), 2004.

[23] B. B. Mandelbrot. Fractal Geometry of Nature.W. H. Freeman & Co., 1998.

[24] S. Olsen. Does search engine’s power threatenweb’s independence? Available athttp://news.com.com/2009-1023-963618.html,October 2002.

[25] S. Raghavan and H. Garcia-Molina. Crawling thehidden web. In Proceedings of the Twenty-seventhInternational Conference on Very LargeDatabases, 2001.

[26] N. Shivakumar and H. Garcia-Molina. Scam: Acopy detection mechanism for digital documents.In Proceedings of the Second Annual Conferenceon the Theory and Practice of Digital Libraries,June 1995.

[27] G. K. Zipf. Human Behavior and the Principle ofLeast-Effort. Addison-Wesley, Cambridge, MA,1949.