Building efficient and effective metasearch engines

42
Building Efficient and Effective Metasearch Engines WEIYI MENG State University of New York at Binghamton CLEMENT YU University of Illinois at Chicago AND KING-LUP LIU DePaul University Frequently a user’s information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems—Distributed databases; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process; Selection process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Information networks General Terms: Design, Experimentation, Performance Additional Key Words and Phrases: Collection fusion, distributed collection, distributed information retrieval, information resource discovery, metasearch This work was supported in part by the following National Science Foundation (NSF) grants: IIS-9902872, IIS-9902792, and EIA-9911099. Authors’ addresses: W. Meng, Department of Computer Science, State University of New York at Binghamton, Binghamton, NY 13902; email: [email protected]; C. Yu, Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607; email: [email protected]; K.-L. Liu, School of Computer Science, Telecommunications and Information Systems, DePaul University, Chicago, IL 60604; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted with- out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2002 ACM 0360-0300/02/0300-0048 $5.00 ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 48–89.

Transcript of Building efficient and effective metasearch engines

Page 1: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines

WEIYI MENG

State University of New York at Binghamton

CLEMENT YU

University of Illinois at Chicago

AND

KING-LUP LIU

DePaul University

Frequently a user’s information needs are stored in the databases of multiplesearch engines. It is inconvenient and inefficient for an ordinary user to invoke multiplesearch engines and identify useful documents from the returned results. To supportunified access to multiple search engines, a metasearch engine can be constructed.When a metasearch engine receives a query from a user, it invokes the underlyingsearch engines to retrieve useful information for the user. Metasearch engines haveother benefits as a search tool such as increasing the search coverage of the Web andimproving the scalability of the search. In this article, we survey techniques that havebeen proposed to tackle several underlying challenges for building a good metasearchengine. Among the main challenges, the database selection problem is to identify searchengines that are likely to return useful documents to a given query. The documentselection problem is to determine what documents to retrieve from each identified searchengine. The result merging problem is to combine the documents returned from multiplesearch engines. We will also point out some problems that need to be further researched.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]:Distributed Systems—Distributed databases; H.3.3 [Information Storage andRetrieval]: Information Search and Retrieval—Search process; Selection process; H.3.4[Information Storage and Retrieval]: Systems and Software—Information networks

General Terms: Design, Experimentation, Performance

Additional Key Words and Phrases: Collection fusion, distributed collection, distributedinformation retrieval, information resource discovery, metasearch

This work was supported in part by the following National Science Foundation (NSF) grants: IIS-9902872,IIS-9902792, and EIA-9911099.

Authors’ addresses: W. Meng, Department of Computer Science, State University of New York at Binghamton,Binghamton, NY 13902; email: [email protected]; C. Yu, Department of Computer Science,University of Illinois at Chicago, Chicago, IL 60607; email: [email protected]; K.-L. Liu, School of ComputerScience, Telecommunications and Information Systems, DePaul University, Chicago, IL 60604; email:[email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted with-out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyrightnotice, the title of the publication, and its date appear, and notice is given that copying is by permission ofACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specificpermission and/or a fee.c©2002 ACM 0360-0300/02/0300-0048 $5.00

ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 48–89.

Page 2: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 49

1. INTRODUCTION

The Web has become a vast informationresource in recent years. Millions of peopleuse the Web on a regular basis and thenumber is increasing rapidly. Most data onthe Web is in the form of text or image. Inthis survey, we concentrate on the searchof text data.

Finding desired data on the Web in atimely and cost-effective way is a prob-lem of wide interest. In the last severalyears, many search engines have been cre-ated to help Web users find desired in-formation. Each search engine has a textdatabase that is defined by the set of docu-ments that can be searched by the searchengine. When there is no confusion, theterm database and the phrase search en-gine will be used interchangeably in thissurvey. Usually, an index for all documentsin the database is created in advance. Foreach term that represents a content wordor a combination of several (usually adja-cent) content words, this index can iden-tify the documents that contain the termquickly. Google, Altavista, Excite, Lycos,and HotBot are all popular search engineson the Web.

Two types of search engines exist.General-purpose search engines aim atproviding the capability to search allpages on the Web. The search engineswe mentioned in the previous para-graph are a few of the well-known ones.Special-purpose search engines, on theother hand, focus on documents in con-fined domains such as documents inan organization or in a specific subjectarea. For example, the Cora search en-gine (cora.whizbang.com) focuses on com-puter science research papers and Med-ical World Search (www.mwsearch.com)is a search engine for medical informa-tion. Most organizations and businesssites have installed search engines fortheir pages. It is believed that hundredsof thousands of special-purpose search en-gines currently exist on the Web [Bergman2000].

The amount of data on the Web is huge.It is believed that by February of 1999,there were already more than 800 million

publicly indexable Web pages [Lawrenceand Lee Giles 1999] and the number iswell over 2 billion now (Google has indexedover 2 billion pages) and is increasing ata very high rate. Many believe that em-ploying a single general-purpose searchengine for all data on the Web is unre-alistic [Hawking and Thistlewaite 1999;Sugiura and Etzioni 2000; Wu et al. 2001].First, its processing power may not scaleto the rapidly increasing and virtually un-limited amount of data. Second, gatheringall the data on the Web and keeping itreasonably up-to-date are extremely diffi-cult if not impossible objectives. Programs(e.g., Web robots) used by major search en-gines to gather data automatically mayslow down local servers and are increas-ingly unpopular. Furthermore, many sitesmay not allow their documents to be in-dexed but instead may allow the docu-ments to be accessed through their searchengines only (these sites are part of the so-called deep Web [Bergman 2000]). Conse-quently, we have to live with the reality ofhaving a large number of special-purposesearch engines that each covers a portionof the Web.

A metasearch engine is a system thatprovides unified access to multiple ex-isting search engines. A metasearch en-gine does not maintain its own indexof documents. However, a sophisticatedmetasearch engine may maintain infor-mation about the contents of its under-lying search engines to provide betterservice. In a nutshell, when a metasearchengine receives a user query, it first passesthe query (with necessary reformatting)to the appropriate underlying search en-gines, and then collects and reorganizesthe results received from them. A simpletwo-level architecture of a metasearch en-gine is depicted in Figure 1. This two-levelarchitecture can be generalized to a hier-archy of more than two levels when thenumber of underlying search engines be-comes large [Baumgarten 1997; Gravanoand Garcia-Molina 1995; Sheldon et al.1994; Yu et al. 1999b].

There are a number of reasons for thedevelopment of a metasearch engine andwe discuss these reasons below.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 3: Building efficient and effective metasearch engines

50 Meng et al.

Fig. 1 . A simple metasearch architecture.

(1) Increase the search coverage of theWeb. A recent study [Lawrence andLee Giles 1999] indicated that thecoverage of the Web by individualmajor general-purpose search engineshas been decreasing steadily. This ismainly due to the fact that the Web hasbeen increasing at a much faster ratethan the indexing capability of any sin-gle search engine. By combining thecoverages of multiple search enginesthrough a metasearch engine, a muchhigher percentage of the Web can besearched. While the largest general-purpose search engines index less than2 billion Web pages, all special-purposesearch engines combined may indexup to 500 billion Web pages [Bergman2000].

(2) Solve the scalability of searching theWeb. As we mentioned earlier, the ap-proach of employing a single general-purpose search engine for the entireWeb has poor scalability. In contrast,if a metasearch engine on top of all thespecial-purpose search engines can becreated as an alternative to search theentire Web, then the problems associ-ated with employing a single general-purpose search engine will either dis-appear or be significantly alleviated.The size of a typical special-purposesearch engine is much smaller thanthat of a major general-purpose searchengine. Therefore, it is much easier forit to keep its index data more up todate (i.e., updating of index data toreflect the changes of documents canbe carried out more frequently). It isalso much easier to build the necessaryhardware and software infrastructure

for a special-purpose search engine. Asa result, the metasearch engine ap-proach for searching the entire Web islikely to be significantly more scalablethan the centralized general-purposesearch engine approach.

(3) Facilitate the invocation of multi-ple search engines. The informationneeded by a user is frequently storedin the databases of multiple search en-gines. As an example, consider the casewhen a user wants to find the best10 newspaper articles about a specialevent. It is likely that the desired arti-cles are scattered across the databasesof a number of newspapers. The usercan send his/her query to every news-paper database and examine the re-trieved articles from each databaseto identify the 10 best articles. Thisis a formidable task. First, the userwill have to identify the sites of thenewspapers. Second, the user will needto send the query to each of thesedatabases. Since different databasesmay accept queries in different for-mats, the user will have to formatthe query correctly for each database.Third, there will be no overall qualityranking among the articles returnedfrom these databases even though theretrieved articles from each individualdatabase may be ranked. As a result,it will be difficult for the user, with-out reading the contents of the arti-cles, to determine which articles arelikely to be among the most usefulones. If there are a large number ofdatabases, each returning some arti-cles to the user, then the user will sim-ply be overwhelmed. If a metasearchengine on top of these local search en-gines is built, then the user only needsto submit one query to invoke all lo-cal search engines via the metasearchengine. A good metasearch engine canrank the documents returned from dif-ferent search engines properly. Clearly,such a metasearch engine makes theuser’s task much easier.

(4) Improve the retrieval effectiveness.Consider the scenario where a user

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 4: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 51

needs to find documents in a spe-cific subject area. Suppose that thereis a special-purpose search engine forthis subject area and there is alsoa general-purpose search engine thatcontains all the documents indexed bythe special-purpose search engine inaddition to many documents unrelatedto this subject area. It is usually truethat if the user submits the same queryto both of the two search engines, theuser is likely to obtain better resultsfrom the special-purpose search enginethan the general-purpose search en-gine. In other words, the existence of alarge number of unrelated documentsin the general-purpose search enginemay hinder the retrieval of desired doc-uments. In text retrieval, documentsin the same collection can be groupedinto clusters such that the documentsin the same cluster are more relatedthan documents across different clus-ters. When evaluating a query, clustersrelated to the query can be identifiedfirst and then the search can be carriedout for these clusters. This method hasbeen shown to improve the retrieval ef-fectiveness of the system [Xu and Croft1999]. For documents on the Web, thedatabases in different special-purposesearch engines are natural clusters. Asa result, if for any given query sub-mitted to the metasearch engine, thesearch can be restricted to only special-purpose search engines related to thequery, then it is likely that better re-trieval effectiveness can be achievedusing the metasearch engine than us-ing a general-purpose search engine.While it may be possible for a general-purpose search engine to cluster itsdocuments to improve retrieval effec-tiveness, the quality of these clustersmay not be as good as the ones corre-sponding to special-purpose search en-gines. Furthermore, constructing andmaintaining the clusters consumesmore resources of the general-purposesearch engine.

This article has three objectives. First,we review the main technical issues in

building a good metasearch engine. Sec-ond, we survey different proposed tech-niques for tackling these issues. Third, wepoint out new challenges and research di-rections in the metasearch engine area.

The rest of the article is organized asfollows. In Section 2, we provide a shortoverview of some basic concepts on infor-mation retrieval (IR). These concepts areimportant for the discussions in this arti-cle. In Section 3, we outline the main soft-ware components of a metasearch engine.In Section 4, we discuss how the autonomyof different local search engines, as wellas the heterogeneities among them, mayaffect the building of a good metasearchengine. In Section 5, we survey reportedtechniques for the database selection prob-lem (i.e., determining which databasesto search for a given user query). InSection 6, we survey known methods forthe document selection problem (i.e., de-termining what documents to retrievefrom each selected database for a userquery). In Section 7, we report differenttechniques for the result merging problem(i.e., combining results returned from dif-ferent local databases into a single rankedlist). In Section 8, we present some newchallenges for building a good metasearchengine.

2. BASIC INFORMATION RETRIEVAL

Information retrieval deals with tech-niques for finding relevant (useful) doc-uments for any given query from acollection of documents. Documents aretypically preprocessed and represented ina form that facilitates efficient and ac-curate retrieval. In this section, we firstoverview some basic concepts in classi-cal information retrieval and then pointout several features specifically associatedwith Web search engines.

2.1. Classical Information Retrieval

The contents of a document may be rep-resented by the words contained in it.Some words such as “a,” “of,” and “is” do

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 5: Building efficient and effective metasearch engines

52 Meng et al.

not contain semantic information. Thesewords are called stop words and areusually not used for document represen-tation. The remaining words are con-tent words and can be used to repre-sent the document. Variations of the sameword may be mapped to the same term.For example, the words “beauty,” “beau-tiful,” and “beautify” can be denoted bythe term “beaut.” This can be achievedby a stemming program. After remov-ing stop words and stemming, each doc-ument can be logically represented bya vector of n terms [Salton and McGill1983; Yu and Meng 1998], where n isthe total number of distinct terms inthe set of all documents in a documentcollection.

Suppose the document d is representedby the vector (d1, . . . , di, . . . , dn), where diis a number (weight) indicating the im-portance of the ith term in representingthe contents of the document d . Most ofthe entries in the vector will be zero be-cause most terms are absent from anygiven document. When a term is presentin a document, the weight assigned to theterm is usually based on two factors. Theterm frequency (tf ) of a term in a doc-ument is the number of times the termoccurs in the document. Intuitively, thehigher the term frequency of a term is,the more important the term is in repre-senting the contents of the document. Asa consequence, the term frequency weight(tfw) of the term in the document is usu-ally a monotonically increasing function ofits term frequency. The second factor af-fecting the weight of a term is the docu-ment frequency (df ), which is the numberof documents having the term. Usually,the higher the document frequency of aterm is, the less important the term isin differentiating documents having theterm from documents not having it. Thus,the weight of a term based on its docu-ment frequency is usually monotonicallydecreasing and is called the inverse docu-ment frequency weight (idfw). The weightof a term in a document can be the prod-uct of its term frequency weight and itsinverse document frequency weight, thatis, tfw ∗ idfw.

A query is simply a question writ-ten in text.1 It can be transformed intoan n-dimensional vector as well. Specif-ically, the noncontent words are elimi-nated by comparing the words in thequery against the stop word list. Then,words in the query are mapped into termsand, finally, terms are weighted basedon term frequency and/or document fre-quency information.

After the vectors of all documents and aquery are formed, document vectors whichare close to the query vector are retrieved.A similarity function can be used to mea-sure the degree of closeness between twovectors. One simple function is the dotproduct function, dot(q, d ) =∑n

k=1 qi ∗ di,where q = (q1, . . . , qn) is the vector of aquery and d = (d1, . . . , dn) is the vectorof a document. The dot product function isa weighted sum of the terms in commonbetween the two vectors. The dot prod-uct function tends to favor long documentshaving many terms, because the chanceof having more terms in common betweena document and a given query is higherfor a longer document than a shorterdocument. In order that all documentshave a fair chance of being retrieved, thecosine function can be utilized. It is givenby dot(q, d )/(|q| · |d |), where |q| and |d |denote, respectively, the lengths of thequery vector and the document vector. Thecosine function [Salton and McGill 1983]between two vectors is really the cosine ofthe angle between the two vectors and italways returns a value between 0 and 1when the weights are nonnegative. It getsthe value 0 if there is no term in commonbetween the query and the document; itsvalue is 1 if the query and the documentvectors are identical or one vector is a pos-itive constant multiple of the other.

A common measure for retrieval effec-tiveness is recall and precision. For a givenquery submitted by a user, suppose that

1 We note that Boolean queries are also supportedby many IR systems. In this article, we concentrateon vector space queries only unless other types ofqueries are explicitly identified. A study of 51,473real user queries submitted to the Excite search en-gine indicated that less than 10% of these queries areBoolean queries [Jansen et al. 1998].

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 6: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 53

the set of relevant documents with respectto the query in the document collection canbe determined. The two quantities recalland precision can be defined as follows:

recall

= the number of retrieved relevant documents

the number of relevant documents,

(1)precision

= the number of retrieved relevant documents

the number of retrieved documents.

(2)

To evaluate the effectiveness of a text re-trieval system, a set of test queries is used.For each query, the set of relevant docu-ments is identified in advance. For eachsuch query, a precision value for each dis-tinct recall value is obtained. When thesesets of recall-precision values are aver-aged over the set of test queries, an aver-age recall-precision curve is obtained. Thiscurve is used as the measure of the effec-tiveness of the system.

An ideal information retrieval systemretrieves all relevant documents and noth-ing else (i.e., both recall and precisionequal to 1). In practice, this is not possi-ble, as a user’s needs may be incorrectly orimprecisely specified by his/her query andthe user’s concept of relevance varies overtime and is difficult to capture. Thus, theretrieval of documents is implemented byemploying some similarity function thatapproximates the degrees of relevance ofdocuments with respect to a given query.Relevance information due to previous re-trieval results may be utilized by sys-tems with learning capabilities to improveretrieval effectiveness. In the remainingportion of this paper, we shall restrict our-selves to the use of similarity functionsin achieving high retrieval effectiveness,except for certain situations where users’feedback information is incorporated.

2.2. Web Search Engines

A Web search engine is essentially an in-formation retrieval system for Web pages.However, Web pages have several features

that are not usually associated with docu-ments in traditional IR systems and thesefeatures have been explored by search en-gine developers to improve the retrievaleffectiveness of search engines.

The first special feature of Web pages isthat they are highly tagged documents. Atpresent, most Web pages are in HTML for-mat. In the foreseeable future, XML docu-ments may be widely used. These tags of-ten convey rich information regarding theterms used in documents. For example, aterm appearing in the title of a documentor emphasized with a special font can pro-vide a hint that the term is rather impor-tant in indicating the contents of the docu-ment. Tag information has been used by anumber of search engines such as Googleand AltaVista to better determine the im-portance of a term in representing the con-tents of a page. For example, a term occur-ring in the title or the header of a page maybe considered to be more important thanthe same term occurring in the main text.As another example, a term typed in a spe-cial font such as bold face and large fonts islikely to be more important than the sameterm not in any special font. Studies haveindicated that the higher weights assignedto terms due to their locations or their spe-cial fonts or tags can yield higher retrievaleffectiveness than schemes which do nottake advantage of the location or tag in-formation [Cutler et al. 1997].

The second special feature of Web pagesis that they are extensively linked. A linkfrom page A to page B provides a con-venient path for a Web user to navigatefrom page A to page B. Careful analysiscan reveal that such a simple link couldcontain several pieces of information thatmay be made use of to improve retrievaleffectiveness. First, such a link indicatesa good likelihood that the contents of thetwo pages are related. Second, the authorof page A values the contents of page B.The linkage information has been usedto compute the global importance (i.e.,PageRank) of Web pages based on whethera page is pointed to by many pages and/orby important pages [Page et al. 1998].This has been successfully used in theGoogle search engine to improve retrieval

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 7: Building efficient and effective metasearch engines

54 Meng et al.

effectiveness. The linkage information hasalso been used to compute the authority(the degree of importance) of Web pageswith respect to a given topic [Kleinberg1998]. IBM’s Clever Project aims to de-velop a search engine that employs thetechnique of computing the authorities ofWeb page for a given query [Chakrabartiet al. 1999].

Another way to utilize the linkage in-formation is as follows. When a page Ahas a link to page B, a set of terms knownas anchor terms is usually associated withthe link. The purpose of using the anchorterms is to provide information regardingthe contents of page B to facilitate the nav-igation by human users. The anchor termsoften provide related terms or synonyms tothe terms used to index page B. To utilizesuch valuable information, several searchengines like Google [Brin and Page 1998]and WWWW [McBryan 1994] have sug-gested also using anchor terms to repre-sent linked pages (e.g., page B). In gen-eral, a Web page may be linked by manyother Web pages and has many associatedanchor terms.

3. METASEARCH ENGINE COMPONENTS

In a typical session of using a metasearchengine, a user submits a query to themetasearch engine through a user-friendly interface. The metasearch enginethen sends the user query to a number ofunderlying search engines (which will becalled component search engines in this ar-ticle). Different component search enginesmay accept queries in different formats.The user query may thus need to be trans-lated to an appropriate format for eachlocal system. After the retrieval resultsfrom the local search engines are received,the metasearch engine merges the resultsinto a single ranked list and presents themerged result, possibly only the top por-tion of the merged result, to the user. Theresult could be a list of documents or morelikely a list of document identifiers (e.g.,URLs for Web pages on the Web) withpossibly short companion descriptions.In this article, we use “documents” and

“document identifiers” interchangeablyunless it is important to distinguish them.

Now let us introduce the concept of po-tentially useful documents.

Definition 1. Suppose there is a similar-ity function that computes the similaritiesbetween documents and any given queryand the similarity of a document witha given query approximates the degreeof the relevance of the document to the“average user” who submits the query. Fora given query, a document d is said to bepotentially useful if it satisfies one of thefollowing conditions:

(1) If m documents are desired in the finalresult for some positive integer m, thenthe similarity between d and the queryis among the m highest of all similar-ities between all documents and thequery.

(2) If every document whose similaritywith the query exceeds a prespecifiedthreshold is desired, then the similar-ity between d and the query is greaterthan the threshold.

In a metasearch engine environment,different component search engines mayemploy different similarity functions. Fora given query and a document, their sim-ilarities computed by different local sim-ilarity functions are likely to be differ-ent and incomparable. To overcome thisproblem, the similarities in the abovedefinition are computed using a similar-ity function defined in the metasearch en-gine. In other words, global similaritiesare used.

Note that, in principle, the two condi-tions in Definition 1 are mutually trans-latable. In other words, for a given m inCondition 1, a threshold in Condition 2 canbe determined such that the number ofdocuments whose similarities exceed thethreshold is m, and vice versa. However,in practice, the translation can only bedone when substantial statistical informa-tion about the text database is available.Usually, a user specifies the number ofdocuments he or she would like to view.The system uses a threshold to determinewhat documents should be retrieved and

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 8: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 55

Fig. 2 . Metasearch software component architec-ture.

displays only the desired number of docu-ments to the user.

The goal of text retrieval is to maximizethe retrieval effectiveness while minimiz-ing the cost. For a centralized retrievalsystem, this can be implemented by re-trieving as many potentially useful doc-uments as possible while retrieving asfew nonpotentially useful documents aspossible. In a metasearch engine environ-ment, the implementation should be car-ried in two levels. First, we should select asmany potentially useful databases (thesedatabases contain potentially useful doc-uments) to search as possible while min-imizing the search of useless databases.Second, for each selected database, weshould retrieve as many potentially use-ful documents as possible while minimiz-ing the retrieval of useless documents.

A reference software component archi-tecture of a metasearch engine is illus-trated in Figure 2. The numbers on theedges indicate the sequence of actions fora query to be processed. We now discussthe functionality of each software com-ponent and the interactions among thesecomponents.

Database selector: If the number ofcomponent search engines in a meta-search engine is small, it may be rea-sonable to send each user query to all

of them. However, if the number islarge, say in the thousands, then send-ing each query to all component searchengines is no longer a reasonable strat-egy. This is because in this case, a largepercentage of the local databases willbe useless with respect to the query.Suppose a user is interested in onlythe 10 best matched documents fora query. Clearly, the 10 desired doc-uments are contained in at most 10databases. Consequently, if the num-ber of databases is much larger than10, then a large number of databaseswill be useless with respect to thisquery. Sending a query to the searchengines of useless databases has sev-eral problems. First, dispatching thequery to useless databases wastes theresources at the metasearch enginesite. Second, transmitting the queryto useless component search enginesfrom the metasearch engine and trans-mitting useless documents from thesesearch engines to the metasearch en-gine would incur unnecessary net-work traffic. Third, evaluating a queryagainst useless component databaseswould waste resources at these localsystems. Fourth, if a large number ofdocuments were returned from use-less databases, more effort would beneeded by the metasearch engineto identify useful documents. There-fore, it is important to send eachuser query to only potentially usefuldatabases. The problem of identifyingpotentially useful databases to searchfor a given query is known as thedatabase selection problem. The soft-ware component database selector isresponsible for identifying potentiallyuseful databases for each user query.A good database selector should cor-rectly identify as many potentially use-ful databases as possible while min-imizing wrongly identifying uselessdatabases as potentially useful ones.Techniques for database selection willbe covered in Section 5.

Document selector: For each search en-gine selected by the database selector,

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 9: Building efficient and effective metasearch engines

56 Meng et al.

the component document selector de-termines what documents to retrievefrom the database of the search en-gine. The goal is to retrieve as manypotentially useful documents from thesearch engine as possible while min-imizing the retrieval of useless docu-ments. If a large number of useless doc-uments were returned from a searchengine, more effort would be neededby the metasearch engine to identifypotentially useful documents. Severalfactors may affect the selection of doc-uments to retrieve from a componentsearch engine such as the number ofpotentially useful documents in thedatabase and the similarity functionused by the component system. Thesefactors help determine either the num-ber of documents that should be re-trieved from the component search en-gine or a local similarity thresholdsuch that only those documents whoselocal similarity with the given queryis higher than or equal to the thresh-old should be retrieved from the com-ponent search engine. Different meth-ods for selecting documents to retrievefrom local search engines will be de-scribed in Section 6.

Query dispatcher: The query dispat-cher is responsible for establishing aconnection with the server of each se-lected search engine and passing thequery to it. HTTP (HyperText Trans-fer Protocol) is used for the connectionand data transfer (sending queries andreceiving results). Each search enginehas its own requirements on the HTTPrequest method (e.g., the GET methodor the POST method) and query format(e.g., the specific query box name). Thequery dispatcher must follow the re-quirements of each search engine cor-rectly. Note that, in general, the querysent to a particular search engine mayor may not be the same as that re-ceived by the metasearch engine. Inother words, the original query may betranslated to a new query before beingsent to a search engine. The transla-tion of Boolean queries across hetero-

geneous information sources is studiedin Chang and Garcia-Molina [1999].

For vector space queries, querytranslation is usually as straightfor-ward as just retaining all the termsin the user query. There are two ex-ceptions, however. First, the relativeweights of query terms in the origi-nal user query may be adjusted be-fore the query is sent to a componentsearch engine. This is to adjust therelative importance of different queryterms, which can be accomplished byrepeating some query terms an appro-priate number of times. Second, thenumber of documents to be retrievedfrom a component search engine maybe different from that desired by theuser. For example, suppose as part of aquery, a user of the metasearch engineindicates that m documents shouldbe retrieved. The document selectormay decide that k documents shouldbe retrieved from a particular compo-nent search engine. In this case, thenumber k, usually different from m,should be part of the translated queryto be sent to the component searchengine.

Result merger: After the results fromselected component search engines arereturned to the metasearch engine, theresult merger combines the results intoa single ranked list. The top m docu-ments in the list are then forwardedto the user interface to be displayed,where m is the number of documentsdesired by the user. A good resultmerger should rank all returned doc-uments in descending order of theirglobal similarities with the user query.Different result merging techniqueswill be discussed in Section 7.

In the remaining discussions, we willconcentrate on the following three maincomponents, namely, the database selec-tor, the document selector, and the resultmerger. Except for the query translationproblem, the component query dispatcherwill not be discussed further in this sur-vey. Query translation for Boolean querieswill not be discussed in this article as we

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 10: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 57

focus on vector space queries only. Morediscussions on query translation for vectorspace queries will be provided at appropri-ate places while discussing other softwarecomponents.

4. SOURCES OF CHALLENGES

In this section, we first review the envi-ronment in which a metasearch engine isto be built and then analyze why such anenvironment causes tremendous difficul-ties to building an effective and efficientmetasearch engine.

Component search engines that partic-ipate in a metasearch engine are oftenbuilt and maintained independently. Eachsearch engine decides the set of documentsit wants to index and provide search ser-vice to. It also decides how documentsshould be represented/indexed and whenthe index should be updated. Similaritiesbetween documents and user queries arecomputed using a similarity function. It iscompletely up to each search engine to de-cide what similarity function to use. Com-mercial search engines often regard thesimilarity functions they use and other im-plementational decisions as proprietaryinformation and do not make them avail-able to the general public.

As a direct consequence of the autonomyof component search engines, a number ofheterogeneities exist. In this section, wefirst identify major heterogeneities thatare unique in the metasearch engine en-vironment. Heterogeneities that are com-mon to other automonous systems (e.g.,multidatabase systems) such as differentOS platforms will not be described. Thenwe discuss the impact of these hetero-geneities as well as the autonomy of com-ponent search engines on building an ef-fective and efficient metasearch engine.

4.1. Heterogeneous Environment

The following heterogeneities can beidentified among autonomous componentsearch engines [Meng et al. 1999b].

Indexing method: Different search en-gines may have different ways to de-

termine what terms should be used torepresent a given document. For ex-ample, some may consider all terms inthe document (i.e., full-text indexing)while others may use only a subsetof the terms (i.e., partial-text index-ing). Lycos [Mauldin 1997], for exam-ple, employs partial-text indexing inorder to save storage space and bemore scalable. Some search engineson the Web use the anchor terms ina Web page to index the referencedWeb page [Brin and Page 1998; Cutleret al. 1997; McBryan 1994] while mostother search engines do not. Otherexamples of different indexing tech-niques involve whether or not to re-move stopwords and whether or not toperform stemming. Furthermore, dif-ferent stopword lists and stemmingalgorithms may be used by differentsearch engines.

Document term weighting scheme:Different methods exist for determin-ing the weight of a term in a document.For example, one method is to usethe term frequency weight and an-other is to use the product of theterm frequency weight and the in-verse document frequency weight (seeSection 2). Several variations of theseschemes exist [Salton 1989]. There arealso systems that distinguish differentoccurrences of the same term [Boyanet al. 1996; Cutler et al. 1997; Wadeet al. 1989] or different fonts of thesame term [Brin and Page 1998]. Forexample, the occurrence of a term ap-pearing in the title of a Web page maybe considered to be more importantthan another occurrence of the sameterm not appearing in the title.

Query term weighting scheme: In thevector space model for text retrieval,a query can be considered as a specialdocument (a very short documenttypically). It is possible for a term toappear multiple times in a query. Dif-ferent query term weighting schemesmay utilize the frequency of a term ina query differently for computing theweight of the term in the query.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 11: Building efficient and effective metasearch engines

58 Meng et al.

Similarity function: Different searchengines may employ different similar-ity functions to measure the similaritybetween a user query and a document.Some popular similarity functionswere mentioned in Section 2 but othersimilarity functions (see, for example,Robertson et al. [1999]; Singhal et al.[1996]) are also possible.

Document database: The text data-bases of different search engines maydiffer at two levels. The first level isthe domain (subject area) of a data-base. For example, one databasemay contain medical documents (e.g.,www.medisearch.co.uk) and anothermay contain legal documents (e.g.,lawcrawler.lp.findlaw.com). In thiscase, the two databases can be saidto have different domains. In prac-tice, the domain of a database maynot be easily determined since somedatabases may contain documentsfrom multiple domains. Furthermore,a domain may be further dividedinto multiple subdomains. The sec-ond level is the set of documents.Even when two databases have thesame domain, the sets of documentsin the two databases can still besubstantially different or even dis-joint. For example, Echidna MedicalSearch (www.drsref.com.au) and Medi-search (www.medisearch.co.uk) areboth search engines for medical in-formation but the former is for Webpages from Australia and the latterfor those from the United Kingdom.

Document version: Documents in adatabase may be modified. This isespecially true in the World Wide Webenvironment where Web pages canoften be modified at the wish of theirauthors. Typically, when a Web pageis modified, those search engines thatindexed the Web page will not be no-tified of the modification. Some searchengines use robots to detect modifiedpages and reindex them. However, dueto the high cost and/or the enormousamount of work involved, attemptsto revisit a page can only be made

periodically (say from one week toone month). As a result, dependingon when a document is fetched (orrefetched) and indexed (or reindexed),its representation in a search enginemay be based on an older version or anewer version of the document. Sincelocal search engines are autonomous,it is highly likely that different sys-tems may have indexed differentversions of the same document (in thecase of WWW, the Web page can stillbe uniquely identified by its URL).

Result presentation: Almost all searchengines present their retrieval resultin descending order of local similar-ities/ranking scores. However, somesearch engines also provide the simi-larities of returned documents (e.g.,FirstGov (www.firstgov.gov) and Nor-thern Light) while some do not (e.g.,AltaVista and Google).

In addition to heterogeneities betweencomponent search engines, there are alsoheterogeneities between the metasearchengine and the local systems. For exam-ple, the metasearch engine uses a globalsimilarity function to compute the globalsimilarities of documents. It is very likelythat the global similarity function is differ-ent from the similarity functions in some(or even all) component search engines.

4.2. Impact of Heterogeneities

In this subsection, we show that the au-tonomy of and the heterogeneities amongdifferent component search engines andbetween the metasearch engine and thecomponent search engines have a pro-found impact on how to evaluate globalqueries in a metasearch engine.

(1) In order to estimate the usefulness of adatabase to a given query, the databaseselector needs to know some informa-tion about the database that charac-terizes the contents of the database.We call the characteristic informationabout a database the representative ofthe database. In the metasearch en-gine environment, different types of

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 12: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 59

database representatives for differentsearch engines may be available tothe metasearch engine. For coopera-tive search engines, they may pro-vide database representatives desiredby the database selector. For unco-operative search engines that followa certain standard, say the proposedSTARTS standard [Gravano et al.1997], the database representativesmay be obtained from the informa-tion that can be provided by thesesearch engines such as the documentfrequency and the average documentterm weight of any query term. Butthe representatives may not containcertain information desired by a par-ticular database selector. For unco-operative search engines that do notfollow any standard, their representa-tives may have to be extracted frompast retrieval experiences (e.g., Savvy-Search [Dreilinger and Howe 1997]) orfrom sampled documents (e.g., Callanet al. [1999]; Callan [2000]).

There are two major challenges indeveloping good database selection al-gorithms. One is to identify appropri-ate database representatives. A goodrepresentative should permit fast andaccurate estimation of database use-fulness. At the same time, a good rep-resentative should have a small size incomparison to the size of the databaseand should be easy to obtain and main-tain. As we will see in Section 5, pro-posed database selection algorithmsoften employ different types of repre-sentatives. The second challenge is todevelop ways to obtain the desired rep-resentatives. As mentioned above, anumber of solutions exist depending onwhether a search engine follows somestandard or is cooperative. The issue ofobtaining the desired representativeswill not be discussed further in thisarticle.

(2) The challenges of the document selec-tion problem and the result mergingproblem lie mainly in the fact thatthe same document may have differ-ent global and local similarities with

a given query due to various hetero-geneities. For example, for a givenquery q submitted by a global user,whether or not a document d in a com-ponent database D is potentially use-ful depends on the global similarityof d with q. It is highly likely thatthe similarity function and/or the termweighting scheme in D are differentfrom the global ones. As a result, the lo-cal similarity of d is likely to be differ-ent from the global similarity of d . Infact, even when the same term weight-ing scheme and the same similarityfunction are used locally and globally,the global similarity and the local sim-ilarity of d may still be different be-cause the similarity computation maymake use of certain database-specificinformation (such as the document fre-quencies of terms). This means thata globally highly ranked document inD may not be a locally highly rankeddocument in D. Suppose the globallytop-ranked document d is ranked ithlocally for some i ≥ 1. In order to re-trieve d from D, the local system mayhave to also retrieve all documentsthat have a higher local similaritythan that of d (text retrieval systemsare generally incapable of retrievinglower-ranked documents without firstretrieving higher-ranked ones). It isquite possible that some of the doc-uments that are ranked higher thand locally are not potentially usefulbased on their global similarities. Themain challenge for document selectionis to develop methods that can maxi-mize the retrieval of potentially use-ful documents while minimizing theretrieval of useless documents fromcomponent search engines. The mainchallenge for result merging is tofind ways to estimate the global sim-ilarities of documents so that docu-ments returned from different compo-nent search engines can be properlymerged.

In the next several sections, we examinethe techniques that have been proposedto deal with the problems of database

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 13: Building efficient and effective metasearch engines

60 Meng et al.

selection, document selection, and resultmerging.

5. DATABASE SELECTION

When a metasearch engine receives aquery from a user, it invokes the databaseselector to select component search en-gines to send the query to. A good databaseselection algorithm should identify poten-tially useful databases accurately. Manyapproaches have been proposed to tacklethe database selection problem. These ap-proaches differ on the database represen-tatives they use to indicate the contents ofeach database, the measures they use toindicate the usefulness of each databasewith respect to a given query, and the tech-niques they employ to estimate the useful-ness. We classify these approaches into thefollowing three categories.

Rough representative approaches:In these approaches, the contents of alocal database are often representedby a few selected key words or para-graphs. Such a representative is onlycapable of providing a very generalidea on what a database is about,and consequently database selectionmethods using rough database rep-resentatives are not very accuratein estimating the true usefulnessof databases with respect to a givenquery. Rough representatives are oftenmanually generated.

Statistical representative approa-ches: These approaches usually rep-resent the contents of a databaseusing rather detailed statistical infor-mation. Typically, the representativeof a database contains some statis-tical information for each term inthe database such as the documentfrequency of the term and the averageweight of the term among all docu-ments that have the term. Detailedstatistics allow more accurate esti-mation of database usefulness withrespect to any user query. Scalability ofsuch approaches is an important issuedue to the amount of information thatneeds to be stored for each database.

Learning-based approaches: In theseapproaches, the knowledge aboutwhich databases are likely to returnuseful documents to what types ofqueries is learned from past retrievalexperiences. Such knowledge is thenused to determine the usefulness ofdatabases for future queries. The re-trieval experiences could be obtainedthrough the use of training queries be-fore the database selection algorithmis put to use and/or through the realuser queries while database selectionis in active use. The obtained experi-ences against a database will be savedas the representative of the database.

In the following subsections, we sur-vey and discuss different database se-lection approaches based on the aboveclassification.

5.1. Rough Representative Approaches

As mentioned earlier, a rough represen-tative of a database uses only a few keywords or a few sentences to describe thecontents of the database. It is only capableof providing a very general idea on whatthe database is about.

In ALIWEB [Koster 1994], an oftenhuman-generated representative in afixed format is used to represent the con-tents of each local database or a site.An example of the representative usedto describe a site containing files for thePerl Programming Language is as follows(www.nexor.com/site.idx):

Template-Type: DOCUMENTTitle: PerlURI: /public/perl/perl.

htmlDescription: Information on the

Perl ProgrammingLanguage. Includesa local HypertextPerl Manual, andthe latest FAQ inHypertext.

Keywords: perl, perl-faq,language

Author-Handle: [email protected]

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 14: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 61

The user query is matched with the rep-resentative of each component databaseto determine how suitable a database isfor the query. The match can be againstone or more fields (e.g. title, description,etc.) of the representatives based on theuser’s choice. Component databases areranked based on how closely they matchwith the query. The user then selects com-ponent databases to search from a rankedlist of component databases, one databaseat a time. Note that ALIWEB is not afull-blown metasearch engine as it only al-lows users to select one database to searchat a time and it does not perform resultmerging.

Similar to ALIWEB, descriptive repre-sentations of the contents of componentdatabases are also used in WAIS [Kahleand Medlar 1991]. For a given query, thedescriptions are used to rank componentdatabases according to how similar theyare to the query. The user then selects com-ponent databases to search for the desireddocuments. In WAIS, more than one lo-cal database can be searched at the sametime.

In Search Broker [Manber and Bigot1997; Manber and Bigot 1998], eachdatabase is manually assigned one or twowords as the subject or category keywords.Each user query consists of two parts:the subject part and the regular querypart. When a query is received by thesystem, the subject part of the query isused to identify the component searchengines covering the same subject andthe regular query part is used to searchdocuments from the identified searchengines.

In NetSerf [Chakravarthy and Haase1995], the text description of the con-tents of a database is transformed intoa structured representative. The trans-formation is performed manually andWordNet [Miller 1990] is used in thetransformation process to disambiguatetopical words. As an example, the de-scription “World facts listed by coun-try” for the World Factbook archive istransformed into the following structuredrepresentation [Chakravarthy and Haase1995]:

topic: countrysynset: [nation,

nationality, land,country, a_people]

synset: [state, nation,country, land,commonwealth,res_publica,body_politic]

synset: [country, state,land, nation]

info-type: facts

Each word in WordNet has one or moresynsets with each containing a set ofsynonyms that together defines a mean-ing. The topical word “country” has foursynsets of which three are considered tobe relevant, and are therefore used. Theone synset (i.e., [rural area, country])whose meaning does not match the in-tended meaning of the “country” in theabove description (i.e., “World facts listedby country”) is omitted. Each user query isa sentence and is automatically convertedinto a structured and disambiguated rep-resentation similar to a database repre-sentation using a combination of severaltechniques. However, not all queries canbe handled. The query representation isthen matched with the representatives oflocal databases in order to identify poten-tially useful databases [Chakravarthy andHaase 1995].

While most rough database representa-tives are generated with human involve-ment, there exist automatically gener-ated rough database representatives. InQ-Pilot [Sugiura and Etzioni 2000], eachdatabase is represented by a vector ofterms with weights. The terms can eitherbe obtained from the interface page of thesearch engine or from the pages that havelinks to the search engine. In the formercase, all content words in the interfacepage are considered and the weights arethe term frequencies. In the latter case,only terms that appear in the same line asthe link to the search engine are used andthe weight of each term is the documentfrequency of the term (i.e., the number ofback link documents that contributed theterm).

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 15: Building efficient and effective metasearch engines

62 Meng et al.

The main appeal of rough representa-tive approaches is that the representativescan be obtained relatively easily and theyrequire little storage space. If all compo-nent search engines are highly specializedwith diversified topics and their contentscan be easily summarized, then these ap-proaches may work reasonably well. Onthe other hand, it is unlikely that the shortdescription of a database can representthe database sufficiently comprehensively,especially when the database containsdocuments of diverse interests. As a re-sult, missing potentially useful databasescan occur easily with these approaches.To alleviate this problem, most such ap-proaches involve users in the database se-lection process. For example, in ALIWEBand WAIS, users will make the finaldecision on which databases to selectbased on the preliminary selections bythe metasearch engine. In Search Broker,users are required to specify the sub-ject areas for their queries. As users of-ten do not know the component databaseswell, their involvement in the databaseselection process can easily miss use-ful databases. Rough representative ap-proaches are considered to be inadequatefor large-scale metasearch engines.

5.2. Statistical Representative Approaches

A statistical representative of a databasetypically takes every term in every docu-ment in the database into considerationand keeps one or more pieces of statisticalinformation for each such term. As a re-sult, if done properly, a database selectionapproach employing this type of databaserepresentatives may detect the existenceof individual potentially useful documentsfor any given query. A large number of ap-proaches based on statistical representa-tives have been proposed. In this subsec-tion, we describe five such approaches.

5.2.1. D-WISE Approach. WISE (Web In-dex and Search Engine) is a centralizedsearch engine [Yuwono and Lee 1996].D-WISE is a proposed metasearch en-gine with a number of underlying searchengines (i.e., distributed WISE) [Yuwono

and Lee 1997]. In D-WISE, the repre-sentative of a component search engineconsists of the document frequency ofeach term in the component databaseas well as the number of documents inthe database. Therefore, the representa-tive of a database with n distinct termswill contain n + 1 quantities (the n doc-ument frequencies and the cardinality ofthe database) in addition to the n terms.Let ni denote the number of documents inthe ith component database and dfi j be thedocument frequency of term t j in the ithdatabase.

Suppose q is a user query. The repre-sentatives of all databases are used tocompute the ranking score of each com-ponent search engine with respect to q.The scores measure the relative useful-ness of all databases with respect to q.If the score of database A is higher thanthat of database B, then database A willbe judged to be more relevant to q thandatabase B. The ranking scores are com-puted as follows. First, the cue validity ofeach query term, say term t j , for the ithcomponent database, CVij , is computedusing the following formula:

CVij =dfi j

ni

dfi j

ni+∑N

k 6=idfk j∑N

k 6=ink

, (3)

where N is the total number of compo-nent databases in the metasearch engine.Intuitively, CVij measures the percentageof the documents in the ith database thatcontain term t j relative to that in all otherdatabases. If the ith database has a higherpercentage of documents containing t j incomparison to other databases, then CVijtends to have a larger value. Next, thevariance of the CVij ’s of each query termt j for all component databases, CVV j , iscomputed as follows:

CVV j =∑N

i=1(CVij − ACVj )2

N, (4)

where ACVj is the average of all CVij ’sfor all component databases. The value

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 16: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 63

CVV j measures the skew of the distri-bution of term t j across all componentdatabases. For two terms tu and tv, if CVVuis larger than CVVv, then term tu is moreuseful to distinguish different componentdatabases than term tv. As an extremecase, if every database had the same per-centage of documents containing a term,then the term would not be very useful fordatabase selection (the CVV of the termwould be zero in this case). Finally, theranking score of component database iwith respect to query q is computed by

ri =M∑

j=1

CVV j · dfi j , (5)

where M is the number of terms in thequery. It can be seen that the rankingscore of database i is the sum of the docu-ment frequencies of all query terms in thedatabase weighted by each query term’sCVV (recall that the value of CVV fora term reflects the distinguishing powerof the term). Intuitively, the rankingscores provide clues to where useful queryterms are concentrated. If a database hasmany useful query terms, each havinga higher percentage of documents thanother databases, then the ranking scoreof the database will be high. After theranking scores of all databases are com-puted with respect to a given query, thedatabases with the highest scores will beselected for search for this query.

The representative of a database inD-WISE contains one quantity, that is, thedocument frequency, per distinct term inthe database, plus one additional quan-tity, that is, the cardinality, for the entiredatabase. As a result, this approach is eas-ily scalable. The computation is also sim-ple. However, there are two problems withthis approach. First, the ranking scoresare relative scores. As a result, it willbe difficult to determine the real valueof a database with respect to a givenquery. If there are no good databases fora given query, then even the first rankeddatabase will have very little value. Onthe other hand, if there are many gooddatabases for another query, then even the

10th ranked database can be very use-ful. Relative ranking scores are not veryuseful in differentiating these situations.Second, the accuracy of this approach isquestionable as this approach does notdistinguish a document containing, say,one occurrence of a term from a docu-ment containing 100 occurrences of thesame term.

5.2.2. CORI Net Approach. In the Collec-tion Retrieval Inference Network (CORINet) approach [Callan et al. 1995], the rep-resentative of a database consists of twopieces of information for each distinct termin the database: the document frequencyand the database frequency. The latteris the number of component databasescontaining the term. Note that if a termappears in multiple databases, only onedatabase frequency needs to be stored inthe metasearch engine to save space.

In CORI Net, for a given query q, adocument ranking technique known as in-ference network [Turtle and Croft 1991]used in the INQUERY document retrievalsystem [Callan et al. 1992] is extendedto rank all component databases with re-spect to q. The extension is mostly con-ceptual and the main idea is to visual-ize the representative of a database as a(super) document and the set of all repre-sentatives as a collection/database of su-per documents. This is explained below.The representative of a database may beconceptually considered as a super docu-ment containing all distinct terms in thedatabase. If a term appears in k docu-ments in the database, we repeat the termk times in the super document. As a re-sult, the document frequency of a term inthe database becomes the term frequencyof the term in the super document. Theset of all super documents of the compo-nent databases in the metasearch engineform a database of super documents. LetD denote this database of all super docu-ments. Note that the database frequencyof a term becomes the document frequencyof the term in D. Therefore, from the rep-resentatives of component databases, wecan obtain the term frequency and docu-ment frequency of each term in each super

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 17: Building efficient and effective metasearch engines

64 Meng et al.

document. In principle, the tfw·idfw (termfrequency weight times inverse documentfrequency weight) formula could now beused to compute the weight of each termin each super document so as to repre-sent each super document as a vector ofweights. Furthermore, a similarity func-tion such as the cosine function may beused to compute the similarities (rank-ing scores) of all super documents (i.e.,database representatives) with respect toquery q and these similarities could thenbe used to rank all component databases.The approach employed in CORI Net isan inference network-based probabilisticapproach.

In CORI Net, the ranking score of adatabase with respect to query q is anestimated belief that the database con-tains useful documents. The belief is es-sentially the combined probability thatthe database contains useful documentsdue to each query term. More specifically,the belief is computed as follows. Supposethe user query contains k terms t1, . . . , tk .Let N be the number of databases in themetasearch engine. Let dfi j be the docu-ment frequency of the j th term in the ithcomponent database Di and dbf j be thedatabase frequency of the j th term. First,the belief that Di contains useful docu-ments due to the j th query term is com-puted by

p(t j | Di) = c1 + (1− c1) · Tij · I j , (6)

where

Tij = c2 + (1− c2) · dfi j

dfi j + K

is a formula for computing the term fre-quency weight of the j th term in the superdocument corresponding to Di and

I j =log(

N + 0.5dbf j

)log(N + 1.0)

is a formula for computing the inverse doc-ument frequency weight of the j th termbased on all super documents. In the above

formulas, c1 and c2 are constants between0 and 1, and K = c3 · ((1− c4)+ c4 ·dwi/adw) is a function of the size of databaseDi with c3 and c4 being two constants, dwibeing the number of words in Di and adwbeing the average number of words in adatabase. The values of these constants(c1, c2, c3 and c4) can be determined em-pirically by performing experiments on ac-tual test collections [Callan et al. 1995].Note that the value of p(t j | Di) is es-sentially the tfw · idfw weight of termt j in the super document correspondingto database Di. Next, the significance ofterm t j in representing query q, denotedp(q | t j ), can be estimated, for example, tobe the query term weight of t j in q. Finally,the belief that database Di contains usefuldocuments with respect to query q, or theranking score of Di with respect to q, canbe estimated to be

ri = p(q | Di) =k∑

j=1

p(q | t j ) · p(t j | Di).

(7)

In CORI Net, the representative of adatabase contains slightly more than onepiece of information per term (i.e., the doc-ument frequency plus the shared databasefrequency across all databases). Therefore,the CORI Net approach also has rathergood scalability. The information for repre-senting each component database can alsobe obtained and maintained easily. An ad-vantage of the CORI Net approach is thatthe same method can be used to computethe ranking score of a document with aquery as well as the ranking score of adatabase (through the database represen-tative or super document) with a query.Recently, it was shown in Xu and Callan[1998] that if phrase information is col-lected and stored in each database repre-sentative and queries are expanded basedon a technique called local context anal-ysis [Xu and Croft 1996], then the CORINet approach can select useful databasesmore accurately.

5.2.3. gGlOSS Approach. The gGlOSS(generalized Glossary Of Servers’ Server)

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 18: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 65

system is a research prototype [Gravanoand Garcia-Molina 1995]. In gGlOSS,each component database is representedby a set of pairs (dfi, Wi), where dfiis the document frequency of the ithterm and Wi is the sum of the weightsof the ith term over all documents inthe component database. A threshold isassociated with each query in gGlOSSto indicate that only documents whosesimilarities with the query are higherthan the threshold are of interest. Theusefulness of a component database withrespect to a query in gGlOSS is definedto be the sum of the similarities of thedocuments in the component databasewith the query that are higher thanthe threshold associated with the query.The usefulness of a component database isused as the ranking score of the database.In gGlOSS, two estimation methods areemployed based on two assumptions.One is the high-correlation assumption(for any given database, if query term tiappears in at least as many documents asquery term t j , then every document con-taining term t j also contains term ti) andthe other is the disjoint assumption (for agiven database, for any two terms ti andt j , the set of documents containing termti is disjoint from the set of documentscontaining term t j ).

We now discuss the two estimationmethods for a component database D.Suppose q = (q1, . . . , qk) is a query andT is the associated threshold, where qi isthe weight of term ti in q.

High-correlation case: Let terms be ar-ranged in ascending order of documentfrequency, i.e., dfi ≤ df j for any i < j ,where dfi is the document frequency ofterm ti. This means that every docu-ment containing ti also contains t j forany j > i. There are df1 documentshaving similarity

∑ki=1 qi · Wi

d f iwith q.

In general, there are df j − df j−1 doc-uments having similarity

∑ki= j qi · Wi

d f iwith q, 1 ≤ j ≤ k, and df0 is defined tobe 0. Let p be an integer between 1 andk that satisfies

∑ki=p qi · Wi

d f i> T and∑k

i=p+1 qi · Wid f i≤ T . Then the estimated

usefulness of this database is

usefulness(D, q, T )

=p∑

j=1

(df j − df j−1) · k∑

i= j

qi · Wi

dfi

=

p∑j=1

qj ·W j + dfp ·k∑

j=p+1

qj · W j

df j.

Disjoint case: By the disjoint assump-tion, each document can contain atmost one query term. Thus, there aredfi documents that contain term ti andthe similarity of these dfi documentswith query q is qi · Wi

dfi. Therefore, the

estimated usefulness of this databaseis:

usefulness(D, q, T )

=∑

i=1,...,k|(df i>0)∧(

qi · Widf i

)>T

dfi · qi · Wi

dfi

=∑

i=1,...,k|(dfi>0)∧(

qi · Widf i

)>T

qi ·Wi.

In gGlOSS, the usefulness of a databaseis sensitive to the similarity thresholdused. As a result, gGlOSS can differenti-ate a database with many moderately sim-ilar documents from a database with a fewhighly similar documents. This is not pos-sible in D-WISE and CORI Net. However,the two assumptions used in gGlOSS aresomewhat too restrictive. As a result, theestimated database usefulness may be in-accurate. It can be shown that, when thethreshold T is not too large, the estima-tion formula based on the high-correlationassumption tends to overestimate the use-fulness and the estimation formula basedon the disjoint assumption tends to under-estimate the usefulness. Since the two es-timates by the two formulas tend to formupper and lower bounds of the true use-fulness, the two methods are more use-ful when used together than when usedseparately. For a given database, the sizeof the database representative in gGlOSS

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 19: Building efficient and effective metasearch engines

66 Meng et al.

is twice the size of that in D-WISE. Thecomputation for estimating the databaseusefulness in gGlOSS can be carried outefficiently.

5.2.4. Estimating the Number of PotentiallyUseful Documents. One database useful-ness measure used is “the number of po-tentially useful documents with respect toa given query in a database.” This measurecan be very useful for search services thatcharge a fee for each search. For exam-ple, the Chicago Tribune Newspaper Com-pany charges a certain fee for retrievingarchival newspaper articles. Suppose thefee is independent of the number of re-trieved documents. In this case, from theuser’s perspective, a component systemwhich contains a large number of sim-ilar documents but not necessarily themost similar documents is preferable toanother component system containing justa few most similar documents. On theother hand, if a fee is charged for eachretrieved document, then the componentsystem having the few most similar docu-ments will be preferred. This type of charg-ing policy can be incorporated into thedatabase selector of a metasearch engineif the number of potentially useful docu-ments in a database with respect to a givenquery can be estimated.

Let D be a component database,sim(q, d ) be the global similarity betweena query q and a document d in D, andT be a similarity threshold. The numberof potentially useful documents in D withrespect to q can be defined precisely asfollows:

NoDoc(D, q, T ) = cardinality({d | d ∈ Dand sim(q, d ) > T }).

(8)

If NoDoc(D, q, T ) can be accurately es-timated for each database with respect toa given query, then the database selec-tor can simply select those databases withthe most potentially useful documents tosearch for this query.

In Meng et al. [1998], a generating-function based method is proposed to es-

timate NoDoc(D, q, T ) when the globalsimilarity function is the dot product func-tion (the widely used cosine function is aspecial case of the dot product functionwith each term weight divided by the doc-ument/query length). In this method, therepresentative of a database with n dis-tinct terms consists of n pairs {(pi, wi)},i = 1, . . . , n, where pi is the probabil-ity that term ti appears in a document inD (note that pi is simply the documentfrequency of term ti in the database di-vided by the number of documents in thedatabase) and wi is the average of theweights of ti in the set of documents con-taining ti. Let (q1, q2, . . . , qk) be the queryvector of query q, where qi is the weight ofquery term ti.

Consider the following generatingfunction:

(p1 ∗ X w1∗q1 + (1− p1)) ∗ (p2 ∗ X w2∗q2

+ (1− p2)) ∗ · · · ∗ (pk ∗ X wk∗qk

+ (1− pk)). (9)

After the generating function (9) is ex-panded and the terms with the same X s

are combined, we obtain

a1 ∗ X b1 + a2 ∗ X b2 + · · · + ac ∗ X bc ,b1 > b2 > · · · > bc. (10)

It can be shown that, if the terms are in-dependent and the weight of term ti when-ever present in a document is wi, whichis given in the database representative(1 ≤ i ≤ k), then ai is the probability thata document in the database has similar-ity bi with q [Meng et al. 1998]. There-fore, if database D contains N documents,then N ∗ ai is the expected number of doc-uments that have similarity bi with queryq. For a given similarity threshold T , letC be the largest integer to satisfy bC > T .Then, NoDoc(D, q, T ) can be estimated bythe following formula:

NoDoc(D, q, D) =C∑

i=1

N ∗ ai = NC∑

i=1

ai.

(11)

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 20: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 67

The above solution has two restrictiveassumptions. The first is the term inde-pendence assumption and the second isthe uniform term weight assumption (i.e.,the weights of a term in all documentscontaining the term are the same—theaverage weight). These assumptions re-duce the accuracy of the database use-fulness estimation. One way to addressthe term independence assumption is toutilize covariances between term pairs,term triplets, and so on and to incorpo-rate them into the generating function(9) [Meng et al. 1998]. The problem withthis approach is that the storage overheadfor representing a component databasemay become too large because a verylarge number of covariances may be as-sociated with each component database.A remedy is to use only significant co-variances (those whose absolute valuesare significantly greater than zero). An-other way to incorporate dependencies be-tween terms is to combine certain ad-jacent terms into a single term [Liuet al. 2001]. This is similar to recognizingphrases.

In Meng et al. [1999a], a method knownas the subrange-based estimation methodis proposed to deal with the uniform termweight assumption. This method parti-tions the actual weights of a term ti inthe set of documents having the term intoa number of disjoint subranges of possi-bly different lengths. For each subrange,the median of the weights in the sub-range is estimated based on the assump-tion that the weight distribution of theterm is normal (hence, the standard de-viation of the weights of the term needsto be added to the database representa-tive). Then, the weights of ti that fall ina given subrange are approximated bythe median of the weights in the sub-range. With this weight approximation,for a query containing term ti, the poly-nomial pi ∗ X wi∗qi + (1− pi) in the generat-ing function (9) is replaced by the followingpolynomial:

pi1 ∗ X wmi1∗qi + pi2 ∗ X wmi2∗qi

+ · · · + pil ∗ X wmil∗qi + (1− pi), (12)

where pij is the probability that term tioccurs in a document and has a weightin the j th subrange, wmij is the medianof the weights of ti in the j th subrange,j = 1, . . . , l , and l is the number of sub-ranges used. After the generating func-tion has been obtained, the rest of theestimation process is identical to that de-scribed earlier. It was shown in Meng et al.[1999a] that if the maximum normalizedweight of each term is used in the high-est subrange, the estimation accuracy ofthe database usefulness can be drasticallyimproved.

The above methods [Liu et al. 2001;Meng et al. 1998; Meng et al. 1999a], whilebeing able to produce accurate estimation,have a large storage overhead. Further-more, the computation complexity of ex-panding the generating function is expo-nential. As a result, they are more suitablefor short queries.

5.2.5. Estimating the Similarity of the MostSimilar Document. Another useful measureis the global similarity of the most simi-lar document in a database with respectto a given query. On one hand, this mea-sure indicates the best that we can expectfrom a database as no other documentsin the database can have higher similari-ties with the query. On the other hand, fora given query, this measure can be usedto rank databases optimally for retrievingthe m most similar documents across alldatabases.

Suppose a user wants the metasearchengine to find the m most similar docu-ments to his/her query q across M com-ponent databases D1, D2, . . . , DM . The fol-lowing definition defines an optimal orderof these databases for the query.

Definition 2. A set of M databases issaid to be optimally ranked in the order[D1, D2, . . . , DM ] with respect to query q ifthere exists a k such that D1, D2, . . . , Dkcontain the m most similar documents andeach Di, 1 ≤ i ≤ k, contains at least one ofthe m most similar documents.

Intuitively, the ordering is optimalbecause whenever the m most similar

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 21: Building efficient and effective metasearch engines

68 Meng et al.

documents to the query are desired, it issufficient to examine the first k databases.A necessary and sufficient condition forthe databases D1, D2, . . . , DM to be op-timally ranked in the order [D1, D2, . . . ,DM ] with respect to query q ismsim(q, D1) > msim(q, D2) > · · · > msim(q, DM ) [Yu et al. 1999b], where msim(q, Di) is the global similarity of themost similar document in database Diwith the query q. Knowing an optimalrank of the databases with respect toquery q, the database selector can se-lect the top-ranked databases to searchfor q.

The challenge here is how to estimatemsim(q, D) for query q and any databaseD. One method is to utilize the Expres-sion (10) for D. We can scan this expres-sion in descending order of the exponentsuntil

∑ri=1 ai ∗ N is approximately 1 for

some r, where N is the number of docu-ments in D. The exponent, br , is an esti-mate of msim(q, D) as the expected num-ber of documents in D with similaritygreater than or equal to br is approxi-mately 1. The drawback of this solutionis that it requires a large database repre-sentative and the computation is of highcomplexity.

A more efficient method to estimatemsim(q, D) is proposed in Yu et al. [1999b].In this method, there are two types of rep-resentatives. There is a global representa-tive for all component databases. For eachdistinct term ti, the global inverse docu-ment frequency weight (gidfi) is stored inthis representative. There is a local rep-resentative for each component databaseD. For each distinct term ti in D, a pair ofquantities (mnwi, anwi) is stored, wheremnwi and anwi are the maximum nor-malized weight and the average normal-ized weight of term ti, respectively. Sup-pose di is the weight of ti in a documentd . Then the normalized weight of ti ind is di/|d |, where |d | denotes the lengthof d . The maximum normalized weightand the average normalized weight of tiin database D are, respectively, the max-imum and the average of the normalizedweights of ti in all documents in D. Sup-pose q = (q1, . . . , qk) is the query vector.

Then msim(q, D) can be estimated asfollows:

msim(q, D) = max1≤i≤k

qi ∗ gidfi ∗mnwi

+k∑

j=1, j 6=i

q j ∗ gidf j ∗ anw j

/|q|. (13)

The intuition for having this estimateis that the most similar document in adatabase is likely to have the maximumnormalized weight of the ith query term,for some i. This yields the first half ofthe above expression within the braces.For each of the other query terms, thedocument takes the average normalizedweight. This yields the second half. Then,the maximum is taken over all i, sincethe most similar document may have themaximum normalized weight of any one ofthe k query terms. Normalization by thequery length, |q|, yields a value less thanor equal to 1. The underlying assumptionof Formula (13) is that terms in each queryare independent. Dependencies betweenterms can be captured to a certain extentby storing the same statistics (i.e. mnw’sand anw’s) of phrases in the database rep-resentatives, i.e., treating each phrase asa term.

In this method, each database is rep-resented by two quantities per term plusthe global representative shared by alldatabases but the computation has linearcomplexity.

The maximum normalized weight ofa term is typically two or more ordersof magnitude larger than the averagenormalized weight of the term as thelatter is computed over all documents,including those not containing the term.This observation implies that in Formula(13), if all query terms have the sametf weight (a reasonable assumption, asin a typical query each term appearsonce), gidfi ∗mnwi is likely to dominate∑k

j=1, j 6=i gidfj ∗anwj , especially when thenumber of terms, k, in a query is small(which is typically true in the Internet

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 22: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 69

environment [Jansen et al. 1998; Kirsch1998]). In other words, the rank ofdatabase D with respect to a givenquery q is largely determined by thevalue of max1≤i≤k{qi ∗ gidfi ∗mnwi}. Thisleads to the following more scalable for-mula to estimate msim(q, D) [Wu et al.2001]: max1≤i≤k{qi ∗ ami}/|q|, where ami =gidfi ∗mnwi is the adjusted maximumnormalized weight of term ti in D. Thisformula requires only one piece of infor-mation, namely ami, to be kept in thedatabase representative for each distinctterm in the database.

5.3. Learning-Based Approaches

These approaches predict the usefulnessof a database for new queries based on theretrieval experiences with the databasefrom past queries. The retrieval experi-ences may be obtained in a number ofways. First, training queries can be usedand the retrieval knowledge of each com-ponent database with respect to thesetraining queries can be obtained in ad-vance (i.e., before the database selec-tor is enabled). This type of approachwill be called the static learning ap-proach as in such an approach, the re-trieval knowledge, once learned, will notbe changed. The weakness of static learn-ing is that it cannot adapt to the changesof database contents and query pattern.Second, real user queries (in contrast totraining queries) can be used and theretrieval knowledge can be accumulatedgradually and be updated continuously.This type of approach will be referred toas the dynamic learning approach. Theproblem with dynamic learning is thatit may take a while to obtain sufficientknowledge useful to the database selector.Third, static learning and dynamic learn-ing can be combined to form a combined-learning approach. In such an approach,initial knowledge may be obtained fromtraining queries but the knowledge is up-dated continuously based on real userqueries. Combined learning can overcomethe weaknesses of the other two learningapproaches. In this subsection, we intro-

duce several learning-based database se-lection methods.

5.3.1. MRDD Approach. The MRDD(Modeling Relevant Document Distribu-tion) approach [Voorhees et al. 1995b] isa static learning approach. During learn-ing, a set of training queries is utilized.Each training query is submitted to everycomponent database. From the returneddocuments from a database for a givenquery, all relevant documents are identi-fied and a vector reflecting the distributionof the relevant documents is obtainedand stored. Specifically, the vector hasthe format <r1, r2, . . . , rs>, where ri isa positive integer indicating that ri top-ranked documents must be retrieved fromthe database in order to obtain i relevantdocuments for the query. As an example,suppose for a training query q and a com-ponent database D, 100 documents areretrieved in the order (d1, d2, . . . , d100).Among these documents, d1, d4, d10, d17,and d30 are identified to be relevant. Thenthe corresponding distribution vector is〈r1, r2, r3, r4, r5〉 = 〈1, 4, 10, 17, 30〉.

With the vectors for all training queriesand all databases obtained, the databaseselector is ready to select databases foruser queries. When a user query is re-ceived, it is compared against all trainingqueries and the k most similar trainingqueries are identified (k = 8 performedwell as reported in [Voorhees et al.1995b]). Next, for each database D, the av-erage relevant document distribution vec-tor over the k vectors corresponding to thek most similar training queries and D isobtained. Finally, the average distributionvectors are used to select the databases tosearch and the documents to retrieve. Theselection tries to maximize the precisionfor each recall point.

Example 1. Suppose for a given queryq, the following three average distributionvectors have been obtained for three com-ponent databases:

D1: 〈1, 4, 6, 7, 10, 12, 17〉D2: 〈3, 5, 7, 9, 15, 20〉D3: 〈2, 3, 6, 9, 11, 16〉

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 23: Building efficient and effective metasearch engines

70 Meng et al.

Consider the case when three relevantdocuments are to be retrieved. To maxi-mize the precision (i.e., to reduce the re-trieval of irrelevant documents), one doc-ument should be retrieved from D1 andthree documents should be retrieved fromD3 (two of the three are supposed to be rel-evant). In other words, databases D1 andD3 should be selected. This selection yieldsa precision of 0.75 as three out of the fourretrieved documents are relevant.

In the MRDD approach, the represen-tative of a component database is theset of distribution vectors for all trainingqueries. The main weakness of this ap-proach is that the learning has to be car-ried out manually for each training query.In addition, it may be difficult to iden-tify appropriate training queries and thelearned knowledge may become less accu-rate when the contents of the componentdatabases change.

5.3.2. SavvySearch Approach. Savvy-Search (www.search.com) is a metasearchengine employing the dynamic learningapproach. In SavvySearch [Dreilinger andHowe 1997], the ranking score of a compo-nent search engine with respect to a queryis computed based on the past retrievalexperience of using the terms in the query.More specifically, for each search engine,a weight vector (w1, . . . , wm) is main-tained by the database selector, whereeach wi corresponds to the ith term in thedatabase of the search engine. Initially, allweights are zero. When a query contain-ing term ti is used to retrieve documentsfrom a component database D, the weightwi is adjusted according to the retrievalresult. If no document is returned by thesearch engine, the weight is reduced by1/k, where k is the number of terms inthe query. On the other hand, if at leastone returned document is read/clickedby the user (no relevance judgment isneeded from the user), then the weightis increased by 1/k. Intuitively, a largepositive wi indicates that the databaseD responded well to term ti in the pastand a large negative wi indicates that Dresponded poorly to ti.

SavvySearch also tracks the recent per-formance of each search engine in termsof h, the average number of documents re-turned for the most recent five queries,and r, the average response time for themost recent five queries sent to the com-ponent search engine. If h is below athreshold Th (the default is 1), then apenalty ph = (Th−h)2

T 2h

for the search engineis computed. Similarly, if the average re-sponse time r is greater than a thresh-old Tr (the default is 15 seconds), thena penalty pr = (r−Tr )2

(ro−Tr )2 is computed, wherero = 45 (seconds) is the maximum allowedresponse time before a timeout.

For a new query q with terms t1, . . . , tk ,the ranking score of database D is com-puted by

r(q, D) =∑k

i=1 wti · log(N/ fi)√∑ki=1 |wi|

− (ph + pr ),

(14)

where log(N/ fi) is the inverse databasefrequency weight of term ti, N is the num-ber of databases, and fi is the number ofdatabases having a positive weight valuefor term ti.

The overhead of storing the represen-tative information for each local searchengine in SavvySearch is moderate. (Es-sentially there is just one piece of informa-tion for each term, i.e., the weight. Onlyterms that have been used in previousqueries need to be considered.) Moderateeffort is needed to maintain the informa-tion. One weakness of SavvySearch is thatit will not work well for new query termsor query terms that have been used onlyvery few times. In addition, the user feed-back process employed by SavvySearchis not rigorous and could easily lead tothe mis-identification of useful databases.Search engine users may have the ten-dency to check out top-ranked documentsfor their queries regardless of whetheror not these documents are actually use-ful. This means that term weights inthe database representative can easily bemodified in a way not consistent with themeaning of the weights. As a result, it is

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 24: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 71

possible that the weight of a term for adatabase does not sufficiently reflect howwell the database will respond to the term.

5.3.3. ProFusion Approach. ProFusion(www.profusion.com) is a metasearchengine employing the combined learningapproach. In ProFusion [Fan and Gauch1999; Gauch et al. 1996], 13 preset cate-gories are utilized in the learning process.The 13 categories are “Science and Engi-neering,” “Computer Science,” “Travel,”“Medical and Biotechnology,” “Businessand Finance,” “Social and Religion,”“Society, Law and Government,” “Animalsand Environment,” “History,” “Recreationand Entertainment,” “Art,” “Music,” and“Food.” A set of terms is associated witheach category to reflect the topic of thecategory. For each category, a set oftraining queries is identified. The reasonfor using these categories and dedicatedtraining queries is to learn how welleach component database will respondto queries in different categories. For agiven category C and a given componentdatabase D, each associated trainingquery is submitted to D. From the top 10retrieved documents, relevant documentsare identified. Then a score reflectingthe performance of D with respect to thequery and the category C is computed

by c ∗∑10

i=1Ni

10 ∗ R10 , where c is a constant;

Ni is set to 1/i if the ith-ranked doc-ument is relevant and Ni is set to 0 ifthe document is not relevant; R is thenumber of relevant documents in the10 retrieved documents. It can be seenthat this formula captures both the rankorder of each relevant document and theprecision of the top 10 retrieved docu-ments. Finally, the scores of all trainingqueries associated with the category C isaveraged for database D and this averageis the confidence factor of the databasewith respect to the category. At the end ofthe training, there is a confidence factorfor each database with respect to each ofthe 13 categories.

When a user query q is received by themetasearch engine, q is first mapped toone or more categories. The query q ismapped to a category C if at least one term

in q belongs to the set of terms associ-ated with C. Now the databases will beranked based on the sum of the confidencefactors of each database with respect tothe mapped categories. Let this sum of theconfidence factors of a database with re-spect to q be called the ranking score ofthe database for q. In ProFusion, the threedatabases with the largest ranking scoresare selected to search for a given query.

In ProFusion, documents retrieved fromselected search engines are ranked basedon the product of the local similarity ofa document and the ranking score of thedatabase. Let d in database D be thefirst document read/clicked by the user. Ifd is not the top-ranked document, thenthe ranking score of D should be in-creased while the ranking scores of thosedatabases whose documents are rankedhigher than d should be reduced. This iscarried out by proportionally adjusting theconfidence factors of D in mapped cate-gories. For example, suppose for a queryq and a database D, two categories C1and C2 are selected and the correspond-ing confidence factors are 0.6 and 0.4, re-spectively. To increase the ranking score ofdatabase D by x, the confidence factors ofD in C1 and C2 are increased by 0.6x and0.4x, respectively. This ranking score ad-justment policy tends to move d higher inthe rank if the same query is processed inthe future. The rationale behind this pol-icy is that if the ranking scores were per-fect, then the top-ranked document wouldbe the first to be read by the user.

ProFusion combines static learning anddynamic learning, and as a result, over-comes some problems associated with em-ploying static learning or dynamic learn-ing alone. ProFusion has the followingshortcomings. First, the static learningpart is still done mostly manually, i.e.,selecting training queries and identifyingrelevant documents are carried out manu-ally. Second, the higher-ranked documentsfrom the same database as the first clickeddocument will remain as higher-rankeddocuments after the adjustment of con-fidence factors although they are of nointerest to the user. This is a situationwhere the learning strategy does not help

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 25: Building efficient and effective metasearch engines

72 Meng et al.

retrieve better documents for a repeatingquery. Third, the employed dynamic learn-ing method seems to be too simplistic. Forexample, very little user feedback infor-mation is used and the tendency of usersto select the highest-ranked document re-gardless of the relevance of the documentis not taken into consideration. One wayto alleviate this problem is to use thefirst clicked document that was read for a“significant” amount of time.

6. DOCUMENT SELECTION

After the database selector has chosen thecomponent databases for a given query,the next task is to determine what doc-uments to retrieve from each selecteddatabase. A naive approach is to let eachselected component search engine returnall documents that are retrieved from thesearch engine. The problem with this ap-proach is that too many documents maybe retrieved from the component systemsunnecessarily. As a result, this approachwill not only lead to higher communicationcost but also require more effort from theresult merger to identify the best matcheddocuments. This naive approach will notbe further discussed in this section.

As noted previously, a component searchengine typically retrieves documents indescending order of local similarities. Con-sequently, the problem of selecting whatdocuments to retrieve from a componentdatabase can be translated into one of thefollowing two problems:

(1) Determine the number of documents toretrieve from the component database.If k documents are to be retrieved froma component database, then the k doc-uments with the largest local similari-ties will be retrieved.

(2) Determine a local threshold for thecomponent database such that a doc-ument from the component databaseis retrieved only if its local similaritywith the query exceeds the threshold.

Both problems have been tackled in ex-isting or proposed metasearch engines.For either problem, the goal is always to

retrieve all or as many as possible poten-tially useful documents from each compo-nent database while minimizing the re-trieval of useless documents. We classifythe proposed approaches for the documentselection problem into the following fourcategories:

User determination: The metasearchengine lets the global user determinehow many documents to retrieve fromeach component database.

Weighted allocation: The number ofdocuments to retrieve from a compo-nent database depends on the rank-ing score (or the rank) of the compo-nent database relative to the rankingscores (or ranks) of other componentdatabases. As a result, proportionallymore documents are retrieved fromcomponent databases that are rankedhigher or have higher ranking scores.

Learning-based approaches: Theseapproaches determine the number ofdocuments to retrieve from a compo-nent database based on past retrie-val experiences with the componentdatabase.

Guaranteed retrieval: This type of ap-proach aims at guaranteeing the re-trieval of all potentially useful docu-ments with respect to any given query.

In the following subsections, we surveyand discuss approaches from each of thecategories.

6.1. User Determination

In MetaCrawler [Selberg and Etzioni1995; 1997] and SavvySearch [Dreilingerand Howe 1997], the maximum number ofdocuments to be returned from each com-ponent database can be customized by theuser. Different numbers can be used fordifferent queries. If a user does not selecta number, then a query-independent de-fault number set by the metasearch enginewill be used. This approach may be reason-able if the number of component databasesis small and the user is reasonably fa-miliar with all of them. In this case, theuser can choose an appropriate number of

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 26: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 73

documents to retrieve for each componentdatabase and can afford to do so.

If the number of component databasesis large, then this method has a seriousproblem. In this case, it is likely that theuser will not be capable of selecting anappropriate number for each componentdatabase. Consequently, the user will beforced to choose one number and applythat number to all selected componentdatabases. As the numbers of useful docu-ments in different databases with respectto a given query are likely to be different,this method may retrieve too many use-less documents from some component sys-tems on the one hand while retrieving toofew useful documents from other compo-nent systems on the other hand. If m doc-uments are to be retrieved from N selecteddatabases, the number of documents to re-trieve from each database may be set to bedm

N e or slightly higher.

6.2. Weighted Allocation

For a given query, each componentdatabase has a rank (i.e., 1st, 2nd, . . .)and a ranking score as determined bythe database selection algorithm. Both therank information and the ranking scoreinformation can be used to determine thenumber of documents to retrieve from dif-ferent component systems. In principle,weighted allocation approaches attempt toretrieve more documents from componentsearch engines that are ranked higher (orhave larger ranking scores).

In D-WISE [Yuwono and Lee 1997], theranking score information is used. For agiven query q, let ri be the ranking scoreof component database Di, i = 1, . . . , N ,where N is the number of selected compo-nent databases for the query. Suppose mdocuments across all selected componentdatabases are desired. Then the numberof documents to retrieve from database Diis m · ri/

∑Nj=1 r j .

In CORI Net [Callan et al. 1995],the rank information is used. Specifi-cally, if a total number of m documentsare to be retrieved from N componentdatabases, then m· 2(1+N−i)

N (N+1) documents will

be retrieved from the ith ranked compo-nent database, i = 1, . . . , N (note that∑N

i=12(1+N−i)N (N+1) = 1). In CORI Net, m could

be chosen to be larger than the number ofdesired documents specified by the globaluser in order to reduce the likelihood ofmissing useful documents.

As a special case of the weighted allo-cation approach, if the ranking score ofa component database is the estimatednumber of potentially useful documents inthe database, then the ranking score of acomponent database can be used as thenumber of documents to retrieve from thedatabase.

Weighted Allocation is a reasonablyflexible and easy-to-implement approachbased on good intuition (i.e., retrieve moredocuments from more highly ranked localdatabases).

6.3. Learning-Based Approaches

It is possible to learn how many doc-uments to retrieve from a componentdatabase for a given query from pastretrieval experiences for similar queries.The following are two learning-based app-roaches [Towell et al. 1995; Voorhees et al.1995a; Voorhees et al. 1995b; Voorhees1996; Voorhees and Tong 1997].

In Section 5.3, we introduced a learning-based method, namely MRDD (Model-ing Relevant Document Distribution), fordatabase selection. In fact, this methodcombines the selection of databases andthe determination of what documents toretrieve from databases. For a given queryq, after the average distribution vectorshave been obtained for all databases, thedecision on what documents to retrievefrom these databases is made to maxi-mize the overall precision. In Example 1,when three relevant documents are de-sired from the given three databases, thismethod retrieves the top one documentfrom database D1 and the top three doc-uments from D3.

The second method, QC (Query Clus-tering), also performs document selec-tion based on past retrieval experiences.Again, a set of training queries is utilized.In the training phase, for each component

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 27: Building efficient and effective metasearch engines

74 Meng et al.

database, the training queries are groupedinto a number of clusters. Two queries areplaced in the same cluster if the num-ber of common documents retrieved bythe two queries is large. Next, the cen-troid of each query cluster is computedby averaging the vectors of the queries inthe cluster. Furthermore, for each compo-nent database, a weight is computed foreach cluster based on the average num-ber of relevant documents among the topT retrieved documents (T = 8 performedwell as reported in [Voorhees et al. 1995b])for each query in the query cluster. Fora given database, the weight of a clus-ter indicates how well the database re-sponds to queries in the cluster. When auser query is received, for each componentdatabase, the query cluster whose centroidis most similar to the query is selected.Then the weights associated with all se-lected query clusters across all databasesare used to determine the number of doc-uments to retrieve from each database.Suppose wi is the weight associated withthe selected query cluster for componentdatabase Di and m is the total number ofdocuments desired. Then the number ofdocuments to retrieve from database Diis m · wi/

∑Nj=1 wj , where N is the num-

ber of component databases. It can be seenthat this method is essentially a weightedallocation method and the weight of adatabase for a given query is the learnedweight of the selected query cluster for thedatabase.

For user queries that have very simi-lar training queries, the above approachesmay produce very good results. However,these approaches also have serious weak-nesses that may prevent them from be-ing used widely. First, they may not besuitable in environments where new com-ponent search engines may be frequentlyadded to the metasearch engine becausenew training needs to be conducted when-ever a new search engine is added. Sec-ond, it may not be easy to determine whattraining queries are appropriate to use.On the one hand, we would like to havesome similar training queries for each po-tential user query. On the other hand, hav-ing too many training queries would con-

sume a lot of resources. Third, it is too timeconsuming for users to identify relevantdocuments for a wide variety of trainingqueries.

6.4. Guaranteed Retrieval

Since the similarity function used in acomponent database may be different fromthat used in the metasearch engine, itis possible for a document with low localsimilarity to have a high global similar-ity, and vice versa. In fact, even whenthe global and local similarity functionsare identical, this scenario regarding localand global similarities may still occurdue to the use of some database-specificstatistical information in these functions.For example, the document frequency ofa term in a component system is prob-ably very different from that across allsystems (i.e., the global document fre-quency). Consequently, if a component sys-tem only returns documents with highlocal similarities, globally potentially use-ful documents that are determined basedon global similarities from the compo-nent database may be missed. The guar-anteed retrieval approach tries to ensurethat all globally potentially useful docu-ments would be retrieved even when theglobal and local document similarities donot match. Note that none of the ap-proaches in earlier subsections belongs tothe guaranteed retrieval category becausethey do not take global similarities intoconsideration.

Many applications, especially those inmedical and legal fields, often desire toretrieve all documents (cases) that aresimilar to a given query (case). For theseapplications, the guaranteed retrieval ap-proaches that can minimize the retrievalof useless documents would be appropri-ate. In this subsection, we introduce someproposed techniques in the guaranteed re-trieval category.

6.4.1. Query Modification. Under certainconditions, a global query can be modi-fied before it is submitted to a componentdatabase to yield the global similaritiesfor returned documents. This technique

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 28: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 75

is called query modification [Meng et al.1998]. It is essentially a query transla-tion method for vector queries. Clearly, ifa component system can be tricked intoreturning documents in descending orderof global similarities, guaranteeing the re-trieval of globally most similar documentsbecomes trivial.

Let D be a component database. Con-sider the case when both the local and theglobal similarity functions are the cosinefunction [Salton and McGill 1983]. Notethat although the same similarity functionis used globally and locally, the same doc-ument may still have different global andlocal similarities due to the use of differentlocal and global document frequencies ofterms. Let d = (w1, . . . , wr ) be the weightvector of a document in D. Suppose eachwi is computed using only information ind (such as term frequency) while a querymay use both the term frequency and theinverse document frequency information.The idf information for each term in Dis incorporated into the similarity compu-tation by modifying each query before itis processed [Buckley et al. 1993]. Con-sider a user query q = (q1, . . . , qr ), whereqj is the weight of term t j in the query,j = 1, . . . , r. It is assumed that qj is ei-ther assigned by the user or computed us-ing the term frequency of t j in the query.When the component system receives thequery q, it first incorporates the local idfweight of each query term by modifyingquery q to

q′ = (q1 ∗ l1, . . . , qr ∗ lr ) (15)

and then evaluates the modified query,where l j is the local idf weight of termt j in component system D, j = 1, . . . , r.As a result, when the cosine function isused, the local similarity of d with q inD can be computed to be simD(q, d ) =(∑r

j=1 qj ∗ l j ∗wj )/(|q′|·|d |), where |q′| and|d | are the lengths of q′ and d , respectively.

Let l ′j be the global idf weight of termt j . Then, when the cosine function isused, the global similarity of d with qshould be simG(q, d )= (

∑rj=1 qj ∗ l ′j ∗wj )/

(|q′′| · |d |), where q′′ = (q1 ∗ l ′1, . . . , qr ∗ l ′r ).

In order to trick the component systemD into computing the global similarity ford , the following procedure is used. Whenquery q = (q1, . . . , qr ) is received by themetasearch engine, it is first modified toq∗ = (q1 ∗ (l ′1/l1), . . . , qr ∗ (l ′r/lr )). Then themodified query q∗ is sent to the compo-nent database D for evaluation. Accord-ing to (15), after D receives q∗, it furthermodifies q∗ to (q1 ∗ (l ′1/l1) ∗ l1, . . . , qr ∗(l ′r/lr ) ∗ lr ) = (q1 ∗ l ′1, . . . , qr ∗ l ′r ) = q′′.Finally, q′′ is evaluated by D to computethe global similarity of d with q.

Unfortunately, query modification is nota technique that can work for any combi-nations of local and global similarity func-tions. In general, we still need to deal withthe situations when documents have dif-ferent local and global similarities. Fur-thermore, this approach requires knowl-edge of the similarity function and theterm weighting formula used in a compo-nent system. The information is likely tobe proprietary and may not be easily avail-able. A study of discovering such informa-tion based on sampling queries is reportedin Liu et al. [2000].

6.4.2. Computing the Tightest Local Thresh-old. For a given query q, suppose themetasearch engine sets a threshold Tand uses a global similarity function Gsuch that any document d that satisfiesG(q, d ) > T is to be retrieved (i.e., the doc-ument is considered to be potentially use-ful). The problem is to determine a properthreshold T ′ for each selected componentdatabase D such that all potentially usefuldocuments will that exist in D can be re-trieved using its local similarity functionL. That is, if G(q, d ) > T , then L(q, d ) >T ′ for any document d in D. Note that inorder to guarantee that all potentially use-ful documents will be retrieved from D,some unwanted documents from D mayalso have to be retrieved. The challengeis to minimize the number of documentsto retrieve from D while still guaranteeingthat all potentially useful documents fromD will be retrieved. In other words, it is de-sirable to determine the tightest (largest)local threshold T ′ such that if G(q, d ) > T ,then L(q, d ) > T ′.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 29: Building efficient and effective metasearch engines

76 Meng et al.

In Gravano and Garcia-Molina [1997],it is shown that if (1) the similaritiescomputed by G and L are between 0 and1, and (2) G and L are related by the in-equality G(q, d ) − ε ≤ L(q, d ), where εis a constant satisfying 0 ≤ ε < 1, thena local threshold T ′ can be determined.However, the local threshold determinedusing the method in Gravano and Garcia-Molina [1997] is often not tight.

In Meng et al. [1998], several tech-niques were proposed to find the tightestlocal threshold for some popular similarityfunction pairs. For a given global similar-ity threshold T , let L(T ) denote the tight-est local threshold for a given componentdatabase D. Then one way to determineL(T ) is as follows:

(1) Find the function f (t), the minimumof the local similarity function L(q, d ),over all documents d in D, subject tot = G(q, d ). In this step, t is fixed andd varies over all possible documentsin D.

(2) Minimize f (t) in the range t ≥ T . Thisminimum of f (t) is the desired L(T ).

Let {ti} be the set of terms in the queryq. If both L(q, d ) and G(q, d ) are differen-tiable with respect to the weight wi of eachterm ti of document d , then finding f (t) inthe above Step 1 can generally be achievedusing the method of Lagrange in calculus[Widder 1989]. Once f (t) is found, its min-imum value in the range t ≥ T can usu-ally be computed easily. In particular, iff (t) is nondecreasing, then L(T ) is simplyf (T ). The example below illustrates thismethod.

Example 2. Let d = (w1, . . . , wr ) bea document and q= (u1, . . . , ur ) be aquery. Let the global similarity functionG(q, d )= ∑r

i=1 ui · wi and the local sim-ilarity function L(q, d ) = (

∑ri=1 up

i wpi )

1p

(known as p-norm in Salton and McGill[1983]), p ≥ 1.

Step 1 is to find f (t), which requiresus to minimize (

∑ri=1 up

i wpi )

1p subject to∑r

i=1 ui · wi = t. Using the Lagrangemethod, f (t) is found to be t · r ( 1

p−1). Asthis function is an increasing function of t,

for a global threshold T , the tightest localthreshold L(T ) is then T · r ( 1

p−1).

While this method may provide thetightest local threshold for certain combi-nations of local and global similarity func-tions, it has two weaknesses. First, a sep-arate solution needs to be found for eachdifferent pair of similarity functions and itis not clear whether a solution can alwaysbe found. Second, it is required that thelocal similarity function be known.

7. RESULT MERGING

To provide local system transparency tothe global users, the results returned fromcomponent search engines should be com-bined into a single result. Ideally, doc-uments in the merged result should beranked in descending order of global simi-larities. However, such an ideal merge isvery hard to achieve due to the variousheterogeneities among the component sys-tems. Usually, documents returned fromeach component search engine are rankedbased on these documents’ local rankingscores or similarities. Some componentsearch engines make the local similari-ties of returned documents available tothe user while other search engines do notmake them available. For example, Googleand AltaVista do not provide local similar-ities while Northern Light and FirstGovdo. Local similarities returned from dif-ferent component search engines, evenwhen made available, may be incompa-rable due to the heterogeneities amongthese search engines. Furthermore, thelocal similarities and the global similari-ties of the same document may be quitedifferent.

The challenge here is to merge the doc-uments returned from different searchengines into a single ranked list in a rea-sonable manner in the absence of localsimilarities and/or in the presence of in-comparable similarities. A further compli-cation to the problem is that some doc-uments may be returned from multiplecomponent search engines. The questionis whether and how this should affect theranking of these documents.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 30: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 77

Existing result merging approaches canbe classified into the following two types:

Local similarity adjustment: Thistype of approaches adjusts local simi-larities using additional informationsuch as the quality of componentdatabases. A variation is to convertlocal document ranks to similarities.

Global similarity estimation: Thistype of approaches attempts to com-pute or estimate the true global sim-ilarities of the returned documents.

The first type is usually easier to imple-ment but the merged ranking may be in-accurate as the merge is not based on thetrue global similarities of returned docu-ments. The second type is more rigorousand has the potential to achieve the idealmerging. However, it typically needs moreinformation from local systems. The twotypes of approaches are discussed in thefollowing subsections.

7.1. Local Similarity Adjustment

Three cases can be identified depending onthe degree of overlap among the selecteddatabases for a given query.

Case 1: These databases are pairwise dis-joint or nearly disjoint. This occurswhen disjoint special-purpose searchengines or those with minimal overlapare selected.

Case 2: The selected databases overlapbut are not identical. An example ofthis situation is when several general-purpose search engines are selected.

Case 3: These databases are identical.

Case 3 usually does not occur in ametasearch engine environment. Instead,it occurs when multiple ranking tech-niques are applied to the same collectionof documents in order to improve theretrieval effectiveness. The result merg-ing problem in this case is also knownas data fusion [Vogt and Cottrell 1999].Data fusion has been studied extensivelyin the last decade. One special propertyof the data fusion problem is that everydocument will be ranked or scored by each

employed ranking technique. A number offunctions have been proposed to combineindividual ranking scores of the samedocument, including min, max, average,sum, weighted average, and other linearcombination functions [Cottrell andBelew 1994; Fox and Shaw 1994; Lee1997; Vogt and Cottrell 1999]. One of themost effective functions for data fusionis known as CombMNZ, which, for eachdocument, sums individual scores andthen multiplies the sum by the number ofnonzero scores [Lee 1997]. This functionemphasizes those documents that areranked high by multiple systems. Moredata fusion techniques are surveyed inCroft [2000].

We now consider more likely scenariosin a metasearch engine context, namelythe selected databases are not identical.We first consider the case where the se-lected databases are disjoint. In this case,all returned documents will be unique. Letus first assume that all returned docu-ments have local similarities attached. Itis possible that different search enginesnormalize their local similarities in dif-ferent ranges. For example, one searchengine may normalize its similarities be-tween 0 and 1 and another search enginebetween 0 and 1,000. In this case, all localsimilarities should be renormalized basedon a common range, say [0, 1], to improvethe comparability of these local similari-ties [Dreilinger and Howe 1997; Selbergand Etzioni 1997]. In the following, we as-sume that all local similarities have beennormalized based on a common range.

When database selection is performedfor a given query, the usefulness or qualityof each database is estimated and is rep-resented as a score. The database scorescan be used to adjust the local similarities.The idea is to give preference to documentsfrom highly ranked databases. In CORINet [Callan et al. 1995], the adjustmentworks as follows. Let s be the rankingscore of component database D and s̄ bethe average of the scores of all databasessearched. Then the following weight is as-signed to D: w= 1+N · s−s̄

s̄ , where N isthe number of component databasessearched for the given query. Clearly, if

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 31: Building efficient and effective metasearch engines

78 Meng et al.

s > s̄, then w will be greater than 1. Fur-thermore, the larger the difference is, thelarger the weight will be. On the otherhand, if s< s̄, then w will be smaller than1. Moreover, the larger the difference is,the smaller the weight will be. Let x be thelocal similarity of document d from D.Then the adjusted similarity of d is com-puted by w · x. The result merger lists re-turned documents in descending order ofadjusted similarities. Based on the waythe weight of a database is computed,it is clear that documents from higher-ranked databases have a better chance tobe ranked higher in the merged result.

A similar method is used in ProFusion[Gauch et al. 1996]. For a given query,a ranking score is calculated for eachdatabase (see the discussion on ProFusionin Section 5.3.3). The adjusted similarityof a document d from a database D is theproduct of the local similarity of d and theranking score of D.

Now let us consider the situation wherethe local similarities of the returned doc-uments from some component search en-gines are not available. In this case, oneof the following two approaches could beapplied to tackle the merging problem.Again, we assume that no document is re-turned from multiple search engines, i.e.,all returned documents are unique.

(1) Use the local document rank infor-mation directly to perform the merge.Local similarities, if available, willbe ignored in this approach. First,the searched databases are arrangedin descending order of usefulness orquality scores obtained during thedatabase selection step. Next, a round-robin method based on the database or-der and the local document rank orderis used to merge the local documentlists. Specifically, the first documentin the merged list is the top-rankeddocument from the highest-rankeddatabase and the second document inthe merged list is the top-ranked docu-ment from the second-highest-rankeddatabase. After the top-ranked doc-uments from all searched databaseshave been selected, the next docu-

ment in the merged list will be thesecond-highest-ranked document inthe highest-ranked database and theprocess continues until the desirednumber of documents are included inthe merged list. One weakness of thissolution is that it does not take intoconsideration the differences betweenthe database scores (i.e., only theorder information is utilized).

A randomized version of the abovemethod is proposed in Voorhees et al.[1995b]. Recall that in the MRDDdatabase selection method, we firstdetermine how many documents to re-trieve from each component databasefor a given query to maximize the pre-cision of the retrieval. Suppose the de-sired number of documents have beenretrieved from each selected com-ponent database and N local docu-ment lists have been obtained, whereN is the number of selected compo-nent databases. Let Li be the localdocument list for database Di. Toselect the next document to be placedin the merged list, the rolling of a die issimulated. The die has N faces corre-sponding to the N local lists. Supposen is the total number of documentsyet to be selected and ni documentsare still in the list Li. The die is madebiased such that the probability thatthe face corresponding to Li will beup when the die is rolled is ni/n.When the face for Li is up, the currenttop-ranked document in the list Li willbe selected as the next-highest-rankeddocument in the merged list. Afterthe selection, the selected documentis removed from Li, and both ni and nare reduced by 1. The probabilities arealso updated accordingly. In this way,the retrieved documents are rankedbased on the probabilistic model.

(2) Convert local document ranks tosimilarities. In D-WISE [Yuwono andLee 1997], the following method isemployed. For a given query, supposeri is the ranking score of database Di,rmin is the lowest database rankingscore (i.e., rmin= min{ri}), r is the local

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 32: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 79

rank of a document from database Di,and g is the converted similarity of thedocument. The conversion function isg = 1−(r−1)·Fi, where Fi is defined tobe (rmin)/(m · ri) and m is the number ofdocuments desired across all searcheddatabases. Intuitively, this conversionfunction has the following properties.First, all top-ranked documents fromlocal systems will have the same con-verted similarity 1. This implies thatall top-ranked documents from localsystems are considered to be equallypotentially useful. Second, Fi is usedto model the distance between theconverted similarities of two consecu-tively ranked documents in databaseDi. In other words, the differencebetween the converted similaritiesof the j th- and the ( j + 1)th-rankeddocuments from database Di is Fi. Thedistance is larger for databases withsmaller ranking scores. As a result, ifthe rank of a document d in a higher-rank database is the same as therank of document d ′ in a lower rankdatabase but none of d and d ′ is top-ranked, then the converted similarityof d will be higher than that of d ′. Inaddition, this method tends to selectmore documents from databases withhigher scores into the merged result.

As an example, consider two data-bases D1 and D2. Suppose r1 = 0.2 andr2 = 0.5. Furthermore, suppose fourdocuments are desired. Then, we havermin = 0.2, F1 = 0.25, and F2 = 0.1.Based on the above conversion func-tion, the top three ranked documentsfrom D1 will have converted similari-ties 1, 0.75, and 0.5, respectively, andthe top three ranked documents fromD2 will have converted similarities 1,0.9, and 0.8, respectively. As a result,the merged list will contain three docu-ments from D2 and one document fromD1. The documents will be rankedin descending order of convertedsimilarities in the merged list.

Now let us consider the situation wherethe selected databases have overlap. Fordocuments that are returned by a single

search engine, the above discussed simi-larity adjustment techniques can be ap-plied. We now consider how to deal withdocuments that are returned by multiplesearch engines. First, each local similaritycan be adjusted using the techniques dis-cussed above. Next, adjusted similaritiesfor the same document can be combinedin a certain way to produce an overall ad-justed similarity for the document. Thecombination can be carried out by utilizingone of the combination functions proposedfor data fusion. Indeed, this has been prac-ticed by some metasearch engines. For ex-ample, the max function is used in Pro-Fusion [Gauch et al. 1996] and the sumfunction is used in MetaCrawler [Selbergand Etzioni 1997]. It should be pointed outthat an effective combination function indata fusion may not necessarily be effec-tive in a metasearch engine environment.In data fusion, if a document is not re-trieved by a retrieval technique, then itis because the document is not considereduseful by the technique. In contrast, in ametasearch engine, there are two possiblereasons for a document not to be retrievedby a selected search engine. The first is thesame as in the data fusion case, namelythe document is not considered sufficientlyuseful by the search engine. The second isthat the document is not indexed by thesearch engine. In this case, the documentdid not have a chance to be judged for itsusefulness by the search engine. Clearly, adocument that is not retrieved due to thesecond reason will be put at a disadvan-tage if a combination function such as sumand CombMNZ is used. Finding an effec-tive combination function in a metasearchengine environment is an area that stillneeds further research.

7.2. Global Similarity Estimation

Under certain conditions, it is possible tocompute or estimate the global similari-ties of returned documents. The followingmethods have been reported.

7.2.1. Document Fetching. That a docu-ment is returned by a search engine typi-cally means that the URL of the document

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 33: Building efficient and effective metasearch engines

80 Meng et al.

is returned. Sometimes, additional in-formation associated with the document,such as a short summary or the first coupleof sentences, is also returned. But the doc-ument itself is typically not returned.

The document fetching method down-loads returned documents from their localservers and computes or estimates theirglobal similarities in the metasearch en-gine. Consider the case in which the globalsimilarity function is the cosine functionand the global document frequency of eachterm is known to the metasearch engine(note that if local databases have little orno overlap, then the global document fre-quency of a term can be computed or ap-proximated as the sum of the local doc-ument frequencies of the term). After adocument is downloaded, the term fre-quency of each term in the document canbe obtained. As a result, all statisticsneeded to compute the global similarityof the document will be available and theglobal similarity can be computed. TheInquirus metasearch engine ranks docu-ments returned from different search en-gines based on analyzing the contents ofdownloaded documents and a ranking for-mula that combines similarity and prox-imity matches is employed [Lawrence andLee Giles 1998].

A document-fetching-based method thatcombines document selection and resultmerging is reported in Yu et al. [1999b].Suppose that the m most similar docu-ments across all databases with respect toa given query are desired for some positiveinteger m. In Section 5.2.5, we introduceda method to rank databases in descendingorder of the similarity of the most simi-lar document in each database for a givenquery. Such a rank is an optimal rank forretrieving the m most similar documents.This rank can also be used to perform doc-ument selection as follows.

First, for some small positive integers (e.g., s can start from 2), each of thes top ranked databases are searched toobtain the actual global similarity of itsmost similar document. This may re-quire downloading some documents fromthese databases. Let min sim be theminimum of these s similarities. Next,

from these s databases, retrieve all doc-uments whose actual global similaritiesare greater than or equal to the tenta-tive threshold min sim. The tightest lo-cal threshold for each of these s databasescould be determined and used here. If m ormore documents have been retrieved, thenthis process stops. Otherwise, the next topranked database (i.e., the (s+ 1)th-rankeddatabase) will be considered and its mostsimilar document will be retrieved. Theactual global similarity of this documentis then compared with min sim and theminimum of these two similarities willbe used as a new global threshold to re-trieve all documents from these s + 1databases whose actual global similaritiesare greater than or equal to this threshold.This process is repeated until m or moredocuments are retrieved. Retrieved docu-ments are ranked in descending order oftheir actual global similarities. A poten-tial problem with this approach is that thesame database may be searched multipletimes. This problem can be relieved tosome extent by retrieving and caching alarger number of documents when search-ing a database.

This method has the following twoproperties [Yu et al. 1999b]. First, if thedatabases are ranked optimally, then allthe m most similar documents can beretrieved while accessing at most oneunnecessary database, for any m. Second,for any single-term query, the optimalrank of databases can be achieved and, asa result, the m most similar documentswill be retrieved.

Downloading documents and analyzingthem on the fly can be an expensive under-taking, especially when the number of doc-uments to be downloaded is large and thedocuments have large sizes. A number ofremedies have been proposed. First, down-loading from different local systems canbe carried out in parallel. Second, somedocuments can be analyzed first and dis-played to the user so that further analy-sis can be done while the user reads theinitial results [Lawrence and Lee Giles1998]. The initially displayed results maynot be correctly ranked and the overallrank needs to be adjusted when more

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 34: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 81

documents are analyzed. Third, we mayconsider downloading only the beginningportion of each (large) document to ana-lyze [Craswell et al. 1999].

On the other hand, downloading-basedapproaches also have some clear advan-tages [Lawrence and Lee Giles 1998].First, when trying to download docu-ments, obsolete URLs can be identified.As a result, documents with dead URLscan be removed from the final result list.Second, by analyzing downloaded docu-ments, documents will be ranked by theircurrent contents. In contrast, local sim-ilarities may be computed based on oldversions of these documents. Third, queryterms in downloaded documents could behighlighted when displayed to the user.

7.2.2. Use of Discovered Knowledge. Asdiscussed previously, one difficulty withresult merging is that local document sim-ilarities may be incomparable because indifferent component search engines thedocuments may be indexed differentlyand the similarities may be computedusing different methods (term weightingschemes, similarity functions, etc.). If thespecific document indexing and similar-ity computation methods used in differ-ent component search engines can be dis-covered, for example, using the techniquesproposed in Liu et al. [2000], then we canbe in a better position to figure out (1) whatlocal similarities are reasonably compara-ble; (2) how to adjust some local similar-ities so that they will become more com-parable with others; and (3) how to deriveglobal similarities from local similarities.This is illustrated by the following exam-ple [Meng et al. 1999b].

Example 3. Suppose it is discoveredthat all the component search enginesselected to answer a given user queryemploy the same methods to index localdocuments and to compute local similari-ties, and no collection-dependent statisticssuch as the idf information are used. Thenthe similarities from these local search en-gines can be considered as comparable. Asa result, these similarities can be used di-rectly to merge the returned documents.

If the only difference among these com-ponent search engines is that some removestopwords and some do not (or the stop-word lists are different), then a query maybe adjusted to generate more comparablelocal similarities. For instance, suppose aterm t in query q is a stopword in compo-nent search engine E1 but not a stopwordin component search engine E2. In orderto generate more comparable similarities,we can remove t from q and submit themodified query to E2 (it does not matterwhether the original q or the modified q issubmitted to E1).

If the idf information is also used, thenwe need to either adjust the local similar-ities or compute the global similarities di-rectly to overcome the problem that theglobal idf and the local idf ’s of a termmay be different. Consider the followingtwo cases. It is assumed that both the localsimilarity function and the global similar-ity function are the cosine function.

Case 1: Query q consists of a singleterm t. The similarity of q with a doc-ument d in a component database canbe computed by

sim(d , q) = qtft(q)× lidft × dtft(d )|q| · |d | ,

where qtft(q) and dtft(d ) are the tfweights of term t in q and in d , respec-tively, and lidft is the local idf weight oft. If the local idf formula has been dis-covered and the global document fre-quency of t is known, then this localsimilarity can be adjusted to the globalsimilarity by multiplying it by gidft

lidft,

where gidft is the global idf weight of t.Case 2: Query q has multiple terms

t1, . . . , tk. The global similarity be-tween d and q in this case is

s =∑k

i=1 qtfti(q)× gidfti

× dtfti(d )

|q| · |d |

=k∑

i=1

qtfti(q)|q| ·

dtfti(d )|d | · gidfti

.

Clearly,qt f ti

(q)|q| and gidfti

, i = 1, . . . , k,can all be computed by the metasearch

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 35: Building efficient and effective metasearch engines

82 Meng et al.

engine as the formulas for computingthem are known. Therefore, in orderto find s, we need to find

dtfti(d )|d | , i =

1, . . . , k. To finddtfti

(d )|d | for a given term

ti without downloading document d, wecan submit ti as a single-term query.Let si = sim(d , ti) = qtfti

(ti )× lidfti×dtfti

(d )|ti |·|d |

be the local similarity returned. Then

dtfti(d )|d | =

si × |ti|qtfti

(ti)× lidfti

(16)

Note that the expression on the right-hand side of the above formula canbe computed by the metasearch enginewhen all the local formulas are known(i.e., have been discovered). In sum-mary, k additional single-term queriescan be used to compute the global sim-ilarities between q and all documentsretrieved by q.

8. NEW CHALLENGES

As discussed in previous sections, muchprogress has been made to find efficientand accurate solutions to the problem ofprocessing queries in a metasearch en-gine environment. However, as an emerg-ing area, many outstanding problems re-main to be solved. In this section, we lista few worthwhile challenges in this area.

(1) Integrate local systems employing dif-ferent indexing techniques. Using dif-ferent indexing techniques in differ-ent local systems can have seriousimpact on the compatibility of localsimilarities. Careful observation canreveal that using different indexingtechniques can in fact affect the esti-mation accuracy in each of the threesoftware components (i.e., database se-lection, document selection, and resultmerging). New studies need to be car-ried out to investigate more preciselywhat impact it poses and how to over-come or alleviate the impact. Previousstudies have largely been focused ondifferent local similarity functions andlocal term weighting schemes.

(2) Integrate local systems supporting dif-ferent types of queries (e.g., Booleanqueries versus vector space queries).Most of our discussions in this articleare based on queries in the vector spacemodel [Salton and McGill 1983]. Thereexist metasearch engines that useBoolean queries [French et al. 1995; Liand Danzig 1997; NCSTRL n.d.] anda number of works on dealing withBoolean queries in a metasearch en-gine have been reported [Gravano et al.1994; Li and Danzig 1997; Sheldonet al. 1994]. Since very different meth-ods may be used to rank documents forBoolean queries (traditional Booleanretrieval systems do not even rank re-trieved documents) and vector spacequeries, we are likely to face manynew problems when integrating localsystems that support both Booleanqueries and vector space queries.

(3) Discover knowledge about componentsearch engines. Many local systemsare not willing to provide sufficientdesign and statistical informationabout their systems. They considersuch information proprietary. How-ever, without sufficient informationabout a local system, the estimationabout the usefulness of the localsystem with respect to a given querymay not be made accurately. Onepossible solution to this dilemma is todevelop tools that can learn about alocal system regarding the indexingterms used and certain statisticalinformation about these terms aswell as the similarity function usedthrough probe queries. These learningor knowledge discovering tools can beused to facilitate not only the additionof new component search engines toan existing metasearch engine butalso the detection of major upgradesor changes of existing component sys-tems. Some preliminary work in thisarea has started to be reported. Usingsampling technique to generate ap-proximate database representativesfor CORI Net is reported in Callenet al. [1999]. In Liu et al. [2000], a

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 36: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 83

technique is proposed to discover howterm weights are assigned in compo-nent search engines. New techniquesneed to be developed to discover knowl-edge about component search enginesmore accurately and more efficiently.

(4) Develop more effective result mergingmethods. Up to now, most result merg-ing methods that have under goneextensive experimental evaluationare those proposed for data fusion.These methods may be unsuitable inthe metasearch engine environmentwhere databases of different compo-nent search engines are not identical.New methods that take into consid-eration the special characteristics ofthe metasearch engine environmentneed to be designed and evaluated.One such special characteristic is thatwhen a document is not retrieved bya search engine, it may be because thedocument is not indexed by the searchengine.

(5) Study the appropriate cooperation be-tween a metasearch engine and thelocal systems. There are two extremeways to build a metasearch engine.One is to impose an interface on topof autonomous component search en-gines. In this case, no cooperation fromthese local systems can be expected.The other is to invite local systemsto join a metasearch engine. In thiscase, the developer of the metasearchengine may set conditions, such aswhat similarity function(s) must beused and what information about thecomponent databases must be pro-vided, that must be satisfied for a localsystem to join the metasearch engine.Many possibilities exist between thetwo extremes. This means it is likely,in a practical environment, that differ-ent types of database representativeswill be available to the metasearchengine. How to use different types ofdatabase representatives to estimatecomparable database usefulnesses isstill a largely untouched problem.An interesting issue is to come up withguidelines on what information from

local systems are useful to facilitatethe construction of a metasearch en-gine. Search engine developers mayuse such guidelines to design or up-grade their search engines. Multiplelevels of compliance should be allowed,with different compliance levels guar-anteeing different levels of estimationaccuracy. A serious initial effort in thisregard can be found in Gravano et al.[1997].

(6) Incorporate new indexing and weigh-ting techniques to build better meta-search engines. Some new indexingand term weighting techniques havebeen developed for search enginesfor HTML documents. For example,some search engines (e.g., WWWW[McBryan 1994], Google [Brin andPage 1998], and Webor [Cutler et al.1997]) use anchor terms in a Webpage to index the Web page that ishyperlinked by the URL associatedwith the anchor. The rationale is thatwhen authors of Web pages add ahyperlink to another Web page p, theyinclude in the anchor tag a descriptionof p in addition to its URL. These de-scriptions have the potential of beingvery important for the retrieval of pbecause they include the perceptionof these authors about the contents ofp. As another example, some searchengines also compute the weight of aterm according to its position in theWeb page and its font type. In SIBRIS[Wade et al. 1989], the weight of aterm in a page is increased if the termappears in the title of the page. Asimilar method is also employed inAltaVista, HotBot, and Yahoo. Google[Brin and Page 1998] assigns higherweights to terms in larger or boldfonts. It is known that co-occurrencesand proximities of terms have signif-icant influence on the relevance ofdocuments. An interesting problemis how to incorporate these new tech-niques into the entire retrieval processand into the database representativesso that better metasearch engines canbe built.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 37: Building efficient and effective metasearch engines

84 Meng et al.

(7) Improve the effectiveness of meta-search. Most existing techniques rankdatabases and documents based on thesimilarities between the query and thedocuments in each database. Similari-ties are computed based on the matchof terms in the query and documents.Studies in information retrieval indi-cate that when queries have a largenumber of terms, the correlation be-tween highly similar documents andrelevant documents exists providedappropriate similarity functions andterm weighting schemes, such as thecosine function and the tfw ∗ idfwweight formula, are used. However,for queries that are short, typical inthe Internet environment [Jansenet al. 1998; Kirsch 1998], the abovecorrelation is weak. The reason is thatfor a long query, the terms in the queryprovide context to each other to helpdisambiguate the meanings of differ-ent terms. In a short query, the partic-ular meaning of a term often cannotbe identified correctly. In summary, asimilar document to a short query maynot be useful to the user who submit-ted the query because the matchingterms may have different meanings.Clearly, the same problem also existsfor search engines. Methods need to bedeveloped to address this issue. Thefollowing are some promising ideas.First, incorporate the importance of adocument as determined by linkagesbetween documents (e.g., PageRank[Page et al. 1998] and authority[Kleinberg 1998]) with the similarityof the document with a query [Yu et al.2001]. Second, associate databaseswith concepts [Fan and Gauch 1999;Ipeirotis et al. 2001; Meng et al. 2001].When a query is received by themetasearch engine, it is first mappedto a number of appropriate conceptsand then those databases associatedwith the mapped concepts are used fordatabase selection. The concepts asso-ciated with a database/query are usedto provide some contexts for terms inthe database/query. As a result, themeanings of terms can be more accu-

rately determined. Third, user profilesmay be utilized to support personal-ized metasearch. Fourth, collaborativefiltering (CF) has been shown to bevery effective for recommending usefuldocuments [Konstan et al. 1997] andis employed by the DirectHit searchengine (www.directhit.com). The CFtechnique may also be useful forrecommending databases to search fora given query.

(8) Decide where to place the softwarecomponents of a metasearch engine.In Section 3, we identified the majorsoftware components for building agood metasearch engine. One issuethat we have not discussed is whereshould these components be placed.An implicit assumption used in thisarticle is that all components areplaced at the site of the metasearchengine. However, valid alternativesexist. For example, instead of havingthe database selector at the globalsite, we could distribute it to all localsites. The representative of each localdatabase can also be stored locally.In this scenario, each user query willbe dispatched to all local sites fordatabase selection. Each site then es-timates the usefulness of its databasewith respect to the query to determinewhether its local search engine shouldbe invoked for the query. Although thisplacement of the database selector willincur a higher communication cost, italso has some appealing advantages.First, the estimation of databaseusefulness can now be carried out inparallel. Next, as database representa-tives are stored locally, the scalabilityissue becomes much less significantthan when centralized database selec-tion is employed. Other componentssuch as the document selector mayalso have alternative placements. Weneed to investigate the pros and cons ofdifferent placements of these softwarecomponents. New research issues mayarise from these investigations.

(9) Create a standard testbed to evaluatethe proposed techniques for database

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 38: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 85

selection, document selection, and re-sult merging. This is an urgent need.Although most papers that reportthese techniques include some exper-imental results, it is hard to draw gen-eral conclusions from these results dueto the limitations of the documents andqueries used. Some studies use vari-ous portions of some old TREC collec-tions to conduct experiments [Callanet al. 1995; Voorhees et al. 1995b; Xuand Callan 1998] so that the informa-tion about the relevance of documentsto each query can be utilized. How-ever, the old TREC collections haveseveral limitations. First, the numberof queries that can be used for differentportions of TREC collections is small(from 50 to 250). Second, these queriestend to be much longer on the aver-age than typical queries encounteredin the Internet environment [Abdullaet al. 1997; Jansen et al. 1998]. Third,the documents do not reflect morestructured and more extensively hy-perlinked Web documents. In Gravanoand Garcia-Molina [1995], Meng et al.[1998, 1999a], and Yu et al. [1999a]a collection of up to more than 6,000real Internet queries is used. However,the database collection is small andthere is no document relevance infor-mation. An ideal testbed should havea large collection of databases of vari-ous sizes, contents, and structures, anda large collection of queries of vari-ous lengths with the relevant docu-ments for each query identified. Re-cently, a testbed based on partitioningsome old TREC collections into hun-dreds of databases has been proposedfor evaluating metasearch techniques[French et al. 1998, 1999]. However,this testbed is far from being idealdue to the problems inherited fromthe used TREC collections. Two newTREC collections consisting of Webdocuments (i.e., WT10g and VLC2;WT10g is a 10GB subset of the 100GBVLC2) have been created recently. Thetest queries are also typical Inter-net queries. It is possible that goodtestbeds can be derived from them.

(10) Extend metasearch techniques to differ-ent types of data sources. Informationsources on the Web often contain mul-timedia data such as text, image andvideo. Most work in metasearch dealswith only text sources or the text as-pect of multimedia sources. Databaseselection techniques have also beeninvestigated for other media types. Forexample, selecting image databasesin a metasearch context was studiedin Chang et al. [1998]. As anotherexample, for data sources that can bedescribed by attributes, such as booktitle and author name, a necessaryand sufficient condition for rankingdatabases optimally was given in Kirket al. [1995]. The database selectionmethod in Liu [1999] also consideredonly data sources of mostly structureddata. But there is a lack of research onproviding metasearch capabilities formixed media or multimedia sources.

The above list of challenges is by nomeans complete. New problems will arisewith a deeper understanding of the issuesin metasearch.

9. CONCLUSIONS

With the increase of the number of searchengines and digital libraries on the WorldWide Web, providing easy, efficient, andeffective access to text information frommultiple sources has increasingly becomenecessary. In this article, we presentedan overview of existing metasearch tech-niques. Our overview concentrated on theproblems of database selection, documentselection, and result merging. A wide va-riety of techniques for each of these prob-lems was surveyed and analyzed. We alsodiscussed the causes that make theseproblems very challenging. The causes in-clude various heterogeneities among dif-ferent component search engines due tothe independent implementations of thesesearch engines, and the lack of informa-tion about these implementations becausethey are mostly proprietary.

Our survey and investigation seem toindicate that better solutions to each of the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 39: Building efficient and effective metasearch engines

86 Meng et al.

three main problems, namely databaseselection, document selection, and re-sult merging, require more information/knowledge about the component searchengines such as more detailed databaserepresentatives, underlying similarityfunctions, term weighting schemes, in-dexing methods, and so on. There arecurrently no sufficiently efficient methodsto find such information without the coop-eration of the underlying search engines.A possible scenario is that we will needgood solutions based on different degreesof knowledge about each local search en-gine, which we will then apply accordingly.

Another important issue is the scalabil-ity of the solutions. Ultimately, we needto develop solutions that can scale in twoorthogonal dimensions: data and access.Specifically, a good solution must scaleto thousands of databases, with many ofthem containing millions of documents,and to millions of accesses a day. None ofthe proposed solutions has been evaluatedunder these conditions.

ACKNOWLEDGMENTS

We are very grateful to the anonymous reviewers andthe editor, Michael Franklin, of the article for theirinvaluable suggestions and constructive comments.We also would like to thank Leslie Lander for readingthe manuscript and providing suggestions that haveimproved the quality of the manuscript.

REFERENCES

ABDULLA, G., LIU, B., SAAD, R., AND FOX, E. 1997.Characterizing World Wide Web queries. InTechnical report TR-97-04, Virginia Tech.

BAUMGARTEN, C. 1997. A probabilistic model fordistributed information retrieval. In Proceed-ings of the ACM SIGIR Conference (Philadel-phia, PA, July 1997), 258–266.

BERGMAN, M. 2000. The deep Web: Surfacingthe hidden value. BrightPlanet, www.complete-planet.com/Tutorials/DeepWeb/index.asp.

BOYAN, J., FREITAG, D., AND JOACHIMS, T. 1996. Amachine learning architecture for optimizingweb search engines. In AAAI Workshop onInternet-Based Information Systems (Portland,OR, 1996).

BRIN, S. AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Pro-ceedings of the Seventh World Wide Web Confer-ence (Brisbane, Australia, April 1998), 107–117.

BUCKLEY, C., SALTON, G., AND ALLAN, J. 1993. Auto-matic retrieval with locality information usingsmart. In Proceedings of the First Text RetrievalConference, NIST Special Publication 500–207(March), 59–72.

CALLAN, J. 2000. Distributed information re-trieval. In Advances in Information Retrieval:Recent Research from the Center for IntelligentInformation Retrieval, W. Bruce Croft, ed.Kluwer Academic Publishers. 127–150.

CALLAN, J., CONNELL, M., AND DU, A. 1999. Auto-matic discovery of language models for textdatabases. In Proceedings of the ACM SIGMODConference (Philadelphia, PA, June 1999), 479–490.

CALLAN, J., CROFT, B., AND HARDING, S. 1992. Theinquery retrieval system. In Proceedings of theThird DEXA Conference (Valencia, Spain, 1992),78–83.

CALLAN, J., LU, Z., AND CROFT, W. 1995. Searchingdistributed collections with inference networks.In Proceedings of the ACM SIGIR Conference(Seattle, WA, July 1995), 21–28.

CHAKRABARTI, S., DOM, B., KUMAR, S., RAGHAVAN, P.,RAJAGOPALAN, S., TOMKINS, A., GIBSON, D., AND

KLEINBERG, J. 1999. Mining the web’s linkstructure. IEEE Comput. 32, 8 (Aug.), 60–67.

CHAKRAVARTHY, A. AND HAASE, K. 1995. Netserf: Us-ing semantic knowledge to find internet informa-tion archives. In Proceedings of the ACM SIGIRConference (Seattle, WA, July 1995), 4–11.

CHANG, C. AND GARCIA-MOLINA, H. 1999. Mind yourvocabulary: query mapping across heteroge-neous information sources. In Proceedings of theACM SIGMOD Conference (Philadelphia, PA,June 1999), 335–346.

CHANG, W., MURTHY, D., ZHANG, A., AND SYEDA-MAHMOOD, T. 1998. Global integration ofvisual databases. In Proceedings of the IEEEInternational Conference on Data Engineering(Orlando, FL, Feb. 1998), 542–549.

COTTRELL, G. AND BELEW, R. 1994. Automatic com-bination of multiple ranked retrieval systems.In Proceedings of the ACM SIGIR Conference(Dublin, Ireland, July 1994), 173–181.

CRASWELL, N., HAWKING, D., AND THISTLEWAITE, P.1999. Merging results from isolated search en-gines. In Proceedings of the Tenth AustralasianDatabase Conference (Auckland, New Zealand,Jan. 1999), 189–200.

CROFT, W. 2000. Combining approaches to infor-mation retrieval. In Advances in Information Re-trieval: Recent Research from the Center for In-telligent Information Retrieval, W. Bruce Croft,ed. Kluwer Academic Publishers. 1–36.

CUTLER, M., SHIH, Y., AND MENG, W. 1997. Usingthe structures of html documents to improveretrieval. In Proceedings of the USENIX Sym-posium on Internet Technologies and Systems(Monterey, CA, Dec. 1997), 241–251.

DREILINGER, D. AND HOWE, A. 1997. Experienceswith selecting search engines using metasearch.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 40: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 87

ACM Trans. Inform. Syst. 15, 3 (July), 195–222.

FAN, Y. AND GAUCH, S. 1999. Adaptive agents for in-formation gathering from multiple, distributedinformation sources. In Proceedings of the 1999AAAI Symposium on Intelligent Agents in Cy-berspace (Stanford University, Palo Alto, CA,March 1999), 40–46.

FOX, E. AND SHAW, J. 1994. Combination of multi-ple searches. In Proceedings of the Second TextREtrieval Conference (Gaithersburg, MD, Aug.1994), 243–252.

FRENCH, J., FOX, E., MALY, K., AND SELMAN, A. 1995.Wide area technical report service: technical re-port online. Commun. ACM 38, 4 (April), 45–46.

FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMITT,T., PREY, K., AND MOU, Y. 1999. Comparing theperformance of database selection algorithms.In Proceedings of the ACM SIGIR Conference(Berkeley, CA, August 1999), 238–245.

FRENCH, J., POWELL, A., AND VILES, C. 1998. Evalu-ating database selection techniques: a testbedand experiment. In Proceedings of the ACMSIGIR Conference (Melbourne, Australia,August 1998), 121–129.

GAUCH, S., WANG, G., AND GOMEZ, M. 1996. Pro-fusion: intelligent fusion from multiple, dis-tributed search engines. J. Univers. Comput.Sci. 2, 9, 637–649.

GRAVANO, L., CHANG, C., GARCIA-MOLINA, H., AND

PAEPCKE, A. 1997. Starts: Stanford proposalfor Internet meta-searching. In Proceedings ofthe ACM SIGMOD Conference (Tucson, AZ, May1997), 207–218.

GRAVANO, L. AND GARCIA-MOLINA, H. 1995. General-izing gloss to vector-space databases and brokerhierarchies. In Proceedings of the InternationalConferences on Very Large Data Bases (Zurich,Switzerland, Sept. 1995), 78–89.

GRAVANO, L. AND GARCIA-MOLINA, H. 1997. Mergingranks from heterogeneous Internet sources. InProceedings of the International Conferences onVery Large Data Bases (Athens, Greece, August1997), 196–205.

GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A.1994. The effectiveness of gloss for the textdatabase discovery problem. In Proceedings ofthe ACM SIGMOD Conference (Minnesota, MN,May 1994), 126–137.

HAWKING, D. AND THISTLEWAITE, P. 1999. Methodsfor information server selection. ACM Trans.Inform. Syst. 17, 1 (Jan.), 40–76.

IPEIROTIS, P., GRAVANO, L., AND SAHAMI, M. 2001.Probe, count, and classify: categorizing hidden-Web databases. In Proceedings of the ACMSIGMOD Conference (Santa Barbara, CA, 2001),67–78.

JANSEN, B., SPINK, A., BATEMAN, J., AND SARACEVIC, T.1998. Real life information retrieval: a studyof user queries on the Web. ACM SIGIRForum 32, 1, 5–17.

KAHLE, B. AND MEDLAR, A. 1991. An informationsystem for corporate users: wide area informa-tion servers. Technical Report TMC199, Think-ing Machine Corporation (April).

KIRK, T., LEVY, A., SAGIV, Y., AND SRIVASTAVA, D. 1995.The information manifold. In AAAI Spring Sym-posium on Information Gathering in DistributedHeterogeneous Environments (1995).

KIRSCH, S. 1998. Internet search: Infoseek’sexperiences searching the internet. ACM SIGIRForum 32, 2, 3–7.

KLEINBERG, J. 1998. Authoritative sources in hy-perlinked environment. In Proceedings of theACM-SIAM Symposium on Discrete Algorithms(San Francisco, CA, January 1998), 668–677.

KONSTAN, J., MILLER, B., MALTZ, D., HERLOCKER, J.,GORDON, L., AND RIEDL, J. 1997. Grouplens:Applying collaborative filtering to usenet news.Commun. ACM 40, 3, 77–87.

KOSTER, M. 1994. Aliweb: Archie-like indexing inthe Web. Comput. Netw. and ISDN Syst. 27, 2,175–182.

LAWRENCE, S. AND LEE GILES, C. 1998. Inquirus, theneci meta search engine. In Proceedings of theSeventh International World Wide Web Confer-ence (Brisbane, Australia, April 1998), 95–105.

LAWRENCE, S. AND LEE GILES, C. 1999. Accessibilityof information on the web. Nature 400, 107–109.

LEE, J.-H. 1997. Analyses of multiple evidencecombination. In Proceedings of the ACM SIGIRConference (Philadelphia, PA, July 1997), 267–276.

LI, S. AND DANZIG, P. 1997. Boolean similarity mea-sures for resource discovery. IEEE Trans. Knowl.Data Eng. 9, 6 (Nov.), 863–876.

LIU, K., MENG, W., YU, C., AND RISHE, N. 2000.Discovery of similarity computations of searchengines. In Proceedings of the Ninth ACM Inter-national Conference on Information and Knowl-edge Management (Washington, DC, Nov. 2000),290–297.

LIU, K., YU, C., MENG, W., WU, W., AND RISHE, N. 2001.A statistical method for estimating the useful-ness of text databases. IEEE Trans. Knowl. DataEng. To appear.

LIU, L. 1999. Query routing in large-scale digi-tal library systems. In Proceedings of the IEEEInternational Conference on Data Engineering(Sydney, Australia, March 1999), 154–163.

MANBER, U. AND BIGOT, P. 1997. The search broker.In Proceedings of the USENIX Symposium onInternet Technologies and Systems (Monterey,CA, December 1997), 231–239.

MANBER, U. AND BIGOT, P. 1998. Connecting diverseweb search facilities. Data Eng. Bull. 21, 2(June), 21–27.

MAULDIN, M. 1997. Lycos: design choices in an in-ternet search service. IEEE Expert 12, 1 (Feb.),1–8.

MCBRYAN, O. 1994. Genvl and wwww: Tools fortraining the Web. In Proceedings of the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 41: Building efficient and effective metasearch engines

88 Meng et al.

First World Wide Web Conference (Geneva,Switzerland, May 1994), 79–90.

MENG, M., LIU, K., YU, C., WANG, X., CHANG, Y., AND

RISHE, N. 1998. Determine text databases tosearch in the internet. In Proceedings of theInternational Conferences on Very Large DataBases (New York, NY, Aug. 1998), 14–25.

MENG, M., LIU, K., YU, C., WU, W., AND RISHE, N.1999a. Estimating the usefulness of searchengines. In Proceedings of the IEEE Interna-tional Conference on Data Engineering (Sydney,Australia, March 1999), 146–153.

MENG, W., WANG, W., SUN, H., AND YU, C. 2001. Con-cept hierarchy based text database categoriza-tion. Int. J. Knowl. Inform. Syst. To appear.

MENG, W., YU, C., AND LIU, K. 1999b. Detection ofheterogeneities in a multiple text database en-vironment. In Proceedings of the Fourth IFCISConference on Cooperative Information Systems(Edinburgh, Scotland, September 1999), 22–33.

MILLER, G. 1990. Wordnet: An on-line lexicaldatabase. Int. J. Lexicography 3, 4, 235–312.

NCSTRL. n.d. Networked computer science tech-nical reference library. At Web site http://cstr.cs.cornell.edu.

PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T.1998. The pagerank citation ranking: bringorder to the web. Technical report, Stanford Uni-versity, Palo, Alto, CA.

ROBERTSON, S., WALKER, S., AND BEAULIEU, M. 1999.Okapi at trec-7: automatic ad hoc, filtering, vlc,and interactive track. In Proceedings of the Sev-enth Text Retrieval Conference (Gaithersburg,MD, Nov. 1999), 253–264.

SALTON, G. 1989. Automatic Text Processing: TheTransformation, Analysis, and Retrieval of Infor-mation by Computer. Addison Wesley, Reading,MA.

SALTON, G. AND MCGILL, M. 1983. Introduction toModern Information Retrieval. McGraw-Hill,New York, NY.

SELBERG, E. AND ETZIONI, O. 1995. Multi-servicesearch and comparison using the metacrawler.In Proceedings of the Fourth World Wide WebConference (Boston, MA, Dec. 1995), 195–208.

SELBERG, E. AND ETZIONI, O. 1997. The metacrawlerarchitecture for resource aggregation on theweb. IEEE Expert 12, 1, 8–14.

SHELDON, M., DUDA, A., WEISS, R., O’TOOLE, J., AND

GIFFORD, D. 1994. A content routing systemfor distributed information servers. In Pro-ceedings of the Fourth International Conferenceon Extending Database Technology (Cambridge,England, March 1994), 109–122.

SINGHAL, A., BUCKLEY, C., AND MITRA, M. 1996. Piv-oted document length normalization. In Pro-ceedings of the ACM SIGIR Conference (Zurich,Switzerland, Aug. 1996), 21–29.

SUGIURA, A. AND ETZIONI, O. 2000. Query routingfor Web search engines: architecture and exper-iments. In Proceedings of the Ninth World Wide

Web Conference (Amsterdam, The Netherlands,May 2000), 417–429.

TOWELL, G., VOORHEES, E., GUPTA, N., AND JOHNSON-LAIRD, B. 1995. Learning collection fusionstrategies for information retrieval. In Proceed-ings of the 12th International Conference onMachine Learning (Tahoe City, CA, July 1995),540–548.

TURTLE, H. AND CROFT, B. 1991. Evaluation of aninference network-based retrieval model. ACMTrans. Inform. Syst. 9, 3 (July), 8–14.

VOGT, C. AND COTTRELL, G. 1999. Fusion via a linearcombination of scores. Inform. Retr. 1, 3, 151–173.

VOORHEES, E. 1996. Siemens trec-4 report: furtherexperiments with database merging. In Pro-ceedings of the Fourth Text Retrieval Conference(Gaithersburg, MD, Nov. 1996), 121–130.

VOORHEES, E., GUPTA, N., AND JOHNSON-LAIRD, B.1995a. The collection fusion problem. In Pro-ceedings of the Third Text Retrieval Conference(Gaithersburg, MD, Nov. 1995), 95–104.

VOORHEES, E., GUPTA, N., AND JOHNSON-LAIRD, B.1995b. Learning collection fusion strategies.In Proceedings of the ACM SIGIR Conference(Seattle, WA, July 1995), 172–179.

VOORHEES, E. AND TONG, R. 1997. Multiple searchengines in database merging. In Proceedings ofthe Second ACM International Conference onDigital Libraries (Philadelphia, PA, July 1997),93–102.

WADE, S., WILLETT, P., AND BAWDEN, D. 1989. Sibris:the sandwich interactive browing and rankinginformation system. J. Inform. Sci. 15, 249–260.

WIDDER, D. 1989. Advanced Calculus, 2nd ed.Dover Publications, Inc., New York, NY.

WU, Z., MENG, W., YU, C., AND LI, Z. 2001. Towards ahighly-scalable and effective metasearch engine.In Proceedings of the Tenth World Wide WebConference (Hong Kong, May 2001), 386–395.

XU, J. AND CALLAN, J. 1998. Effective retrieval withdistributed collections. In Proceedings of theACM SIGIR Conference (Melbourne, Australia,1998), 112–120.

XU, J. AND CROFT, B. 1996. Query expansion us-ing local and global document analysis. In Pro-ceedings of the ACM SIGIR Conference (Zurich,Switzerland, Aug. 1996), 4–11.

XU, J. AND CROFT, B. 1999. Cluster-based languagemodels for distributed retrieval. In Proceedingsof the ACM SIGIR Conference (Berkeley, CA,Aug. 1999), 254–261.

YU, C., LIU, K., WU, W., MENG, W., AND RISHE, N.1999a. Finding the most similar documentsacross multiple text databases. In Proceedingsof the IEEE Conference on Advances in DigitalLibraries (Baltimore, MD, May 1999), 150–162.

YU, C. AND MENG, W. 1998. Principles of DatabaseQuery Processing for Advanced Applications.Morgan Kaufmann Publishers, San Francisco,CA.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 42: Building efficient and effective metasearch engines

Building Efficient and Effective Metasearch Engines 89

YU, C., MENG, W., LIU, K., WU, W., AND RISHE, N.1999b. Efficient and effective metasearch for alarge number of text databases. In Proceedings ofthe Eighth ACM International Conference on In-formation and Knowledge Management (KansasCity, MO, Nov. 1999), 217–224.

YU, C., MENG, W., WU, W., AND LIU, K. 2001. Effi-cient and effective metasearch for text databasesincorporating linkages among documents. InProceedings of the ACM SIGMOD Conference(Santa Barbara, CA, May 2001), 187–198.

YUWONO, B. AND LEE, D. 1996. Search and rankingalgorithms for locating resources on the WorldWide Web. In Proceedings of the IEEE Inter-national Conference on Data Engineering (NewOrleans, LA, Feb. 1996), 164–177.

YUWONO, B. AND LEE, D. 1997. Server ranking fordistributed text resource systems on the In-ternet. In Proceedings of the 5th InternationalConference On Database Systems for AdvancedApplications (Melbourne, Australia, April 1997),391–400.

Received March 1999; revised October 2000; accepted May 2001

ACM Computing Surveys, Vol. 34, No. 1, March 2002.