Supporting non-English Web searching: An experiment on the ...rudys/arnie/som-nonenglish-web.pdf ·...

42 (2006) 1697–1714www.elsevier.com/locate/dss

Decision Support Systems

Supporting non-English Web searching: An experiment on theSpanish business and the Arabic medical intelligence portals

Wingyan Chung a,⁎, Alfonso Bonillas b,1, Guanpi Lai b,1, Wei Xi b,1, Hsinchun Chen b,1

aDepartment of Information and Decision Sciences, College of Business Administration, The University of Texas at El Paso,500 W. University Avenue, El Paso, TX 79968, USA

bArtificial Intelligence Lab, Department of Management Information Systems, The University of Arizona, 1130 East Helen Street,McClelland Hall 430, Tucson, AZ 85721, USA

Received 3 March 2005; received in revised form 19 February 2006; accepted 22 February 2006Available online 27 June 2006

Abstract

Although non-English-speaking online populations are growing rapidly, support for searching non-English Web content ismuch weaker than for English content. Prior research has implicitly assumed English to be the primary language used on the Web,but this is not the case for many non-English-speaking regions. This research proposes a language-independent approach that usesmeta-searching, statistical language processing, summarization, categorization, and visualization techniques to build high-qualitydomain-specific collections and to support searching and browsing of non-English information. Based on this approach, wedeveloped SBizPort and AMedPort for the Spanish business and Arabic medical domains respectively. Experimental resultsshowed that the portals achieved significantly better search accuracy, information quality, and overall satisfaction than benchmarksearch engines. Subjects strongly favored the portals' search and browse functionality and user interface. This research thuscontributes to developing and validating a useful approach to non-English Web searching and providing an example of supportingdecision-making in non-English Web domains.© 2006 Elsevier B.V. All rights reserved.

Keywords: Internet; Web; Searching; Browsing; Business intelligence; Medical intelligence; Spanish; Arabic; Non-English Web searching; Webportal; Mutual information; Summarization; Categorization; Visualization; Kohonen self-organizing map

1. Introduction

The Internet has gained popularity worldwide and isestimated to continue to grow as access to Web contentin different languages increases. A report published in

⁎ Corresponding author. Tel.: +1 915 747 5496; fax: +1 915 7475126.

E-mail address: [email protected] (W. Chung).1 Tel.: +1 520 621 2748; fax: +1 520 621 2433.

0167-9236/$ - see front matter © 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.dss.2006.02.015

September 2004 shows that the majority (64.8%) of theworld's online population consists of non-Englishspeakers [13]. Moreover, that population was estimatedto grow significantly in the near future to 820 millionwhile the size of English-speaking online populationwas predicted to remain at 300 million [12]. Forinstance, there are more than 3.5 million Internet usersin the Arab world [1] where the growth of Arabic Webcontent is estimated to double every year [28]. TheSpanish-speaking online population has exceeded9 millions and Latin America is estimated to have the

mailto:[email protected]://dx.doi.org/10.1016/j.dss.2006.02.015

1698 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714

fastest growing population in the world in the comingdecades [2].

These statistics suggest a growing need for bettersupport for Web searching in some non-English lan-guages that individuals and organizations use on a dailybasis. Non-English-speaking Internet users use theirnative language to search for useful information, andsuch searching typically happens across different regionswhere the language is used, as is the case for members ofmultinational organizations (MNOs) that have operationsin multiple regions using the same language. TheseMNOs increasingly rely on the Internet when seekinginformation. An example might be searching foropportunities to expand a business in Latin America. Amedical institution in an Arab region may need to disco-ver efficient ways to collect, analyze, and disseminatemassive information about different regions in theMiddleEast and North Africa.

Despite growing needs for non-EnglishWeb searching,most existing technologies have been developed forEnglish-speaking users and fail to address the needs ofnon-English Web searching. Current search engines inSpanish and Arabic, for example, lack search and analysiscapabilities. In particular, these search engines lack high-quality collections to support searching across differentregions. Better approaches to overcoming these problemswould provide system developers with insights to enhancenon-English Web searching. To address the needs, wepropose in this paper a language-independent approach tobuilding intelligent portals in non-English languages. Ourgoal was to develop and validate the approach to non-English Web searching. Based on the approach, wedeveloped two Web search portals for the Spanishbusiness and Arabic medical domains. We empiricallystudied theway these portals support decision-making andthe related issues using native Spanish and Arab subjects.

The rest of the paper is structured as follows. Section 2surveys previous research in non-English Web searchingand search support in different languages. Section 3presents a language-independent approach to supportingnon-English Web searching and the two Web portalsdeveloped using the approach. Section 4 describes themethodology for evaluating the portals. Section 5 reportsand discusses the findings. Section 6 concludes the paperand discusses future directions.

2. Literature review

Since the inception of the Internet, English has beenthe dominant language for communication on the Web.Prior research about Web searching has assumedimplicitly that technologies are developed for English-

speaking users. However, as more non-English-speakingusers have adopted Internet technologies, other languageshave gained popularity. It therefore is useful to reviewprevious research in information seeking on the Web in amultilingual world. In particular, we also review devel-opments of Web search technologies for the Spanish-speaking and Arabic-speaking regions.

2.1. Information seeking on the Web

Researchers who have studied information seeking onthe Web have described the process of informationseeking as consisting of various stages of problem iden-tification, problem definition, problem resolution, andsolution presentation [39]. Variations of this processmodel can be found in the literature [18,22,35].

Two major information-seeking activities are search-ing and browsing. Prior research has considered searchingto include behaviors ranging from goal directed informa-tion searching, where the user has a specific target inmind, to more serendipitous or exploratory informationbrowsing when no specific goal is present besides theintention to explore the information repository [35]. Indirected searching, the user first decomposes his goal intosmaller problems, then expresses his needs as conceptsand higher level semantics, formulates queries using suchsupports as Boolean query languages and syntax directededitors, and finally evaluates the results by serial search orsystematic sampling. In exploratory browsing, the userfirst transforms his general information need into a prob-lem. He then (1) articulates that need as search terms orhyperlinks that appear on the system interface; (2) sear-ches using the terms or explores the hyperlinks using suchbrowse supports as automatic summarization, clusteringand visualization tools, and Web directories; and (3)finally evaluates the results by scanning through them.

2.1.1. Support for Web searching and browsingTo supportWeb searching and browsing, various types

of information technologies have been proposed. Meta-searching has been found to be a promising method [4] toalleviate biases of search results from different searchengines [26] by sending queries tomultiple search enginesand collating the set of top-ranked results from eachengine. In addition, post-retrieval analysis provides addedvalue to results returned by search engines. Previews andoverviews of retrieved Web pages are important elementsin post-retrieval analysis. A preview is extracted from, andacts as a surrogate for, a single object of interest [14].Document summarization techniques provide previews ofindividualWeb pages in the form of indicative summaries[10], query-biased summaries [36], or generic summaries

1699W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714

[25]. An overview is constructed from and represents acollection of objects of interest [14]. Document catego-rization techniques such as the self-organizing mapalgorithm [17] have been used to categorize and searchWeb pages [5]. Document visualization techniques alsohave been used to amplify human cognition in browsingInternet search results [20,23]. Despite the potentialadvantages of meta-searching and information previewsand overviews, they rarely have been applied to non-English search engines. One such application is theCBizPort that helps users to search, browse, categorize,and summarize Chinese business information [7] but doesnot contain a domain-specific collection and lacks usefulfunctionality such as visualization.

2.1.2. Information qualityInformation quality, a multifaceted concept, is consid-

ered to be an important aspect of evaluating the quality ofaWeb site [21] and is one that has been explored byWangand Strong [38], who evaluated information quality usinga set of 16 dimensions that were tested in [33]. Thesedimensions were for the most part used in evaluating thequality of information of organizations or companies, notthe quality of information obtained from search engines.Although Marsico and Levialdi [24] have developed aWeb site evaluation methodology that considers a site'sinformation quality, their methodology was designed forevaluating generalWeb sites (e.g., travel informationWebsites) and does not consider the special requirements ofnon-English Web searching.

There have been studies on cross-regional use ofChinese search engines (e.g., [7]), but because Chinese ismainly used in three geographically close regions(mainland China, Hong Kong, and Taiwan), its regionalcharacteristics are less apparent than those of Spanish andArabic, which are used across continents and widely-separated regions. Unfortunately, no attempts have beenmade to study the cross-regional impacts of Spanish andArabic search engines, evaluation ofwhich could improveunderstanding of optimal design of search engines andportals.

2.2. Search engines for Spanish-speaking and Arabic-speaking regions

As more non-English-speaking people use theInternet to search and browse information, majorsearch engines have attempted to expand their servicesfor non-English speakers. Regional search engines thatprovide more localized searching have begun toemerge. In addition to English, these search enginestypically accept queries in a user's native language and

return pages from the regions being served. A survey ofmajor search engines in Spanish and Arabic, twowidely-used languages that are gaining popularity onthe Web, follows.

2.2.1. Spanish search enginesMajor search engines have been developed for

Spanish, the second most popular language in the UnitedStates and the primary language for Spain and some 22Latin American countries. Terra (http://www.terra.com/)offers its services tomore than 3.1million Internet users inEurope and the Americas. A Gallup poll in 2002 reportedTerra to be the most popular search engine in Spain;Wanadoo (http://www.wanadoo.com/), a subsidiary ofFrance Telecom, was rated second [11]. Currently, Terraserves more than 3 million Internet users in Spain, LatinAmerica, the United States, and many Europeancountries. Supporting Web searching in English andFrench as well as Spanish, Wanadoo is currently theleading Internet service provider in France and the UnitedKingdom with 9.3 million customers in June 2004.

Spanish search engines serving Latin America includeYahoo Español, Ahijuna, Auyantepui, Quepasa, Bacan,and Conexcol. Yahoo Español (Spain, http://espanol.yahoo.com/), the Spanish version of Yahoo, provides ahuman-compiled Web directory developed by about 150editors who categorized over one million listed sites.YahooES also supplements its results with those fromInktomi andGoogle. Inktomimatches also appear to usersafter all YahooES matches have first been shown. Esta-blished in 1995, BIWE (Buscador en Internet para la weben Español, http://www.biwe.com/) is one of the earliestsearch engines for searching Spanish information on theWeb. BIWE supports searching of news, products,images, and other information and provides a variety ofservices including a Web directory, email, entertainment,and market information for Hispanics. Headquartered inthe United States, Quepasa (http://www.quepasa.com/)was launched in 1997 and is a bilingual Web portal(Spanish and English) serving Hispanic populations in theUnited States and Latin America. It uses proprietary Websearch technologies to reduce the number of irrelevantresults by utilizing terms most frequently used anddocuments most frequently viewed [32]. Quepasa alsooffers other services such as news, email, online radio,chat, online translation, forums, and Web hosting.

The following Spanish search engines primarily servetheir own or adjacent regions. Launched in 1998,Ahijuna (Argentina, http://www.ahijuna.com.ar/) pro-vides searching services of ArgentinaWeb sites and otherSpanish Web sites. It contains a Web directory with 14categories having a total of 7578 hyperlinks. Based in

http://www.terra.com/http://www.wanadoo.com/http://espanol.yahoo.com/http://espanol.yahoo.com/http://www.biwe.com/http://www.quepasa.com/http:Argentina%2C%20http://www.ahijuna.com.ar/


Venezuela, Auyantepui (http://www.auyantepui.com/)provides a searchable Web directory of Spanish sites. Itgrew from 14 categories listing 117 Web sites in 1996 to550 categories with over 18,000 Web sites in 2002.Launched in 1998, Conexcol (Colombia, http://www.conexcol.com/) provides a searchable Web directorycontaining 14 categories having 400 subcategories and13,214 Web sites' URLs. With more than 150,000unique visitors per month, it is one of the top four mostvisited sites in Colombia. Bacan (Ecuador, http://www.bacan.com/), a major search engine in Ecuador, began itsoperations in 1996. It provides services such as news,email, online chat, entertainment, and shopping guides.

Table 1Comparing major Spanish search engines

Content Spain Latin America

Web pages andnews on

Terra(Spain)

Wanadoo(France)

Auyantepui(Venezuela)

Ascinsa(Peru)

Conexcol(Colombia)

IT ✓ ✓ ✓ ✓ ✓Business ✓ ✓ ✓ ✓ ✓Government ✓ ✓ ✓ ✓Financial ✓ ✓ ✓ ✓ ✓Medical ✓ ✓ ✓ ✓ ✓Other LatinAmericancountries

✓ ✓

General ✓ ✓ ✓ ✓ ✓Size ofcollection

Very good Very good Fair Good Good

Functionality Terra(Spain)

Wanadoo(France)

Auyantepui(Venezuela)

Ascinsa(Peru)

Conexcol(Colombia)

Links to relatedresources

✓ ✓ ✓ ✓ ✓

Membershipservices

✓ ✓ ✓

Newsgroupsearch

✓ ✓ ✓ ✓ ✓

Web directory ✓ ✓ ✓Search for Websites

✓ ✓ ✓ ✓ ✓

Search stockprices

✓ ✓

Filtering foradult content

Onlinetranslationtool

Search fornews

✓ ✓ ✓ ✓ ✓

Multimediasearch(image,music,software,etc)

✓ ✓ ✓ ✓ ✓

User interface Very good Very good Fair Fair Fair

Every month Bacan has 80,000 individual visitors andgenerates over 2 millions hits. Ascinsa Internet (http://www.ascinsa.com/) is widely-used in Peru and containsWeb sites from Latin American countries and the UnitedStates. It provides services such as Internet access, email,Web page design, domain registration, Web hosting,among others. It also contains a directory listed bycountries and then by domains.

Table 1 summarizes the content and functionality ofmajor Spanish search engines. Although different types ofinformation are provided, these search engines typicallypresent results as a long textual list and lack post-retrievalanalysis capabilities. Moreover, except for some large

Bacan(Ecuador)

Quepasa(Mexico and U.S.)

YahooES(Spain)

Ahijuna(Argentina)

BIWE(Spain)

✓ ✓ ✓ ✓ ✓✓ ✓ ✓ ✓ ✓✓ ✓ ✓ ✓✓ ✓ ✓ ✓ ✓✓ ✓ ✓ ✓ ✓

✓ ✓

✓ ✓ ✓ ✓ ✓Fair Very good Very good Fair Very

good

Bacan(Ecuador)

Quepasa(Mexico and U.S.)

YahooES(Spain)

Ahijuna(Argentina)

BIWE(Spain)

✓ ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓✓ ✓ ✓ ✓ ✓

✓

✓

✓

✓ ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓ ✓

Good Very good Very good Fair Very good

http://www.auyantepui.com/http://www.conexcol.com/http://www.conexcol.com/http://www.bacan.com/http://www.bacan.com/http://www.ascinsa.com/http://www.ascinsa.com/

Table 2Comparing major Arabic search engines

Content Ajeeb Albawaba Albahhar Ayna

Business ✓ ✓ ✓ ✓Government ✓ ✓ ✓ ✓Financial ✓ ✓ ✓ ✓Medical ✓ ✓ ✓ ✓General ✓ ✓ ✓ ✓Size of collection Very

goodGood Very

goodVerygood

Functionality Ajeeb Albawaba Albahar Ayna

Encoding conversion(utf8-CP1256)

✓

Links to relatedresources

✓ ✓ ✓ ✓

Membership services ✓ ✓ ✓Web directory Very

goodVerygood

Fair Good

Search for Web sites ✓ ✓ ✓ ✓Search by time period ✓Search for news ✓ ✓ ✓ ✓Languages of the

search databaseEnglish/Arabic

English/Arabic

English/Arabic

English/Arabic

Cross-regional searchsupport

✓ ✓ ✓ ✓

User interface Verygood

Very good Good Good

System reliability Fair Fair Poor Verygood


portals such as Yahoo Español, BIWE, and Terra, mostSpanish search engines serve a few regions rather than anentire Spanish speaking community.

2.2.2. Arabic search enginesArabic is spoken by more than 284 million people in

about 22 countries. Although Arabic is the fifth mostfrequently spoken language in theworld, theArabicWeb isstill in its infancy, constituting less than 1%of the totalWebcontent and having a low 2.2% penetration rate [1]. Thecross-regional use of Arabic and the exponential growth ofArabic Web [28] nevertheless have highlighted thenecessity of providing better Web searching and browsing.

Four major search engines offer the Arab Worldcomprehensive services and extensive content coverage.Ajeeb (http://www.ajeeb.com/) is a bilingual Web portal(English/Arabic) launched in 2000 by Sakhr SoftwareCompany. Its database contains over one millionsearchable Arabic Web pages, which can be translatedto English using the online version of Sakhr's machine-translation software. In addition, Ajeeb has a multilingualdictionary and is known for its largeWeb directory, “DalilAjeeb,” which the company claims is the world's largestonline Arabic directory. Ajeeb has launched Johaina, anautomatic tool that gathers news from many MiddleEastern and worldwide news agencies. Using Sakhr's“IDRISI” search engine Johaina gathers mainly MiddleEast related news and categorizes them into primary andsecondary topic categories. Albawaba.com (http://www.albawaba.com/) is a consumer portal offering compre-hensive services including news, sports, entertainment,e-mail, and online chatting. The portal supports searchingfor both Arabic and English pages and the results areclassified according to language and relevancy. Albawabaalso provides meta-searching of other search engines(Google, Yahoo, Excite, Alltheweb, Dogpile) and acomprehensive directory of all Arab countries. Launchedin 2000, UAE-based Albahhar (http://www.albahhar.com/) provides a wide range of online services such assearching, news, online chatting, and entertainment. Theportal searches its 1.25 million Arabic Web pages andprovides Arabic speakers a wide range of other onlineservices like news, chat, and entertainment. Based in NewHampshire, Ayna (http://www.ayna.com/) is a Web portalproviding an Arabic Web directory, an Arabic searchengine, and other services such as a bilingual (English/Arabic) email system, chat, greeting cards, personalhomepage hosting, and personal commercial classifieds.In July 2001, Ayna had over 700,000 registered users andprovided access to more than 25 million pages per month.Due to Ayna's popularity, Alexa Research ranks it amongthe top three leading Web sites in the Arab World.

Table 2 compares the content and functionality of theArabic search enginesmentioned above.Despite their richcontent and comprehensive services, Arabic searchengines lack post-retrieval capability and their contentstend to be general, offering limited resources to servedomain-specific needs. They also fall short of supportingadvanced search and browse functions. For example,none of them supports categorization or visualization ofsearch results.

2.3. Summary

Because existing search engines in Spanish and Arabictypically lack analysis capabilities, they limit users'ability to understand retrieved results. The collectionssearched by these search engines are often region-specific, so they do not provide a comprehensiveunderstanding of the environment where they areoperating. Major English search engines such as Googleprovide searching of non-English resources but fall shortof covering domain- and region-specific information.There is a need for better approaches to overcoming theseproblems and to providing high-quality information tomultinational organization users. We therefore propose a

http://www.ajeeb.com/http://www.albawaba.com/http://www.albawaba.com/http://www.albahhar.com/http://www.albahhar.com/http://www.ayna.com/

Table 3Topical coverage details of the two portals

Topics SBizPort AMedPort

Scenario The user searches forSpanish Web pagesabout electroniccommerce.

The user searches forArabic medicalinformation aboutexcretion.

Search page The query “electroniccommerce” is used(Fig. 1(a)).

The query “excretion” isused (Fig. 2(a)).

Result page Approximately 40results from 4 meta-searchers (top 10from each) aredisplayed (Fig. 1(b)).

Approximately 30 resultsfrom 4 meta-searchers (top10 or fewer from each) aredisplayed (Fig. 2(b)).Examples of the resultsinclude “middle earinfection” and “skinsymptoms of diabetes.”

Categorizer The categorizer groupsretrieved Web pages into20 folders, among whichare labeled “customsagent,” “electroniccommerce,” and “foreigncommerce” (Fig. 1(c)).

Examples of categoriesinclude “children'seducation” and “sports.”The user selects the seventhcategory titled “specialeducation,” within whichhe browses 2 Web pagesabout “questions andanswers” (Fig. 2(c)).

Summarizer The summarizerprovides a 3-sentencesummary of thecircled result thatcontains information onan e-commerce eventheld in 2000 in Caracas,Venezuela (Fig. 1(d)).

The 3-sentence summary islisted on the left while theoriginal page aboutpetrochemical informationis displayed on the right(Fig. 2(d)).

Visualizer The SOM visualizercategorizes about 40Web pages onto 2regions labeled “foreigncommerce” and“internationalcommerce”and displays hyperlinkson the right (Fig. 1(e)).

The SOM visualizercategorizes about 30 Webpages onto 6 regions anddisplays hyperlinks on theright (Fig. 2(e)). Examplesof the regions include“special education” and“sports.”


language-independent approach to address the followingquestions:

1. How can a language-independent approach sup-port Web searching in non-English languages thatare widely used across different regions?

2. How well do portals developed by the approachperform in comparison with existing searchengines, in terms of accuracy, precision, recall,and user satisfaction?

3. When comparing with existing search engines,what is the information quality of the portalsdeveloped by the approach?

3. A language-independent approach

In this section, we describe a language-independentapproach to supporting non-English Web searching. Theapproach uses meta-searching, statistical language pro-cessing, summarization, categorization, and visualizationtechniques to build high-quality domain-specific collec-tions and to support searching and browsing of non-English information on theWeb. Specifically, we used theapproach to build two Web portals providing domain-specific collections for non-English Web searching andpost-retrieval analysis for the Spanish business andArabicmedical domains. Because the implementation of theapproach requires no (or minimal) customization to theportals' languages, it allows system developers to easilyadapt the development to new languages and domains.

3.1. The SBizPort and AMedPort

The chosen domains of the two portals, SpanishBusiness Intelligence Portal (SBizPort) and ArabicMedical Intelligence Portal (AMedPort), representimportant segments of the Web of interest to individualusers and multinational organizations. Given thegrowing Spanish-speaking populations in the UnitedStates, Spain, and Latin America, businesses activelyexpand their opportunities by seeking information onthe Web. Meanwhile, the growing Arabic onlinepopulation and medical professionals seek a compre-hensive, one-stop Web portal through which tocommunicate medical information among differentArab regions. The SBizPort and AMedPort weredeveloped to address these growing needs. In additionto providing relevant information, the portals supportintelligence gathering and analysis, where intelligence isdefined as the product of acquisition, interpretation,collation, assessment, and exploitation of information inthe respective domains [6]. The intelligence is presented

in the forms of collated search results obtained fromvarious high-quality information sources, Web pagesummaries, categorized search results, and visual mapsshowing clusters of Web pages. Table 3 provides topicalcoverage details of the two portals, screen shots ofwhich are shown in Figs. 1 and 2.

3.2. Steps in the approach

Our approach consists of five major steps, which aredescribed in the context of building SBizPort andAMedPort.

(a) Search page

(b) Result page(c) Categorizer

(d) Summarizer

(e) Visualizer

Categorize button

Search button

Visualize button

Click to summarize in 3 or 5 sentences

The user types in “comercio electronico” to search for BI information about electronic commerce in Spanish-speaking regions.

The SOM visualizer categorizes about 40 Web pages onto 2 regions and displays hyperlinks on the right.

Retrieved pages are categorized into folders labeled by key phrases.

Retrieved pages’ titles and abstracts are listed.

The summary is listed on the left while the original page on the right.

Fig. 1. Screen shots of SBizPort.


3.2.1. Collection building and searchingFigs. 1(a), 1(b), 2(a), and 2(b) show the search and

result pages of the two portals. On the search page, a usercan input keywords and choose whether to search,

organize, or visualize the results. The user can inputmultiple keywords separated by line breaks and canchoose among a number of carefully selected informationsources from the Spanish or Arab regions by checking the

(a) Search page

(b) Result page (c) Categorizer

(d) Summarizer

(e) Visualizer

Button to categorize results

Button to search for results

Button to visualize results

Click to summarize in 3 or 5 sentences

Virtual Arabickeyboard

The user types in “excretion” to search for medical information about the topic in Arab regions.

Retrieved pages’ titles and abstracts are listed.

The SOM visualizer categorizes about 30 Web pages onto 6 regions and displays hyperlinks on the right.

The summary is listed on the left while the original page on the right.

Retrieved pages are categorized into folders labeled by key phrases.

Fig. 2. Screen shots of AMedPort.


boxes. The result page lists search results according to theinformation sources selected by the user.

To provide high-quality information, we manuallyanalyzed the existing information sources in the two

domains. For the Spanish business domain, keybusiness categories such as e-commerce, internationalbusiness, and competitive intelligence were searchedto obtain seed URLs (translated into English), that


were used for domain spidering/collecting of Webpages. More than 183 seed URLs were obtained. AWeb crawler then followed these URLs to collectpages automatically. The pages were then automati-cally indexed and stored in our database. In additionto domain spidering, we performed meta-spidering ofsix major search engines (Yahoo ES, Ahijuna,Conexcol, Ambdirecto, Auyantepui, and Teoma)using queries translated from English queries thatpreviously had been used to build an English businessintelligence search portal [23]. We chose these searchengines because of their rich Spanish businesscontent. The Spanish business collection obtainedfrom this method contained more than 476,084 Webpages covering more than 22 countries.

Similarly, our Arabic medical collection was built byusing 105 seed URLs collected from seven major searchengines (Google, Yahoo, AlltheWeb, Ajeeb, ArabVista,AltaVista, andDMOZ) and bymeta-spidering theseURLsusing keywords from an Arabic medical glossary [16].The results are then filtered depending on their numberand quality. The resulting Arabic medical collectioncontained more than 220,000 Web pages covering morethan 22 countries.

Apart from searching its own database, the SBizPortsupports meta-searching two domain-specific databases(SBizPort collection and AMBDirecto) and six Spanishgeneral search engines (Yahoo Español, Terra, Ahijuna,Auyantepui, Bacan, and Ascinsa). The AMedPortsupports meta-searching three domain-specific data-bases (AMedPort, Sehha.com, and ArabMedmag.com)and three Arabic general search engines (Ba7th.com,ArabVista.com, and Ayna). These meta-search engineswere chosen because of their rich content and domain-specific coverage. A virtual keyboard provided forAMedPort facilitates input (see Fig. 2(a)).

3.2.2. SummarizerThe SBizPort and AMedPort summarizers were modi-

fied from an English summarizer that uses sentence-selection heuristics to rank text segments [25]. Theseheuristics strive to reduce redundancy of information in aquery-based summary [3]. The summarization takes placein three main steps: (1) sentence evaluation, (2) segmen-tation or topic identification and (3) segment ranking andextraction. First, a Web page to be summarized is fetchedfrom the remote server and parsed to extract its full text.All sentences are extracted by identifying punctuationserving as periods. Important information such as pre-sence of cue phrases (e.g., “therefore,” “in summary” inthe respective languages), sentence lengths and positionsare also extracted for ranking the sentences. Second, we

use the Text-Tiling algorithm [15] to analyze the Webpage and determine topic boundaries. A Jaccard similarityfunction is used to compare the similarity of differentblocks of sentences. Third, we rank document segmentsidentified in the previous step according to the rankingscores obtained in the first step and key sentences areextracted as summary. The summarizer can summarizeWeb pages flexibly, using three or five sentences. Userscan invoke it by clicking the number of sentences forsummarization under each result. Then, a new window isactivated (shown in Figs. 1(d) and 2(d)), that displays thesummary and the original Web page.

3.2.3. CategorizerThe SBizPort and AMedPort categorizers organize

the Web pages (related to the query shown on top) into20 (or fewer) folders labeled by the key phrases ap-pearing most frequently in the page summaries or titles(see Figs. 1(c) and 2(c)). Each categorizer relies on aphrase lexicon in the relevant language to extractphrases from Web page summaries obtained frommeta-searching or searching our collections. To createthe lexicons, we collected a large number of Web pagesin the two domains. From each collection of pages, weextracted meaningful phrases by using the mutualinformation approach, a statistical method that identifiessignificant patterns as meaningful phrases from a largeamount of text in any language [30]. The approach is aniterative process of identifying significant lexical pat-terns by examining the frequencies of word co-occur-rences in a large amount of text.

The mutual information (MI) algorithm is used in theapproach to compute how frequently a pattern appearsin the corpus, relative to its sub-patterns. Based on thealgorithm, the MI of a pattern c (MIc) can be found by

MIc ¼ fcfleft þ fright−fc

where f stands for the frequency of a set of words.Intuitively, MIc represents the probability of co-occurrence of pattern c, relative to its left sub-patternand right sub-pattern. Phrases with high MI are likely tobe extracted and used in automatic indexing. Forexample, if the Spanish phrase “gerencia del conoci-miento” (knowledge management) appears in the corpus100 times, the left sub-pattern (gerencia del) appears 110times and the right sub-pattern (del conocimiento) ap-pears 105 times, then the mutual information (MI) forthe pattern “gerencia del conocimiento” is 100 / (110+105−100)= 0.87. In addition, we employed an update-able PAT-tree data structure developed in [30] that


supports online frequency update after removingextracted patterns to facilitate subsequent extraction.Repetitive removal of sub-patterns therefore is notnecessary. In addition, we used a stop word list andmanual filtering to refine the results obtained.

Using the approach, we extracted 19,417 phrasesfrom the SBizPort collection and 68,079 phrases fromthe AMedPort collection. The categorizer then usesthese phrases to categorize the Web pages non-exclusively (see Figs. 1(c) and 2(c)).

3.2.4. VisualizerThe resulting portals also support visualization ofWeb

pages retrieved using a Kohonen self-organizing map(SOM) algorithm [17] to categorize and place Web pagesonto a two-dimensional jigsaw map [23] (see Figs. 1(e)and 2(e)). SOM is a neural networks algorithm that hasbeen used in image processing and pattern recognitionapplications. When applied to automatic categorizationand visualization of Web pages, SOM assigns similarpages to adjacent regions with each region labeled by themost frequently occurring phrases extracted by themutualinformation approach described. The larger the size of aregion on themap, themore theWeb pages are assigned toit. Users can click on a region to see a list of pages on theright and can open pages by clicking the link-embeddedtitles.

3.2.5. Web directoryIn addition, each of our portals provides a Web direc-

tory of the resources in its specific domain. Organized in ahierarchical manner, the directory was built from a com-bination of human identification and meta-searching. TheSpanish business directory contains 295 categories andthe Arabic medical directory contains 232 categories.Both have a depth of 5 levels.

3.3. Enhancements of the approach

We believe that the proposed approach offers benefitsand new enhancements in five aspects: (1) New inte-gration of existing techniques: Although some of thetechniques used in the approach have been studied inprior work, we have not found a comprehensiveapproach that addresses the problem of informationquality on the Web and the need for Web searching inlanguages used in widely separated geographic regions(e.g., Spanish and Arabic). By integrating human ana-lysis with existing techniques for text processing, ourapproach was developed to alleviate information over-load in searching and browsing Web content in non-English languages. For example, we have customized

the Kohonen self-organizing map algorithm to theSpanish business and Arabic medical domains to sup-port dynamic visualization of Web pages. This inte-gration of visualization technique has been enhancedfrom our previous work [6,23] by considering languages(Spanish and Arabic) used widely in a multitude ofgeographic regions and by applying the technique tonon-English domains. To our knowledge, there has beenno previous attempt to integrate the technique into anapplication similar to the portals described here. (2)Collection building: Previous work on building Webcollections typically focuses on English content due tothe more abundant resources available. To deal with thechallenge of supporting non-English Web searching, ourproposed approach was used to build non-English Webcollections encompassing wide arrays of geographicregions and content providers. For example, the SBiz-Port collection was built from spidering more than 183Spanish business Web sites located in such regions asArgentina, Bolivia, Central America, Chile, Colombia,Ecuador, Spain, Mexico, Paraguay, Peru, Uruguay, andVenezuela. The AMedPort collection covered Webresources obtained from such regions as Saudi Arabia,Bahrain, Lebanon, Tunisia, Kuwait, Egypt, United ArabEmirates, Switzerland, United Kingdom, USA, Russia,and Canada. While existing search engines in thoseregions mainly provide regional services, the SBizPortand AMedPort collections respectively serve the entirecommunities that use Spanish and Arabic in Websearching. The collections also represent new advancesover the English business collection built in [23] and thelack of its own Web collection in [6]. (3) Languageprocessing: To extract meaningful phases as input forthe categorizer and visualizer, we used the mutualinformation technique that considered the co-occurrenceof terms in a large corpus (see Section 3.2.3). Becausethe approach used the probabilities of the termsappearing in the corpus rather than their linguisticpatterns as the criterion for extraction, the technique wasstatistics-based and hence different from linguistictechniques used in previous research (e.g., [23]).Comparing with our previous work [6] (in which thesystem only served three closely-located geographicregions (China, Taiwan, and Hong Kong)), we haveenhanced the performance of this technique by using alarge number of Web pages from different regions as ourcorpus and by testing the technique in the two chosenlanguages. (4) User interface customization: The userinterface interfaces of SBizPort and AMedPort werespecially designed to bring about the industry featuresand to address the language-specific needs. Forexample, AMedPort provides a virtual keyboard to

Table 4A summary of the experimental setup

System Scenario Task Task type

1 First 1 and 2 Search3 Browse

2 Second 4 and 5 Search6 Browse

The systems and scenarios were randomly assigned to subjects.


assist in the input of the right-to-left Arabic language.The images in SBizPort user interface are related tomajor industries in Latin America. (5) Application do-mains: This research has extended to domains such asArabic medicine and Spanish business that are lessexplored in prior work. As the online populations inthese two languages will grow significantly (seeSection 1), this work thus helps system developers toeasily customize their development to the particularlanguage they consider. We believe that our approachcan help multinational organizations to search effec-tively for non-English information on the Web.

4. Evaluation methodology

In this section, we describe our methodology forevaluating the usability of the Web portals developedby our approach. Our evaluation objectives are: (1) tostudy how the Web portals developed by our ap-proach can assist searching and browsing of special-ized domains on the Web; (2) to compare our portalswith existing search engines in order to understandthe effectiveness and efficiency of our portals; and (3)to evaluate the information quality and user satisfac-tion achieved by using our portals.

To achieve objective (1), we invited human subjectsto use our portals to search and browse the Spanishbusiness or Arabic medical domains, two specializeddomains that do not have as much coverage on the Webas their English counterparts. To achieve objective (2),we selected BIWE and Ayna as benchmarks againstwhich to compare SBizPort and AMedPort because oftheir comprehensive coverage and functionality. BIWE(http://www.biwe.com/) is a major Spanish searchengine providing information for the Spanish-speakingcommunity. It also has a detailed Web directory for usersto browse topics in which they are interested. Comparedwith other Spanish search engines, BIWE's services aremore comprehensive and target more closely toHispanics. As one of the most visited Arab Internethubs, Ayna (http://www.ayna.com/) serves Arabic-speaking people of the Middle East and North Africa.Unlike many Arabic search engines, Ayna is more stableand reliable that serves as a good benchmark to supporta fair comparison with AMedPort. To achieve objective(3), we asked subjects to provide subjective rating andcomments on information quality and user satisfaction.

4.1. Experimental design

We designed scenario-based search and browse tasksconsistent with Text Retrieval Conference standards

[37] to evaluate the performance of our Web portals. Forexample, a scenario for testing SBizPort was “AmericaOnline (AOL) in Latin America,” where a search taskwas “When was AOL Latin America launched in theUnited States?” and a browse task was “Find the URLsof financial portals where you can find stock quotes onAmerica Online.” In a scenario for testing AMedPort“Prevention and treatment of cancer,” a search task was“Give the name of one vitamin that helps to preventcancer,” and a browse task was “Find articles abouthealthy diet and cancer prevention.” To further validatethe relevance of tasks, before conducting the actualexperiment we did a pilot test with three subjects foreach portal.

We recruited 19 Spanish students and 11 Arab stu-dents as volunteer subjects to evaluate the performanceof the SBizPort and AMedPort. In each one-hourexperiment, we introduced two systems (our portaland the benchmark system) to a subject and randomlyassigned different scenarios to evaluate the systems.Each scenario contained two search tasks and onebrowse task. To test the impact of the domain-specificcollection, we asked the subjects not to use thecollection in the first task when using our portal but touse it in the second task. In the third task, we asked thesubjects to use the SOM visualizer when using ourportal and to use the available browse tools (e.g.,hyperlinks, Web directory) when using the benchmarksearch engine (see Table 4). Although we did not imposeany time limit on completing the tasks, we found thateach subject spent an average of three minutes to finish asearch task and eight minutes to finish a browse task.The order in which the systems were used was randomlyassigned to avoid bias due to sequence of use.

After using a system, a subject filled in a post-sessionquestionnaire about his ratings and comments on thesystem. The experimenter recorded all verbal commentsor behavioral observations that were later analyzedusing protocol analysis [9]. Upon finishing the study,each subject also filled in a post-study questionnaire torate each system in terms of information quality andoverall satisfaction and to provide additional feedback.

http://www.biwe.com/http://www.ayna.com/


The questionnaire was developed based on the usersatisfaction measures used in [8,19]. We asked thesubjects to rate their satisfaction on each system along aseven-point Likert scale.

To measure information quality, we modified the16-dimension construct developed in [38] by droppingthe “security” dimension which is not relevant becausethe information provided by the systems is alreadypublic. To accommodate the different levels of impor-tance in the remaining 15 dimensions, we invited twoexperts to provide ratings on the relative importance ofdifferent dimensions in the two domains (see Table 5).The Spanish business expert is a senior executive of amanagement consulting company in Mexico. Being anative Spanish speaker, he had 24 years of experience inbusiness development, raising capital, negotiations,finance, and strategic planning. He also had worked asthe Vice President of Business Development for theGallup Organization in Mexico. The Arabic medicalexpert is an Arab microbiology Ph.D. student at a majorresearch university in the United States. These expertsprovided answers that we used to judge subjects' per-formances in the tasks.

The subjects also provided demographic information,which was kept confidential in accordance with theInstitutional Review Board Guidebook [31].

Table 5Definitions of 15 dimensions of information quality and expert ratings

Dimension Definition

Presentation quality and clarityAccessibility The extent to which information is available, or easConcise representation The extent to which information is compactly repreConsistentrepresentation

The extent to which information is presented in the

Ease of manipulation The extent to which information is easy to manipulInterpretability The extent to which information is in appropriate la

definitions are clear

Coverage and reliabilityAppropriate amount ofinformation

The extent to which the volume of information is a

Believability The extent to which information is regarded as trueCompleteness The extent to which information is not missing andFree-of-error The extent to which information is correct and reliaObjectivity The extent to which information is unbiased, unpre

Usability and analysis qualityRelevancy The extent to which information is applicable and hReputation The extent to which information is highly regardedTimeliness The extent to which information is sufficiently up-tUnderstandability The extent to which information is easily compreheValue-added The extent to which information is beneficial and p

a Expert rating: 3 = extremely important, 2 = very important, 1 = importa

4.2. Hypothesis testing

Because the Web portals developed by our approachencompassed Web resources from different Spanish orArab regions, we believed that they would providericher content and higher usability than those ofbenchmark systems. Users could thus find relevantresults more quickly from our portals. With respect tothe two domains, we tested the following five sets ofhypotheses, none of which had been explored inprevious research.

H1. Using a domain-specific collection in SBizPort/AMedPort enables users to achieve higher effectivenessand efficiency than performing search tasks without itssupport.

H2. SBizPort/AMedPort enables users to achievehigher effectiveness and efficiency than relying onbenchmark search engines for searching.

H3. The use of SOM visualizer in SBizPort/AMedPortenables users to achieve higher effectiveness andefficiency than using benchmark search engines toperform browse tasks.

H4. SBizPort/AMedPort users achieve a higher overallsatisfaction than users of a benchmark search engine.

Expert ratinga

Spanish Arab

ily and quickly retrievable 3 3sented 3 3same format 3 3

ate and apply to different tasks 3 2nguages, symbols, and units, and the 2 3

ppropriate for the task at hand 2 3

and credible 2 2is of sufficient breadth and depth for the task at hand 3 3ble 2 3judiced, and impartial 2 3

elpful for the task at hand 3 3in terms of its source or content 3 3o-date for the task at hand 3 3nded 3 2rovides advantages from its use 3 3

nt.


H5. SBizPort/AMedPort provides higher informationquality than a benchmark search engine.

To test H1, we compared the performances of using(task 2) and not using (task 1) our domain-specificcollections. To test H2, we compared the search per-formances of our portal and the benchmark searchengine. To test H3, we compared browse performancesof using our portal's SOM visualizer and the benchmarksearch engine's browse support tools. Because aprevious research [7] has conducted a focused evaluationon the use of summarizer and categorizer to support Websearching and browsing, we did not repeat the evaluationof these tools here. To test H4 and H5, we comparedsubjects' ratings on the aforementioned aspects. As eachsubject was asked to perform similar tasks using the twosystems, we used a one-factor repeated-measures design,which gives greater precision than designs that employonly between-subjects factors [27].

4.3. Performance measure

We recorded the time the subject spent on each taskto measure the efficiency of using a system. We alsomeasured the effectiveness of using a system by thefollowing formulae:

Accuracy ¼ Number of correctly answered partsTotal number of parts

Precision ¼ Number of relevant URLs identified by the subjectNumber of all URLs identified by the subject

Recall ¼ Number of relevant URLs identified by the subjectNumber of relevant URLs identified by the expert

F value ¼ 2� Recall� PrecisionRecallþ Precision

Accuracy reflects how well a system finds correctanswers for search tasks. To measure the browse taskperformance, we used precision, recall, and F value.Precision reflected how well the portal helped users findrelevant results and avoid irrelevant results. Recallreflected how well the portal helped users find all therelevant results that had been identified by experts. Fvalue was used to balance recall and precisionsimultaneously [34], reflecting the performancesachieved by the expert and by subjects.

5. Experimental results and discussions

In this section, we report and discuss the results ofour user evaluation study. Table 6 summarizes the

means and standard deviations of various performancemeasures. Table 7 shows the p-values and results oftesting various hypotheses. Table 8 summarizes sub-jects' demographic profiles.

5.1. SBizPort performance

5.1.1. Search performanceUsing SBizPort's domain-specific collection achieved

higher mean accuracy and lower mean efficiency than notusing it. However, the differences were not significant.The figures show that employing our domain-specificcollection resulted in performance comparable to thatachieved by using all the meta-search engines incombination, suggesting the comprehensive nature ofour collection. We nevertheless believe that the SBizPortcollection should be further enhanced to provide morecomprehensive results in a shorter time, so H1 was notconfirmed.

Comparing our portal with the benchmark searchengine, we found that the mean accuracy of SBizPortwas significantly higher than that of BIWE, while therewas no significant difference between the efficienciesachieved by the two systems. We believe that SBizPort'sability to provide comprehensive, high-quality informa-tion from many sources helped users get accurateresults. However, the efficiency of SBizPort was notsignificantly better than that of BIWE. BecauseSBizPort is a research prototype, it lacks the profes-sional operations of BIWE. Therefore, H2 was partiallyconfirmed.

5.1.2. Browse performanceWe found that SBizPort achieved a higher mean

precision, recall, and F value than BIWE. However,only the difference in F value was significant at a 5%alpha-error level and the difference in recall wassignificant at a 6% alpha-error level. The results showthat SBizPort's browse support tools and SOM visua-lizer could enable users to find more relevant resultsthan BIWE. However, there is still room for improve-ments in terms of efficiency and precision. Therefore,H3 was partially confirmed.

5.1.3. User ratings and commentsSubjects rated SBizPort more favorably than BIWE

in terms of information quality and overall satisfaction(see Table 6). The mean differences between the twosystems' ratings ranged from 0.6 to 1.5 and were allsignificant at a 5% alpha-error level. Subjects were verysatisfied with SBizPort. We believe that several aspectsof SBizPort contributed to its good performance: the

Table 6Means and standard deviations of different measures

Measure SBizPort BIWE AMedPort Ayna

Meana S.D. Meana S.D. Meana S.D. Meana S.D.

Task 1 Search performanceb Accuracyc 0.87 0.33 0.55 0.50 0.64 0.50 0.23 0.41Efficiencyd 131 43 149 48 141 45 146 37

Task 2 Search performanceb Accuracy 0.95 0.23 0.55 0.50 0.50 0.50 0.18 0.40Efficiencyd 134 59 151 37 141 45 174 19

Task 3 Browse performancee Precision 0.87 0.29 0.86 0.34 0.43 0.37 0.27 0.41Recall 0.21 0.14 0.13 0.085 0.26 0.21 0.12 0.18F value 0.78 0.38 0.48 0.49 0.24 0.23 0.11 0.21Efficiencyd 288 63 285 24 289 26 300 24

Information quality (overall) 2.1 0.66 2.9 1.07 2.6 1.1 4.7 1.0– Presentation quality and clarity 2.3 0.78 2.9 1.3 2.4 1.0 4.5 1.2– Coverage and reliability 2.2 0.63 3.0 1.1 2.9 1.3 5.0 0.87– Usability and analysis quality 1.98 0.76 2.9 1.1 2.4 1.2 4.6 1.2Overall satisfaction 1.8 0.76 3.1 1.7 2.2 1.3 4.9 1.8

a The range of rating is from 1 to 7, with 1 being the best.b When using our portals, the subjects were asked not to use our domain-specific collection in task 1 but used it in task 2.c In task 1, the “SBizPort” or “AMedPort” column refers to using domain-specific collection and the right column (“Benchmark”) refers to not

using domain-specific collection.d Efficiency was measured by the time (in seconds) used.e In task 3, the subjects were asked to use the SOM visualizer when using our portals and could use all available browse tools when using the

benchmark search engines.


high-quality meta-searchers and domain-specific col-lection used in SBizPort, the useful browse supporttools, and the comprehensive content coverage. H4 andH5 were confirmed.

The subjects provided many positive comments onSBizPort's search and browse capabilities. Twelvesubjects agreed that SBizPort was very useful forsearching Spanish business information. For instance,subject #s10 said that SBizPort “is very useful forsearching,” and “(the information) is clear.” Subject #s1said “For specific topics (SBizPort) gave out specificresults, making the searches better than other searchengines.” The subjects also liked the browse supporttools provided by SBizPort. A majority of seventeensubjects commented positively on it. For example,subject #s6 said that SBizPort was “really nice to have

Table 7p-values of testing various hypotheses (alpha error*=0.05)

Comparison SBizPort vs. BIWE

Hypothesis Measure Effectiveness Effi

H1 Accuracy 0.42 0.85H2 Accuracy 0.002 ⁎ 0.30H3 Precision 0.89 0.84

Recall 0.06F value 0.035 ⁎

H4 Satisfaction 0.005 ⁎

H5 Information quality (overall) 0.009 ⁎

a Efficiency was measured by the time (in seconds) used.⁎ p valuesV0.05.

different functions and have a catalog.” Subject #s18said that the browse tools “made it easy to view retrieveddata.” Regarding the search performance, fifteen sub-jects commented that SBizPort did a good job or has agreater variety than the benchmark search engine. Forexample, subject #s7 said: “(SBizPort) gives lots ofpages related to what I look for from differentcountries.” Subject #s10 said “(SBizPort) looks withmore information and (is) able to provide in detail.”However, five subjects complained about the low speedof the system, especially when retrieving informationfrom many meta-searchers.

On the other hand, the subjects were unhappy withBIWE's lack of relevance and clarity in searching andbrowsing. For example, subject #s7 said that BIWE“gives irrelevant pages (of) other countries I'm not

AMedPort vs. Ayna Result

ciencya Effectiveness Efficiencya

0.54 0.83 Not confirmed0.046 ⁎ 0.011 ⁎ Partially confirmed0.22 0.31 Partially confirmed0.090.070.000⁎ Confirmed0.000⁎ Confirmed

Table 8Subjects' demographic profile

Demographicinformation

Spanish subjects(total: 19)

Arab subjects(total: 11)

Country oforigin

Mexico (12), USA (3),Panama (1), Puerto Rico (1),Colombia (1), Peru (1)

Lebanon (7),Morocco (1), Iraq (1),Mauritania (1),Jordan (1)

Education Undergraduate (13),bachelor earned (2),master earned (3),doctorate earned (1)

Undergraduate (3),associate degree (1),bachelor earned (2),master earned (5)

Age range 18–25 (14), 26–30 (2),31–35 (2), 41–50 (1)

18–25 (6), 26–30 (3),36–40 (1), 41–50 (1)

Gender Female (10), male (9) Female (3), male (8)Hours of

usingcomputerper week

b5 (1), 5–10 (2), 10–15 (1),15–20 (3), 20–25 (9),30–35 (1), N40 (2)

5–10 (1), 10–15 (3),15–20 (1), 20–25 (2),25–30 (1), 30–35 (1),N40 (2)


interested in.” Subject #s9 said that it was “time-consuming” to use BIWE. Moreover, most users did notlike the presence of pop-up advertisements when usingBIWE. Nevertheless, six subjects said that BIWE wasuseful for searching Spanish business information.Three subjects commented that the system was easy touse and fast.

5.2. AMedPort performance

5.2.1. Search performanceUsing AMedPort's collection resulted in higher

mean accuracy and efficiency than not using it.However, similarly to SBizPort, the differences werenot significant. We believe that the AMedPort collec-tion should be improved to provide more comprehen-sive results to users in a shorter time. H1 was notconfirmed.

Comparing our portal with the benchmark searchengine, we found that the mean accuracy and efficiencyof AMedPort were significantly higher than those ofAyna. We believe that, like SBizPort, AMedPortprovided comprehensive, high-quality informationfrom many sources and helped users find correct resultsin a shorter time. H2 was confirmed.

5.2.2. Browse performanceContrary to our expectation, AMedPort achieved

performance comparable to that of Ayna, as shown byinsignificant differences in precision, recall, and Fvalue. Yet, at 7% and 10% alpha-error levels, AMedPortachieved better F value and recall respectively. SoAMedPort needs further fine-tuning to be able toachieve a better performance. H3 was not confirmed.

5.2.3. User ratings and commentsSimilarly to SBizPort, AMedPort received signifi-

cantly better ratings than the benchmark search enginein terms of information quality and overall satisfaction.The mean differences ranged from 2.1 to 2.8 and wereall significant at a 5% alpha-error level. We believe thatAMedPort's good performance can be attributed to itshigh-quality meta-searchers and domain-specific col-lection and its useful browse support tools. H4 and H5were confirmed.

Subjects' verbal comments show better satisfactionwith AMedPort than with Ayna. Nine (out of eleven)subjects said that AMedPort was useful or provides moretopics and information. For instance, subject #a7 saidAMedPort was “helpful in cross-referencing informationfrom specific to general.” Subject #a5 said AMedPortwas “very useful because it does meta-searching.”Subject #a2 said the AMedPort was “very easy to usefor Arabs.” In contrast, Ayna received many negativecomments from subjects because of its lack of relevantresults and confusing interface. For example, subject #a2said that Ayna was “very clumsy, disorganized, (and)very brief.” Subject #a8 said she “couldn't easily accessit” and subject #a9 said Ayna was “hard to use.”

5.3. Discussion

The encouraging results from our experiment dem-onstrate that the proposed approach is useful to supportnon-English Web searching and browsing. Although weapplied the approach to building two portals in differentdomains and languages, the experimental results aresurprisingly similar. We believe that this was becausesimilar procedures were used to develop the portals andensured high information quality, comprehensiveness incontent coverage, useful functionality, and user-friendlyinterface. These important components help users whoneed to search for information from widely scatteredregions in a language used by a multitude of countriesand places. The results may also imply applicability ofthe proposed approach to building portals in otherdomains and languages. Given that the Internet willlikely become more and more internationalized [29], theproposed approach is expected to benefit a wide range ofdomains and users.

Looking more closely into the findings, we observedthat the performance differences between the two Arabicsearch engines are generally larger than those between thetwo Spanish search engines. This may be due to therelatively weaker Internet development in Arabic-speak-ing regions. However, as Arabic gains importance on theInternet, we expect the demand for better searching and


browsing will grow significantly. Meanwhile, the perfor-mance of existing Spanish search engines is expected tolag behind the rapidly-growing Hispanic and Latinopopulations. Our proposed approach may possibly fillsome of the needs.

Compared with previous research (such as [7,23]), ourexperimental findings provide insights to non-EnglishWeb searching in languages that are used in widely-separated regions. New empirical findings and develop-ments are provided in this work. For example, thisresearch is the first attempt to use and to empirically studythe SOM visualizer in supporting non-English Websearching. The Web collections provided by SBizPortand AMedPort are also much larger and contain morediverse regional information than the one developed in[23]. Meta-searching and post-retrieval analysis are newapplications in Spanish and Arabic. While major lan-guages like English and Chinese will still be important onthe Web, the notion of “multilingual Web” is expected todraw attention from practitioners and researchers in thefuture. And this research will likely shed light on somesystem development and decision support issues for non-English Web searching.

6. Conclusions and future directions

As non-English speakers increasingly use the Web toseek information, there is a need for better support ofsearching the Web across different regions. However,support for Internet searching in non-English speakingregions is much weaker than in English-speaking regions.This research proposes a language-independent approachto building Web search portals to support non-EnglishWeb searching. Based on the approach, we developed twoportals, SBizPort and AMedPort, for the Spanish businessand Arabic medical domains, respectively. Experimentalresults show that the two portals significantly outper-formed the benchmark search engines in terms of searchaccuracy and user ratings on information quality andoverall satisfaction. The two portals also achieved pre-cision and recall comparable to those of benchmark searchengines. Subjects much preferred our portals to thebenchmark search engines in many types of usage. Wetherefore conclude that the proposed approach is useful insupporting non-English Web searching. This researchthus contributes to developing and validating a usefulapproach to non-English Web searching and providing anexample of supporting Web searching in different non-English domains.

This study was limited in several ways. Our tworesearch prototype portals have speed and stability thatare not as good as those of commercial search engines

like the chosen benchmarks. Several subjects com-plained about the slow responses of our systems.We alsohave been limited by the scarcity of prior work on non-English Web searching, which has prevented a morecomprehensive review of a topic that possibly wouldoffer better criteria for designing our approach. As for theuser study, we had difficulty in recruiting native speakersas our subjects. Future work should consider expandingthe sample size to establish a higher statistical confidencein the experimental results.

We are pursuing several directions to extend ourresearch. As the notion of a “multilingualWeb” continuesto draw attentions, we are developing scalable techniquesto collect and analyze information in different languagesmeaningfully to relate diverse content to produce intel-ligence. For instance, multinational corporations (MNCs)typically provide Web site information in differentlanguages. Analyzing MNC's relationships with theirmultinational stakeholders could help provide a holisticpicture of how they stand in the international arena. Otherdomains that wewill explore include Spanishmedical andArabic business domains. The resulting business intelli-gence from stakeholders will serve to guide global de-velopment strategies. Another challenging area is thedigital archiving of multilingual data from heterogeneoussources — often scattered in different regions. We willinvestigate techniques and methods to facilitate such aprocess and better support non-English Web searching.Furthermore, we will develop and validate new visual-ization techniques to support browsing and compre-hending massive multilingual information on the Web.

Acknowledgments

This research was partly supported by funding fromthe National Science Foundation Knowledge Discoveryand Dissemination (KDD) program #9983304, June2003–March 2004 and October 2003–March 2004 andfrom the University Research Institute Grant Program ofthe University of Texas at El Paso. We are grateful to ourproject members and the experts and the studentsubjects who participated in the user study.

References

[1] R. Abbi, Internet in the Arab world, UNESCO Observatory onthe Information Society 3 (2002).

[2] P. Caramelli, The current and future rapid growth of older peoplein Latin America: implications in psychogeriatrics (keynote pre-sentation), Proceedings of the Eleventh International Congress,International Psychogeriatric Association, Chicago, IL, 2003.

[3] J. Carbonell, J. Goldstein, The use of MMR: diversity-basedreranking for reordering documents and producing summaries,


Proceedings of the 21st Annual International ACM-SIGIRConference on Research and Development in InformationRetrieval, ACMPress, Melbourne, Australia, 1998, pp. 335–336.

[4] H. Chen, H. Fan, M. Chau, D. Zeng, MetaSpider: meta-searchingand categorization on the web, Journal of the American Society forInformation Science and Technology 52 (13) (2001) 1134–1147.

[5] H. Chen, A. Houston, R. Sewell, B. Schatz, Internet browsingand searching: user evaluation of category map and conceptspace techniques, Journal of the American Society forInformation Science, Special Issue on AI Techniques forEmerging Information Systems Applications 49 (7) (1998)582–603.

[6] W. Chung, H. Chen, J.F. Nunamaker, A visual framework forknowledge discovery on the Web: an empirical study on businessintelligence exploration, Journal of Management InformationSystems 21 (4) (2005) 57–84.

[7] W. Chung, Y. Zhang, Z. Huang, G. Wang, T.-H. Ong, H. Chen,Internet searching and browsing in a multilingual world: anexperiment on the Chinese Business Intelligence Portal (CBiz-Port), Journal of the American Society for Information Scienceand Technology 55 (9) (2004) 818–831.

[8] F.D. Davis, Perceived usefulness, perceived ease of use, and useracceptance of information technology, MIS Quarterly 13 (3)(1989) 319–340.

[9] K.A. Ericsson, H.A. Simon, Protocol Analysis: Verbal Reports asData, MIT Press, Cambridge, MA, 1993.

[10] T. Firmin, M.J. Chrzanowski, An Evaluation of Automatic TextSummarization Systems, The MIT Press, Cambridge, 1999.

[11] Gallup, Encuesta Sobre Portales 2002, http://aui.es/estadi/gallup/gallup_portales_2002.htm, 2002.

[12] Global Reach, Evolution of non-English online populations,http://global-reach.biz/globstats/evol.html, 2004.

[13] Global Reach, Global internet statistics (by language),http://www.glreach.com/globstats/, 2004.

[14] S. Greene, G. Marchionini, C. Plaisant, B. Shneiderman,Previews and overviews in digital libraries: designing surrogatesto support visual information seeking, Journal of the AmericanSociety for Information Science 51 (4) (2000) 380–393.

[15] M.A. Hearst, Multi-paragraph segmentation of expository text,Proceedings of the 32nd Annual Meeting of the Association forComputational Linguistics, Morgan Kaufmann Publishers, LasCruces, New Mexico, 1994, pp. 9–16.

[16] Y.K. Hitti, Hitti's Medical Dictionary English–Arabic, Librairiedu Liban, Beirut, 1972.

[17] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin,1995.

[18] C. Kuhlthau, Longitudinal case studies of the information searchprocess of users in libraries, Library and Information ScienceResearch 10 (3) (1998) 257–304.

[19] J.R. Lewis, IBM computer usability satisfaction questionnaires:psychometric evaluation and instructions for use, InternationalJournal of Human–Computer Interaction 7 (1) (1995) 57–78.

[20] X. Lin, Map displays for information retrieval, Journal of theAmerican Society for Information Science 48 (1) (1997) 40–54.

[21] E. Loiacono, WebQual™: a web site quality instrument,Proceedings of International Conference on Information Systems(ICIS) Doctoral Consortium, (Charlotte, NC, USA), 2002.

[22] G. Marchionini, Information Seeking in Electronic Environ-ments, Cambridge University Press, New York, 1995.

[23] B. Marshall, D. McDonald, H. Chen, W. Chung, EBizPort:collecting and analyzing business intelligence information,Journal of the American Society for Information Science andTechnology 55 (10) (2004) 873–891.

[24] M.D. Marsico, S. Levialdi, Evaluating web sites: exploitinguser's expectations, International Journal of Human–ComputerStudies 60 (3) (2004) 381–416.

[25] D. McDonald, H. Chen, Using sentence selection heuristics torank text segments in TXTRACTOR, Proceedings of the SecondACM/IEEE–CS Joint Conference on Digital Libraries, ACM/IEEE–CS, Portland, OR, USA, 2002, pp. 28–35.

[26] A. Mowshowitz, A. Kawaguchi, Bias on the web, Communica-tions of the ACM 45 (9) (2002) 56–60.

[27] J. Myers, A. Well, Research Design and Statistical Analysis,Lawrence Erlbaum Associates, Publishers, Hillsdale, NJ, USA,1995.

[28] L. Norton, The Expanding Universe: Internet Adoption in theArab Region, World Markets Research Centre, 2001, p. 3.

[29] E.T. O'Neill, B.F. Lavoie, R. Bennett, Trends in the evolution ofthe public web 1998–2002, Digital Library Magazine 9 (4)(2003).

[30] T.-H. Ong, H. Chen, Updateable PAT-array approach for Chinesekey phrase extraction using mutual information: a linguisticfoundation for knowledge management, Proceedings of theSecond Asian Digital Library Conference, Taipei, Taiwan, 1999,pp. 63–84.

[31] R.L. Penslar, Institutional Review Board Guidebook, Office forHuman Research Protection, U.S. Department of Health andHuman Services, http://ohrp.osophs.dhhs.gov/irb/irb_guidebook.htm, 2001.

[32] J. Peterson, Quepasa Announces Agreement to Acquire VayalaCorporation Hispanic PR Wire–Business Wire, Phoenix, 2002.

[33] L.L. Pipino, Y.W. Lee, R.Y. Wang, Data quality assessment,Communications of the ACM 45 (4) (2002) 211–218.

[34] W.M.J. Shaw, R. Burgin, P. Howell, Performance standards andevaluations in information retrieval test collections: cluster-basedretrieval models, Information Processing and Management 33 (1)(1997) 1–14.

[35] A.G. Sutcliffe, M. Ennis, Towards a cognitive theory ofinformation retrieval, Interacting with Computers (SpecialEdition on HCI and Information Retrieval) 10 (1998) 321–351.

[36] A. Tombros, M. Sanderson, Advantages of query biasedsummaries in information retrieval, Proceedings of the 21stAnnual International ACM-SIGIR Conference on Research andDevelopment in Information Retrieval, (Melbourne, Australia),ACM Press, 1998, pp. 2–10.

[37] E. Voorhees, D. Harman, Overview of the sixth text retrievalconference (TREC-6), NIST Special Publication 500-240: TheSixth Text Retrieval Conference (TREC-6), National Institute ofStandards and Technology, Gaithersburg, MD, USA, 1997.

[38] R.Y. Wang, D.M. Strong, Beyond accuracy: what data qualitymeans to data consumers, Journal of Management InformationSystems 12 (4) (1996) 5–34.

[39] T.D. Wilson, Models of information behavior research, Journal ofDocumentation 55 (3) (1999) 249–270.

http://aui.es/estadi/gallup/gallup_portales_2002.htmhttp://aui.es/estadi/gallup/gallup_portales_2002.htmhttp://global-each.biz/globstats/evol.htmlhttp://www.glreach.com/globstats/http://ohrp.osophs.dhhs.gov/irb/irb_guidebook.htmhttp://ohrp.osophs.dhhs.gov/irb/irb_guidebook.htm


Wingyan Chung is Assistant Professor ofCIS in the Department of Information and

Decision Sciences at The University ofTexas at El Paso. He received his Ph.D. inManagement Information Systems fromThe University of Arizona, and M.S. ininformation and technology managementand BBA in business administration fromThe Chinese University of Hong Kong. Hisresearch interests include knowledge man-agement, Web analysis and mining, data

and text mining, information visualization, and human-computerinteraction. He has published in leading journals such as Communica-tions of the ACM, Journal of Management Information Systems, IEEEComputer, International Journal of Human-Computer Studies, andDecision Support Systems. Contact him at [email protected].

Alfonso A. Bonillas received his B.S. inSystems Engineering at the University ofArizona. His main interests are Web devel-opment, systems optimization, programming,and database management. Contact him [email protected].

Guanpi (Greg) Lai is a doctoral student in theSystems and Industrial Engineering (SIE)Department at the University of Arizona. Hereceived his B.S. in Computer Science fromTsinghua University, China and M.S. inIndustrial Engineering from the Universityof Arizona. His research interests includeembedded systems’ tasks scheduling, intelli-gent control (automobile, home automation),data mining, and data visualization. Contacthim at [email protected].

Wei Xi received her masters degree inManagement Information Systems from the

University of Arizona in 2004 and her B.A. inEnglish from Xi'an Foreign Languages Uni-versity, China (1995). She joined AI lab inSpring 2003. Her areas of interest includeWeb programming and database manage-ment. Contact her at [email protected].

Hsinchun Chen is McClelland Professor ofMIS at the Eller College of the University of

Arizona and Andersen Consulting Professorof the Year (1999). He received the Ph.D.degree in Information Systems from NewYork University in 1989, MBA in Financefrom SUNY-Buffalo in 1985, and BS inManagement Science from the NationalChiao-Tung University in Taiwan. He isauthor/editor of 10 books and more than130 journal articles covering intelligence

analysis, biomedical informatics, data/text/Web mining, digital library,knowledge management, and Web computing. Contact him [email protected].

mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]

Supporting non-English Web searching: An experiment on the Spanish business and the Arabic medi.....IntroductionLiterature reviewInformation seeking on the WebSupport for Web searching and browsingInformation quality

Search engines for Spanish-speaking and Arabic-speaking regionsSpanish search enginesArabic search engines

Summary

A language-independent approachThe SBizPort and AMedPortSteps in the approachCollection building and searchingSummarizerCategorizerVisualizerWeb directory

Enhancements of the approach

Evaluation methodologyExperimental designHypothesis testingPerformance measure

Experimental results and discussionsSBizPort performanceSearch performanceBrowse performanceUser ratings and comments

AMedPort performanceSearch performanceBrowse performanceUser ratings and comments

Discussion

Conclusions and future directionsAcknowledgmentsReferences

Supporting non-English Web searching: An experiment on the ...rudys/arnie/som-nonenglish-web.pdf ·...

Documents

Transcript of Supporting non-English Web searching: An experiment on the ...rudys/arnie/som-nonenglish-web.pdf ·...