Greek Web Search Engines - University of...

GGrreeeekk WWeebb SSeeaarrcchh EEnnggiinneess An evaluative and Comparative Study

A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Systems

at

TTHHEE UUNNIIVVEERRSSIITTYY OOFF SSHHEEFFFFIIEELLDD

by

Panteleimon Lilis

September 2002

Abstract

The present study consists a first attempt to evaluate the overall performance of

Greek Web search engines and compare them with a class-leading search engine, Google.

To do so a particular methodology was designed and developed for the purposes of the

comparison. More specific, the three Web search engines were evaluated in terms of

precision, relative recall, validity of links, response time, interface and on-line

documentation. These criteria were particularly employed and developed for the needs of

the present study. The queries employed were also developed for the needs of the particular

study. More specific, it was decided to use twenty queries so as the query subject matter to

be as wide as possible. Apart from that, some of the queries were in Greek language, some

in English language and some of them were a combination of Greek and English keywords

in order to assess the selected search engines of how capable are to cope with a variety of

different linguistic characteristics. The first ten pages of returned results in each search

engine were evaluated on the basis of the criteria described above. Results are presented in

tables and the comparison was based on averaging and finding the mean rate for each search

engine in each criterion. The conclusion is that the performance of the three selected search

engines is rather poor. Google, in general was found to have a better overall performance in

comparison to Anazitisis and Trinity, while the two Greek search engines had an almost

similar performance.

i

List of Tables .............................................................................................................. iv List of Charts............................................................................................................... iv 1. INTRODUCTION.................................................................................................. 1

1.1. World Wide Web search engines...................................................................... 1 1.2. Aims and objectives of the current study.......................................................... 1

2. THE SELECTED SEARCH ENGINES .............................................................. 3

2.1. Search engine selection..................................................................................... 3 2.2. Features of the selected search engines............................................................. 4

2.2.1. Google ........................................................................................................ 4 2.2.2. Anazitisis.................................................................................................... 6 2.2.3. Trinity......................................................................................................... 7

3. LITERATURE REVIEW...................................................................................... 8

3.1. Introduction....................................................................................................... 8 3.2. Review of comparative-experimental studies. .................................................. 9

4. METHODOLOGY............................................................................................... 19

4.1. Introduction..................................................................................................... 19 4.2. Setting up the evaluation criteria..................................................................... 20

4.3. Text REtrieval Conference (TREC)............................................................ 21 4.4. The influence of TREC ................................................................................... 22 4.5. Development of the test environment ............................................................. 23 4.6. Sample queries suite........................................................................................ 25

4.6.1. Number of queries.................................................................................... 25 4.6.2. Query subject matter ................................................................................ 26 4.6.3. Query formulation and search expression................................................ 27 4.6.4. Further analysis over the sample queries ................................................. 28

4.7. Evaluation of returned pages........................................................................... 30 4.7.1. Document cut-off ..................................................................................... 30 4.7.2. Measures of evaluation specific to the current study ............................... 31 4.7.3. Precision................................................................................................... 32 4.7.4. Recall ....................................................................................................... 34 4.7.5. Response time .......................................................................................... 36 4.7.6. Validity of links ....................................................................................... 37 4.7.7. Interface.................................................................................................... 37 4.7.8. On-line documentation............................................................................. 38

4.8. Possible drawbacks, inconsistencies and bias of the specific methodology ... 39 5. RESULTS ............................................................................................................. 43

5.1. Calculations..................................................................................................... 43 5.1.1. Averaging................................................................................................. 43 5.1.2. Precision scores........................................................................................ 43 5.1.3. Recall scores ............................................................................................ 45 5.1.4. Response time .......................................................................................... 46 5.1.5. Validity of links ....................................................................................... 46

ii

6. ANALYSIS AND INTERPRETATION OF THE RESULTS ......................... 48 6.1. Evaluation of the overall performance of the tested search engines............... 48

6.1.1. Precision ratio .......................................................................................... 48 6.1.2. Recall ratio ............................................................................................... 51 6.1.3. Response time .......................................................................................... 52 6.1.4. Validity of links ....................................................................................... 53 6.1.5. Interface.................................................................................................... 55 6.1.6. On-line documentation............................................................................. 55

7. CONCLUSIONS .................................................................................................. 57

7.1. Limitations of the current study ...................................................................... 59 7.2. Some future recommendations........................................................................ 60

BIBLIOGRAPHY .................................................................................................... 62 APPENDIX - SEARCH ENGINES INTERFACE ............................................... 65

iii

List of Tables Table 1: Precision scores............................................................................................ 44 Table 2: Recall scores ................................................................................................ 45 Table 3: Response time .............................................................................................. 46 Table 4: Invalid links ................................................................................................. 47

List of Charts Chart 1: Mean precision performance........................................................................ 51 Chart 2: Mean recall................................................................................................... 52 Chart 3: Mean response time...................................................................................... 53 Chart 4: Validity of links ........................................................................................... 54

iv

1. INTRODUCTION

1.1. World Wide Web search engines

According to Chu and Rosenthal (1996) the World Wide Web has gained so

much popularity that it is the second most popular internet application after e-mail.

The Web is used for a variety of purposes by many people around the world.

However, it can be argued that the Web is used for two main purposes (Clarke,

1997). The first is the publishing of information. Indeed, the fact that in the Web the

information can be accessible by many people in the same time, resulted that the

Web became the world’s largest information medium.

The second use of the Web is for information retrieval (Clarke 1997). More

specific, under many aspects, the Web can be described in terms of a huge

information storage system. But, the reality about the Web is that its unstructured

and ever changing nature has made a very difficult task the information searching

and retrieval (Declan, 2000). Web search engines developed in order to overcome

this difficulty by assisting the simple Web-user in searching and retrieving the

required information.

Web search engines came into existence in 1994 and since then at least

twelve have been developed for use in the Web. Search engines have variously been

referred as search tools, search services, indexes, Web databases and search engines.

In the present study the phrase that will be used mostly is search engines, since this is

also the case for the majority of the studies reviewed.

1.2. Aims and objectives of the current study

The specific dissertation aims to evaluate two of the most popular Greek

search engines (Anazitisis and Trinity) and to compare them with a class-leading

Web search engine such as Google. The main reason for conducting such study is

the fact that no similar study has ever been conducted in Greece, meaning that there

1

is no particular information about how each of the Greek search engines performs.

Moreover, articles that were involved with Greek search engines are very limited and

all of them are reviews and thus, descriptive in their nature. This is due to the fact

that Greek Web search engines are recent in comparison to search engines such as

Google or Alta Vista and thus, the relevant literature is very immature.

However, recent developments in the Greek search engines (Anazitisis, new

ranking algorithm in Trinity) have emerged some concern in the Greek Web about

the performance of these search services. Thus, another reason for conducting

comparison of two Greek search engines with Google is that this will give a measure

of how developed Anazitisis and Trinity claim to be. After all, as Chu and Rosenthal

(1996) state, the seer number of such services invites further research.

In order to achieve the particular aim a methodology was required to be

designed and developed. This will require to explore and examine the relevant

literature so as to identify the required criteria and the appropriate test environment

to be developed. It is very important to mention that the methodology in the present

study is the most important part. Its completeness will ensure the objectivity of the

results and will minimize the risk of introducing bias, both conscious and

unconscious, and inconsistencies.

Furthermore, the researcher of the present study decided to design and

develop a methodology for the particular study, since for the purposes of comparison

a number of different criteria and search engines features needed to be examined and

evaluated thoroughly. For example, from the queries employed some were in Greek

language, some in English and some were a combination of both Greek and English

search keywords, another example is that on-line documentation of each search

engine was employed as an evaluation measure for reasons discussed in more detail

in the methodology section.

2

2. THE SELECTED SEARCH ENGINES

2.1. Search engine selection

The researcher of the current study decided to select only three search

engines to test and evaluate. It can be argued that the number of the search engines

examined in the current study is rather small in relation to some of the reviewed

studies. This constraint on the number of the search engines selected was considered

as necessary for the following reasons. First it would allow a greater number of

queries to be used so as the subject matter of queries to be as wide as possible.

Second, it would allow a larger number of evaluation criteria to be employed so as to

assess the overall performance of the selected search engines. Many of the reviewed

studies are limited to the usual measures of precision, recall and interface. But, due

to the fact that the present study attempts to examine, evaluate and compare the

selected search engines in terms of their overall performance, suggests that a larger

number of evaluation criteria should be employed.

The idea was to select two of the most popular and well respected Greek

search engines and compare them with one class-leading search engine such as

Google. The one Greek search engine that was selected is Trinity. It is one of the

most well respected Greek search engines and is being used by the most popular

portal of the Greek Web, the www.in.gr. The second Greek search engine that was

selected is Anazitisis, which is product of one of the most popular ISPs in Greece,

OTEnet. The selection of Anazitisis was based on the fact that is a very new search

engine, which gained popularity in a very short time. Anazitisis boasts that employs

advanced ranking algorithms, impressive special features along with special software

designed particularly to increase noticeably its searching and retrieval performance

in the Greek language. The characteristics and features of Google, Anazitisis and

Trinity will now be considered in more detail in the following section.

3

http://www.in.gr/

2.2. Features of the selected search engines

2.2.1. Google The Google Web search engine has been founded by Sergey Brin and

Lawrence Page, two graduate students in computer science at Stanford University in

California. In less than a year, their Google search engine has become the most

popular on the web, yielding more precise results for most queries than conventional

search engines. Google’s database is very huge and according to many sites and

resources in the Web, Google must be the biggest search engine database in the

world. Google claims that its database size is over two million pages, but may be

counting pages which are not fully indexed.

One distinguishable characteristic of Google is its searching and retrieval

speed or more formal the very low response time. According to Google’s homepage

this can be attributed in part to the efficiency of its search algorithm and partly to the

thousands of low cost PCs that they have been networked together (so as to form a

powerful computing grid) to create a very fast search engine. The other most

distinguishable characteristic is its ranking algorithm.

As far as its ranking algorithm is concerned, Google is a unique search engine

in the World Wide Web. More specific, Google’s ranking algorithm is based on how

many other pages link to each page, along with other factors such as the proximity of

the search keywords or phrases in the documents. It uses not only the number of

other pages that link to a page, but also the importance of the other links which are

being evaluated by the links to each of them. This simply means that there is no way

anyone to be able to influence the ranking of his or her page in Google, something

which is quite possible in some other search engines and directories. This innovative

approach takes its inspiration from the citation analyses used in scientific literature

(Declan, 2000) and is based on the principle of “bibliographical coupling” (Skandali,

1990).

Google embodies these principles into its ranking algorithm “PageRank”

which has been the topic of many discussions, but so far there is no clear evidence of

4

how exactly works. In general, the PageRank (PR) is calculated for every webpage

that exists in Google's database. The calculation of the PR for a page is based on the

quantity and quality of the WebPages that contain links to a particular page.

According to the co-founders of Google, Sergey Brin and Lawrence Page, the PR of

a webpage is calculated using this formula:

PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I)) . Where:

PR(A) is the Page Rank of your page A.

d is the damping factor, usually set to 0,85.

PR(I->A) is the Page Rank of page I containing a link to page A.

C(I) is the number of links off page I.

PR(I->A)/C(I) is a PR-value page A receives from page I.

SUM (PR(I->A)/C(I)) is the sum of all PR-values page A receives from pages

with links to page A..

More explicit, the PR of a page is determined by the PR of every page I that has a

link to page A. For every page I that points to page A, the PR of page I is divided by

the number of links from page I. These values are cumulated and multiplied by 0.85.

Finally 0.15 is added to this result, and this number represents the PR of page A.

(Declan, 2000).

Google allows the user to search either in the simple or in the advanced

mode. Each mode has a different entry screen and provides different functions and

search options. The simple interface is a single search box with two search buttons:

"Google Search" and "I'm Feeling Lucky". ". The latter automatically displays the

page deemed most relevant rather than displaying a list of results. The advanced

interface provides boxes for the following search options: for "all the words", "exact

phrase", "any of the words", and "without the words", pull-down menus to limit by

location on the page (anywhere, title or URL), language and domain, radio buttons to

filter results using "SafeSearch", and search boxes that allow you to search for pages

that are similar to or link to a given URL. Apart from these, Google also supports

major Romanized and non-Romanized languages and translation to English from

major European languages. However, Google does not support truncation and it is

not case sensitive.

5

2.2.2. Anazitisis Anazitisis is the most recently of the Greek search engines. In fact, Anazitisis

is part of the on-line products provided by OTEnet, one of the most popular and well

respected ISPs in Greece. Unfortunately, the researcher didn’t have enough

information about Anazitisis due to the fact that its administrators have not been interested at all in contributing to the present research. Thus, much of the

information illustrated in the present study about Anazitisis is based partially on

some information found in the Greek Web and partially on the personal experience

of the researcher with Anazitisis.

Anazitisis became fully operational a year ago. During this time Anazitisis

became very popular among the Greek users. Its popularity is much based on to the

fact that boasts to support advanced search features and capabilities. More

specifically, the particular search engine employs the SDK, a linguistic software tool

developed from AltaVista especially for the Greek language. The SDK is being used

by the Anazitisis in order to increase its searching and retrieval capabilities on the

Greek language. However, far more impressive is “normalisation” a special feature

of Anazitisis which has been designed and developed to cope with the various forms

or more precisely with the various “cases” that a Greek word can be used within a

sentence (e.g. nominative case, accusative case etc.). With this particular

characteristic, Anazitisis claims to increase considerably its precision and recall

results, when a query containing Greek search keywords is submitted.

There is no particular information about how the ranking of the Web pages is

being performed by Anazitisis. Also, there is no information at all about its Web

robot and its capabilities. However, it can be argued that its ranking is based on a

combination of two sets of criteria. The first set of criteria is considered to be

dynamic and the second is regarded as static. The first set includes criteria such as,

the presence and the number of keywords in the title or in the first line of the text or

in the main body of the page or in the Meta tags of the HTML code. The second set

includes criteria such as the popularity of a specific Web page (measured by the

number of the pages who link to the particular Web page) and by the percentage of

the text contained (more text is regarded as an indication that the specific page must

be more valuable and thus, more informative).

6

Anazitisis supports full Boolean searching by inserting the appropriate

operator before a word (“+” for the “AND” operator and “-” for the “OR” operator)

and phrase searching (by using quotes “…”). Truncation is also supported by

inserting the wildcard “ * ” at the end of the truncated word. At the end, there is also the possibility for searching in specific fields of the Meta tags of the HTML code,

such as in the title or in the URL. The user can enter his or her queries either in the

simple mode or in the advanced mode where further functions are supported such as

searching only in governmental sites.

2.2.3. Trinity Unfortunately, as in the case of Anazitisis the amount of information that was

available for the second Greek search engine Trinity, was very limited. To begin

with, Trinity was developed by the Phaistos Networks SA, a Greek company which

is being enabled in the area of Internet and Web applications. Trinity became fully

operational in 1997. Trinity is the basic search engine that is being used by the most

popular Greek portal the www.in.gr and this might also be one of the reasons for

Trinity’s popularity.

Trinity also operates in two modes, simple and advanced mode. Each mode

has a different screen entry, because different functions are being supported. In the

simple mode the user can only submit his or her query, while in the advanced mode

special operators are being supported in order to help the user to increase either the

precision or the recall. As far as the search options are concerned, Trinity provides

only limited options in comparison to Anazitisis and Google. More specific, Trinity

supports partially Boolean searching (only two operators: AND, OR) and phrase

searching only.

As far as the ranking of the documents is concerned, the information that is

available for Trinity is very limited and very general. However, it appears that

Trinity’s ranking algorithm is based on the analysis of the popularity of a Web page.

The popularity of a Web page is being measured by analysing the number of other

pages that link to the particular page. It appears that the basic principle of Google’s

PageRank is being employed by more Web search engines. Other criteria that are

7

http://www.in.gr/

being taking into consideration by Trinity’s ranking algorithm are word proximity

and URL analysis. There is no particular information for the later criterion, but from

the researcher’s personal experience with Trinity a possible explanation should be

that, what is being measured is the proximity of the URL of a specific Web page with

the keyword of the user’s query. Trinity claims to employ a very fast Web robot,

Septera, but no further information is given about that. Also the number of the pages

indexed in its database is unknown.

3. LITERATURE REVIEW

3.1. Introduction

The literature review section in the particular study is attempting to explore

and provide some critical evaluation of the relevant literature in the area of Web

search engines evaluation and comparison. It should be noted here that it was

decided to include in this section only these studies that can be considered as

experimental or comparative. It was felt best to conduct an in depth review of some

of the most important previous studies rather than examine everything that has been

written for Web search engine evaluation. Some other studies such as Scoville

(1996) or Randall (1996) which found to be more of a review rather than an

experimental or comparative study were not included in this section. This selection

was considered as necessary due to the fact that the researcher of the current study

wanted to explore and examine in depth how each researcher conducted his or her

study, what and which evaluation measures employed or what methodology was

applied and the reasoning behind each of these steps. The design and the

development of the methodology employed in the particular study is based and

influenced by some of the studies reviewed in this section. However, it should be

noted that conclusions or findings derived from non comparative studies (reviews)

were also used extensively within the present study and more particularly for the

design and development of the methodology employed.

8

Unfortunately, there are no significant studies conducted in Greece about the

evaluation and comparison of Web search services. There are only a few which are

of descriptive nature and thus, they are considered to be more of a review rather than

an experimental or comparative study. Some of these Greek studies (Papathanasiou

and Kanarelis 2001) were used by the researcher in order to acquire some further

information about the two Greek search engines Anazitisis and Trinity, since their

administrators have not been interested at all in contributing to the present study.

3.2. Review of comparative-experimental studies.

To begin with a very comprehensive, practical and analytical in its approach

methodology for evaluating the performance of Web search engines is presented by

Chu and Rosenthal (1996). They examined and evaluated three Web search engines,

namely Alta Vista, Lycos and Excite, in their attempt to develop a feasible

methodology for evaluating all Web search engines.

In order to test and evaluate the performance of the selected search engines,

the authors of this study developed and used ten (10) sample search queries. The

search queries were selected and constructed in such way so as to test the various

features of each search engine. Some of the search queries were phrases, while

others required Boolean logic or truncation or field searching capabilities. Nine out

of the ten search queries were extracted by real reference questions.

The authors of this study evaluated the performance of each search engine in

terms of precision of results, response time, output option, documentation and

interface. The authors of this study paid special attention to the criterion of

precision. Moreover, they downloaded the first 10 documents for each query and

assessed their precision. They gave the score of 1 for the highly relevant documents,

the score of 0.5 for fairly relevant documents and 0 for irrelevant documents. After

having assessed the precision score for each query, they calculated the average

precision score for all 10 results for each search engine. The conclusion of their

study was that among the three selected search engines, Alta Vista is the one with the

highest precision in its returned results. As far as the other two search engines are

9

concerned, they offer a plethora of features that the users can take advantage of them,

such as the concept search of Excite or the very good documentation and interface of

Lycos. The weak point of this study was the fact that there was no attempt to

evaluate recall. Chu and Rosenthal (1996), rationalize their decision to deliberately

omit the evaluation criterion of recall by arguing that it is not possible to calculate

how many relevant items there are for a particular query in the huge and ever

changing Web system.

Ding and Marchionini (1996) conducted a comparative study of the

performance of three popular Web search engines: InfoSeek, Lycos and Open Text.

In order to evaluate the performance of the selected search engines, they used three

queries which were randomly selected from a question set for Dialog online

searching exercises in an information sciences class. The other two queries were

formulated based on their personal interest. According to the authors of this study all

five queries were open – ended. In order to get the best search, syntax specific to

each search engine was used for each query.

The selected search engines were evaluated for precision in the returned

results, duplication in the retrieved sets, invalid links and the degree of overlap

between search engines. This evaluation was performed by analyzing the first

twenty hits that every search engine returned. Ding and Marchionini (1996) used in

this study a six point scale to rate the relevance and the quality of the three search

services. More specific, the measures that they defined for the purposes of their

study were precision, salience and relevance concentration.

As far as the measure of precision is concerned, the authors distinguished

three types of precision, in order to record the statistically significant differences of

the precision variants between the search engines and to unveil whether and to which

degree a complex query can affect the precision performance of each search engine.

The measure of salience was used in order to report the summary of ratings of all the

twenty hits for each search engine out of the summary of ratings for all the three

search engines. The last measure of relevance concentration was used in order to

report the ratio of “good” items in first ten hits to the number of “good” items in the

first twenty.

10

Ding and Marchionini (1996) concluded that that the performance of all three

Web search engines was very similar, but in terms of means of precision and salience

Lycos and Open Text considered to be superior to InfoSeek. The limitations of their

study were that they used only five queries and that they didn’t assess the selected

search services for response time, accessibility and recall.

Among the studies conducted from 1995 to 1997, Venditto’s (1996) study is

considered to be the most quantitative. Venditto (1996) selected, examined and

evaluated seven search engines which in that era were considered to be very popular

for their overall performance. The selected Web search engines were: Alta Vista,

Infoseek, Lycos, Open Text, WebCrawler and WWW Worm. In this study twelve

search terms were employed for a period of two weeks. The problem was that

Venditto (1996) did not report how many queries were used.

The seven search engines were assessed for relevance for the first twenty five

results returned from each query. Apart from that, each of the selected search

engines was tested on the basis of how capable is to cope with complex query

statements. To do so, known sites were identified for a given subject, next

formulating a search query in natural language and at the end, examining how many

of the determined sites each search engine managed to retrieve. This is a very

interesting approach, but it can be argued that introduces a lot of inconsistencies and

bias since it does not record crucial information about the test environment. Apart

from that, the currency of the search engines was also examined by employing a

query that reflected news events which were important at that particular time.

Venditto (1996) concluded that all the seven search engines performed well

when submitting simple queries. But, with the complex queries some of the search

engines had poor performance. According to Venditto (1996) this suggests that

relevance ranking methods employed in certain search engines were not very

effective and the relevance of each hit was partly based on the site’s relative

popularity. As far as the relevance results are concerned, InfoSeek was found to be

the best, while Alta Vista produced the most comprehensive results. However,

Venditto (1996) did not report what the exact statistics were in his study.

11

Zorn et al. (1996) conducted a comparative study which aimed to examine

and evaluate the advanced search features of four Web search engines. They decided

to select Alta Vista, InfoSeek, Lycos and Open Text Index. These search engines

were selected for their popularity and for the advanced search features that each

claims to support.

Zorn et al. assessed these particular Web search engines for complex Boolean

logic, limiting retrieval by fields, proximity, phrase searching, duplicate detection

and truncation. In their study they devised and employed three sample searches

which involved and required the elements of the advanced search features of each

search engine. While they provide a rather detailed discussion section about the

performance of each search engine, the fact is that the number of searches employed

is very small in order to conduct an analysis of the results and the findings. Also, it

appears that there is no quantitative evaluation of relevancy.

In the conclusion they support that there is no single Web search engine

which can be considered as the best, since each one has its own weakness and

strengths. However, they found Lycos and Alta Vista to have the best performance

as far as the number of URLs is concerned.

The study conducted by Toamiuolo and Packer (1996) can be considered as

the most quantitative study. The authors selected five search engines (Magellan,

Alta Vista, Point, Lycos and InfoSeek) and assessed them by employing two hundred

(200) queries. The selection of the particular search engines was based on the

popularity of each search engine, but Magellan and Point were selected because they

had the ability to review and evaluate the Web pages that they index.

The subject matter of the queries employed was based on undergraduate

topics. The document cut-off was determined for the first ten hits. Toamiuolo and

Packer (1996) evaluated the first ten hits returned by the search engines for

relevance. Also, the total number of the pages that each search engine returned fro

each individual query was recorded. The authors were based on the microaverage

method in order to produce a mean relevance ratio for each search engine. As far as

12

their findings are concerned, they found that Alta Vista had the best relevance

performance followed by Lycos, InfoSeek, Point and Magellan. They also noted that

some of the tested search services (Point and Magellan) failed to retrieve at least ten

hits for some of the queries employed.

Lindop et al. (1997) writing for the U.K. edition of the PC Magazine

conducted a “lab – test”, which involved the review and the comparison of 11 search

engines. The test suite, which developed for the purposes of evaluating the selected

search engines, was undoubtedly subject to various types of bias. More specific, the

methodology of the testing involved a team of testers who evaluated the selected

search engines by carrying out simple keyword searches or advanced searching, such

us Boolean or field searching.

Apart from the problem of inadequate and biased test suite, there was never

kept record of crucial information about how the searching was conducted, such as

the detail that the number of searches carried out was never revealed. The

problematic and inadequate methodology can also be unveiled by the fact that the

hits that every search engine returned were never formally assessed for their

relevancy. Instead of using some set of criteria for assessing the retrieval

performance of the search engines, the testing team kept only a record of the number

of the results retrieved for their queries and another record of their impression of the

search refinement and online documentation.

The evaluation procedure of this “lab – test” ended up with a usability score

based on the testers satisfaction using each search engine. Additionally, each search

engine awarded a score for every feature facilitating form a list, such as Boolean

searching, proximity or field searching. The testing team concluded that Alta Vista

was the best search engine especially in terms of usability and additional features.

One of the most comprehensive and complete study was carried out by

Leighton and Srivastava (1997). Their study employed one of the most carefully

designed and developed methodology for assessing the performance of the selected

Web search services. In their study Leigthon and Srivastava (1997) selected five

13

Web search engines namely, Excite, Alta Vista, InfoSeek, HotBot and Lycos and

assessed them using fifteen (15) queries.

The development of the test suite that they employed in their study is

considered as on of the most complete. More specific, they used a combination of

structured and unstructured queries and they tried so as the subject matter of the

queries to be as wide as possible. Relevancy categories were also developed before

the evaluation of the pages take place in order to avoid or minimize possible bias.

Moreover, they devised a method to “blind” the pages that each search engine

returned so as the evaluator not to be able to know which page is from which search

engine. To do so a script written in PERL was employed. The PERL script was

employed in order to fetch automatically the results that each search engine had

produced and hide the name of each search engine. Thus, the evaluator was

assessing the results without being able to know from which search engine were the

results that was evaluating. Their findings are contained in a very detailed report

where they had also recorded every possible detail about the environment where the

test took place.

Apart from that, they also conducted several experiments on the same data in

order to compare the selected search engines using a variety of definitions. In their

study, Friedman’s randomized block design was used to perform multiple

comparisons for significance. Analysis had showed that Alta Vista, Excite and

InfoSeek are the top three search services with their relative rank changing,

depending on how one interpreted the concept of "relevant." Correspondence

analysis showed that Lycos performed better on short, unstructured queries, while

Hotbot performed better on structured queries.

Another very comprehensive study is that of Clarke (1997). Clarke (1997)

employed an experimental methodology designed to estimate the precision and recall

of World Wide search engines. The search engines selected for this purpose were

Alta Vista, Lycos and Excite. Clarke (1997) employed TREC-inspired methodology

in order to estimate the recall of the selected Web search engines.

14

Clarke (1997) evaluated only the relative or comparative recall, since

determining absolute recall is impossible in the huge and ever changing Web system.

Relative recall was determined by conducting a second search for known pages in

each search engine. So a “pool” of relevant documents was identified for each

individual query and each search engine was measured on the basis of how many

relevant documents managed to retrieve from this “pool”. The first ten pages of the

returned results in each search engine were evaluated for relevance using a three

point scale. Results were presented in tables and the Friedman nonparametric

statistical test was performed in order to determine the significance of the results.

Clarke (1997) found that Alta Vista achieved the best mean precision score and

Lycos the worst, but the precision performance of Alta Vista was only significantly

different to that of Lycos. Excite achieved the best mean recall performance and Alta

Vista the worst although there was no significant difference in the recall performance

of the three search engines. Clarke (1997) concluded that Alta Vista was marginally

the best search engine and this was in agreement with previous studies. The main

conclusion according to Clarke (1997) is that, it is possible to apply the pooled recall

approach to estimate relative recall of Web search engines.

A very different comparative study of Web search engines was conducted by

Courtois and Berry (1999). Their aim was to test how five major Web search

services retrieve and rank documents in answer to user’s search query request. Their

main rationalization about conducting such comparison lies within the reason that

each search engine ranks or sorts the results according to a specific set of criteria,

namely the rank algorithm.

Furthermore, according to the authors, result ranking has a major impact on

users’ overall satisfaction with Web search engines and their way of retrieving the

relevant documents from the results list. Courtois and Berry (1999) identify that the

majority of analogous studies are sharing a common methodology which consists of

examining and evaluating the relevancy of the first 10 or 15 hits returned by the

search engine. While the authors recognize the fact that this is an effective and

feasible methodology for determining the precision, they argue that according to

their experience this is not the approach with which users make use of their results

15

list. Furthermore, the authors attempt to justify their different methodological

approach by explaining how most of the users are likely to scan and retrieve only

selected documents. However, this is a rather weak point within their study, since it

is based only to their personal experience and to a similar study conducted by Koll

(1993).

According to their findings, Courtois and Berry (1999) developed a test suite

and an appropriate methodology, which consisted of three criteria for testing

relevance ranking:

1. All Terms: Are documents that contain all search terms ranked higher

than documents that do not contain all search terms?

2. Proximity: For documents that contain all search terms, are

documents that contain search terms as a contiguous phrase ranked

higher than documents that do not?

3. Location: For documents that contain all search terms, are documents

that contain search terms in the title, headings, or metatags ranked

higher than documents that contain terms only within the body of the

document?

For comparison they selected five search engines which scored highly in many

comparison tests conducted by popular computer magazines, namely AltaVista,

HotBot, Excite, Infoseek, and Lycos. They identified and selected 12 search queries

to test the particular search engines and they downloaded the first 100 hits of each

search.

According to a further analysis of the downloaded documents and to the

above test criteria, the authors concluded that in general the ranking performance of

all the engines was generally good. In the Proximity and Location test most of the

search engines had worse performance comparing to the All terms test, implying

some very interesting thoughts about the ranking algorithm of every search engine.

Ultimately, Courtois and Berry (1999) are suggesting a very interesting methodology

for evaluating the quality and the reliability of the ranking algorithm of the Web

search engines and their results are suggesting some very serious considerations from

the perspective of the end user.

16

Another experimental-comparative study is the one conducted by Gordon and

Pathak (1999). In their study Gordon and Pathak (1999) distinguish between two

types of search engine evaluation: testimonials, encompassing informal and

impressionistic appraisals and feature-list comparisons and shootouts, which appears

to correspond more closely to traditional information retrieval effectiveness

experiments.

According to their definition Gordon and Pathak (1999) presented a table of

twelve earlier shootout studies, but identify only three (including their own) which

make use of “appropriate experimental design and evaluation”. Of these, that of

Gordon and Pathak (1999) is the most comprehensive and most recent. Gordon and

Pathak obtained thirty-three (33) real information needs from volunteers among the

faculty members in a university business school. These were recorded in

considerable detail and passed to skilled search intermediaries who were given the

task of generating near-optimal queries for each of eight search engines by an

interactive, iterative process. The top twenty (20) results produced by each of the

engines in response to the final queries were then printed and returned to the

originating faculty member for assessment on a four point relevance scale.

Gordon and Pathak (1999) presented a list of seven evaluation features which

they claim should be present to maximise accuracy and meaningfulness of

evaluation. Very briefly, these features can be listed as following:

1. Searches should be motivated by genuine user need.

2. If a search intermediary is employed, the primary searcher’s information need

should be as fully captured as possibly and transmitted in full in the

intermediary.

3. A large number of search topics must be used.

4. Most major search engines should be included.

5. The most effective combination of specific features of each search engine

should be exploited. This means that the queries submitted to the engines

need not be the same.

6. Relevance judgments must be made by the individual who needs the

information.

7. Experiments should be well designed and conducted.

17

8. The search topics should represent the range of information needs both with

respect to subject and to type of results wanted.

Of these features, some are very interesting, but others are debatable such as

the feature 5. It appears that Gordon and Pathak (1999) are questioning the general

practice to evaluate Web search engines based on the results produced by a set of

query words without special syntax or special operators. However, it is more

reasonable to compare the quality of results produced by search engines given

identical input queries in this particular form, rather than attempting to find the best

search query for each search engine and to compare them. After all, typical users

avoid to use special operators or special syntax in their queries. They concluded that

search effectiveness was generally low, that there were significant differences

between engines and that the ranking of engines was to some extent dependent upon

the strictness of the relevance criterion.

18

4. METHODOLOGY

4.1. Introduction

According to Van House et al. (1990) evaluation is the process of identifying

and collecting data about specific services or activities, establishing criteria by which

their success can be assessed and determining both the quality of the service or the

activity and the degree to which the service or activity accomplishes stated goals and

objectives.

The process of evaluation, as defined above, is being widely used in

traditional databases, CD-ROMs and other online information retrieval systems in

order to assess their overall quality and performance. However, evaluation of

performance of Web search engines is a new area within the context of information

retrieval. According to Dong and Su (1997) studies concerning the assessment of

Web search services began the year 1995. The review of previous literature has

revealed that the majority of such studies have been conducted between 1995 and

1997. This review has also revealed that in general three types of methodologies

have been employed in assessing the performance of Web search services: actual

tests with data collection and analysis, evaluative comments with examples of simple

searches and review of functions of different search engines without examples or

some other kind of tests.

Further to this, it could be argued that methodologies which were making use

of only simple tests and reviews of search engines functions were employed mostly

during the earlier studies, while the majority of the recent studies are employing

actual tests with data collection and analysis for the overall performance of the Web

search services.

In their study, Gordon and Pathak (1999), distinguish only between two types

of search engine evaluation methodologies: testimonials and shootouts. Testimonials

are generally conducted by the trade press or by computer industry organizations that

“test drive” and then compare search engines on the bases of speed, ease of use,

19

interface design or other features that are readily apparent to users of the search

engine. Another type of testimonial evaluation comes from looking at the more

technical features of search engines and making comparisons among them on that

basis. Such testimonials are based on features like the set of search capabilities

different engines have, the completeness of their coverage or the rate at which newly

developed pages are indexed and made available for searching.

Despite the actuality that testimonials can give users some useful information

in making decisions about which search engine to employ, they can only indirectly

suggest which search engines are most effective in retrieving relevant Web pages.

For an overall evaluation of the performance of Web search engines, shootouts

methodologies appear to be more appropriate. More specific, in shootouts, different

search engines are actually used to retrieve Web pages and their electiveness in doing

so is compared. Shootouts resemble the typical information retrieval evaluations that

take place in laboratory settings to compare different retrieval algorithms, despite the

fact that Internet shootouts often consider only the first 10 to 20 documents retrieved,

whereas traditional information retrieval studies often consider many more (Gordon

and Pathak, 1999).

4.2. Setting up the evaluation criteria

The special features of Web search engines in indexing technique, resource

coverage, relevance ranking, search strategy, hyperlinks and interface lead some of

the information retrieval researchers to the conclusion that the evaluation measures

should be different from those of traditional online databases and CD-ROMs. While

this seems to be sensible, it doesn’t necessarily mean that measures and criteria

employed for the evaluation of traditional online systems are inadequate for the

evaluation of interactive information retrieval services, specifically Web search

engines. The six criteria that Lancaster and Fayen (1973) once had listed

(1.Coverage, 2.Recall, 3.Precision, 4.Response time, 5.User effort and 6.Form of

output) for the evaluation of information retrieval systems are still quite applicable to

modern and interactive information retrieval systems despite the fact that they were

set up three decades ago. In their study Chu and Rosenthal (1996) employed a set of

20

criteria based on those listed by Lancaster and Fayen (1973). Their justification of

employing criteria and evaluation measures which are being used in traditional

online information retrieval systems is that the Web can also be described in terms of

an information storage and retrieval system which is characterized by its enormous

size, hypermedia structure and distributed architecture.

Furthermore, the review of previous studies has revealed that the vast

majority of previous studies are employing in their methodology evaluation criteria

such as precision, output option and response time which are being commonly used

in the assessment of traditional online information retrieval systems (Chu and

Rosenthal, 1996; Winship, 1995). Ultimately, Su (1992) stated that, in general,

criteria for evaluating interactive information retrieval systems include relevance,

utility, efficiency and user satisfaction, but the truth is that there is no agreement as

to which of the existing criteria or measures are the most appropriate for evaluating

interactive information retrieval performance.

4.3. Text REtrieval Conference (TREC)

At this point it is appropriate to point out the importance and the contribution

of Text REtrieval Conference (TREC) to the information retrieval research. Some of

the previous studies concerning the evaluation of Web search engines used TREC –

inspired methods (Clarke, 1997; Hawking et al., 2001).

The first Text REtrieval conference was held in November of 1992 at the

National Institute of Standards and Technology (NIST) (Harman 1993). The purpose

of the conference was to bring together researchers from the field of information

retrieval to discuss the results of their systems on a new large test collection

(TIPSTER collection). The TREC gave the opportunity to researchers to compare

results on the same data using the same evaluation methods. Moreover, it

represented a breakthrough in cross – system assessment in the field of information

retrieval. It was the first time that most of these researchers had used such a large

test collection and therefore required a major effort by all of them to scale up their

retrieval techniques (Harman 1995).

21

The overall goal of the TREC programme is to encourage research in

information retrieval using large test collections. It is hoped that by providing a very

large test collection and encouraging with other researchers in a friendly evaluation

forum, new impetus in information retrieval will be generated. Moreover, it was

hoped that the participation of groups with commercial information retrieval systems

would lead to an increased technological transfer between the research laboratories

and the commercial products (Harman, 1995).

In the second TREC, which took place in August of 1993, two types of

retrieval were tested: retrieval using an “ad hoc” query such as a researcher may use

in a library environment and retrieval using a “routed” query. Routed queries are

considered to be queries which are extracted from specified topics and then tested

against a set of “training” documents where relevant documents are identified. With

this process an optimal query is generated and can be tested against new data

(Beaulieu et al., 1996). In contrast “ad hoc” queries are considered to be new queries

which are tested against to existing set of data without awareness of relevant

documents. The assessment of the results was based on traditional recall and

precision criteria. The queries employed in the present study were all “ad hoc”.

4.4. The influence of TREC

It appears that the main concern of the majority of the previous studies was

the efficiency of the methodology, the evaluation criteria and the development of the

test suite. In the current study much effort was exerted in order the methodology and

the evaluation measures to be as efficient as possible, so as to produce accurate and

meaningful evaluation of the quality of the results returned by the selected Web

search engines. Therefore, the methodology and evaluation criteria employed in this

study were influenced by those used in previous studies and by TREC experiments.

22

4.5. Development of the test environment

The main purpose of the current study is to assess the overall performance of

the selected Greek search engines and to compare their performance with an

excellent performer, Google. To do so, a wide range of evaluation criteria should be

considered. While this is true, the limited available time for this study resulted in

selecting only these criteria which were considered that would reflect better the

overall performance of each search engine and these criteria, which were considered

to be of significant importance from previous studies, such as interface and precision

(Dong and Su 1997).

Another concern was the time proximity of the searching. According to

Leighton and Srivastava (1997) the goal of close time proximity of the searching

should be taken into consideration in order to ensure the objectivity and accuracy of

the evaluation of the returned results. The closer in time that a query is executed on

each of the selected search engines, the better. The rationale behind this tactics is

that if a relevant page were to be made available between the time one engine was

searched and the time a second was searched, this would result in an unfair situation

where the second search engine would have the opportunity to have this new page

indexed and consequently retrieved. According to previous researchers which had

conducted similar studies (Chu and Rosenthal, 1995; Ding and Marchionini, 1996;

Leighton and Srivastava 1997) the close time proximity is characterizing the quality

of the methodology and evaluation procedures that are employed in similar studies.

According to them, the ideal situation would be each query to be executed on all the

selected search engines simultaneously.

In the current study, all the three search engines were searched for a

particular query on the same day and each query performed on each search engine

within an hour the maximum. A second round of searches was conducted

immediately after the first for each query in order to evaluate the recall, which will

be discussed later and more in more detail. Again the time limit was considered to

be of major importance, since the selected search engines assert to update their

database indexes in a weekly or even on a daily basis.

23

Similar to the point of close time proximity there was also the goal of

checking the pages that were cited in the results from the Web search engines as

quickly as possible after the results had obtained. This was considered to be of the

same importance as the previous goal of close time proximity of the searching. The

reason behind this objective is that the longer one waits after the results had obtained

the more possible is that some pages, which were active during the searching, to have

been removed from the Web. Thus the tested search engine would be assessed

unfairly by the evaluator (Leigthon and Srivastava, 1997) .

The relevance judgements performed immediately and it only took about

thirty minutes to evaluate the first ten returned results for each search engine. The

precision scores were also assigned within the time of thirty minutes. For efficiency

reasons the results were saved as “.htm” files in every case that there will be a need

to check again the URLs or to validate assessments. The relevance judgements were

also another area of great concern during the evaluation of the selected Web search

engines. The relevance judgment is the weak point of the evaluation procedure in the

majority of similar studies. The main problem of these studies is the person who is

responsible for assessing the relevance of the returned results.

Moreover, in most of these studies, the author or the authors were the

evaluators of the returned results. During this step of the evaluation procedure bias,

both conscious and unconscious, can enter and distort the objectivity and accuracy of

the relevance judgment and thus the precision score of the tested search engines. For

example, if the subject matter of the selected queries is wide there is a serious

concern over the adequate knowledge background of the evaluator to assess the

relevance of the returned results. In order to overcome this flaw, many researchers

decided to select queries with narrow subject matter (Clarke, 1997). While this

approach makes possible the extraction of accurate and meaningful relevance

judgements, nevertheless there is always the peril that the returned pages would be

from only one portion of the Web.

In the present study, the evaluation procedure that is employed is inspired

from the previous study of Gordon and Pathak (1999), which introduces intermediate

researchers or evaluators in an attempt to circumvent the risk of distorted relevance

24

judgments. More specific, the evaluations of the returned results assessed by other

six fellow students from the information studies department in Sheffield University,

with adequate knowledge background in specific subject areas (economics, software

engineering, librarianship, history and archaeology, mathematics and management).

Furthermore, many of the queries used in the current study were real reference

questions drawn from their research dissertation for the obtainment of their Master’s

degree. So, they were evaluating the results both as researchers and as end-users. As

it was mentioned previously, this procedure was considered necessary and was

employed in order to ensure the highest possible degree of accuracy and objectivity

in the evaluation of the results.

Searches were carried out on PCs at the St. George I.T. centre in the

University of Sheffield. The access to the World Wide Web was possible through

the LAN (Local Area Network) of the University and the Web browser that was used

was the latest version of Internet Explorer(version 5.5).

4.6. Sample queries suite

This methodology, as it has already been described, involves a set of sample

queries that will be employed in order to test and assess the overall performance of

the selected Web search engines. Thus, the development of the sample queries suite

is a sensitive step within the development of the evaluation procedure, which can

potentially affect the performance of the tested search engines (Ding and

Marchionini, 1996).

4.6.1. Number of queries As it was mentioned, the searching procedure was designed in such a way in

order to minimize the possibility of favouring the search engine examined first or the

one examined last. Thus, a compromise between the number of queries employed

and the number of documents assessed for each search engine was considered to be

of crucial importance in order each individual query to be processed within a

reasonable time limit. This time limit was compulsory in order to ensure that the

25

search engine’s indexes would not change during the evaluation process of each

individual query. Taking this into consideration, it was regarded that twenty is a

feasible number of queries to be evaluated in a rather limited time. The first

intentions, before designing the evaluation methodology, was to use a number of at

least 25 queries, in order the subject matter of queries to be as wide as possible.

However, there were found to be problems in keeping the required time limit and so

the number was limited to 20 queries.

4.6.2. Query subject matter Previous studies suffered from a lot of bias regarding the subject matter of the

queries employed. Many of the researchers decided to use a wide subject matter

(Chu and Rosenthal, 1995; Ding and Marchionini, 1996) but there is always the

question of how capable is the evaluator to make accurate and meaningful relevance

judgments over subject topics which are requiring an appropriate knowledge

background. Other researchers (Clarke, 1997) in order to avoid this risk they

deliberately employed queries with narrow subject matter. While this allowed them

to make accurate and meaningful relevance judgements on the other hand, it could be

argued that only one portion of the Web was tested for retrieval (Clarke, 1997).

In the current study there is an effort to avoid such a risk by increasing the

number of evaluators to six, each of them has a different knowledge background.

Moreover, most of the queries used in the current study were extracted from real

reference questions that were used in the evaluators’ research dissertation. So, the

described evaluation procedure was designed in order to ensure that the relevance

judgments from the evaluators would be as accurate as possible and it would allow

the subject matter of the queries to be as wide as possible.

Apart from that, some of the queries’ subject matter involves issues of Greek

culture and civilization. The underlying reason behind this approach is that Greek

search engines are being used every day by hundreds of users searching the Greek

domain for information which mostly involves issues of Greek culture and

civilization. So it would be very interesting to test these two most popular Greek

search engines (Anazitisis and Trinity) in this particular subject matter, to evaluate

26

their performance and compare it with a class leading search engine such as Google.

Of course, it can be argued that such a comparison can be quite unfairly for Google,

due to the fact that the Web robots of the tested Greek search engines are primarily

focused on indexing the Greek Web and so an advantage of Anazitisis and Trinity in

these particular queries can be expected. While this can be true, bear in mind that

this was considered as necessary for the purposes of the evaluation on the particular

study. Moreover, this study is also attempting to explore the capabilities and the

limits of the tested search engines and examine how each search engine can cope

with queries consisting of Greek and English together.

4.6.3. Query formulation and search expression The decision over the query formulation and the search expression that will

be entered in each search engine, proved to be quite difficult. Previous studies have

suffered form a lot of bias here. Some of the researchers who have conducted similar

studies (Chu and Rosenthal, 1996; Tomaiuolo and Parker, 1995) decided to compile

the selected queries and use syntax specific to each of the tested search engines or

use the so-called “advanced mode” feature, if it was available. However, in their

study, Leighton and Srivastava (1997) tried to be more systematic and carefully

examined, before conducting the actual test, what search expression should submit to

the selected search engines. They decided to use simple queries such as these that a

simple user would enter because, according to them, this kind of queries are forcing

the search engine to do more of the work, ranking the results by its own algorithm

rather than the constraints specified by the operators.

Moreover, according to Hawking et al. (2001) all well-known public search

engines are designed to produce a list of results when a set of simple queries (without

operators or special syntax) is typed in to the search box provided by the primary

interface of the search engine. So, it seems to be more realistic to compare the

quality of the results returned by search engines given identical input queries in this

particular form. Furthermore, the examination of query logs (Silverstein et al., 1999)

has revealed that most of the users do not use any form of query operators or the

“advanced mode” if this is provided. Additionally, Silverstein et al. (1999)

concluded that in most of the cases, when the users attempted to enter queries using

27

query operators they were making a lot of errors. Thus, while studies, which adopt

the approach of trying to find the best query formulation for each search engine are

very interesting, they also introduce conscious and unconscious bias, which can lead

to unfair comparison among the tested search engines. The query formulation that is

employed in the current study is consisting of “simple queries” in an attempt to

minimize and isolate possible unfairness that can enter this step.

4.6.4. Further analysis over the sample queries As it has previously been described, the queries that are employed in the

current study have been extracted from real reference questions and there has been an

effort, so as their subject matter to be as wide as possible. The fact that this study

attempts to examine and evaluate the most popular Greek search engines imposes

some requirements to the queries that will be employed. More specific, the subject

matter of some of the sample queries is related to the Greek culture, history and

civilization. Some queries are entirely in Greek language, while some other queries

contain a combination of Greek and English search keywords. The introduction of

queries such as those related to issues of Greek civilization was considered to be of

significant importance, since they are being used to unearth information related to

issues of Greek culture and civilization.

Apart from that, another area that this study is examining is how the selected

Greek search engines can cope with the combination of Greek and English words

within the same question. This area considered to be of particular importance, for

the current study, due to the reason that the first Greek search engines faced a variety

of problems with the combination of Greek and English words within an individual

query. So it is very interesting to examine how improved are the selected Greek

search engines in such a “sensitive” area in which they claim that they have

overcame the problems of the past. Furthermore, it is also very interesting to

examine the searching performance of Google in another language, the Greek, since

it claims that it is capable of doing so.

Further to this point, problems can arise from the complicated and

sophisticated grammatical and syntactical structure of the Greek language. More

28

precisely, many words such as nouns can be used within a sentence with a variety of

forms, called cases (e.g. nominative case, accusative case e.t.c.). This is also the

situation for many other words such as adjectives and pronouns. This variation of

forms of particular words within a sentence in Greek language is being used in order

to alter the meaning of an individual sentence or to illustrate a special relationship

between the words within an individual sentence. In every case, the possible

variations of the forms of words in Greek language are far more complicated

comparing to the English and thus, obstructing the task of the Web search engines.

Problems such as the one previously described pestered the users of the first

Greek search engines. The specific language related problem is of crucial

importance since it was rendering the first Greek search engines unable to perform

successfully their requested tasks. Unfortunately, there are no particular proofs to

support this actuality, due to fact that until now there has been no significant study

over the evaluation performance of the Greek search engines. As it has already been

illustrated in the literature review section there is a least number of studies that are

attempting to examine the Greek search engines and all of them are of descriptive

nature. The modern Greek search engines, such as the selected that are being tested

in the current study, claim to have overcome this problem by utilizing special

software (e.g. Anazitisis) or special language techniques in the ranking algorithm

(Trinity).

So, the sample queries part of the test suit has been especially developed in

the way that has just been described in order to test and assess how the advanced

algorithms and special features of the selected Greek search engines and Google can

cope with the combination of Greek and English words within an individual query.

Below is illustrated the list of the twenty sample queries that was employed in the

current study. Inside the parenthesis and written with italics is the English

translation of the Greek queries and the queries that are using combination of Greek

and English words.

1. Internet Banking strategy 2. PEST factors in internet banking 3. Public libraries and learning disabilities 4. Customer relationship management 5. Object-Oriented management with UML

29

6. Customer-centric culture 7. Telecommunications infrastructure in Greece and Spain 8. Information retrieval and Web search engines 9. Backpropagation models of neural networks for information retrieval 10. Information modelling and SSADM methodology 11. Μηχανές αναζήτησης στο Web (Search engines in the Web) 12. Ο κόσµος του Internet (The world of internet) 13. Οι εκδόσεις του OECD (Publications of the OECD) 14. Αρχιτεκτονική επεξεργαστών Risk στους προσωπικούς

υπολογιστές(Risk architecture CPU in personal computers) 15. Μεγάλοι έλληνες ρεµπέτες (Great Greek folklore musicians) 16. Ελληνική επανάσταση 1821 (Greek independence war 1821) 17. Εκδόσεις της τράπεζας Ελλάδος (Publications of Bank of Greece) 18. O ελληνικός στοχασµός κατά το 19ο αιώνα (Greek philosophical

meditation during the 19th century). 19. Nεοελληνικός διαφωτισµός και Ρήγας Φεραίος (Modern Greek

enlightenment and Rigas Feraios [Personal name]) 20. Αντικειµενοστρεφή συστήµατα βάσεων δεδοµένων (object oriented

database systems)

4.7. Evaluation of returned pages

4.7.1. Document cut-off The application of document cut-off practice in the evaluation procedure of

on-line information retrieval systems is regarded as a necessary step before the actual

evaluation procedure takes place. A decision should be taken over the number of the

pages that will be assessed according to some predefined measures of evaluation.

The necessity of document cut-off in evaluating the performance of on-line

information retrieval systems is based on the reality that the output of these systems

can be hundreds or even thousands of returned pages. This is also the fact for the

Web search engines. In most of the cases their returned results can easily be of

thousands of Web pages, which might not be a large volume of information, when

taking into consideration the vast and ever changing Web system.

In the current study, it was decided to evaluate the first ten hits of the results

list produced by each search engine. This decision was based over the personal

experience and by observing the behaviour of my fellow students towards the results

list produced by the search engine. Virtually all of them had a tendency in browsing

30

and examining only the first ten or rarely the first twenty hits returned by the search

engine. Also this approach over the document cut-off practise seems to be supported

by the vast majority of the researchers conducted similar studies. Chu and Rosenthal

(1996), Scoville (1996) and Tomauillo and Packer (1996) tested the search engines

for the first ten results, while others such as Ding and Marchionini (1996) or Gauch

and Wang (1996) evaluated the search engines based on the first twenty results. So

the common practise for similar studies is to examine and evaluate the search engines

for the first ten and sometimes twenty returned results. The time factor should also

be regarded as another major reason behind this common practise of document cut-

off.

Since all the selected search engines display their results in a list of

descending order of relevance calculated one way or another, it is considered that

this should not critically affect the validity of the current study.

4.7.2. Measures of evaluation specific to the current study As it has previously been described and inferred by the relevant literature, the

evaluation criteria employed to studies attempting to assess the overall performance

of on-line information retrieval systems can be considered as perhaps one of the

weakest parts of these studies. The underlying reason behind this fact is that there is

no common agreement as to which of the existing criteria or measures are the most

appropriate for evaluating interactive information retrieval performance (Su, 1992).

This statement can easily be supported by the review of previous studies.

More specific, due to the special features of Web search engines (Interface, hyperlink

structure etc.), every researcher attempted to use criteria which were somewhat

different from traditional ones. For example, Taubes (1995) considered reliability,

completeness and speed as the measures in the evaluation, Winship (1996) argued

that the record structure and search techniques had a greater significance than

retrieval performance and others suggest a powerful and usable interface and that the

quantity, precision and readability of returned results are the most important criteria

for evaluating and rating search engines.

31

A clarification that should be noted here is that the researchers who attempted

to employ a rather different set of evaluation criteria in their studies, doesn’t

necessarily mean that they rejected the traditional ones. The fact is that most of them

included some of the traditional criteria to a whole set of evaluation measures

employed in their study. It is just the detail that some other criteria such as the

interface, search techniques, record structure, ease of use etc. was considered to have

greater significance from the traditional ones, when giving an overall performance

score to the tested search engine. According to Dong and Su (1997), specific

traditional criteria such as precision and response time are the most commonly used

in the vast majority of studies comparing and evaluating the performance of Web

search engines.

In the current study there was an effort to employ these criteria that would

better unveil the overall performance of each of the tested search engines. At the

earlier stages of the current study there was a thought to employ a wider set of

criteria in order to assess the overall performance of the selected search engines. The

motivation behind this thought was that the greater the number of the evaluation

measures employed the better, since such an exhaustive test would unveil almost

every strength or weakness of the tested search engines. While this speculation

seems to have some validity, the review of previous studies did not unveil any study

employing a very large number of evaluation measures. The truth is that such a

study appears to be superficial, especially for the present study where the available

time period was very limited. It was felt best to conduct a feasible test of some of the

most representative performance measures rather than employing a large number of

criteria to evaluate every feature and aspect of the search engine. So it was decided

to evaluate the selected search engines on the basis of precision, recall, response

time, validity of links, interface and documentation.

4.7.3. Precision Precision together with recall constitute the two most important traditional

measures of retrieval effectiveness (Saracevic, 1975). Precision or Precision ratio is

defined as the proportion of retrieved documents that are judged relevant, meaning

the number of relevant documents retrieved divided by the total number of

32

documents retrieved. Cheong (1996) also supports that the percentage of the

retrieved documents that the user judges relevant refers to a measure of the signal-

noise ratio in certain kinds of system.

According to Dong and Su (1997) precision considered as very important in

comparing and evaluating the performance of a search engine for two reasons. The

first is that each search engine is employing its own methods and techniques in

collecting and indexing documents. Also the fields of indexing are different for each

search engine. Thus, based on precision it is a way to identify which method or

technique of indexing is the most efficient. The second reason is that automatic or

machine produced indexing, despite the modern sophisticated techniques or

algorithms that are being employed, cannot always cope successfully with words

used in various contexts resulting in the indexing of non-relevant items. Therefore, it

can be supported that the output relevance to a user’s query can be an important

indicator for assessing the quality and intelligence of an individual search engine.

Dong and Su (1997) support that while precision has been widely used as a

criterion in describing the relevance of search results in many studies, only a few of

the studies conducted between the years 1995 and 1996 (Chu and Rosenthal 1996;

Ding and Marchionini 1996) applied the precision criterion on a “standardised

formula” (Dong and Su 1997). The problem with this statement of Dong and Su

(1997) is that they don’t clarify properly what they mean with the term “standardised

formula”. The fact is that the traditional way of evaluating precision needs to be

reconsidered when dealing with the evaluation of Web search engines. More

specific, in all the previous studies, who attempted to assess the performance of Web

search tools, the calculation of the precision measure was performed on the basis of

the first ten or twenty hits returned by the search engine.

In the current study it was felt best to assess the precision of the Web pages

returned based on a three point scale, as Clark (1997) did in his own study. More

specific, a score of 1 was given to very relevant documents, a score of 0.5 was given

to somewhat relevant documents and 0 to documents that were not relevant. Since in

the current study a number of six evaluators were used, every page was examined

33

thoroughly before assigning a precision score. Also, all the links in every page were

examined and not just one or two initial links.

In the case that a page was consisted of a whole set of links, every link was

also examined thoroughly. If these links, by following them, could lead to useful

information resources, then a score of 0.5 was assigned to this page (Leighton, 1995).

Greek Web Search Engines - University of...

Documents

Transcript of Greek Web Search Engines - University of...