Greek Web Search Engines - University of...

72
G G r r e e e e k k W W e e b b S S e e a a r r c c h h E E n n g g i i n n e e s s An evaluative and Comparative Study A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Systems at THE UNIVERSITY OF SHEFFIELD by Panteleimon Lilis September 2002

Transcript of Greek Web Search Engines - University of...

  • GGrreeeekk WWeebb SSeeaarrcchh EEnnggiinneess An evaluative and Comparative Study

    A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Systems

    at

    TTHHEE UUNNIIVVEERRSSIITTYY OOFF SSHHEEFFFFIIEELLDD

    by

    Panteleimon Lilis

    September 2002

  • Abstract

    The present study consists a first attempt to evaluate the overall performance of

    Greek Web search engines and compare them with a class-leading search engine, Google.

    To do so a particular methodology was designed and developed for the purposes of the

    comparison. More specific, the three Web search engines were evaluated in terms of

    precision, relative recall, validity of links, response time, interface and on-line

    documentation. These criteria were particularly employed and developed for the needs of

    the present study. The queries employed were also developed for the needs of the particular

    study. More specific, it was decided to use twenty queries so as the query subject matter to

    be as wide as possible. Apart from that, some of the queries were in Greek language, some

    in English language and some of them were a combination of Greek and English keywords

    in order to assess the selected search engines of how capable are to cope with a variety of

    different linguistic characteristics. The first ten pages of returned results in each search

    engine were evaluated on the basis of the criteria described above. Results are presented in

    tables and the comparison was based on averaging and finding the mean rate for each search

    engine in each criterion. The conclusion is that the performance of the three selected search

    engines is rather poor. Google, in general was found to have a better overall performance in

    comparison to Anazitisis and Trinity, while the two Greek search engines had an almost

    similar performance.

    i

  • List of Tables .............................................................................................................. iv List of Charts............................................................................................................... iv 1. INTRODUCTION.................................................................................................. 1

    1.1. World Wide Web search engines...................................................................... 1 1.2. Aims and objectives of the current study.......................................................... 1

    2. THE SELECTED SEARCH ENGINES .............................................................. 3

    2.1. Search engine selection..................................................................................... 3 2.2. Features of the selected search engines............................................................. 4

    2.2.1. Google ........................................................................................................ 4 2.2.2. Anazitisis.................................................................................................... 6 2.2.3. Trinity......................................................................................................... 7

    3. LITERATURE REVIEW...................................................................................... 8

    3.1. Introduction....................................................................................................... 8 3.2. Review of comparative-experimental studies. .................................................. 9

    4. METHODOLOGY............................................................................................... 19

    4.1. Introduction..................................................................................................... 19 4.2. Setting up the evaluation criteria..................................................................... 20

    4.3. Text REtrieval Conference (TREC)............................................................ 21 4.4. The influence of TREC ................................................................................... 22 4.5. Development of the test environment ............................................................. 23 4.6. Sample queries suite........................................................................................ 25

    4.6.1. Number of queries.................................................................................... 25 4.6.2. Query subject matter ................................................................................ 26 4.6.3. Query formulation and search expression................................................ 27 4.6.4. Further analysis over the sample queries ................................................. 28

    4.7. Evaluation of returned pages........................................................................... 30 4.7.1. Document cut-off ..................................................................................... 30 4.7.2. Measures of evaluation specific to the current study ............................... 31 4.7.3. Precision................................................................................................... 32 4.7.4. Recall ....................................................................................................... 34 4.7.5. Response time .......................................................................................... 36 4.7.6. Validity of links ....................................................................................... 37 4.7.7. Interface.................................................................................................... 37 4.7.8. On-line documentation............................................................................. 38

    4.8. Possible drawbacks, inconsistencies and bias of the specific methodology ... 39 5. RESULTS ............................................................................................................. 43

    5.1. Calculations..................................................................................................... 43 5.1.1. Averaging................................................................................................. 43 5.1.2. Precision scores........................................................................................ 43 5.1.3. Recall scores ............................................................................................ 45 5.1.4. Response time .......................................................................................... 46 5.1.5. Validity of links ....................................................................................... 46

    ii

  • 6. ANALYSIS AND INTERPRETATION OF THE RESULTS ......................... 48 6.1. Evaluation of the overall performance of the tested search engines............... 48

    6.1.1. Precision ratio .......................................................................................... 48 6.1.2. Recall ratio ............................................................................................... 51 6.1.3. Response time .......................................................................................... 52 6.1.4. Validity of links ....................................................................................... 53 6.1.5. Interface.................................................................................................... 55 6.1.6. On-line documentation............................................................................. 55

    7. CONCLUSIONS .................................................................................................. 57

    7.1. Limitations of the current study ...................................................................... 59 7.2. Some future recommendations........................................................................ 60

    BIBLIOGRAPHY .................................................................................................... 62 APPENDIX - SEARCH ENGINES INTERFACE ............................................... 65

    iii

  • List of Tables Table 1: Precision scores............................................................................................ 44 Table 2: Recall scores ................................................................................................ 45 Table 3: Response time .............................................................................................. 46 Table 4: Invalid links ................................................................................................. 47

    List of Charts Chart 1: Mean precision performance........................................................................ 51 Chart 2: Mean recall................................................................................................... 52 Chart 3: Mean response time...................................................................................... 53 Chart 4: Validity of links ........................................................................................... 54

    iv

  • 1. INTRODUCTION

    1.1. World Wide Web search engines

    According to Chu and Rosenthal (1996) the World Wide Web has gained so

    much popularity that it is the second most popular internet application after e-mail.

    The Web is used for a variety of purposes by many people around the world.

    However, it can be argued that the Web is used for two main purposes (Clarke,

    1997). The first is the publishing of information. Indeed, the fact that in the Web the

    information can be accessible by many people in the same time, resulted that the

    Web became the world’s largest information medium.

    The second use of the Web is for information retrieval (Clarke 1997). More

    specific, under many aspects, the Web can be described in terms of a huge

    information storage system. But, the reality about the Web is that its unstructured

    and ever changing nature has made a very difficult task the information searching

    and retrieval (Declan, 2000). Web search engines developed in order to overcome

    this difficulty by assisting the simple Web-user in searching and retrieving the

    required information.

    Web search engines came into existence in 1994 and since then at least

    twelve have been developed for use in the Web. Search engines have variously been

    referred as search tools, search services, indexes, Web databases and search engines.

    In the present study the phrase that will be used mostly is search engines, since this is

    also the case for the majority of the studies reviewed.

    1.2. Aims and objectives of the current study

    The specific dissertation aims to evaluate two of the most popular Greek

    search engines (Anazitisis and Trinity) and to compare them with a class-leading

    Web search engine such as Google. The main reason for conducting such study is

    the fact that no similar study has ever been conducted in Greece, meaning that there

    1

  • is no particular information about how each of the Greek search engines performs.

    Moreover, articles that were involved with Greek search engines are very limited and

    all of them are reviews and thus, descriptive in their nature. This is due to the fact

    that Greek Web search engines are recent in comparison to search engines such as

    Google or Alta Vista and thus, the relevant literature is very immature.

    However, recent developments in the Greek search engines (Anazitisis, new

    ranking algorithm in Trinity) have emerged some concern in the Greek Web about

    the performance of these search services. Thus, another reason for conducting

    comparison of two Greek search engines with Google is that this will give a measure

    of how developed Anazitisis and Trinity claim to be. After all, as Chu and Rosenthal

    (1996) state, the seer number of such services invites further research.

    In order to achieve the particular aim a methodology was required to be

    designed and developed. This will require to explore and examine the relevant

    literature so as to identify the required criteria and the appropriate test environment

    to be developed. It is very important to mention that the methodology in the present

    study is the most important part. Its completeness will ensure the objectivity of the

    results and will minimize the risk of introducing bias, both conscious and

    unconscious, and inconsistencies.

    Furthermore, the researcher of the present study decided to design and

    develop a methodology for the particular study, since for the purposes of comparison

    a number of different criteria and search engines features needed to be examined and

    evaluated thoroughly. For example, from the queries employed some were in Greek

    language, some in English and some were a combination of both Greek and English

    search keywords, another example is that on-line documentation of each search

    engine was employed as an evaluation measure for reasons discussed in more detail

    in the methodology section.

    2

  • 2. THE SELECTED SEARCH ENGINES

    2.1. Search engine selection

    The researcher of the current study decided to select only three search

    engines to test and evaluate. It can be argued that the number of the search engines

    examined in the current study is rather small in relation to some of the reviewed

    studies. This constraint on the number of the search engines selected was considered

    as necessary for the following reasons. First it would allow a greater number of

    queries to be used so as the subject matter of queries to be as wide as possible.

    Second, it would allow a larger number of evaluation criteria to be employed so as to

    assess the overall performance of the selected search engines. Many of the reviewed

    studies are limited to the usual measures of precision, recall and interface. But, due

    to the fact that the present study attempts to examine, evaluate and compare the

    selected search engines in terms of their overall performance, suggests that a larger

    number of evaluation criteria should be employed.

    The idea was to select two of the most popular and well respected Greek

    search engines and compare them with one class-leading search engine such as

    Google. The one Greek search engine that was selected is Trinity. It is one of the

    most well respected Greek search engines and is being used by the most popular

    portal of the Greek Web, the www.in.gr. The second Greek search engine that was

    selected is Anazitisis, which is product of one of the most popular ISPs in Greece,

    OTEnet. The selection of Anazitisis was based on the fact that is a very new search

    engine, which gained popularity in a very short time. Anazitisis boasts that employs

    advanced ranking algorithms, impressive special features along with special software

    designed particularly to increase noticeably its searching and retrieval performance

    in the Greek language. The characteristics and features of Google, Anazitisis and

    Trinity will now be considered in more detail in the following section.

    3

    http://www.in.gr/

  • 2.2. Features of the selected search engines

    2.2.1. Google The Google Web search engine has been founded by Sergey Brin and

    Lawrence Page, two graduate students in computer science at Stanford University in

    California. In less than a year, their Google search engine has become the most

    popular on the web, yielding more precise results for most queries than conventional

    search engines. Google’s database is very huge and according to many sites and

    resources in the Web, Google must be the biggest search engine database in the

    world. Google claims that its database size is over two million pages, but may be

    counting pages which are not fully indexed.

    One distinguishable characteristic of Google is its searching and retrieval

    speed or more formal the very low response time. According to Google’s homepage

    this can be attributed in part to the efficiency of its search algorithm and partly to the

    thousands of low cost PCs that they have been networked together (so as to form a

    powerful computing grid) to create a very fast search engine. The other most

    distinguishable characteristic is its ranking algorithm.

    As far as its ranking algorithm is concerned, Google is a unique search engine

    in the World Wide Web. More specific, Google’s ranking algorithm is based on how

    many other pages link to each page, along with other factors such as the proximity of

    the search keywords or phrases in the documents. It uses not only the number of

    other pages that link to a page, but also the importance of the other links which are

    being evaluated by the links to each of them. This simply means that there is no way

    anyone to be able to influence the ranking of his or her page in Google, something

    which is quite possible in some other search engines and directories. This innovative

    approach takes its inspiration from the citation analyses used in scientific literature

    (Declan, 2000) and is based on the principle of “bibliographical coupling” (Skandali,

    1990).

    Google embodies these principles into its ranking algorithm “PageRank”

    which has been the topic of many discussions, but so far there is no clear evidence of

    4

  • how exactly works. In general, the PageRank (PR) is calculated for every webpage

    that exists in Google's database. The calculation of the PR for a page is based on the

    quantity and quality of the WebPages that contain links to a particular page.

    According to the co-founders of Google, Sergey Brin and Lawrence Page, the PR of

    a webpage is calculated using this formula:

    PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I)) . Where:

    PR(A) is the Page Rank of your page A.

    d is the damping factor, usually set to 0,85.

    PR(I->A) is the Page Rank of page I containing a link to page A.

    C(I) is the number of links off page I.

    PR(I->A)/C(I) is a PR-value page A receives from page I.

    SUM (PR(I->A)/C(I)) is the sum of all PR-values page A receives from pages

    with links to page A..

    More explicit, the PR of a page is determined by the PR of every page I that has a

    link to page A. For every page I that points to page A, the PR of page I is divided by

    the number of links from page I. These values are cumulated and multiplied by 0.85.

    Finally 0.15 is added to this result, and this number represents the PR of page A.

    (Declan, 2000).

    Google allows the user to search either in the simple or in the advanced

    mode. Each mode has a different entry screen and provides different functions and

    search options. The simple interface is a single search box with two search buttons:

    "Google Search" and "I'm Feeling Lucky". ". The latter automatically displays the

    page deemed most relevant rather than displaying a list of results. The advanced

    interface provides boxes for the following search options: for "all the words", "exact

    phrase", "any of the words", and "without the words", pull-down menus to limit by

    location on the page (anywhere, title or URL), language and domain, radio buttons to

    filter results using "SafeSearch", and search boxes that allow you to search for pages

    that are similar to or link to a given URL. Apart from these, Google also supports

    major Romanized and non-Romanized languages and translation to English from

    major European languages. However, Google does not support truncation and it is

    not case sensitive.

    5

  • 2.2.2. Anazitisis Anazitisis is the most recently of the Greek search engines. In fact, Anazitisis

    is part of the on-line products provided by OTEnet, one of the most popular and well

    respected ISPs in Greece. Unfortunately, the researcher didn’t have enough

    information about Anazitisis due to the fact that its administrators have not been interested at all in contributing to the present research. Thus, much of the

    information illustrated in the present study about Anazitisis is based partially on

    some information found in the Greek Web and partially on the personal experience

    of the researcher with Anazitisis.

    Anazitisis became fully operational a year ago. During this time Anazitisis

    became very popular among the Greek users. Its popularity is much based on to the

    fact that boasts to support advanced search features and capabilities. More

    specifically, the particular search engine employs the SDK, a linguistic software tool

    developed from AltaVista especially for the Greek language. The SDK is being used

    by the Anazitisis in order to increase its searching and retrieval capabilities on the

    Greek language. However, far more impressive is “normalisation” a special feature

    of Anazitisis which has been designed and developed to cope with the various forms

    or more precisely with the various “cases” that a Greek word can be used within a

    sentence (e.g. nominative case, accusative case etc.). With this particular

    characteristic, Anazitisis claims to increase considerably its precision and recall

    results, when a query containing Greek search keywords is submitted.

    There is no particular information about how the ranking of the Web pages is

    being performed by Anazitisis. Also, there is no information at all about its Web

    robot and its capabilities. However, it can be argued that its ranking is based on a

    combination of two sets of criteria. The first set of criteria is considered to be

    dynamic and the second is regarded as static. The first set includes criteria such as,

    the presence and the number of keywords in the title or in the first line of the text or

    in the main body of the page or in the Meta tags of the HTML code. The second set

    includes criteria such as the popularity of a specific Web page (measured by the

    number of the pages who link to the particular Web page) and by the percentage of

    the text contained (more text is regarded as an indication that the specific page must

    be more valuable and thus, more informative).

    6

  • Anazitisis supports full Boolean searching by inserting the appropriate

    operator before a word (“+” for the “AND” operator and “-” for the “OR” operator)

    and phrase searching (by using quotes “…”). Truncation is also supported by

    inserting the wildcard “ * ” at the end of the truncated word. At the end, there is also the possibility for searching in specific fields of the Meta tags of the HTML code,

    such as in the title or in the URL. The user can enter his or her queries either in the

    simple mode or in the advanced mode where further functions are supported such as

    searching only in governmental sites.

    2.2.3. Trinity Unfortunately, as in the case of Anazitisis the amount of information that was

    available for the second Greek search engine Trinity, was very limited. To begin

    with, Trinity was developed by the Phaistos Networks SA, a Greek company which

    is being enabled in the area of Internet and Web applications. Trinity became fully

    operational in 1997. Trinity is the basic search engine that is being used by the most

    popular Greek portal the www.in.gr and this might also be one of the reasons for

    Trinity’s popularity.

    Trinity also operates in two modes, simple and advanced mode. Each mode

    has a different screen entry, because different functions are being supported. In the

    simple mode the user can only submit his or her query, while in the advanced mode

    special operators are being supported in order to help the user to increase either the

    precision or the recall. As far as the search options are concerned, Trinity provides

    only limited options in comparison to Anazitisis and Google. More specific, Trinity

    supports partially Boolean searching (only two operators: AND, OR) and phrase

    searching only.

    As far as the ranking of the documents is concerned, the information that is

    available for Trinity is very limited and very general. However, it appears that

    Trinity’s ranking algorithm is based on the analysis of the popularity of a Web page.

    The popularity of a Web page is being measured by analysing the number of other

    pages that link to the particular page. It appears that the basic principle of Google’s

    PageRank is being employed by more Web search engines. Other criteria that are

    7

    http://www.in.gr/

  • being taking into consideration by Trinity’s ranking algorithm are word proximity

    and URL analysis. There is no particular information for the later criterion, but from

    the researcher’s personal experience with Trinity a possible explanation should be

    that, what is being measured is the proximity of the URL of a specific Web page with

    the keyword of the user’s query. Trinity claims to employ a very fast Web robot,

    Septera, but no further information is given about that. Also the number of the pages

    indexed in its database is unknown.

    3. LITERATURE REVIEW

    3.1. Introduction

    The literature review section in the particular study is attempting to explore

    and provide some critical evaluation of the relevant literature in the area of Web

    search engines evaluation and comparison. It should be noted here that it was

    decided to include in this section only these studies that can be considered as

    experimental or comparative. It was felt best to conduct an in depth review of some

    of the most important previous studies rather than examine everything that has been

    written for Web search engine evaluation. Some other studies such as Scoville

    (1996) or Randall (1996) which found to be more of a review rather than an

    experimental or comparative study were not included in this section. This selection

    was considered as necessary due to the fact that the researcher of the current study

    wanted to explore and examine in depth how each researcher conducted his or her

    study, what and which evaluation measures employed or what methodology was

    applied and the reasoning behind each of these steps. The design and the

    development of the methodology employed in the particular study is based and

    influenced by some of the studies reviewed in this section. However, it should be

    noted that conclusions or findings derived from non comparative studies (reviews)

    were also used extensively within the present study and more particularly for the

    design and development of the methodology employed.

    8

  • Unfortunately, there are no significant studies conducted in Greece about the

    evaluation and comparison of Web search services. There are only a few which are

    of descriptive nature and thus, they are considered to be more of a review rather than

    an experimental or comparative study. Some of these Greek studies (Papathanasiou

    and Kanarelis 2001) were used by the researcher in order to acquire some further

    information about the two Greek search engines Anazitisis and Trinity, since their

    administrators have not been interested at all in contributing to the present study.

    3.2. Review of comparative-experimental studies.

    To begin with a very comprehensive, practical and analytical in its approach

    methodology for evaluating the performance of Web search engines is presented by

    Chu and Rosenthal (1996). They examined and evaluated three Web search engines,

    namely Alta Vista, Lycos and Excite, in their attempt to develop a feasible

    methodology for evaluating all Web search engines.

    In order to test and evaluate the performance of the selected search engines,

    the authors of this study developed and used ten (10) sample search queries. The

    search queries were selected and constructed in such way so as to test the various

    features of each search engine. Some of the search queries were phrases, while

    others required Boolean logic or truncation or field searching capabilities. Nine out

    of the ten search queries were extracted by real reference questions.

    The authors of this study evaluated the performance of each search engine in

    terms of precision of results, response time, output option, documentation and

    interface. The authors of this study paid special attention to the criterion of

    precision. Moreover, they downloaded the first 10 documents for each query and

    assessed their precision. They gave the score of 1 for the highly relevant documents,

    the score of 0.5 for fairly relevant documents and 0 for irrelevant documents. After

    having assessed the precision score for each query, they calculated the average

    precision score for all 10 results for each search engine. The conclusion of their

    study was that among the three selected search engines, Alta Vista is the one with the

    highest precision in its returned results. As far as the other two search engines are

    9

  • concerned, they offer a plethora of features that the users can take advantage of them,

    such as the concept search of Excite or the very good documentation and interface of

    Lycos. The weak point of this study was the fact that there was no attempt to

    evaluate recall. Chu and Rosenthal (1996), rationalize their decision to deliberately

    omit the evaluation criterion of recall by arguing that it is not possible to calculate

    how many relevant items there are for a particular query in the huge and ever

    changing Web system.

    Ding and Marchionini (1996) conducted a comparative study of the

    performance of three popular Web search engines: InfoSeek, Lycos and Open Text.

    In order to evaluate the performance of the selected search engines, they used three

    queries which were randomly selected from a question set for Dialog online

    searching exercises in an information sciences class. The other two queries were

    formulated based on their personal interest. According to the authors of this study all

    five queries were open – ended. In order to get the best search, syntax specific to

    each search engine was used for each query.

    The selected search engines were evaluated for precision in the returned

    results, duplication in the retrieved sets, invalid links and the degree of overlap

    between search engines. This evaluation was performed by analyzing the first

    twenty hits that every search engine returned. Ding and Marchionini (1996) used in

    this study a six point scale to rate the relevance and the quality of the three search

    services. More specific, the measures that they defined for the purposes of their

    study were precision, salience and relevance concentration.

    As far as the measure of precision is concerned, the authors distinguished

    three types of precision, in order to record the statistically significant differences of

    the precision variants between the search engines and to unveil whether and to which

    degree a complex query can affect the precision performance of each search engine.

    The measure of salience was used in order to report the summary of ratings of all the

    twenty hits for each search engine out of the summary of ratings for all the three

    search engines. The last measure of relevance concentration was used in order to

    report the ratio of “good” items in first ten hits to the number of “good” items in the

    first twenty.

    10

  • Ding and Marchionini (1996) concluded that that the performance of all three

    Web search engines was very similar, but in terms of means of precision and salience

    Lycos and Open Text considered to be superior to InfoSeek. The limitations of their

    study were that they used only five queries and that they didn’t assess the selected

    search services for response time, accessibility and recall.

    Among the studies conducted from 1995 to 1997, Venditto’s (1996) study is

    considered to be the most quantitative. Venditto (1996) selected, examined and

    evaluated seven search engines which in that era were considered to be very popular

    for their overall performance. The selected Web search engines were: Alta Vista,

    Infoseek, Lycos, Open Text, WebCrawler and WWW Worm. In this study twelve

    search terms were employed for a period of two weeks. The problem was that

    Venditto (1996) did not report how many queries were used.

    The seven search engines were assessed for relevance for the first twenty five

    results returned from each query. Apart from that, each of the selected search

    engines was tested on the basis of how capable is to cope with complex query

    statements. To do so, known sites were identified for a given subject, next

    formulating a search query in natural language and at the end, examining how many

    of the determined sites each search engine managed to retrieve. This is a very

    interesting approach, but it can be argued that introduces a lot of inconsistencies and

    bias since it does not record crucial information about the test environment. Apart

    from that, the currency of the search engines was also examined by employing a

    query that reflected news events which were important at that particular time.

    Venditto (1996) concluded that all the seven search engines performed well

    when submitting simple queries. But, with the complex queries some of the search

    engines had poor performance. According to Venditto (1996) this suggests that

    relevance ranking methods employed in certain search engines were not very

    effective and the relevance of each hit was partly based on the site’s relative

    popularity. As far as the relevance results are concerned, InfoSeek was found to be

    the best, while Alta Vista produced the most comprehensive results. However,

    Venditto (1996) did not report what the exact statistics were in his study.

    11

  • Zorn et al. (1996) conducted a comparative study which aimed to examine

    and evaluate the advanced search features of four Web search engines. They decided

    to select Alta Vista, InfoSeek, Lycos and Open Text Index. These search engines

    were selected for their popularity and for the advanced search features that each

    claims to support.

    Zorn et al. assessed these particular Web search engines for complex Boolean

    logic, limiting retrieval by fields, proximity, phrase searching, duplicate detection

    and truncation. In their study they devised and employed three sample searches

    which involved and required the elements of the advanced search features of each

    search engine. While they provide a rather detailed discussion section about the

    performance of each search engine, the fact is that the number of searches employed

    is very small in order to conduct an analysis of the results and the findings. Also, it

    appears that there is no quantitative evaluation of relevancy.

    In the conclusion they support that there is no single Web search engine

    which can be considered as the best, since each one has its own weakness and

    strengths. However, they found Lycos and Alta Vista to have the best performance

    as far as the number of URLs is concerned.

    The study conducted by Toamiuolo and Packer (1996) can be considered as

    the most quantitative study. The authors selected five search engines (Magellan,

    Alta Vista, Point, Lycos and InfoSeek) and assessed them by employing two hundred

    (200) queries. The selection of the particular search engines was based on the

    popularity of each search engine, but Magellan and Point were selected because they

    had the ability to review and evaluate the Web pages that they index.

    The subject matter of the queries employed was based on undergraduate

    topics. The document cut-off was determined for the first ten hits. Toamiuolo and

    Packer (1996) evaluated the first ten hits returned by the search engines for

    relevance. Also, the total number of the pages that each search engine returned fro

    each individual query was recorded. The authors were based on the microaverage

    method in order to produce a mean relevance ratio for each search engine. As far as

    12

  • their findings are concerned, they found that Alta Vista had the best relevance

    performance followed by Lycos, InfoSeek, Point and Magellan. They also noted that

    some of the tested search services (Point and Magellan) failed to retrieve at least ten

    hits for some of the queries employed.

    Lindop et al. (1997) writing for the U.K. edition of the PC Magazine

    conducted a “lab – test”, which involved the review and the comparison of 11 search

    engines. The test suite, which developed for the purposes of evaluating the selected

    search engines, was undoubtedly subject to various types of bias. More specific, the

    methodology of the testing involved a team of testers who evaluated the selected

    search engines by carrying out simple keyword searches or advanced searching, such

    us Boolean or field searching.

    Apart from the problem of inadequate and biased test suite, there was never

    kept record of crucial information about how the searching was conducted, such as

    the detail that the number of searches carried out was never revealed. The

    problematic and inadequate methodology can also be unveiled by the fact that the

    hits that every search engine returned were never formally assessed for their

    relevancy. Instead of using some set of criteria for assessing the retrieval

    performance of the search engines, the testing team kept only a record of the number

    of the results retrieved for their queries and another record of their impression of the

    search refinement and online documentation.

    The evaluation procedure of this “lab – test” ended up with a usability score

    based on the testers satisfaction using each search engine. Additionally, each search

    engine awarded a score for every feature facilitating form a list, such as Boolean

    searching, proximity or field searching. The testing team concluded that Alta Vista

    was the best search engine especially in terms of usability and additional features.

    One of the most comprehensive and complete study was carried out by

    Leighton and Srivastava (1997). Their study employed one of the most carefully

    designed and developed methodology for assessing the performance of the selected

    Web search services. In their study Leigthon and Srivastava (1997) selected five

    13

  • Web search engines namely, Excite, Alta Vista, InfoSeek, HotBot and Lycos and

    assessed them using fifteen (15) queries.

    The development of the test suite that they employed in their study is

    considered as on of the most complete. More specific, they used a combination of

    structured and unstructured queries and they tried so as the subject matter of the

    queries to be as wide as possible. Relevancy categories were also developed before

    the evaluation of the pages take place in order to avoid or minimize possible bias.

    Moreover, they devised a method to “blind” the pages that each search engine

    returned so as the evaluator not to be able to know which page is from which search

    engine. To do so a script written in PERL was employed. The PERL script was

    employed in order to fetch automatically the results that each search engine had

    produced and hide the name of each search engine. Thus, the evaluator was

    assessing the results without being able to know from which search engine were the

    results that was evaluating. Their findings are contained in a very detailed report

    where they had also recorded every possible detail about the environment where the

    test took place.

    Apart from that, they also conducted several experiments on the same data in

    order to compare the selected search engines using a variety of definitions. In their

    study, Friedman’s randomized block design was used to perform multiple

    comparisons for significance. Analysis had showed that Alta Vista, Excite and

    InfoSeek are the top three search services with their relative rank changing,

    depending on how one interpreted the concept of "relevant." Correspondence

    analysis showed that Lycos performed better on short, unstructured queries, while

    Hotbot performed better on structured queries.

    Another very comprehensive study is that of Clarke (1997). Clarke (1997)

    employed an experimental methodology designed to estimate the precision and recall

    of World Wide search engines. The search engines selected for this purpose were

    Alta Vista, Lycos and Excite. Clarke (1997) employed TREC-inspired methodology

    in order to estimate the recall of the selected Web search engines.

    14

  • Clarke (1997) evaluated only the relative or comparative recall, since

    determining absolute recall is impossible in the huge and ever changing Web system.

    Relative recall was determined by conducting a second search for known pages in

    each search engine. So a “pool” of relevant documents was identified for each

    individual query and each search engine was measured on the basis of how many

    relevant documents managed to retrieve from this “pool”. The first ten pages of the

    returned results in each search engine were evaluated for relevance using a three

    point scale. Results were presented in tables and the Friedman nonparametric

    statistical test was performed in order to determine the significance of the results.

    Clarke (1997) found that Alta Vista achieved the best mean precision score and

    Lycos the worst, but the precision performance of Alta Vista was only significantly

    different to that of Lycos. Excite achieved the best mean recall performance and Alta

    Vista the worst although there was no significant difference in the recall performance

    of the three search engines. Clarke (1997) concluded that Alta Vista was marginally

    the best search engine and this was in agreement with previous studies. The main

    conclusion according to Clarke (1997) is that, it is possible to apply the pooled recall

    approach to estimate relative recall of Web search engines.

    A very different comparative study of Web search engines was conducted by

    Courtois and Berry (1999). Their aim was to test how five major Web search

    services retrieve and rank documents in answer to user’s search query request. Their

    main rationalization about conducting such comparison lies within the reason that

    each search engine ranks or sorts the results according to a specific set of criteria,

    namely the rank algorithm.

    Furthermore, according to the authors, result ranking has a major impact on

    users’ overall satisfaction with Web search engines and their way of retrieving the

    relevant documents from the results list. Courtois and Berry (1999) identify that the

    majority of analogous studies are sharing a common methodology which consists of

    examining and evaluating the relevancy of the first 10 or 15 hits returned by the

    search engine. While the authors recognize the fact that this is an effective and

    feasible methodology for determining the precision, they argue that according to

    their experience this is not the approach with which users make use of their results

    15

  • list. Furthermore, the authors attempt to justify their different methodological

    approach by explaining how most of the users are likely to scan and retrieve only

    selected documents. However, this is a rather weak point within their study, since it

    is based only to their personal experience and to a similar study conducted by Koll

    (1993).

    According to their findings, Courtois and Berry (1999) developed a test suite

    and an appropriate methodology, which consisted of three criteria for testing

    relevance ranking:

    1. All Terms: Are documents that contain all search terms ranked higher

    than documents that do not contain all search terms?

    2. Proximity: For documents that contain all search terms, are

    documents that contain search terms as a contiguous phrase ranked

    higher than documents that do not?

    3. Location: For documents that contain all search terms, are documents

    that contain search terms in the title, headings, or metatags ranked

    higher than documents that contain terms only within the body of the

    document?

    For comparison they selected five search engines which scored highly in many

    comparison tests conducted by popular computer magazines, namely AltaVista,

    HotBot, Excite, Infoseek, and Lycos. They identified and selected 12 search queries

    to test the particular search engines and they downloaded the first 100 hits of each

    search.

    According to a further analysis of the downloaded documents and to the

    above test criteria, the authors concluded that in general the ranking performance of

    all the engines was generally good. In the Proximity and Location test most of the

    search engines had worse performance comparing to the All terms test, implying

    some very interesting thoughts about the ranking algorithm of every search engine.

    Ultimately, Courtois and Berry (1999) are suggesting a very interesting methodology

    for evaluating the quality and the reliability of the ranking algorithm of the Web

    search engines and their results are suggesting some very serious considerations from

    the perspective of the end user.

    16

  • Another experimental-comparative study is the one conducted by Gordon and

    Pathak (1999). In their study Gordon and Pathak (1999) distinguish between two

    types of search engine evaluation: testimonials, encompassing informal and

    impressionistic appraisals and feature-list comparisons and shootouts, which appears

    to correspond more closely to traditional information retrieval effectiveness

    experiments.

    According to their definition Gordon and Pathak (1999) presented a table of

    twelve earlier shootout studies, but identify only three (including their own) which

    make use of “appropriate experimental design and evaluation”. Of these, that of

    Gordon and Pathak (1999) is the most comprehensive and most recent. Gordon and

    Pathak obtained thirty-three (33) real information needs from volunteers among the

    faculty members in a university business school. These were recorded in

    considerable detail and passed to skilled search intermediaries who were given the

    task of generating near-optimal queries for each of eight search engines by an

    interactive, iterative process. The top twenty (20) results produced by each of the

    engines in response to the final queries were then printed and returned to the

    originating faculty member for assessment on a four point relevance scale.

    Gordon and Pathak (1999) presented a list of seven evaluation features which

    they claim should be present to maximise accuracy and meaningfulness of

    evaluation. Very briefly, these features can be listed as following:

    1. Searches should be motivated by genuine user need.

    2. If a search intermediary is employed, the primary searcher’s information need

    should be as fully captured as possibly and transmitted in full in the

    intermediary.

    3. A large number of search topics must be used.

    4. Most major search engines should be included.

    5. The most effective combination of specific features of each search engine

    should be exploited. This means that the queries submitted to the engines

    need not be the same.

    6. Relevance judgments must be made by the individual who needs the

    information.

    7. Experiments should be well designed and conducted.

    17

  • 8. The search topics should represent the range of information needs both with

    respect to subject and to type of results wanted.

    Of these features, some are very interesting, but others are debatable such as

    the feature 5. It appears that Gordon and Pathak (1999) are questioning the general

    practice to evaluate Web search engines based on the results produced by a set of

    query words without special syntax or special operators. However, it is more

    reasonable to compare the quality of results produced by search engines given

    identical input queries in this particular form, rather than attempting to find the best

    search query for each search engine and to compare them. After all, typical users

    avoid to use special operators or special syntax in their queries. They concluded that

    search effectiveness was generally low, that there were significant differences

    between engines and that the ranking of engines was to some extent dependent upon

    the strictness of the relevance criterion.

    18

  • 4. METHODOLOGY

    4.1. Introduction

    According to Van House et al. (1990) evaluation is the process of identifying

    and collecting data about specific services or activities, establishing criteria by which

    their success can be assessed and determining both the quality of the service or the

    activity and the degree to which the service or activity accomplishes stated goals and

    objectives.

    The process of evaluation, as defined above, is being widely used in

    traditional databases, CD-ROMs and other online information retrieval systems in

    order to assess their overall quality and performance. However, evaluation of

    performance of Web search engines is a new area within the context of information

    retrieval. According to Dong and Su (1997) studies concerning the assessment of

    Web search services began the year 1995. The review of previous literature has

    revealed that the majority of such studies have been conducted between 1995 and

    1997. This review has also revealed that in general three types of methodologies

    have been employed in assessing the performance of Web search services: actual

    tests with data collection and analysis, evaluative comments with examples of simple

    searches and review of functions of different search engines without examples or

    some other kind of tests.

    Further to this, it could be argued that methodologies which were making use

    of only simple tests and reviews of search engines functions were employed mostly

    during the earlier studies, while the majority of the recent studies are employing

    actual tests with data collection and analysis for the overall performance of the Web

    search services.

    In their study, Gordon and Pathak (1999), distinguish only between two types

    of search engine evaluation methodologies: testimonials and shootouts. Testimonials

    are generally conducted by the trade press or by computer industry organizations that

    “test drive” and then compare search engines on the bases of speed, ease of use,

    19

  • interface design or other features that are readily apparent to users of the search

    engine. Another type of testimonial evaluation comes from looking at the more

    technical features of search engines and making comparisons among them on that

    basis. Such testimonials are based on features like the set of search capabilities

    different engines have, the completeness of their coverage or the rate at which newly

    developed pages are indexed and made available for searching.

    Despite the actuality that testimonials can give users some useful information

    in making decisions about which search engine to employ, they can only indirectly

    suggest which search engines are most effective in retrieving relevant Web pages.

    For an overall evaluation of the performance of Web search engines, shootouts

    methodologies appear to be more appropriate. More specific, in shootouts, different

    search engines are actually used to retrieve Web pages and their electiveness in doing

    so is compared. Shootouts resemble the typical information retrieval evaluations that

    take place in laboratory settings to compare different retrieval algorithms, despite the

    fact that Internet shootouts often consider only the first 10 to 20 documents retrieved,

    whereas traditional information retrieval studies often consider many more (Gordon

    and Pathak, 1999).

    4.2. Setting up the evaluation criteria

    The special features of Web search engines in indexing technique, resource

    coverage, relevance ranking, search strategy, hyperlinks and interface lead some of

    the information retrieval researchers to the conclusion that the evaluation measures

    should be different from those of traditional online databases and CD-ROMs. While

    this seems to be sensible, it doesn’t necessarily mean that measures and criteria

    employed for the evaluation of traditional online systems are inadequate for the

    evaluation of interactive information retrieval services, specifically Web search

    engines. The six criteria that Lancaster and Fayen (1973) once had listed

    (1.Coverage, 2.Recall, 3.Precision, 4.Response time, 5.User effort and 6.Form of

    output) for the evaluation of information retrieval systems are still quite applicable to

    modern and interactive information retrieval systems despite the fact that they were

    set up three decades ago. In their study Chu and Rosenthal (1996) employed a set of

    20

  • criteria based on those listed by Lancaster and Fayen (1973). Their justification of

    employing criteria and evaluation measures which are being used in traditional

    online information retrieval systems is that the Web can also be described in terms of

    an information storage and retrieval system which is characterized by its enormous

    size, hypermedia structure and distributed architecture.

    Furthermore, the review of previous studies has revealed that the vast

    majority of previous studies are employing in their methodology evaluation criteria

    such as precision, output option and response time which are being commonly used

    in the assessment of traditional online information retrieval systems (Chu and

    Rosenthal, 1996; Winship, 1995). Ultimately, Su (1992) stated that, in general,

    criteria for evaluating interactive information retrieval systems include relevance,

    utility, efficiency and user satisfaction, but the truth is that there is no agreement as

    to which of the existing criteria or measures are the most appropriate for evaluating

    interactive information retrieval performance.

    4.3. Text REtrieval Conference (TREC)

    At this point it is appropriate to point out the importance and the contribution

    of Text REtrieval Conference (TREC) to the information retrieval research. Some of

    the previous studies concerning the evaluation of Web search engines used TREC –

    inspired methods (Clarke, 1997; Hawking et al., 2001).

    The first Text REtrieval conference was held in November of 1992 at the

    National Institute of Standards and Technology (NIST) (Harman 1993). The purpose

    of the conference was to bring together researchers from the field of information

    retrieval to discuss the results of their systems on a new large test collection

    (TIPSTER collection). The TREC gave the opportunity to researchers to compare

    results on the same data using the same evaluation methods. Moreover, it

    represented a breakthrough in cross – system assessment in the field of information

    retrieval. It was the first time that most of these researchers had used such a large

    test collection and therefore required a major effort by all of them to scale up their

    retrieval techniques (Harman 1995).

    21

  • The overall goal of the TREC programme is to encourage research in

    information retrieval using large test collections. It is hoped that by providing a very

    large test collection and encouraging with other researchers in a friendly evaluation

    forum, new impetus in information retrieval will be generated. Moreover, it was

    hoped that the participation of groups with commercial information retrieval systems

    would lead to an increased technological transfer between the research laboratories

    and the commercial products (Harman, 1995).

    In the second TREC, which took place in August of 1993, two types of

    retrieval were tested: retrieval using an “ad hoc” query such as a researcher may use

    in a library environment and retrieval using a “routed” query. Routed queries are

    considered to be queries which are extracted from specified topics and then tested

    against a set of “training” documents where relevant documents are identified. With

    this process an optimal query is generated and can be tested against new data

    (Beaulieu et al., 1996). In contrast “ad hoc” queries are considered to be new queries

    which are tested against to existing set of data without awareness of relevant

    documents. The assessment of the results was based on traditional recall and

    precision criteria. The queries employed in the present study were all “ad hoc”.

    4.4. The influence of TREC

    It appears that the main concern of the majority of the previous studies was

    the efficiency of the methodology, the evaluation criteria and the development of the

    test suite. In the current study much effort was exerted in order the methodology and

    the evaluation measures to be as efficient as possible, so as to produce accurate and

    meaningful evaluation of the quality of the results returned by the selected Web

    search engines. Therefore, the methodology and evaluation criteria employed in this

    study were influenced by those used in previous studies and by TREC experiments.

    22

  • 4.5. Development of the test environment

    The main purpose of the current study is to assess the overall performance of

    the selected Greek search engines and to compare their performance with an

    excellent performer, Google. To do so, a wide range of evaluation criteria should be

    considered. While this is true, the limited available time for this study resulted in

    selecting only these criteria which were considered that would reflect better the

    overall performance of each search engine and these criteria, which were considered

    to be of significant importance from previous studies, such as interface and precision

    (Dong and Su 1997).

    Another concern was the time proximity of the searching. According to

    Leighton and Srivastava (1997) the goal of close time proximity of the searching

    should be taken into consideration in order to ensure the objectivity and accuracy of

    the evaluation of the returned results. The closer in time that a query is executed on

    each of the selected search engines, the better. The rationale behind this tactics is

    that if a relevant page were to be made available between the time one engine was

    searched and the time a second was searched, this would result in an unfair situation

    where the second search engine would have the opportunity to have this new page

    indexed and consequently retrieved. According to previous researchers which had

    conducted similar studies (Chu and Rosenthal, 1995; Ding and Marchionini, 1996;

    Leighton and Srivastava 1997) the close time proximity is characterizing the quality

    of the methodology and evaluation procedures that are employed in similar studies.

    According to them, the ideal situation would be each query to be executed on all the

    selected search engines simultaneously.

    In the current study, all the three search engines were searched for a

    particular query on the same day and each query performed on each search engine

    within an hour the maximum. A second round of searches was conducted

    immediately after the first for each query in order to evaluate the recall, which will

    be discussed later and more in more detail. Again the time limit was considered to

    be of major importance, since the selected search engines assert to update their

    database indexes in a weekly or even on a daily basis.

    23

  • Similar to the point of close time proximity there was also the goal of

    checking the pages that were cited in the results from the Web search engines as

    quickly as possible after the results had obtained. This was considered to be of the

    same importance as the previous goal of close time proximity of the searching. The

    reason behind this objective is that the longer one waits after the results had obtained

    the more possible is that some pages, which were active during the searching, to have

    been removed from the Web. Thus the tested search engine would be assessed

    unfairly by the evaluator (Leigthon and Srivastava, 1997) .

    The relevance judgements performed immediately and it only took about

    thirty minutes to evaluate the first ten returned results for each search engine. The

    precision scores were also assigned within the time of thirty minutes. For efficiency

    reasons the results were saved as “.htm” files in every case that there will be a need

    to check again the URLs or to validate assessments. The relevance judgements were

    also another area of great concern during the evaluation of the selected Web search

    engines. The relevance judgment is the weak point of the evaluation procedure in the

    majority of similar studies. The main problem of these studies is the person who is

    responsible for assessing the relevance of the returned results.

    Moreover, in most of these studies, the author or the authors were the

    evaluators of the returned results. During this step of the evaluation procedure bias,

    both conscious and unconscious, can enter and distort the objectivity and accuracy of

    the relevance judgment and thus the precision score of the tested search engines. For

    example, if the subject matter of the selected queries is wide there is a serious

    concern over the adequate knowledge background of the evaluator to assess the

    relevance of the returned results. In order to overcome this flaw, many researchers

    decided to select queries with narrow subject matter (Clarke, 1997). While this

    approach makes possible the extraction of accurate and meaningful relevance

    judgements, nevertheless there is always the peril that the returned pages would be

    from only one portion of the Web.

    In the present study, the evaluation procedure that is employed is inspired

    from the previous study of Gordon and Pathak (1999), which introduces intermediate

    researchers or evaluators in an attempt to circumvent the risk of distorted relevance

    24

  • judgments. More specific, the evaluations of the returned results assessed by other

    six fellow students from the information studies department in Sheffield University,

    with adequate knowledge background in specific subject areas (economics, software

    engineering, librarianship, history and archaeology, mathematics and management).

    Furthermore, many of the queries used in the current study were real reference

    questions drawn from their research dissertation for the obtainment of their Master’s

    degree. So, they were evaluating the results both as researchers and as end-users. As

    it was mentioned previously, this procedure was considered necessary and was

    employed in order to ensure the highest possible degree of accuracy and objectivity

    in the evaluation of the results.

    Searches were carried out on PCs at the St. George I.T. centre in the

    University of Sheffield. The access to the World Wide Web was possible through

    the LAN (Local Area Network) of the University and the Web browser that was used

    was the latest version of Internet Explorer(version 5.5).

    4.6. Sample queries suite

    This methodology, as it has already been described, involves a set of sample

    queries that will be employed in order to test and assess the overall performance of

    the selected Web search engines. Thus, the development of the sample queries suite

    is a sensitive step within the development of the evaluation procedure, which can

    potentially affect the performance of the tested search engines (Ding and

    Marchionini, 1996).

    4.6.1. Number of queries As it was mentioned, the searching procedure was designed in such a way in

    order to minimize the possibility of favouring the search engine examined first or the

    one examined last. Thus, a compromise between the number of queries employed

    and the number of documents assessed for each search engine was considered to be

    of crucial importance in order each individual query to be processed within a

    reasonable time limit. This time limit was compulsory in order to ensure that the

    25

  • search engine’s indexes would not change during the evaluation process of each

    individual query. Taking this into consideration, it was regarded that twenty is a

    feasible number of queries to be evaluated in a rather limited time. The first

    intentions, before designing the evaluation methodology, was to use a number of at

    least 25 queries, in order the subject matter of queries to be as wide as possible.

    However, there were found to be problems in keeping the required time limit and so

    the number was limited to 20 queries.

    4.6.2. Query subject matter Previous studies suffered from a lot of bias regarding the subject matter of the

    queries employed. Many of the researchers decided to use a wide subject matter

    (Chu and Rosenthal, 1995; Ding and Marchionini, 1996) but there is always the

    question of how capable is the evaluator to make accurate and meaningful relevance

    judgments over subject topics which are requiring an appropriate knowledge

    background. Other researchers (Clarke, 1997) in order to avoid this risk they

    deliberately employed queries with narrow subject matter. While this allowed them

    to make accurate and meaningful relevance judgements on the other hand, it could be

    argued that only one portion of the Web was tested for retrieval (Clarke, 1997).

    In the current study there is an effort to avoid such a risk by increasing the

    number of evaluators to six, each of them has a different knowledge background.

    Moreover, most of the queries used in the current study were extracted from real

    reference questions that were used in the evaluators’ research dissertation. So, the

    described evaluation procedure was designed in order to ensure that the relevance

    judgments from the evaluators would be as accurate as possible and it would allow

    the subject matter of the queries to be as wide as possible.

    Apart from that, some of the queries’ subject matter involves issues of Greek

    culture and civilization. The underlying reason behind this approach is that Greek

    search engines are being used every day by hundreds of users searching the Greek

    domain for information which mostly involves issues of Greek culture and

    civilization. So it would be very interesting to test these two most popular Greek

    search engines (Anazitisis and Trinity) in this particular subject matter, to evaluate

    26

  • their performance and compare it with a class leading search engine such as Google.

    Of course, it can be argued that such a comparison can be quite unfairly for Google,

    due to the fact that the Web robots of the tested Greek search engines are primarily

    focused on indexing the Greek Web and so an advantage of Anazitisis and Trinity in

    these particular queries can be expected. While this can be true, bear in mind that

    this was considered as necessary for the purposes of the evaluation on the particular

    study. Moreover, this study is also attempting to explore the capabilities and the

    limits of the tested search engines and examine how each search engine can cope

    with queries consisting of Greek and English together.

    4.6.3. Query formulation and search expression The decision over the query formulation and the search expression that will

    be entered in each search engine, proved to be quite difficult. Previous studies have

    suffered form a lot of bias here. Some of the researchers who have conducted similar

    studies (Chu and Rosenthal, 1996; Tomaiuolo and Parker, 1995) decided to compile

    the selected queries and use syntax specific to each of the tested search engines or

    use the so-called “advanced mode” feature, if it was available. However, in their

    study, Leighton and Srivastava (1997) tried to be more systematic and carefully

    examined, before conducting the actual test, what search expression should submit to

    the selected search engines. They decided to use simple queries such as these that a

    simple user would enter because, according to them, this kind of queries are forcing

    the search engine to do more of the work, ranking the results by its own algorithm

    rather than the constraints specified by the operators.

    Moreover, according to Hawking et al. (2001) all well-known public search

    engines are designed to produce a list of results when a set of simple queries (without

    operators or special syntax) is typed in to the search box provided by the primary

    interface of the search engine. So, it seems to be more realistic to compare the

    quality of the results returned by search engines given identical input queries in this

    particular form. Furthermore, the examination of query logs (Silverstein et al., 1999)

    has revealed that most of the users do not use any form of query operators or the

    “advanced mode” if this is provided. Additionally, Silverstein et al. (1999)

    concluded that in most of the cases, when the users attempted to enter queries using

    27

  • query operators they were making a lot of errors. Thus, while studies, which adopt

    the approach of trying to find the best query formulation for each search engine are

    very interesting, they also introduce conscious and unconscious bias, which can lead

    to unfair comparison among the tested search engines. The query formulation that is

    employed in the current study is consisting of “simple queries” in an attempt to

    minimize and isolate possible unfairness that can enter this step.

    4.6.4. Further analysis over the sample queries As it has previously been described, the queries that are employed in the

    current study have been extracted from real reference questions and there has been an

    effort, so as their subject matter to be as wide as possible. The fact that this study

    attempts to examine and evaluate the most popular Greek search engines imposes

    some requirements to the queries that will be employed. More specific, the subject

    matter of some of the sample queries is related to the Greek culture, history and

    civilization. Some queries are entirely in Greek language, while some other queries

    contain a combination of Greek and English search keywords. The introduction of

    queries such as those related to issues of Greek civilization was considered to be of

    significant importance, since they are being used to unearth information related to

    issues of Greek culture and civilization.

    Apart from that, another area that this study is examining is how the selected

    Greek search engines can cope with the combination of Greek and English words

    within the same question. This area considered to be of particular importance, for

    the current study, due to the reason that the first Greek search engines faced a variety

    of problems with the combination of Greek and English words within an individual

    query. So it is very interesting to examine how improved are the selected Greek

    search engines in such a “sensitive” area in which they claim that they have

    overcame the problems of the past. Furthermore, it is also very interesting to

    examine the searching performance of Google in another language, the Greek, since

    it claims that it is capable of doing so.

    Further to this point, problems can arise from the complicated and

    sophisticated grammatical and syntactical structure of the Greek language. More

    28

  • precisely, many words such as nouns can be used within a sentence with a variety of

    forms, called cases (e.g. nominative case, accusative case e.t.c.). This is also the

    situation for many other words such as adjectives and pronouns. This variation of

    forms of particular words within a sentence in Greek language is being used in order

    to alter the meaning of an individual sentence or to illustrate a special relationship

    between the words within an individual sentence. In every case, the possible

    variations of the forms of words in Greek language are far more complicated

    comparing to the English and thus, obstructing the task of the Web search engines.

    Problems such as the one previously described pestered the users of the first

    Greek search engines. The specific language related problem is of crucial

    importance since it was rendering the first Greek search engines unable to perform

    successfully their requested tasks. Unfortunately, there are no particular proofs to

    support this actuality, due to fact that until now there has been no significant study

    over the evaluation performance of the Greek search engines. As it has already been

    illustrated in the literature review section there is a least number of studies that are

    attempting to examine the Greek search engines and all of them are of descriptive

    nature. The modern Greek search engines, such as the selected that are being tested

    in the current study, claim to have overcome this problem by utilizing special

    software (e.g. Anazitisis) or special language techniques in the ranking algorithm

    (Trinity).

    So, the sample queries part of the test suit has been especially developed in

    the way that has just been described in order to test and assess how the advanced

    algorithms and special features of the selected Greek search engines and Google can

    cope with the combination of Greek and English words within an individual query.

    Below is illustrated the list of the twenty sample queries that was employed in the

    current study. Inside the parenthesis and written with italics is the English

    translation of the Greek queries and the queries that are using combination of Greek

    and English words.

    1. Internet Banking strategy 2. PEST factors in internet banking 3. Public libraries and learning disabilities 4. Customer relationship management 5. Object-Oriented management with UML

    29

  • 6. Customer-centric culture 7. Telecommunications infrastructure in Greece and Spain 8. Information retrieval and Web search engines 9. Backpropagation models of neural networks for information retrieval 10. Information modelling and SSADM methodology 11. Μηχανές αναζήτησης στο Web (Search engines in the Web) 12. Ο κόσµος του Internet (The world of internet) 13. Οι εκδόσεις του OECD (Publications of the OECD) 14. Αρχιτεκτονική επεξεργαστών Risk στους προσωπικούς

    υπολογιστές(Risk architecture CPU in personal computers) 15. Μεγάλοι έλληνες ρεµπέτες (Great Greek folklore musicians) 16. Ελληνική επανάσταση 1821 (Greek independence war 1821) 17. Εκδόσεις της τράπεζας Ελλάδος (Publications of Bank of Greece) 18. O ελληνικός στοχασµός κατά το 19ο αιώνα (Greek philosophical

    meditation during the 19th century). 19. Nεοελληνικός διαφωτισµός και Ρήγας Φεραίος (Modern Greek

    enlightenment and Rigas Feraios [Personal name]) 20. Αντικειµενοστρεφή συστήµατα βάσεων δεδοµένων (object oriented

    database systems)

    4.7. Evaluation of returned pages

    4.7.1. Document cut-off The application of document cut-off practice in the evaluation procedure of

    on-line information retrieval systems is regarded as a necessary step before the actual

    evaluation procedure takes place. A decision should be taken over the number of the

    pages that will be assessed according to some predefined measures of evaluation.

    The necessity of document cut-off in evaluating the performance of on-line

    information retrieval systems is based on the reality that the output of these systems

    can be hundreds or even thousands of returned pages. This is also the fact for the

    Web search engines. In most of the cases their returned results can easily be of

    thousands of Web pages, which might not be a large volume of information, when

    taking into consideration the vast and ever changing Web system.

    In the current study, it was decided to evaluate the first ten hits of the results

    list produced by each search engine. This decision was based over the personal

    experience and by observing the behaviour of my fellow students towards the results

    list produced by the search engine. Virtually all of them had a tendency in browsing

    30

  • and examining only the first ten or rarely the first twenty hits returned by the search

    engine. Also this approach over the document cut-off practise seems to be supported

    by the vast majority of the researchers conducted similar studies. Chu and Rosenthal

    (1996), Scoville (1996) and Tomauillo and Packer (1996) tested the search engines

    for the first ten results, while others such as Ding and Marchionini (1996) or Gauch

    and Wang (1996) evaluated the search engines based on the first twenty results. So

    the common practise for similar studies is to examine and evaluate the search engines

    for the first ten and sometimes twenty returned results. The time factor should also

    be regarded as another major reason behind this common practise of document cut-

    off.

    Since all the selected search engines display their results in a list of

    descending order of relevance calculated one way or another, it is considered that

    this should not critically affect the validity of the current study.

    4.7.2. Measures of evaluation specific to the current study As it has previously been described and inferred by the relevant literature, the

    evaluation criteria employed to studies attempting to assess the overall performance

    of on-line information retrieval systems can be considered as perhaps one of the

    weakest parts of these studies. The underlying reason behind this fact is that there is

    no common agreement as to which of the existing criteria or measures are the most

    appropriate for evaluating interactive information retrieval performance (Su, 1992).

    This statement can easily be supported by the review of previous studies.

    More specific, due to the special features of Web search engines (Interface, hyperlink

    structure etc.), every researcher attempted to use criteria which were somewhat

    different from traditional ones. For example, Taubes (1995) considered reliability,

    completeness and speed as the measures in the evaluation, Winship (1996) argued

    that the record structure and search techniques had a greater significance than

    retrieval performance and others suggest a powerful and usable interface and that the

    quantity, precision and readability of returned results are the most important criteria

    for evaluating and rating search engines.

    31

  • A clarification that should be noted here is that the researchers who attempted

    to employ a rather different set of evaluation criteria in their studies, doesn’t

    necessarily mean that they rejected the traditional ones. The fact is that most of them

    included some of the traditional criteria to a whole set of evaluation measures

    employed in their study. It is just the detail that some other criteria such as the

    interface, search techniques, record structure, ease of use etc. was considered to have

    greater significance from the traditional ones, when giving an overall performance

    score to the tested search engine. According to Dong and Su (1997), specific

    traditional criteria such as precision and response time are the most commonly used

    in the vast majority of studies comparing and evaluating the performance of Web

    search engines.

    In the current study there was an effort to employ these criteria that would

    better unveil the overall performance of each of the tested search engines. At the

    earlier stages of the current study there was a thought to employ a wider set of

    criteria in order to assess the overall performance of the selected search engines. The

    motivation behind this thought was that the greater the number of the evaluation

    measures employed the better, since such an exhaustive test would unveil almost

    every strength or weakness of the tested search engines. While this speculation

    seems to have some validity, the review of previous studies did not unveil any study

    employing a very large number of evaluation measures. The truth is that such a

    study appears to be superficial, especially for the present study where the available

    time period was very limited. It was felt best to conduct a feasible test of some of the

    most representative performance measures rather than employing a large number of

    criteria to evaluate every feature and aspect of the search engine. So it was decided

    to evaluate the selected search engines on the basis of precision, recall, response

    time, validity of links, interface and documentation.

    4.7.3. Precision Precision together with recall constitute the two most important traditional

    measures of retrieval effectiveness (Saracevic, 1975). Precision or Precision ratio is

    defined as the proportion of retrieved documents that are judged relevant, meaning

    the number of relevant documents retrieved divided by the total number of

    32

  • documents retrieved. Cheong (1996) also supports that the percentage of the

    retrieved documents that the user judges relevant refers to a measure of the signal-

    noise ratio in certain kinds of system.

    According to Dong and Su (1997) precision considered as very important in

    comparing and evaluating the performance of a search engine for two reasons. The

    first is that each search engine is employing its own methods and techniques in

    collecting and indexing documents. Also the fields of indexing are different for each

    search engine. Thus, based on precision it is a way to identify which method or

    technique of indexing is the most efficient. The second reason is that automatic or

    machine produced indexing, despite the modern sophisticated techniques or

    algorithms that are being employed, cannot always cope successfully with words

    used in various contexts resulting in the indexing of non-relevant items. Therefore, it

    can be supported that the output relevance to a user’s query can be an important

    indicator for assessing the quality and intelligence of an individual search engine.

    Dong and Su (1997) support that while precision has been widely used as a

    criterion in describing the relevance of search results in many studies, only a few of

    the studies conducted between the years 1995 and 1996 (Chu and Rosenthal 1996;

    Ding and Marchionini 1996) applied the precision criterion on a “standardised

    formula” (Dong and Su 1997). The problem with this statement of Dong and Su

    (1997) is that they don’t clarify properly what they mean with the term “standardised

    formula”. The fact is that the traditional way of evaluating precision needs to be

    reconsidered when dealing with the evaluation of Web search engines. More

    specific, in all the previous studies, who attempted to assess the performance of Web

    search tools, the calculation of the precision measure was performed on the basis of

    the first ten or twenty hits returned by the search engine.

    In the current study it was felt best to assess the precision of the Web pages

    returned based on a three point scale, as Clark (1997) did in his own study. More

    specific, a score of 1 was given to very relevant documents, a score of 0.5 was given

    to somewhat relevant documents and 0 to documents that were not relevant. Since in

    the current study a number of six evaluators were used, every page was examined

    33

  • thoroughly before assigning a precision score. Also, all the links in every page were

    examined and not just one or two initial links.

    In the case that a page was consisted of a whole set of links, every link was

    also examined thoroughly. If these links, by following them, could lead to useful

    information resources, then a score of 0.5 was assigned to this page (Leighton, 1995).