Greek Web Search Engines - University of...
Transcript of Greek Web Search Engines - University of...
-
GGrreeeekk WWeebb SSeeaarrcchh EEnnggiinneess An evaluative and Comparative Study
A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Systems
at
TTHHEE UUNNIIVVEERRSSIITTYY OOFF SSHHEEFFFFIIEELLDD
by
Panteleimon Lilis
September 2002
-
Abstract
The present study consists a first attempt to evaluate the overall performance of
Greek Web search engines and compare them with a class-leading search engine, Google.
To do so a particular methodology was designed and developed for the purposes of the
comparison. More specific, the three Web search engines were evaluated in terms of
precision, relative recall, validity of links, response time, interface and on-line
documentation. These criteria were particularly employed and developed for the needs of
the present study. The queries employed were also developed for the needs of the particular
study. More specific, it was decided to use twenty queries so as the query subject matter to
be as wide as possible. Apart from that, some of the queries were in Greek language, some
in English language and some of them were a combination of Greek and English keywords
in order to assess the selected search engines of how capable are to cope with a variety of
different linguistic characteristics. The first ten pages of returned results in each search
engine were evaluated on the basis of the criteria described above. Results are presented in
tables and the comparison was based on averaging and finding the mean rate for each search
engine in each criterion. The conclusion is that the performance of the three selected search
engines is rather poor. Google, in general was found to have a better overall performance in
comparison to Anazitisis and Trinity, while the two Greek search engines had an almost
similar performance.
i
-
List of Tables .............................................................................................................. iv List of Charts............................................................................................................... iv 1. INTRODUCTION.................................................................................................. 1
1.1. World Wide Web search engines...................................................................... 1 1.2. Aims and objectives of the current study.......................................................... 1
2. THE SELECTED SEARCH ENGINES .............................................................. 3
2.1. Search engine selection..................................................................................... 3 2.2. Features of the selected search engines............................................................. 4
2.2.1. Google ........................................................................................................ 4 2.2.2. Anazitisis.................................................................................................... 6 2.2.3. Trinity......................................................................................................... 7
3. LITERATURE REVIEW...................................................................................... 8
3.1. Introduction....................................................................................................... 8 3.2. Review of comparative-experimental studies. .................................................. 9
4. METHODOLOGY............................................................................................... 19
4.1. Introduction..................................................................................................... 19 4.2. Setting up the evaluation criteria..................................................................... 20
4.3. Text REtrieval Conference (TREC)............................................................ 21 4.4. The influence of TREC ................................................................................... 22 4.5. Development of the test environment ............................................................. 23 4.6. Sample queries suite........................................................................................ 25
4.6.1. Number of queries.................................................................................... 25 4.6.2. Query subject matter ................................................................................ 26 4.6.3. Query formulation and search expression................................................ 27 4.6.4. Further analysis over the sample queries ................................................. 28
4.7. Evaluation of returned pages........................................................................... 30 4.7.1. Document cut-off ..................................................................................... 30 4.7.2. Measures of evaluation specific to the current study ............................... 31 4.7.3. Precision................................................................................................... 32 4.7.4. Recall ....................................................................................................... 34 4.7.5. Response time .......................................................................................... 36 4.7.6. Validity of links ....................................................................................... 37 4.7.7. Interface.................................................................................................... 37 4.7.8. On-line documentation............................................................................. 38
4.8. Possible drawbacks, inconsistencies and bias of the specific methodology ... 39 5. RESULTS ............................................................................................................. 43
5.1. Calculations..................................................................................................... 43 5.1.1. Averaging................................................................................................. 43 5.1.2. Precision scores........................................................................................ 43 5.1.3. Recall scores ............................................................................................ 45 5.1.4. Response time .......................................................................................... 46 5.1.5. Validity of links ....................................................................................... 46
ii
-
6. ANALYSIS AND INTERPRETATION OF THE RESULTS ......................... 48 6.1. Evaluation of the overall performance of the tested search engines............... 48
6.1.1. Precision ratio .......................................................................................... 48 6.1.2. Recall ratio ............................................................................................... 51 6.1.3. Response time .......................................................................................... 52 6.1.4. Validity of links ....................................................................................... 53 6.1.5. Interface.................................................................................................... 55 6.1.6. On-line documentation............................................................................. 55
7. CONCLUSIONS .................................................................................................. 57
7.1. Limitations of the current study ...................................................................... 59 7.2. Some future recommendations........................................................................ 60
BIBLIOGRAPHY .................................................................................................... 62 APPENDIX - SEARCH ENGINES INTERFACE ............................................... 65
iii
-
List of Tables Table 1: Precision scores............................................................................................ 44 Table 2: Recall scores ................................................................................................ 45 Table 3: Response time .............................................................................................. 46 Table 4: Invalid links ................................................................................................. 47
List of Charts Chart 1: Mean precision performance........................................................................ 51 Chart 2: Mean recall................................................................................................... 52 Chart 3: Mean response time...................................................................................... 53 Chart 4: Validity of links ........................................................................................... 54
iv
-
1. INTRODUCTION
1.1. World Wide Web search engines
According to Chu and Rosenthal (1996) the World Wide Web has gained so
much popularity that it is the second most popular internet application after e-mail.
The Web is used for a variety of purposes by many people around the world.
However, it can be argued that the Web is used for two main purposes (Clarke,
1997). The first is the publishing of information. Indeed, the fact that in the Web the
information can be accessible by many people in the same time, resulted that the
Web became the world’s largest information medium.
The second use of the Web is for information retrieval (Clarke 1997). More
specific, under many aspects, the Web can be described in terms of a huge
information storage system. But, the reality about the Web is that its unstructured
and ever changing nature has made a very difficult task the information searching
and retrieval (Declan, 2000). Web search engines developed in order to overcome
this difficulty by assisting the simple Web-user in searching and retrieving the
required information.
Web search engines came into existence in 1994 and since then at least
twelve have been developed for use in the Web. Search engines have variously been
referred as search tools, search services, indexes, Web databases and search engines.
In the present study the phrase that will be used mostly is search engines, since this is
also the case for the majority of the studies reviewed.
1.2. Aims and objectives of the current study
The specific dissertation aims to evaluate two of the most popular Greek
search engines (Anazitisis and Trinity) and to compare them with a class-leading
Web search engine such as Google. The main reason for conducting such study is
the fact that no similar study has ever been conducted in Greece, meaning that there
1
-
is no particular information about how each of the Greek search engines performs.
Moreover, articles that were involved with Greek search engines are very limited and
all of them are reviews and thus, descriptive in their nature. This is due to the fact
that Greek Web search engines are recent in comparison to search engines such as
Google or Alta Vista and thus, the relevant literature is very immature.
However, recent developments in the Greek search engines (Anazitisis, new
ranking algorithm in Trinity) have emerged some concern in the Greek Web about
the performance of these search services. Thus, another reason for conducting
comparison of two Greek search engines with Google is that this will give a measure
of how developed Anazitisis and Trinity claim to be. After all, as Chu and Rosenthal
(1996) state, the seer number of such services invites further research.
In order to achieve the particular aim a methodology was required to be
designed and developed. This will require to explore and examine the relevant
literature so as to identify the required criteria and the appropriate test environment
to be developed. It is very important to mention that the methodology in the present
study is the most important part. Its completeness will ensure the objectivity of the
results and will minimize the risk of introducing bias, both conscious and
unconscious, and inconsistencies.
Furthermore, the researcher of the present study decided to design and
develop a methodology for the particular study, since for the purposes of comparison
a number of different criteria and search engines features needed to be examined and
evaluated thoroughly. For example, from the queries employed some were in Greek
language, some in English and some were a combination of both Greek and English
search keywords, another example is that on-line documentation of each search
engine was employed as an evaluation measure for reasons discussed in more detail
in the methodology section.
2
-
2. THE SELECTED SEARCH ENGINES
2.1. Search engine selection
The researcher of the current study decided to select only three search
engines to test and evaluate. It can be argued that the number of the search engines
examined in the current study is rather small in relation to some of the reviewed
studies. This constraint on the number of the search engines selected was considered
as necessary for the following reasons. First it would allow a greater number of
queries to be used so as the subject matter of queries to be as wide as possible.
Second, it would allow a larger number of evaluation criteria to be employed so as to
assess the overall performance of the selected search engines. Many of the reviewed
studies are limited to the usual measures of precision, recall and interface. But, due
to the fact that the present study attempts to examine, evaluate and compare the
selected search engines in terms of their overall performance, suggests that a larger
number of evaluation criteria should be employed.
The idea was to select two of the most popular and well respected Greek
search engines and compare them with one class-leading search engine such as
Google. The one Greek search engine that was selected is Trinity. It is one of the
most well respected Greek search engines and is being used by the most popular
portal of the Greek Web, the www.in.gr. The second Greek search engine that was
selected is Anazitisis, which is product of one of the most popular ISPs in Greece,
OTEnet. The selection of Anazitisis was based on the fact that is a very new search
engine, which gained popularity in a very short time. Anazitisis boasts that employs
advanced ranking algorithms, impressive special features along with special software
designed particularly to increase noticeably its searching and retrieval performance
in the Greek language. The characteristics and features of Google, Anazitisis and
Trinity will now be considered in more detail in the following section.
3
http://www.in.gr/
-
2.2. Features of the selected search engines
2.2.1. Google The Google Web search engine has been founded by Sergey Brin and
Lawrence Page, two graduate students in computer science at Stanford University in
California. In less than a year, their Google search engine has become the most
popular on the web, yielding more precise results for most queries than conventional
search engines. Google’s database is very huge and according to many sites and
resources in the Web, Google must be the biggest search engine database in the
world. Google claims that its database size is over two million pages, but may be
counting pages which are not fully indexed.
One distinguishable characteristic of Google is its searching and retrieval
speed or more formal the very low response time. According to Google’s homepage
this can be attributed in part to the efficiency of its search algorithm and partly to the
thousands of low cost PCs that they have been networked together (so as to form a
powerful computing grid) to create a very fast search engine. The other most
distinguishable characteristic is its ranking algorithm.
As far as its ranking algorithm is concerned, Google is a unique search engine
in the World Wide Web. More specific, Google’s ranking algorithm is based on how
many other pages link to each page, along with other factors such as the proximity of
the search keywords or phrases in the documents. It uses not only the number of
other pages that link to a page, but also the importance of the other links which are
being evaluated by the links to each of them. This simply means that there is no way
anyone to be able to influence the ranking of his or her page in Google, something
which is quite possible in some other search engines and directories. This innovative
approach takes its inspiration from the citation analyses used in scientific literature
(Declan, 2000) and is based on the principle of “bibliographical coupling” (Skandali,
1990).
Google embodies these principles into its ranking algorithm “PageRank”
which has been the topic of many discussions, but so far there is no clear evidence of
4
-
how exactly works. In general, the PageRank (PR) is calculated for every webpage
that exists in Google's database. The calculation of the PR for a page is based on the
quantity and quality of the WebPages that contain links to a particular page.
According to the co-founders of Google, Sergey Brin and Lawrence Page, the PR of
a webpage is calculated using this formula:
PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I)) . Where:
PR(A) is the Page Rank of your page A.
d is the damping factor, usually set to 0,85.
PR(I->A) is the Page Rank of page I containing a link to page A.
C(I) is the number of links off page I.
PR(I->A)/C(I) is a PR-value page A receives from page I.
SUM (PR(I->A)/C(I)) is the sum of all PR-values page A receives from pages
with links to page A..
More explicit, the PR of a page is determined by the PR of every page I that has a
link to page A. For every page I that points to page A, the PR of page I is divided by
the number of links from page I. These values are cumulated and multiplied by 0.85.
Finally 0.15 is added to this result, and this number represents the PR of page A.
(Declan, 2000).
Google allows the user to search either in the simple or in the advanced
mode. Each mode has a different entry screen and provides different functions and
search options. The simple interface is a single search box with two search buttons:
"Google Search" and "I'm Feeling Lucky". ". The latter automatically displays the
page deemed most relevant rather than displaying a list of results. The advanced
interface provides boxes for the following search options: for "all the words", "exact
phrase", "any of the words", and "without the words", pull-down menus to limit by
location on the page (anywhere, title or URL), language and domain, radio buttons to
filter results using "SafeSearch", and search boxes that allow you to search for pages
that are similar to or link to a given URL. Apart from these, Google also supports
major Romanized and non-Romanized languages and translation to English from
major European languages. However, Google does not support truncation and it is
not case sensitive.
5
-
2.2.2. Anazitisis Anazitisis is the most recently of the Greek search engines. In fact, Anazitisis
is part of the on-line products provided by OTEnet, one of the most popular and well
respected ISPs in Greece. Unfortunately, the researcher didn’t have enough
information about Anazitisis due to the fact that its administrators have not been interested at all in contributing to the present research. Thus, much of the
information illustrated in the present study about Anazitisis is based partially on
some information found in the Greek Web and partially on the personal experience
of the researcher with Anazitisis.
Anazitisis became fully operational a year ago. During this time Anazitisis
became very popular among the Greek users. Its popularity is much based on to the
fact that boasts to support advanced search features and capabilities. More
specifically, the particular search engine employs the SDK, a linguistic software tool
developed from AltaVista especially for the Greek language. The SDK is being used
by the Anazitisis in order to increase its searching and retrieval capabilities on the
Greek language. However, far more impressive is “normalisation” a special feature
of Anazitisis which has been designed and developed to cope with the various forms
or more precisely with the various “cases” that a Greek word can be used within a
sentence (e.g. nominative case, accusative case etc.). With this particular
characteristic, Anazitisis claims to increase considerably its precision and recall
results, when a query containing Greek search keywords is submitted.
There is no particular information about how the ranking of the Web pages is
being performed by Anazitisis. Also, there is no information at all about its Web
robot and its capabilities. However, it can be argued that its ranking is based on a
combination of two sets of criteria. The first set of criteria is considered to be
dynamic and the second is regarded as static. The first set includes criteria such as,
the presence and the number of keywords in the title or in the first line of the text or
in the main body of the page or in the Meta tags of the HTML code. The second set
includes criteria such as the popularity of a specific Web page (measured by the
number of the pages who link to the particular Web page) and by the percentage of
the text contained (more text is regarded as an indication that the specific page must
be more valuable and thus, more informative).
6
-
Anazitisis supports full Boolean searching by inserting the appropriate
operator before a word (“+” for the “AND” operator and “-” for the “OR” operator)
and phrase searching (by using quotes “…”). Truncation is also supported by
inserting the wildcard “ * ” at the end of the truncated word. At the end, there is also the possibility for searching in specific fields of the Meta tags of the HTML code,
such as in the title or in the URL. The user can enter his or her queries either in the
simple mode or in the advanced mode where further functions are supported such as
searching only in governmental sites.
2.2.3. Trinity Unfortunately, as in the case of Anazitisis the amount of information that was
available for the second Greek search engine Trinity, was very limited. To begin
with, Trinity was developed by the Phaistos Networks SA, a Greek company which
is being enabled in the area of Internet and Web applications. Trinity became fully
operational in 1997. Trinity is the basic search engine that is being used by the most
popular Greek portal the www.in.gr and this might also be one of the reasons for
Trinity’s popularity.
Trinity also operates in two modes, simple and advanced mode. Each mode
has a different screen entry, because different functions are being supported. In the
simple mode the user can only submit his or her query, while in the advanced mode
special operators are being supported in order to help the user to increase either the
precision or the recall. As far as the search options are concerned, Trinity provides
only limited options in comparison to Anazitisis and Google. More specific, Trinity
supports partially Boolean searching (only two operators: AND, OR) and phrase
searching only.
As far as the ranking of the documents is concerned, the information that is
available for Trinity is very limited and very general. However, it appears that
Trinity’s ranking algorithm is based on the analysis of the popularity of a Web page.
The popularity of a Web page is being measured by analysing the number of other
pages that link to the particular page. It appears that the basic principle of Google’s
PageRank is being employed by more Web search engines. Other criteria that are
7
http://www.in.gr/
-
being taking into consideration by Trinity’s ranking algorithm are word proximity
and URL analysis. There is no particular information for the later criterion, but from
the researcher’s personal experience with Trinity a possible explanation should be
that, what is being measured is the proximity of the URL of a specific Web page with
the keyword of the user’s query. Trinity claims to employ a very fast Web robot,
Septera, but no further information is given about that. Also the number of the pages
indexed in its database is unknown.
3. LITERATURE REVIEW
3.1. Introduction
The literature review section in the particular study is attempting to explore
and provide some critical evaluation of the relevant literature in the area of Web
search engines evaluation and comparison. It should be noted here that it was
decided to include in this section only these studies that can be considered as
experimental or comparative. It was felt best to conduct an in depth review of some
of the most important previous studies rather than examine everything that has been
written for Web search engine evaluation. Some other studies such as Scoville
(1996) or Randall (1996) which found to be more of a review rather than an
experimental or comparative study were not included in this section. This selection
was considered as necessary due to the fact that the researcher of the current study
wanted to explore and examine in depth how each researcher conducted his or her
study, what and which evaluation measures employed or what methodology was
applied and the reasoning behind each of these steps. The design and the
development of the methodology employed in the particular study is based and
influenced by some of the studies reviewed in this section. However, it should be
noted that conclusions or findings derived from non comparative studies (reviews)
were also used extensively within the present study and more particularly for the
design and development of the methodology employed.
8
-
Unfortunately, there are no significant studies conducted in Greece about the
evaluation and comparison of Web search services. There are only a few which are
of descriptive nature and thus, they are considered to be more of a review rather than
an experimental or comparative study. Some of these Greek studies (Papathanasiou
and Kanarelis 2001) were used by the researcher in order to acquire some further
information about the two Greek search engines Anazitisis and Trinity, since their
administrators have not been interested at all in contributing to the present study.
3.2. Review of comparative-experimental studies.
To begin with a very comprehensive, practical and analytical in its approach
methodology for evaluating the performance of Web search engines is presented by
Chu and Rosenthal (1996). They examined and evaluated three Web search engines,
namely Alta Vista, Lycos and Excite, in their attempt to develop a feasible
methodology for evaluating all Web search engines.
In order to test and evaluate the performance of the selected search engines,
the authors of this study developed and used ten (10) sample search queries. The
search queries were selected and constructed in such way so as to test the various
features of each search engine. Some of the search queries were phrases, while
others required Boolean logic or truncation or field searching capabilities. Nine out
of the ten search queries were extracted by real reference questions.
The authors of this study evaluated the performance of each search engine in
terms of precision of results, response time, output option, documentation and
interface. The authors of this study paid special attention to the criterion of
precision. Moreover, they downloaded the first 10 documents for each query and
assessed their precision. They gave the score of 1 for the highly relevant documents,
the score of 0.5 for fairly relevant documents and 0 for irrelevant documents. After
having assessed the precision score for each query, they calculated the average
precision score for all 10 results for each search engine. The conclusion of their
study was that among the three selected search engines, Alta Vista is the one with the
highest precision in its returned results. As far as the other two search engines are
9
-
concerned, they offer a plethora of features that the users can take advantage of them,
such as the concept search of Excite or the very good documentation and interface of
Lycos. The weak point of this study was the fact that there was no attempt to
evaluate recall. Chu and Rosenthal (1996), rationalize their decision to deliberately
omit the evaluation criterion of recall by arguing that it is not possible to calculate
how many relevant items there are for a particular query in the huge and ever
changing Web system.
Ding and Marchionini (1996) conducted a comparative study of the
performance of three popular Web search engines: InfoSeek, Lycos and Open Text.
In order to evaluate the performance of the selected search engines, they used three
queries which were randomly selected from a question set for Dialog online
searching exercises in an information sciences class. The other two queries were
formulated based on their personal interest. According to the authors of this study all
five queries were open – ended. In order to get the best search, syntax specific to
each search engine was used for each query.
The selected search engines were evaluated for precision in the returned
results, duplication in the retrieved sets, invalid links and the degree of overlap
between search engines. This evaluation was performed by analyzing the first
twenty hits that every search engine returned. Ding and Marchionini (1996) used in
this study a six point scale to rate the relevance and the quality of the three search
services. More specific, the measures that they defined for the purposes of their
study were precision, salience and relevance concentration.
As far as the measure of precision is concerned, the authors distinguished
three types of precision, in order to record the statistically significant differences of
the precision variants between the search engines and to unveil whether and to which
degree a complex query can affect the precision performance of each search engine.
The measure of salience was used in order to report the summary of ratings of all the
twenty hits for each search engine out of the summary of ratings for all the three
search engines. The last measure of relevance concentration was used in order to
report the ratio of “good” items in first ten hits to the number of “good” items in the
first twenty.
10
-
Ding and Marchionini (1996) concluded that that the performance of all three
Web search engines was very similar, but in terms of means of precision and salience
Lycos and Open Text considered to be superior to InfoSeek. The limitations of their
study were that they used only five queries and that they didn’t assess the selected
search services for response time, accessibility and recall.
Among the studies conducted from 1995 to 1997, Venditto’s (1996) study is
considered to be the most quantitative. Venditto (1996) selected, examined and
evaluated seven search engines which in that era were considered to be very popular
for their overall performance. The selected Web search engines were: Alta Vista,
Infoseek, Lycos, Open Text, WebCrawler and WWW Worm. In this study twelve
search terms were employed for a period of two weeks. The problem was that
Venditto (1996) did not report how many queries were used.
The seven search engines were assessed for relevance for the first twenty five
results returned from each query. Apart from that, each of the selected search
engines was tested on the basis of how capable is to cope with complex query
statements. To do so, known sites were identified for a given subject, next
formulating a search query in natural language and at the end, examining how many
of the determined sites each search engine managed to retrieve. This is a very
interesting approach, but it can be argued that introduces a lot of inconsistencies and
bias since it does not record crucial information about the test environment. Apart
from that, the currency of the search engines was also examined by employing a
query that reflected news events which were important at that particular time.
Venditto (1996) concluded that all the seven search engines performed well
when submitting simple queries. But, with the complex queries some of the search
engines had poor performance. According to Venditto (1996) this suggests that
relevance ranking methods employed in certain search engines were not very
effective and the relevance of each hit was partly based on the site’s relative
popularity. As far as the relevance results are concerned, InfoSeek was found to be
the best, while Alta Vista produced the most comprehensive results. However,
Venditto (1996) did not report what the exact statistics were in his study.
11
-
Zorn et al. (1996) conducted a comparative study which aimed to examine
and evaluate the advanced search features of four Web search engines. They decided
to select Alta Vista, InfoSeek, Lycos and Open Text Index. These search engines
were selected for their popularity and for the advanced search features that each
claims to support.
Zorn et al. assessed these particular Web search engines for complex Boolean
logic, limiting retrieval by fields, proximity, phrase searching, duplicate detection
and truncation. In their study they devised and employed three sample searches
which involved and required the elements of the advanced search features of each
search engine. While they provide a rather detailed discussion section about the
performance of each search engine, the fact is that the number of searches employed
is very small in order to conduct an analysis of the results and the findings. Also, it
appears that there is no quantitative evaluation of relevancy.
In the conclusion they support that there is no single Web search engine
which can be considered as the best, since each one has its own weakness and
strengths. However, they found Lycos and Alta Vista to have the best performance
as far as the number of URLs is concerned.
The study conducted by Toamiuolo and Packer (1996) can be considered as
the most quantitative study. The authors selected five search engines (Magellan,
Alta Vista, Point, Lycos and InfoSeek) and assessed them by employing two hundred
(200) queries. The selection of the particular search engines was based on the
popularity of each search engine, but Magellan and Point were selected because they
had the ability to review and evaluate the Web pages that they index.
The subject matter of the queries employed was based on undergraduate
topics. The document cut-off was determined for the first ten hits. Toamiuolo and
Packer (1996) evaluated the first ten hits returned by the search engines for
relevance. Also, the total number of the pages that each search engine returned fro
each individual query was recorded. The authors were based on the microaverage
method in order to produce a mean relevance ratio for each search engine. As far as
12
-
their findings are concerned, they found that Alta Vista had the best relevance
performance followed by Lycos, InfoSeek, Point and Magellan. They also noted that
some of the tested search services (Point and Magellan) failed to retrieve at least ten
hits for some of the queries employed.
Lindop et al. (1997) writing for the U.K. edition of the PC Magazine
conducted a “lab – test”, which involved the review and the comparison of 11 search
engines. The test suite, which developed for the purposes of evaluating the selected
search engines, was undoubtedly subject to various types of bias. More specific, the
methodology of the testing involved a team of testers who evaluated the selected
search engines by carrying out simple keyword searches or advanced searching, such
us Boolean or field searching.
Apart from the problem of inadequate and biased test suite, there was never
kept record of crucial information about how the searching was conducted, such as
the detail that the number of searches carried out was never revealed. The
problematic and inadequate methodology can also be unveiled by the fact that the
hits that every search engine returned were never formally assessed for their
relevancy. Instead of using some set of criteria for assessing the retrieval
performance of the search engines, the testing team kept only a record of the number
of the results retrieved for their queries and another record of their impression of the
search refinement and online documentation.
The evaluation procedure of this “lab – test” ended up with a usability score
based on the testers satisfaction using each search engine. Additionally, each search
engine awarded a score for every feature facilitating form a list, such as Boolean
searching, proximity or field searching. The testing team concluded that Alta Vista
was the best search engine especially in terms of usability and additional features.
One of the most comprehensive and complete study was carried out by
Leighton and Srivastava (1997). Their study employed one of the most carefully
designed and developed methodology for assessing the performance of the selected
Web search services. In their study Leigthon and Srivastava (1997) selected five
13
-
Web search engines namely, Excite, Alta Vista, InfoSeek, HotBot and Lycos and
assessed them using fifteen (15) queries.
The development of the test suite that they employed in their study is
considered as on of the most complete. More specific, they used a combination of
structured and unstructured queries and they tried so as the subject matter of the
queries to be as wide as possible. Relevancy categories were also developed before
the evaluation of the pages take place in order to avoid or minimize possible bias.
Moreover, they devised a method to “blind” the pages that each search engine
returned so as the evaluator not to be able to know which page is from which search
engine. To do so a script written in PERL was employed. The PERL script was
employed in order to fetch automatically the results that each search engine had
produced and hide the name of each search engine. Thus, the evaluator was
assessing the results without being able to know from which search engine were the
results that was evaluating. Their findings are contained in a very detailed report
where they had also recorded every possible detail about the environment where the
test took place.
Apart from that, they also conducted several experiments on the same data in
order to compare the selected search engines using a variety of definitions. In their
study, Friedman’s randomized block design was used to perform multiple
comparisons for significance. Analysis had showed that Alta Vista, Excite and
InfoSeek are the top three search services with their relative rank changing,
depending on how one interpreted the concept of "relevant." Correspondence
analysis showed that Lycos performed better on short, unstructured queries, while
Hotbot performed better on structured queries.
Another very comprehensive study is that of Clarke (1997). Clarke (1997)
employed an experimental methodology designed to estimate the precision and recall
of World Wide search engines. The search engines selected for this purpose were
Alta Vista, Lycos and Excite. Clarke (1997) employed TREC-inspired methodology
in order to estimate the recall of the selected Web search engines.
14
-
Clarke (1997) evaluated only the relative or comparative recall, since
determining absolute recall is impossible in the huge and ever changing Web system.
Relative recall was determined by conducting a second search for known pages in
each search engine. So a “pool” of relevant documents was identified for each
individual query and each search engine was measured on the basis of how many
relevant documents managed to retrieve from this “pool”. The first ten pages of the
returned results in each search engine were evaluated for relevance using a three
point scale. Results were presented in tables and the Friedman nonparametric
statistical test was performed in order to determine the significance of the results.
Clarke (1997) found that Alta Vista achieved the best mean precision score and
Lycos the worst, but the precision performance of Alta Vista was only significantly
different to that of Lycos. Excite achieved the best mean recall performance and Alta
Vista the worst although there was no significant difference in the recall performance
of the three search engines. Clarke (1997) concluded that Alta Vista was marginally
the best search engine and this was in agreement with previous studies. The main
conclusion according to Clarke (1997) is that, it is possible to apply the pooled recall
approach to estimate relative recall of Web search engines.
A very different comparative study of Web search engines was conducted by
Courtois and Berry (1999). Their aim was to test how five major Web search
services retrieve and rank documents in answer to user’s search query request. Their
main rationalization about conducting such comparison lies within the reason that
each search engine ranks or sorts the results according to a specific set of criteria,
namely the rank algorithm.
Furthermore, according to the authors, result ranking has a major impact on
users’ overall satisfaction with Web search engines and their way of retrieving the
relevant documents from the results list. Courtois and Berry (1999) identify that the
majority of analogous studies are sharing a common methodology which consists of
examining and evaluating the relevancy of the first 10 or 15 hits returned by the
search engine. While the authors recognize the fact that this is an effective and
feasible methodology for determining the precision, they argue that according to
their experience this is not the approach with which users make use of their results
15
-
list. Furthermore, the authors attempt to justify their different methodological
approach by explaining how most of the users are likely to scan and retrieve only
selected documents. However, this is a rather weak point within their study, since it
is based only to their personal experience and to a similar study conducted by Koll
(1993).
According to their findings, Courtois and Berry (1999) developed a test suite
and an appropriate methodology, which consisted of three criteria for testing
relevance ranking:
1. All Terms: Are documents that contain all search terms ranked higher
than documents that do not contain all search terms?
2. Proximity: For documents that contain all search terms, are
documents that contain search terms as a contiguous phrase ranked
higher than documents that do not?
3. Location: For documents that contain all search terms, are documents
that contain search terms in the title, headings, or metatags ranked
higher than documents that contain terms only within the body of the
document?
For comparison they selected five search engines which scored highly in many
comparison tests conducted by popular computer magazines, namely AltaVista,
HotBot, Excite, Infoseek, and Lycos. They identified and selected 12 search queries
to test the particular search engines and they downloaded the first 100 hits of each
search.
According to a further analysis of the downloaded documents and to the
above test criteria, the authors concluded that in general the ranking performance of
all the engines was generally good. In the Proximity and Location test most of the
search engines had worse performance comparing to the All terms test, implying
some very interesting thoughts about the ranking algorithm of every search engine.
Ultimately, Courtois and Berry (1999) are suggesting a very interesting methodology
for evaluating the quality and the reliability of the ranking algorithm of the Web
search engines and their results are suggesting some very serious considerations from
the perspective of the end user.
16
-
Another experimental-comparative study is the one conducted by Gordon and
Pathak (1999). In their study Gordon and Pathak (1999) distinguish between two
types of search engine evaluation: testimonials, encompassing informal and
impressionistic appraisals and feature-list comparisons and shootouts, which appears
to correspond more closely to traditional information retrieval effectiveness
experiments.
According to their definition Gordon and Pathak (1999) presented a table of
twelve earlier shootout studies, but identify only three (including their own) which
make use of “appropriate experimental design and evaluation”. Of these, that of
Gordon and Pathak (1999) is the most comprehensive and most recent. Gordon and
Pathak obtained thirty-three (33) real information needs from volunteers among the
faculty members in a university business school. These were recorded in
considerable detail and passed to skilled search intermediaries who were given the
task of generating near-optimal queries for each of eight search engines by an
interactive, iterative process. The top twenty (20) results produced by each of the
engines in response to the final queries were then printed and returned to the
originating faculty member for assessment on a four point relevance scale.
Gordon and Pathak (1999) presented a list of seven evaluation features which
they claim should be present to maximise accuracy and meaningfulness of
evaluation. Very briefly, these features can be listed as following:
1. Searches should be motivated by genuine user need.
2. If a search intermediary is employed, the primary searcher’s information need
should be as fully captured as possibly and transmitted in full in the
intermediary.
3. A large number of search topics must be used.
4. Most major search engines should be included.
5. The most effective combination of specific features of each search engine
should be exploited. This means that the queries submitted to the engines
need not be the same.
6. Relevance judgments must be made by the individual who needs the
information.
7. Experiments should be well designed and conducted.
17
-
8. The search topics should represent the range of information needs both with
respect to subject and to type of results wanted.
Of these features, some are very interesting, but others are debatable such as
the feature 5. It appears that Gordon and Pathak (1999) are questioning the general
practice to evaluate Web search engines based on the results produced by a set of
query words without special syntax or special operators. However, it is more
reasonable to compare the quality of results produced by search engines given
identical input queries in this particular form, rather than attempting to find the best
search query for each search engine and to compare them. After all, typical users
avoid to use special operators or special syntax in their queries. They concluded that
search effectiveness was generally low, that there were significant differences
between engines and that the ranking of engines was to some extent dependent upon
the strictness of the relevance criterion.
18
-
4. METHODOLOGY
4.1. Introduction
According to Van House et al. (1990) evaluation is the process of identifying
and collecting data about specific services or activities, establishing criteria by which
their success can be assessed and determining both the quality of the service or the
activity and the degree to which the service or activity accomplishes stated goals and
objectives.
The process of evaluation, as defined above, is being widely used in
traditional databases, CD-ROMs and other online information retrieval systems in
order to assess their overall quality and performance. However, evaluation of
performance of Web search engines is a new area within the context of information
retrieval. According to Dong and Su (1997) studies concerning the assessment of
Web search services began the year 1995. The review of previous literature has
revealed that the majority of such studies have been conducted between 1995 and
1997. This review has also revealed that in general three types of methodologies
have been employed in assessing the performance of Web search services: actual
tests with data collection and analysis, evaluative comments with examples of simple
searches and review of functions of different search engines without examples or
some other kind of tests.
Further to this, it could be argued that methodologies which were making use
of only simple tests and reviews of search engines functions were employed mostly
during the earlier studies, while the majority of the recent studies are employing
actual tests with data collection and analysis for the overall performance of the Web
search services.
In their study, Gordon and Pathak (1999), distinguish only between two types
of search engine evaluation methodologies: testimonials and shootouts. Testimonials
are generally conducted by the trade press or by computer industry organizations that
“test drive” and then compare search engines on the bases of speed, ease of use,
19
-
interface design or other features that are readily apparent to users of the search
engine. Another type of testimonial evaluation comes from looking at the more
technical features of search engines and making comparisons among them on that
basis. Such testimonials are based on features like the set of search capabilities
different engines have, the completeness of their coverage or the rate at which newly
developed pages are indexed and made available for searching.
Despite the actuality that testimonials can give users some useful information
in making decisions about which search engine to employ, they can only indirectly
suggest which search engines are most effective in retrieving relevant Web pages.
For an overall evaluation of the performance of Web search engines, shootouts
methodologies appear to be more appropriate. More specific, in shootouts, different
search engines are actually used to retrieve Web pages and their electiveness in doing
so is compared. Shootouts resemble the typical information retrieval evaluations that
take place in laboratory settings to compare different retrieval algorithms, despite the
fact that Internet shootouts often consider only the first 10 to 20 documents retrieved,
whereas traditional information retrieval studies often consider many more (Gordon
and Pathak, 1999).
4.2. Setting up the evaluation criteria
The special features of Web search engines in indexing technique, resource
coverage, relevance ranking, search strategy, hyperlinks and interface lead some of
the information retrieval researchers to the conclusion that the evaluation measures
should be different from those of traditional online databases and CD-ROMs. While
this seems to be sensible, it doesn’t necessarily mean that measures and criteria
employed for the evaluation of traditional online systems are inadequate for the
evaluation of interactive information retrieval services, specifically Web search
engines. The six criteria that Lancaster and Fayen (1973) once had listed
(1.Coverage, 2.Recall, 3.Precision, 4.Response time, 5.User effort and 6.Form of
output) for the evaluation of information retrieval systems are still quite applicable to
modern and interactive information retrieval systems despite the fact that they were
set up three decades ago. In their study Chu and Rosenthal (1996) employed a set of
20
-
criteria based on those listed by Lancaster and Fayen (1973). Their justification of
employing criteria and evaluation measures which are being used in traditional
online information retrieval systems is that the Web can also be described in terms of
an information storage and retrieval system which is characterized by its enormous
size, hypermedia structure and distributed architecture.
Furthermore, the review of previous studies has revealed that the vast
majority of previous studies are employing in their methodology evaluation criteria
such as precision, output option and response time which are being commonly used
in the assessment of traditional online information retrieval systems (Chu and
Rosenthal, 1996; Winship, 1995). Ultimately, Su (1992) stated that, in general,
criteria for evaluating interactive information retrieval systems include relevance,
utility, efficiency and user satisfaction, but the truth is that there is no agreement as
to which of the existing criteria or measures are the most appropriate for evaluating
interactive information retrieval performance.
4.3. Text REtrieval Conference (TREC)
At this point it is appropriate to point out the importance and the contribution
of Text REtrieval Conference (TREC) to the information retrieval research. Some of
the previous studies concerning the evaluation of Web search engines used TREC –
inspired methods (Clarke, 1997; Hawking et al., 2001).
The first Text REtrieval conference was held in November of 1992 at the
National Institute of Standards and Technology (NIST) (Harman 1993). The purpose
of the conference was to bring together researchers from the field of information
retrieval to discuss the results of their systems on a new large test collection
(TIPSTER collection). The TREC gave the opportunity to researchers to compare
results on the same data using the same evaluation methods. Moreover, it
represented a breakthrough in cross – system assessment in the field of information
retrieval. It was the first time that most of these researchers had used such a large
test collection and therefore required a major effort by all of them to scale up their
retrieval techniques (Harman 1995).
21
-
The overall goal of the TREC programme is to encourage research in
information retrieval using large test collections. It is hoped that by providing a very
large test collection and encouraging with other researchers in a friendly evaluation
forum, new impetus in information retrieval will be generated. Moreover, it was
hoped that the participation of groups with commercial information retrieval systems
would lead to an increased technological transfer between the research laboratories
and the commercial products (Harman, 1995).
In the second TREC, which took place in August of 1993, two types of
retrieval were tested: retrieval using an “ad hoc” query such as a researcher may use
in a library environment and retrieval using a “routed” query. Routed queries are
considered to be queries which are extracted from specified topics and then tested
against a set of “training” documents where relevant documents are identified. With
this process an optimal query is generated and can be tested against new data
(Beaulieu et al., 1996). In contrast “ad hoc” queries are considered to be new queries
which are tested against to existing set of data without awareness of relevant
documents. The assessment of the results was based on traditional recall and
precision criteria. The queries employed in the present study were all “ad hoc”.
4.4. The influence of TREC
It appears that the main concern of the majority of the previous studies was
the efficiency of the methodology, the evaluation criteria and the development of the
test suite. In the current study much effort was exerted in order the methodology and
the evaluation measures to be as efficient as possible, so as to produce accurate and
meaningful evaluation of the quality of the results returned by the selected Web
search engines. Therefore, the methodology and evaluation criteria employed in this
study were influenced by those used in previous studies and by TREC experiments.
22
-
4.5. Development of the test environment
The main purpose of the current study is to assess the overall performance of
the selected Greek search engines and to compare their performance with an
excellent performer, Google. To do so, a wide range of evaluation criteria should be
considered. While this is true, the limited available time for this study resulted in
selecting only these criteria which were considered that would reflect better the
overall performance of each search engine and these criteria, which were considered
to be of significant importance from previous studies, such as interface and precision
(Dong and Su 1997).
Another concern was the time proximity of the searching. According to
Leighton and Srivastava (1997) the goal of close time proximity of the searching
should be taken into consideration in order to ensure the objectivity and accuracy of
the evaluation of the returned results. The closer in time that a query is executed on
each of the selected search engines, the better. The rationale behind this tactics is
that if a relevant page were to be made available between the time one engine was
searched and the time a second was searched, this would result in an unfair situation
where the second search engine would have the opportunity to have this new page
indexed and consequently retrieved. According to previous researchers which had
conducted similar studies (Chu and Rosenthal, 1995; Ding and Marchionini, 1996;
Leighton and Srivastava 1997) the close time proximity is characterizing the quality
of the methodology and evaluation procedures that are employed in similar studies.
According to them, the ideal situation would be each query to be executed on all the
selected search engines simultaneously.
In the current study, all the three search engines were searched for a
particular query on the same day and each query performed on each search engine
within an hour the maximum. A second round of searches was conducted
immediately after the first for each query in order to evaluate the recall, which will
be discussed later and more in more detail. Again the time limit was considered to
be of major importance, since the selected search engines assert to update their
database indexes in a weekly or even on a daily basis.
23
-
Similar to the point of close time proximity there was also the goal of
checking the pages that were cited in the results from the Web search engines as
quickly as possible after the results had obtained. This was considered to be of the
same importance as the previous goal of close time proximity of the searching. The
reason behind this objective is that the longer one waits after the results had obtained
the more possible is that some pages, which were active during the searching, to have
been removed from the Web. Thus the tested search engine would be assessed
unfairly by the evaluator (Leigthon and Srivastava, 1997) .
The relevance judgements performed immediately and it only took about
thirty minutes to evaluate the first ten returned results for each search engine. The
precision scores were also assigned within the time of thirty minutes. For efficiency
reasons the results were saved as “.htm” files in every case that there will be a need
to check again the URLs or to validate assessments. The relevance judgements were
also another area of great concern during the evaluation of the selected Web search
engines. The relevance judgment is the weak point of the evaluation procedure in the
majority of similar studies. The main problem of these studies is the person who is
responsible for assessing the relevance of the returned results.
Moreover, in most of these studies, the author or the authors were the
evaluators of the returned results. During this step of the evaluation procedure bias,
both conscious and unconscious, can enter and distort the objectivity and accuracy of
the relevance judgment and thus the precision score of the tested search engines. For
example, if the subject matter of the selected queries is wide there is a serious
concern over the adequate knowledge background of the evaluator to assess the
relevance of the returned results. In order to overcome this flaw, many researchers
decided to select queries with narrow subject matter (Clarke, 1997). While this
approach makes possible the extraction of accurate and meaningful relevance
judgements, nevertheless there is always the peril that the returned pages would be
from only one portion of the Web.
In the present study, the evaluation procedure that is employed is inspired
from the previous study of Gordon and Pathak (1999), which introduces intermediate
researchers or evaluators in an attempt to circumvent the risk of distorted relevance
24
-
judgments. More specific, the evaluations of the returned results assessed by other
six fellow students from the information studies department in Sheffield University,
with adequate knowledge background in specific subject areas (economics, software
engineering, librarianship, history and archaeology, mathematics and management).
Furthermore, many of the queries used in the current study were real reference
questions drawn from their research dissertation for the obtainment of their Master’s
degree. So, they were evaluating the results both as researchers and as end-users. As
it was mentioned previously, this procedure was considered necessary and was
employed in order to ensure the highest possible degree of accuracy and objectivity
in the evaluation of the results.
Searches were carried out on PCs at the St. George I.T. centre in the
University of Sheffield. The access to the World Wide Web was possible through
the LAN (Local Area Network) of the University and the Web browser that was used
was the latest version of Internet Explorer(version 5.5).
4.6. Sample queries suite
This methodology, as it has already been described, involves a set of sample
queries that will be employed in order to test and assess the overall performance of
the selected Web search engines. Thus, the development of the sample queries suite
is a sensitive step within the development of the evaluation procedure, which can
potentially affect the performance of the tested search engines (Ding and
Marchionini, 1996).
4.6.1. Number of queries As it was mentioned, the searching procedure was designed in such a way in
order to minimize the possibility of favouring the search engine examined first or the
one examined last. Thus, a compromise between the number of queries employed
and the number of documents assessed for each search engine was considered to be
of crucial importance in order each individual query to be processed within a
reasonable time limit. This time limit was compulsory in order to ensure that the
25
-
search engine’s indexes would not change during the evaluation process of each
individual query. Taking this into consideration, it was regarded that twenty is a
feasible number of queries to be evaluated in a rather limited time. The first
intentions, before designing the evaluation methodology, was to use a number of at
least 25 queries, in order the subject matter of queries to be as wide as possible.
However, there were found to be problems in keeping the required time limit and so
the number was limited to 20 queries.
4.6.2. Query subject matter Previous studies suffered from a lot of bias regarding the subject matter of the
queries employed. Many of the researchers decided to use a wide subject matter
(Chu and Rosenthal, 1995; Ding and Marchionini, 1996) but there is always the
question of how capable is the evaluator to make accurate and meaningful relevance
judgments over subject topics which are requiring an appropriate knowledge
background. Other researchers (Clarke, 1997) in order to avoid this risk they
deliberately employed queries with narrow subject matter. While this allowed them
to make accurate and meaningful relevance judgements on the other hand, it could be
argued that only one portion of the Web was tested for retrieval (Clarke, 1997).
In the current study there is an effort to avoid such a risk by increasing the
number of evaluators to six, each of them has a different knowledge background.
Moreover, most of the queries used in the current study were extracted from real
reference questions that were used in the evaluators’ research dissertation. So, the
described evaluation procedure was designed in order to ensure that the relevance
judgments from the evaluators would be as accurate as possible and it would allow
the subject matter of the queries to be as wide as possible.
Apart from that, some of the queries’ subject matter involves issues of Greek
culture and civilization. The underlying reason behind this approach is that Greek
search engines are being used every day by hundreds of users searching the Greek
domain for information which mostly involves issues of Greek culture and
civilization. So it would be very interesting to test these two most popular Greek
search engines (Anazitisis and Trinity) in this particular subject matter, to evaluate
26
-
their performance and compare it with a class leading search engine such as Google.
Of course, it can be argued that such a comparison can be quite unfairly for Google,
due to the fact that the Web robots of the tested Greek search engines are primarily
focused on indexing the Greek Web and so an advantage of Anazitisis and Trinity in
these particular queries can be expected. While this can be true, bear in mind that
this was considered as necessary for the purposes of the evaluation on the particular
study. Moreover, this study is also attempting to explore the capabilities and the
limits of the tested search engines and examine how each search engine can cope
with queries consisting of Greek and English together.
4.6.3. Query formulation and search expression The decision over the query formulation and the search expression that will
be entered in each search engine, proved to be quite difficult. Previous studies have
suffered form a lot of bias here. Some of the researchers who have conducted similar
studies (Chu and Rosenthal, 1996; Tomaiuolo and Parker, 1995) decided to compile
the selected queries and use syntax specific to each of the tested search engines or
use the so-called “advanced mode” feature, if it was available. However, in their
study, Leighton and Srivastava (1997) tried to be more systematic and carefully
examined, before conducting the actual test, what search expression should submit to
the selected search engines. They decided to use simple queries such as these that a
simple user would enter because, according to them, this kind of queries are forcing
the search engine to do more of the work, ranking the results by its own algorithm
rather than the constraints specified by the operators.
Moreover, according to Hawking et al. (2001) all well-known public search
engines are designed to produce a list of results when a set of simple queries (without
operators or special syntax) is typed in to the search box provided by the primary
interface of the search engine. So, it seems to be more realistic to compare the
quality of the results returned by search engines given identical input queries in this
particular form. Furthermore, the examination of query logs (Silverstein et al., 1999)
has revealed that most of the users do not use any form of query operators or the
“advanced mode” if this is provided. Additionally, Silverstein et al. (1999)
concluded that in most of the cases, when the users attempted to enter queries using
27
-
query operators they were making a lot of errors. Thus, while studies, which adopt
the approach of trying to find the best query formulation for each search engine are
very interesting, they also introduce conscious and unconscious bias, which can lead
to unfair comparison among the tested search engines. The query formulation that is
employed in the current study is consisting of “simple queries” in an attempt to
minimize and isolate possible unfairness that can enter this step.
4.6.4. Further analysis over the sample queries As it has previously been described, the queries that are employed in the
current study have been extracted from real reference questions and there has been an
effort, so as their subject matter to be as wide as possible. The fact that this study
attempts to examine and evaluate the most popular Greek search engines imposes
some requirements to the queries that will be employed. More specific, the subject
matter of some of the sample queries is related to the Greek culture, history and
civilization. Some queries are entirely in Greek language, while some other queries
contain a combination of Greek and English search keywords. The introduction of
queries such as those related to issues of Greek civilization was considered to be of
significant importance, since they are being used to unearth information related to
issues of Greek culture and civilization.
Apart from that, another area that this study is examining is how the selected
Greek search engines can cope with the combination of Greek and English words
within the same question. This area considered to be of particular importance, for
the current study, due to the reason that the first Greek search engines faced a variety
of problems with the combination of Greek and English words within an individual
query. So it is very interesting to examine how improved are the selected Greek
search engines in such a “sensitive” area in which they claim that they have
overcame the problems of the past. Furthermore, it is also very interesting to
examine the searching performance of Google in another language, the Greek, since
it claims that it is capable of doing so.
Further to this point, problems can arise from the complicated and
sophisticated grammatical and syntactical structure of the Greek language. More
28
-
precisely, many words such as nouns can be used within a sentence with a variety of
forms, called cases (e.g. nominative case, accusative case e.t.c.). This is also the
situation for many other words such as adjectives and pronouns. This variation of
forms of particular words within a sentence in Greek language is being used in order
to alter the meaning of an individual sentence or to illustrate a special relationship
between the words within an individual sentence. In every case, the possible
variations of the forms of words in Greek language are far more complicated
comparing to the English and thus, obstructing the task of the Web search engines.
Problems such as the one previously described pestered the users of the first
Greek search engines. The specific language related problem is of crucial
importance since it was rendering the first Greek search engines unable to perform
successfully their requested tasks. Unfortunately, there are no particular proofs to
support this actuality, due to fact that until now there has been no significant study
over the evaluation performance of the Greek search engines. As it has already been
illustrated in the literature review section there is a least number of studies that are
attempting to examine the Greek search engines and all of them are of descriptive
nature. The modern Greek search engines, such as the selected that are being tested
in the current study, claim to have overcome this problem by utilizing special
software (e.g. Anazitisis) or special language techniques in the ranking algorithm
(Trinity).
So, the sample queries part of the test suit has been especially developed in
the way that has just been described in order to test and assess how the advanced
algorithms and special features of the selected Greek search engines and Google can
cope with the combination of Greek and English words within an individual query.
Below is illustrated the list of the twenty sample queries that was employed in the
current study. Inside the parenthesis and written with italics is the English
translation of the Greek queries and the queries that are using combination of Greek
and English words.
1. Internet Banking strategy 2. PEST factors in internet banking 3. Public libraries and learning disabilities 4. Customer relationship management 5. Object-Oriented management with UML
29
-
6. Customer-centric culture 7. Telecommunications infrastructure in Greece and Spain 8. Information retrieval and Web search engines 9. Backpropagation models of neural networks for information retrieval 10. Information modelling and SSADM methodology 11. Μηχανές αναζήτησης στο Web (Search engines in the Web) 12. Ο κόσµος του Internet (The world of internet) 13. Οι εκδόσεις του OECD (Publications of the OECD) 14. Αρχιτεκτονική επεξεργαστών Risk στους προσωπικούς
υπολογιστές(Risk architecture CPU in personal computers) 15. Μεγάλοι έλληνες ρεµπέτες (Great Greek folklore musicians) 16. Ελληνική επανάσταση 1821 (Greek independence war 1821) 17. Εκδόσεις της τράπεζας Ελλάδος (Publications of Bank of Greece) 18. O ελληνικός στοχασµός κατά το 19ο αιώνα (Greek philosophical
meditation during the 19th century). 19. Nεοελληνικός διαφωτισµός και Ρήγας Φεραίος (Modern Greek
enlightenment and Rigas Feraios [Personal name]) 20. Αντικειµενοστρεφή συστήµατα βάσεων δεδοµένων (object oriented
database systems)
4.7. Evaluation of returned pages
4.7.1. Document cut-off The application of document cut-off practice in the evaluation procedure of
on-line information retrieval systems is regarded as a necessary step before the actual
evaluation procedure takes place. A decision should be taken over the number of the
pages that will be assessed according to some predefined measures of evaluation.
The necessity of document cut-off in evaluating the performance of on-line
information retrieval systems is based on the reality that the output of these systems
can be hundreds or even thousands of returned pages. This is also the fact for the
Web search engines. In most of the cases their returned results can easily be of
thousands of Web pages, which might not be a large volume of information, when
taking into consideration the vast and ever changing Web system.
In the current study, it was decided to evaluate the first ten hits of the results
list produced by each search engine. This decision was based over the personal
experience and by observing the behaviour of my fellow students towards the results
list produced by the search engine. Virtually all of them had a tendency in browsing
30
-
and examining only the first ten or rarely the first twenty hits returned by the search
engine. Also this approach over the document cut-off practise seems to be supported
by the vast majority of the researchers conducted similar studies. Chu and Rosenthal
(1996), Scoville (1996) and Tomauillo and Packer (1996) tested the search engines
for the first ten results, while others such as Ding and Marchionini (1996) or Gauch
and Wang (1996) evaluated the search engines based on the first twenty results. So
the common practise for similar studies is to examine and evaluate the search engines
for the first ten and sometimes twenty returned results. The time factor should also
be regarded as another major reason behind this common practise of document cut-
off.
Since all the selected search engines display their results in a list of
descending order of relevance calculated one way or another, it is considered that
this should not critically affect the validity of the current study.
4.7.2. Measures of evaluation specific to the current study As it has previously been described and inferred by the relevant literature, the
evaluation criteria employed to studies attempting to assess the overall performance
of on-line information retrieval systems can be considered as perhaps one of the
weakest parts of these studies. The underlying reason behind this fact is that there is
no common agreement as to which of the existing criteria or measures are the most
appropriate for evaluating interactive information retrieval performance (Su, 1992).
This statement can easily be supported by the review of previous studies.
More specific, due to the special features of Web search engines (Interface, hyperlink
structure etc.), every researcher attempted to use criteria which were somewhat
different from traditional ones. For example, Taubes (1995) considered reliability,
completeness and speed as the measures in the evaluation, Winship (1996) argued
that the record structure and search techniques had a greater significance than
retrieval performance and others suggest a powerful and usable interface and that the
quantity, precision and readability of returned results are the most important criteria
for evaluating and rating search engines.
31
-
A clarification that should be noted here is that the researchers who attempted
to employ a rather different set of evaluation criteria in their studies, doesn’t
necessarily mean that they rejected the traditional ones. The fact is that most of them
included some of the traditional criteria to a whole set of evaluation measures
employed in their study. It is just the detail that some other criteria such as the
interface, search techniques, record structure, ease of use etc. was considered to have
greater significance from the traditional ones, when giving an overall performance
score to the tested search engine. According to Dong and Su (1997), specific
traditional criteria such as precision and response time are the most commonly used
in the vast majority of studies comparing and evaluating the performance of Web
search engines.
In the current study there was an effort to employ these criteria that would
better unveil the overall performance of each of the tested search engines. At the
earlier stages of the current study there was a thought to employ a wider set of
criteria in order to assess the overall performance of the selected search engines. The
motivation behind this thought was that the greater the number of the evaluation
measures employed the better, since such an exhaustive test would unveil almost
every strength or weakness of the tested search engines. While this speculation
seems to have some validity, the review of previous studies did not unveil any study
employing a very large number of evaluation measures. The truth is that such a
study appears to be superficial, especially for the present study where the available
time period was very limited. It was felt best to conduct a feasible test of some of the
most representative performance measures rather than employing a large number of
criteria to evaluate every feature and aspect of the search engine. So it was decided
to evaluate the selected search engines on the basis of precision, recall, response
time, validity of links, interface and documentation.
4.7.3. Precision Precision together with recall constitute the two most important traditional
measures of retrieval effectiveness (Saracevic, 1975). Precision or Precision ratio is
defined as the proportion of retrieved documents that are judged relevant, meaning
the number of relevant documents retrieved divided by the total number of
32
-
documents retrieved. Cheong (1996) also supports that the percentage of the
retrieved documents that the user judges relevant refers to a measure of the signal-
noise ratio in certain kinds of system.
According to Dong and Su (1997) precision considered as very important in
comparing and evaluating the performance of a search engine for two reasons. The
first is that each search engine is employing its own methods and techniques in
collecting and indexing documents. Also the fields of indexing are different for each
search engine. Thus, based on precision it is a way to identify which method or
technique of indexing is the most efficient. The second reason is that automatic or
machine produced indexing, despite the modern sophisticated techniques or
algorithms that are being employed, cannot always cope successfully with words
used in various contexts resulting in the indexing of non-relevant items. Therefore, it
can be supported that the output relevance to a user’s query can be an important
indicator for assessing the quality and intelligence of an individual search engine.
Dong and Su (1997) support that while precision has been widely used as a
criterion in describing the relevance of search results in many studies, only a few of
the studies conducted between the years 1995 and 1996 (Chu and Rosenthal 1996;
Ding and Marchionini 1996) applied the precision criterion on a “standardised
formula” (Dong and Su 1997). The problem with this statement of Dong and Su
(1997) is that they don’t clarify properly what they mean with the term “standardised
formula”. The fact is that the traditional way of evaluating precision needs to be
reconsidered when dealing with the evaluation of Web search engines. More
specific, in all the previous studies, who attempted to assess the performance of Web
search tools, the calculation of the precision measure was performed on the basis of
the first ten or twenty hits returned by the search engine.
In the current study it was felt best to assess the precision of the Web pages
returned based on a three point scale, as Clark (1997) did in his own study. More
specific, a score of 1 was given to very relevant documents, a score of 0.5 was given
to somewhat relevant documents and 0 to documents that were not relevant. Since in
the current study a number of six evaluators were used, every page was examined
33
-
thoroughly before assigning a precision score. Also, all the links in every page were
examined and not just one or two initial links.
In the case that a page was consisted of a whole set of links, every link was
also examined thoroughly. If these links, by following them, could lead to useful
information resources, then a score of 0.5 was assigned to this page (Leighton, 1995).