Tefko Saracevic 1 search engines digital libraries [email protected]@rutgers.edu; tefko/tefko
-
Upload
jody-pierce -
Category
Documents
-
view
219 -
download
1
Transcript of Tefko Saracevic 1 search engines digital libraries [email protected]@rutgers.edu; tefko/tefko
Tefko Saracevic 1
search engines
digital libraries
[email protected]; http://comminfo.rutgers.edu/~tefko/
Central ideas
Search enginesWhile the structure & basic
operation of search engines is similar
• a great number & variety exists beyond Google with their own features many of them in
specialized domains
Digital libraries
They have rich & varied resources of use in accessing & searching
of variety of databases & reference tools in many domains
accessing of journals for delivery of full texts in all fields
Tefko Saracevic 2
As a searcher you are also using
Knowing searching = also knowing these resources
ToC
1. Search engines2. Digital libraries
Tefko Saracevic 3
Definitions. How they work. Diversity1. Search engines
Tefko Saracevic 4
5
dictionary definitions
searchCOMPUTING (transitive verb) to examine a computer
file, disk, database, or network for particular information
enginesomething that supplies the driving force or energy to
a movement, system, or trend
search enginea computer program that searches for particular
keywords and returns a list of documents in which they were found, especially a commercial service that scans documents on the Internet
Tefko Saracevic
6
about definition of search engines
• oh well … search engines do not search only for
keywords, some search for other stuff as well
• and they are really not “engines” in the classical sensebut then mouse is not a “mouse”
Tefko Saracevic
7
use of search engines … among others
Tefko Saracevic
8
Your Browser
How Search Engines Work(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
SearchEngine
Database Eggs?Eggs.
Eggs - 90%Eggo - 81%Ego- 40%
Huh? - 10%
All AboutEggsby
S. I. Am
Tefko Saracevic
9
how do search engines work? elaboration
• crawlers, spiders: go out to find content in various ways go through the web
looking for new & changed sitesperiodic, not for each query
no search engine works in real time
some search engines do it for themselves, others not
buy content from other companies
for a number of reasons crawlers do not cover all of the web – just a fraction
what is not covered is “invisible web”Tefko Saracevic
10
elaboration …
• organizing content: labeling, arranging indexing for searching – automatic
keywords and other fields arranging by URL popularity - PageRank as Google
classifying as directory mostly human handpicked & classified
• as a result of different organization we have basically several kinds of search engines:
search – input is a query that is then searched & displayed
directory – classified content – a class is displayed fused: directories have now also search capabilities &
vice versaTefko Saracevic
11
elaboration (cont.)
• databases, caches: storing content humongous files usually distributed over many computers
• query processor: searching, retrieval, display takes your query as input
engines have differing rules how handled displays ranked output
some engines also cluster output and provide visualization
• at the other end is your browser in addition to Explorer a number of the exists
Mozilla Firefox for instance – became quite popular
Tefko Saracevic
12
elaboration…similarities, differences
• all search engines have these basic parts in common
• BUT the actual processes – methods how they do it – are based on various algorithms & they differ most are proprietary with details kept
secret but based on well known principles from information retrieval or classification
to some extent Google is an exception – they published their original method, but not further
Tefko Saracevic
13
case of
• developed by Sergey Brin and Lawrence Page while students at Stanford in the beginning run on Stanford computers
• basic approach has been described in their famous paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” well written, simple language, has their pictures in acknowledgement they cite the support by NSF’s
Digital Library Initiative i.e. initially, Google came out of government sponsored research
describe their method PageRank - based on ranking hyperlinks as in citation indexing
“We chose our system name, Google, because it is a common spelling of googol, or ten on hundredth power” Tefko Saracevic
14
coverage differences
• no engine covers more than a fraction of WWW estimates: none more than 16% hard (even impossible) to discern & compare coverage, but
they differ substantially in what they cover
• in addition: many national search engines
own coverage, orientation, governance many specialized or domain search engines
own coverage geared to subject of interest many comprehensive sources independent of search
engines some have compilations of evaluated web sources
Tefko Saracevic
searching differences• substantial differences among search engines
on searching, retrieval displayneed to know how they work & differ in respect to
defaults in searching a query searching of phrases, case sensitivity, categories searching of different fields, formats, types of resources advance search capabilities and features possibilities for refinement, using relevance feedback display options personalization options
• Greg Notess’ chart & features describe differences
Tefko Saracevic 15
16
business model differences
several business models• public good - have independent budget
e.g. PubMed, Librarians’ Index to Internet
• earn revenue from provision of information all commercial search engines
• using search engines to promote their other activities e.g. telephone directories
Tefko Saracevic
17
sponsorship differences
• need to understand treatment of sponsorship – they influence what they search & how they display resultssome list separately results from sponsored
sites so you are reasonably clear what is there - what is sponsored & not
some have display-per-pay - showing first sites that paid most & do not even tell you that
some have pay per update of sites
• imperative to find sources that explain these models for different engines to know what is covered & what are you are getting Tefko Saracevic
18
limitations
• every search engine has limitation as tocoverage
meta engines just follow coverage limitations & have more of their own – have to be careful in their use
search capabilitiesfinding quality information
• some have compromised search with economics becoming little more than advertisers
• but search engines are also many times victims of spamindexingaffecting what is included and how ranked Tefko Saracevic
19
spamming a search engine
• use of techniques that push rankings higher than they belong is also called spamdexing methods typically include textual as well as
link-based techniques like e-mail spam, search engine spam is a form
of adversarial information retrieval the conflicting goals of accurate results of search
providers & high positioning by content page rank
• search engines are constantly battling this with their own special (& secret) tools
Tefko Saracevic
search engine features, reviews, tutorials -
• Search Engine Showdown• lists, reviews, follows search engines, blog – look at Chart• by Greg Notess (librarian) – book Teaching Web Search Skills has
live links
• Recommended search engines by UC Berkeley
• library workshop; lists features, evaluates
• Search Basics: Web Search Essentials• among others, has a large section on search engines
• Search features chart• with explanations
Tefko Saracevic 20
21
how to find a search engine?
• resources that list or categorize enginesSearch Engine Guideengines categorized by topic; other engine information
Search Engine Colossus international directory of search engines by country, topic
from 351 countries and territories; engines in many languages
Phil Bradley’s country based search engines“currently a total of 4,017 search engines and 222
countries, territories, islands and regions”
Tefko Saracevic
all questions are not created equal
• what engine, what resource to use for what kind of question or information need? An exhaustive classification in:Finding information: search engines by Phil Bradley
Sources for different topics:Choose the Best Search for Your Information Need
by NoodleTools
List of capabilities for major search engines:Best Search Tools Chart by Infopeople
Tefko Saracevic 22
meta search engines
Tefko Saracevic 23
• meta engines search multiple engines getting combined results from a variety
of engines
• do not have their own databasesbut have their own business models
affecting results
• a number of techniques usedinteresting ones: clustering, statistical
analyses
24
sample of meta engines- with organized results
Dogpile results from a number of leading search engines;
gives source, so overlap can be compared; has SearchSpy -listing searches that were performed
Surfwax gives text sources & linking to sources; for some
terms gives related terms to focus
Turbo10provides results in clusters; engines searched can
be edited
Clustyresults grouped by topics or clusters for further
sources Tefko Saracevic
25
meta search engines (cont.)
• large directory Complete Planet
directory of over 70,000 databases & specialty engines; classified
• results with graphical displaysKartoo
results in display by topics of query
• new kid on the block (not a meta engine, but a search engine)
CuilClaim: “Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft”. Well … I do not know if it holds.
Tefko Saracevic
multilingual
• English still the major language but declining, now slightly over 50%
• multilingual retrieval search enginesEuroseek
searches in a number of languagesAll the Web
results in 45 languages
Tefko Saracevic 26
where to find out?
Tefko Saracevic 27
• information about search engines in sources that have updates, news, tips for searching and more – a MUST for searchers : Search Engine Watch
ratings, news, statistics, charts, explanations, tutorials Search Engine Showdown
“The users’ guide to web searching” - run by a librarian, news links, ratings
Virtual Chase a site about “Teaching Legal Professionals How To Do
Research” - this section has very good tips and links for consideration of quality on the web
28
where? ….
SiteLinesa blog, written by Rita Vine, a professional
librarian, & web search trainer; many evaluations in archive
ResourceShelf“Resources and News for Information
Professionals,” edited by Gary Price, a librarian & author of Invisible Web – has extensive archive
WebsearchAboutnot evaluative, but provides news, capabilities,
sources, articles about web searching
Tefko Saracevic
29
art of searching search engines
Tefko Saracevic
31
definition
• digital libraries are viewed from several perspectivestechnical: “Digital library is a managed collection
of information, with associated services, where information is stored in digital format and accessible over a network.” (Arms, 2000)
institutional: “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities.” (Waters, 1998)
Tefko Saracevic
32
a bit of context
• digital libraries have a short but volatile history research & development took of by start/mid 1990’s in the next decade phenomenal growth worldwide large investment in research, development, keeping
up
• number of communities involvedcomputer science, primarily in research library & information science: operations, studies of
users, use, usabilitymany subjects: digital libraries in their domain
• diversity is largemany institutions e..g. museums developed own
Tefko Saracevic
libraries & digital resources
Tefko Saracevic 33
• libraries (particularly research, academic & special) invested massive & ongoing funding towardelectronic journalsdatabases reference sourcesdigitization of parts of collection
• thus becoming in effect digital libraries – or more accurately hybrid libraries with graphic and digital versions or types of
resources
RUL has substantial holdings & expenditures in all of these
34
emphasis here
• on large academic or research digital libraries that also are related to searching including provision of search capabilities & access to databaseselectronic journals that provide full text of
articles after a searchdigital reference sources
• such libraries have become also search portals of sort, essential for their users in education, research & related activities
Tefko Saracevic
35
sample
New York Public Library Digital CollectionsA gateway to rare and unique collections in digitized form & to
databases. Access to most searchable databases requires library card number
U California Berkeley Digital Library SUNsitedigital collections and services
The British Library“The world’s knowledge.” Includes “
Services for library and information Professionals.”
Los Angeles Public Library Kids’ Pathresources for children; search through directory
Tefko Saracevic
36
sample …
New Zealand Digital Librarysearching of a number of digital collections, incl. humanitarian
and UN collections; provision of free software for digital libraries
Public Library of Science“PLoS is a nonprofit organization of scientists and physicians
committed to making the world's scientific and medical literature a public resource.” Publishes open access journals
Closer to home: New Brunswick Free Public Libraryhas online resources, databases (some require library PIN),
historical archives and moreexample of great many public libraries that have databases for
searching
Tefko Saracevic
37
Rutgers libraries – digital components
• strategic planning in developing digital access
• rich & complex content of digital resourcesseveral hundred indexes & databases for
searchingsome 20,000 electronic journals thousand & more digital reference sourcessubject research guidesSearchpath & other tutorialselectronic reserve
• affected teaching, learning, research by the whole community
Tefko Saracevic
38
some critical issues for searching
• no way yet to do effective federated searching in digital libraries (to search several indexes at the same time)
RUL has Searchlight – searches only 8 major databases
each source has to be searched separately most have very different search features, capabilities
• finding items in indexes does not mean that always able to get full text
• thus, searching time-consuming, chaotic
Tefko Saracevic
39
where to find out?
• information about digital libraries for searching LibWeb Webjunction formerly U California, Berkeley“lists currently over 7900 pages from libraries in over 146 countries”
Digital Library Federation“a consortium of libraries and related agencies that are pioneering
the use of electronic-information technologies to extend their collections and services”
D-Lib Magazine“a solely electronic publication with a primary focus on digital library
research and development, including but not limited to new technologies, applications, and contextual social and economic issues”
Tefko Saracevic
40
where? …
Ariadne (UK)“to report on information service developments and
information networking issues worldwide, keeping the busy practitioner abreast of current digital library initiatives”
Journal of Digital Information“Publishing papers on the management, presentation
and uses of information in digital environments” Tool Kit for the Expert Web Searcher
one of the wikis by Library Information and Technology Association, a division of the American Library Association
Expert Web Search Tipsone of many informative articles from the Living
Internet Tefko Saracevic
in conclusion
Tefko Saracevic 41
• search engines are great but you have to KNOW what is under the hoodas to coverage, business model, search
features, outputs … they are NOT for every kind of information
need
• digital libraries are great for searching but you have to KNOW requirements for searching different resources that are includedas yet federated searching is limited
42
art of searching digital libraries
Tefko Saracevic
more
43
and rewards …
Tefko Saracevic