October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire...
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Cheshire II: Recent Additions Cheshire II: Recent Additions &&
Cheshire III: Cheshire III: Design and System Overview Design and System Overview
Cheshire II: Recent Additions Cheshire II: Recent Additions &&
Cheshire III: Cheshire III: Design and System Overview Design and System Overview
Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and
Systems Systems
University of California, BerkeleyUniversity of California, Berkeley
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
OverviewOverviewOverviewOverview
• Cheshire IICheshire II– Feature overview Feature overview – Current usageCurrent usage– Recent AdditionsRecent Additions
• Distributed Search and IndexingDistributed Search and Indexing• Geographic Operators and Search RankingGeographic Operators and Search Ranking• XML Schemas and Element RetrievalXML Schemas and Element Retrieval• MySQL and PostgreSQL interfacesMySQL and PostgreSQL interfaces• CORI, Okapi BM-25 ranking algorithmsCORI, Okapi BM-25 ranking algorithms• Result Set sorting, merging and ranking operators, bitmapped Result Set sorting, merging and ranking operators, bitmapped
indexesindexes
• Cheshire III Design and DevelopmentCheshire III Design and Development
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XMLIt supports SGML and XML• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI,
SOAP, SDLIP also implementedSOAP, SDLIP also implemented• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search engine as Supports probabilistic ranked retrieval in the Cheshire search engine as
well as Boolean and proximity searchwell as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance
feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server CGI WWW/CGI forms interface for DL, using combined client/server CGI
scripting via WebCheshirescripting via WebCheshire• Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python• Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Current UsageCurrent UsageCurrent UsageCurrent Usage
• Over 100 Databases in the UK, includingOver 100 Databases in the UK, including– AHDS/History Data ServiceAHDS/History Data Service– Mersey LibrariesMersey Libraries– ZETOCZETOC– Archives HubArchives Hub
• Distributed Archives HubDistributed Archives Hub
– JISC Resource Discovery Network (RDN)JISC Resource Discovery Network (RDN)• (OAI-MHP Harvesting with Cheshire Search)(OAI-MHP Harvesting with Cheshire Search)
– Planned use with TEL being developed by the BLPlanned use with TEL being developed by the BL
• Also being used at Harvard and Berkeley Also being used at Harvard and Berkeley • California Sheet Music ProjectCalifornia Sheet Music Project• Los Alamos National Lab (genomics metadata)Los Alamos National Lab (genomics metadata)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Distributed SearchDistributed SearchDistributed SearchDistributed Search
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
The ProblemThe ProblemThe ProblemThe Problem• The Digital Library vision -- Access to everyone The Digital Library vision -- Access to everyone
for “all human knowledge”for “all human knowledge”• Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web”• Hundreds or Thousands of servers with databases Hundreds or Thousands of servers with databases
ranging widely in content, topic, formatranging widely in content, topic, format– Broadcast search is expensive in terms of bandwidth Broadcast search is expensive in terms of bandwidth
and in processing too many irrelevant resultsand in processing too many irrelevant results– How to select the “best” ones to search?How to select the “best” ones to search?
• Which resource to search first?Which resource to search first?• Which to search next if more is wanted?Which to search next if more is wanted?
– Topical /domain constraints on the search selectionsTopical /domain constraints on the search selections– Variable contents of database (metadata only, full text, Variable contents of database (metadata only, full text,
multimedia…)multimedia…)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Distributed Search TasksDistributed Search TasksDistributed Search TasksDistributed Search Tasks• Resource DescriptionResource Description
– How to collect metadata about digital libraries and their How to collect metadata about digital libraries and their collections or databasescollections or databases
• Resource SelectionResource Selection– How to select relevant digital library collections or databases How to select relevant digital library collections or databases
from a large number of databasesfrom a large number of databases
• Distributed SearchDistributed Search– How to perform parallel or sequential searching over the How to perform parallel or sequential searching over the
selected digital library databasesselected digital library databases
• Data FusionData Fusion– How to merge query results from different digital libraries with How to merge query results from different digital libraries with
their different search engines, differing record structures, etc.their different search engines, differing record structures, etc.
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery
An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery
• Distributed resource representation and discoveryDistributed resource representation and discovery– New approach to building resource descriptions based on New approach to building resource descriptions based on
Z39.50Z39.50– Instead of using Instead of using broadcastbroadcast search across resources we are using search across resources we are using
two Z39.50 Servicestwo Z39.50 Services• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN
• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple sources?How effective is merging search results from multiple sources?– Can we build hierarchies of servers Can we build hierarchies of servers
(general/meta-topical/individual)?(general/meta-topical/individual)?
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain
• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata
• Server NameServer Name
• IP AddressesIP Addresses
• Ports Ports
– Database-Level metadataDatabase-Level metadata• Database nameDatabase name
• Search attributes (indexes and combinations) Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN
• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for
– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set
• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for
the given attribute combination)the given attribute combination)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …
zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Resource Index CreationResource Index CreationResource Index CreationResource Index Creation• For all servers, or a topical subset…For all servers, or a topical subset…
– Get Explain information Get Explain information – For each indexFor each index
• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database metadata Add term + freq + source index + database metadata
to the XML “Collection Document” for the resourceto the XML “Collection Document” for the resource– Planned extensions:Planned extensions:
• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc) for special types of data for special types of data
– e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach
MetaSearchServer
Map ExplainAnd ScanQueries
Internet
MapResults
MapQuery
MapResults
SearchEngine
DB2DB 1
MapQuery
MapResults
SearchEngine
DB 4DB 3
DistributedIndex
SearchEngine
Db 6Db 5
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Known Issues and ProblemsKnown Issues and ProblemsKnown Issues and ProblemsKnown Issues and Problems
• Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain• Solutions that appear to work well:Solutions that appear to work well:
– Probing for attributes instead of explain (e.g. DC Probing for attributes instead of explain (e.g. DC attributes or analogs)attributes or analogs)
– We also support OAI and can extract OAI metadata for We also support OAI and can extract OAI metadata for servers that support OAIservers that support OAI
– Query-based sampling (Callan)Query-based sampling (Callan)
• Collection Documents are static and need to be Collection Documents are static and need to be replaced when the associated collection changesreplaced when the associated collection changes
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Evaluation Evaluation Evaluation Evaluation
• Test EnvironmentTest Environment– TREC Tipster data (approx. 3 GB)TREC Tipster data (approx. 3 GB)
– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source and date by month (no DOE)and date by month (no DOE)
• High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records)
• Same database as used in other distributed search studies by J. Same database as used in other distributed search studies by J. French and J. Callan among othersFrench and J. Callan among others
– Used TREC topics 51-150 for evaluation (these are the Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 only topics with relevance judgements for all 3 TIPSTER disksTIPSTER disks
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Harvesting EfficiencyHarvesting EfficiencyHarvesting EfficiencyHarvesting Efficiency
• Tested using the databases on the previous slide + Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)the full FT database (210,158 records ~ 600 Mb)
• Average of 23.07 seconds per database to SCAN Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create each database (3.4 indexes on average) and create a collection representative, over the networka collection representative, over the network
• Average of 14.07 secondsAverage of 14.07 seconds• Also tested larger databases (E.g. TREC FT Also tested larger databases (E.g. TREC FT
database ~600 Mb with 7 indexes was harvested in database ~600 Mb with 7 indexes was harvested in 131 seconds. 131 seconds.
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Our Collection Ranking Our Collection Ranking ApproachApproach
Our Collection Ranking Our Collection Ranking ApproachApproach
• We attempt to estimate the probability of We attempt to estimate the probability of relevance for a given collection with respect to relevance for a given collection with respect to a query using the Logistic Regression method a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeweight calculation at retrieval time
• Estimates from multiple extracted indexes are Estimates from multiple extracted indexes are combined to provide an overall ranking score combined to provide an overall ranking score for a given resource (I.e., fusion of multiple for a given resource (I.e., fusion of multiple query results)query results)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
∑=
+=6
10),|(
iii XccCQRP
Probability of relevance for a given index is based on logistic regression from a sample set documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
MX
n
nNICF
ICFM
X
CLX
CAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
10
log1
log1
6
15
4
13
2
11
=
−=
=
=
=
=
=
∑
∑
∑Average Absolute Query Frequency
Query Length
Average Absolute Collection Frequency
Collection size estimate
Average Inverse Collection Frequency
Inverse Document Frequency (N = Number of collections
M = Number of Terms in common between query and document
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
EvaluationEvaluationEvaluationEvaluation• Effectiveness Effectiveness
– Tested using the collection representatives described Tested using the collection representatives described above (as harvested from over the network) and the above (as harvested from over the network) and the TIPSTER relevance judgements TIPSTER relevance judgements
– Testing by comparing our approach to known Testing by comparing our approach to known algorithms for ranking collectionsalgorithms for ranking collections
– Results were measured against reported results for the Results were measured against reported results for the Ideal and CORI algorithms and against the optimal Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX)“Relevance Based Ranking” (MAX)
– Recall analog (How many of the Rel docs occurred in Recall analog (How many of the Rel docs occurred in the top n databases – averaged)the top n databases – averaged)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Titles only (short query)Titles only (short query)Titles only (short query)Titles only (short query)
R̂
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
FutureFutureFutureFuture
• Logically Clustering servers by topicLogically Clustering servers by topic
• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers
Replicatedservers
Meta-TopicalServers
General ServersDatabaseServers
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Geographic Operators and Search Geographic Operators and Search RankingRanking
Geographic Operators and Search Geographic Operators and Search RankingRanking
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations
• Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile• Implemented using special operations on indexesImplemented using special operations on indexes• Indexing allows extraction of geographic Indexing allows extraction of geographic
coordinates and dates from SGML/XML data in a coordinates and dates from SGML/XML data in a variety of formatsvariety of formats
• Normalized internal representation in indexesNormalized internal representation in indexes• Search using geographic and time elements as Search using geographic and time elements as
primary or limiting search elementsprimary or limiting search elements
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations
• X-based interfaces permit (simple) map X-based interfaces permit (simple) map drawing and searchdrawing and search
• Interface to MapServer for web-based map Interface to MapServer for web-based map searchingsearching
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
GEO Geographic operatorsGEO Geographic operatorsGEO Geographic operatorsGEO Geographic operators>=< >=< OverlapOverlap Search region and data OverlapSearch region and data Overlap
>#< >#< Fully EnclosedFully Enclosed Data fully enclosed in search reg.Data fully enclosed in search reg.
<#><#> EnclosesEncloses Data fully encloses search regionData fully encloses search region
<>#<># Fully Outside Fully Outside Data outside of search regionData outside of search region
++++ NearNear Data is near search regionData is near search region
:<::<: BeforeBefore Data date is before search dateData date is before search date
:<=::<=: Before or Before or DuringDuring
Data date is before or during Data date is before or during search datesearch date
:>=::>=: During or During or AfterAfter
Data date is during or after search Data date is during or after search datedate
:>::>: AfterAfter Data date is after search dateData date is after search date
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Overlaps searchOverlaps searchOverlaps searchOverlaps search
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Fully Enclosed SearchFully Enclosed SearchFully Enclosed SearchFully Enclosed Search
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Map-Based SearchMap-Based SearchMap-Based SearchMap-Based Search
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
GeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web Interface
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
XML Schemas and Element XML Schemas and Element RetrievalRetrieval
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
XML Schema SupportXML Schema SupportXML Schema SupportXML Schema Support
• XML Schemas can now be used to define XML Schemas can now be used to define the data contentsthe data contents
• Tested with a wide variety of schemas Tested with a wide variety of schemas including METS (with various supporting including METS (with various supporting schemas)schemas)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction
• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_
• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present requestrequest
• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
XML ExtractionXML ExtractionXML ExtractionXML Extraction
% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
MySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQL
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
RDBMS SupportRDBMS SupportRDBMS SupportRDBMS Support
• There are two reasons for RDBMS supportThere are two reasons for RDBMS support– IR systems are not meant for LOTS of update IR systems are not meant for LOTS of update
transactionstransactions
– Some application need to have access to both relational Some application need to have access to both relational data and text data via Z39.50data and text data via Z39.50
• Both MySQL and PostgreSQL are popular open Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used source RDBMS and now either can now be used via Cheshirevia Cheshire– Z39.50 mappings to RDBMS columnsZ39.50 mappings to RDBMS columns
– ““ZQL” submission of SQL as Z39.50 Type 0 queryZQL” submission of SQL as Z39.50 Type 0 query
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Protocol SupportProtocol SupportProtocol SupportProtocol Support
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
ProtocolsProtocolsProtocolsProtocols
• In Cheshire II most protocols (except In Cheshire II most protocols (except Z39.50) are implemented using scriptingZ39.50) are implemented using scripting
• Example scripts to support the following Example scripts to support the following are included in the distribution are included in the distribution – OAIOAI– SRW (Python version)SRW (Python version)– SOAPSOAP– SDLIPSDLIP
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
CORI, Okapi BM-25 ranking CORI, Okapi BM-25 ranking algorithmsalgorithms
CORI, Okapi BM-25 ranking CORI, Okapi BM-25 ranking algorithmsalgorithms
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Why additional ranking methodsWhy additional ranking methodsWhy additional ranking methodsWhy additional ranking methods
• CORI is extremely hard to beat as a CORI is extremely hard to beat as a distributed search methoddistributed search method
• OKAPI BM-25 is now the “default” OKAPI BM-25 is now the “default” retrieval algorithm in experimental IRretrieval algorithm in experimental IR
• New operators (later) let us mix and match New operators (later) let us mix and match ranking methods and Boolean operationsranking methods and Boolean operations
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
CORI rankingCORI rankingCORI rankingCORI ranking
( )
ranked being databases theof average theis
in wordsofnumber theis
ranked being databases ofnumber theis ||
containing databases ofnumber is
containing documents ofnumber is
:where
6.04.0)|(
0.1||log
5.0||log
/15050
cwcw
dbcw
DB
rcf
rdf
ITdbrp
DB
cfDB
I
cwcwdf
dfT
i
k
k
ik ⋅⋅+=
+
⎟⎟⎠
⎞⎜⎜⎝
⎛ +
=
⋅++=
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Okapi BM25Okapi BM25Okapi BM25Okapi BM25
• Where:Where:• QQ is a query containing terms is a query containing terms TT• K K is is kk11((1-((1-bb) + ) + b.dlb.dl//avdlavdl))• kk11, b , b and and kk33 are parameters , usually 1.2, 0.75 and 7-1000are parameters , usually 1.2, 0.75 and 7-1000• tftf is the frequency of the term in a specific document is the frequency of the term in a specific document• qtf qtf is the frequency of the term in a topic from which is the frequency of the term in a topic from which QQ was derived was derived• dl dl and and avdl avdl are the document length and the average document length are the document length and the average document length
measured in some convenient unitmeasured in some convenient unit• ww(1) (1) is the Robertson-Sparck Jones weight.is the Robertson-Sparck Jones weight.
∑∈ +
+++
QT qtfk
qtfk
tfK
tfkw
3
31)1( )1()1(
⎟⎠⎞
⎜⎝⎛
++−−+−
⎟⎠⎞
⎜⎝⎛
+−+
=
5.0
5.05.0
5.0
log)1(
rRnN
rnrR
r
w
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Result Set sorting, merging and Result Set sorting, merging and ranking operators, bitmapped ranking operators, bitmapped
indexesindexes
Result Set sorting, merging and Result Set sorting, merging and ranking operators, bitmapped ranking operators, bitmapped
indexesindexes
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
SortingSortingSortingSorting
• Support for Z39.50 Sort functionsSupport for Z39.50 Sort functions
• Merge multiple resultsets and sort new setMerge multiple resultsets and sort new set– Sort by index name/key (ATTRIBUTE)Sort by index name/key (ATTRIBUTE)– Sort by rank (ELEMENTS)Sort by rank (ELEMENTS)
• Merges ranked results and Boolean resultsMerges ranked results and Boolean results
– Sort by XML/SGML Tag contents (TAG)Sort by XML/SGML Tag contents (TAG)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Merging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking Operators
• Extends the capabilities of merging to include Extends the capabilities of merging to include merger operations in queries like Boolean operatorsmerger operations in queries like Boolean operators
• Fuzzy Logic OperatorsFuzzy Logic Operators– !FUZZY_AND!FUZZY_AND– !FUZZY_OR!FUZZY_OR– !FUZZY_NOT!FUZZY_NOT
• Restrict components to particular parentsRestrict components to particular parents• Merge OperatorsMerge Operators
– !MERGE_SUM!MERGE_SUM– !MERGE_MEAN!MERGE_MEAN
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Bitmapped IndexesBitmapped IndexesBitmapped IndexesBitmapped Indexes
• Bitmap indexes can be used for Boolean Bitmap indexes can be used for Boolean operations where the data has only a few operations where the data has only a few values and very large numbers of items with values and very large numbers of items with each valueeach value
• Only one bit per record stored in the indexOnly one bit per record stored in the index
• Processed on a demand basis so only blocks Processed on a demand basis so only blocks with the bits needed to resolve a query are with the bits needed to resolve a query are fetchedfetched
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Cheshire III Design and Cheshire III Design and DevelopmentDevelopment
Cheshire III Design and Cheshire III Design and DevelopmentDevelopment
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Cheshire III GoalsCheshire III GoalsCheshire III GoalsCheshire III Goals• Retain or reproduce (and refine) all Cheshire II Retain or reproduce (and refine) all Cheshire II
featuresfeatures– ““Spring cleaning” of code baseSpring cleaning” of code base– Add Full Unicode Support Add Full Unicode Support – Store most system and content data in the databaseStore most system and content data in the database
• Permit easy and efficient integration in Web Permit easy and efficient integration in Web ServicesServices
• Use threaded server for economy of resource usageUse threaded server for economy of resource usage• Enhanced Multiprotocol support Enhanced Multiprotocol support • Support for distributed processing (I.e. GRID Support for distributed processing (I.e. GRID
clusters)clusters)• Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality• Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Cheshire II Design OverviewCheshire II Design OverviewCheshire II Design OverviewCheshire II Design Overview
XML DOCS
XMLDIRECTORY
INDEXCLUSTER
INDEXCHESHIRE
CONT
BUILD ASSOC
ZSERVER
CONFIG
COMPONENTDEFINITION
INDEX(S)
ASSOC
CLUSTEREXTENSION
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Cheshire III Server OverviewCheshire III Server OverviewCheshire III Server OverviewCheshire III Server Overview
API
INDEXING
T R RX E AS C NL O ST R F D O R M S
SEARCH
P HR AO NT DO LC EO RL
DB API
REMOTESYSTEMS
(any protocol)
XMLCONFIG
& MetadataINFO
INDEXES
LOCAL DB
STAFF UI
CONFIG
NETWORK
RESULTSETS
SCAN
USERINFOC
ONFIG&CONTROL
ACCESSINFO
AUTHENTICATION
CLUSTERING
Native calls
Z39.50SOAPOAI
JDBC
Fetch IDPut ID
OpenURL
APACHE
INTERFACE
SERVERCONTROL
UDDIWSRP
SRW
Normalization
ClientUser/
Clients
OGIS
Cheshire III SERVER
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
API
INDEXING
T R RX E AS C NL O ST R F D O R M S
SEARCH
P HR AO NT DO LC EO RL
DB API
REMOTESYSTEMS
(any protocol)
XMLCONFIG
& MetadataINFO
INDEXES
LOCAL DB
STAFF UI
CONFIG
NETWORK
RESULTSETS
SCAN
USERINFO
CONFIG&CONTROL
ACCESSINFO
AUTHENTICATION
CLUSTERING
Native calls
Z39.50
SOAP
OAI
JDBC
Fetch ID
Put ID
OpenURL
APACHE
INTERFACE
SERVERCONTROL
UDDI
WSRP
SRW
Normalization
ClientUser/
Clients
OGIS
Cheshire III SERVER
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Retain FeaturesRetain FeaturesRetain FeaturesRetain Features
• The intent is to permit all of the types of in The intent is to permit all of the types of in indexing, searching and record formatting indexing, searching and record formatting available now, while making it easier to add available now, while making it easier to add new capabilitiesnew capabilities
• The new system will also support full The new system will also support full UNICODE for content and for metadataUNICODE for content and for metadata
• Store metadata and content in the database Store metadata and content in the database (including config information, etc.)(including config information, etc.)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Permit easy integration of Web Permit easy integration of Web ServicesServices
Permit easy integration of Web Permit easy integration of Web ServicesServices
• The assumption is that the web server will The assumption is that the web server will be the central server mechanism in the be the central server mechanism in the future.future.
• The new design relies on the session The new design relies on the session handling, threading and load management handling, threading and load management tools available in Apache (2.0.40+)tools available in Apache (2.0.40+)
• The Cheshire server is dynamically loaded The Cheshire server is dynamically loaded as part of the Web Serveras part of the Web Server
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Multiprotocol SupportMultiprotocol SupportMultiprotocol SupportMultiprotocol Support
• The Web server handles the network issues The Web server handles the network issues and passes requests in various protocols and passes requests in various protocols along to the Cheshire Server. along to the Cheshire Server.
• Individual Protocol “plugins” and the Individual Protocol “plugins” and the Protocol Handler convert search, display, Protocol Handler convert search, display, and metadata requests in a particular and metadata requests in a particular protocol to the internal Cheshire III control protocol to the internal Cheshire III control language, and convert outgoing message language, and convert outgoing message and data to the appropriate protocol formand data to the appropriate protocol form
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Distributed ProcessingDistributed Processing(RESEARCH)(RESEARCH)
Distributed ProcessingDistributed Processing(RESEARCH)(RESEARCH)
• The server will support protocols for interchange The server will support protocols for interchange of partial results and collection statistics with a of partial results and collection statistics with a single “Master” controlling the actions of a large single “Master” controlling the actions of a large number of “Slave” serversnumber of “Slave” servers
• These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment• This is still “research” but will probably be using This is still “research” but will probably be using
“Storage Grid” technology from SDSC with our “Storage Grid” technology from SDSC with our own applicationsown applications
• Non-Grid use of the same protocols, etc will be Non-Grid use of the same protocols, etc will be possible (but definitely slower)possible (but definitely slower)
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Enhanced ExpanabilityEnhanced ExpanabilityEnhanced ExpanabilityEnhanced Expanability
• Clearly defined APIs for interacting with Clearly defined APIs for interacting with the server will permit easy addition of new the server will permit easy addition of new functionality, or to replace or upgrade functionality, or to replace or upgrade existing functionalityexisting functionality
• Interactive user interface for database Interactive user interface for database configuration and setupconfiguration and setup– We want to make it easier for a We want to make it easier for a
user/administrator to create and manage the user/administrator to create and manage the databasedatabase
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
Multilingual APIsMultilingual APIsMultilingual APIsMultilingual APIs
• The system is being developed in a The system is being developed in a multilingual environment.multilingual environment.
• We will include the ability to interface with We will include the ability to interface with (at a minimum) Java, Python and C/C++ (at a minimum) Java, Python and C/C++ applications.applications.
• APIs for developing new functions will be APIs for developing new functions will be available in these languages as well available in these languages as well
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
DevelopmentDevelopmentDevelopmentDevelopment
• Currently work is going on here (RRL) and Currently work is going on here (RRL) and (primarily) in the UK(primarily) in the UK
• We have incomplete (Alpha) versions of the We have incomplete (Alpha) versions of the system, but haven’t been distributing it in system, but haven’t been distributing it in the current form (changing constantly)the current form (changing constantly)
• First release version is expected in mid-’04First release version is expected in mid-’04
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson
For More InformationFor More InformationFor More InformationFor More Information
• http://Cheshire.berkeley.eduhttp://Cheshire.berkeley.edu
• ftp://Cheshire.berkeley.edu for sourceftp://Cheshire.berkeley.edu for source
• Contact [email protected] [email protected]