October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire...

60
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Cheshire II: Recent Additions Additions & & Cheshire III: Cheshire III: Design and System Design and System Overview Overview Ray R. Larson Ray R. Larson School of Information School of Information Management and Systems Management and Systems University of California, University of California, Berkeley Berkeley
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire...

Page 1: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Cheshire II: Recent Additions Cheshire II: Recent Additions &&

Cheshire III: Cheshire III: Design and System Overview Design and System Overview

Cheshire II: Recent Additions Cheshire II: Recent Additions &&

Cheshire III: Cheshire III: Design and System Overview Design and System Overview

Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and

Systems Systems

University of California, BerkeleyUniversity of California, Berkeley

Page 2: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

OverviewOverviewOverviewOverview

• Cheshire IICheshire II– Feature overview Feature overview – Current usageCurrent usage– Recent AdditionsRecent Additions

• Distributed Search and IndexingDistributed Search and Indexing• Geographic Operators and Search RankingGeographic Operators and Search Ranking• XML Schemas and Element RetrievalXML Schemas and Element Retrieval• MySQL and PostgreSQL interfacesMySQL and PostgreSQL interfaces• CORI, Okapi BM-25 ranking algorithmsCORI, Okapi BM-25 ranking algorithms• Result Set sorting, merging and ranking operators, bitmapped Result Set sorting, merging and ranking operators, bitmapped

indexesindexes

• Cheshire III Design and DevelopmentCheshire III Design and Development

Page 3: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XMLIt supports SGML and XML• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI,

SOAP, SDLIP also implementedSOAP, SDLIP also implemented• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search engine as Supports probabilistic ranked retrieval in the Cheshire search engine as

well as Boolean and proximity searchwell as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance

feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server CGI WWW/CGI forms interface for DL, using combined client/server CGI

scripting via WebCheshirescripting via WebCheshire• Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python• Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database

Page 4: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Current UsageCurrent UsageCurrent UsageCurrent Usage

• Over 100 Databases in the UK, includingOver 100 Databases in the UK, including– AHDS/History Data ServiceAHDS/History Data Service– Mersey LibrariesMersey Libraries– ZETOCZETOC– Archives HubArchives Hub

• Distributed Archives HubDistributed Archives Hub

– JISC Resource Discovery Network (RDN)JISC Resource Discovery Network (RDN)• (OAI-MHP Harvesting with Cheshire Search)(OAI-MHP Harvesting with Cheshire Search)

– Planned use with TEL being developed by the BLPlanned use with TEL being developed by the BL

• Also being used at Harvard and Berkeley Also being used at Harvard and Berkeley • California Sheet Music ProjectCalifornia Sheet Music Project• Los Alamos National Lab (genomics metadata)Los Alamos National Lab (genomics metadata)

Page 5: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Distributed SearchDistributed SearchDistributed SearchDistributed Search

Page 6: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

The ProblemThe ProblemThe ProblemThe Problem• The Digital Library vision -- Access to everyone The Digital Library vision -- Access to everyone

for “all human knowledge”for “all human knowledge”• Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web”• Hundreds or Thousands of servers with databases Hundreds or Thousands of servers with databases

ranging widely in content, topic, formatranging widely in content, topic, format– Broadcast search is expensive in terms of bandwidth Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant resultsand in processing too many irrelevant results– How to select the “best” ones to search?How to select the “best” ones to search?

• Which resource to search first?Which resource to search first?• Which to search next if more is wanted?Which to search next if more is wanted?

– Topical /domain constraints on the search selectionsTopical /domain constraints on the search selections– Variable contents of database (metadata only, full text, Variable contents of database (metadata only, full text,

multimedia…)multimedia…)

Page 7: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Distributed Search TasksDistributed Search TasksDistributed Search TasksDistributed Search Tasks• Resource DescriptionResource Description

– How to collect metadata about digital libraries and their How to collect metadata about digital libraries and their collections or databasescollections or databases

• Resource SelectionResource Selection– How to select relevant digital library collections or databases How to select relevant digital library collections or databases

from a large number of databasesfrom a large number of databases

• Distributed SearchDistributed Search– How to perform parallel or sequential searching over the How to perform parallel or sequential searching over the

selected digital library databasesselected digital library databases

• Data FusionData Fusion– How to merge query results from different digital libraries with How to merge query results from different digital libraries with

their different search engines, differing record structures, etc.their different search engines, differing record structures, etc.

Page 8: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

• Distributed resource representation and discoveryDistributed resource representation and discovery– New approach to building resource descriptions based on New approach to building resource descriptions based on

Z39.50Z39.50– Instead of using Instead of using broadcastbroadcast search across resources we are using search across resources we are using

two Z39.50 Servicestwo Z39.50 Services• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN

• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple sources?How effective is merging search results from multiple sources?– Can we build hierarchies of servers Can we build hierarchies of servers

(general/meta-topical/individual)?(general/meta-topical/individual)?

Page 9: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain

• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata

• Server NameServer Name

• IP AddressesIP Addresses

• Ports Ports

– Database-Level metadataDatabase-Level metadata• Database nameDatabase name

• Search attributes (indexes and combinations) Search attributes (indexes and combinations)

– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)

Page 10: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN

• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for

– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set

• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for

the given attribute combination)the given attribute combination)

Page 11: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …

zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …

Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

Page 12: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Resource Index CreationResource Index CreationResource Index CreationResource Index Creation• For all servers, or a topical subset…For all servers, or a topical subset…

– Get Explain information Get Explain information – For each indexFor each index

• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database metadata Add term + freq + source index + database metadata

to the XML “Collection Document” for the resourceto the XML “Collection Document” for the resource– Planned extensions:Planned extensions:

• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc) for special types of data for special types of data

– e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes

Page 13: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach

MetaSearchServer

Map ExplainAnd ScanQueries

Internet

MapResults

MapQuery

MapResults

SearchEngine

DB2DB 1

MapQuery

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5

Page 14: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Known Issues and ProblemsKnown Issues and ProblemsKnown Issues and ProblemsKnown Issues and Problems

• Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain• Solutions that appear to work well:Solutions that appear to work well:

– Probing for attributes instead of explain (e.g. DC Probing for attributes instead of explain (e.g. DC attributes or analogs)attributes or analogs)

– We also support OAI and can extract OAI metadata for We also support OAI and can extract OAI metadata for servers that support OAIservers that support OAI

– Query-based sampling (Callan)Query-based sampling (Callan)

• Collection Documents are static and need to be Collection Documents are static and need to be replaced when the associated collection changesreplaced when the associated collection changes

Page 15: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Evaluation Evaluation Evaluation Evaluation

• Test EnvironmentTest Environment– TREC Tipster data (approx. 3 GB)TREC Tipster data (approx. 3 GB)

– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source and date by month (no DOE)and date by month (no DOE)

• High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records)

• Same database as used in other distributed search studies by J. Same database as used in other distributed search studies by J. French and J. Callan among othersFrench and J. Callan among others

– Used TREC topics 51-150 for evaluation (these are the Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 only topics with relevance judgements for all 3 TIPSTER disksTIPSTER disks

Page 16: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Harvesting EfficiencyHarvesting EfficiencyHarvesting EfficiencyHarvesting Efficiency

• Tested using the databases on the previous slide + Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)the full FT database (210,158 records ~ 600 Mb)

• Average of 23.07 seconds per database to SCAN Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create each database (3.4 indexes on average) and create a collection representative, over the networka collection representative, over the network

• Average of 14.07 secondsAverage of 14.07 seconds• Also tested larger databases (E.g. TREC FT Also tested larger databases (E.g. TREC FT

database ~600 Mb with 7 indexes was harvested in database ~600 Mb with 7 indexes was harvested in 131 seconds. 131 seconds.

Page 17: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Our Collection Ranking Our Collection Ranking ApproachApproach

Our Collection Ranking Our Collection Ranking ApproachApproach

• We attempt to estimate the probability of We attempt to estimate the probability of relevance for a given collection with respect to relevance for a given collection with respect to a query using the Logistic Regression method a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeweight calculation at retrieval time

• Estimates from multiple extracted indexes are Estimates from multiple extracted indexes are combined to provide an overall ranking score combined to provide an overall ranking score for a given resource (I.e., fusion of multiple for a given resource (I.e., fusion of multiple query results)query results)

Page 18: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

∑=

+=6

10),|(

iii XccCQRP

Probability of relevance for a given index is based on logistic regression from a sample set documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:

Page 19: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNICF

ICFM

X

CLX

CAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

10

log1

log1

6

15

4

13

2

11

=

−=

=

=

=

=

=

∑Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection size estimate

Average Inverse Collection Frequency

Inverse Document Frequency (N = Number of collections

M = Number of Terms in common between query and document

Page 20: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

EvaluationEvaluationEvaluationEvaluation• Effectiveness Effectiveness

– Tested using the collection representatives described Tested using the collection representatives described above (as harvested from over the network) and the above (as harvested from over the network) and the TIPSTER relevance judgements TIPSTER relevance judgements

– Testing by comparing our approach to known Testing by comparing our approach to known algorithms for ranking collectionsalgorithms for ranking collections

– Results were measured against reported results for the Results were measured against reported results for the Ideal and CORI algorithms and against the optimal Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX)“Relevance Based Ranking” (MAX)

– Recall analog (How many of the Rel docs occurred in Recall analog (How many of the Rel docs occurred in the top n databases – averaged)the top n databases – averaged)

Page 21: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Titles only (short query)Titles only (short query)Titles only (short query)Titles only (short query)

Page 22: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

FutureFutureFutureFuture

• Logically Clustering servers by topicLogically Clustering servers by topic

• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)

Page 23: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers

Page 24: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Geographic Operators and Search Geographic Operators and Search RankingRanking

Geographic Operators and Search Geographic Operators and Search RankingRanking

Page 25: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations

• Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile• Implemented using special operations on indexesImplemented using special operations on indexes• Indexing allows extraction of geographic Indexing allows extraction of geographic

coordinates and dates from SGML/XML data in a coordinates and dates from SGML/XML data in a variety of formatsvariety of formats

• Normalized internal representation in indexesNormalized internal representation in indexes• Search using geographic and time elements as Search using geographic and time elements as

primary or limiting search elementsprimary or limiting search elements

Page 26: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations

• X-based interfaces permit (simple) map X-based interfaces permit (simple) map drawing and searchdrawing and search

• Interface to MapServer for web-based map Interface to MapServer for web-based map searchingsearching

Page 27: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

GEO Geographic operatorsGEO Geographic operatorsGEO Geographic operatorsGEO Geographic operators>=< >=< OverlapOverlap Search region and data OverlapSearch region and data Overlap

>#< >#< Fully EnclosedFully Enclosed Data fully enclosed in search reg.Data fully enclosed in search reg.

<#><#> EnclosesEncloses Data fully encloses search regionData fully encloses search region

<>#<># Fully Outside Fully Outside Data outside of search regionData outside of search region

++++ NearNear Data is near search regionData is near search region

:<::<: BeforeBefore Data date is before search dateData date is before search date

:<=::<=: Before or Before or DuringDuring

Data date is before or during Data date is before or during search datesearch date

:>=::>=: During or During or AfterAfter

Data date is during or after search Data date is during or after search datedate

:>::>: AfterAfter Data date is after search dateData date is after search date

Page 28: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Overlaps searchOverlaps searchOverlaps searchOverlaps search

Page 29: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Fully Enclosed SearchFully Enclosed SearchFully Enclosed SearchFully Enclosed Search

Page 30: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Map-Based SearchMap-Based SearchMap-Based SearchMap-Based Search

Page 31: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

GeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web Interface

Page 32: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

XML Schemas and Element XML Schemas and Element RetrievalRetrieval

Page 33: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

XML Schema SupportXML Schema SupportXML Schema SupportXML Schema Support

• XML Schemas can now be used to define XML Schemas can now be used to define the data contentsthe data contents

• Tested with a wide variety of schemas Tested with a wide variety of schemas including METS (with various supporting including METS (with various supporting schemas)schemas)

Page 34: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction

• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_

• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present requestrequest

• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..

Page 35: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

XML ExtractionXML ExtractionXML ExtractionXML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…

Page 36: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

MySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQL

Page 37: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

RDBMS SupportRDBMS SupportRDBMS SupportRDBMS Support

• There are two reasons for RDBMS supportThere are two reasons for RDBMS support– IR systems are not meant for LOTS of update IR systems are not meant for LOTS of update

transactionstransactions

– Some application need to have access to both relational Some application need to have access to both relational data and text data via Z39.50data and text data via Z39.50

• Both MySQL and PostgreSQL are popular open Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used source RDBMS and now either can now be used via Cheshirevia Cheshire– Z39.50 mappings to RDBMS columnsZ39.50 mappings to RDBMS columns

– ““ZQL” submission of SQL as Z39.50 Type 0 queryZQL” submission of SQL as Z39.50 Type 0 query

Page 38: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Protocol SupportProtocol SupportProtocol SupportProtocol Support

Page 39: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

ProtocolsProtocolsProtocolsProtocols

• In Cheshire II most protocols (except In Cheshire II most protocols (except Z39.50) are implemented using scriptingZ39.50) are implemented using scripting

• Example scripts to support the following Example scripts to support the following are included in the distribution are included in the distribution – OAIOAI– SRW (Python version)SRW (Python version)– SOAPSOAP– SDLIPSDLIP

Page 40: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

CORI, Okapi BM-25 ranking CORI, Okapi BM-25 ranking algorithmsalgorithms

CORI, Okapi BM-25 ranking CORI, Okapi BM-25 ranking algorithmsalgorithms

Page 41: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Why additional ranking methodsWhy additional ranking methodsWhy additional ranking methodsWhy additional ranking methods

• CORI is extremely hard to beat as a CORI is extremely hard to beat as a distributed search methoddistributed search method

• OKAPI BM-25 is now the “default” OKAPI BM-25 is now the “default” retrieval algorithm in experimental IRretrieval algorithm in experimental IR

• New operators (later) let us mix and match New operators (later) let us mix and match ranking methods and Boolean operationsranking methods and Boolean operations

Page 42: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

CORI rankingCORI rankingCORI rankingCORI ranking

( )

ranked being databases theof average theis

in wordsofnumber theis

ranked being databases ofnumber theis ||

containing databases ofnumber is

containing documents ofnumber is

:where

6.04.0)|(

0.1||log

5.0||log

/15050

cwcw

dbcw

DB

rcf

rdf

ITdbrp

DB

cfDB

I

cwcwdf

dfT

i

k

k

ik ⋅⋅+=

+

⎟⎟⎠

⎞⎜⎜⎝

⎛ +

=

⋅++=

Page 43: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Okapi BM25Okapi BM25Okapi BM25Okapi BM25

• Where:Where:• QQ is a query containing terms is a query containing terms TT• K K is is kk11((1-((1-bb) + ) + b.dlb.dl//avdlavdl))• kk11, b , b and and kk33 are parameters , usually 1.2, 0.75 and 7-1000are parameters , usually 1.2, 0.75 and 7-1000• tftf is the frequency of the term in a specific document is the frequency of the term in a specific document• qtf qtf is the frequency of the term in a topic from which is the frequency of the term in a topic from which QQ was derived was derived• dl dl and and avdl avdl are the document length and the average document length are the document length and the average document length

measured in some convenient unitmeasured in some convenient unit• ww(1) (1) is the Robertson-Sparck Jones weight.is the Robertson-Sparck Jones weight.

∑∈ +

+++

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

⎟⎠⎞

⎜⎝⎛

++−−+−

⎟⎠⎞

⎜⎝⎛

+−+

=

5.0

5.05.0

5.0

log)1(

rRnN

rnrR

r

w

Page 44: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Result Set sorting, merging and Result Set sorting, merging and ranking operators, bitmapped ranking operators, bitmapped

indexesindexes

Result Set sorting, merging and Result Set sorting, merging and ranking operators, bitmapped ranking operators, bitmapped

indexesindexes

Page 45: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

SortingSortingSortingSorting

• Support for Z39.50 Sort functionsSupport for Z39.50 Sort functions

• Merge multiple resultsets and sort new setMerge multiple resultsets and sort new set– Sort by index name/key (ATTRIBUTE)Sort by index name/key (ATTRIBUTE)– Sort by rank (ELEMENTS)Sort by rank (ELEMENTS)

• Merges ranked results and Boolean resultsMerges ranked results and Boolean results

– Sort by XML/SGML Tag contents (TAG)Sort by XML/SGML Tag contents (TAG)

Page 46: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Merging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking Operators

• Extends the capabilities of merging to include Extends the capabilities of merging to include merger operations in queries like Boolean operatorsmerger operations in queries like Boolean operators

• Fuzzy Logic OperatorsFuzzy Logic Operators– !FUZZY_AND!FUZZY_AND– !FUZZY_OR!FUZZY_OR– !FUZZY_NOT!FUZZY_NOT

• Restrict components to particular parentsRestrict components to particular parents• Merge OperatorsMerge Operators

– !MERGE_SUM!MERGE_SUM– !MERGE_MEAN!MERGE_MEAN

Page 47: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Bitmapped IndexesBitmapped IndexesBitmapped IndexesBitmapped Indexes

• Bitmap indexes can be used for Boolean Bitmap indexes can be used for Boolean operations where the data has only a few operations where the data has only a few values and very large numbers of items with values and very large numbers of items with each valueeach value

• Only one bit per record stored in the indexOnly one bit per record stored in the index

• Processed on a demand basis so only blocks Processed on a demand basis so only blocks with the bits needed to resolve a query are with the bits needed to resolve a query are fetchedfetched

Page 48: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Cheshire III Design and Cheshire III Design and DevelopmentDevelopment

Cheshire III Design and Cheshire III Design and DevelopmentDevelopment

Page 49: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Cheshire III GoalsCheshire III GoalsCheshire III GoalsCheshire III Goals• Retain or reproduce (and refine) all Cheshire II Retain or reproduce (and refine) all Cheshire II

featuresfeatures– ““Spring cleaning” of code baseSpring cleaning” of code base– Add Full Unicode Support Add Full Unicode Support – Store most system and content data in the databaseStore most system and content data in the database

• Permit easy and efficient integration in Web Permit easy and efficient integration in Web ServicesServices

• Use threaded server for economy of resource usageUse threaded server for economy of resource usage• Enhanced Multiprotocol support Enhanced Multiprotocol support • Support for distributed processing (I.e. GRID Support for distributed processing (I.e. GRID

clusters)clusters)• Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality• Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++

Page 50: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Cheshire II Design OverviewCheshire II Design OverviewCheshire II Design OverviewCheshire II Design Overview

XML DOCS

XMLDIRECTORY

INDEXCLUSTER

INDEXCHESHIRE

CONT

BUILD ASSOC

ZSERVER

CONFIG

COMPONENTDEFINITION

INDEX(S)

ASSOC

CLUSTEREXTENSION

Page 51: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Cheshire III Server OverviewCheshire III Server OverviewCheshire III Server OverviewCheshire III Server Overview

API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFOC

ONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50SOAPOAI

JDBC

Fetch IDPut ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDIWSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire III SERVER

Page 52: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFO

CONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50

SOAP

OAI

JDBC

Fetch ID

Put ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDI

WSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire III SERVER

Page 53: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Retain FeaturesRetain FeaturesRetain FeaturesRetain Features

• The intent is to permit all of the types of in The intent is to permit all of the types of in indexing, searching and record formatting indexing, searching and record formatting available now, while making it easier to add available now, while making it easier to add new capabilitiesnew capabilities

• The new system will also support full The new system will also support full UNICODE for content and for metadataUNICODE for content and for metadata

• Store metadata and content in the database Store metadata and content in the database (including config information, etc.)(including config information, etc.)

Page 54: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Permit easy integration of Web Permit easy integration of Web ServicesServices

Permit easy integration of Web Permit easy integration of Web ServicesServices

• The assumption is that the web server will The assumption is that the web server will be the central server mechanism in the be the central server mechanism in the future.future.

• The new design relies on the session The new design relies on the session handling, threading and load management handling, threading and load management tools available in Apache (2.0.40+)tools available in Apache (2.0.40+)

• The Cheshire server is dynamically loaded The Cheshire server is dynamically loaded as part of the Web Serveras part of the Web Server

Page 55: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Multiprotocol SupportMultiprotocol SupportMultiprotocol SupportMultiprotocol Support

• The Web server handles the network issues The Web server handles the network issues and passes requests in various protocols and passes requests in various protocols along to the Cheshire Server. along to the Cheshire Server.

• Individual Protocol “plugins” and the Individual Protocol “plugins” and the Protocol Handler convert search, display, Protocol Handler convert search, display, and metadata requests in a particular and metadata requests in a particular protocol to the internal Cheshire III control protocol to the internal Cheshire III control language, and convert outgoing message language, and convert outgoing message and data to the appropriate protocol formand data to the appropriate protocol form

Page 56: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Distributed ProcessingDistributed Processing(RESEARCH)(RESEARCH)

Distributed ProcessingDistributed Processing(RESEARCH)(RESEARCH)

• The server will support protocols for interchange The server will support protocols for interchange of partial results and collection statistics with a of partial results and collection statistics with a single “Master” controlling the actions of a large single “Master” controlling the actions of a large number of “Slave” serversnumber of “Slave” servers

• These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment• This is still “research” but will probably be using This is still “research” but will probably be using

“Storage Grid” technology from SDSC with our “Storage Grid” technology from SDSC with our own applicationsown applications

• Non-Grid use of the same protocols, etc will be Non-Grid use of the same protocols, etc will be possible (but definitely slower)possible (but definitely slower)

Page 57: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Enhanced ExpanabilityEnhanced ExpanabilityEnhanced ExpanabilityEnhanced Expanability

• Clearly defined APIs for interacting with Clearly defined APIs for interacting with the server will permit easy addition of new the server will permit easy addition of new functionality, or to replace or upgrade functionality, or to replace or upgrade existing functionalityexisting functionality

• Interactive user interface for database Interactive user interface for database configuration and setupconfiguration and setup– We want to make it easier for a We want to make it easier for a

user/administrator to create and manage the user/administrator to create and manage the databasedatabase

Page 58: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Multilingual APIsMultilingual APIsMultilingual APIsMultilingual APIs

• The system is being developed in a The system is being developed in a multilingual environment.multilingual environment.

• We will include the ability to interface with We will include the ability to interface with (at a minimum) Java, Python and C/C++ (at a minimum) Java, Python and C/C++ applications.applications.

• APIs for developing new functions will be APIs for developing new functions will be available in these languages as well available in these languages as well

Page 59: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

DevelopmentDevelopmentDevelopmentDevelopment

• Currently work is going on here (RRL) and Currently work is going on here (RRL) and (primarily) in the UK(primarily) in the UK

• We have incomplete (Alpha) versions of the We have incomplete (Alpha) versions of the system, but haven’t been distributing it in system, but haven’t been distributing it in the current form (changing constantly)the current form (changing constantly)

• First release version is expected in mid-’04First release version is expected in mid-’04

Page 60: October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.

October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

For More InformationFor More InformationFor More InformationFor More Information

• http://Cheshire.berkeley.eduhttp://Cheshire.berkeley.edu

• ftp://Cheshire.berkeley.edu for sourceftp://Cheshire.berkeley.edu for source

• Contact [email protected] [email protected]