Using a Registry to Disclose and Discover Resources for Metasearching
When worlds collide Metasearching meets central indexes
description
Transcript of When worlds collide Metasearching meets central indexes
When worlds collide
Metasearching meetscentral indexes
Mike Taylor – [email protected]
Index Data – http://indexdata.com/
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data
Problem solved!
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData Data
? ?
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
360 SearchEHIS (EBSCO)MetaLib
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
360 SearchEHIS (EBSCO)MetaLib
Pazpar2(Open source)
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
A.K.A. federated search
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
A.K.A. federated search
A.K.A. distributed search
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
A.K.A. federated search
A.K.A
. bro
adcast
searc
h
A.K.A. distributed search
Searching
?
Back tothe sadsearcher
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData Data
? ?
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
SummonWorldCatPrimo Central
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
SummonWorldCatPrimo Central
MasterKey
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
A.K.A. local index
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
A.K.A. local indexA.K.A. discovery services
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
A.K.A. local index
A.K.A
. verti
cal s
earch
A.K.A. discovery services
?
We need a controlled vocabulary!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Metasearch= Federated search= Distributed search= Broadcast search
Central index= Local index= Discovery services= Vertical search (if you ever heard anything so dumb)
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Central indexing compared with metasearching:
- requires harvesting infrastructure- requires lots of local storage- requires co-operation from services to be harvested- does not have access to all searchable data- will always be somewhat out of date- is faster at search time (or SHOULD be)- allows data to be normalised (e.g. dates extracted)- allows for better relevance ranking- can provide pre-baked facets- may have access to some data that not searchable
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Let's do both!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
Metasearchhides thecomplexity
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
Nine tenths underThe surface
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
What you seelooks beautiful
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Problems that need solving
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
A. Problems with pure metasearching
B. How those problems change when you add a central index
Problems with metasearching
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Examples based on Index Data's suite:
Pazpar2 is a free metasearching engine with a stupid name
http://indexdata.com/pazpar2/
MasterKey is a non-open suite that wraps ithttp://indexdata.com/masterkey/
MasterKey is only one way to use Pazpar2
Also integrated into other vendors' UIs.
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/
Option 2: MasterKey Connect (non-open)http://indexdata.com/connector-framework
Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Catalogs searchable using ANSI/NISO Z39.50
Support is very nominal in some cases
Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Catalogs searchable using ANSI/NISO Z39.50
Support is very nominal in some cases
IRSpy probes behaviourhttp://irspy.indexdata.com
MasterKey target profiles describe behaviour
Problems with metasearching#3: Data servers don't support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Problems with metasearching#3: Data servers don't support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Pazpar2 does its own relevance ranking
(Part of merging/deduplication)
Problems with metasearching#4: Data servers don't return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Problems with metasearching#4: Data servers don't return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Pazpar2 calculates its own facets
There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Pazpar2
DataData DataData
Remember, ourengine is free:
http://indexdata.com/pazpar2/
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! What happenswhen we adda central index?
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
You can't harvest Google
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
You can't harvest Google
You just can't
Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Repositories harvestable using OAI-PMH
(an even worse name than pazpar2)
Support is very nominal in some cases
Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Repositories harvestable using OAI-PMH (an even worse name than pazpar2)
Support is very nominal in some cases
OAI-PMH client must be very tolerant
Extensive data-cleaning is usually required
Problems with integrated search#3: Central index does support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Returned records carry relevance scores
Must be merged with records scored by engine
Requires score normalisation into same range
Existing ordering may be used in merge
Problems with integrated search#3: Central index does support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Unranked#1
Ranked#1
Ranked#2
Solr
Sort
MergedUnranked#2 Sort
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Lists of field values with occurrence counts:
AuthorKernighan 27Pike 13Ritchie 7Thompson 4
TitleC 7Unix 35Programming 16
Date1977 51978 41979 21981 2
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Lists are returned or calculated for each server:
Server 1 (central index)(all facets from 2000 hits)
Cat 68Dinosaur 162Fish 145Frog 19
Server 2 (metasearch)(1000 hits, 100 records)
Cat 7Dog 10Dinosaur 87Fish 23
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Metasearched counts normalised by total hit-count
Server 1 (central index)(all facets from 2000 hits)
Cat 68Dinosaur 162Fish 145Frog 19
Server 2 (metasearch)(normalised to 1000 hits)
Cat 70Dog 100Dinosaur 870Fish 230
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Facet lists are merged
Servers 1+2 (integrated)(as though for all records in result sets)
Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Fringe benefit: facet-count normalisation is alsouseful when doing pure metasearching.
Servers 1+2(as though for all records in result sets)
Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19
Summary of search issues
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Issue Metasearchsolution
Central indexsolution
No data serverBuild gatewaysMasterKey Connect
---
Bad data server Probe capabilitiesProfile targets
Tolerant harvesterData-cleaning
Relevance scores Magic engineNormalise scores Ingest from server
Facets Magic engineNormalise counts Ingest from server
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
When worlds collide
Metasearching meetscentral indexes
Mike Taylor – [email protected]
Index Data – http://indexdata.com/