DL:Lesson 5 Classification Schemas Luca Dini [email protected]
description
Transcript of DL:Lesson 5 Classification Schemas Luca Dini [email protected]
DL:Lesson 5Classification Schemas
Luca [email protected]
Web: Crawling
“central”index
?
Metadata harvesting
metadata
metadata
AuthorTitleAbstractIdentifer
?
Metasearching
MetasearchEngine
?
What is metasearching?
Given many document sources and a query, a metasearcher:
– Finds the good sources for the query– Evaluates the query at these sources– Merges the results from these sources
Metasearcher
<%
%>
UnindexedDocuments Legacy
Database /WAIS / etc.
ExistingWebApplication
Main Issues
How to query different types of sources? How to combine results and rankings from multiple
data sources?
Metasearcher
<%
%>
grep ‘biomedical’*.txt
SELECT title FROMarticles . . .
http://…/getTitle?title=‘biomedical’&…
Other Issues
How to choose among multiple data sources? How to get metadata about multiple data
sources?
Metasearcher
<%
%>
cat*.txt
SELECT SCHEMA…….
Best:http://….?getMetaDataWorst:“Hi. What do you have?”
Cost/Functionality
Cost of acceptance
Metadata Harvesting
SDLIP/STARTS
Z39.50
Function
Z39.50
http://www.loc.gov/z3950/agency/
Goals
• Permits one computer, the client, to search and retrieve information on another, the database server
• Important both technically and for its wide use in library systems
• Most development has concentrated on bibliographic data
• Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records
Principles
Abstract view of database searching.
• Server stores a set of databases with searchable indexes
• Interactions are based on a session
• The client opens a connection with the server, carries out a sequence of interactions and then closes the connection.
• During the course of the session, both the server and the client remember the state of their interaction.
The results
Z39.50
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the result set.
• Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.
Services
init -- client connects to the server and exchanges initial information, e.g., preferred message size
explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options
search -- client presents a query to a database choices of syntax for specifying searches
• only Boolean queries widely implemented • one or more records may be returned to the client
Services
manipulation of results sets -- e.g., sort or delete
present -- requests the server to send specified records from the results set to the client in a specified format
• options:
for controlling content and formats
for managing large records or large results sets
Example
In the database named "Books" find all records for which the access point title that contains the value "evangeline" and the access point author contains the value "longfellow.“
Z39.50 defines a rich variety of search access points that can be extended by implementers
Problems
Very difficult to implement– There are freely available implementations, but they are
complex
Outdated assumptions– Searching is expensive computationally – Bandwidth is limited (ASN.1 compression)
Originally designed for bibliographic record retrieval, and not full documents or other objects
“Overspecified” (Almost) Nobody Implements Explain! Assumes questionable user model (stateful)
Simple Digital Library Interoperability Protocol
http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/
SDLIP
Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web
Support for stateful and stateless operation by the server
Support for thin clients, such as handheld devices Developed jointly by Stanford, Berkeley, and UC
Santa Barbara
SDLIP – Search Middleware
Interfaces
Interfaces
Search Interface – defines simple query language, protocol can then include other languages
Result Interface – parking meter metaphor supports varying notions of results sets
Source Metadata Interface – provides extension mechanism through discovery server capabilities
Result access interface
This interface allows client applications to access the set of result documents, wherever that set is maintained
Four services:– getSessionInfo– getDocs– extendStateTimeout– cancelRequest
Source metadata interface
Provides information about the service and server itself, such as– Collections served– Collection metadata/content information– Searchable properties
Three operations– getInterface– getSubcollectionInfo– getPropertyInfo
OAI
Cost
Functionality
HTTPGoogle
Z39.50SGML
DublinCore
MetadataHarvesting
Metadata
metadata
AuthorTitleAbstractIdentifer
History
Increasing interest in alternative scholarly publishing solutions – e.g., LANL arXiv
Increasing impact through federation UPS Mtg., Sante Fe, October 1999
– Representatives of various ePrint, library, publishing, communities
– Goal: definition of an interoperability framework among ePrint providers
– Result: Santa Fe Convention, interoperability through metadata harvesting
Umbrella model
ReferenceLibraries
PublishersE-Print
Archives
…that can be exploited by different communities
Museums
Key Technical features
Deploy now technology – 80/20 rule Two-party model – providers (data providers) and
consumers (service providers) Simple HTTP encoding XML schema for some degree of protocol
conformance Extensibility
– Multiple item-level metadata– Collection level metadata
Roles
DiscoveryCurrent
AwarenessPreservation
Service Providers
Data Providers
Meta
data
harv
estin
g
Key Features
definitions & concepts– repository– record– identifier– datestamp– set
protocol features– HTTP encoding– metadata prefix &
schema– flow control
protocol requests– supporting requests– harvesting requests
Record
<record><header>
<identifier>oai:eg:001</identifier><datestamp>1999-01-01</datestamp>
</header><metadata>
<dc xmlns=“http://purl.org/dc”><title>My Example</title>
</dc></metadata><about>
<ea xmlns=“http://www.arXiv.org/ea”<usage>No restrictions</usage>
</ea></about>
</record>
protocol support
format-specificmetadata
community-specific
record data
Identifiers
oai-identifier = oai:archive-identifier:record-identifier
Registered URI
Scheme
Archive Idendifier:
Registered within OAI
Unique ID within archive:
(syntax is archive-specific)
example = oai:ncstrl:ncstrl.cornellcs/TR94-1418
locally unique key for extracting a record from a repository
Identify
•Repository name •Base-URL
• Admin e-mail• OAI protocol version
• Description Container
repos i tory
harves ter
service provider data provider
ListMetadataFormats
REPEAT• Format prefix
• Format XML schema/REPEAT
repos i tory
harves ter
service provider data provider
ListSets
REPEAT• Set Specification
• Set Name/REPEAT
repos i tory
harves ter
service provider data provider
* from=a * until=b * set=klmListRecords * metadataPrefix=oai_dc
REPEAT• Identifier
• Datestamp• Metadata
•About Container/REPEAT
repos i tory
harves ter
service provider data provider
REPEAT• Identifier
• Datestamp/REPEAT
repos i tory
* from=a * until=bListIdentifiers * set=klmh
arves ter
service provider data provider
* identifier=oai:mlib:123a GetRecord * metadataPrefix=oai_dc
• Identifier• Datestamp
• Metadata• About
repos i tory
harves ter
service provider data provider
http://www.google.com.tw/webmasters/sitemaps/docs/en/other.html#oai
http://www.nla.gov.au/digicoll/oai/getRecord.html
oai_dc http://www.nla.gov.au/digicoll/oai/ http://www.nla.gov.au/digicoll/oai/
listMetadataFormats.html (oai:nla.gov.au:nla.pic-an22111591)