The Brave New World Of Repositoriesinfo.tuwien.ac.at/elpub2006/presentations/Sompel_keynote.pdf ·...
Transcript of The Brave New World Of Repositoriesinfo.tuwien.ac.at/elpub2006/presentations/Sompel_keynote.pdf ·...
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
The Brave New World of Repositories
This work was supported by NSF award number IIS-0430906 (Pathways)
Herbert Van de SompelResearch Library
Los Alamos National Laboratory, USA
Obt
ain
Har
vest
Put
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Acknowledgments
The reported material is based on the following work:
o The NSF-funded Pathways project, a collaboration between theInformation Science group at Cornell University (PIs: CarlLagoze, Sandy Payette, Simeon Warner) and the LANL DigitalLibrary Research & Prototyping Team (PI Herbert Van deSompel).
- See http://www.infosci.cornell.edu/pathways/o The LANL aDORe repository effort.
- See http://dx.doi.org/10.1093/comjnl/bxh114- See http://african.lanl.gov/aDORe/
o The PhD thesis by Jeroen Bekaert (Advisor Herbert Van deSompel) regarding protocol-based interfaces for Open ArchivalInformation Systems (OAIS).
- See http://hdl.handle.net/1854/4833
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Digital Object: A data structure whose principal components aredigital data and key-metadata. Digital data can be a Datastream or aDigital Object, i.e. a Digital Object may have one or more other DigitalObjects as nested components. Key-metadata must include anidentifier for the Digital Object.
id
Data Model: An abstraction for Digital Objects such that each DigitalObject can be seen as an instance of the class defined by a Data Model.Example Data Models include the Pathways Core model, the MPEG-21Digital Item Declaration model, etc.
Surrogate: A serialization of a Digital Object according to a Data Model.
m
Datastream: An ordered sequence of bytes.
Terminology
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Obtain interface: a Repository interface that supports the request ofservices pertaining to individual Digital Objects (including theircomponent Datastreams).
Terminology
Obt
ain
Repository: a networked system that provides services pertaining to acollection of Digital Objects.
Har
vest Harvest interface: a Repository interface that exposes Surrogates for
incremental collecting/harvesting.
Put Put interface: a Repository interface that supports submission of one
or more Surrogates into the Repository, thereby facilitating theaddition of Digital Objects to the collection of the Repository.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
The Repository model
"Pattern Recognition: The 2003 OCLC Environmental Scan"http://www.oclc.org/membership/escan/toc.htm
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Context: The Repository model
Repository
An environment consisting ofDigital Object Repositorieswith a Long Life Expectation:
o Scholarly repositories- Institutional
repositories- Discipline-oriented
repositories- Publisher’s repositories- Dataset repositories- …
o Cultural heritagerepositories
o Preservation archiveso Educational repositories
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Context: compound Digital Objects
id
id
Digital Objects
Objects of scholarlycommunication system areincreasingly compound innature, simultaneouslyconsisting of:
• Multiple media types• Multiple content types
o Papers,o Datasets,o simulations,o software,o dynamic knowledge
representations,o machine readable chemical
structures
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Compound Digital Objects:o Have a persistent
identifiero Multiple content streams
and properties about thosecontent streams
o Can contain other DigitalObjects
o This doesn’t readily map tothe Web world
o It does map top the worldof rich media (cf. MPEG-21)
id
id
Digital Objects
Context: compound Digital Objects
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Aim: Digital Object use and re-use
• We must leverage the value of the materials that becomeavailable in those distributed Repositories.
• Think about these Repositories as active nodes in a globalenvironment, not as passive local nodes
o These Repositories are about facilitating the use and re-use of materials in many contexts
o These Repositories are the starting point of value chains
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
http://www.technorati.com
• Valuechainsemergingfrom simpleRSStechnology
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Augmenting interoperability across Repositories
• We need appropriate technologies to facilitate theemergence of such cross-Repository value chains
• We need to augment interoperability across Repositories
• Motivations:1. To facilitate the emergence of richer cross-Repository
services2. To facilitate scholarly communication workflows across
Repositories
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Motivation 1 : Richer cross-Repository services
• Distributed Repositories provide source materials for cross-Repository overlay services such as discovery services
• Manner in which those materials are exposed must allow forthe seamless emergence of rich and meaningful services
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Selective collecting
Motivation 1 : Richer cross-Repository services
service
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Scenario 1: Chemical search engine
• A search engine monitors scholarly repositories but is onlyinterested in making machine-readable chemical structurescontained in Digital Objects available from those repositoriessearchable.
• This constitutes re-use of the (part of) the Digital Objectsby a service overlaid upon the monitored repositories.
• And, of course, a chemical compound discovered via thesearch engine can be cited in some new paper, i.e. the valuechain does not stop here
Richer cross-Repository services : Scenario
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Selective collecting
Motivation 1 : Richer cross-Repository services
serviceNeed: digital object representation,harvesting interface, datastreamsemantics
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Motivation 2 : Scholarly communication workflow
• Distributed Repositories at the basis of a digital scholarlycommunication system
• Scholarly communication as a global workflow (value chain)across those Repositories
• Digital Objects from Repositories are the subject of theworkflow; they are used and re-used in many contexts.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
id
id
idrecombine & add value
Motivation 2 : Scholarly communication workflow
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Scholarly communication workflow : Scenarios
Scenario 2: Citation
• An author writes a paper (to be Put into her institutionalrepository) and cites 10 papers available from otherrepositories.
• A citation to a paper is a type of re-use of the cited paper ina new context.
• And, of course, the new paper can be cited too, i.e. the valuechain does not stop here.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Scholarly communication workflow : Scenarios
Scenario 3: Overlay journal
• The editor of an overlay journal selects papers from 3different repositories for inclusion in the next issue of theoverlay journal.
• Each of those articles is being re-used in a new context, withvalue being added.
• And, the overlay journal can be mirrored for preservationpurposes, i.e. the value chain does not stop here.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
id
id
idrecombine & add value
Motivation 2 : Scholarly communication workflow
Need: digital object representation,obtain interface, put interface
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Scholarly communication workflow : Scenarios
Scenario 4: eScience
• A researcher uses datasets from 2 different datasetrepositories, performs operations on those, and creates apublication that contains a resulting new dataset and anaccompanying paper, and deposits this publication in herinstitutional repository.
• This constitutes re-use of the origin datasets, and valueadded through the creation of the new publication.
• And, of course, the new dataset can be re-used too, i.e. thevalue chain does not stop here.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Value chains starting in Repositories : Observations
Currently:
• The Scenarios can only be realized in idiosyncratic ways.• The connection between a new object and the ones it builds
on is lost in the workflow, and can only be re-instantiatedthrough fuzzy data post processing.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Augmenting interoperability across RepositoriesD
Spac
e
Fedo
ra
aDO
Re
ePri
nts
arX
iv
Nat
ure
Individual Data Models and Services
Shared Data Model and Services
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Considerations re interoperable framework
• Scholarly communication is a long-term endeavor:• Need abstract definitions of Repository interfaces that can be
instantiated on the basis of various technologies as time goes by• Repository interfaces need to work with whichever type of
identifier (current and future) because Repositories will usewhichever type of identifier
• Value chains do not require transfer of all digital objectcontent
• The content that needs to be transferred depends on the natureof the value chain
• Recording a chain of evidence of a value chain requires finegranularity of identification
• Not only identifier of the digital object but also of therepository
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Augmenting interoperability across RepositoriesD
Spac
e
Fedo
ra
aDO
Re
ePri
nts
arX
iv
Nat
ure
Individual Data Models and Services
m Obt
ain
Har
vest
Put
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Augmenting interoperability across Repositories
m Pathways Core Data Model for Cross-Repository services
Bekaert, Jeroen, Xiaoming Liu, Herbert Van de Sompel, Sandy Payette, Carl Lagoze, andSimeon Warner. Pathways Core: A Data Model for Cross-Repository Services. 2006.Poster for JCDL 2006. http://public.lanl.gov/herbertv/papers/pathways_core_poster_submit.pdf
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Augmenting interoperability across Repositories
• A Surrogate is available for every Digital Object• A Surrogate expresses access points andproperties of a Digital Object, e.g.:
• Location of content streams
• providerInfo: the keys necessary to Obtain afresh Surrogate at some later point in time:
• (Repository identifier, preferredIdentifier,versionKey)
• Lineage: A Surrogate expresses itspredecessor(s)
• == providerInfo in previous life• semantic: A Surrogate expresses the type ofcontent.
m Pathways Core Surrogates (currently XML/RDF)
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Augmenting interoperability across Repositories
• It would be wonderful if Surrogates would haveno Intellectual Property strings attached, i.e.Surrogates can flow freely, independent ofbusiness models of the underlying content
• push IP issues out of the core level ofinteroperability
m Pathways Core Surrogates (currently XML/RDF)
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Obtain interface: a Repository interface that supports the request ofservices pertaining to individual Digital Objects (including theircomponent Datastreams). The core service is the request of aSurrogate for a Digital Object.
Augmenting interoperability across Repositories
Obt
ain
Har
vest Harvest interface: a Repository interface that exposes Surrogates for
incremental collecting/harvesting.
Put Put interface: a Repository interface that supports submission of oneor more Surrogates into the Repository, thereby facilitating theaddition of Digital Objects to the collection of the Repository.
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Surrogate is at the core of the value chain
id
id
id
Obt
ain
Obt
ain
Put
Obt
ain
recombine &add value
Lineage
Lineage
providerInfo
providerInfo
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Basis for a Network of Linked Digital Objects
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Repo1
Obt
ain
Har
vest
Put1 Harvest1
Obtain1
Put
Repo2
Obt
ain
Har
vest
Put2 Harvest2
Obtain2
Put
service
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Repo2
Repo1
Obt
ain
Har
vest
Obt
ain
Har
vest
Put2 Harvest2
Obtain2
Put1 Harvest1
Obtain1
Put
Put
Put2Harvest2Obtain2Repo2
Put1Harvest1Obtain1Repo1
PutHarvestObtainprovider
Serv
ice
Regi
stry
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Meeting in NYC, April 20-21 2006
• Supported by Microsoft, Mellon Foundation, Coalition forNetworked Information, Digital Library Federation, JISC
• Representatives from institutional Repository projects, scholarlycontent Repositories, Registry projects, various projects that touchon interoperability
• See http://msc.mellon.org/Meetings/Interop/ for Agenda,Participants, Topics & Goals, Terminology, Presentations, Prototypedemonstration.
• Report available by the end of June 2006
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Demonstration• Overlay journal Scenario combined with Search engine Scenario• Surrogates compliant with Pathways Core Data Model, expressed in
RDF/XML.• Obtain interfaces (OpenURL Application) at:
o an aDORe repositoryo arXivo a DSpace repositoryo a Fedora repository
• Harvest interfaces (OAI-PMH) at:o an aDORe repositoryo arXivo a Fedora repository
• Put interface at a Fedora repository• MS Live Clipboard functionality in user interfaces of arXiv, Fedora,
and the overlay search engine
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Demonstration
• Acknowledgments:o Carl Lagoze, Sandy Payette, Simeon Warner, Chris Wilper at
Cornell Universityo Rob Tansley at HPo Luda Balakireva, Xiaoming Liu, Herbert Van de Sompel, Zhiwu
Xie at the Los Alamos National Laboratory
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Demonstration
id
id
Obt
ain
Put
Live Clipboard Copy
Live Clipboard PasteSubmit
The Brave New World of RepositoriesElPub 2006, Bansko, Bulgaria, June 15 2006
Herbert Van de Sompel
RESEARCHLIBRARY
Questions, Comments, Flames