Post on 29-Dec-2015
Breaking down the walls
Moving libraries from collectors to portals
Carl LagozeCornell Universitylagoze@cs.cornell.edu
The Library should selectively adopt the portal model for targeted program areas. By creating links from the Library’s Web site, this approach would make available the ever-increasing body of research materials distributed across the Internet. The Library would be responsible for carefully selecting and arranging for access to licensed commercial resources for its users, but it would not house local copies of materials or assume responsibility for long-term preservation.
LC21: Digital Strategy for the Library of Congresspage 5
LC21: Digital Strategy for the Library of Congresspage 5
Some of the most fundamental aspects of library operations entail the existence of a border, across which objects of information are transferred and
maintained. Such a parameter, demarcating a single, distributed digital library (the "control zone"), needs to
be created and managed by the academic library community at the earliest opportunity.
Ross AtkinsonLibrary Quarterly, 1996
Towards a Virtual Control Zone
Why distributed collections?
• Scale of the Web• Prevalence of new publishing
models and agents• Increasing complexity of licensing
and access management• Dynamic nature of content
Towards Hybrid Portals
• Traditional portal (e.g., Yahoo!)– linkage without responsibility
• Hybrid Portal– assertion of (some semblance) of
curatorial role over linked objects
New models have cultural/organizational
ramifications…• Performance and ranking metrics –
"bigger is better"• Levels of confidence• Trust
…that can be assisted by new technical foundations
• Digital object architectures– that enable aggregating and customizing
content for local access and management
• Metadata frameworks– that model changes of objects and their
management over time
• OAI Harvesting Protocol– for exchange of structured information
• Preservation models– that enable non-cooperative and cooperative
offsite monitoring
Digital Object Architectures:
aggregating & localizing distributed content
Acknowledgements:– Naomi Dushay– Sandy Payette– Thorton Staples (U. Va.)– Ross Wayland (U. Va.)
From Mediators to Value-Added Surrogates
• Wiederhold – mediators between raw data and end-user applications for integration and transformation
• Paepcke – mediators as foundation for digital library interoperability
• Payette and Lagoze – mediators (V-A surrogates) to aggregate and create a localized service layer for distributed resources
FEDORA Digital Object Model
Establishing a Virtual Control Zone
V-A Surrogate Applications
• Access management– Shared responsibility among trusted
partners
• Enhanced and customized functionality– Examples: reference linking, format
translation, special needs
• Preservation– Monitoring "significant" events and acting
on them
ContextBroker
A
DigitalObject A
StructuralCharacteristics
Realaudio video
Powerpoint presentation
SMIL synchronization metadata
Tool
Tool
DigitalObject A:• View Slides• View Video• View synchronized presentation using applet
Tool
Tool
ContextBroker
B
DigitalObject A:• Get Transcript of Audio• Search for keyword• Get Slides translated to French
Where we are now…
• Ongoing FEDORA reference prototype– http://www.cs.cornell.edu/cdlrg/FEDORA.html– Policy enforcement research– Content mediation
• Proposed joint deployment with University of Virginia– Open source scalable implementation of
FEDORA architecture– Testing and deployment with a number of
research library partners.
Event-Aware Metadata Frameworks:
describing changes over time
• Acknowledgements:– Dan Brickley (ILRT, Bristol)– Martin Doer (FORTH, Crete)– Jane Hunter (DSTC, Brisbane)
Distributed ContentThe Metadata Challenge
• From fixed, contained physical artifacts to fluid, distributed digital objects
• Need for basis of trust and authenticity in network environment
• Decentralization and specialization of resource description and need for mapping formalisms
Multi-entity nature of object description
Photographer
Camera type Software
Computer artist
Attribute/Value approaches to metadata…
Hamlet has a creator Shakespeare
subject implied verb metadata noun literal
Play
wrig
ht
metadata adjective
The playwright of Hamlet was Shakespeare
R1
“Shakespeare”
“Hamlet”
dc:creator.playwright
dc:title
…run into problems for richer descriptions…
Hamlet has a creator Stratford
birt
hpla
ce
The playwright of Hamlet was Shakespeare,who was born in Stratford
“Stratford”R1
“Shakespeare”dc:creator.playwright
dc:creator.birthplace
…because of their failure to model entity
distinctions
R1
“Stratford”
creatorR2
name “Shakespeare”
birthplacetitle
“Hamlet”
ABC/Harmony Event-aware metadata model
• Recognizing inherent lifecycle aspects of description (esp. of digital content)
• Modeling incorporates time (events and situations) as first-class objects– Supplies clear attachment points for
agents, roles, occurrent properties• Resource description as a “story-
telling” activity
Resource-centric Metadata
Title Anna Karenina
Author Leo Tolstoy
Illustrator Orest Vereisky
Translator Margaret Wettlin
Date Created 1877
Date Translated 1978
DescriptionAdultery & Depression
Birthplace Moscow
Birthdate 1828
?
“translator”
“Margaret Wettlin”“Orest Vereisky”
“illustrator”
“Anna Karenina”
“Tragic adultery andthe search for meaningfullove”
“English”
“author”
“creation”
“1877”“1978”
“translation”
“Russian”
“Leo Tolstoy”"Moscow"
“1828”
Queries over descriptive graphs
List details of events where Lagoze is a participating agent
SELECT ?title, ?type, ?time, ?place, ?name FROM http://ilrt.org/discovery/harmony/oai.rdf WHERE (web::type ?event abc::Event) (abc::context ?event ?context) ….. AND ?name ~ lagoze
USING web FOR http://www.w3.org/1999/02/22-rdf-syntax-ns#
Rudolf Squish – http://swordfish.rdfweb.org/rdfquery
Where we are now
• Stabilization of model• Collaboration with museum/CIDOC
community for joint modeling principles
• Plans– RDF api for model elements– UI for metadata creation– Query engine testing
Open Archives Initiative:facilitating exchange of structured information
• Acknowledgements:– Herbert Van de Sompel– OAI Steering and Technical
Committees
Open Archives Initiative
• Testing the hypotheses– exposing metadata in various forms
will facilitate creation of value-added services
– key to deployable DL infrastructure is low-entry cost
– Individual communities can/will customize common infrastructure
Where we’ve come from
• Late 1999 Santa Fe UPS meeting – increase impact of eprint initiatives through federation
• Santa Fe Convention – metadata harvesting among eprint archives
• Increasing interest outside the eprint community– Research libraries– Museums– Publishers
Progress over the past year
• OAI workshops at US and EC DL conferences
• Organizational stability– Executive committee and steering committee
• September 2000 technical meeting– Reframe and rethink technical solutions for
broader domain
• Extensive testing and refinement of technical infrastructure
Technical Infrastructure – key technical features
• Deploy now technology – 80/20 rule• Two-party model – providers and
consumers• Simple HTTP encoding• XML schema for some degree of protocol
conformance• Extensibility
– Multiple item-level metadata– Collection level metadata
OAI protocol requests
Supporting protocol requests:• Identify• ListMetadataFormats• ListSets
Harvesting protocol requests:• ListRecords• ListIdentifiers• GetRecord
repos i tory
harves ter
service provider data provider
Where we are now• “Stable” 1.0 protocol specification • Hopefully, self-documenting infrastructure
– http://www.openarchives.org
• 27 registered data providers• Increasing number of tools available• Research initiatives
– NSF-funded NSDL– EC-funded Cyclades– Andrew W. Mellon service proposals– EC-funded community building
Where do we go from here
• Controlling the stampede• Maintaining the organizational model – lean and
mean while encouraging community-specific exploitation
• Encouraging testing especially through deployment and especially service development
• Encouraging metadata diversification – this isn’t just above Dublin Core!!!– Preservation– Document access– Authentication
OAI & Metadata Research
• Dictionary of metadata terms (Tom Baker)• Mandating usage rules has only limited
effectiveness• Compiling usage of those terms is vital to
machine understanding and interoperability– Provide context heuristics for search engine and
indexer processing
• Large-scale deployment of OAI and web crawling enables (partial) automation of usage compilation (e.g., data mining of term usage)
Preservation Models:monitoring threats to distributed content
• Acknowledgements:– Bill Arms– Peter Botticelli (CUL)– Anne Kenney (CUL)
Preservation & Remote Control
• Organization Issues– “assured preservation” may not be possible
without direct custodial control.– what are the levels of acceptability and for
which types of resources?
• Technical Issues– what are the technologies for remote control at
the various levels of assurance deemed acceptable by the library?
– what is the probability of a reasonable level of preservation in the context of such technologies?
Cost vs. Functionalityco
st
f u n c tio n a lity
O p en Ar c h iv a lI n f o r m atio n S y s tem
( O AI S )
Leveraging Current Work
• Event-based metadata• Metadata harvesting• Longevity and threats to digital
resources
Level 0 Experiment
W e b S ite W e b S ite W e b S ite
Se le c t ive W e b C rawling
E ve ntR e c ords
P1 A1
P2 A2
P3 A3
P o lic y E n f o r c er
acti
ons
Level 1 Experiment
W e b S ite W e b S i teO AI D ataP ro vide r
Se le c t ive W e b C rawling
E ve ntR e c ords
P1 A1
P2 A2
P3 A3
P o lic y E n f o r c er
acti
ons
O AI D ataP ro vide r
P r e se r v a t io n M e t a da t aP r e se r v a t io n M e t a da t a
O AI P ro to c o l R e que s t
One of Six Core One of Six Core Integration Integration
Demonstration Projects Demonstration Projects for the NSDLfor the NSDL
How Big might the NSDL be?
The NSDL aims to be comprehensive -- all branches of science, all levels of education, very broadly defined.
Five year targets:
1,000,000 different users
10,000,000 digital objects
100,000 independent sites
Requires: low-cost, scalable, technology automated collection building and maintenance
Levels of Interoperability:Metadata Harvesting
Agreements on simple protocol and metadata standard(s)
Example:
Metadata harvesting protocol of the Open Archives Initiative (MHP)
• Moderate-quality services
• Low cost of entry to participating sites
Moderately large numbers of loosely collaborating sites
Promising but still an emerging approach
Levels of Interoperability:Gathering
Robots gather collections automatically with no participation from individual sites
Examples:
Web search services (e.g., Google)
CiteSeer (a.k.a. ResearchIndex)
• Restricted but useful services
• Zero cost of entry to gathered sites
Very large numbers of independent sites
Only suitable for open access collections