The Open Archives Initiative Movement Kurt Maly Old Dominion University Norfolk Virginia, USA...
-
Upload
melanie-richards -
Category
Documents
-
view
217 -
download
0
Transcript of The Open Archives Initiative Movement Kurt Maly Old Dominion University Norfolk Virginia, USA...
The Open Archives Initiative Movement
Kurt MalyOld Dominion University
Norfolk Virginia, [email protected]
http://www.cs.odu.edu/~maly
Brazilian DL international conference Política de Informação em Bibliotecas Digitais
Campinas, BrazilMarch 19-22, 2003
Outline*
• OpenArchivesInitiative - history and summary description
• OAI services• Why the OAI-PMH is not important• Defining the OAI-PMH data model• More interesting services (DP9, Celestial,
Kepler)* Slides from Herbert Van de Sompel & Carl Lagoze & Michael Nelson included
herbert van de sompel
The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between preprint solutions, as a way to promote their global acceptance. Paul Ginsparg, Rick Luce & Herbert Van de Sompel
OAI origin
herbert van de sompel
preprint solutions
herbert van de sompel
• Santa Fe meeting: improve accessibility of
preprints
• by improving searchability
• via the provision of an interoperability spec
Core concepts of Santa Fe convention
herbert van de sompel
• low-barrier interoperability
• data-provider & service-provider model
• metadata harvesting model
• shared metadata format and parallel, community-
specific metadata formats
• acceptable use
Dienst subset
OAMS
XML reply
HTTP based
Gentelmen’s agreement
metadata harvesting
herbert van de sompel
metadata
AuthorTitleAbstractIdentifer
e-print
e-print
e-print
e-print
e-print
interest from other communities
herbert van de sompel
• Digital Library Federation meetings ~ research library community has many materials for which they would like to ‘expose’ metadata
• OAI San Antonio meeting: ~ interest from librarians, publishers, others, ...
resulting actions: organizational
herbert van de sompel
• establish organizational stability for the OAI:
• institutional backing from CNI & DLF
• steering committee: policy guidance
• technical committee: technical specifications
• executive group: day to day coordination
• workshops: public dissemination, feedback
resulting actions: technical
herbert van de sompel
• [09/2000] revise specifications to allow adoption beyond preprints: technical committee
• [09/2000-01/2001] compile new specifications: editing by Carl and Herbert
• [11/2000-01/2001] alpha-test specifications: oai-alpha group
• [01/2001] discontinue the Santa Fe Convention
• [01/2001] release version 1.0 of the OAI protocol
• [07/2001) version 1.1
• [06/2002] version 2.0
core concepts in OAI 1.0
herbert van de sompel
• low-barrier interoperability
• data-provider & service-provider model
• metadata harvesting model
• shared metadata format and parallel, community-
specific metadata formats
• acceptable use
• flexibility
OAI 1.0 protocol
Dublin Core
HTTP based
Community specific
Reply • XML Schema
• Self contained
low-barrier interop umbrella
herbert van de sompel
metadata
OPAC
image
FTXT
A&I
e-print
AuthorTitleAbstractIdentifer
communication re OAI
herbert van de sompel
• lists: subscribe via http://www.openarchives.org
• oai-general list [replaces UPS list; UPS-
subscribers will be moved]
• oai-implementers list
• web: http://www.openarchives.org
• FAQ: http://www.openarchives.org/faq.htm
• mail: [email protected]
• freeze specifications for 12 -18 months:
• stable for experimentation; not definitive• minimize risk for early adopters
• maximize chances for future interoperability across communities
revision of specifications
herbert van de sompel
The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.
new OAI mission statement
herbert van de sompel
The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. Continued support of this work remains a cornerstone of the Open Archives program.
new OAI mission statement
herbert van de sompel
The fundamental technological framework and standards that are developing to support this work are, however, independent of the both the type of content offered and the economic mechanisms surrounding that content, and promise to have much broader relevance in opening up access to a range of digital materials.
[...]
new OAI mission statement
herbert van de sompel
OAI-PMH Meeting History
OAI Open Day, Washington DC
1/2001
CERN meeting10/2002
Protocol definition,development tools
DPs, retrofittingexisting DLs
SPs, new services
Socio-Economic-Political Issues
4 1
5 4
1 11
0 6
Shift of Topics
• From the protocol itself, supporting & debugging tools and how to retrofit (existing) DLs…
• …to building (new) services that use the OAI-PMH as a core technology and reporting on their impact to the institution/community
NTRS
• http://ntrs.nasa.gov/• metadata harvesting
replacement for http://techreports.larc.nasa.gov/cgi-bin/NTRS – previous NTRS was
based on distributed searching
– hierarchical harvesting
• (nigh) publicly available
Arc
• http://arc.cs.odu.edu/
• harvests all known archives
• first end-user service provider
• source available through SourceForge
• hierarchical harvesting
NCSTRL
• http://www.ncstrl.org/
• metadata harvesting replacement for Dienst-based NCSTRL
• based on Arc• computer science
metadata
Archon
• http://archon.cs.odu.edu/
• physics metadata• based on Arc• features:
– citation indexing– equation-based
searching
Torii
• http://torii.sissa.it/
• physics metadata• features
– personalization– recommendations– WAP access
iCite
• http://icite.sissa.it/
• physics metadata• features
– citation based access to arXiv metadata
my.OAI
• http://www.myoai.com/
• covers all registered metadata
• features– result sets– personalization– many other advanced
features
Cyclades• http://www.ercim.org/cyclades
• scientific metadata
• features– personalization– recommendations– collaboration
• status?
Public Knowledge Project• http://www.pkp.ubc.ca/harvester/
• domain-specific filtering of harvested metadata (?)
Perseus
• http://www.perseus.tufts.edu/
• they claim to harvest all DPs, but only humanities related DPs appear in the pull down menu
Service Providers
• It is clear that SPs are proliferating, despite (because of?) the inherent bias toward DPs in the protocol– easy to be a DP -> many DPs -> SPs eventually
emerge– hard to be a DP -> SPs starve– currently 5x DPs more than SPs
• SPs are beginning to offer increasingly sophisticated services– competitive market originally envisioned for
SPs is emerging
Why The OAI-PMH is NOT Important
• Users don’t care• OAI-PMH is middleware
– if done right, the uninterested user should never have to know
OAI
Inside
• Using the OAI-PMH does not insure a good SP
• OAI-PMH is (or is becoming) HTTP for DLs– few people get excited about http now
• http & OAI-PMH are core technologies whose presence is now assumed
Other Uses For the OAI-PMH• Assumptions:
– Traditional DLs / SPs will continue on their present path of increasing sophistication
• citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc.
– growth rates remain the same (5x DPs as SPs)
• Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state– Future opportunities are possible by creatively
interpreting the OAI-PMH data model
resource
all available metadata about David
item
Dublin Coremetadata
MARCmetadata
SPECTRUMmetadata records
item = identifier
record = identifier + metadata format + datestamp
set-membership is item-level property
OAI-PMH Data Model
Typical Values• repository
– collection of publications• resource
– scholarly publication• item
– all metadata (DC + MARC)• record
– a single metadata format• datestamp
– last update / addition of a record• metadata format
– bibliographic metadata format• set
– originating institution or subject categories
Interesting Services
• DP9– gateway to expose repository contents in
HTML suitable for web crawlers
• Celestial– OAI “cache”, also 1.1 -> 2.0 converter
• Static (mini-) repositories– XML files, based on OLAC work
• OpenURL metadata format registries– record = metadata format
DP9 Formatting• Format of URLs
– http://arc.cs.odu.edu:8080/dp9/getrecord.jsp?identifier=oai:NACA:1917:naca-report-
10 &prefix=oai_dc – http://arc.cs.odu.edu:8080/dp9/getrecord/oai_dc/oai:NACA:1917:naca-report-10
• HTML Meta tags– Some crawlers (such as Inktomi) use the HTML
meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags.
– For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used
– X-FORWARDED-FOR header to distinguish between different users coming in via a proxy
Slide from Liu
Celestial
• Developed by Brody @ Southampton– http://celestial.eprints.org/– designed to complement DP9– see Liu, Brody, et al., D-Lib Magazine 8(11)
• Where DP9 is a non-caching proxy, Celestial caches the metadata records– can off-load work from individual archives,
higher availability– can harvest 1.1, 2.0; exports in 2.0
“Static” Repositories• Premise: a repository does not wish to
have an executing program on its site, so it has a “static” XML file with some of the OAI-PMH responses in place– Design still being discussed
• accessed through a proxy• could be a low functionality node, or the XML file
could be produced by a process and moved outside a firewall
• Based on OLAC work by Bird & Simons– http://www.language-archives.org/
Original Kepler Framework
• Support "personal data providers" or "archivelets“
• An archivelet is a self-contained, self-installing software package that easily allows a researcher to create and maintain a small, OAI-PMH-compliant archive.
• General public have a seamless access to the totality of all such published materials.
Enhanced Kepler Framework (EKF)
• Improve the scalability and service reliability by the concept of buddy nodes and SuperNodes.
• Extend OAI-PMH with Push model and hybrid push/pull model.
• Rapid discovery of content as soon as it is published.
• Works with firewall and network address translation proxy
• Support community-based installation and integration.
Motivation behind Kepler
• The success of Peer-to-Peer (P2P) network.
• The vision of author self-archiving.
• Efficient repository synchronization technology defined by OAI-PMH.
Peer to Peer Network
• File sharing P2P networks such as Napster, Gnutella, Freenet.
• LOCKSS (Lots Of Copies Keep Stuff Safe) provides long-term preservation of scientific journals.
• Recent arrival of FastTrack and openFT:– A 2-tier system :SuperNodes and Nodes to solve
scalability problem.– Kazaa (based on FastTrack technology) claims 20M
downloads and scales well.
Author Self-Archiving
• Subject based: ArXiv.org is a very successful subject-based self-archiving service. Since its inception in 1991 there are nearly 200K documents submitted.
• Institutional Based: Eprints software from Southampton.
• Personal Based: Personal homepage (indexed by researchindex) and Kepler (indexed by OAI-PMH compliant service such as arc.
Problem of Original Kepler framework
• Centralized Registration Server (LDAP): Increases the complexity of installing the Kepler server side software.
• Open Protocol: The archivelet uses a non-standard protocol for registration, thus inhibiting the development of third party applications.
• Security and NAT (Network Address Translation): In many instances, an archivelet is behind a firewall or NAT proxy, which makes it difficult for the service provider to harvest the archivelet.
Problem of Original Kepler Framework
• Availability: The archivelet is extremely unstable. Some use dynamic IP address.
• Freshness: Large number of archivelets with sparse changes. This doesn’t fit well with OAI-PMH’s “poll”-based mode.
• Full text vs. Metadata: As the archivelet is not up all the time, it is desirable to harvest full-text documents as well to improve the availability of full-text to end-users.
Improvement through EKF
• “push” and a hybrid “push/pull” model to address the scalability, security and freshness problem.
• SuperNode and Buddy node to improve scalability and server availability.
• The protocol in EKF is open and we hope it will inspire third-party development of Kepler tools
EKF- Push/Pull model
• Pull – Retrieval without prior coordination (e.g., as used by current robots and OAI-PMH)
• Hybrid Push/Pull – Retrieval after notification
• Push –Notification followed by a provider push.
EKF- Why extend OAI-PMH?
• "Update Overhead" problem. – Frequent crawling has to be done to
synchronize the data providers and service providers.
– It is inefficient if the data providers seldom change during a harvest interval.
– On the other hand, without frequent crawling, service providers may become inconsistent with data providers.
Extension of OAI-PMH in Service Provider Side
• The AddFriend and Notify support push/pull hybrid model. – The AddFriend verb informs the service
provider of the existence of a data provider. – The Notify verb informs the service provider
that a data provider is up/down or any new data is available.
• A PushMetadata verb is added to support the push model.
Design and Implementation
• Loose Name Space Management– Use email address to uniquely identify
archivelet.– Avoid the effort of maintaining a global
namespace.
• Sample– oai: kepler.cs.odu.edu: [email protected]/doc1.
Archivelet
• Components– File System based. OAI-PMH-compliant
repository – Publication tool.– A simple extended HTTP server which
supports the OAI-PMH protocol and push/pull model.
• It might act as a SuperNode and BuddyNode at the same time.
SuperNode
• A SuperNode has all the functionalities of an archivelet.
• The SuperNode collects all documents and metadata from archivelets in its friends list, and builds value-added services over these harvested data.
• SuperNode is typically deployed at an institution with a high quality network.
Protocol Syntax
• Add a Friend: Request to be added as a friend – ? verb=AddFriend&id=&baseURL=
• Notify: used for major events of an archivelet, including startup, shutdown and document update. – ?verb=Notify&event=[start/stop/
update]&id=&baseURL
• PushMetadata: – ? verb=PushMetadata&contents=
Optional Implementation
• Features outside of kernel protocol, but may operate as a “hook” to attract more usages of Kepler.– Cache full text document.– Query service in archivelet.– Security and Spoofing Issues
Conclusions
• Protocol / transport gateways– Dienst <-> OAI
• DOG, http://www.cs.odu.edu/~tharriso/DOG/
– Z39.50• ZMARCO (UIUC)
– SOAP• prototypes @ VT (Suleman) & ODU (Zubair)
OAI-PMH Will Have Arrived When:
• general web robots issue OAI-PMH verbs– …DP9 will no longer be needed– requires shift in “control”: harvester or
repository?
• mod_oai is developed and is included in the default Apache configuration
• OAI-PMH fades into the background– similar to TCP/IP, http, XML, etc.– next year’s workshop is on OpenURL