The Open Archives Initiative Movement Kurt Maly Old Dominion University Norfolk Virginia, USA...

68
The Open Archives Initiative Movement Kurt Maly Old Dominion University Norfolk Virginia, USA [email protected] http://www.cs.odu.edu/~maly Brazilian DL international conference Política de Informação em Bibliotecas Digitais Campinas, Brazil March 19-22, 2003

Transcript of The Open Archives Initiative Movement Kurt Maly Old Dominion University Norfolk Virginia, USA...

The Open Archives Initiative Movement

Kurt MalyOld Dominion University

Norfolk Virginia, [email protected]

http://www.cs.odu.edu/~maly

Brazilian DL international conference Política de Informação em Bibliotecas Digitais

Campinas, BrazilMarch 19-22, 2003

Outline*

• OpenArchivesInitiative - history and summary description

• OAI services• Why the OAI-PMH is not important• Defining the OAI-PMH data model• More interesting services (DP9, Celestial,

Kepler)* Slides from Herbert Van de Sompel & Carl Lagoze & Michael Nelson included

herbert van de sompel

The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between preprint solutions, as a way to promote their global acceptance. Paul Ginsparg, Rick Luce & Herbert Van de Sompel

OAI origin

herbert van de sompel

e-print

e-print accessibility

e-print

e-print

e-print

herbert van de sompel

e-print

e-print

e-print accessibility

e-print

e-print

e-print

herbert van de sompel

e-print

preprint solutions

herbert van de sompel

• Santa Fe meeting: improve accessibility of

preprints

• by improving searchability

• via the provision of an interoperability spec

Core concepts of Santa Fe convention

herbert van de sompel

• low-barrier interoperability

• data-provider & service-provider model

• metadata harvesting model

• shared metadata format and parallel, community-

specific metadata formats

• acceptable use

Dienst subset

OAMS

XML reply

HTTP based

Gentelmen’s agreement

metadata harvesting

herbert van de sompel

metadata

e-print

e-print

e-print

e-print

e-print

metadata harvesting

herbert van de sompel

metadata

AuthorTitleAbstractIdentifer

e-print

e-print

e-print

e-print

e-print

interest from other communities

herbert van de sompel

• Digital Library Federation meetings ~ research library community has many materials for which they would like to ‘expose’ metadata

• OAI San Antonio meeting: ~ interest from librarians, publishers, others, ...

resulting actions: organizational

herbert van de sompel

• establish organizational stability for the OAI:

• institutional backing from CNI & DLF

• steering committee: policy guidance

• technical committee: technical specifications

• executive group: day to day coordination

• workshops: public dissemination, feedback

resulting actions: technical

herbert van de sompel

• [09/2000] revise specifications to allow adoption beyond preprints: technical committee

• [09/2000-01/2001] compile new specifications: editing by Carl and Herbert

• [11/2000-01/2001] alpha-test specifications: oai-alpha group

• [01/2001] discontinue the Santa Fe Convention

• [01/2001] release version 1.0 of the OAI protocol

• [07/2001) version 1.1

• [06/2002] version 2.0

core concepts in OAI 1.0

herbert van de sompel

• low-barrier interoperability

• data-provider & service-provider model

• metadata harvesting model

• shared metadata format and parallel, community-

specific metadata formats

• acceptable use

• flexibility

OAI 1.0 protocol

Dublin Core

HTTP based

Community specific

Reply • XML Schema

• Self contained

low-barrier interop umbrella

herbert van de sompel

metadata

OPAC

image

FTXT

A&I

e-print

low-barrier interop umbrella

herbert van de sompel

metadata

OPAC

image

FTXT

A&I

e-print

AuthorTitleAbstractIdentifer

communication re OAI

herbert van de sompel

• lists: subscribe via http://www.openarchives.org

• oai-general list [replaces UPS list; UPS-

subscribers will be moved]

• oai-implementers list

• web: http://www.openarchives.org

• FAQ: http://www.openarchives.org/faq.htm

• mail: [email protected]

• freeze specifications for 12 -18 months:

• stable for experimentation; not definitive• minimize risk for early adopters

• maximize chances for future interoperability across communities

revision of specifications

herbert van de sompel

http://www.openarchives.org

[email protected]

The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.

new OAI mission statement

herbert van de sompel

The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. Continued support of this work remains a cornerstone of the Open Archives program.

new OAI mission statement

herbert van de sompel

The fundamental technological framework and standards that are developing to support this work are, however, independent of the both the type of content offered and the economic mechanisms surrounding that content, and promise to have much broader relevance in opening up access to a range of digital materials.

[...]

new OAI mission statement

herbert van de sompel

OAI-PMH Meeting History

OAI Open Day, Washington DC

1/2001

CERN meeting10/2002

Protocol definition,development tools

DPs, retrofittingexisting DLs

SPs, new services

Socio-Economic-Political Issues

4 1

5 4

1 11

0 6

Shift of Topics

• From the protocol itself, supporting & debugging tools and how to retrofit (existing) DLs…

• …to building (new) services that use the OAI-PMH as a core technology and reporting on their impact to the institution/community

NTRS

• http://ntrs.nasa.gov/• metadata harvesting

replacement for http://techreports.larc.nasa.gov/cgi-bin/NTRS – previous NTRS was

based on distributed searching

– hierarchical harvesting

• (nigh) publicly available

Arc

• http://arc.cs.odu.edu/

• harvests all known archives

• first end-user service provider

• source available through SourceForge

• hierarchical harvesting

NCSTRL

• http://www.ncstrl.org/

• metadata harvesting replacement for Dienst-based NCSTRL

• based on Arc• computer science

metadata

Archon

• http://archon.cs.odu.edu/

• physics metadata• based on Arc• features:

– citation indexing– equation-based

searching

Torii

• http://torii.sissa.it/

• physics metadata• features

– personalization– recommendations– WAP access

iCite

• http://icite.sissa.it/

• physics metadata• features

– citation based access to arXiv metadata

my.OAI

• http://www.myoai.com/

• covers all registered metadata

• features– result sets– personalization– many other advanced

features

Cyclades• http://www.ercim.org/cyclades

• scientific metadata

• features– personalization– recommendations– collaboration

• status?

citebase

• http://citebase.eprints.org/

• arXiv metadata• citation based

indexing, reporting

OAIster

• http://oaister.umdl.umich.edu/

• harvests all known archives

Public Knowledge Project• http://www.pkp.ubc.ca/harvester/

• domain-specific filtering of harvested metadata (?)

Perseus

• http://www.perseus.tufts.edu/

• they claim to harvest all DPs, but only humanities related DPs appear in the pull down menu

Service Providers

• It is clear that SPs are proliferating, despite (because of?) the inherent bias toward DPs in the protocol– easy to be a DP -> many DPs -> SPs eventually

emerge– hard to be a DP -> SPs starve– currently 5x DPs more than SPs

• SPs are beginning to offer increasingly sophisticated services– competitive market originally envisioned for

SPs is emerging

Why The OAI-PMH is NOT Important

• Users don’t care• OAI-PMH is middleware

– if done right, the uninterested user should never have to know

OAI

Inside

• Using the OAI-PMH does not insure a good SP

• OAI-PMH is (or is becoming) HTTP for DLs– few people get excited about http now

• http & OAI-PMH are core technologies whose presence is now assumed

Other Uses For the OAI-PMH• Assumptions:

– Traditional DLs / SPs will continue on their present path of increasing sophistication

• citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc.

– growth rates remain the same (5x DPs as SPs)

• Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state– Future opportunities are possible by creatively

interpreting the OAI-PMH data model

resource

all available metadata about David

item

Dublin Coremetadata

MARCmetadata

SPECTRUMmetadata records

item = identifier

record = identifier + metadata format + datestamp

set-membership is item-level property

OAI-PMH Data Model

Typical Values• repository

– collection of publications• resource

– scholarly publication• item

– all metadata (DC + MARC)• record

– a single metadata format• datestamp

– last update / addition of a record• metadata format

– bibliographic metadata format• set

– originating institution or subject categories

Interesting Services

• DP9– gateway to expose repository contents in

HTML suitable for web crawlers

• Celestial– OAI “cache”, also 1.1 -> 2.0 converter

• Static (mini-) repositories– XML files, based on OLAC work

• OpenURL metadata format registries– record = metadata format

DP9 Architecture

see Liu et al., JCDL 2002; http://dlib.cs.odu.edu/dp9

Slide from Liu

DP9 Formatting• Format of URLs

– http://arc.cs.odu.edu:8080/dp9/getrecord.jsp?identifier=oai:NACA:1917:naca-report-

10 &prefix=oai_dc – http://arc.cs.odu.edu:8080/dp9/getrecord/oai_dc/oai:NACA:1917:naca-report-10

• HTML Meta tags– Some crawlers (such as Inktomi) use the HTML

meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags.

– For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used

– X-FORWARDED-FOR header to distinguish between different users coming in via a proxy

Slide from Liu

Celestial

• Developed by Brody @ Southampton– http://celestial.eprints.org/– designed to complement DP9– see Liu, Brody, et al., D-Lib Magazine 8(11)

• Where DP9 is a non-caching proxy, Celestial caches the metadata records– can off-load work from individual archives,

higher availability– can harvest 1.1, 2.0; exports in 2.0

“Static” Repositories• Premise: a repository does not wish to

have an executing program on its site, so it has a “static” XML file with some of the OAI-PMH responses in place– Design still being discussed

• accessed through a proxy• could be a low functionality node, or the XML file

could be produced by a process and moved outside a firewall

• Based on OLAC work by Bird & Simons– http://www.language-archives.org/

Original Kepler Framework

• Support "personal data providers" or "archivelets“

• An archivelet is a self-contained, self-installing software package that easily allows a researcher to create and maintain a small, OAI-PMH-compliant archive.

• General public have a seamless access to the totality of all such published materials.

Enhanced Kepler Framework (EKF)

• Improve the scalability and service reliability by the concept of buddy nodes and SuperNodes.

• Extend OAI-PMH with Push model and hybrid push/pull model.

• Rapid discovery of content as soon as it is published.

• Works with firewall and network address translation proxy

• Support community-based installation and integration.

Motivation behind Kepler

• The success of Peer-to-Peer (P2P) network.

• The vision of author self-archiving.

• Efficient repository synchronization technology defined by OAI-PMH.

Peer to Peer Network

• File sharing P2P networks such as Napster, Gnutella, Freenet.

• LOCKSS (Lots Of Copies Keep Stuff Safe) provides long-term preservation of scientific journals.

• Recent arrival of FastTrack and openFT:– A 2-tier system :SuperNodes and Nodes to solve

scalability problem.– Kazaa (based on FastTrack technology) claims 20M

downloads and scales well.

Author Self-Archiving

• Subject based: ArXiv.org is a very successful subject-based self-archiving service. Since its inception in 1991 there are nearly 200K documents submitted.

• Institutional Based: Eprints software from Southampton.

• Personal Based: Personal homepage (indexed by researchindex) and Kepler (indexed by OAI-PMH compliant service such as arc.

Original Kepler Framework

Original Kepler Framework

Problem of Original Kepler framework

• Centralized Registration Server (LDAP): Increases the complexity of installing the Kepler server side software.

• Open Protocol: The archivelet uses a non-standard protocol for registration, thus inhibiting the development of third party applications.

• Security and NAT (Network Address Translation): In many instances, an archivelet is behind a firewall or NAT proxy, which makes it difficult for the service provider to harvest the archivelet.

Problem of Original Kepler Framework

• Availability: The archivelet is extremely unstable. Some use dynamic IP address.

• Freshness: Large number of archivelets with sparse changes. This doesn’t fit well with OAI-PMH’s “poll”-based mode.

• Full text vs. Metadata: As the archivelet is not up all the time, it is desirable to harvest full-text documents as well to improve the availability of full-text to end-users.

Improvement through EKF

• “push” and a hybrid “push/pull” model to address the scalability, security and freshness problem.

• SuperNode and Buddy node to improve scalability and server availability.

• The protocol in EKF is open and we hope it will inspire third-party development of Kepler tools

EKF-Architecture

EKF- Push/Pull model

• Pull – Retrieval without prior coordination (e.g., as used by current robots and OAI-PMH)

• Hybrid Push/Pull – Retrieval after notification

• Push –Notification followed by a provider push.

EKF-Push/Pull Model

EKF- Why extend OAI-PMH?

• "Update Overhead" problem. – Frequent crawling has to be done to

synchronize the data providers and service providers.

– It is inefficient if the data providers seldom change during a harvest interval.

– On the other hand, without frequent crawling, service providers may become inconsistent with data providers.

Extension of OAI-PMH in Service Provider Side

• The AddFriend and Notify support push/pull hybrid model. – The AddFriend verb informs the service

provider of the existence of a data provider. – The Notify verb informs the service provider

that a data provider is up/down or any new data is available.

• A PushMetadata verb is added to support the push model.

Design and Implementation

• Loose Name Space Management– Use email address to uniquely identify

archivelet.– Avoid the effort of maintaining a global

namespace.

• Sample– oai: kepler.cs.odu.edu: [email protected]/doc1.

Archivelet

• Components– File System based. OAI-PMH-compliant

repository – Publication tool.– A simple extended HTTP server which

supports the OAI-PMH protocol and push/pull model.

• It might act as a SuperNode and BuddyNode at the same time.

SuperNode

• A SuperNode has all the functionalities of an archivelet.

• The SuperNode collects all documents and metadata from archivelets in its friends list, and builds value-added services over these harvested data.

• SuperNode is typically deployed at an institution with a high quality network.

Protocol Syntax

• Add a Friend: Request to be added as a friend – ? verb=AddFriend&id=&baseURL=

• Notify: used for major events of an archivelet, including startup, shutdown and document update. – ?verb=Notify&event=[start/stop/

update]&id=&baseURL

• PushMetadata: – ? verb=PushMetadata&contents=

Optional Implementation

• Features outside of kernel protocol, but may operate as a “hook” to attract more usages of Kepler.– Cache full text document.– Query service in archivelet.– Security and Spoofing Issues

Conclusions

• Protocol / transport gateways– Dienst <-> OAI

• DOG, http://www.cs.odu.edu/~tharriso/DOG/

– Z39.50• ZMARCO (UIUC)

– SOAP• prototypes @ VT (Suleman) & ODU (Zubair)

OAI-PMH Will Have Arrived When:

• general web robots issue OAI-PMH verbs– …DP9 will no longer be needed– requires shift in “control”: harvester or

repository?

• mod_oai is developed and is included in the default Apache configuration

• OAI-PMH fades into the background– similar to TCP/IP, http, XML, etc.– next year’s workshop is on OpenURL

Conclusions

• DPs continue to proliferate– and spawn SPs!

• SPs are / are becoming a competitive market– e.g., at least 10 different interfaces to arXiv

metadata– growing sophistication of services– differentiation of SPs will be on features that

have little to nothing to do with OAI-PMH