H erbert V an de S ompel Los Alamos National Laboratory – Research Library Carl Lagoze

56
herbert van de sompel & carl lagoze Herbert Van de Sompel Los Alamos National Laboratory – Research Library Carl Lagoze Cornell University – Computer Science the OAI Protocol for Metadata Harvesting an update

description

the OAI Protocol for Metadata Harvesting an update. H erbert V an de S ompel Los Alamos National Laboratory – Research Library Carl Lagoze Cornell University – Computer Science. o rigins & e volution of OAI-PMH. p rocess leading to OAI-PMH v.2.0. w hat’s new in OAI-PMH v.2.0?. - PowerPoint PPT Presentation

Transcript of H erbert V an de S ompel Los Alamos National Laboratory – Research Library Carl Lagoze

Page 1: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

Herbert Van de Sompel Los Alamos National Laboratory – Research Library

Carl LagozeCornell University – Computer Science

the OAI Protocol for Metadata Harvesting

an update

Page 2: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

origins & evolution of OAI-PMH

process leading to OAI-PMH v.2.0

what’s new in OAI-PMH v.2.0?

what’s next?

Page 3: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

evolution towards OAI-PMH v.2.0

OAI-PMH 1.0 [01/2001]

OAI-PMH 2.0 [06/2002]

Santa Fe Convention [02/2000]

Page 4: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

about eprintsdocument

like objectsresources

metadata OAMSunqualifiedDublin Core

unqualifiedDublin Core

transport HTTP HTTP HTTP

responses XML XML XML

requests HTTP GET/POST HTTP GET/POST HTTP GET/POST

verbs Dienst OAI-PMH OAI-PMH

nature experimental experimental stable

modelmetadataharvesting

metadataharvesting

metadataharvesting

Santa Feconvention

OAI-PMHv.1.0/1.1

OAI-PMHv.2.0

Page 5: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

Santa Fe Convention [02/2000]

• goal: optimize discovery of e-prints

• input:

• the UPS prototype

• RePEc data provider / service provider model

• Dienst protocol

• deliberations at Santa Fe meeting [10/99]

Page 6: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

Santa Fe Convention [02/2000]

• low-barrier interoperability specification

• metadata harvesting model: data provider / service provider

• focus on eprints (e.g. OAMS format)

• Dienst subset

• HTTP based

• XML responses

• experimental

Page 7: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

OAI-PMH v.1.0 [01/2001]

• goal: optimize discovery of document-

like objects

• input:• SFC• DLF meetings on metadata harvesting• deliberations at Cornell meeting [09/00]• alpha test group of OAI-PMH v.1.0

Page 8: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• low-barrier interoperability specification

• metadata harvesting model: data provider / service provider

• focus on document-like objects

• autonomous protocol

• HTTP based

• XML responses

• unqualified Dublin Core

• experimental: 12-18 months

OAI-PMH v.1.0 [01/2001]

Page 9: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

OAI-PMH v.2.0 [06/2002]

• goal: recurrent exchange of metadata

about resources between systems

• input:• OAI-PMH v.1.0• feedback on OAI-implementers• deliberations by OAI-tech [09/01 -]

• alpha test group of OAI-PMH v.2.0 [03/02 -]

Page 10: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• low-barrier interoperability specification

• metadata harvesting model: data provider / service provider

• metadata about resources

• autonomous protocol

• HTTP based

• XML responses

• unqualified Dublin Core

• stable

OAI-PMH v.2.0 [06/2002]

Page 11: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

process leading to OAI-PMH v.2.0

pre-alpha phase

alpha-phase

creation of OAI-tech

beta-phase

Page 12: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• created for 1 year period

• charge:• review functionality and nature of OAI-PMH v.1.0

• investigate extensions

• release stable version of OAI-PMH by 05/02

• determine need for infrastructure to support broad adoption of the protocol

• communication: listserv, SourceForge, conference calls

creation of OAI-tech [06/01]

Page 13: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

US representatives

Thomas Krichel (Long Island U) - Jeff Young (OCLC) - Tim Cole - (U of Illinois at Urbana Champaign) - Hussein Suleman (Virginia Tech) - Simeon Warner (Cornell U) - Michael Nelson (NASA) - Caroline Arms (LoC) - Muhammad Zubair (Old Dominion U) - Steven Bird (U Penn.)

European representatives

Andy Powell (Bath U. & UKOLN) - Mogens Sandfaer (DTV) - Thomas Baron (CERN) - Les Carr (U of Southampton)

OAI-tech

Page 14: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• review process by OAI-tech:

• identification of issues

• conference call to filter/combine issues

• white paper per issue

• on-line discussion per white paper

• proposal for resolution of issue by OAI-exec

• discussion of proposal & closure of issue

• conference call to resolve open issues

pre-alpha phase [09/01 – 02/02]

Page 15: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• creation of revised protocol document

• in-person meeting Lagoze - Van de Sompel - Nelson – Warner

• autonomous decisions

• internal vetting of protocol document

pre-alpha phase [02/02]

Page 16: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• alpha-1 release to OAI-tech March 1st

2002

• OAI-tech extended with alpha testers

• discussions/implementations by OAI-tech

• ongoing revision of protocol document

alpha phase [02/02 – 05/02]

Page 17: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• The British Library • Cornell U. -- NSDL project & e-print arXiv • Ex Libris • FS Consulting Inc -- harvester for my.OAI • Humboldt-Universität zu Berlin • InQuirion Pty Ltd, RMIT University • Library of Congress • NASA • OCLC

OAI-PMH 2.0 alpha testers (1/2)

Page 18: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

OAI-PMH 2.0 alpha testers (2/2)

• Old Dominion U. -- ARC , DP9 • U. of Illinois at Urbana-Champaign • U. Of Southampton -- OAIA, CiteBase, eprints.org

• UCLA, John Hopkins U., Indiana U., NYU -- sheet music collection • UKOLN, U. of Bath -- RDN• Virginia Tech -- repository explorer

Page 19: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

beta phase [05/02]

• beta release on May 1st 2002 to:

• registered data providers and service providers

• interested parties

• fine tuning of protocol document

• preparation for the release of 2.0 conformant tools by alpha testers

Page 20: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

what’s new in OAI-PMH v.2.0?

corrections

new functionality

general changes to improve solidity of protocol

quick recap

Page 21: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

service provider data provider

Requests

Replies

repos i tory

harves ter

6 OAI-PMH

Page 22: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

Supporting protocol requests:• Identify• ListMetadataFormats• ListSets

Harvesting protocol requests:• ListRecords• ListIdentifiers• GetRecord

repos i tory

service provider data provider

harves ter

Page 23: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

service provider data provider

DatestampIdentifierSet

Records

repos i tory

harves ter

Page 24: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

general changes

• clear distinction between protocol and

periphery

• fixed protocol document

• extensible implementation guidelines:

• e.g. sample metadata formats, description containers, about containers

• allows for OAI guidelines and community guidelines

Page 25: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

general changes

• clear separation of OAI-PMH and HTTP

• OAI-PMH error handling

• all OK at HTTP level? => 200 OK

• something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb)

Page 26: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

general changes

• notion of item has become prominent

• resource / item / record

• metadata can be disseminated from item

• item == identifier

• record == identifier, datestamp, metadataPrefix

Page 27: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

general changes

• better definitions of harvester,

repository, item, unique identifier, record,

datestamp, set

• oai_dc schema builds on DCMI XML

Schema for unqualified Dublin Core

• usage of must, must not etc. as in

RFC2119

• wording on response compression

Page 28: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

general changes

• all protocol responses can be validated

with a single XML Schema

• easier for data providers

• no redundancy in type definitions

• SOAP-ready

• clean for error handling

Page 29: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord></OAI-PMH>

response no errors

Page 30: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request><error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error></OAI-PMH>

response with error

Page 31: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

corrections

• all dates/times are UTC, encoded in

ISO8601, Z-notation

1957-03-20T20:30:00.00Z

Page 32: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• idempotency of resumptionToken: return

same incomplete list when rT is reissued

• while no changes occur in the repo: strict

• while changes occur in the repo: all items

with unchanged datestamp

• expirationDate attribute for rT

corrections

Page 33: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• harvesting granularity

• mandatory support of YYYY-MM-DD

• optional support of YYYY-MM-DDThh:mm:ssZ

• granularity of from and until must be the

same

new functionality

Page 34: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• Identify more expressive

new functionality

<Identify>

<repositoryName>Library of Congress 1</repositoryName>

<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>

<protocolVersion>2.0</protocolVersion>

<adminEmail>[email protected]</adminEmail>

<adminEmail>[email protected]</adminEmail>

<deletedRecord>transient</deletedRecord>

<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>

<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>

<compression>deflate</compression>

Page 35: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• header contains set membership of item

new functionality

<record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record>

Page 36: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• ListIdentifiers returns headers

new functionality

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“…” …>http://arXiv.org/oai2</request><ListIdentifiers> <header> <identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:hep</setSpec> <setSpec>physic:exp</setSpec> </header> ……

Page 37: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• ListIdentifiers mandates

metadataPrefix as argument

new functionality

http://www.perseus.tufts.edu/cgi-bin/pdataprov?

verb=ListIdentifiers

&metadataPrefix=olac

&from=2001-01-01

&until=2001-01-01

&set=Perseus:collection:PersInfo

Page 38: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• character set for metadataPrefix and

setSpec extended to URL-safe characters

new functionality

A-Z a-z 0-9 _ ! ‘ $ ( ) + - . *

Page 39: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• introduction of provenance container to

facilitate tracing of harvesting history

in the periphery

<about> <provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate> </originDescription> <originDescription> … … … </originDescription> </provenance></about>

Page 40: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• introduction of friends container to

facilitate discovery of repositories

in the periphery

<description>

<Friends>

<baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL>

<baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL>

<baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL>

<baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL>

</Friends>

</description>

Page 41: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• revision of oai-identifier

• guidelines for collection-level and set-

level metadata

in the periphery

Page 42: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

future

adoption

communities

OAI-PMH

Page 43: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• release of OAI-PMH v.2.0 [06/2002]

• no backwards compatibility with v.1.0/1.1

• stable

• migration process for registered repos

• ? formal standardization ?

• ? SOAP version ~ web services framework [SOAP, WSDL, UDDI] ?

the OAI-PMH

Page 44: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• proliferation of community-specific add-ons for:

• collection & set level metadata

• expressive metadata formats (e.g. qualified DC XML Schema)

• shared set-structures

• machine readable rights (about the metadata)

communities

Page 45: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• evolution

• from talking about OAI-PMH

• to talking about projects that use OAI-PMH

• to talking about projects and failing to mention they use OAI-PMH

=> OAI-PMH becomes part of the infrastructure

adoption

Page 46: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

I just wanted to report what I consider an OAI success. I discovered that RLG had harvested records for two of the American Memory collections I had made available and integrated them into their Cultural Materials Initiative service without the need for a single e-mail or phone call. They reported that it was working very well for them.

[Caroline Arms, Library of Congress]

Page 47: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

http://www.openarchives.org

[email protected]

Page 48: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

indicators of adoption of OAI-PMH

tools

structural support

service providers

data providers

Page 49: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• 49 registered repositories [11/2001]

• 65 registered repositories [03/2002]

• 5+ million records

• many unregistered repositories

data providers

Page 50: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

•Arc : cross-searching of registered

repositories [Old Dominion U]

[ http://arc.cs.odu.edu ]

• OLAC: cross-searching of Language

Archive Community repositories

http://www.language-archives.org/index.html

service providers

Page 51: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• Scirus scientific search engine [Elsevier]

[ http://www.scirus.com ]

• my.OAI : user-tailorable cross-searching

of registered repositories [FS Consulting,

Inc.]

[http://www.myoai.com]

• growing interest from web search

engines

service providers

Page 52: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• Repository Explorer: interactive

exploration of repositories [Virginia Tech][ http://www.purl.org/NET/oai_explorer ]

• eprints.org: generic OAI-PMH compliant

repository software [U of Southampton][ http://www.eprints.org ]

• ALCME repository and harvester

software [OCLC][ http://alcme.oclc.org/index.html ]

OAI-PMH tools

Page 53: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• Kepler [Old Dominion U]

• your personal OAI data provider: Kepler

archivelet

• the Kepler service provider harvests from

archivelets that register

• archivelet downloadable

•http://www.dlib.org/dlib/april01/maly/04maly.html

exploration

Page 54: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• DP9 [Old Dominion U]• provides entry page to repositories for web-

crawlers

• provides bookmarkable URL for OAI record

• provides resolution of OAI identifier into

metadata

• software downloadable

exploration

Page 55: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• CNI & DLF support the day-to-day operation of the OAI Executive

structural support

Page 56: H erbert  V an de  S ompel  Los Alamos National Laboratory – Research Library Carl Lagoze

herbert van de sompel & carl lagoze

• Metadata Harvesting Initiative of the Mellon Foundation

• NSF funded NSDL

• UK FAIR call for proposals to support disclosure of institutional assets (papers, learning materials, etc.)

• several EC projects exploring/supporting usage of OAI-PMH: TEL, Leaf, Cyclades, OA Forum, Figaro

structural support