OAI Protocol for Metadata Harvesting Tim Brody Intelligence, Agents, Multimedia Group University of...

Post on 27-Mar-2015

215 views 2 download

Tags:

Transcript of OAI Protocol for Metadata Harvesting Tim Brody Intelligence, Agents, Multimedia Group University of...

OAI Protocol for Metadata Harvesting

Tim BrodyIntelligence, Agents, Multimedia Group

University of SouthamptonOpCit – http://opcit.eprints.org/

www.ecs.soton.ac.uk

BCS Metadata Meeting, London 29th May 2002

(Many slides borrowed from Michael L. Nelson)

OAI 2.0

• Public, stable not released yet … (but very close)– Beta released mid-May– Public release scheduled: 1st June

• 2.0 implementations in the pipeline– British Library, Cornell Univ, Ex Libris, my.OAI, Humbolt

Univ, InQuirion Pty Ltd, Library of Congress, NASA, OCLC, Old Dominion Univ, U. of Illinois, U. of Southampton, UCLA,

John Hopkins U., Indiana U., NYU, UKOLN, Virginia Tech

Open Archives Initiative

The protocol is openlydocumented, and metadatais “exposed” to at least somepeer group (note: rights management can still apply!)

Archive defined as a“collection of stuff” --not the archivist’s definition of “archive”. “Repository” used in most OAI documents.

OAI is happeningat break-neck speed...

Metadata Harvesting• Move away from distributed searching• Extract metadata from various sources• Build services on local copies of metadata

– Resources remain at remote repositories

user

. . .

search for “cfd applications”

local copy ofmetadata

metadataharvested offline

metadataharvested offline

metadataharvested offline

metadataharvested offline

each node independently maintained

all searching, browsing, etc. performed on the metadata hereindividual nodes can

still support direct userinteraction

Metadata Harvesting

• Repositories (archives etc.) = low implementation cost

• Services = higher implementation cost

• Similar to web search model– DP9 gateway makes it exactly the same

about eprintsdocument

like objectsresources

metadata OAMSunqualifiedDublin Core

unqualifiedDublin Core

transport HTTP HTTP HTTP

responses XML XML XML

requests HTTP GET/POST HTTP GET/POST HTTP GET/POST

verbs Dienst OAI-PMH OAI-PMH

nature experimental experimental stable

modelmetadataharvesting

metadataharvesting

metadataharvesting

Santa Feconvention

OAI-PMHv.1.0/1.1

OAI-PMHv.2.0

OAI-PMH v.2.0 [06/2002]

• Goal: recurrent exchange of metadata about resources between systems

• Input:• OAI-PMH v.1.0 [01/01 – 09/02]• feedback on OAI-implementers• deliberations by OAI-tech [09/01 -]• alpha test group of OAI-PMH v.2.0 [03/02 -]

• low-barrier interoperability specification• metadata harvesting model: data provider / service

provider• metadata about resources • autonomous protocol• distinction between protocol and periphery

• community-specific extensions• HTTP based• XML responses• unqualified Dublin Core• stable (1.0 characterized as experimental)

OAI-PMH v.2.0 [06/2002]

OAI Data Model:

Resources / Items / Records

resource

all available metadata about David

item

Dublin Coremetadata

MARCmetadata

SPECTRUMmetadata records

item = identifier

record = identifier + metadata format + datestamp

Overview of OAI Verbs

Verb Function

Identify description of archive

ListMetadataFormats metadata formats supported by archive

ListSets sets defined by archive

ListIdentifiers OAI unique ids contained in archive

ListRecords listing of N records

GetRecord listing of a single record

archivalmetadata

harvestingverbs

most verbs take arguments: dates, sets, ids, metadata formatsand resumption token (for flow control)

Identify

• Arguments– none

• Errors– none

• Arguments– none

• Errors– badArgument

1.1 2.0

ListMetadataFormats

• Arguments– identifier

(OPTIONAL)

• Errors– id does not exist

• Arguments– identifier

(OPTIONAL)

• Errors– badArgument– noMetadataFormats– idDoesNotExist

1.1 2.0

ListSets

• Arguments– resumptionToken

(EXCLUSIVE)

• Errors– no set hierarchy

• Arguments– resumptionToken

(EXCLUSIVE)

• Errors– badArgument– badResumptionToken– noSetHierarchy

1.1 2.0

ListIdentifiers

• Arguments– from (OPTIONAL)

– until (OPTIONAL)

– set (OPTIONAL)

– resumptionToken (EXCLUSIVE)

• Errors– no records match

• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken

(EXCLUSIVE)– metadataPrefix (REQUIRED)

• Errors– badArgument– cannotDisseminateFormat– badResumptionToken– noSetHierarchy– noRecordsMatch

1.1 2.0

ListRecords

• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken

(EXCLUSIVE)– metadataPrefix

(REQUIRED)

• Errors– no records match– metadata format cannot be

disseminated

• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken

(EXCLUSIVE)– metadataPrefix (REQUIRED)

• Errors– noRecordsMatch– cannotDisseminateFormat– badResumptionToken– noSetHierarchy– badArgument

1.1 2.0

GetRecord

• Arguments– identifier

(REQUIRED)

– metadataPrefix (REQUIRED)

• Errors– id does not exist

– metadata format cannot be disseminated

• Arguments– identifier

(REQUIRED)– metadataPrefix

(REQUIRED)

• Errors– badArgument– cannotDisseminateFor

mat– idDoesNotExist

1.1 2.0

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord></OAI-PMH>

response no errors

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request><error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error></OAI-PMH>

response with error

• Idempotency of resumptionToken: return same incomplete list when rT is re-issued• while no changes occur in the repo: strict• while changes occur in the repo: all items with unchanged

datestamp• new attributes for the resumptionToken:

• expirationDate• completeListSize• cursor

resumptionToken Flow-Control

• evolution

• from talking about OAI-PMH

• to talking about projects that use OAI-PMH

• to talking about projects and failing to mention they use OAI-PMH

• => OAI-PMH becomes part of the infrastructure

Adoption

• 49 registered repositories [11/2001]

• 65 registered repositories [03/2002]

• 77 registered repositories [05/2002]

• 5+ million records

• many unregistered repositories

• private implementations (e.g. RDN)

Data Providers (a.k.a. repositories)

• Arc: cross-searching of registered repositories [ http://arc.cs.odu.edu ]

• CiteBase: research literature search + citation ranking[ http://citebase.eprints.org ]

• OLAC: cross-searching of Language Archive Community repositories[ http://www.language-archives.org/index.html ]

Service Providers

• Scirus scientific search engine [Elsevier][ http://www.scirus.com ]

• my.OAI : user-tailorable cross-searching of registered repositories [FS Consulting, Inc.][ http://www.myoai.com ]

• Growing interest from web search engines

Service Providers

• Repository Explorer: interactive exploration of repositories [Virginia Tech][ http://www.purl.org/NET/oai_explorer ]

• eprints.org: generic OAI-PMH compliant repository software [U of Southampton][ http://www.eprints.org ]

• ALCME repository and harvester software [OCLC][ http://alcme.oclc.org/index.html ]

• APIs, others tools @ www.openarchives.org

OAI-PMH tools

http://www.openarchives.org/

openarchives@openarchives.org