OAIster: Metadata Pointing to Digital Objects
Kat HagedornMetadata Harvesting/DLXS Librarian
University of Michigan LibrariesFebruary 18, 2004
background
• One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public
• Digital Library Production Service at University of Michigan Libraries began work in December 2001
• Launched in June 2002
highlights
• Any audience• Any subject matter• Any format• Freely accessible• No dead ends• One-stop shopping
…retrieving the “hidden web”
tool we borrowed
• University of Illinois Urbana-Champaign open-source OAI protocol harvester
• java edition for our unix environment• Worked collaboratively to iron out kinks
– resumptionToken / retryAfter– inexplicable kill– bogus records in MySQL table
development environment
• Digital Library Extension Service (DLXS)• Develop open-source middleware and
license XPAT search engine for building and mounting digital libraries
• Middleware consists of document classes, i.e., Text, Image, Bib, FindAid
• Originally designed to make SGML encoded texts available online
tool we developed
• Runs in DLXS environment using BibClass• Current BibClass web templates modified• Additional java-based transformation tool
to:– DC metadata records concatenated– No-digital-object records filtered out– Records counted– Conversion from UTF-8 to ISO-8859-1– XSLT used to transform DC records into
BibClass records
system design
UIUC harvester
Record storage
XSLT transformation
tool
BibClass indexes
OAI-enabled DC records
Non-OAI-enabled
DC records
XSL stylesheets (per source
type)
Search interface(XPAT)
result
• One place to look for digital objects• Big
– 3,016,251 metadata records– 267 institutions (as of last week…)
• Popular– Averages 3300 search sessions / month– Picked up in March ‘03: average 3500 now– 43,894 searches in one year (June 2002 –
July 2003)
repositories: e.g.,
• arXiv Eprint Archive: math and physics pre- and post-prints
• Online Archive of California: manuscripts, photographs, and works of art held in institutions across California
• Sammelpunkt, Elektronisch Archivierte Theorie: archive of philosophical publications
• British Women Romantic Poets Project: collection of poems written by British women between 1789 and 1832
repositories: stats
• As of February ‘04, out of 267 repositories…• International and U.S.
– U.S.: 50.5% (135)– Intl: 49.5% (132)
• By subject– Humanities: 24% (65)– Science: 30% (81)– Mixed: 46% (121)
• E-prints and pre-prints– Using eprints.org software: 39% (104)– Not using eprints.org software: 61% (163)
major issues encountered
• Metadata variation• Records not leading to digital objects• Access restrictions on digital objects
described in records• Duplicate records for a single digital
object
issue: metadata variation
• With more records, users need more restrictions
• Consistent metadata needed to facilitate these restrictions
• One option: normalization of data
issue: metadata variation
• Type: the obvious quick win– 240 metadata values mapped to four
generic values (text, image, audio, video)– e.g.,
audio, sound = audiomotion, animation, newsreels, etc. = videowatercolour, watercolor, slides, etc. = imagearticle, articles, booklet, diss, story, etc. = text
issue: metadata variation
• Date: where to begin?– Most records with at least one date– Some records include up to seven dates– No consistent style of date
• Subject: out of context, what meaning?– Many records with at least one subject element– But over 100 records with more than 50 subjects– And one record with 1000!
issue: metadata variation
• Sample date values
<date>2-12-01</date><date>2002-01-01</date><date>0000-00-00</date><date>1822</date><date>between 1827 and 1833</date><date>18--?</date><date>November 13, 1947</date><date>SEP 1958</date><date>235 bce</date><date>Summer, 1948</date>
issue: metadata variation
• Sample subject values
<subject>30,51,52</subject><subject>1852, Apr. 22. E[veritt] Judson, letter to
Philuta [Judson].</subject><subject>Slavery--United States--Controversial
literature</subject><subject>view of interior with John Henry
sculpture</subject><subject>Particles (Nuclear physics) --
Research.</subject>
issue: no digital objects
• Some records contain links to further description of digital object
• But not the digital object itself• Culling difficult• One option: add explanatory text to site• Or, unfortunately, spot-check and
remove repositories with this issue
issue: access restrictions
• No records where metadata itself is restricted in use (as far as we know!)
• Definitely some records where objects are restricted to licensed users
• One option: add explanatory text to site• Or sub-set OAIster into free and
“partially” free repositories
issue: duplicate records
• Two records harvested, different identifiers, same object described and pointed to
• Two records harvested inadvertently through aggregators and original repositories
issue: duplicate records
• Need algorithm to automate de-duplication
• Were duplicates to be identified, how to deal with the issue?– Suppress?– Group?– Flag?
• So far, not addressed in OAIster
future of OAIster
• Advanced searching• Grouping to aid browsing• Further normalization of data• Handling duplicate records• Saving/emailing/downloading records• Collaboration with other services:
search, instructional…• More user testing…
current state of protocol
• Popular• As Peter Suber says:
– “…no other single idea or technology in the [open-source movement] has enjoyed this density of endorsement and adoption in a six month period.”
• Data providers over one year:– June ‘02: 56 repositories / 274,062 records– June ‘03: 187 repositories / 1,246,953 records– Over three-fold increase for repositories– Over four-fold increase for records
future of protocol
• Branching out– DC required vs. highly recommended– Use of OAI in closed environments– Static repository protocol– OAI-rights committee
• OAI evangelism
contact info
• Kat Hagedorn• University of Michigan Libraries, Digital
Library Production Service• [email protected]• http://www.oaister.org/
Top Related