Introduction to metadata Jenn Riley Metadata Librarian Digital Library Program.
An introduction to metadata in digital projects Jenn Riley Metadata Librarian L566 Fall 2006.
-
Upload
ralf-malone -
Category
Documents
-
view
218 -
download
0
Transcript of An introduction to metadata in digital projects Jenn Riley Metadata Librarian L566 Fall 2006.
An introduction to metadata in digital projects
Jenn Riley
Metadata Librarian
L566 Fall 2006
10/17/06 L566 Fall 2006 2
Topics we’ll cover
Choosing descriptive metadata standardsChoosing controlled vocabulariesUsing controlled vocabularies to enhance
searching and browsingWrapping it all up
Choosing descriptive metadata standards
10/17/06 L566 Fall 2006 4
Descriptive metadata
Enables users to find relevant materialsUsed by many different knowledge
domainsMany potential representationsControlled by
Data structure standards Data content standards Syntax encoding schemes Vocabulary encoding schemes
10/17/06 L566 Fall 2006 5
Some data structure standards
Dublin Core (DC) Unqualified (simple) Qualified
MAchine Readable Cataloging (MARC)MARC in XML (MARCXML)Metadata Object Description Schema
(MODS)
10/17/06 L566 Fall 2006 6
How do I pick one? (1)
Institution Nature of holding institution Resources available for metadata creation What others in the community are doing Formats supported by your delivery software
The standard Purpose Structure Context History
10/17/06 L566 Fall 2006 7
How do I pick one? (2)
Materials Genre Format Likely audiences What metadata already exists for these materials
Project goals Robustness needed for the given materials and users Describing multiple versions Mechanisms for providing relationships between records Plan for interoperability, including repeatability of elements
More information on handout
10/17/06 L566 Fall 2006 8
Dublin Core (DC)
15-element setNational and international standard
2001: Released as ANSI/NISO Z39.85 2003: Released as ISO 15836
Maintained by the Dublin Core Metadata Initiative (DCMI)
Other players DCMI Working Groups DC Usage Board
10/17/06 L566 Fall 2006 9
DCMI mission
The Dublin Core Metadata Initiative provides simple standards to facilitate the finding, sharing and management of information.
DCMI does this by: Developing and maintaining international
standards for describing resources Supporting a worldwide community of users and
developers Promoting widespread use of Dublin Core
solutions
10/17/06 L566 Fall 2006 10
DC Principles
“Core” across all knowledge domainsNo element requiredAll elements repeatable1:1 principle
10/17/06 L566 Fall 2006 11
DCMI Abstract Model
Released in 2005 “A reference model against which particular DC
encoding guidelines can be compared” Heavily influenced by RDF thinking New XML and RDF encodings under
development to conform to the abstract model Two schools of thought on its development
Clarifies model underlying the metadata standard Overly complicates a standard intended to be simple
10/17/06 L566 Fall 2006 12
DC encodings
HTML <meta>XMLRDF[Spreadsheets][Databases]
10/17/06 L566 Fall 2006 13
Content/value standards for DC
None requiredSome elements recommend a content
or value standard as a best practice Relation Source Subject Type
Coverage Date Format Language Identifier
10/17/06 L566 Fall 2006 14
Some limitations of simple DC
Can’t indicate a main title vs. other subordinate titles
No method for specifying creator rolesW3CDTF format can’t indicate date ranges
or uncertaintyCan’t by itself provide robust record
relationships
10/17/06 L566 Fall 2006 15
Good times to use DC
Cross-collection searchingCross-domain discoveryMetadata sharingDescribing some types of simple
resourcesMetadata creation by novices
DC[record]
QDC[record]
[collection]
MARC[record]
[collection]MARCXML
[record]
MODS[record]
[collection]
Record format
XMLRDF
(X)HTML
Field labels Text
Reliance on AACR
None
Common method of
creation
By novices, by
specialists, and by
derivation
10/17/06 L566 Fall 2006 17
Qualified Dublin Core (QDC)
Adds some increased specificity to Unqualified Dublin Core
Same governance structure as DC Same encodings as DC Same content/value standards as DC Listed in DMCI Terms Additional principles
Extensibility Dumb-down principle
10/17/06 L566 Fall 2006 18
Types of DC qualifiers
Additional elementsElement refinementsEncoding schemes
Vocabulary encoding schemes Syntax encoding schemes
10/17/06 L566 Fall 2006 19
DC qualifier status
RecommendedConformingObsoleteRegistered
10/17/06 L566 Fall 2006 20
Limitations of QDC
Widely misunderstoodNo method for specifying creator rolesW3CDTF format can’t indicate date ranges
or uncertaintySplit across 3 XML schemas
10/17/06 L566 Fall 2006 21
Best times to use QDC
More specificity needed than simple DC, but not a fundamentally different approach to description
Want to share DC with others, but need a few extensions for your local environment
Describing some types of simple resources
Metadata creation by novices
DC[record]
QDC[record]
[collection]
MARC[record]
[collection]MARCXML
[record]
MODS[record]
[collection]
Record format
XMLRDF
(X)HTML
XMLRDF
(X)HTML
Field labels Text Text
Reliance on AACR
None None
Common method of
creation
By novices, by
specialists, and by
derivation
By novices, by
specialists, and by
derivation
10/17/06 L566 Fall 2006 23
MAchine Readable Cataloging (MARC) Format for the records in IUCAT, WorldCat and
other library catalogs Used for library metadata since 1960s
Adopted as national standard in 1971 Adopted as international standard in 1973
Maintained by: Network Development and MARC Standards Office at
the Library of Congress Standards and the Support Office at the National Library
of Canada
10/17/06 L566 Fall 2006 24
More about MARC
Actually a family of MARC standards throughout the world U.S. & Canada use MARC21 MARC Bibliographic is for descriptive metadata
Structured as a binary interchange format ANSI/NISO Z39.2 ISO 2709
Field names Numeric fields Alphabetic subfields
10/17/06 L566 Fall 2006 25
Content/value standards for MARC
None required by the format itselfBut US record creation practice relies
heavily on: AACR2r ISBD LCNAF LCSH
10/17/06 L566 Fall 2006 26
Limitations of MARC
Use of all its potential is time-consumingOPACs don’t make full use of all possible
dataOPACs virtually the only systems to use
MARC dataRequires highly-trained staff to createLocal practice differs greatly
10/17/06 L566 Fall 2006 27
Good times to use MARC
Integration with other records in OPACResources are like those traditionally
found in library catalogsMaximum compatibility with other libraries
is neededHave expert catalogers for metadata
creation
DC[record]
QDC[record]
[collection]
MARC[record]
[collection]MARCXML
[record]
MODS[record]
[collection]
Record format
XMLRDF
(X)HTML
XMLRDF
(X)HTML
ISO 2709 [ANSI Z39.2]
Field labels Text Text Numeric
Reliance on AACR
None None Strong
Common method of
creation
By novices, by
specialists, and by
derivation
By novices, by
specialists, and by
derivation
By specialists
10/17/06 L566 Fall 2006 29
MARC in XML (MARCXML)
Copies the exact structure of MARC21 in an XML syntax Numeric fields Alphabetic subfields
Implicit assumption that content/value standards are the same as in MARC
10/17/06 L566 Fall 2006 30
Limitations of MARCXML
Not appropriate for direct data entryExtremely verbose syntaxFull content validation requires tools
external to XML Schema conformance
10/17/06 L566 Fall 2006 31
Good times to use MARCXML
As a transition format between a MARC record and another XML-encoded metadata format
Materials lend themselves to library-type description
Need more robustness than DC offers Want XML representation to store within larger
digital object but need lossless conversion to MARC
DC[record]
QDC[record]
[collection]
MARC[record]
[collection]MARCXML
[record]
MODS[record]
[collection]
Record format
XMLRDF
(X)HTML
XMLRDF
(X)HTML
ISO 2709 [ANSI Z39.2]
XML
Field labels Text Text Numeric Numeric
Reliance on AACR
None None Strong Strong
Common method of
creation
By novices, by
specialists, and by
derivation
By novices, by
specialists, and by
derivation
By specialists
By derivation
10/17/06 L566 Fall 2006 33
Metadata Object Description Schema (MODS)Developed and managed by the Library of
Congress Network Development and MARC Standards Office
First released for trial use June 2002MODS 3.2 released June 2006“Schema for a bibliographic element set
that may be used for a variety of purposes, and particularly for library applications.”
10/17/06 L566 Fall 2006 34
Differences between MODS and MARC
MODS is “MARC-like” but intended to be simpler
Textual tag namesEncoded in XMLSome specific changes
Some regrouping of elements Removes some elements Adds some elements
10/17/06 L566 Fall 2006 35
Content/value standards for MODS
Some elements indicate a given content/value standard should be used Generally follows MARC/AACR2/ISBD
conventions But not all enforced by the MODS XML schema
Authority attribute available on some elements
10/17/06 L566 Fall 2006 36
Limitations of MODS
No lossless round-trip conversion from and to MARC
Still largely implemented by library community only
Some semantics of MARC lostFormat still growing to meet the needs of
the digital library community
10/17/06 L566 Fall 2006 37
Good times to use MODS
Materials lend themselves to library-type description
Want to reach both library and non-library audiences
Need more robustness than DC offersWant XML representation to store within
larger digital object
DC[record]
QDC[record]
[collection]
MARC[record]
[collection]MARCXML
[record]
MODS[record]
[collection]
Record format
XMLRDF
(X)HTML
XMLRDF
(X)HTML
ISO 2709 [ANSI Z39.2]
XML XML
Field labels Text Text Numeric Numeric Text
Reliance on AACR
None None Strong Strong Implied
Common method of
creation
By novices, by
specialists, and by
derivation
By novices, by
specialists, and by
derivation
By specialists
By derivation
By specialists
and by derivation
10/17/06 L566 Fall 2006 39
Picking a format
Consider all optionsMatch format to the types of discovery you
want to supportYour choice has to fit in your larger
technological infrastructure Realize the constraints you’re operating under Or, expand infrastructure!
Don’t have to choose just one, can use several for different purposes
10/17/06 L566 Fall 2006 40
Mapping between metadata formats
Also called “crosswalking”To create “views” of metadata for specific
purposesMapping from robust format to more
general format is commonMapping from general format to more
robust format is ineffective
10/17/06 L566 Fall 2006 41
Types of mapping logic
Mapping the complete contents of one field to another
Splitting multiple values in a single local field into multiple fields in the target schema
Translating anomalous local practices into a more generally useful value
Splitting data in one field into two or more fields Transforming data values Boilerplate values to include in output schema
10/17/06 L566 Fall 2006 42
Common mapping pitfalls
Cramming in too much informationLeaving in trailing punctuationMissing context of recordsMeaningless placeholder data
ALWAYS remember the purpose of the metadata you are creating!
10/17/06 L566 Fall 2006 43
No, really, which one do I pick?
It depends. Sorry.Be as robust as you can affordPlan for future uses of the metadata you
createLeverage existing expertise as much as
possibleFocus on content and value standards as
much as possible
10/17/06 L566 Fall 2006 44
More information
Dublin Core DC Element Set version 1.1 DCMI Metadata Terms
MODSMARCMARCXML
Break time!
Choosing controlled vocabularies
10/17/06 L566 Fall 2006 47
Some characteristics of CVs
Also known as “vocabulary encoding schemes”
Enumerated lists of all possible choices for a field value
Often organized into a syndetic structureUsually intended to be human-readable
10/17/06 L566 Fall 2006 48
CVs in libraries
Many library CVs grow constantly with catalogers contributing new terms
Many library CVs use content standards to dictate the form of headings
Fields that use CVs are said to be under “authority control”
10/17/06 L566 Fall 2006 49
Traditional uses of CVs in library catalog records
CollocationDisambiguation Interoperability
BROWSING! (Although this isn’t used much in libraries…)
10/17/06 L566 Fall 2006 50
Other considerations
Human cataloging using CVs is expensiveDeveloping and maintaining CVs is
expensiveCurrent library systems usually rely on the
same string being present in all records rather than true relational structures linking records to CV terms
10/17/06 L566 Fall 2006 51
When a controlled vocabulary is usefulUser browsing of a small number of
categories each with a large number of members
When many different things have the same label
When recall is a priority for a given access point
10/17/06 L566 Fall 2006 52
Some common fields using CVs
NamesPlaces“Subjects”
10/17/06 L566 Fall 2006 53
Names
Seeking works by or about a certain individual is frequent
Individuals are often known by many different names
Many different individuals have the same name Name authority lists often create uniqueness by
adding qualifiers Some example vocabularies:
Library of Congress Name Authority File (LCNAF) Getty Union List of Artists’ Names (ULAN)
10/17/06 L566 Fall 2006 54
Places
Common in libraries to control place names in subjects, but not publication places
Many different places with the same nameOften organized hierarchicallyCommonly used vocabularies:
Library of Congress Subject Headings (LCSH) Getty Thesaurus of Geographic Names (TGN) GEONet Names Server
10/17/06 L566 Fall 2006 55
“Subjects”
Libraries traditionally group topic, location, genre, form, time period and other related concepts all under “subject”
Often organized into a rich syndetic structure
General rule is to apply the most specific heading applicable
Involves subjective judgment on the part of the individual assigning the heading
10/17/06 L566 Fall 2006 56
Deciding which fields to place under authority control
Consider your budgetary restraintsLearn about the functionalities possible in
your systemIdentify appropriate vocabularies that meet
defined needsDevelop a clear plan for how the fields
with controlled values will be used
Using controlled vocabularies to enhance searching and
browsing
10/17/06 L566 Fall 2006 58
Case Study: Cushman Collection
Funded with an Institute of Museum & Library Services (IMLS) grant
~15,000 color slides taken between 1938-1969
Cushman provided a significant amount of description
Additional metadata created to enhance genre, subject and geographic access
10/17/06 L566 Fall 2006 59
Metadata for the Cushman Collection Cushman’s description
Dates Location Names
TGM I – LC Thesaurus for Graphic Materials: Subject Terms
TGM II - LC Thesaurus for Graphic Materials: Genre & Physical Characteristics
TGN – Getty Thesaurus of Geographic Names We wanted to use this high-quality metadata to
improve on past search systems
10/17/06 L566 Fall 2006 60
TGM I: Subject Terms Strengths and Weaknesses Strengths include:
Pre-defined relationships between concepts Some lead-in vocabulary
Weaknesses include: Syndetic relationship lacking for new terms Language not user-friendly Not enough lead-in vocabulary Form and number of top-level categories not useful
for a browse structure
10/17/06 L566 Fall 2006 61
User studies performed
Two types Group walkthroughs of prototypes Task scenario study
Some functionality suggested by the studies Refinement while searching Search suggestions Faceted browsing Browsing on subject terms at all levels CV interaction
10/17/06 L566 Fall 2006 62
Browsing Image Collections
Research shows: Browsing is exploratory (Bawden) Guided, flexible browsing in context works
(Flamenco and SI Art Image Browser projects)Our usability studies show:
Structure is important Contents should be easily exposed Flexible and combinatorial browsing is desired Browsing cultivates searching
10/17/06 L566 Fall 2006 63
Searching Image Collections
Research shows: Using thesaurus structure helps searching (Greenberg)
Automatic expansion of synonyms and narrower terms
User-initiated expansion of broader and related terms
Our Usability studies show: Referencing an A-Z list with no lead-in terms for
searching is NOT helpful at all Concerns about word choice Iterative reformulation of queries in context is desired
10/17/06 L566 Fall 2006 64
Cushman Specifications: BrowsingDateGenreSubjects (hierarchical)
Retrieval of all records with narrower termsLocation (hierarchical)Combination of categories
10/17/06 L566 Fall 2006 65
Cushman Specifications: Searching Integrated search against BOTH “free-text”
descriptions and thesaurusMapping from lead-in vocabularyRetrieval of all records with narrower termsUser-initiated broadening and narrowing
Wrapping it all up
10/17/06 L566 Fall 2006 67
What next?
After choosing metadata standards and controlled vocabularies Figure out where metadata creation fits in the
overall workflow Write metadata creation guidelines Design and implement a metadata creation
process
10/17/06 L566 Fall 2006 68
And there’s more
Other types of metadata Content markup Technical metadata Rights metadata Preservation metadata Structural metadata
Specialized metadata standardsWhen to create a local metadata format
10/17/06 L566 Fall 2006 69
In a grant proposal (1)
Give specific information on all the decisions you’ve made Metadata standards Controlled vocabularies Metadata creation workflow Discovery functionality the metadata will
supportDescribe what metadata already exists for
these materials
10/17/06 L566 Fall 2006 70
In a grant proposal (2)
Indicate who will do the metadata creation work
Give reasonable cost estimatesThe more planning you do, the more likely
you are to Receive funding Complete the project on schedule Complete the project within your budget
10/17/06 L566 Fall 2006 71
That’s all for today!
[email protected] presentation slides:
<http://www.dlib.indiana.edu/~jenlrile/presentations/slis/06fall/l566/l566.ppt>
Handout: <http://www.dlib.indiana.edu/~jenlrile/presentations/slis/06fall/l566/handout.doc>