An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances...

24
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project

Transcript of An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances...

Page 1: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

An Introduction to Repositories

Thornton Staples

Director of Community Strategy and Alliances

Director of the Fedora Project

Page 2: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Creating a digital library is not a process of moving the traditional

library online.

Increasingly, it’s more about the care and feeding of the web!

Page 3: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Creating digital surrogates of paper collections is only the beginning

• Surrogate collections are an important step!• Collecting born-digital materials is rapidly

coming upon us• Simple Institutional repository approaches are

good but only scratch the surface• Complex scholarly and scientific projects are

the biggest challenge

Page 4: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Te xtTe xt

T h e R os s etti A rch iv e

Ar tw o r k Ar tw o r k

W o r k W o r k

I m a g e s

W o r k

Page 5: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Repositories are designed to be flexible and adaptable

• Relational databases are too rigid• Need to be able to add new content types

and media easily• Need to be able to handle arbitrary

complexity in relatively simple ways• Above all, it all needs to be durable over a

very long time!

Page 6: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Preservation and ArchivingPreservation

and Archiving

Scholars WorkbenchScholars

WorkbenchInstitutional Repository

Institutional Repository

Data CurationSolutions

Data CurationSolutions

The Repository(Content abstraction)

The Repository(Content abstraction)

RaidArraysRaid

Arrays TapeLibraries

TapeLibraries

Cloud StorageCloud Storage

Page 7: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Repositories are the foundation for many applications

• A set of abstractions that can be used to represent different kinds of data

• Manages the actual content beneath the surface

• Negotiates the connection between access and storage

• Designed to make data “durable” over the long term

Page 8: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Access is the core purpose of a repository

• Searching is important but it is not the only thing

• Finding is the point of searching!• The point of finding is very often to use the

resource that you have found, for analysis or reuse

• New digital resources that reuse found objects depend on continuing access for validity

Page 9: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Any unit of content may have more than one context

• Within one collection– An architectural image may related to more than

one building

• Across collections– Special collections images many be art objects

• Across repositories– Born digital publications will almost always cross

institutional boudaries

Page 10: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Authenticity and fidelity

• What is an authoritative digital surrogate of a real object?

• When is a copy of an original surrogate exact?

• A born-digital object has nothing to compare• Digital “fingerprints” must be captured and

managed as metadata• When formats change, objects will not have

all the same technical characteristics…

Page 11: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Making complex digital information “durable” is a very hard problem

• Durability implies that digital content is directly in use and sustained long-term

• A history of the changes to the encoding and state of content must be reliably provided

• A meaningful context for any unit of content may be one of many and must be sustained

• Replication appears to be our best friend and the could looks like an answer

Page 12: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Management is the core function of a repository

• Repositories are designed to keep everything as stable as possible while providing flexible access

• Managing things such that when they aren’t changing they are reliably the same

• Accounting for migration for technical reasons• Disaster preparedness (lots of copies!)• Must respect legal and policy issues

Page 13: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Repository abstractions provide a durability framework for managing.

• Content is “unitized” as information objects that combine data, metadata, policies, relationships and the history of the object.

• Complex digital resources are formally defined graphs of related objects.

• The public view of the content is presented as virtual data components.

Page 14: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

DCDC

Persistent ID

RELS-EXTRELS-EXT

AUDITAUDIT

11

22

nn

Reserved Datastreams

Custom Datastreams

(any type, any number)

A data object is one unit of content

POLICYPOLICY

Page 15: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Files are stored on disk and managed directly

• Versioning is necessary• Checksums for each file provide assurance

that they file has not changed• Can be managed by the repository or as

remote files

Page 16: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Virtual datastreams provide the access abstraction

• Can be simply retrieving a stored component• Views of the content can be derived on

demand, for different formats and resolutions• Other data productions can be derived on

demand; i.e. tiles from a JPEG2000 file• By providing an abstract view of the content

you break the dependence on the stored files

Page 17: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Pid

syste m Me ta

MO D S

JP2 0 0 0

T hum b S cre e n Mas te rC us to m

S izeD ub linC o re MODS C itatio n

MODSFile

J PEG200File

ContentAccess

ContentManagement

Page 18: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Descriptive metadata is about the content of the resource

• Indexed for searching • Also used for rendering user experiences• Some standards in use:

• Dublin core - general• MODS - bibliographic• VRACore – cultural heritage• FGDC - GIS datasets• DDI – social science datsets

Page 19: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Administrative metadata is more about the encoding and use

• Metadata about the object generally, like checksums

• Technical metadata about the specifics of the encoding each format

• Event metadata, about what happens to an object over its lifetime; audit trails

• Policy metadata, like access restrictions and credit lines

Page 20: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Relationships Among Objects

• Describes adjacency relationships among objects, among units of content

• Can be done by explicitly listing IDs in XML, using METS for example

• or using RDF:

PID – typeOfRelationship – relatedObjectPID• Can used to assemble complex resources

and aggregations of objects• Explicit and implicit aggregations

Page 21: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Text Collections

Te xts

M o de r n E ng l i s h C o l l e c t i o n

P ag eIm ag e s

Page 22: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Establishing and Enforcing Policies

• Policies must be established for the entire life-cycle of the information– Ownership and workflow policies– Access and use policies– Policies associated with sustaining (or not!)

• Polices must be expressed for end users• Policies must also be expressed for machine

access

Page 23: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Indexing

• In a repository there is no “catalog”; the repository is the catalog

• Many indexes can be created for many reasons

• Either metadata or full content, or both• Ontology-based indexes are rapidly

becoming more feasible• Keeping indexes updated is the trick

Page 24: An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.

Fedora Repository ServiceGSearch GSearch

OAIOAI

IngestIngest

SimpleJMS

SimpleJMS

More…More…repository publishes events

serviceslisten andconsumeevents or other messages

Indexing as a harvesting service

BlacklightBlacklight