Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some...

Post on 12-Jul-2020

0 views 0 download

Transcript of Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some...

Agenda, Day 208:30 – 08.35 Review of objectives and agenda

08:35 – 09:30 Infrastructure and tools

09:30 – 10:30 Case study: preservation activities at CDL

10:30 – 11:00 Morning break

11:00 – 12:00 Case study: preservation activities at Portico

12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF

12:30 – 14:00 Lunch

14:00 – Case study: preservation activities at Stanford

– 15:00 Other preservation resources

15:00 – 15:30 Afternoon break

15:30 – 16:00 Format characterization

16:00 – 16:30 Characterization in preservation workflows

16:30 – 17:00 Questions and discussion

Digital Preservation at Stanford University

Tom CramerChief Technology StrategistStanford University LibrariesMay 2011

Agenda• Stanford University

• First & Second Generation Digital Library

• Digitization Efforts

• The Stanford Digital Repository

– Preservation Core

– Management

– Access

Stanford University

“The Universityof Stanford” ?

Leland Stanford Junior Universityx

Stanford University• 15,000 students

• 8,000 graduate• 7,000

undergraduate• 2,000 faculty• 35,000 total

university community

• $3.4 billion annual operating budget• $17.2 billion endowment• Roots of Silicon Valley• One of the world’s leading research universities

Stanford’s Digital Library c. 2007

Typical of all first generation digital libraries?

1st Generation Digital Libraries

• Small scale digitization, largely focused on text & images

• Purpose built systems for specific content types – application focus

• Highly theoretical approach to digital preservation

• Anemic UI’s

2nd Generation Digital Libraries• Large scale digitization

• With more content types

• Multi-pathway workflows• Content use & reuse in an integrated

environment• Pragmatic approach to digital

preservation & full lifecycle of objects

• Infrastructure & service focus

Digitization Trends -- Drivers

• Boutique Large scale • Text & image text, image, audio,

video, software and more• Refresh of 1st generation delivery

systems with contemporary UI’s

Digitization Trends -- Responses

Replacing individual, handwroughtschemes with workflow-based systems, largely automated, with QA, exception handling and reporting that work for multiple content streams.

Management of full lifeycle of object, from physical object management through capture, preservation & access

Digitization at SULAIR

1. Robotic Book Scanning Lab2. Rare Book Scanning Lab3. Map Scanning Lab4. High End Imaging Lab5. Multipurpose (Sheet Feed, et al) Lab6. Media Preservation Lab7. Digital Forensics Lab

Stanford’s Legacy Media Counts

More than 20,000 handheld media objects in Special Collections alone

Legacy Media & Digital Forensics

• Files, operating systems & software• mss, correspondence, images,

records, data, etc.• Steps:

• Extraction• Forensic analysis• Archival processing & description• Access & emulation

• Paradigm shift for archivists, donors

Lifecycle Management = Integration

Lifecycle Management = Integration

Digitization & file processing are the easiest parts of any digitization initiative. Description, file management, collection management, access, and a holistic workflow uniting all pieces, is the real challenge.

Preservation at Stanford

• SDR is in production since Dec 2006•Now a second generation preservation

system• one component in a larger ecosystem of

digital library infrastructure

1997

needidentified

“Dark Cave”concept

‘02 ‘03 ‘04 ‘05 ‘06 ‘07

NDIIPPprototype redesign

1.0 inprod

‘08 ‘09

2.0 conceived

‘10

2.0 in prod

Three Major Areas of Preservation Needs• Digital Library

– Legacy collections– Digitized collections– Licensed, locally loaded content– Born digital collections

• Institutional Repository– Research data, – Publications, dissertations, – Learning objects, university assets

• External Depositors– Publishers– Discipline-specific repositories– Reciprocal deposits with peer institutions

Google Books (’00s of TB)Manuscripts (75 TB)Media (50 TB)Geospatial Data (10 TB)~30 other digi projects (15 TB)Purchased collections (25 TB)

Download, process and preserve 8 million volumes in SDR for...•local indexing,•text mining,•selective delivery, and •long-term access.

E.g., Google-Scanned Books

E.g., Monterey Jazz Festival

•Festival founded in 1958: longest running jazz festival in the world.

•Rich collection of recordings from inception, spanning over 50 years, in varying states of condition & decay.

•Archives held at Stanford’s Archive of Recorded Sound

•~800 audio recordings, 1.6 TB audio files in SDR

•~250 video recordings, 22 TB video files in SDR

Access: - complete database of digital

recordings online at collections.stanford.edu/mjf

- Access via in-site visit to ARS- New commercial releases on

MJF Records

E.g., National Geospatial Digital Archive

• Some 27,000 “at risk” geospatial objects

• TIFFs, GeoTIFFs, Shapefiles, Digital Elevation Models, Digital OrthophotoQuadrangle files

E.g., Preserving Virtual Worlds

Stanford University LibrariesSecond Life Open House,31 July 2009

E.g., Forensically Extracted Born Digital Files

•Digital Forensics lab extracting original computer files from legacy media

•Actively building pipeline from extraction to preservation store

•Support for both immediate and deferred archival processing & description

E.g., Electronic Theses and Dissertations

NSF Policy Position on Data Archiving 1

“NSF's policy position on data is straightforward:

1 National Science Foundation, Cyberinfrastructure Council. Cyberinfrastructure Vision for 21st

Century Discovery. March, 2007.

NSF and NIH Grants to Stanford

SDR 2.0 today

• 100+ TB of unique content • 300+ TB of managed data• 200,000+ objects• 62,000,000 files• 7 content types: books, images, audio,

video, manuscripts, GIS data, software• Integrated component of larger

environment

2008: SDR 1.0 In Production & Working, BUT…

• Custom code, maintained by evolving & smaller team– No Reuse of code within Stanford, or larger community

• Bottlenecks– Needed to be quicker to add new content types– Needed to be quicker to add new collections– Needed to decompose code into more granular components

• Largely a stand-alone system– Lacked flexible Management services for streamlined,

continuous content deposit workflows– “Dark Archive” – No access services for rich, self-service

patron access

SDR 1.0 Architecture: Strongly Rooted in OAIS

SDR 2.0: New Technical Architecture • Adopt Fedora as a metadata management

system– Clean mapping of new data model to Fedora

content models– Reuse same design pattern, core technology as in

DOR

• Support for parallelized & asynchronous operations– Multiple ingest streams to increase throughput– Decompose one process (e.g, “ingest”) into

discrete, loosely coupled operations (“checksum”, “package”, “transfer”)

• Adopt a RESTful architecture & common workflow service

SDR 2.0: New Technical Architecture

SDR 2.0: Robots & “WorkDo” Service

Complex Systems from Atomic Pieces

SDR 2.0: Revised Data Model SDR 1.x’s METS-based SIP, AIP and DIP, had many issues: – Each Transfer Manifest was content & collection

specific Doesn’t scale– Transfer manifests require too much interpretation and

analysis to change, augment– Too complex: Stanford METS structure breaks apart

related data across the object– Wraps (somewhat dynamic) metadata with (mostly

static) data files in same envelope– Recursive nature of transfer manifest makes

versioning self-referential, complex– No one speaks METS natively: depositors, SDR &

clients all forced to perform translation at handshakes

Content Structures and Flavors of Metadata

• Flexible data model can take any type of data, packaged in “bags”– A “bag” is a directory with

standardized top-level structure and syntax

• Minimizes analysis & processing required on ingest

• Preserves options for future processing & transformations based on future needs

Each object has seven discrete metadata files:– Identity metadata– Descriptive metadata– Content metadata

(aka structural metadata)

– Technical metadata– Rights metadata– Source metadata– Provenance metadata

SDR Deposits: Content Transfer via Bagit

druid/bagit-info.txt

: Stanford-Content-Metadata: data/metadata/contentMetadata Stanford-Identity-Metadata: data/metadata/identityMetadata Stanford-Provenance-Metadata:

data/metadata/provenanceMetadata /data

/metadata /contentMetadata /descMetadata /identityMetadata /provenanceMetadata /rightsMetadata/sourceMetadata /technicalMetadata

/content/file1/file2

:

Lessons Learned Over 5 Years

• Custom code, maintained by evolving & smaller team, was inefficient & unsustainable– Adopted Fedora for metadata management, Hydra for

application framework– Shared technology & design patterns with rest of digital

library ecosystem– API’s for management, ingest, retrieval, reporting

• Bottlenecks– Need to be quicker to add new content types & collections:

simplify the data model, support “Zip & SIP”– Need to increase the throughput to the storage layer led to

parallelization of processes

• Need to refine & hone the SDR service model– Complement Preservation with robust Management & Access

services

Preservation Is One Leg of a Stool

• Preservation without Access is pointless– Further, all signs points indicate that it is not

economically viable

• Access without Preservation is myopic

• Robust Management services are prerequisite for accessioning, archiving and providing access to content– The “pre-ingest” phenomenon

Can one system handle it all? or

Stanford’s Digital Library Ecosystem

Three Spheres: Management, Preservation and Access

Digitization, Deposit & Management

Preservation

Discovery & Delivery

Stanford Digital Repository (SDR): content agnostic, preservation repository

Specialty applications provide context-specific, user-facing deposit, and access services tailored to content types and disciplines

SDR in Stanford’s DL Ecosystem

Library Management Applications

EEMS (acquiring born digital content), digitization workflow, etc.

Institutional Repository

ETDs, open access articles, faculty “papers”, research data, web sites, etc.

SULAIR Digital Stacks

Delivery for text, images, mss, media, data, & curated collections

National Geospatial Digital Archive(NGDA)

Geospatial data

and SDR provides “back-office” preservation services: replication, auditing, migration, and retrieval in a secure, sustainable, scalable stewardship environment

E.g., Parker Manuscripts

•559 Anglo-Saxon manuscripts, 200,000 pages

•For each page:

22 MB JPEG2000 delivery surrogate22 MB JPEG2000 delivery surrogate110 MB submaster TIFF220 MB master TIFF SDR –

Preservation Core

Parker.stanford.edu: Rich web application, tailored for general public, medievalists

Separation of Concerns

• Scoped repository: differentiation between preservation (provided by SDR) and

…content management (provided by DOR)…access (provided by the Digital Stacks apps)

• Implications: – Reduces pressure on SDR to be all things to all

depositors, for all content– Reinforces need to provide managed & secure storage at

scale– Reinforces requirement to focus on fixity and integrity

services– Emphasizes need to integrate SDR to management &

access services through stable API’s

Management: Hydra-based Applications

Under Development…• SDR’s Front End – Institutional Repository for Stanford• Hypatia – Archival Arrangement, Description & Access• SDR Preservation Core Administrative Application

ETD’s –Electronic Theses & Dissertations

SALT –Self-Archiving Legacy Toolkit

EEMs –Everyday Electronic Materials

Hydra

• Joint development project among Stanford, University of Virginia, University of Hull and Fedora Commons

• Based on Fedora, Active Fedora and Ruby on Rails

• Reuse Blacklight & solr for search & browse within a hydra application

Fundamental Assumption #1

No single system can provide the full range of repository-based solutions for a given institution’s needs,

…yet sustainable solutions require a common repository infrastructure.

For instance…

An ETD solution…- Single PDF- With auxiliary data

files- Simple, prescribed

workflow- Integrated with

student administration system

- Streamlined UI for depositors, reviewers & readers

A digitization workflow system…- Potentially hundreds of

files type per object- Complex, branching

workflow- Sophisticated operator

(back office) interfaces

A general purpose institutional repository- Heterogeneous file types- Simple to complex

objects- General purpose user

interfaces

Distinct Application NeedsMore than one dozen distinct repository application needs across three institutions.

• Electronic theses & dissertations• Open access articles• Data curation application(s)• General purpose institutional repository• Manuscript & archival collection delivery• Library materials accessioning tools• Digitization workflow system• And more...

Shared, Primitive Functions• Deposit – uploading simple or multipart

objects, singly or in bulk• Manage – editing an object’s content,

metadata and permissions• Search – full text and fielded search

supporting both user discovery and administration

• Browse – sequential viewing of objects by collection, attribute or ad hoc filtering

• Deliver – viewing, downloading & disseminating objects through user and machine interfaces

Hydra Philosophy -- Technical• Tailored applications and workflows for

different content types, contexts and user interactions

• A common repository infrastructure• Flexible, atomistic data models• Modular, “Lego brick” services• Library of user interaction widgets• Easily skinned UI

One body, many heads

Fundamental Assumption #2

No single institution can resource the development of a full range of solutions on its own,

…yet each needs the flexibility to tailor solutions to local demands and workflows.

Hydra Philosophy -- Community• An open architecture, with many

contributors to a common core• Collaboratively built “solution bundles” that

can be adapted and modified to suit local needs

• A community of developers and adopters extending and enhancing the core

• “If you want to go fast, go alone. If you want to go far, go together.”

One body, many heads

Electronic Theses and Dissertations

• Automatic deposit to library as part of degree conferral• Built in digital collection building• Better access for patrons• Reduced expenses for students,

University, library processing• Increased visibility of and access to

Stanford research via catalog & Google• Built in preservation through Stanford

Digital Repository

Electronic Theses & Dissertation (ETD)

EEMs: Accessioning Born Digital Materials

Browser widget enables selector to capture the PDF, plus URL, title, author, copyright status, payment information, and comments, and route to Acquisitions.

EEMs: Accessioning Born Digital Materials

Dashboard enables item processing, ultimately leading to preservation in SDR and access via the catalog.

SALT: Digital Archives

SALT: Digital Archives

• Archiving unstructured and semi-structured data

• Allow access to semi-processed information,- with strong access & visibility controls- leveraging full text & entity extraction

• Ongoing enrichment of the archive- through self-annotation by the donor- through crowd-sourcing description and

organization

Component Based Architecture• Fedora as a metadata store• Well structured file system as data store• Solr index for rapid data access• Blacklight & Hydra: app logic & presentation• Atomic Services

– “Robots”: simple, autonomous scripts, providing small units of work in reusable packages

– “Services” provide common operations that support workflows across the environment

• “WorkDo”: lightweight workflow to orchestrate cascade of services

DOR & Digital Stacks Architecture

Digital Library Ecosystem

Growth in Disk and Computing at SULAIR

Stanford’s Digital Library, 2011The next generation of Digital libraries will be complex ecosystems made up of simple components.

Separate systems for digitization, management, preservation and access will enable pieces to be mixed and matched, supporting content streams from a variety of sources, and access by a variety of communities, services and tools.

Photo by Alun Salt. Used under CC Attribution-ShareAlike 2.0 Generic license.

LOCKSS• Lots of Copies Keeps Stuff Safe• Originated at Stanford University• Peer-to-peer, decentralized digital

preservation system• Focus is on scholarly articles

– 7100 e-journal titles, 470 publishers– Collects web-based content – Preserves it locally – Provides 100% post-cancellation access– Done with publisher permission

LOCKSS

Capture & Replication

LOCKSS

Audit & Healing

LOCKSS• Commodity Hardware & Open source

software & Appliance = very low cost• Follows traditional model of library-

based distribution and preservation– Lots of Copies– Locally Managed Copies

• Publisher permissions ensure legal coverage

• Extensible to other collections

LOCKSS• CLOCKSS: Controlled LOCKSS

– Not-for-profit archive for ensuring access to orphaned scholarly content

– One dozen major publishers + libraries• Private LOCKSS Networks

– Alabama Digital Preservation Network– Arizona State Library, Archive & Public Records– Council of Prairie & Pacific University Libraries

Consoritum– Data Preservation Alliance for the Social Sciences– Digital Commons – Berkely Electronic Press– MetaArchive Cooperative Project– Digital Federal Depository Library Program