Post on 09-May-2015
Digital Preservation Cloud
Services for Libraries and
Archives
DLF 2011
Baltimore, MD
Quyen L. Nguyen – NARA
Introduction
LDPaaS
Levels of Service and Cost Model
Related Work
Conclusion
Oct. 31, 2011 2011 DLF Forum 2
Outline
3
Functional Requirements
Need for Long-Term Digital Preservation
– Policy mandates: retention of governments’ records
– Knowledge function: preserve digitized books and digital born
materials
– History-oriented mandates: preservation of cultural heritage
Challenges
– Rapid growth of digital objects that require archiving.
– Data heterogeneity
Oct. 31, 2011 2011 DLF Forum
4
Desired System Characteristics
Dynamic Scalability
– Increase as well as decrease
Cost-effective Maintainability
– Operation cost
– Patches: COTS, security.
Evolvability
– Technology refresh
– New features and services
Oct. 31, 2011 2011 DLF Forum
5
Cloud Computing Characteristics
Elasticity
– Computing and storage resources
– Three levels of cloud services: IaaS, PaaS, and SaaS.
– Quick Provisioning (e.g. Cloud Market [3])
– Pay-as-you-go
Cost-efficient Maintenance
– Economies of scale
– Maximizing utilization of computing resources
Evolvability by configuration
Oct. 31, 2011 2011 DLF Forum
OAIS Reference Model
Oct. 31, 2011 2011 DLF Forum 6
LDPaaS
Long-term Digital Preservation as a Cloud Service
– Encompass major OAIS functionalities
– Not only storage service,
– But also preservation service according to customer’s
policies: retention period, preservation level, and access
level.
Beneficial to Cloud Service Consumer
– Relieve records owners from the burden of engineering
and provisioning preservation infrastructure
Beneficial to Cloud Service Provider
– Realize economies of scales by sharing unused
computing resources
Oct. 31, 2011 2011 DLF Forum 7
8
Ingest Provisioning Challenges
Unpredictability due to business policies
– Uneven flow of transfer volume
– Various object sizes, hence object numbers
– Various object types
Cloud Computing benefits:
– Computation resources
File format identification and Application of Integrity Seal
– Storage resources: Ingest processing Buffer Space
Oct. 31, 2011 2011 DLF Forum
9
Access Provisioning Challenges
Oct. 31, 2011 2011 DLF Forum
Unpredictability of publishing
– Volume of publishable data sets
Spikiness of Access request load
Access types: Storage Delivery Networks vs Content
Delivery Networks.
Cloud Computing benefits
– Computation: access-time visualization, zooming, conversion to
access format
– Storage: High-efficiency Access disk cache
10
Preservation Provisioning Challenges
Oct. 31, 2011 2011 DLF Forum
Prominent preservation methods:
Bit-level: error detection and correction capabilities
Transformation
Computing resources for transformation processes
Storage served as a scratchpad for transformation.
Emulation: virtual machine requirements.
Cloud Computing benefits
– Computation: Execution of Preservation Algorithms
– Storage: Preservation Processing Buffer Space
11
Storage Provisioning Challenges
It is all about Storage capacity
Oct. 31, 2011 2011 DLF Forum
Scale of Storage Requirement May be Best Suited to Function as
Hyper Large-Scale Cloud Provider
Moderate-to-Small-Scale Cloud Consumer
Could there be a Community Cloud?
Software Paradigms
Structural Object-oriented SOA Cloud
Oct. 31, 2011 2011 DLF Forum 12
Virtu
aliz
atio
n
System Architecture
Oct. 31, 2011 2011 DLF Forum 13
SOA-based Ingest Process
Ingest
Virus Scan
File Format Identification
DROID
JHOVE
Metadata Extraction
Integrity Seal
Move to Preservation
Storage
• Ingest Process implemented as composite service
• Could be implemented by BPEL.
Oct. 31, 2011 2011 DLF Forum 14
15
LDPaaS Levels of Service
Oct. 31, 2011 2011 DLF Forum
Service Levels
Ingest
IL1: Transfer Only
IL2: With Format Identification
IL3: Metadata Extraction
Preservation
PL1: Bit
PL2: Content
PL3: Content, Behavior & Formatting
Discovery
DL1: Metadata search
DL2: Full content search
Access
AL1: Passive Viewer
AL2: Interactive Viewer
AL3: Content Mining
Storage
SL1: Delayed Access - Near-Line Storage
SL2: Rapid Access - High Performance Storage
Content Server
CL1: Just-in-Time Active
CL2: Always Active
Level of Service Definitions
16 Oct. 31, 2011 2011 DLF Forum
Definition 1.
Each Content Server has a set of LoS formalized by the following 6-
tuple:
C = (CL, IL, PL, DL, AL, SL).
Definition 2.
Since a customer can have one or more Content Servers, a customer’s
SLA is specified by the n-tuple:
L = (C1, …, Cn), if the customer has signed up for n Content Servers,
with each Ci being a 6-tuple defined according to Definition 1.
LoS - Example 1
17 Oct. 31, 2011 2011 DLF Forum
Digital Library Repository
Define Content Server C1 by C1 = (CL1, IL2, PL2, DL1, AL2, SL2)
Content Server CL1 - Active Just-in-Time - this repository is sporadically used
Ingest Service IL2 - File Format Identification
Preservation Service PL2 - Preservation at the Content Level
Discovery Service: DL1 - Metadata Search
Access Service: AL2 - Interactive Viewer is provided for access.
Storage Service SL2 - Rapid Access, High Performance Disk - the volume is static
LoS - Example 2
18 Oct. 31, 2011 2011 DLF Forum
Digital Library Repository for Research Publications Two Sets of Records Stored in Two Different Content Servers: C1 and C2
C1 - Relatively Small Volume of High-Demand Digital Assets
C1 = (CL1, IL2, PL3, DL1, AL1, SL2) CL1 - Active Just-in-Time Content Server
IL2 - File Format Identification
PL3 - Preservation at the Content and Formatting Level
DL1 - Metadata Search
AL1 - Passive Viewer
SL2 - High Performance, Rapid Access Storage
C2 - Backend Repository, Volume Increasing with Time
C2 = (CL2, IL2, PL3, DL2, AL1, SL1) CL2 - Always Active Content Server
IL2 - File Format Identification
PL3 - Preservation at the Content and Formatting Level
DL2 - Full Content Search
AL1 - Passive Viewer
SL1 - Delayed Access Storage
LoS - Example 3
19 Oct. 31, 2011 2011 DLF Forum
Sarbanes-Oxley Act Compliance Business Archive
Retain and Preserve Records in a Sliding Time Window of Seven Years
C1 = (CL1, IL2, PL1, DL2, AL1, SL1)
PL1 - Preservation Service at the Bit Level
Retention Period of Seven Years – Elaborate Preservation not Needed
SL1 - Delayed Access Storage
Archive Intended for Audit Purposes Only - Rapid Access to Data not Essential
20
Cost Model Cost is one of the crucial elements in Cloud Computing
Let O = (V, N) be the Body of N Digital Objects and total
volume V
Cost (O, Service) depends on the level of service.
– Function of V or N or both.
Examples:
fIL1 - Utilization Cost for Digital Object Transfer, varies with V
fIL2 - File Type Identification
fIL3 - Metadata Extraction
TOTAL COST (O,C) = Cost (O, Service), where
where Service = {Ingest, Preservation, Discovery, Access, Storage}
Oct. 31, 2011 2011 DLF Forum
Vary with N
21
Cost Model Example
Let C1 = (CL2, IL2, PL1, DL1, AL1, SL1). Assume :
fCL2 (V,N) = 20V + 100 N;
fIL2 (N) = 10 N;
fPL1 (V) = 20 V;
fDL1 (N) = 30 N;
fAL1 (V) = 30 V;
fSL1 (V) = 40 V.
For Set O1 of Objects with V1 = 10 GB and N1 = 106
totalCost(O1,C1) = 140,000,740
For Set O2 of Objects with V2 = 103 GB and N2 = 102
totalCost(O2,C1) = 88,000
Note : totalCost(O2,C1) < totalCost(O1,C1) , although V2 > V1
Oct. 31, 2011 2011 DLF Forum
Related Work CiteSeer study by Teregowda [2]:
– Examine each service in the architecture stack in terms of feasibility
and cost of migrating and hosting in the Cloud.
– Possible integration with Cloud Storage thanks to current virtualized
storage component.
DuraCloud [5]:
– Open source platform for digital libraries and archives
– Adapters to commercially available Cloud Storage services
Strategies and SLAs for bit-level preservation by Zierau [6]:
– Various sub-levels of bit-preservation.
www.cloudpreservation.com: archives and indexes data
from websites and social networks.
www.ltdprm.org/ - Long-Term Digital Retention and
Preservation Reference Model: cloud-based digital archive.
Oct. 31, 2011 2011 DLF Foruml 22
Conclusion
Proposed LDPaaS concept: why is it useful?
– Beneficial to large organizations
– Beneficial to small organizations
Notional cost model useful for establishing a price
model associated with published SLA set.
Contend that Cloud Storage Service vendors can
augment their portfolios to provide LDPaaS.
Community Cloud for Preservation
– Environment for more collaboration and sharing
Oct. 31, 2011 2011 DLF Forum 23
24
References
1. Michael Armbrust et al. “A View of Cloud Computing”. Communications of the ACM,
Volume 53, No 4, April 2010.
2. P. Teregowda, Burgaonkar, B. and C. L. Giles. “Cloud Computing: A Digital
Libraries Perspective”. 2010 IEEE 3rd International Conference on Cloud Computing,
Miami, FL, July 2010.
3. Stephen Abrams, Patricia Cruse, and John Kunze. “Preservation Is Not a Place”.
The International Journal of Digital Curation, Issue 1, Volume 4, 2009.
4. Steve Hitchcock, David Tarrant, Adrian Brown, Ben O’Steen, Neil Jefferies, and
Leslie Carr. “Towards Smart Storage for Repository Preservation Services”. The
International Journal of Digital Curation, Issue 1, Volume 5, 2010.
5. DuraCloud. Available: http://www.duraspace.org/duracloud.php.
6. Eld Zierau, Ulla Bogvad Kejser, and Hannes Kulovits. “Evaluation of Bit Preservation
Strategies”. 7th International Conference on Preservation of Digital Objects
(iPRES2010), Sep. 19-24, 2010, Vienna, Austria.
Oct. 31, 2011 2011 DLF Forum
Disclaimer
The content of this presentation is the personal opinion of
the author and does not necessarily reflect any position of
the U.S. Government or the National Archives and Records
Administration.
Oct. 31, 2011 2011 DLF Forum 25
26
Thank You!
Any questions?
mailto:quyen.nguyen@nara.gov
Oct. 31, 2011 2011 DLF Forum