An Introduction to the Merritt Curation Repository
description
Transcript of An Introduction to the Merritt Curation Repository
An Introduction to the Merritt Curation Repository
University of California Curation Center TeamCalifornia Digital Library
June 9, 2011
UC3 Summer Webinar Series
First, a word about the webinar series…• A forum for timely topics of interest to the UC
community– Highlighting projects, services, and developments in the
areas of digital preservation, web archiving, and data curation
– Intended to raise awareness of issues, and provide information on useful resources and services available to the UC community
– 2nd and 4th Thursday of the month, and as scheduled, featuring UC3 staff and UC librarians, content managers, and technologists
Teleconference +1 (866) 740-1260, access code 9879016#Webconference http://bit.ly/jdjMAP
First, a word about the webinar series…
• Some logistics…– Participant phones will be muted during the formal
presentation, but we will be monitoring the online chat
– Slides, Q & A, and web and voice recordings will be posted after each presentation
– Schedule available at http://www.cdlib.org/uc3/uc3webinars.html
– Please suggest additional [email protected]
– Take the short surveyhttp://www.surveymonkey.com/s/XSGWP8R
Now on with the show…
• Today’s topic is an introduction to the Merritt curation repository– Who is it for?
– What can it do?
– Why use it?
– What does it cost?
– Next steps?
– Q & A
What keeps you up at night?
Are there standards or best practices I should
be aware of?
How much will it cost?
How can I transfer my content to an
appropriate curation environment
How do I know my content is safe?
What’s the best strategy to ensure
permanent availability?
Do I need to create new derivatives just for preservation purposes?
How can I get a persistent reference
to my content? What if my content needs to evolve over
time?
Can I control who can see my
content?
I have a good discovery platform; how can I add preservation services?
“There’s an app for that”
Are there standards or best practices I should
be aware of?
How much will it cost?
How can I transfer my content to an
appropriate curation environment
How do I know my content is safe?
What’s the best strategy to ensure
permanent availability?
Do I need to create new derivatives just for preservation purposes?
How can I get a persistent reference
to my content? What if my content needs to evolve over
time?
Can I control who can see my
content?
I have a good discovery platform; how can I add preservation services?
Automatic replication and high-availability redundancy
Periodic fixity audit
Simple submission UI/APIMETS “feeder” duplicates
existing DPR workflow
Model freeNo packaging, format, or metadata requirements
Strongly versionedIntegration with
EZID and DataCiteCurator-defined
access control rules
Modular micro-services “toolkit”
UC3 consultation
Storage at $1.04/GB/year
Merritt repository
• Merritt is available for use by all members of the UC community
– Libraries/archives/museums– ORU/MRUs– Faculty/staff
• Centrally hosted by UC3/CDL on behalf of the UC community– Economies of scale– Shared experience and
expertise
Mediated through campus libraries
Modes of use: dark archive
• Pro-active preservation, but no expectation of direct end user access– Legacy DPR content contributed by campus libraries– Cultural heritage texts, master images, sound, moving
image, data sets
– All DPR content will be automatically migrated to Merritt
Modes of use: bright archive
• Provide preservation and end user access– NIH Healthy Pathways project on bio-demographics
• Multi-institutional: UC Davis, University of Colorado, University of Virginia, Syddansk University (Denmark)
• Need to restrict access to project partners initially, with eventual public access
Modes of use: bright archive
• Content discovery: search
Modes of use: bright archive
• Content discovery: search
Modes of use: bright archive
• Content discovery: browse
Modes of use: bright archive
• Content discovery: browse
Modes of use: preservation “back end”
• Preservation only; content discovery/delivery provided by well-known external systems– Using direct hooks into Merritt to retrieve content
– eScholarshipOpen access publishing
– Open ContextArchaeological data publishing
– Investigating integration with Islandora/Drupal and Alfresco
Modes of use: distributed data grids
• DataONE “Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it”
More information
• Online help http://merritt.cdlib.org/help
• FAQ http://merritt.cdlib.org/docs/merritt_handout.pdf
• User’s guidehttp://merritt.cdlib.org/docs/merritt_user_guide.pdf
• UC3 contact http://www.cdlib.org/uc3/[email protected]
Merritt cost model
• UC3 provides technical infrastructure, data center hosting, staff, monitoring, maintenance, enhancements, help, outreach, consultation, etc.
• Contributors are charged only for storage used, at the UC3 recovery rate of $1.04/GB/year
• Developing an “endowment” model: Pay once, preserve forever
• Will soon extend model for non-UC contributors
How does this compare?• Cost of a physical book in RLF † $
4.62/year• Cost of a digital book in HathiTrust ‡ $
0.15/year• Cost of a digital book in Merritt $
0.06/year
† Gary Lawrence (2007) Internal analysis, CDL; ‡ Paul Courant and Matthew Nielsen (2010), On the cost of keeping a book, HathiTrust.
Average collection sizes and costs
Collection Objects Size Annual cost
CA DOE reports 8,000 12.0 GB $ 12.48
Cal Cultures 420 65.6 GB $ 68.22
eScholarship 46,425 118.6 GB $ 123.34
A “cost calculator” spreadsheet is available athttp://www.cdlib.org/uc3/docs/Merritt-cost-calculator-v3.xlsx
Average ETD size and cost
Campus ETD titles Size Annual cost
Berkeley 797 12.4 GB $ 12.88
Davis 837 13.0 GB $ 13.52
Irvine 390 6.1 GB $ 6.30
Los Angeles 720 11.2 GB $ 11.63
Riverside 192 2.9 GB $ 3.10
San Diego 558 8.7 GB $ 9.02
San Francisco * 560 8.7 GB $ 9.05
Santa Barbara 325 5.0 GB $ 5.25
Santa Cruz 155 2.4 GB $ 2.50
Based on 2009 holdings in ProQuest * UCSF based on total ETD holdings in Merritt
Average research data size and cost
• Almost 50% of all research data is less than 1 GB
Source: Science 331:6018 (February 11, 2011): 692-693 <DOI: 10.1126/science.331.6018.692>
Size Percentage Annual cost
< 1 GB 48.3 % < $ 1.04
1 – 100 GB 32.0 % $ 1.04 – 104.00
100 GB – 1 TB 12.1 % $ 104.00 – 1,040.00
> 1 TB 7.6 % > $ 1,040.00
Next steps
• UC3 is working with campus partners to determine ongoing development and collection priorities
ReplicationIdM/Authn/AuthzIngest, Access Inventory, QueuingStorage and Identity
Technology watchMetadata standardsPolicy and business modelData management guidelinesObject and collection modeling
New contentacquisition
Next steps
In production• Model-free objects• Submission via UI and API• Persistent identifiers• Format identification• Version provenance• Automated replication• Automated fixity audit• Role-based access control• Collections• Semantic index and search• Object/version/file download
In progress
• Simplified update
• Enhanced characterization (JHOVE2)
• Faceted search and browse (XTF)• CMS/DAMS-like function
(Islandora)
In planning
• Simplified batch
• UCTrust integration
• Linked data
• Transformation• Notification• Annotation• Support for NGTS/DLSTF
recommendations
We welcome your feedback on needs and priorities!http://www.cdlib.org/uc3/[email protected]
Simplified update
• Variant form of object update requiring the submission of only the changed components
• Client-side tools to simplify the creation of batch manifests #%checkm_0.7
#%profile | http://uc3.cdlib.org/registry/ingest/mani#%prefix | mrt: | http://merritt.cdlib.org/terms##%prefix | nfo: | http://www.semanticdesktop.org/onto#%fields | nfo:fileUrl | nfo:hashAlgorithm | nfo:hash
http://merritt.cdlib.org/samples/goldenDragon.jpg | mhttp://merritt.cdlib.org/samples/tumbleBug.jpg | md5 http://merritt.cdlib.org/samples/generalDrapery.jpg | http://merritt.cdlib.org/samples/generalDrapery.jpg |
#%eof
Enhanced characterization
• JHOVE2 next-generation framework for format-aware characterization http://jhove2.org/
– Automated extraction and inference of extensive technical metadata significant for preservation analysis and planning
"Module": { "scope": "ICCModule“, "Header": { "scope": "ICCHeader“, "ProfileSize": { "unit": "byte“, "value": 60960 } ,"ProfileVersionNumber": "4.2.0.0“ ,"ProfileDeviceClass_raw": "spac“ ,"ProfileDeviceClass_descriptive": "ColorSpace Conversion profile“ ,"ColourSpace_raw": "RGB “ ,"ColourSpace_descriptive": "rgbData“ ,"ProfileConnectionSpace_raw": "Lab “ ,"ProfileConnectionSpace_descriptive": "labData“
Enhanced discovery via XTF
• eXtensible Text Framework http://xtf.cdlib.org/
– CDL developed/supported open source discovery platform– Robust, scalable faceted search and browse
CMS/DAMS-like function
• Many campuses are looking for CMS/DAMS solutions• Investigating integration with Islandora to provide a
Drupal CMS/DAMS front-end to Merritt
http://islandora.ca/ http://drupal.org/
Questions?
Upcoming webinarsDate/time TopicWednesday, June 1512:30 pm
Data Sharing by Scientists: Practices and PerceptionsCarol Tenopir, Univ. TennesseeMike Frame, USGS
Thursday, June 302:00 pm
The Data Management Planning Tool (DMP Tool)Trisha Cruse, UC3
Thursday, July 142:00 pm
Data as PublicationJohn Kunze, UC3Catherine Mitchell, CDL Publishing Program
Thursday, July 282:00 pm
Merritt: Depositing Content and Providing Access
Thursday, August 112:00 pm
DCXL (Data Curation Excel)
http://www.cdlib.org/uc3/uc3webinars.html
Please take the webinar survey http://www.surveymonkey.com/s/XSGWP8R
For more information
UC Curation Centerhttp://www.cdlib.org/uc3http://www.cdlib.org/uc3/[email protected]
Stephen Abrams Margaret LowLisa Colvin David LoyPatricia Cruse Mark Reyes Scott Fisher Tracy Seneca Erik Hetzner Joan StarrGreg Janée Marisa StrongJohn Kunze Perry Willett
UC3 webinar serieshttp://www.cdlib.org/uc3/uc3webinars.html
Merritt repositoryhttp://merritt.cdlib.org/ http://merritt.cdlib.org/helphttp://merritt.cdlib.org/docs/merritt_handout.pdfhttp://merritt.cdlib.org/docs/merritt_user_guide.pdf