Post on 27-Mar-2015
HATHI TRUST A Shared Digital Repository
HathiTrust Digital Library
Cooperation for Preservation
Outline
• About HathiTrust– Mission & Goals
• Background• What we do– Services
• How we do it– Governance– Partnership & Resources– Technology
• Future Directions
What is HathiTrust• Shared Digital Repository– Launched 2008 by 25 institutions (now 26)– Initial focus on digitized book and journal content– Expanding to non-book/non-journal, born digital – “Light” archive
• Collaboration – Preservation and access– Print collections– Local services– Public Good
History
• Michigan Digitization Project 2004• “…U of M shall have the right to use the U of
M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation…”
History
• Collective Agreement with CIC Announced in June 2007
• CIC agreed to establish a shared digital repository
History
The Partners
• When announced in October 2008, partners included:– University of California system– CIC (Committee on Institutional Cooperation)
– University of Virginia
University of ChicagoUniversity of IllinoisIndiana UniversityUniversity of IowaUniversity of Michigan Michigan State University
University of MinnesotaNorthwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison
Columbia University
The Name
• The meaning behind the name– Hathi (hah-tee)--Hindi for elephant– Big, strong– Never forgets, wise– Secure– Trustworthy
Content Distribution
As of February 1:5,323,716 - Total 764,481 - Public Domain
Content Growth
Services
How we do it
Governance
HathiTrustHathiTrust
Executive Committee
Strategic Advisory
Board
Strategic Advisory
Board
Budget/FinancesDecision-making
PolicyPlanning
Executive Committee
• Paul Courant, University Librarian and Dean of Libraries, UM• Laine Farley, Executive Director, CDL• John King, Vice Provost for Academic Information, UM• Paula Kaufman, University Librarian and Dean of Libraries, UI• Brian Schottlaender, University Librarian, UCSD• Ed Van Gemert, Director of Libraries, UW - Madison• Brenda Johnson, Dean of Libraries, IU• Brad Wheeler, Chief Information Officer, IU• John Wilkin, Executive Director of HathiTrust and
Associate University Library, LIT, UM
Strategic Advisory Board
• Ed Van Gemert (Chair), Director of Libraries, UW - Madison• John Butler, Associate University Librarian for Information
Technology, U Minn• Patricia Cruse, Director, Preservation, CDL• Bernie Hurley, Director, Library Technologies, UC Berkeley• R. Bruce Miller, University Librarian, UC - Merced• Sarah Pritchard, University Librarian, Northwestern• Paul Soderdahl, Director, LIT, U Iowa• John Wilkin, Executive Director, HathiTrust (ex officio)
Partnership & Resources (1)
• Funded for a initial 5 years with base-funding from partners
• Budget – separately held within UMich budget system, managed by the Executive Committee
• Cost Model – Per GB cost of storage per year with a one-time fee on new content to build a capital fund
• Review in 3rd yr of each 5 yr period
Partnership & Resources (2)
• Staff/Expertise – highly integrated– Project managers, IT and communications
staff, copyright experts, administrators (UM,
Indiana and UC taking the lead)• Working groups• UM recently hired a Digital Preservation Librarian• Shared development space
Financial contributions of partners
HathiTrust Functional Framework
Partnership & Resources (3)
• Toward a Cloud Library– CLIR, Mellon Foundation– OCLC Research, NYU, HathiTrust, Recap Libraries
• Objective: Characterize the near-term opportunity for externalizing management of academic research collections leveraging capacity of large-scale shared print and digital repositories*
• Outcomes: opportunity and risk assessment based on aggregate collection analysis; draft service agreement enabling generic consumer library to selectively outsource preservation and access of low-use research collections to large-scale print and digital repositories
*From the RLG Partner Update January 7, 2010
Partnership & Resources (4)
• CRL TRAC Audit– Portico and HathiTrust assessments timely– “Certification will augment CRL’s strategic archiving of
print, and support a responsible transition to electronic-only formats where appropriate.”
– Work with UC to design shared print journal archiving effort
– “With this hybrid strategy CRL hopes to enable its community to accelerate the shift to electronic-only resources in a careful and responsible manner.”
* http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-digital-repositories
Partnership & Resources (5)
• New cost model• Based on benefits to institutions– Public Domain– In-copyright• Volumes “held”
Partnership & Resources (6)
• Timeline:– Implement in 2013– Accept new partners now with costs based on
overlap calculations
• Requirements:– Print holdings database– Update mechanisms– Manual remediation
Technology - OAIS
GRINInternal Data Loading
GRINInternal Data Loading
Google[OCA]
In-house Conversion
Google[OCA]
In-house Conversion
MARC record extensions (Aleph)
Rights DB
MARC record extensions (Aleph)
Rights DB
Page TurnerHathiTrust API
OAIGeoIP DB
CNRI Handles[Solr]
Page TurnerHathiTrust API
OAIGeoIP DB
CNRI Handles[Solr]
METS/PREMIS objectTIFF G4/JPEG2000
OCRMD5 checksums
METS/PREMIS objectTIFF G4/JPEG2000
OCRMD5 checksums
METS objectPNGOCRPDF
METS objectPNGOCRPDFIsilon
Site ReplicationTSM
MD5 checksum validation
IsilonSite Replication
TSMMD5 checksum validation
GROOVE(JHOVE)GROOVE(JHOVE)
;
Technology – Architecture
• Inbound validation, standards-based object storage and related metadata
• Storage in Ann Arbor and Indianapolis• Encrypted backup to 3rd location• Rights database for rights metadata• Online catalog as source and storage for descriptive
metadata
Technology - Ingest
• Automatic validation in GROOVE– Check barcode check digit using Luhn algorithm– Fixity check on JPG2000, TIFF, UTF8 using MD5– Well-formedness and embedded metadata check
on JPG2000, TIFF, UTF8 using JHOVE• Creation of METS and PREMIS
• Isilon storage• Simple filesystem layout– One directory per volume, zip file and METS file– Use of a namespace allows for conflicting
identifiers– Namespaces for institutions and, if needed, types
of identifiers within the institution
Technology - Repository
• Why METS?– Can serve as Archival Information
Package and a Dissemination Information Package
– Designed to record the relationship between pieces of complex digital objects
– Can be created automatically as texts are loaded or reloaded
– Preservation actions (PREMIS)
Technology – METS Object
• What’s there?
–metsHdr with an ID and CREATEDATE
– 2 dmdSecs: Marcxml and mdRef
– amdSec containing one techMD with PREMIS metadata
– fileSec with 4 fileGrps (zip, images, OCR, hOCR)
– Physical structMap tying together files with metadata (pg. numbers and features)
Technology – METS Object
Future Directions
Future Directions (1)
Future Directions (2)
Links• Catalog, Full-text search, and Collection Builder
– http://catalog.hathitrust.org• METS and PREMIS implementation
– http://www.hathitrust.org/preservation• Technical profile:
– http://www.hathitrust.org/technology• Technical flow diagram
– http://www.hathitrust.org/documents/HathiTrust-PASIG-200910.pdf– http://www.hathitrust.org/documents/HathiTrust-PASIG-notes-200910.pdf
• Rights management– http://www.hathitrust.org/rights_management
• TRAC– http://www.hathitrust.org/accountability
Thank You!hathitrust-info@umich.edu
jjyork@umich.edu
http://www.hathitrust.org