Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and...

23
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Digital Libraries, Data Grids, and Persistent Archives Reagan W Moore San Diego Supercomputer Center 10100 John Jay Hopkins Dr, La Jolla CA 92093 Phone: +1-858-534-5073 E-mail: [email protected] Presented at the THIC Meeting at the Hilton San Diego/Del Mar Del Mar CA 92014-1901 on January 22, 2002

Transcript of Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and...

Page 1: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Digital Libraries, Data Grids, and Persistent Archives

Reagan W MooreSan Diego Supercomputer Center

10100 John Jay Hopkins Dr, La Jolla CA 92093Phone: +1-858-534-5073

E-mail: [email protected]

Presented at the THIC Meeting at the Hilton San Diego/Del MarDel Mar CA 92014-1901

on January 22, 2002

Page 2: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Digital Libraries, Data Grids, and Persistent Archives

Reagan W. MooreSan Diego Supercomputer Center

[email protected]://www.npaci.edu/DICE/

Page 3: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Data and Knowledge Systems GroupGraduate Students • A. Bagchi• S. Bansal• A. Behere• R. Bharath• S. Bharath• M. Kulrul• L. SuiUndergraduate Interns• N. Cotofana• M. Shumaker• J. Trang• L. Yin• +/- NN

Staff• Reagan Moore• Ilkai Altintas• Chaitan Baru• Sheau Yen Chen• Charles Cowart• Amarnath Gupta• George Kremenek• Bertram Ludäscher• Richard Marciano• XuFei Qian• Roman Olshanowsky• Arcot Rajasekar• Abe Singer• Michael Wan• Ilya Zaslavsky• Bing Zhu

Page 4: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Accessing Data

• How do you access storage systems at remote sites in someone else’s administration domain?

• How do you organize distributed data into a cohesive collection with global, persistent identifiers?

Page 5: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Information Management Projects• Digital Libraries

– CDL - AMICO– DARPA/USPTO - patent digital library– NLM Visible Embryo digital library - GMU– NSF Digital Library Initiative, Phase II - UCSB, Stanford– NSF NPACI Digital Sky - Caltech 2MASS sky survey– NSF NSDL - UCAR / Columbia / Cornell / UCSB

• Data Grid Environments– DOE Data Visualization Corridor - LLNL– DOE Particle Physics Data Grid - Stanford, Caltech– NASA Information Power Grid - NASA Ames– NIH Biomedical Informatics Research Network– NSF Grid Physics Network - U Florida– NSF National Virtual Observatory - Johns Hopkins University / Caltech– NSF Southern California Earthquake Center - ISI

• Persistent Archives– NARA Persistent Archive – NHPRC - Archivist workbench

Page 6: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Specifying levels of Abstraction

• Technology management becomes simpler if the persistent archive infrastructure operates on abstractions, rather than an explicit physical implementation of a resource

• Can we abstract– Digital objects– Storage

Page 7: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Technology ManagementApplication

Operating System

Storage System

Digital Object

Storage System Abstraction Display System Abstraction

Display System

Digital Object Abstraction

Page 8: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Types of Digital Entity Abstractions

• Logical representation– What does the digital entity represent?– What is the associated meaning?

• Physical representation– What is the physical structure of the digital

entity?

Page 9: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for BitsAbstraction for Digital Entity

Logical:I-nodes

Physical:Track / Sector

Digital Entity Bit Stream

Physical:File System

(NFS/AFS/NTFS)

Abstraction for Repository

Logical:File Name

Repository Disk

Page 10: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Managing Distributed Storage• Separate the organization of digital objects from

their physical storage– Logical Name Space to manage attributes about the

digital objects– Data handling system to manage interactions with

remote storage systems

• Create storage abstraction layer• Storage Resource Broker (SRB) provides data

management system

Page 11: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for DataAbstraction for Digital Entity

Logical:Data Model

(units, semantics)

Physical:Encoding Format(syntax, structure)

Digital Entity Files

Abstraction for Repository

Physical:Data Handling

System -SRB/MCAT

Logical:Name Space

Repository File System, Archive

Page 12: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

SDSC

LosAngeles

Oakland

OHSU

ASX200

DC POPOC-3Abilene

OC-3Abilene

OC-3

VBNS

OC-12

DS3

Vegas

GST

100 Gbit

WRLHSCC

Disk Cache

Archive

Visible Embryo ProjectImageGeneration

AFIP:Collab WS

NT WS

NT WS

EolasATD Net

NICDisk Cache

UICStartap

GMU

MSWS

MSWS

BEN

OC-3

Disk Cache

JHU

Disk Cache

Page 13: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Disaster Response

• Support replicas - provide multiple copies of a data set stored at multiple sites, but accessed by the same logical file name

• On access, map from logical file name to the physical file name. If the file is not accessible, automatically fail over to a replica.

Page 14: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebPrologPredicate

Application

C, C++, Libraries

Linux I/O

DLL /Python

SDSC Storage Resource Broker & Meta-data Catalog

Clients

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

HRM

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

Servers

Page 15: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Information Management-Logical Name Space

• Set of attributes to describe digital entities that are registered into the logical name space

• SRB metadata - Unix file system semantics• Provenance metadata - Dublin Core• Resource metadata - User access control lists• Discipline metadata - User defined attributes

• Each digital entity may have unique attributes

Page 16: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Information Management• Abstraction layer for interacting with information

repositories– Manage the schema and physical table structures of a

database– Extensible schema– User defined attributes

• Extensible Metadata CATalog (EMCAT) manages collections

• mySRB.html interface supports dynamic collection creation

Page 17: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for InformationAbstraction for Digital Entity

Logical:Collection Schema

Physical:XML Syntax

Digital Entity Metadata Attributes

Logical:DatabaseSchema

Abstraction for Repository

Physical:EMCAT/CWM

Repository Database

Page 18: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Compute Resources Catalogs Data Archives

InformationDiscovery

Metadatadelivery

Data Discovery

Data Delivery

Catalog Mediator Data mediator

1. Portals and Workbenches

Bulk DataAnalysis

CatalogAnalysis

MetadataView

DataView

4.GridSecurityCachingReplicationBackupScheduling

2.Knowledge & ResourceManagement

Standard Metadata format, Data model, Wire format

Catalog/Image Specific Access

Standard APIs and Protocols Concept space

3.

5.

6.

7. Derived Collections

National Virtual ObservatoryData Grid

Page 19: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Knowledge Management -Discovery across Collections

• Characterization of relationships between attributes– Semantic / logical - cross-walks– Procedural / temporal - records management– Structural / spatial - GIS

• Abstraction layer for knowledge repositories • Mapping from collection attributes to discipline

concepts• Model-based Mediation supports mapping from

knowledge relationships to rule-based inference engines

Page 20: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for KnowledgeAbstraction for Digital Entity

Logical:Relationship

Schema

Physical:ER/UML/XMI/RDF syntax

Concept Space(ontology instance)Digital Entity

Physical:Model-based

Mediation System

Abstraction for Repository

Logical:Knowledge

Repository Schema

Repository Knowledge Repository

Page 21: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

referenceditems &

collections

referenceditems &

collections

ReferencedItems &

Collections

NSDLServicesNSDL

ServicesOther NSDLServices

CI Services

visualization...

CI Services

discussion

CI Services

personalization

CI Services

topic-map registry

CI Services

query transform

Core Services:annotation

Core Collection-Building Servicespersistent storage

Core Collection-Building Servicesmetadata harvesting

Core Services:metadata normalizing

Portals &ClientsPortals &

ClientsPortals &Clients

Usage Enhancement

Collection Building

User Interfaces

NSDLCollectionsNSDL

CollectionsNSDLCollections

Metadata & data access-based

services

Core NSDL BusMeta-data delivery

Data deliveryQuery

Global IdsSecurityNetwork

Virtual Collections &Mediators

Information about collections

Delivery PresentationAggregation - Channels

NSDL

Page 22: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

ERA Concept model

Mediation of Information using XML

Storage Resource Broker/Extensible Meta-data CATalog

ERA: Archival Components Concept

Me tadata

ArchivalRepository

OrderFulfillment

System

ReferenceWorkbench

Query

Rebuil d

Present

Tapes

AccessioningWorkbench

Accession

Verify

Wrap & Containerize

Describe

C ollectionDisks

Internet

C ollection

C ollection

Archival Research CatalogRecords

Schedules

Grid Security Infrastructure

Page 23: Digital Libraries, Data Grids, and Persistent Archives - THIC · Digital Libraries, Data Grids, and Persistent Archives ... Data Grids, and Persistent Archives ... HSCC Disk Cache

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Further Information

• Academichttp://www.npaci.edu/DICE

• Commercial - Storage Resource [email protected]