Modern Data Management Overview Storage Resource Broker Reagan W. Moore [email protected] .

36
Modern Data Management Modern Data Management Overview Overview Storage Resource Broker Storage Resource Broker Reagan W. Moore Reagan W. Moore [email protected] http://www.sdsc.edu/srb http://www.sdsc.edu/srb
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Modern Data Management Overview Storage Resource Broker Reagan W. Moore [email protected] .

Modern Data Management Modern Data Management OverviewOverview

Storage Resource BrokerStorage Resource Broker

Reagan W. MooreReagan W. Moore

[email protected]

http://www.sdsc.edu/srbhttp://www.sdsc.edu/srb

TopicsTopics

• Data management evolution• Shared collections• Digital Libraries• Persistent Archives

• Building shared collections• Project level / National level / International

• Demonstration of shared collections• Access to collections at SDSC

Types of Data ManagementTypes of Data Management

• File system (AFS)• Provides caching at remote sites, uses single authentication system

• Backup system (Veritas)• Provides time-based copies of data, tools for re-loading backups

• Database system (Oracle 10g IFS)• Can link metadata to files on an Internet File System

• Archive system (HPSS)• Manages data stored on tape, supports parallel I/O streams

• Persistent object environment (Avaki)• Provides vaults for storing objects

• Globus toolkit• Provides differentiated services for building a data grid

Data Management EnvironmentsData Management Environments

• Data grids• Manage shared collections

• Digital libraries• Provide discovery, browsing, presentation services on

top of collections

• Persistent archives• Manage technology evolution while the authenticity

and integrity of the assembled collection is preserved

• Real-time sensor networks• Manage access to real-time data streams from

thousands of sensors

Generic InfrastructureGeneric Infrastructure

• Can a single system provide all of the features needed to implement each type of data management system, while supporting access across administrative domains and managing data stored in multiple types of storage systems?

• Answer is data grid technology

Types of Data ManagementTypes of Data Management

• File system (AFS)• Data grid manages replication, parallel I/O, containers

• Backup system (Veritas)• Data grid supports replicas, versions, and snapshots of files and

containers

• Database system (Oracle 10g IFS)• Data grid virtualizes catalogs - schema extension, bulk metadata load

• Archive system (HPSS)• Data grid integrates access across archives and file systems

• Persistent object environment (Avaki)• Data grid manages user-defined metadata and collection hierarchy

• Globus toolkit - set of differentiated services• Data grid manages consistent state information

Shared CollectionsShared Collections

• Purpose of SRB data grid is to enable the creation of a collection that is shared between academic institutions• Register digital entity into the shared collection• Assign owner, access controls• Assign descriptive, provenance metadata• Manage state information

• Audit trails, versions, replicas, backups, locks• Size, checksum, validation date, synchronization date, …

• Manage interactions with storage systems• Unix file systems, Windows file systems, tape archives, …

• Manage interactions with preferred access mechanisms• Web browser, Java, WSDL, C library, …

SRBserver

SRB agent

SRBserver

Federated Server ArchitectureFederated Server Architecture

MCAT

Read Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1R2

5/6

Generic InfrastructureGeneric Infrastructure

• Digital libraries now build upon data grids to manage distributed collections• DSpace digital library - MIT and Hewlitt Packard• Fedora digitial library - Cornell University and University

of Virginia

• Persistent archives build upon data grids to manage technology evolution• NARA research prototype persistent archive• California Digital Library - Digital Preservation Repository• NSF National Science Digital Library persistent archive

Southern California Earthquake CenterSouthern California Earthquake Center

SCEC Community

Library

Select Receiver (Lat/Lon)

OutputTime HistorySeismograms

Select ScenarioFault Model

Source Model

•Intuitive User Interface–Pull-Down Query Menus –Graphical Selection of Source Model–Clickable LA Basin Map (Olsen)–Seismogram/History extraction (Olsen)

•Access SCEC Digital Library

–Data stored in a data grid–Annotated by modelers–Standard naming convention–Automated extraction of selected data and metadata–Management of visualizations

SCEC Digital LibrarySCEC Digital Library

Terashake Data HandlingTerashake Data Handling

• Simulate 7.7 magnitude earthquake on San Andreas fault• 50 Terabytes in a simulation• Move 10 Terabytes per day

• Post-Processing of wave field• Movies of seismic wave propagation• Seismogram formatting for interactive on-

line analysis• Velocity magnitude• Displacement vector field• Cumulative peak maps• Statistics used in visualizations• Register derived data products into

SCEC digital library

HumidityClimateEcologicalWirelessOceanography

Wind SpeedClimateEcologicalWirelessOceanography

SeismicGeophysics

ROADNet Sensor Network Data Integration

Fire startRain start

Frank Vernon - UCSD/SIOFrank Vernon - UCSD/SIO

ChileChile June 13, 2005June 13, 2005

Mw 7.9

Frank Vernon - UCSD/SIOFrank Vernon - UCSD/SIO

National Science Digital LibraryNational Science Digital Library

• URLs for educational material for all grade levels registered into repository at Cornell

• SDSC crawls the URLs, registers the web pages into a SRB data grid, builds a persistent archive• 750,000 URLs• 13 million web pages• About 3 TBs of data

Worldwide University Network Data GridWorldwide University Network Data Grid

• SDSC• Manchester• Southampton• White Rose• NCSA• U. Bergen

• A functioning, general purpose international Data Grid for academic collaborations

Manchester-SDSC mirror

KEK Data GridKEK Data Grid

• Japan• Taiwan• South Korea• Australia• Poland• US

• A functioning, general purpose international Data Grid for high-energy physics

Manchester-SDSC mirror

BaBar High-energy PhysicsBaBar High-energy Physics

• Stanford Linear Accelerator

• Lyon, France• Rome, Italy• San Diego• RAL, UK

• A functioning international Data Grid for high-energy physics

Manchester-SDSC mirror

Moved over 100 TBs of dataMoved over 100 TBs of data

Astronomy Data GridAstronomy Data Grid

• Chile• Tucson, Arizona• NCSA, Illinois

• A functioning international Data Grid for Astronomy Manchester-SDSC mirror

Moved over 400,000 imagesMoved over 400,000 images

International Institutions (2005)International Institutions (2005)

Project InstitutionData mangement project British Antarctic Survey, UKeMinerals Cambridge e-Science Center, UKSickkids Hospital in Toronto CanadaWelsh e-Science Centre Cardiff University, UKAustralian Partnership for Advanced Computing Data Grid Victoria, Australia

Australian Partnership for Advanced Computing Data GridCommonwealth Scientific and Industrial Research Organization, Australia

Australian Partnership for Advanced Computing Data Grid University of Technology, AustraliaCenter for Advanced Studies, Research, and Development ItalyLIACS(Leiden Inst. Of Comp. Sci) Leiden University,The NetherlandsAustralian Partnership for Advanced Computing Data Grid Melbourne, AustraliaMonash E-Research Grid Monash University, AustraliaComputational Modelling University of Queensland, AustraliaVirtual Tissue Bank Osaka University, JapanCybermedia Center Osaka University, JapanBelfast e-Science Centre Queen's University, UKInformation Technology Department Sejong University, South KoreaNanyang Centre for Supercomputing SingaporeNational University (Biology data grid) SingaporeProtein structure prediction Taiwan University, Taiwan

Storage Resource Broker Collections at SDSC (8/2/2005)GBs of

datastored

Numberof files

Userswith

ACLsData Grid Ź Ź Ź

NSF/ITR - National Virtual Observatory 53,862 9,536,751 100NSF - National Partnership for Advanced Computational Infrastructure 36,149 7,539,180 380

Static collections Š Hayden planetarium 8,013 161,352 227

Pzone Š public collections 12,998 6,707,952 68

NSF/NPACI - Biology and Environmental collections 40,155 76,083 67

NSF/NPACI Š Joint Center for Structural Genomics 15,731 1,577,260 55

NSF - TeraGrid, ENZO Cosmology simulations 176,730 2,125,945 3,267

NIH - Biomedical Informatics Research Network 10,561 7,596,888 303

Digital Library Ź Ź Ź

NSF/NPACI - Long Term Ecological Reserve 256 9,033 36

NSF/NPACI - Grid Portal 2,620 53,048 460

NIH - Alliance for Cell Signaling microarray data 741 84,594 21

NSF - National Science Digital Library SIO Explorer collection 2,733 1,083,998 27

NSF/ITR - Southern California Earthquake Center 131,010 2,702,421 73

Persistent Archive Ź Ź Ź

NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) 100 382,186 28

UCSD Libraries archive 4,147 408,050 29

NARA- Research Prototype Persistent Archive 1,478 893,434 58

NSF - National Science Digital Library persistent archive 3,600 27,034,150 136

TOTAL 501 TB 68 million 5,335

Unix Shell

NT Browser,Kepler Actors

OAI,WSDL,(WSRF)

HTTP,DSpace,

OpenDAP,GridFTP

Archives - Tape,Sam-QFS, DMF,

HPSS, ADSM,UniTree, ADS

Databases -DB2, Oracle,

Sybase, Postgres, mySQL, Informix

File SystemsUnix, NT,Mac OSX

Application

ORB

Storage Repository AbstractionDatabase Abstraction

Databases -DB2, Oracle, Sybase,

Postgres, mySQL,Informix

CLibrary,

Java

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency & Metadata Management / Authorization, Authentication, Audit

Linux I/OC++

DLL /Python,

Perl, Windows

Federation Management

Storage Resource Broker 3.3.1Storage Resource Broker 3.3.1

SRB ObjectivesSRB Objectives

• Automate all aspects of data discovery, access, management, analysis, preservation• Security paramount• Distributed data

• Provide distributed data support for• Data sharing - data grids• Data publication - digital libraries• Data preservation - persistent archives• Data collections - Real time sensor data

SRB DevelopersSRB Developers

Reagan Moore Reagan Moore - PI- PIMichael Wan Michael Wan - SRB Architect- SRB ArchitectArcot Rajasekar Arcot Rajasekar - SRB Manager- SRB ManagerWayne Schroeder Wayne Schroeder - SRB Productization- SRB ProductizationCharlie CowartCharlie Cowart - inQ- inQLucas Gilbert Lucas Gilbert - Jargon- JargonBing Zhu Bing Zhu - Perl, Python, Windows- Perl, Python, WindowsAntoine de Torcy Antoine de Torcy - mySRB web browser- mySRB web browserSheau-Yen Chen Sheau-Yen Chen - SRB Administration- SRB AdministrationGeorge KremenekGeorge Kremenek - SRB Collections- SRB CollectionsArun Jagatheesan Arun Jagatheesan - Matrix workflow- Matrix workflowMarcio Faerman Marcio Faerman - SCEC Application- SCEC ApplicationSifang Lu Sifang Lu - ROADnet Application- ROADnet ApplicationRichard Marciano Richard Marciano - SALT persistent archives- SALT persistent archives

75 FTE-years of support75 FTE-years of supportAbout 300,000 lines of CAbout 300,000 lines of C

HistoryHistory• 1995 - DARPA Massive Data Analysis Systems• 1997 - DARPA/USPTO Distributed Object Computation Testbed• 1998 - NSF National Partnership for Advanced Computational Infrastructure

• 1998 - DOE Accelerated Strategic Computing Initiative data grid• 1999 - NARA persistent archive• 2000 - NASA Information Power Grid• 2001 - NLM Digital Embryo digital library• 2001 - DOE Particle Physics data grid• 2001 - NSF Grid Physics Network data grid• 2001 - NSF National Virtual Observatory data grid• 2002 - NSF National Science Digital Library persistent archive• 2003 - NSF Southern California Earthquake Center digital library• 2003 - NIH Biomedical Informatics Research Network data grid• 2003 - NSF Real-time Observatories, Applications, and Data management Network

• 2004 - NSF ITR, Constraint based data systems• 2005 - LC Digital Preservation Lifecycle Management• 2005 - LC National Digital Information Infrastructure and Preservation program

DevelopmentDevelopment

• SRB 1.1.8 - December 15, 2000• Basic distributed data management system• Metadata Catalog

• SRB 2.0 - February 18, 2003• Parallel I/O support• Bulk operations

• SRB 3.0 - August 30, 2003• Federation of data grids

• SRB 3.3.1 - April 6, 2005• Feature requests (extensible schema)

SRB Latency ManagementSRB Latency Management

ReplicationServer-initiated I/O

StreamingParallel I/O

CachingClient-initiated I/O

Remote Proxies,Staging

Data AggregationContainers

SourceDestination

Prefetch

NetworkDestinationNetwork

Latency Management -Latency Management -Bulk OperationsBulk Operations

• Bulk register• Create a logical name for a file• Load context (metadata)

• Bulk load• Create a copy of the file on a data grid storage repository

• Bulk unload• Provide containers to hold small files and pointers to

each file location

• Bulk delete• Trash can

• Sticky bits for access control, …

Logical Name SpacesLogical Name Spaces

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Access Methods (C library, Unix, Web Browser)

Data access directly between

application and storage

repository using names

required by the local

repository

Logical Name SpacesLogical Name Spaces

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection

Data Access Methods (C library, Unix, Web Browser)

Data is organized as a shared collection

Federation Between Data GridsFederation Between Data Grids

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection A

Access controls and consistency constraints on cross registration of digital entities

Types of Risk Types of Risk

• Media failure• Replicate data onto multiple media

• Vendor specific systemic errors• Replicate data onto multiple vendor products

• Operational error• Replicate data onto a second administrative domain

• Natural disaster• Replicate data to a geographically remote site

• Malicious user• Replicate data to a deep archive

How Many ReplicasHow Many Replicas

• Three sites minimize risk• Primary site

• Supports interactive user access to data

• Secondary site• Supports interactive user access when first site is

down• Provides 2nd media copy, located at a remote site,

uses different vendor product, independent administrative procedures

• Deep archive• Provides 3rd media copy, staging environment for

data ingestion, no user access

State of the Art TechnologyState of the Art Technology

• Grid - workflow virtualization• Support execution of jobs (processes) across

multiple compute servers

• Data grid - data virtualization• Manage a shared collection that is distributed

across multiple storage servers

• Semantic grid - information virtualization• Create a common understanding of information

(metadata) across multiple collections.

For More InformationFor More Information

Reagan W. Moore

San Diego Supercomputer Center

[email protected]

http://www.sdsc.edu/srb/