Data Management

102
Data Management Azizol Abdullah FSKTM

description

Data Management. Azizol Abdullah FSKTM. What is Data Management?. It depends on ……. Storage system Data transport mechanism Replication management Metadata management Publishing and curation of data. What is data management? (cont.). Storage systems Disk arrays - PowerPoint PPT Presentation

Transcript of Data Management

Page 1: Data Management

Data Management

Azizol AbdullahFSKTM

Page 2: Data Management

What is Data Management?

It depends on ……. Storage system Data transport mechanism Replication management Metadata management Publishing and curation of data

Page 3: Data Management

What is data management? (cont.)

Storage systems Disk arrays Network caches (e.g., DPSS) Hierarchical storage systems (e.g., HPSS)

Efficient data transport mechanisms Striped Parallel Secure Reliable Third-party transfers

Page 4: Data Management

What is data management? (cont.)

Replication management Associate files into collections Mechanisms for reliably copying collections,

propagating updates to collections, selecting among replicas

Metadata management Associate attributes that describe data Select data based on attributes

Publishing and curation of data “Official” versions of important collections Digital libraries

Page 5: Data Management

Data-Intensive Applications: Physics

CERN Large Hadron Collider

Several terabytes of data per year Starting in 2005 Continuing 15 to 20 years

Replication scenario: Copy of everything at CERN (Tier 0) Subsets at national centers (Tier 1) Smaller regional centers (Tier 2) Individual researchers will have copies

Page 6: Data Management
Page 7: Data Management

The Large Hadron Collider (LHC) experiment

Page 8: Data Management

The CERN structure

Page 9: Data Management

GriPhyN Overview(www.griphyn.org)

5-year, $12.5M NSF ITR proposal to realize the concept of virtual data, via:Key research areas: Virtual data technologies (information models,

management of virtual data software, etc.) Request planning and scheduling

(including policy representation and enforcement)

Task execution (including agent computing, fault management, etc.)

Development of Virtual Data Toolkit (VDT)Four Applications: ATLAS, CMS, LIGO, SDSS

Page 10: Data Management

GriPhyN ParticipantsComputer Science U.Chicago, USC/ISI, UW-Madison, UCSD, UCB,

Indiana, Northwestern, Florida

Toolkit Development U.Chicago, USC/ISI, UW-Madison, Caltech

Applications ATLAS (Indiana), CMS (Caltech), LIGO (UW-

Milwaukee, UT-B, Caltech), SDSS (JHU)

Unfunded collaborators UIC (STAR-TAP), ANL, LBNL, Harvard, U.Penn

Page 11: Data Management

The Petascale Virtual Data Grid (PVDG) Model

Data suppliers publish data to the GridUsers request raw or derived data from Grid, without needing to know Where data is located Whether data is stored or computed

User can easily determine What it will cost to obtain data Quality of derived data

PVDG serves requests efficiently, subject to global and local policy constraints

Page 12: Data Management

PVDGScenario

?

Major Archive Facilities

Network caches & regional centers

Local sites

User requests may be satisfied via a combination of data access and computation at local, regional, and central sites

Page 13: Data Management

Other Application Scenarios

Climate community Terabyte-scale climate model datasets:

Collecting measurements Simulation results

Must support sharing, remote access to and analysis of datasets

Distance visualization Remote navigation through large datasets,

with local and/or remote computing

Page 14: Data Management

Data-intensive computingThe term data-intensive computing is used to describe applications that are I/O bound. Such applications devote the largest fraction of execution time to movement of data. They can be identified by evaluating “computational bandwidth”—the number of bytes of data processed per floating-point operation.On vector supercomputers for applications that sustain high performance, usually 7 bytes of data are accessed from memory for every floating point operation

Page 15: Data Management

Storage Systems: Disk Arrays

What is a disk array? Collection of disks

Advantages: Higher capacity

Many small, inexpensive disks Higher throughput

Higher bandwidth (Mbytes/sec) on large transfers Higher I/O rate (transactions/sec) on small

transfers

Page 16: Data Management

Trends in Magnetic Disks

Capacity increases: 60% per yearCost falling at similar rate ($/MB or $/GB)Evolving to smaller physical sizes 14in 5.25in 3.5in 2.5in 1.0in … ?

Put lots of small disks together

Problem: RELIABILITY Reliability of N disks =

Reliability of 1 disk divided by N

Page 17: Data Management

Key Concepts in Disk ArraysStriping for High Performance

Interleave data from single file across multiple disks

Fine-grained interleaving: every file spread across all disks any access involves all disks

Course-grained interleaving: interleave in large blocks small accesses may be satisfied by a

single disk

Page 18: Data Management

Key Concepts in Disk Arrays

Redundancy Maintain extra information in disk

array Duplication Parity Reed-Solomon error correction codes Others

When a disk fails: use redundancy information to reconstruct data on failed disk

Page 19: Data Management

RAID “Levels”(Redundant Arrays of Inexpensive Disks)

Defined by combinations of striping & redundancy ( 6 level of RAID)RAID Level 1: Mirroring or Shadowing

Maintain a complete copy of each disk Very reliable High cost: twice the number of disks Great performance: on a read, may go to disk with faster access time

Page 20: Data Management

RAID “Levels” (cont.)

RAID Level 2: Memory Style Error Detection and Correction

Not really implemented in practice

Based on DRAM-style Hamming codes

In disk systems, don’t need detection

Use less expensive correction schemes

Page 21: Data Management

RAID “Levels” (cont.)RAID Level 3: Fine-grained Interleaving and Parity

Many commercial RAIDs Calculate parity bit-wise

across disks in the array (using exclusive-OR logic)

Maintain a separate parity disk; update on write operations

When a disk fails, use other data disk and parity disk to reconstruct data on lost disk

Fine-grained interleaving: all disks involved in any access to the array

Page 22: Data Management

RAID “Levels” (cont.)RAID Level 4: Large Block Interleaving and Parity

Similar to level 3, but interleave on larger blocks

Small accesses may be satisfied by a single disk

Supports higher rate of small I/Os

Parity disk may become a bottleneck with multiple concurrent I/Os

Page 23: Data Management

RAID “Levels” (cont.)

RAID Level 5: Large Block Interleaving and Distributed Parity Similar to level 4 Distributes parity

blocks throughout all disks in array

Page 24: Data Management

RAID Levels (cont.)RAID Level 6: Reed-Solomon Error Correction Codes Protection against two disk failures

Page 25: Data Management

RAID Levels (cont.)

Disks getting so cheap: consider massive storage systems composed entirely of disks No tape!!

Page 26: Data Management

DPSS: Distributed Parallel Storage System

Produced by Lawrence Berkeley National Labs“Cache”: provides storage that is Faster than typical local disk Temporary

“Virtual disk”: appears to be single large, random-access, block-oriented I/O deviceIsolates application from tertiary storage system: Acts as large buffer between slow tertiary storage

and high-performance network connections “Impedance matching”

Page 27: Data Management

Features of DPSSComponents: DPSS block servers

Typically low-cost workstations Each with several disk controllers, several disks

per controller

DPSS mater process Data requests sent from client to master process Determines which DPSS block server stores the

requested blocks Forwards request to that block server

Note: servers can be anywhere on network (a distributed cache)

Page 28: Data Management

Features of DPSS (cont.)Client API library Supports variety of I/O semantics dpssOpen(), dpssRead(), dpssWrite(),

dpssLSeek(), dpssClose()

Application controls data layout in cache For typical applications that read sequentially:

stripe blocks of data across servers in round-robin fashion

DPSS client library is multi-threaded Number of client threads is equal to number of

DPSS servers: client speed scales with server speed

Page 29: Data Management

Features of DPSS (cont.)Optimized for relatively small number of large files Several thousand files Greater than 50 MB

DPSS blocks are available as soon as they are placed in cache Good for staging larges files to/from tertiary storage Don’t have to wait for large transfer to complete

Dynamically reconfigurable Add or remove servers or disks on the fly

Page 30: Data Management

Features of DPSS (cont.)

Agent-based performance monitoring system

Client library automatically sets TCP buffer size to optimal value Uses information published by monitoring system

Load balancing Supports replication of files on multiple servers DPSS master uses status information stored in

LDAP directory to select a replica that will give fastest response

Page 31: Data Management

Hierarchical Storage System

Fast, disk cache in front of larger, slower storageWorks on same principle as other hierarchies: Level-1 and Level-2 caches: minimize off-chip

memory accesses Virtual memory systems:minimize page faults

to disk

Goal: Keep popular material in faster storage Keep most of material on cheaper, slower

storage Locality: 10% of material gets 90% of accesses

Page 32: Data Management

Hierarchical Storage System (cont.)

Problem with tertiary storage (especially tape): Very slow Tape seek times can be a minute or

more…

Page 33: Data Management

Data Management

GridFTP

Page 34: Data Management

Motivation….

The GridFTP protocol born out of a realization that the Grid environment

needed a fast, secure, efficient, and reliable transport mechanism.

Existing distributed data storage systems DPSS, HPSS: focus on high-performance access,

utilize parallel data transfer, striping DFS: focus on high-volume usage, dataset

replication, local caching SRB: connects heterogeneous data collections,

uniform client interface, metadata queries

Page 35: Data Management

Motivation…. (cont.)

Problems Incompatible (and proprietary) protocols

Each require custom client Partitions available data sets and storage devices

Each protocol has subset of desired functionality

Page 36: Data Management

A Common, Secure,Efficient Data Access Protocol

Common, extensible transfer protocol Common protocol means all can interoperate

Decouple low-level data transfer mechanisms from the storage serviceAdvantages: New, specialized storage systems are automatically

compatible with existing systems Existing systems have richer data transfer functionality

Interface to many storage systems HPSS, DPSS, file systems Plan for SRB integration

Page 37: Data Management

Access/Transport Protocol Requirements

Suite of communication libraries and related tools that support GSI, Kerberos security Third-party transfers Parameter set/negotiate Partial file access Reliability/restart Large file support Data channel reuse

All based on a standard, widely deployed protocol

– Integrated instrumentation

– Loggin/audit trail

– Parallel transfers

– Striping (cf DPSS)

– Policy-based access control

– Server-side computation

– Proxies (firewall, load bal)

Page 38: Data Management

And The Protocol Is … GridFTP

Why FTP? Ubiquity enables interoperation with many commodity

tools Already supports many desired features, easily extended

to support others Well understood and supported

We use the term GridFTP to refer to Transfer protocol which meets requirements Family of tools which implement the protocol

Note GridFTP > FTPNote that despite name, GridFTP is not restricted to file transfer!

Page 39: Data Management

GridFTP: Basic Approach

FTP protocol is defined by several IETF RFCsStart with most commonly used subset Standard FTP: get/put etc., 3rd-party transfer

Implement standard but often unused features GSS binding, extended directory listing, simple

restart

Extend in various ways, while preserving interoperability with existing servers Striped/parallel data channels, partial file, automatic

& manual TCP buffer setting, progress monitoring, extended restart

Page 40: Data Management

3rd Party Transfer

ComputerA

ComputerB

ComputerC

Data

1: A sends transfer request to B

2: B initiates Transfer, A disconnects

Data

3: C receives file

Page 41: Data Management

The GridFTP Family of Tools

Provide the following features: Grid Security Infrastructure (GSI) and Kerberos

support: Robust and flexible authentication, integrity, and

confidentiality features are critical when transferring or accessing files.

GridFTP supports both GSI and Kerberos authentication, with user controlled setting of various levels of data integrity and/or confidentiality.

Third-party control of data transfer: In order to manage large data sets for large distributed

communities, it is necessary to provide third-party control of transfers between storage servers.

GridFTP provides this capability by adding GSSAPI security to the existing third-party transfer capability defined in the FTP standard.

Page 42: Data Management

The GridFTP Family of Tools (cont.)

Parallel data transfer: On wide-area links, using multiple TCP streams can

improve aggregate bandwidth over using a single TCP stream.

This is required both between a single client and a single server, and between two servers.

GridFTP supports parallel data transfer through FTP command extensions and data channel extensions.

Striped data transfer: Partitioning data across multiple servers can further

improve aggregate bandwidth. GridFTP supports striped data transfers through

extensions defined in the Grid Forum draft.

Page 43: Data Management

The GridFTP Family of Tools (cont.)Partial file transfer:

Many applications require the transfer of partial files. However, standard FTP requires the application to transfer

the entire file, or the remainder of a file starting at a particular offset.

GridFTP introduces new FTP commands to support transfers of regions of a file.

Support for reliable data transfer: Reliable transfer is important for many applications that

manage data. Fault recovery methods for handling transient network

failures, server outages, etc., are needed. The FTP standard includes basic features for restarting failed

transfer that are not widely implemented. The GridFTP protocol exploits these features, and

substantially extends them.

Page 44: Data Management

The GridFTP Family of Tools (cont.)

Manual control of TCP buffer size: This is a critical parameter for achieving maximum

bandwidth with TCP/IP. The protocol also has support for automatic buffer size

tuning, but we have not yet implemented anything in our code.

We are talking with both NCSA and LANL to see if it makes sense to integrate work they are doing in this area into our code.

Integrated Instrumentation: The protocol calls for restart and performance markers

to be sent back. It is not specified how often, and this is something we

intend to address shortly.

Page 45: Data Management

Why Did We Need a New Transport Protocol?

requirements was a transport protocol that met the following criteria:

Targeted at bulk data transport: We saw this as a protocol to move lots of data (100s of Megabytes and above)

Based on industry standards: i.e., a clear, well defined, published, nonproprietary protocol.

Secure: Allowed for authentication, authorization, integrity, and privacy

Fast and Efficient: This meant employing multiple levels of parallelism and minimizing overhead.

Page 46: Data Management

Why Did We Need a New Transport Protocol? (cont.)

Robust: The protocol must be able to tolerate system failures gracefully.

Allowed 3rd party transfers: We believe much of the traffic will be generated by automated systems such as schedulers.

Integrated instrumentation: The protocol must provide feedback on operational status so that intelligent actions can be taken during transfers.

Easily Extensible: Both in terms of standards body approval and technically/architecturally/coding wise.

Page 47: Data Management

Data Management

Replication Management

Page 48: Data Management

The Motivation…

Data-intensive, high-performance computing applications require an efficient management and transfer of terabytes or petabytes of information in wide-area,distributed computing environments.

Examples of such applications include experimental analyses and simulations in scientific disciplines such as:

high-energy physics climate modeling earthquake engineering astronomy.

Page 49: Data Management

The Motivation… (cont.)

In such applications, massive datasets must be shared by a community of hundreds or thousands of researchers distributed worldwide.

These researchers need to transfer large subsets of these datasets to local sites or other remote resources for processing.

They may create local copies or replicas to overcome long wide-area data transfer latencies.

Page 50: Data Management

The Motivation… (cont.)

Once multiple copies of files are distributed at multiple locations, researchers need a service: to be able to locate copies to determine whether to access an existing copy or

create a new one

Meet the performance needs of their applications.

Page 51: Data Management

The Problem…..

“Enable a geographically distributed community [of thousands] to perform sophisticated, computationally intensive analyses on Petabytes of data”

Page 52: Data Management

Replica Management ServiceIn a high-performance and distributed computing environment, a replica management service :

manages the copying process manage the placement of files

The goal: to optimize the performance of the data-intensive applications.

This architecture consists of two parts: a replica catalog: repository where information can be registered

about logical files, collections of files, and physical locations where subsets of collections are stored;

a set of registration and query operations: that are supported by the replica management service.

Page 53: Data Management

Replica Management Service (cont.)

The replica management service can be used by higher-level services such as replica selection and automatic creation of new replicas to satisfy application performance requirements.

Page 54: Data Management

Example: CERN Large Hadron Collider

Multiple petabytes of data per year

Copy of everything at CERN (Tier 0)• Subsets at national centers (Tier 1)

Smaller regional centers (Tier 2)• Individual researchers have

copies

How to keep track of all copies?Select among available copies or create a new copy?

Page 55: Data Management
Page 56: Data Management

An Approach to Replica Management

Identify replica cataloging and reliable replication as two fundamental services Layer on other Grid services: GSI, transport,

MDS Information Service Use LDAP as catalog format and protocol, for

consistency

These services can be used as building blocks for higher-level services

Page 57: Data Management

The Replica Catalog:An Information Service

Registers new copies of files and collectionsResponds to queries about existing replicasMaintains a mapping between logical names for files and collections and one or more physical locationsUses the LDAP protocol

Accessed by higher-level tools that perform: Selection of replicas based on performance

From Information Services (MDS, NWS) Dynamic creation of replicas in response to

demand

Page 58: Data Management

Replica Catalog Structure: A Climate Modeling Example

Logical File Parent

Logical File Jan 1998

Logical CollectionC02 measurements 1998

Replica Catalog

Locationjupiter.isi.edu

Locationsprite.llnl.gov

Logical File Feb 1998Size: 1468762

Filename: Jan 1998Filename: Feb 1998…

Filename: Mar 1998Filename: Jun 1998Filename: Oct 1998Protocol: gsiftpUrlConstructor: gsiftp://jupiter.isi.edu/ nfs/v6/climate

Filename: Jan 1998…Filename: Dec 1998Protocol: ftpUrlConstructor: ftp://sprite.llnl.gov/ pub/pcmdi

Logical CollectionC02 measurements 1999

Page 59: Data Management

Example: Components of the Globus replica Manager

Replica catalog definition LDAP object classes for representing logical-

to-physical mappings in an LDAP catalog

Low-level replica catalog API globus_replica_catalog library Manipulates replica catalog: add, delete, etc.

High-level reliable replication API globus_replica_manager library Combines calls to file transfer operations and

calls to low-level API functions: create, destroy, etc.

Page 60: Data Management

Replica Selection Relies on Information Services

Replica catalog identifies all existing copies of files or collectionsSelect among them based on performance

Consult other Information Services Network Weather Service: network

performance between source, destination Information Service for Storage Systems:

file system capacity and performance

Wide variety of selection algorithms

Page 61: Data Management

Dynamic Replica Creation andInformation Services

Application manager needs to guarantee a certain level of performance Bandwidth from source to destination Rate of accesses

Using information services (NWS, MDS): Determine that existing replicas can’t

provide that performance Identify location to create a new replica with

desired capacity and performance

Data distribution services

Page 62: Data Management

Relationship of Replica Managerand Metadata Catalogs

Metadata Services: Information Services that describe

data contents

Replica Management Service interacts with a variety of metadata catalogs Globus: simple set of object classes MCAT Community-defined metadata

catalogs using common set of attributes

Page 63: Data Management

Relationship of Replica Managerand Metadata Catalogs (cont.)

Metadata service produces logical names needed by replica catalog:

Logical collectionsLogical files

Page 64: Data Management

A Model Architecture for Data Grids

Metadata Catalog

Replica Catalog

Tape Library

Disk Cache

Attribute Specification

Logical Collection and Logical File Name

Disk Array Disk Cache

Application

Replica Selection

Multiple Locations

NWSSelectedReplica

gsiftp commands PerformanceInformation andPredictions

Replica Location 1 Replica Location 2 Replica Location 3

MDS

Page 65: Data Management

Outstanding Issues for Replica Management

Early architecture assumed a read-only workload What update models should we support?

What high-level operations are needed? Combine storage and catalog operations

Relationship to databasesReplicating the replica catalogAlternate catalog views: files belong to more than one logical collection

Page 66: Data Management

Data Management

Metadata Management

Page 67: Data Management

The Motivation…

In managing a large data sets efficiently, metadata or descriptive information about the data needs to be managed.

There are various types of metadata, and it is likely that a range of metadata services will exist in Grid environments that are specialized for particular types of metadata cataloguing and discovery.

Page 68: Data Management

What is metadata?Information that describes the contents of data UNIX-style file system metadata: file size, access

permissions, creation time, modify time, etc.

Descriptions of the meaning of files: Satellite image of South America? Results of a Monte Carlo simulation run by

physicists? Today’s powerpoint lecture

Information needed to read/interpret the bits

Page 69: Data Management

What is metadata? (cont.)

Provenance information When was the data created? By whom? Under what experimental conditions? Using what application software, operating system software,

hardware configuration? What input files/parameter settings?

Make experimental results repeatable or just understandable

Page 70: Data Management

Attributes and Schema

Typically store metadata in attribute: value pairs associated with data objects (files, database objects, etc.) LDAP catalogs Relational databases Dublin Core: mechanisms to associate

descriptive fields with documents

Page 71: Data Management

The Metadata ServicesAllow scientists to record information about:

the creation Transformation Meaning quality of data items.

Query for data items based on descriptive attributes.

Accurate identification of desired data items is essential for correct analysis of experimental and simulation results.

In the past, scientists have largely relied on ad hoc methods (descriptive file and directory names, lab notebooks, etc.) to record information about data items.

Page 72: Data Management

The Metadata Services (cont.)

However, these methods do not scale to terabyte and petabyte data sets consisting of millions of data items.

Extensible, reliable, high performance Grid services are required to support registration and query of metadata information.

Page 73: Data Management

Type of Metadata ServiceThere are various types of metadata and a range of metadata services will exist in Grid environments that are specialized for different types of metadata cataloguing and discovery.

Some metadata relate to the physical characteristics of data objects, such as:

their size access permissions Owners modification information.

Example: Replication metadata information describes the relationship between

logical data identifiers and one or more physical instances of the data.

Page 74: Data Management

Type of Metadata Service (cont.)

Other metadata attributes describe the contents of data items, allowing the data to be interpreted.

Example: Climate modeling data sets may have associated metadata

attributes that include variables such as temperature, surface pressure, precipitation and cloud cover.

High-energy physics metadata might include information about the period of time during which events were detected in a particle collider.

Page 75: Data Management

Type of Metadata Service (cont.)

A special case of descriptive metadata is provenance information records how data items are created and transformed. describe what experimental apparatus, simulation or

analysis software produced the data item.

This provenance information can be used to track a series of analyses or transformation steps on a data item or to reproduce a data item.

Page 76: Data Management

Type of Metadata Service (cont.)

The types of metadata need to be recorded because data characterization and data discovery are diverse and have different requirements for performance, reliability and consistency.

Page 77: Data Management
Page 78: Data Management

Type of Metadata Service (cont.)

physical metadata information about the characteristics of data on physical storage

systems as well as replica location metadata. Services that maintain physical metadata include file systems and

database management systems.

domain-independent metadata general metadata attributes that apply to data items regardless of

the application domain or virtual organization in which the data sets are created and shared.

Page 79: Data Management

Type of Metadata Service (cont.)

Domain-specific metadata defined by metadata ontologies that are developed by application

communities. For example, physicists or earthquake engineers may agree on a

common set of terms and metrics that are useful for characterizing shared data sets and represent these using a common set of metadata attributes.

Virtual organization metadata includes multiple scientific or corporate institutions may define an

additional set of metadata attribute conventions for characterizing data sets.

Users metadata may want to associate metadata attributes such as annotations with

data items or collections.

Page 80: Data Management

The Role of Metadata Management Service

Services that maintain mappings between logical name attributes for data items and other descriptive metadata attributes and respond to queries about those mappings.

In particular, Metadata Services support domain-independent, domain-specific, virtual organization and user metadata attributes.

Metadata Services play a key role in the publication and the discovery and access of data sets.

Metadata Catalog Service (MCS) provides a mechanism : for storing and accessing descriptive metadata Allows users to query for data items based on desired attributes.

Page 81: Data Management

Publication

The process by which data sets and their associated attributes are stored and made accessible to a user community.

Many data intensive scientific applications publish data sets as a community.

For example: when results of a scientific experiment are obtained, they are

calibrated, put into a standard format, and made available or published to the community.

Page 82: Data Management

Publication (cont.)

Part of the publication process includes to associate data set with domain-independent, domain-dependent and virtual organization metadata attributes.

Subsequent to the publication: annotate the data set with their own observations using

user attributes and make it available to a controlled subset of the community.

perform analysis of published data sets to produce new data sets.

organize published data items of interest into customized views.

Page 83: Data Management

Discovery and Access

The process of identifying data items of interest to the user.

The Metadata Service allows users to discover data sets based on the value of descriptive attributes, rather than requiring them to know about specific names or physical locations of data items.

The Metadata Service forms one component of data discovery and access in a Grid.

Page 84: Data Management
Page 85: Data Management

Discovery and Access (cont.)The scenario:

a client application first queries the Metadata Service to find data sets with particular attribute values (1).

the Metadata Service responds with a list of logical name attributes for data items with matching attributes (2).

the client queries the Replica Location Service (3), he replica service returns a list of physical locations for the data content

identified by the logical names (4).

the client selects replicas for access, contacts the storage systems where the data items reside (5).

the desired data sets are returned using the GridFTP protocol (6).

Page 86: Data Management

Metadata Service ComponentsIt is a specialized service that includes the following components: A data model that includes mechanisms for aggregation

of metadata mappings A standard schema for domain-independent metadata

attributes with extensibility for additional user-defined attributes

A set of standard service behaviors Query mechanisms for accessing the database A set of standard interfaces and APIs for storing and

accessing metadata A set of policies for consistency, access control and

authorization, and auditing

Page 87: Data Management

MCAT: A Metadata Catalog

From the Storage Resource Broker project at San Diego Supercomputing Center

One of the most sophisticated metadata catalogs

Allows content-based searching: specify attributes of the data at a high level and it identifies data objects that match

Page 88: Data Management

MCS: A Metadata Catalog Service for Grids

The initial implementation of MCS assumes a file - based data model.

The most basic item in MCS data model is the logical file, which is uniquely identified by a logical name.

Page 89: Data Management

Data Model for MCS

Logical file - uniquely identified by a logical name.Logical collections - user-defined aggregations that can consist of zero or more logical files and/or other logical collections.Logical views - aggregation that can consist of zero or more logical files, collections and/or other logical views.

Page 90: Data Management

MCS Schema

MCS schema can be divided into the following logical categories: Logical File Metadata: Metadata attributes

associated with the logical file include

Page 91: Data Management

Logical File MetadataMetadata attributes associated with the logical file include the following:

logical file name attribute specifies a name that is unique within the namespace managed by the Metadata Service.

data type attribute describes the data item type, for example, whether the file format is binary, html, XML, etc.

valid attribute indicates whether a data item is currently valid. version attribute allows us to distinguish among versions of a

logical file. collection identifier attribute allows us to associate a logical file

with exactly one logical collection. Container identifier and container service attributes allow us to

specify the external container service that groups objects together and associate a container identifier with the logical file mapping.

Page 92: Data Management

Logical File Metadata (cont.)

Creator and last modifier attributes record the identifications of the logical file’s creator and last modifier.

creation time and last modification time attribute. A master copy attribute can contain the physical location of the

definitive or master copy of the file for use by higher level data consistency services.

audit attributes specifies whether audit information is to be recorded for this logical file.

Page 93: Data Management

Logical collection metadata

Logical collection metadata attributes include: the collection name and a description of the collection

contents attributes - consists of the list of logical files and other logical collections that compose this collection.

a text description of the collection, information about the creator and modifiers of the collection and audit information.

a parent attribute that records the identifier of the parent

Page 94: Data Management

Logical view metadata

Logical view metadata attributes include: the logical view name and description

information about the logical files. logical collections and other logical views that

compose this logical view. Attributes describing the creator and modifiers

of the view, and audit information.

Page 95: Data Management

Authorization metadataAuthorization metadata attributes are used in addition to (or in to the absence of) an external authorization service such as the Community Authorization Service

Use to specify access privileges on logical files, collections and views.

The authorization information must be maintained for individual users.

If an external Community Authorization Service (CAS) is used, then authorization information must also be maintained for the CAS.

Page 96: Data Management

Authorization metadata (cont.)

Access permissions may be defined on the MCS itself permission to add logical files to the MCS

On a logical file permission to modify the file’s attributes)

On a logical collection permission to add a logical file to the logical collection

On a logical view permission to list the contents of a logical view

Access permissions specified on a logical collection apply to all logical files that comprise that logical collection (and its subcollections).

Page 97: Data Management

Others metadata in MCS

User metadata: describe writers - contact information: name, description, institution, address, phone and email.Audit metadata is used to record actions performed via the metadata service. (ex. user who performed the audited action, timestamp at which the audited operation was performed.User-defined metadata attributes: Extensibility of the MCS schema beyond predefined attributes is provided by allowing users to define new attributes

Page 98: Data Management

Others metadata in MCS (cont.)

Annotation attributes

Creation and transformation history metadata

External catalog metadata: the host name and IP address for where this external catalog may be accessed.

Page 99: Data Management

Application Experiences : The Pegasus/LIGO Application

The Pegasus - Planning for Execution in Grids system developed within the GriPhyN project.

Pegasus is used to map complex application workflows onto the available Grid resources.

Its use in the Laser Gravitational Wave Observatory (LIGO) project.

A project that seeks to directly detect the gravitational waves predicted by Einstein’s theory of relativity.

Page 100: Data Management

Application Experiences : The Pegasus/LIGO Application (cont.)

Pegasus uses MCS to discover existing application data products.

When the Pegasus planner receives a user request to retrieve data with particular metadata attributes, it queries the MCS to find all logical files with the corresponding properties. For example, a user might request all logical files that

corresponding to a particular frequency band, and the MCS will return a list of relevant files to the Pegasus planner.

Page 101: Data Management

Application Experiences : The Pegasus/LIGO Application (cont.)

When the workflow generated by the Pegasus planner results in creation of new application data products, Pegasus uses the Metadata Catalog Service to record metadata attributes associated with those newly materialized data products.

Among the data products created by LIGO analyses are time series data, frequency spectra and the results of pulsar searches.

Attributes that describe these data products, including the type of data and the duration of data measurements, are stored in the MCS.

Page 102: Data Management

A Model Architecture for Data Grids

Metadata Catalog

Replica Catalog

Tape Library

Disk Cache

Attribute Specification

Logical Collection and Logical File Name

Disk Array Disk Cache

Application

Replica Selection

Multiple Locations

NWSSelectedReplica

gsiftp commands PerformanceInformation andPredictions

Replica Location 1 Replica Location 2 Replica Location 3

MDS