CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group...

20
www.geongrid.org CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center

Transcript of CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group...

Page 1: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Data Replication Service

Sandeep Chandra

GEON Systems Group

San Diego Supercomputer Center

Page 2: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Outline

• Motivation

• Data Replication Service (DRS)

• Components for DRS

– RLS, GridFTP, RFT

• DRS Deployment

• DRS setup on GEON

• Next Steps

Page 3: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Motivation

• Science domains spend considerable effort collecting and managing large amounts of data

• Science domains develop customized data management services that vary with the type of application

• Common data management requirements – Publish and replicate large datasets– Register data replicas in catalogs and discover them– Perform metadata-based discovery of datasets– May require ability to validate correctness of replicas

Page 4: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Motivation (cont.)

• These systems demand considerable resources to design, implement & maintain– Typically cannot be re-used by other applications

• Need for a long-term solution– Generalize functionality provided by these data

management systems– Provide suite of application-independent services

• Design and build on lower-level grid services– Globus Reliable File Transfer (RFT) service– Replica Location Service (RLS)– GridFTP

Page 5: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

A possible solution:Data Replication System (DRS)• Higher level data management service based

on low level data management components like RLS and RFT

• The primary functionality is to – Allow users to identify a set of desired files

existing in their grid environment – Make local replicas of those data files by

transferring files from one or more source locations

– Register the new replicas in a Replica Location Service

Page 6: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Replica Location Service (RLS)

• A simple registry that keeps track of where replicas exist on physical storage systems.

• Users or services register files in RLS when the files are created.

• Query RLS servers to find these replicas.• RLS can be a distributed registry, consisting

of multiple servers at different sites. • Distributed RLS increases the overall scale

and store more mappings than would be possible in a single, centralized catalog.

Page 7: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

RLS (cont.)• A logical file name is a unique

identifier for the contents of a file.• A physical file name is the location

of a copy of the file on a storage system.

• RLS maintains mappings between logical file names and one or more physical file names of replicas.

• Users can provide a logical file name to an RLS server and ask for all the registered physical file names of replicas.

• Users can also query an RLS server to find the logical file name associated with a particular physical file location.

XYZ replica 1 XYZ replica 2 XYZ replica 3

Logical File Name XYZ

Site 1 Site 2 Site 3

Page 8: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

RLS (cont.)• Two servers: LRI, LRC• LRC stores mappings between

logical names for data items and the physical locations of replicas.

• Query the LRC to discover replicas associated with a logical name.

• RLI server collects information about the logical name mappings stored in one or more LRCs.

• RLI returns a list of all the LRCs it is aware of that contain mappings for the logical name contained in a query.

• The client then queries these LRCs to find the physical locations of replicas.

RLI RLI RLI

LRC LRC LRC LRC

Local Replica Catalogs (LRC)

Replica Location Index (RLI) Nodes

Page 9: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

RLS in Context

• The RLS is one component in a layered data management architecture

• Consistency management provided by higher-level services

Replica Consistency Management Services

Replica Location ServiceReliable Data Transfer

Service

GridFTP

MetadataService

Reliable Replication Service

Page 10: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

GridFTP• The GridFTP protocol provides for the secure,

robust, fast and efficient transfer of (especially bulk) data.

• Globus Toolkit provides the most commonly used implementation of the protocol, though others exist.

• The Globus Toolkit provides– server implementation called globus-gridftp-server– scriptable command line client called globus-url-

copy– a set of development libraries for custom clients

Page 11: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Reliable File Transfer (RFT)

• A WSRF compliant web service that provides “job scheduler” like functionality for data movement.

• You provide a list of source and destination URLs (including directories or files), then the service writes your job description into a database and moves the files on your behalf.

Page 12: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

RFT (cont.)• Accepts SOAP description of a desired transfer• Service methods are provided for querying the

transfer status• WSRF tools to subscribe for notifications of state

change events• Supports all the same options as globus-url-copy

(buffer size, etc)• Increased reliability because state is stored in a

database• Supports concurrency, multiple files transferred for

better performance

Page 13: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Web Service Container

Data Replication

Service

Replicator Resource

Reliable File

Transfer Service

RFT Resource

Local Replica Catalog

Replica Location

Index

GridFTP Server

Delegation Service

Delegated Credential

Local Site

Globus Services

• WSRF Services– Data Replication Service– Delegation Service– Reliable File Transfer

Service

• Pre WSRF Components– Replica Location Service

(Local Replica Catalog, Replica Location Index)

– GridFTP Server

Page 14: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

GridFTP Server

DRSService

RFTService

Create aTransferrequest

DRS Deployment• Local storage system• GridFTP server for file

transfer• Replica Location Service:

– LRCs stores mappings from logical names to storage locations

– RLI collects state summaries from LRCs

• RFT: WSRF service to perform data transfer

• DRS: The master replication service

DatabaseSite

StorageSystem

ReplicaLocation

Index

LocationReplicaCatalog

Page 15: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Web Service Container

Data Replication

Service

Replicator Resource

Reliable File

Transfer Service

RFT Resource

Local Replica Catalog

GridFTP Server

Delegation Service

Delegated Credential

Local Site

Web Service Container

Data Replication

Service

Replicator Resource

Reliable File

Transfer Service

RFT Resource

Local Replica Catalog

Replica Location

Index

GridFTP Server

Delegation Service

Delegated Credential

Remote Sites 1…N

Client

RequestFile

1

2

3

4 5

6

Replica Location

Index

7

8

9

10

11

12

13

Page 16: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

DRS Functionality

• Initiate a DRS Request• Create a delegated credential (Delegate Authority)• Create a Replicator resource (Replication Service)• Monitor Replicator resource (Status)• Discover replicas of files in RLS, select among replicas• Start data transfer to local site with RFT service• Check status• Register new replicas in RLS catalogs• Allow client inspection of DRS results• Destroy Replicator resource

Page 17: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Geon DRS Test Setup

ASU SDSC

GridFTP Server

DRSService

RFTService

Create aTransferrequest

DatabaseSite

StorageSystem

ReplicaLocation

Index

ReplicaLocationCatalog

GridFTP Server

DRSService

RFTService

Create aTransferrequest

DatabaseSite

StorageSystem

ReplicaLocation

Index

ReplicaLocationCatalog

Data Transfer

Globus Container Globus Container

Page 18: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Next Tasks

• Transfer LIDAR data from ASU to SDSC resource. (HPSS, etc)

• Extend the testbed to include more nodes.

• Benchmarking data movement.

• Package DRS and components with GEON software stack version 2.0

Page 19: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Acknowledgement

• Ann Chervenak & Robert Schuler (ISI)

• www.globus.org (slides)

Page 20: CYBERINFRASTRUCTURE FOR THE GEOSCIENCES  Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Questions?