1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

34
1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities

Transcript of 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

Page 1: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

1

integrated Rule Oriented Data System

Tutorial: iRODS Capabilities

Page 2: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

2

Outline

Introduction to iRODS capabilities

Data-driven science and full Data Life Cycle

Policy-based Management of Distributed Data

Scaling: petabytes, 100s of millions of files

Enabling unified sharable "virtual" collections

Enabling data grids (sharing), digital libraries (publishing), persistent archives (preservation)

Unified Data Space: Interoperate via Federation

Page 3: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

3

Introduction to

iRODS Capabilities

Page 4: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

4

Data Driven Science

• Enable new science through collaborative research on shared data collections• Management of entire scientific data life cycle from data

analysis pipelines to long-term sustainability of reference collections

• Implement national scale data cyber-infrastructure• Federation of exemplar data management technologies in

exemplar research initiatives

• Creation of production data management systems

• Proven technology implemented in extant data grids

• Integrate “live” research data collections into education initiatives• Policy-based data management across distributed data

Project

Shared Collection

Processing Pipeline

Digital Library

Reference Collection

Federation

Data Life Cycle

Page 5: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

5

Data are Inherently Distributed

• Distributed sources• Projects span multiple institutions, nations

• Distributed analysis platforms• Grid computing

• Distributed data storage• Minimize risk of data loss, optimize access

• Distributed users• Caching of data near user

• Multiple stages of data life cycle• Data repurposing for use in broader context

Page 6: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

Cloud Storage

Institutional Repositories

Federal Repositories

Carolina Digital Repository

Texas Digital Library

National Climatic Data Center

National Optical Astronomy Observatory

Page 7: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

Data Processing Pipelines

Preservation Environment

Ocean Observatories Initiative

NARA Transcontinental Persistent Archive Prototype

Carolina Digital Repository

Large Synoptic Survey Telescope

Digital Library

Texas Digital Library

French National Library

Data GridTeragrid Temporal Dynamics

of Learning Center

Australian Research Collaboration Service

Taiwan National Archive

Page 8: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

8

Data Life Cycle

ProjectCollection

Private

LocalPolicy

DataGrid

Shared

DistributionPolicy

DigitalLibrary

Published

DescriptionPolicy

DataProcessing

Pipeline

Analyzed

ServicePolicy

ReferenceCollection

Preserved

RepresentationPolicy

Federation

Sustained

Re-purposingPolicy

Each stage adds new policies for a broader communityVirtualize the stages of data life cycle through evolution of policies

Interoperability across data life cycle representations

Each stage of the data life cycle re-purposes the original collection

Page 9: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

9

Tracing the Data Life Cycle

•Collection Creation using a Data Grid•Data manipulation / Data ingestion

•Processing Pipelines•Pipeline processing / Environment administration

•Data Grid•Policy display / Micro-service display / State information display / Replication

•Digital Library•Access / Containers / Metadata browsing / Visualization

•Preservation Environment•Validation / Audit / Federation / Deep Archive / SHAMAN

Page 10: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

10

Goal - Generic Infrastructure

• Manage all stages of the data life cycle• Data organization• Data processing pipelines• Collection creation• Data sharing• Data publication• Data preservation

• Create reference collection against which future information and knowledge is compared• Each stage uses similar storage, arrangement,

description, and access mechanisms

Page 11: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

1111

Concept Roadmap

• Purpose - reason a collection is assembled• Properties - attributes needed to ensure the purpose• Policies - enforce and maintain required properties• Procedures – computer functions to implement Policies• State information - results of applying procedures (iCAT) • Assessment criteria - validate that state information conforms

to desired purpose• Federation – interoperate w/shared logical name spaces• These are the required elements for data life cycle

virtualization

Page 12: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

12

Policy-based Management

• Each data life cycle stage is driven by extensions of management policies to address broader user communities• Data arrangement <-----> Project policies• Data analysis <-----------> Processing pipeline standards• Data sharing <-----------> Research collaborations• Data publication <---------> Discipline standards• Data preservation <------> Reference collection

• Reference collections need to be preserved and interpretable by future generations, most stringent standard• Data grids - integrated Rule Oriented Data System

Page 13: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

13

iRODS - Policy-based Management

• Turn Policies into computer-actionable Rules• Compose Rules by chaining Micro-services• Manage state information (in iCAT metadata

catalog) as attributes on namespaces:• Files / collections /users / resources / rules

• Validate assessment criteria• Queries on state information, parsing audit trails

• Automate administrative functions• Enable scaling to today's massive collections

Page 14: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

14

User w/ClientCan Search, Access, Add and

Manage Data& Metadata

Access distributed data with Web-based Browser or iRODS GUI or Command Line clients.

Overview of iRODS Architecture

iRODS Data Server

Disk, Tape, etc.

iRODS Metadata

CatalogTrack information

iRODS Data System

iRODS Rule Engine

Tracks Policies

Page 15: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

iput ../src/irm.c - Checks 10 Policy hooks when file put into iRODS

brick14:10900:ApplyRule#116:: acChkHostAccessControlbrick14:10900:GotRule#117:: acChkHostAccessControlbrick14:10900:ApplyRule#118:: acSetPublicUserPolicybrick14:10900:GotRule#119:: acSetPublicUserPolicybrick14:10900:ApplyRule#120:: acAclPolicybrick14:10900:GotRule#121:: acAclPolicybrick14:10900:ApplyRule#122:: acSetRescSchemeForCreatebrick14:10900:GotRule#123:: acSetRescSchemeForCreatebrick14:10900:execMicroSrvc#124:: msiSetDefaultResc(demoResc,null)brick14:10900:ApplyRule#125:: acRescQuotaPolicybrick14:10900:GotRule#126:: acRescQuotaPolicybrick14:10900:execMicroSrvc#127:: msiSetRescQuotaPolicy(off)brick14:10900:ApplyRule#128:: acSetVaultPathPolicybrick14:10900:GotRule#129:: acSetVaultPathPolicybrick14:10900:execMicroSrvc#130:: msiSetGraftPathScheme(no,1)brick14:10900:ApplyRule#131:: acPreProcForModifyDataObjMetabrick14:10900:GotRule#132:: acPreProcForModifyDataObjMetabrick14:10900:ApplyRule#133:: acPostProcForModifyDataObjMetabrick14:10900:GotRule#134:: acPostProcForModifyDataObjMetabrick14:10900:ApplyRule#135:: acPostProcForCreatebrick14:10900:GotRule#136:: acPostProcForCreatebrick14:10900:ApplyRule#137:: acPostProcForPutbrick14:10900:GotRule#138:: acPostProcForPutbrick14:10900:GotRule#139:: acPostProcForPutbrick14:10900:GotRule#140:: acPostProcForPut

Page 16: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

16

Scale of iRODS Data Grid• Number of files

• Desktop to 10s to 100s of millions of files

• Size of data• Desktop to 100s of terabytes to petabytes of data

• Number of policy enforcement points• 64 actions define when policies are checked

• System state information• 112 metadata attributes of system information per file

• Number of functions• 185 composable iRODS Micro-services

• Number of storage systems that are linked• Desktop to 10s to 100 storage resources

• Number of data grids that can interoperate• Federation of 10s of data grids

Page 17: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

17

UserWith Client Views & Manages Data

My DataDisk, Tape, Database,

Filesystem, etc.

The iRODS Data System can install in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection.

iRODS Shows Unified “Virtual Collection”

Project DataDisk, Tape, Database,

Filesystem, etc.

User Sees Single “Virtual Collection”

Reference DataRemote Disk, Tape,

Filesystem, etc.

Page 18: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

18

Organize Distributed Data into a Sharable "Virtual" Collection

• Project repository• MotifNet - manage collection of analysis products

• Institutional repository• Carolina Digital Repository for UNC collections

• Regional collaboration• RENCI Data Grid linking resources across North Carolina

• National collaboration• NSF Temporal Dynamics of Learning Center• Australian Research Collaboration Service

• National Library• French National Library

• National Archive• NARA Transcontinental Persistent Archive Prototype, Taiwan

• International collaboration• BaBar High Energy Physics (SLAC-IN2P3)• National Optical Astronomy Observatory (Chile-US)

Page 19: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

19

Infrastructure Independence

• Manage properties of the collection independently of the choice of technology• Access, authentication, authorization, description,

location, distribution, replication, integrity, retention

• Enforce policies globally at all storage locations• Rule Engine resident at each storage site

• Apply procedures at each remote storage site• Chain encapsulated operations into workflows

• Infrastructure independence enables evolution to new technology without interruption• Integrate new access methods, new storage systems,

new network protocols, new authentication systems

Page 20: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

20

Data VirtualizationData Virtualization

Storage SystemStorage System

Storage ProtocolStorage Protocol

Access InterfaceAccess Interface

Standard Micro-servicesStandard Micro-services

Data GridData Grid

Map from actions

requested by access

method to standard set

of iRODS Micro-services.

Map standard Micro-

services to standard

operations.

Map the operations to

protocol supported by

operating system.

Standard OperationsStandard Operations

Page 21: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

21

Data Grid Security• Manage global name spaces for:

• {users, files, storage}

• Assign access controls as constraints imposed between two logical name spaces• Access controls remain invariant as files are moved within

the data grid• Controls on: Files / Storage systems / Metadata

• Authenticate each user access• PKI, Kerberos, challenge-response, Shibboleth• Use internal or external identity management system

• Authorize all operations• ACLs (Access Control Lists) on users and groups• Separate condition for execution of each Rule• Internal approval flags (e.g. IRB) within a Rule

Page 22: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

NOAO Zone Architecture

Archive

Telescope Telescope

Page 23: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

Ocean Observatories Initiative

SensorsCloud

Computing

External Repositories

Cloud Storage Cache

Message Bus

Aggregate sensor data in cache

SuperComputer

Event DetectionRemote locations

Simulations

Digital LibraryArchive

Clients

Remote Users

iRODS Data Grid

Multiple Protocols

Large-scale workflows from real-time data to steerable instruments, dig. Library.

Page 24: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

Access: Data Grid Clients

API Client DeveloperBrowser

DCAPE UNCiExplore DICE-Bing ZhuJUX IN2P3Peta Web browser PetaShare

Digital LibraryAkubra/iRODS DICEDspace MITFedora on Fuse IN2P3Fedora/iRODS module DICEIslandora DICE

File SystemDavis - Webdav ARCSDropbox / iDrop DICE-Mike ConwayFUSE IN2P3, DICE,FUSE optimization PetaShareOpenDAP ARCSPetaFS (Fuse) Petashare - LSUPetashell (Parrot) PetaShare

GridGridFTP - Griffin ARCSJsaga IN2P3Parrot Notre Dame-Doug ThainSaga KEK

API Client DeveloperI/O Libraries

PHP - DICE DICE-Bing ZhuC API DICE-Mike WanC I/O library DICE-Wayne SchroederJargon DICE-Mike ConwayPyrods - Python SHAMAN-Jerome Fusillier

Portal iDrop DICE-Mike ConwayEnginFrame NICE / RENCI

ToolsArchive tools-NOAO NOAOBig Board visualization RENCIFile-format-identifier GA Techicommands DICEPcommands PetaShareResource Monitoring IN2P3Sync-package Academica SinicaURSpace Teldap - Academica Sinica

Web ServiceVOSpace NVOAShibboleth King's College

WorkflowsKepler DICEStork LSUTaverna RENCI

Page 25: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

25

iRODS Distributed Data Management

Page 26: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

26

Towards a Unified Data Space

• Sharing data across Space • Organize data as a shared "virtual" Collection• Define unifying properties for the Collection

• Sharing data across Time • Preservation is communication with the future• Preservation validates communication from the past

• Managing full Data Life Cycle • Evolution of the Policies that govern a data Collection

at each stage of the life cycle • From data creation, to collection, to publication, to

reference collection, to analysis pipeline

Page 27: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

27

Intellectual Property

• Given generic infrastructure, intellectual property resides in the Policies and procedures that manage the Collection• Consistency of the Policies• Capabilities of the procedures• Automation of internal Policy assessment• Validation of desired Collection properties• Automation of administrative tasks

• Interacting with DataDirectNetwork, HP, IBM, MicroSoft on commercial application of open source technology.

Page 28: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

28

Societal Impact

• Many communities are assembling digital holdings that represent an emerging consensus:• Common meaning associated with the data• Common interpretation of the data• Common data manipulation mechanisms

• The development of a consensus is described as• Socialization of Collections• An example is Trans-border Urban Planning

Page 29: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

29

Social consensus for sharing data, policies, methods, practice

• Each community controls their own collection Policies • Policies enforced at each storage location

• Explicit computer-actionable rules control type of federation interactions• e.g. peer-to-peer, central archive, master-slave data

distribution, chained data grids, deep archives

Interoperability mechanisms support technology integration

• Community specific clients

• Bulk data export / import

• Cross registration of data

• Structured information resource drivers

Federation of CollectionsFederation of Collections

Page 30: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

30

Data Grid Federation

• Motivation• Improve performance, scalability, and independence

• To initiate Federation, each Data Grid administrator establishes trust and creates a remote user• iadmin mkzone B remote Host:Port

• iadmin mkuser rods#B rodsuser• Use cases

• Chained data grids - National Optical Astronomy Observatory• Master-slave data grids - NIH BIRN• Central archive - UK e-Science• Deep archive - NARA TPAP• Replication - NSF Teragrid

Page 31: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

31

Federated irodsUser

(use iRODS clients)

Federated irodsUsers can upload, download, replicate, share, manage & track access to some or all data (depending on access permissions) in either zone.

Accessing Data in Federated iRODS

“Gets data to user”

“With access permissions”

“Finds the data”

iRODS/ICAT system at University of North Carolina

at Chapel Hill(renci zone)

Two federated iRODS data grids

iRODS/ICAT system at University of Texas

at Austin (tacc zone)

Page 32: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

32

Development Team• DICE team

• Arcot Rajasekar - iRODS Development Lead • Mike Wan - iRODS Chief Architect• Wayne Schroeder - iRODS Product Mgr., Sr. Developer• Bing Zhu - Fedora, Windows• Mike Conway - Java (Jargon)• Paul Tooby - Documentation, Foundation• Sheau-Yen Chen - Data Grid Administration• Reagan Moore - PI

• Preservation • Richard Marciano - Preservation Development Lead• Chien-Yi Hou - Preservation Micro-services• Antoine de Torcy - Preservation Micro-services

Page 33: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

33

Foundation• Data Intensive Cyber Environments Foundation

• Nonprofit open source software development

• Promotes use of iRODS technology

• Supports standards efforts, intellectual prop.

• Coordinates international development efforts• IN2P3 - quota and monitoring system• King’s College London - Shibboleth• Australian Research Collaboration Services - WebDAV• Academia Sinica - SRM interface

• More information: http://diceresearch.org

Page 34: 1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

34

iRODS Wiki

• More information…• http://irods.diceresearch.org• Descriptions, tutorials, documentation• Publications / presentations• Download of iRODS open source s.w.• Performance tests• irods-chat page