From Clusters to Grids

Post on 05-Jan-2016

30 views 0 download

Tags:

description

From Clusters to Grids. October, 2003 – Linkoping, Sweden Andrew Grimshaw Department of Computer Science, Virginia CTO & Founder Avaki Corporation. Agenda. Grid Computing Background Legion Existing Systems & Standards Summary. Grid Computing. First: What is a Grid System?. - PowerPoint PPT Presentation

Transcript of From Clusters to Grids

October, 2003 – Linkoping, Sweden

Andrew Grimshaw

Department of Computer Science, Virginia

CTO & Founder Avaki Corporation

From Clusters to GridsFrom Clusters to Grids

2

Agenda

• Grid Computing Background

• Legion

• Existing Systems & Standards

• Summary

3

Grid Computing

4

First: What is a Grid System?

A Grid system is a collection

of distributed resources

connected by a network

Examples of Distributed Resources: Desktop Handheld hosts Devices with embedded processing resources such as

digital cameras and phones Tera-scale supercomputers

5

A grid enables users to collaborate securely by sharing processing, applications, and data across heterogeneous systems and administrative domains for collaboration, faster application execution and easier access to data.• Compute Grids • Data Grids

What is a Grid?

A grid is all about gathering together resources and making them accessible to users and applications.

6

What are the characteristics of a Grid system?

Numerous Resources

Ownership by MutuallyDistrustful Organizations

& Individuals

Potentially FaultyResources

Different SecurityRequirements

& Policies Required

Resources areHeterogeneous

GeographicallySeparated

Different ResourceManagementPolicies

Connected byHeterogeneous, Multi-Level Networks

7

What are the characteristics of a Grid system?

Numerous Resources

Ownership by MutuallyDistrustful Organizations

& Individuals

Potentially FaultyResources

Different SecurityRequirements

& Policies Required

Resources areHeterogeneous

GeographicallySeparated

Different ResourceManagementPolicies

Connected byHeterogeneous, Multi-Level Networks

8

Technical Requirements of a Successful Grid Architecture

Simple Secure Scalable Extensible Site Autonomy Persistence & I/O Multi-Language Legacy Support Single Namespace Transparency Heterogeneity Fault-tolerance & Exception Management

Success requires an integrated solution

ANDflexible policy

Manage Complexity!!

9

Implication:Complexity is THE Critical Challenge

How should complexity be addressed?

10

Robustness

Time & Cost

Low

Low

HighSockets & Shells

Low

High

A low-level or “socket & shell” approach is low in robustness & high in time and cost to develop.

Integrated Solution

Low

High

An integrated approach is high in robustness and low in time and cost to develop.

As Application Complexity Increases, Differences Between the Systems Increase Dramatically

High

High-level versus low-level solutions

11

The Importance of Integration in a Grid Architecture

If separate pieces are used, then the programmer must integrate the solutions.

If all the pieces are not present, then the programmer must develop enough of the missing pieces to support the application.

Bottom Line: Both raise the bar by putting the cognitive burden on the programmer.

12

• Simple cycle aggregation • State of the state is essentially scheduling and

queuing for CPU cluster management• These definitions are selling short the promise of

Grid technology• AVAKI believes grids are not just about aggregating

and scheduling CPU cycles but also …• Virtualizing many types of resources, internally and across

domains• Empowering anyone to have secure access to any and all

resources through easy administration

Misconceptions about Grids

13

• Sons of SETI@home • United Devices, Entropia, Data Synapse• Low-end, desktop cycle aggregation• Hard sell in corporate America

• Cluster Load Management • LSF, PBS, SGE• High end, great for management of local clusters but not

well proven in multi-cluster environments

• As soon as you go outside of the local cluster to cross-domain multi-cluster, the game changes dramatically with the introduction of three major issues:

• Data • Security• Administration

Compute Grids Categories

To address these issues, you need a fully-integrated solution, or a toolkit to build one

14

Typical Grid Scenarios

Desktop Cycle Aggregation• Desktop only• United Devices, Entropia, Data Synapse

Cluster & Departmental Grids• Single owner, platform, domain, file system and location• SUN SGE, Platform LSF, PBS

Enterprise Grids• Single enterprise; multiple owners, platforms, domains, file systems, locations, and security policies• SUN SGE EE, Platform Multi-cluster

Global Grids• Multiple enterprises, owners, platforms, domains, file systems, locations, and security policies• Legion, Avaki, Globus

15

What are grids being used for today?

• Multiple sites with multiple data sources (public and private)

• Need secure access to data and applications for sharing

• Have partnership relationships with other organizations: internal, partners, or customers

• Computationally challenging applications

• Distributed R&D groups across company, networks and geographies

• Staging large files

• Want to utilize and leverage heterogeneous compute resources

• Need for accounting of resources

• Need to handle multiple queuing systems• Considering purchasing compute cycles for spikes in demand

16

Legion

17

Legion Grid Software

Desktop Server

Users

Wide-area access to data, processing and application

resources in a single, uniform operating environment that is secure and easy to administer

Server ApplicationData Server Data Cluster

ApplicationsLegion Grid Capabilities Wide-area data access Distributed processing Global naming Policy-based

administration Resource accounting Fine-grained security Automatic failure

detection and recovery

Legion G R I DLegion G R I D

Load Mgmt & Queuing

VendorDepartment BDepartment APartner

ApplicationData

Load Mgmt & Queuing

18

Legion Combines Data and Compute Grid

Users Applications

Legion G R I DLegion G R I D

Desktop ServerServer ApplicationData Server Data Cluster

Load Mgmt & Queuing

VendorDepartment BDepartment APartner

ApplicationData

Load Mgmt & Queuing

19

The Legion Data Grid

20

Data Grid

Users

Wide-area access to data at its source location based on business

policies, eliminating manual copying and errors caused by accessing

out-of-date copies

Applications

Desktop ServerServer ApplicationData Server Data Cluster

VendorDepartment BDepartment APartner

Application

Legion G R I DLegion G R I D

Data

Data Grid Capabilities

Federates multiple data sources

Provides global naming Works with local and

virtual file systems – NFS, XFS, CIFS

Accesses data in DAS, NAS, SAN

Uses standard interfaces Caches data locally

21

Data Grid Share

Users Applications

Linux NT Solaris Solaris

Tools VendorResearch CenterHeadquartersInformatics Partner

Data mapped to Grid namespace via Legion ExportDir

Legion Data Grid transparently handles client and application requests, maps them

to the global namespace, and returns the data

22

Data Grid Access

ServerRD - 2

App_APM-1 ClusterHQ - 1

sequence_b Cluster BLAST sequence_csequence_a

Tools VendorResearch CenterHeadquartersInformatics Partner

Users Applications

Fine-grained Security

Access Point

• Access files using standard NFS protocol or Legion commands

- NFS security issues eliminated- Caches exploit semantics

• Access files using global name• Access based on specified privileges

23

Data Grid Access using virtual NFS

Partner

Fine-grained Security

Department A Department B

Legion-NFS

Complexity = Servers + Clients• Clients mount grid• Servers share files to grid• Clients access data using NFS protocol• Wide-area access to data outside administrative domain

sequence_csequence_a

24

Keeping Data in the grid

• Legion storage servers• Data is copied into Legion storage servers

that execute on a set of hosts.• The particular set hosts used is a

configuration option - here five hosts are used

• Access to the different files is completely independent and asynchronous

• Very high sustained read/write bandwidth is possible using commodity resources

a

d e

b

f

c

g h

/

Local Disk

Local Disk

Local Disk

Local Disk

Local Disk

25

I/O Performance

0

20

40

60

80

100

120

140

160

180

200

Ban

dw

idth

(M

B/s

ec)

1 10 20 30 40 50

Number of readers

Large Read Aggregate Bandwidth

NFS lnfsd LegionFS

Read performance in NFS, Legion-NFS, and Legion I/Olibraries. The x axis indicates the number of clients that simultaneously perform 1 MB reads on 10 MB files, and the y axis indicates total read bandwidth. All results are the average of multiple runs. All clients on 400 MHZ Intel’s, NFS server on 800 MHZ Intel server.

26

Data Grid Benefits

• Easy, convenient, wide-area access to data – regardless of location, administrative domain or platform

• Eliminates time-consuming copying and obtaining accounts on machines where data resides

• Provides access to the most recent data available• Eliminates confusion and errors caused by inconsistent

naming of data• Caches remote data for improved performance• Requires no changes to legacy or commercial applications• Protects data with fine-grained security and limits access

privileges to those required • Eases data administration and management• Eases migration to new storage technologies

27

The Legion Compute Grid

28

Compute Grid

Users

Wide-area access to processing resources based on business policies,

managing utilization of processing resources for fast, efficient job

completion

Applications

Desktop Server ApplicationServer ApplicationData Server Data Cluster

VendorDepartment BDepartment APartner

Application

Legion G R I DLegion G R I D

Compute Grid Capabilities

Job scheduling and priority-based queuing

Easy integration with third party load management and queuing software

Automatic staging of data and applications

Efficient processing of both sequential and parallel applications

Failure detection and recovery

Usage accounting

29

Fine-grained Security

Compute Grid Access

SolarisServerRD - 2

NT ServerPM-1

Data ClusterHQ - 1

Data Linux Cluster

BLAST

Tools VendorResearch CenterHeadquartersInformatics Partner

Scheduling, Queuing, Usage Management, Accounting, Recovery

Login/SubmissionLogin/Submission

• The grid:­Locates resources­Authenticates and grants access privileges­Stages applications and data­Detects failures and recovers­Writes output to specified location­Accounts for usage

App_AData

Users Applications

30

Tools - All are cross-platform

• MPI• P-space studies - multi-run• Parallel C++• Parallel object-based

Fortran• CORBA binding• Object migration• Accounting

• legion_make - remote builds

• Fault-tolerant MPI libraries • post-mortem debugger• “console” objects• parallel 2D file objects• Collections

31

One Favorite

32

Related Work

33

Related Work

• Avaki• All distributed systems literature• Globus• AFS/DFS• LSF, PBS, ….• Global Grid Forum - OGSA

34

Avaki Company Background• Grid Pioneers - a Legion spin-off

• Over $20M capitalization

• The only commercial grid software provider with a solution that addresses data access, security, and compute power challenges

• Standards efforts leader

Partners StandardsOrganizations

Customers

35

AFS/DFS comparison with Legion Data Grid

• AFS presumes that all files kept in AFS - no federation with other file systems. Legion allows data to be kept in Legion, or in an NFS, XFS, PFS, or Samba file system.

• AFS presumes all sites using Kerberos and that realms “trust” each other - Legion assumes nothing about local authentication mechanism and there is no need for cross-realm trust

• AFS semantics are fixed - copy on open - Legion can support multiple semantics. Default is Unix semantics.

• AFS volume oriented (sub-tree’s) - Legion can be volume oriented or file oriented

• AFS caching semantics not extensible - Legion caching semantics are extensible

36

Legion & Globus GT2

• Projects with many common goals:• Metacomputing (or the “Grid”)• Middleware for wide-area systems• Heterogeneous resource sets• Disjoint administrative domains• High-performance, large-scale applications

37

Legion Specific Goals

• Shared collaborative environment including shared file system

• Fault-tolerance and high-availability

• Both HPC applications and distributed applications

• Complete security model including access control

• Extensible

• Integrated - create a meta-operating system

38

Many “Similar” Features

• Resource Management Support • Message-passing libraries

• e.g., MPI

• Distributed I/O Facilities• Globus GASS/remote I/O vs. Avaki Data Grid

• Security Infrastructure

39

• The “toolkit” approach• Provide services as separate libraries

• E.g. Nexus, GASS, LDAP

• Pros:• Decoupled architecture

• easy to add new services into the mix• Low buy-in: use only what you like!

• In practice all the pieces use each other

• Cons:• No unifying abstractions

• very complex environment to learn in full• composition of services difficult as number of services grows

• Interfaces keep changing due to ever evolving design

• Does not cover space of problems

Globus

40

Standards: GGF

Background:

• Grid standards are now being developed at the Global Grid Forum (GGF)

• In-development standard, Open Grid Services Infrastructure (OGSI) will extend Web Services (SOAP/XML, WSDL, etc.)

• Names and a two level name scheme

• Factories and lifetime management

• Mandatory set of interfaces, e.g., discovery interfaces

• OGSA – Open Grid Services Architecture• Over-arching architecture

• Still in development

41

Summary

• Grids are about resource federation and sharing• Grids are here today. They are being used in production computing

in industry to solve real problems and provide real value.• Compute Grids• Data Grids

• We believe that users want high-level abstractions - and don’t want to think about the grid.

• Need low activation energy and legacy support

• There are a number of challenges to be solved - and different applications and organizations want to solve them differently

• Policy heterogeneity• Strong separation of policy and mechanism

• Several areas where really good policies are still lacking• Scheduling• Security and security policy interactions• Failure recovery (and the interaction of different policies)