From Clusters to Grids
description
Transcript of From Clusters to Grids
October, 2003 – Linkoping, Sweden
Andrew Grimshaw
Department of Computer Science, Virginia
CTO & Founder Avaki Corporation
From Clusters to GridsFrom Clusters to Grids
2
Agenda
• Grid Computing Background
• Legion
• Existing Systems & Standards
• Summary
3
Grid Computing
4
First: What is a Grid System?
A Grid system is a collection
of distributed resources
connected by a network
Examples of Distributed Resources: Desktop Handheld hosts Devices with embedded processing resources such as
digital cameras and phones Tera-scale supercomputers
5
A grid enables users to collaborate securely by sharing processing, applications, and data across heterogeneous systems and administrative domains for collaboration, faster application execution and easier access to data.• Compute Grids • Data Grids
What is a Grid?
A grid is all about gathering together resources and making them accessible to users and applications.
6
What are the characteristics of a Grid system?
Numerous Resources
Ownership by MutuallyDistrustful Organizations
& Individuals
Potentially FaultyResources
Different SecurityRequirements
& Policies Required
Resources areHeterogeneous
GeographicallySeparated
Different ResourceManagementPolicies
Connected byHeterogeneous, Multi-Level Networks
7
What are the characteristics of a Grid system?
Numerous Resources
Ownership by MutuallyDistrustful Organizations
& Individuals
Potentially FaultyResources
Different SecurityRequirements
& Policies Required
Resources areHeterogeneous
GeographicallySeparated
Different ResourceManagementPolicies
Connected byHeterogeneous, Multi-Level Networks
8
Technical Requirements of a Successful Grid Architecture
Simple Secure Scalable Extensible Site Autonomy Persistence & I/O Multi-Language Legacy Support Single Namespace Transparency Heterogeneity Fault-tolerance & Exception Management
Success requires an integrated solution
ANDflexible policy
Manage Complexity!!
9
Implication:Complexity is THE Critical Challenge
How should complexity be addressed?
10
Robustness
Time & Cost
Low
Low
HighSockets & Shells
Low
High
A low-level or “socket & shell” approach is low in robustness & high in time and cost to develop.
Integrated Solution
Low
High
An integrated approach is high in robustness and low in time and cost to develop.
As Application Complexity Increases, Differences Between the Systems Increase Dramatically
High
High-level versus low-level solutions
11
The Importance of Integration in a Grid Architecture
If separate pieces are used, then the programmer must integrate the solutions.
If all the pieces are not present, then the programmer must develop enough of the missing pieces to support the application.
Bottom Line: Both raise the bar by putting the cognitive burden on the programmer.
12
• Simple cycle aggregation • State of the state is essentially scheduling and
queuing for CPU cluster management• These definitions are selling short the promise of
Grid technology• AVAKI believes grids are not just about aggregating
and scheduling CPU cycles but also …• Virtualizing many types of resources, internally and across
domains• Empowering anyone to have secure access to any and all
resources through easy administration
Misconceptions about Grids
13
• Sons of SETI@home • United Devices, Entropia, Data Synapse• Low-end, desktop cycle aggregation• Hard sell in corporate America
• Cluster Load Management • LSF, PBS, SGE• High end, great for management of local clusters but not
well proven in multi-cluster environments
• As soon as you go outside of the local cluster to cross-domain multi-cluster, the game changes dramatically with the introduction of three major issues:
• Data • Security• Administration
Compute Grids Categories
To address these issues, you need a fully-integrated solution, or a toolkit to build one
14
Typical Grid Scenarios
Desktop Cycle Aggregation• Desktop only• United Devices, Entropia, Data Synapse
Cluster & Departmental Grids• Single owner, platform, domain, file system and location• SUN SGE, Platform LSF, PBS
Enterprise Grids• Single enterprise; multiple owners, platforms, domains, file systems, locations, and security policies• SUN SGE EE, Platform Multi-cluster
Global Grids• Multiple enterprises, owners, platforms, domains, file systems, locations, and security policies• Legion, Avaki, Globus
15
What are grids being used for today?
• Multiple sites with multiple data sources (public and private)
• Need secure access to data and applications for sharing
• Have partnership relationships with other organizations: internal, partners, or customers
• Computationally challenging applications
• Distributed R&D groups across company, networks and geographies
• Staging large files
• Want to utilize and leverage heterogeneous compute resources
• Need for accounting of resources
• Need to handle multiple queuing systems• Considering purchasing compute cycles for spikes in demand
16
Legion
17
Legion Grid Software
Desktop Server
Users
Wide-area access to data, processing and application
resources in a single, uniform operating environment that is secure and easy to administer
Server ApplicationData Server Data Cluster
ApplicationsLegion Grid Capabilities Wide-area data access Distributed processing Global naming Policy-based
administration Resource accounting Fine-grained security Automatic failure
detection and recovery
Legion G R I DLegion G R I D
Load Mgmt & Queuing
VendorDepartment BDepartment APartner
ApplicationData
Load Mgmt & Queuing
18
Legion Combines Data and Compute Grid
Users Applications
Legion G R I DLegion G R I D
Desktop ServerServer ApplicationData Server Data Cluster
Load Mgmt & Queuing
VendorDepartment BDepartment APartner
ApplicationData
Load Mgmt & Queuing
19
The Legion Data Grid
20
Data Grid
Users
Wide-area access to data at its source location based on business
policies, eliminating manual copying and errors caused by accessing
out-of-date copies
Applications
Desktop ServerServer ApplicationData Server Data Cluster
VendorDepartment BDepartment APartner
Application
Legion G R I DLegion G R I D
Data
Data Grid Capabilities
Federates multiple data sources
Provides global naming Works with local and
virtual file systems – NFS, XFS, CIFS
Accesses data in DAS, NAS, SAN
Uses standard interfaces Caches data locally
21
Data Grid Share
Users Applications
Linux NT Solaris Solaris
Tools VendorResearch CenterHeadquartersInformatics Partner
Data mapped to Grid namespace via Legion ExportDir
Legion Data Grid transparently handles client and application requests, maps them
to the global namespace, and returns the data
22
Data Grid Access
ServerRD - 2
App_APM-1 ClusterHQ - 1
sequence_b Cluster BLAST sequence_csequence_a
Tools VendorResearch CenterHeadquartersInformatics Partner
Users Applications
Fine-grained Security
Access Point
• Access files using standard NFS protocol or Legion commands
- NFS security issues eliminated- Caches exploit semantics
• Access files using global name• Access based on specified privileges
23
Data Grid Access using virtual NFS
Partner
Fine-grained Security
Department A Department B
Legion-NFS
Complexity = Servers + Clients• Clients mount grid• Servers share files to grid• Clients access data using NFS protocol• Wide-area access to data outside administrative domain
sequence_csequence_a
24
Keeping Data in the grid
• Legion storage servers• Data is copied into Legion storage servers
that execute on a set of hosts.• The particular set hosts used is a
configuration option - here five hosts are used
• Access to the different files is completely independent and asynchronous
• Very high sustained read/write bandwidth is possible using commodity resources
a
d e
b
f
c
g h
/
Local Disk
Local Disk
Local Disk
Local Disk
Local Disk
25
I/O Performance
0
20
40
60
80
100
120
140
160
180
200
Ban
dw
idth
(M
B/s
ec)
1 10 20 30 40 50
Number of readers
Large Read Aggregate Bandwidth
NFS lnfsd LegionFS
Read performance in NFS, Legion-NFS, and Legion I/Olibraries. The x axis indicates the number of clients that simultaneously perform 1 MB reads on 10 MB files, and the y axis indicates total read bandwidth. All results are the average of multiple runs. All clients on 400 MHZ Intel’s, NFS server on 800 MHZ Intel server.
26
Data Grid Benefits
• Easy, convenient, wide-area access to data – regardless of location, administrative domain or platform
• Eliminates time-consuming copying and obtaining accounts on machines where data resides
• Provides access to the most recent data available• Eliminates confusion and errors caused by inconsistent
naming of data• Caches remote data for improved performance• Requires no changes to legacy or commercial applications• Protects data with fine-grained security and limits access
privileges to those required • Eases data administration and management• Eases migration to new storage technologies
27
The Legion Compute Grid
28
Compute Grid
Users
Wide-area access to processing resources based on business policies,
managing utilization of processing resources for fast, efficient job
completion
Applications
Desktop Server ApplicationServer ApplicationData Server Data Cluster
VendorDepartment BDepartment APartner
Application
Legion G R I DLegion G R I D
Compute Grid Capabilities
Job scheduling and priority-based queuing
Easy integration with third party load management and queuing software
Automatic staging of data and applications
Efficient processing of both sequential and parallel applications
Failure detection and recovery
Usage accounting
29
Fine-grained Security
Compute Grid Access
SolarisServerRD - 2
NT ServerPM-1
Data ClusterHQ - 1
Data Linux Cluster
BLAST
Tools VendorResearch CenterHeadquartersInformatics Partner
Scheduling, Queuing, Usage Management, Accounting, Recovery
Login/SubmissionLogin/Submission
• The grid:Locates resourcesAuthenticates and grants access privilegesStages applications and dataDetects failures and recoversWrites output to specified locationAccounts for usage
App_AData
Users Applications
30
Tools - All are cross-platform
• MPI• P-space studies - multi-run• Parallel C++• Parallel object-based
Fortran• CORBA binding• Object migration• Accounting
• legion_make - remote builds
• Fault-tolerant MPI libraries • post-mortem debugger• “console” objects• parallel 2D file objects• Collections
31
One Favorite
32
Related Work
33
Related Work
• Avaki• All distributed systems literature• Globus• AFS/DFS• LSF, PBS, ….• Global Grid Forum - OGSA
34
Avaki Company Background• Grid Pioneers - a Legion spin-off
• Over $20M capitalization
• The only commercial grid software provider with a solution that addresses data access, security, and compute power challenges
• Standards efforts leader
Partners StandardsOrganizations
Customers
35
AFS/DFS comparison with Legion Data Grid
• AFS presumes that all files kept in AFS - no federation with other file systems. Legion allows data to be kept in Legion, or in an NFS, XFS, PFS, or Samba file system.
• AFS presumes all sites using Kerberos and that realms “trust” each other - Legion assumes nothing about local authentication mechanism and there is no need for cross-realm trust
• AFS semantics are fixed - copy on open - Legion can support multiple semantics. Default is Unix semantics.
• AFS volume oriented (sub-tree’s) - Legion can be volume oriented or file oriented
• AFS caching semantics not extensible - Legion caching semantics are extensible
36
Legion & Globus GT2
• Projects with many common goals:• Metacomputing (or the “Grid”)• Middleware for wide-area systems• Heterogeneous resource sets• Disjoint administrative domains• High-performance, large-scale applications
37
Legion Specific Goals
• Shared collaborative environment including shared file system
• Fault-tolerance and high-availability
• Both HPC applications and distributed applications
• Complete security model including access control
• Extensible
• Integrated - create a meta-operating system
38
Many “Similar” Features
• Resource Management Support • Message-passing libraries
• e.g., MPI
• Distributed I/O Facilities• Globus GASS/remote I/O vs. Avaki Data Grid
• Security Infrastructure
39
• The “toolkit” approach• Provide services as separate libraries
• E.g. Nexus, GASS, LDAP
• Pros:• Decoupled architecture
• easy to add new services into the mix• Low buy-in: use only what you like!
• In practice all the pieces use each other
• Cons:• No unifying abstractions
• very complex environment to learn in full• composition of services difficult as number of services grows
• Interfaces keep changing due to ever evolving design
• Does not cover space of problems
Globus
40
Standards: GGF
Background:
• Grid standards are now being developed at the Global Grid Forum (GGF)
• In-development standard, Open Grid Services Infrastructure (OGSI) will extend Web Services (SOAP/XML, WSDL, etc.)
• Names and a two level name scheme
• Factories and lifetime management
• Mandatory set of interfaces, e.g., discovery interfaces
• OGSA – Open Grid Services Architecture• Over-arching architecture
• Still in development
41
Summary
• Grids are about resource federation and sharing• Grids are here today. They are being used in production computing
in industry to solve real problems and provide real value.• Compute Grids• Data Grids
• We believe that users want high-level abstractions - and don’t want to think about the grid.
• Need low activation energy and legacy support
• There are a number of challenges to be solved - and different applications and organizations want to solve them differently
• Policy heterogeneity• Strong separation of policy and mechanism
• Several areas where really good policies are still lacking• Scheduling• Security and security policy interactions• Failure recovery (and the interaction of different policies)