The Anatomy of the Grid Enabling Scalable Virtual Organizations Acknowldgement to: Ian Foster...

39
The Anatomy of the Grid Enabling Scalable Virtual Organizations Acknowldgement to: Ian Foster Mathematics and Computer Science Division Argonne National Laboratory John DYER TERENA [email protected]

Transcript of The Anatomy of the Grid Enabling Scalable Virtual Organizations Acknowldgement to: Ian Foster...

The Anatomy of the GridEnabling Scalable Virtual Organizations

Acknowldgement to: Ian FosterMathematics and Computer Science Division

Argonne National Laboratory

John DYERTERENA

[email protected]

but what are they really about?

Grids are “hot” …

Presentation Agenda

• Problem statement• Architecture• Globus Toolkit• Futures

The Grid Problem

Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations

Elements of the Problem

• Resource sharing• Computers, storage, sensors, networks, …

• Sharing always conditional: issues of trust, policy, negotiation, payment, … (Cost v Performance)

• Coordinated problem solving• Beyond client-server: distributed data analysis,

computation, collaboration, …

• Dynamic, multi-institutional virtual orgs• Community overlays on classic org structures

• Large or small, static or dynamic

Computational Astrophysics

• Solved EEs for gravitational waves• Tightly coupled, communications required • Must communicate 30MB/step between machines

Gig-E100MB/sec

SDSC IBM SP1024 procs5x12x17 =1020

NCSA Origin Array256+128+1285x12x(4+2+2) =480

OC-12 lineBut only 2.5MB/sec)

17

512

5

4 2 2

Data Grids for High Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Image courtesy Harvey Newman, Caltech

Network for Earthquake Engineering Simulation

• NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other

• On-demand access to experiments, data streams, computing, archives, collaboration

NEESgrid: Argonne, Michigan, NCSA, UIUC, USC

Grid Applications:Mathematicians

• Community=an informal collaboration of mathematicians and computer scientists

• Condor-G delivers 3.46E8 CPU seconds in 7 days (600E3 seconds real-time)

• peak 1009 processors in U.S. and Italy (8 sites)

MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin

Grid Architecture

Isn’t it just the Next Generation Internet , so why bother !

Why Discuss Architecture?

• Descriptive• Provide a common vocabulary for use when

describing Grid systems• Guidance

• Identify key areas in which services are required - FRAMEWORK

• Prescriptive• Define standards

• But in the existing standards framework• GGF working with IETF, Internet2 etc.

What Sorts of Standards?• Need for interoperability when different groups want

to share resources• E.g., IP lets me talk to your computer, but how do we

establish & maintain sharing?• How do I discover, authenticate, authorize, describe

what I want to do, etc., etc.?• Need for shared infrastructure services to avoid

repeated development, installation, e.g.• One port/service for remote access to computing, not

one per tool/application• X.509 enables sharing of Certificate Authorities

• MIDDLEWARE !

In Defining Grid Architecture, We Must Address . . .

• Development of Grid protocols & services• Protocol-mediated access to remote resources

• New services: e.g., resource brokering

• Mostly (extensions to) existing protocols

• Development of Grid APIs & SDKs• Facilitate application development by

supplying higher-level abstractions

• The model is the Internet and Web

The Role of Grid Services(Middleware) and Tools

Informationservices

Faultdetection

. . .Resourcemgmt

CollaborationTools

Data MgmtTools

Distributedsimulation

. . .

net

GRID ArchitectureStatus

• No “official” standards exist• But:

• Globus Toolkit has emerged as the de facto standard for several important Connectivity, Resource, and Collective protocols

• GGF has an architecture working group• Technical specifications are being developed

for architecture elements: e.g., security, data, resource management, information

• Internet drafts submitted in security area

Layered Grid Architecture

Application

Fabric

ConnectivityCOMMS & AUTHENTICATIONSingle Sign On, Trust . . .

ResourceNEGOTIATION & CONTROL Sharing resources, controlling

CollectiveJOB MANAGEMENTDirectory, Discovery, Monitoring

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

DOES THE SCIENCE /….

ALL PHYSICAL RESOURCESNet, CPUs, Storage, Sensors

Toolkits & Components

• CONDOR - Harnessing the processing capacity of idle workstations

www.cs.wisc.edu/condor/

• LEGION- developing an object-oriented framework for grid applications

www.cs.virginia.edu/~legion

• Globus Toolkit SDK - APIs www.globus.org/

Architecture: Fabric Layer

• Just what you would expect: the diverse mix of resources that may be shared• Individual computers, Condor pools, file

systems, archives, metadata catalogs, networks, sensors, etc., etc.

• Few constraints on low-level technology: connectivity and resource level protocols

• Globus toolkit provides a few selected components (e.g., bandwidth broker)

Architecture: Connectivity

• Communication• Internet protocols: IP, DNS, routing, etc.

• Security: Grid Security Infrastructure (GSI)• Uniform authentication & authorization

mechanisms in multi-institutional setting• Single sign-on, delegation, identity mapping• Public key technology, SSL, X.509, GSS-API

(several Internet drafts document extensions)• Supporting infrastructure: Certificate

Authorities, key management, etc.

GSI Futures

• Scalability in numbers of users & resources • Credential management• Online credential repositories• Account management

• Authorization• Policy languages• Community authorization

• Protection against compromised resources• Restricted delegation, smartcards

Architecture: Resources• Resource management: Remote allocation,

reservation, monitoring, control of [compute] resources - GRAM (access & management

• Data access: GridFTP• High-performance data access & transport

• Information:• GRIP cf LDAP• GRRP – Registration Protocol• Access to structure & state information

• & others emerging: catalog access, code repository access, accounting, …

• All integrated with GSI

GRAM Resource Management Protocol

• Grid Resource Allocation & Management• Allocation, monitoring, control of computations

• Simple HTTP-based RPC• Job request: Returns opaque, transferable “job contact”

string for access to job

• Job cancel, Job status, Job signal

• Event notification (callbacks) for state changes

• Protocol/server address robustness (exactly once execution), authentication, authorization

• Servers for most schedulers; C and Java APIs

Data Access & Transfer

• GridFTP: extended version of popular FTP protocol for Grid data access and transfer

• Secure, efficient, reliable, flexible, extensible, parallel, concurrent, e.g.:• Third-party data transfers, partial file transfers

• Parallelism, striping (e.g., on PVFS)

• Reliable, recoverable data transfers

• Reference implementations• Existing clients and servers: wuftpd, nicftp

• Flexible, extensible libraries

Architecture: Collective• Bringing the underlying resources together

to provide the requested services • Resource brokers (e.g., Condor Matchmaker)

• Resource discovery and allocation• Replica management and replica selection

• Optimize aggregate data access performance• Co-reservation and co-allocation services

• End-to-end performance• Etc., etc.

Globus Toolkit Solution

Registration & enquiry protocols, information models, query languages• Provides standard interfaces to sensors

• Supports different “directory” structures supporting various discovery/access strategies

Karl Czajkowski, Steve Fitzgerald, others

Grid Futures

Major Grid ProjectsName URL &

SponsorsFocus

Access Grid www.mcs.anl.gov/FL/accessgrid; DOE, NSF

Create & deploy group collaboration systems using commodity technologies

BlueGrid IBM Grid testbed linking IBM laboratories

DISCOM www.cs.sandia.gov/discomDOE Defense Programs

Create operational Grid providing access to resources at three U.S. DOE weapons laboratories

DOE Science Grid

sciencegrid.org

DOE Office of Science

Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities

Earth System Grid (ESG)

earthsystemgrid.orgDOE Office of Science

Delivery and analysis of large climate model datasets for the climate research community

European Union (EU) DataGrid

eu-datagrid.org

European Union

Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics

g

g

g

g

g

g

New

New

Major Grid ProjectsName URL/

SponsorFocus

EuroGrid, Grid Interoperability (GRIP)

eurogrid.org

European Union

Create technologies for remote access to supercomputer resources & simulation codes; in GRIP, integrate with Globus

Fusion Collaboratory

fusiongrid.org

DOE Off. Science

Create a national computational collaboratory for fusion research

Globus Project globus.org

DARPA, DOE, NSF, NASA, Msoft

Research on Grid technologies; development and support of Globus Toolkit; application and deployment

GridLab gridlab.org

European Union

Grid technologies and applications

GridPP gridpp.ac.uk

U.K. eScience

Create & apply an operational grid within the U.K. for particle physics research

Grid Research Integration Dev. & Support Center

grids-center.org

NSF

Integration, deployment, support of the NSF Middleware Infrastructure for research & education

g

g

g

g

g

g

New

New

New

New

New

Major Grid ProjectsName URL/Sponsor Focus

Grid Application Dev. Software

hipersoft.rice.edu/grads; NSF

Research into program development technologies for Grid applications

Grid Physics Network

griphyn.org

NSF

Technology R&D for data analysis in physics expts: ATLAS, CMS, LIGO, SDSS

Information Power Grid

ipg.nasa.gov

NASA

Create and apply a production Grid for aerosciences and other NASA missions

International Virtual Data Grid Laboratory

ivdgl.org

NSF

Create international Data Grid to enable large-scale experimentation on Grid technologies & applications

Network for Earthquake Eng. Simulation Grid

neesgrid.org

NSF

Create and apply a production Grid for earthquake engineering

Particle Physics Data Grid

ppdg.net

DOE Science

Create and apply production Grids for data analysis in high energy and nuclear physics experiments

g

g

g

g

g

New

New

g

Major Grid Projects

Name URL/Sponsor FocusTeraGrid teragrid.org

NSF

U.S. science infrastructure linking four major resource sites at 40 Gb/s

UK Grid Support Center

grid-support.ac.uk

U.K. eScience

Support center for Grid projects within the U.K.

Unicore BMBFT Technologies for remote access to supercomputers

g

g

New

New

Also many technology R&D projects: e.g., Condor, NetSolve, Ninf, NWS

See also www.gridforum.org

The 13.6 TF TeraGrid:Computing at 40 Gb/s

26

24

8

4 HPSS

5

HPSS

HPSS UniTree

External Networks

External Networks

External Networks

External Networks

Site Resources Site Resources

Site ResourcesSite ResourcesNCSA/PACI8 TF240 TB

SDSC4.1 TF225 TB

Caltech Argonne

TeraGrid/DTF: NCSA, SDSC, Caltech, Argonne www.teragrid.org

International Virtual Data Grid Lab

Tier0/1 facility

Tier2 facility

10 Gbps link

2.5 Gbps link

622 Mbps link

Other link

Tier3 facility

U.S. PIs: Avery, Foster, Gardner, Newman, Szalay www.ivdgl.org

Problem Evolution

• Past-present: (102) high-end systems; Mb/s networks; centralized (or entirely local) control• I-WAY (1995): 17 sites, week-long; 155 Mb/s• GUSTO (1998): 80 sites, long-term experiment• NASA IPG, NSF NTG: O(10) sites, production

• Present: (104-106) data systems, computers; Gb/s networks; scaling, decentralized control• Scalable resource discovery; restricted delegation;

community policy; GriPhyN Data Grid: 100s of sites, (104) computers; complex policies

• Future: (106-109) data, sensors, computers; Tb/s networks; highly flexible policy, control

The Future

• We don’t build or buy “computers” anymore, we borrow or lease required resources• When I walk into a room, need to solve a

problem, need to communicate

• A “computer” is a dynamically, often collaboratively constructed collection of processors, data sources, sensors, networks• Similar observations apply for software

And Thus …

• Reduced barriers to access mean that we do much more computing, and more interesting computing, than today => Many more components (& services); massive parallelism

• All resources are owned by others => Sharing (for fun or profit) is fundamental; trust, policy, negotiation, payment

• All computing is performed on unfamiliar systems => Dynamic behaviors, discovery, adaptivity, failure

The Global Grid Forum

• Merger of (US) GridForum & EuroGRID• Cooperative Forum of Working Groups• Open to all who show up• Meets every four months• Alternate – US and Europe

• GGF1 – Amsterdam, NL

• GGF2 – Washington, US

• GGF3 – Frascatti, IT

http://www.gridforum.org

GF BOF (Orlando)

GF1 (San Jose, NASA Ames)

GF2 (Chicago, Northwestern)

eGrid and GF BOFs (Portland)

GF3 (San Diego, SDSC)

eGrid1(Posnan, PSNC)

GF4 (Redmond, Microsoft)

eGrid2 (Munich, Europar)

GF5 (Boston, Sun)

Global GF BOF (Dallas)

1999 2000

Asia-Pacific GF Planning (Yokohama)

1998

GGF-1 (Amsterdam, WTCW)

GGF-2 (Washington, DC, DOD-MSRC)

GGF3 (Rome,INFN)7-10 October 2001

GGF4 (Toronto, NRC)17-20 February 2002

GGF5 (Edinburgh)21-24 July 2002Jointly with HPDC(24-26 July)

2001 2002

Global Grid Forum History

GGF AREAs • Working Groups • Research Groups

Grid Information Services • Grid Object Specification

• Grid Notification Framework

• Metacomputing Directory Services

• Relational Database Information Services

Scheduling and Resource Management

• Advanced Reservation

• Scheduling Dictionary

• Scheduler Attributes

Security • Grid Security Infrastructure

• Grid Certificate Policy

Performance • Performance

Architectures • JINI • Grid Protocol Architecture

Data • GridFTP • Replica

Applications, Programming Models, and User Environments

• APPS, GUS, GCE, APM

Summary

• The Grid problem: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations

• Grid architecture: Emphasize protocol and service definition to enable interoperability and resource sharing

• Globus Toolkit a source of protocol and API definitions, reference implementations

• See: globus.org, griphyn.org, gridforum.org