Chapter 4:- Introduction to Grid and its Evolution
description
Transcript of Chapter 4:- Introduction to Grid and its Evolution
Chapter 4:-Introduction to Grid and its
Evolution
Prepared By:- NITIN PANDYA Assistant Professor SVBIT.
2
OverviewBackground: What is the Grid?Related technologiesGrid applicationsCommunitiesGrid ToolsCase Studies
3
What is a Grid?Many definitions exist in the literatureEarly defs: Foster and Kesselman, 1998
“A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational facilities”
Kleinrock 1969: “We will probably see the spread of ‘computer
utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.”
Grid computing (1) “Coordinated resource sharing and
problem solving in dynamic, multi-institutional virtual organisations” (I. Foster)
Grid computing (2)Information grid
large access to distributed data (the Web)
Data gridmanagement and processing of very large
distributed data sets
Computing gridmeta computer
Parallelism vs grids: some recalls
Grids date back “only” 1996 Parallelism is older ! (first classification in 1972) Motivations:
need more computing power (weather forecast, atomic simulation, genomics…)
need more storage capacity (Petabytes and more)in a word: improve performance ! 3 ways ...
Work harder --> Use faster hardwareWork smarter --> Optimize algorithmsGet help --> Use more computers !
The performance ? Ideally it grows linearly Speed-up:
if TS is the best time to process a problem sequentially, then the parallel processing time should be TP=TS/P with P
processorsspeedup = TS/TP
the speedup is limited by Amdhal law: any parallel program has a purely sequential and a parallelizable part TS= F + T//,
thus the speedup is limited: S = (F + T//) / (F + (T///P)) < P
Scale-up: if TPS is the time to solve a problem of size S with P processors, then TPS should also be the time to process a problem of size n*S
with n*P processors
8
Why do we need Grids?Many large-scale problems cannot be
solved by a single computerGlobally distributed data and resources
9
Background: Related technologiesCluster computingPeer-to-peer computingInternet computing
10
Cluster computingIdea: put some PCs together and get them
to communicateCheaper to build than a mainframe
supercomputerDifferent sizes of clustersScalable – can grow a cluster by adding
more PCs
11
Cluster Architecture
12
Peer-to-Peer computingConnect to other computersCan access files from any computer on the
networkAllows data sharing without going through
central serverDecentralized approach also useful for Grid
13
Peer to Peer architecture
14
Internet computingIdea: many idle PCs on the InternetCan perform other computations while not
being used“Cycle scavenging” – rely on getting free time
on other people’s computersExample: SETI@homeWhat are advantages/disadvantages of cycle
scavenging?
15
Some Grid ApplicationsDistributed supercomputingHigh-throughput computingOn-demand computingData-intensive computingCollaborative computing
16
Grid UsersMany levels of users
Grid developersTool developersApplication developersEnd usersSystem administrators
17
Some Grid challengesData movementData replicationResource managementJob submission
Computational grid“Hardware and software infrastructure that
provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities” (I. Foster)
Performance criteria:securityreliabilitycomputing powerlatencythroughputscalabilityservices
Grid characteristicsLarge scaleHeterogeneityMultiple administration domainAutonomy… and coordinationDynamicityFlexibilityExtensibilitySecurity
Levels of cooperation in a computing grid
End system (computer, disk, sensor…) multithreading, local I/O
Cluster synchronous communications, DSM, parallel I/O parallel processing
Intranet/Organization heterogeneity, distributed admin, distributed FS and databases load balancing access control
Internet/Grid global supervision brokers, negotiation, cooperation…
Basic services Authentication/Authorization/Traceability
Activity control (monitoring)
Resource discovery
Resource brokering
Scheduling
Job submission, data access/migration and execution
Accounting
Layered Grid Architecture(By Analogy to Internet Architecture)
Application
Fabric“Controlling things locally”: Access to, & control of, resources
Connectivity“Talking to things”: communication (Internet protocols) & security
Resource“Sharing single resources”: negotiating access, controlling use
Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services
InternetTransport
Application
Link
Internet Protocol Architecture
From I. Foster
ResourcesDescriptionAdvertisingCatalogingMatchingClaimingReservingCheckpointing
Resource management (1)Services and protocols depend on the infrastructure
Some parametersstability of the infrastructure (same set of resources or not)freshness of the resource availability informationreservation facilitiesmultiple resource or single resource brokering
Example of request: I need from 10 to 100 CE each with at least 512 MB RAM and a computing power of 150 Mflops
Resource management and scheduling (1)
Levels of schedulingjob scheduling (global level ; perf: throughput)resource scheduling (perf: fairness, utilization)application scheduling (perf: response time, speedup, produced
data…)
Mapping/Scheduling processresource discovery and selectionassignment of tasks to computing resourcesdata distributiontask scheduling on the computing resources(communication scheduling)
Resource management and scheduling (2)Individual perfs are not necessarily consistent
with the global (system) perf !
Grid problemspredictions are not definitive: dynamicity !Heterogeneous platformsCheckpointing and migration
GRAM GRAM GRAMLSF Condor NQE
Application
RSL
Simple ground RSL
Information Service
Localresourcemanagers
RSLspecializationBroker
Ground RSL
Co-allocator
Queries& Info
A Resource Management System Example (Globus)
NQE: Network Queuing Env.(batch management; developedby Cray Research
LSF: Load Sharing Facility(task scheduling and load balancing; Developed by Platform Computing)
Resource Specification Language
Resource information (1) What is to be stored ?
virtual organizations, people, computing resources, software packages, communication resources, event producers, devices…
what about data ???
A key issue in such dynamics environments
A first approach : (distributed) directory (LDAP)easy to use tree structuredistributionstaticmostly read ; not efficient updatinghierarchicalpoor procedural language
Resource information (2)Goal:
dynamicitycomplex relationshipsfrequent updatescomplex queries
A second approach: (relational) database
Programming on the grid: potential programming models
Message passing (PVM, MPI)Distributed Shared MemoryData Parallelism (HPF, HPC++)Task Parallelism (Condor)Client/server - RPCAgentsIntegration system (Corba, DCOM, RMI)
Program execution: issues Parallelize the program with the right job structure, communication
patterns/procedures, algorithms
Discover the available resources
Select the suitable resources
Allocate or reserve these resources
Migrate the data
Initiate computations
Monitor the executions ; checkpoints ?
React to changes
Collect results
Data managementIt was long forgotten !!!Though it is a key issue !Issues:
indexingretrievalreplicationcachingtraceability(auditing)
And security !!!
34
Some Grid-Related ProjectsGlobusCondorNimrod-G
35
Globus Grid ToolkitOpen source toolkit for building Grid systems and
applicationsEnabling technology for the Grid Share computing power, databases, and other
tools securely online Facilities for:
Resource monitoringResource discoveryResource managementSecurityFile management
36
Data Management in Globus ToolkitData movement
GridFTPReliable File Transfer (RFT)
Data replicationReplica Location Service (RLS)Data Replication Service (DRS)
37
GridFTPHigh performance, secure, reliable data transfer
protocolOptimized for wide area networksSuperset of Internet FTP protocolFeatures:
Multiple data channels for parallel transfersPartial file transfersThird party transfersReusable data channelsCommand pipelining
38
More GridFTP featuresAuto tuning of parametersStriping
Transfer data in parallel among multiple senders and receivers instead of just one
Extended block modeSend data in blocksKnow block size and offsetData can arrive out of orderAllows multiple streams
39
Striping ArchitectureUse “Striped” servers
40
Limitations of GridFTPNot a web service protocol (does not
employ SOAP, WSDL, etc.)Requires client to maintain open socket
connection throughout transferInconvenient for long transfers
Cannot recover from client failures
41
GridFTP
42
Reliable File Transfer (RFT)Web service with “job-scheduler” functionality for
data movementUser provides source and destination URLsService writes job description to a database and
moves filesService methods for querying transfer status
43
RFT
44
Replica Location Service (RLS)
Registry to keep track of where replicas exist on physical storage system
Users or services register files in RLS when files created
Distributed registryMay consist of multiple servers at different sitesIncrease scaleFault tolerance
45
Replica Location Service (RLS)Logical file name – unique identifier for contents
of filePhysical file name – location of copy of file on
storage systemUser can provide logical name and ask for
replicasOr query to find logical name associated with
physical file location
46
Data Replication Service (DRS)Pull-based replication capabilityImplemented as a web serviceHigher-level data management service built on top
of RFT and RLSGoal: ensure that a specified set of files exists on a
storage siteFirst, query RLS to locate desired filesNext, creates transfer request using RFTFinally, new replicas are registered with RLS
47
CondorOriginal goal: high-throughput computingHarvest wasted CPU power from other
machinesCan also be used on a dedicated clusterCondor-G – Condor interface to Globus
resources
48
Earth System GridProvide climate studies scientists with
access to large datasetsData generated by computational models –
requires massive computational powerMost scientists work with subsets of the
dataRequires access to local copies of data
49
ESG InfrastructureArchival storage systems and disk storage
systems at several sitesStorage resource managers and GridFTP servers
to provide access to storage systemsMetadata catalog servicesReplica location servicesWeb portal user interface
50
Earth System Grid
51
Earth System Grid Interface
52
Laser Interferometer Gravitational Wave Observatory (LIGO)
Instruments at two sites to detect gravitational waves
Each experiment run produces millions of filesScientists at other sites want these datasets on
local storageLIGO deploys RLS servers at each site to register
local mappings and collect info about mappings at other sites
53
Large Scale Data Replication for LIGOGoal: detection of gravitational wavesThree interferometers at two sitesGenerate 1 TB of data dailyNeed to replicate this data across 9 sites to
make it available to scientistsScientists need to learn where data items
are, and how to access them
54
LIGO
55
LIGO SolutionLightweight data replicator (LDR)Uses parallel data streams, tunable TCP
windows, and tunable write/read buffersTracks where copies of specific files can be
found Stores descriptive information (metadata) in a
database Can select files based on description rather than
filename
56
TeraGridNSF high-performance computing facilityNine distributed sites, each with different
capability , e.g., computation power, archiving facilities, visualization software
Applications may require more than one site
Data sizes on the order of gigabytes or terabytes
57
TeraGrid
58
TeraGridSolution: Use GridFTP and RFT with front end
command line tool (tgcp)Benefits of system:
Simple user interface High performance data transfer capability Ability to recover from both client and server
software failuresExtensible configuration
59
TGCP DetailsIdea: hide low level GridFTP commands from
usersCopy file smallfile.dat in a working directory to
another system:tgcp smallfile.dat tg-login.sdsc.teragrid.org:/users/ux454332
GridFTP command:globus-url-copy -p 8 -tcp-bs 1198372 \gsiftp://tg-gridftprr.uc.teragrid.org:2811/home/navarro/smallfile.dat \gsiftp://tg-login.sdsc.teragrid.org:2811/users/ux454332/smallfile.dat
60
The realityWe have spent a lot of time talking about
“The Grid”There is “the Web” and “the Internet”Is there a single Grid?
61
The realityMany types of Grids existPrivate vs. publicRegional vs. GlobalAll-purpose vs. particular scientific problem