Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...

34
Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected] Department of Computer Science University of Maryland, College Park, MD 20 742

Transcript of Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...

Copyright 2006, Jeffrey K. Hollingsworth

Grid Computing

Jeffrey K. Hollingsworth

[email protected]

Department of Computer ScienceUniversity of Maryland, College Park, MD 20742

University of Maryland2

The Need for GRIDS

Many Computation Bound Jobs– Simulations

• Financial• Electronic Design• Science

– Data Mining

Large-scale Collaboration– Sharing of large data sets– Coupled communication simulation codes

University of Maryland3

Available Resources - Desktops

Networks of Workstations– Workstations have high processing power

– Connected via high speed network (100Mbps+)

– Long idle time (50-60%) and low resource usage

Goal: Run CPU-intensive programs using idle periods

• while owner is away: send guest job and run

• when owner returns: stop and migrate guest job away

– Examples: Condor (University of Wisconsin)

University of Maryland4

Computational Grids Environment

– Collection of semi-autonomous computers– Geographically distributed– Goal: Use these systems as a coordinated resource– Heterogeneous: processors, networks, OS

Target Applications– Large-scale programs: running for 100-1,000’s of

seconds– Significant need to access long term storage

Needs– Coordinated access (scheduling)– Specific time requests (reservations)– Scalable system software (1000’s of nodes)

University of Maryland5

Two Models of Grid Nodes

Harvested Nodes (Desktop)– Computers on desktops– Have Primary user who has priority– Participate in Grid, when resources are free

Dedicated Nodes (Data Center)– Dedicated to computational bound jobs– Various Policies

• May participate in grid 24/7• May only participate when load is low

University of Maryland6

Available Processing Power

Available Memory

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60

memory size (MB)

Prob

abilit

y

all

idle

nonidle

– Memory is available - 30MB available 70% of time

CPU usage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

CPU Usage (%)

Cum

ulat

ive

Dis

tr.

all

idle

nonidle

– CPU usage is low - 10% or less for 75% of time

University of Maryland7

OS Support for Harvested Grid Computing

Need To Manage Resources Differently– Scheduler

• Normally designed to be fair• Need strict priority

– Virtual Memory• Need priority for local jobs

– File systems

Virtual Machines make things easier– Provide Isolation– Mange Resources

University of Maryland8

Starvation Level CPU Scheduling

Original Linux CPU Scheduler– Run-time Scheduling Priority

• nice value & remaining time quanta

• Ti = 20 - nice_level + 1/2 * Ti-1

– Possible to schedule niced processes

Modified Linux CPU Scheduler– If runnable host processes exist

• Schedule a host process with highest priority

– Only when no host process is runnable• Schedule a guest process

University of Maryland9

Prioritized Page Replacement

New page replacement algorithm

Adaptive Page-Out Speed– When a host job steals a guest’s page,

page-out multiple guest pages faster

High Limit

Low Limit

Priority to Host Job

Priority to Guest Job

Based only on LRU

Mai

n M

emor

y P

ages

– No limit on taking free pages

– High Limit : • Maximum pages guest can hold

– Low Limit : • Minimum pages guest can hold

University of Maryland10

Micro Test Prioritized Memory Page Replacement

– Total Available Memory : 179MB– Memory Thresholds: High Limit (70MB), Low Limit (50MB)

0

20

40

60

80

100

120

140

160

0 20 40 60 80 100 120 140 160 180 200time (sec)

mem

ory

(MB)

host jobmemory

guest jobmemory

High Limit

Low Limit

– Guest job starts at 20 acquiring 128MB – Host job starts at 38 touching 150MB– Host job becomes I/O intensive at 90– Host job finishes at 130

University of Maryland11

Application Evaluation - Setup Experiment Environment

– Linux PC Cluster • 8 pentium II PCs, Linux 2.0.32• Connected by a 1.2Gbps Myrinet

Local Workload for host jobs– Emulate Interactive Local User

• MUSBUS interactive workload benchmark• Typical Programming environment

Guest jobs– Run DSM parallel applications (CVM)– SOR, Water and FFT

Metrics– Guest Job Performance, Host Workload Slowdown

University of Maryland12

Application Evaluation - Host Slowdown Run DSM Parallel Applications

– 3 Host Workloads : 7%, 13%, 24% (CPU Usage) – Host Workload Slowdown

– For Equal Priority:• Significant Slowdown • Slowdown increases with load

– No Slowdown with Linger Priority

Host Slowdown

0%

5%

10%

15%

20%

7% 13% 25%

musbus utilization

mu

sb

us

slo

wd

ow

n mb(sor-l)

mb(sor-e)

mb(water-l)

mb(water-e)

mb(fft-l)

mb(fft-e)

University of Maryland13

Application Evaluation - Guest Performance

Run DSM Parallel Applications– Guest Job Slowdown

– Slowdown proportional to musbus usage– Running guest at same priority as host

provides little benefit to guest job

Guest Slowdown

0%

5%

10%

15%

20%

25%

30%

35%

40%

7% 13% 25%

musbus utilization

app

licat

ion

slo

wd

ow

n

sor-l

sor-e

water-l

water-e

fft-l

fft-e

sor water fft sorsor waterwater fftfft

University of Maryland14

Unique Grid Infrastructure

Applies to both Harvested and Dedicated

Resource Monitoring– Finding available resources– Need both CPUs and Bandwidth

Scheduling– Policies to sharing resources among

organizations

Security– Protect nodes from guest jobs– Protect jobs on foreign nodes

University of Maryland15

Security

Goals– Don’t require explicit accounts on each

computer– Provide controlled access

• Define policies on what jobs run where• Authenticate access

Techniques– Certificates – Single account on system for all grid jobs

University of Maryland16

Resource Monitoring

Need to find available resources– CPU cycles

• With appropriate OS/System Software• With sufficient memory & temporary disk

– Network bandwidth• Between nodes running a parallel job• To the remote file system

Issues– Time varying availability– Passive vs. active monitoring

University of Maryland17

Ganglia Toolkit

Courtesy of NPACI, SDSC, and UC Berkeley

University of Maryland18

NetLogger

Courtesy of Brian Tierney, LBL

University of Maryland19

Scheduling

Need to allocate resources on Grid Each site might:

– Accept jobs from remote sites– Send jobs to other sites

Need to accommodate co-scheduling– A single job that spans multiple site

Need for reservations– Time certain allocate of resources

University of Maryland20

Scheduling Parallel Jobs

Scheduling Constraints– Different jobs use different numbers of nodes– Jobs provide estimate of runtime– Jobs run from a few minutes to a few weeks

Typical Approach– One parallel job per node

• Called space-sharing– Batch Style Scheduling Used

• Even a single user often has more processes than can run at once

• Need to have many nodes at once for a job

University of Maryland21

Typical Parallel Scheduler

Packs Jobs into a schedule by – Required number of nodes– Estimated runtime

Backfills with smaller jobs when– Holes develop due to early job termination

University of Maryland22

Imprecise Calendars

Data structure to manage scheduling grids– permits allocations of time to applications– uses hierarchical representation

• each level maintains calendar for managed nodes

– allows multiple temporal resolutions

Key Features:– allows reservations– supports co-scheduling semi-autonomous sites

• a site can refuse an individual remote job • small jobs don’t need inter-site coordination

University of Maryland23

Multiple Time/Space Resolutions

T_A(1 hour)Free(1 hour)

T_A(30 min)

Free (30 min)

Free (30 min)

T_A(30 min)

T_A(15 min)Free(15 min)

T_A(15 min)Free(15 min)

T_A(15 min)Free(15 min)

T_A(15 min)Free(15 min)

T_A(full) T_A(full)T_A(full)T_A(full)

Free (7.5 min) Free (7.5 min)Free (7.5 min)Free (7.5 min)

Free (7.5 min) Free (7.5 min)Free (7.5 min)Free (7.5 min)

T_A(full) T_A(full)T_A(full)T_A(full)

30 min. 30 min.

Refine space

Refine time

Parameters– number and sizes of slots– packing density

Have multiple time-scales at once– near events at finest temporal resolution

University of Maryland24

Evaluation

Approach– use traces of job submission to real clusters– simulate different scheduling policies

• imprecise calendars• traditional back-filling schedulers

Metrics for comparison– job completion time

• aggregate and by job size– node utilization

University of Maryland25

Comparison with Partitioned ClusterDelay of Jobs

0123456

32 32 64 128 256 512 all

Cluster

mea

n w

ait t

ime combined

separatebackfill

Based on job data from LANL Treat each cluster as a trading

partner

University of Maryland26

Balance of Trade

Utilization TradingQueue Size

Separate Comb. Supply Use Balance

1 32 18.0% 18.1% 170.5 88.4 -82.02 32 21.5% 21.7% 162.3 85.4 -77.03 64 24.7% 24.9% 281.0 54.2 -226.94 128 36.4% 36.3% 64.3 456.1 391.85 256 38.9% 38.9% 136.2 84.6 -51.66 512 38.8% 38.8% 52.7 98.3 45.6

Jobs are allowed to split across partitions Significant shift in work from 128 node

partition

University of Maryland27

Large Cluster of ClustersAll Jobs

0

5

10

15

20

25

30

All 1 2 3 4 5 6 7 8 9 10

Cluster Number

Mea

n Jo

b D

elay

Separate

Combined

Each cluster has 336 nodes– jobs < 1/3 of nodes and < 12 node-hours sched. locally– jobs were not split between nodes

Data is one month of jobs per node Workload from CTC SP-2

University of Maryland28

Balance of Trade: Large Clusters

Two Level Combined Queues#

Avg.

Util Supply Use Balance Local Util.

1 34.5% 32,001 10,605 21,395 15,698 67.7%2 72.9% 28,168 31,493 (3,326) 24,162 74.2%3 79.3% 25,588 34,713 (9,125) 25,816 72.9%4 70.8% 27,098 32,713 (5,615) 21,319 68.7%5 55.2% 22,493 16,054 6,439 26,082 68.9%6 65.4% 26,152 28,778 (2,626) 21,162 67.1%7 63.6% 25,882 20,516 5,366 28,026 76.5%8 72.9% 27,489 32,055 (4,566) 23,562 72.4%9 61.2% 22,881 21,111 1,770 25,570 68.7%

10 77.3% 26,949 36,662 (9,713) 22,306 69.9%All 65.3% 264,701 264,701 0 25,813 70.7%

University of Maryland29

Social, Political, and Corporate Barriers

“It’s my computer”– Even if the employer purchased it

Tragedy of the Commons– Who will buy resources

Chargeback concerns– HW purchased for one project used by

another

Data Security Concerns– You want to run our critical jobs where?

University of Maryland30

Globus Toolkit

Collection of Tools– Security– Scheduling– Grid aware Parallel Programming

Designed for – Confederation of dedicated clusters– Support for parallel programs

University of Maryland31

Condor

Core of tightly coupled tools– Monitoring of node– Scheduling (including batch queues)– Checkpointing of jobs

Designed for– Harvested resources (dedicated nodes too)– Parameter sweeps using many serial program

runs

University of Maryland32

Layout of the Condor Pool

Central Manager

Master

Collector

Cluster Node

Master

startd

Cluster Node

Master

startdDesktop

Master

startd

schedd

Desktop

Master

startd

schedd

negotiator

schedd

negotiator

schedd

MasterMaster

Master Master

Master

Courtesy of Condor Group, University of Wisconsin

University of Maryland33

Conclusion

What the Grid is– An approach to improve computation

utilization– Support for data migration for large-scale

computation– Several families of tools– Tools to enable collaboration

What the Grid is not– Free cycles from heaven

University of Maryland34

Grid Resources

Books– The Grid2: Blueprint for a New Computing Infrastructure

• Foster & Kessleman, ed.– Grid Computing: Making the Global Infrastructure a Reality

• Berman, Fox, Hey, ed.

Software Distributions– Condor: www.cs.wisc.edu/condor– Globus: www.globus.org