Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...

Copyright 2006, Jeffrey K. Hollingsworth

Grid Computing

Jeffrey K. Hollingsworth

[email protected]

Department of Computer ScienceUniversity of Maryland, College Park, MD 20742

University of Maryland2

The Need for GRIDS

Many Computation Bound Jobs– Simulations

• Financial• Electronic Design• Science

– Data Mining

Large-scale Collaboration– Sharing of large data sets– Coupled communication simulation codes


Available Resources - Desktops

Networks of Workstations– Workstations have high processing power

– Connected via high speed network (100Mbps+)

– Long idle time (50-60%) and low resource usage

Goal: Run CPU-intensive programs using idle periods

• while owner is away: send guest job and run

• when owner returns: stop and migrate guest job away

– Examples: Condor (University of Wisconsin)


Computational Grids Environment

– Collection of semi-autonomous computers– Geographically distributed– Goal: Use these systems as a coordinated resource– Heterogeneous: processors, networks, OS

Target Applications– Large-scale programs: running for 100-1,000’s of

seconds– Significant need to access long term storage

Needs– Coordinated access (scheduling)– Specific time requests (reservations)– Scalable system software (1000’s of nodes)


Two Models of Grid Nodes

Harvested Nodes (Desktop)– Computers on desktops– Have Primary user who has priority– Participate in Grid, when resources are free

Dedicated Nodes (Data Center)– Dedicated to computational bound jobs– Various Policies

• May participate in grid 24/7• May only participate when load is low


Available Processing Power

Available Memory

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60

memory size (MB)

Prob

abilit

y

all

idle

nonidle

– Memory is available - 30MB available 70% of time

CPU usage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

CPU Usage (%)

Cum

ulat

ive

Dis

tr.

all

idle

nonidle

– CPU usage is low - 10% or less for 75% of time


OS Support for Harvested Grid Computing

Need To Manage Resources Differently– Scheduler

• Normally designed to be fair• Need strict priority

– Virtual Memory• Need priority for local jobs

– File systems

Virtual Machines make things easier– Provide Isolation– Mange Resources


Starvation Level CPU Scheduling

Original Linux CPU Scheduler– Run-time Scheduling Priority

• nice value & remaining time quanta

• Ti = 20 - nice_level + 1/2 * Ti-1

– Possible to schedule niced processes

Modified Linux CPU Scheduler– If runnable host processes exist

• Schedule a host process with highest priority

– Only when no host process is runnable• Schedule a guest process


Prioritized Page Replacement

New page replacement algorithm

Adaptive Page-Out Speed– When a host job steals a guest’s page,

page-out multiple guest pages faster

High Limit

Low Limit

Priority to Host Job

Priority to Guest Job

Based only on LRU

Mai

n M

emor

y P

ages

– No limit on taking free pages

– High Limit : • Maximum pages guest can hold

– Low Limit : • Minimum pages guest can hold


Micro Test Prioritized Memory Page Replacement

– Total Available Memory : 179MB– Memory Thresholds: High Limit (70MB), Low Limit (50MB)

0

20

40

60

80

100

120

140

160

0 20 40 60 80 100 120 140 160 180 200time (sec)

mem

ory

(MB)

host jobmemory

guest jobmemory

High Limit

Low Limit

– Guest job starts at 20 acquiring 128MB – Host job starts at 38 touching 150MB– Host job becomes I/O intensive at 90– Host job finishes at 130


Application Evaluation - Setup Experiment Environment

– Linux PC Cluster • 8 pentium II PCs, Linux 2.0.32• Connected by a 1.2Gbps Myrinet

Local Workload for host jobs– Emulate Interactive Local User

• MUSBUS interactive workload benchmark• Typical Programming environment

Guest jobs– Run DSM parallel applications (CVM)– SOR, Water and FFT

Metrics– Guest Job Performance, Host Workload Slowdown


Application Evaluation - Host Slowdown Run DSM Parallel Applications

– 3 Host Workloads : 7%, 13%, 24% (CPU Usage) – Host Workload Slowdown

– For Equal Priority:• Significant Slowdown • Slowdown increases with load

– No Slowdown with Linger Priority

Host Slowdown

0%

5%

10%

15%

20%

7% 13% 25%

musbus utilization

mu

sb

us

slo

wd

ow

n mb(sor-l)

mb(sor-e)

mb(water-l)

mb(water-e)

mb(fft-l)

mb(fft-e)


Application Evaluation - Guest Performance

Run DSM Parallel Applications– Guest Job Slowdown

– Slowdown proportional to musbus usage– Running guest at same priority as host

provides little benefit to guest job

Guest Slowdown

0%

5%

10%

15%

20%

25%

30%

35%

40%

7% 13% 25%

musbus utilization

app

licat

ion

slo

wd

ow

n

sor-l

sor-e

water-l

water-e

fft-l

fft-e

sor water fft sorsor waterwater fftfft


Unique Grid Infrastructure

Applies to both Harvested and Dedicated

Resource Monitoring– Finding available resources– Need both CPUs and Bandwidth

Scheduling– Policies to sharing resources among

organizations

Security– Protect nodes from guest jobs– Protect jobs on foreign nodes


Security

Goals– Don’t require explicit accounts on each

computer– Provide controlled access

• Define policies on what jobs run where• Authenticate access

Techniques– Certificates – Single account on system for all grid jobs


Resource Monitoring

Need to find available resources– CPU cycles

• With appropriate OS/System Software• With sufficient memory & temporary disk

– Network bandwidth• Between nodes running a parallel job• To the remote file system

Issues– Time varying availability– Passive vs. active monitoring


Ganglia Toolkit

Courtesy of NPACI, SDSC, and UC Berkeley


NetLogger

Courtesy of Brian Tierney, LBL


Scheduling

Need to allocate resources on Grid Each site might:

– Accept jobs from remote sites– Send jobs to other sites

Need to accommodate co-scheduling– A single job that spans multiple site

Need for reservations– Time certain allocate of resources


Scheduling Parallel Jobs

Scheduling Constraints– Different jobs use different numbers of nodes– Jobs provide estimate of runtime– Jobs run from a few minutes to a few weeks

Typical Approach– One parallel job per node

• Called space-sharing– Batch Style Scheduling Used

• Even a single user often has more processes than can run at once

• Need to have many nodes at once for a job


Typical Parallel Scheduler

Packs Jobs into a schedule by – Required number of nodes– Estimated runtime

Backfills with smaller jobs when– Holes develop due to early job termination


Imprecise Calendars

Data structure to manage scheduling grids– permits allocations of time to applications– uses hierarchical representation

• each level maintains calendar for managed nodes

– allows multiple temporal resolutions

Key Features:– allows reservations– supports co-scheduling semi-autonomous sites

• a site can refuse an individual remote job • small jobs don’t need inter-site coordination


Multiple Time/Space Resolutions

T_A(1 hour)Free(1 hour)

T_A(30 min)

Free (30 min)

Free (30 min)

T_A(30 min)

T_A(15 min)Free(15 min)




T_A(full) T_A(full)T_A(full)T_A(full)

Free (7.5 min) Free (7.5 min)Free (7.5 min)Free (7.5 min)

Free (7.5 min) Free (7.5 min)Free (7.5 min)Free (7.5 min)

T_A(full) T_A(full)T_A(full)T_A(full)

30 min. 30 min.

Refine space

Refine time

Parameters– number and sizes of slots– packing density

Have multiple time-scales at once– near events at finest temporal resolution


Evaluation

Approach– use traces of job submission to real clusters– simulate different scheduling policies

• imprecise calendars• traditional back-filling schedulers

Metrics for comparison– job completion time

• aggregate and by job size– node utilization


Comparison with Partitioned ClusterDelay of Jobs

0123456

32 32 64 128 256 512 all

Cluster

mea

n w

ait t

ime combined

separatebackfill

Based on job data from LANL Treat each cluster as a trading

partner


Balance of Trade

Utilization TradingQueue Size

Separate Comb. Supply Use Balance

1 32 18.0% 18.1% 170.5 88.4 -82.02 32 21.5% 21.7% 162.3 85.4 -77.03 64 24.7% 24.9% 281.0 54.2 -226.94 128 36.4% 36.3% 64.3 456.1 391.85 256 38.9% 38.9% 136.2 84.6 -51.66 512 38.8% 38.8% 52.7 98.3 45.6

Jobs are allowed to split across partitions Significant shift in work from 128 node

partition


Large Cluster of ClustersAll Jobs

0

5

10

15

20

25

30

All 1 2 3 4 5 6 7 8 9 10

Cluster Number

Mea

n Jo

b D

elay

Separate

Combined

Each cluster has 336 nodes– jobs < 1/3 of nodes and < 12 node-hours sched. locally– jobs were not split between nodes

Data is one month of jobs per node Workload from CTC SP-2


Balance of Trade: Large Clusters

Two Level Combined Queues#

Avg.

Util Supply Use Balance Local Util.

1 34.5% 32,001 10,605 21,395 15,698 67.7%2 72.9% 28,168 31,493 (3,326) 24,162 74.2%3 79.3% 25,588 34,713 (9,125) 25,816 72.9%4 70.8% 27,098 32,713 (5,615) 21,319 68.7%5 55.2% 22,493 16,054 6,439 26,082 68.9%6 65.4% 26,152 28,778 (2,626) 21,162 67.1%7 63.6% 25,882 20,516 5,366 28,026 76.5%8 72.9% 27,489 32,055 (4,566) 23,562 72.4%9 61.2% 22,881 21,111 1,770 25,570 68.7%

10 77.3% 26,949 36,662 (9,713) 22,306 69.9%All 65.3% 264,701 264,701 0 25,813 70.7%


Social, Political, and Corporate Barriers

“It’s my computer”– Even if the employer purchased it

Tragedy of the Commons– Who will buy resources

Chargeback concerns– HW purchased for one project used by

another

Data Security Concerns– You want to run our critical jobs where?


Globus Toolkit

Collection of Tools– Security– Scheduling– Grid aware Parallel Programming

Designed for – Confederation of dedicated clusters– Support for parallel programs


Condor

Core of tightly coupled tools– Monitoring of node– Scheduling (including batch queues)– Checkpointing of jobs

Designed for– Harvested resources (dedicated nodes too)– Parameter sweeps using many serial program

runs


Layout of the Condor Pool

Central Manager

Master

Collector

Cluster Node

Master

startd

Cluster Node

Master

startdDesktop

Master

startd

schedd

Desktop

Master

startd

schedd

negotiator

schedd

negotiator

schedd

MasterMaster

Master Master

Master

Courtesy of Condor Group, University of Wisconsin


Conclusion

What the Grid is– An approach to improve computation

utilization– Support for data migration for large-scale

computation– Several families of tools– Tools to enable collaboration

What the Grid is not– Free cycles from heaven


Grid Resources

Books– The Grid2: Blueprint for a New Computing Infrastructure

• Foster & Kessleman, ed.– Grid Computing: Making the Global Infrastructure a Reality

• Berman, Fox, Hey, ed.

Software Distributions– Condor: www.cs.wisc.edu/condor– Globus: www.globus.org

Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...

Documents

Transcript of Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...