Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...
-
Upload
maximillian-underwood -
Category
Documents
-
view
219 -
download
0
Transcript of Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth [email protected]...
Copyright 2006, Jeffrey K. Hollingsworth
Grid Computing
Jeffrey K. Hollingsworth
Department of Computer ScienceUniversity of Maryland, College Park, MD 20742
University of Maryland2
The Need for GRIDS
Many Computation Bound Jobs– Simulations
• Financial• Electronic Design• Science
– Data Mining
Large-scale Collaboration– Sharing of large data sets– Coupled communication simulation codes
University of Maryland3
Available Resources - Desktops
Networks of Workstations– Workstations have high processing power
– Connected via high speed network (100Mbps+)
– Long idle time (50-60%) and low resource usage
Goal: Run CPU-intensive programs using idle periods
• while owner is away: send guest job and run
• when owner returns: stop and migrate guest job away
– Examples: Condor (University of Wisconsin)
University of Maryland4
Computational Grids Environment
– Collection of semi-autonomous computers– Geographically distributed– Goal: Use these systems as a coordinated resource– Heterogeneous: processors, networks, OS
Target Applications– Large-scale programs: running for 100-1,000’s of
seconds– Significant need to access long term storage
Needs– Coordinated access (scheduling)– Specific time requests (reservations)– Scalable system software (1000’s of nodes)
University of Maryland5
Two Models of Grid Nodes
Harvested Nodes (Desktop)– Computers on desktops– Have Primary user who has priority– Participate in Grid, when resources are free
Dedicated Nodes (Data Center)– Dedicated to computational bound jobs– Various Policies
• May participate in grid 24/7• May only participate when load is low
University of Maryland6
Available Processing Power
Available Memory
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60
memory size (MB)
Prob
abilit
y
all
idle
nonidle
– Memory is available - 30MB available 70% of time
CPU usage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100
CPU Usage (%)
Cum
ulat
ive
Dis
tr.
all
idle
nonidle
– CPU usage is low - 10% or less for 75% of time
University of Maryland7
OS Support for Harvested Grid Computing
Need To Manage Resources Differently– Scheduler
• Normally designed to be fair• Need strict priority
– Virtual Memory• Need priority for local jobs
– File systems
Virtual Machines make things easier– Provide Isolation– Mange Resources
University of Maryland8
Starvation Level CPU Scheduling
Original Linux CPU Scheduler– Run-time Scheduling Priority
• nice value & remaining time quanta
• Ti = 20 - nice_level + 1/2 * Ti-1
– Possible to schedule niced processes
Modified Linux CPU Scheduler– If runnable host processes exist
• Schedule a host process with highest priority
– Only when no host process is runnable• Schedule a guest process
University of Maryland9
Prioritized Page Replacement
New page replacement algorithm
Adaptive Page-Out Speed– When a host job steals a guest’s page,
page-out multiple guest pages faster
High Limit
Low Limit
Priority to Host Job
Priority to Guest Job
Based only on LRU
Mai
n M
emor
y P
ages
– No limit on taking free pages
– High Limit : • Maximum pages guest can hold
– Low Limit : • Minimum pages guest can hold
University of Maryland10
Micro Test Prioritized Memory Page Replacement
– Total Available Memory : 179MB– Memory Thresholds: High Limit (70MB), Low Limit (50MB)
0
20
40
60
80
100
120
140
160
0 20 40 60 80 100 120 140 160 180 200time (sec)
mem
ory
(MB)
host jobmemory
guest jobmemory
High Limit
Low Limit
– Guest job starts at 20 acquiring 128MB – Host job starts at 38 touching 150MB– Host job becomes I/O intensive at 90– Host job finishes at 130
University of Maryland11
Application Evaluation - Setup Experiment Environment
– Linux PC Cluster • 8 pentium II PCs, Linux 2.0.32• Connected by a 1.2Gbps Myrinet
Local Workload for host jobs– Emulate Interactive Local User
• MUSBUS interactive workload benchmark• Typical Programming environment
Guest jobs– Run DSM parallel applications (CVM)– SOR, Water and FFT
Metrics– Guest Job Performance, Host Workload Slowdown
University of Maryland12
Application Evaluation - Host Slowdown Run DSM Parallel Applications
– 3 Host Workloads : 7%, 13%, 24% (CPU Usage) – Host Workload Slowdown
– For Equal Priority:• Significant Slowdown • Slowdown increases with load
– No Slowdown with Linger Priority
Host Slowdown
0%
5%
10%
15%
20%
7% 13% 25%
musbus utilization
mu
sb
us
slo
wd
ow
n mb(sor-l)
mb(sor-e)
mb(water-l)
mb(water-e)
mb(fft-l)
mb(fft-e)
University of Maryland13
Application Evaluation - Guest Performance
Run DSM Parallel Applications– Guest Job Slowdown
– Slowdown proportional to musbus usage– Running guest at same priority as host
provides little benefit to guest job
Guest Slowdown
0%
5%
10%
15%
20%
25%
30%
35%
40%
7% 13% 25%
musbus utilization
app
licat
ion
slo
wd
ow
n
sor-l
sor-e
water-l
water-e
fft-l
fft-e
sor water fft sorsor waterwater fftfft
University of Maryland14
Unique Grid Infrastructure
Applies to both Harvested and Dedicated
Resource Monitoring– Finding available resources– Need both CPUs and Bandwidth
Scheduling– Policies to sharing resources among
organizations
Security– Protect nodes from guest jobs– Protect jobs on foreign nodes
University of Maryland15
Security
Goals– Don’t require explicit accounts on each
computer– Provide controlled access
• Define policies on what jobs run where• Authenticate access
Techniques– Certificates – Single account on system for all grid jobs
University of Maryland16
Resource Monitoring
Need to find available resources– CPU cycles
• With appropriate OS/System Software• With sufficient memory & temporary disk
– Network bandwidth• Between nodes running a parallel job• To the remote file system
Issues– Time varying availability– Passive vs. active monitoring
University of Maryland19
Scheduling
Need to allocate resources on Grid Each site might:
– Accept jobs from remote sites– Send jobs to other sites
Need to accommodate co-scheduling– A single job that spans multiple site
Need for reservations– Time certain allocate of resources
University of Maryland20
Scheduling Parallel Jobs
Scheduling Constraints– Different jobs use different numbers of nodes– Jobs provide estimate of runtime– Jobs run from a few minutes to a few weeks
Typical Approach– One parallel job per node
• Called space-sharing– Batch Style Scheduling Used
• Even a single user often has more processes than can run at once
• Need to have many nodes at once for a job
University of Maryland21
Typical Parallel Scheduler
Packs Jobs into a schedule by – Required number of nodes– Estimated runtime
Backfills with smaller jobs when– Holes develop due to early job termination
University of Maryland22
Imprecise Calendars
Data structure to manage scheduling grids– permits allocations of time to applications– uses hierarchical representation
• each level maintains calendar for managed nodes
– allows multiple temporal resolutions
Key Features:– allows reservations– supports co-scheduling semi-autonomous sites
• a site can refuse an individual remote job • small jobs don’t need inter-site coordination
University of Maryland23
Multiple Time/Space Resolutions
T_A(1 hour)Free(1 hour)
T_A(30 min)
Free (30 min)
Free (30 min)
T_A(30 min)
T_A(15 min)Free(15 min)
T_A(15 min)Free(15 min)
T_A(15 min)Free(15 min)
T_A(15 min)Free(15 min)
T_A(full) T_A(full)T_A(full)T_A(full)
Free (7.5 min) Free (7.5 min)Free (7.5 min)Free (7.5 min)
Free (7.5 min) Free (7.5 min)Free (7.5 min)Free (7.5 min)
T_A(full) T_A(full)T_A(full)T_A(full)
30 min. 30 min.
Refine space
Refine time
Parameters– number and sizes of slots– packing density
Have multiple time-scales at once– near events at finest temporal resolution
University of Maryland24
Evaluation
Approach– use traces of job submission to real clusters– simulate different scheduling policies
• imprecise calendars• traditional back-filling schedulers
Metrics for comparison– job completion time
• aggregate and by job size– node utilization
University of Maryland25
Comparison with Partitioned ClusterDelay of Jobs
0123456
32 32 64 128 256 512 all
Cluster
mea
n w
ait t
ime combined
separatebackfill
Based on job data from LANL Treat each cluster as a trading
partner
University of Maryland26
Balance of Trade
Utilization TradingQueue Size
Separate Comb. Supply Use Balance
1 32 18.0% 18.1% 170.5 88.4 -82.02 32 21.5% 21.7% 162.3 85.4 -77.03 64 24.7% 24.9% 281.0 54.2 -226.94 128 36.4% 36.3% 64.3 456.1 391.85 256 38.9% 38.9% 136.2 84.6 -51.66 512 38.8% 38.8% 52.7 98.3 45.6
Jobs are allowed to split across partitions Significant shift in work from 128 node
partition
University of Maryland27
Large Cluster of ClustersAll Jobs
0
5
10
15
20
25
30
All 1 2 3 4 5 6 7 8 9 10
Cluster Number
Mea
n Jo
b D
elay
Separate
Combined
Each cluster has 336 nodes– jobs < 1/3 of nodes and < 12 node-hours sched. locally– jobs were not split between nodes
Data is one month of jobs per node Workload from CTC SP-2
University of Maryland28
Balance of Trade: Large Clusters
Two Level Combined Queues#
Avg.
Util Supply Use Balance Local Util.
1 34.5% 32,001 10,605 21,395 15,698 67.7%2 72.9% 28,168 31,493 (3,326) 24,162 74.2%3 79.3% 25,588 34,713 (9,125) 25,816 72.9%4 70.8% 27,098 32,713 (5,615) 21,319 68.7%5 55.2% 22,493 16,054 6,439 26,082 68.9%6 65.4% 26,152 28,778 (2,626) 21,162 67.1%7 63.6% 25,882 20,516 5,366 28,026 76.5%8 72.9% 27,489 32,055 (4,566) 23,562 72.4%9 61.2% 22,881 21,111 1,770 25,570 68.7%
10 77.3% 26,949 36,662 (9,713) 22,306 69.9%All 65.3% 264,701 264,701 0 25,813 70.7%
University of Maryland29
Social, Political, and Corporate Barriers
“It’s my computer”– Even if the employer purchased it
Tragedy of the Commons– Who will buy resources
Chargeback concerns– HW purchased for one project used by
another
Data Security Concerns– You want to run our critical jobs where?
University of Maryland30
Globus Toolkit
Collection of Tools– Security– Scheduling– Grid aware Parallel Programming
Designed for – Confederation of dedicated clusters– Support for parallel programs
University of Maryland31
Condor
Core of tightly coupled tools– Monitoring of node– Scheduling (including batch queues)– Checkpointing of jobs
Designed for– Harvested resources (dedicated nodes too)– Parameter sweeps using many serial program
runs
University of Maryland32
Layout of the Condor Pool
Central Manager
Master
Collector
Cluster Node
Master
startd
Cluster Node
Master
startdDesktop
Master
startd
schedd
Desktop
Master
startd
schedd
negotiator
schedd
negotiator
schedd
MasterMaster
Master Master
Master
Courtesy of Condor Group, University of Wisconsin
University of Maryland33
Conclusion
What the Grid is– An approach to improve computation
utilization– Support for data migration for large-scale
computation– Several families of tools– Tools to enable collaboration
What the Grid is not– Free cycles from heaven