Towards Autonomic Grids
-
Upload
germainrenaud -
Category
Education
-
view
1.222 -
download
0
Transcript of Towards Autonomic Grids
Towards Autonomic Grids
Cecile Germain-RenaudLaboratoire de Recherche en Informatique
Universite Paris-Sud - CNRS - INRIA
e-science infrastructures
2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure
Grids of computational centers
Comprehensive libraries of digitalobjects
Well-curated collections ofscientific data
Online instruments and vast sensorarrays
Convenient software toolkits
The largest (circ 26km),fastest(14TeV), coldest(1.9K), emptiest (10−13 atm)machine.
e-science infrastructures
2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure
Grids of computational centers
Comprehensive libraries of digitalobjects
Well-curated collections ofscientific data
Online instruments and vast sensorarrays
Convenient software toolkitsThe largest (circ 26km),fastest(14TeV), coldest(1.9K), emptiest (10−13 atm)machine.
e-science infrastructures
2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure
Grids of computational centers
Comprehensive libraries of digitalobjects
Well-curated collections ofscientific data
Online instruments and vast sensorarrays
Convenient software toolkits
Storage and analysis of15PB/year
e-science infrastructures
2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure
Grids of computational centers
Comprehensive libraries of digitalobjects
Well-curated collections ofscientific data
Online instruments and vast sensorarrays
Convenient software toolkits
The largest (40000 CPUs),most complex (200 VOs),most distributed (250 sites),most used (300K jobs/day)computing machine
How we configure our grids
Courtesy James Casey talk @EGEE09
Outline
1 The grid ecosystem
2 Grids and Autonomic Computing
3 The Grid Observatory
4 Learning grid modelsOn-line fault detectionModel Selection
5 Model-free policiesPolicy evaluationReinforcement learning for responsive grids
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
e-science infrastructures
The classical definition of grids
A computational grid is a hardware and software infrastructurethat provides dependable, consistent, pervasive, and inexpensiveaccess to high computational capabilities.I. Foster, C. Kesselman, The Grid, 1998
An old dream
UCLA press release on the creation of Arpanet, 1969
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
The niches in the ecosystem
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Grids are not about technology, but about sharing
Ian Foster’s definition 2000
Grid are defined bycoordinated resource sharingand problem solving indynamic, multi-institutionalvirtual organizationsThe sharing is necessarily, highly controlled, with
resource providers and consumers defining clearly
and carefully just what is shared, who is allowed to
share, and the conditions under which sharing
occurs. A set of individuals and/or institutions
defined by such sharing rules form a virtual
organization
Consumers: Large scaleinternational collaborations
Different users withdifferentiated requirementsacross and within thecollaborations
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Grids are not about technology, but about sharing
Ian Foster’s definition 2000
Grid are defined bycoordinated resource sharingand problem solving indynamic, multi-institutionalvirtual organizationsThe sharing is necessarily, highly controlled, with
resource providers and consumers defining clearly
and carefully just what is shared, who is allowed to
share, and the conditions under which sharing
occurs. A set of individuals and/or institutions
defined by such sharing rules form a virtual
organization
Providers: national andregional institutions
Organized in National GridInitiatives, coordinated by EGI
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Grids are not about technology, but about sharing
Ian Foster’s definition 2000
Grid are defined bycoordinated resource sharingand problem solving indynamic, multi-institutionalvirtual organizationsThe sharing is necessarily, highly controlled, with
resource providers and consumers defining clearly
and carefully just what is shared, who is allowed to
share, and the conditions under which sharing
occurs. A set of individuals and/or institutions
defined by such sharing rules form a virtual
organization
Operators: local sites, withtemporary EU support(EGI-Inspire)
Configuration, prioritization,monitoring, accounting, . . .
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Do Datacenters and Cloud make Grid obsolete?
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
*-aaS
Courtesy William Vambenepe - slides from the Cloud Connect keynote Freeing SaaS from Cloud
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Grids and Clouds
IaaS : on-demand, elastic, virtualization-based provisioning
A single-objective optimization target: pay less by turning onand off at the minute rather than days or weeks scale
Convergence path: Grids over Clouds or Clouds of Grids?
EU project Stratuslab
SaaS: the core of the ITprocess lies in deploying andorchestrating heterogeneoussoftware components, andhaving them ”in the cloud”does not help much
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Autonomic Computing
Computing systems that manage themselves in accordance withhigh-level objectives from humansKephart and Chess A vision of Autonomic Computing, IEEEComputer 2003AUTONOMIC VISION & MANIFESTOhttp://www.research.ibm.com/autonomic/manifesto/Relation with Machine Learning : I. Rish tutorial @ECML 2006,
Self-managing system with the ability of
Self-healing: detect, diagnose and repair failuresSelf-configuring: automatically incorporate and configurecomponentsSelf-optimizing: ensure the optimal functioning wrt high-levelrequirementsSelf-protecting: anticipate and defend against security breaches
On dynamical non-steady state systems
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Autonomic Computing
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Autonomic Computing
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Autonomic Grids
Emerging behaviour as the result of sites and stakeholdersdecisions
Coupled usage: Virtual Organizations, community softwareand activity
Feedback loops in the middleware
Incomplete and noisy information
We need
Inference of models for middleware components andapplications, users and usage profiles, users interactions,inconsistencies
Self-configuration and self-optimization for managementpolicies
Self-healing across middleware and applications
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Goals
Grid digital assets curation
Collecting verifiable digital assetsProviding digital asset search and retrievalCertification of the trustworthiness and integrity of thecollection contentSemantic and ontological continuity and comparability of thecollection
Building the domain knowledge
Dimensionality and volume reduction: getting rid of themassive redundancy in operational logsAnswering operational issuesDescriptive/generative/predictive modelsDesign and validation of model-free policies
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Support and collaborations
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Methods
Focused on EGEE/EGI
The best approximationof the current needs ofe-science
Extensive monitoringfacilities
Traces were discardedafter operational usage,and in any case notavailable to the scientificcommunity
Now available withoutgrid certificate
www.grid-observatory.org
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Methods
Focused on EGEE/EGI
The best approximationof the current needs ofe-science
Extensive monitoringfacilities
Traces were discardedafter operational usage,and in any case notavailable to the scientificcommunity
Now available withoutgrid certificate
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Grids are complex systems
Users/Files/Clients worker nodes graph display with AVIZ GraphDice
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Grids are complex systems
Users in green, File groups in purple. Rightmost is most ”active”And also [Lovro Iliasic PhD Computational Grids as Complex Networks]
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Issues
Large non-stationary system
Courtesy M. Lassnig et al. Austrian Grid Symp. 09
Trends
Academic events
Scientific events
Software events
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
On-line fault detection
Abrupt changepoint detection
Page-Hinkley Statistics -jumps in the mean
pt changing distributionpt = 1
t
Pt`=1 p`
mt =Pt
`=1 (p` − p` + δ)Mt = max{m`}PHt = Mt −mt
CUSUM test: if PHt > λ, changedetected
First Application
Blackhole detectionValidation requires expertinterpretation
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
On-line fault detection
StrAP: On-line clustering aka Streaming
Affinity Propagation (AP) [Frey2007]
statistical physics algorithm for clustering(based on message passing)
a cluster = an exemplar(akin k-centers)
the model = set of {exemplar, frequency}
Why AP ?
Traceability: real jobs as exemplarsbecause of categorical variables, e.g., userid, queue name etc
No prior knowledge of K , number of clusters
quasi optimality wrt. information loss—> stability [Meila2006]
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
On-line fault detection
From AP to Large-scale Data Streaming
1 SCALABILITY : from O(N2 log N) to O(Nh+2h+1 )
Hierarchical Affinity Propagation
negligible infromation loss (proof in the paper)
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
On-line fault detection
From AP to Large-scale Data Streaming
2 Non stationary distribution
various Virtual Organization
number and expertise of users
Streaming AP (StrAP)
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
On-line fault detection
Adaptive change detection test
Self-adapt λ ≡ An optimization problem
BIC: Fλ = 1|C |∑|C |
i=1
(1ni
∑ej∈Ci
d(ej , e∗i ))
+ ϕρ2 log N + ηOt
∝ loss + size of model + fraction of outliers
OPTIMIZATION:
ε-greedy search from a finite set of λ values
λ = argmin{E(Fλ}),
λ1 λ2 λ3 λ4 ...
E(Fλ1) E(Fλ2) E(Fλ3) E(Fλ4) ...
Gaussian Process Regression based on {λi ,Fλi}
a continuous value of λ is generated
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
On-line fault detection
G-StrAP: A Grid Dashboard
Online Monitoring
1 2 3 4 50
20
40
60
80
100
Reservoir
700000
10 47 54129 0 0
8 18 24 30595139
7 13 14 24 972819190
Clusters
Perc
enta
ge o
f job
s as
signe
d (%
)
exemplar shown as a job vector
1 2 3 4 5 6 7 80
20
40
60
80
100
Reservoir
000000
700000
10 47 54129 0 0
9 18 2520110 0 0
8 18 24 30595139
6 5 10 14 12710854
10 18 2920091 395 276
LogMonitor isgetting clogged
Off-line Analysis
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Model Selection
The Piecewise Autoregressive model
AR process: Xt = γ + φ1Xt−1 + . . .+ φpXt−p + εt
The model Parameters forpiecewise AR
Number of segments m
Breakpointslocation/segment size(nj)j=1...m
AR orders.(pj)j=1...m
AR parameters(Ψj)j=1...m
Very large model space
Segment 1, 0 < t ≤ 512:Xt = 0.9Xt−1 + εt
Segment 2, 512 < t ≤ 768:Xt = 1.69Xt−1 − 0.81Xt−2 + εt
Segment 3, 768 < t ≤ 1024:Xt = 1.32Xt−1 − 0.81Xt−2 + εt
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Model Selection
Minimum Description Length model selection for PAR
[Davis, Lee, Rodriguez-Yam, J. American Statist. Assoc. 2006.]
The MDL principle: the best-fitting model is the one that producesthe shortest code length that completely describes the observeddata y
CLF (y) = CLF (F) + CLF (e|F)
CLF (F): description of the model
CLF (e|F) description the residuals - what is not explained by themodel
CL = log m+(m+1) log n+∑m+1
j=1 log pj +pj +2
2 log nj +nj
2 log(2πσ2j )
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Model Selection
Results on the workload processes
The amount of unterminated work in the system
Smoothed workloaddifference
Typically low ARmodels
Long segments
no. of segment segment smallest Ljung-BoxCE segment start end root abs. test on
[days] [days] value residuals(p-value)
CE-A 18 158.91 196.53 1.5915 0.05CE-B 19 109.61 160.65 2.1563 0.04CE-C 17 104.86 149.31 5.5711 0.21CE-D 27 151.39 190.16 1.1062 0.05
[T. Elteto et al. Discovering Piecewise Linear Models of Grid Workload, CCGrid 2010]
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Model Selection
Model validation
PAR: Ljung-Box test -whiteness of the ARresiduals
Stability: Bootstrapping- stable breakpoints
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Model Selection
Model reconciliation – bootstrap aggregation
Outcome: a simple and robust model describing the essentialpart of the workload process.
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Policy evaluation
Evaluation of the matchmaking scheduling policy
ART: Actual Response Time = queuing delay at the CE
ERT: Expected Response Time, copernican principle, gLite
Question: how good is the prediction?
Question: what is your definition of good predictor?
Root Mean Squared Error?Close statistical distribution, at normal regime, in the tail?Correlation of time series?ROC (Receiver Operating Characteristic): cost-benefit relation
Heterogeneous data
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Policy evaluation
Evaluation of the matchmaking scheduling policy
Overall
The distributions are notconsistent
RMSE Atl. 7.94E4, Biom.7.2E3
Correlation (subsamplingat 900s) is not convincing
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Policy evaluation
Evaluation of the matchmaking scheduling policy
A la BQP (Batch QueuePredictor) How often does
the prediction lie within a
reasonable distance of the
actual? Modified because
BQP considers only upper
bounds
ERT is a classifier, theclasses are intervals of thevalue range Intervals ofexponentially increasingsize
ROC: True Positive Ratevs False Positive Rate
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Reinforcement learning for responsive grids
Reinforcement learning for ressource provisioning in grids
A multi-objective scheduling anddimensioning problem
Users: Differentiated QoS
Stakeholders: Fairness
Administrators: Utilization 100
101
102
103
104
105
106
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pro
babili
ty
Execution time [s]
all data
atlas
biomed
Goals
Elastic resource provisioning: the context is Grids over Clouds- Infrastructure as a Service (IaaS)
Realistic hypotheses: organized sharing and mutualization, nocentral control
Autonomics: Model-free policies and configuration-freeimplementations
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Reinforcement learning for responsive grids
Formalisation
The scheduling MDP
State: descriptive variables of a site (queue, cluster)
Action: descriptive variables of a job (VO, execution time)
The dimensioning MDP
Action: number of computing nodes to maintain in activity
Policy learning
sarsa algorithm
Continuous state-action space: Non linear regression ofQ : (s, a)→ r
Neural Network and Echo State Network
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Reinforcement learning for responsive grids
The Rewards
The Responsiveness utility for job j is
Wj =execution timej
execution timej + waiting timej
. (1)
The Fairness utility for job j is
Fj = 1−maxk(wk − Skj)+,
M, (2)
where x+ = x if x > 0 and 0 otherwise, wk the target share of VOk, and Skj the share received by VO k up to the election of job jThe Utilization reward Un at time Tn is
Un =fn∑n
k=0 Pk(Tk+1 − Tk)(3)
where (T1, . . . ,TN) are the instants of decision making, Pk thenumber of processors allocated in the interval [Tk ,Tk+1] for1 ≤ n < N, and fn the sum of the execution times of jobs completedat time Tn.
The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies
Reinforcement learning for responsive grids
Experimental results on EGEE traces
101
102
103
104
105
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Queueing delay (sec)
CD
F
EGEE−INTERORA−INTER−0.5ORA−INTER−1.0EST−INTER−0.5EST−INTER−1.0
Queuing delays - interactive jobs -Rigid
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 106
−3
−2
−1
0
1
2
3
4x 10
−3
Arrival Times (sec)
Fai
rsha
re D
iffer
ence
ELA−ORA−0.5 − EGEE
Dynamics of the fairshare - All jobs - Rigid
101
102
103
104
105
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Queueing delay (sec)
CD
F
ELA−ORA−0.5ELA−ORA−1.0ELA−EST−0.5ELA−EST−1.0RIG−ORA−0.5RIG−ORA−1.0
Queuing delays - interactive jobs - Elastic [J.
Perez et al. JoGC 8/3 Sep. 2010]
Conclusion