T.J. Watson Research Center Apr 2, 2006 | FeBID 2006 Presentation subtitle: 20pt Arial Regular, teal...
-
Upload
janis-chandler -
Category
Documents
-
view
218 -
download
0
Transcript of T.J. Watson Research Center Apr 2, 2006 | FeBID 2006 Presentation subtitle: 20pt Arial Regular, teal...
T.J. Watson Research Center
Apr 2, 2006 | FeBID 2006 © 2006 IBM Corporation
Control of Large Scale Computing Systems
Yixin DiaoJoseph HellersteinSujay Parekh
2
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Multi-tier e-Commerce System
Work Classes
Owners
Tiers
$$ $$ $$
$$
Web Application Data
Storage
SLAObjective
SLA
3
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Scaling Factors
Policy Single owner
Has global objective (implicit or explicit)o eg: maximize revenue
Multiple owner E-Commerce Tx requiring billing, credit card, shipping system, etc. Global arbitration?
Single or Multiple Objectives Arising from service classes
o Eg: {Gold, Silver, Bronze} or {Browse,Buy}
Target System Number of inputs/outputs
SISO (easier) MIMO (harder)
o MIMO-Co MIMO-D: Controlled resources are distributed
Key: decomposability
4
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Control Problems – Examples
TargetSystem
Policy
1 Objective1 Owner
N Objectives1 Owner
N ObjectivesM owners
SISO Lotus NotesTCP REDVoltage Scaling
MISO DB2 Utilities
MIMO-C DB2 Self-Tuning Memory
Web server QoSWeb server CPU/MEM
MIMO-D EUCON, D-EUCONWeb cluster performancezOS WLMMulti-tier system
Storage systemsUtility & Grid ComputingNetworks
5
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Examples – Another view
TargetSystem
ControllerReferenceInput
Error Controls
MetricsSensor
State
1. MaxUsers2. Per-class Thread Pool3. Per-class Seat Allocation4. Per-class Rate Control
1. Response Time2. Per-class RT3. Per-class RT4. Per-class Bandwidth
1. Queue Length
1. RT Goal2. Per-class RT Goal3. Per-class RT Goal4. Per-class Bandwidth
1. Domino Server2. Apache Server3. Websphere Cluster4. ATM Network
6
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Control Strategies
(A) Centralized
(B) Distributed
Good for: co-located resources, LANs, single owner Eg: Cluster resource allocation MIMO control techniques
Good for: separable control loops Challenge: Delays, Information hiding, Policy design Eg: Per-tier resource allocation
MIMO-C
MIMO-D
7
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Policy Authority Example
Policy Authority
End-to-End Goal
Per-tier Goals
Web Application Data
8
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Control Strategies (Cont’d)
(C) Hierarchical
9
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Challenges: Policy related
Policy decomposition/transformation Translating high-level (business) goals into IT-level goals for controller
Create a library of “standard” schemes/patterns Metric translation
You say “tomayto” I say “tomahto” Eg, how is throughput measured? What time scale?
Business-level KPI’s may be difficult/impossible to measure Multiple owners
Arbitration between multiple owners
Use SLAs, Utility-based framework Design the proper incentive schemes (Game Theory)
Too many actuators Non-unique control configurations
Use implied objectives, constraints, projection algorithms
10
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
SLA Utility Example
SLA SLA goals, eg: RT < 500ms (90th %ile)
Profit: Income / request, eg: $.02/request
Loss: Penalty / SLA violation, eg: $100 / minute Individual objective
U = Profit – Loss
Design control strategy to maximize U
E(U) over (in)finite horizon
11
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Challenges: System related
Interacting flows of requests Layered Queueing effects
not well captured by Queuing models Multiple Time Constants
Web tier has different dynamics than Storage subsystem Discrete or continuous time models?
Distributed computing issues Delays Information hiding / monitoring overheads
Availability of sensors & actuators Software abstractions, political or IP boundaries Open Source systems can help
Ensuring global stability / optimality Local stability proofs are easier
12
T.J. Watson Research Center
Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation
Conclusion
Controlling large-scale computing systems is a challenge Many problems inherited from scale
How do computing systems make it easier/harder to manage scale?
Some ideas how to think about scale Target system
Policy System & Policy architecture drives Control architecture Inter-disciplinary approaches needed
Control + {Queueing theory, Game Theory, …}