T.J. Watson Research Center Apr 2, 2006 | FeBID 2006 Presentation subtitle: 20pt Arial Regular, teal...

12
T.J. Watson Research Center Apr 2, 2006 | FeBID 2006 © 2006 IBM Corporation http://w3.ibm.com/ibm/presentations Control of Large Scale Computing Systems Yixin Diao Joseph Hellerstein Sujay Parekh

Transcript of T.J. Watson Research Center Apr 2, 2006 | FeBID 2006 Presentation subtitle: 20pt Arial Regular, teal...

T.J. Watson Research Center

Apr 2, 2006 | FeBID 2006 © 2006 IBM Corporation

Control of Large Scale Computing Systems

Yixin DiaoJoseph HellersteinSujay Parekh

2

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Multi-tier e-Commerce System

Work Classes

Owners

Tiers

$$ $$ $$

$$

Web Application Data

Storage

SLAObjective

SLA

3

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Scaling Factors

Policy Single owner

Has global objective (implicit or explicit)o eg: maximize revenue

Multiple owner E-Commerce Tx requiring billing, credit card, shipping system, etc. Global arbitration?

Single or Multiple Objectives Arising from service classes

o Eg: {Gold, Silver, Bronze} or {Browse,Buy}

Target System Number of inputs/outputs

SISO (easier) MIMO (harder)

o MIMO-Co MIMO-D: Controlled resources are distributed

Key: decomposability

4

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Control Problems – Examples

TargetSystem

Policy

1 Objective1 Owner

N Objectives1 Owner

N ObjectivesM owners

SISO Lotus NotesTCP REDVoltage Scaling

MISO DB2 Utilities

MIMO-C DB2 Self-Tuning Memory

Web server QoSWeb server CPU/MEM

MIMO-D EUCON, D-EUCONWeb cluster performancezOS WLMMulti-tier system

Storage systemsUtility & Grid ComputingNetworks

5

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Examples – Another view

TargetSystem

ControllerReferenceInput

Error Controls

MetricsSensor

State

1. MaxUsers2. Per-class Thread Pool3. Per-class Seat Allocation4. Per-class Rate Control

1. Response Time2. Per-class RT3. Per-class RT4. Per-class Bandwidth

1. Queue Length

1. RT Goal2. Per-class RT Goal3. Per-class RT Goal4. Per-class Bandwidth

1. Domino Server2. Apache Server3. Websphere Cluster4. ATM Network

6

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Control Strategies

(A) Centralized

(B) Distributed

Good for: co-located resources, LANs, single owner Eg: Cluster resource allocation MIMO control techniques

Good for: separable control loops Challenge: Delays, Information hiding, Policy design Eg: Per-tier resource allocation

MIMO-C

MIMO-D

7

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Policy Authority Example

Policy Authority

End-to-End Goal

Per-tier Goals

Web Application Data

8

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Control Strategies (Cont’d)

(C) Hierarchical

9

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Challenges: Policy related

Policy decomposition/transformation Translating high-level (business) goals into IT-level goals for controller

Create a library of “standard” schemes/patterns Metric translation

You say “tomayto” I say “tomahto” Eg, how is throughput measured? What time scale?

Business-level KPI’s may be difficult/impossible to measure Multiple owners

Arbitration between multiple owners

Use SLAs, Utility-based framework Design the proper incentive schemes (Game Theory)

Too many actuators Non-unique control configurations

Use implied objectives, constraints, projection algorithms

10

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

SLA Utility Example

SLA SLA goals, eg: RT < 500ms (90th %ile)

Profit: Income / request, eg: $.02/request

Loss: Penalty / SLA violation, eg: $100 / minute Individual objective

U = Profit – Loss

Design control strategy to maximize U

E(U) over (in)finite horizon

11

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Challenges: System related

Interacting flows of requests Layered Queueing effects

not well captured by Queuing models Multiple Time Constants

Web tier has different dynamics than Storage subsystem Discrete or continuous time models?

Distributed computing issues Delays Information hiding / monitoring overheads

Availability of sensors & actuators Software abstractions, political or IP boundaries Open Source systems can help

Ensuring global stability / optimality Local stability proofs are easier

12

T.J. Watson Research Center

Control of Large Scale Computing Systems | FeBID 2006 © 2006 IBM Corporation

Conclusion

Controlling large-scale computing systems is a challenge Many problems inherited from scale

How do computing systems make it easier/harder to manage scale?

Some ideas how to think about scale Target system

Policy System & Policy architecture drives Control architecture Inter-disciplinary approaches needed

Control + {Queueing theory, Game Theory, …}