Operational Insight: Concepts and Examples (w/o Presenter Notes)

Post on 12-Aug-2015

277 views 0 download

Tags:

Transcript of Operational Insight: Concepts and Examples (w/o Presenter Notes)

Operational InsightJune 15, 2015 Roy Rapoport

@royrapoport / linkedin.com/in/royrapoport / rrapoport@netflix.com

Oh, The Places We’ll Go!

John Boyd

Observe

Observe

Orient

Observe

Orient

Decide

Observe

Orient

Decide

Act

Observe

Orient

Decide

Act OODA

Observe

Orient

Decide

Act OODA

“This approach favors agility over raw power in dealing with human opponents in any endeavor” - Wikipedia

This Is What We Do

OODA KPI

OODA KPI

Speed

OODA KPI

Speed Effort

OODA KPI

Speed Effort Reliability

Winning

Speed Effort Reliability

WinningSpeed

Effort Reliability

WinningSpeed

Effort

Reliability

WinningSpeed

Effort

Reliability

Implications … for Observation (aka measurement, telemetry, metrics)

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable• Make it pluggable

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable• Make it pluggable• (Eventually) Ruthlessly Cull

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable• Make it pluggable• (Eventually) Ruthlessly Cull

“What decision will this help me make?”

A Joke

52

48

% of servers in major region with an even IP address

Implications … for Orientation (aka graphing, visualization)

Implications … for Orientation (aka graphing, visualization)

• First-class product

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz• Low cognitive load better than

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz• Low cognitive load better than

• High refresh rates

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz• Low cognitive load better than

• High refresh rates• Deep data density

Better Like This …

Or Better Like That …

Implications … for Decisions (aka alerting, real-time analytics, etc)

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement• Sky’s the limit

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement• Sky’s the limit

• For benefits

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement• Sky’s the limit

• For benefits• For cost

Implications … for Action

Implications … for Action

1. Humans beat bureaucracy

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs

Repeatable machine processes TROUNCE one-off human bureaucracy

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs4. Start with humans

Repeatable machine processes TROUNCE one-off human bureaucracy

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs4. Start with humans5. If IFTTT, deprecate humans

Repeatable machine processes TROUNCE one-off human bureaucracy

Decision: Do I Have Enough

Instances?

Decision: Is My Canary Good?

25

Been there.Done that.Manually.Artisanally.

25

Been there.

• Started in the Data Center

Done that.Manually.Artisanally.

25

Been there.

• Started in the Data Center

• Manual, dashboard-driven

Done that.Manually.Artisanally.

25

Been there.Done that.Manually.

26

CPU

Requests

Errors

Been there.Done that.Manually.

27

Been there.Done that.Manually.• Context vs Precision

27

Been there.Done that.Manually.• Context vs Precision

• No …

27

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

27

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

• Trending

27

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

• Trending

• Manual effort is manual

27

So Now What?

28

So Now What?

• Automate Analysis

28

So Now What?

• Automate Analysis

• Took Some Effort

28

So Now What?

• Automate Analysis

• Took Some Effort

• Approach and analytics

28

So Now What?

• Automate Analysis

• Took Some Effort

• Approach and analytics

• Presentation matters

28

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

Pretty Pictures

29

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System1 server @ 1.0.2

Automated Canary Analysis

Pretty Pictures

29

10 servers @ 1.0.2Version

Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

Pretty Pictures

29

1000 servers @ 1.0.2

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

Pretty Pictures

29

Version

1000 servers @ 1.0.1

Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

30

Version Control System

Build & Deployment

System

Automated Canary Analysis

Customers

Version Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

30

Version Control System

Build & Deployment

System

Automated Canary Analysis

Customers

Version

1000 servers @ 1.0.1

Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

31

Version Control System

Build & Deployment

System

Automated Canary Analysis

Version

1000 servers @ 1.0.1

Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

31

Version Control System

Build & Deployment

System

Automated Canary Analysis

Just The Stats 4-Week View

Just The Stats 4-Week View

6309 canary analysis cycles

Just The Stats 4-Week View

6309 canary analysis cycles16% canaries failed

Decision: Do I Have an Outlier?

Outlier Detection

Would You Like to Play a Game?

Spot the Outlier

The Outlier Is

“A”

Just The Stats 4-Week View

Just The Stats 4-Week View

739 Server Terminations

In a Nutshell Observe

Orient

Decide

Act

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

Make it easier for humans

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

Make it easier for humans

Make machinesdo it

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

Make it easier for humans

Make machinesdo it

Higher speed Lower effort

Higher reliability

Questions, Attributions, Feedback

42

Questions, Attributions, Feedback

@royrapoportrsr@netflix.comlinkedin.com/in/royrapoport?42