Failure the-good-parts

57
FAILURE The Good Parts Viktor Klang Director of Engineering

Transcript of Failure the-good-parts

Page 1: Failure the-good-parts

√FAILURE The Good Parts

Viktor Klang Director of Engineering

Page 2: Failure the-good-parts

�2

Build powerful, concurrent, resilient & distributed

software more easily.

”“

Page 3: Failure the-good-parts

FAILURE The Bad Parts

Page 4: Failure the-good-parts

Ariane 5 - 4 June 1996

๏ 10 years of research

๏ $7 billion invested

๏ Exploded within a minute of take-off

๏ Loss estimate $370 million

๏ Why?

๏ Trying to stuff a 64-bit float into a16-bit int

๏ o_O + wat

Page 5: Failure the-good-parts

Failure is an option. A Some(failure)

to be exact. – me “”

Page 6: Failure the-good-parts

Failure Recovery

Page 7: Failure the-good-parts

#define Failure#undef Failure

Page 8: Failure the-good-parts

Software fails

Page 9: Failure the-good-parts

Runtime๏VM (OpenJDK Issue Tracker)

๏OS

๏Drivers

๏Firmware

Page 10: Failure the-good-parts

Runtime๏Overload/Exhaustion

๏Stack

๏Heap

๏FDs

๏…

๏Starvation

Page 11: Failure the-good-parts

Hardware fails

Page 12: Failure the-good-parts

CPUs

"Related instructions that are affected by the bug are

FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1.

The instructions FPTAN and FPATAN are also susceptible"

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

Page 13: Failure the-good-parts

RAM

Page 14: Failure the-good-parts

DRAM Errors in the Wild: A Large-Scale Field Study

Bianca Schroeder Dept. of Computer Science

University of Toronto Toronto, Canada

[email protected]

Eduardo Pinheiro Google Inc.

Mountain View, CA

Wolf-Dietrich Weber Google Inc.

Mountain View, CA

Page 15: Failure the-good-parts

DRAM Errors in the wild๏Memory errors were between

15-120 times (!) more common than had previously been assumed.

๏More than 90% of the problems with a given platform were caused by about 20% of the machines who had errors.

Page 16: Failure the-good-parts

DRAM Errors in the wild

(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)!http://news.cnet.com/8301-30685_3-10370026-264.html

Page 17: Failure the-good-parts

DRAM Errors in the wild

๏Temperature didn't seem to make a big difference.

๏Irreparable problems were more common than transient problems.

๏Increased number of errors with age, setting in as early as 10-18 months in the field.

Page 18: Failure the-good-parts

HDDs

Page 19: Failure the-good-parts

Failure Trends in a Large Disk Drive Population

Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso ´

Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043

{edpin,wolf,luiz}@google.com

Page 20: Failure the-good-parts

Failure Trends by age

Page 21: Failure the-good-parts

Failure Trends by utilization and age

Page 22: Failure the-good-parts

The Network is ReliableLOL

Kyle Kingsbury's blog: !

http://aphyr.com/posts/288-the-network-is-reliable

Page 23: Failure the-good-parts

Wetware fails

Page 24: Failure the-good-parts

An expert is a man who has made all

the mistakes which can be made, in a

narrow field. – Niels Bohr

“”

Page 25: Failure the-good-parts

Assumptions are bad

Page 26: Failure the-good-parts

Quiz

val result = something(x,y)

Page 27: Failure the-good-parts

๏ Failure is unintentional๏ Validation is intentional

Validation vs Failure

Page 28: Failure the-good-parts

Flows of information

๏ Results &Validation

๏ Failures & Recovery

๏ Don't complect them!

Attribution:

Page 29: Failure the-good-parts

The Little

Vending Machine

That Could

Page 30: Failure the-good-parts

Failure ValidationHandled

Page 31: Failure the-good-parts

Outcome awareness

Known-Unknowns Unknown-Unknowns

Known-Knowns Unknown-Knowns

Page 32: Failure the-good-parts

Failure awareness

Known-Unknowns Unknown-Unknowns

Known-Knowns Unknown-Knowns

Page 33: Failure the-good-parts

๏ Result

๏ Invalid input

๏ Illegal value

๏ Illegal value combination

๏ Capability/Dependency violation

๏ Nothing

๏ Uninvoked

๏ Response lost

Possibilities

Page 34: Failure the-good-parts

Program testing can be used to show the presence of bugs, but never to show their absence! !

– Edsger Dijkstra

“”

Page 35: Failure the-good-parts

Testing & Checking๏ Testing is good for

๏ Known-Knowns

๏ Checking is good for

๏ Unknown-Knowns

๏ Known-Unknowns

๏ Unknown-Unknowns

๏ Conclusion

๏Use both!

Page 36: Failure the-good-parts

Quiz

val result = println(x,y)

Page 37: Failure the-good-parts

Death & Delay & Distributed Programs

๏ There is no apparent difference between death and delay in a distributed system

๏ "Distributed programming is all about retries and timeouts"

๏ Without distribution you'll always have a SPOF

๏ … but the more hardware you have, the higher the risk of failures

Page 38: Failure the-good-parts

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

!– Leslie Lamport

“”

Page 39: Failure the-good-parts

Traditional Blocking RPC

๏What if: Request is lost

๏What if: Response is lost

๏Caller is held hostage by the Callee

๏… Stockholm Syndrome anyone?

http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf

Page 40: Failure the-good-parts

Defensive programming๏ "Paranoid programming"

๏ Mixes concerns

๏ Unclear responsibilities

๏ At best gives sense of false security

๏ Yields systems that fail extraordinarily

Page 41: Failure the-good-parts

!

try { val breakfast = try { prepare(new Breakfast) } catch { case ex: OutOfJamError => … } finally { … } eat(breakfast) } catch { case ex: BreakfastOverflowError => … } finally { … }

Page 42: Failure the-good-parts

Yes We Can

Make Failure

Management Fun

Page 43: Failure the-good-parts

Distribution

Page 44: Failure the-good-parts

Replication & Failover

Page 45: Failure the-good-parts

CircuitBreakers

Page 46: Failure the-good-parts

CircuitBreakers

๏Benefits

๏Relieves pressure on failing parts

๏Are self-healing

๏Can be operated manually

Page 47: Failure the-good-parts

Supervisors

๏ Components dealing with the failure of subcomponents

๏ Decouples failure from validation

๏ Makes it obvious who is responsible for what

Page 48: Failure the-good-parts

Service

Superviso

Input

Result/Validation

Failures / Recovery

Supervisors

Page 49: Failure the-good-parts

Quis custodiet ipsos custodes? – Decimus Iunius Iuvenalis “”

Supervision

Page 50: Failure the-good-parts

Bulkheading

๏Compartmentalization

๏Prevent failures from cascading

๏Plays well with redundancy & failover

Page 51: Failure the-good-parts

An escalator can never break: it can only become stairs. You should never see an Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the convenience. !

– Mitch Hedberg

“”

Graceful degradation

Page 52: Failure the-good-parts

My crystal ball

Page 53: Failure the-good-parts

Microservices๏ Does one thing well

๏ Concurrent & Compartmentalized

๏ Location transparent

๏ Typed endpoints producing typed streams of data

๏ Exhibit compositionality

๏ Are async and non-blocking

๏ Support backpressure & flow control

Page 54: Failure the-good-parts

Summary๏Failure management

๏… is not Validation

๏… need not be boring

๏… is not optional

๏There are real consequences

๏… and there are ways to avoid them!

Page 55: Failure the-good-parts

“”

Don't worry—be happy. – Bobby McFerrin

Attribution: Steve Jurvetson

Page 56: Failure the-good-parts

Thank you!๏ @viktorklang on Twitter

[email protected]

๏ Want to know more?

๏ http://akka.io

๏ http://typesafe.com

๏ http://reactivemanifesto.org√

Page 57: Failure the-good-parts

End  of  transmission…