Architecting for failure - Why are distributed systems hard?

55
Architecting for Failure Why are distributed systems so hard? Markus Eisele

Transcript of Architecting for failure - Why are distributed systems hard?

Page 1: Architecting for failure - Why are distributed systems hard?

Architecting for Failure Why are distributed systems so hard?

Markus Eisele

Page 2: Architecting for failure - Why are distributed systems hard?

@myfear

Page 3: Architecting for failure - Why are distributed systems hard?

Evolution

Page 4: Architecting for failure - Why are distributed systems hard?

Extreme Uptime (99.999)

Vertical Scaling

Custom Hardware

Hardware High Availability

Centralized

Designed for availability (99.9)

Commodity Hardware

Replicated

Designed for failure (99.999)

Horizontal Scaling

Virtualized / Cloud

Software High Availability

Distributed

Centralized Shared Self Service

“Big Iron” “Enterprise” “Cloud”

Page 5: Architecting for failure - Why are distributed systems hard?

60s 80s 90s 2000 2014 2016 2020 2030

Num

ber o

f Ent

erpr

ise

Proj

ects

Mainframe Enterprise Cloud

Distribution of Projects over time.Disclaimer:My personal prediction!

Page 6: Architecting for failure - Why are distributed systems hard?

Today’s biggest problem?

Page 7: Architecting for failure - Why are distributed systems hard?

High Infrastructure Cost11%

Awful Downtime9%

Meeting Demand21%

Release Frquency20%

Developer Velocity39%

Page 8: Architecting for failure - Why are distributed systems hard?

Meeting demands.

http

://w

ww

.inte

rnet

lives

tats

.com

/inte

rnet

-use

rs/

J2EE

Spring

RoR

Akka

Reactive Manifesto

Microservices

Page 9: Architecting for failure - Why are distributed systems hard?

What the hell is “Developer Velocity“ anyway?

Page 10: Architecting for failure - Why are distributed systems hard?

Release frequency!!

bit.ly/helloworldmsa

Page 11: Architecting for failure - Why are distributed systems hard?

And this is why we have Microservices..

Page 12: Architecting for failure - Why are distributed systems hard?

ScaleDeployDevelopIndependently

Page 13: Architecting for failure - Why are distributed systems hard?
Page 14: Architecting for failure - Why are distributed systems hard?

REQ: Building and Scaling Microservices

• Lightweight runtime• Cross – Service Security• Transaction Management• Service Scaling• Load Balancing• SLA’s• Flexible Deployment• Configuration• Service Discovery• Service Versions

• Monitoring• Governance• Asynchronous communication• Non-blocking I/O• Streaming Data• Polyglot Services• Modularity (Service definition)• High performance persistence (CQRS)• Event handling / messaging (ES)• Eventual consistency• API Management• Health check and recovery

Page 15: Architecting for failure - Why are distributed systems hard?

If the components do not compose cleanly, then all you are doing is shifting complexity from inside a component to the connections between components. Not just does this just move complexity around, it moves it to a place that's less explicit and harder to control.Martin Fowler

https://martinfowler.com/articles/microservices.html

Page 16: Architecting for failure - Why are distributed systems hard?

How do we handle “failures” in centralized or shared infrastructures?

Page 17: Architecting for failure - Why are distributed systems hard?
Page 18: Architecting for failure - Why are distributed systems hard?

Why did Application Server become a thing?

• Network and Threading• Two Phase Commit (2PC)• Shared resources• Manageability• Clustering supports scalability,

performance, and availability.• Programing models• Standardization

https://antoniogoncalves.org/2013/07/03/monster-component-in-java-ee-7/

Page 19: Architecting for failure - Why are distributed systems hard?

Checked vs. Unchecked Exceptions

If a client can reasonably be expected to recover from an exception, make it a checked exception. If a client cannot do anything to recover from the exception, make it an unchecked exception.

https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html

Page 20: Architecting for failure - Why are distributed systems hard?

It wasn’t easy – but manageable.

https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html

• MVC handles checked• Global exception handlers handle unchecked• Centralized log files

Page 21: Architecting for failure - Why are distributed systems hard?
Page 22: Architecting for failure - Why are distributed systems hard?

'If it ain't broke, don't fix it!' Bert Lance 1977.

Page 23: Architecting for failure - Why are distributed systems hard?

What is different for Microservices?

Page 24: Architecting for failure - Why are distributed systems hard?

Microservices are Distributed Systems.

Page 25: Architecting for failure - Why are distributed systems hard?
Page 26: Architecting for failure - Why are distributed systems hard?
Page 27: Architecting for failure - Why are distributed systems hard?

• Reactive Microservices Framework for the JVM• Focused on right sized services• Asynchronous I/O and communication as first class

priorities• Highly productive development environment• Takes you all the way to production• https://github.com/lagom/online-auction-java

What is Lagom?

Page 28: Architecting for failure - Why are distributed systems hard?

Protect Yourself

with Circuit Breakers

Page 29: Architecting for failure - Why are distributed systems hard?

CircuitBreakers

Page 30: Architecting for failure - Why are distributed systems hard?

CircuitBreakers

Page 31: Architecting for failure - Why are distributed systems hard?

CircuitBreakers

Page 32: Architecting for failure - Why are distributed systems hard?

CircuitBreakers

Page 33: Architecting for failure - Why are distributed systems hard?

Circuit Breakersdefault Descriptor descriptor() {

return named("item").withCalls(pathCall("/api/item", this::createItem),restCall(Method.POST, "/api/item/:id/start", this::startAuction),pathCall("/api/item/:id", this::getItem),restCall(Method.PUT, "/api/item/:id", this::updateItem),pathCall("/api/item?userId&status", this::getItemsForUser))

.withCircuitBreaker(CircuitBreaker.identifiedBy("item"))

Page 34: Architecting for failure - Why are distributed systems hard?

Degraded beats

Unavailable

Page 35: Architecting for failure - Why are distributed systems hard?

Degraded > Unavailable

Search

Bid

Item

Page 36: Architecting for failure - Why are distributed systems hard?

Degraded>Unavailable

Search

Bid

Item

Page 37: Architecting for failure - Why are distributed systems hard?

CompletionStage<PSequence<Bid>> bidHistoryFuture = bidService.getBids(itemUuid)

.invoke().exceptionally(error -> {log.warn("Bidding service failed to load", error);

return TreePVector.empty()});

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CompletionStage.html#exceptionally-java.util.function.Function-

Page 38: Architecting for failure - Why are distributed systems hard?

Bulkheading(Kind of Important)

Page 39: Architecting for failure - Why are distributed systems hard?
Page 40: Architecting for failure - Why are distributed systems hard?

Duplication isn’t a bad

thing

Page 41: Architecting for failure - Why are distributed systems hard?

Degraded > Unavailable

Search

Bid

Item

Page 42: Architecting for failure - Why are distributed systems hard?

Publish/SubscribeTopic<BidEvent> bidEvents();

default Descriptor descriptor() {return named("bidding").withCalls(

pathCall("/api/item/:id/bids", this::placeBid),pathCall("/api/item/:id/bids", this::getBids)

).publishing(topic("bidding-BidEvent", this::bidEvents)

)

Page 43: Architecting for failure - Why are distributed systems hard?

Publish/SubscribeTopic<BidEvent> bidEventTopic = biddingService.bidEvents();bidEventTopic.subscribe()

.atLeastOnce(Flow.<BidEvent>create().map(this::toDocument).mapAsync(1, indexedStore::store));

Page 44: Architecting for failure - Why are distributed systems hard?

Always have a plan B.

Page 45: Architecting for failure - Why are distributed systems hard?

•Fallback pattern (cache instead of dB)•The cost of resilience should be accuracy or latency.

•CAP Theorem: Your choice: sacrifice availability or consistency. You can't have all three.

What you can do..

https://codahale.com/you-cant-sacrifice-partition-tolerance/

Page 46: Architecting for failure - Why are distributed systems hard?

Do you remember?

Page 47: Architecting for failure - Why are distributed systems hard?

8 fallacies of distributed computing

1.Thenetworkisreliable2.Latencyiszero3.Bandwidthisinfinite4.Thenetworkissecure5.Topologydoesn'tchange6.Thereisoneadministrator7.Transportcostiszero8.Thenetworkishomogeneous

Page 48: Architecting for failure - Why are distributed systems hard?

Lessons learned.

Page 49: Architecting for failure - Why are distributed systems hard?

Some things to remember.

•Distributedsystemsaredifferentbecausetheyfailoften.•Writingrobustdistributedsystemscostsmorethanwritingrobustsingle-machinesystems.

•Robust,opensourcedistributedsystemsaremuchlesscommonthanrobust,single-machinesystems.

•Coordinationisveryhard.• “It’sslow”isthehardestproblemyou’lleverdebug.• Findwaystobepartiallyavailable.

https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Page 50: Architecting for failure - Why are distributed systems hard?

Where do we go from here?

Page 51: Architecting for failure - Why are distributed systems hard?

http://www.ofbizian.com/2016/07/from-fragile-to-antifragile-software.html

Page 52: Architecting for failure - Why are distributed systems hard?
Page 53: Architecting for failure - Why are distributed systems hard?

Next Steps! Download and try Lagom!Project Site:http://www.lightbend.com/lagom

GitHub Repo:https://github.com/lagom

Documentation:http://www.lagomframework.com/documentation/1.3.x/java/Home.html

Example:https://github.com/lagom/online-auction-java

Page 54: Architecting for failure - Why are distributed systems hard?

Written for architects and developers that must quickly gain a fundamental understanding of microservice-based architectures, this free O’Reilly report explores the journey from SOA to microservices, discusses approaches to dismantling your monolith, and reviews the key tenets of a Reactive microservice:

• Isolate all the Things• Act Autonomously• Do One Thing, and Do It Well• Own Your State, Exclusively• Embrace Asynchronous Message-Passing• Stay Mobile, but Addressable• Collaborate as Systems to Solve Problems

http://bit.ly/ReactiveMicroservice

Page 55: Architecting for failure - Why are distributed systems hard?

The detailed example in this report is based on Lagom, a new framework that helps you follow the requirements for building distributed, reactive systems.

• Get an overview of the Reactive Programming model and basic requirements for developing reactive microservices

• Learn how to create base services, expose endpoints, and then connect them with a simple, web-based user interface

• Understand how to deal with persistence, state, and clients

• Use integration technologies to start a successful migration away from legacy systems

http://bit.ly/DevelopReactiveMicroservice