Sistemas Distribuidos: ¿Para Qué?€¦ · Sistemas Distribuidos: ¿Para Qué? Mariano Guerra...
Transcript of Sistemas Distribuidos: ¿Para Qué?€¦ · Sistemas Distribuidos: ¿Para Qué? Mariano Guerra...
Sistemas Distribuidos: ¿Para Qué?Mariano Guerra @warianoguerra
instadeq.com @instadeq
Erlang Factory Light
Buenos Aires 2017
1 / 73
2 / 73
3 / 73
Disclaimer
4 / 73
New boring CRUD app, how can we make it exciting?
5 / 73
6 / 73
What could possibly go wrong?Well ... let's see
7 / 73
A = foo()B = bar(A)do_something(A, B)
8 / 73
Tracing/MetricsDapper, a Large-Scale Distributed Systems TracingInfrastructure
Open Tracing/zipkinErlang gh:project-fifo/otters
9 / 73
LogsCollection, Centralization, Search, Correlation
ELK, Splunk, flume, fluentd, "hadoop"
10 / 73
Timeouts, Retries, Exponential Backo�gh:Netflix/Histrix
Latency and Fault Tolerance
Stop cascading failures.
Fallbacks and graceful degradation.
Circuit Breakers.
11 / 73
Timeouts, Retries, Exponential Backo�gh:Netflix/Histrix
Realtime Operations
Realtime monitoring and configuration changes.
Watch service and property changes take effect immediately.
12 / 73
Timeouts, Retries, Exponential Backo�gh:Netflix/Histrix
Concurrency
Parallel execution.
13 / 73
New Kinds of ErrorsLeaky Abstractions
Timeouts
Transport Errors
Encoding/Parsing Errors
API Versioning
14 / 73
Beware of the Distributed Monolith
15 / 73
Fallacies of Distributed ComputingThe network is reliable.Latency is zero.Bandwidth is infinite.The network is secure.Topology doesn't change.There is one administrator.Transport cost is zero.The network is homogeneous.
16 / 73
Partial Failure
17 / 73
Gray FailureWhen at least one app makes the observation
That the system is unhealthy
But the observer observes
That the system is healthy.
18 / 73
Two Generals ProblemPitfalls and design challenges of attempting to coordinate an
action by communicating over an unreliable link.
19 / 73
First computer communication problem to be proved to beunsolvable.
20 / 73
FLP Imposibility ResultIn an asynchronous network
21 / 73
FLP Imposibility ResultIn an asynchronous network
Where messages may be delayed but not lost
22 / 73
FLP Imposibility ResultIn an asynchronous network
Where messages may be delayed but not lost
There is no consensus algorithm that is guaranteed toterminate in every execution for all starting conditions
23 / 73
FLP Imposibility ResultIn an asynchronous network
Where messages may be delayed but not lost
There is no consensus algorithm that is guaranteed toterminate in every execution for all starting conditions
If at least one node may fail-stop.
24 / 73
It's about consensus
Deals with ‘faulty nodes’
Nodes that aren’t receiving the messages that are being sent tothem are failed (Exempt from having to achieve consensus)
Partitioned node in FLP does not have to achieve consensus,
Since it is considered failed, but the same node in CAP must
25 / 73
Therefore it’s not possible to say whether a processor hascrashed or is simply taking a long time to respond.
The FLP result shows that in an asynchronous setting, whereonly one processor might crash,
There is no distributed algorithm that solves the consensusproblem.
26 / 73
CAP theorem
27 / 73
CAP theorem
28 / 73
Brewer’s Conjecture and the Feasibility of Consistent,Available, Partition-Tolerant Web ServicesIn an asynchronous network
Where messages may be lost
It is impossible to implement a sequentially consistent atomicread/write register
That responds eventually to every request under everypattern of message loss.
29 / 73
It's about serialized atomic objects (registers)
Deals with ‘partitions’
Nodes that aren’t receiving the messages that are being sent tothem are only partitioned
A CAP solution requires that any live node be able to correctlyserve requests, even if it has not received any messages
30 / 73
That A, it doesn't mean what you thing it meansAvailability refers to a liveness property of an algorithmwhere every request to a non-failing node must eventuallyreturn a valid response.
Not Availability of the system, the system can fail for otherreasons.
Don’t Settle For Eventual Consistency
The Limits of the CAP Theorem
31 / 73
Scalability! But at what COST?COST of distributed systems: the Configuration that
Outperforms a Single Thread
32 / 73
33 / 73
34 / 73
Scalability! But at what COST?Many published systems have unbounded COST
(They never outperform the best single threaded application)
Others are orders of magnitude slower even when usinghundreds of cores.
35 / 73
36 / 73
Lord, grant me the strength to accept systems that can run inone node/process,
37 / 73
Lord, grant me the strength to accept systems that can run inone node/process,
the courage to handle the distributed systems I have to,
38 / 73
Lord, grant me the strength to accept systems that can run inone node/process,
the courage to handle the distributed systems I have to,
and the wisdom to know the difference.
39 / 73
But ...
40 / 73
Cars are distributed systems"Analysts predict that the on-board computing power of anormal saloon will increase by 100x from 2016 to 2025,powering ADAS and IVI functions."
"ADAS combines information from the many sensors dottedall over the car, feeding into a large processing unit thatmakes sense of the data, and makes decisions in real time.These sensors include radar, lidar, ultrasonic and cameras"
The future of automotive is coming faster than you think
41 / 73
What is this?"Each core has an integrated network interface and a router,with each router connected to the four routers around it ....There are algorithms designed to reduce congestion, ... but abasic system will have buffers and queues and will knowhow busy the local network congestion is."
42 / 73
What is this?"Each core has an integrated network interface and a router,with each router connected to the four routers around it ....There are algorithms designed to reduce congestion, ... but abasic system will have buffers and queues and will knowhow busy the local network congestion is."
Your new CPU
Intel Skylake-X: The New Core-to-Core CommunicationParadigm
43 / 73
You need at least 2 computers for availability
44 / 73
Other Distributed SystemsData Store*Mobile AppsData Processing Pipeline*IoTMicroservicesCurrency*Games*
45 / 73
What to do about those?We have to answer at least this questions:
Who Does This?Who Knows This?When did This Happen?
And detect problems as soon as posible
46 / 73
Who Does This?
47 / 73
Amazon's Dynamo Paper
48 / 73
Shameless PlugThe Little Riak Core Book
49 / 73
Distributed Virtual ActorsOrleans - Virtual Actors
Erlang: gh:SpaceTime-IoT/erleans
50 / 73
Spark - Resilient Distributed Datasets: A Fault-TolerantAbstraction for In-Memory Cluster Computing
Spark Streaming - Discretized Streams: Fault-TolerantStreaming Computation at Scale
52 / 73
Who Knows This?
53 / 73
ConsensusPaxos
ZAB
Raft (Vis 1, Vis 2)
More
54 / 73
HyParView a membership protocol for reliable gossip-basedbroadcast
Erlang: lasp/partisan
Epidemic Broadcast Trees
Erlang: plumtree
55 / 73
Convergent and Commutative Replicated Data Types
56 / 73
Lasp
Lasp is a suite of libraries aimed at providing acomprehensive programming system for planetary scale Elixirand Erlang applications.
@cmeik
58 / 73
Merkle TreesGit, IPFS, riak_core_metadata AAE
How does node 3 gets the values broadcasted while he wasdown?
Gossip protocols, Epidemic Broadcast and EventualConsistency in Practice
59 / 73
60 / 73
Distributed Hash Tables
Bittorrent, IPFS, Erlang global*
Also: Consistent Hashing
gh:jlouis/dht
61 / 73
When did This Happen?
62 / 73
Time, clocks and the ordering of events in a distributedsystem
Virtual Time and Global States of Distributed Systems
Dotted Version Vectors: Efficient Causality Tracking forDistributed Key-Value Stores
ACM: Why Logical Clocks are Easy
63 / 73
Atomic Clocks!Inside Cloud Spanner and the CAP Theorem
64 / 73
Avoiding Problems
65 / 73
Simple testing can prevent most critical failures
Early detection of configuration errors to reduce failuredamage
66 / 73
Quick Check
Concuerror
TLA+
Lineage-driven Fault Injection
Jepsen
Simian Army
67 / 73
Learn TLA
TLA+ in Practice and Theory Part 1: The Principles of TLA+
68 / 73
A = foo()B = bar(A)do_something(A, B)
Life was simpler, wasn't it?
Let's enjoy single node systems... while we can :)
69 / 73
ResourcesAphyr's Distributed Systems Class Notes
Designing Data-Intensive Applications
70 / 73
PeopleAdrian Colyer @adriancolyer
Aphyr @aphyr
Christopher Meiklejohn @cmeik
Eric Brewer @eric_brewer
Peter Alvaro @palvaro
Peter Bailis @pbailis
71 / 73
Thanks
72 / 73