Joe armstrong erlanga_languageforprogrammingreliablesystems
-
Upload
home -
Category
Technology
-
view
356 -
download
0
description
Transcript of Joe armstrong erlanga_languageforprogrammingreliablesystems
![Page 1: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/1.jpg)
Systems that never stop (and Erlang)
Joe Armstrong
![Page 2: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/2.jpg)
How can we get
10 nines reliability?
![Page 3: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/3.jpg)
SIX LAWS
![Page 4: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/4.jpg)
ONE
ISOLATION
![Page 5: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/5.jpg)
ISOLATION
10 nines = 99.99999999% availability
P(fail) = 10-10
If P(fail | one computer) = 10-3 then P(fail | four computers) = 10-12
Fixed
![Page 6: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/6.jpg)
TWO
CONCURRENCY
![Page 7: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/7.jpg)
Concurrency
World is concurrent
Need at least TWO computers to make a non-stop sytem
TWO computer is concurrent and distributed
![Page 8: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/8.jpg)
“My first message is that concurrency
is best regarded as a program structuring principle”
Structured concurrent programming – Tony Hoare
Redmond, July 2001
![Page 9: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/9.jpg)
THREE
MUSTDETECT FAILURES
![Page 10: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/10.jpg)
Failure detection
If you can’t detect a failure you can’t fix it
Must work across machine boundariesthe entire machine might fail
Implies distributed error handling, no shared state, asynchronous messaging
![Page 11: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/11.jpg)
FOUR
FAULT IDENTIFICATION
![Page 12: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/12.jpg)
Failure Identification
Fault detection is not enough - you must no why the failure occurred
Implies that you have sufficient information for post hock debugging
![Page 13: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/13.jpg)
FIVE
LIVE CODE
UPGRADE
![Page 14: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/14.jpg)
Live code upgrade
Must upgrade software while it is running
Want zero down time
![Page 15: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/15.jpg)
SIX
STABLE STORAGE
![Page 16: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/16.jpg)
Stable storage
Must store stuff forever
No backup necessary - storage just works
Implies multiple copies, distribution, ...
Must keep crash reports
![Page 17: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/17.jpg)
HISTORY
Those who cannot learn from history are doomed to repeat it.
George Santayana
![Page 18: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/18.jpg)
GRAYAs with hardware, the key to software fault-tolerance is tohierarchically decompose large systems into modules, each module beinga unit of service and a unit of failure. A failure of a module doesnot propagate beyond the module.
...
The process achieves fault containment by sharing no state withother processes; its only contact with other processes is via messagescarried by a kernel message system
- Jim Gray- Why do computers stop and what can be done about it- Technical Report, 85.7 - Tandem Computers,1985
![Page 19: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/19.jpg)
SCHNEIDERHalt on failure in the event of an error a processorshould halt instead of performing a possibly erroneous operation.
Failure status property when a processor fails,other processors in the system must be informed. The reason for failure must be communicated.
Stable Storage Property The storage of a processor should be partitioned into stable storage (which survives a processor crash) and volatile storage which is lost if a processor crashes.
SchneiderACM Computing Surveys 22(4):229-319, 1990
![Page 20: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/20.jpg)
GRAY Fault containment through fail-fast software modules. Process-pairs to tolerant hardware and transient software faults. Transaction mechanisms to provide data and message integrity. Transaction mechanisms combined with process-pairs to ease
exception handling and tolerate software fault Software modularity through processes and messages.
![Page 21: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/21.jpg)
KAYFolks --
Just a gentle reminder that I took some pains at the last OOPSLA to try to remind everyone that Smalltalk is not only NOT its syntax or the class library, it is not even about classes. I'm sorry that I long ago coined the term "objects" for this topic because it gets many people to focus on the lesser idea.
The big idea is "messaging" -- that is what the kernal of Smalltalk/Squeak is all about (and it's something that was never quite completed in our Xerox PARC phase)....
http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/017019.html
![Page 22: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/22.jpg)
GRAYSoftware modularity through processes and messages. As with hardware, the key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module.
![Page 23: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/23.jpg)
Fail FastThe process approach to fault isolation advocates that the process software be fail-fast, it should either function correctly or itshould detect the fault, signal failure and stop operating.
Processes are made fail-fast by defensive programming. They checkall their inputs, intermediate results and data structures as a matterof course. If any error is detected, they signal a failure and stop. Inthe terminology of [Cristian], fail-fast software has small faultdetection latency.
GrayWhy ...
![Page 24: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/24.jpg)
Fail EarlyA fault in a software system can cause one or more errors. The latency time which is the interval between the existence of the fault and the occurrence of the error can be very high, which complicates the backwards analysis of an error ...
For an effective error handling we must detect errors and failures as early as possible
Renzel - Error Handling for Business Information Systems,
Software Design and Management, GmbH & Co. KG, München, 2003
![Page 25: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/25.jpg)
ARMSTRONG Processes are the units of error encapsulation. Errors
occurring in a process will not affect other processes in the system. We call this property strong isolation.
Processes do what they are supposed to do or fail as soon as possible.
Failure and the reason for failure can be detected by remote processes.
Processes share no state, but communicate by message passing.
ArmstrongMaking reliable systems in the presence of software errors
PhD Thesis, KTH, 2003
![Page 26: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/26.jpg)
COMMERCIAL BREAK
![Page 27: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/27.jpg)
![Page 28: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/28.jpg)
Joe’s 2’nd theorem
Whatever Joe starts talking about, He will end up talking about Erlang
![Page 29: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/29.jpg)
Erlang was designed
to programfault-tolerant
systems
![Page 30: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/30.jpg)
Erlang
Concurrent programming Functional
programming
Fault tolerance
ConcurrencyOriented
programming
Multicore
![Page 31: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/31.jpg)
Erlang Very light-weight processes Very fast message passing Total separation between processes Automatic marshalling/demarshalling Fast sequential code Strict functional code Dynamic typing Transparent distribution Compose sequential AND concurrent code
![Page 32: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/32.jpg)
Properties
No sharing Hot code replacement Pure message passing No locks Lots of computers (= fault tolerant scalable ...) Functional programming (no side effects)
![Page 33: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/33.jpg)
What is COP?
➡ Large numbers of processes➡ Complete isolation between processes➡ Location transparency➡ No Sharing of data➡ Pure message passing systems
Machine
Process
Message
![Page 34: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/34.jpg)
Thread Safety
Erlang programs are automatically thread safe if they don't use an external resource.
![Page 35: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/35.jpg)
Functional
If you call thesame function twice with
the same argumentsit should return the same value
“jolly good” Joe Armstrong
![Page 36: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/36.jpg)
No Mutable State Mutable state needs locks
No mutable state = no locks = programmers bliss
![Page 37: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/37.jpg)
Multicore ready
![Page 38: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/38.jpg)
The rise of the cores 2 cores won't hurt you 4 cores will hurt a little 8 cores will hurt a bit 16 will start hurting 32 cores will hurt a lot (2009) ... 1 M cores ouch (2019) (complete paradigm shift)
1997 1 Tflop = 850 KW 2007 1 Tflop = 24 W (factor 35,000) 2017 1 Tflop = ?
![Page 39: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/39.jpg)
LAWS
![Page 40: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/40.jpg)
ISOLATION CONCURRENCY
Pid = spawn(.....)Pid = spawn(Node, ....)
Pid ! Message receive Pattern1 -> Actions1; Pattern2 -> Actions2; ...end
![Page 41: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/41.jpg)
FAULT IDENTIFICATION
link(Pid), receive {Pid, ‘EXIT’, Why} -> ... end
![Page 42: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/42.jpg)
LIVE CODE UPGRADE
Can upgrade code while its running
Existing processes continue to use original code, new processes run new code - no mixups of namespaces
Sophisticated roll-forward, roll-back, roll-back-on-error functions in OTP libraries
Properly designed systems can be rolled-forward and back with no loss of service. Not easy, but possible
![Page 43: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/43.jpg)
STABLE STORAGE Performed in libraries
mnesia:transaction( fun() -> Val = mnesia:read(Key), mnesia:write({Key,Val}), ... end)
![Page 44: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/44.jpg)
Projects CouchDB Amazon SimpleDB Mochiweb (facebook chat) Scalaris Nitrogren Ejabberd (xmpp) Rabbit MQ (amqp) ....
![Page 45: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/45.jpg)
Companies
Ericsson Amazon Tail-f Kreditor Synapse ...
![Page 46: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/46.jpg)
Books
![Page 47: Joe armstrong erlanga_languageforprogrammingreliablesystems](https://reader033.fdocuments.in/reader033/viewer/2022052906/5588fbb7d8b42a4f1a8b462c/html5/thumbnails/47.jpg)
THE END