Preparing for distributed system failures using akka #ScalaMatsuri

68
Copyright © 2017 TIS Inc. All rights reserved. Preparing for distributed system failures using Akka 2017.2.25 Scala Matsuri Yugo Maede @yugolf

Transcript of Preparing for distributed system failures using akka #ScalaMatsuri

Page 1: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Preparing for distributed system failures using Akka2017.2.25 Scala Matsuri

Yugo Maede @yugolf

Page 2: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 2

Who am I?

TIS Inc. provides “Reactive Systems Consulting Service”

@yugolf

https://twitter.com/okapies/status/781439220330164225

- support PoC projects - review designs - review codes        etc

リアクティブシステムのコンサルティングサービスをやっています

Page 3: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

3

Todayʼs Topics

What are Architectural Safety Measures in distributed system?

How to realize them with Akka

分散システムに考慮が必要な安全対策 Akkaでどうやるか?

Page 4: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 4

Microservices mean distributed systems

from Monolith to Microservices

マイクロサービス、 すなわち、分散システム

Page 5: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 5

"Mooreʼs law is dead" means

"distributed systems are the beginning"

limitation of CPU performance

ムーアの法則の終焉、 すなわち、分散システムの幕開け

Page 6: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

confront with distributed system

6

Building large-scale systems requires distributed systems

分散システムなしにはビジネスの成功はない

Page 7: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 7

- increasing the number of server means increasing failure points

- face new enemies "network"

サーバが増えれば障害点も増える ネットワークという新たな敵の出現

building distributed system is not easy

Page 8: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Architectural Safety Measures

8

define Cross-Functional Requirements

- availability - response time and latency

機能横断要件を定義しましょう 可⽤性と応答時間/遅延

Page 9: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

systems based on failure

9

- needs Antifragile Organizations - needs systems based on failure

アンチフラジャイルな組織と障害を前提としたシステムが必要

Page 10: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Architectural Safety Measures need

10

timeout bulkhead

circuit breaker ...

タイムアウト、隔壁、サーキットブレーカー、…

Page 11: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Akka is here

11

Akka has tools to deal with distributed system failures

Akkaには分散システムに関わる障害に対処するためのツールが備わっている

Page 12: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Akka Actor

12

participant

Actor processes messages in order of arrival

$30host

アクターはメッセージを到達順に処理 シンプルに⾮同期処理を実装可能

$10

$10

$10

status

$10 $10 $10

mailbox

Page 13: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Supervisor Hierarchy

13

let it crash

スーパーバイザーが⼦アクターを監視し障害制御などを⾏う

supervisor

child actorchild actor

supervise

signal failure

- restart - resume - stop - escalate

Page 14: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

timeout

14

Page 15: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

request-response needs timeout

15

request

response

応答が遅かったり、返ってこないこともある

Page 16: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

message passing

16

!

tell(fire and forget)を使う askの場合はタイムアウトを適切に設定

?1s

tell(fire and forget)

ask

Page 17: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

timeout configuration

17

import akka.pattern.askimport akka.util.Timeoutimport system.dispatcherimplicit val timeout = Timeout(5 seconds) val response = kitchen ? KitchenActor.DripCoffee(count)response.mapTo[OrderCompleted] onComplete { case Success(result) => log.info(s"success: ${result.message}") case Failure(e: AskTimeoutException) => log.info(s"failure: ${e.getMessage}") case Failure(t) => log.info(s"failure: ${t.getMessage}") }

askのタイムアウト設定

Page 18: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 18

送信先に問題があった場合は?

?1s

if a receiver has a problem

Page 19: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 19

supervisor

never return failure to sender

障害の事実を送信元に返さない

if a receiver has a problem

- restart - resume - stop - escalate

Page 20: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 20

timeout!

レスポンスが返ってこないためタイムアウトが必要

?1s

if a receiver has a problem

Page 21: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

implements of ask pattern 1/2

21

def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any] = internalAsk(message, timeout, sender)

private[pattern] def internalAsk(message: Any, timeout: Timeout, sender: ActorRef): Future[Any] = actorSel.anchor match { case ref: InternalActorRef ⇒ if (timeout.duration.length <= 0) Future.failed[Any]( new IllegalArgumentException(s"""Timeout length must not be negative, question not sent to [$actorSel]. Sender[$sender] sent the message of type "${message.getClass.getName}".""")) else { val a = PromiseActorRef(ref.provider, timeout, targetName = actorSel, message.getClass.getName, sender) actorSel.tell(message, a) a.result.future } case _ ⇒ Future.failed[Any](new IllegalArgumentException(s"""Unsupported recipient ActorRef type, question not sent to [$actorSel]. Sender[$sender] sent the message of type "${message.getClass.getName}".""")) }

?

internalAsk

Page 22: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 22

akka.pattern.PromiseActorRefdef apply(provider: ActorRefProvider, timeout: Timeout, targetName: Any, messageClassName: String, sender: ActorRef = Actor.noSender): PromiseActorRef = { val result = Promise[Any]() val scheduler = provider.guardian.underlying.system.scheduler val a = new PromiseActorRef(provider, result, messageClassName) implicit val ec = a.internalCallingThreadExecutionContext val f = scheduler.scheduleOnce(timeout.duration) { result tryComplete Failure( new AskTimeoutException(s"""Ask timed out on [$targetName] after [${timeout.duration.toMillis} ms]. Sender[$sender] sent message of type "${a.messageClassName}".""")) } result.future onComplete { _ ⇒ try a.stop() finally f.cancel() } a}

スケジューラを設定し時間がくればAskTimeoutException送信

implements of ask pattern 2/2

Page 23: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

circuit breaker

23

Page 24: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

a receiver is down

24

問い合わせたいサービスがダウンしていることもある

Page 25: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

response latency will rise

25

100ms

1s

normal

abnormal(timeout=1s)

レスポンス劣化 過負荷により性能劣化が拡⼤

Page 26: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

apply circuit breaker

26

サーキットブレーカ でダウンしているサービスには問い合わせをしないように

circuit breaker

Page 27: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

what is circuit breaker

27

https://martinfowler.com/bliki/CircuitBreaker.html

⼀定回数の失敗を繰り返すと接続を抑⽌

a

Once the failures reach a certain threshold, the circuit breaker trips

Page 28: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

circuit breaker has three statuses

28http://doc.akka.io/docs/akka/current/common/circuitbreaker.html

Closed:メッセージ送信可能 Open :メッセージ送信不可

Page 29: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

decrease the latency

29

無駄な問い合わせをやめてレイテンシを発⽣させないようにする

100ms

x ms

normal

abnormal(timeout=1s)

1s

Open

Close

Page 30: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

apply circuit breaker: implement

30

val breaker = new CircuitBreaker( context.system.scheduler, maxFailures = 5, callTimeout = 10.seconds, resetTimeout = 1.minute).onOpen(notifyMeOnOpen())

http://doc.akka.io/docs/akka/current/common/circuitbreaker.html

def receive = { case "dangerousCall" => breaker.withCircuitBreaker(Future(dangerousCall)) pipeTo sender()}

5回失敗するとOpenになり、1分間はメッセージを送信させない

Page 31: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

block threads

31

ブロッキング処理があるとスレッドが枯渇しレイテンシが伝播

blockingblocking

threads threads

Page 32: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

prevention of propagation

32

異常サービスを切り離すことで、問題が上流へ伝播しない

blockingblocking

threads threads

Page 33: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

CAP trade-off

33

return old information vs

don't return anything

just do my work vs

need synchronize with others

cache

push

- read

- write

古い情報を返してもよいか? 他者との同期なしで問題ないか?

Page 34: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

rate limiting

34

rate limiter

同じクライアントからの集中したリクエストから守る

no more than 100 requests in any 3 sec interval

Page 35: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

bulkhead

35

Page 36: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Even if there is damage next door, are you OK?

36

無関係なお隣さんがダウンしたとき、影響を被る不運な出来事

Page 37: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

bulkhead blocks the damage

37

スレッドをブロックするアクターと影響を受けるアクターの間に隔壁

threadsthreads

blocking

Page 38: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

isolating the blocking calls to actors

38

val blockingActor = context.actorOf(Props[BlockingActor]. withDispatcher(“blocking-actor-dispatcher”), "blocking-actor") class BlockingActor extends Actor { def receive = { case GetCustomer(id) => // calling database … }}

ブロッキングコードはアクターごと分離してリソースを共有しない

Page 39: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

the blocking in Future

39

Future{ // blocking }

ブロックするFutureによりディスパッチャが枯渇

threads

Page 40: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 40http://www.slideshare.net/ktoso/zen-of-akka#44

デフォルトディスパッチャを利⽤した場合

using the default dispatcher

Page 41: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 41

ブロッキング処理を分離

threadsthreads

Future{ // blocking }

isolating the blocking Future

Page 42: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 42http://www.slideshare.net/ktoso/zen-of-akka#44

using a dedicated dispatcher

専⽤ディスパッチャの利⽤

Page 43: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

CQRS:Command and Query Responsibility Segregation

43

コマンドとクエリを分離する

write

read

command

query

Page 44: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

cluster

44

Page 45: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

hardware will fail

45

If there are 365 machines failing once a year, one machine will fail a day

Wouldn't a machine break even when it's hosted on the cloud?

1年に1回故障するマシンが365台あれば平均毎⽇1台故障する

Page 46: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

availability of AWS

46

例:AWSの可⽤性検証サイト

https://cloudharmony.com/status-of-compute-and-storage-and-cdn-and-dns-for-aws

Page 47: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

preparing for failure of hardware

47

- minimize single point of failure - allow recovery of State

単⼀障害点を最⼩化 状態を永続化

Page 48: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Cluster

monitor each other by sending heartbeats

48

node1

node2

node3

node4

クラスタのメンバーがハートビートを送り合い障害を検知

Page 49: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

recovery states

49

Cluster

永続化しておいたイベントをリプレイすることで状態の復元が可能

persist

replay

node1

node2

node3

node4

eventsstate

akka-persistence

Page 50: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

the database may be down or overloaded

50

永続化機能の障害未復旧時に闇雲にリトライしない

persist

replay

node3

node4

replay

replay

db has not started yet

Page 51: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

BackoffSupervisor

51

http://doc.akka.io/docs/akka/current/general/supervision.html#Delayed_restarts_with_the_BackoffSupervisor_pattern

3秒後、6秒後、12秒後、…の間隔でスタートを試みる

val childProps = Props(classOf[EchoActor])val supervisor = BackoffSupervisor.props( Backoff.onStop( childProps, childName = "myEcho", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 // adds 20% "noise" to vary the intervals slightly )) system.actorOf(supervisor, name = "echoSupervisor")

increasing intervals of 3, 6, 12, ...

Page 52: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

split brain resolver

52

Page 53: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Cluster

node1

node2

node3

node4

In the case of network partitions

53

ネットワークが切れることもある

Page 54: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

node1

node4

Cluster1 Cluster2

using split brain resolver

54

クラスタ間での⼀貫性維持のためSplit brain resolverを適⽤

node2

node3

node5

split brain resolver

Page 55: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

strategy 1/4: Static Quorum

55

quorum-size = 3

クラスタ内のノード数が⼀定数以上の場合⽣存

node2

node1

node4

node3

node5

Which can survive? - If the number of nodes is quorum-size or more

Page 56: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

strategy 2/4: Keep Majority

56

ノード数が50%より多い場合に⽣存

node2

node1

node4

node3

node5

Which can survive? - If the number of nodes is more than 50%

Page 57: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

strategy 3/4: Keep Oldest

57

最古のノードが⾃グループに含まれている場合に⽣存

node2node4

node3

node5

Which can survive? - If contain the oldest node

node1

oldest

Page 58: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

strategy 4/4: Keep Referee

58

特定のノードが含まれている場合に⽣存

node2node4

node3

node5

node1

Which can survive? - If contain the given referee node

address = "akka.tcp://system@node1:port"

Page 59: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 59

SBR is included in Lightbend Reactive Platform

https://github.com/TanUkkii007/akka-cluster-custom-downing

http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html

Lightbend Reactive Platform

akka-cluster-custom-downing

SBRはLightbend Reactive Platformで提供されています

Page 60: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

idempotence

60

冪等性

Page 61: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

Failed to receive ack message

61

Order(coffee,1)

Order(coffee,1)

ackを受信できずメッセージを再送すると2重注⽂してしまう

coffee please!

becomes a duplicate order by resending the message

Page 62: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

idempotence

62

メッセージを複数回受信しても問題ないように冪等な設計で⼀貫性を維持

Order(id1, coffee, 1)

Order(id1, coffee, 1)

coffee, please!

applying it multiple times is not harmful

Page 63: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

summary

63

Page 64: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

summary

64

- Microservices mean distributed systems - define Cross-Functional Requirements - design for failure

障害は発⽣するものなので、受け⼊れましょう

Page 65: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

summary

65

timeout circuit breaker

bulkhead cluster backoff

split brain resolver ...

by using AkkaAkkaは分散システムの障害に対処するためのツールキットを備えています

Page 66: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved.

reference materials

66

- Building Microservices - Reactive Design Patterns - Reactive Application Development - Effective Akka - http://akka.io/

Page 67: Preparing for distributed system failures using akka #ScalaMatsuri

Copyright © 2017 TIS Inc. All rights reserved. 67

https://gitter.im/akka-ja/akka-doc-ja

https://github.com/akka-ja/akka-doc-ja/

akka.io翻訳協⼒者募集中!! Gitterにジョインしてください。

now translating

Page 68: Preparing for distributed system failures using akka #ScalaMatsuri

THANK YOU