Preparing for distributed system failures using akka #ScalaMatsuri
-
Upload
tis-inc -
Category
Technology
-
view
2.040 -
download
0
Transcript of Preparing for distributed system failures using akka #ScalaMatsuri
Copyright © 2017 TIS Inc. All rights reserved.
Preparing for distributed system failures using Akka2017.2.25 Scala Matsuri
Yugo Maede @yugolf
Copyright © 2017 TIS Inc. All rights reserved. 2
Who am I?
TIS Inc. provides “Reactive Systems Consulting Service”
@yugolf
https://twitter.com/okapies/status/781439220330164225
- support PoC projects - review designs - review codes etc
リアクティブシステムのコンサルティングサービスをやっています
Copyright © 2017 TIS Inc. All rights reserved.
3
Todayʼs Topics
What are Architectural Safety Measures in distributed system?
How to realize them with Akka
分散システムに考慮が必要な安全対策 Akkaでどうやるか?
Copyright © 2017 TIS Inc. All rights reserved. 4
Microservices mean distributed systems
from Monolith to Microservices
マイクロサービス、 すなわち、分散システム
Copyright © 2017 TIS Inc. All rights reserved. 5
"Mooreʼs law is dead" means
"distributed systems are the beginning"
limitation of CPU performance
ムーアの法則の終焉、 すなわち、分散システムの幕開け
Copyright © 2017 TIS Inc. All rights reserved.
confront with distributed system
6
Building large-scale systems requires distributed systems
分散システムなしにはビジネスの成功はない
Copyright © 2017 TIS Inc. All rights reserved. 7
- increasing the number of server means increasing failure points
- face new enemies "network"
サーバが増えれば障害点も増える ネットワークという新たな敵の出現
building distributed system is not easy
Copyright © 2017 TIS Inc. All rights reserved.
Architectural Safety Measures
8
define Cross-Functional Requirements
- availability - response time and latency
機能横断要件を定義しましょう 可⽤性と応答時間/遅延
Copyright © 2017 TIS Inc. All rights reserved.
systems based on failure
9
- needs Antifragile Organizations - needs systems based on failure
アンチフラジャイルな組織と障害を前提としたシステムが必要
Copyright © 2017 TIS Inc. All rights reserved.
Architectural Safety Measures need
10
timeout bulkhead
circuit breaker ...
タイムアウト、隔壁、サーキットブレーカー、…
Copyright © 2017 TIS Inc. All rights reserved.
Akka is here
11
Akka has tools to deal with distributed system failures
Akkaには分散システムに関わる障害に対処するためのツールが備わっている
Copyright © 2017 TIS Inc. All rights reserved.
Akka Actor
12
participant
Actor processes messages in order of arrival
$30host
アクターはメッセージを到達順に処理 シンプルに⾮同期処理を実装可能
$10
$10
$10
status
$10 $10 $10
mailbox
Copyright © 2017 TIS Inc. All rights reserved.
Supervisor Hierarchy
13
let it crash
スーパーバイザーが⼦アクターを監視し障害制御などを⾏う
supervisor
child actorchild actor
supervise
signal failure
- restart - resume - stop - escalate
Copyright © 2017 TIS Inc. All rights reserved.
timeout
14
Copyright © 2017 TIS Inc. All rights reserved.
request-response needs timeout
15
request
response
応答が遅かったり、返ってこないこともある
☓
Copyright © 2017 TIS Inc. All rights reserved.
message passing
16
!
tell(fire and forget)を使う askの場合はタイムアウトを適切に設定
?1s
tell(fire and forget)
ask
Copyright © 2017 TIS Inc. All rights reserved.
timeout configuration
17
import akka.pattern.askimport akka.util.Timeoutimport system.dispatcherimplicit val timeout = Timeout(5 seconds) val response = kitchen ? KitchenActor.DripCoffee(count)response.mapTo[OrderCompleted] onComplete { case Success(result) => log.info(s"success: ${result.message}") case Failure(e: AskTimeoutException) => log.info(s"failure: ${e.getMessage}") case Failure(t) => log.info(s"failure: ${t.getMessage}") }
askのタイムアウト設定
Copyright © 2017 TIS Inc. All rights reserved. 18
送信先に問題があった場合は?
?1s
if a receiver has a problem
Copyright © 2017 TIS Inc. All rights reserved. 19
supervisor
never return failure to sender
障害の事実を送信元に返さない
if a receiver has a problem
- restart - resume - stop - escalate
Copyright © 2017 TIS Inc. All rights reserved. 20
timeout!
レスポンスが返ってこないためタイムアウトが必要
?1s
if a receiver has a problem
☓
Copyright © 2017 TIS Inc. All rights reserved.
implements of ask pattern 1/2
21
def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any] = internalAsk(message, timeout, sender)
private[pattern] def internalAsk(message: Any, timeout: Timeout, sender: ActorRef): Future[Any] = actorSel.anchor match { case ref: InternalActorRef ⇒ if (timeout.duration.length <= 0) Future.failed[Any]( new IllegalArgumentException(s"""Timeout length must not be negative, question not sent to [$actorSel]. Sender[$sender] sent the message of type "${message.getClass.getName}".""")) else { val a = PromiseActorRef(ref.provider, timeout, targetName = actorSel, message.getClass.getName, sender) actorSel.tell(message, a) a.result.future } case _ ⇒ Future.failed[Any](new IllegalArgumentException(s"""Unsupported recipient ActorRef type, question not sent to [$actorSel]. Sender[$sender] sent the message of type "${message.getClass.getName}".""")) }
?
internalAsk
Copyright © 2017 TIS Inc. All rights reserved. 22
akka.pattern.PromiseActorRefdef apply(provider: ActorRefProvider, timeout: Timeout, targetName: Any, messageClassName: String, sender: ActorRef = Actor.noSender): PromiseActorRef = { val result = Promise[Any]() val scheduler = provider.guardian.underlying.system.scheduler val a = new PromiseActorRef(provider, result, messageClassName) implicit val ec = a.internalCallingThreadExecutionContext val f = scheduler.scheduleOnce(timeout.duration) { result tryComplete Failure( new AskTimeoutException(s"""Ask timed out on [$targetName] after [${timeout.duration.toMillis} ms]. Sender[$sender] sent message of type "${a.messageClassName}".""")) } result.future onComplete { _ ⇒ try a.stop() finally f.cancel() } a}
スケジューラを設定し時間がくればAskTimeoutException送信
implements of ask pattern 2/2
Copyright © 2017 TIS Inc. All rights reserved.
circuit breaker
23
Copyright © 2017 TIS Inc. All rights reserved.
a receiver is down
24
問い合わせたいサービスがダウンしていることもある
Copyright © 2017 TIS Inc. All rights reserved.
response latency will rise
25
100ms
1s
normal
abnormal(timeout=1s)
レスポンス劣化 過負荷により性能劣化が拡⼤
Copyright © 2017 TIS Inc. All rights reserved.
apply circuit breaker
26
サーキットブレーカ でダウンしているサービスには問い合わせをしないように
circuit breaker
Copyright © 2017 TIS Inc. All rights reserved.
what is circuit breaker
27
https://martinfowler.com/bliki/CircuitBreaker.html
⼀定回数の失敗を繰り返すと接続を抑⽌
a
Once the failures reach a certain threshold, the circuit breaker trips
Copyright © 2017 TIS Inc. All rights reserved.
circuit breaker has three statuses
28http://doc.akka.io/docs/akka/current/common/circuitbreaker.html
Closed:メッセージ送信可能 Open :メッセージ送信不可
Copyright © 2017 TIS Inc. All rights reserved.
decrease the latency
29
無駄な問い合わせをやめてレイテンシを発⽣させないようにする
100ms
x ms
normal
abnormal(timeout=1s)
1s
Open
Close
Copyright © 2017 TIS Inc. All rights reserved.
apply circuit breaker: implement
30
val breaker = new CircuitBreaker( context.system.scheduler, maxFailures = 5, callTimeout = 10.seconds, resetTimeout = 1.minute).onOpen(notifyMeOnOpen())
http://doc.akka.io/docs/akka/current/common/circuitbreaker.html
def receive = { case "dangerousCall" => breaker.withCircuitBreaker(Future(dangerousCall)) pipeTo sender()}
5回失敗するとOpenになり、1分間はメッセージを送信させない
Copyright © 2017 TIS Inc. All rights reserved.
block threads
31
ブロッキング処理があるとスレッドが枯渇しレイテンシが伝播
blockingblocking
threads threads
Copyright © 2017 TIS Inc. All rights reserved.
prevention of propagation
32
異常サービスを切り離すことで、問題が上流へ伝播しない
blockingblocking
threads threads
Copyright © 2017 TIS Inc. All rights reserved.
CAP trade-off
33
return old information vs
don't return anything
just do my work vs
need synchronize with others
cache
push
- read
- write
古い情報を返してもよいか? 他者との同期なしで問題ないか?
Copyright © 2017 TIS Inc. All rights reserved.
rate limiting
34
rate limiter
同じクライアントからの集中したリクエストから守る
no more than 100 requests in any 3 sec interval
Copyright © 2017 TIS Inc. All rights reserved.
bulkhead
35
Copyright © 2017 TIS Inc. All rights reserved.
Even if there is damage next door, are you OK?
36
無関係なお隣さんがダウンしたとき、影響を被る不運な出来事
Copyright © 2017 TIS Inc. All rights reserved.
bulkhead blocks the damage
37
スレッドをブロックするアクターと影響を受けるアクターの間に隔壁
threadsthreads
blocking
Copyright © 2017 TIS Inc. All rights reserved.
isolating the blocking calls to actors
38
val blockingActor = context.actorOf(Props[BlockingActor]. withDispatcher(“blocking-actor-dispatcher”), "blocking-actor") class BlockingActor extends Actor { def receive = { case GetCustomer(id) => // calling database … }}
ブロッキングコードはアクターごと分離してリソースを共有しない
Copyright © 2017 TIS Inc. All rights reserved.
the blocking in Future
39
Future{ // blocking }
ブロックするFutureによりディスパッチャが枯渇
threads
Copyright © 2017 TIS Inc. All rights reserved. 40http://www.slideshare.net/ktoso/zen-of-akka#44
デフォルトディスパッチャを利⽤した場合
using the default dispatcher
Copyright © 2017 TIS Inc. All rights reserved. 41
ブロッキング処理を分離
threadsthreads
Future{ // blocking }
isolating the blocking Future
Copyright © 2017 TIS Inc. All rights reserved. 42http://www.slideshare.net/ktoso/zen-of-akka#44
using a dedicated dispatcher
専⽤ディスパッチャの利⽤
Copyright © 2017 TIS Inc. All rights reserved.
CQRS:Command and Query Responsibility Segregation
43
コマンドとクエリを分離する
write
read
command
query
Copyright © 2017 TIS Inc. All rights reserved.
cluster
44
Copyright © 2017 TIS Inc. All rights reserved.
hardware will fail
45
If there are 365 machines failing once a year, one machine will fail a day
Wouldn't a machine break even when it's hosted on the cloud?
1年に1回故障するマシンが365台あれば平均毎⽇1台故障する
Copyright © 2017 TIS Inc. All rights reserved.
availability of AWS
46
例:AWSの可⽤性検証サイト
https://cloudharmony.com/status-of-compute-and-storage-and-cdn-and-dns-for-aws
Copyright © 2017 TIS Inc. All rights reserved.
preparing for failure of hardware
47
- minimize single point of failure - allow recovery of State
単⼀障害点を最⼩化 状態を永続化
Copyright © 2017 TIS Inc. All rights reserved.
Cluster
monitor each other by sending heartbeats
48
node1
node2
node3
node4
クラスタのメンバーがハートビートを送り合い障害を検知
Copyright © 2017 TIS Inc. All rights reserved.
recovery states
49
Cluster
永続化しておいたイベントをリプレイすることで状態の復元が可能
persist
replay
node1
node2
node3
node4
eventsstate
akka-persistence
Copyright © 2017 TIS Inc. All rights reserved.
the database may be down or overloaded
50
永続化機能の障害未復旧時に闇雲にリトライしない
persist
replay
node3
node4
replay
replay
db has not started yet
Copyright © 2017 TIS Inc. All rights reserved.
BackoffSupervisor
51
http://doc.akka.io/docs/akka/current/general/supervision.html#Delayed_restarts_with_the_BackoffSupervisor_pattern
3秒後、6秒後、12秒後、…の間隔でスタートを試みる
val childProps = Props(classOf[EchoActor])val supervisor = BackoffSupervisor.props( Backoff.onStop( childProps, childName = "myEcho", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 // adds 20% "noise" to vary the intervals slightly )) system.actorOf(supervisor, name = "echoSupervisor")
increasing intervals of 3, 6, 12, ...
Copyright © 2017 TIS Inc. All rights reserved.
split brain resolver
52
Copyright © 2017 TIS Inc. All rights reserved.
Cluster
node1
node2
node3
node4
In the case of network partitions
53
ネットワークが切れることもある
Copyright © 2017 TIS Inc. All rights reserved.
node1
node4
Cluster1 Cluster2
using split brain resolver
54
クラスタ間での⼀貫性維持のためSplit brain resolverを適⽤
node2
node3
node5
split brain resolver
Copyright © 2017 TIS Inc. All rights reserved.
strategy 1/4: Static Quorum
55
quorum-size = 3
クラスタ内のノード数が⼀定数以上の場合⽣存
node2
node1
node4
node3
node5
Which can survive? - If the number of nodes is quorum-size or more
Copyright © 2017 TIS Inc. All rights reserved.
strategy 2/4: Keep Majority
56
ノード数が50%より多い場合に⽣存
node2
node1
node4
node3
node5
Which can survive? - If the number of nodes is more than 50%
Copyright © 2017 TIS Inc. All rights reserved.
strategy 3/4: Keep Oldest
57
最古のノードが⾃グループに含まれている場合に⽣存
node2node4
node3
node5
Which can survive? - If contain the oldest node
node1
oldest
Copyright © 2017 TIS Inc. All rights reserved.
strategy 4/4: Keep Referee
58
特定のノードが含まれている場合に⽣存
node2node4
node3
node5
node1
Which can survive? - If contain the given referee node
address = "akka.tcp://system@node1:port"
Copyright © 2017 TIS Inc. All rights reserved. 59
SBR is included in Lightbend Reactive Platform
https://github.com/TanUkkii007/akka-cluster-custom-downing
http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html
Lightbend Reactive Platform
akka-cluster-custom-downing
SBRはLightbend Reactive Platformで提供されています
Copyright © 2017 TIS Inc. All rights reserved.
idempotence
60
冪等性
Copyright © 2017 TIS Inc. All rights reserved.
Failed to receive ack message
61
Order(coffee,1)
Order(coffee,1)
ackを受信できずメッセージを再送すると2重注⽂してしまう
coffee please!
becomes a duplicate order by resending the message
Copyright © 2017 TIS Inc. All rights reserved.
idempotence
62
メッセージを複数回受信しても問題ないように冪等な設計で⼀貫性を維持
Order(id1, coffee, 1)
Order(id1, coffee, 1)
coffee, please!
applying it multiple times is not harmful
Copyright © 2017 TIS Inc. All rights reserved.
summary
63
Copyright © 2017 TIS Inc. All rights reserved.
summary
64
- Microservices mean distributed systems - define Cross-Functional Requirements - design for failure
障害は発⽣するものなので、受け⼊れましょう
Copyright © 2017 TIS Inc. All rights reserved.
summary
65
timeout circuit breaker
bulkhead cluster backoff
split brain resolver ...
by using AkkaAkkaは分散システムの障害に対処するためのツールキットを備えています
Copyright © 2017 TIS Inc. All rights reserved.
reference materials
66
- Building Microservices - Reactive Design Patterns - Reactive Application Development - Effective Akka - http://akka.io/
Copyright © 2017 TIS Inc. All rights reserved. 67
https://gitter.im/akka-ja/akka-doc-ja
https://github.com/akka-ja/akka-doc-ja/
akka.io翻訳協⼒者募集中!! Gitterにジョインしてください。
now translating
THANK YOU