Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being...
Transcript of Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being...
![Page 1: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/1.jpg)
Reliability and Timeliness Analysis of Fault-tolerant Distributed Publish/Subscribe Systems
Thad Pongthawornkamol, Klara NahrstedtUniversity of Illinois at Urbana-Champaign
Guijun WangBoeing Research and Technology
![Page 2: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/2.jpg)
Publish / Subscribe Systems
● Pub/sub system is an interest-based communication paradigm
● Each user can be either publisher or subscriber.
● Pub/sub broker network handles routing / matching / recovery.
Pub / Sub Broker
Network
PP
S S
S
event
event
event
event
event
![Page 3: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/3.jpg)
Publish / Subscribe SystemsPP
S S
S
event
event
event
event
event
● Pub/sub system is an interest-based communication paradigm
● Each user can be either publisher or subscriber.
● Pub/sub broker network handles routing / matching / recovery.
![Page 4: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/4.jpg)
Goal : Pub / Sub Performance Analysis
● Question : Given a publish / subscribe network, how to predict reliability / timeliness perceived by each subscriber ?
● Several factors affect subscriber's QoS.
PP
S S
S
Traffic load and middle capacity
Pub / Sub brokernetwork failure and recovery
Subscriber mobility
![Page 5: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/5.jpg)
Goal : Pub / Sub Performance Analysis
● Question : Given a publish / subscribe network, how to predict reliability / timeliness perceived by each subscriber ?
● Several factors affect subscriber's QoS.
● This paper focuses on broker network failure and recovery.
PP
S S
S
Traffic load and middle capacity
Subscriber mobility
Pub / Sub brokernetwork failure and recovery
![Page 6: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/6.jpg)
Goal : Pub / Sub Performance Analysis
This paper proposes an analytical model that :● captures failure / recovery behavior of publish /
subscribe middleware.● predicts reliability and timeliness perceived at each
subscriber.● supports several commonly used publish / subscribe
fault tolerance algorithms
The proposed analytical model can be used in :● subscriber admission control● broker network planning● fault-tolerant publish / subscribe protocol selection
![Page 7: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/7.jpg)
Outline
● Motivation● Model & Assumptions● Reliability / Timeliness Analysis● Results● Conclusion
![Page 8: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/8.jpg)
Model : Subscriber Real-time Reliability
● Each published event has its lifetime (i.e., the period of time after which the event is expired after being published). In this paper, we assume all events have the same lifetime value D.
● Subscriber Real-time Reliability = fraction of events of subscriber's interest that are delivered to the subscriber before they are expired.
Pub / Sub MiddlewareS
event
event
event
t = 10s
t = 20s
t = 30s
lifetime D = 5s
event
event
t = 26s
t = 34s
real-time reliability = 0.33
![Page 9: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/9.jpg)
Analytical Framework
PP
S S
S
Analyzer
S
S
S
= 0.99
= 0.85
= 0.94
![Page 10: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/10.jpg)
Model : System Components
P
S
Publishers
Subscribers
Brokers / Links
Component Known Variables
● Each subscriber's topic τS
● Each publisher's topic τP
● Each publisher's average publishing rate λP (events / second)
● Each broker's failure rate γB (exponentially distributed)● Each broker's recovery rate σB (exponentially distributed)● Each link's failure rate γL (exponentially distributed)● Each link's recovery rate σL (exponentially distributed)
![Page 11: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/11.jpg)
Assumption : Pub/Sub Routing● Upon joining, a new subscriber
subscribes to its local broker.
● The local broker stores the subscription to its routing table and propagates the subscription to other brokers.
● The model supports any pub/sub routing protocol that has path consistency property (i.e., always use the same broker path to route events from a publisher to a subscriber)
PP
S S
S
event
event
event
event
event
![Page 12: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/12.jpg)
Outline
● Motivation● Model and Assumptions● Reliability / Timeliness Analysis● Results● Conclusion
![Page 13: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/13.jpg)
Reliability / Timeliness Analysis● Question : Given the entire
publish / subscribe graph and each component's parameters, how can we estimate each subscriber's real-time reliability?
PP
S S
S
P
![Page 14: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/14.jpg)
Reliability / Timeliness Analysis● Question : Given the entire
publish / subscribe graph and each component's parameters, how can we estimate each subscriber's real-time reliability?
● Answer : Assuming path consistency property, estimate pair-wise real-time reliability between each publisher - subscriber pair.
● Subscriber real-time reliability is then equal to the weighted average of all pair-wise reliability between the subscriber and all publishers with the same topic.
PP
S
λP1 = 2 event / sec λP2 = 1 event / sec
rP1S = 0.9 rP2S = 0.8
rS = (0.9*2 + 0.8*1) / (2 + 1) = 0.87
![Page 15: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/15.jpg)
Pair-wise Reliability : Basic Routing● In basic protocol, an event is loss if at least one component along the path
fails.
● Each broker B has availability aB, which is equal to (1/σB) / (1/γB + 1/σB)● Each link L has availability aL, which is equal to (1/σL) / (1/γL + 1/σL)
● Pair-wise reliability is the multiplication of each component's availability.
P S
a=0.95
a=0.9
a=0.85
a=0.97
a=0.99
rPS = 0.95 * 0.9 * 0.85 * 0.97 * 0.99 = 0.70
![Page 16: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/16.jpg)
Event Retransmission([Chand & Felber '04][Espository et al '09])
● In retransmission protocol, each broker stores incoming event into its persistent storage before sending acknowledgement back to the sender.
● The broker keeps retransmitting event until it receives acknowledgement message from the next hop, then it discards the buffered event.
● In retransmission protocol, an event will never get lost at broker or link. However, an event may expire due to buffering delay.
P S
event
event
P S
event event
ACK
![Page 17: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/17.jpg)
Pair-wise Reliability : Retransmission
● To compute path reliability in retransmission protocol, we compute the probability that the end-to-end delivery delay is less than the event lifetime.
P S
event eventdPS
B1 B2 B3dPB1 dB1B2 dB2B3 dB3S
rPS = P[dPS < D] = P[dPB1 + dB1B2 + dB2B3 + dB3S < D]
● Assuming all brokers / links failure and recovery durations are exponentially distributed, we can estimate per-hop delivery delay distribution using Markov theory (See paper for proof).
![Page 18: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/18.jpg)
Multi-path Routing ([Chand & Felber '04][Jaeger '07][Kazemzadeh & Jacobsen '09])
● Brokers run failure detection and new path discovery protocol.
● If the next hop fails, broker forwards event to an alternative neighbor.
● Assuming relatively fast discovery protocol, the event is always delivered on time as long as the publisher and subscriber are connected.
P
S
![Page 19: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/19.jpg)
● Brokers run failure detection and new path discovery protocol.
● If the next hop fails, broker forwards event to an alternative neighbor.
● Assuming relatively fast discovery protocol, the event is always delivered on time as long as the publisher and subscriber are connected.
P
S
Multi-path Routing ([Chand & Felber '04][Jaeger '07][Kazemzadeh & Jacobsen '09])
![Page 20: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/20.jpg)
● Brokers run failure detection and new path discovery protocol.
● If the next hop fails, broker forwards event to an alternative neighbor.
● Assuming relatively fast discovery protocol, the event is always delivered on time as long as the publisher and subscriber are connected.
P
S
Multi-path Routing ([Chand & Felber '04][Jaeger '07][Kazemzadeh & Jacobsen '09])
![Page 21: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/21.jpg)
● Pair-wise reliability between publisher and subscriber with multi-path routing is equal to the probability that the publisher and subscriber is connected.
● Finding connection probability in a graph is NP-hard.
P
S
Pair-wise Reliability : Multi-path Routing
![Page 22: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/22.jpg)
● Pair-wise reliability between publisher and subscriber with multi-path routing is equal to the probability that the publisher and subscriber is connected.
● Finding connection probability in a graph is NP-hard.
● Estimate lower bound instead by reducing the graph into multiple independent paths.
P
S
Pair-wise Reliability : Multi-path Routing
![Page 23: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/23.jpg)
P
S
Pair-wise Reliability : Multi-path Routing (Cont.)
rPS > P[at least one path is connected] = 1 - P[all paths are disconnected] = 1 - (1 - r1)(1 - r2)(1 - r3) r1 r2
r3
r1, r2, r3 can be computed using reliability analysis for basic routing protocol.
![Page 24: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/24.jpg)
Retransmission + Multi-path Routing● Retransmission and multi-
path routing can be combined.
● Use retransmission on the default forwarding path and opportunistic forwarding on alternate path.
● Event is not lost even when publisher and subscriber are disconnected.
P
S
event
![Page 25: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/25.jpg)
Retransmission + Multi-path RoutingP
S
event
● Retransmission and multi-path routing can be combined.
● Use retransmission on the default forwarding path and opportunistic forwarding on alternate path.
● Event is not lost even when publisher and subscriber are disconnected.
event
![Page 26: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/26.jpg)
Retransmission + Multi-path Routing (Cont.)
P
S
r1
r2
d
rPS = P[d < D] + P[d > D].(1 - (1 - r1)(1 - r2))
r1, r2 can be computed using reliability analysis for basic routing protocol.
P[d < D] can be computed using reliability analysis for retransmission protocol.
![Page 27: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/27.jpg)
Outline
● Motivation● Model and Assumptions● Reliability / Timeliness Analysis● Results● Conclusion
![Page 28: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/28.jpg)
Evaluation Setting● NS-2 network simulator, simulating 10-broker networks.
● Period (MTBF + MTTR) is set to 60k seconds (approximately 17 hours) for brokers and links.
● Each link has availability set to 0.99 (hence MTBF = 0.99 * 17 hours, MTTR = 0.01 * 17 hours).
● Two sets of brokers (observed from data traces).○ Low-end brokers ([0.9, 0.95] availability range)○ High-end brokers ([0.99, 0.999] availability range)
● Event lifetime set to 3600 seconds (1 hour).
● Four protocols (basic, retransmission, multi-path, retransmission + multi-path)
![Page 29: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/29.jpg)
Results (Tree topology)
● Each dot in the graph represents one subscriber.
![Page 30: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/30.jpg)
Results (Tree topology)
● Each dot in the graph represents one subscriber.● Retransmission protocol provides a magnitude of improvement over basic
protocol.
Basic w/ low-end brokers
Basic w/ high-end brokers Retrans w/
low-end brokers
Retrans w/ low-end brokers
![Page 31: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/31.jpg)
Results (Random Low-end Broker Graph)
● Average node degree = 4● Basic routing < retransmission < multi-path < hybrid
Basic
Retrans
Multi-path
Retrans + Multi-path
![Page 32: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/32.jpg)
Results (Random High-end Broker Graph)
● Retransmission protocol is better than multi-path routing.● Combining retransmission with multi-path routing does not improve
reliability very much.
Basic
Multi-pathRetrans
Retrans + Multi-path
![Page 33: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/33.jpg)
Conclusions● Our work presents an analytical model to predict reliability and
timeliness in distributed publish / subscribe systems that abstracts○ broker / link failure and recovery○ several commonly used fault tolerance schemes.
● Evaluation results suggest that different fault tolerance schemes perform differently based on○ Broker network quality○ Event lifetime○ Graph connectivity
● The proposed analytical model can be used as a building block for○ subscriber admission control○ broker network planning○ fault-tolerant publish / subscribe protocol selection
![Page 34: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/34.jpg)
Pub / Sub Performance Analysis
● Question : Given a publish / subscribe network, how to predict reliability / timeliness perceived by each subscriber ?
● Several factors affect subscriber's QoS.
PP
S S
S
Traffic load and middleware capacity(ICAC'10)1
Pub / Sub brokernetwork failure and recovery(ICAC'13)2
Subscriber mobility(PhD thesis)3
1Pongthawornkamol et al, "Probabilistic QoS modeling for reliability/timeliness prediction in distributed content-based publish/subscribe systems over best-effort networks", ICAC 2010.2Pongthawornkamol et al, "Reliability and Timeliness Analysis of Fault-tolerant Distributed Publish/Subscribe Systems", ICAC 2013.3Pongthawornkamol et al, "Reliability and timeliness analysis of content-based publish/subscribe systems", Ph.D. Thesis.
![Page 35: Fault-tolerant Distributed Publish/Subscribe Reliability ... · the event is expired after being published). In this paper, we assume all events have the same lifetime value D. Subscriber](https://reader033.fdocuments.in/reader033/viewer/2022042302/5ecd6143eaac6c5f67389c66/html5/thumbnails/35.jpg)
Thank you !