PagerDuty: Span the WAN? Yes you can!
-
Upload
datastax-academy -
Category
Technology
-
view
638 -
download
0
Transcript of PagerDuty: Span the WAN? Yes you can!
![Page 2: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/2.jpg)
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
Span the WAN. Why?
![Page 3: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/3.jpg)
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
![Page 4: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/4.jpg)
2015-10-01SPAN THE WAN? YES YOU CAN!
![Page 5: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/5.jpg)
2015-10-01
PagerDuty: some history
•Monolithic Ruby on Rails + MySQL •Hosted in AWS us-east-1 •AWS outages in 2010 and 2011 •…including correlated multi-AZ failures •PagerDuty was heavily impacted •Needed resiliency to this failure mode
SPAN THE WAN? YES YOU CAN!
![Page 6: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/6.jpg)
2015-10-01
Design goals
•Continuity during a DC drop (AZ or Region) •No operator intervention •Can’t lose data •Can’t delay data (shelf life) •Timely notifications - always
•Measured in 10’s of seconds
SPAN THE WAN? YES YOU CAN!
![Page 7: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/7.jpg)
2015-10-01
Design decisions
•Masterless: peer-based & clustered •Can’t tolerate staleness: synchronous WAN replication •Manage state: consistent reads •Opted to use Cassandra •…despite many of Cassandra’s features not being relevant
SPAN THE WAN? YES YOU CAN!
![Page 8: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/8.jpg)
2015-10-01
How Cassandra is often used
SPAN THE WAN? YES YOU CAN!
•Massive throughput •Lots of data •Horizontally scalable •Eventually consistent •High write:read ratio •High performance individual operations
![Page 9: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/9.jpg)
2015-10-01
Essential Cassandra features for PagerDuty
•Quorum operations •Tuneable consistency •Synchronous WAN replication
SPAN THE WAN? YES YOU CAN!
![Page 10: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/10.jpg)
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
WAN-spanning system design
![Page 11: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/11.jpg)
2015-10-01
System architecture
SPAN THE WAN? YES YOU CAN!
Shared cross-DC datastore
(Cassandra)
Distributed Coordination (ZooKeeper)
Clustered Application
![Page 12: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/12.jpg)
2015-10-01
Quorum consistency systems
•Each item replicated N times •Writes: require W of N replicas •Reads: require R of N replicas •W + R <= N: read can miss a write •W + R > N: read can’t miss a write
SPAN THE WAN? YES YOU CAN!
WRITE READ
![Page 13: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/13.jpg)
2015-10-01
•Replication factor: N=5 •Three DCs •DC-aware placement strategy •W=3: all writes hit multiple DCs •R=3: all reads hit multiple DCs •3 + 3 > 5: consistent reads
Cassandra setup
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2 Cass 4
Cass 3
DC-A
DC-C
DC-B
![Page 14: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/14.jpg)
2015-10-01
Data layer summary
•Data safe against DC failure •Consistent reads (of acknowledged writes) •Expensive multi-DC writes & reads •Managing state: No ACID transactions! •Enforce “transactions” in the application layer
SPAN THE WAN? YES YOU CAN!
![Page 15: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/15.jpg)
2015-10-01
Application layer: “transactions”
•Sequence of logic and Cassandra operations •Implement sequence as idempotent •Failure is not an option •Enforce transaction ordering •Expect (some) (transient) inconsistencies
SPAN THE WAN? YES YOU CAN!
![Page 16: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/16.jpg)
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
Tales from production
![Page 17: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/17.jpg)
2015-10-01
What about the network?
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2 Cass 4
Cass 3
DC-A
DC-C
DC-B
24 ms
24 ms 3
ms
•Network diversity limits DC choices •Result? Uneven network latencies
![Page 18: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/18.jpg)
2015-10-01
…and how you should think of the network
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
![Page 19: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/19.jpg)
2015-10-01
Reads and writes
SPAN THE WAN? YES YOU CAN!
DC-A
DC-B
DC-C
Client
R1
R2
R3
R4
R5
![Page 20: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/20.jpg)
2015-10-01
Read and write performance
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•R and W =3 means always hitting replicas in two DCs (by design) •Reads coordinated from DC-B or DC-C nodes will take >3ms •Reads coordinated from DC-A nodes will take >24ms
![Page 21: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/21.jpg)
2015-10-01
Another latency effect? Per-node read volume
SPAN THE WAN? YES YOU CAN!
![Page 22: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/22.jpg)
2015-10-01
Per-node read volume: why so skewed?
SPAN THE WAN? YES YOU CAN!
![Page 23: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/23.jpg)
2015-10-01
Writes: Which replicas are involved? All 5
SPAN THE WAN? YES YOU CAN!
DC-A
DC-B
DC-C
Client
R1
R2
R3
R4
R5
![Page 24: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/24.jpg)
2015-10-01
Writes: per-node volume
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•N=5, so there is a write op on each replica •All replicas experience the same per-node write load
![Page 25: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/25.jpg)
2015-10-01
Reads: Which replicas are involved? Only 3!
SPAN THE WAN? YES YOU CAN!
DC-A
DC-B
DC-C
Client
R1
R2
R3
R4
R5
![Page 26: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/26.jpg)
2015-10-01
Reads: per-node volume
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•Coordinator chooses R fastest replicas (R=3) •Network latency steers to the nearest replicas
![Page 27: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/27.jpg)
2015-10-01
Reads: per-node volume (Cass 3 as coord)
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•Chooses 3, 4, and 5 •Same when Cass 4 or Cass 5 coordinates
![Page 28: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/28.jpg)
2015-10-01
Reads: per-node volume (Cass 1 as coord)
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•Hits 1, 2 and (randomly) one of 3, 4, 5 •Same when Cass 2 coordinates
![Page 29: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/29.jpg)
2015-10-01
Reads: per-node volume, uniform coord usage
SPAN THE WAN? YES YOU CAN!
Coordinator Node Cass 1 Cass 2 Cass 3 Cass 4 Cass 5
Cass 1 1 1 0.33 0.33 0.33
Cass 2 1 1 0.33 0.33 0.33
Cass 3 0 0 1 1 1
Cass 4 0 0 1 1 1
Cass 5 0 0 1 1 1
Total requests 2 2 3.66 3.66 3.66
![Page 30: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/30.jpg)
2015-10-01
Per-node read volume: reality vs. theory
SPAN THE WAN? YES YOU CAN!
![Page 31: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/31.jpg)
2015-10-01
What about scaling out?
• Asymmetrical per-node read volumes • So each DC has different CPU and disk IO needs • Different node size? • Different per-DC node count? • What about DC degradation or loss? • End up with same-sized nodes
SPAN THE WAN? YES YOU CAN!
![Page 32: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/32.jpg)
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
When a data center vanishes…
![Page 33: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/33.jpg)
2015-10-01
Major outage: DC-C (May, 2015)
• All hosts unreachable for ~5 hours
SPAN THE WAN? YES YOU CAN!
![Page 34: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/34.jpg)
2015-10-01
Seamless data center migration (August 2015)
• Moved DC-C fleet from one provider to another • Remove old node; add new node • No application-level migration needed • Zero customer impact
SPAN THE WAN? YES YOU CAN!
![Page 35: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/35.jpg)
2015-10-01
DC-A to DC-B fiber cut (September, 2015)
• DC-A to DC-B network latency 24ms -> 200ms, lasted 48 hours • All Cass ops now take 24ms
SPAN THE WAN? YES YOU CAN!
FIBER CUT EAST-1
![Page 36: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/36.jpg)
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
And back to where we started
![Page 37: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/37.jpg)
2015-10-01
What have we learned?
• WAN-spanning synchronous replication is a thing • Data layer consistent reads are practical • Application layer consequences for managing state • Network topology affects:
• Request performance • Per-node load
• Trade off latency for reliability
SPAN THE WAN? YES YOU CAN!
![Page 38: PagerDuty: Span the WAN? Yes you can!](https://reader031.fdocuments.in/reader031/viewer/2022030309/58f1f2f41a28aba92d8b45cf/html5/thumbnails/38.jpg)
2015-10-01
Span the WAN?
Yes you can!
SPAN THE WAN? YES YOU CAN!