The events must flow - bigdatadays.ru · Lessons learnt from evolving the Spotify’s event...
Transcript of The events must flow - bigdatadays.ru · Lessons learnt from evolving the Spotify’s event...
The events must flowLessons learnt from evolving the Spotify’s event delivery system
2Proprietary & Confidential
Agenda
● Introduction Spotify’s event delivery system
● System generations● Lessons learnt
What’s the event delivery?
What’s the event delivery?
Collect & Deliver
What’s the event delivery?
Collect & Deliver
What’s the event delivery?
Collect & Deliver
What’s the event delivery?
Collect & Deliver
What’s the event delivery?
Collect & Deliver
What’s the event delivery?
Collect & Deliver
250 billion events per day
What’s the event delivery?
Spotify’s event delivery
Distributed Storage
3 Generations
Based on sftp
3 Generations
Based on sftp
3 Generations
Based on Kafka 0.7
Based on sftp
3 Generations
Based on Kafka 0.7
Based on Google CloudPub/Sub
1st GenerationBased on sftp
2008
Thousands of users
Few millions
5 markets
Royalties payments and reportingTrending trackingBillboards chartsData driven features
1st Generation Architecture
service
syslog Event 1
Event 2
1st Generation Architecture
service
syslog Event 1
Event 2
1st Generation Architecture
service
syslog
sftp Collector
Event 1
Event 2
1st Generation Architecture
service
syslog
sftp Collector
Event 1
Event 2
1st Generation Architecture
service
syslog
sftp Collector
Event 1
Event 2
M/RPost Process
● Event delivery is not a critical system
1st Generation Pros
● Event delivery is not a critical system
● Decoupling between○ Producers○ Delivery system○ Consumers
1st Generation Pros
● Event delivery is not a critical system
● Decoupling between○ Producers○ Delivery system○ Consumers
● Simple○ Mainly copying files
1st Generation Pros
● Some silent failures scenarios
1st Generation Cons
● Some silent failures scenarios● Service level objectives (SLO)
1st Generation Cons
● Some silent failures scenarios● Service level objectives (SLO)● Single point of failure
1st Generation Cons
● Some silent failures scenarios● Service level objectives (SLO)● Single point of failure● Pets instead of cattle
1st Generation Cons
Learning 1
Simple systems can get you a long way
Learning 1
Simple systems can get you a long way...until it breaks!
2nd GenerationBased on Kafka 0.7
2013
24 million ~15 billion events per day
55 markets
Royalties payments and reportingTrending trackingBillboards chartsData driven features
2nd Generation Architecture
service
syslog Event 1
Event 2
2nd Generation Architecture
service
syslog
Publisher
Event 1
Event 2
2nd Generation Architecture
service
syslog
Publisher data
Event 1
Event 2
2nd Generation Architecture
service
syslog
Publisher Consumerdata
Event 1
Event 2
2nd Generation Architecture
service
syslog
Publisher Consumerdata
ack Event 1
Event 2
2nd Generation Architecture
service
syslog
Publisher Consumerdata
ack Event 1
Event 2
M/RPost Process
2nd Generation Architecture
service
syslog
Publisher Consumerdata
ack Event 1
Event 2
M/RPost Process
Grouper
real timechecks
liveness
● Support real-time use cases (~)
2nd Generation Pros
● Support real-time use cases (~)● Increased scalability
2nd Generation Pros
● Complicated to operate
2nd Generation Cons
● Complicated to operate○ End to end acks hides bottlenecks
2nd Generation Cons
● Complicated to operate○ End to end acks hides bottlenecks○ 100% timely delivery
2nd Generation Cons
● Complicated to operate○ End to end acks hides bottlenecks○ 100% timely delivery
● Benefits are not visible to the downstreams consumers
2nd Generation Cons
● Complicated to operate○ End to end acks hides bottlenecks○ 100% timely delivery
● Benefits are not visible to the downstreams consumers○ No improvements to SLOs
2nd Generation Cons
● Complicated to operate○ End to end acks hides bottlenecks○ 100% timely delivery
● Benefits are not visible to the downstreams consumers○ No improvements to SLOs○ System was about moving bits
2nd Generation Cons
Learning 2
You might focus on the wrong issuewithout owning the whole problem
2nd GenerationIncremental improvements
● One team owns event delivery*
2nd GenerationIncremental improvements
● One team owns event delivery*● SLOs improvements
2nd GenerationIncremental improvements
● One team owns event delivery*● SLOs improvements
○ End to end monitoring
2nd GenerationIncremental improvements
● One team owns event delivery*● SLOs improvements
○ End to end monitoring○ Rewrite of slows components
2nd GenerationIncremental improvements
● One team owns event delivery*● SLOs improvements
○ End to end monitoring○ Rewrite of slows components○ Interface simplifications
Motivations for 3rd Generation
● Create a simpler system○ pub/sub system with replication
Motivations for 3rd Generation
● Create a simpler system○ pub/sub system with replication
● Event isolation
Motivations for 3rd Generation
● Create a simpler system○ pub/sub system with replication
● Event isolation● Focus on the data that counts
○ Only transmit valid events
2nd Generation Architecture
service
syslog
Publisher Consumerdata
ack Event 1
Event 2
M/RPost Process
2nd Generation Architecture
syslog
Publisher Consumerdata
ack Event 1
Event 2
M/RPost Process
service
2015
2015
Learning 3
Make sure you have a reliable and stable system before focusing on a new one
Learning 4
Get support by offloadingcross cutting concerns to other teams
Based on Google Cloud Pub/Sub
3rd Generation
2016
100 million ~120 billion events per day
59 markets
Royalties payments and reportingTrending trackingBillboards chartsData driven features
3rd Generation Architecture
service
syslog
Publisher
Event 1 Event 2 Event 3
3rd Generation Architecture
service
syslog
Publisher
Event 1 Event 2 Event 3
Service
3rd Generation Architecture
service
syslog
Publisher
Event 1 Event 2 Event 3
Service
3rd Generation Architecture
service
syslog
Publisher
Event 1 Event 2 Event 3
Service Consumer
Consumer
Consumer
3rd Generation Architecture
service
syslog
Publisher Service Consumer
Consumer
Consumer
Event 1
M/RDedup
Event 2
M/RDedup
Event 3
M/RDedup
● Simplified system
3rd Generation Pros
● Simplified system● Easier to operate
○ Managed infrastructure
3rd Generation Pros
● Simplified system● Easier to operate
○ Managed infrastructure● Next order of magnitude of scale
3rd Generation Pros
● Simplified system● Easier to operate
○ Managed infrastructure● Next order of magnitude of scale● Events, not bits
○ Isolation○ Ownership
3rd Generation Pros
● Death by thousand papercuts
3rd Generation Cons
● Death by thousand papercuts○ Early stages tools
3rd Generation Cons
● Death by thousand papercuts○ Early stages tools○ All events were treated equally
3rd Generation Cons
● Death by thousand papercuts○ Early stages○ All events were treated equally
● Still relying on syslog, enforcing stateful components
3rd Generation Cons
Learning 5
Improve your life by focusing on what matters
The journey continuesRelying on syslog is still a problem
Lessons learnt recap
1. Keep it simple
Lessons learnt recap
1. Keep it simple2. Own the whole problem
Lessons learnt recap
1. Keep it simple2. Own the whole problem3. Stabilize before focusing on a new one
Lessons learnt recap
1. Keep it simple2. Own the whole problem3. Stabilize before focusing on a new one4. Offload cross cutting concerns to other teams
Lessons learnt recap
1. Keep it simple2. Own the whole problem3. Stabilize before focusing on a new one4. Offload cross cutting concerns to other teams5. Focus on what matters
Lessons learnt recap
Thank [email protected] to join the band? http://spoti.fi/jobs