Who am I?
§ Lev Popov § Service Reliability Engineer in Spotify § Joined Spotify in 2014 § Previous QIK – Skype – Microsoft
§ Background in services and networks operations
Some Numbers
• Over 60 million MAU (monthly active users) • Over 15 million paying subscribers • Over 30 million tracks • Over 1.5 billion playlists • Over 20.000 songs added per day
Capacity We Own
• 4 Data Centers • Over 7000 bare metal servers • Many different services • Pushing an average of 35GBps to the Internet • 24/7/365
Service
Service
Service
Service
Dev owner
In the beginning was the…
Dev owner
Ops owner
Dev owner
Ops owner
Opera0ons team
Dev owner
On-‐call Monitoring
Build systems
Backups DB Networks …
Operations Team in 2011
Thin group of 5 people
• Over 10 million users • Over 2 million paying subscribers • 12 Countries • Over 15 million tracks • Over 400 million playlists • 3 datacenters • Over 1300 servers
Operations Team Now
? • Over 60 million users • Over 15 million paying subscribers • 58 Countries • Over 30 million tracks • Over 1.5 billion playlists • 4 datacenters • Over 5000 servers
Operations Team Now
No team
• Over 60 million users • Over 15 million paying subscribers • 58 Countries • Over 30 million tracks • Over 1.5 billion playlists • 4 datacenters • Over 5000 servers
How We Scale
• Service oriented architecture Separate services for separate features
• UNIX way Small simple programs doing one thing well
• KISS principle Simple applications are easier to scale
Scaling Agile
• Squad is similar to a scrum team
• Designed to feel like a small startup
• Self organizing teams
• Autonomy to decide their own way of working
Service Dev owner
Service
Can we scale that?
Service
Dev owner
Ops owner
Service
Dev owner
Ops owner
Opera0ons team
Dev owner
On-‐call Monitoring
Build systems
Backups DB Networks …
Ops in Squads Background
Impossible to scale a central operations team • Understaffed • Difficult to find generalists
We believe that operation has to sit close to development
Our bet for autonomy • Break dependencies • End to end responsibility
Timeline
Dev Dev
Backend Infrastructure
I/O
Opera0ons
SRE
Internal IT
Opera0ons in Squads
2008 Early 2011 Mid 2012 Sep 2013
Infrastructure Operations
feature squad
feature squad
feature squad
feature squad
IO Tribe
networks conf mgmt containers
feature squad
enable + support
product area
squad
Core SRE
Core SRE
IO Tribe
Major Incidents Scalability Issues Systems Design Problems
Teaching Best Prac0ces in General
squad squad squad squad
Incident Management
Incident Postmortem Remedia0on
Incident Manager On-‐Call
Everybody involved in an incident
Postmortems
• Plan for post-mortems • Keep it close in time • Record the project details • Involve everyone • Get it in writing • Record successes as well as failures • It's not for punishment • Create an action plan • Make it available
On-call follows the sun
Stockholm New York
Stockholm New York
Stockholm New York L0
SA Product Owners L1
SA Lead L2
19 CET
01 EST
19 CET
01 EST
07 CET 07 CET
13 EST 13 EST
19 CET
13 EST
Areas of Improvement
• The expectations we place on squads are sometimes unclear
• Communication between feature teams and infrastructure teams
• It’s hard to measure ops in squads success
• Abandoned services and other ownership issues
Top Related