Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service...
Transcript of Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service...
![Page 1: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/1.jpg)
Keeping Pinterest Running
1
Joe Gordon2 February 2016
![Page 2: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/2.jpg)
What is Pinterest?
![Page 3: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/3.jpg)
Software v. Service
![Page 4: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/4.jpg)
Software v. Service
● Stable branches● Drivers and configurations● Support matrix● Dependency versions● Developers support their own service
○ On call rotation○ Aligns incentives○ Monitoring & alerting built in from day one
● Testing against production traffic
![Page 5: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/5.jpg)
SRE at Pinterest
![Page 6: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/6.jpg)
![Page 7: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/7.jpg)
![Page 8: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/8.jpg)
![Page 9: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/9.jpg)
What do SREs focus on?
![Page 10: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/10.jpg)
Operational Maturity
![Page 11: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/11.jpg)
Operational Excellence
![Page 12: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/12.jpg)
Operational Excellence
![Page 13: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/13.jpg)
VisibilityInsight into the system
![Page 14: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/14.jpg)
Visibility
● Data Driven● Cornerstone for many things we do
○ Measure and enforce SLA (Service Level Agreement)○ Debug issues○ Capacity planning
● Time series data - TSDB● Metrics
○ System○ Service○ Dependencies○ Latencies
● Alerting● ELK stack for real time
log collection
![Page 15: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/15.jpg)
Deployments
![Page 16: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/16.jpg)
Deployment Requirements
● No impact to end user● Change history● Easy
![Page 17: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/17.jpg)
Staging and Canary
Canary in a Coal mine. Rabbit in a Sarin gas plant
![Page 18: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/18.jpg)
Canary vs Staging
Staging
![Page 19: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/19.jpg)
Teletraandeploy system
![Page 20: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/20.jpg)
Teletraan
● Rollback
● Hotfix
● Rolling deploy
● Staging and testing
● Visibility & Usability
Features
![Page 21: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/21.jpg)
TeletraanDesign
● client-server model
● PRE/POST-DOWNLOAD
● PRE/POST-RESTART
● RESTART
● RBAC
![Page 22: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/22.jpg)
TeletraanAdvanced Features
● Pause/Resume● Acceptance Testing● Auto Deploy● Autoscaling
Staging Canary Production
Test Test Test
Auto Promote Promote
![Page 23: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/23.jpg)
PostmortemsLearn from our mistakes
![Page 24: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/24.jpg)
Postmortems
● Blameless● Incident Manager● Impact● Outage Type● Method of Detection● Timeline● Root Cause● Restoration Details● Actionable Items
![Page 25: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/25.jpg)
Production Readiness ReviewPre-mortem?
![Page 26: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/26.jpg)
Production Readiness Review
● Dependencies● Define an SLA● Alerting● Capacity planning● Testing● On call rotation● Decider to turn feature off if needed● Incremental launch plan● Rate limiting
![Page 27: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/27.jpg)
Public CloudIssues
![Page 28: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/28.jpg)
Public Cloud
● “If you get an InsufficientInstanceCapacity error when you try to launch an instance, AWS does not currently have enough available capacity to service your request.”
○ Cloud is not infinite○ Reserved instances○ Capacity planning
● RequestLimitExceeded: “The maximum request rate permitted by the Amazon EC2 APIs has been exceeded for your account.”
○ Includes DescribeInstances○ Use internal mirror (powered by elasticsearch)
● Noise Neighbors● Rightsizing● Ownership
Issues
![Page 29: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/29.jpg)
Open Sourced Tools
![Page 30: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/30.jpg)
Open Sourced Tools
● mysql_utils○ MySQL Management Tools for the Cloud
● thrift-tools○ thrift-tools is a library and a set of tools to introspect Apache Thrift traffic
● secor○ Secor is a service implementing Kafka log persistence
● pymemcache○ A comprehensive, fast, pure-Python memcached client
● pinrepo○ Artifact Repo
● TeletraanMore at: https://github.com/pinterest
![Page 31: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...](https://reader034.fdocuments.in/reader034/viewer/2022051602/5aef26b87f8b9ac62b8d03d2/html5/thumbnails/31.jpg)
Pinterest Template 1.0