ScaleOut your team - Building a technology team for scale in a DevOps culture
-
Upload
agilesparks -
Category
Technology
-
view
141 -
download
0
Transcript of ScaleOut your team - Building a technology team for scale in a DevOps culture
Scale Out your team Building your technology team for scale
SVP Platform Engineering & Operations Shai Peretz
Provide people with the most interesting, relevant and trusted content
Audience First.
Our Lighthouse
Widget Examples
Traffic: >25 Billion PVs per month >8 Billion recs per day Reach: >550M users globally Data: multiple petabytes (dist.) Servers: >4000 physical nodes Monitoring: >4m metrics per minute Team: ~130 Engineers (Dev + Ops) High growth rate
Scaling Vectors
- We design and build our own data centers (optimization, cost) - Collocation (less clouds on the horizon:) - Active/Active approach - Rely on external services when needed (DNS, CDN)
Operational Decisions
- No SPOF (n+x) - Vendor diversity - Flexible architecture - Commodity hardware - Scale out – no central devices - Open source
Design guidelines
- Automate using Chef - Configuration as code (Source control) - Log changes automatically
Configuration Management
Architecture – tolerance Service owner responsibility War rooms Ops + Dev Production Party Open communication with business
Disaster Recovery
Ops: - Facilities - Network and Infra - Visibility - Data systems - Production Engineering
Platform Group Structure
Engineering: - Data delivery & processing - App Infrastructure - Build/Dev tools - Ops tools
Skilled Ops engineers Sit with product development teams Product/Business KPIs What? – from product, How? - From Ops PE team lead – sync, training, implementations Two way communication
Production Engineering
-Very short release cycles (>100 per day) - Micro services - Easy to find issue (fix or rollback) - Automated deployment process - Testing & monitoring - Work procedures and culture
Continuous Deployment
Continuous Deployment
Ownership & Trust
Product Developers own their Services
Platform own Infrastructure, Hardware & Network services
- Ownership - Trust - Good communication - Learning
Values
- Face to Face - Sync and Share - Hipchat – always on - Open channels
Communication
- Prevention (Anomaly detection, trends) - MTTD, MTTR, MTTS Technology will eventually fail, we promise to fix it ASAP!
Stability Goals
- Graph everything - Self serve - Combination of internal and external tools: Collectd/Graphite/Nagios Logstash/ElasticSearch/Kibana New Relic/Boundary Keynote/Pingdom/Catchpoint - Dashboards – Graphitus/Grafana - Escalation of critical alerts via PagerDuty
Visibility
Prevention
Immune system
Unit tests (10k every 10m)
Integration and Regression
Self tests
Monitoring system
Alerts
Keys to success: - Self serve - Eliminate false alarms (Signal to noise ratio) Automatic full coverage
Immune System
To NOC or not to NOC?
Mean Time To Detect
Ops on shift Engineer on call Escalation policy on PD Manage on HipChat/Phone
Mean Time To Recover
Escalate only critical issues Measure time to resolve Blameless learn from events (Take-Ins) Respect your team’s sleep!
Mean Time To Sleep
After each event Blameless Action Items Publish Follow up
Take ins
Order 2-3 times a year Load testing + Prediction Elasticity for engineering Automatic provisioning
Capacity planning
Weekly tech talks IL TechTalks (+ techtalk week) Reversim podcast and summit Internal/External Lectures Sunday School
Learning
Do you guys ever work??
Two weeks dedicated to the needs of the technical teams
Quality Time
Thank You
Shai Peretz, SVP Platform Engineering & Operations
And Yes, we are hiring…