ScaleOut your team - Building a technology team for scale in a DevOps culture

Scale Out your team Building your technology team for scale

SVP Platform Engineering & Operations Shai Peretz

Provide people with the most interesting, relevant and trusted content

Audience First.

Our Lighthouse

Widget Examples

Traffic: >25 Billion PVs per month >8 Billion recs per day Reach: >550M users globally Data: multiple petabytes (dist.) Servers: >4000 physical nodes Monitoring: >4m metrics per minute Team: ~130 Engineers (Dev + Ops) High growth rate

Scaling Vectors

- We design and build our own data centers (optimization, cost) - Collocation (less clouds on the horizon:) - Active/Active approach - Rely on external services when needed (DNS, CDN)

Operational Decisions

- No SPOF (n+x) - Vendor diversity - Flexible architecture - Commodity hardware - Scale out – no central devices - Open source

Design guidelines

- Automate using Chef - Configuration as code (Source control) - Log changes automatically

Configuration Management

Architecture – tolerance Service owner responsibility War rooms Ops + Dev Production Party Open communication with business

Disaster Recovery

Ops: - Facilities - Network and Infra - Visibility - Data systems - Production Engineering

Platform Group Structure

Engineering: - Data delivery & processing - App Infrastructure - Build/Dev tools - Ops tools

Skilled Ops engineers Sit with product development teams Product/Business KPIs What? – from product, How? - From Ops PE team lead – sync, training, implementations Two way communication

Production Engineering

-Very short release cycles (>100 per day) - Micro services - Easy to find issue (fix or rollback) - Automated deployment process - Testing & monitoring - Work procedures and culture

Continuous Deployment

Continuous Deployment

Ownership & Trust

Product Developers own their Services

Platform own Infrastructure, Hardware & Network services

- Ownership - Trust - Good communication - Learning

Values

- Face to Face - Sync and Share - Hipchat – always on - Open channels

Communication

- Prevention (Anomaly detection, trends) - MTTD, MTTR, MTTS Technology will eventually fail, we promise to fix it ASAP!

Stability Goals

- Graph everything - Self serve - Combination of internal and external tools: Collectd/Graphite/Nagios Logstash/ElasticSearch/Kibana New Relic/Boundary Keynote/Pingdom/Catchpoint - Dashboards – Graphitus/Grafana - Escalation of critical alerts via PagerDuty

Visibility

Prevention

Immune system

Unit tests (10k every 10m)

Integration and Regression

Self tests

Monitoring system

Alerts

Keys to success: - Self serve - Eliminate false alarms (Signal to noise ratio) Automatic full coverage

Immune System

To NOC or not to NOC?

Mean Time To Detect

Ops on shift Engineer on call Escalation policy on PD Manage on HipChat/Phone

Mean Time To Recover

Escalate only critical issues Measure time to resolve Blameless learn from events (Take-Ins) Respect your team’s sleep!

Mean Time To Sleep

After each event Blameless Action Items Publish Follow up

Take ins

Order 2-3 times a year Load testing + Prediction Elasticity for engineering Automatic provisioning

Capacity planning

Weekly tech talks IL TechTalks (+ techtalk week) Reversim podcast and summit Internal/External Lectures Sunday School

Learning

Do you guys ever work??

Two weeks dedicated to the needs of the technical teams

Quality Time

Thank You

[email protected]

Shai Peretz, SVP Platform Engineering & Operations

And Yes, we are hiring…

ScaleOut your team - Building a technology team for scale in a DevOps culture

Technology

Transcript of ScaleOut your team - Building a technology team for scale in a DevOps culture