Monitoring at section.io - Operational Intelligence Meetup May 2016

14
Monitoring at section.io Operational visibility for both the platform and our users

Transcript of Monitoring at section.io - Operational Intelligence Meetup May 2016

Page 1: Monitoring at section.io - Operational Intelligence Meetup May 2016

Monitoring at section.ioOperational visibility for both the platform and our users

Page 2: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Runs on your local machine and pre-production•Configuration and deployment via git•Fast global cache management•HTTPS and HTTP/2 by default

A modern CDN

Page 3: Monitoring at section.io - Operational Intelligence Meetup May 2016

• Integrates with popular open-source•API driven•Near real-time log access•Consistent operational interface

Open platform

Page 4: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Delivery Proxies• Varnish Cache•ModSecurity

•Kibana•Graphite•Umpire

Containers

Page 5: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Web access logs, syslog, performance data•Docker Volumes•Elastic Beats•Log rotation

Gathering data

Page 6: Monitoring at section.io - Operational Intelligence Meetup May 2016

•600 million web access logs per week•60,000 log entries processed per minute•7 days of logs are searchable

Log volume

Page 7: Monitoring at section.io - Operational Intelligence Meetup May 2016

Log flow

Delivery

networks

Logstash

receivers

redis

Logstash processor

s

Logstash senders

redis

Ops Elasticsearch

clusterApps

Elasticsearch

cluster

StatsD, Carbon

Between about 5 seconds and 2 minutes

Page 8: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Kibana•Elasticsearch API•Traces

Log visibility

Page 9: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Metrics can optimise common log queries•Metrics retention:• 1 minute granularity for 1 month• 1 hour granularity for 13 months

•Graphite, Tessera, and Grafana•Heroku Umpire

Beyond logs

Page 10: Monitoring at section.io - Operational Intelligence Meetup May 2016

•CPU utilisation, memory usage, disk space•Traffic: connections, requests, packets, bytes• By partition, node, geo-region, and domain• By HTTP response status code

•Log latency, queue depth, processing rate•Message counts, errors, processing time

Platform monitoring

Page 11: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Cache hit, miss, pass• By content-type

•Response time (median, mean, upper 95%)•WAF intercepts• By rule• By country

Website monitoring

Page 12: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Every staff member does on-call•Every alert is actionable•Every incident feeds the product backlog

Internal processes

Page 13: Monitoring at section.io - Operational Intelligence Meetup May 2016

•Yelp Elastalert•Custom log fields•A `tail -f` UI•Automated anomaly detection

Beyond today

Page 14: Monitoring at section.io - Operational Intelligence Meetup May 2016

Jason Stangroome

Twitter: @jstangroomehttps://blog.stangroome.comhttps://www.section.io/blog

Thank you