Enterprise Drupal Application & Hosting Infrastructure Level Monitoring

Post on 06-Apr-2017

123 views 2 download

Transcript of Enterprise Drupal Application & Hosting Infrastructure Level Monitoring

Enterprise Drupal Application & Hosting Infrastructure Level

Monitoring

Daniel KanchevSenior Site Reliability Engineer

@dvkanchev

Enterprise Drupal Hosting Characteristics

○ Consists of multiple servers

○ Provides high availability

○ Offers auto scalability

○ Requires multiple services to work as expected

Enterprise Drupal Hosting Characteristics

○ Consists of multiple servers

○ Provides high availability

○ Offers auto scalability

○ Requires multiple services to work as expected

○ Really expensive

○ Nobody wants to manage this sh*t :)

Hosting Types Complexity

Hosting Types Complexity

○ Shared Hosting Service

○ Single Virtual Server

○ Single Dedicated Server

○ PaaS

Hosting Types Complexity

○ Shared Hosting Service

○ Single Virtual Server

○ Single Dedicated Server

○ PaaS

○ Custom Private/Public Clouds

○ ElasticSearch/Solr

○ Redis/Memcached

○ GraphQL

○ MongoDB

○ Nodejs

○ Gearman

○ CI systems

One Monitoring To Rule Them All

• Website Monitoring• Hosting Infrastructure Monitoring

Website Monitoring Architecture

Website

London Amsterdam Munich

Website Monitoring Architecture

Website

London Amsterdam Munich

503 ISE

Incidents○ Critical Incident - website is down from all locations

○ Major Incident - website is down from a single location; MySQL replication

is broken; PHP fatal errors recorded in the logs; read-only file system issue

○ Minor Incident - Memcached/Redis on a single server is down

○ Notice Incident - web node X is running out of space; PHP warnings

recorded in the logs

Core Principles○ Log all events and archive them. Write postmortem reports

○ Check every single incident - even minor ones and notices

○ Define performance limits and regularly check reports

○ Beware of cascade failures

○ Always strive to go back to pre-incident state

○ Check one thing at a time and return “OK” or “Failure”

Examples○ 1 of 5 app servers goes down

○ Load on the other 4 increases by 20%

○ Redis caches are invalidated - overload

○ Varnish is restarted by a system

administrator to apply a configuration

change

○ App servers start to return 503 errors

○ MySQL master goes down

○ MySQL slave 1 takes over and at this

moment there is no downtime

○ MySQL slave 2 is behind the new

master

○ The new MySQL master goes down too

result is a broken DB or outdated one

KEY TAKEAWAYS

1. Embrace Failure and Design for Failure2. Automate Recovery3. Log all incidents and analyse them4. Measure and graph the performance of all components5. Regularly brake things on purpose in order to test

RESOURCES

Injecting Failure at Netflix - goo.gl/YE1sEYWhat is SRE - goo.gl/2lI8E0SRE book - goo.gl/bfL2AtNetflix Open Source Software - https://netflix.github.io/Etsy “Measure Everything” - goo.gl/CPVUT5

JOIN US FORCONTRIBUTION SPRINTS

First Time Sprinter Workshop - 9:00-12:00 - Room Wicklow2AMentored Core Sprint - 9:00-18:00 - Wicklow Hall 2BGeneral Sprints - 9:00 - 18:00 - Wicklow Hall 2A

Evaluate This Session

THANK YOU!

events.drupal.org/dublin2016/schedule

WHAT DID YOU THINK?