A Tale of a Server Architecture (Frozen Rails 2012)

Post on 29-Nov-2014

9.994 views 1 download

description

Ville Lautanala's talk from Frozen Rails 2012: how Flowdock uses chef and ZooKeeper to manage a set of distributed services.

Transcript of A Tale of a Server Architecture (Frozen Rails 2012)

A Tale of a Server Architecture

Ville Lautanala@lautis

WHO AM I @lautis

Flowdock, team collaboration app with software developer as primary target audience.Right-hand side: chat, left-hand side: inbox or activity stream for your team.If you’ve read a Node.JS tutorial you probably know needed the architecture.

Facts

• Single page JavaScript front-end

• WebSocket based communication layer

• Three replicated databases

• Running on dedicated servers in Germany

• 99.98% availability

WebSockets == no third-party load-balancers/PaaS for us99.99% according to CEO, but I’m being conservative

Goal: beat your hosting provider in uptime

Have a good uptime on unreliable hardware.

We don’t want to wake up at night to fix our app like this guy in this picture. The founders had previously a hosting company.

This is not an exact science, every app is different.

Architecture Archaeology

We haven’t been always doing very well

Flowdock 2010

MongoDB

Messages

PostgreSQL

Rails

Apache

Simple stack, but the messaging part quickly became hairy. It had HTTP streaming, Twitter integration and e-mail server. Lot of brittle state.

Divide and Conquer

Nice strategy for building your SOA, sorting lists and taking over the world.

MongoDB

Redis

HTTP Streaming

API

Message Backend

PostgreSQL

Rails

WebSocket APIIRCRSS

Stunnel

GeoDNS

HAproxy

These are all different processes. More components, but this has enabled us to easily add new features to components

Separated concerns...

but many parts to configure

So, you need to setup boxes...

ChefInfrastructure as (Ruby) Code

Chef lets you to automate server configuration with Ruby code.

Chef at Flowdock

• Firewall configuration

• Distribute SSH host keys

• User setup

• Join mesh-based VPN

• And app/server specific stuff

Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.SSH host keys prevent MITMWe have a mesh-based VPN, which is automatically configured based on Chef data

•Cookbooks

•Recipes

•Roles

Chef server

Centralized chef server which nodes communicate with and get updates from.

include_recipe "flowdock:users" package "ruby"

%w{port listen_to flowdock_domain}.each do |e| template "#{node[:flowdock][:oulu][:envdir]}/#{e.upcase}" do source "envdir_file.erb" variables :value => node[:flowdock][:oulu][e] owner "oulu" mode "0600" endend

runit_service "oulu" do options :use_config => trueend

cookbooks/flowdock/oulu.rb

Recipe for our IRC server

roles/rails.rbname "rails"description "Rails Box"run_list(  "recipe[nginx]", "recipe[passenger]")override_attributes( passenger: { version: "3.0.7" })

Recipe in Ruby DSLEach node can be assigned any number of rolesOverride attributes can be used to override recipe attributes

Managing Chef cluster

$ knife cookbook upload -a -o cookbooks

Managing Chef cluster

$ knife search node role:flowdock-app-serverNode Name: imaginary-serverEnvironment: qaFQDN: imaginary-server.flowdock.dmzIP: 10.0.0.1Run List: role[qa], role[flowdock-app-server], role[web-server]Roles: qa, flowdock-app-server, web-serverRecipes: ubuntu, firewall, chef, flowdock, unicorn, haproxyPlatform: ubuntu 12.04Tags:

Managing Chef cluster

$ knife ssh 'role:qa' 'echo "lol"'imaginary-server lolqa-db1 lolqa-db2 lol

Most useful command: trigger chef run on servers

Testing Chef Recipes

• Use Chef environments to isolate changes

• Run chef-client on throw-away VMs

• cucumber-chef

sous-chef could be used to automate VM setupOur experience with cucumber-chef and sous-chef is limitedYou need also to monitor stuff e.g. runs have finished on nodes, backups are really taken

Automatic FailoverAvoiding Single Point of Failures

MongoDB works flawlessly as failover is built-in, but how to handle Redis?

HAproxyTCP/HTTP Load Balancer with Failover handling

HAproxy provides easy failover for Rails instances

MongoDB has automatic failover built-in

MongoDB might have many problems, but failover isn’t one of them. Drivers are always connected to master.

Redis and Postgres have replication, but failover is manual

Not only do you need to promote master automatically, but also change application configuration.

ZooKeeper

Distributed coordination

Each operation has to be agreed by majority of servers. Eventual consistency.

require 'zk'

$queue = Queue.newzk = ZK.newzk.register('/hello_world') do |event| # need to reset watch data = zk.get('/hello_world', watch: true).first# do stuff

$queue.push(:event)end

zk.create('/hello_world', 'sup?')$queue.pop # Handle local synchronizationzk.set('/hello_world', 'omg, update')

Using the high-level zk gem. Block is run every time value is updated.ZK gem has locks and other stuff implemented.

zk = ZK.new

zk.with_lock('/lock', :wait => 5.0) do |lock| # do stuff # others have to waitend

Redis master failover using ZooKeeper

gem install redis_failover

but in 3 programming languages

Redis Failover

Node Manager

Node Manager

Redis NodeRedis Node

ZooKeeper

Monitor

Update

App

App

App

Watch

Our apps might not use redis_failover or read ZK directly. Script restarts the app when ZK changes.HAproxy or DNS based solutions also possible, but this gives us more control over the app restart.

Postgres failover with pgpool-II and ZooKeeper

pgpool manages pg cluster, queries can be distributed to slavesI’m afraid of pgpool, configuration and monitoring scripts are really scary

Postgres Failover

pgpool

PGPG

App

ZooKeeper

PGpool monitor

zookeeper/pgpool monitoring is used to provide redundancy to pgpoolIf pgpool fails, app needs to reconnect to new server

Zoos are keptSimilar scheme can be used for other master-slave based replications, e.g. handling twitter integration failover.

REMEMBER TO TEST

Test your failover

You might only need some failover few times a year.

Not sure if everything of our stuff is top-notch, but there have been one-time use cases for the complicated stuff.

Chef vs ZooKeeper

Chef ZooKeeper

Configuration files Dynamic configuration variables

Server boostrap Failover handling

Chef write long configuration files, ZooKeeper only contains few variablesChef boostraps server and keeps them up-to-date, ZooKeeper is used to elect master nodes in master-slave scenarios.

Mesh-based VPN between boxes

Encrypted MongoDB traffic between masters and slaves. Saved the day few times when there has been routing issues between data centers.

SSL endpoints in AWS

Routing issues between our German ISP and Comcast. Move SSL front ends closer to client to fix this and reduce latency. Front-page loads 150ms faster.

WinningWe don’t need to worry about waking up at nights. The whole team could go sailing and be without internet access at the same time.

Lessons learned

What have we learned?

WebSockets are cool, but make your life harder

Heroku, Amazon Elastic Load Balancer, CloudFlare and Google App engine don’t work with WS. If you only need to stream stuff, using HTTP EventStreaming is better choice.

♫ Let it crash ♫

Make your app crash, at least you are there to fix things.

Questions?

Thanks!