Stack Exchange Infrastructure - LISA 14

25
The Stack Exchange Infrastructure Vroom Vroom

description

This is my presentation about the Stack Exchange (Stack Overflow, Server Fault, Super User) Infrastructure that I presented at LISA 14

Transcript of Stack Exchange Infrastructure - LISA 14

The Stack Exchange Infrastructure

Vroom Vroom

inet.perf.profile

• SRE Generalist @ Stack Exchange

• @GABeech

• http://brokenhaze.com

• http://stackexchange.com

A brief Overview

• 560 Million Page Views a Month

• 34TB of Data transfered a Month

• 1665 rps (2250 peak) Across web Farm

• WISC(HER)

Windows IIS SQL Server C# HAProxy Elastic Search Redis

Our First Priority is Performance

Nobody likes a slow site, least of all us. When your site is slow people leave.

!

Make your site fast, and the people will stay !

Good write up on moz.com: http://moz.com/blog/site-speed-are-you-fast-does-it-

matter

Why do I bring up performance in an infra talk? simple. It drives our design decisions.

The Performance toolkit

• Mini Profiler

• OpServer (https://github.com/opserver/Opserver)

• Client Timings (http://teststackoverflow.com/)

Mini Profiler

Shown to every Dev/SRE on every page Oneboxed in our chat system

OpServer

Bubbles up problems

OpServer HAproxy

OpServer Redis

OpServer SQL

Client Timings

How well are we actually doing when _you_ load the page

You can’t be fast if you are not up

• Highly Redundant network

• Datacenter, ISP, Edge, Core, Server, Port

The actual design starts now.

4 Different providers Selected for different characteristics Router Redundancy Hot/Standby HSRP/BGP on “T2” Full BGP tables and HSRP on T1

Load Balencers

• HAProxy

• 2 Servers (Hot/Standby)

• Multiple Tiers (HAProxy Processes)

4B requests/month 3000 req/sec peak 10% CPU 18% peak Between 600 and 700 concurrent connections (EST, TIME_WAIT, ETC) Multiple Processes Allow for granular restarts and segregation of faults SSL Termination done on the LB Websockets: The weird connection Long lived TCP not HTTP

Request flow In, is http? yes, servers: no term https, is http

SSL Termination

• Terminated at LB

• Feature added to HAProxy 1.5

• See: http://brokenhaze.com/blog/2014/03/25/how-stack-exchange-gets-the-most-out-of-haproxy/

Source Port Exhaustion use 127.0.0.0/8 to resolve Server only running at ~12% cpu We don’t run full SSL everywhere yet

Web Servers

!

• IIS

• 9 Production (2 Test/Dev)

• Dell R610’s

• 32GB Memory

• 2xE5-5640

185 req/s 250 peak 15% CPU usage 20% peak

Data Tier

• MS SQL Server

• 4 Servers

• 2 Always-On Clusters

• Each Cluster 1 RW, 1 RO

(SO) 343 M Queries per day (SO) Peak of 7500 queries / second (SE) 216M Queries per day (SE) Peak 3200 queries / second !CPU Use: SO 8% Peak 15% — SE 10% Peak 20%

Caching Tier

• Redis

• 2 Servers

• Hot / Standby configuration

3.65 B operations a day Peak 60,000/s 3% cpu usage !

Tag Engine

• Our Special index of SO

• Tagging is hard

• Written by Marc Gravell

• http://blog.marcgravell.com/2014/04/technical-debt-case-study-tags.html

3 Servers, 32 GB RAM 3644 req/s 3% CPU 10% peak Replaced Full Text search in SQL Server Spins up a full copy of SO/SE Cool thing can be upgraded with 0 downtime

Elastic Search

• 203GB Index

• 3 Machines

• 42M searches/day

2 others/ not prod Machine learning Log stash (300TB)

Deployment

• Git

• TeamCity

• Custom Powershell Scripts

Team City monitors our Development Git repository Dev Auto builds (Deploy to Meta) When the build is verified Dev triggers Prod Build Copy Artifacts from Dev Build

So what does this get you

• 52 ms homepage render time

• 33 ms questions page render time

Always See our Performance

• http://stackexchange.com/performance

Thank YOU!Contact:

@GABeech [email protected]

Office Hours: Wednesday, November 12th

(today…) 2:00pm - 3:30pm

LISA Lab