Scaling at Showyou: Operations

55
Introduction Storage Processing Monitoring Review Scaling at Showyou Operations September 26, 2011

description

Architecture/operations slides from the Scaling at Showyou talk. From the same talk, John's Riak backend, Mecha: http://www.slideshare.net/jmuellerleile/scaling-with-riak-at-showyou

Transcript of Scaling at Showyou: Operations

Page 1: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Scaling at ShowyouOperations

September 26, 2011

Page 2: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

I’m Kyle Kingsbury

Handle aphyrCode http://github.com/aphyrEmail [email protected] Backend, API, ops

Page 3: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

What the hell is Showyou?

Page 4: Scaling at Showyou: Operations
Page 5: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Nontrivial complexity

Page 6: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Challenges

� Scanning social networks� Feeds� Search� Trends� Responsive client experience

� Everything fails all the time

Page 7: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Challenges

� Scanning social networks� Feeds� Search� Trends� Responsive client experience� Everything fails all the time

Page 8: Scaling at Showyou: Operations
Page 9: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Storage

Page 10: Scaling at Showyou: Operations
Page 11: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

We left MySQL

� Changing the schema requires downtime� Crashes� Master-slave lag� Slow restarts� Node replacements difficult� Fully normalized queries complex, slow

Page 12: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

We left MySQL

� Changing the schema requires downtime� Crashes� Master-slave lag� Slow restarts� Node replacements difficult� Fully normalized queries complex, slow

Page 13: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

MySQL does scale

But there are tradeoffs

Page 14: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Riak

� Key/value store� Homogenous� Scales linearly with nodes� Excellent durability/recoverability� Eventually consistent

Page 15: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

We use Riak as our durable datastore

� Users, feeds, videos, etc� Highly denormalized� Limited MR queries (feeds, etc)

� Latency-bounded MR jobs are Erlang� Hot-deployable

� Extensive use of conflict resolution� Made possible by Risky

Page 16: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Riak at Showyou

� 51 million keys (153 M replicated)� 100 GB of data (300 GB replicated)� 260 gets/sec (baseline)� 75 puts/sec (baseline)� Capable of over 3000 ops/sec

Page 17: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

SSDs are amazing

WD 7200RPM

� 100 ops/sec� 95%: 100-300ms

Micron RealSSD P300

� 1000+ ops/sec� 95%: 3-5ms

Page 18: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

When Riak fails,

� Another node takes up the slack� Clients connected to that node reconnect to others� Typically no service interruption

� However, latencies may rise� Especially for MR jobs

Page 19: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Riak has downsides

� Difficult to debug� Membership changes are dangerous� Significantly slower than MySQL� (Bitcask) All keys must fit in memory� Mapreduce is only appropriate for known keys� List-keys can take down your cluster

Long story short: it’s only a KV store

Page 20: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

+Redis

Page 21: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

We use Redis for fast, temporary state

� List of users� List of videos� Counters� Queues

Incredibly fast, excellent primitives

Page 22: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

When Redis fails,

� Daemons using those indexes pause� Frontend service continues� Bitcask scanners and incremental updaters repair

any lost data

Eventually consistent.

Page 23: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

When Redis fails,

� Daemons using those indexes pause� Frontend service continues� Bitcask scanners and incremental updaters repair

any lost data

Eventually consistent.

Page 24: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

We also use SOLR extensively

� Supplements Riak� Complex indices� Full-text search� Analytics

More on that later. . .

Page 25: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Processing

Page 26: Scaling at Showyou: Operations
Page 27: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Do one thing well

Lots of small processes handling well-defined tasks

� Easier to debug� Easier to test� Simplifies parallelism� Simplifies error handling� Less likely to cause total system failure

Page 28: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Minimize Shared State

� Vector clocks for concurrent modification� Queues for message passing� Riak for durable storage� Redis for fast synchronous state

Page 29: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Crash by Default

� Someone else will take your work� Repair constantly� Assume everybody is out to kill you

Page 30: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Distribute

� Multiple threads, processes, hosts� Failover IPs with Heartbeat� Rolling restarts mean frequent deploys and nobody

notices� Losing a node is no big deal� Scaling out is easy

Page 31: Scaling at Showyou: Operations
Page 32: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Monitoring

Page 33: Scaling at Showyou: Operations
Page 34: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

UState: A state aggregator

Page 35: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Receive states over protobufs

Host backend1.showyou.comService feed merger rate

Time unix epoch secondsState ok

Metric 12.5Description 12.5 feed items/sec

Page 36: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Query states

� state = "warning" or state = "critical"� service =∼ "api %" and host != null

Page 37: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

� Combine states together (sum, average, . . . )� Send email on changes� Forward to another UState server� Forward to Graphite� Dashboard

Page 38: Scaling at Showyou: Operations
Page 39: Scaling at Showyou: Operations
Page 40: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Understand application behavior

Page 41: Scaling at Showyou: Operations
Page 42: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

When can we. . . ?

Page 43: Scaling at Showyou: Operations
Page 44: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

It’s 23:15 PST.

Do you know where YOUR database is?

Page 45: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

It’s 23:15 PST.

Do you know where YOUR database is?

Page 46: Scaling at Showyou: Operations
Page 47: Scaling at Showyou: Operations
Page 48: Scaling at Showyou: Operations
Page 49: Scaling at Showyou: Operations
Page 50: Scaling at Showyou: Operations
Page 51: Scaling at Showyou: Operations
Page 52: Scaling at Showyou: Operations
Page 53: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

http://github.com/aphyr/ustate

Page 54: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Recap

� Robust, discrete components� Highly distributed� Message passing� Eventual consistency� Comprehensive monitoring

Page 55: Scaling at Showyou: Operations

Introduction Storage Processing Monitoring Review

Thanks!

� Basho (esp. Pharkmillups!)� Formspring� Bump