The challenges of live events scalability

Post on 05-Jul-2015

687 views 0 download

Transcript of The challenges of live events scalability

THE CHALLENGES OF

SUPPORTING ONLINE LIVE

EVENTS WITH TV

PARTICIPATION NUMBERS

Presentation for B.Sc students from IDC

By Guy Tomer, November 2011

A STARTUP PERSPECTIVE

Hello

• I’m Guy Tomer

• Founding and working in start-ups for the last

13 years

• Founder & CTO of attracTV for the last 4 years

• This Presentation is about

• Building a scalable system for “a lot” of users

• More specifically for handling usage peaks of live TV

events on the internet

• Even more specifically – how we tackle it as a small start-up

attracTV

Web based self-service solution and tools for

managing viewers’ engagement and

interaction on the online screen

Social Information Advertisement eCommerce

Our Use Case – MTV European Music

Awards• One of the biggest online live streams ever

• Can’t expose precise numbers but• 7 digits ( > 1,000,000) – number of streams

• 6 digits (> 100,000) –number of concurrent users

• 5 digits (> 10,000) –number of users joining every minute at peak

• In addition• International event, 20 sites, viewers from >150 countries

• 9 languages

What Are The Challenges

1. Scaling for these numbers

2. Handling very steep ramp-up

3. Big data

4. High availability

5. Testing & preparing for such numbers

6. The cost of the above – how to do it and still make

money

We’ll Discuss mainly 1,5 & 6

Some Big “Internet Scale” Examples

• Google Uses About 900,000 Servers

• (Map-Reduce) Google completed sorting a ten petabyte

input set took 6 hours and 27 minutes to complete on

8000 computers.

• Facebook serves 1 trillion pages per months

• (2010) 30 billion – Pieces of content (links, notes, photos,

etc.) shared on Facebook per month.

• (2010) 2 billion – The number of videos watched per day

on YouTube.

• Akamai, the “CDN to the starts” has 95811 (Q2 2011)

servers, 1000 networks, 70 countries

Challenge 1 – Handling The Scale

• We are prepared for 400,000 concurrent viewers

• HTTP polling every 10<=N<=30 seconds

• This means ~20,000 HTTP R/S (requests per

second)

• For comparison

• Stack overflow recently reported 800 R/S

• Sify.com (leading portal in India)

reported 3900 R/S

• Jobs' death resulted in a record

breaking 10,000 tweets/s(they do have a lot more requests,

that’s just to feel the scale)

What Is Scalability

• From Wikipedia “Scalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth.”

Performance ≠ ScalabilityThe fact that your code runs very fast for X users doesn’t mean your architecture supports 100*X users.

Vertical Scalability (scale up)

• “Get a bigger server”

• “Use faster CPUs”

• Cons

• Can only help so much (with bad scale/$ value).

• A server twice as fast is more than

twice as expensive

• Pros

• Easier to manager less computers

• Can use virtualization

Horizontal Scalability (scale out)

• “Just add another box” (or another thousand or

...)

• Plan the architecture right first, do micro

optimizations later

• Pros

• Unlimited theoretically

• Works well with the cloud services elasticity

• Cons

• More complex to manage

• More complex programming models

Challenge #2 – Steep Ramp-up

• Live Event - Everyone comes at the same time

• A car can drive 250k/h doesn’t mean it can do 0-

100km/h in 4 seconds

Standard website example (wikimedia)Steep ramp-up

Challenge #3 – Big Data

• From Wikipedia:

“Big data are datasets that grow so large

that they become awkward to work with

using on-hand database management tools”• One of the biggest hypes in the industry today

• During this even we had ~10,000,000 records written to

our analytics system per hour

• We’re not “Big Data” yet but

it’s coming

Challenge #4 – High Availability

“High availability refers to a system or

component that is continuously operational

for a desirably long length of time.”

• We need to meet a Service

Level of 99.9%

• Backup, failover systems

are expensive

• The cloud is at our help

High availability in the cloud

Challenge #5 – Testing

• Simulating 100s of thousands of concurrent

users… not trivial

• Requires 10s of strong servers

• Very difficult to collect the data

• The cloud is at our help

Challenge #6 – Handling The Costs

Of Such Event (Hint- Elasticity)

• For production we used ~50 servers that have 4 cores

with 2GH and 15GB RAM (m1.xl)

• Some options (rough estimation) for this are:

• Buy - ~$3500 per box = $175,000. Not for us…

• Dedicated server for a month - ~$1000 per instance = $50,000

• VPS (Virtual private server) monthly - ~300$ per box = $15,000

• Solution: Cloud on-demand (Amazon AWS) - ~$500 per

instance = $25,000 for a month…. BUT … no need to take it for a month,

we activate it on demand for 12 hours

and it costs $416!

Our #1 Lesson - Think Horizontal!

• Why not vertical?• We don’t want it to be our business’s bottleneck at any

point in time

• We don’t want to buy giant servers

• We wanted a cheap start

• We want elasticity

• We don’t want to buy anything at this point

• How? (deserves a separate lecture)• Everything in the architecture

• No state shared between the web/appservers

• No relation between the # of users and the load on the Database

Lesson #2 KISS

• Keep It Simple Stupid• Your system architecture

• Your code

• Your features

• Your business model

• If you don’t

you won’t scale,

from personal

experience

Hug out all the complexity in your system

Lesson #3 – Load Test Everything, Focus

On Real World Usage Patterns• We did massive stress testing

• We launched tens of servers just for stress testing

• Automated with Jmeter and monitored the same way as

production

Why?

• The only way to test your scaling capabilities

• Looking at the code and manual tests are irrelevant

• Measure the capacity of a single app server

• Test the specific ramp-up scenario because

• Example 1 app server = 5000 users, we need

to support 200,000 users so we need to

prepare at least 40 servers

Lesson #4 – S*t Happens, Don’t Save On

Real-Time Monitoring and Support• We had a series of successful big events before this one

• We launched tens of servers just for the stress testing

• And yet we had two problems during the event

Why?

• Murphy is always (eventually) right…

• Because of a feature no one uses (see lesson #2 - KISS)

that wasn’t active in the stress tests

• The specific usage of 9 languages caused unexpected load

(see lesson #3 – stress real world scenarios)

Luckily the whole team was in

monitoring mode and the issues

were quickly handled on the fly.

Lesson #5 – Use The Cloud (startups)

• It’s Elastic, pay on demand

• Flexible when you don’t know your parameters

• Solution for affordable High Availability & Testing

• Focus on development

• I am not getting paid by Amazon – check

others as well!

Summary - What To Remember?

• Scalability is the ability of a system to handle growing

amount of work with additional resources

• Think horizontal

• Keep It Simple (Stupid) – everything

• Stress test everything, focus on real world scenarios

• Monitor and Real-Time support

• Cloud is great for start-ups

The End

• Questions? Comments? Consulting Preguntas?

问题 ?

• Just Shy? Think you should be working in attracTV?

Contact me:

guy.tomer@gmail.com

www.guytomer.com

Special Thanks (presentations, websites I “borrowed” from)

• Ask Bjørn Hansen

(http://groups.google.com/group/scalable)

• High Scalability blog http://highscalability.com/

• http://royal.pingdom.com

• Google images

• Entourage (http://www.hbo.com/entourage/index.html)