Humans by the hundred

76
Humans By The Hundred Scaling Big Data for Big Team Growth

Transcript of Humans by the hundred

Page 1: Humans by the hundred

Humans By The HundredScaling Big Data for Big Team Growth

Page 2: Humans by the hundred

$ whoamiSRE Manager at YelpCWRU AlumPittsburgh native<3 Web OperationsJust a dude

Page 3: Humans by the hundred

Yelp’s Mission:Connecting people with great

local businesses.

Page 4: Humans by the hundred

Yelp Stats:As of Q2 2015

83M 3268%83M

Page 5: Humans by the hundred

What is Yelp?Many sites: www, m, biz, apiMobile appsPartner platformHundreds of developersThousands of servers

Page 6: Humans by the hundred

Why Am I Here?

Page 7: Humans by the hundred
Page 8: Humans by the hundred

DATA

Page 9: Humans by the hundred

This talk is about people

Page 10: Humans by the hundred
Page 11: Humans by the hundred
Page 12: Humans by the hundred
Page 13: Humans by the hundred
Page 14: Humans by the hundred
Page 15: Humans by the hundred
Page 16: Humans by the hundred
Page 17: Humans by the hundred

The Goal

Page 18: Humans by the hundred

Iterate as fast as possible

Page 19: Humans by the hundred

Regardless of how many people are participating

Page 20: Humans by the hundred

Deployment

Page 21: Humans by the hundred

How It Starts

Page 22: Humans by the hundred

Deployment: the early daysGet a few people together in slack/irc/etc.

Merge up the codeRun the testsManually test it in stageCross your fingers

Page 23: Humans by the hundred
Page 24: Humans by the hundred
Page 25: Humans by the hundred

Things get slower...Tests take longer to runMore hosts = longer downloadsMore developers = more eyeballsMore features = more code

Page 26: Humans by the hundred

The Problem: Humans Are Fallible

Page 27: Humans by the hundred

The Problem: Humans Are Fallible

“…oh @$#&”

Page 28: Humans by the hundred
Page 29: Humans by the hundred

The Problem, With MathAssume:

Every change has a chance of success: 98%That means no test failures, no reverts, etc.

Every deploy has a number of changes: nAny failure in the pipeline invalidates the

deployLet’s figure out the probability of a successful deployment: p

Page 30: Humans by the hundred

The Problem, With MathOnly you

p = .98 (98%)You and a friend

p = .98 * .98 = .96 (96%)You and nine co-workers

p = .98 * .98 * .98 * … * .98 = .82 (82%)

Page 31: Humans by the hundred

The Problem, With Math

p = (.98)n

Page 32: Humans by the hundred

The Problem, With Math

p = (.98)n

exponential decay!

Page 33: Humans by the hundred
Page 34: Humans by the hundred

This doesn’t scale!More developers = more changesMore changes = longer deploysLonger deploys = less time to developLess time to develop = slower to iterateSlower to iterate != the goal

Page 35: Humans by the hundred

Mitigating Exponential Decay

p = (.98)n

Page 36: Humans by the hundred

Mitigating Exponential Decay

p = (.98)n

Page 37: Humans by the hundred
Page 38: Humans by the hundred

Making it harder to screw upWrite more testsWrite better testsGet better code reviewsGet better infrastructureSwitch programming languagesUse better tools

Page 39: Humans by the hundred

Just write better software and stop making mistakes!

Page 40: Humans by the hundred

PROBLEM SOLVED

Page 41: Humans by the hundred
Page 42: Humans by the hundred

The Real WorldTesting builds confidence in our changes

Testing does not protect you from failure

Better tools, tests, and infrastructure can raise our success rates

Page 43: Humans by the hundred

Mitigating Exponential Decay

p = (.98)n

Page 44: Humans by the hundred

Mitigating Exponential Decay

p = (.98)n

Page 45: Humans by the hundred

Service-Oriented ArchitectureLarge monolith → smaller servicesServices communicate over network

Usually HTTP, but you can do RPC, SOAP, etc.Service = independent code baseIndependent deployments

Page 46: Humans by the hundred

Service-Oriented ArchitectureBenefits

Smaller code bases = upper bound to nFailure domains become isolatedTechnology independenceFederated responsibility

Page 47: Humans by the hundred

Service-Oriented ArchitectureDrawbacks

everything becomes decoupledfunction calls start looking like HTTP

requestsversioning can be a nightmare

tracking dependencies is harddata consistency becomes challengingend-to-end testing becomes hard(er), if not

impossible

Page 48: Humans by the hundred

SOA scales people, not code.

Page 49: Humans by the hundred

Conquering SOAWith the monolith, it’s easy to focus on mean time between failures (MTBF)

Page 50: Humans by the hundred

Conquering SOAIn a SOA, focus on mean time to recovery (MTTR)

Page 51: Humans by the hundred

Conquering SOAFail fastAnticipate failureLeverage iteration speed to recover fast

Page 52: Humans by the hundred

Conquering SOATreat everything as distributed

That means everything will failUse timeouts, retriesFind ways to degrade gracefully

Fail fast & isolatedDon’t rely on synchronous processesPrepare for eventual consistency

Page 53: Humans by the hundred

Reaping the BenefitsSmaller failure domainsFewer people & changes to manageDeploys get smallerDeploys get fasterDeploys become continuous

Page 54: Humans by the hundred

Reaping the BenefitsSmaller changes

means smaller code reviewsmeans faster validationmeans smaller blast radiusmeans faster iteration

Page 55: Humans by the hundred

Continuous DeliveryEveryone works against master branchMaster is deployed when commits added

Deployment gated by testsMonitoring knows something is wrong before you do!

Page 56: Humans by the hundred

PROBLEM SOLVED

Page 57: Humans by the hundred

Testing

Page 58: Humans by the hundred

Tests are hard to get right.

Page 59: Humans by the hundred
Page 60: Humans by the hundred
Page 61: Humans by the hundred
Page 62: Humans by the hundred
Page 63: Humans by the hundred
Page 64: Humans by the hundred
Page 65: Humans by the hundred

How can we do better?

Page 66: Humans by the hundred
Page 67: Humans by the hundred

“Not Recommended” Tests

Page 68: Humans by the hundred

“Not Recommended” TestsIf a test fails on master:

a feature is broken on the live website, oryour test sucks and you should ditch it

In either case, we disable itTicket is createdDevelopers can fix it later or just bin it and start

fresh

Page 69: Humans by the hundred

Reliable tests >> test coverage.

Page 70: Humans by the hundred

Don’t always run all the tests!

Page 71: Humans by the hundred

Tests of external services should be monitoring

Page 72: Humans by the hundred

Define your boundaries.

Page 73: Humans by the hundred

yelp.com / dataset_challenge● 61K businesses● 61K checkin-sets● 481K business attributes

● 1.6M reviews● 366K users● 2.8M edge social-graph● 495K tips

Your academic project, research or visualizations, submitted by Dec 31, 2015=

$5,000 prize + $1,000 for publication + $500 for presenting*

*See full terms on website

Academic dataset from 10 cities in 4 countries!

Page 74: Humans by the hundred

@YelpEngineering

YelpEngineers

engineeringblog.yelp.com

github.com/yelp

Page 75: Humans by the hundred

yelp.com/careers

Page 76: Humans by the hundred

Questions?