Humans by the hundred
-
Upload
yelp-engineering -
Category
Data & Analytics
-
view
11.824 -
download
0
Transcript of Humans by the hundred
Humans By The HundredScaling Big Data for Big Team Growth
$ whoamiSRE Manager at YelpCWRU AlumPittsburgh native<3 Web OperationsJust a dude
Yelp’s Mission:Connecting people with great
local businesses.
Yelp Stats:As of Q2 2015
83M 3268%83M
What is Yelp?Many sites: www, m, biz, apiMobile appsPartner platformHundreds of developersThousands of servers
Why Am I Here?
DATA
This talk is about people
The Goal
Iterate as fast as possible
Regardless of how many people are participating
Deployment
How It Starts
Deployment: the early daysGet a few people together in slack/irc/etc.
Merge up the codeRun the testsManually test it in stageCross your fingers
Things get slower...Tests take longer to runMore hosts = longer downloadsMore developers = more eyeballsMore features = more code
The Problem: Humans Are Fallible
The Problem: Humans Are Fallible
“…oh @$#&”
The Problem, With MathAssume:
Every change has a chance of success: 98%That means no test failures, no reverts, etc.
Every deploy has a number of changes: nAny failure in the pipeline invalidates the
deployLet’s figure out the probability of a successful deployment: p
The Problem, With MathOnly you
p = .98 (98%)You and a friend
p = .98 * .98 = .96 (96%)You and nine co-workers
p = .98 * .98 * .98 * … * .98 = .82 (82%)
The Problem, With Math
p = (.98)n
The Problem, With Math
p = (.98)n
exponential decay!
This doesn’t scale!More developers = more changesMore changes = longer deploysLonger deploys = less time to developLess time to develop = slower to iterateSlower to iterate != the goal
Mitigating Exponential Decay
p = (.98)n
Mitigating Exponential Decay
p = (.98)n
Making it harder to screw upWrite more testsWrite better testsGet better code reviewsGet better infrastructureSwitch programming languagesUse better tools
Just write better software and stop making mistakes!
PROBLEM SOLVED
The Real WorldTesting builds confidence in our changes
Testing does not protect you from failure
Better tools, tests, and infrastructure can raise our success rates
Mitigating Exponential Decay
p = (.98)n
Mitigating Exponential Decay
p = (.98)n
Service-Oriented ArchitectureLarge monolith → smaller servicesServices communicate over network
Usually HTTP, but you can do RPC, SOAP, etc.Service = independent code baseIndependent deployments
Service-Oriented ArchitectureBenefits
Smaller code bases = upper bound to nFailure domains become isolatedTechnology independenceFederated responsibility
Service-Oriented ArchitectureDrawbacks
everything becomes decoupledfunction calls start looking like HTTP
requestsversioning can be a nightmare
tracking dependencies is harddata consistency becomes challengingend-to-end testing becomes hard(er), if not
impossible
SOA scales people, not code.
Conquering SOAWith the monolith, it’s easy to focus on mean time between failures (MTBF)
Conquering SOAIn a SOA, focus on mean time to recovery (MTTR)
Conquering SOAFail fastAnticipate failureLeverage iteration speed to recover fast
Conquering SOATreat everything as distributed
That means everything will failUse timeouts, retriesFind ways to degrade gracefully
Fail fast & isolatedDon’t rely on synchronous processesPrepare for eventual consistency
Reaping the BenefitsSmaller failure domainsFewer people & changes to manageDeploys get smallerDeploys get fasterDeploys become continuous
Reaping the BenefitsSmaller changes
means smaller code reviewsmeans faster validationmeans smaller blast radiusmeans faster iteration
Continuous DeliveryEveryone works against master branchMaster is deployed when commits added
Deployment gated by testsMonitoring knows something is wrong before you do!
PROBLEM SOLVED
Testing
Tests are hard to get right.
How can we do better?
“Not Recommended” Tests
“Not Recommended” TestsIf a test fails on master:
a feature is broken on the live website, oryour test sucks and you should ditch it
In either case, we disable itTicket is createdDevelopers can fix it later or just bin it and start
fresh
Reliable tests >> test coverage.
Don’t always run all the tests!
Tests of external services should be monitoring
Define your boundaries.
yelp.com / dataset_challenge● 61K businesses● 61K checkin-sets● 481K business attributes
● 1.6M reviews● 366K users● 2.8M edge social-graph● 495K tips
Your academic project, research or visualizations, submitted by Dec 31, 2015=
$5,000 prize + $1,000 for publication + $500 for presenting*
*See full terms on website
Academic dataset from 10 cities in 4 countries!
@YelpEngineering
YelpEngineers
engineeringblog.yelp.com
github.com/yelp
yelp.com/careers
Questions?