Recruiting SolutionsRecruiting SolutionsRecruiting
Solutions****Sid AnandQCon NY 2013Building a Modern Website for
Scale
About Me2*Current Life LinkedIn Search, Network, and Analytics
(SNA) Search Infrastructure MeIn a Previous Life LinkedIn, Data
Infrastructure, Architect Netflix, Cloud Database Architect eBay,
Web Development, Research Lab, & Search EngineAnd Many Years
Prior Studying Distributed Systems at Cornell University@r39132
2
Our missionConnect the worlds professionals to makethem more
productive and successful3@r39132 3
Over 200M members and counting2 4 8173255901452004 2005 2006
2007 2008 2009 2010 2011 2012LinkedIn Members (Millions)200+The
worlds largest professional networkGrowing at more than 2
members/secSource :http://press.linkedin.com/about
5*>88%Fortune 100 Companiesuse LinkedIn Talent Soln to
hireCompany Pages>2.9MProfessional searches in
2012>5.7BLanguages19@r39132 5>30MFastest growing
demographic:Students and NCGsThe worlds largest professional
networkOver 64% of members are now internationalSource
:http://press.linkedin.com/about
Other Company Facts6* Headquartered in Mountain View, Calif.,
with offices around the world! As of June 1, 2013, LinkedIn has
~3,700 full-time employees located around theworld@r39132 6Source
:http://press.linkedin.com/about
Agenda Company Overview Serving Architecture How Does LinkedIn
Scale Web Services Databases Messaging Other Q & A7@r39132
Serving Architecture8@r39132 8
Overview Our site runs primarily on Java, with some use of
Scala for specificinfrastructure The presentation tier is an
exception runs on everything! What runs on Scala? Network Graph
Engine Kafka Some front ends (Play) Most of our services run on
JettyLinkedIn : Serving Architecture@r39132 9
LinkedIn : Serving Architecture@r39132 10FrontierPresentation
TierPlaySpringMVCNodeJS JRuby Grails DjangoUSSR (Chrome V8 JS
Engine)Our presentation tier is composed of ATS with 2 plugins:
Fizzy A content aggregator that unifies content across a diverse
set of front-ends Open-source JS templating framework USSR (a.k.a.
Unified Server-Side Rendering) Packages Google Chromes V8 JS engine
as an ATS plugin
A A B BMasterC CD D E E F FPresentation TierBusiness Service
TierData Service TierData InfrastructureSlave Master
MasterMemcached A web page requestsinformation A and B A thin layer
focused onbuilding the UI. It assemblesthe page by making
parallelrequests to BST services Encapsulates businesslogic. Can
call other BSTclusters and its own DSTcluster. Encapsulates DAL
logic andconcerned with one OracleSchema. Concerned with
thepersistent storage of andeasy access to dataLinkedIn : Serving
ArchitectureHadoop@r39132 11Other
12Serving Architecture : Other?Oracle orEspresso Data Change
EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationAs I
will discuss later, data that is committed to Databases needs to
also be madeavailable to a host of other online serving systems :
Search Standardization Services These provider canonical names for
yourtitles, companies, schools, skills, fields of study, etc..
Graph engine Recommender SystemsThis data change feed needs to be
scalable, reliable, and fast. [ Databus ]@r39132
13Serving Architecture : Hadoop@r39132How do we Hadoop to
Serve? Hadoop is central to our analytic infrastructure We ship
data streams into Hadoop from ourprimary Databases via Databus
& fromapplications via Kafka Hadoop jobs take daily or hourly
dumps of thisdata and compute data files that Voldemort canload!
Voldemort loads these files and serves them onthe site
Voldemort : RO Store Usage at LinkedInPeople You May
KnowLinkedIn SkillsRelated SearchesViewers of this profile also
viewedEvents you may be interested in Jobs you may beinterested
in@r39132 14
How Does LinkedIn Scale?15@r39132 15
Scaling Web ServicesLinkedIn : Web Services16@r39132
LinkedIn : Scaling Web Services@r39132 17Problem How do 150+
web services communicate with each other to fulfill user requests
inthe most efficient and fault-tolerant manner? How do they handle
slow downstream dependencies? For illustration sake, consider the
following scenario: Service B has 2 hosts Service C has 2 hosts A
machine in Service B sends a web request to a machine in Service CA
A B B C C
LinkedIn : Scaling Web Services@r39132 18What sorts of failure
modes are we concerned about? A machine in service C has a long GC
pause calls a service that has a long GC pause calls a service that
calls a service that has a long GC pause see where I am going? A
machine in service C or in its downstream dependencies may be slow
for anyreason, not just GC (e.g. bottlenecks on CPU, IO, and
memory, lock-contention)Goal : Given all of this, how can we ensure
high uptime?Hint : Pick the right architecture and implement
best-practices on top of it!
LinkedIn : Scaling Web Services@r39132 19In the early days,
LinkedIn made a big bet on Spring and Spring RPC.Issues1. Spring
RPC is difficult to debug You cannot call the service using simple
command-line tools like Curl Since the RPC call is implemented as a
binary payload over HTTP, http accesslogs are not very usefulB BC
CLB2. A Spring RPC-based architecture leads to high MTTR Spring RPC
is not flexible and pluggable -- we cannot use custom client-side
load balancing strategies custom fault-tolerance features Instead,
all we can do is to put all of our service nodesbehind a hardware
load-balancer & pray! If a Service C node experiences a
slowness issue, aNOC engineer needs to be alerted and then
manuallyremove it from the LB (MTTR > 30 minutes)
LinkedIn : Scaling Web Services@r39132 20SolutionA better
solution is one that we see often in both cloud-based architectures
andNoSQL systems : Dynamic Discovery + Client-side
load-balancingStep 1 :Service C nodes announce theiravailability to
serve traffic to a ZKregistryStep 2 :Service B nodes get
updatesfrom ZKB BC CZKZKZKB BC CZKZKZKStep 3 :Service B nodes
routetraffic to service C nodesB BC CZKZKZK
LinkedIn : Scaling Web Services@r39132 21With this new paradigm
for discovering services and routing requests tothem, we can
incorporate additional fault-tolerant services
LinkedIn : Scaling Web Services@r39132 22Best Practices
Fault-tolerance Support1. No client should wait indefinitely for a
response from a service Issues Waiting causes a traffic jam : all
upstream clients end up also gettingblocked Each service has a
fixed number of Jetty or Tomcat threads. Oncethose are all tied up
waiting, no new requests can be handled Solution After a
configurable timeout, return Store different SLAs in ZK for each
REST end-points In other words, all calls are not the same and
should not havethe same read time out
LinkedIn : Scaling Web Services@r39132 23Best Practices
Fault-tolerance Support2. Isolate calls to back-ends from one
another Issues You depend on a responses from independent services
A and B. If Aslows down, will you still be able to serve B? Details
This is a common use-case for federated services and for
shard-aggregators : E.g. Search at LinkedIn is federated and will
call people-search, job-search, group-search, etc... In parallel
E.g. People-search is itself sharded, so an additional
shard-aggregation step needs to happen across 100s of shards
Solution Use Async requests or independent ExecutorServices for
syncrequests (one per each shard or vertical)
LinkedIn : Scaling Web Services@r39132 24Best Practices
Fault-tolerance Support3. Cancel Unnecessary Work Issues Work
issued down the call-graphs is unnecessary if the clients at thetop
of the call graph have already timed out Imagine that as a call
reaches half-way down your call-tree, thecaller at the root times
out. You will still issue work down the remaining half-depth of
yourtree unless you cancel it! Solution A possible approach Root of
the call-tree adds (, inProgress status) toMemcached All services
pass the tree-UUID down the call-tree (e.g. as aHTTP custom request
header) Servlet filters at each hop check whether inProgress ==
false. Ifso, immediately respond with an empty response
LinkedIn : Scaling Web Services@r39132 25Best Practices
Fault-tolerance Support4. Avoid Sending Requests to Hosts that are
GCing Issues If a client sends a web request to a host in Service C
and if that hostis experiencing a GC pause, the client will wait
50-200ms, dependingon the read time out for the request During that
GC pause other requests will also be sent to that nodebefore they
all eventually time out Solution Send a GC scout request before
every real web request
LinkedIn : Scaling Web Services@r39132 26Why is this a good
idea? Scout requests are cheap and provide negligible overhead for
requestsStep 1 :A Service B node sends a cheap1 msec TCP request to
adedicated scout Netty portStep 2 :If the scout request comesback
within 1 msec, send thereal request to the Tomcat orJetty portStep
3 :Else repeat with a differenthost in Service CB BNetty
TomcatZKZKZKCB BNetty TomcatZKZKZKCB B ZKZKZKNetty TomcatCNetty
TomcatC
LinkedIn : Scaling Web Services@r39132 27Best Practice
Fault-tolerance Support5. Services should protect themselves from
traffic bursts Issues Service nodes should protect themselves from
being over-whelmedby requests This will also protect their
downstream servers from beingoverwhelmed Simply setting the tomcat
or jetty thread pool size is not always anoption. Often times,
these are not configurable per application. Solution Use a sliding
window counter. If the counter exceeds a configuredthreshold,
return immediately with a 503 (service unavailable) Set threshold
below Tomcat or Jetty thread pool size
Espresso : Overview@r39132 29Problem What do we do when we run
out of QPS capacity on an Oracle database server? You can only buy
yourself out of this problem so far (i.e. buy a bigger box)
Read-replicas and memcached will help scale reads, but not
writes!Solution EspressoYou need a horizontally-scalable
database!Espresso is LinkedIns newest NoSQL store. It offers the
following features: Horizontal Scalability Works on commodity
hardware Document-centric Avro documents supporting rich-nested
data models Schema-evolution is drama free Extensions for Lucene
indexing Supports Transactions (within a partition, e.g. memberId)
Supports conditional reads & writes using standard HTTP headers
(e.g. if-modified-since)
Espresso : Overview@r39132 30Why not use Open-source? Change
capture stream (e.g. Databus) Backup-restore Mature storage-engine
(innodb)
31 Components Request Routing Tier Consults Cluster Manager
todiscover node to route to Forwards request toappropriate storage
node Storage Tier Data Store (MySQL) Local Secondary Index(Lucene)
Cluster Manager Responsible for data setpartitioning Manages
storage nodes Relay Tier Replicates data toconsumersEspresso:
Architecture@r39132
33DataBus : OverviewProblemOur databases (Oracle &
Espresso) are used for R/W web-site traffic.However, various
services (Search, Graph DB, Standardization, etc) need the
abilityto Read the data as it is changed in these OLTP stores
Occasionally, scan the contents in order rebuild their entire
stateSolution DatabusDatabus provides a consistent, in-time-order
stream of database changes that Scales horizontally Protects the
source database from high-read-load@r39132
Where Does LinkedIn useDataBus?34@r39132 34
35DataBus : Usage @ LinkedInOracle orEspresso Data Change
EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationA user
updates the company, title, & school on his profile. He also
accepts aconnection The write is made to an Oracle or Espresso
Master and DataBus replicates: the profile change is applied to the
Standardization service E.g. the many forms of IBM were
canonicalized for search-friendliness
andrecommendation-friendliness the profile change is applied to the
Search Index service Recruiters can find you immediately by new
keywords the connection change is applied to the Graph Index
service The user can now start receiving feed updates from his new
connections immediately@r39132
RelayEvent
Win36DBBootstrapCaptureChangesOn-lineChangesDBDataBus consists of 2
services Relay Service Sharded Maintain an in-memory buffer
pershard Each shard polls Oracle and thendeserializes transactions
into Avro Bootstrap Service Picks up online changes as theyappear
in the Relay Supports 2 types of operationsfrom clients If a client
falls behind andneeds records older than whatthe relay has,
Bootstrap cansend consolidated deltas! If a new client comes on
lineor if an existing client fell toofar behind, Bootstrap cansend
a consistent snapshotDataBus : Architecture@r39132
RelayEvent
Win37DBBootstrapCaptureChangesOn-lineChangesOn-lineChangesDBConsistentSnapshot
at UConsumer 1Consumer nDatabusClientLibClientConsumer 1Consumer
nDatabusClientLibClientGuarantees Transactions In-commit-order
Delivery commits are replicated in order Durability you can replay
the change stream at any time in the future Reliability 0% data
loss Low latency If your consumers can keep up with the relay
sub-secondresponse timeDataBus : Architecture@r39132
38DataBus : ArchitectureCool Features Server-side (i.e.
relay-side & bootstrap-side) filters Problem Say that your
consuming service is sharded 100 ways e.g. Member Search Indexes
sharded by member_id % 100 index_0, index_1, , index_99 However,
you have a single member Databus stream How do you avoid having
every shard read data it is not interested in? Solution Easy,
Databus already understands the notion of server-side filters It
will only send updates to your consumer instance for the shard it
isinterested in@r39132
40Kafka : OverviewProblemWe have Databus to stream changes that
were committed to a database. How do wecapture and stream
high-volume data if we relax the requirement that the data
needslong-term durability? In other words, the data can have
limited retentionChallenges Needs to handle a large volume of
events Needs to be highly-available, scalable, and low-latency
Needs to provide limited durability guarantees (e.g. data retained
for a week)Solution KafkaKafka is a messaging system that supports
topics. Consumers can subscribe to topicsand read all data within
the retention window. Consumers are then notified of newmessages as
they appear!@r39132
41Kafka is used at LinkedIn for a variety of business-critical
needs:Examples: End-user Activity Tracking (a.k.a. Web Tracking)
Emails opened Logins Pages Seen Executed Searches Social Gestures :
Likes, Sharing, Comments Data Center Operational Metrics Network
& System metrics such as TCP metrics (connection resets,
message resends, etc) System metrics (iops, CPU, load average,
etc)Kafka : Usage @ LinkedIn@r39132
42WebTierTopic 1Broker TierPushEventsTopic 2Topic NZookeeper
Message IdManagementTopic, PartitionOwnershipSequential write
sendfileKafkaClientLibConsumersPullEvents Iterator 1Iterator nTopic
Message Id100 MB/sec 200 MB/sec Pub/Sub Batch Send/Receive E2E
Compression System DecouplingFeatures Guarantees At least once
delivery Very high throughput Low latency (0.8) Durability (for a
time period) Horizontally ScalableKafka : Architecture@r39132
Average Unique Message @Peak writes/sec = 460k reads/sec: 2.3m #
topics: 69328 billion unique messages written per dayScale at
LinkedIn
43Improvements in 0.8 Low Latency Features Kafka has always
been designed for high-throughput, but E2E latency couldhave been
as high as 30 seconds Feature 1 : Long-polling For high throughput
requests, a consumers request for data will always befulfilled For
low throughput requests, a consumers request will likely return 0
bytes,causing the consumer to back-off and wait. What happens if
data arrives onthe broker in the meantime? As of 0.8, a consumer
can park a request on the broker for as muchas m milliseconds have
passed If data arrives during this period, it is instantly returned
to theconsumerKafka : Overview@r39132
44Improvements in 0.8 Low Latency Features In the past, data
was not visible to a consumer until it was flushed to disk on
thebroker. Feature 2 : New Commit Protocol In 0.8, replicas and a
new commit protocol has been introduced. As long asdata has been
replicated to the memory of all replicas, even if it has notbeen
flushed to disk on any one of them, it is considered committed
andbecomes visible to consumersKafka : Overview@r39132
Jay Kreps (Kafka) Neha Narkhede (Kafka) Kishore Gopalakrishna
(Helix) Bob Shulman (Espresso) Cuong Tran (Perfomance &
Scalability) Diego Mono Buthay (Search
Infrastructure)45Acknowledgments@r39132