Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy
-
Upload
slashn -
Category
Technology
-
view
5.937 -
download
0
description
Transcript of Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy
Flipkart Website Architecture
Mistakes & Learnings
Siddhartha ReddyArchitect, Flipkart
June 2007
November 2007
December 2012
www.flipkart.com
• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as…
• [1] Issue: Website is “slow”• [2] RCA = Root Cause Analysis
Issue[1] RCA[2] Actions Learnings
INFANCY (2007 – MID-2010)Surviving & reacting to the environment
Website is “slow”!
RCA
• Why?– MySQL queries taking too long
• Why?– Too many queries– Many slow queries– Queries locking tables
• Why?– Capacity
• Hmm…
Fixing it
• Get beefier servers (the obvious)• Separate master_db, slave_db– Writes go to master_db– Reads from slave_db– Critical reads from master_db
MySQL
ReadsWrites
MySQL
Master
Writes
MySQL
Slave
Reads
Replication
Learning from it
• Scale-out databases reads by distributing load across systems
• Isolate database writes from reads– Writes are (usually) more critical
Website is “slow”!(Again)
RCA
• Why?– MySQL queries taking too long (on slave_db)
• Why?– Too many queries– Many slow queries
• Why?– Queries from analytics / reporting and other
backend jobs• Urm…
Fixing it
• Analytics / reporting DB (archival_db)– Use MyISAM — optimized for reads– Additional indexes for quicker reporting
MySQL
Master
Website
Writes
MySQL
Slave
Website
Reads
Analytics
Reads
Replicatio
n
MySQL
Master
Website Writes
MySQL
Slave 1
Website
Reads
Replication
MySQL Slave 2
Analytics Reads
Replication
Learning from it
• Isolate the databases being used for serving website traffic from those being used for analytical/reporting
• Isolate systems being used by production website from those being used for background processing
BABY (2010 – 2011)Learning the basics
Website is “slow”!
RCA
• Why?• How?– Instrumentation
RCA - 1
• Why?– Logging a lot– PHP processes blocking on writing logs
Log file
Request1-> Process1
Request2-> Process2Request3
-> Process3Waiting
Request2:Process1
Waiting
Request2:Process2
Writing
Request3:Process3
RCA - 2
• Why?– Service Oriented Architecture (SOA)– Too many calls to remote services per request• Creating fresh connection for each call• All the calls are made in serial order
Receive
request
Connect to
Service1
Request
Service1
Connect
Service2
Request
Service2
Send respon
se
RCA - 3
• Why?– Configurability– Fetch a lot of “config” from database for serving
each request
Receive request
Fetch Config1
Fetch Config2
Fetch Config3
Fetch Config4
Send response
Database
RCA – 1,2,3
• Why?– Logging a lot– SOA– Configurability
• Why?– PHP’s process model
• Argh!
Fixing it
• fk-w3-agent– Simple Java “middleware” daemon– Deployed on each web server– PHP communicates to it through local socket– Hosts pluggable “handlers”
fk-w3-agent: LoggingHandler
Log file
Request1->
Process1
Request2->
Process2
Request3->
Process3
fk-w3-agent
Request1->
Process1
Request2->
Process2
Request3->
Process3
Log file
Async / buffered
fk-w3-agent: ServiceHandler(s)
Receive request Callfk-w3-agent
Send response
fk-w3-agent
Service1Service2
Receive
request
Connect to
Service1
Request
Service1
Connect
Service2
Request
Service2
Send respon
se
fk-w3-agent: ConfigHandlerReceiv
e reques
t
Fetch Config
1
Fetch Config
2
Fetch Config
3
Fetch Config
4
Send respon
se
Database
Receive request Fetch all config fromfk-w3-agent Send response
fk-w3-agent
Database
Poll and cache
Learning from it
• PHP — good for frontend and templating– Gives a lot of agility– Limiting process model• Hurdle for high performance
• Java — stability and performance• Horses for courses
Website is “slow”!(Again)
RCA
• Why?– PHP processes taking up too much time– PHP processes taking up too much CPU
• Why?– Product info deserialization taking up time/CPU– View construction taking up time/CPU
Fixing it
• Caching!• Cache fully constructed pages– For a few minutes– Only for highly trafficked pages (Homepage)
• Cache PHP serialized Product objects– ~20 million objects– Memcache
• Yeah! But…– Add caching => add complexity
Caching: Complications (1)
• “Caching fully constructed pages”• But parts of pages still need to be dynamic
• Example: Logged-in user’s name
• Impossible to do effective bucket testing• Or at least makes it prohibitively complex
Caching: Complications (2)
• “Caching PHP serialized Product objects”• Without caching:
• With caching, cache hit:
• With caching, cache miss:
getProductInfo() Fetch from CMS
getProductInfo() Fetch from Cache
getProductInfo()
Fetch from Cache
Fetch from CMS Set in Cache
Caching: Complications (3)
• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache– Receive “notifications” about product updates• Notification Server — pushes notifications raised by
CMS
• Use a persistent, distributed cache– Memcache => Membase, Couchbase
Learning from it
• Caching is a powerful tool for performance optimization
• Caching adds complexities– Reduced by keeping cache close to data source– Think deeply about TTL, invalidation
• Use caching to go from “acceptable performance” to “awesome performance”– Don’t rely on it to get to “acceptable
performance”
KID (2012)Growing up
Website is “slow”!
RCA
• Why?– Search-service is slow (or Reviews-service is slow
or Recommendations-service is slow)• But why is rest of website slow?– Requests to the slow service are blocking
processing threads• Eh?!
Let’s do some math
• Let’s say– Mean (or median) response time: 100 ms– 8-core server– All requests are CPU bound
• Throughput: 80 requests per second (rps)• Let’s also say
– 95th Percentile response time: 1000 ms• Call them “bad requests”
• 4 bad requests in a second– Throughput down to 44 rps
• 8 bad requests in a second?– Throughput down to 8 rps
Fixing it
• Aggressive timeouts for all service calls– Isolate impact of a slow service• only to pages that depend on it
• Very aggressive timeouts for non-critical services– Example: Recommendations• On a Product page, Search results page etc.• Not on My Recommendations page
• Load non-critical parts of pages through AJAX
Learning from it
• Isolate the impact of a poorly performing services / systems
• Isolate the required from the good-to-have
Website is “slow”!(Again)
RCA
• Why?– Load average of web servers has spiked
• Why?– Requests per second has spiked• From 1000 rps to 1500 rps
• Why?– Large number of notifications of product
information updates
Fixing it
• Separate cluster for receiving product info update notifications from the cluster that serves users
• Admission control: Don’t let a system receive more requests than it can handle– Throttling
• Batch the notifications
Learning from it
• Isolate the systems serving internal requests from those serving production traffic
• Admission control to ensure that a system is isolated from the over-enthusiasm of a client
• Look at the granularity at which we’re working
TEENAGERIncreasing complexity
THANK YOU
Mistake?
• Sub-optimal decision– Not all information/scenarios considered– Insufficient information– Built for a different scenario
• Due to focus on “functional” aspects• A mistake is a mistake– … in retrospect