Lessons from Highly Scalable Architectures at Social Networking Sites
-
Upload
patrick-senti -
Category
Technology
-
view
5.696 -
download
1
description
Transcript of Lessons from Highly Scalable Architectures at Social Networking Sites
1
Software Engineering in a Cloud World
Lessons from highly-scalable architectures
at social networking sites
Patrick [email protected]
2
Social Networking – Trends 2012more users... … higher share of time ...
… for longer
Source: State of Media: The Social Media Report 2012, nielsen, http://is.gd/LYHmnm
3
User Adoption Faster for New Entrants
Source: author's compilations of data from company data, press statements, technical blogs & presentations
0.5 1 2 3 4 5 6 7 80.01
0.1
1
10
100
1000
User Growth
(years since launch)
Tumblr
years
Milli
on (
loga
rithm
ic)
4
Staggering Volumes
Page views 500 million/dayReads ~40k requests/secondWrites ~1 million/secondNew data ~3 TB/dayServers 1000Engineers 20
Sources: http://is.gd/mpdOPN, http://is.gd/1vJ1il, http://is.gd/58X8ns, http://is.gd/LGexI6, http://is.gd/tZfNPA, http://is.gd/bcpCJc, http://is.gd/kXVEEF
Likes (counter) 2.7 billion/dayPhotos 300 millions/dayQueries 70'000/dayNew Data 500 TB/dayServers “tens of thousands”Engineers ~1700
Tweets (peak) ~25'000/secondTweets (avg) ~250 million/day (1000/second)API calls 6 billion/day (70'000/second)New data ~8 TB/day (80MB/second)Engineers 500 (of 1000 total employees)
Page views 2.3 billion/monthGrowth rate 50% (visitors, March 2012)Machinery 150 web servers
90 caching servers70 database instances35 logging, internal
Data size 410 TB (user data)Employees ~65 (NB, until end of 2011: 12)
5
Methodology
● Author's synthesis● Information collected 2010 – 2012● Mostly secondary research conducted on the internet
● Sources of information● Public presentations at industry conferences● Engineering blogs by social network companies● Research reports● Technology documentation● Author's data analysis
● Threats to validity● Subjective selection of information sources● Non-systematic analysis and synthesis of data gathered
6
Typical Scalability Approaches
● Load Balancing
● Static content on dedicated servers
● Caching
● Database Partitioning
● Replication (high availability)
● (How) Do these work at social-network scale?
7
Source: Aaditya Agarwal, Facebook Architecture, Qcon'2008, London
Functionality- Type of blog - User profile with personal data- Users 'friend' each-other- Post public or private messages
Data Center- owned by facebook
Software Architecture
8
Software architecture- Ruby on Rails, Erlang- since 2009: JVM, Scala - MySQL- Memcached- Unicorn (Mongrel) web server
Functionality- 140-character messages- Users follow each-other- Posts can contain pictures, media links etc.
Data Center- dedicated data center (outsourced)
Source: Krikorian R., Twitter's Real Time Architecture, Qcon NYC 2012
9
tumblr
Software architecture- PHP, Ruby, Scala - Redis, Hbase, MySQL- Memcache- Thrift
Functionality- Microblogging- Users follow each-other- Dashboard similar to a Facebook page
Data Center- started at Rackspace - co-located, dedicated
Source: Tumblr Architecture – 15 Billion Page Views A months and Harder to Scale than Twitter, Highscalability Blog
Source: tumblr.com
10
Data Center- Amazon EC2, EBS, S3
Functionality- Photo sharing pinboards- Categorize images, share with others- mostly used by women (2012: 83%)
Software architecture- Python - Django
Source: pinterest.com
Source: Jackson B., Pinterest growth driven by Amazon cloud scalability, 04.2012, techworld.com
11
Software architecture- Python, Django- PostgreSQL- Redis- Nginx- Node.js - Android
Functionality- Smartphone photo sharing- Post to other social networks- Send messages
Data Center- started with single small scale PC (up to 30+ million users)- 100+ instances at Amazon (EC2, EBS, S3 for photos)
Employees- 2010: 2 engineers, 2012: 5 engineers- That's the total employee count
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, Instagram Engineering Blog
Source: Wikipedia
12
Scalability Options
scale out
scale up
#CPUsRAMdisk
#machines
●transparent scalability●scale 'out of the box'●complex hardware (high cost)●specialised Knowledge●more complex software (multi-core)
●simple hardware (low cost)●scale by numbers●difficult to implement●difficult to maintain (myth?)●nore complex software (expensive licenses)
either way- scale by parallization- partition for fault tolerance- replicate for reliability
this means:- decouple components - asynchronous processing- monitor to operate
13
Caching
● Goal Reduce response times for web site & data access
● Product memcached (open source, initially developed 2003)
● Benefits All accesses (read & write) are O(1)
14
memchached
Web Server
Load Balancer
Web Server
memcached
memcached
memcached
memcached
client
server = hashf(key) % #servers
Features● Remote-accessible in-memory key/value cache● Least Recently Used (LRU) eviction ● Shared-nothing, distributed architecture
Implementation ● memcached nodes map to key-ranges (client-side hashing – no SPOF)● Multi-threaded, event-based async network I/O (200'000 requests/s at Facebook)● Single-node fault tolerance by consistent hashing scheme
Keys={1,2,3}
Keys={4,5,6}
Keys={7,8,9}
Keys={10,11,12}
Source: memcached.org
16
Consistent Hashing in a nutshell
server = min(s | s.location >= (hashf(key) % #locations))
Consistent hashing: buckets are located on a ring, contain up to pre-defined limit => at worst, only the keys of the failing node need to be re-mapped
Source: David Karget et al, Web caching with consistent hashing, Vol 31, 1999, Computer Networks
Keys={1,2,3}
Keys={3,4,5}Keys={5,6,7}
Keys={8,9,10}
m
Keys={1,2,3}
Keys={1,2,3,4,5}Keys={5,6,7}
Keys={8,9,10}
'Traditional' hashing: buckets contain pre-defined range=> at worst requires re-building the full cache, every node may be affected
17
Memcached Results
● Results at Twitter
● 100s of servers
● 20TB of data covering >30 services
● 2 trillion queries/day (>23 million queries/second)
● Modified memcached, released as “Twemcache”
● Key objectives
● High Availability
● Predictable Performance
● Dynamic adoption to size (grow/shrink)
● Monitoring of cache effectiveness
Source: Chris Aniszczyk, Caching with Twemcache, 07.2012, Twitter Engineering Blog
18
Shard your data
● Shards ● horizontal partitions (e.g. by user, time, ...)
● distributed to multiple physical nodes => parallelized data access
● data typically denormalized
● similar data is replicated to all shards – e.g. static data
node1 node2 node3 node4
Web Server
db-client
node = hashf(userid) % #nodes
Userids={A, …, F}
Userids={G, …, L}
Userids={….}
Userids={….}
19
Sharding Results
● Impressive results at Facebook
● 1800 MySQL servers● 4ms reads, 5ms writes ● 60M queries/second (peak)● Growth 20x (overall data, over two years)
● What work's
● Shard by user – group similar data into the same shard
● Linking across shards – store cross-reference s in both shards (two-way access)● Fault tolerance: single-instance failure only affects subset of users
● Consistent hashing -
● What doesn't
● Join's across shards – not possible efficient● Sharding by time not helpful – one shard keeps running “hot” ● Sharding by function not helpful – non-uniform distribution, hot spots, unique access patterns● Fixed hashing – nodes become unbalanced, difficult to grow or shrink
Source: Facebook Techtalks, MySQL & Hbase, December 5, 2011
20
Managing shards
● Results at Tumblr● 200 db servers● Grouped into 5 global pools / 58 shard pools● 28 TB ● 100 billion rows● No DBAs - 2 engineers keep this running at 50% of their time
● Jetpants – DB management toolkit● Clone slaves efficiently● Split shards into new shards● Master promotions● Command line to work with topology
● Open sourced ● https://github.com/tumblr/jetpants
Source: Elias E., Managing Large Sharded Topologies with Jetpants, 12.2012, Percona Live MySQL Conference
21
Asynchronous & Distributed Work
● Problem Do more work in less time
● Solution Distributed, asynchronous processingMapReduce
● Requirements
● Split work job into multiple pieces
● Distribute work
● Collect results
● Fault tolerant
● Technologies
● Message Queueing
● Gearman
● Hadoop / Pig
22
Asynchronous Work Example
● Instagram Push Notifications
● Image uploads
● All uploads go into a task-queue
● ~200 worker processes asynchronously process the images
● Gearman
● Open Source
● Framework to distribute work
● Load Balancing
● No SPOF
Source: gearman.org
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, 2012, Instagram Engineering Blog
23
Apache Hadoop
● What it is
● Distributed MapReduce engine
● Fault tolerant
● Asynchronous job scheduling
● Scalable: e.g. 4000 node cluster,sorts of 1TB in 62 seconds
● Datastorage
● HDFS – scalable to multiple PB
● Distributed storage
● Written in Java
● Data replicated among 3 nodes
● Block storage of 64MB/block
● No SPOF
● Apache Pig
● High-level query language
Sources: Apache Hadoop, Wikipedia, The Free Encyclopedia, accesses January 8, 2013Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
24
Results
● NoSQL at Twitter
● Store 7TB of data/day
● HD speed: ~80MB/s => 24.3 hours
● Need to parallelize writes and reads
● Analysis using Pig
● Count all tweets
● 12 billion
● 5 minutes
Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
25
Simplified Queries
Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
27
Service Oriented Architecture
“Onion-Style”
outer services- public (e.g. REST)- user interface- typically scripted (Python, Ruby, JavaScript)
inner services- private & highly efficient- data access, calculation etc.- workers to accomplish work in parallel- mix of languages (Java, Scala, Python, C, ...)
fire hose- highly available, scalable service bus- distribute services as needed- typically asynchronous
28
Tumblr Firehose
Apache kafka- O(1) persistent message queue- x times 100K messages/s- pub/sub interface
Apache Zookeeper (Cluster)- distributed coordination - highly available
finagle
finagle- asynchronous RPC system- JVM-hosted languages (Java, Scala, ...)- Connection pools, failure detectors, failover, load-balancing, back-pressure ...
NewPost finagle
HTTP ClientHTTP ClientHTTP Client
Results- 4 x CPUs @ 72GB RAM, 2 disks- provide 1 week of streams- ~400k messages/second- 1 Week of Tumblr posts
public API(JSON)
internal API(thrift)
Source: Blake M., Tumblr Firehose - The Gory Details, 2012, Tumblr Engineering Blog
29
SOA revisited – network efficiency
consumer provider
Inte
rfac
e
1. Serialize2. Wait for response3. Deserialize
1. Deserialize2. Provide response3. Serialize
CORBA, HTTP/JSON, WSDL/XML/SOAP, ...
efficient?
30
Apache thrift – optimized wire protocol
● What it is
● Human-readable interface definition language (non-XML)
● Cross-language service implementation
● Code-generation engine (C++, Java, Python, JavaScript, …)
● Binary wire protocol
● Benefits
● Low-overhead serialization/de-serialization
● Native language bindings (no XML parsing or XSD)
● Efficient protocol implementation
31
thrift example
struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid) }
# Make an object up = UserProfile(uid=1, name="Test User", blurb="Thrift is great")# Talk to a server via TCP sockets, binary protocol transport = TSocket.TSocket("localhost", 9090) transport.open() Protocol =TBinaryProtocol.TBinaryProtocol(transport) # Use the service we already defined service = UserStorage.Client(protocol) service.store(up) Up2 = service.retrieve(1)
class UserStorageHandler : virtual public UserStorageIf { public: UserStorageHandler() { // Your initialization goes here } void store(const UserProfile& user) { // Your implementation goes here printf("store\n"); } void retrieve(UserProfile& _return, const int32_t uid) { // Your implementation goes here printf("retrieve\n"); } }; //main ... }
interface client
Service implementation
Source: thrift.apache.org
32
Serialization / Deserialization Performance
Serialization … (thrift: -66% )
… Deserialization (thrift: -92%)
Message size (thrift: -19%)
Benchmark - CPU Core i7 2.7GHz - Serialization of a service message (media descriptor of a video)
Source: Author testing
33
redis: In-Memory DB
redis
redis
redis
redis
consumer
Keys={1,2,3}
Keys={3,4,5}
Keys={5,6,7}
Keys={8,9,10}
master
slave
slave
slave
async replication
Problem Require speed of cache, query semantics, persistence, fault-tolerance of DB clusterSolution redis.io – a distributed in-memory DB
Redis● fast: O(1) access times - 100'000 writes/second, 80'000 read/second ● fault-tolerant● datatypes: strings, hashes, lists, sets, sorted sets● complex queries: intersection, subset, sort, …● more than just a DB: pub/sub channels
35
redis results
● tumblr● >7500 notifications/second (well above MySQL max. concurrent limit)
● <5ms response time requirement
● Redis: 30'000 requests/second
Source: Blake M., Staircar: Redis-powered notifications, 07.2011, Tumblr Engineering Blog
36
Automate everything & Monitor
● If just two engineers
● run 100+ servers
● maintain dozens of databases
● Scale a system to 30+ million users
● … automation is like air to breathe …
● … monitoring is the lifelineDashboard @ Twitter
Source: Adams J., Scaling Twitter, 2010, Chirp Conference
37
Cell Architecture
● Cell Architecture
● Self-contained cells of data + logic
● Each cell itself made up of a cluster of nodes
● Cells provide internal failover
● Reliability
● Scalability
Cell
Application Server Cluster
Metadata store (HBase)
Discovery Service
Client
consistent hashing by user-id
Source: Malik P., Scaling the Messages Application Back End, 04.11, facebook Engineering's Notes
38
Summary
Scalability● Cache● Data Sharding● In-Memory DB● Efficient wire protocols
Flexibility● SOA
● Decoupled● Layered (outer, inner services)● Asynchronous (firehouse)
● Automation
Reliability● Replication ● Cell Architecture
39
Take Away for Application Development
● Scalability => Distribution● Loosely Coupled Components (accessible via APIs, services)● Efficiency at every level● Shared nothing
● Reliability => Replication● Automation● Monitoring ● Fast provisioning of replicates
● Flexibility => Simplification
● Build for simple use ● Abstract to simplify (e.g. Pig/Hadoop, Redis/in-Memory DB)● API-everything
40
Paradigm Shift?
● New normal
● 100s of machines
● <5 engineers
● Distributed work load
● Horizontal scalability
● PBs of data
● Drivers
● Low barriers of entry – free or low-cost hosting
● Declining cost – CPU, storage, networking
● Web-scale ready open-source software
41
Q & A
Thank you
42
What we haven't covered
● CAP Theorem
● A/B Testing
● NoSQL Databases