Scalable Data Management@facebook Srinivas Narayanan 11/13/09.
-
Upload
marc-overfield -
Category
Documents
-
view
220 -
download
0
Transcript of Scalable Data Management@facebook Srinivas Narayanan 11/13/09.
![Page 1: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/1.jpg)
Scalable Data Management@facebook
Srinivas Narayanan11/13/09
![Page 2: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/2.jpg)
Scale
![Page 3: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/3.jpg)
#2 site on the Internet(time on site)
>200 billion monthly page views
Over 1 million developers in 180 countries
Over 300 million active users
More than 232 photos…
100 million search queries per day
> 3.9 trillion feed actions processed per
day
2 billion pieces ofcontent per week 6 billion minutes
per day
![Page 4: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/4.jpg)
Growth Rate
2009
300MActive Users
![Page 5: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/5.jpg)
Social Networks
![Page 6: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/6.jpg)
The social graph links everything
![Page 7: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/7.jpg)
Scaling Social Networks▪ Much harder than typical
websites where...
▪ Typically 1-2% online: easy to cache the data
▪ Partitioning & scaling relatively easy
▪ What do you do when everything is interconnected?
![Page 8: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/8.jpg)
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, video thumbnail
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, video thumbnail
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, video thumbnail
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photoname, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photoname, status, privacy, video thumbnail
name, status, privacy, video thumbnail
name, status, privacy, profile photoname, status, privacy, video thumbnail
name, status, privacy, profile photo name, status, privacy, profile photoname, status, privacy, profile photo
name, status, privacy, video thumbnail
name, status, privacy, profile photo
name, status, privacy, video thumbnail
name, status, privacy, profile photo
name, status, privacy, profile photoname, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photoname, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photoname, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, video thumbnailname, status, privacy, profile photo
![Page 9: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/9.jpg)
System Architecture
![Page 10: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/10.jpg)
Architecture
Database (slow, persistent)
Load Balancer (assigns a web server)
Web Server (PHP assembles data)
Memcache (fast, simple)
![Page 11: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/11.jpg)
▪ Simple in-memory hash table
▪ Supports get/set,delete,multiget, multiset
▪ Not a write-through cache
▪ Pros and Cons
▪ The Database Shield!
▪ Low latency, very high request rates
▪ Can be easy to corrupt, inefficient for very small items
Memcache
![Page 12: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/12.jpg)
▪ Multithreading and efficient protocol code - 50k req/s
▪ Polling network drivers - 150k req/s
▪ Breaking up stats lock - 200k req/s
▪ Batching packet handling - 250k req/s
▪ Breaking up cache lock - future
Memcache Optimization
![Page 13: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/13.jpg)
Network Incast
Many SmallGet Requests
Memcache Memcache Memcache Memcache
Switch
PHP Client
![Page 14: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/14.jpg)
Memcache Memcache Memcache Memcache
Switch
PHP Client
Many bigdata packets
Network Incast
![Page 15: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/15.jpg)
Memcache Memcache Memcache Memcache
Switch
PHP Client
Network Incast
![Page 16: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/16.jpg)
Memcache Memcache Memcache Memcache
Switch
PHP Client
Network Incast
![Page 17: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/17.jpg)
Memcache Clustering
Many small objects per server
Many small objects per server
Many servers per large object
Many servers per large object
![Page 18: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/18.jpg)
Memcache Clustering
Memcache
10 Objects
PHP Client
![Page 19: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/19.jpg)
Memcache
5 Objects
PHP Client
2 round trips total1 round trip per server
5 Objects
Memcache
Memcache Clustering
![Page 20: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/20.jpg)
Memcache
3 Objects
PHP Client•3 round trips total1 round
trip per server
4 Objects
MemcacheMemcache
3 Objects
Memcache Clustering
![Page 21: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/21.jpg)
Memcache Pool Optimization▪ Currently a manual process
▪ Replication for obvious hot data sets
▪ Interesting problem: Optimize the allocation based on access patterns
![Page 22: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/22.jpg)
General pool with wide fanout
Shard 1 Shard 2
Specialized Replica 2
Shard 1 Shard 2
Shard 1 Shard 2 Shard 3 Shard n
Specialized Replica 1
...
Vertical Partitioning of Object Types
![Page 23: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/23.jpg)
ScribeScribeScribe
ScribeScribeScribe
ScribeScribeScribe
Thousands of MySQL servers in two datacentersMySQL has played a role from the beginning
![Page 24: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/24.jpg)
MySQL Usage•Pretty solid transactional persistent store
•Logical migration of data is difficult
• Logical-Physical db mapping
•Rarely use advanced query features
• Performance
• Database resources are precious
• Web tier CPU is relatively cheap
• Distributed data - no joins!
•Sound administrative model
![Page 25: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/25.jpg)
MySQL is better because it is Open SourceWe can enhance or extend the database
▪ ...as we see fit
▪ ...when we see fit
▪ Facebook extended MySQL to support distributed cache invalidation for memcache
INSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,...
![Page 26: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/26.jpg)
Scaling across datacenters
West Coast
MySql replication
SF Web
SF Memcache
SC Memcache
SC Web
SC MySQL
East Coast
VA MySQL
VA Web
VA Memcache
Memcache Proxy
Memcache ProxyMemcache Proxy
![Page 27: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/27.jpg)
Other Interesting Issues▪ Application level batching and parallelization
▪ Super hot data items
▪ Cachekey versioning with continuous availability
![Page 28: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/28.jpg)
Photos
![Page 29: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/29.jpg)
Photos + Social Graph = Awesome!
![Page 30: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/30.jpg)
Photos: Scale▪ 20 billion photos x4 = 80
billion
▪ Would wrap around the world more than 10 times!
▪ Over 40M new photos per day
▪ 600K photos / second
![Page 31: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/31.jpg)
Photos Scaling - The easy wins▪ Upload tier - handles uploads, scales images, stores on NFS
▪ Serving tier: Images served from NFS via HTTP
▪ However...
▪ File systems are not good at supporting large number of files
▪ Metadata too large to fit in memory causing too many IOs for each file read
▪ Limited by I/O not storage density
▪ Easy wins
▪ CDN
▪ Cachr (http server + caching)
▪ NFS file handle cache
![Page 32: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/32.jpg)
Photos: Haystack
Overlay file system
Index in memory
One IO per read
![Page 33: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/33.jpg)
Data Warehousing
![Page 34: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/34.jpg)
Data: How much?
▪ 200GB per day in March 2008
▪ 2+TB(compressed) raw data per day in April 2009
▪ 4+TB(compressed) raw data per day today
![Page 35: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/35.jpg)
The Data Age
▪ Free or low cost of user services
▪ Consumer behavior hard to predict
▪ Data and analysis are critical
▪ More data beats better algorithms
![Page 36: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/36.jpg)
Deficiencies of existing technologies
▪ Analysis/storage on proprietary systems too expensive
▪ Closed systems are hard to extend
![Page 37: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/37.jpg)
Hadoop & Hive
![Page 38: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/38.jpg)
Hadoop
▪ Superior availability/scalability/manageability despite lower single node performance
▪ Open system
▪ Scalable costs
▪ Cons: Programmability and Metadata
▪ Map-reduce hard to program (users know sql/bash/python/perl)
▪ Need to publish data in well known schemas
![Page 39: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/39.jpg)
Hive▪ A system for managing and
querying structured data built on top of Hadoop
▪ Components
▪ Map-Reduce for execution
▪ HDFS for storage
▪ Metadata in an RDBMS
![Page 40: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/40.jpg)
Hive: New Technology, Familiar Interface
hive> select key, count(1) from kv1 where key > 100 group by key;
vs.
$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file
/tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
![Page 41: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/41.jpg)
Hive: Sample Applications▪ Reporting
▪ E.g.,: Daily/Weekly aggregations of impression/click counts
▪ Measures of user engagement
▪ Ad hoc Analysis
▪ E.g.,: how many group admins broken down by state/country
▪ Machine Learning (Assembling training data)
▪ Ad Optimization
▪ E.g.,: User Engagement as a function of user attributes
▪ Lots More
![Page 42: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/42.jpg)
Hive: Server Infrastructure▪ 4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per
node
▪ Two level network topology
▪ 1 Gbit/sec from node to rack switch
▪ 4 Gbit/sec to top level rack switch
![Page 43: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/43.jpg)
Hive & Hadoop: Usage Stats▪ 4 TB of compressed new data added per day
▪ 135TB of compressed data scanned per day
▪ 7500+ Hive jobs on per day
▪ 80K compute hours per day
▪ 200 people run jobs on Hadoop/Hive
▪ Analysts (non-engineers) use Hadoop through Hive
▪ 95% of jobs are Hive Jobs
![Page 44: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/44.jpg)
Hive: Technical Overview
![Page 45: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/45.jpg)
Hive: Open and Extensible
▪ Query your own formats and types with your own Serializer/Deserializers
▪ Extend the SQL functionality through User Defined Functions
▪ Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script
![Page 46: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/46.jpg)
Hive: Smarter Execution Plans▪ Map-side Joins
▪ Predicate Pushdown
▪ Partition Pruning
▪ Hash based Aggregations
▪ Parallel execution of operator trees
▪ Intelligent Scheduling
![Page 47: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/47.jpg)
Hive: Possible Future Optimizations▪ Pipelining?
▪ Finer operator control (controlling sorts)
▪ Cost based optimizations?
▪ HBase
![Page 48: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/48.jpg)
Spikes: The Username Launch
![Page 49: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/49.jpg)
System Design▪ Database tier cannot handle the load
▪ Dedicated memcache tier for assigned usernames
▪ Miss => Available
▪ Avoid database hits altogether
▪ Blacklists: bucketize, local tier cache
▪
▪ timeout
![Page 50: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/50.jpg)
Username Memcache Tier
▪ Parallel pool in each data center
▪ Writes replicated to all nodes
▪ 8 nodes per pool
▪ Reads can go to any node (hashed by uid)
...UN0 UN1 UN7
PHP Client
Username Memcache
![Page 51: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/51.jpg)
Write Optimization
▪ Hashout store
▪ Distributed key-value store (MySQL backed)
▪ Lockless (optimistic) concurrency control
![Page 52: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/52.jpg)
Fault Tolerance▪ Memcache nodes can go down
▪ Always check another node on miss
▪ Replay from a log file (scribe)
▪ Memcache sets are not guaranteed to succeed
▪ Self-correcting code: write again to mc if we detect it during db writes
![Page 53: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/53.jpg)
Nuclear Options▪ Newsfeed
▪ Reduce number of stories
▪ Turn off scrolling, highlights
▪ Profile
▪ Reduce number of stories
▪ Make info tab the default
▪ Chat
▪ Reduce buddy list refresh rate
▪ Turn if off!
![Page 54: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/54.jpg)
How much load?▪200k in 3 min
▪1M in 1 hour
▪50M in first month
▪Prepared for over 10x!
![Page 55: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/55.jpg)
Some interesting problems
![Page 56: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/56.jpg)
Some interesting problems▪ Graph models and languages
▪ Low latency fast access
▪ Slightly more expressive queries
▪ Consistency, Staleness can be a bit loose
▪ Analysis over large data sets
▪ Privacy as part of the model
▪ Fat data pipes
▪ Push enormous volumes of data to several third party applications (E.g., entire newsfeed to search partners).
▪ Controllable QoS
![Page 57: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/57.jpg)
Some interesting problems (contd.)▪ Search relevance
▪ Storage systems
▪ Middle tier (cache) optimization
▪ Application data access language
![Page 58: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.](https://reader035.fdocuments.in/reader035/viewer/2022081514/55167863550346a2698b5a82/html5/thumbnails/58.jpg)
Questions?