Data Storage Infrastructure at Facebook -...
Transcript of Data Storage Infrastructure at Facebook -...
![Page 1: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/1.jpg)
Data Storage Infrastructure at Facebook
Spring 2018 Cleveland State University
CIS 601 PresentationYi Dong
Instructor: Dr. Chung
![Page 2: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/2.jpg)
Outline
● Strategy of data storage, processing, and log collection
● Data flow from the source to the data warehouse
● Storage systems and optimization
● Data discovery and analysis
● Challenges in resource sharing
![Page 3: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/3.jpg)
Facebook’s Architecture
![Page 4: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/4.jpg)
Facebook’s Architecture
PHPHipHop compiler
ScribeThrift
Hadoop HbaseHayStack
HiveMySQL
Memcached
![Page 5: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/5.jpg)
Part 1: Strategy for Data Storage, Processing, Log collection
● Apache Hadoop
● Apache Hive
● Scribe
![Page 6: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/6.jpg)
Hadoop, Why?
● Scalability
– Able to process multi petabyte datasets● Fault Tolerance
– Node failure is expected everyday
– Number of nodes is not constant● High Availability
– User can access from nearest node● Cost Efficiency
– Open source
– Use commodity hardware as a node in Hadoop clusters
– Eliminates particular technology dependency
![Page 7: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/7.jpg)
Hadoop Architecture
● HDFS (Hadoop Distributed File System)
● Map-Reduce Infrastructure
![Page 8: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/8.jpg)
Hive
● SQL-like analysis tool (HiveQL) on top of Hadoop
● Dramatically improve the productivity and usage for Hadoop
– With Hive, users without programming experience can use Hadoop for their work
– Without Hive, one basic Hadoop data manipulation, like GROUP BY will take >100 lines of Java/Python code
– Even worse, if the programmer does not have database knowledge, the code will likely use sub-optimal algorithm, often it is pretty sub-optimal
![Page 9: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/9.jpg)
Hive Architecture
![Page 10: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/10.jpg)
Scribe – Scalable Logging System
● Distributed and scalable logging system
● Combined with HDFS
● Aggregate logs from thousands of web servers
![Page 11: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/11.jpg)
Part 2: Data Flow Architecture
● Two Sources of Data
– Web Server
● Log data● Copy every 5-15 minutes
– Federated MySQL
● Information data● Copy daily
● Two different clusters
– Production Hive-Hadoop cluster
– Ad-hoc Hive-Hadoop cluster
![Page 12: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/12.jpg)
Deal with Data Delivery Latency
● Even log data copied at 5-15 minutes interval, the loader will only load data into Hive native table at the end of the day
● Solution at Facebook:
– Use Hive’s external table feature, create table meta data on the raw HDFS files
– After data loaded into Hive native table at the end of day, remove raw HDFS files from the external table
– New solutions are needed to enable continuously log data loading
![Page 13: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/13.jpg)
Part 3: Storage Optimization
● All data need to compressed to save space
– Hadoop allows user specific codecs, Facebook using gzip codec to get compression factor at 6-7
● HDFS by default use 3 copies of data to prevent data loss
– Using erasure codes, 2 copies of data and 2 copies of error correction code, this multiple can be brought down to 2.2
– Using Hadoop RAID on older data sets and keeping the newer data sets replicated 3 ways
![Page 14: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/14.jpg)
Part 3: Storage Optimization
● Reduce the memory usage by HDFS NameNode
– Trade off latency to reduce memory pressure
– Implement file format to reduce map tasks
● Data federation
– Distribute data based on time● Data across time boundary will need more join
– Distribute data based on application● Some of the common data have to be replicated
![Page 15: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/15.jpg)
Part 4: Data Discovery and Analysis
● Hive
– Provide immense scalability to non-engineering users, such as business analysts, product managers
● Data discovery
– Internal tool to enable wiki approach for metadata creation
– Tools to extract lineage information from query log● Periodic Batch Jobs
– For such job, inner job dependencies and ability to schedule such job are critical
![Page 16: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/16.jpg)
Part 5: Resource Sharing
● Support the co-existence of interactive jobs and batch jobs on the same Hadoop cluster
– Implement Hadoop Fair Share Scheduler
– Isolate ad-hoc queries and periodic batch queries
– Implement Scheduler to make it more aware of system resource usage caused by poorly written ad-hoc queries
![Page 17: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/17.jpg)
Take Home Message
● For a data warehouse design
– What kind of data source, flow architecture
– What kind of storage architecture
– What kind of user, what kind of task
– How to make usage easier
– How to share the resource between jobs
![Page 18: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018](https://reader031.fdocuments.in/reader031/viewer/2022030908/5b51346a7f8b9a056a8bcc4f/html5/thumbnails/18.jpg)
End
Thank you