Lecture 3 – Foundations for Big Data Systems and Programming - Computer...
Transcript of Lecture 3 – Foundations for Big Data Systems and Programming - Computer...
![Page 1: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/1.jpg)
CS 626 Large Scale Data Science
Jun ZhangOriginally prepared by Dr. Licong Cui
Department of Computer ScienceUniversity of Kentucky
January 28, 2020
Lecture 3 – Foundations for Big Data Systems and Programming
![Page 2: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/2.jpg)
Review: Five P’s of Data Science
People
Purpose
Process
Platforms
Programmability
![Page 3: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/3.jpg)
Review: Steps in the Data Science Process
Acquire Prepare Analyze Report Act
Big Data Engineering Computational Big Data Science
![Page 4: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/4.jpg)
Basic Scalable Computing Concepts
Distributed File Systems
Scalable Computing over the Internet
Programming Models for Big Data
![Page 5: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/5.jpg)
Traditional File System
64GB 256GB 512GB
1TB
![Page 6: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/6.jpg)
Copy data to an external hard drive?
Buy a bigger disk?
Work Personal
Distribute data on multiple computers?
![Page 7: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/7.jpg)
Store Data in Server?
![Page 8: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/8.jpg)
Cluster of Machines
![Page 9: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/9.jpg)
Distributed File System (DFS)
File system spreads over multiple, autonomous computers
Distributed File System
Rack
![Page 10: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/10.jpg)
Data Replication
1
2
4
5
3
3
1
2
4
5
2
4
5
3
1
5
3
1
2
4
4
5
3
1
2
Rack
1 2 3 4 5
Data
![Page 11: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/11.jpg)
Fault Tolerance
1
2
4
5
3
3
1
2
4
5
2
4
5
3
1
5
3
1
2
4
4
5
3
1
2
Rack
1 2 3 4 5
Data
![Page 12: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/12.jpg)
High Concurrency
1
2
4
5
3
3
1
2
4
5
2
4
5
3
1
5
3
1
2
4
4
5
3
1
2
Rack
1 2 3 4 5
Data
Reader 1 Reader 2
Reader 3
![Page 13: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/13.jpg)
Distributed File SystemRa
ck
Data Partitioning
Data Replication
Data Scalability
Fault Tolerance
High Concurrency
![Page 14: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/14.jpg)
Scalable Computing Over the Internet
Single compute node
![Page 15: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/15.jpg)
Parallel Computer
Expensive
![Page 16: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/16.jpg)
Commodity Cluster
Affordable
Less-specialized
Distributed computing over the Internet
Reduced computing cost
![Page 17: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/17.jpg)
Architecture of a Commodity Cluster
NetworkRa
ck
![Page 18: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/18.jpg)
Distributed ComputingNetwork
Rack
Network
Rack
Net
wor
kRack
• Enables data-parallelism• Move computation to data
![Page 19: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/19.jpg)
Programming Models for Big Data
Runtime Libraries Programming Languages
Programming Model = abstractions
Distributed File System
Rack
Infrastructure:
![Page 20: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/20.jpg)
Requirements for Big Data Programming Models
1. Support Big Data Operationso Split large volumes of data
o Access data fast
o Distribute computations to nodes
![Page 21: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/21.jpg)
Requirements for Big Data Programming Models (cont.)
2. Handle Fault Toleranceo Replicate data partitions
o Recover files when needed
![Page 22: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/22.jpg)
Requirements for Big Data Programming Models (cont.)
3. Enable Adding More Racks
4
1
2 5
3Rack
1 2 3 4 5
Data
Rack
Rack
![Page 23: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/23.jpg)
Requirements for Big Data Programming Models (cont.)
4. Optimized for specific data types
Document Table
Key-value Graph
Multimedia Stream
![Page 24: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/24.jpg)
Example – Suits of Cards
![Page 25: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/25.jpg)
Example – Suits of Cards
![Page 26: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/26.jpg)
Key Programming Model
MapReduce
A programming model for Big Data
Many implementations
![Page 27: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/27.jpg)
Programming Model = abstractions
Runtime Libraries Programming Languages
Support large data volumes
Provide fault tolerance
Enable scale out
MapReduce
![Page 28: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/28.jpg)
What is MapReduce?
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a map procedure performs filtering and sorting (such as
sorting students by first name into queues, one queue for each name)
a reduce method performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)
![Page 29: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/29.jpg)
Google Distributed System
![Page 30: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/30.jpg)
Google File System Architecture
![Page 31: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/31.jpg)
Google MapReduce
![Page 32: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/32.jpg)
Hadoop Ecosystem - History
Yahoo! released Hadoop in 2005
More open-source projects
2004
![Page 33: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/33.jpg)
Hadoop Ecosystem – Layer Diagram
![Page 34: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/34.jpg)
What is Hadoop?
Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.
Goals/Requirements Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)
Fault-tolerance Move computation to data
![Page 35: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/35.jpg)
Hadoop Architecture
![Page 36: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/36.jpg)
Hadoop Architecture
![Page 37: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/37.jpg)
Hadoop Architecture (cont.)
HDFS Name Node
Data Node
MapReduce Job Tracker
Task Tracker
![Page 38: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/38.jpg)
Reminder: Downloading and Installing Hadoop
Download and Install VirtualBoxhttps://www.virtualbox.org/wiki/Downloads
Download and Install Cloudera QuickStart VMhttps://www.cloudera.com/downloads/quickstart_vms/5-13.html
Launch the Cloudera VM
![Page 39: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/39.jpg)
Reminder: Import Appliance in VirtualBox
![Page 40: Lecture 3 – Foundations for Big Data Systems and Programming - Computer …jzhang/CS626/Lecture3.pdf · 2020. 1. 30. · Department of Computer Science University of Kentucky January](https://reader033.fdocuments.in/reader033/viewer/2022060918/60aad55d96bc6a5e533acea1/html5/thumbnails/40.jpg)
Reminder: Amazon Web Service (AWS) Educate Sign Up
AWS Educatehttps://aws.amazon.com/education/awseducate/