Big data on_aws in korea by abhishek sinha (lunch and learn)
-
Upload
amazon-web-services-korea -
Category
Technology
-
view
441 -
download
1
Transcript of Big data on_aws in korea by abhishek sinha (lunch and learn)
![Page 2: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/2.jpg)
![Page 3: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/3.jpg)
An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
![Page 4: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/4.jpg)
What does big data look like ?
![Page 5: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/5.jpg)
Volume
Velocity
Variety
3Vs
![Page 6: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/6.jpg)
Where is this data coming from ?
![Page 7: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/7.jpg)
Human generated
Machine generated
Tweet
Surf the internet
Buy and sell products
Upload images and videos
Play games
Check in at restaurants
Search for cafes
Find deals
Watch content online
Look for directions
Use social media
![Page 8: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/8.jpg)
Human generated
Machine generated
Networks and security devices
Mobile phones
Cell phone towers
Smart grids
Smart meters
Telematics from cars
Sensors on machines
Videos from traffic and security cameras
![Page 9: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/9.jpg)
What are people using this for ?
![Page 10: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/10.jpg)
Big Data Verticals and Use cases
Media/Advertising
Targeted Advertising
Image and Video
Processing
Oil & Gas
Seismic Analysis
Retail
Recommendations
Transactions Analysis
Life Sciences
Genome Analysis
Financial Services
Monte Carlo Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image Recognition
Social Network/Gaming
User Demographi
cs
Usage analysis
In-game metrics
![Page 11: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/11.jpg)
Why is big data hard ?
![Page 12: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/12.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
![Page 13: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/13.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
![Page 14: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/14.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Highly
constrained
Lower cost,
higher throughput
![Page 15: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/15.jpg)
Big Gap in turning data into actionable
information
![Page 16: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/16.jpg)
Amazon Web Services helps remove constraints
![Page 17: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/17.jpg)
Big Data + Cloud = Awesome Combination
Big data:
• Potentially massive datasets
• Iterative, experimental style
of data manipulation and
analysis
• Frequently not a steady-state
workload; peaks and valleys
• Data is a combination of
structured and unstructured
data in many formats
AWS Cloud:
• Massive, virtually unlimited
capacity
• Iterative, experimental style of
infrastructure deployment/usage
• At its most efficient with highly
variable workloads
• Tools for managing structured
and unstructured data
![Page 18: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/18.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
![Page 19: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/19.jpg)
![Page 20: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/20.jpg)
Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
![Page 21: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/21.jpg)
Stack
Ap
plic
atio
n S
tack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files
Databases Logs D
ata
Stac
k
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump Flume
![Page 22: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/22.jpg)
Stack – Front end Application
Ap
plic
atio
n S
tack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files
Databases Logs D
ata
Stac
k
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump Flume
![Page 23: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/23.jpg)
Stack – Collection and Storage
Ap
plic
atio
n S
tack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files
Databases Logs D
ata
Stac
k
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump Flume
![Page 24: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/24.jpg)
Stack – analysis and sharing
Ap
plic
atio
n S
tack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files
Databases Logs D
ata
Stac
k
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump Flume
![Page 25: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/25.jpg)
Users Overtime
![Page 26: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/26.jpg)
“Who is using our
service?”
![Page 27: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/27.jpg)
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
![Page 28: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/28.jpg)
9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
![Page 29: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/29.jpg)
Autocomplete Search
Recommendations
Automatic spelling
corrections
![Page 30: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/30.jpg)
“What kind of movies do people
like ?”
![Page 31: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/31.jpg)
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 months
4 million ratings per day
3 million searches
Device location , time , day, week etc.
Social data
![Page 32: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/32.jpg)
![Page 33: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/33.jpg)
![Page 34: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/34.jpg)
10 TB of streaming data per day
![Page 35: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/35.jpg)
Data consumed in multiple ways
S3
EMR
Prod Cluster (EMR)
Recommendati
on Engine
Ad-hoc
Analysis
Personalization
![Page 36: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/36.jpg)
AWS
Import/Export
Corporate
data center
Amazon
Elastic
MapReduce
Amazon
Simple
Storage
Service (S3)
BI Users
Clickstream data from
500+ websites and
VoD platform
![Page 37: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/37.jpg)
“Who buys video games?”
![Page 38: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/38.jpg)
Who is Razorfish
• Full service Digital Agency
• Developed an Ad-Serving Platform compatible with most browsers
• Clickstream analysis of data , current historical trends and segmentation of
users
• Segmentation is used to serve ads and cross sell
• 45TB of Log data
• Problems at scale
– Giant Datasets
– Building Infrastructure requires large continuous investment
– Build for peak holiday season
– Traditional Data stores are not scaling
![Page 39: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/39.jpg)
3.5 billion records
13 TB of click stream logs
71 million unique cookies
Per day:
![Page 40: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/40.jpg)
Previously in 2009
![Page 41: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/41.jpg)
Today
![Page 42: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/42.jpg)
Today
![Page 43: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/43.jpg)
![Page 44: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/44.jpg)
![Page 45: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/45.jpg)
![Page 46: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/46.jpg)
![Page 47: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/47.jpg)
![Page 48: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/48.jpg)
![Page 49: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/49.jpg)
![Page 50: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/50.jpg)
This happens in 8 hours everyday
![Page 51: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/51.jpg)
Why AWS + EMR
• Prefect Clarity of Cost
• No upfront infrastructure investment
• No client processing contention
• Without EMR/Hadoop it takes 3 days , with EMR 8 hours
– Scalability 1 node x 100 hours = 100 nodes x 1 hour
• Meet SLA
![Page 52: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/52.jpg)
Playfish improves in-game experience for its users
through data mining
Challenge: Must understand player usage trends across 50M month users, multiple platforms, 10s of games, and in the face of rapid growth. This
drives both in-game improvements and defines what games to target next.
Solution: EMR provides Playfish the flexibility to
experiment and rapidly ask new questions. All usage data is stored in S3 and analysts run ad-hoc hive queries that can slice the
data by time, game, and user.
![Page 53: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/53.jpg)
![Page 54: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/54.jpg)
![Page 55: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/55.jpg)
Data Driven Game Design
Data is being used to understand what gamers are doing inside the game (behavioral analysis)
- What features people like (rely on data instead of forum posts)
- What features are abandoned
- A/B testing
- Monetization – In Game Analytics
![Page 56: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/56.jpg)
Building a big data architecture
Design Patterns
![Page 57: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/57.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
![Page 58: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/58.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
![Page 59: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/59.jpg)
Getting your Data into AWS
Amazon S3
Corporate Data Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
1
![Page 60: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/60.jpg)
Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data store
Amazon S3
Amazon EC2
2
![Page 61: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/61.jpg)
Queue , pre-process and then write to data source
Amazon Simple Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data store
3
![Page 62: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/62.jpg)
Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue
Service (SQS)
Amazon Simple Storage Service
(S3)
Amazon Elastic MapReduce
![Page 63: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/63.jpg)
Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data store
HDFS
4
![Page 64: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/64.jpg)
What is Flume
• Collection, Aggregation of streaming Event Data
– Typically used for log data, sensor data , GPS data etc
• Significant advantages over ad-hoc solutions
– Reliable, Scalable, Manageable, Customizable and High Performance
– Declarative, Dynamic Configuration
– Contextual Routing
– Feature rich
– Fully extensible
![Page 65: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/65.jpg)
Typical Aggregation Flow
[Client]+ Agent [ Agent]* Destination
Flume uses a multi-tier approach where multiple agents can send data to
another agent which acts as a aggregator. For each agent , data can from
either an agent or a client or can be sent to another agent or a sink
![Page 66: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/66.jpg)
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3
![Page 67: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/67.jpg)
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Choose depending upon design
![Page 68: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/68.jpg)
Choice of storage systems (Structure and Volume)
Structure Low High
Large
Small
Size
S3
RDS
Dynamo DB
NoSQL EBS
1
![Page 69: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/69.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
![Page 70: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/70.jpg)
Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
![Page 71: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/71.jpg)
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
![Page 72: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/72.jpg)
A framework Splits data into pieces Lets processing occur
Gathers the results
![Page 73: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/73.jpg)
distributed computing
![Page 74: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/74.jpg)
Dif
ficu
lty
Number of Machines 1
1
![Page 75: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/75.jpg)
Dif
ficu
lty
Number of Machines 1
1
106
2
![Page 76: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/76.jpg)
Dif
ficu
lty
Number of Machines 1
1
106
2
![Page 77: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/77.jpg)
distributed computing is hard
![Page 78: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/78.jpg)
distributed computing requires god-like engineers
![Page 79: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/79.jpg)
Innovation #1:
![Page 80: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/80.jpg)
Hadoop is… The MapReduce computational paradigm
![Page 81: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/81.jpg)
Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
![Page 82: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/82.jpg)
Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
![Page 83: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/83.jpg)
Person Start End Duration Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
![Page 84: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/84.jpg)
Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
![Page 85: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/85.jpg)
Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
![Page 86: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/86.jpg)
Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 18 Charlie 11:17:24 11:17:38 14 Bob 11:23:10 11:23:25 15 Alice 16:26:46 16:26:54 8 David 17:20:28 17:20:45 17 Alice 18:16:53 18:17:00 7 Charlie 19:33:44 19:33:59 15 Bob 21:13:32 21:13:43 11 David 22:36:22 22:36:34 12 Alice 23:42:01 23:42:11 10
![Page 87: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/87.jpg)
Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
![Page 88: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/88.jpg)
Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
map
![Page 89: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/89.jpg)
Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
![Page 90: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/90.jpg)
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
![Page 91: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/91.jpg)
Person Total
Alice 25
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
![Page 92: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/92.jpg)
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
Person Total
Bob 49
Alice 25
![Page 93: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/93.jpg)
Person Total
Charlie 63
Bob 49
Alice 25
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
![Page 94: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/94.jpg)
Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
![Page 95: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/95.jpg)
Person Total
David 29
Charlie 63
Bob 49
Alice 25
![Page 96: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/96.jpg)
Person Total Alice 25 Bob 49
Charlie 63 David 29
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
reduce
![Page 97: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/97.jpg)
Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
![Page 98: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/98.jpg)
Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
![Page 99: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/99.jpg)
map
reduce
Works on one record. In this case it
does “end time minus start time”
In parallel over all the records
Group together common records
(e.g “Alice, Bob”) and add all the
results
![Page 100: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/100.jpg)
Hadoop is… The MapReduce computational paradigm
![Page 101: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/101.jpg)
Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
![Page 102: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/102.jpg)
distributed computing requires god-like engineers
![Page 103: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/103.jpg)
distributed computing (with Hadoop) requires god-like talented engineers
![Page 104: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/104.jpg)
Launch a Hadoop cluster from the CLI (
elastic-mapreduce --create --alive \
--instance-type m1.xlarge \
--num-instances 5
![Page 105: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/105.jpg)
The Hadoop Ecosystem
![Page 106: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/106.jpg)
EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language
(HiveQL)
![Page 107: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/107.jpg)
R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and applications
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining
![Page 108: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/108.jpg)
Hive Schema on read
![Page 109: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/109.jpg)
Launch a Hive cluster from the CLI (step 1/1)
./elastic-mapreduce --create --alive \
--name "Test Hive" \
--hadoop-version 0.20 \
--num-instances 5 \
--instance-type m1.large \
--hive-interactive \
--hive-versions 0.7.1
![Page 110: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/110.jpg)
SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on S3
Language called HiveQL, similar to SQL
An example of a query could be: SELECT COUNT(1) FROM sometable;
Requires to setup a mapping to the input data
Uses SerDe:s to make different input formats queryable
Powerful data types (Array & Map..)
![Page 111: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/111.jpg)
SQL HiveQL
Updates UPDATE, INSERT, DELETE
INSERT, OVERWRITE TABLE
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Functions Hundreds Dozens
Multi-table inserts Not supported Supported
Create table as select Not valid SQL-92 Supported
![Page 112: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/112.jpg)
./elastic-mapreduce –create
--name "Hive job flow”
--hive-script
--args s3://myawsbucket/myquery.q
--args -d,INPUT=s3://myawsbucket/input,-
d,OUTPUT=s3://myawsbucket/output
HiveQL to execute
![Page 113: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/113.jpg)
./elastic-mapreduce
--create
--alive
--name "Hive job flow”
--num-instances 5 --instance-type m1.large \
--hive-interactive
Interactive hive session
![Page 114: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/114.jpg)
114
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "ajhlasH6JASLHbas8",
searchPhrase: "digital cameras" adId:
"jalhdahu789asashja",
impresssionId: "hjakhlasuhiouasd897asdh",
referrer: "http://cooking.com/recipe?id=10231",
hostname: "ec2-12-12-12-12.ec2.amazonaws.com",
modelId: "asdjhklasd7812hjkasdhl",
processId: "12901", threadId: "112121",
timers:
{ requestTime: "1910121", modelLookup: "1129101" }
counters:
{ heapSpace: "1010120912012" }
}
![Page 115: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/115.jpg)
115
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "ajhlasH6JASLHbas8",
adId: "jalhdahu789asashja",
impresssionId:
hjakhlasuhiouasd897asdh",
clickId: "ashda8ah8asdp1uahipsd",
referrer: "http://recipes.com/",
directedTo: "http://cooking.com/" }
![Page 116: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/116.jpg)
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
![Page 117: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/117.jpg)
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Table structure to create
(happens fast as just mapping to
source)
![Page 118: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/118.jpg)
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Source data in S3
![Page 119: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/119.jpg)
Hadoop lowers the cost of developing a distributed system.
![Page 120: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/120.jpg)
hive> select * from impressions limit 5;
Selecting from source data directly via Hadoop
![Page 121: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/121.jpg)
What about the cost of operating a distributed system?
![Page 122: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/122.jpg)
November traffic at amazon.com
![Page 123: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/123.jpg)
November traffic at amazon.com
![Page 124: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/124.jpg)
November traffic at amazon.com
76%
24%
![Page 125: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/125.jpg)
Innovation #2:
![Page 126: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/126.jpg)
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
![Page 127: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/127.jpg)
![Page 128: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/128.jpg)
1 instance x 100 hours = 100 instances x 1 hour
![Page 129: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/129.jpg)
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
How does EMR work ?
![Page 130: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/130.jpg)
S3
What can you run on EMR…
EMR Cluster
![Page 131: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/131.jpg)
Resize Nodes
EMR Cluster
You can easily add and
remove nodes
![Page 132: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/132.jpg)
On and Off Fast Growth
Predictable peaks Variable peaks
WASTE
![Page 133: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/133.jpg)
Fast Growth On and Off
Predictable peaks Variable peaks
![Page 134: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/134.jpg)
Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
![Page 135: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/135.jpg)
SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
![Page 136: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/136.jpg)
Massively Parallel Columnar Datawarehouses
• Columnar Data stores
• MPP
– Parallel Ingest
– Parallel Query
– Scale Out
– Parallel Backup
![Page 137: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/137.jpg)
Columnar data stores
• Data alignment and block size in row stores vs. column stores
• Compression based on each column
![Page 138: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/138.jpg)
MPP Data warehouse parallelizes and distributes
everything • Query
• Load
• Backup
• Restore
• Resize
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
![Page 139: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/139.jpg)
But Data-warehouses are
• Hard to manage
• Very expensive
• Difficult to scale
• Difficult to get performance
![Page 140: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/140.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
![Page 141: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/141.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Parallelize and Distribute Everything
Dramatically Reduce I/O MPP
Load
Query
Resize
Backup
Restore
![Page 142: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/142.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Parallelize and Distribute Everything
Dramatically Reduce I/O MPP
Load
Query
Resize
Backup
Restore
Direct-attached storage
Large data block sizes
Column data store
Data compression
Zone maps
![Page 143: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/143.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
![Page 144: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/144.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
Create a cluster in minutes
Automatic OS and software patching
Scale up to 1.6PB with a few clicks and no downtime
![Page 145: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/145.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Start Small and Grow Big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE
1 node (2TB) 2-32 node cluster (64TB)
8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE
2-100 node cluster (1.6PB)
![Page 146: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/146.jpg)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
![Page 147: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/147.jpg)
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL Single Node
Effective Hourly Price Per TB
Effective Annual Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
![Page 148: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/148.jpg)
Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
![Page 149: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/149.jpg)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
![Page 150: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/150.jpg)
Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
![Page 151: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/151.jpg)
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
![Page 152: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/152.jpg)
Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
![Page 153: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/153.jpg)
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
![Page 154: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/154.jpg)
Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
![Page 155: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/155.jpg)
Rinse Repeat every day or hour
![Page 156: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/156.jpg)
Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
![Page 157: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/157.jpg)
The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
![Page 158: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/158.jpg)
How do you start ?
![Page 159: Big data on_aws in korea by abhishek sinha (lunch and learn)](https://reader033.fdocuments.in/reader033/viewer/2022052523/555c2395d8b42a0b418b4b30/html5/thumbnails/159.jpg)
Where do you start ?
• Where is your data ? (S3, SQL, NoSQL ?)
– Are you collecting all your data ?
– What is the format (structured or unstructured)
– How much is this data going to grow ?
• How do you want to process it ?
– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?
• How do you want to use this data
– Visualization tools
• Do you yourself or engage an AWS partner
• Write to me [email protected]