Post on 09-Jul-2020
Analytics in the Cloud
Peter Sirota, GM Elastic MapReduce
Data-Driven Decision Making
Data is the new raw material for any business on par with capital, people, and labor.
What is Big Data?
Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching
generate recommendations
calculate advanced statistics (i.e., TP99)
Twitter “Firehose” 50 million tweets per day
1,400% growth per year
How can advertisers drink from it?
Social graphs
Value increases with exponential growth in data connections
Big Data is full of valuable, unanswered questions!
Why is Big Data Hard (and Getting Harder)?
Today’s Data Warehouses Need to consolidate from multiple data sources in multiple formats across
multiple businesses
Unconstrained growth of this business-critical information
Today’s Users Expect faster response time of fresher data
Sampling is not good enough and history is important
Demand inexpensive experimentation with new data
Become increasingly sophisticated Data Scientists
Current systems don’t scale (and weren’t meant to) Long time to provision more infrastructure
Specialized DB expertise required
Expensive and inelastic solutions
We need tools built specifically for Big Data!
What is this thing called Hadoop?
Dealing with Big Data requires two things: Distributed, scalable storage
Inexpensive, flexible analytics
Apache Hadoop is an open source software platform that addresses both of these needs Includes a fault‐tolerant, distributed storage system
(HDFS) developed for commodity servers
Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets
Key benefits Affordable – Cost / TB is a fraction of traditional options
Proven at scale – Numerous petabyte implementations in production; linear scalability
Flexible – Data can be stored with or without schema
RDBMS vs. MapReduce/Hadoop
RDBMS Predefined schema
Strategic data placement for query tuning
Exploit indexes for fast retrieving
SQL only
Doesn’t scale linearly
MapReduce/Hadoop No schema is required
Random data placement
Fast scan of the entire dataset
Uniform query performance
Linearly scales for reads and writes
Support many languages including SQL
Complementary technologies
Why Amazon Elastic MapReduce?
Managed Apache Hadoop Web Service Monitor thousands of clusters per day
Use cases span from University students to Fortune 50
Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown
Tunes Hadoop to your hardware and network
Provides tools to debug and monitor your Hadoop clusters
Provides tight integration with AWS services Improved performance working with S3
Automatic re-provisioning on node failure
Dynamic expanding/shrinking of cluster size
Spot integration
Simplified Cluster Configuration/Management Resize running job flows
Support for EIP/IAM/Tagging
Workload-specific configurations
Bootstrap Actions
Enhanced Monitoring/Debugging Free CloudWatch Metrics / Alarms
Hadoop Metrics in Console
Ganglia Support
Improved Performance S3 Multipart Upload
Cluster Compute Instances
Elastic MapReduce Key Features
Analytics Use Cases
Targeted advertising / Clickstream analysis
Data warehousing applications
Bio-informatics (Genome analysis)
Financial simulation (Monte Carlo simulation)
File processing (resize jpegs)
Web indexing
Data mining and BI
APACHE HIVEDATA WAREHOUSE FOR HADOOP
Open source project started at Facebook
Turns data on Hadoop into a virtually limitless data warehouse
Provides data summarization, ad hoc querying and analysis
Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as “,” in
CSV file formats
Inherits linear scalability of Hadoop
AWS Data Warehousing Architecture
Elastic Data Warehouse
Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight)
Reduce costs by increasing server utilization
Improve performance during high usage periods
Expand to
25 instances
Data Warehouse
(Steady State)
Data Warehouse
(Batch Processing)
Shrink to
9 instances
Data Warehouse
(Steady State)
Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption
#1: Cost without Spot4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Reducing Costs with Spot Instances
Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing
#2: Cost with Spot4 instances *7 hrs * $0.50 = $13 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $21.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50%
Cost Savings: ~22%
Monitoring Clusters with CloudWatch
Free CloudWatch Metrics and Alarms Track Hadoop job progress
Alarm on degradations in cluster health
Monitor aggregate Elastic MapReduce usage
Big Data Ecosystem And Tools
We have a rapidly growing ecosystem and will continue to integrate with a wide range of partners. Some examples:
Business Intelligence
MicroStrategy, Pentaho
Analytics
Datameer, Karmasphere, Quest
Open source
Ganglia, SQuirrel SQL
Resources
Amazon Elastic MapReduce
aws.amazon.com/elasticmapreduce
aws.amazon.com/articles/Elastic-MapReduce
forums.aws.amazon.com/forum.jspa?forumID=52