Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce...

17
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce

Transcript of Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce...

Page 1: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Analytics in the Cloud

Peter Sirota, GM Elastic MapReduce

Page 2: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Data-Driven Decision Making

Data is the new raw material for any business on par with capital, people, and labor.

Page 3: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

What is Big Data?

Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching

generate recommendations

calculate advanced statistics (i.e., TP99)

Twitter “Firehose” 50 million tweets per day

1,400% growth per year

How can advertisers drink from it?

Social graphs

Value increases with exponential growth in data connections

Big Data is full of valuable, unanswered questions!

Page 4: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Why is Big Data Hard (and Getting Harder)?

Today’s Data Warehouses Need to consolidate from multiple data sources in multiple formats across

multiple businesses

Unconstrained growth of this business-critical information

Today’s Users Expect faster response time of fresher data

Sampling is not good enough and history is important

Demand inexpensive experimentation with new data

Become increasingly sophisticated Data Scientists

Current systems don’t scale (and weren’t meant to) Long time to provision more infrastructure

Specialized DB expertise required

Expensive and inelastic solutions

We need tools built specifically for Big Data!

Page 5: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

What is this thing called Hadoop?

Dealing with Big Data requires two things: Distributed, scalable storage

Inexpensive, flexible analytics

Apache Hadoop is an open source software platform that addresses both of these needs Includes a fault‐tolerant, distributed storage system

(HDFS) developed for commodity servers

Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets

Key benefits Affordable – Cost / TB is a fraction of traditional options

Proven at scale – Numerous petabyte implementations in production; linear scalability

Flexible – Data can be stored with or without schema

Page 6: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

RDBMS vs. MapReduce/Hadoop

RDBMS Predefined schema

Strategic data placement for query tuning

Exploit indexes for fast retrieving

SQL only

Doesn’t scale linearly

MapReduce/Hadoop No schema is required

Random data placement

Fast scan of the entire dataset

Uniform query performance

Linearly scales for reads and writes

Support many languages including SQL

Complementary technologies

Page 7: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with
Page 8: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Why Amazon Elastic MapReduce?

Managed Apache Hadoop Web Service Monitor thousands of clusters per day

Use cases span from University students to Fortune 50

Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown

Tunes Hadoop to your hardware and network

Provides tools to debug and monitor your Hadoop clusters

Provides tight integration with AWS services Improved performance working with S3

Automatic re-provisioning on node failure

Dynamic expanding/shrinking of cluster size

Spot integration

Page 9: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Simplified Cluster Configuration/Management Resize running job flows

Support for EIP/IAM/Tagging

Workload-specific configurations

Bootstrap Actions

Enhanced Monitoring/Debugging Free CloudWatch Metrics / Alarms

Hadoop Metrics in Console

Ganglia Support

Improved Performance S3 Multipart Upload

Cluster Compute Instances

Elastic MapReduce Key Features

Page 10: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Analytics Use Cases

Targeted advertising / Clickstream analysis

Data warehousing applications

Bio-informatics (Genome analysis)

Financial simulation (Monte Carlo simulation)

File processing (resize jpegs)

Web indexing

Data mining and BI

Page 11: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

APACHE HIVEDATA WAREHOUSE FOR HADOOP

Open source project started at Facebook

Turns data on Hadoop into a virtually limitless data warehouse

Provides data summarization, ad hoc querying and analysis

Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as “,” in

CSV file formats

Inherits linear scalability of Hadoop

Page 12: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

AWS Data Warehousing Architecture

Page 13: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Elastic Data Warehouse

Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight)

Reduce costs by increasing server utilization

Improve performance during high usage periods

Expand to

25 instances

Data Warehouse

(Steady State)

Data Warehouse

(Batch Processing)

Shrink to

9 instances

Data Warehouse

(Steady State)

Page 14: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Reducing Costs with Spot Instances

Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing

#2: Cost with Spot4 instances *7 hrs * $0.50 = $13 +

5 instances * 7 hrs * $0.25 = $8.75

Total = $21.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50%

Cost Savings: ~22%

Page 15: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Monitoring Clusters with CloudWatch

Free CloudWatch Metrics and Alarms Track Hadoop job progress

Alarm on degradations in cluster health

Monitor aggregate Elastic MapReduce usage

Page 16: Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with

Big Data Ecosystem And Tools

We have a rapidly growing ecosystem and will continue to integrate with a wide range of partners. Some examples:

Business Intelligence

MicroStrategy, Pentaho

Analytics

Datameer, Karmasphere, Quest

Open source

Ganglia, SQuirrel SQL