Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce...

Analytics in the Cloud

Peter Sirota, GM Elastic MapReduce

Data-Driven Decision Making

Data is the new raw material for any business on par with capital, people, and labor.

What is Big Data?

Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching

generate recommendations

calculate advanced statistics (i.e., TP99)

Twitter “Firehose” 50 million tweets per day

1,400% growth per year

How can advertisers drink from it?

Social graphs

Value increases with exponential growth in data connections

Big Data is full of valuable, unanswered questions!

Why is Big Data Hard (and Getting Harder)?

Today’s Data Warehouses Need to consolidate from multiple data sources in multiple formats across

multiple businesses

Unconstrained growth of this business-critical information

Today’s Users Expect faster response time of fresher data

Sampling is not good enough and history is important

Demand inexpensive experimentation with new data

Become increasingly sophisticated Data Scientists

Current systems don’t scale (and weren’t meant to) Long time to provision more infrastructure

Specialized DB expertise required

Expensive and inelastic solutions

We need tools built specifically for Big Data!

What is this thing called Hadoop?

Dealing with Big Data requires two things: Distributed, scalable storage

Inexpensive, flexible analytics

Apache Hadoop is an open source software platform that addresses both of these needs Includes a fault‐tolerant, distributed storage system

(HDFS) developed for commodity servers

Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets

Key benefits Affordable – Cost / TB is a fraction of traditional options

Proven at scale – Numerous petabyte implementations in production; linear scalability

Flexible – Data can be stored with or without schema

RDBMS vs. MapReduce/Hadoop

RDBMS Predefined schema

Strategic data placement for query tuning

Exploit indexes for fast retrieving

SQL only

Doesn’t scale linearly

MapReduce/Hadoop No schema is required

Random data placement

Fast scan of the entire dataset

Uniform query performance

Linearly scales for reads and writes

Support many languages including SQL

Complementary technologies

Why Amazon Elastic MapReduce?

Managed Apache Hadoop Web Service Monitor thousands of clusters per day

Use cases span from University students to Fortune 50

Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown

Tunes Hadoop to your hardware and network

Provides tools to debug and monitor your Hadoop clusters

Provides tight integration with AWS services Improved performance working with S3

Automatic re-provisioning on node failure

Dynamic expanding/shrinking of cluster size

Spot integration

Simplified Cluster Configuration/Management Resize running job flows

Support for EIP/IAM/Tagging

Workload-specific configurations

Bootstrap Actions

Enhanced Monitoring/Debugging Free CloudWatch Metrics / Alarms

Hadoop Metrics in Console

Ganglia Support

Improved Performance S3 Multipart Upload

Cluster Compute Instances

Elastic MapReduce Key Features

Analytics Use Cases

Targeted advertising / Clickstream analysis

Data warehousing applications

Bio-informatics (Genome analysis)

Financial simulation (Monte Carlo simulation)

File processing (resize jpegs)

Web indexing

Data mining and BI

APACHE HIVEDATA WAREHOUSE FOR HADOOP

Open source project started at Facebook

Turns data on Hadoop into a virtually limitless data warehouse

Provides data summarization, ad hoc querying and analysis

Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as “,” in

CSV file formats

Inherits linear scalability of Hadoop

AWS Data Warehousing Architecture

Elastic Data Warehouse

Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight)

Reduce costs by increasing server utilization

Improve performance during high usage periods

Expand to

25 instances

Data Warehouse

(Steady State)

Data Warehouse

(Batch Processing)

Shrink to

9 instances

Data Warehouse

(Steady State)

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Reducing Costs with Spot Instances

Other EMR + Spot Use CasesRun entire cluster on Spot for biggest cost savingsReduce the cost of application testing

#2: Cost with Spot4 instances *7 hrs * $0.50 = $13 +

5 instances * 7 hrs * $0.25 = $8.75

Total = $21.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50%

Cost Savings: ~22%

Monitoring Clusters with CloudWatch

Free CloudWatch Metrics and Alarms Track Hadoop job progress

Alarm on degradations in cluster health

Monitor aggregate Elastic MapReduce usage

Big Data Ecosystem And Tools

We have a rapidly growing ecosystem and will continue to integrate with a wide range of partners. Some examples:

Business Intelligence

MicroStrategy, Pentaho

Analytics

Datameer, Karmasphere, Quest

Open source

Ganglia, SQuirrel SQL

Resources

Amazon Elastic MapReduce

aws.amazon.com/elasticmapreduce

aws.amazon.com/articles/Elastic-MapReduce

forums.aws.amazon.com/forum.jspa?forumID=52

Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce...

Documents

Transcript of Analytics in the Cloud - Amazon S3...Analytics in the Cloud Peter Sirota, GM Elastic MapReduce...

SAP Analytics Cloud Minimum Wage Analysis (Simulation)Analytics+Cloud... · SAP Analytics Cloud – Sample Scenarios . 1. SAP Analytics Cloud - Minimum Wage Analysis (Simulation)

OCIjp19 oracle analytics cloud

SAP Analytics Cloud - globaltalent.pe

Krystal Thurman Sirota Resume - A-State

GoodData Cloud Analytics (Brief)

Secure Cloud Analytics Public Cloud Monitoring for ...

Secure Cloud Analytics Public Cloud Monitoring for Amazon ...

SAP Analytics Cloud All Analytics. All Users. One Productsapevents.be/AnalyticsBriefings/presentations/09h00... · SAP Analytics Cloud All Analytics. All Users. One Product ... SAP

SAP Analytics – BOAK Networking Event Modern Analytics for ...€¦ · SAP Analytics Cloud SAP Analytics Cloud is SAP’s primary strategic planning solution moving forward. The

SAP Analytics Cloud, analytics designer Developer Handbook...Aug 17, 2020 · SAP Analytics Cloud, analytics designer Developer Handbook Document Version: 6.1 – 2020-08-17

SAP Analytics Cloud, analytics designer Developer Handbook · 2020-04-21 · SAP Analytics Cloud, analytics designer Developer Handbook Document Version: 5.1 – 2020-04-06

Social Mobile Analytics Cloud

SAP Analytics Cloud, analytics designer...Nov 11, 2019 · SAP Analytics Cloud, analytics designer Developer Handbook Document Version: 3.0 - 2019-11-11

Advanced Analytics & Cloud Computing

Managing Your Marketing Career - Susan Sirota

Sales Cloud Einstein Analytics

Smart Analytics Cloud

Analytics Cloud Service Administration Guide€¦ · About the Analytics Cloud Administration Web Application The Analytics Cloud Administration Web Application gives administrators

HEALTHCARE ANALYTICS IN CLOUD

Introduction to Analytics Cloud