Big Data Use Cases and Solutions in the AWS Cloud

Post on 08-Sep-2014

838 views 3 download

Tags:

description

The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from Ben Butler, Big Data Solutions Marketing Manager for AWS, to learn how our customers are using big data services in the AWS cloud to innovate faster than ever before. Not only is AWS technology available to everyone, but it is self-service, on-demand, and featuring innovative technology and flexible pricing models at low cost with no commitments. Learn from customer success stories, as Ben shares real-world case studies describing the specific big data challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the resources needed to get you started quickly.

Transcript of Big Data Use Cases and Solutions in the AWS Cloud

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Big Data Use Cases and

Solutions in the AWS Cloud

Ben Butler, @bensbutler, Sr. Mgr., Big Data & HPC

July 10, 2014

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Big Data: Unconstrained data growth

95% of the 1.2 zettabytes

of data in the digital

universe is unstructured

70% of of this is user-

generated content

Unstructured data growth

explosive, with estimates

of compound annual

growth (CAGR) at 62%

Source: IDCGB TB

PB

ZB

EB

The amount of information generated during the first day of

a baby’s life today is equivalent to 70 times the information

contained in the Library of Congress

Lower cost,

higher throughput Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Highly

constrained

Lower cost,

higher throughput Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Available for analysis

Generated data

Data volume - Gap

1990 2000 2010 2020

Elastic and highly scalable

No upfront capital expense

Only pay for what you use+

+

Available on-demand+

=

Remove constraints

Accelerated

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Technologies and techniques for working

productively with data, at any scale.

Big Data

Big data and AWS Cloud computing

Big data Cloud computing

Variety, volume, and velocity

requiring new tools

Variety of compute, storage,

and networking options

Big data and AWS Cloud computing

Big data Cloud computing

Potentially massive datasets Massive, virtually unlimited

capacity

Big data and AWS Cloud computing

Big data Cloud computing

Iterative, experimental style of

data manipulation and analysis

Iterative, experimental style of

infrastructure deployment/usage

Big data and AWS Cloud computing

Big data Cloud computing

Frequently not steady-state

workload; peaks and valleys

At its most efficient with highly

variable workloads

Big data and AWS Cloud computing

Big data Cloud computing

Absolute performance not as

critical as “time to results”;

shared resources are a

bottleneck

Parallel compute projects allow

each workgroup to have more

autonomy, get faster results

One tool to

rule them all

Use the right tools

Amazon

S3

Amazon

Kinesis

Amazon

DynamoDB

Amazon

RedshiftAmazon

Elastic

MapReduce

Store anything

Object storage

Scalable

99.999999999% durability

Amazon

S3

Real-time processing

High throughput; elastic

Easy to use

EMR, S3, Redshift, DynamoDB

Integrations

Amazon

Kinesis

NoSQL Database

Seamless scalability

Zero admin

Single digit millisecond latency

Amazon

DynamoDB

Relational data warehouse

Massively parallel

Petabyte scale

Fully managed

$1,000/TB/Year

Amazon

Redshift

Try Amazon Redshift with BI & ETL for Free!

aws.amazon.com/redshift/free-trial

2 months | 750 hours/month | dw2.large SSD instance

160GB of compressed storage per node

Try BI & ETL for free from nine partners at

aws.amazon.com/redshift/partners

Hadoop/HDFS clusters

Hive, Pig, Impala, Hbase

Easy to use; fully managed

On-demand and spot pricing

Tight integration with S3,

DynamoDB, and Kinesis

Amazon

Elastic

MapReduce

Amazon EMR now ships with ODBC and JDBC drivers for

Hive, Impala, and HBase

Easier to use popular BI tools like:

Microsoft Excel, Tableau, MicroStrategy, and QlikView

ODBC and JDBC drivers now for Amazon EMR

The right tools.

At the right scale.

At the right time.

HDFS

Amazon EMR

HDFS

Amazon S3 Amazon

DynamoDB

Amazon EMR

AWS Data Pipeline

HDFS

Amazon S3 Amazon

DynamoDB

Amazon EMR

Amazon

Kinesis

AWS Data Pipeline

Data

Sources

HDFS

Amazon S3 Amazon

DynamoDB

Amazon EMR

Amazon

Kinesis

AWS Data Pipeline

Data

Sources

Data management Hadoop Ecosystem analytical tools

HDFS

Amazon

RedShift

Amazon

RDS

Amazon S3 Amazon

DynamoDB

Amazon EMR

Amazon

Kinesis

AWS Data Pipeline

Data management Hadoop Ecosystem analytical tools

Data

Sources

HDFS

Amazon

RedShift

Amazon

RDS

Amazon S3 Amazon

DynamoDB

Amazon EMR

Amazon

Kinesis

AWS Data Pipeline

Data management Hadoop Ecosystem analytical tools

Data

Sources

AWS Data

Pipeline

Free steak campaign

Disaster recovery

Web site & media sharing

Facebook app

Ground campaign

SAP & SharePoint

Marketing web site

Business line of sight

Consumer social app

IT operations

Mars exploration ops

Interactive TV apps

Media streaming

Consumer social app

Facebook page

Securities Trading Data Archiving

Financial markets analytics

Web and mobile apps

Big data analytics

Digital media

Ticket pricing optimization

Streaming webcasts

Mobile analytics

Consumer social app

Core IT and media

Customer Use Cases of Big Data

Dropcam is the biggest inbound video service

on the Web

More data uploaded per

minute than YouTube

Petabytes of data

processed every month

Billions of motion events

detected

4 months to production

300% speed gain

$500k - $1M in CAPEX saved

500MM tweets/day = ~ 20.8MM tweets/hr

2k/tweet is ~12MB/sec, need 6 shards, ~1TB/day

$0.015/hour per shard, $0.028/million PUTS

Kinesis cost is $0.765/hour

Redshift cost is $0.850/hour (for a 2TB dw1.xlarge)

Total: $1.615/hour

Cost &

Scale

http://wefeel.csiro.au/#/

“THANKS TO AMAZON WEB SERVICES, WE CAN DELIGHT OUR PLAYERS WORLDWIDE.”

Sami Yliharju | Services Lead

The Climate Corporation - Weather Insurance for Farms

Challenge:Volatile weather is deadly to crops like grapes

Solution:

Built a predictive model based on freely available

data:

• 60 years of crop data,

• 14 TBs of soil data, and

• 1M government Doppler radar points

• 50 EMR clusters process new data as it comes

into S3 each day, continuously updating the

model.

150B Soil

Observations

3M Daily Weather

Measurements

850K Precision Rainfall

Grids Tracked

200 TB in Amazon S3

Foursquare…

33 million users1.3 million businesses

…generates a lot of Data3.5 billion check-ins 15M+ venues, Terabytes of log data

Uses EMR for

Evaluation of new features

Machine learning

Exploratory analysis

Daily customer usage reporting

Long-term trend analysis

Benefits of Amazon EMR

Ease-of-Use“We have decreased the processing time for urgent data-analysis”

FlexibilityTo deal with changing requirements & dynamically expand reporting clusters

Costs“We have reduced our analytics costs by over 50%”

Who is checking in?

0

0.1

0.2

0.3

0.4

0.5

0.6

Female Male

Gender

0 20 40 60 80

Age

Gorilla Coffee

Gray's Papaya

Amorino

Thursday Friday Saturday Sunday

When do people go to a place?

User Sign-ups

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

a

AmazonDynamoDB

Amazon

RDS

Amazon

Redshift

AWS

Direct Connect

AWS

Storage Gateway

AWS

Import/ Export

Amazon

GlacierS3

Amazon

KinesisAmazon EMR

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Amazon EC2 Amazon EMRAmazon

Kinesis

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

AmazonRedshift

AmazonDynamoDB

Amazon

RDS

S3 Amazon EC2 Amazon EMR

Amazon

CloudFront

AWS

CloudFormation

AWS

Data Pipeline

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

DataXu in the Cloud

Yekesa Kosuru, V.P Technology

July 10th 2014

What is DataXu?

• Digital Marketing Platform, Ad Tech Platform

• Real-time Multivariate Decision System

• 5th Fastest Growing Private Company in U.S (Inc 500)

• Optimize Digital Marketing Campaigns– ...put the right ad campaign in front of the right customer

– …find customer who left their site without converting

– …find more customers who are likely to convert

– …offer insight into who, why, when, where are respondents

• 950,000 times per second

Big Data, Little Decisions

Decision

impact(also proportional

to risk)

Decision rate

1

2000’s – “How often can we run a permission-based email mktg. campaign?” Rules-based alerts

2010’s – Millions of decisions and actions taken, all in less than a blink of an eye

volume ~ value

The Evolution of Real-Time Decision Systems

1

2

2

3

3

1990’s – “Should we advertise on the Superbowl? Should we run direct mail this qtr.?” Batch mode

Real Time Bidding

Site

Auctions

Ads, e.g

Google

User

Opens

Browser

Goes to

Sports Site

DataXu

Bids(others bid too)

DataXu

Wins Bid

Ad Shown,

Page loads

Quick Statistics

• 950K bid requests per second

• Billions of impressions per month, Petabyte of

data

• 100 ms round trip response time

• 100+TB of warehouse data

• 3000+ Servers powering the platform

Why AWS

• Automation, API

• Costs, Pay As You Go

• Auto Scaling (elasticity – up and down)

• All Data in One Place (S3 foundational store)

• Improved Testability

• Security, Privacy

• Disaster Recovery and Business Continuity

DataXu StackCampaign

Management

Business Intelligence

Data Mart

Interactive

Queries

Batch

Queries

Real Time Bidding System

Activity Logs

1st Party3rd Party

Distributed Log Ingestion

S3/HDFS Warehouse

CDN

User

ProfilesCampaign

Metadata

ETL Attribution Machine Learning

SpendDecision

System

Audience

CalculationUniques/S

egment

Big Velocity950K TPS

Big VolumePetabyte of Data

Big VarietyData Providers

High Level Deployment

ON PREMISE

SSL

Meta

Amazon S3

RTB

System

Elastic Load

Balancing

Availability Zone

Route

53

EC2

Auto scaling Group

Volumes

AMI

Availability Zone

Log

Ingestion

System

Machine

Learning

SystemAuto scaling

Group

EMR

CloudWatch

Traditional Hadoop vs EMR• Traditional Hadoop

– Anticipate and provision for peaks

– Cant de-couple storage and compute

– 75% cluster is idle

– Data Duplication/Multiple Clusters

• EMR to the rescue

• Monthly savings of 72%using EMR

S3 Provides Linearly Scalable Bandwidth

• Big volume workloads involve several datasets together and terabytes of data

• Aggregate bandwidth matters

• S3 scales pretty linearly

S3 Streaming Performance

(m1.xlarge @ $0.34/hr)100 VMs; 9.6GB/s; $34/hr

350 VMs; 28.7GB/s; $119/hr

34 secs per terabyte

ThankYou

www.dataxu.com

Yekesa Kosuru, @ykosuru

ykosuru@dataxu.com

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Getting Started with

Big Data on AWS

AWS is here to help

Solution

Architects

Professional

ServicesPremium

Support

AWS Partner

Network (APN)

aws.amazon.com/partners/competencies/big-data

Partner with an AWS Big Data expert

https://aws.amazon.com/architecture/

Processing large amounts of parallel

data using a scalable cluster

AWS Architecture Diagrams

http://aws.amazon.com/marketplace

Big Data Case Studies

Learn from other AWS customers

aws.amazon.com/solutions/case-studies/big-data

AWS Marketplace

AWS Online Software Store

aws.amazon.com/marketplace

Shop the big data category

http://aws.amazon.com/marketplace

AWS Public Data Sets

Free access to big data sets

aws.amazon.com/publicdatasets

AWS Grants Program

AWS in Education

aws.amazon.com/grants

AWS Big Data Test Drives

APN Partner-provided labs

aws.amazon.com/testdrive/bigdata

https://aws.amazon.com/training

AWS Training & Events

Webinars, Bootcamps,

and Self-Paced Labs

aws.amazon.com/events

Big Data on AWS

Course on Big Data

aws.amazon.com/training/course-descriptions/bigdata

reinvent.awsevents.com

aws.amazon.com/big-data

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Thank you!

Ben Butler, @bensbutler, Sr. Mgr., Big Data

July 10, 2014 – http://aws.amazon.com/big-data