Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

78
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Scaling Your Analytics with Amazon Elastic MapReduce Peter Sirota, General Manager - Amazon Elastic MapReduce November 14, 2013

description

Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.

Transcript of Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Page 1: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Scaling Your Analytics with Amazon Elastic MapReduce

Peter Sirota, General Manager - Amazon Elastic MapReduce

November 14, 2013

Page 2: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Agenda • Amazon EMR: Hadoop in the cloud

• Hadoop Ecosystem on Amazon EMR

• Customer Use Cases

Page 3: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Hadoop is the right system for Big Data

• Scalable and fault tolerant • Flexibility for multiple languages

and data formats • Open source • Ecosystem of tools • Batch and real-time analytics

Page 4: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Challenges with Hadoop

On Premise

• Manage HDFS, upgrades, and system administration

• Pay for expensive support contracts

• Select hardware in advance and stick with predictions

On Amazon EC2

• Difficult to integrate with AWS storage services

• Independently manage and monitor clusters

Page 5: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Amazon EMR is the easiest way to run Hadoop in the cloud

Page 6: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

• Managed services • Easy to tune clusters and trim costs • Support for multiple data stores • Unique features and ecosystem support

Why Amazon EMR?

Page 7: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Input data S3, DynamoDB, Redshift

Page 8: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Code

Input data S3, DynamoDB, Redshift

Elastic MapReduce

Page 9: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Elastic MapReduce

Code Name node

Input data S3, DynamoDB, Redshift

Page 10: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Elastic MapReduce

Code Name node

Input data

Elastic cluster

S3, DynamoDB, Redshift

S3/HDFS

Page 11: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Elastic MapReduce

Code Name node

Input data

S3/HDFS Queries + BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic cluster

Page 12: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Elastic MapReduce

Code Name node

Output

Input data

Queries + BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic cluster

S3/HDFS

Page 13: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Output

Input data S3, DynamoDB, Redshift

Page 14: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Elastic clusters Customize size and type to reduce costs

Page 15: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Choose your instance types Try out different configurations to find your optimal architecture

CPU c1.xlarge cc1.4xlarge cc2.8xlarge

Memory m1.large m2.2xlarge m2.4xlarge

Disk hs1.8xlarge

Page 16: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need

=

Page 17: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

10 hours

Resizable clusters Easy to add and remove compute capacity on your cluster

Page 18: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

6 hours

Resizable clusters Easy to add and remove compute capacity on your cluster

Page 19: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Peak capacity

Resizable clusters Easy to add and remove compute capacity on your cluster

Page 20: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Matched compute demands with cluster sizing

Resizable clusters Easy to add and remove compute capacity on your cluster

10 hours

Page 21: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use Spot and Reserved Instances Minimize costs by supplementing on-demand pricing

Page 22: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Easy to use Spot Instances Name-your-price supercomputing to minimize costs

Spot for task nodes

Up to 90% off Amazon

EC2 on-demand

pricing

On-demand for core nodes

Standard Amazon EC2

pricing for on-demand

capacity

Page 23: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

24/7 clusters on Reserved Instances Minimize cost for consistent capacity

Reserved Instances for long running

clusters

Up to 65% off on-demand

pricing

Page 24: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Your data, your choice Easy to integrate Amazon EMR with your data stores

Page 25: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
Page 26: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Using Amazon S3 and HDFS

Data Sources Transient EMR cluster

for batch map/reduce jobs for daily reports

Long running EMR cluster holding data in HDFS for Hive interactive queries

Weekly Report

Ad-hoc Query

Data aggregated and stored in Amazon S3

Page 27: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use Amazon EMR with Amazon Redshift and Amazon S3

Data Sources

Daily data aggregated in Amazon S3

Amazon EMR cluster used to process data

Processed data loaded into

Amazon Redshift data warehouse

Page 28: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use the Hadoop Ecosystem on Amazon EMR Leverage a diverse set of tools to get the most out of your data

Page 29: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

• Databases • Machine learning • Metadata stores • Exchange formats • Diverse query languages

Hadoop 2.x

and much more...

Page 30: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use Hive on Amazon EMR to interact with your data in HDFS and Amazon S3

• Data warehouse for Hadoop • Integration with Amazon S3 for

better performance reading and writing to Amazon S3

• SQL-like query language to make iterative queries easier

• Easy to scale in HDFS on a persistent Amazon EMR cluster

Page 31: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use HBase on a persistent Amazon EMR cluster as a column-oriented scalable data store

• Billions of rows and millions of columns

• Backup to and restore from Amazon S3

• Flexible datatypes • Modulate your HBase tables

when adding new data to your system

Page 32: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use ad-hoc queries on your cluster to drive insights in real-time

• In-memory MapReduce for faster queries

• Use HiveQL to interact with your data

Spark / Shark

Page 33: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Use ad-hoc queries on your cluster to drive insights in real-time

• In-memory MapReduce for faster queries

• Use HiveQL to interact with your data

Spark / Shark

• Parallel database engine for Hadoop

• Use SQL to query data in HDFS on your cluster in real-time

Impala (coming soon!)

Page 34: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

“Hadoop-as-a-Service [Amazon EMR] offers a better price-performance ratio [than bare-metal Hadoop].”

1. Elastic clusters and cost optimization

2. Rapid, tuned provisioning

3. Agility for experimentation

4. Easy integration with diverse datastores

Page 35: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Diverse set of partners to build on Amazon EMR

BI / Visualization Business Intelligence BI / Visualization BI / Visualization

Hadoop Distribution Data Transfer Encryption Data Transformation

Monitoring Performance Tuning Graphical IDE Graphical IDE

Available on AWS Marketplace Available as a distribution in Amazon Elastic MapReduce

ETL Tool

Page 36: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Thousands of customers

Page 37: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

How Netflix scales Big Data Platform on Amazon EMR

Eva Tse, Director of Big Data Platform, Netflix

November 14, 2013

Page 38: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Hadoop ecosystem as our Data Analytics platform

in the cloud

Page 39: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

How we got here?

Page 40: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
Page 41: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
Page 42: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

How do we scale?

Page 43: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
Page 44: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Separate compute and storage layers

Page 45: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Amazon S3 as our DW

Page 46: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

S3

Source of

truth

Page 47: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

S3 S3mper-enabled

Source of

truth

Page 48: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Multiple clusters

Page 49: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

S3

Source of

truth

zone x zone y

Ad hoc SLA

Page 50: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

S3

Source of

truth

zone x zone y zone z

SLA Ad hoc

Bonus Bonus Bonus

Page 51: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Unified and global big data collection pipeline

Page 52: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Ursula

cloud apps

Suro

SLA

Source of

truth

S3

Events Pipeline

Aegisthus

Dimension Pipeline

Bonus

Adhoc

Page 53: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Innovate – services and tools

Page 54: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

CLIs Gateways

Sting

Page 55: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Putting into perspective … • Billions of viewing hours of data • ~3000 nodes clusters • Hundred billion events / day • Few petabytes DW on Amazon S3 • Thousands of jobs / day

Page 56: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Adhoc querying

Page 57: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Simple Reporting

Page 58: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

E

T L E

T

T

L

Page 59: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Analytics and statistical modeling

Page 60: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
Page 61: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Open Connect

Page 62: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

What works for us? Scalability

Page 63: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

What works for us? Hadoop integration on Amazon EC2 / AWS

Page 64: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

What works for us? Let us focus on innovation and build a solution

Page 65: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

What works for us?

Tight engagement with Amazon EMR & Amazon EC2 teams for tactical issues and strategic roadmap

Page 66: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Next Steps …

• Heterogeneous node cluster • Auto expand shrink

• Richer monitoring infrastructure

Page 67: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

We strive to build the best of class big data platform in the cloud

Page 68: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Big Data at Channel 4 Amazon Elastic MapReduce for Competitive Advantage

Bob Harris – Channel 4 Television

14th November 2013

Page 69: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Channel 4 – Background • Channel 4 is a public service, commercially funded, not-for-profit, broadcaster.

• We have a remit to deliver innovative, experimental, distinctive, and diverse

content across television, film, and digital media.

• We are funded predominantly by television advertising, competing with the other established UK commercial broadcasters, and increasingly with emerging, Internet based, providers.

• Our content, is available across our portfolio of around 10 core and time-shift channels, and our on demand service 4oD is accessible across multiple devices and platforms.

Page 70: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Why Big Data at C4

Page 71: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Business Intelligence at C4 • Well established Business Intelligence capability

• Based on industry standard proprietary products

• Real-time data warehousing

• Comprehensive business reporting

• Excellent internal skills

• Good external skills availability

Page 72: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Big Data Technology at C4 • 2011 - Embarked on Big Data initiative

– Ran in-house and cloud-based PoCs – Selected Amazon EMR

• 2012 - Ran Amazon EMR in parallel with conventional BI

– Hive deployed to Data Analysts – Amazon EMR workflows deployed to production

• 2013 – Amazon EMR confirmed as primary Big Data platform

– Amazon EMR usage growing, focus on automation – Experimenting with Mahout for Machine Learning

Page 73: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

What problems are we solving?

Single view of the viewer recognising them across

devices and serving relevant content

Personalising the viewer experience

Page 74: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

How are we doing this? • Principal tasks…

– Audience segmentation – Personalisation – Recommendations

• What data do we process…

– Website clickstream logs – 4oD activity and viewing history – Over 9m registered users – Majority of activity now from “logged-in” users

Page 75: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

High-Level Architecture

Page 76: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

High-Level Architecture • Amazon EMR and existing BI technology are

complementary

• Process billions of data rows in Amazon EMR, store millions of result rows in RDBMS

• No need to “rip and replace”, existing technology investment is protected

• Amazon EMR will continue to underpin major growth in data volumes and processing complexity

Page 77: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Where Next? • Continued growth in usage of Amazon EMR

• Migrate to Hadoop 2.x

• Adopt Amazon Redshift

• Improved integration between C4 and AWS

• Shift toward “near real-time” processing

Page 78: Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT301