Big Data and Hadoop in the Cloud

122
Jose Papo Amazon Evangelist @josepapo @josepapo

description

Big Data and Hadoop in the Cloud - Presentation made in the conference Colombia 3.0 in Bogotá, Colombia

Transcript of Big Data and Hadoop in the Cloud

Page 1: Big Data and Hadoop in the Cloud

Jose Papo

Amazon Evangelist

@josepapo @josepapo

Page 2: Big Data and Hadoop in the Cloud

HANDS-ON DEMOS

AFTER THE BIG

DATA SESSION

Page 3: Big Data and Hadoop in the Cloud

La Nube es el driver de las nuevas tendencias tecnológicas

Page 4: Big Data and Hadoop in the Cloud

Accelerating the startup boom

Page 5: Big Data and Hadoop in the Cloud

Optimizing the corporate world

Page 6: Big Data and Hadoop in the Cloud

#1 ●○○○○

Page 7: Big Data and Hadoop in the Cloud

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.

We are constantly producing more data

Page 8: Big Data and Hadoop in the Cloud

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.

From all types of industries

Page 9: Big Data and Hadoop in the Cloud

Collect,

Store,

Organize,

Analyze &

Share

Page 10: Big Data and Hadoop in the Cloud

3Vs

Page 11: Big Data and Hadoop in the Cloud

27 TB per day Large Hadron Collider – CERN

Page 12: Big Data and Hadoop in the Cloud
Page 13: Big Data and Hadoop in the Cloud
Page 14: Big Data and Hadoop in the Cloud

The Role of Data

is Changing

Page 15: Big Data and Hadoop in the Cloud

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.

Until now, Questions you ask drove Data model

New model is collect as much data as possible – “Data-First Philosophy”

Page 16: Big Data and Hadoop in the Cloud

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.

Data is the new raw material for

any business on par with

capital, people, labor

Data is the new raw material for business on par with capital

& labor

Page 17: Big Data and Hadoop in the Cloud

Data

Actionable Information

Page 18: Big Data and Hadoop in the Cloud

Generated

data

Available for analysis

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 19: Big Data and Hadoop in the Cloud

Data Strategist

Page 20: Big Data and Hadoop in the Cloud

1.1M peak requests/sec

Page 21: Big Data and Hadoop in the Cloud

lunch hours last year?

Page 22: Big Data and Hadoop in the Cloud

select productId, count(*) from page_hits where hour in (12,13) group by productId order by count(*) desc

cat *-(12|13) | cut –f3 | sort | uniq -c > out

Hit <enter>?

Page 23: Big Data and Hadoop in the Cloud

1PB = 10^15 (1,000,000,000,000,000) bytes

1 PB = 231 days at 50MB/s

Page 24: Big Data and Hadoop in the Cloud

Solution: Massively Parallel Processing

Page 25: Big Data and Hadoop in the Cloud

#2 ○●○○○

Page 26: Big Data and Hadoop in the Cloud
Page 27: Big Data and Hadoop in the Cloud

HDFS Reliable storage

MapReduce Data analysis

Page 28: Big Data and Hadoop in the Cloud

Very large log

(e.g TBs)

Page 29: Big Data and Hadoop in the Cloud

Very large log

(e.g TBs)

Lots of actions

by John

Page 30: Big Data and Hadoop in the Cloud

Very large log

(e.g TBs) Split into

small

pieces

Lots of actions

by John

Page 31: Big Data and Hadoop in the Cloud

Very large log

(e.g TBs)

Process in a

hadoop cluster

Split into

small

pieces

Lots of actions

by John

Page 32: Big Data and Hadoop in the Cloud

Very large log

(e.g TBs)

John’s history

Process in a

hadoop cluster

Aggregate

the results Split into

small

pieces

Lots of actions

by John

Page 33: Big Data and Hadoop in the Cloud

map Input

file reduce Output

file

Worker node

Page 34: Big Data and Hadoop in the Cloud

map Input

file reduce Output

file

map Input

file reduce Output

file

map Input

file reduce Output

file

Worker node

Worker node

Worker node

Page 35: Big Data and Hadoop in the Cloud

How can we

help John?

Very large log

(e.g TBs) Actionable Insight

Page 36: Big Data and Hadoop in the Cloud

Deploying a Hadoop Cluster is Hard

Page 37: Big Data and Hadoop in the Cloud
Page 38: Big Data and Hadoop in the Cloud

#3 ♥

○○●○○

Page 39: Big Data and Hadoop in the Cloud

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.

Page 40: Big Data and Hadoop in the Cloud

Elastic On Demand

Pay as you go

Focus on

YOUR

business

Page 41: Big Data and Hadoop in the Cloud

Elastic On Demand

Pay as you go

Focus on

YOUR

business

Page 42: Big Data and Hadoop in the Cloud

November

Page 43: Big Data and Hadoop in the Cloud

Provisioned capacity

November

Page 44: Big Data and Hadoop in the Cloud

76%

24%

Provisioned capacity

November

Page 45: Big Data and Hadoop in the Cloud

November

Page 46: Big Data and Hadoop in the Cloud

On and Off Fast Growth

Variable Peaks Predictable Peaks

Page 47: Big Data and Hadoop in the Cloud

On and Off Fast Growth

Predictable Peaks Variable Peaks

WASTE

CUSTOMER DISSATISFACTION

Page 48: Big Data and Hadoop in the Cloud

Fast Growth On and Off

Predictable peaks Variable peaks

Page 49: Big Data and Hadoop in the Cloud

#4 ○○○●○

Page 50: Big Data and Hadoop in the Cloud

EMR is Hadoop in the Cloud

Page 51: Big Data and Hadoop in the Cloud
Page 52: Big Data and Hadoop in the Cloud
Page 53: Big Data and Hadoop in the Cloud

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographics

Usage analysis

In-game metrics

Page 54: Big Data and Hadoop in the Cloud

0

1.000.000

2.000.000

3.000.000

4.000.000

5.000.000

6.000.000

Page 55: Big Data and Hadoop in the Cloud

Versions

1.0.3

0.20.205

0.20

0.18

Distributions

Apache Hadoop

Page 56: Big Data and Hadoop in the Cloud

Job Flows

Custom JAR

Cascading

Streaming

Ruby, Perl, Python, PHP, R, Bash, C++

Page 57: Big Data and Hadoop in the Cloud

Data Warehouse for Hadoop

SQL-like query language

Hive

Page 58: Big Data and Hadoop in the Cloud

High-level programming

Ideal for data flow / ETL

Pig

Page 59: Big Data and Hadoop in the Cloud

Near real time key/value

store for structured data

HBase

Page 60: Big Data and Hadoop in the Cloud

Distributed monitoring

of cluster and nodes

Ganglia

Page 61: Big Data and Hadoop in the Cloud

Statistical computing

and graphics

Machine learning library

discover Value in Data

Page 62: Big Data and Hadoop in the Cloud

Unknown Unknowns

Page 63: Big Data and Hadoop in the Cloud

Elastic On Demand

Pay as you go

Focus on

YOUR

business

Page 64: Big Data and Hadoop in the Cloud

Undifferentiated

Heavy Lifting

Focus on

YOUR

business

Page 65: Big Data and Hadoop in the Cloud
Page 66: Big Data and Hadoop in the Cloud
Page 67: Big Data and Hadoop in the Cloud

elastic-mapreduce

--create

--key-pair micro

--region eu-west-1

--name MyJobFlow

--num-instances 5

--instance-type m2.4xlarge

–-alive

--log-uri s3n://mybucket/EMR/log

Instance type/count

Page 68: Big Data and Hadoop in the Cloud

elastic-mapreduce

--create

--key-pair micro

--region eu-west-1

--name MyJobFlow

--num-instances 5

--instance-type m2.4xlarge

–-alive

--pig-interactive --pig-versions latest

--hive-interactive –-hive-versions latest

--hbase

--log-uri s3n://mybucket/EMR/log

Adding Hive, Pig and

Hbase to the job flow

Page 69: Big Data and Hadoop in the Cloud

Elastic On Demand

Pay as you go

Focus on

YOUR

business

Page 70: Big Data and Hadoop in the Cloud

1 instance for 1000 hours

=

1000 instances for 1 hour

Page 71: Big Data and Hadoop in the Cloud
Page 72: Big Data and Hadoop in the Cloud

…to Thousands

Page 73: Big Data and Hadoop in the Cloud
Page 74: Big Data and Hadoop in the Cloud
Page 75: Big Data and Hadoop in the Cloud

Turn Off the Resources and Stop Paying

Page 76: Big Data and Hadoop in the Cloud

Elastic On Demand

Pay as you go

Focus on

YOUR

business

Page 77: Big Data and Hadoop in the Cloud

Source: IDC Whitepaper, sponsored by Amazon, “The Business Value of Amazon Web Services Accelerates Over Time.” July 2012

70% lower 5 year TCO per app

AWS

On-premises

$3.01M

$0.90M

50% reduction in analytics costs

Page 78: Big Data and Hadoop in the Cloud

Save more money by using Spot Instances

Page 79: Big Data and Hadoop in the Cloud

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

EMR with Spot Instances

Page 80: Big Data and Hadoop in the Cloud

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

EMR with Spot Instances

14 hrs

Page 81: Big Data and Hadoop in the Cloud

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

7 hrs

EMR with Spot Instances

Page 82: Big Data and Hadoop in the Cloud

With Spot 4 instances * 7 hrs * $0.50 = $14 +

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

EMR with Spot Instances

7 hrs

Page 83: Big Data and Hadoop in the Cloud

With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

EMR with Spot Instances

7 hrs

Page 84: Big Data and Hadoop in the Cloud

Time -50% Cost -22%

With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

EMR with Spot Instances

7 hrs

Page 85: Big Data and Hadoop in the Cloud

#5 ○○○○●

Page 86: Big Data and Hadoop in the Cloud

“What kind of movies do people like ?”

Page 87: Big Data and Hadoop in the Cloud

More than 25 Million Streaming Members

50 Billion Events Per Day

30 Million plays every day

2 billion hours of video in 3

months

4 million ratings per day

3 million searches

Device location , time ,

day, week etc.

Social data

Page 88: Big Data and Hadoop in the Cloud

10 TB of streaming data per day

Page 89: Big Data and Hadoop in the Cloud

~1 PB of data stored in Amazon S3

S3

Page 90: Big Data and Hadoop in the Cloud

Wide range of processing languages used

EMR

Prod Cluster (EMR)S3

Page 91: Big Data and Hadoop in the Cloud

Data consumed in multiple ways

S3

EMR

Prod Cluster (EMR)

Recommendation

Engine

Ad-hoc

Analysis Personalization

Page 92: Big Data and Hadoop in the Cloud

EMR

S3EMR

EMR

Prod Cluster (EMR)

Query Cluster (EMR)

EMR

EMR

Page 93: Big Data and Hadoop in the Cloud
Page 94: Big Data and Hadoop in the Cloud

Durability

Page 95: Big Data and Hadoop in the Cloud

Versioning

Page 96: Big Data and Hadoop in the Cloud
Page 97: Big Data and Hadoop in the Cloud
Page 98: Big Data and Hadoop in the Cloud

Foursquare…

33 million users 1.3 million businesses

…generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data

Page 99: Big Data and Hadoop in the Cloud

Uses EMR for Evaluation of new features

Machine learning

Exploratory analysis

Daily customer usage reporting

Long-term trend analysis

Page 100: Big Data and Hadoop in the Cloud

Benefits of EMR

Ease-of-Use “We have decreased the processing time for urgent data-analysis”

Flexibility To deal with changing requirements & dynamically expand reporting clusters

Costs “We have reduced our analytics costs by over 50%”

Page 101: Big Data and Hadoop in the Cloud

Applic

ation S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Sta

ck

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 102: Big Data and Hadoop in the Cloud

Applic

ation S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Sta

ck

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 103: Big Data and Hadoop in the Cloud

Applic

ation S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Sta

ck

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 104: Big Data and Hadoop in the Cloud

Applic

ation S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Sta

ck

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 105: Big Data and Hadoop in the Cloud

0

0,1

0,2

0,3

0,4

0,5

0,6

Female Male

Gender

0 10 20 30 40 50 60 70 80

Age

Page 106: Big Data and Hadoop in the Cloud

Gorilla Coffee

Gray's Papaya

Amorino

Thursday Friday Saturday Sunday

Page 107: Big Data and Hadoop in the Cloud
Page 108: Big Data and Hadoop in the Cloud
Page 109: Big Data and Hadoop in the Cloud
Page 110: Big Data and Hadoop in the Cloud
Page 111: Big Data and Hadoop in the Cloud

Python library

https://github.com/Yelp/mrjob

Page 112: Big Data and Hadoop in the Cloud

Log files

250 EMR clusters spun up

and down every week

Page 113: Big Data and Hadoop in the Cloud

Common Crawl

1000 Genomes Project

Census Data

54 other datasets

http://aws.amazon.com/publicdatasets/

Page 114: Big Data and Hadoop in the Cloud

Challenge: Large amounts of computing resources needed for short periods of time; significant data storage costs

Solution: Clusters of 100s of nodes on EMR running 4-5 hours at a time Leverages 1000 genomes Public Data Set on AWS —free access to ~200 TB of genomes for over 2,600 people from 26 populations around the world.

Page 115: Big Data and Hadoop in the Cloud

Challenge: Volatile weather is deadly to crops like grapes

Solution: Built a predictive model based on freely available data— 60 years of crop data, 14 TBs of soil data, and 1M government Doppler radar points 50 EMR clusters process new data as it comes into S3 each day, continuously updating the model.

Page 116: Big Data and Hadoop in the Cloud

150B Soil

Observations

3M Daily Weather

Measurements

850K Precision Rainfall

Grids Tracked

200 TB in Amazon S3

Page 117: Big Data and Hadoop in the Cloud

Big Data and AWS Cloud

Page 118: Big Data and Hadoop in the Cloud

Elastic and scalable

No upfront CapEx

Pay per use +

+

On demand

+

= Remove

constraints

Page 119: Big Data and Hadoop in the Cloud

Remove constraints = More experimentation

Page 120: Big Data and Hadoop in the Cloud

More experimentation = More innovation

Page 121: Big Data and Hadoop in the Cloud

Focus on your business

Leave undifferentiated heavy lifting to us

Page 122: Big Data and Hadoop in the Cloud

GRACIAS!

slideshare.net/AmazonWebServicesLATAM

http://aws.amazon.com/es/big-data/

José Papo

AWS Tech Evangelist

@josepapo