Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal...

169
Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal Product Manager Adam Gray, Senior Product Manager

Transcript of Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal...

Scientific Computing at AmazonDisruptive Innovations in Distributed Computing

Dave Ward, Principal Product ManagerAdam Gray, Senior Product Manager

Innovation #1: Cloud

42

Building your own virtual programmable

datacenter

ec2-run-instances

On DemandGlobal

Infrastructure

Programmable

Elastic

Instance Types

Standard (m1)High Memory (m2)

High CPU (c1)

High Performance

“Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows

in 950 milliseconds.”

Mike Driscoll – Meta Markets

Cluster Computing

MPI

Bandwidth Intensive

Cluster ComputeInstance

2*Intel Xeon 55708 Cores w/ HT

23 GB RAM1.7 TB disk

HVMCc1.4xlarge

linpack

231

November 2010

451

June 2011

New Cluster ComputeInstances

2*Intel Xeon16 cores w/HT60.5GB RAM

3.4TB diskHVM

cc2.8xlarge

linpack

42November 2011

Innovation #2:

Lowering the cost of developing a distributed system

Case Study: Amazon’s Associates Program

Text Links

Enhanced Links

how much to pay each associate?

orders

c++ app

bi-hourlyflat files

orders

c++ app

bi-hourlyflat files

c++ app

dailyaggregations

orders

c++ app

bi-hourlyflat files

c++ app

dailyaggregations

c++ app

to payments service…

orders

c++ app

bi-hourlyflat files

c++ app

dailyaggregations

c++ app

to payments service…

“just one more Q4”

distributed computing

Diffi

culty

Number of Machines1

1

Diffi

culty

Number of Machines1

1

106

2

Diffi

culty

Number of Machines1

1

106

2

distributed computingis hard

distributed computingrequires god-like engineers

Hadoop is…

The MapReduce computational paradigm

Hadoop is…

The MapReduce computational paradigm

… implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

Person Start EndBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11

Person Start End DurationBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11

Person Start End DurationBob 00:44:48 00:45:11 23Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11

Person Start End DurationBob 00:44:48 00:45:11 23Charlie 02:16:02 02:16:18 16Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11

Person Start End DurationBob 00:44:48 00:45:11 23Charlie 02:16:02 02:16:18 16Charlie 11:16:59 11:17:17 18Charlie 11:17:24 11:17:38 14Bob 11:23:10 11:23:25 15Alice 16:26:46 16:26:54 8David 17:20:28 17:20:45 17Alice 18:16:53 18:17:00 7Charlie 19:33:44 19:33:59 15Bob 21:13:32 21:13:43 11David 22:36:22 22:36:34 12Alice 23:42:01 23:42:11 10

Person DurationBob 23Charlie 16Charlie 18Charlie 14Bob 15Alice 8David 17Alice 7Charlie 15Bob 11David 12Alice 10

Person DurationBob 23Charlie 16Charlie 18Charlie 14Bob 15Alice 8David 17Alice 7Charlie 15Bob 11David 12Alice 10

Person Start EndBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11

map

Person DurationBob 23Charlie 16Charlie 18Charlie 14Bob 15Alice 8David 17Alice 7Charlie 15Bob 11David 12Alice 10

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

Person Total

Alice 25

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

Person Total

Bob 49

Alice 25

Person Total

Charlie 63

Bob 49

Alice 25

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

Person Total

David 29

Charlie 63

Bob 49

Alice 25

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

Person Total

David 29

Charlie 63

Bob 49

Alice 25

Person TotalAlice 25Bob 49

Charlie 63David 29

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

reduce

Person Start EndBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11

Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17

Hadoop is…

The MapReduce computational paradigm

Hadoop is…

The MapReduce computational paradigm

… implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

distributed computingrequires god-like engineers

distributed computing (with Hadoop)requires god-like talented engineers

how much to pay each associate?

orders

c++ app

bi-hourlyflat files

c++ app

dailyaggregations

c++ app

to payments service…

orders

c++ app

bi-hourlyflat files

c++ app

dailyaggregations

c++ app

to payments service…Person TotalAlice 25Bob 49

Charlie 63David 29

Orders Filter

S3

Other Services

Orders Filter

S3

Hadoop Cluster

Diffi

culty

Number of Machines1

1

106

2

Diffi

culty

Number of Machines1

1

106

2

More data? Smarter engineers.

Diffi

culty

Number of Machines1

1

106

2

Diffi

culty

Number of Machines1

1

106

2

More data? Smarter Engineers.More data? More boxes.

Hadoop lowers the cost of developing a distributed system.

What about the cost of operating a distributed system?

November traffic at amazon.com

November traffic at amazon.com

November traffic at amazon.com

76%

24%

Orders Filter

S3

Hadoop Cluster

Amazon Elastic Compute Cloud“provides resizable compute capacity in the cloud.”

Amazon Elastic MapReduce =

Amazon EC2 + Hadoop

Orders Filter

S3

Hadoop Cluster

Filter

S3

EMR Cluster

Orders

Filter

S3

EMR Cluster

Orders

Filter

S3

Orders

Filter

S3

Orders

Amazon EC2 lowers the cost of operating a distributed system.

Hadoop lowers the cost of developing a distributed system.

Amazon Elastic MapReduce changes the economics of data processing.

Managed Apache Hadoop Service

Removes MUCK from Big Data processing

Provides tight integration with AWS services

AMAZON ELASTIC MAPREDUCE

> elastic-mapreduce --create --instance-type m1.large /--instance-count 1000 --name “My Hadoop Cluster” /--jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar

What is big data?

Dataset size

Num

ber o

f dat

aset

s

Dataset size

Num

ber o

f dat

aset

sfits on a single machine

Dataset size

Num

ber o

f dat

aset

s

Big Data

Dataset size

Num

ber o

f dat

aset

s

Extremely Big Data

Dataset size

Diffi

culty

Dataset size

Diffi

culty

Dataset size

Diffi

cultyExtremely valuable

Marginally valuable

Dataset size

Diffi

cultyExtremely valuable

Marginally valuable

Dataset size

Num

ber o

f dat

aset

s

Extremely Big Data

Dataset size

Diffi

culty

Dataset size

Diffi

culty

Dataset size

Diffi

culty

Dataset size

Diffi

culty

cheap experimentation

Innovation #3: Cloud

Public Data Sets

Lowering the cost of accessing data

Over 50 free data sets

Nearly 1 PB of free data

Stored at no cost to providers; also free access to consumers

• 1000 Genomes Project (110 TB)• Common Crawl Corpus (60 TB)• Sloan Digital Sky Survey (180 GB)• United States Census (200 GB)• Million Song Dataset (500 GB)• Google Books Corpus (2.2 TB)• Marvel Universe Social Graph (50 GB)

aws.amazon.com/datasets

Innovation #4:Creating a Market

for Capacity

Finding Research Dollars (even further) for AWS

Educators

Up to $100 per Student in AWS Credits

for intro courses

Researchers

Infrastructure Credits(EC2, S3, …)

4 Grant Review CyclesPer Year

February 10, 2012

Students

Student Organizations,Self Learning,

Entrepreneurial Projects

aws.amazon.com/education

Stretching your Research Dollars (even further) on AWS

On-Demand

Reserved

Spot

Unused EC2 Capacity

Bid

July 2011

Interruption

July 2011

Manage Interruption

GridComputing

MIT StarCluster

http://youtu.be/2Ym7epCYnSk

Harvard Medical SchoolLab of

Personalized Medicine

Temple UniversitySpot MPI

ElasticMapReduce

#1: Cost without Spot4 instances *14 hrs * $0.50 = $28

Allocate 4

instances

Job Flow

14 Hours

Duration:

#2: Cost with Spot4 instances *7 hrs * $0.50 = $13 +5 instances * 7 hrs * $0.25 = $8.75Total = $21.75

Scenario #1 Add 5 Spot

Instances

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50% Cost Savings: ~22%

Save Time and Money

QueueBasedArchitecture

Amazon EC2 Spot

Amazon EC2 On-Demand / Reserved

Queue

Applications

Checkpointing

30,000+ Cores95,078 Instance Hours

$1,279/hour

We are Hiring!FT/Interns: amazon.com/careersExperienced: aws.amazon.com/jobs