Scientific Computing With Amazon Web Services

Post on 14-Jan-2015

466 views 1 download

Tags:

description

Researchers from around the world are increasingly using AWS for a wide-array of use cases. This presentation describes how AWS facilitates scientific collaboration and powers some of the world's largest scientific efforts, including real-world examples from NASA JPL, the European Space Agency (ESA) and CERN's CMS particle detector.

Transcript of Scientific Computing With Amazon Web Services

Scien&fic  Compu&ng  on  AWS:NASA/JPL,  ESA  and  CERN

Jamie KinneyPrincipal Solutions ArchitectWorld Wide Public Sectorjkinney@amazon.com@jamiekinney

1

?How do researchers use AWS today?

Can you run HPC on AWS?

Should everything run on the cloud?

How does AWS facilitate scientific collaboration?

2

Amazon Web Services

AWS Global Infrastructure

Application Services

Networking

Deployment & Administration

DatabaseStorageCompute

3

Amazon EC2

4

ec2-run-instances

5

6

Programmable

7

8

9

Elastic

10

Self Hosting

Waste

CustomerDissatisfaction

Actual demand

Predicted Demand

Rigid

Actual demand

Elastic

11

Go from one instance...

12

To Thousands

13

Instance Types

14

Standard (m1)High Memory (m2,m3)

High CPU (c1)

15

Intel Nehalem (cc1.4xlarge)Nvidia GPUs (cg1.4xlarge)

2TB of SSD 120,000 IOPS (hi1.4xlarge)Intel Sandy Bridge E5-2670 (cc2.8xlarge)

Sandy Bridge, NUMA, 240GB RAM (cr1.4xlarge)48 TB of ephemeral storage (hs1.8xlarge)

Cluster Compute

16

17

Placement Groups

18

10 gig EPlacement

Group

Full

Bisection

EC2

EC2

EC2

EC2 EC2 EC2

EC2

EC2EC2

19

What is Scientific Computing?

20

Use Cases

•Science-as-a-Service•Large-scale HTC (100,000+ core clusters)•Large-scale MapReduce (Hadoop/Spark/Shark) using EMR or EC2•Small to medium-scale MPI clusters (hundreds of nodes)•Many small MPI clusters working in parallel to explore parameter space•GPGPU workloads•Dev/test of MPI workloads prior to submitting to supercomputing centers•Collaborative research environments•On-demand academic training/lab environments

21

Large Input Data Sets

22

ESA Gaia Mission Overview

ESA’s Gaia is an ambitious mission to chart a three-dimensional map of the Milky Way Galaxy in order to reveal the composition, formation and evolution of our Galaxy.

Gaia will repeatedly analyze and record the positions and magnitude of approximately one billion stars over the course of several years.

1 billion stars x 80 observations x 10 readouts = ~1 x 10^12 samples.

1ms processing time/sample = more than 30 years of processing

23

Gaia Solution Overview

• Purchase at the beginning of the mission for the anticipated high-water mark

• Pay as you go: Launch what you need, as you need it. Turn instances off when you’re done

• Purchase additional systems for redundancy

• If an instance fails, turn it off and launch a replacement at no additional charge

• Large-scale data reprocessing is constrained to available infrastructure. No way to accelerate jobs without additional CapEx

• Need to reprocess the data within a few hours, simply launch more instances. 100 machines running for 1 hour at the same cost as 1 machine running for 100 hours

• Performance constrained to processor/disk/memory available at time of procurement...for a multi-year mission

• AWS frequently launches new instance types running the latest hardware. Simply restart your instances on a newer instance type and stop paying for less-capable infrastructure.

• Data transfer and security policies make it difficult to collaborate with researchers located elsewhere

• Easily and securely collaborate with researchers around the world

24

Many Iterations With Varying Parameters

25

Linear Algebra Calculations

26

27

JPL Pasadena, CA

CDSCC

Canberra Deep Space

Communication Complex

MDSCC

Madrid Deep Space

Communication Complex

GDSCC

Goldstone Deep Space

Communication Complex

ARC

CheMinMoffett Field, CA

MSSS

MARDI, MAHLI,

MastCamSan Diego, CA

KSC

IKI

DANMoscow, Russia

INTA

REMSMadrid, Spain

LANL

ChemCamLos Alamos, NM

UofGuelph

APXSGuelph, OntarioSwRI

RADBoulder, CO

GSFC

SAMGreenbelt, MD

Plus hundreds of other sites around the world for

Co-Is and Colleagues

MSL Distributed Operations

28

Data Locality Challenges

Scientist 1 retrieves data from L.A.

Scientist 1 returns data to L.A.

Scientist 2 retrieves data from L.A.

Scientist 2 returns data to L.A.

29

AWS Global Infrastructure

9 regions

25 availability zones

38 edge locations

30

AWS Public Data Sets

AWS.amazon.com/datasets31

Data Locality Challenges

Researcher in L.A. uploads data to the cloud

Scientist 1 uses cloud resources to process data

Scientist 2 retrieves data products from edge network

Scientist 2 uses cloud resources to process data

Global collaboration

32

33

On-Demand Pricing

34

Reserved Instances

35

Spot Instances

• Bid $X per hour

• If current price <= bid, instance starts

• If current price > bid, instance terminates

• Customers pay market rate, not bid

36

U. Wisc.: CMS Particle Detector

http://www.hep.wisc.edu/~dan/talks/EC2SpotForCMS.pdf

37

Integrated Architectures

38

Amazon VPC

AWS Direct Connect

EC2 EC2

EC2EC2

Los AngelesSingapore

JapanLondon

Sao PaoloNew YorkSydney

39

40

Secured Uplink Planning

41

JPL Data Center

Decider

File Transfer Workers

Data Processing

Workers

Polyphony

Amazon SWF

Decider

Data Processing Tasks

File Transfer Tasks

Decision Tasks

Create EC2 Instances

Upload and Download

File Chunks

Data Processing Workers

EC2 EC2 EC2 EC2

S3

42

SWFEC2S3

SimpleDBCloudWatch

IAMsELB

5 Giga-pixels in 5 minutes!

43

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

EC2

Ames

Large, tightly-coupled MPI

Large EP, smaller scale tightly-coupled MPI, dev/test, burst capacity

Small scale MPI and EP

NASA Researcher

44

45

46

Zero to Internet-Scale in One Week!

47

ELBs on Steroids

48

Route53

49

CloudFormation

50

CloudFront

51

Regions and AZs

52

Mars Science Laboratory - Live Video StreamingArchitecture

Availability Zone: us-east-1a

Adobe Flash Media Server

Availability Zone: us-west-1b Telestream Wirecast

CloudFront streaming for

museum partners

Adobe Flash Media Server

Elastic LoadBalancer

Tier 2 Nginx Cache

Tier 1 Nginx Cache

Cloud Formation Stack

Elastic LoadBalancer

Tier 2 Nginx Cache

Tier 1 Nginx Cache

Cloud Formation Stack

53

Battle Testing JPL’s DeploymentBenchmarking

54

Dynamic Traffic ScalingUS-East Cache Node Performance

11.4 Gbps

55

Dynamic Traffic ScalingUS-East Cache Node Performance

25.3 Gbps

56

Dynamic Traffic ScalingUS-East Cache Node Performance

10.1 Gbps

57

Dynamic Traffic ScalingUS-East Cache Node Performance

40.3 Gbps

58

Dynamic Traffic ScalingUS-East Cache Node Performance

26.6 Gbps

59

Only ~42Mbps

Dynamic Traffic ScalingImpact on US-East FMS Origin Servers

60

Only ~42Mbps

Dynamic Traffic ScalingImpact on US-East FMS Origin Servers

61

CloudFront BehaviorsUsing ELBs for Dynamic Content

62

AWS Academic Grants

AWS.amazon.com/grants63

Thank You

64