Scientific Computing With Amazon Web Services
-
Upload
jamie-kinney -
Category
Technology
-
view
466 -
download
1
description
Transcript of Scientific Computing With Amazon Web Services
Scien&fic Compu&ng on AWS:NASA/JPL, ESA and CERN
Jamie KinneyPrincipal Solutions ArchitectWorld Wide Public [email protected]@jamiekinney
1
?How do researchers use AWS today?
Can you run HPC on AWS?
Should everything run on the cloud?
How does AWS facilitate scientific collaboration?
2
Amazon Web Services
AWS Global Infrastructure
Application Services
Networking
Deployment & Administration
DatabaseStorageCompute
3
Amazon EC2
4
ec2-run-instances
5
6
Programmable
7
8
9
Elastic
10
Self Hosting
Waste
CustomerDissatisfaction
Actual demand
Predicted Demand
Rigid
Actual demand
Elastic
11
Go from one instance...
12
To Thousands
13
Instance Types
14
Standard (m1)High Memory (m2,m3)
High CPU (c1)
15
Intel Nehalem (cc1.4xlarge)Nvidia GPUs (cg1.4xlarge)
2TB of SSD 120,000 IOPS (hi1.4xlarge)Intel Sandy Bridge E5-2670 (cc2.8xlarge)
Sandy Bridge, NUMA, 240GB RAM (cr1.4xlarge)48 TB of ephemeral storage (hs1.8xlarge)
Cluster Compute
16
17
Placement Groups
18
10 gig EPlacement
Group
Full
Bisection
EC2
EC2
EC2
EC2 EC2 EC2
EC2
EC2EC2
19
What is Scientific Computing?
20
Use Cases
•Science-as-a-Service•Large-scale HTC (100,000+ core clusters)•Large-scale MapReduce (Hadoop/Spark/Shark) using EMR or EC2•Small to medium-scale MPI clusters (hundreds of nodes)•Many small MPI clusters working in parallel to explore parameter space•GPGPU workloads•Dev/test of MPI workloads prior to submitting to supercomputing centers•Collaborative research environments•On-demand academic training/lab environments
21
Large Input Data Sets
22
ESA Gaia Mission Overview
ESA’s Gaia is an ambitious mission to chart a three-dimensional map of the Milky Way Galaxy in order to reveal the composition, formation and evolution of our Galaxy.
Gaia will repeatedly analyze and record the positions and magnitude of approximately one billion stars over the course of several years.
1 billion stars x 80 observations x 10 readouts = ~1 x 10^12 samples.
1ms processing time/sample = more than 30 years of processing
23
Gaia Solution Overview
• Purchase at the beginning of the mission for the anticipated high-water mark
• Pay as you go: Launch what you need, as you need it. Turn instances off when you’re done
• Purchase additional systems for redundancy
• If an instance fails, turn it off and launch a replacement at no additional charge
• Large-scale data reprocessing is constrained to available infrastructure. No way to accelerate jobs without additional CapEx
• Need to reprocess the data within a few hours, simply launch more instances. 100 machines running for 1 hour at the same cost as 1 machine running for 100 hours
• Performance constrained to processor/disk/memory available at time of procurement...for a multi-year mission
• AWS frequently launches new instance types running the latest hardware. Simply restart your instances on a newer instance type and stop paying for less-capable infrastructure.
• Data transfer and security policies make it difficult to collaborate with researchers located elsewhere
• Easily and securely collaborate with researchers around the world
24
Many Iterations With Varying Parameters
25
Linear Algebra Calculations
26
27
JPL Pasadena, CA
CDSCC
Canberra Deep Space
Communication Complex
MDSCC
Madrid Deep Space
Communication Complex
GDSCC
Goldstone Deep Space
Communication Complex
ARC
CheMinMoffett Field, CA
MSSS
MARDI, MAHLI,
MastCamSan Diego, CA
KSC
IKI
DANMoscow, Russia
INTA
REMSMadrid, Spain
LANL
ChemCamLos Alamos, NM
UofGuelph
APXSGuelph, OntarioSwRI
RADBoulder, CO
GSFC
SAMGreenbelt, MD
Plus hundreds of other sites around the world for
Co-Is and Colleagues
MSL Distributed Operations
28
Data Locality Challenges
Scientist 1 retrieves data from L.A.
Scientist 1 returns data to L.A.
Scientist 2 retrieves data from L.A.
Scientist 2 returns data to L.A.
29
AWS Global Infrastructure
9 regions
25 availability zones
38 edge locations
30
AWS Public Data Sets
AWS.amazon.com/datasets31
Data Locality Challenges
Researcher in L.A. uploads data to the cloud
Scientist 1 uses cloud resources to process data
Scientist 2 retrieves data products from edge network
Scientist 2 uses cloud resources to process data
Global collaboration
32
33
On-Demand Pricing
34
Reserved Instances
35
Spot Instances
• Bid $X per hour
• If current price <= bid, instance starts
• If current price > bid, instance terminates
• Customers pay market rate, not bid
36
U. Wisc.: CMS Particle Detector
http://www.hep.wisc.edu/~dan/talks/EC2SpotForCMS.pdf
37
Integrated Architectures
38
Amazon VPC
AWS Direct Connect
EC2 EC2
EC2EC2
Los AngelesSingapore
JapanLondon
Sao PaoloNew YorkSydney
39
40
Secured Uplink Planning
41
JPL Data Center
Decider
File Transfer Workers
Data Processing
Workers
Polyphony
Amazon SWF
Decider
Data Processing Tasks
File Transfer Tasks
Decision Tasks
Create EC2 Instances
Upload and Download
File Chunks
Data Processing Workers
EC2 EC2 EC2 EC2
S3
42
SWFEC2S3
SimpleDBCloudWatch
IAMsELB
5 Giga-pixels in 5 minutes!
43
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
EC2
Ames
Large, tightly-coupled MPI
Large EP, smaller scale tightly-coupled MPI, dev/test, burst capacity
Small scale MPI and EP
NASA Researcher
44
45
46
Zero to Internet-Scale in One Week!
47
ELBs on Steroids
48
Route53
49
CloudFormation
50
CloudFront
51
Regions and AZs
52
Mars Science Laboratory - Live Video StreamingArchitecture
Availability Zone: us-east-1a
Adobe Flash Media Server
Availability Zone: us-west-1b Telestream Wirecast
CloudFront streaming for
museum partners
Adobe Flash Media Server
Elastic LoadBalancer
Tier 2 Nginx Cache
Tier 1 Nginx Cache
Cloud Formation Stack
Elastic LoadBalancer
Tier 2 Nginx Cache
Tier 1 Nginx Cache
Cloud Formation Stack
53
Battle Testing JPL’s DeploymentBenchmarking
54
Dynamic Traffic ScalingUS-East Cache Node Performance
11.4 Gbps
55
Dynamic Traffic ScalingUS-East Cache Node Performance
25.3 Gbps
56
Dynamic Traffic ScalingUS-East Cache Node Performance
10.1 Gbps
57
Dynamic Traffic ScalingUS-East Cache Node Performance
40.3 Gbps
58
Dynamic Traffic ScalingUS-East Cache Node Performance
26.6 Gbps
59
Only ~42Mbps
Dynamic Traffic ScalingImpact on US-East FMS Origin Servers
60
Only ~42Mbps
Dynamic Traffic ScalingImpact on US-East FMS Origin Servers
61
CloudFront BehaviorsUsing ELBs for Dynamic Content
62
AWS Academic Grants
AWS.amazon.com/grants63
Thank You
64