Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal...
-
Upload
prosper-golden -
Category
Documents
-
view
218 -
download
1
Transcript of Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal...
Scientific Computing at AmazonDisruptive Innovations in Distributed Computing
Dave Ward, Principal Product ManagerAdam Gray, Senior Product Manager
“Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows
in 950 milliseconds.”
Mike Driscoll – Meta Markets
Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
Person Start EndBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11
Person Start End DurationBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11
Person Start End DurationBob 00:44:48 00:45:11 23Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11
Person Start End DurationBob 00:44:48 00:45:11 23Charlie 02:16:02 02:16:18 16Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11
Person Start End DurationBob 00:44:48 00:45:11 23Charlie 02:16:02 02:16:18 16Charlie 11:16:59 11:17:17 18Charlie 11:17:24 11:17:38 14Bob 11:23:10 11:23:25 15Alice 16:26:46 16:26:54 8David 17:20:28 17:20:45 17Alice 18:16:53 18:17:00 7Charlie 19:33:44 19:33:59 15Bob 21:13:32 21:13:43 11David 22:36:22 22:36:34 12Alice 23:42:01 23:42:11 10
Person DurationBob 23Charlie 16Charlie 18Charlie 14Bob 15Alice 8David 17Alice 7Charlie 15Bob 11David 12Alice 10
Person DurationBob 23Charlie 16Charlie 18Charlie 14Bob 15Alice 8David 17Alice 7Charlie 15Bob 11David 12Alice 10
Person Start EndBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11
map
Person DurationBob 23Charlie 16Charlie 18Charlie 14Bob 15Alice 8David 17Alice 7Charlie 15Bob 11David 12Alice 10
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
Person Total
Alice 25
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
Person Total
Bob 49
Alice 25
Person Total
Charlie 63
Bob 49
Alice 25
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
Person TotalAlice 25Bob 49
Charlie 63David 29
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
reduce
Person Start EndBob 00:44:48 00:45:11Charlie 02:16:02 02:16:18Charlie 11:16:59 11:17:17Charlie 11:17:24 11:17:38Bob 11:23:10 11:23:25Alice 16:26:46 16:26:54David 17:20:28 17:20:45Alice 18:16:53 18:17:00Charlie 19:33:44 19:33:59Bob 21:13:32 21:13:43David 22:36:22 22:36:34Alice 23:42:01 23:42:11
Person DurationAlice 8Alice 7Alice 10Bob 23Bob 15Bob 11Charlie 16Charlie 18Charlie 14Charlie 15David 12David 17
Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
orders
c++ app
bi-hourlyflat files
c++ app
dailyaggregations
c++ app
to payments service…Person TotalAlice 25Bob 49
Charlie 63David 29
Managed Apache Hadoop Service
Removes MUCK from Big Data processing
Provides tight integration with AWS services
AMAZON ELASTIC MAPREDUCE
> elastic-mapreduce --create --instance-type m1.large /--instance-count 1000 --name “My Hadoop Cluster” /--jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar
• 1000 Genomes Project (110 TB)• Common Crawl Corpus (60 TB)• Sloan Digital Sky Survey (180 GB)• United States Census (200 GB)• Million Song Dataset (500 GB)• Google Books Corpus (2.2 TB)• Marvel Universe Social Graph (50 GB)
#1: Cost without Spot4 instances *14 hrs * $0.50 = $28
Allocate 4
instances
Job Flow
14 Hours
Duration:
#2: Cost with Spot4 instances *7 hrs * $0.50 = $13 +5 instances * 7 hrs * $0.25 = $8.75Total = $21.75
Scenario #1 Add 5 Spot
Instances
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50% Cost Savings: ~22%
Save Time and Money