Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public...
Transcript of Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public...
2013 AWS Worldwide Public Sector Summit Washington, D.C.
Big Data in the Cloud: Accelerating Innovation in the Public Sector
Jamie Kinney│Principal Solutions Architect
[email protected] │ @jamiekinney
2013 AWS Worldwide Public Sector Summit
Technologies and techniques for
working productively with data,
at any scale
BIG DATA
2013 AWS Worldwide Public Sector Summit
The more data you collect
The more VALUE you can
derive from it
Bigger is Better!
2013 AWS Worldwide Public Sector Summit
YOU DON’T HAVE
THE CHOICE…
27 TB per day Large Hadron Collider – CERN
2013 AWS Worldwide Public Sector Summit
GB TB
PB
Compute Storage Big Data
Unconstrained data growth
95% of the 1.2 zettabytes of data in the digital universe is unstructured
70% of of this is user-generated content
Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.
Source: IDC
ZB
EB
2013 AWS Worldwide Public Sector Summit
Big Data Verticals
Media Advertising
Targeted Advertising
Image and Video
Processing
Oil & Gas
Seismic Analysis
Retail
Recom-mendations
Transaction Analysis
Life Sciences
Genome Analysis
Financial
Services
Monte Carlo
Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image Recognition
Social Network Gaming
User Demo-graphics
Usage analysis
In-game metrics
VOLUME
VELOCITY
VARIETY
COLLECT │ STORE │ ANALYZE │ SHARE
COLLECT │ STORE │ ANALYZE │ SHARE
AWS
IMPORT / EXPORT
AWS
Direct Connect
COLLECT │ STORE │ ANALYZE │ SHARE
AMAZON S3
2013 AWS Worldwide Public Sector Summit
Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q2 2013
2 Trillion
1.1 M peak transactions per second
Objects in S3
AMAZON
DYNAMODB
AMAZON
REDSHIFT
AMAZON RDS
HBase on
AMAZON EMR
COLLECT │ STORE │ ANALYZE │ SHARE
AMAZON EC2
2013 AWS Worldwide Public Sector Summit
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128
Mem
ory
(GB)
EC2 Compute Units
Instance Types
Standard 2nd Gen Standard Micro High-Memory High-CPU Cluster Compute Cluster GPU High I/O High-Storage Cluster High-Mem
hi1.4xlarge 60.5 GB of memory 35 EC2 Compute Units 2x1024 GB SSD instance storage 64-bit platform
cc1.4xlarge 23 GB of memory 33.5 EC2 Compute Units 1690 GB of instance storage 64-bit platform
c1.xlarge 7 GB of memory 20 EC2 Compute Units 1690 GB of instance storage 64-bit platform
m1.small 1.7 GB memory 1 EC2 Compute Unit 160 GB instance storage 32-bit or 64-bit
m1.medium 3.75 GB memory 2 EC2 Compute Unit 410 GB instance storage 32-bit or 64-bit platform
m1.large EBS Optimizable 7.5 GB memory 4 EC2 Compute Units 850 GB instance storage 64-bit platform
m1.xlarge EBS Optimizable 15 GB memory 8 EC2 Compute Units 1,690 GB instance storage 64-bit platform
m2.xlarge 17.1 GB of memory 6.5 EC2 Compute Units 420 GB of instance storage 64-bit platform
m2.2xlarge 34.2 GB of memory 13 EC2 Compute Units 850 GB of instance storage 64-bit platform
m2.4xlarge EBS Optimizable 68.4 GB of memory 26 EC2 Compute Units 1690 GB of instance storage 64-bit platform
t1.micro 613 MB memory Up to 2 EC2 Compute Units EBS storage only 32-bit or 64-bit platform
c1.medium 1.7 GB of memory 5 EC2 Compute Units 350 GB of instance storage 32-bit or 64-bit platform
cg1.4xlarge 22 GB of memory 33.5 EC2 Compute Units 2 x NVIDIA Tesla “Fermi” M2050 GPUs 1690 GB of instance storage 64-bit platform
cc2.8xlarge 60.5 GB of memory 88 EC2 Compute Units 3370 GB of instance storage 64-bit platform m3.xlarge
15 GB of memory 13 EC2 Compute Units
m3.2xlarge EBS Optimizable 30 GB of memory 26 EC2 Compute Units
hs1.8xlarge 117 GB of memory 35 EC2 Compute Units 24x2 TB instance storage 64-bit platform
cr1.8xlarge 244 GB of memory 88 EC2 Compute Units 2x120 GB SSD instance storage 64-bit platform
GPU GRAPHICS PROCESSING UNIT
2013 AWS Worldwide Public Sector Summit
CLUSTER GPU
QUADRUPLE EXTRA LARGE
Intel Xeon X5570, quad-core
Nehalem architecture
NVIDIA Tesla Fermi
M2050 GPUs
22 GB of memory – 1.7 TB of storage
2x
2x
$0.35 / hour (Amazon EC2 Spot)
PARALLELIZATION
ON A SINGLE INSTANCE
COST: 4h x $2.1 = $8.4
RENDERING TIME: 4h
ON MULTIPLE INSTANCES
COST: 2 x 2h x $2.1 = $8.4
RENDERING TIME:
2013 AWS Worldwide Public Sector Summit
What are Spot Instances?
Availability Zone
Region
Availability Zone
Unused
Unused
Unused
Unused
Unused
Unused
Sold at 50% Discount!
Sold at 56% Discount!
Sold at 66% Discount!
Sold at 59% Discount!
Sold at 54% Discount!
Sold at 63% Discount!
ON MULTIPLE SPOT INSTANCES
COST: 4 x 1h x $0.35 = $1.4
RENDERING TIME:
2013 AWS Worldwide Public Sector Summit
"Hadoop is a reliable storage and data analysis system"
HDFS MapReduce
Deploying a Hadoop cluster is hard
AMAZON EMR HADOOP + AWS
2013 AWS Worldwide Public Sector Summit
2013 AWS Worldwide Public Sector Summit
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
5/22/2010
7/10/2010
8/28/2010
10/16/2010
12/4/2010
1/22/2011
3/12/2011
4/30/2011
6/18/2011
8/6/2011
9/24/2011
11/12/2011
12/31/2011
2/18/2012
4/7/2012
5/26/2012
7/14/2012
9/1/2012
10/20/2012
12/08/2012
1/26/2013
3/16/2013
Amazon Elastic MapReduce: Clusters launched by customers
Amazon EMR: 5.5M clusters launched by customers since May 2010
Massive Scale
2013 AWS Worldwide Public Sector Summit
2013 AWS Worldwide Public Sector Summit
USE THE RIGHT TOOL FOR THE RIGHT JOB
RDBMS (Amazon RDS)
Affordable Storage/Compute
Structured or Not (Agility)
Resilient Auto Scalability
Interactive Reporting (<1sec)
Multistep Transactions
Lots of Updates/Deletes
Hadoop (Amazon EMR)
2013 AWS Worldwide Public Sector Summit
Expand to
25 instances
Data Warehouse
(Steady State)
Data Warehouse
(Batch Processing)
Shrink to
9 instances
Data Warehouse
(Steady State)
COLLECT │ STORE │ ANALYZE │ SHARE
PUBLIC
DATA SETS
http://aws.amazon.com/publicdatasets
COLLECT │ STORE │ ANALYZE │ SHARE
INNOVATE
« Want to increase innovation?
Lower the cost of failure »
Joi Ito
AWS LOWERS
THE COST OF INNOVATION Testing a new idea is cheap
Georgetown University Next-generation sequencing and whole genomics
analysis to identify causation for premature birth
Solution Overview
Alignment, mapping, variant-calling
Downstream variant analytic pipelines
Hosted data portal including MongoDB
Genomic data storage (raw and processed)
Accessing 1,000 genomes public data set
SEC MIDAS & Tradeworx Real-time analysis of 20 billion messages/day
Reconstruct any market, any day in history
Solution Overview
Data Servers
Analytic Servers
Market reconstruction processing
Store historical stock ‘tick’ information
2013 AWS Worldwide Public Sector Summit
The Results
“For the growing team of quant types now employed at the SEC, MIDAS is
becoming the world’s greatest data sandbox. And the staff is planning to use
it to make the SEC a leader in its use of market data”
Elisse B. Walter, Chairman of the SEC
"This basically propels the SEC from zero to 60 in one fell swoop, going
from being way behind even the most basic market participant to being on par if
not ahead of the vast majority of market participants, in terms of their system and
analytical capabilities’’
Gregg E. Berman, Associate Director of the Office of Analytics and
Research
Thank You