© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel
Durga Nemani, System Architect AOL Inc.
October 2015
Building Scalable Big Data
Solutions
BDT210
Building Scalable Big Data Solutions
October 2015
Bob Rogers, PhD
Chief Data Scientist for Big Data Solutions
Intel
@scientistBob 3
About me
@scientistBob
What does Big Data have to do with Intel?
Trusted Analytics Platform
@scientistBob 5
Intel contributions to Apache Hadoop
EncryptionIntel® AES-NI
@scientistBob 6
Use case:
Assemble an accurate patient problem list
Why?
• To improve patient outcomes
KPI
• False negatives in problem list
@scientistBob 7
What does a patient look like to a data scientist?
@scientistBob
8
My first enterprise data hub
@scientistBob
0-25 %
25-50 %
50-75%
75-100 %
Poll: What percent of the key clinical data to you think is missing from
the problem list?
?
@scientistBob
>63%
Missing
Poll: What percent of the key clinical data to you think is missing from
the problem list?
@scientistBob
Real patient example
Coded
Data
Free Text
Scanned
Document
s
Other
Data Silos
@scientistBob
Missing information
@scientistBob 13
What did we learn?
• Start with what you know
• Leverage existing
technologies
• Use simple tools
• Measure your results
@scientistBob
Powerful Big Data analytics reveal the truth about your…
…customers
…products
…ecosystem
…opportunities
14
Building Scalable Big Data Solutions
Durga Nemani – AOL Inc.
BACKGROUND&ARCHITECTURE
HYBRID
The Three Vs
• Volume• Multiple Terabytes per day
• Variety• Delimited, Avro, JSON
• Velocity
• Hourly, Batch
Workload Management
• “One size fits all” model does not work.• Specific infrastructure tuned to needs and requirements• Variety of EMR clusters as per Data need
2
0
Workloads with significant
diversity of needs
Resources with lowest
common denominatorResources for
workloads with significant
diversity of needs
S3
EMR
EMR
EMR
EMR
JSON
EC2EMRS3
Apache HiveApache PigApache Hadoop
Open Source Data Formats
AWS Services
Open Source Technologies
Avro Parquet
UNIQUE FEATURES & ADVANTAGES
Separation of Compute and Storage
SEE, SPOT, SQUEEZE
• Just enough spot instances to finish the job in 59 minutes.
Key Features
• Separation of Compute and Storage: Amazon S3 and Amazon EMR
• Transient Clusters: No permanent cluster. Different size clusters for
different datasets
• Separation of duties: Independent jobs for Processing,
Extracting, loading and monitoring.
• Parallelism: Process the smallest chunk of data possible in
parallel to reduce dependencies
• Scalability: Hundreds of Amazon EMR clusters in multiple
regions and Availability Zones
• Cost optimized: All Spot instances. Launch in Availability Zone
with lowest spot prices.
DATA & INSIGHTS
CLOUD Facts
2
8
Total Compressed
Amazon S3 Data Size
150 TB
Uncompressed
RAW Data/Day
2-3 TB
Amazon EMR
Clusters/Day
350
Amazon S3 Data
Retention Period
13-24 Months
150
24,000
Restatement Use Case
Terabytes raw
2
9
10 Availability Zone
550EMR Clusters EC2 Instances
AWS COST BREAKOUT
44%
40%
16%
3
0** Storage cost is recurring every month at 2.85$/100 GB
EC2 Cost
EMR Fee
S3 Cost
Best Practices & Suggestions
Tag all resources
Infrastructure as
Code
Command Line Interface
JSON as configuration files
AWS Identity and
Access Management
(IAM) roles and policies
Use of application ID
Enable CloudTrail
S3 lifecycle
management
S3 versioning
Separate code/data/logs buckets
Keyless EMR
clusters
Hybrid model
Enable debugging
Create multiple CLI profiles
Multi-factor authentication
CloudWatch billing alarms
EC2 Spot
instances
SNS notifications for failures
Loosely coupled Apps
Scale horizontally
Next Steps
3
4
Database on cloud
• Database on AWS
• Options: Amazon RDS, Amazon Redshift, or others using
Amazon EC2
Event-driven design
• Kick off code based on events
• Run downstream processes as soon as upstream completes
• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS
Data Pipeline
Data analytics
• Implement massive parallel processing technologies
• Options: Spark, Impala or Presto
DevOPS on cloud
• Rapidly and automatically deploy new code
• Continuous Integration/Continuous Deployment
• Options: AWS CodeDeploy, AWS CodeCommit, or AWS
CodePipeline
Q & A
THANK YOU
Recommended session:
BDT208 - A Technical Introduction to
Amazon Elastic MapReduce
Thursday, Oct 8, 12:15 PM - 1:15 PM
– Titian 2201B
Remember to complete your
evaluations!
Top Related