(BDT210) Building Scalable Big Data Solutions: Intel & AOL

37
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel Durga Nemani, System Architect AOL Inc. October 2015 Building Scalable Big Data Solutions BDT210

Transcript of (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Page 1: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel

Durga Nemani, System Architect AOL Inc.

October 2015

Building Scalable Big Data

Solutions

BDT210

Page 2: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Building Scalable Big Data Solutions

October 2015

Bob Rogers, PhD

Chief Data Scientist for Big Data Solutions

Intel

Page 3: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob 3

About me

Page 4: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

What does Big Data have to do with Intel?

Trusted Analytics Platform

Page 5: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob 5

Intel contributions to Apache Hadoop

EncryptionIntel® AES-NI

Page 6: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob 6

Use case:

Assemble an accurate patient problem list

Why?

• To improve patient outcomes

KPI

• False negatives in problem list

Page 7: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob 7

What does a patient look like to a data scientist?

Page 8: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

8

My first enterprise data hub

Page 9: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

0-25 %

25-50 %

50-75%

75-100 %

Poll: What percent of the key clinical data to you think is missing from

the problem list?

?

Page 10: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

>63%

Missing

Poll: What percent of the key clinical data to you think is missing from

the problem list?

Page 11: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

Real patient example

Coded

Data

Free Text

Scanned

Document

s

Other

Data Silos

Page 12: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

Missing information

Page 13: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob 13

What did we learn?

• Start with what you know

• Leverage existing

technologies

• Use simple tools

• Measure your results

Page 14: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

@scientistBob

Powerful Big Data analytics reveal the truth about your…

…customers

…products

…ecosystem

…opportunities

14

Page 15: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Thank you

[email protected]

@scientistBob

15

Page 16: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Building Scalable Big Data Solutions

Durga Nemani – AOL Inc.

Page 17: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

BACKGROUND&ARCHITECTURE

Page 18: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

HYBRID

Page 19: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

The Three Vs

• Volume• Multiple Terabytes per day

• Variety• Delimited, Avro, JSON

• Velocity

• Hourly, Batch

Page 20: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Workload Management

• “One size fits all” model does not work.• Specific infrastructure tuned to needs and requirements• Variety of EMR clusters as per Data need

2

0

Workloads with significant

diversity of needs

Resources with lowest

common denominatorResources for

workloads with significant

diversity of needs

Page 21: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

S3

EMR

EMR

EMR

EMR

Page 22: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

JSON

EC2EMRS3

Apache HiveApache PigApache Hadoop

Open Source Data Formats

AWS Services

Open Source Technologies

Avro Parquet

Page 23: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

UNIQUE FEATURES & ADVANTAGES

Page 24: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Separation of Compute and Storage

Page 25: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

SEE, SPOT, SQUEEZE

• Just enough spot instances to finish the job in 59 minutes.

Page 26: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Key Features

• Separation of Compute and Storage: Amazon S3 and Amazon EMR

• Transient Clusters: No permanent cluster. Different size clusters for

different datasets

• Separation of duties: Independent jobs for Processing,

Extracting, loading and monitoring.

• Parallelism: Process the smallest chunk of data possible in

parallel to reduce dependencies

• Scalability: Hundreds of Amazon EMR clusters in multiple

regions and Availability Zones

• Cost optimized: All Spot instances. Launch in Availability Zone

with lowest spot prices.

Page 27: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

DATA & INSIGHTS

Page 28: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

CLOUD Facts

2

8

Total Compressed

Amazon S3 Data Size

150 TB

Uncompressed

RAW Data/Day

2-3 TB

Amazon EMR

Clusters/Day

350

Amazon S3 Data

Retention Period

13-24 Months

Page 29: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

150

24,000

Restatement Use Case

Terabytes raw

2

9

10 Availability Zone

550EMR Clusters EC2 Instances

Page 30: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

AWS COST BREAKOUT

44%

40%

16%

3

0** Storage cost is recurring every month at 2.85$/100 GB

EC2 Cost

EMR Fee

S3 Cost

Page 31: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Best Practices & Suggestions

Page 32: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Tag all resources

Infrastructure as

Code

Command Line Interface

JSON as configuration files

AWS Identity and

Access Management

(IAM) roles and policies

Use of application ID

Enable CloudTrail

S3 lifecycle

management

S3 versioning

Separate code/data/logs buckets

Keyless EMR

clusters

Hybrid model

Enable debugging

Create multiple CLI profiles

Multi-factor authentication

CloudWatch billing alarms

EC2 Spot

instances

SNS notifications for failures

Loosely coupled Apps

Scale horizontally

Page 33: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Next Steps

Page 34: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

3

4

Database on cloud

• Database on AWS

• Options: Amazon RDS, Amazon Redshift, or others using

Amazon EC2

Event-driven design

• Kick off code based on events

• Run downstream processes as soon as upstream completes

• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS

Data Pipeline

Data analytics

• Implement massive parallel processing technologies

• Options: Spark, Impala or Presto

DevOPS on cloud

• Rapidly and automatically deploy new code

• Continuous Integration/Continuous Deployment

• Options: AWS CodeDeploy, AWS CodeCommit, or AWS

CodePipeline

Page 35: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Q & A

Page 36: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

THANK YOU

Recommended session:

BDT208 - A Technical Introduction to

Amazon Elastic MapReduce

Thursday, Oct 8, 12:15 PM - 1:15 PM

– Titian 2201B

Page 37: (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Remember to complete your

evaluations!