(BDT210) Building Scalable Big Data Solutions: Intel & AOL

Post on 23-Jan-2018

896 views 1 download

Transcript of (BDT210) Building Scalable Big Data Solutions: Intel & AOL

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel

Durga Nemani, System Architect AOL Inc.

October 2015

Building Scalable Big Data

Solutions

BDT210

Building Scalable Big Data Solutions

October 2015

Bob Rogers, PhD

Chief Data Scientist for Big Data Solutions

Intel

@scientistBob 3

About me

@scientistBob

What does Big Data have to do with Intel?

Trusted Analytics Platform

@scientistBob 5

Intel contributions to Apache Hadoop

EncryptionIntel® AES-NI

@scientistBob 6

Use case:

Assemble an accurate patient problem list

Why?

• To improve patient outcomes

KPI

• False negatives in problem list

@scientistBob 7

What does a patient look like to a data scientist?

@scientistBob

8

My first enterprise data hub

@scientistBob

0-25 %

25-50 %

50-75%

75-100 %

Poll: What percent of the key clinical data to you think is missing from

the problem list?

?

@scientistBob

>63%

Missing

Poll: What percent of the key clinical data to you think is missing from

the problem list?

@scientistBob

Real patient example

Coded

Data

Free Text

Scanned

Document

s

Other

Data Silos

@scientistBob

Missing information

@scientistBob 13

What did we learn?

• Start with what you know

• Leverage existing

technologies

• Use simple tools

• Measure your results

@scientistBob

Powerful Big Data analytics reveal the truth about your…

…customers

…products

…ecosystem

…opportunities

14

Thank you

bob.rogers@intel.com

@scientistBob

15

Building Scalable Big Data Solutions

Durga Nemani – AOL Inc.

BACKGROUND&ARCHITECTURE

HYBRID

The Three Vs

• Volume• Multiple Terabytes per day

• Variety• Delimited, Avro, JSON

• Velocity

• Hourly, Batch

Workload Management

• “One size fits all” model does not work.• Specific infrastructure tuned to needs and requirements• Variety of EMR clusters as per Data need

2

0

Workloads with significant

diversity of needs

Resources with lowest

common denominatorResources for

workloads with significant

diversity of needs

S3

EMR

EMR

EMR

EMR

JSON

EC2EMRS3

Apache HiveApache PigApache Hadoop

Open Source Data Formats

AWS Services

Open Source Technologies

Avro Parquet

UNIQUE FEATURES & ADVANTAGES

Separation of Compute and Storage

SEE, SPOT, SQUEEZE

• Just enough spot instances to finish the job in 59 minutes.

Key Features

• Separation of Compute and Storage: Amazon S3 and Amazon EMR

• Transient Clusters: No permanent cluster. Different size clusters for

different datasets

• Separation of duties: Independent jobs for Processing,

Extracting, loading and monitoring.

• Parallelism: Process the smallest chunk of data possible in

parallel to reduce dependencies

• Scalability: Hundreds of Amazon EMR clusters in multiple

regions and Availability Zones

• Cost optimized: All Spot instances. Launch in Availability Zone

with lowest spot prices.

DATA & INSIGHTS

CLOUD Facts

2

8

Total Compressed

Amazon S3 Data Size

150 TB

Uncompressed

RAW Data/Day

2-3 TB

Amazon EMR

Clusters/Day

350

Amazon S3 Data

Retention Period

13-24 Months

150

24,000

Restatement Use Case

Terabytes raw

2

9

10 Availability Zone

550EMR Clusters EC2 Instances

AWS COST BREAKOUT

44%

40%

16%

3

0** Storage cost is recurring every month at 2.85$/100 GB

EC2 Cost

EMR Fee

S3 Cost

Best Practices & Suggestions

Tag all resources

Infrastructure as

Code

Command Line Interface

JSON as configuration files

AWS Identity and

Access Management

(IAM) roles and policies

Use of application ID

Enable CloudTrail

S3 lifecycle

management

S3 versioning

Separate code/data/logs buckets

Keyless EMR

clusters

Hybrid model

Enable debugging

Create multiple CLI profiles

Multi-factor authentication

CloudWatch billing alarms

EC2 Spot

instances

SNS notifications for failures

Loosely coupled Apps

Scale horizontally

Next Steps

3

4

Database on cloud

• Database on AWS

• Options: Amazon RDS, Amazon Redshift, or others using

Amazon EC2

Event-driven design

• Kick off code based on events

• Run downstream processes as soon as upstream completes

• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS

Data Pipeline

Data analytics

• Implement massive parallel processing technologies

• Options: Spark, Impala or Presto

DevOPS on cloud

• Rapidly and automatically deploy new code

• Continuous Integration/Continuous Deployment

• Options: AWS CodeDeploy, AWS CodeCommit, or AWS

CodePipeline

Q & A

THANK YOU

Recommended session:

BDT208 - A Technical Introduction to

Amazon Elastic MapReduce

Thursday, Oct 8, 12:15 PM - 1:15 PM

– Titian 2201B

Remember to complete your

evaluations!