(BDT210) Building Scalable Big Data Solutions: Intel & AOL

Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel

Durga Nemani, System Architect AOL Inc.

October 2015

Building Scalable Big Data

Solutions

BDT210

Building Scalable Big Data Solutions

October 2015

Bob Rogers, PhD

Chief Data Scientist for Big Data Solutions

@scientistBob 3

About me

@scientistBob

What does Big Data have to do with Intel?

Trusted Analytics Platform

@scientistBob 5

Intel contributions to Apache Hadoop

EncryptionIntel® AES-NI

@scientistBob 6

Use case:

Assemble an accurate patient problem list

• To improve patient outcomes

• False negatives in problem list

@scientistBob 7

What does a patient look like to a data scientist?

@scientistBob

My first enterprise data hub

@scientistBob

0-25 %

25-50 %

50-75%

75-100 %

Poll: What percent of the key clinical data to you think is missing from

the problem list?

@scientistBob

Missing

Poll: What percent of the key clinical data to you think is missing from

the problem list?

@scientistBob

Real patient example

Free Text

Scanned

Document

Data Silos

@scientistBob

Missing information

@scientistBob 13

What did we learn?

• Start with what you know

• Leverage existing

technologies

• Use simple tools

• Measure your results

@scientistBob

Powerful Big Data analytics reveal the truth about your…

…customers

…products

…ecosystem

…opportunities

Thank you

bob.rogers@intel.com

@scientistBob

Building Scalable Big Data Solutions

Durga Nemani – AOL Inc.

BACKGROUND&ARCHITECTURE

HYBRID

The Three Vs

• Volume• Multiple Terabytes per day

• Variety• Delimited, Avro, JSON

• Velocity

• Hourly, Batch

Workload Management

• “One size fits all” model does not work.• Specific infrastructure tuned to needs and requirements• Variety of EMR clusters as per Data need

Workloads with significant

diversity of needs

Resources with lowest

common denominatorResources for

workloads with significant

diversity of needs

EC2EMRS3

Apache HiveApache PigApache Hadoop

Open Source Data Formats

AWS Services

Open Source Technologies

Avro Parquet

UNIQUE FEATURES & ADVANTAGES

Separation of Compute and Storage

SEE, SPOT, SQUEEZE

• Just enough spot instances to finish the job in 59 minutes.

Key Features

• Separation of Compute and Storage: Amazon S3 and Amazon EMR

• Transient Clusters: No permanent cluster. Different size clusters for

different datasets

• Separation of duties: Independent jobs for Processing,

Extracting, loading and monitoring.

• Parallelism: Process the smallest chunk of data possible in

parallel to reduce dependencies

• Scalability: Hundreds of Amazon EMR clusters in multiple

regions and Availability Zones

• Cost optimized: All Spot instances. Launch in Availability Zone

with lowest spot prices.

DATA & INSIGHTS

CLOUD Facts

Total Compressed

Amazon S3 Data Size

150 TB

Uncompressed

RAW Data/Day

2-3 TB

Amazon EMR

Clusters/Day

Amazon S3 Data

Retention Period

13-24 Months

24,000

Restatement Use Case

Terabytes raw

10 Availability Zone

550EMR Clusters EC2 Instances

AWS COST BREAKOUT

0** Storage cost is recurring every month at 2.85$/100 GB

EC2 Cost

EMR Fee

S3 Cost

Best Practices & Suggestions

Tag all resources

Infrastructure as

Command Line Interface

JSON as configuration files

AWS Identity and

Access Management

(IAM) roles and policies

Use of application ID

Enable CloudTrail

S3 lifecycle

management

S3 versioning

Separate code/data/logs buckets

Keyless EMR

clusters

Hybrid model

Enable debugging

Create multiple CLI profiles

Multi-factor authentication

CloudWatch billing alarms

EC2 Spot

instances

SNS notifications for failures

Loosely coupled Apps

Scale horizontally

Next Steps

Database on cloud

• Database on AWS

• Options: Amazon RDS, Amazon Redshift, or others using

Amazon EC2

Event-driven design

• Kick off code based on events

• Run downstream processes as soon as upstream completes

• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS

Data Pipeline

Data analytics

• Implement massive parallel processing technologies

• Options: Spark, Impala or Presto

DevOPS on cloud

• Rapidly and automatically deploy new code

• Continuous Integration/Continuous Deployment

• Options: AWS CodeDeploy, AWS CodeCommit, or AWS

CodePipeline

THANK YOU

Recommended session:

BDT208 - A Technical Introduction to

Amazon Elastic MapReduce

Thursday, Oct 8, 12:15 PM - 1:15 PM

– Titian 2201B

Remember to complete your

evaluations!

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

Technology

Transcript of (BDT210) Building Scalable Big Data Solutions: Intel & AOL

SECOND GENERATION Intel Xeon Scalable Processors...foundation for the data centric era from the multi-cloud to intelligent edge, and back, the Intel Xeon Scalable platform with 2nd

Second Generation Intel Xeon Scalable Processors · Revision History 4 Second Generation Intel® Xeon® Scalable Processors Specification Update December 2019 Revision History Date

Intel Xeon Scalable Second-Generation Processor ... · Intel Deep Learning Boost with Vector Neural Network Instructions Recommended Intel Cascade Lake processors for data center

3rd Gen Intel® Xeon® Scalable processors

Intel® Xeon Processor Scalable Family€¢ Intel® Core X-Series Processor Family i7 78xx and i9-79xx Series ... 8 Intel® Xeon® Processor Scalable Family Datasheet, Volume One:

New Intel® Xeon® Scalable Processor Scalable Family ... · Family Improves HPC Performance Data Center High Performance Computing Systems based on the Intel® Xeon® processor Scalable

Intel® Xeon Processor Scalable Family Intel® Xeon® Processor Scalable Family Datasheet, Volume One: Electrical, July 2017 Legal Lines and DisclaimersIntel technologies features

Spark Tuning Guide on 3rd Generation Intel® Xeon® Scalable ......Spark Tuning Guide for 3rd Generation Intel® Xeon® Scalable Processors Based Platforms Revision 1.0 Page 4 | Total

Intel® Scalable System Framework - SJTUitoc.sjtu.edu.cn/wp-content/uploads/2016/05/Intel... · Intel® Scalable System Framework A Configurable Design Philosophy Extensible to a

SECOND GENERATION Intel Xeon Scalable Processors€¦ · The Intel® Xeon® Scalable platform provides the foundation for a powerful data center platform that creates an evolutionary

Intel Xeon Scalable Family Balanced Memory Configurationslenovopress.com/lp0742.pdf · 6 Intel Xeon Scalable Family Balanced Memory Configurations All DIMMs used are 32 GB dual -ank

Intel® Virtual RAID on CPU (Intel® VROC)€¦ · Intel® Volume Management Device (Intel® VMD) on Intel® Xeon® Scalable Processors. 2. Intel® Virtual RAID on CPU (Intel VROC)

Intel® Scalable Memory Interconnect 2 438-Pin Edge ... · Intel® Scalable Memory Interconnect 2 438-Pin Edge Connector 7 Specification Introduction 1 Introduction 1.1 Purpose and

INTES PERMAS* and Intel® Xeon® Scalable Processors DEEPER … · 2020. 12. 11. · DEEPER INSIGHTS, FASTER DESIGN CYCLES WITH INTEL® XEON® SCALABLE PROCESSORS INTES PERMAS* and

© 2009 AOL LLC. AOL and the AOL logo are trademarks of AOL LLC and may not be used without written permission. AOL Overview & IPO Steve Hosley, SVP, and.

Ansys® Fluent® on 3rd Generation Intel® Xeon® Scalable ...

Intel Intel® VTune™ Amplifier Tuning Guide for the Intel ......Intel® VTune Amplifier Tuning uide for the Intel® Xeon® Processor Scalable amily, 2nd en 3 The pipeline slot is

Intel Xeon Scalable Platform · PDF fileThe Intel® Xeon® Scalable platform provides the foundation for a powerful ... – Mode based execution control (MBE) ... plus it helps improve

OSPRay: An Open, Scalable, Parallel, Ray Tracing Based ... · intel® ospray anopen,scalable,parallel,raytracing basedrenderingengineforhigh-fidelity visualization version2.0.1 february10,2020

Intel® System Configuration Utility · Intel® Server Board based on 2nd Generation Intel® Xeon® Scalable Processor family Intel® Server Board based on Intel® Xeon® Platinum