Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh...

Managing Data Analytics in a Hybrid Cloud

Karan SinghSr. Solution ArchitectStorage & Hyper-Converged Business Unit

Daniel GilfixTechnical Marketing ManagerStorage & Hyper-Converged Business Unit

AGENDA

2

● CUSTOMER PAIN

● COMMON APPROACHES

● SHARED DATA LAKES

● HOW IT WORKS AND WHERE

● SUMMARY AND NEXT STEPS

CUSTOMER PAIN

INSERT DESIGNATOR, IF NEEDED4

CUSTOMER PAIN POINTS

EXPLOSIVE GROWTHin data analytics teams and analytic tools

MULTIPLE TEAMS COMPETINGfor use of the samebig data resources.

CONGESTIONin busy analytic clusterscausing frustration and missed SLAs.

HADOOP

SPARKSQLSPARK

HIVEMAPREDUCE

PRESTOIMPALA

KAFKANIFI

ETC.


RESULTING IN CUSTOMER CHOICES

Get a bigger clusterfor many teams to share.

Give each teamown dedicated cluster,

each with copy of PBs of data.

Give teams ability tospin-up/spin-downclusters which can

share common data store.

#1 #2 #3


#3 ON-DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE

HIT SERVICE-LEVEL AGREEMENTSGive teams their owncompute clusters.

ELIMINATE IDLE RESOURCESBy right-sizing de-coupled compute and storage.

BUY 10s OF PBS INSTEAD OF 100S Share data sets across clusters instead of duplicating them.

INCREASE AGILITYWith spin-up/spin-down clusters.

INSERT DESIGNATOR, IF NEEDED

Red Hat data analytics infrastructure solution Multi-tenant workload isolation with shared data context

BATCH JOBS(SLOW)

STREAMINGANALYTICS

INTERACTIVEANALYTICS

OTHERANALYTICS

BATCH JOBS(FAST)

DYNAMIC compute resources and clusters able to meet different SLAs

UNIFIED single object storage solution feeding analytics jobs

ELASTIC provisioning and release of compute resources required by various analytics jobs

BENEFITS - AGILITY AND $$$

● Faster answers through elastic provisioning via OSP on shared data sets● Fewer roadblocks for empowered users in self-service data labs / clusters● Private/public cloud versatility with S3A interface● Reduced cost and risk from not duplicating and maintaining data sets● CapEx relief by scaling storage independent from compute

HOW IT WORKS


GENERATION - I : ANALYTICSMONOLITHIC HADOOP STACKS

Analytics vendors provide single-purpose infrastructure

Analytics vendors provideanalytics software

ANALYTICS +INFRASTRUCTURE




GENERATION - II : ANALYTICSELASTIC COMPUTE AND SHARED STORAGE CLOUDS

Analytics vendors provideanalytics software

Red Hat providescloud infrastructure software

Provisioned Compute Poolvia OpenStack and OpenShift platforms

Shared Datasets on Red Hat Ceph Storage


MULTIPLE ANALYTIC CLUSTERSSHARING DATA

INGEST ETL INTERACTIVEQUERY

BATCH QUERY& JOINS

ELASTIC COMPUTE RESOURCE POOL

Kafkacompute instances

Hive/Map Reducecompute instances

Prestocompute instances

Sparkcompute instances

SHARED DATA LAKE

Platinum SLA

Gold SLA

Silver SLA

Bronze SLA


ANALYTIC WORKLOADS JOINING THE INFRA

storage silo

bare metal silo virtualization infra

shared storage SAN

Red Hat private cloud infra

Red Hat private cloud object store

The rest of an enterprise’s apps

The rest of an enterprise’s apps

VMs VMs today -> containers tomorrow

MULTI TENANT WORKLOAD ISOLATION With Shared Data Context

HDFS TMP

HADOOP

RED HAT CEPH STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1

OPENSTACK VM

OPENSHIFT CONTAINER

2

3HDFS TMP

SPARK

HDFS TMP

SPARK/PRESTO

HDFS TMP

S3A S3A

BAREMETALRHEL

S3A/S3


COMMON ARCHITECTURAL MODEL -PUBLIC OR PRIVATE CLOUD

PUBLIC CLOUD (AWS) PRIVATE CLOUD (RHT)

AWS EC2 PROVISIONING

RED HAT® OPENSTACK PLATFORMPROVISIONING

AWS S3SHARED DATASETS

RED HAT® CEPH S3SHARED DATASETS

Hadoop

Presto

Spark Hadoop

Presto

Spark


FEATURES AND BENEFITS

MULTIPLE ANALYTIC CLUSTERS• Enable teams to meet their individual SLAs without competing for resources.

SHARED DATA SETS• Eliminate duplicate storage costs for multiple HDFS cluster silos.• Eliminate OpEx costs and complexity for maintaining multiple copies of datasets for multiple HDFS cluster silos.

FAST PROVISIONING OF ANALYTIC CLUSTERS• Unlocks Agility• Enables Speed to Capability

ADVANCE ANALYTICS on CEPH


MODERN BIG DATA ANALYTICS PIPELINESimplified Example

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE,JOIN

DATAANALYTICS


MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE, JOIN

DATAANALYTICS

• Sensors• Click-stream• Transactions• Call-detail records

• NiFi• Kafka • Presto

• Impala• SparkSQL

• TensorFlow

• Kafka • Hadoop• Spark

• Spark• Hadoop


TESTED WITH CEPH OBJECT STORE

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE, JOIN

DATAANALYTICS

• TPC-DS data sets(structured)• logsynth(semi-structured)

• bulk load• MapReduce • Impala

• Presto• (not tested)

• SparkSQL• Hive/MapReduce

• SparkSQL• Hive/MapReduce

• (not tested)


TYPICAL SHARED DATA LAKE PROJECT STAGES

IDENTIFY• Potential fit?

QUALIFY• 1-2 day workshop• ID questions needing evidence• Prioritize questions by value• Design POC architecture

POC OR PILOT• Answer questions• Empirical results• RHT Solution Engineering• RHT Consulting

DEPLOYMENT• Phased roll-out• Red Hat Consulting

SUMMARY AND NEXT STEPS


KEY TAKEAWAYS

MISSED SLAsLarge Spark/Hadoop shops suffering from missedSLAs due to cluster congestion.

EXCESSIVE CAPEX AND OPEXdue to multi-clustersolutions without shared data.

Do you do big data analytics on-premises?

Do you have multi-PB data sets?

Do you have multiple Spark/Hadoop clusters?

Do these Spark/Hadoop clusters need to share data sets?

Do you also have non Spark/Hadoop tools that need access to these data sets?

PROBLEMS HOW YOU KNOW IT’S YOU

INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL

ONE CUSTOMER’S UNSOLICITED TESTIMONY“We managed to deliver tremendous value to our organization”:

● Releasing lock on data: moving the HDFS to an open access object store and opening the data process to more processes and analysis.

● Releasing lock on compute: now we’re able to spin up and decommission compute power according to customer needs and utilize cloud benefits (including GPU incorporation in zero time and effort), without worrying about the data.

● Releasing lock on innovation: we can now allow anyone to try and build something new without the fear of messing things up (data or cluster wise). We’ve built an environment that can tolerant mistakes at all levels (process and data), and by doing so, our developers can be much more daring.“

INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL

CUSTOMER SATISFACTION

“I’m delighted to announce that its been a few weeks since we’ve launched our Cloudoop* offering to our customers, and it’s a huge success. The responses from our customers are very, very positive, and I’m quoting “Big big like!!!”

This shift from the traditional approach is revolutionizing the way we consume and process our data.”

---- Head of Cloud Infrastructure, government agency(*Cloudoop is their Spark-as-a-service offering with an S3 backend, Spark by Cloudera and an S3 by Ceph)

INSERT DESIGNATOR, IF NEEDED

RESOURCESSummary-level blogs:

● Breaking down data silos with Red Hat infrastructure

● Why would companies do this?

● Will mainstream analytics jobs run directly against a Ceph object store?

● How much slower will they run than natively on HDFS?

Architect-level blogs:● What about locality?● Anatomy of the S3A filesystem client● To the cloud!● Storing tables in Ceph object storage● Comparing with HDFS—TestDFSIO● Comparing with remote HDFS—Hive

Testbench (SparkSQL)● Comparing with local HDFS—Hive

Testbench (SparkSQL)● Comparing with remote HDFS—Hive

Testbench (Impala)● AI and machine learning workloads

https://redhatstorage.redhat.com/2018/09/12/analytics-infrastructure/

https://redhatstorage.redhat.com/2018/09/12/analytics-infrastructure/

https://redhatstorage.redhat.com/2018/06/25/why-spark-on-ceph-part-1-of-3/





https://redhatstorage.redhat.com/2018/07/11/what-about-locality/

https://redhatstorage.redhat.com/2018/07/19/anatomy-of-the-s3a-filesystem-client/

https://redhatstorage.redhat.com/2018/07/25/to-the-cloud/

https://redhatstorage.redhat.com/2018/07/31/storing-tables-in-ceph-object-storage/

27

SOCIAL MEDIA OPTIONS

THANK YOU

Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh...

Documents

Transcript of Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh...