Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh...
Transcript of Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh...
Managing Data Analytics in a Hybrid Cloud
Karan SinghSr. Solution ArchitectStorage & Hyper-Converged Business Unit
Daniel GilfixTechnical Marketing ManagerStorage & Hyper-Converged Business Unit
AGENDA
2
● CUSTOMER PAIN
● COMMON APPROACHES
● SHARED DATA LAKES
● HOW IT WORKS AND WHERE
● SUMMARY AND NEXT STEPS
CUSTOMER PAIN
INSERT DESIGNATOR, IF NEEDED4
CUSTOMER PAIN POINTS
EXPLOSIVE GROWTHin data analytics teams and analytic tools
MULTIPLE TEAMS COMPETINGfor use of the samebig data resources.
CONGESTIONin busy analytic clusterscausing frustration and missed SLAs.
HADOOP
SPARKSQLSPARK
HIVEMAPREDUCE
PRESTOIMPALA
KAFKANIFI
ETC.
INSERT DESIGNATOR, IF NEEDED5
RESULTING IN CUSTOMER CHOICES
Get a bigger clusterfor many teams to share.
Give each teamown dedicated cluster,
each with copy of PBs of data.
Give teams ability tospin-up/spin-downclusters which can
share common data store.
#1 #2 #3
INSERT DESIGNATOR, IF NEEDED6
#3 ON-DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE
HIT SERVICE-LEVEL AGREEMENTSGive teams their owncompute clusters.
ELIMINATE IDLE RESOURCESBy right-sizing de-coupled compute and storage.
BUY 10s OF PBS INSTEAD OF 100S Share data sets across clusters instead of duplicating them.
INCREASE AGILITYWith spin-up/spin-down clusters.
INSERT DESIGNATOR, IF NEEDED
Red Hat data analytics infrastructure solution Multi-tenant workload isolation with shared data context
BATCH JOBS(SLOW)
STREAMINGANALYTICS
INTERACTIVEANALYTICS
OTHERANALYTICS
BATCH JOBS(FAST)
DYNAMIC compute resources and clusters able to meet different SLAs
UNIFIED single object storage solution feeding analytics jobs
ELASTIC provisioning and release of compute resources required by various analytics jobs
BENEFITS - AGILITY AND $$$
● Faster answers through elastic provisioning via OSP on shared data sets● Fewer roadblocks for empowered users in self-service data labs / clusters● Private/public cloud versatility with S3A interface● Reduced cost and risk from not duplicating and maintaining data sets● CapEx relief by scaling storage independent from compute
HOW IT WORKS
INSERT DESIGNATOR, IF NEEDED10
GENERATION - I : ANALYTICSMONOLITHIC HADOOP STACKS
Analytics vendors provide single-purpose infrastructure
Analytics vendors provideanalytics software
ANALYTICS +INFRASTRUCTURE
ANALYTICS +INFRASTRUCTURE
ANALYTICS +INFRASTRUCTURE
INSERT DESIGNATOR, IF NEEDED11
GENERATION - II : ANALYTICSELASTIC COMPUTE AND SHARED STORAGE CLOUDS
Analytics vendors provideanalytics software
Red Hat providescloud infrastructure software
Provisioned Compute Poolvia OpenStack and OpenShift platforms
Shared Datasets on Red Hat Ceph Storage
INSERT DESIGNATOR, IF NEEDED12
MULTIPLE ANALYTIC CLUSTERSSHARING DATA
INGEST ETL INTERACTIVEQUERY
BATCH QUERY& JOINS
ELASTIC COMPUTE RESOURCE POOL
Kafkacompute instances
Hive/Map Reducecompute instances
Prestocompute instances
Sparkcompute instances
SHARED DATA LAKE
Platinum SLA
Gold SLA
Silver SLA
Bronze SLA
INSERT DESIGNATOR, IF NEEDED13
ANALYTIC WORKLOADS JOINING THE INFRA
storage silo
bare metal silo virtualization infra
shared storage SAN
Red Hat private cloud infra
Red Hat private cloud object store
The rest of an enterprise’s apps
The rest of an enterprise’s apps
VMs VMs today -> containers tomorrow
MULTI TENANT WORKLOAD ISOLATION With Shared Data Context
HDFS TMP
HADOOP
RED HAT CEPH STORAGE
COMPUTE
STORAGE
COMPUTE
STORAGE
COMPUTE
STORAGE
WORKER
HADOOP CLUSTER 1
OPENSTACK VM
OPENSHIFT CONTAINER
2
3HDFS TMP
SPARK
HDFS TMP
SPARK/PRESTO
HDFS TMP
S3A S3A
BAREMETALRHEL
S3A/S3
INSERT DESIGNATOR, IF NEEDED15
COMMON ARCHITECTURAL MODEL -PUBLIC OR PRIVATE CLOUD
PUBLIC CLOUD (AWS) PRIVATE CLOUD (RHT)
AWS EC2 PROVISIONING
RED HAT® OPENSTACK PLATFORMPROVISIONING
AWS S3SHARED DATASETS
RED HAT® CEPH S3SHARED DATASETS
Hadoop
Presto
Spark Hadoop
Presto
Spark
INSERT DESIGNATOR, IF NEEDED16
FEATURES AND BENEFITS
MULTIPLE ANALYTIC CLUSTERS• Enable teams to meet their individual SLAs without competing for resources.
SHARED DATA SETS• Eliminate duplicate storage costs for multiple HDFS cluster silos.• Eliminate OpEx costs and complexity for maintaining multiple copies of datasets for multiple HDFS cluster silos.
FAST PROVISIONING OF ANALYTIC CLUSTERS• Unlocks Agility• Enables Speed to Capability
ADVANCE ANALYTICS on CEPH
INSERT DESIGNATOR, IF NEEDED18
MODERN BIG DATA ANALYTICS PIPELINESimplified Example
DATAGENERATION
INGEST DATASCIENCE
MACHINELEARNING
STREAMPROCESSING
TRANSFORM,MERGE,JOIN
DATAANALYTICS
INSERT DESIGNATOR, IF NEEDED19
MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY
DATAGENERATION
INGEST DATASCIENCE
MACHINELEARNING
STREAMPROCESSING
TRANSFORM,MERGE, JOIN
DATAANALYTICS
• Sensors• Click-stream• Transactions• Call-detail records
• NiFi• Kafka • Presto
• Impala• SparkSQL
• TensorFlow
• Kafka • Hadoop• Spark
• Spark• Hadoop
INSERT DESIGNATOR, IF NEEDED20
TESTED WITH CEPH OBJECT STORE
DATAGENERATION
INGEST DATASCIENCE
MACHINELEARNING
STREAMPROCESSING
TRANSFORM,MERGE, JOIN
DATAANALYTICS
• TPC-DS data sets(structured)• logsynth(semi-structured)
• bulk load• MapReduce • Impala
• Presto• (not tested)
• SparkSQL• Hive/MapReduce
• SparkSQL• Hive/MapReduce
• (not tested)
INSERT DESIGNATOR, IF NEEDED21
TYPICAL SHARED DATA LAKE PROJECT STAGES
IDENTIFY• Potential fit?
QUALIFY• 1-2 day workshop• ID questions needing evidence• Prioritize questions by value• Design POC architecture
POC OR PILOT• Answer questions• Empirical results• RHT Solution Engineering• RHT Consulting
DEPLOYMENT• Phased roll-out• Red Hat Consulting
SUMMARY AND NEXT STEPS
INSERT DESIGNATOR, IF NEEDED23
KEY TAKEAWAYS
MISSED SLAsLarge Spark/Hadoop shops suffering from missedSLAs due to cluster congestion.
EXCESSIVE CAPEX AND OPEXdue to multi-clustersolutions without shared data.
Do you do big data analytics on-premises?
Do you have multi-PB data sets?
Do you have multiple Spark/Hadoop clusters?
Do these Spark/Hadoop clusters need to share data sets?
Do you also have non Spark/Hadoop tools that need access to these data sets?
PROBLEMS HOW YOU KNOW IT’S YOU
INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL
ONE CUSTOMER’S UNSOLICITED TESTIMONY“We managed to deliver tremendous value to our organization”:
● Releasing lock on data: moving the HDFS to an open access object store and opening the data process to more processes and analysis.
● Releasing lock on compute: now we’re able to spin up and decommission compute power according to customer needs and utilize cloud benefits (including GPU incorporation in zero time and effort), without worrying about the data.
● Releasing lock on innovation: we can now allow anyone to try and build something new without the fear of messing things up (data or cluster wise). We’ve built an environment that can tolerant mistakes at all levels (process and data), and by doing so, our developers can be much more daring.“
INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL
CUSTOMER SATISFACTION
“I’m delighted to announce that its been a few weeks since we’ve launched our Cloudoop* offering to our customers, and it’s a huge success. The responses from our customers are very, very positive, and I’m quoting “Big big like!!!”
This shift from the traditional approach is revolutionizing the way we consume and process our data.”
---- Head of Cloud Infrastructure, government agency(*Cloudoop is their Spark-as-a-service offering with an S3 backend, Spark by Cloudera and an S3 by Ceph)
INSERT DESIGNATOR, IF NEEDED
RESOURCESSummary-level blogs:
● Breaking down data silos with Red Hat infrastructure
● Why would companies do this?
● Will mainstream analytics jobs run directly against a Ceph object store?
● How much slower will they run than natively on HDFS?
Architect-level blogs:● What about locality?● Anatomy of the S3A filesystem client● To the cloud!● Storing tables in Ceph object storage● Comparing with HDFS—TestDFSIO● Comparing with remote HDFS—Hive
Testbench (SparkSQL)● Comparing with local HDFS—Hive
Testbench (SparkSQL)● Comparing with remote HDFS—Hive
Testbench (Impala)● AI and machine learning workloads
27
SOCIAL MEDIA OPTIONS
THANK YOU