Best Practices for Big Data Analytics in...

21
#1 Modern Platformto Turn Data into a Strategic Asset ©2016 RapidMiner, Inc. All rights reserved. ©2016 RapidMiner, Inc. All rights reserved. June 14, 2016 Zoltan Prekopcsak VP Big Data Best Practices for Big Data Analytics in Hadoop

Transcript of Best Practices for Big Data Analytics in...

Page 1: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

#1ModernPlatformtoTurnDataintoaStrategicAsset

©2016RapidMiner,Inc.Allrightsreserved.©2016RapidMiner,Inc.Allrightsreserved.

June14,2016

ZoltanPrekopcsakVPBigData

BestPracticesforBigDataAnalyticsinHadoop

Page 2: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 2 -

90%ofDeployedDataLakesare“USELESS”“Through2018,90%ofdeployeddata lakeswillbeuselessastheyareoverwhelmedwithinformationassetscapturedforuncertainusecases.”

• SKILLSGAPisamajoradoptioninhibitor,citedby57%

• HowtoEXTRACTVALUEfromHadoop,citedby49%

Page 3: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 3 -

Sampling

GridComputing

NativeDistributedAlgorithms

DifferentApproachestoBigDataAnalytics

Page 4: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 4 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach1: Sampling

Page 5: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 5 -

Sampling:DataMovement&Processing

• DataMovement§ PullssampledatafromHDFS/Hive/Impala

• DataProcessing§ Intheanalyticstool

AnalyticsTool

PiecesofdatapulledoutofHadoop

PerformsCalculations

Page 6: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 6 -

Sampling:Pros&Cons

• Pros+ Simpleandeasytostartwith+ Usuallyworkswellfordataexplorationandearly

prototyping+ SomeMLmodelswouldnotbenefitfrommore

dataanyway

• Cons– ManyMLmodelswouldbenefitfrommoredata– Cannotbeusedwhenlargescaledata preparation

isneeded– Hadoopisusedasadatarepositoryonly

AnalyticsTool

PiecesofdatapulledoutofHadoop

PerformsCalculations

Page 7: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 7 -

Sampling:BestPractices

• Whentouseit+ Onlydataexploration/dataunderstanding+ Earlyprototypingonpreparedandcleandata+ MachineLearningmodelingwithveryfewandbasic

patterns(e.g.only ahandfulofcolumns andbinarypredictiontarget)

• WhenNOTtouseit− Largenumberofcolumns inthedata− Needtoblendlargedatasets(e.g.large-scalejoins)− ComplexMachineLearningmodels− Lookingforrareevents

• Horrorstories– Importantdecisionsmadebasedonbiasedsamples

AnalyticsTool

PiecesofdatapulledoutofHadoop

PerformsCalculations

Page 8: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 8 -

Sampling

GridComputing

NativeDistributedAlgorithms

DifferentApproachestoBigDataAnalytics

DataVisualization,Programming

Page 9: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 9 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach2:GridComputing

Page 10: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 10 -

Approach2:GridComputing

• DataMovement• Onlyresultsaremoved,dataremainsinHadoop

• DataProcessing• Customsingle-nodeapplicationrunningon

multipleHadoopnodes• Pros&Cons

+ Hadoopisusedforparallelprocessinginadditiontousingasadatasource

– Onlyworksifdatasubsets canbeprocessedindependently

– Onlyasgoodasthesingle-nodeengine,nobenefitfromfast-evolvingHadoopinnovations

App

AppApp

AnalyticsTool

App

Application Results

Calculations

Page 11: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 11 -

Gridcomputingbestpractices

• Whentouseit+ Taskcanbeperformedonsmaller,independent

datasubsets+ Compute-intensivedataprocessing

• WhenNOTtouseit– Data-intensivedataprocessing– ComplexMachineLearningmodels– Lotsofinterdependenciesbetweendatasubsets

• Horror stories– Gridcomputingjobcalledinhugeloopsto

managedependenciesandintermediateresults App

AppApp

AnalyticsTool

App

Application Results

Calculations

Page 12: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 12 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach2:GridComputing

Legacysingle-machineanalyticsengines

Page 13: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 13 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach3: NativeDistributedAlgorithms

Page 14: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 14 -

AnalyticsTool

Approach3:Nativedistributedalgorithms

• DataMovement• Onlyresultsaremoved,dataremainsinHadoop

• DataProcessing• ExecutedbynativeHadooptools:Hive,Spark,

H2O,Pig,MapReduce,etc.• Pros&Cons

+ Holisticviewofalldataandpatterns+ Highlyscalabledistributedprocessingoptimized

forHadoop− Limitedsetofalgorithmsavailable,veryhardto

developnewalgorithms

Calculations

ResultsInstructions pushedtoHadoop

Page 15: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 15 -

AnalyticsTool

Nativedistributedalgorithmsbestpractices

• Whentouseit+ ComplexMachineLearningmodelsneeded+ Lotsofinterdependenciesinsidethedata(e.g.

graphanalytics)+ Needtoblendandcleanselargedatasets(e.g.

large-scalejoins)• WhenNOTtouseit

− Dataisnotthatlarge− Samplewouldrevealallinterestingpatterns

• Horror stories– ComplexMLmodeldevelopedin3months

defeatedbyaprototypemodelcreatedinanafternoon

Calculations

ResultsInstructions pushedtoHadoop

Page 16: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 16 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach3: NativeDistributedAlgorithms

Hadoopecosystemprojects

Page 17: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 17 -

Sampling

GridComputing

NativeDistributedAlgorithms

DifferentApproachestoBigDataAnalytics

Whichonetouseforagivenusecase?

Page 18: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 18 -

Typicalprojectsneedallthreetosucceed

Sampling

GridComputing

NativeDistributedAlgorithms

Page 19: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 19 -

RapidMinerPredictiveAnalyticsPlatform

Page 20: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 20 -

Sampling

GridComputing

NativeDistributedAlgorithms

SingleAnalyticsPlatformtosupportallthree

Page 21: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

- 21 -©2016RapidMiner,Inc.Allrightsreserved.

10MilkStreet,11thFloorBoston,MA02108USA

+16174017708

PredictiveAnalyticsReimaginedAModernDataSciencePlatformtoTurnDataIntoaStrategicAsset

rapidminer.com

[email protected]@prekopcsak

Email:Twitter:

Web: