Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop...
Transcript of Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop...
![Page 1: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/1.jpg)
1/59©Cloudera,Inc.Allrightsreserved.
Introduc;ontoDataSciencewithHadoopGlynnDurham|[email protected]
![Page 2: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/2.jpg)
2/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
![Page 3: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/3.jpg)
3/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
![Page 4: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/4.jpg)
4/59©Cloudera,Inc.Allrightsreserved.
DataScienceis…
• gatheringdata,poten;allyofmanytypesandfrommanysources,
• wranglingthatdataintousefulforms,and
• applyingsta;s;calprogrammingandmachinelearning,togainnewinforma;onfromthedata.
![Page 5: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/5.jpg)
5/59©Cloudera,Inc.Allrightsreserved.
MachineLearningandDataVolume
“It’snotwhohasthebestalgorithmswhowins.It’swhohasthemostdata.”[BankoandBrill,2001]
![Page 6: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/6.jpg)
6/59©Cloudera,Inc.Allrightsreserved.
Hadoopis…
• anopensourceso\warepla]ormfor• acquiring,storing,andprocessingmassivevolumesofdata,• economically.
![Page 7: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/7.jpg)
7/59©Cloudera,Inc.Allrightsreserved.
TheAgeofMachineLearning
![Page 8: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/8.jpg)
8/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
![Page 9: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/9.jpg)
9/59©Cloudera,Inc.Allrightsreserved.
Theword“Hadoop”means
• achild’stoyor
• HadoopCoreor
• theHadoopEcosystem.
![Page 10: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/10.jpg)
10/59©Cloudera,Inc.Allrightsreserved.
HadoopCore
• Afreeopensourceso\wareso\wareproject• Managedtransparentlyonline,attheApacheSo\wareFounda;on(ASF),apache.org
• Theprojectwasstartedin2006,basedonpapersfromGoogle,in2003and2004
• Consistsof:• HDFS(HadoopDistributedFileSystem),forstorage• HadoopMapReduce,forprocessing• YARN(YetAnotherResourceNego;ator)
hadoop.apache.org
![Page 11: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/11.jpg)
11/59©Cloudera,Inc.Allrightsreserved.
HadoopCoremainfeatures:Filestorageandbatchprogramming
![Page 12: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/12.jpg)
12/59©Cloudera,Inc.Allrightsreserved.
HDFSWrites
![Page 13: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/13.jpg)
13/59©Cloudera,Inc.Allrightsreserved.
HDFSReads
![Page 14: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/14.jpg)
14/59©Cloudera,Inc.Allrightsreserved.
GeneralFileInput/Output
![Page 15: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/15.jpg)
15/59©Cloudera,Inc.Allrightsreserved.
HDFSStrengthsandWeaknesses• HDFSisgoodat:
• storingenormousfiles• storinglotsofdatareliably• throughputonsequen;alwrites• throughputonsequen;alreadsofafileorpartofafile
• HDFSisnotgoodat:• highspeed(lowlatency)randomreadsofpartsofafile
• HDFScannot:• updateanypartofafileoncewrijen**butyoucanalwayswriteanewfileand/ordelete,move,andrenamefilesanddirectories
![Page 16: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/16.jpg)
16/59©Cloudera,Inc.Allrightsreserved.
MapReduce:Programmingwithsimplefunc;ons
![Page 17: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/17.jpg)
17/59©Cloudera,Inc.Allrightsreserved.
MapReduceExample:WordCountCountthenumberofoccurrencesofeachwordoveralargeamountofinputdata• Thisisthe‘helloworld’ofMapReduceprogramming
map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)
reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)
![Page 18: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/18.jpg)
18/59©Cloudera,Inc.Allrightsreserved.
WordCount,con;nuedInputtotheMapper:
OutputfromtheMapper:
(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')
('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
![Page 19: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/19.jpg)
19/59©Cloudera,Inc.Allrightsreserved.
WordCount,con;nuedIntermediatedatasenttotheReducer:
FinalReduceroutput:
('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1])
('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)
![Page 20: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/20.jpg)
20/59©Cloudera,Inc.Allrightsreserved.
Sowejustcountedwords.Sowhat?
• Manyproblemsconformtothispajern:• Webloganalysis:map()emitsanIPaddressforeachweblogevent;reduce()countsoccurrencesforeachIPaddress
• Indexing:Foreachdocument,map()emitseachtermofinterestpairedwiththedocumentID;reduce()collectsandemitsalldocumentIDsforeachterm
• Pagerankalgorithm:• Everywebpage(URL)ontheWebgetsanini;alscore.• map()dividesapage’sscoreamongallofitsoutlinks’URLs;reduce()sumsthereceivedscoresforeachURL.
• Iterateonthisprocedure.
![Page 21: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/21.jpg)
21/59©Cloudera,Inc.Allrightsreserved.
MapReduceChains
![Page 22: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/22.jpg)
22/59©Cloudera,Inc.Allrightsreserved.
MapReduceatScale
![Page 23: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/23.jpg)
23/59©Cloudera,Inc.Allrightsreserved.
MapReduceStrengthsandWeaknesses• MapReduceisgoodat:
• processingenormousvolumesofdata• scalingoutasyouaddmoremachines• con;nuingtocomple;on,evenwhensomemachinesdie
• MapReduceisnotgoodat:• runninganyalgorithmyoucanwriteinpseudocode• algorithmsthatrequiresharedstateoverall**butmaybeyoucangetcleverwithyouralgorithmdesign
• MapReducecannot:• runinreal;me:MapReducejobsarebatchjobs
![Page 24: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/24.jpg)
24/59©Cloudera,Inc.Allrightsreserved.
YARN,YetAnotherResourceNego;ator
![Page 25: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/25.jpg)
25/59©Cloudera,Inc.Allrightsreserved.
Sqoop:RDBMStoHadoopandBack
• UsesMapReducetorunconcurrentdatabasequeriesthatextractorinsertdatasqoop.apache.org
![Page 26: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/26.jpg)
26/59©Cloudera,Inc.Allrightsreserved.
Flume:Inges;ngOngoingEventDataflume.apache.org
![Page 27: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/27.jpg)
27/59©Cloudera,Inc.Allrightsreserved.
Kata:GeneralDataStreamingkata.apache.org
![Page 28: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/28.jpg)
28/59©Cloudera,Inc.Allrightsreserved.
HBase:ANoSQLDatabaseSystem
• Ascalablekey/valuestore• Accommodatesgeneralbinarydata• Highvolume,highperformanceaccesstoindividualitems• Randomreadsandwrites• WeakerquerylanguagethanSQL(put,get,scan,delete)• LacksACID-complianttransac;ons
hbase.apache.org
![Page 29: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/29.jpg)
29/59©Cloudera,Inc.Allrightsreserved.
Kudu:Scalablestorageforstructureddatakudu.apache.org
![Page 30: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/30.jpg)
30/59©Cloudera,Inc.Allrightsreserved.
Hive:MapReduce(orSpark)as“SQL”
• Familiarlanguageandprogrammingparadigm
• ProvidesinterfacetomanySQL-complianttools
hive.apache.org
![Page 31: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/31.jpg)
31/59©Cloudera,Inc.Allrightsreserved.
Pig:AnotherLanguageforMapReduce(orSpark)pig.apache.org
![Page 32: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/32.jpg)
32/59©Cloudera,Inc.Allrightsreserved.
Impala:HighSpeedAnaly;csinHadoop
• Purpose-builtforhighspeedanaly;cqueries
• DoesnotuseMapReduceorSpark
• Usually5to30;mesfaster—some;mes100;mesfaster!
incubator.apache.org/projects/impala.html
![Page 33: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/33.jpg)
33/59©Cloudera,Inc.Allrightsreserved.
AndMore
• Serializa;onandefficientfilestorage:AvroandParquet
• Workflow:Oozie
avro.apache.org parquet.apache.org
oozie.apache.org
![Page 34: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/34.jpg)
34/59©Cloudera,Inc.Allrightsreserved.
AndEvenMore…
• Security:SentryandRecordService
• MachineLearninginMapReduce:Mahout
• And…mahout.apache.org
sentry.apache.org recordservice.io
![Page 35: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/35.jpg)
35/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
![Page 36: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/36.jpg)
36/59©Cloudera,Inc.Allrightsreserved.
Spark:AnImprovementonMapReduce
• OriginallyaresearchprojectatUCBerkeleyRADLab—latertheAMPLab,in2009
• AddressessomefundamentalpainpointsofMapReduce
• TheSparkStreamingsubprojectof2012addsnearreal-;meprogramming• using“micro-batches”asanadapta;onofbatchprogramming• acapabilityaltogetherlackinginHadoopMapReduce
spark.apache.org
![Page 37: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/37.jpg)
37/59©Cloudera,Inc.Allrightsreserved.
Similari;esofMapReduceandSpark
• Processesmassivevolumesofdatawithascale-out,distributedframework• Theframeworkprovidesreliability,eveninthefaceofmachinefailure• Programmingwithstatelessfunc;ons• Reliesonexpensiveshuffletoreorganizedataforaggrega;on,joins,sor;ng• S;lllacksasharedstateamongallprocesses• CanrununderYARNtoshareprocessingresources
![Page 38: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/38.jpg)
38/59©Cloudera,Inc.Allrightsreserved.
ImprovedAPI
• First-classAPIsinScala,Java,PythonandR• Data-flowprogrammingparadigm(likePig)• Interac;veshell—greatforexploratorywork
• ImprovedsupportforstructureddataandSQL-likeprocessing
![Page 39: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/39.jpg)
39/59©Cloudera,Inc.Allrightsreserved.
ProcessingChains,Improved
func;on func;on func;on
func;onfunc;onfunc;on
EliminateI/O
EliminateI/O
ReduceI/O
Tasks,notnewprocesses(JVMs)Enhancedcachinginmemory
![Page 40: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/40.jpg)
40/59©Cloudera,Inc.Allrightsreserved.
SparkMLlib:MachineLearninginSpark
• SubprojectofSpark• Effec;velyreplacesMahoutformachinelearninginHadoopclusters• Fromspark.apache.org,thefrontpage:
Butjustbeclearwhatyoumeanby“Hadoop”!
![Page 41: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/41.jpg)
41/59©Cloudera,Inc.Allrightsreserved.
CommercialMessage#1
![Page 42: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/42.jpg)
42/59©Cloudera,Inc.Allrightsreserved.
BigEcosystem
oozie.apache.org
![Page 43: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/43.jpg)
43/59©Cloudera,Inc.Allrightsreserved.
CompleteBigDataPla]orm
• ClouderaManagercan• install,monitor,manage,upgradeacoherentbundleoftheseprojectsandmore
• ClouderaDirectorcan• easilyconfigureanddeploythispla]ormoncloudservicesfromAmazon,Google,orMicroso\
• !!!
![Page 44: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/44.jpg)
44/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
![Page 45: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/45.jpg)
45/59©Cloudera,Inc.Allrightsreserved.
MachineLearningAlgorithms
• SupervisedLearning:• Startwithcorrectlylabeledrecords,andlearntoes;mateorpredictlabelsfornewrecords
• Con;nuouslabels:Regression• Discretelabels:Logis;cRegression,Classifiers
• UnsupervisedLearning:• Startwithunlabeledrecords,trytoteasepajerns(labels)outofthedata• Thereisnotasingle“correct”answerforlabeling• Con;nuouslabels:Collabora;veFilters(Recommenders)• Discretelabels:Clustering
![Page 46: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/46.jpg)
46/59©Cloudera,Inc.Allrightsreserved.
LinearRegression:SupervisedLearningofaCon;nuousLabel
![Page 47: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/47.jpg)
47/59©Cloudera,Inc.Allrightsreserved.
Logis;cRegression:SupervisedLearningofaBinaryLabel
![Page 48: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/48.jpg)
48/59©Cloudera,Inc.Allrightsreserved.
Classifiers:SupervisedLearningofDiscreteLabels
Training:Cat
Training:Table
Scoring:???
![Page 49: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/49.jpg)
49/59©Cloudera,Inc.Allrightsreserved.
Collabora;veFilters(Recommenders):UnsupervisedLearningofCon;nuousLabels
![Page 50: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/50.jpg)
50/59©Cloudera,Inc.Allrightsreserved.
Clustering:UnsupervisedLearningofDiscreteLabels
![Page 51: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/51.jpg)
51/59©Cloudera,Inc.Allrightsreserved.
SparkMLlib:MachineLearningonHadoop
![Page 52: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/52.jpg)
52/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
![Page 53: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/53.jpg)
53/59©Cloudera,Inc.Allrightsreserved.
CommercialMessage#2
![Page 54: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/54.jpg)
54/59©Cloudera,Inc.Allrightsreserved.
MoreDSTeamsintheOrganiza;on
• Collabora;on,repeatabilitywithinteams• Differingsecurityrequirements• Differentpreferredprograminglanguages:Python,R,Scala• Differentso\warelibraries:Pandas,H2O,etc.• Evendifferentversionsofso\ware
![Page 55: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/55.jpg)
55/59©Cloudera,Inc.Allrightsreserved.
ClouderaDataScienceWorkbench
• DevelopmentinPython,Scala,orR• Differingsecurityrequirements
![Page 56: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/56.jpg)
56/59©Cloudera,Inc.Allrightsreserved.
DeepLearning
![Page 57: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/57.jpg)
57/59©Cloudera,Inc.Allrightsreserved.
DeepLearningonHadoop
• DeepLearningreferstoacategoryofclassifieralgorithms,mostlyinventedin2006.
• SparkMLlibdoesnothaveanydirectimplementa;onofDL.• Thereareseveraladdi;onalprojectsthatcanfitDLontoSpark/Hadoop:
• BigDL• Caffe• TensorFlow• DL4J
![Page 58: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,](https://reader031.fdocuments.in/reader031/viewer/2022022603/5b5ac0cc7f8b9a302a8c7a6f/html5/thumbnails/58.jpg)
58/59©Cloudera,Inc.Allrightsreserved.
TheRoad—orRunway(!)—Ahead
• Itisatruismthatorganiza;onstodayhavevaluableinsightshiddenintheirdatathatarewai;ngtobeuncovered.
• 90%ofalldatathatwillexistin2020hasyettobecreated.• Opensourceisheretostay.• Hadoopasadatasciencepla]ormisevolving,anditsuseisgrowingexponen;ally.