Machine Learning with R and Zeppelin on Oracle Big Data ... › otndocs › products › ...Spark...
Transcript of Machine Learning with R and Zeppelin on Oracle Big Data ... › otndocs › products › ...Spark...
!"#$%&'()*+*,-./0 1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
?23(&65*@52%6&6'*A&)(*B*267*C5##54&6*"6*1%2345*D&'*E2)2*F"4G)&"69!"#$%&'%(
?2%3"9*<%263&H&2I%"7G3)*?262'5%E2)2*F3&5635*267*D&'*E2)213)"H5%*,J0*,-./
Copyright©2018,Oracleand/oritsaffiliates.Allrightsreserved.| 2
SafeHarbor StatementThefollowingisintendedtooutlineourgeneralproductdirection.Itisintendedforinformationpurposesonly,andmaynotbeincorporatedintoanycontract.Itisnotacommitmenttodeliveranymaterial,code,orfunctionality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment,release,andtimingofanyfeaturesorfunctionalitydescribedforOracle’sproductsremainsatthesolediscretionofOracle.
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1%2345*D&'*E2)2*?262'5%
J
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
!"635#)G24
Q
F)%52L&6'*R6'&65 E2)2*@2O5 R6)5%#%&95*E2)2*S*B5#"%)&6'
E&93"=5%$*@2H
N6#G)R=56)9
RT53G)&"6
N66"=2)&"6
E&93"=5%$*1G)#G)
E2)2
F)%G3)G%57R6)5%#%&95*E2)2
<3)&"62H45R=56)9
<3)&"62H45?5)%&39
<3)&"62H45E2)2*F5)9
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
I%23)&324
U
<3)&"62H45R=56)9
F)%52L&6'*R6'&65 E2)2*@2O5 R6)5%#%&95*E2)2*S*B5#"%)&6'
E&93"=5%$*@2H
<3)&"62H45?5)%&39
<3)&"62H45E2)2*F5)9
N6#G)R=56)9
RT53G)&"6
N66"=2)&"6
E&93"=5%$*1G)#G)
E2)2
F)%G3)G%57R6)5%#%&95*E2)2
V")5H""O98<624$)&3*F5%=&359
1HW53)*F)"%5 K27""#8KEXF
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
D&'*E2)2*?262'5%! N634G757 A&)(*244*D&'*E2)2*"::5%&6'9*YDE<0*DE!!*267*DE!FZ! R62H459*L299&=5*#2%24454*3"#$*":*72)2*45=5%2'&6'*<#23(5*F#2%O" KEXF*[\\] KEXF" KEXF*[\\]*1HW53)*F)"%2'5*Y!4"G7Z" X&45*E&::\&6' 267*3(53O&6'*2:)5%*3"#&59
! DG&47*267*L262'5*#6&659
! RLH57757*C5##54&6*V")5H""O" <624$^5*72)2*&69)26)4$" <624$^5*2)*93245*A&)(*1%2345*B*<7=26357*<624$)&39*:"%*K27""#
_
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
D&'*E2)2*?262'5%*` ?"=&6'*E2)2*)"*E&93"=5%$*@2H*! N634G757 A&)(*244*D&'*E2)2*"::5%&6'9*YDE<0*DE!!*267*DE!FZ! R62H459*L299&=5*#2%24454*3"#$*":*72)2*45=5%2'&6'*<#23(5*F#2%O" KEXF*[\\] KEXF" KEXF*[\\]*1HW53)*F)"%2'5*Y!4"G7Z" X&45*E&::\&6' 267*3(53O&6'*2:)5%*3"#&59
! DG&47*267*L262'5*#6&659
! RLH57757*C5##54&6*V")5H""O" <624$^5*72)2*&69)26)4$" <624$^5*2)*93245*A&)(*1%2345*B*<7=26357*<624$)&39*:"%*K27""#
a
F55*@&=5*E5L"9*":*D&'*E2)2*?262'5%b
<G)"6"L"G9*KGH*` D&'*E2)2*!4"G7*F5%=&35
?"93"65*F"G)(Y?"672$*` c5765972$Z
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> /
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> d
X&45*H%"A95%*562H459*954:\95%=&359*72)2*L"=5L56)*:%"L*:"%*5T2L#45*KEXF*)"*1HW53)*
F)"%2'5
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> .-
NL#"%)26)4$*)(&9*7%2'S7%"#"%*3"#$*&9*)G%657*&6)"*2*F#2%O*#%"'%2L0*A(&3(*&9*5T53G)57*
"%*93(57G457
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
?23(&65*@52%6&6'*"6*D&'*E2)2
..
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
e"G%*E&93"=5%$*@2H*&6*M"72$P9*!4"G7*c"%47
.,
K27""#8KEXF
M(5*529&59)*A2$*)"*HG&47*"G)*2*42H*&9*)"*45=5%2'5*9"L5*
O6"A6*H29&39*267*34G9)5%&6'*)"*%G6*W"H9*&6*#2%24454
I&3O*$"G%*:2="%&)5*6")5H""O*56=&%"6L56)*267*9)2%)*)"*3"75*&6*(5%5*2'2&69)*$"G%*
2624$)&39*4&H%2%&59
f95*4&H%2%&59*4&O5*B0*M569"%:4"A 267*!2::5*:"%*$"G%*2624$)&39*267*?@*` &:*#"99&H45*
&6*#2%24454
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
D&'*E2)2*?262'5%*A&)(*B*N6)5%#%5)5%<H&4&)$*)"*=&9G24&^5*95=5%24*9"G%359*:%"L*V")5H""O9*"6*DE<0*DE!F*267*DE!!
.J
)*+,-,./)0123456783
9+,6:2;,<,=6,*83./;=9
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
<66"G63&6'b*1B<<K*,;/;-*:"%*F#2%O*,;T
.Q
Copyright©2018,Oracleand/oritsaffiliates.Allrightsreserved.|
WhatisORAAH(OracleRAdvancedAnalyticsforHadoop)• ORAAHisasetofRpackagesandJavalibrariesthatprovide:– AnRinterfaceformanipulatingdatastoredinalocalFileSystem,HDFS,HIVE,ImpalaorJDBCsources,andcreatingDistributedModelMatricesacrossaClusterofHadoopNodesinpreparationforML.– Ageneralcomputationframeworkwhereusersinvokeparallel,distributedMapReducejobsfromR,writingcustommappersandreducersinRwhilealsoleveragingopensourceCRANpackages.– ParallelanddistributedMachineLearningalgorithmsthattakeadvantageofallthenodesofaHadoopclusterforscalable,highperformancemodelingonbigdata.FunctionsusetheexpressiveRformulaobjectoptimizedforSparkparallelexecution.–ORAAH'scustomLM/GLM/MLPNNalgorithmsonSparkscalebetterandrunfasterthantheopen-sourceSparkMLlib functions,butORAAHprovidesinterfacestoMLlibaswell.
15
Copyright©2018,Oracleand/oritsaffiliates.Allrightsreserved.|
WhereisORAAHavailable?
16
• Onpremises:–PartoftheOracleBigDataConnectors licensefortheOracleBigDataAppliance,DIYClouderaclustersandDIYHortonworksclusters.
• OnOracleCloud:–PartoftheOracleBigDataConnectorslicensethatisincludedwiththeOracleBigDataCloudService andtheOracleBigDataCloudatCustomer– IncludedaspartoftheBigDataCloud (formerlyknownasComputeEdition)
Copyright©2018,Oracleand/oritsaffiliates.Allrightsreserved.|
ORAAHBenefits:MakingSparkMLlibbetterforRusersORAAHFormulaparsercanhandlethefullsetofopen-sourceRformulatransformations,soitcanbeusedwithanySparkMLlibalgorithmsupportedbyORAAH. EveninnewerSparkreleases(Oct2018)SparkRfailstoprocessasimpleinteractionbetweenattributes.
UsingSparkMLlib LogisticRegressionmodelinSparkRfails:R> model <- glm( Kyphosis ~ (Age + Number)^2, df, family = "binomial")ERROR RBackendHandler: fitRModelFormula on org.apache.spark.ml.api.r.SparkRWrappers failedError in invokeJava(isStatic = TRUE, className, methodName, ...) :java.lang.IllegalArgumentException: Could not parse formula: Kyphosis ~ (Age + Number)^2
UsingSparkMLlibLogisticRegressionmodelviaORAAH…
R> model <- orch.ml.logistic( Kyphosis ~ (Age + Number)^2, data = data)OBX Model Matrix: processed 1 factor variables, 0.050 secOBX Model Matrix: created MLlib LabeledPoint RDD (81 rows) 0.008 secOBX Machine Learning: MLlib Logistic Regression elapsed time 0.858 secR> model$coefficients[1] -6.568918 0.027176503 1.022537535 -0.004490547
…producesthesameexactresultfromopen-sourceR
glm( Kyphosis ~ (Age + Number)^2, data = kyphosis, family = "binomial")$coefficients(Intercept) Age Number Age:Number-6.568917860 0.027176503 1.022537536 -0.004490547
17
Copyright©2018,Oracleand/oritsaffiliates.Allrightsreserved.|
Pythonusersteps– 47lines ORAAHusersteps– 14lines
18
ORAAHandPython:Simpleandcleancode:buildingaSparkMLlibRandomForestmodelfromHIVEsource
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tableshttps://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_example.pyhttps://github.com/apache/spark/blob/master/examples/src/main/python/ml/rformula_example.py
LoadLibraries
ProcessFormula
EstablishSparkSession
CopydatafromHIVE
Create3rd copyofDataforvectors
BuildModel
SingleVectorofPredictions
LoadLibrariesEstablishHIVEandSparkSession
BuildModeldirectlyagainstHIVE(alsoHDFS,IMPALA,,JDBCorSparkDF)datawithfullformulasupport
Predictionsexportedwithdesiredcolumns,noneedto”glueback”original
columns
http://www.oracle.com/technetwork/database/database-technologies/bdc/r-advanalytics-for-hadoop/documentation/index.html
Copyright©2018,Oracleand/oritsaffiliates.Allrightsreserved.|
Spark2.1+AlgorithmsbyOracle,interfacestoSparkMLlib,plusHIVE,ImpalaandSparkDFinterfacesRED indicatesnewinrelease2.8.0
MachineLearningAlgorithmsandUtilitiesinORAAH2.8.0
ExtremeLearningMachines(Oracle’sMPI/Spark-based)Hierarchical-ELM(Oracle’sMPI/Spark-based)Multi-LayerNeuralNets(Oracle’sSpark-based)LogisticRegression(Oracle’sSpark-based)GradientBoostedTrees(SparkMLlib)LogisticRegression(SparkMLlib)DecisionTrees(SparkMLlib)RandomForest(SparkMLlib)
RegressionMulti-LayerNeuralNets(Oracle’sSpark-based)LinearRegressionModel(Oracle’sSpark-based)GradientBoostedTrees(SparkMLlib)LinearRegressionModel(SparkMLlib)SupportVectorMachine(SVM)(SparkMLlib)LASSO(SparkMLlib)RidgeRegression(SparkMLlib)RandomForest(SparkMLlib)DecisionTrees(SparkMLlib)
Hierarchicalk-Means(SparkMLlib)GaussianMixtureModels(SparkMLlib)Hierarchicalk-Means(alsoavailableinMap-Red)
FeatureExtraction&CreationDistributedStochasticPCA(Oracle’sMPI/Spark-based)DistributedStochasticSVD(Oracle’sMPI/Spark-based)PrincipalComponentAnalysis(SparkMLlib)NonnegativeMatrixFactorization(Map-Red)LowRankMatrixFactorization(Map-Red)
Classification Clustering
AbilitytorunanyRpackageviaourhadoop.runfunctioninMap-Reducemode
OpenSourceRAlgorithms
TransparencyFunctionswithIMPALA andHIVEAggregations,TableJoins,summarizationVariableCreation,Push&PulldatafromIMPALA andHIVEAbilitytopushandpulldatafromOracleDatabaseJDBCDriverinterface- buildSparkDataFrames forORAAH
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<Kb*f#*)"*aT*)(5*#5%:"%L2635*":*F#2%O*?@4&H :"%*E2)2*)(2)*:&)9*&6*L5L"%$llHG)*249"*2H45*)"*9"4=5*2*.-H&*%"A*L"754*YA(&3(*3266")*:&)*56)&%54$*&6*L5L"%$Z*
,-
<44*)59)9*%G6*"6*2*_\V"75*D&'*E2)2*<##4&2635*qa\,*A&)(*,U_jD*":*B<?*#5%*V"75X"%LG42b*326354457*r*7&9)2635*s*"%&'&6*s*759) s*29;:23)"%YL"6)(Z*s*29;:23)"%Y$52%Z*s*29;:23)"%Y72$":L"6)(Z*s*29;:23)"%Y72$":A55OZ*s*29;:23)"%Y:4&'()6GLZ
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1%2345*D&'*E2)2*9"4G)&"69*3"L5*A&)(*"#)&L&^57*95))&6'9*:"%*42%'5\93245*?23(&65*@52%6&6'*)%2&6&6'*267*93"%&6'b*!4299&:&32)&"6*RT2L#459*":*.D&*%"A9
,.
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<KP9*V5A*E&9)%&HG)57*FhED563(L2%O*":*1B<<KP9*YF#2%Os?INZ*=9*F#2%O*?44&Hb*_T*:29)5%*s*4&652%*93245*G#
,,
B26O ;U90;29+,6:MQ")VWI
QR-?O 90;P2MF8<-?OUX,Y,
.-- HT Qd.
.U- I&Z a-a
,-- IZ' d/,
F&6'G42%*h53)"%*E53"L#"9&)&"6*,-O*T*,-O*75695*&6#G)*YJ;,jHZ.-*)(%5279*Y.,*23)G24*3"%590*\qLTQ-jHZ 9+,6:2QR-?O
$#AA/29+,6:MQ")
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
V5A*1B<<K*f)&4&)$*:G63)&"69*:"%*E2)2*I%"3599&6'*267*N6'59)[;\]2,FJ29+,6:2;,<,2=6,*83
,J
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*72)2*L26&#G42)&"6*:G63)&"69
,Q
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*72)2*L26&#G42)&"6*:G63)&"69
,U
467EL73YKJ>N6)5%#%5)9*2*!Fh*:&45*267*4"279*&)*&6*L5L"%$*&6)"*2*F#2%O*EX;*F"G%35*:&459*2%5*@"324*:&45*9$9)5L*"%*KEXF
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*72)2*L26&#G42)&"6*:G63)&"69
,_
467ELJ8376?O8j565%2)59*2*9&L#45*9GLL2%$*":*)(5*&6:"%L2)&"6*&6*2*F#2%OEX
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*72)2*L26&#G42)&"6*:G63)&"69
,a
467EL74--87<D%&6'9*2*F#2%OEX &6)"*BP9*4"324*L5L"%$*:"%*:G%)(5%*L26&#G42)&"6*"%*%59G4)*#%&6)&6'
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*F324&6'*:G63)&"6
,/
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*F324&6'*:G63)&"6
,d
467EL37,-8F32459*2*F#2%O*EX*G9&6'*"65*":*.U*7&::5%56)*)53(6&mG59;**!%&)&324*:"%*L26$*?@*24'"%&)(L9*9569&)&=5*)"*932458"G)4&5%9
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
1B<<K*,;/;-*\ V5A*F#2%O*EX*F324&6'*:G63)&"6
J-
^3E4SF#2%O*EX*2335#)*L26$*%g2=2 75:2G4)*:G63)&"690*:%"L*9("AYZ*)"*3"G6)YZ
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
V5A*gED!*&6)5%:235! !%52)59*2*F#2%O*E2)2X%2L5 :%"L*2*gED!*9"G%35*)(2)*326*H5*G957*"6*26$*":*1B<<KP9*F#2%O\H2957*?@*24'"%&)(L9;
J.
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
?"754*#5%:"%L2635*92L#45*:"%*<449)2)5*I%57&3)&"6*!(24456'5D2426357*92L#45*"G)*":*)(5*"%&'&624*.J;QL&*%53"%790*_-t8Q-t*9#4&)*:"%*M59)&6'
J,
$#AA/29+,6:MQ") 1RQ! <4L"9)*29*'""7*29*)(5*
H59)*?@I! JT*:29)5%*)"*HG&47
$#AA/29+,6:2_RQK! X29)59)*)"*DG&47*267*
F3"%5
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
?"754*#5%:"%L2635*92L#45*:"%*<449)2)5*I%57&3)&"6*!(24456'5D2426357*92L#45*"G)*":*)(5*"%&'&624*.J;QL&*%53"%790*_-t8Q-t*9#4&)*:"%*M59)&6'
JJ
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
@&=5*E5L"D&'*E2)2*?262'5%*V")5H""O9`46:?FN2S?<E2R,6N829+,6:2]-53<863
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> JU
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> J_
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> Ja
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> J/
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> Jd
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> Q-
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> Q.
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> Q,
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> QJ
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> QQ!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> QQ
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> QU!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**> QU
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
E55#*@52%6&6'*` L")&=2)&"6*:"%*)(5*:G)G%5
Q_
B5'%599&"6E53&9&"6*M%559
())#b88AAA;L4$52%6&6';"%'8
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>
34"G73G9)"L5%3"6653);"%2345;3"L
!"#$%&'()*+*,-./0*1%2345*2678"%*&)9*2::&4&2)59;*<44*%&'()9*%595%=57;**>