Scalable Machine Learning using Big Data SQL, Hadoop · PDF fileScalable Machine Learning...
Transcript of Scalable Machine Learning using Big Data SQL, Hadoop · PDF fileScalable Machine Learning...
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
ScalableMachineLearningusingBigDataSQL,HadoopandSparkAdvancedAnalyticsatScale
BIWASummit2016MarcosArancibiaProductManager,OracleDataScience
SafeHarborStatementThefollowing isintended tooutlineourgeneralproductdirection.Itisintended forinformation purposesonly,andmaynotbeincorporated intoanycontract.Itisnotacommitment todeliveranymaterial,code,orfunctionality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment, release,andtimingofanyfeaturesorfunctionalitydescribed forOracle’sproducts remainsatthesolediscretionofOracle.
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 2
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
UseCases
1. RootCauseAnalysisofsemiconductormanufacturingthatusescustom-builtRalgorithmsrunningagainstdataintheDatabaseandagainstHadooptoverifythepotentialofBigDataSQL.
2. PredictingAirlineflightcancellationsusingLogisticRegressiondirectlyagainstaBigDataCluster
3
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
DataAnalyticsChallengeSeparatedataaccessinterfaces…
4
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
DataAnalyticsChallenge…whichrequireseparatePredictiveAnalyticsinterfaces.
5
NoSQL
OAAORAAHR
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleBigDataSQLExpandingthereachofOAAwithPredictiveAnalyticsviaSQL
6
NoSQL
OAAORAAHR
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
1.RootCauseAnalysis- RequirementsSemiconductorIndustry
• BISTel isanOraclePartnerintheUSAandinKorea,anditistheleadingproviderofequipmentengineeringsystemsandservicesforthefabricationofsemiconductorchipsandflatpaneldisplays.
• BISTel offerssolutionsandservicesthatenablethecustomerstoachieveyieldimprovementandincreaseproductivity.
• Therequirementwastobecapableofusingtheiroriginal8,000+linesofRcodethatexpresssomeoftheircustomalgorithmsonlargescaledatausingbothOracleDatabasedataaswellasdatastoredinHadoop.
7
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
TraditionalManufacturingYieldMgt Process
8
Traditionally …
Identify yield loss patterns manually or through pre-defined pattern libraries
Extract items/lots with similar patterns
YMS
In-House
3rd party YMS
SAS
SPSS
JMP
3rd party BI tools
Use multiple 3rd party tools to do different analysis to find cause down to suspected tool/chamber level
Ask equipment engineers to look at their tools for any possible cause
Report on possible causes
Hours/Days to Find Root Cause
EES
Tool Logs
FDC
SPC
EPT
RMS
Report
Engineer searches through various systems to see if there is any correlation
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
NewAnalyticalApproachforManufacturing
99
New Analytical Approach for Manufacturing
New Paradigm: Changing the way you analyze
Run in manual or auto mode
Engineer view result and make decision/report
Customer Time to Root Cause Identification (TTRCI) for Traditional Method
TTRCI using BISTel MA + IM
A 7 days Within 4 hours
B 3 weeks Within 4 hours
B Could not find within 4 weeks Found within 3 hours
ROI from Customer:
Run pattern detection and classification, root cause analysis together for enhanced accuracy and TTRCI
Problem with yield
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
E.g.:AutoPatternRecognitionandClassification
1010
Auto Pattern Recognition and Classification
Data • Coordinate (i.e. x, y) • Quality Measurement (i.e.
thickness, defect) • Yield data (i.e. good/
bad, grade)
Manipulate data and convert to map (i.e. circle, rectangle, etc…)
Extrapolate and interpolate data to full map
Dynamic pattern recognition and classification
Root cause analysis of pattern
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
E.g.:RootCauseAnalysiswithMachineLearning
11
11
Root Cause Analysis using Advanced Data Mining
Data • Effect data • Cause candidate data • Continuous/categorical
Select Effect
Select Cause Candidates
Ranked Result Engineer Decision
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
SolutionArchitectureOverview
12
13
Overview of Analytics using Oracle ORE
ORE Packages
MA
Intellimine
TA
BISTel’s algorithms
R Console
R Studio
eDataLyzer
SQL Developer Or
PL/SQL
R Engine or Oracle R Distribution
Oracle R Distribution
JDBC
JDBC
EXADATA BDA
OracleREnterprise
OracleRDistribution
EXADATA BIGDATAAPPLIANCE
OracleRDistribution
RConsole
RStudio
eDataLyzer
SQLDeveloperorPL/SQL
JDBC
JDBC
BISTel’s algorithms
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
InterfacesoftheusageinProduction:Database
1314
How Oracle ORE is used in Production Environment
ETL
User Request
Auto Request
Notify Analysis Result
.Net Client
Java Analysis Server
EXADATA
Various Database for OLTP
JDBC
ExtProcess
ExtProcess
ExtProcess
ExtProcess
ExtProcess
ExtProcess
ExtProcess
…
ExtProcess
ExtProcess
ExtProcess
ExtProcess
ExtProcess
ExtProcess
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
Intellimine
ExtProcExtProcExtProc
ExtProc
ExtProc
ExtProc
ExtProc
ExtProcExtProc
ExtProcExtProcExtProc
ExtProc
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleDatabaseServerwith Advanced Analytics Option
FeaturesofOracleAdvancedAnalyticsFromeitheranRClientoraSQLClient,OAAin-Databasealgorithms,REnginesandOpen-SourceRPackagescanbeaccessed.
RAnalyticsOracleREnterprise
RClient
OREParallelalgorithms:MLPNeural,Stepwise,LM,GLM,PCAAccesstoopen-sourceRpackages SQLDeveloper
OtherSQLApps
SQLBasicStatisticsandJoins
DataMiningPredictiveAnalytics15PL/SQLIn-Databasealgorithms
R
14
SQLClient
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
ORDwithinternal
BLAS/LAPACK1thread
ORD+MKL1thread
ORD+MKL2threads
ORD+MKL4threads
ORD+MKL8threads
PerformancegainORD+MKL4threads
PerformancegainORD+MKL8threads
MatrixCalculations 11.2 1.9 1.3 1.1 0.9 9.2x 11.4x
MatrixFunctions 7.2 1.1 0.6 0.4 0.4 17.0x 17.0x
MatrixMultiply 517.6 21.2 10.9 5.8 3.1 88.2x 166.0x
CholeskyFactorization 25 3.9 2.1 1.3 0.8 18.2x 29.4x
Singular ValueDecomposition 103.5 15.1 7.8 4.9 3.4 20.1x 40.9x
PrincipalComponentAnalysis
490.1 42.7 24.9 15.9 11.7 29.8x 40.9x
LinearDiscriminantAnalysis 419.8 120.9 110.8 94.1 88.0 3.5x 3.8x
This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.More details at https://blogs.oracle.com/R/entry/oracle_r_distribution_3_0
15
FeaturesofOracleAdvancedAnalyticsOracleRDistributionx64runningontheIntelPlatformcanmakeuseofIntel’sMKLforadditionalperformance,evenwithopen-sourceRpackages
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
R:Transparencythroughfunctionoverloading.E.g. in-databaseaggregationfunction
> aggdata <- aggregate(ONTIME_S$DEST, + by = list(ONTIME_S$DEST), + FUN = length)
ONTIME_S
In-dbStats
Oracle SQLselect DEST, count(*)from ONTIME_Sgroup by DEST
Oracle Advanced AnalyticsORE Client Packages
Transparency Layer
> class(aggdata)[1] "ore.frame"attr(,"package")[1] "OREbase"> head(aggdata)
Group.1 x1 ABE 2372 ABI 343 ABQ 13574 ABY 105 ACK 36 ACT 33
16
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
FeaturesofOracleAdvancedAnalytics
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
R:ScalableMachineLearningModels.E.g.customparalleldistributedlmmodel
> options(ore.parallel=4)> lm_mod <- ore.lm(ARRDELAY ~ DISTANCE + DEPDELAY,
data=ONTIME_S)
Oracle Advanced AnalyticsORE Client Packages
Transparency Layer
extprocextprocextprocextproc
3
2
ONTIME_S
ParallelORE
Framework1
> summary(lm_mod)Call:
ore.lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = ONTIME_S)Residuals:
Min 1Q Median 3Q Max -1462.45 -6.97 -1.36 5.07 925.08
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16 ***
DEPDELAY 9.625e-01 1.151e-03 836.289 < 2e-16 ***---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 14.73 on 215144 degrees of freedom(4785 observations deleted due to missingness)
Multiple R-squared: 0.7647,Adjusted R-squared: 0.7647 F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16
17
Oracle R DistributionParallel ore.lm Compute
R PackagesOracle R Distribution
Parallel ore.lm ComputeR Packages
Oracle R DistributionParallel ore.lm Compute
R Packages
4
Oracle R DistributionParallel Compute Engine
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
FeaturesofOracleAdvancedAnalytics
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
Serverexecutionofopen-sourceRpackage:ore.tableApply()
> mod_biglm <- ore.tableApply(dat = ONTIME_S, # Database tablefunction(dat) {
library(biglm) # Load open-source packagebiglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
});
Oracle Advanced AnalyticsORE Client Packages
Embedded R
extproc
3
2
ONTIME_S
Embedded RORE
Framework
1
> library(biglm) # Load open-source package locally to interpret results> summary(mod_biglm) # Summary of the resulting Model
Large data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)Sample size = 392805
Coef (95% CI) SE p(Intercept) 0.0638 -0.7418 0.8693 0.4028 0.8742
DISTANCE -0.0014 -0.0021 -0.0006 0.0004 0.0002DEPDELAY 1.0552 1.0373 1.0731 0.0090 0.0000
18
Oracle R DistributionOpen-source R Packages
4
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
FeaturesofOracleAdvancedAnalytics
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
> options(ore.parallel=4)> modList <- ore.groupApply(dat = ONTIME_S, # Database table
INDEX = ONTIME_S$DEST,# groupby colfunction(dat) {
library(biglm) # Load open-source packagebiglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
});> library(biglm) # Load open-source package locally to interpret results> summary(modList) # Checks how many models we have in the model list
Length Class Mode 325 ore.list S4
> summary(modList$BOS) # Request the resulting Model for Boston Logan AirportLarge data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
Sample size = 3928 Coef (95% CI) SE p
(Intercept) 0.0638 -0.7418 0.8693 0.4028 0.8742DISTANCE -0.0014 -0.0021 -0.0006 0.0004 0.0002
DEPDELAY 1.0552 1.0373 1.0731 0.0090 0.0000
Oracle Advanced AnalyticsORE Client Packages
Transparency Layer
extprocextprocextprocextproc
3
2
ONTIME_S
ParallelORE
Framework1
19
Oracle R DistributionOpen-source R Packages
R PackagesOracle R Distribution
Open-source R PackagesR Packages
Oracle R DistributionOpen-source R Packages
R Packages
4
Oracle R DistributionOpen-source R Packages
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
FeaturesofOracleAdvancedAnalyticsServerexecutionofopen-sourceRpackage:ore.groupApply()fordataparalellism
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
begin
sys.rqScriptCreate('Example6',
'function(){
res <- 1:10
plot( 1:100, rnorm(100), pch = 21,
bg = "red", cex = 2 )
res
}');
end;
/
select value
from table(rqEval(NULL,'XML','Example6'));
SQLinterfacerqEval forRscripts– canalsogenerateXMLstringforgraphicoutputOraclePL/SQL
OracleSQL
RLanguage
• R script output is often dynamic –not conforming to pre-defined structure
• R apps generate stats, new data, graphics• Example
– Plot 100 random numbers– Return a vector with values 1 to 10– Return the results as XML
20
FeaturesofOracleAdvancedAnalytics
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
PerformanceonAutoPatternClassification:EXADATAonly
21
15
Performance of ORE for Auto Pattern Classification
Approx. 30,000 wafers (semiconductor)
Approx. 300,000 wafers (semiconductor)
Approx. 3 million wafers (semiconductor)
17 8.8
37.4
278
1
10
100
1000
BISTEL 2GB (57mi) OAA 2GB (57mi) OAA 20GB (570mi) OAA 200GB(5.7Bi)
Min
utes
(log
sca
le)
Database size (records)
10x 100x
1x
Tested on 2-node EXADATA X3
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleEXATADAwith Advanced Analytics Option
OAAwithBigDataSQL:EXADATA+BDAUsingthein-Databasealgorithms,plusREngineandOpen-SourceRPackagesifdesired
RAnalyticsOracleREnterprise
RClient
SQLDeveloperOtherSQLApps
R
22
SQLClient
OracleBIGDATAAPPLIANCE
BigDataSQL
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
PerformanceonMicroLevelDataAnalytics:EXADATA+BDA
23
Performance of ORE for Micro Level Data Analytics
17
42.0 111.5
855.0
41,652
1.0
100.0
10000.0
4G 20G 200G 10T
Exa BDS (Exa+BDA)
Sec
onds
Data Size
50x
Tested on EXADATA X5 Full Rack • Algorithm Execution Time • DoP is 288
5 x 1 x
250x
WithdataeitherasDatabasetablesorasHDFSfilesontheBDA,theperformancewasthesame,highlightingthethroughputofBigDataSQL,andOAA’sagilityonrunningopen-sourceRcodeagainstgroup-byproblems
EXADATA+Big DataSQL+OAAonafullrackEXADATAX5-2,viaInfiniband toa9-nodeBDAX5-2.DegreeofParallelismsetto288
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
Conclusion:ROIusingBISTel AnalyticswithOAA
2418
ROI using BISTel Analytics on Oracle ORE
Shorten TTRCI (Time to Root Cause Identification)
Within 1 day from 2 weeks to 2 month
Reduce Investment Cost
Use of proven EXADATA and ORE with minimal data migration from existing Oracle DB
ROI for Customer
Shorten Time to Market
Faster POC (Proof of Concept) at customer site with new ideas/apps – Reduced by more than 30 days Dramatic reduction of development and test time – At least by 50%
ROI for Solution Provider
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
2.FlightCancellationpredictioninUSAairportsAirlineIndustry
• On-timearrivaldatafornon-stopdomesticflightsbymajoraircarrierscanbefoundattheBureauofTransportationStatisticswebsite,andisfreetodownload:http://www.transtats.bts.gov/Fields.asp?Table_ID=236
• Severalbenchmarkshavebeenexecutedagainstthisdataset,knownasONTIME.Theoriginalcombineddatasethad123mirecordsandcontaineddatafromOctober1987toApril2008.Itisavailableatmanywebsites,includingtheAmericanStatisticalAssociation:http://stat-computing.org/dataexpo/2009/
• WeaugmentedtheoriginaldatawithinformationuntilSeptember2014toget159miuniquerecords.Forscalabilitytesting,weappendedthefileontoitselftoget1Bitotalrecords
25
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
LogisticRegression
OracleRAdvancedAnalyticsforHadoopAdvancedAnalyticsalgorithmsinaHadoopCluster:Map-ReduceandSparkbased
GeneralizedLinearModel
Regression
Classification
AttributeImportance
Principal ComponentsAnalysis
Clustering
Hierarchicalk-Means
FeatureExtraction
NonnegativeMatrixFact(NMF)
StatisticalFunctions
CorrelationCovariance
Cross TabulationSummarystatistics
Multi-LayerNeuralNetworks
LinearRegression
CollaborativeFiltering(LMF)
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleRAdvancedAnalyticsforHadoop– vs.Rhadoop (RMR)BestplatformavailabletorunMap-ReduceRjobsvs.RevolutionAnalytics’RHadoopPerformanceona6-nodeBDAX3-2,16coresand47GBofTotalRAMassignedCovariancecomputationon100GBHDFS/200columnsinputdataset
27
Moreinfoathttps://blogs.oracle.com/R/entry/oraah_enabling_high_performance_r
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
HadoopClusterwith Oracle R Advanced Analytics for Hadoop
OracleRAdvancedAnalyticsforHadoop:IntegrationUsingORAAH’sHadoopandHIVEIntegration,plusREngineandOpen-SourceRPackages
RAnalyticsOracle R Advanced
Analytics for Hadoop
RClient
ORAAHdistributedalgorithms:MLPNeuralNets*,Logistic Reg*,GLM,LM,PCA,k-Means,NMF,LMF.Open-source RpackagesviaMap-Reduce*Spark-Cachingenabled
SQLDeveloperOtherSQLApps
HQLBasicStatistics,DataPrep,Joins andViewcreation
28
SQLClient
HQL
OracleDatabaseServerwithAdvancedAnalyticsoption
R
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
Join thetwotablesbyonecommonvariable>joined<- merge(tab_input,tab_input2,by="value")
InvokeORAAHtransparentfunctions forHIVE:JOINORAAH:TransparencyfunctionsagainstHIVEtables
Oracle R Advanced Analyticsfor Hadoop Client Packages
HIVETransparency Engine
Thenewtable isatemporaryHIVEtablenotseeninhere>ore.ls()[1]“tab_input"“tab_input2"But,it’spartofthe localRobjects>ls()[1]"joined">names(joined)[1]"value""v1.x""v2.x""v3.x""v4.x""v5.x""v6.x""v7.x""v8.x""v9.x"[11]"v10.x""v11.x" "v12.x""v13.x""v14.x""v15.x""v16.x""v17.x""v18.x""v19.x"[21]"v20.x""v21.x" "v1.y""v2.y""v3.y""v4.y""v5.y""v6.y""v7.y""v8.y"[31]"v9.y""v10.y""v11.y""v12.y""v13.y""v14.y""v15.y""v16.y""v17.y""v18.y"[41]"v19.y""v20.y""v21.y"
4
/user/oracle/tab_input
HDFS StorageHDFS Storage
3
HIVEThrift
Server
1HQL
MetastoreMetastoreMetastore
Metastore
2
29
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
Wecancreateanewvariablethat isacombination ofothercolumns inthejoined table,usingjustR’splainsyntax.>joined$NEW_VARIABLE <- (joined$v1.x+joined$v1.y)/2
InvokeORAAHtransparentfunctions forHIVE:CreatenewVariablesandSummaryStatisticsORAAH:TransparencyfunctionsagainstHIVE
Oracle R Advanced Analyticsfor Hadoop Client Packages
HIVETransparency EngineThenewvariablecanbeused inanytypeofcomputation
Checking thecontentofthenewVariable:>summary(joined$NEW_VARIABLE)Min.1stQu.MedianMean3rdQu.Max.-14890-12040-10190-10230-8621-2609
>head(joined$NEW_VARIABLE)[1]-9862.83-9705.47-9643.92 -8655.63-11572.07-11702.89
4
/user/oracle/tab_input
HDFS StorageHDFS Storage
3
HIVEThrift
Server
1HQL
MetastoreMetastoreMetastore
Metastore
2
30
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
dfs <- hdfs.put(iris, key='Species')res <- NULLdfs.res <- hadoop.run(dfs,mapper = function(key, vals) {keyval(key, vals)
},reducer = function(key, vals) {dat <- do.call(rbind.data.frame, vals)orch.dlogv(colnames(dat))mod = lm(Petal.Length ~ Sepal.Length+Petal.Width, data=dat)fname <- paste("fit-",key,".png",sep="")png(fname)par(mfrow=c(2, 2), cex=0.6, mar=c(6, 6, 6, 4), mex=0.8)plot(mod,id.n=1, cex.caption=0.8, which=1:4)dev.off()hdfs.fdir <- "/user/pngfiles"hdfs.fname <- paste(hdfs.fdir,"/",fname, sep="")system(paste("hadoop fs -copyFromLocal", fname, hdfs.fdir))pred <- predict(mod, dat)keyval(NULL, orch.pack(pred, hdfs.fname))
})
Oracle RAAHClient Packages
Map/Reduce Call
res<- hdfs.get(dfs.res)finalres =list()
for(i in1:nrow(res)){finalres[[i]] <-
orch.unpack(res[i,])}
/user/oracle/irisMapper(s)Reducer(s)
R Result Object Stored in HDFS
2
5
Invokeopen-sourcelmmodelandcollectgraphical resultsinMap-ReduceusingjustRORAAH:MachineLearningmodelsagainstHDFSdata
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients" 1
4
YARN: Hadoop Map Reduce Job
31
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
MappersMappersMappersMappers
InvokeORAAHcustomparallel distributedmodel(LinearRegression)ORAAH:MachineLearningmodelsagainstHDFSdata
> ontime <- hdfs.attach("/user/oracle/ontime_s")> lm_mod <- orch.lm(ARRDELAY ~ DISTANCE + DEPDELAY,
dfs.dat=ontime, nMappers = 4, nReducers = 2)
Oracle R Advanced Analyticsfor Hadoop Client Packages
Machine Learningalgorithms module
> summary(lm_mod)Call:ore.lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = ONTIME_S)Residuals:
Min 1Q Median 3Q Max -1462.45 -6.97 -1.36 5.07 925.08 Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16 ***DEPDELAY 9.625e-01 1.151e-03 836.289 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.73 on 215144 degrees of freedom(4785 observations deleted due to missingness)
Multiple R-squared: 0.7647, Adjusted R-squared: 0.7647 F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16
2
/user/oracle/ontime_s
YARN: Hadoop Map Reduce Job
1
4
Custom Java Algorithm ReducersCustom Java
Algorithm Reducers
3
OracleDistributionofRversion3.1.1(--)-- "SockittoMe"
32
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
InvokeORAAHcustomparallel distributedGLMModelusingSparkCachingORAAH:MachineLearninginSparkagainstHDFSdata
>#ConnectstoSparkandreservesadedicatedContext>spark.connect("yarn-client",memory="24g")>#CreatesapointertotheHDFSfileforusewithinR>ont1bi<- hdfs.attach("/user/oracle/ontime_1bi")>#Checksthesizeofdataset(rowsandcolumns)>formatC(hdfs.dim(ont1bi),digits=4,format='fg',big.mark = ",")”1,000,000,000" " 30"
>#Formuladefinition:Cancelledflights(0or1)basedonotherattributes>form_oraah_glm2<- CANCELLED~DISTANCE+ORIGIN+DEST+F(YEAR)+F(MONTH)++ F(DAYOFMONTH)+F(DAYOFWEEK)>system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2,ont1bi))
OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"
/user/oracle/ontime_1bi
33
YARN: 1. Spark Context Creation
11
1. Reserve Memory in a dedicated Context
1 3
3
ORCHGLM:processed6factorvariables,25.806secORCHGLM:createdmodelmatrix,100128partitions,32.871secORCHGLM:iter 1, deviance 1.38433414089348300E+09, elapsedtime9.582secORCHGLM:iter 2, deviance 3.39315388583931150E+08, elapsedtime9.213secORCHGLM:iter 3, deviance 2.06855738812683250E+08, elapsedtime9.218secORCHGLM:iter 4, deviance 1.75868100359263200E+08, elapsedtime9.104secORCHGLM:iter 5, deviance 1.70023181759611580E+08, elapsedtime9.132secORCHGLM:iter 6, deviance 1.69476890425481350E+08, elapsedtime9.124secORCHGLM:iter 7, deviance 1.69467586045954760E+08, elapsedtime9.077secORCHGLM:iter 8, deviance 1.69467574351380850E+08, elapsedtime9.164secuser system elapsed84.107 5.606143.591
2
22
2
2. Spark Job Execution
2. Loads data from HDFS and runs ORAAH’s in-Memory Custom Machine Learning algorithm
Oracle R Advanced Analyticsfor Hadoop Client Packages
Spark-Based Machine Learning algorithms
module
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 34
PerformanceagainstORAAH’sMap-ReduceGLMORAAH’sSpark-basedGLMagainstHDFSdata
Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNode, Spark1.2.0configurationwith24coresand24GBofRAMperNode
Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleRAdvancedAnalyticsforHadoop– Sparkvs.Map-ReduceNeuralNetworksPerformance– LinearModel,845weights,linearmodel
35Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for
Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNode,Spark1.2.0configurationwith24coresand24GBofRAMperNode
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
ORAAHonSpark:1BirecordsandLargenumberofcoefficientsScalableGLM-LogisticRegressionandComplexNonlinearDeepNeuralNetworks
36
Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNodeSpark1.2.0configuration with24coresand24GBofRAMperNode
Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 37
PerformancemeasuredonsameHardwareandsameHDFSinputDatasetORAAH’sSpark-basedGLMvs.SparkMllib GLM
14.5x
13.9x
5x
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleRAdvancedAnalyticsforHadoopEfficientuseofSparkCachingmemory,evenatminimumlevelsPerformanceon159mirecordsonanX4-2Server,40threads,128GBofRAM,CDH5.3.0Spark1.2.0configuration with24coresand24GBofRAMperNodeGLM – Logistic Regression model with 845 CoefficientsNeural Networks - Model using 1 Layer of Neurons, linear activation function, 838 Coefficients
38
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
RoadmapArchitecture
39
Hadoop Relational
AlgorithmsCommoncore,parallel,distributed
SQL RGUI
Cloud
Roadmap
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
OracleAdvancedAnalytics—OnPremiseorCloud100%CompatibilityEnablesEasyCoexistenceandMigration
40
OracleCloud
CoExistence andMigration
Same Architecture
Same Analytics
Same Standards
TransparentlymoveworkloadsandanalyticalmethodologiesbetweenOn-premiseandpubliccloud
On-Premise
Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |
DataSciencealreadyavailableontheOracleCloudCloud-BasedAdvancedAnalytics
OracleAdvancedAnalyticsoptionincludingtheOracleDataMiningandOracleREnterpriseon:
• OracleEXADATACloudService
• OracleDatabaseasaService:IncludedinHighPerformanceandExtremePerformanceservices
OracleRAdvancedAnalyticsforHadoop• IncludedintheBigDataCloudService
41