6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avanzate (Linguaggio R)
-
Upload
juergen-ambrosi -
Category
Education
-
view
168 -
download
3
Transcript of 6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avanzate (Linguaggio R)
Cortana Intelligence Suite – dal dato all’azione
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
Power BI
Microsoft + Hortonworks
Azure Data Lake Analytics and HDInsight
YARN
U-SQL
Analytics HDInsight
Hive R Server
HDFS
Store
Store and analyze data of any kind and size
Develop faster, debug and optimize smarter
Interactively explore patterns in your data
No learning curve
Managed and supported
Dynamically scales to match your business
priorities
Enterprise-grade security
Built on YARN, designed for the cloud
The highest levels of security in a managed Cloud Hadoop solution
Authentication and identity management in a few clicks
Azure HDInsight is the first big data service to seamlessly integrate Azure Active Directory and Azure Active Directory Domain
Services for enterprise-grade authentication and identity management. This is accomplished with a few clicks, making it easy to
secure your Hadoop clusters. This also makes it easy to leverage your existing on-premises Active Directory deployment, which
currently supports 1.3 billion daily authentications across 600 million user accounts. You can build sophisticated access control
policies around users or security groups supported by features such as multifactor authentication.
Authorization with central security policy administration and auditing
Azure HDInsight is the first managed cloud Hadoop service to include Apache Ranger, which provides a central policy and
management portal where administrators can author and maintain fine-grained access control policies over Hadoop data access,
components and services. In addition, you can now analyze detailed audit records in the familiar Apache Ranger user interface.
Encryption for data protection
Data processed by Azure HDInsight is stored in Azure Data Lake Store or Azure Storage that both provide server-side encryption as
an option to secure data at rest. The encryption works transparently with HDInsight with no extra configuration needed. For Azure
Data Lake Store, enterprises can rely on service-managed encryption keys or manage their own keys in Azure Key Vault. Azure Key
Vault protects your keys using hardware security models and gives you the ability to revoke access to the keys at any time.
And more significant features
•Hive with LLAP (Live Long And
Process) adding caching,
vectorization, and other
optimizations to Hive on Tez
(potential sub-second queries and
much faster)
•Spark 2.0 with improved
performance (vectorization) , new
Spark SQL syntax, and Spark SQL –
HBase connector
•Zeppelin Notebook integration in
addition to Jupyter
•3rd Party ISV support (Datameer
previously) but now adding folks
like Cask and StreamSets
Microsoft has been involved from the beginning in making Hive run faster with our contributions to Project Stinger and Tez that sped up Hive query performance 100x. We are now pleased to be the first Cloud Hadoop solution to onboard LLAP (Long Lived and Process) from the Stinger.Next initiatives, which promises sub-second querying on big data, which is 25x faster than existing Hive.
R Open Microsoft R Server
DeployRDevelopR
The Microsoft R Server Platform
ConnectR• High-speed & direct
connectors
Available for:• High-performance XDF
• SAS, SPSS, delimited & fixed format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBCScaleR• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R algorithms across nodes
• Wide data sets supported – thousands of variables
DistributedR• Distributed computing framework
• Delivers cross-platform portability
R+CRAN• Open source R interpreter
• R 3.1.2
• Freely-available huge range of R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing R scripts, functions and packages
RevoR• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance math library to speed up linear algebra functions
ScaleR – Parallel + “Big Data”
Stream data in to RAM in blocks. “Big Data” can be any data
size. We handle Megabytes to Gigabytes to Terabytes…
Our ScaleR algorithms work
inside multiple cores / nodes
in parallel at high speed
Interim results are collected
and combined analytically to
produce the output on the
entire data setXDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
2 righe di codice per fare la stessa operazione!
Gradient Boosted Decision Trees
Naïve Bayes
Scale R – Parallelized Algorithms & Functions
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Preparation Statistical Tests
Sampling
Descriptive Statistics Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models K-Means
Decision Trees
Decision Forests
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination rxDataStep
rxExec
New
PEMA-R API Custom Algorithms
Demo
Thank you- Jürgen Ambrosi <[email protected]>
- Francesco Umiliaco <[email protected]>
- Riccardo Trubiani <[email protected]>- Lorenzo Casucci <[email protected]>
Vi aspettiamo alla prossima sessione!
Why Hadoop in the Cloud?
LocalDir<-"C:\\tmp\\AirOnTimeSmall"
inputDir <-file.path(LocalDir)
airlineColInfo <- list(DAY_OF_WEEK = list(type = "factor"),ORIGIN = list(type = "factor"),DEST = list(type = "factor"),DEP_TIME = list(type = "integer"),ARR_DEL15 = list(type = "logical"))
# get all the column namesvarNames <- names(airlineColInfo)
# Define the text data source in local systemairOnTimeDataLocal <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames)
# formula to useformula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"
# Set a local compute contextrxSetComputeContext("local")# Run a logistic regressionsystem.time(
modelLocal <- rxLogit(formula, data = airOnTimeDataLocal))# Display a summarysummary(modelLocal)
#copy local file to HDFSrxHadoopMakeDir("/share")rxHadoopCopyFromLocal(system.file("SampleData/AirlineDemoSmall.csv",package="RevoScaleR"), "/share")
myNameNode <- "default"myPort <- 0
# Location of the databigDataDirRoot <- "/share"inputdir <- "airOT1112"
bigDataDirRoot <- "/home/testws4"inputDir <- "AirOnTimeCSV2012"
airlineColInfo <- list(DAY_OF_WEEK = list(type = "factor"),ORIGIN = list(type = "factor"),DEST = list(type = "factor"),DEP_TIME = list(type = "integer"),ARR_DEL15 = list(type = "logical"))
# get all the column namesvarNames <- names(airlineColInfo)airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS)# Define the text data source in local systemairOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames)
# formula to useformula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"
rxSetComputeContext("local")
system.time( modelSpark <- rxLogit(formula, data = airOnTimeData)
)
system.time( modelSpark <- rxLogit(formula, data = airOnTimeDataLocal)
)
Scale ClusterWhy Microsoft Azure?
Azure Storage
HDInsight Built for Windows or Linux
HDInsight Supports Hive
Hadoop 2.0
HDInsight Supports HBase
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMasterCoordination
Region Server Region Server Region Server Region Server
HDInsight Supports Mahout
HDInsight Supports Storm
Stream processin
g
Search and query
Data analytics (Excel)
Web/thick client
dashboards
Devices to take action
RabbitMQ /
ActiveMQ
Spark for Azure HDInsight In Memory Processing on Multiple Workloads
Azure
HDInsight
In Memory
Spark
• Single execution model for multiple
tasks
• Processing up to 100x faster
performance
• Developer friendly (Java, Python, Scala)
• BI tool of choice (Power BI, Tabelau,
Qlik, SAP)
• Notebook experience (Jupyter/iPython,
Zeppelin)
R Server for HDInsight
• Familiarity of R (most popular language for data scientists)
• Scalability of Hadoop and Spark
• Up to 7x faster using Spark engine
• Train and run ML models on datasets of any size
• Cloud managed solution (easy setup, elastic, SLA)
HDInsight Allows You To Add Hadoop Projects
Microsoft Makes Hadoop EasierDeep Visual Studio Integration• Debug Hive jobs through Yarn logs or troubleshoot Storm topologies
• Visualize Hadoop clusters, tables, and storage
• Submit Hive queries, Storm topologies (C# or Java spouts/bolts)
• IntelliSense