6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avanzate (Linguaggio R)

Cortana Intelligence Suite – dal dato all’azione

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards &

Visualizations

Cortana

Bot

Framework

Cognitive

Services

Information

Management

Event Hubs

Data Catalog

Data Factory

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream Analytics

Intelligence

Data Lake

Analytics

Machine

Learning

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

Power BI

Microsoft + Hortonworks

Azure Data Lake Analytics and HDInsight

YARN

U-SQL

Analytics HDInsight

Hive R Server

HDFS

Store

Store and analyze data of any kind and size

Develop faster, debug and optimize smarter

Interactively explore patterns in your data

No learning curve

Managed and supported

Dynamically scales to match your business

priorities

Enterprise-grade security

Built on YARN, designed for the cloud

The highest levels of security in a managed Cloud Hadoop solution

Authentication and identity management in a few clicks

Azure HDInsight is the first big data service to seamlessly integrate Azure Active Directory and Azure Active Directory Domain

Services for enterprise-grade authentication and identity management. This is accomplished with a few clicks, making it easy to

secure your Hadoop clusters. This also makes it easy to leverage your existing on-premises Active Directory deployment, which

currently supports 1.3 billion daily authentications across 600 million user accounts. You can build sophisticated access control

policies around users or security groups supported by features such as multifactor authentication.

Authorization with central security policy administration and auditing

Azure HDInsight is the first managed cloud Hadoop service to include Apache Ranger, which provides a central policy and

management portal where administrators can author and maintain fine-grained access control policies over Hadoop data access,

components and services. In addition, you can now analyze detailed audit records in the familiar Apache Ranger user interface.

Encryption for data protection

Data processed by Azure HDInsight is stored in Azure Data Lake Store or Azure Storage that both provide server-side encryption as

an option to secure data at rest. The encryption works transparently with HDInsight with no extra configuration needed. For Azure

Data Lake Store, enterprises can rely on service-managed encryption keys or manage their own keys in Azure Key Vault. Azure Key

Vault protects your keys using hardware security models and gives you the ability to revoke access to the keys at any time.

http://hortonworks.com/apache/ranger/

https://azure.microsoft.com/en-us/services/data-lake-store/

https://azure.microsoft.com/en-us/services/key-vault/

And more significant features

•Hive with LLAP (Live Long And

Process) adding caching,

vectorization, and other

optimizations to Hive on Tez

(potential sub-second queries and

much faster)

•Spark 2.0 with improved

performance (vectorization) , new

Spark SQL syntax, and Spark SQL –

HBase connector

•Zeppelin Notebook integration in

addition to Jupyter

•3rd Party ISV support (Datameer

previously) but now adding folks

like Cask and StreamSets

Microsoft has been involved from the beginning in making Hive run faster with our contributions to Project Stinger and Tez that sped up Hive query performance 100x. We are now pleased to be the first Cloud Hadoop solution to onboard LLAP (Long Lived and Process) from the Stinger.Next initiatives, which promises sub-second querying on big data, which is 25x faster than existing Hive.

http://hortonworks.com/blog/microsofts-contributions-to-the-stinger-initiative-and-apache-hive/

http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/

R Open Microsoft R Server

DeployRDevelopR

The Microsoft R Server Platform

ConnectR• High-speed & direct

connectors

Available for:• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database & Aster

• EDWs and ADWs

• ODBCScaleR• Ready-to-Use high-performance

big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Range of predictive functions

• User tools for distributing customized R algorithms across nodes

• Wide data sets supported – thousands of variables

DistributedR• Distributed computing framework

• Delivers cross-platform portability

R+CRAN• Open source R interpreter

• R 3.1.2

• Freely-available huge range of R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and packages

RevoR• Performance enhanced R

interpreter

• Based on open source R

• Adds high-performance math library to speed up linear algebra functions

ScaleR – Parallel + “Big Data”

Stream data in to RAM in blocks. “Big Data” can be any data

size. We handle Megabytes to Gigabytes to Terabytes…

Our ScaleR algorithms work

inside multiple cores / nodes

in parallel at high speed

Interim results are collected

and combined analytically to

produce the output on the

entire data setXDF file format is optimised to work with the ScaleR library and

significantly speeds up iterative algorithm processing.

2 righe di codice per fare la stessa operazione!

Gradient Boosted Decision Trees

Naïve Bayes

Scale R – Parallelized Algorithms & Functions

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Preparation Statistical Tests

Sampling

Descriptive Statistics Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models K-Means

Decision Trees

Decision Forests

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination rxDataStep

rxExec

New

PEMA-R API Custom Algorithms

Thank you- Jürgen Ambrosi <[email protected]>

- Francesco Umiliaco <[email protected]>

- Riccardo Trubiani <[email protected]>- Lorenzo Casucci <[email protected]>

Vi aspettiamo alla prossima sessione!

Why Hadoop in the Cloud?

LocalDir<-"C:\\tmp\\AirOnTimeSmall"

inputDir <-file.path(LocalDir)

airlineColInfo <- list(DAY_OF_WEEK = list(type = "factor"),ORIGIN = list(type = "factor"),DEST = list(type = "factor"),DEP_TIME = list(type = "integer"),ARR_DEL15 = list(type = "logical"))

# get all the column namesvarNames <- names(airlineColInfo)

# Define the text data source in local systemairOnTimeDataLocal <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames)

# formula to useformula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"

# Set a local compute contextrxSetComputeContext("local")# Run a logistic regressionsystem.time(

modelLocal <- rxLogit(formula, data = airOnTimeDataLocal))# Display a summarysummary(modelLocal)

#copy local file to HDFSrxHadoopMakeDir("/share")rxHadoopCopyFromLocal(system.file("SampleData/AirlineDemoSmall.csv",package="RevoScaleR"), "/share")

myNameNode <- "default"myPort <- 0

# Location of the databigDataDirRoot <- "/share"inputdir <- "airOT1112"

bigDataDirRoot <- "/home/testws4"inputDir <- "AirOnTimeCSV2012"

airlineColInfo <- list(DAY_OF_WEEK = list(type = "factor"),ORIGIN = list(type = "factor"),DEST = list(type = "factor"),DEP_TIME = list(type = "integer"),ARR_DEL15 = list(type = "logical"))

# get all the column namesvarNames <- names(airlineColInfo)airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS)# Define the text data source in local systemairOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames)

# formula to useformula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"

rxSetComputeContext("local")

system.time( modelSpark <- rxLogit(formula, data = airOnTimeData)

)

system.time( modelSpark <- rxLogit(formula, data = airOnTimeDataLocal)

)

Scale ClusterWhy Microsoft Azure?

Azure Storage

HDInsight Built for Windows or Linux

HDInsight Supports Hive

Hadoop 2.0

HDInsight Supports HBase

Data Node Data Node Data Node Data Node

Task Tracker Task Tracker Task Tracker Task Tracker

Name Node

Job Tracker

HMasterCoordination

Region Server Region Server Region Server Region Server

HDInsight Supports Mahout

HDInsight Supports Storm

Stream processin

g

Search and query

Data analytics (Excel)

Web/thick client

dashboards

Devices to take action

RabbitMQ /

ActiveMQ

Spark for Azure HDInsight In Memory Processing on Multiple Workloads

Azure

HDInsight

In Memory

Spark

• Single execution model for multiple

tasks

• Processing up to 100x faster

performance

• Developer friendly (Java, Python, Scala)

• BI tool of choice (Power BI, Tabelau,

Qlik, SAP)

• Notebook experience (Jupyter/iPython,

Zeppelin)

R Server for HDInsight

• Familiarity of R (most popular language for data scientists)

• Scalability of Hadoop and Spark

• Up to 7x faster using Spark engine

• Train and run ML models on datasets of any size

• Cloud managed solution (easy setup, elastic, SLA)

HDInsight Allows You To Add Hadoop Projects

Microsoft Makes Hadoop EasierDeep Visual Studio Integration• Debug Hive jobs through Yarn logs or troubleshoot Storm topologies

• Visualize Hadoop clusters, tables, and storage

• Submit Hive queries, Storm topologies (C# or Java spouts/bolts)

• IntelliSense

6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avanzate (Linguaggio R)

Education

Transcript of 6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avanzate (Linguaggio R)