1 DEEPER DIVE: VARIABLE SELECTION ROUTINES IN SAS ... · • The selection order of the principal...

42
Copyright © 2012, SAS Institute Inc. All rights reserved. 1 – DEEPER DIVE: VARIABLE SELECTION ROUTINES IN SAS ENTERPRISE MINER DR IAIN BROWN, ANALYTICS & INNOVATION PRACTICE, SAS UK 23 MARCH, 2017

Transcript of 1 DEEPER DIVE: VARIABLE SELECTION ROUTINES IN SAS ... · • The selection order of the principal...

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

1 – DEEPER DIVE: VARIABLE SELECTION

ROUTINES IN SAS ENTERPRISE MINER

DR IAIN BROWN, ANALYTICS & INNOVATION PRACTICE, SAS UK

23 MARCH, 2017

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

DATA EXPLORATION

AND VISUALISATIONAGENDA

• SAS – 23rd March 2017 at 10:00am

• Deeper Dive: Variable Selection Routines in SAS Enterprise Miner

• The session looks at:

- Role of SAS Enterprise Miner

- Reasons for Variable Selection / Reduction

- Supervised vs Unsupervised

- Variable Selection vs Dimension Reduction

- Overview of Variable Selection / Reduction Nodes

- Variable Clustering and Principal Components Analysis Comparison

- Demonstration

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

ROLE OF SAS ENTERPRISE MINER

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

THE ANALYTICS

LIFECYCLEPREDICTIVE ANALYTICS AND DATA MINING

IDENTIFY /

FORMULATE

PROBLEM

DATA

PREPARATION

DATA

EXPLORATION

TRANSFORM

& SELECT

BUILD

MODEL

VALIDATE

MODEL

DEPLOY

MODEL

EVALUATE /

MONITOR

RESULTS

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

THE ANALYTICS

LIFECYCLEPREDICTIVE ANALYTICS AND DATA MINING

IDENTIFY /

FORMULATE

PROBLEM

DATA

PREPARATION

DATA

EXPLORATION

TRANSFORM

& SELECT

BUILD

MODEL

VALIDATE

MODEL

DEPLOY

MODEL

EVALUATE /

MONITOR

RESULTS

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

SAS®

ENTERPRISE

MINER™

SAS® ENTERPRISE MINER™

• Modern, collaborative, easy-to-use data mining

workbench

• Sophisticated set of data preparation and exploration

tools

• Modern suite of modeling techniques and methods

• Interactive model comparison, testing and validation

• Automated scoring process delivers faster results

• Open, extensible design for ultimate flexibility

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

SAS®

ENTERPRISE

MINER™

SAS® ENTERPRISE MINER™

MODEL DEVELOPMENT PROCESS

Sample Explore Modify Model Assess

Feature Selection /

Unsupervised

Learning

Feature

Creation

Supervised

Learning

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

SAS®

ENTERPRISE

MINER™

SAS® ENTERPRISE MINER™

MODEL DEVELOPMENT PROCESS

Utility Apps.

Time

Series HPDM

Credit

Scoring

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

SAS®

ENTERPRISE

MINER™

SAS® ENTERPRISE MINER™

SEMMA IN ACTION – REPEATABLE PROCESS

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

REASONS FOR VARIABLE

SELECTION/REDUCTION

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONWHY?

• Huge amounts of Data …

• A blessing or a curse?

• Problems with having too many variables

• Correlation

• Overfitting

• Sparseness

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONSUPERVISED VS UNSUPERVISED

• Supervised = variable reduction methods which use the target (dependent)

variable for selection

• Unsupervised = variable reduction methods which ignore the target

(dependent) variable in the selection process

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONOUTPUT VARIABLES

• Some variable reduction methods use the original variables as inputs into

subsequent models = Variable Selection

• Some variable reduction methods use combinations of the original variables

as inputs into subsequent models = Dimension Reduction

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

OVERVIEW OF VARIABLE SELECTION /

REDUCTION NODES

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

OVERVIEW CONCEPTS

• Variable selection vs variable combination

• Variable Selection

• Regression:

• Forward, Backward, Stepwise selection

• Tree based

• Variable Selection Node

• IGN

• LARS (LASSO)

• The LASSO method adds and deletes parameters based on a version of ordinary least squares where the sum of the

absolute regression coefficients is constrained.

• Variable Clustering:

• grouping correlated subsets of original variables;

• selecting variables with minimal resulting collinearity; representative “best” variable from each cluster

• Principal Components:

• uncorrelated linear combinations of all input variables

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

CONCEPTS

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

TARGET

VAR01

VAR02

VAR04

VAR07

VAR09

TARGET

Variable

Selection /

IGN based on

relationship

with TARGET:

Cluster Scores

based on

Variable

Clustering:

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

CLUS1

CLUS2

CLUS3

Best Variables

based on

Variable

Clustering:

VAR02

VAR05

VAR09

Input variables

and TARGET:Principal

Components:

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

PC01

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

PC10

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE SELECTION ROUTINES

• Variable Selection Node

• Relationship of independent variables to dependent target

• R-Square of Chi-square selection criteria

• Interactive Grouping Node

• Computes Weights of Evidence

• GINI and Information Values for variable selection

• Variable Clustering Node

• Identify correlations and covariance's between input variables

• Select Best variable from cluster or Cluster Component

• Principal Components Node

• Calculates eigenvalues and eigenvectors from the uncorrected covariance

matrix, corrected covariance matrix, or the correlation matrix of input variables

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE CLUSTERING AND PRINCIPAL

COMPONENTS ANALYSIS COMPARISON

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE CLUSTERING

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

• The Variable Clustering node divides the input variables into hierarchical clusters.

• The main idea is to select one variable (or the cluster component) from each cluster

as a cluster representative.

• The representative variables (or components) are used as input variables in

successor nodes.

• The other input variables are rejected.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Inputs selected by

cluster representation

expert opinion

target correlation

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Inputs selected by

cluster representation

expert opinion

target correlation

X1

X4

X6

X8

X9

X10

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Inputs selected by

cluster representation

expert opinion

target correlation

X1

X3

X4

X6

X8

X9

X10

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTION

VARIABLE CLUSTERING: WHAT IS A CLUSTER

COMPONENT?

• Each cluster can be described as a linear combination of the variables in the

cluster.

• This is the first principal component of the cluster.

• In this context, it is called the cluster component.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: ALGORITHM

• The algorithm is divisive; at the start, all variables are in one single cluster.

• The following steps are repeated until convergence:

1. A cluster is chosen for splitting.

2. The chosen cluster is split into two clusters.

3. The variables are iteratively (re)assigned to the clusters.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: LARGE DATASETS

• Computationally efficient if the data set has fewer than 100 variables and

fewer than 100,000 observations.

• If you have more than 100 variables:

• Use two-stage variable clustering.

• If you have more than 100,000 observations:

• Sample the data.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTION

VARIABLE CLUSTERING: METHODS FOR REDUCING

PROCESSING TIME

• If the data set has more than 30 variables:

• If the number of clusters is known, specify the number of clusters.

• Set the Keep Hierarchies property to Yes.

• Set the Two Stage Clustering property to Yes.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: TWO-STAGE

• This four-step approach is used to speed up variable clustering with more

than 100 input variables.

• Global clusters are formed and variable clustering is performed on each

global cluster.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: PROS

• Reduction of collinearity

• Redundancy reduction with low information loss

• Identification of underlying data structure

• Interpretation of original input variables can be kept in successor nodes.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONVARIABLE CLUSTERING: CONS

• One-stage clustering is not computationally efficient if more than 100 input

variables.

• Node cannot be used on data with more than 100,000 observations.

• Method is not so well-known. You need to explain it.

• Levels of categorical variables can be located in different clusters.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

PRINCIPAL COMPONENTS ANALYSIS

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONPRINCIPAL COMPONENTS ANALYSIS

• Principal components are constructed as mathematical transformations of the

input variables.

• The first principal component is constructed in such a way that it captures as

much of the variation in the input variables (the X-space) set as possible.

• The second principal component is orthogonal to the first principal

component.

• The second principal component captures as much as possible of the

variation in the input data not captured by the first principal component.

• And so on ...

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

PCA: INPUT AND OUTPUT VARIABLES

• Input variables:

• Principal component 1:

• Principal component 2:

• Principal component 3:

321 , , xxx

3121111 xcxbxapc

3222122 xcxbxapc

3323133 xcxbxapc

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONPCA: SELECTION OF THE NUMBER OF PC’S

• The number of principal components used as input variables for the

successor modelling nodes can be selected using one of the following:

• Proportion of variance explained

• Scree plot

• Eigenvalue > 1

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONPCA: PROS

• Constructed output variables are definitely uncorrelated.

• The selection order of the principal components is automatically determined.

• The principal components are constructed in such a way that the first

principal component represents more of the variation in the data cloud than

the second one, and so on.

• Often, a very small number of principal components must be kept in order to

explain a lot of the variation in the data cloud.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONPCA: CONS

• It is difficult or impossible to interpret the constructed principal components.

• It is difficult to know how many principal components should be selected as

new input variables.

• All original input variables are still used because they build the principal

components.

• Misinterpretation of the coefficients of the linear combinations is common.

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

SAS®

ENTERPRISE

MINER™

DEMONSTRATION

• Variable selection and reduction techniques

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

SUMMARY

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

SELECTIONSUMMARY

• Comprehensive variable selection and dimension reduction toolset

• Number of approaches to data and dimension reduction

• Importance of enhancing data prior to model development

• Leads to:

• Better model stability

• Longer model life-span

• Reduced complexity

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d . www.SAS.com

QUESTIONS AND ANSWERS

[email protected]