1 DEEPER DIVE: VARIABLE SELECTION ROUTINES IN SAS ... · • The selection order of the principal...

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

1 – DEEPER DIVE: VARIABLE SELECTION

ROUTINES IN SAS ENTERPRISE MINER

DR IAIN BROWN, ANALYTICS & INNOVATION PRACTICE, SAS UK

23 MARCH, 2017


DATA EXPLORATION

AND VISUALISATIONAGENDA

• SAS – 23rd March 2017 at 10:00am

• Deeper Dive: Variable Selection Routines in SAS Enterprise Miner

• The session looks at:

- Role of SAS Enterprise Miner

- Reasons for Variable Selection / Reduction

- Supervised vs Unsupervised

- Variable Selection vs Dimension Reduction

- Overview of Variable Selection / Reduction Nodes

- Variable Clustering and Principal Components Analysis Comparison

- Demonstration


ROLE OF SAS ENTERPRISE MINER


THE ANALYTICS

LIFECYCLEPREDICTIVE ANALYTICS AND DATA MINING

IDENTIFY /

FORMULATE

PROBLEM

DATA

PREPARATION

DATA

EXPLORATION

TRANSFORM

& SELECT

BUILD

MODEL

VALIDATE

MODEL

DEPLOY

MODEL

EVALUATE /

MONITOR

RESULTS


SAS®

ENTERPRISE

MINER™

SAS® ENTERPRISE MINER™

• Modern, collaborative, easy-to-use data mining

workbench

• Sophisticated set of data preparation and exploration

tools

• Modern suite of modeling techniques and methods

• Interactive model comparison, testing and validation

• Automated scoring process delivers faster results

• Open, extensible design for ultimate flexibility


SAS®

ENTERPRISE

MINER™


MODEL DEVELOPMENT PROCESS

Sample Explore Modify Model Assess

Feature Selection /

Unsupervised

Learning

Feature

Creation

Supervised

Learning


SAS®

ENTERPRISE

MINER™


MODEL DEVELOPMENT PROCESS

Utility Apps.

Time

Series HPDM

Credit

Scoring


SAS®

ENTERPRISE

MINER™


SEMMA IN ACTION – REPEATABLE PROCESS


REASONS FOR VARIABLE

SELECTION/REDUCTION


VARIABLE

SELECTIONWHY?

• Huge amounts of Data …

• A blessing or a curse?

• Problems with having too many variables

• Correlation

• Overfitting

• Sparseness


VARIABLE

SELECTIONSUPERVISED VS UNSUPERVISED

• Supervised = variable reduction methods which use the target (dependent)

variable for selection

• Unsupervised = variable reduction methods which ignore the target

(dependent) variable in the selection process


VARIABLE

SELECTIONOUTPUT VARIABLES

• Some variable reduction methods use the original variables as inputs into

subsequent models = Variable Selection

• Some variable reduction methods use combinations of the original variables

as inputs into subsequent models = Dimension Reduction


OVERVIEW OF VARIABLE SELECTION /

REDUCTION NODES


OVERVIEW CONCEPTS

• Variable selection vs variable combination

• Variable Selection

• Regression:

• Forward, Backward, Stepwise selection

• Tree based

• Variable Selection Node

• IGN

• LARS (LASSO)

• The LASSO method adds and deletes parameters based on a version of ordinary least squares where the sum of the

absolute regression coefficients is constrained.

• Variable Clustering:

• grouping correlated subsets of original variables;

• selecting variables with minimal resulting collinearity; representative “best” variable from each cluster

• Principal Components:

• uncorrelated linear combinations of all input variables


CONCEPTS

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

TARGET

VAR01

VAR02

VAR04

VAR07

VAR09

TARGET

Variable

Selection /

IGN based on

relationship

with TARGET:

Cluster Scores

based on

Variable

Clustering:

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

CLUS1

CLUS2

CLUS3

Best Variables

based on

Variable

Clustering:

VAR02

VAR05

VAR09

Input variables

and TARGET:Principal

Components:

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

PC01

…

VAR01

VAR02

VAR03

VAR04

VAR05

VAR06

VAR07

VAR08

VAR09

VAR10

PC10


VARIABLE

SELECTIONVARIABLE SELECTION ROUTINES

• Variable Selection Node

• Relationship of independent variables to dependent target

• R-Square of Chi-square selection criteria

• Interactive Grouping Node

• Computes Weights of Evidence

• GINI and Information Values for variable selection

• Variable Clustering Node

• Identify correlations and covariance's between input variables

• Select Best variable from cluster or Cluster Component

• Principal Components Node

• Calculates eigenvalues and eigenvectors from the uncorrected covariance

matrix, corrected covariance matrix, or the correlation matrix of input variables


VARIABLE CLUSTERING AND PRINCIPAL

COMPONENTS ANALYSIS COMPARISON


VARIABLE CLUSTERING


VARIABLE

SELECTIONVARIABLE CLUSTERING: MAIN FEATURES

• The Variable Clustering node divides the input variables into hierarchical clusters.

• The main idea is to select one variable (or the cluster component) from each cluster

as a cluster representative.

• The representative variables (or components) are used as input variables in

successor nodes.

• The other input variables are rejected.


VARIABLE


X1

X2

X3

X4

X5

X6

X7

X8

X9

X10


VARIABLE


X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Inputs selected by

cluster representation

expert opinion

target correlation


VARIABLE


X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Inputs selected by


expert opinion

target correlation

X1

X4

X6

X8

X9

X10


VARIABLE


X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Inputs selected by


expert opinion

target correlation

X1

X3

X4

X6

X8

X9

X10


VARIABLE

SELECTION

VARIABLE CLUSTERING: WHAT IS A CLUSTER

COMPONENT?

• Each cluster can be described as a linear combination of the variables in the

cluster.

• This is the first principal component of the cluster.

• In this context, it is called the cluster component.


VARIABLE

SELECTIONVARIABLE CLUSTERING: ALGORITHM

• The algorithm is divisive; at the start, all variables are in one single cluster.

• The following steps are repeated until convergence:

1. A cluster is chosen for splitting.

2. The chosen cluster is split into two clusters.

3. The variables are iteratively (re)assigned to the clusters.


VARIABLE

SELECTIONVARIABLE CLUSTERING: LARGE DATASETS

• Computationally efficient if the data set has fewer than 100 variables and

fewer than 100,000 observations.

• If you have more than 100 variables:

• Use two-stage variable clustering.

• If you have more than 100,000 observations:

• Sample the data.


VARIABLE

SELECTION

VARIABLE CLUSTERING: METHODS FOR REDUCING

PROCESSING TIME

• If the data set has more than 30 variables:

• If the number of clusters is known, specify the number of clusters.

• Set the Keep Hierarchies property to Yes.

• Set the Two Stage Clustering property to Yes.


VARIABLE

SELECTIONVARIABLE CLUSTERING: TWO-STAGE

• This four-step approach is used to speed up variable clustering with more

than 100 input variables.

• Global clusters are formed and variable clustering is performed on each

global cluster.


VARIABLE

SELECTIONVARIABLE CLUSTERING: PROS

• Reduction of collinearity

• Redundancy reduction with low information loss

• Identification of underlying data structure

• Interpretation of original input variables can be kept in successor nodes.


VARIABLE

SELECTIONVARIABLE CLUSTERING: CONS

• One-stage clustering is not computationally efficient if more than 100 input

variables.

• Node cannot be used on data with more than 100,000 observations.

• Method is not so well-known. You need to explain it.

• Levels of categorical variables can be located in different clusters.


PRINCIPAL COMPONENTS ANALYSIS


VARIABLE

SELECTIONPRINCIPAL COMPONENTS ANALYSIS

• Principal components are constructed as mathematical transformations of the

input variables.

• The first principal component is constructed in such a way that it captures as

much of the variation in the input variables (the X-space) set as possible.

• The second principal component is orthogonal to the first principal

component.

• The second principal component captures as much as possible of the

variation in the input data not captured by the first principal component.

• And so on ...


PCA: INPUT AND OUTPUT VARIABLES

• Input variables:

• Principal component 1:



321 , , xxx

3121111 xcxbxapc

3222122 xcxbxapc

3323133 xcxbxapc


VARIABLE

SELECTIONPCA: SELECTION OF THE NUMBER OF PC’S

• The number of principal components used as input variables for the

successor modelling nodes can be selected using one of the following:

• Proportion of variance explained

• Scree plot

• Eigenvalue > 1


VARIABLE

SELECTIONPCA: PROS

• Constructed output variables are definitely uncorrelated.

• The selection order of the principal components is automatically determined.

• The principal components are constructed in such a way that the first

principal component represents more of the variation in the data cloud than

the second one, and so on.

• Often, a very small number of principal components must be kept in order to

explain a lot of the variation in the data cloud.


VARIABLE

SELECTIONPCA: CONS

• It is difficult or impossible to interpret the constructed principal components.

• It is difficult to know how many principal components should be selected as

new input variables.

• All original input variables are still used because they build the principal

components.

• Misinterpretation of the coefficients of the linear combinations is common.


SAS®

ENTERPRISE

MINER™

DEMONSTRATION

• Variable selection and reduction techniques


SUMMARY


VARIABLE

SELECTIONSUMMARY

• Comprehensive variable selection and dimension reduction toolset

• Number of approaches to data and dimension reduction

• Importance of enhancing data prior to model development

• Leads to:

• Better model stability

• Longer model life-span

• Reduced complexity

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d . www.SAS.com

QUESTIONS AND ANSWERS

[email protected]

http://www.sas.com/

http://www.sas.com/

1 DEEPER DIVE: VARIABLE SELECTION ROUTINES IN SAS ... · • The selection order of the principal...

Documents

Transcript of 1 DEEPER DIVE: VARIABLE SELECTION ROUTINES IN SAS ... · • The selection order of the principal...