Discriminant Analysis, a Powerful Classification Technique ... · Discriminant analysis is one of...

Discriminant Analysis, a Powerful Classification Technique in Predictive Modeling

George Fernandez University of Nevada – Reno

ABSTRACT

Discriminant analysis is one of the classical classification techniques used to discriminate a singlecategorical variable using multiple attributes. Discriminant analysis also assigns observations to one ofthe pre-defined groups based on the knowledge of the multi-attributes. When the distribution within eachgroup is multivariate normal, a parametric method can be used to develop a discriminant function using ageneralized squared distance measure. The classification criterion is derived based on either the individualwithin-group covariance matrices or the pooled covariance matrix that also takes into account the priorprobabilities of the classes. Non-parametric discriminant methods are based on non-parametric group-specific probability densities. Either a kernel or the k-nearest-neighbor method can be used to generate anon-parametric density estimate in each group and to produce a classification criterion. The performanceof a discriminant criterion could be evaluated by estimating probabilities of mis-classification of newobservations in the validation data. A user-friendly SAS application utilizing SAS macro to performdiscriminant analysis is presented here. Chemical diabetes data containing multi-attributes is used todemonstrate the features of discriminant analysis in discriminating the three clinical types of diabetes.

INTRODUCTION

Discriminant Analyis (DA), a multivariate statistical technique is commonly used to build a predictive /descriptive model of group discrimination based on observed predictor variables and to classify eachobservation into one of the groups. Stepwise, canonical and discriminant function analyses arecommonly used DA techniques available in the SAS systems STAT module (SAS Inst. Inc. 2004). In DA multiple quantitative attributes are used to discriminate single classification variable. DA is differentfrom the cluster analysis because prior knowledge of the classes membership is required. The commonobjectives of DA are• to investigate differences between groups • to discriminate groups effectively;• to identify important discriminating variables; • to perform hypothesis testing on the differences between the expected groupings; and • to classify new observations into pre-existing groups. The main objective of this presentation is to demonstrate features of the user-friendly SAS applicationDISCRIM (Fernandez, 2002) using the chemical diabetes data containing multi-attributes indiscriminating the three clinical types of diabetes. The users can perform the discriminant analysis usingtheir data by following the instructions included. (Fernandez, 2002).

Diabetes data

Diabet2 data (Raven and Miller, 1979) containing multi-attributes,X1, relative weight, X2: fastingplasma glucose level, X3: test plasma glucose, X4: Plasma insulin during test, X5: Steady state plasmaglucose level and diabet1 (multivariate normally distributed simulated data with same mean vector andequal variance covariance matrix similar to diabet2) are used here to demonstrate the discriminantanalysis features in classifying three types of disbetes, 1: Normal, 2: Overt diabetic, and 3: Chemical. Parametric DA analysis using Diabet1 as the training data and diabet2 as the validation dataDATA EXPLORATIONExamining the group discrimination based on simple scatter plots between any two discrimination

1

variables is the first step in data exploring. An example of simple two-dimensional scatter plots showingthe discrimination of three price groups is presented in Figure 1.

These scatter plots are useful in examining the range of variation and the degree of linear associationsbetween any two attributes. The scatter plot presented in Figure.1 revealed the strong correlation existedbetween X2: fasting plasma glucose level and X3: test plasma glucose and between X3: test plasmaglucose and X5: Steady state plasma glucose level . These two attributes appeared to discriminate the three diabetic groups to a certain degree.

Figure 1 Group discrimination in a simple scatter plots

2

Model Selection: The variable selection methods are useful especially when there is a need for screeninga large number of predictor variables. Information on the number of observations, number of group levels,discriminating variables, and the threshold significance level for eliminating non-significant predictorvariables in the backward elimination method is presented in Table 1.Backward elimination starts withthe full model and the overall significance in discriminating the diabetes groups is highly significantbased on the p-value for Wilk's lambda (<0.0001) and the p-value for the average squared canonicalcorrelations (<0.0001). In step1, the non-significant (P-value 0.28) steady state plasma glucose (X5) iseliminated from the model and the resulting 4-variable model is equally as good as the full model. Thebackward elimination method is stopped here since no other variable can be dropped based on the'p-value to stay' (0.15) criterion. In the backward elimination method, once a variable is removed fromthe discriminant model, it cannot be re-entered.

Table 1 Data exploration using SAS MACRO: DISCRIM- Backward elimination summary

StepRemoved Label

PartialR-

Square F Value Pr > FWilks'

LambdaPr <

Lambda

AverageSquared

CanonicalCorrelation

(ASCC)Pr >

ASCC

0 . . . 0.080 <.0001 0.611 <.0001

1X5 steady stateplasma glucose

0.0187 1.28 0.282 0.081 <.0001 0.605 <.0001

Note: Data: discrim1 Observations = 141;Variable(s) in the Analysis= 5; Class Levels=3; Significance Level to Stay=0.15

The summary results of the stepwise selection method are presented in Table 2. The significance ofpredictor variables in discriminating the three clinical diabetes groups is evaluated in a step-wise fashion.At each step the, significance of already entered predictor variables is evaluated based on the significancefor staying (p-value:0.15) criterion, and the significance of newly entering variables is evaluated based onthe significance for entering (P-value: 0.05)criterion. The stepwise selection procedure stops when novariables can be removed or entered. The results of stepwise selection methods are in agreement with thebackward elimination method since both methods choose variables X1-X4 as the significant predictors.

Table 2 Data exploration using SAS MACRO: DISCRIM- step-wise selection summary

Entered Removed LabelPartial

R-Square F Value Pr > FWilks'

LambdaPr <

LambdaASCC

X3 - test plasma glucose 0.87 461.68 <.0001 0.130 <.00010.434

X2 - Fasting PlasmaGlucose

0.19 16.52 <.0001 0.104 <.00010.530

X1 - Relative wt 0.16 13.82 <.0001 0.087 <.0001 0.584

X4 - Plasma insulinduring test

0.05 4.29 0.0156 0.081 <.00010.605

Note: Data: discrim1 Observations=141; Variable(s) in the Analysis=5;Class Levels=3;Significance Levelto Enter=0.15;Significance Level to Stay=0.15

The summary results of the forward selection method (Data: discrim1) are presented in Table 3. Thesignificance of predictor variables in discriminating the three clinical diabetes groups is evaluated one at atime. At each step, the significance of entering variables is evaluated based on the significance for

3

entering (P-value: 0.15) criterion. The forward selection procedure stops when no variables can beentered by the entering P-value (0.15) criterion. In the forward selection method, once a variable isentered into the discriminant model, it cannot be removed. The results of the forward selection methodare in agreement with both the backward elimination and the stepwise selection methods since all threemethods choose variables X1-X4 as the significant predictors.

Table 3 Data exploration using SAS MACRO: DISCRIM- Forward selection summary

StepEntered Label

PartialR-

Square F Value Pr > F

Wilks'Lambd

aPr <

Lambda

AverageSquared

CanonicalCorrelation

(ASCC)Pr >

ASCC

1x3 test plasmaglucose

0.8700 461.68 <.0001 0.130 <.0001 0.434 <.0001

2x2 FastingPlasmaGlucose

0.1943 16.52 <.0001 0.104 <.0001 0.530 <.0001

3x1 Relative wt 0.1689 13.82 <.0001 0.087 <.0001 0.584 <.0001

4x4 Plasma insulinduring test

0.0598 4.29 0.0156 0.081 <.0001 0.605 <.0001

Checking for multivariate normality: The right choice for selecting parametric vs. non-parametricdiscriminant analysis is dependent on the assumption of multivariate normality within each group. Thediabetes data within each clinical group is assumed to have a multivariate normal distribution. Thismultivariate normality assumption can be checked by estimating multivariate skewness, kurtosis, andtesting for their significance levels for each level. The Quantile-Quantile (Q-Q plot) plot of expected andobserved distributions (Khattree and Naik 1995) of multi-attribute residuals can be used to graphicallyexamine for multivariate normality for each group level. The estimated multivariate skewness andmultivariate kurtosis (Figure 2) clearly support the hypothesis, these four multi-attributes have a jointmultivariate normal distribution. A non-significant departure from the 450 angle reference line in the Q-Qplot (Figure 2) also supports this finding. Thus, parametric discriminant analysis can be considered to bethe appropriate technique for discriminating the three clinical groups based on these four attributes (X1 toX4) for data discrim1.

Checking for the presence of multivariate outliers: Multivariate outliers can be detected in a plotbetween the differences of robust (Mahalanobis distance - chi-squared quantile) vs. chi-squared quantilevalues (Khattree and Naik 1995). No observations are identified as influential observations since thedifferences between the robust Mahalanobis distance and the chi-squared quantile value are not largerthan 2 and fall inside the critical region (Figure 3). This can be expected since the training dataset used(Data: discrim1 ) is a multivariate normally distributed simulated dataset.

4

Figure 2 Checking for multivariate normality using Q-Q plots within each group member

Figure 3 Checking for multivariate outliers and influential observations within each groupmember 5

Canonical Discriminant Analysis (CDA):

Canonical DA is a dimension-reduction technique similar to principal component analysis. The mainobjective of CDA is to extract a set of linear combinations of the quantitative variables that best revealthe differences among the groups. Given a nominal group variable and several quantitative attributes, theCDA extracts linear combinations of the quantitative variables (canonical variables) that capturebetween-class variation in much the same way that principal components summarize total variation (SASInst. Inc. 2004). Moreover, these canonical functions will be independent or orthogonal, that is, theircontributions to the discrimination between groups will not overlap. The univariate test statistics fordiscriminating the three levels of diabetes is presented in Table 4. The univariate ANOVA results indicatehighly significant group differences exist for all the predictor variables. The total, between, and withingroup variability in predictor variables are expressed in standard deviation terms. The R2 statisticdescribes the amount of variability in each predictor variable accounted for by the group differences. The(R2/1-R2) column expresses the ratio between accounted and un-accounted variation in the univariateANOVA model. By comparing the R2 and the (R2/1-R2) statistics for each significant predictor variable,we can conclude that the test plasma glucose (X3) has the highest amount of significant discriminativepotential while the relative weight has the least amount of discriminative power in differentiating thethree diabetes clinical groups. The relatively large average R2 weighted by variances (Table 4) indicatesthat the four predictor variables have high discriminatory power in classifying the three clinical diabetesgroups.

Table 4 Canonical discriminant analysis using SAS MACRO: DISCRIM- Univariate teststatistics

Variable Label

TotalStandard

Deviation

PooledStandard

Deviation

BetweenStandard

Deviation R-Square

R-Square/ (1-

RSq) F Value Pr > F

x1 Relative wt 0.142 0.134 0.060 0.121 0.139 9.60 0.0001

x2 Fasting PlasmaGlucose

50.313 23.286 54.535 0.788 3.735 257.78 <.0001

x3 Test plasmaglucose

266.188 96.676 303.000 0.870 6.691 461.68 <.0001

x4 Plasma insulinduring test

111.492 101.945 57.060 0.175 0.213 14.72 <.0001

Note: Data: discrim1 F Statistics, Numerator df=2, Denominator DF=138, R2 Weighted byVariance=0.767

In CDA, canonical variables that have the highest possible multiple correlations with the groupsare extracted. The un-standardized coefficients used in computing the raw canonical variables arecalled the canonical coefficients or canonical weights. The standardized discriminant functioncoefficients indicate the partial contribution of each variable to the discriminant function(s), controllingfor other attributes entered in the equation. The total standardized discriminant functions given in Table5 indicate that the predictor variable, test plasma glucose (X3), contributed significantly to the firstcanonical variable (CAN1). The fasting plasma glucose (X2) and test plasma glucose (X3) equallycontributed in a negative way to the second canonical variable (CAN2).

These canonical variables are independent or orthogonal to each other; that is, theircontributions to the discrimination between groups will not overlap. This maximal multiple correlationbetween the first canonical variable and the group variables are called the first canonical correlation.The second canonical correlation is obtained by finding the linear combination uncorrelated with theCAN1 that has the highest possible multiple correlations with the groups. In CDA, the process of

6

extracting canonical variables is repeated until you extract the maximum number of canonical variableswhich is equal to the number of groups minus one, or the number of variables in the analysis, whicheveris smaller.

The correlation between the CAN1 and the clinical group is very high (>0.9) and about 87% ofthe variation in the first canonical variable can be attributed to the differences among the three clinicalgroups (Table 6). The first eigenvalue measures the variability in the CAN1 and accounts for 93% ofthe variability among the three group members in four predictor variables. The correlation between theCAN2 and the clinical group is moderate (0.5) and about 37% of the variation in the second canonicalvariable can be attributed to the differences among the three clinical groups (Table 6). The secondeigenvalue measures the variability in the second canonical variable and accounts for remaining 6% ofthe variability among the three group members in the four predictor variables. Both canonical variablesare statistically highly significant based on the Wilk’s lambda test (Table 6). However, the statisticalvalidity might be questionable if the multivariate normality or the equal variance-covarianceassumptions are violated.

Table 5 Canonical discriminant analysis using SAS MACRO: DISCRIM- Total canonical structureloadings

Variable Label Can1 Can2

x1 Relative wt 0.049 0.599

x2 Fasting Plasma Glucose 0.926 -0.327

x3 Test plasma glucose 0.994 -0.096

x4 Plasma insulin during test -0.229 0.623

Table 6 Canonical discriminant analysis using SAS MACRO: DISCRIM- Canonical correlations

Canonical

Correlation

Adjusted

Canonical

Correlation

Approximate

Standard

Error

Squared

Canonical

Correlation

Eigenvalues of Inv(E)*H= CanRsq/(1-CanRsq)

Eigenvalue Difference Proportion Cumulative

1 0.936 0.934 0.010 0.877 7.130 6.628 0.934 0.934

2 0.578 0.572 0.056 0.334 0.502 0.0658 1.000

LikelihoodRatio

ApproximateF Value Numerator df Denominator df Pr > F

1 0.0818 84.21 8 270 <.0001

2 0.6655 22.78 3 136 <.0001

7

For each observation in the training dataset, we can compute standardized canonical variable scores. These standardized canonical variable scores and the structure loadings can be used in two-dimensionalbi-plots to aid visual interpretation of the group differences. Inter-relationships among the fourmulti-attributes and the discriminations of the three groups are presented in Figure 4. The firstcanonical variable that has the largest loadings on X2 and X3 discriminated the NORMAL (1), OVERT(2) and the CHEMICAL diabetic groups effectively. The CAN2, which has a moderate size loading onX1and X4, discriminated the NORMAL (1) and the OVERT group. But, CAN2 is not effective inseparating the CHEMICAL diabetic group. The narrow angle between the X2 and X3 variable vector inthe same direction indicates that the plasma glucose variables are positively highly correlated. Thecorrelations between X1 and X4 are moderate in size and act in the opposite direction from the plasmaglucose variables.

The two canonical variables extracted from the CDA effectively discriminated the three clinicaldiabetic groups. The difference between the normal and the chemical group is distinct. Thediscrimination between the NORMAL and the OVERT group is effective when both CAN1 and CAN2are used simultaneously. Therefore, the CDA can be considered as an effective descriptive tool indiscriminating groups based on continuous predictor variables. If the variance-covariance between thegroups is assumed to be equal and the predictor variables have joint multivariate normal distributionswithin each group, then the group differences can be tested statistically for group differences.

Figure 4 Bi-plot display of multi-attributes and the group discrimination

8

Predictive Discriminant Analysis (PDA): PDA is a predictive classification technique deals with aset of multi-attributes and one classification variable, the latter being a grouping variable with two ormore levels. Predictive discriminant analysis is similar to multiple regression analysis except that PDAis used when the criterion variable is categorical and nominally scaled. As in multiple regression, inPDA a set of rules is formulated which consists of as many linear combinations of predictors as thereare categories, or groups. A PDA is commonly used for classifying observations to pre-defined groups based on knowledge of the quantitative attributes. When the distribution within each group is assumedto be multivariate normal, a parametric method can be used to develop a discriminant function using ameasure of generalized squared distance. The discriminant function, also known as a classificationcriterion, is estimated by measuring generalized squared distance (SAS Inst. Inc. 2004). Theclassification criterion can be derived based on either the individual within-group covariance matrices a quadratic function) or the pooled covariance matrix (a linear function). This classification criterionalso takes into account the prior probabilities of the discriminating groups. Each observation isclassified in the group from which it has the smallest generalized squared distance. The posteriorprobability of an observation belonging to each class could be also estimated in PDA.

Table 7 Parametric discriminant function analysis using SAS MACRO: DISCRIM- classification tableand error count estimates by groups in cross-validation using quadratic discriminant function

a) Training dataTo Group

From group 1 2 3 Total

1 69a

95.83b

3

4.17

0

0.00

72

100.00

2 1

2.78

35

97.22

0

0.00

36

100.00

3 0

0.00

1

3.03

32

96.97

33

100.00

Total 70

49.65

39

27.66

32

22.70

141

100.00

Error Count Estimates for group

1 2 3 Total

Error Rate 0.041 0.027 0.030 0.035

Prior probability 0.510 0.255 0.234

B) Validation data

9

To Group


1 75a

98.68b1

1.320

0.0076

100.00

2 25.56

3494.44

00.00

36100.00

3 00.00

39.09

3090.91

33100.00

Total 7753.10

3826.21

3020.69

145100.00


1 2 3 Total

Rate 0.013 0.055 0.090 0.042

Priors 0.510 0.255 0.234

Table8 Parametric discriminant function analysis using SAS MACRO: DISCRIM- classification tableand posterior probability error rate estimates by groups in cross-validation using quadratic discriminantfunctions.A) Training data

To Group

From group 1 2 3

1 69a

0.990b3

0.8710.

2 10.969

350.947

0.

3 0.

10.999

320.999

Total 700.989

390.942

320.999

Posterior Probability Error Rate Estimates forgroup

Estimate 1 2 3 Total

Stratified 0.037 -0.021 0.030 0.020

Un stratified 0.037 -0.021 0.030 0.020

Priors 0.510 0.255 0.234a Frequency b Percentage

10

The performance of a discriminant criterion in the classification of new observations in the validationdata could be evaluated by estimating the probabilities of mis-classification or error rates in the SASDISCRIM procedure. These error-rate estimates include error-count estimates and posterior probabilityerror-rate estimates. When the input data set is a SAS data set, the error rate can also be estimated bycross validation. SAS uses two types of error-rate estimates to evaluate the derived classificationcriterion based on parameters estimated by the training sample: i) error-count estimates and ii) posteriorprobability error-rate estimates. The error count estimate is calculated by applying the discriminant criterion derived from the training sample to a test set and then counting the number of mis-classifiedobservations. The group-specific error-count estimate is the proportion of mis-classified observations inthe group. If the test sample set is independent of the training sample, the estimate is unbiased.However, it can have a large variance, especially if the test sample size is small (SAS Inst. Inc. 2004).

When no independent test sets are available, the same data set can be used both to calibrate and toevaluate the classification criterion. The resulting error-count estimate has an optimistic bias and iscalled an apparent error rate. To reduce the bias, the data can be split into two sets, one set for derivingthe discriminant function and the other set for estimating the error rate. Such a split-sample method hasthe unfortunate effect of reducing the effective sample size.

Another way to reduce bias in estimating the classification error is cross validation (Lachenbrush andMickey 1968). In cross validation, n-1 out of n training observations in the calibration sample aretreated as a training set. It determines the discriminant functions based on these n-1 observations andthen applies them to classify the one observation left out. This is performed for each of the n trainingobservations. The mis-classification rate for each group is the proportion of sample observations in that

Figure 5 Box-plot display of posterior probability estimates for three diabetes groups

11

group that are mis-classified. This method achieves a nearly unbiased estimate but with a relativelylarge variance.. Classification results based on parametric Quadratic DF and error rates based on cross-validation are presented in Table7.

To reduce the variance in an error-count estimate Glick (1978) suggested a smoothed error-rateestimates. Instead of summing values that are either zero or one as in the error-count estimation, thesmoothed estimator uses a continuum of values between zero and one in the terms that are summed. Theresulting estimator has a smaller variance than the error-count estimate. The posterior probabilityerror-rate estimates are smoothed error-rate estimates. The posterior probability estimates for eachgroup are based on the posterior probabilities of the observations classified into that same group. Theposterior probability estimates provide good estimates of the error rate when the posterior probabilitiesare accurate. When a parametric classification criterion (linear or quadratic discriminant function) isderived from a non normal population, the resulting posterior probability error-rate estimators may notbe appropriate.

The overall error rate is estimated through a weighted average of the individual group-specificerror-rate estimates, where the prior probabilities are used as the weights. To reduce both the bias andthe variance of the estimator, Hora and Wilcox (1982) compute the posterior probability estimatesbased on cross validation. The resulting estimates are intended to have both low variance from using theposterior probability estimate and low bias from cross validation. They use Monte Carlo studies ontwo-group multivariate normal distributions to compare the cross validation posterior probabilityestimates with three other estimators: the apparent error rate, cross validation estimator, and posteriorprobability estimator. They conclude that the cross validation posterior probability estimator has alower mean squared error in their simulations. Classification results based on parametric Quadratic DFand smoothed error rates based on cross validation are presented in Table 8.

Non-parametric discriminant function analysis

When no distribution assumptions within each group can be made, or when the distribution is notassumed to have multivariate normal, nonparametric methods can be used to estimate the group-specificdensities. Non-parametric discriminant methods are based on nonparametric estimates of group-specificprobability densities. Either a kernel method or the k-nearest-neighbor method can be used to generate anon-parametric density estimate in each group and to produce a classification criterion. The kernelmethod in SAS systems uses uniform, normal, Epanechnikov, biweight, or triweight kernels in thedensity estimation (SAS Inst. Inc. 2004). Either Mahalanobis or euclidean distance can be used todetermine proximity in the SAS DISCRIM procedure (SAS Inst. Inc. 2004) When the k-nearest-neighbor method is used, the Mahalanobis distances are estimated based on the pooledcovariance matrix. Whereas in the kernel method, the Mahalanobis distances based on either theindividual within-group covariance matrices or the pooled covariance matrix is estimated.

In non-parametric DA estimation, with the estimated group-specific densities and their associated priorprobabilities, the posterior probability estimates of group membership for each class can be evaluated. The classification of an observation vector x is based on the estimated group specific densities from thecalibration or training sample. From these estimated densities, the posterior probabilities of groupmembership at x are evaluated

12

Figure 6 Q-Q Plot for detecting Multivariate normality with each group

Figure 7 Checking for multivariate outliers and influential observations within each diabetesgroup

13

Non-parametric DA analysis using Diabet2 as the training data and diabet1 as the validation data

Checking for multivariate normality: The estimated multivariate skewness and multivariate kurtosisfor each group (Figure 6) clearly support the hypothesis that, these four multi-attributes do not have ajoint multivariate normal distribution. A significant departure from the 450 angle reference line in theQ-Q plot (Figure 6) also supports this finding. Thus, non-parametric discriminant analysis must beconsidered to be the appropriate technique for discriminating the three clinical groups based on thesefour attributes (X1 to X4).

Checking for the presence of multivariate outliers: Multivariate outliers can be detected in a plotbetween the differences of robust (Mahalanobis distance - chi-squared quantile) vs. chi-squared quantilevalue.. Three observations are identified as influential observations because the difference betweenrobust Mahalanobis distance and chi-squared quantile values are larger than 2 and fall outside thecritical region (Figure 7).

Non-parametric DA: Compare the classification summary and the mis-classification rates of thesefour different non-parametric DFA methods and pick one, that gives the smallest classification error inthe cross-validation Among the three NN-DFA, classification results based on the 2nd NN non-parametric DFA gave the smallest classification error. The classification summary and the error ratesfor NN (k=2) are presented in Table 9. When the k-nearest-neighbor method is used, the Mahalanobisdistances are estimated based on the pooled covariance matrix. Classification results based on NN(k=2) and error rates based on cross-validation are presented in Table 9. The mis-classification rates ingroups 1, 2, and 3 are 1.3%, 0%, and 12.0% respectively. The overall discrimination is quitesatisfactory since the overall error rate is very low, at 3.45%. The posterior probability estimates basedon cross validation reduces both the bias and the variance of classification function. The resultingoverall error estimates are intended to have both a low variance from using the posterior probabilityestimate and a low bias from cross validation. The Figure 8 illustrates the variation in the posteriorprobability estimates for the three diabetes groups.

The DISCRIM macro also output a table of the ith group posterior probability estimates for allobservations in the training dataset. These posterior probability values are very useful estimates sincethese estimates can be successfully used in developing scorecards and ranking the observations in thedataset. The posterior probability error-rate estimates for each group are based on the posteriorprobabilities of the observations classified into that same group. The posterior probability estimatesprovide good estimates of the error rate when the posterior probabilities are accurate.

If the classification error rate obtained for the validation data is small and similar to the classificationerror rate for the training data, then we can conclude that the derived classification function has gooddiscriminative potential. Classification results for the validation dataset based on NN(K=2)classification functions are presented in Table 9. The mis-classification rates in groups 1, 2, and 3 are4.1%, 25%, and 15.1.0% respectively. The overall discrimination in the validation dataset ismoderately good since the weighted error rate is 11.2%. A total of 17 observations in the validationdataset are mis-classified. The mis-classification error rate estimated for the validation dataset isrelatively higher than the error rate obtained from the training data. We can conclude that theclassification criterion derived using NN (k=2) performed poorly in validating the independentvalidation dataset. The presence of multivariate influential observations in the training dataset might beone of the contributing factors for this poor performance in validation. Using larger K values in NNDFA might do a better job in classifying validation dataset.

14

The classification summary using KD (normal, un-equal bandwidth) non-parametric DFA and the errorrates using cross-validation are presented in Table 10. The mis-classification rates in groups 1, 2, and 3are 7.8%, 16.6%, and 9.0% respectively. Thus an overall success rate of correct discrimination is about90% since the overall error rate is about 10.3%, slightly lower than the overall error rate for the K=2NN method. Figure 9 illustrates the variation in the posterior probability estimates for all three diabetes group. The posterior probability error-rate estimates for each group are based on the posteriorprobabilities of the observations classified into that same group. The smoothed posterior probabilityerror rate estimates based on cross-validation DF are presented in Table 10. The overall error rate forstratified and unstratified estimates are equal since group proportion was used as the prior probabilityestimate. The overall discrimination is quite satisfactory since the overall error rate using the smoothedposterior probability error rate is relatively low, at 4.7%.

If the classification error rate obtained for the validation data is small and similar to the classificationerror rate for the training data, then we can conclude that the derived classification function has gooddiscriminative potential. Classification results for the validation dataset based on KD (normal, un-equalbandwidth) non parametric DFA classification functions are presented in Table 10. The mis-classification rates in groups 1, 2, and 3 are 4.1%, 25%, and 15.1% respectively. The overalldiscrimination in the validation dataset is moderately good since the weighted error rate is 11.8%. Atotal of 17 observations in the validation dataset are mis-classified. The mis-classification error rateestimated for the validation dataset is relatively higher than the error rate obtained from the trainingdata. We can conclude that the classification criterion derived using KD (normal, un-equal bandwidth)performed poorly in validating the independent validation dataset. The presence of multivariateinfluential observations in the training dataset might be one of the contributing factors for this poorperformance in validation. Using other types of density options might do a better job in classifying thevalidation dataset..

Figure 8 Box-plot display of posterior probability estimates for three diabetes groups-Nearestneighbor (K=2)

15

Table 9 Nearest neighbor (K=2) non-parametric discriminant function analysis using SAS MACRO:DISCRIM. Classification summary using cross-validation a) Training data

To group


1 75a

98.68b1

1.320

0.0076

100.00

2 00.00

36100.00

00.00

36100.00

3 00.00

412.12

2987.88

33100.00

Total 7551.72

4128.28

2920.00

145100.00


1 2 3 Total

Rate 0.013 0.000 0.121 0.034

Priors 0.524 0.248 0.227

b) validation data

Figure 9 Box-plot display of posterior probability estimates for three diabetes Unequalbandwidth kernel density discriminant function

16

Into group


1 6995.83

34.17

00.00

72100.00

2 925.00

2775.00

00.00

36100.00

3 26.06

39.09

2884.85

33100.00

Total 8056.74

3323.40

2819.86

141100.00


1 2 3

Totalerror

rate

Error rate 0.0417 0.2500 0.1515 0.1184

PriorProbability

0.5241 0.2483 0.2276

Table 10 Unequal bandwidth kernel density discriminant function analysis using SAS MACRODISCRIM: Classification summary using cross-validation results a) Training data

Number of Observations and Percent Classified intogroup


1 7092.11

56.58

11.32

76100.00

2 25.56

3083.33

411.11

36100.00

3 00.00

39.09

3090.91

33100.00

Total 7249.66

3826.21

3524.14

145100.00


1 2 3 Total Error

Error Rate 0.078 0.166 0.090 0.103

Prior probability 0.524 0.248 0.227

b) validation data

17

Number of Observations and Percent Classified intogroup


1 6995.83

34.17

00.00

72100.00

2 925.00

2775.00

00.00

36100.00

3 26.06

39.09

2884.85

33100.00

Total 8056.74

3323.40

2819.86

141100.00

Priors 0.52414 0.24828 0.22759


1 2 3Total error

rate

Error Rate 0.041 0.250 0.151 0.118

Priorprobability

0.524 0.248 0.227

User-friendly SAS macro application- DISCRIM

The DISCRIM macro is a powerful user-friendly SAS application for performing completediscriminant analysis. A screen-shot of the DISCRIM application is presented in Figure 10.Options are available for obtaining various exploratory and diagnostic graphs and forperforming different types of discriminant analyses. SAS procedures, STEPDISC, andDISCRIM are the main tools used in the DISCRIM macro. In addition to these SASprocedures, GPLOT, BOXPLOT procedures, and IML modules are also utilized in theDISCRIM macro. The enhanced features implemented in the DISCRIM macro are:

Exploratory bivariate plots to check for group discrimination in a simple scatter plotbetween two predictor variables are generated.. Plots for checking for multivariate normality and influential observations within eachgroup are also generated. Test statistics and P-values for testing equality in variance and covariance matriceswithin each group level are automatically produced.In the case of CDA, box plots of canonical discriminant functions by groups and biplotdisplay of canonical discriminant function scores of observations and the structureloadings for the predictors are generated. 5. When you fit DFA, box plots of the ith level posterior probability by groups areproduced. Options are available for validating the discriminant model obtained from a trainingdataset using an independent validation dataset by comparing classification errors.. Options for saving the output tables and graphics in WORD, HTML, PDF and TXTformats are available.

18

Software requirements for using the DISCRIM macro are:SAS/BASE, SAS/STAT, and SAS/GRAPH must be licensed and installed at your site.SAS/IML is required to check for multivariate normality.. SAS version 9.13 and above is recommended for full utilization. Active Internet connection is required for downloading DISCRIM macro from thebook’s website.

SUMMARY

A user-friendly SAS application developed by the author utilizes the advanced analytical andgraphical features in SAS software to perform stepwise, canonical and parametric and non-parametric discriminant function analysis with data exploration is presented here. Chemical diabetesdata containing multi-attributes is used to demonstrate the features of discriminant analysis indiscriminating the three clinical types of diabetes. The users can perform complete discriminantanalysis using their data by using the DISCRIM macro applications and following the instructionsincluded in the book. (Fernandez, 2002).

REFERENCES

1. Fernandez, G. (2002) Data Mining Using SAS applications Chapman and Hall Florida

Figure 10 Screen shot of DISCRIM macro call window

19

2. Gabriel, K. R Bi-plot display of multivariate matrices for inspection of data anddiagnosis. In V. Barnett (Ed.). Interpreting multivariate data. London: John Wiley &Sons. 1981.

3. Glick N (1978) Additive estimators for probabilities of correct classification PatternRecognition 10: 211-222(1)

4. Hora S. C and Wilcox J.B. (1982) Estimation of error rates in several populationdiscriminant analyses. J. of. Marketing Research. 19:57-61

5. Khattree R. and Naik D.N , (1995) Applied Multivariate statistics with SAS softwareCary NC . SAS Institute Inc. Chapter 1

6. Lachenbruch P. A and Mickey M.A (1968) Estimation of error rates in discriminantanalysis Technometrics 10 1-10.

7. SAS Institute Inc. SAS/STAT 9.1Users Guide, Cary NC .SAS Institute Inc 20048. Reaven, G.M and Miller R.G. (1979) An attempt to define the nature of chemical

diabetes using a multidimensional analysis Diabetologia 16 17-24 Inc

CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the author:Name: George C. Fernandez, PhDEnterprise: University of Nevada - RenoAddress: CRDA/088 Reno, NV 89557Work phone: (775)784-4206Email: [email protected] Web: Http://www.ag.unr.edu/gf

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarksof SAS Institute Inc. in the USA and other countries. ® indicates USA registration.Other brand and product names are trademarks of their respective companies

20

Discriminant Analysis, a Powerful Classification Technique ... · Discriminant analysis is one of...

Documents

Transcript of Discriminant Analysis, a Powerful Classification Technique ... · Discriminant analysis is one of...