Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business...

Post on 16-Jan-2016

231 views 10 download

Tags:

Transcript of Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business...

Chapter 6

Data Mining

1

2

Introduction

• The increase in the use of data-mining techniques in business has been caused largely by three events.

• The explosion in the amount of data being produced and

electronically tracked

• The ability to electronically warehouse these data

• The affordability of computer power to analyze the data

3

Introduction

• Observation: set of recorded values of variables associated with a

single entity.

• Data-mining approaches can be separated into two categories.

• Supervised learning – For prediction and classification.

• Unsupervised learning – To detect patterns and relationships in the

data.

4

Introduction

• Whether we are using a supervised or unsupervised learning

approach, the data-mining process comprises the following steps:

• Data Sampling

• Data Preparation

• Model Construction

• Model Assessment

5

Data Sampling

6

Data Sampling

• When dealing with large volumes of data, it is best practice to extract a representative sample for analysis.

• A sample is representative, if the analyst can make the same conclusions from it as from the entire population of data.

• The sample of data must be large enough to contain significant information, yet small enough to be manipulated quickly.• Data-mining algorithms typically are more effective given more

data.

7

Data Sampling

• When obtaining a representative sample, it is generally best to include as many variables as possible in the sample.

• After exploring the data with descriptive statistics and visualization, the analyst can eliminate variables that are not of interest.

8

Data Preparation

9

Data Preparation

• The data in a data set are often said to be “dirty” and “raw” before they have been preprocessed.

• We need to put them into a form that is best suited for a data-mining algorithm – Data Table.

• Data preparation makes heavy use of the descriptive statistics and data visualization methods.

Name Trait 1 Trait 2 Trait 3

Entity 1 Value Value Value

Entity 2 Value Value

Entity 3 Value Value Value

Entity 4 Value Value Value

10

Data Preparation

• Treatment of Missing Data

• The primary options for addressing missing data

• To discard observations with any missing values

• To discard any variable with missing values

• To fill in missing entries with estimated values

• To apply a data-mining algorithm (such as classification and

regression trees) that can handle missing values

11

Data Preparation

• Identification of Outliers and Erroneous Data

• Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.

• Closer examination of outliers may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.

12

Data Preparation

Identification of Outliers and Erroneous Data

• A conservative approach is to create two data sets, one with and one without outliers, and then construct a model on both data sets.

• If a model’s implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down the cause of the outliers.

13

Entended Large Credit Example

Account Number

Annual Income ($1000)

Household Size

Years of Post-High School Education

Hours Per Week Watching Television Age Gender

Exceded Credit Limit in Past 12 Months?

Annual Charges ($)

23313578 21.8 4.0 5.0 29.0 47 Female No 10023.5214168784 65.5 7.0 3.0 46.0 60 Female No 11248.76

498076 54.2 3.0 2.0 18.0 47 Male No 6115.0012286258 73.7 6.0 0.0 44.0 41 Male No 9785.5514458626 110.4 7.0 5.0 39.0 36 Female No 12633.98

6587932 22.1 8.0 3.0 39.0 52 Female No 3309.4228612775 39.6 5.0 4.0 40.0 49 Female No 18927.6823618891 90.6 8.0 5.0 27.0 46 Female No 15762.55

6620536 38.7 1.0 4.0 15.0 40 Male No 0.0025744803 60.5 3.0 1.0 3.0 60 Male No 11294.9026373139 104.3 4.0 5.0 58.0 53 Female Yes 21484.5218241851 44.5 3.0 5.0 19.0 30 Male No 11023.7529558287 67.1 6.0 1.0 33.0 35 Male No 20523.1114427717 72.3 3.0 3.0 27.0 33 Female No 17274.7519206708 114.8 5.0 3.0 23.0 29 Male No 25181.6413775672 96.0 6.0 2.0 36.0 19 Male No 11498.3928207255 90.2 4.0 0.0 59.0 24 Male No 16442.8226411517 54.6 4.0 0.0 6.0 39 Male No 4410.3222411321 97.4 4.0 4.0 30.0 28 Male Yes 15200.7529587727 45.9 7.0 3.0 46.0 33 Female No 16624.8325774035 61.8 4.0 4.0 53.0 65 Female No 11055.97

202180 38.6 2.0 3.0 57.0 43 Female No 6595.49

14

Scatterplot Matrix in XLMINER

15

Data Preparation

• Variable Representation

• In many data-mining applications, the number of variables for

which data is recorded may be prohibitive to analyze.

• Dimension reduction: Process of removing variables from the

analysis without losing any crucial information.

• One way is to examine pairwise correlations to detect variables or groups of variables that may supply similar information.

• Such variables can be aggregated or removed to allow more parsimonious model development.

16

Data Preparation

• Variable Representation• A critical part of data mining is determining how to represent the

measurements of the variables and which variables to consider.• The treatment of categorical variables is particularly important.

• Often data sets contain variables that, considered separately, are not particularly insightful but that, when combined as ratios, may represent important relationships.

17

UnSupervised Learning (skip)

18

Supervised Learning

19

Supervised Learning

The goal of a supervised learning technique is to develop a model that predicts a value for a continuous outcome or classifies a categorical outcome.

Partitioning DataWe can use the abundance of data to guard against the potential for overfitting by decomposing the data set into three partitions

• the training set• the validation set, and • the test set

20

Supervised Learning

Partitioning Data

• Training set: Consists of the data used to build the candidate models.

• Validation set: The data set to which promising subsets of models are applied to identify which model is the most accurate at predicting when applied to data that were not used to build the model.

• Test set: The data set to which the final model should be applied to estimate this model’s effectiveness when applied to data that have not been used to build or select the model.

21

Figure 6.10 - XLMiner data Partition with Oversampling Dialog Box

22

Figure 6.11 - XLMiner Standard Data Partition Dialog Box

23

Supervised Learning

• Classification Accuracy

• By counting the classification errors on a sufficiently large validation set and/or test set that is representative of the population, we will generate an accurate measure of the model’s classification performance.

• Classification confusion matrix: Displays a model’s correct and

incorrect classifications.

24

Supervised Learning

• Overall Error Rate: percentage of misclassified observations

• Measure of classification accuracy are based on the classification confusion matrix.

Table 6.4 Classification Confusion Matrix

25

Supervised Learning

We define error rate with respect to the individual classes to

account for the assymetric costs in misclassification:

Class 1 error rate = ; Class 0 error rate =

Cutoff value: Probability value used to understand the tradeoff

between Class 1 error rate and Class 0 error rate.

26

Supervised Learning

• Prediction Accuracy

• The measures of accuracy are some function of the error in

estimating an outcome for an observation i.

• Average error = /n

• Root mean squared error (RMSE) =

• = error in estimating an outcome for observation i.

27

Supervised Learning

• Classification and Regression Trees (CART)

• Partition a data set of observations into increasingly smaller and

more homogeneous subsets.

• At each iteration of the CART method, a subset of observations is

split into two new subsets based on the values of a single variable.

• Series of questions that successively narrow down observations

into smaller and smaller groups of decreasing impurity.

28

Titantic Passengers

Classifying a categorical outcome with a classification tree

• Typical data on 1309 passengers – Missing Data

Passenger Class Survived Name Sex Age

Siblings and Spouses

Parents and Children Fare Home / Destination

3 0Allen, Miss. Elisabeth Walton female 29 0 0 211.33St Louis, MO

1 1Allison, Master. Hudson Trevor male 1 2 151.55Montreal, PQ / Chesterville, ON

1 0Allison, Miss. Helen Loraine female 2 1 2 151.55Montreal, PQ / Chesterville, ON

1 0Allison, Mr. Hudson Joshua Creighton male 30 1 2 151.55Montreal, PQ / Chesterville, ON

1 0Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 151.55Montreal, PQ / Chesterville, ON

29

Figure 6.20 - Construction Sequence of Branches in a Classification Tree

1 2 30

100

200

300

400

500

600

0

1

Died Survived Grand TotalFirst Class 123 200 323Second Class 158 119 277Third Class 528 181 709Grand Total 809 500 1309

30

Minimum Error Tree

31

Finding the best model

# Decision Nodes % Error

19 22.51908

18 22.51908

17 22.51908

16 22.51908

15 22.51908

14 22.51908

13 22.51908

12 22.51908

11 22.51908

10 20.80153

9 20.80153

8 20.80153 <-- Min Error Tree Std. Error 0.017731

7 21.18321

6 21.56489

5 21.56489

4 21.56489

3 21.56489

2 21.75573

1 21.75573 <-- Best Pruned

0 35.87786

32

Result of Classification Tree in XLMINERConfusion Matrix for Training Data

  Predicted Class

Actual Class 1 01 242 700 81 392

Error ReportClass # Cases # Errors % Error

1 312 70 26.92308

0 473 81 15.85624

Overall 785 151 20.25478

33

Training data performance

Performance

Success Class 1

Precision 0.749226

Recall (Sensitivity) 0.775641

Specificity 0.828753

F1-Score 0.762205

34

Result of Classification Tree in XLMINER

Confusion Matrix for Validation Data

  Predicted Class

Actual Class 1 0

1 124 64

0 50 286

Error Report

Class # Cases # Errors % Error

1 188 64 34.04255

0 336 50 14.88095

Overall 524 114 21.75573

35

Validation data performance

Performance

Success Class 1

Precision 0.712644Recall (Sensitivity) 0.659574Specificity 0.85119F1-Score 0.685083

36

Regression Trees

37

Supervised Learning

• Predicting a continuous outcome with a regression tree

• A regression tree bases the impurity of a partition based on the

variance of the outcome value for the observations in the group.

• After a final tree is constructed, the predicted outcome value of an

observation is based on the mean outcome value of the partition

into which the new observation belongs.

38

Figure 6.30 - XLMiner steps for regression trees

39

Figure 6.31 - Full Regression Tree for Optiva Credit Union

40

Figure 6.32 - Regression Tree Pruning Log

41

Figure 6.33 - Best Pruned Regression tree for Optiva Credit Union

42

Figure 6.34 - Best Pruned Tree Prediction of Test Data for Optiva Credit Union

43

Figure 6.35 - Prediction Error of Regression Trees

44

Supervised Learning

Logistic regression: Attempts to classify a categorical outcome (y = 0

or 1) as a linear function of explanatory variables.

45

Supervised Learning

46

Supervised Learning

• Odds - measure related to probability

• If an estimate of the probability of an event is then the equivalent

odds measure is / (1 – ).

• The odds metric ranges between zero and positive infinity.

• We eliminate the fit problem by using logit , ln (/ (1 – )).

• Estimating the logit with a linear function results in the estimated

logistic regression equation.

47

Supervised Learning

• Estimated Logistic Regression Equation

ln + + ∙ ∙ ∙

Given a set of explanatory variables, a logistic regression algorithm

determines values of , , . . . , that best estimate the log odds.

• Logistic Function

=

48

Figure 6.39 - XLMiner steps for logistic regression

49

Figure 6.40 - XLMiner logistic regression output

50

Figure 6.41 - XLMiner steps for refitting Logistic Regression Model and using it to Predict new Data

51

Figure 6.42 - Classification Error for Logistic Regression Model

52

Figure 6.43 - Classification of 30 new Customer Observations