Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business...

52
Chapter 6 Data Mining 1

Transcript of Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business...

Page 1: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

Chapter 6

Data Mining

1

Page 2: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

2

Introduction

• The increase in the use of data-mining techniques in business has been caused largely by three events.

• The explosion in the amount of data being produced and

electronically tracked

• The ability to electronically warehouse these data

• The affordability of computer power to analyze the data

Page 3: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

3

Introduction

• Observation: set of recorded values of variables associated with a

single entity.

• Data-mining approaches can be separated into two categories.

• Supervised learning – For prediction and classification.

• Unsupervised learning – To detect patterns and relationships in the

data.

Page 4: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

4

Introduction

• Whether we are using a supervised or unsupervised learning

approach, the data-mining process comprises the following steps:

• Data Sampling

• Data Preparation

• Model Construction

• Model Assessment

Page 5: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

5

Data Sampling

Page 6: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

6

Data Sampling

• When dealing with large volumes of data, it is best practice to extract a representative sample for analysis.

• A sample is representative, if the analyst can make the same conclusions from it as from the entire population of data.

• The sample of data must be large enough to contain significant information, yet small enough to be manipulated quickly.• Data-mining algorithms typically are more effective given more

data.

Page 7: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

7

Data Sampling

• When obtaining a representative sample, it is generally best to include as many variables as possible in the sample.

• After exploring the data with descriptive statistics and visualization, the analyst can eliminate variables that are not of interest.

Page 8: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

8

Data Preparation

Page 9: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

9

Data Preparation

• The data in a data set are often said to be “dirty” and “raw” before they have been preprocessed.

• We need to put them into a form that is best suited for a data-mining algorithm – Data Table.

• Data preparation makes heavy use of the descriptive statistics and data visualization methods.

Name Trait 1 Trait 2 Trait 3

Entity 1 Value Value Value

Entity 2 Value Value

Entity 3 Value Value Value

Entity 4 Value Value Value

Page 10: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

10

Data Preparation

• Treatment of Missing Data

• The primary options for addressing missing data

• To discard observations with any missing values

• To discard any variable with missing values

• To fill in missing entries with estimated values

• To apply a data-mining algorithm (such as classification and

regression trees) that can handle missing values

Page 11: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

11

Data Preparation

• Identification of Outliers and Erroneous Data

• Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.

• Closer examination of outliers may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.

Page 12: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

12

Data Preparation

Identification of Outliers and Erroneous Data

• A conservative approach is to create two data sets, one with and one without outliers, and then construct a model on both data sets.

• If a model’s implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down the cause of the outliers.

Page 13: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

13

Entended Large Credit Example

Account Number

Annual Income ($1000)

Household Size

Years of Post-High School Education

Hours Per Week Watching Television Age Gender

Exceded Credit Limit in Past 12 Months?

Annual Charges ($)

23313578 21.8 4.0 5.0 29.0 47 Female No 10023.5214168784 65.5 7.0 3.0 46.0 60 Female No 11248.76

498076 54.2 3.0 2.0 18.0 47 Male No 6115.0012286258 73.7 6.0 0.0 44.0 41 Male No 9785.5514458626 110.4 7.0 5.0 39.0 36 Female No 12633.98

6587932 22.1 8.0 3.0 39.0 52 Female No 3309.4228612775 39.6 5.0 4.0 40.0 49 Female No 18927.6823618891 90.6 8.0 5.0 27.0 46 Female No 15762.55

6620536 38.7 1.0 4.0 15.0 40 Male No 0.0025744803 60.5 3.0 1.0 3.0 60 Male No 11294.9026373139 104.3 4.0 5.0 58.0 53 Female Yes 21484.5218241851 44.5 3.0 5.0 19.0 30 Male No 11023.7529558287 67.1 6.0 1.0 33.0 35 Male No 20523.1114427717 72.3 3.0 3.0 27.0 33 Female No 17274.7519206708 114.8 5.0 3.0 23.0 29 Male No 25181.6413775672 96.0 6.0 2.0 36.0 19 Male No 11498.3928207255 90.2 4.0 0.0 59.0 24 Male No 16442.8226411517 54.6 4.0 0.0 6.0 39 Male No 4410.3222411321 97.4 4.0 4.0 30.0 28 Male Yes 15200.7529587727 45.9 7.0 3.0 46.0 33 Female No 16624.8325774035 61.8 4.0 4.0 53.0 65 Female No 11055.97

202180 38.6 2.0 3.0 57.0 43 Female No 6595.49

Page 14: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

14

Scatterplot Matrix in XLMINER

Page 15: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

15

Data Preparation

• Variable Representation

• In many data-mining applications, the number of variables for

which data is recorded may be prohibitive to analyze.

• Dimension reduction: Process of removing variables from the

analysis without losing any crucial information.

• One way is to examine pairwise correlations to detect variables or groups of variables that may supply similar information.

• Such variables can be aggregated or removed to allow more parsimonious model development.

Page 16: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

16

Data Preparation

• Variable Representation• A critical part of data mining is determining how to represent the

measurements of the variables and which variables to consider.• The treatment of categorical variables is particularly important.

• Often data sets contain variables that, considered separately, are not particularly insightful but that, when combined as ratios, may represent important relationships.

Page 17: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

17

UnSupervised Learning (skip)

Page 18: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

18

Supervised Learning

Page 19: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

19

Supervised Learning

The goal of a supervised learning technique is to develop a model that predicts a value for a continuous outcome or classifies a categorical outcome.

Partitioning DataWe can use the abundance of data to guard against the potential for overfitting by decomposing the data set into three partitions

• the training set• the validation set, and • the test set

Page 20: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

20

Supervised Learning

Partitioning Data

• Training set: Consists of the data used to build the candidate models.

• Validation set: The data set to which promising subsets of models are applied to identify which model is the most accurate at predicting when applied to data that were not used to build the model.

• Test set: The data set to which the final model should be applied to estimate this model’s effectiveness when applied to data that have not been used to build or select the model.

Page 21: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

21

Figure 6.10 - XLMiner data Partition with Oversampling Dialog Box

Page 22: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

22

Figure 6.11 - XLMiner Standard Data Partition Dialog Box

Page 23: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

23

Supervised Learning

• Classification Accuracy

• By counting the classification errors on a sufficiently large validation set and/or test set that is representative of the population, we will generate an accurate measure of the model’s classification performance.

• Classification confusion matrix: Displays a model’s correct and

incorrect classifications.

Page 24: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

24

Supervised Learning

• Overall Error Rate: percentage of misclassified observations

• Measure of classification accuracy are based on the classification confusion matrix.

Table 6.4 Classification Confusion Matrix

Page 25: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

25

Supervised Learning

We define error rate with respect to the individual classes to

account for the assymetric costs in misclassification:

Class 1 error rate = ; Class 0 error rate =

Cutoff value: Probability value used to understand the tradeoff

between Class 1 error rate and Class 0 error rate.

Page 26: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

26

Supervised Learning

• Prediction Accuracy

• The measures of accuracy are some function of the error in

estimating an outcome for an observation i.

• Average error = /n

• Root mean squared error (RMSE) =

• = error in estimating an outcome for observation i.

Page 27: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

27

Supervised Learning

• Classification and Regression Trees (CART)

• Partition a data set of observations into increasingly smaller and

more homogeneous subsets.

• At each iteration of the CART method, a subset of observations is

split into two new subsets based on the values of a single variable.

• Series of questions that successively narrow down observations

into smaller and smaller groups of decreasing impurity.

Page 28: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

28

Titantic Passengers

Classifying a categorical outcome with a classification tree

• Typical data on 1309 passengers – Missing Data

Passenger Class Survived Name Sex Age

Siblings and Spouses

Parents and Children Fare Home / Destination

3 0Allen, Miss. Elisabeth Walton female 29 0 0 211.33St Louis, MO

1 1Allison, Master. Hudson Trevor male 1 2 151.55Montreal, PQ / Chesterville, ON

1 0Allison, Miss. Helen Loraine female 2 1 2 151.55Montreal, PQ / Chesterville, ON

1 0Allison, Mr. Hudson Joshua Creighton male 30 1 2 151.55Montreal, PQ / Chesterville, ON

1 0Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 151.55Montreal, PQ / Chesterville, ON

Page 29: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

29

Figure 6.20 - Construction Sequence of Branches in a Classification Tree

1 2 30

100

200

300

400

500

600

0

1

Died Survived Grand TotalFirst Class 123 200 323Second Class 158 119 277Third Class 528 181 709Grand Total 809 500 1309

Page 30: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

30

Minimum Error Tree

Page 31: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

31

Finding the best model

# Decision Nodes % Error

19 22.51908

18 22.51908

17 22.51908

16 22.51908

15 22.51908

14 22.51908

13 22.51908

12 22.51908

11 22.51908

10 20.80153

9 20.80153

8 20.80153 <-- Min Error Tree Std. Error 0.017731

7 21.18321

6 21.56489

5 21.56489

4 21.56489

3 21.56489

2 21.75573

1 21.75573 <-- Best Pruned

0 35.87786

Page 32: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

32

Result of Classification Tree in XLMINERConfusion Matrix for Training Data

  Predicted Class

Actual Class 1 01 242 700 81 392

Error ReportClass # Cases # Errors % Error

1 312 70 26.92308

0 473 81 15.85624

Overall 785 151 20.25478

Page 33: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

33

Training data performance

Performance

Success Class 1

Precision 0.749226

Recall (Sensitivity) 0.775641

Specificity 0.828753

F1-Score 0.762205

Page 34: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

34

Result of Classification Tree in XLMINER

Confusion Matrix for Validation Data

  Predicted Class

Actual Class 1 0

1 124 64

0 50 286

Error Report

Class # Cases # Errors % Error

1 188 64 34.04255

0 336 50 14.88095

Overall 524 114 21.75573

Page 35: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

35

Validation data performance

Performance

Success Class 1

Precision 0.712644Recall (Sensitivity) 0.659574Specificity 0.85119F1-Score 0.685083

Page 36: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

36

Regression Trees

Page 37: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

37

Supervised Learning

• Predicting a continuous outcome with a regression tree

• A regression tree bases the impurity of a partition based on the

variance of the outcome value for the observations in the group.

• After a final tree is constructed, the predicted outcome value of an

observation is based on the mean outcome value of the partition

into which the new observation belongs.

Page 38: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

38

Figure 6.30 - XLMiner steps for regression trees

Page 39: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

39

Figure 6.31 - Full Regression Tree for Optiva Credit Union

Page 40: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

40

Figure 6.32 - Regression Tree Pruning Log

Page 41: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

41

Figure 6.33 - Best Pruned Regression tree for Optiva Credit Union

Page 42: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

42

Figure 6.34 - Best Pruned Tree Prediction of Test Data for Optiva Credit Union

Page 43: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

43

Figure 6.35 - Prediction Error of Regression Trees

Page 44: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

44

Supervised Learning

Logistic regression: Attempts to classify a categorical outcome (y = 0

or 1) as a linear function of explanatory variables.

Page 45: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

45

Supervised Learning

Page 46: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

46

Supervised Learning

• Odds - measure related to probability

• If an estimate of the probability of an event is then the equivalent

odds measure is / (1 – ).

• The odds metric ranges between zero and positive infinity.

• We eliminate the fit problem by using logit , ln (/ (1 – )).

• Estimating the logit with a linear function results in the estimated

logistic regression equation.

Page 47: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

47

Supervised Learning

• Estimated Logistic Regression Equation

ln + + ∙ ∙ ∙

Given a set of explanatory variables, a logistic regression algorithm

determines values of , , . . . , that best estimate the log odds.

• Logistic Function

=

Page 48: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

48

Figure 6.39 - XLMiner steps for logistic regression

Page 49: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

49

Figure 6.40 - XLMiner logistic regression output

Page 50: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

50

Figure 6.41 - XLMiner steps for refitting Logistic Regression Model and using it to Predict new Data

Page 51: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

51

Figure 6.42 - Classification Error for Logistic Regression Model

Page 52: Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.

52

Figure 6.43 - Classification of 30 new Customer Observations