Post on 16-Jan-2016
Chapter 6
Data Mining
1
2
Introduction
• The increase in the use of data-mining techniques in business has been caused largely by three events.
• The explosion in the amount of data being produced and
electronically tracked
• The ability to electronically warehouse these data
• The affordability of computer power to analyze the data
3
Introduction
• Observation: set of recorded values of variables associated with a
single entity.
• Data-mining approaches can be separated into two categories.
• Supervised learning – For prediction and classification.
• Unsupervised learning – To detect patterns and relationships in the
data.
4
Introduction
• Whether we are using a supervised or unsupervised learning
approach, the data-mining process comprises the following steps:
• Data Sampling
• Data Preparation
• Model Construction
• Model Assessment
5
Data Sampling
6
Data Sampling
• When dealing with large volumes of data, it is best practice to extract a representative sample for analysis.
• A sample is representative, if the analyst can make the same conclusions from it as from the entire population of data.
• The sample of data must be large enough to contain significant information, yet small enough to be manipulated quickly.• Data-mining algorithms typically are more effective given more
data.
7
Data Sampling
• When obtaining a representative sample, it is generally best to include as many variables as possible in the sample.
• After exploring the data with descriptive statistics and visualization, the analyst can eliminate variables that are not of interest.
8
Data Preparation
9
Data Preparation
• The data in a data set are often said to be “dirty” and “raw” before they have been preprocessed.
• We need to put them into a form that is best suited for a data-mining algorithm – Data Table.
• Data preparation makes heavy use of the descriptive statistics and data visualization methods.
Name Trait 1 Trait 2 Trait 3
Entity 1 Value Value Value
Entity 2 Value Value
Entity 3 Value Value Value
Entity 4 Value Value Value
10
Data Preparation
• Treatment of Missing Data
• The primary options for addressing missing data
• To discard observations with any missing values
• To discard any variable with missing values
• To fill in missing entries with estimated values
• To apply a data-mining algorithm (such as classification and
regression trees) that can handle missing values
11
Data Preparation
• Identification of Outliers and Erroneous Data
• Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.
• Closer examination of outliers may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.
12
Data Preparation
Identification of Outliers and Erroneous Data
• A conservative approach is to create two data sets, one with and one without outliers, and then construct a model on both data sets.
• If a model’s implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down the cause of the outliers.
13
Entended Large Credit Example
Account Number
Annual Income ($1000)
Household Size
Years of Post-High School Education
Hours Per Week Watching Television Age Gender
Exceded Credit Limit in Past 12 Months?
Annual Charges ($)
23313578 21.8 4.0 5.0 29.0 47 Female No 10023.5214168784 65.5 7.0 3.0 46.0 60 Female No 11248.76
498076 54.2 3.0 2.0 18.0 47 Male No 6115.0012286258 73.7 6.0 0.0 44.0 41 Male No 9785.5514458626 110.4 7.0 5.0 39.0 36 Female No 12633.98
6587932 22.1 8.0 3.0 39.0 52 Female No 3309.4228612775 39.6 5.0 4.0 40.0 49 Female No 18927.6823618891 90.6 8.0 5.0 27.0 46 Female No 15762.55
6620536 38.7 1.0 4.0 15.0 40 Male No 0.0025744803 60.5 3.0 1.0 3.0 60 Male No 11294.9026373139 104.3 4.0 5.0 58.0 53 Female Yes 21484.5218241851 44.5 3.0 5.0 19.0 30 Male No 11023.7529558287 67.1 6.0 1.0 33.0 35 Male No 20523.1114427717 72.3 3.0 3.0 27.0 33 Female No 17274.7519206708 114.8 5.0 3.0 23.0 29 Male No 25181.6413775672 96.0 6.0 2.0 36.0 19 Male No 11498.3928207255 90.2 4.0 0.0 59.0 24 Male No 16442.8226411517 54.6 4.0 0.0 6.0 39 Male No 4410.3222411321 97.4 4.0 4.0 30.0 28 Male Yes 15200.7529587727 45.9 7.0 3.0 46.0 33 Female No 16624.8325774035 61.8 4.0 4.0 53.0 65 Female No 11055.97
202180 38.6 2.0 3.0 57.0 43 Female No 6595.49
14
Scatterplot Matrix in XLMINER
15
Data Preparation
• Variable Representation
• In many data-mining applications, the number of variables for
which data is recorded may be prohibitive to analyze.
• Dimension reduction: Process of removing variables from the
analysis without losing any crucial information.
• One way is to examine pairwise correlations to detect variables or groups of variables that may supply similar information.
• Such variables can be aggregated or removed to allow more parsimonious model development.
16
Data Preparation
• Variable Representation• A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.• The treatment of categorical variables is particularly important.
• Often data sets contain variables that, considered separately, are not particularly insightful but that, when combined as ratios, may represent important relationships.
17
UnSupervised Learning (skip)
18
Supervised Learning
19
Supervised Learning
The goal of a supervised learning technique is to develop a model that predicts a value for a continuous outcome or classifies a categorical outcome.
Partitioning DataWe can use the abundance of data to guard against the potential for overfitting by decomposing the data set into three partitions
• the training set• the validation set, and • the test set
20
Supervised Learning
Partitioning Data
• Training set: Consists of the data used to build the candidate models.
• Validation set: The data set to which promising subsets of models are applied to identify which model is the most accurate at predicting when applied to data that were not used to build the model.
• Test set: The data set to which the final model should be applied to estimate this model’s effectiveness when applied to data that have not been used to build or select the model.
21
Figure 6.10 - XLMiner data Partition with Oversampling Dialog Box
22
Figure 6.11 - XLMiner Standard Data Partition Dialog Box
23
Supervised Learning
• Classification Accuracy
• By counting the classification errors on a sufficiently large validation set and/or test set that is representative of the population, we will generate an accurate measure of the model’s classification performance.
• Classification confusion matrix: Displays a model’s correct and
incorrect classifications.
24
Supervised Learning
• Overall Error Rate: percentage of misclassified observations
• Measure of classification accuracy are based on the classification confusion matrix.
Table 6.4 Classification Confusion Matrix
25
Supervised Learning
We define error rate with respect to the individual classes to
account for the assymetric costs in misclassification:
Class 1 error rate = ; Class 0 error rate =
Cutoff value: Probability value used to understand the tradeoff
between Class 1 error rate and Class 0 error rate.
26
Supervised Learning
• Prediction Accuracy
• The measures of accuracy are some function of the error in
estimating an outcome for an observation i.
• Average error = /n
• Root mean squared error (RMSE) =
• = error in estimating an outcome for observation i.
27
Supervised Learning
• Classification and Regression Trees (CART)
• Partition a data set of observations into increasingly smaller and
more homogeneous subsets.
• At each iteration of the CART method, a subset of observations is
split into two new subsets based on the values of a single variable.
• Series of questions that successively narrow down observations
into smaller and smaller groups of decreasing impurity.
28
Titantic Passengers
Classifying a categorical outcome with a classification tree
• Typical data on 1309 passengers – Missing Data
Passenger Class Survived Name Sex Age
Siblings and Spouses
Parents and Children Fare Home / Destination
3 0Allen, Miss. Elisabeth Walton female 29 0 0 211.33St Louis, MO
1 1Allison, Master. Hudson Trevor male 1 2 151.55Montreal, PQ / Chesterville, ON
1 0Allison, Miss. Helen Loraine female 2 1 2 151.55Montreal, PQ / Chesterville, ON
1 0Allison, Mr. Hudson Joshua Creighton male 30 1 2 151.55Montreal, PQ / Chesterville, ON
1 0Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 151.55Montreal, PQ / Chesterville, ON
29
Figure 6.20 - Construction Sequence of Branches in a Classification Tree
1 2 30
100
200
300
400
500
600
0
1
Died Survived Grand TotalFirst Class 123 200 323Second Class 158 119 277Third Class 528 181 709Grand Total 809 500 1309
30
Minimum Error Tree
31
Finding the best model
# Decision Nodes % Error
19 22.51908
18 22.51908
17 22.51908
16 22.51908
15 22.51908
14 22.51908
13 22.51908
12 22.51908
11 22.51908
10 20.80153
9 20.80153
8 20.80153 <-- Min Error Tree Std. Error 0.017731
7 21.18321
6 21.56489
5 21.56489
4 21.56489
3 21.56489
2 21.75573
1 21.75573 <-- Best Pruned
0 35.87786
32
Result of Classification Tree in XLMINERConfusion Matrix for Training Data
Predicted Class
Actual Class 1 01 242 700 81 392
Error ReportClass # Cases # Errors % Error
1 312 70 26.92308
0 473 81 15.85624
Overall 785 151 20.25478
33
Training data performance
Performance
Success Class 1
Precision 0.749226
Recall (Sensitivity) 0.775641
Specificity 0.828753
F1-Score 0.762205
34
Result of Classification Tree in XLMINER
Confusion Matrix for Validation Data
Predicted Class
Actual Class 1 0
1 124 64
0 50 286
Error Report
Class # Cases # Errors % Error
1 188 64 34.04255
0 336 50 14.88095
Overall 524 114 21.75573
35
Validation data performance
Performance
Success Class 1
Precision 0.712644Recall (Sensitivity) 0.659574Specificity 0.85119F1-Score 0.685083
36
Regression Trees
37
Supervised Learning
• Predicting a continuous outcome with a regression tree
• A regression tree bases the impurity of a partition based on the
variance of the outcome value for the observations in the group.
• After a final tree is constructed, the predicted outcome value of an
observation is based on the mean outcome value of the partition
into which the new observation belongs.
38
Figure 6.30 - XLMiner steps for regression trees
39
Figure 6.31 - Full Regression Tree for Optiva Credit Union
40
Figure 6.32 - Regression Tree Pruning Log
41
Figure 6.33 - Best Pruned Regression tree for Optiva Credit Union
42
Figure 6.34 - Best Pruned Tree Prediction of Test Data for Optiva Credit Union
43
Figure 6.35 - Prediction Error of Regression Trees
44
Supervised Learning
Logistic regression: Attempts to classify a categorical outcome (y = 0
or 1) as a linear function of explanatory variables.
45
Supervised Learning
46
Supervised Learning
• Odds - measure related to probability
• If an estimate of the probability of an event is then the equivalent
odds measure is / (1 – ).
• The odds metric ranges between zero and positive infinity.
• We eliminate the fit problem by using logit , ln (/ (1 – )).
• Estimating the logit with a linear function results in the estimated
logistic regression equation.
47
Supervised Learning
• Estimated Logistic Regression Equation
ln + + ∙ ∙ ∙
Given a set of explanatory variables, a logistic regression algorithm
determines values of , , . . . , that best estimate the log odds.
• Logistic Function
=
48
Figure 6.39 - XLMiner steps for logistic regression
49
Figure 6.40 - XLMiner logistic regression output
50
Figure 6.41 - XLMiner steps for refitting Logistic Regression Model and using it to Predict new Data
51
Figure 6.42 - Classification Error for Logistic Regression Model
52
Figure 6.43 - Classification of 30 new Customer Observations