Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
4
Transcript of Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.
![Page 1: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/1.jpg)
Data Mining: A Closer Look
Chapter 2
![Page 2: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/2.jpg)
2.1 Data Mining Strategies
![Page 3: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/3.jpg)
Figure 2.1 A hierarchy of data mining strategies
Data MiningStrategies
SupervisedLearning
Market BasketAnalysis
UnsupervisedClustering
PredictionEstimationClassification
![Page 4: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/4.jpg)
Classification
• Learning is supervised.
• The dependent variable is categorical.
• Well-defined classes.
• Current rather than future behavior.
![Page 5: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/5.jpg)
Estimation
• Learning is supervised.
• The dependent variable is numeric.
• Well-defined classes.
• Current rather than future behavior.
![Page 6: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/6.jpg)
Prediction
• The emphasis is on predicting future rather than current outcomes.
• The output attribute may be categorical or numeric.
![Page 7: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/7.jpg)
The Cardiology Patient Dataset
![Page 8: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/8.jpg)
Table 2.1 • Cardiology Patient Data Attribute Mixed Numeric Name Values Values Comments
Age Numeric Numeric Age in years
Sex Male, Female 1, 0 Patient gender
Chest Pain Type Angina, Abnormal Angina, 1–4 NoTang = Nonanginal NoTang, Asymptomatic pain
Blood Pressure Numeric Numeric Resting blood pressure upon hospital admission
Cholesterol Numeric Numeric Serum cholesterol
Fasting Blood True, False 1, 0 Is fasting blood sugar less Sugar < 120 than 120?
Resting ECG Normal, Abnormal, Hyp 0, 1, 2 Hyp = Left ventricular hypertrophy
Maximum Heart Numeric Numeric Maximum heart rate Rate achieved
Induced Angina? True, False 1, 0 Does the patient experience angina as a result of exercise?
Old Peak Numeric Numeric ST depression induced by exercise relative to rest
Slope Up, flat, down 1–3 Slope of the peak exercise ST segment
Number Colored 0, 1, 2, 3 0, 1, 2, 3 Number of major vessels Vessels colored by fluorosopy
Thal Normal fix, rev 3, 6, 7 Normal, fixed defect, reversible defect
Concept Class Healthy, Sick 1, 0 Angiographic disease status
![Page 9: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/9.jpg)
Table 2.2 • Most and Least Typical Instances from the Cardiology Domain
Attribute Most Typical Least Typical Most Typical Least TypicalName Healthy Class Healthy Class Sick Class Sick Class
Age 52 63 60 62Sex Male Male Male FemaleChest Pain Type NoTang Angina Asymptomatic AsymptomaticBlood Pressure 138 145 125 160Cholesterol 223 233 258 164Fasting Blood Sugar < 120 False True False FalseResting ECG Normal Hyp Hyp HypMaximum Heart Rate 169 150 141 145Induced Angina? False False True FalseOld Peak 0 2.3 2.8 6.2Slope Up Down Flat DownNumber of Colored Vessels 0 0 1 3Thal Normal Fix Rev Rev
![Page 10: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/10.jpg)
A Healthy Class Rule for the Cardiology Patient Dataset
IF 169 <= Maximum Heart Rate <=202
THEN Concept Class = Healthy
Rule accuracy: 85.07%
Rule coverage: 34.55%
![Page 11: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/11.jpg)
A Sick Class Rule for the Cardiology Patient Dataset
IF Thal = Rev & Chest Pain Type = Asymptomatic
THEN Concept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
![Page 12: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/12.jpg)
Unsupervised Clustering
• Determine if concepts can be found in the data.• Evaluate the likely performance of a supervised model.• Determine a best set of input attributes for supervised learning.• Detect Outliers.
![Page 13: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/13.jpg)
Market Basket Analysis
• Find interesting relationships among retail products.
• Uses association rule algorithms.
![Page 14: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/14.jpg)
2.2 Supervised Data Mining Techniques
![Page 15: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/15.jpg)
The Credit Card Promotion Database
![Page 16: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/16.jpg)
Table 2.3 • The Credit Card Promotion Database
Income Magazine Watch Life Insurance Credit CardRange ($) Promotion Promotion Promotion Insurance Sex Age
40–50K Yes No No No Male 4530–40K Yes Yes Yes No Female 4040–50K No No No No Male 4230–40K Yes Yes Yes Yes Male 4350–60K Yes No Yes No Female 3820–30K No No No No Female 5530–40K Yes No Yes Yes Male 3520–30K No Yes No No Male 2730–40K Yes No No No Male 4330–40K Yes Yes Yes No Female 4140–50K No Yes Yes No Female 4320–30K No Yes Yes No Male 2950–60K Yes Yes Yes No Female 3940–50K No Yes No No Male 5520–30K No No Yes Yes Female 19
![Page 17: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/17.jpg)
A Hypothesis for the Credit Card Promotion Database
A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.
![Page 18: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/18.jpg)
A Production Rule for theCredit Card Promotion Database
IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
![Page 19: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/19.jpg)
Production Rules
• Rule accuracy is a between-class measure.
• Rule coverage is a within-class measure.
![Page 20: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/20.jpg)
Neural Networks
![Page 21: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/21.jpg)
Figure 2.2 A multilayer fully connected neural network
InputLayer
OutputLayer
HiddenLayer
![Page 22: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/22.jpg)
Table 2.4 • Neural Network Training: Actual and Computed Output
Instance Number Life Insurance Promotion Computed Output
1 0 0.024
2 1 0.998
3 0 0.023
4 1 0.986
5 1 0.999
6 0 0.050
7 1 0.999
8 0 0.262
9 0 0.060
10 1 0.997
11 1 0.999
12 1 0.776
13 1 0.999
14 0 0.023
15 1 0.999
![Page 23: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/23.jpg)
Statistical Regression
Life insurance promotion =
0.5909 (credit card insurance) -
0.5455 (sex) + 0.7727
![Page 24: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/24.jpg)
2.3 Association Rules
![Page 25: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/25.jpg)
An Association Rule for the Credit Card Promotion Database
IF Sex = Female & Age = over40 &
Credit Card Insurance = No
THEN Life Insurance Promotion = Yes
![Page 26: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/26.jpg)
2.4 Clustering Techniques
![Page 27: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/27.jpg)
Figure 2.3 An unsupervised cluster of the credit card database
# Instances: 5Sex: Male => 3
Female => 2Age: 37.0Credit Card Insurance: Yes => 1
No => 4Life Insurance Promotion: Yes => 2
No => 3
Cluster 1
Cluster 2
Cluster 3
# Instances: 3Sex: Male => 3
Female => 0Age: 43.3Credit Card Insurance: Yes => 0
No => 3Life Insurance Promotion: Yes => 0
No => 3
# Instances: 7Sex: Male => 2
Female => 5Age: 39.9Credit Card Insurance: Yes => 2
No => 5Life Insurance Promotion: Yes => 7
No => 0
![Page 28: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/28.jpg)
2.5 Evaluating Performance
![Page 29: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/29.jpg)
Evaluating Supervised Learner Models
![Page 30: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/30.jpg)
Confusion Matrix
• A matrix used to summarize the results of a supervised classification.
• Entries along the main diagonal are correct classifications.
• Entries other than those on the main diagonal are classification errors.
![Page 31: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/31.jpg)
Table 2.5 • A Three-Class Confusion Matrix
Computed Decision
C1 C2 C3C1 C11 C12 C13
C2 C21 C22 C23
C3 C31 C32 C33
![Page 32: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/32.jpg)
Two-Class Error Analysis
![Page 33: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/33.jpg)
Table 2.6 • A Simple Confusion Matrix
Computed Computed
Accept Reject
Accept True FalseAccept Reject
Reject False TrueAccept Reject
![Page 34: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/34.jpg)
Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate
Model Computed Computed Model Computed ComputedA Accept Reject B Accept Reject
Accept 600 25 Accept 600 75Reject 75 300 Reject 25 300
![Page 35: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/35.jpg)
Evaluating Numeric Output
• Mean absolute error
• Mean squared error
• Root mean squared error
![Page 36: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/36.jpg)
Comparing Models by Measuring Lift
![Page 37: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/37.jpg)
Figure 2.4 Targeted vs. mass mailing
0
200
400
600
800
1000
1200
0 10 20 30 40 50 60 70 80 90 100
NumberResponding
% Sampled
![Page 38: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/38.jpg)
Computing Lift
)|(
)|(
PopulationCP
SampleCPLift
i
i
![Page 39: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/39.jpg)
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
No Computed Computed Ideal Computed ComputedModel Accept Reject Model Accept Reject
Accept 1,000 0 Accept 1,000 0Reject 99,000 0 Reject 0 99,000
![Page 40: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/40.jpg)
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
Model Computed Computed Model Computed ComputedX Accept Reject Y Accept Reject
Accept 540 460 Accept 450 550Reject 23,460 75,540 Reject 19,550 79,450
![Page 41: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.](https://reader038.fdocuments.in/reader038/viewer/2022102818/56649d265503460f949fdbdd/html5/thumbnails/41.jpg)
Unsupervised Model Evaluation