Chapter 2 Data Mining Processes and Knowledge Discovery Identify actionable results.
-
Upload
shavonne-wood -
Category
Documents
-
view
223 -
download
1
Transcript of Chapter 2 Data Mining Processes and Knowledge Discovery Identify actionable results.
Chapter 2Chapter 2Data Mining Processes and Data Mining Processes and
Knowledge DiscoveryKnowledge Discovery
Identify actionable results
結束
2-2
ContentsContents
Describes the Cross-Industry Standard Process for Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can Data Mining (CRISP-DM), a set of phases that can be used in data mining studiesbe used in data mining studies
Discusses each phase in detailDiscusses each phase in detail
Gives an example illustrationGives an example illustration
Discusses a knowledge discovery processDiscusses a knowledge discovery process
Describes the Cross-Industry Standard Process for Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can Data Mining (CRISP-DM), a set of phases that can be used in data mining studiesbe used in data mining studies
Discusses each phase in detailDiscusses each phase in detail
Gives an example illustrationGives an example illustration
Discusses a knowledge discovery processDiscusses a knowledge discovery process
結束
2-3
CRISP-DMCRISP-DM
Cross-Industry Standard Process for Data Mining
One of first comprehensive attempts toward standard process model for data mining
Independent of industry sector & technology
結束
2-4
CRISP-DM PhasesCRISP-DM Phases
1. Business (or problem) understanding2. Data understanding
A systematic process to try to make sense of the massive amounts of data generated from daily operations.
3. Data preparation• Transform & create data set for modeling
4. Modeling5. Evaluation
• Check good models, evaluate to assure nothing missing
6. Deployment
結束
2-5
Business UnderstandingBusiness Understanding
Solve a specific problemDetermining business objectives, assessing the current
situation, establishing data mining goals, and developing a project plan.
Clear definition helpsMeasurable success criteria
Convert business objectives to set of data-mining goalsWhat to achieve in technical terms, such as
What types of customers are interested in each of our products?
What are typical profiles of customers …
結束
2-6
Data UnderstandingData Understanding
Initial data collection, data description, data exploration, and the verification of data quality.Three issues considered in data selection:1. Set up a concise and clear description of the problem.
For example, a retail DM project may seek to identify spending behaviors of female shoppers who purchase seasonal clothes.
2. Identify the relevant data for the problem description, such demographical, credit card transactional, financial data…
3. Select variables for the relevant important for the project.
結束
2-7
Data Understanding (cont.)Data Understanding (cont.)
Data types: Demographic data (income, education, age …) Socio-graphic data (hobby, club membership,…) Transactional data (sales record, credit card spending…) Quantitative data: are measurable using numerical values) Qualitative data: known as categorical data, contains both nominal and
ordinal data. (see also page. 22)Related data: Can come from many sources? Internal
ERP (or MIS) Data Warehouse
External Government data Commercial data
Created Research
結束
2-8
Data PreparationData Preparation
Once data sources available are identified, the data need to be selected, cleaned, built into the desired and formatted forms. Clean data: Formats, gaps, filters outliers & redundancies (see page .22)Unified numerical scalesNominal data
Code (such gender data, male and female)Ordinal data
Nominal code or scale (excellent, fair, poor)Cardinal data (Categorical, A, B, C levels)
結束
2-9
Types of DataTypes of Data
Type Features Synonyms
Numerical Continuous Range
Integer Range
Binary Yes/No Flag
Categorical Finite Set
Date/Time Range
String Typeless
Text String
Range: Numeric vales (integer, real, or date/time)Set: Data with distinct multiple value (numeric, string, or data/time)Typeless: for other types of data
結束
2-10
Data Preparation (Cont.)Data Preparation (Cont.)
Several statistical method and visualization tools can be used to preprocess the selected data.Such max, min, mean, and mode can be used to
aggregate or smooth the data.Scatter plots and box plots can be used to filter outliers.More advanced techniques, such as regression analysis,
cluster analysis, decision tree, or hierarchical analysis may be applied in data preprocessing.
In some cases, data preprocessing could take over 50% of the time of the entire data mining process.Shortening data processing time can reduce much of the
total computation time in data mining.
結束
2-11
Data Preparation Data Preparation –– data transformation data transformation
Data transformation is to use simple mathematical formulations or learning curves to convert different measurements of selected, and clean, data into a unified numerical scale for the data analysis.Data transformation can be used to 1. Transform from numerical to numerical scales, to
shrink or enlarge the given data. Such as (x-min)/max-min) to shrink the data into the interval [0,1].
2. Recode categorical data to numerical scales. Categorical data can be ordinal (less, moderate, strong) and nominal (red, yellow, blue..). Such 1=yes, 0=no. see also page. 24.
See page. 24 for more details.See page. 24 for more details.
結束
2-12
ModelingModeling
Data modeling is where the data mining software is used to generate results for various situations. Data visualization and cluster analysis are useful for initial analysis.
Depending on the data type, 1. if the task is to group data, discriminant analysis is
applied.
2. If the purpose is estimation, regression is appropriate the data are continuous (and logistic regression is not).
3. Neural networks could be applied for both tasks.
Data Treatment Training set for development of the model. Test set for testing the model that is built. Maybe others for refining the model
結束
2-13
Data mining techniquesData mining techniques
TechniquesAssociation: the relationship of a particular item in a data
transaction on other items in the same transaction is used to predict patterns. See also page 25 for example.
Classification: the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. Two key research problems related to classification results are the evaluation of misclassification and prediction power(C4.5).Mathematical modeling is often used to construct classification
methods are binary decision trees (CART), neural networks (nonlinear), linear programming (boundary), and statistics.
See also page. 25, 26 for more explanations
結束
2-14
Data mining techniques (Cont.)Data mining techniques (Cont.)
Clustering: taking ungrouped data and uses automatic techniques to put this data into groups.Clustering is unsupervised and does not require a learning set.
(Chapter 5)Predictions: is related to regression technique, to discover
the relationship between the dependent and independent variables.
Sequential patterns: seeks to find similar patterns in data transaction over a business period.The mathematical models behind sequential patterns are logic
rules, fuzzy logic, and so on.Similar time sequences: applied to discover sequences similar
to a known sequence over both past and current business periods.
結束
2-15
EvaluationEvaluation
Does model meet business objectives?
Any important business objectives not addressed?
Does model make sense?
Is model actionable?PDC
APDCA
CRISP-DMCRISP-DM
結束
2-16
DeploymentDeployment
DM can be used to verify previously held hypotheses or for knowledge discovery.
DM models can be applied to business purposes , including prediction or identification of key situations
Ongoing monitoring & maintenanceEvaluate performance against success criteriaMarket reaction & competitor changes (remodeling or
fine tune)
結束
2-17
ExampleExample
Training set for computer purchase16 records5 attributes
GoalFind classifier for consumer behavior
結束
2-18
Database (1st half)Database (1st half)
Case Age Income Student Credit Gender Buy?
A1 31-40 High No Fair Male Yes
A2 >40 Medium No Fair Female Yes
A3 >40 Low Yes Fair Female Yes
A4 31-40 Low Yes Excellent Female Yes
A5 ≤30 Low Yes Fair Female Yes
A6 >40 Medium Yes Fair Male Yes
A7 ≤30 Medium Yes Excellent Male Yes
A8 31-40 Medium No Excellent Male Yes
結束
2-19
Database (2nd half)Database (2nd half)
Case Age Income Student Credit Gender Buy?
A9 31-40 High Yes Fair Male Yes
A10 ≤30 High No Fair Male No
A11 ≤30 High No Excellent Female No
A12 >40 Low Yes Excellent Female No
A13 ≤30 Medium No Fair Male No
A14 >40 Medium No Excellent Female No
A15 ≤30 Unknown No Fair Male Yes
A16 >40 Medium No N/A Female No
結束
2-20
Data SelectionData Selection
Gender has weak relationship with purchaseBased on correlationDrop gender
Selected Attribute Set
{Age, Income, Student, Credit}
結束
2-21
Data PreprocessingData Preprocessing
Income unknown in Case 15
Credit not available in Case 16
Drop these noisy cases
結束
2-22
Data TransformationData Transformation
Assign numerical values to each attributeAge: ≤30 = 3 31-40 = 2 >40 = 1Income: High = 3 Medium = 2 Low = 1Student: Yes = 2 No = 1Credit: Excellent = 2 Fair = 1
結束
2-23
Data MiningData Mining
Categorize outputBuys = C1 Doesn’t buy = C2
Conduct analysisModel says A8, A10 don’t buy; rest doOf the actual yes, 7 correct and 1 notOf the actual no, 2 correct
Confusion matrix
結束
2-24
Data Interpretation and Test Data SetData Interpretation and Test Data Set
Test on independent data
Case Actual Model
B1 Yes Yes (1)
B2 Yes Yes (2)
B3 Yes Yes (3)
B4 Yes Yes (4)
B5 Yes Yes (5)
B6 Yes Yes (6)
B7 Yes Yes (7)
B8 (do not) No No
B9 No Yes
B10 (do not) No No
結束
2-25
Confusion MatrixConfusion Matrix
Model Buy Model Not Totals
Actual Buy 7 0 7
Actual Not 1 2 3
Totals 8 2 10
right
結束
2-26
MeasuresMeasures
Correct classification rate
9/10 = 0.90
Cost function
cost of error:
model says buy, actual no $20
model says no, actual buy $200
1 x $20 + 0 x $200 = $20
結束
2-27
GoalsGoals
Avoid broad concepts:Gain insight; discover meaningful patterns;
learn interesting thingsCan’t measure attainment
Narrow and specify:Identify customers likely to renew; reduce
churn;Rank order by propensity (favor) to…;
結束
2-28
GoalsGoals
Description: what isunderstandexplaindiscover knowledge
Prescription: what should be doneclassifypredict
結束
2-29
GoalGoal
Method A:four rules, explains 70%
Method B:fifty rules, explains 72%
BEST?
Gain understanding: Method A betterminimum description length (MDL)
Reduce cost of mailing: Method B better
結束
2-30
MeasurementMeasurement
AccuracyHow well does model describe observed data?
Confidence levels proportion of the time between lower
and upper limits
Comprehensibility
Whole or parts?
結束
2-31
Measuring PredictiveMeasuring Predictive
Classification & prediction:error rate = incorrect/total
requires evaluation set be representative
Estimatorspredicted - actual (MAD, MSE, MAPE)
variance = sum(predicted - actual)^2
standard deviation = square root of variance
distance - how far off
結束
2-32
StatisticsStatistics
Population - entire group studied
Sample - subset from population
Bias - difference between sample average & population averagemean, median, modedistributionsignificancecorrelation, regression (hamming distance)
結束
2-33
Classification ModelsClassification Models
LIFT = probability in class by sample divided by probability in class by populationif population probability is 20% and
sample probability is 30%,
LIFT = 0.3/0.2 = 1.5
Best lift not necessarily best need sufficient sample size as confidence increase.
結束
2-34
Lift ChartLift Chart
LIFT
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
mailed
responded
% mailed
% responded
結束
2-35
Measuring ImpactMeasuring Impact
Ideal - $ (NPV) because of expenditure
Mass mailing may be better
Depends on:fixed costcost per recipientcost per respondentvalue of positive response
結束
2-36
Bottom LineBottom Line
Return on investment
結束
2-37
Example ApplicationExample Application
Telephone industry
Problem: Unpaid bills
Data mining used to develop models to predict nonpayment as early as possible
See page. 27
結束
2-38
Knowledge Discovery ProcessKnowledge Discovery Process
1 Data SelectionLearning the application domain
Creating target data set
2 Data Preprocessing Data cleaning & preprocessing
3 Data Transformation Data reduction & projection
4 Data Mining
Choosing function
Choosing algorithms
Data mining
5 Data InterpretationInterpretation
Using discovered knowledge
結束
2-39
1: Business Understanding1: Business Understanding
Predict which customers would be insolventIn time for firm to take preventive measures
(and avert losing good customers)
Hypothesis:Insolvent customers would change calling
habits & phone usage during a critical period before & immediately after termination of billing period
結束
2-40
2: Data Understanding2: Data Understanding
Static customer information available in filesBills, payments, usage
Used data warehouse to gather & organize dataCoded to protect customer privacy
結束
2-41
Creating Target Data SetCreating Target Data Set
Customer filesCustomer informationDisconnectsReconnections
Time-dependent dataBillsPaymentsUsage
100,000 customers over 17-month periodStratified (hierarchical) sampling to assure all groups appropriately represented
結束
2-42
3: Data Preparation3: Data Preparation
Filtered out incomplete data
Deleted inexpensive callsReduced data volume about 50%
Low number of fraudulent cases
Cross-checked with phone disconnects
Lagged data made synchronization necessary
結束
2-43
Data Reduction & ProjectionData Reduction & Projection
Information grouped by account
Customer data aggregated by 2-week periods
Discriminant analysis on 23 categories
Calculated average owed by category (significant)
Identified extra charges (significant)
Investigated payment by installments (not significant)
結束
2-44
Choosing Data Mining FunctionChoosing Data Mining Function
Classes:Most possibly solvent (99.3%)Most possibly insolvent (0.7%)
Costs of error widely differentNew data set created through stratified samplingRetained all insolventAltered distribution to 90% solventUsed 2,066 cases total
Critical period identifiedLast 15 two-week periods before service interruption
Variables defined by counting measures in two-week periods46 variables as candidate discriminant factors
結束
2-45
4: Modeling4: Modeling
Discriminant AnalysisLinear modelSPSS – stepwise forward selection
Decision TreesRule-based classifier, C5, C4.5
Neural NetworksNonlinear model
結束
2-46
Data MiningData Mining
Training set about 2/3rdsRest testDiscriminant analysisUsed 17 variablesEqual costs – 0.875 correctUnequal costs – 0.930 correct
Rule-based – 0.952 correctNeural network – 0.929 correct
結束
2-47
5: Evaluation5: Evaluation
1st objective to maximize accuracy of predicting insolvent customersDecision tree classifier best
2nd objective to minimize error rate for solvent customersNeural network model close to Decision tree
Used all 3 on case-by-case basis
結束
2-48
Coincidence Matrix Coincidence Matrix –– Combined Models Combined Models
Model insolvent
Model solvent
Unclass Totals
Actual insolvent
19 17 28 64
Actual solvent
1 626 27 654
Totals 20 643 91 718
結束
2-49
6: Implementation6: Implementation
Every customer examined using all 3 algorithmsIf all 3 agreed, used that classificationIf disagreement, categorized as unclassified
Correct on test data 0.898Only 1 actually solvent customer would
have been disconnected