Post on 10-Apr-2018
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 1/52
Data Mining Overview
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 2/52
Data Mining - Overview 2
ContentCase Data Mining ± Supervised Learning
Case Data Mining ± Unsupervised Learning
Definition
Applications
Techniques
Supervised Learning
Unsupervised Learning
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 3/52
Data Mining - Overview 3
ContentDM as a Business Process
DM MethodologyReferences
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 4/52
Data Mining - Overview 4
Definition Advanced methods for exploring and modeling
relationships in large amounts of data (SAS)
Process of discovering meaningful newcorrelations, patterns and trends by sifting through
large amounts of data stored in repositories, using
pattern recognition technologies as well as
statistical and mathematical techniques.´ (Gartner
Group)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 5/52
Data Mining - Overview 5
Definition (Cont) Process of exploration and analysis, by automatic
or semi automatic means, of large quantities of
data in order to discover meaningful patterns andrules
From the middle of 1900s, corporate data has
increased by factor of 100,000! due to automated
operations throwing enormous opportunities to
improve business decision making
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 6/52
Data Mining - Overview 6
Applications Data Mining is useful when large amount of data
and something worth learning (i.e. resulting
knowledge is worth more money than it costs todiscover)
Research
Process Improvement
Marketing
Customer Relationship Management (CRM)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 7/52
Data Mining - Overview 7
Application (Cont) CRM (cont)
± Presenting single image of organization
± Keeping single image of customer
± Knowing Likes and dislikes of customers
± Anticipating their needs and exploiting them
proactively ± Recognizing their displeasure and do some
thing before it is too late
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 8/52
Data Mining - Overview 8
Popular Applications(source ± kdnuggets)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 9/52
Data Mining - Overview 9
TechniquesSupervised Learning (Directed Knowledge
Discovery)
Classification (e.g. assigning customers to predefined segment. Discrete classes)
Estimation / Regression (e.g. Value of real estate.Continuous)
Prediction: Classification or Estimation for future(Which customer will close account in 6 month)
Time Series Analysis
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 10/52
Data Mining - Overview 10
Techniques (Cont)Unsupervised Learning (Undirected Knowledge
Discovery)
Association Rules (Affinity Grouping): Whichthings go together
Sequence Discovery: Association Rules based ontime
Clustering: Segmenting diverse group into number of similar group / cluster
Dimension reduction
Summarization / Characterization / Generalization
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 11/52
Data Mining - Overview 11
Overview of Techniques - 1
Logistic
Regression
Predicts probability of success; Gives
subset selection of variables
ClassificationTree
Gives a decision tree with rules of classification
Neural
Network
Is very opaque but gives higher level
of accuracy in many situations
k-Nearest
Neighbor
Groups cases into neighbors and
assigns a class based on majority of
cases in a neighborhood
Classification
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 12/52
Data Mining - Overview 12
Illustrative Applications - Classification
Target Marketing
Attrition Prediction/Churn Analysis
Fraud Detection
Credit Scoring
Predicting for every case which class it belongs to or
probability of success based on its predictor variables data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 13/52
Data Mining - Overview 13
Overview of Techniques - 2
Multiple Linear
Regression
Gives predicted values based on
Regression Model
Regression Tree Gives a decision tree with rules of
prediction
k-Nearest
Neighbors
Groups cases into neighbors and
assigns a value based on majority of cases in a neighborhood
Neural Network
Prediction
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 14/52
Data Mining - Overview 14
Illustrative Applications - Prediction
Forecasting sales
Predicting price fluctuations
Predicting profitability of business units
Predicting market value of assets
Predicting yield or consumption of criticalinputsPredicting for every case a value based on its
predictor variables data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 15/52
Data Mining - Overview 15
Overview of Techniques - 3
k-Means
Clustering
For given number of clusters ± k value - develops
clusters based on minimum distance between the
cluster centers and the cases in the cluster.
Hierarchical
Clustering
Builds, through successive steps, clusters by
grouping cases having less dissimilarities and
finally creating a single cluster. The user can
choose the number of clusters corresponding to a
distance measure.Principal
Components
Creates new variables, called Principal
Components, that are uncorrelated and that
explain majority of variability in original data.
Clustering and Dimension Reduction
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 16/52
Data Mining - Overview 16
Dimension Reduction When there are many dimensions
(predictors), say 20, 30 or 50..
Or when several predictors are correlated
Develop new variables that:
± Explain the major portion of variability in data,
and
± Are uncorrelated
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 17/52
Data Mining - Overview 17
Illustrative Applications - Clustering
Market segmentation
Product grouping based on customer preferences
Grouping of business units based on performance
parameters
Grouping channel partners based on performance
parametersGrouping of homogenous cases based on
predefined variables data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 18/52
Data Mining - Overview 18
Overview of Techniques - 4
AssociationRules
Gives prediction of combinationsof events that will occur together
based on the past occurrences
Market Basket Analysis / Affinity
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 19/52
Data Mining - Overview 19
Illustrative Applications ±
Market Basket
Cross selling
Product placement in a store
Forecasting sales
Predicting events that occur together as antecedents and consequentswith certain level of confidence and support number of events
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 20/52
Data Mining - Overview 20
DM as Business ProcessIdentifying the business problem (and how will
business benefits will be measured)
Planning direct marketing campaign - new Product Understanding customer attrition
Mining Data to transform data into ActionableInformation
Who are more likely to buy product Which customers are likely to leave. Are they
worth keeping?
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 21/52
Data Mining - Overview 21
DM as Business Process (Cont)Acting on the information
Contacting more likely customers
Offering special services to valuable customers
likely to leave
Measuring the results
Actual Business benefits achieved as definedearlier
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 22/52
Data Mining - Overview 22
DM MethodologyWhy Methodology?
Avoid learning that is not true
Avoid learning that is true but not useful
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 23/52
Data Mining - Overview 23
Learning that is not true Incorrect Data
Data may not be relevant (business situation has changed)
Summarization of data may have destroyed importantinformation (Fig 3.1 pg 47)
Due to small volume of data, pattern emerges due to
chance (when India does well in cricket, sensex goes up)
Model set may not reflect relevant population (³Issue of
Credit´ model built on persons who were given credit. Poll
conducted on WEB)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 24/52
Data Mining - Overview 24
Learning that is true but not useful Learning that are already known: People in area with no
cell coverage, do not buy cell phones
Learning that can not be used: Product sale is related toweather (Can you change weather?). Bad credit history
may be predictive of more insurance claim, but regulators
may prohibit usage of such information
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 25/52
Data Mining - Overview 25
DM Methodology ± 11 StepsStep 1: Translate business problem into DM problem
State in specific term (i.e. instead of ³Gaining insight into
customer behavior´, Identify customer who are unlikely torenew subscription)
Determine type of problem (Classification, Clustering,
etc.)
Decide how results will be used
± Contact high risk / high value customer and try to lure
them with offer
± Forecast customer population in future months
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 26/52
Data Mining - Overview 26
DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data
Input variables
± Which one?
Ignore Input columns with only one value
Ignore Input columns with unique value for each row (e.g.customer name)
Choose only one column out of two having highcorrelation. (e.g. Age_Difference and Age_Ratio)
± What should it contain: Example of all possibleoutcome
- Availability
Ideally from DW (If present) but may need to supplement
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 27/52
Data Mining - Overview 27
DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data (Cont) Input variables (Cont)
± How Many?
Do not eliminate at this stage Needs to be done later on ± How Much Data?More the merrier
Needs to optimize w.r.t. cost involved in processing, etc.
(Rule: If doubling size does not improve result much,stop)- How much history?Seasonality? (Consider seasonality. Data that is too old,
may not be relevant. Typically 2 ± 3 years for CRM)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 28/52
Data Mining - Overview 28
DM Methodology ± 11 Steps (cont)Step 3: Get to know Data
Data Type
Descriptive statistics Validation (Why were so many customers born on 1911?Are they really that old?)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 29/52
Data Mining - Overview 29
Data Type Columns: Categorical Vs Continuous
± Categorical: Takes discrete values (# of children, Marital Status)
± Continuous: Takes continuous values (Income)
Unordered vs Ordered Columns
± Unordered: (Marital Status, Sex)
± Ordered: Rank (e.g. ³Low´, ³High´) ± Ordered: Interval (e.g. Temperature)
± Ordered: True Numeric (e.g. Sales in Rs.,Weight
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 30/52
Data Mining - Overview 30
Descriptive
Statistics We can get general
idea about the way
data are distributed
Alcohol
Mean 13.00
Standard Error 0.06
Median 13.05
Mode 13.05
Standard Deviation 0.81
Sample Variance 0.65
Kurtosis -0.85Skewness -0.05
Range 3.80
Minimum 11.03
Maximum 14.83
Sum 2314.11Count 178
Largest(1) 14.83
Smallest(1) 11.03Confidence Level(95.0%) 0.12
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 31/52
Data Mining - Overview 31
Data Visualization We can study data
distribution using
Histogram
Histogram - All Types of Wines
0
10
20
30
40
1 1
. 5
1 2
. 5
1 3
. 5
1 4
. 5 o r e
Bin - Alcohol Content
F
r e q u e n c
.00%20.00%
40.00%60.00%80.00%100.00%120.00%
Frequency
Cumulative %
i¡
to¢
r £
¤ -¥
y ¦ e A§
in e¡
0
5
10
15
20
25
1 1
. 5
1 2
. 5
1 3
. 5
1 4
. 5 o r e
B in - Alcohol Conte nt
F
r e q
u
e n
.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
Frequency
Cumulative %
i to r - y eB ine
0
10
20
3040
1 1. 5
1 2. 5
1 3. 5
1 4. 5 o r
e
Bin - Alcohol Content
F r e q u e n c y
.00%
50.00%
100.00%150.00%
Frequency
Cumulative %
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 32/52
Data Mining - Overview 32
Data Visualization Visual presentation of data (e.g. Graphs like bar
chart, X-Y Plot of two variables, Scatter Chartetc.)
Correlation-ship between data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 33/52
Data Mining - Overview 33
ValidationIncorrect Values:
Reasons
± Transcription error ± Laziness (force entry for birth day many were
born on November 11, 1911!!)
± Programming error (value of previous field gets
entered in this field) ± Old code and new code coexist!
± Collected wrongly (Time zone not considered)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 34/52
Data Mining - Overview 34
ValidationIncorrect Values:
Reasons
± Stored incorrectly (Numeric instead of character type)
³My data must be clean because no human beinghas touched it manually´ .. One CEO
Result: 50% data wrong, because human beingdid not touch system clocks on computers!
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 35/52
Data Mining - Overview 35
DM Methodology ± 11 Steps (cont)
Step 4: Create a model set
Sampling
± Proportionate (Including multiple time frames) ± Over sampling
Partitioning
± Training
± Validation
± Test
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 36/52
Data Mining - Overview 36
DM Methodology ± 11 Steps (cont)
Step 5: Fix Problems with Data
Correct Error
Missing Values Outliers
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 37/52
Data Mining - Overview 37
Missing DataReasons
± ³Missing Data´ might be important
information. (e.g. not providing TN do not bother me calling) Keep a flag
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 38/52
Data Mining - Overview 38
Missing DataReasons (Cont)
± Nature of Problem. (e.g. New customer do not
have 12 month history data) Build separatemodel for those
± Sources not providing data (e.g. externalvendor not able to provide certain data) Replace
by other derived value / build separate model ± Data was never collected
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 39/52
Data Mining - Overview 39
Missing DataWhat to do?
± Do Nothing
± Filter rows (introduces bias) ± Ignore column
± Predict New Value
± Build separate model
± Modify operations systems to collect data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 40/52
Data Mining - Overview 40
Missing Data Correction Delete record
Problems
± Too many rows thrown out
± Bias introduced (All persons not wanting to state³Salary´ out)
Replace values with:
± Mode
± Mean (Local / Global)
± Median
± User specified value
Will replacement create problems?
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 41/52
Data Mining - Overview 41
Outliers Outlier are cases that contain unusual high or low
data value in a variable.
Such records unduly influence the model. If they are not a natural occurrence they should be
remove
Treatment depends upon algorithm chosen
(Decision tree ± no problem. Clustering ± Defineseparate cluster. Some cases ± remove / replacewith Max / Min )
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 42/52
Data Mining - Overview 42
DM Methodology ± 11 Steps (cont)
Step 6: Transform Data
Normalization
Transforming
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 43/52
Data Mining - Overview 43
Transformation Derived Variables
Create derived variable that represent
something in real world (e.g. Passenger *Miles)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 44/52
Data Mining - Overview 44
TransformationExtracting Information from a column /
Transformation
26 Jan and 15 Aug Holiday Date: Holiday / Working Day
Date: Festive Season / Normal Season
Time: Peak Hour / Off-peak Hour
Telephone Number: Landline / Mobile
Address: Single House / Multi-unit dwelling
Categorize continuous data (e.g. Income)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 45/52
Data Mining - Overview 45
DM Methodology ± 11 Steps (cont)
Step 7: Build Model
Choose one or more techniques
Step 8: Asses Models
Some Errors are more serious than others
Confusion Matrix
Lift
RMS
Ratio of intra-cluster to inter-cluster distance
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 46/52
Data Mining - Overview 46
DM Methodology ± 11 Steps (cont)
Step 9: Deploy Model
Choose one or more techniques
Step 10: Asses Results
Example:
What was the cost of direct marketing campaign?(Including DM Cost)
What were benefits>)
Step 11: Begin Again
Things change over time
Better way of handling
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 47/52
Data Mining - Overview 47
DM and KDDKDD (Knowledge Discovery in Database) and DM
are used interchangeably.Some prefer to differentiate. KDD consists of:
Selection: Sourcing Data
Preprocessing: Correcting erroneous data,handling missing data
Transformation: Transforming data to more usableformats
Data Mining: Applying various algorithms
Presentation / Interpretation / Evaluation of data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 48/52
Data Mining - Overview 48
SEMMA Methodology (SAS) Sample from data sets, Partition into Training,
Validation and Test datasets
Explore data set statistically and graphically
Modify:Transform variables, Impute missingvalues
Model: fit predictive models e.g. regression, tree,collaborative filtering
Assess: Compare models
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 49/52
Data Mining - Overview 49
Miscellaneous
Data Mining Issues
Human Interaction
Over fitting
Outliers
Interpretation of Results
Visualization of Results Large Datasets (some algorithm do not scale. Use
Sampling or Parallel processing)
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 50/52
Data Mining - Overview 50
Miscellaneous (Cont)
Data Mining Issues (Cont)
High Dimensionality
Multimedia Data Missing Data
Irrelevant Data
Noisy data
Changing Data
Integration of KDD in traditional DBMS systems
Applications
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 51/52
Data Mining - Overview 51
Miscellaneous (Cont)
Future
Data Mining Query Lang (DMQL) based on SQL
DMQL should bring out ± Generalized Relation: Obtained by
Generalizing data from input data
± Characteristic Rule: Condition satisfied by
almost all records in target class ± Discriminate Rule: Condition satisfied by target
class but not by other classes
± Classify Rule: Used to classify data
8/8/2019 DataMining Overview
http://slidepdf.com/reader/full/datamining-overview 52/52
Data Mining - Overview 52
References
1. Michael Berry, Gordon Linoff ³Mastering Data
Mining´, Wiley Publications (Ch 1, 3, 5, 6, 7)
2. Michael Berry, Gordon Linoff ³Data MiningTechniques´, Wiley Publications, (Ch 7 ±
Overview of Data Mining Techniques)
3. Margaret Dunham, ³Data Mining ± Introductory
and Advanced Topics´, Pearson Edition (Ch1,2,3)