DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 1/52

Data Mining Overview

Data Mining - Overview 2

ContentCase Data Mining ± Supervised Learning

Case Data Mining ± Unsupervised Learning

Definition

Applications

Techniques

Supervised Learning

Unsupervised Learning

ContentDM as a Business Process

DM MethodologyReferences

Definition Advanced methods for exploring and modeling

relationships in large amounts of data (SAS)

Process of discovering meaningful newcorrelations, patterns and trends by sifting through

large amounts of data stored in repositories, using

pattern recognition technologies as well as

statistical and mathematical techniques.´ (Gartner

Group)

Definition (Cont) Process of exploration and analysis, by automatic

or semi automatic means, of large quantities of

data in order to discover meaningful patterns andrules

From the middle of 1900s, corporate data has

increased by factor of 100,000! due to automated

operations throwing enormous opportunities to

improve business decision making

Applications Data Mining is useful when large amount of data

and something worth learning (i.e. resulting

knowledge is worth more money than it costs todiscover)

Research

Process Improvement

Marketing

Customer Relationship Management (CRM)

Application (Cont) CRM (cont)

± Presenting single image of organization

± Keeping single image of customer

± Knowing Likes and dislikes of customers

± Anticipating their needs and exploiting them

proactively ± Recognizing their displeasure and do some

thing before it is too late

Popular Applications(source ± kdnuggets)

TechniquesSupervised Learning (Directed Knowledge

Discovery)

Classification (e.g. assigning customers to predefined segment. Discrete classes)

Estimation / Regression (e.g. Value of real estate.Continuous)

Prediction: Classification or Estimation for future(Which customer will close account in 6 month)

Time Series Analysis

Techniques (Cont)Unsupervised Learning (Undirected Knowledge

Discovery)

Association Rules (Affinity Grouping): Whichthings go together

Sequence Discovery: Association Rules based ontime

Clustering: Segmenting diverse group into number of similar group / cluster

Dimension reduction

Summarization / Characterization / Generalization

Overview of Techniques - 1

Logistic

Regression

Predicts probability of success; Gives

subset selection of variables

ClassificationTree

Gives a decision tree with rules of classification

Neural

Network

Is very opaque but gives higher level

of accuracy in many situations

k-Nearest

Neighbor

Groups cases into neighbors and

assigns a class based on majority of

cases in a neighborhood

Classification

Illustrative Applications - Classification

Target Marketing

Attrition Prediction/Churn Analysis

Fraud Detection

Credit Scoring

Predicting for every case which class it belongs to or

probability of success based on its predictor variables data

Multiple Linear

Regression

Gives predicted values based on

Regression Model

Regression Tree Gives a decision tree with rules of

prediction

k-Nearest

Neighbors

Groups cases into neighbors and

assigns a value based on majority of cases in a neighborhood

Neural Network

Prediction

Illustrative Applications - Prediction

Forecasting sales

Predicting price fluctuations

Predicting profitability of business units

Predicting market value of assets

Predicting yield or consumption of criticalinputsPredicting for every case a value based on its

predictor variables data

k-Means

Clustering

For given number of clusters ± k value - develops

clusters based on minimum distance between the

cluster centers and the cases in the cluster.

Hierarchical

Clustering

Builds, through successive steps, clusters by

grouping cases having less dissimilarities and

finally creating a single cluster. The user can

choose the number of clusters corresponding to a

distance measure.Principal

Components

Creates new variables, called Principal

Components, that are uncorrelated and that

explain majority of variability in original data.

Clustering and Dimension Reduction

Dimension Reduction When there are many dimensions

(predictors), say 20, 30 or 50..

Or when several predictors are correlated

Develop new variables that:

± Explain the major portion of variability in data,

± Are uncorrelated

Illustrative Applications - Clustering

Market segmentation

Product grouping based on customer preferences

Grouping of business units based on performance

parameters

Grouping channel partners based on performance

parametersGrouping of homogenous cases based on

predefined variables data

AssociationRules

Gives prediction of combinationsof events that will occur together

based on the past occurrences

Market Basket Analysis / Affinity

Illustrative Applications ±

Market Basket

Cross selling

Product placement in a store

Forecasting sales

Predicting events that occur together as antecedents and consequentswith certain level of confidence and support number of events

DM as Business ProcessIdentifying the business problem (and how will

business benefits will be measured)

Planning direct marketing campaign - new Product Understanding customer attrition

Mining Data to transform data into ActionableInformation

Who are more likely to buy product Which customers are likely to leave. Are they

worth keeping?

DM as Business Process (Cont)Acting on the information

Contacting more likely customers

Offering special services to valuable customers

likely to leave

Measuring the results

Actual Business benefits achieved as definedearlier

DM MethodologyWhy Methodology?

Avoid learning that is not true

Avoid learning that is true but not useful

Learning that is not true Incorrect Data

Data may not be relevant (business situation has changed)

Summarization of data may have destroyed importantinformation (Fig 3.1 pg 47)

Due to small volume of data, pattern emerges due to

chance (when India does well in cricket, sensex goes up)

Model set may not reflect relevant population (³Issue of

Credit´ model built on persons who were given credit. Poll

conducted on WEB)

Learning that is true but not useful Learning that are already known: People in area with no

cell coverage, do not buy cell phones

Learning that can not be used: Product sale is related toweather (Can you change weather?). Bad credit history

may be predictive of more insurance claim, but regulators

may prohibit usage of such information

DM Methodology ± 11 StepsStep 1: Translate business problem into DM problem

State in specific term (i.e. instead of ³Gaining insight into

customer behavior´, Identify customer who are unlikely torenew subscription)

Determine type of problem (Classification, Clustering,

Decide how results will be used

± Contact high risk / high value customer and try to lure

them with offer

± Forecast customer population in future months

DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data

Input variables

± Which one?

Ignore Input columns with only one value

Ignore Input columns with unique value for each row (e.g.customer name)

Choose only one column out of two having highcorrelation. (e.g. Age_Difference and Age_Ratio)

± What should it contain: Example of all possibleoutcome

- Availability

Ideally from DW (If present) but may need to supplement

DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data (Cont) Input variables (Cont)

± How Many?

Do not eliminate at this stage Needs to be done later on ± How Much Data?More the merrier

Needs to optimize w.r.t. cost involved in processing, etc.

(Rule: If doubling size does not improve result much,stop)- How much history?Seasonality? (Consider seasonality. Data that is too old,

may not be relevant. Typically 2 ± 3 years for CRM)

DM Methodology ± 11 Steps (cont)Step 3: Get to know Data

Data Type

Descriptive statistics Validation (Why were so many customers born on 1911?Are they really that old?)

Data Type Columns: Categorical Vs Continuous

± Categorical: Takes discrete values (# of children, Marital Status)

± Continuous: Takes continuous values (Income)

Unordered vs Ordered Columns

± Unordered: (Marital Status, Sex)

± Ordered: Rank (e.g. ³Low´, ³High´) ± Ordered: Interval (e.g. Temperature)

± Ordered: True Numeric (e.g. Sales in Rs.,Weight

Descriptive

Statistics We can get general

idea about the way

data are distributed

Alcohol

Mean 13.00

Standard Error 0.06

Median 13.05

Mode 13.05

Standard Deviation 0.81

Sample Variance 0.65

Kurtosis -0.85Skewness -0.05

Range 3.80

Minimum 11.03

Maximum 14.83

Sum 2314.11Count 178

Largest(1) 14.83

Smallest(1) 11.03Confidence Level(95.0%) 0.12

Data Visualization We can study data

distribution using

Histogram

Histogram - All Types of Wines

. 5 o r e

Bin - Alcohol Content

r e q u e n c

.00%20.00%

40.00%60.00%80.00%100.00%120.00%

Frequency

Cumulative %

¤ -¥

y ¦ e A§

in e¡

. 5 o r e

B in - Alcohol Conte nt

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

Frequency

Cumulative %

i to r - y eB ine

1 1. 5

1 2. 5

1 3. 5

1 4. 5 o r

Bin - Alcohol Content

F r e q u e n c y

50.00%

100.00%150.00%

Frequency

Cumulative %

Data Visualization Visual presentation of data (e.g. Graphs like bar

chart, X-Y Plot of two variables, Scatter Chartetc.)

Correlation-ship between data

ValidationIncorrect Values:

Reasons

± Transcription error ± Laziness (force entry for birth day many were

born on November 11, 1911!!)

± Programming error (value of previous field gets

entered in this field) ± Old code and new code coexist!

± Collected wrongly (Time zone not considered)

ValidationIncorrect Values:

Reasons

± Stored incorrectly (Numeric instead of character type)

³My data must be clean because no human beinghas touched it manually´ .. One CEO

Result: 50% data wrong, because human beingdid not touch system clocks on computers!

DM Methodology ± 11 Steps (cont)

Step 4: Create a model set

Sampling

± Proportionate (Including multiple time frames) ± Over sampling

Partitioning

± Training

± Validation

± Test

Step 5: Fix Problems with Data

Correct Error

Missing Values Outliers

Missing DataReasons

± ³Missing Data´ might be important

information. (e.g. not providing TN do not bother me calling) Keep a flag

Missing DataReasons (Cont)

± Nature of Problem. (e.g. New customer do not

have 12 month history data) Build separatemodel for those

± Sources not providing data (e.g. externalvendor not able to provide certain data) Replace

by other derived value / build separate model ± Data was never collected

Missing DataWhat to do?

± Do Nothing

± Filter rows (introduces bias) ± Ignore column

± Predict New Value

± Build separate model

± Modify operations systems to collect data

Missing Data Correction Delete record

Problems

± Too many rows thrown out

± Bias introduced (All persons not wanting to state³Salary´ out)

Replace values with:

± Mode

± Mean (Local / Global)

± Median

± User specified value

Will replacement create problems?

Outliers Outlier are cases that contain unusual high or low

data value in a variable.

Such records unduly influence the model. If they are not a natural occurrence they should be

remove

Treatment depends upon algorithm chosen

(Decision tree ± no problem. Clustering ± Defineseparate cluster. Some cases ± remove / replacewith Max / Min )

Step 6: Transform Data

Normalization

Transforming

Transformation Derived Variables

Create derived variable that represent

something in real world (e.g. Passenger *Miles)

TransformationExtracting Information from a column /

Transformation

26 Jan and 15 Aug Holiday Date: Holiday / Working Day

Date: Festive Season / Normal Season

Time: Peak Hour / Off-peak Hour

Telephone Number: Landline / Mobile

Address: Single House / Multi-unit dwelling

Categorize continuous data (e.g. Income)

Step 7: Build Model

Choose one or more techniques

Step 8: Asses Models

Some Errors are more serious than others

Confusion Matrix

Ratio of intra-cluster to inter-cluster distance

Step 9: Deploy Model

Choose one or more techniques

Step 10: Asses Results

Example:

What was the cost of direct marketing campaign?(Including DM Cost)

What were benefits>)

Step 11: Begin Again

Things change over time

Better way of handling

DM and KDDKDD (Knowledge Discovery in Database) and DM

are used interchangeably.Some prefer to differentiate. KDD consists of:

Selection: Sourcing Data

Preprocessing: Correcting erroneous data,handling missing data

Transformation: Transforming data to more usableformats

Data Mining: Applying various algorithms

Presentation / Interpretation / Evaluation of data

SEMMA Methodology (SAS) Sample from data sets, Partition into Training,

Validation and Test datasets

Explore data set statistically and graphically

Modify:Transform variables, Impute missingvalues

Model: fit predictive models e.g. regression, tree,collaborative filtering

Assess: Compare models

Miscellaneous

Data Mining Issues

Human Interaction

Over fitting

Outliers

Interpretation of Results

Visualization of Results Large Datasets (some algorithm do not scale. Use

Sampling or Parallel processing)

Miscellaneous (Cont)

Data Mining Issues (Cont)

High Dimensionality

Multimedia Data Missing Data

Irrelevant Data

Noisy data

Changing Data

Integration of KDD in traditional DBMS systems

Applications

Miscellaneous (Cont)

Future

Data Mining Query Lang (DMQL) based on SQL

DMQL should bring out ± Generalized Relation: Obtained by

Generalizing data from input data

± Characteristic Rule: Condition satisfied by

almost all records in target class ± Discriminate Rule: Condition satisfied by target

class but not by other classes

± Classify Rule: Used to classify data

References

1. Michael Berry, Gordon Linoff ³Mastering Data

Mining´, Wiley Publications (Ch 1, 3, 5, 6, 7)

2. Michael Berry, Gordon Linoff ³Data MiningTechniques´, Wiley Publications, (Ch 7 ±

Overview of Data Mining Techniques)

3. Margaret Dunham, ³Data Mining ± Introductory

and Advanced Topics´, Pearson Edition (Ch1,2,3)

DataMining Overview

Documents

Transcript of DataMining Overview

datamining 2mark

Data Warehousing Datamining Concepts

Datamining 8th hclustering

BIOINFORMATICS Datamining

BY SANDY. WHAT IS DATAMINING TYPES OF DATAMINING TOOLS OVERVIEW OF TIBCO TIBCO SPOTFIRE MINER DATA ANALYSIS EXPLORE DATA MANIPULATE DATA CHART VIEW.

DataMining Presentation

Datamining 2-130925113213-phpapp01

Datamining Intro IEP

Datamining y Datawerhouse

Datamining Pres

DataMining Introduction Classification

datamining-lecture 2

datamining methods and models

Datawarehouse y Datamining

Datamining intro

Datamining techniuqes

Datamining for crm

ch3 DataMining

Datamining 9th association_rule.key

DataMining Kyoto