Lecture2 Data Mining

Data Mining: A Closer Look

Kritda Vavnilanon

Out-line

• Objectives

• Data Mining Strategies

• Supervised Data Mining Techniques

• Association Rules

• Clustering Techniques

• Evaluating Performance

• Summary

• Questions

Objectives

• Determine an appropriate data mining

strategy for a specific problem.

• Know about several data mining

techniques and how each technique builds

a generalized model to represent data.

• Understand how a confusion matrix is

used to help evaluate supervised learner

models.

Objectives

• Understand basic techniques for

evaluating supervised learner models with

numeric output.

• Know how measuring lift can be used to

compare the performance of several

competing supervised learner models.

• Understand basic techniques for

evaluating unsupervised learner models.

Data Mining Strategies


• Supervised learning builds models by using input attributes to predict output attribute values.

• Output attributes are also known as dependentvariablesas their outcome depends on the values of one or more input attributes.

• Input attributes are referred to as independentvariables.


• Classificationis probably the best

understood of all data Classification tasks

have three common characteristics

– Learning is supervised.

– The dependent variable is categorical.

– The emphasis is on building models able to

assign new instances to one of a set of well-

defined classes.


• Some example classification tasks include the

following:

– Determine those characteristics that differentiate

individuals who have suffered a heart attack from those

who have not.

– Develop a profile of a “successful” person.

– Determine if a credit card purchase is fraudulent.

– Classify a car loan applicant as a good or a poor credit

risk.

– Develop a profile to differentiate female and male stroke

victims.


• Estimationdetermines a value for an unknown output attribute.

• However, unlike classification, the output attribute for an estimation problem is numeric.– Estimate the number of minutes before a

thunderstorm will reach a given location.– Estimate the salary of an individual who owns a

sports car.– Estimate the likelihood that a credit card has been

stolen.– Estimate the length of a gamma ray burst.


• Prediction: It is not easy to differentiate

prediction from classification or estimation.

• Un like a classification or estimation model,

the purpose of a predictive model is to

determine future outcome rather than current

behavior.

• The output attribute (s) of a predictive model

can be categorical or numeric.


• Determine whether a credit card customer is

likely to take advantage of a special offer

made available with their credit card billing.

• Predict next week’s closing price for the Dow

Jones Industrial Average.

• Forecast which telephone subscribers are

likely to change providers during the next

three months.


• IF169 <= Maximum Heart Rate <= 202 THEN

Concept Class = Healthy

Rule accuracy: 85.07%

Rule coverage: 34.55%

• IFThal = Rev and Chest Pain Type =

Asymptomatic THENConcept Class = Sick

Rule accuracy: 91.14%

Rule coverage: 52.17%


• WARNING1:Have your maximum heart rate checked on a regular basis. If your maximum heart rate is low, you may be at risk of having a heart attack! (Predictive)

• WARNING2:If you have a heart attack, expect your maximum heart rate to decrease. (Classification)


• With unsupervised clustering we are without a dependent variable to guide the learning process.

• Rather, the learning program builds a knowledge structure by using some measure of cluster quality to group instances into two or more classes.

• A primary goal of an unsupervised clustering strategy is to discover concept structures in data.


• Common uses of unsupervised clustering

include:

– Determine if meaningful relationships in the form

of concepts can be found in the data

– Evaluate the likely performance of a supervised

learner model

– Determine a best set of input attributes for

supervised learning

– Detect outliers


• Unsupervised clustering can also help detect any atypical instances present in the data.

• Atypical instances are referred to as outliers.

• Outliers can be of great importance and should be identified whenever possible.

• Statistical mining applications frequently remove outliers.

• With data mining, the outliers might be just those instances we are trying to identify.


• The purpose of marketbasketanalysisis to find interesting relationships among re tail products.

• The results of a market basket analysis help retailers design promotions, arrange shelf or catalog items, and develop cross-marketing strategies.

• Association rule algorithms are often used to apply a market basket analysis to a set of data.

Supervised Data Mining Techniques

• A dataminingtechniqueis used to apply a data mining strategy to a set of data.

• A specific data mining technique is defined by an algorithm and an associated knowledge structure such as a tree or a set of rules.

• ProductionRules: Any decision tree can be translated into a set of production rules.

• However, we do not need an initial tree structure to generate production rules. RuleMaker,the production rule generator that comes with your iDA software, uses ratios together with mathematical set theory operations to create rules from spreadsheet data.

• Earlier in this chapter you saw two rules generated by RuleMaker for the heart patient dataset. Let’s apply RuleMaker to the credit card promotion data.


A combination of one or more of the dataset

attributes differentiate between Acme Credit

Card Company card holders who have taken

advantage of a life insurance promotion and

those card holders who have chosen not to

participate in the promotional offer.


• IFSex = Female & 19 <= Age <= 43

THENLife Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

• IFSex = Male & Income Range = 40–50K

THENLife Insurance Promotion = No




• IFCredit Card Insurance = Yes




• IFIncome Range = 30–40K & Watch Promotion = Yes





• A neuralnetworkis a set of interconnected nodes designed to imitate the functioning of the human brain.

• As the human brain contains billions of neurons and a typicalneural network has fewer than one hundred nodes, the comparison is somewhat superficial.

• However, neural networks have been successfully applied to problems across several disciplines and for this reason are quite popular in the data mining community.


• Neural networks come in many shapes and forms and can be constructed for supervised learning as well as unsupervised clustering.

• In all cases the values input into a neural network must be numeric.

• The feed-forward network is a popular supervisedlearner model.


• Figure 2.2 shows a fully connected feed-forward

neural network consisting of three layers.

• With a feed-forward network the input attribute

values for an individual instance enter at the input

layer and pass directly through the output layer of

the network structure.

• The output layer may contain one or several nodes.

• The out put layer of the network shown in Fig. 2.2

contains two nodes.


• Therefore the output of the neural network will be an ordered pair of values.

• Neural networks operate in two phases.

• The first phase is called the learning phase. During network learning, the input values associated with each instance enter the net work at the input layer. One input layer node exists for each input attribute contained in the data.


• The actual output value for each instance is computed and compared with the desired network output. Any error between the desired and computed output is propagated back through the network by changing connection-weight values.

• Training terminates after a certain number of iterations or when the network converges to a predetermined minimum error rate. During the second phase of operation, the network weights are fixed and the network is used to classify new instances.


• Statisticalregressionis a supervised learning technique that generalizes a set of nu meric data by creating a mathematical equation relating one or more input attributes to a single numeric output attribute.

• A linearregressionmodel is characterized by an output attribute whose value is determined by a linear sum of weighted input at tribute values. Here is a linear regression equation for the data in Table 2.3:

life insurance promotion = 0.5909 ~ (credit card insurance) – 0.5455 ~ (sex) + 0.7727


• life insurance promotion = 0.5909(0) – 0.5455(0) + 0.7727 = 0.7727

• Individual is likely to take advantage of the promotional offer.

• Although regression can be nonlinear, the most popular use of regression is for linear modeling.

• Linear regression is appropriate provided the data can be accurately modeled with a straight line function.

Association Rules

• As the name implies, associationrulemining techniques are used to discover interesting associations between attributes contained in a database.

• Unlike traditional production rules, association rules can have one or several output attributes. Also, an output attribute for one rule can be an input attribute for another rule.

Association Rules

• Association rules are a popular technique for

market basket analysis because all possible

combinations of potentially interesting

product groupings can be explored.

• For this reason a limited number of attributes

are able to generate hundreds of association

rules.

Association Rules

• IFSex = Female & Age = over40 & Credit Card Insurance = No THENLife Insurance Promotion = Yes

• IFSex = Male & Age =over40 & Credit Card Insurance = No THENLife Insurance Promotion = No

• IFSex = Female & Age = over40THENCredit Card Insurance = No & Life Insurance Promotion = Yes

Clustering Techniques

• Several unsupervised clustering techniques

can be identified.

• One common technique is to apply some

measure of similarity to divide instances into

disjoint parti tions. The partitions are

generalized by computing a group mean for

each cluster or by listing a most typical subset

of instances from each cluster.


• A second approach is to partition data in a

hierarchical fashion where each level of the

hierarchy is a generalization of the data at

some level of abstraction.

• One of the unsupervised clustering models

that comes with your iDA software tool is a

hierarchical clustering system.


By applying unsupervised clustering to the

instances of the Acme Credit Card Company

database, we will find a subset of input

attributes that differentiate card holders who

have taken advantage of the life insurance

promotion from those cardhold ers who have

not accepted the promotional offer.


• IFSex = Female & 43 >= Age >= 35 & Credit

Card Insurance = No THENClass = 3



Evaluating Performance

• The most critical of all the steps in the data mining

process.

• A common sense approach to evaluating supervised

and unsupervised learner models.

• Three general questions:

– Will the benefits received from a data mining project more

than offset the cost of the data mining process?

– How do we interpret the results of a data mining session?

– Can we use the results of a data mining process with

confidence?


• Is there knowledge about projects similar to the

proposed project? What are the success rates and

costs of projects similar to the planned project?

• What is the current form of the data to be analyzed?

Does the data exist or will it have to be collected?

When a wealth of data exists and is not in a form

amenable for data mining, the greatest project cost

will fall under the category of data preparation. In

fact, a larger question may be whether to develop a

data warehouse for future data mining projects.


• Who will be responsible for the data mining

project? How many current employees will be

involved? Will outside consultants be hired?

• Is the necessary software currently available?

If not, will the software be pur chased or

developed? If purchased or developed, how

will the software be integrated into the current

system?


• Evaluating Supervised Learner Models

• Classification correctness is best calculated

by presenting previously unseen data in the

form of a test set to the model being

evaluated.

• Test set model accuracy can be summarized

in a table known as a confusion matrix.


• A generic confusion matrix for the three class case is shown in Table

• Values along the main diagonal give the total number of correct classifications for each class. For example, a value of 15 for C11 means that 15 class C1 test set instances were correctly classified.

• Values other than those on the main diagonal represent classi fication errors. To illustrate, suppose C12 has the value 4.This means that four class C1instances were incorrectly classified as belonging to class C2.

C1 C2 C3

C1 C11 C12 C13

C2 C21 C22 C23

C3 C31 C32 C33


• Two-Class Error Analysis• Evaluating Numeric Output• The meanabsoluteerror for a set of test data is computed

by finding the average absolute difference between computed and desired outcome values.

• In a similar manner, the meansquarederror is simply the average squared difference between computed and desired outcome.

• It is obvious that for a best test set accuracy we wish to obtain the smallest possible value for each measure.

• The rootmeansquarederror(rms)is simply the square root of a mean squared error value. Rms is frequently used as a measure of test set accuracy with feed-forward neural networks.


• Comparing Models by Measuring Lift

Lift = P(Ci | Sample)

P(Ci | Population)


% Sampled


• UnsupervisedModelEvaluation

• Evaluating unsupervised data mining is, in general, a more difficult task than super vised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning.

• A general technique that employs supervised learning to evaluate an unsupervised clustering.


• All unsupervised clustering techniques compute some measure of cluster quality.

• A common technique is to calculate the summation of squared error differences between the instances of each cluster and their corresponding cluster center.

• Smaller values for sums of squared error differences indicate clusters of higher quality. However, for a detailed evaluation of unsupervised clustering, it is supervised learning that comes to the rescue.


• Finally, a common misconception in the business world is

that data mining can be accomplished simply by choosing the

right tool, turning it loose on some data, and waiting for

answers to problems.

• This approach is doomed to failure. Machines are still

machines. It is the analysis of results provided by the human

element that ultimately dictates the success or failure of a

data mining project.

• A formal KDD process model such as the one described in

Chapter 5 will help provide more complete answers to the

questions posed at the beginning of this section.

Summary

• Data mining strategies include classification, estimation, prediction, unsupervised clus tering, and market basket analysis.

• The output of a classification strategy is categorical.

• The output of an estimation strategy is numeric.• A predictive strategy is used to design models

for predicting future outcome.• Unsupervised clustering strategies are employed

to dis cover hidden concept structures in data as well as to locate atypical data instances.

• The purpose of market basket analysis is to find interesting relationships among retail products.

Summary

• Data min ing techniques are defined by an algorithm and a knowledge structure.

• Common fea tures that distinguish the various techniques are whether learning is supervised or unsupervised and whether their output is categorical or numeric.

• Performance evaluation is probably the most critical of all the steps in the data mining process.

• Supervised model evaluation is often performed using a training/test set scenario.

Summary

• A marketing application measures the goodness of a model by its ability to lift response rate thresholds to levels well above those achieved by naïve (mass) mailing strategies.

• Unsupervised models support some measure of cluster quality that can be used for evaluative purposes. Supervised learning can also be employed to evaluate the quality of the clusters formed by an unsupervised model.

Questions

Lecture2 Data Mining

Documents

Transcript of Lecture2 Data Mining