Lecture2 Data Mining

51
Data Mining: A Closer Look Kritda Vavnilanon

Transcript of Lecture2 Data Mining

Page 1: Lecture2 Data Mining

Data Mining: A Closer Look

Kritda Vavnilanon

Page 2: Lecture2 Data Mining

Out-line

• Objectives

• Data Mining Strategies

• Supervised Data Mining Techniques

• Association Rules

• Clustering Techniques

• Evaluating Performance

• Summary

• Questions

Page 3: Lecture2 Data Mining

Objectives

• Determine an appropriate data mining

strategy for a specific problem.

• Know about several data mining

techniques and how each technique builds

a generalized model to represent data.

• Understand how a confusion matrix is

used to help evaluate supervised learner

models.

Page 4: Lecture2 Data Mining

Objectives

• Understand basic techniques for

evaluating supervised learner models with

numeric output.

• Know how measuring lift can be used to

compare the performance of several

competing supervised learner models.

• Understand basic techniques for

evaluating unsupervised learner models.

Page 5: Lecture2 Data Mining

Data Mining Strategies

Page 6: Lecture2 Data Mining

Data Mining Strategies

• Supervised learning builds models by using input attributes to predict output attribute values.

• Output attributes are also known as depen­dent­variables­as their outcome depends on the values of one or more input attributes.

• Input attributes are referred to as independent­variables.

Page 7: Lecture2 Data Mining

Data Mining Strategies

• Classification­is probably the best

understood of all data Classification tasks

have three common characteristics

– Learning is supervised.

– The dependent variable is categorical.

– The emphasis is on building models able to

assign new instances to one of a set of well-

defined classes.

Page 8: Lecture2 Data Mining

Data Mining Strategies

• Some example classification tasks include the

following:

– Determine those characteristics that differentiate

individuals who have suffered a heart attack from those

who have not.

– Develop a profile of a “successful” person.

– Determine if a credit card purchase is fraudulent.

– Classify a car loan applicant as a good or a poor credit

risk.

– Develop a profile to differentiate female and male stroke

victims.

Page 9: Lecture2 Data Mining

Data Mining Strategies

• Estimation­determines a value for an unknown output attribute.

• However, unlike classification, the output attribute for an estimation problem is numeric.– Estimate the number of minutes before a

thunderstorm will reach a given location.– Estimate the salary of an individual who owns a

sports car.– Estimate the likelihood that a credit card has been

stolen.– Estimate the length of a gamma ray burst.

Page 10: Lecture2 Data Mining

Data Mining Strategies

• Prediction: It is not easy to differentiate

prediction from classification or estimation.

• Un like a classification or estimation model,

the purpose of a predictive model is to

determine future outcome rather than current

behavior.

• The output attribute (s) of a predictive model

can be categorical or numeric.

Page 11: Lecture2 Data Mining

Data Mining Strategies

• Determine whether a credit card customer is

likely to take advantage of a special offer

made available with their credit card billing.

• Predict next week’s closing price for the Dow

Jones Industrial Average.

• Forecast which telephone subscribers are

likely to change providers during the next

three months.

Page 12: Lecture2 Data Mining

Data Mining Strategies

• IF­169 <= Maximum Heart Rate <= 202 THEN­

Concept Class = Healthy

Rule accuracy: 85.07%

Rule coverage: 34.55%

• IF­Thal = Rev and Chest Pain Type =

Asymptomatic THEN­Concept Class = Sick

Rule accuracy: 91.14%

Rule coverage: 52.17%

Page 13: Lecture2 Data Mining

Data Mining Strategies

• WARNING­1:­Have your maximum heart rate checked on a regular basis. If your maximum heart rate is low, you may be at risk of having a heart attack! (Predictive)

• WARNING­2:­If you have a heart attack, expect your maximum heart rate to decrease. (Classification)

Page 14: Lecture2 Data Mining

Data Mining Strategies

• With unsupervised clustering we are without a dependent variable to guide the learning process.

• Rather, the learning program builds a knowledge structure by using some measure of cluster quality to group instances into two or more classes.

• A primary goal of an unsupervised clustering strategy is to discover concept structures in data.

Page 15: Lecture2 Data Mining

Data Mining Strategies

• Common uses of unsupervised clustering

include:

– Determine if meaningful relationships in the form

of concepts can be found in the data

– Evaluate the likely performance of a supervised

learner model

– Determine a best set of input attributes for

supervised learning

– Detect outliers

Page 16: Lecture2 Data Mining

Data Mining Strategies

• Unsupervised clustering can also help detect any atypical instances present in the data.

• Atypical instances are referred to as outliers.­

• Outliers can be of great importance and should be identified whenever possible.

• Statistical mining applications frequently remove outliers.

• With data mining, the outliers might be just those instances we are trying to identify.

Page 17: Lecture2 Data Mining

Data Mining Strategies

• The purpose of market­basket­analysis­is to find interesting relationships among re tail products.

• The results of a market basket analysis help retailers design promotions, arrange shelf or catalog items, and develop cross-marketing strategies.

• Association rule algorithms are often used to apply a market basket analysis to a set of data.

Page 18: Lecture2 Data Mining

Supervised Data Mining Techniques

• A data­mining­technique­is used to apply a data mining strategy to a set of data.

• A specific data mining technique is defined by an algorithm and an associated knowledge structure such as a tree or a set of rules.

• Production­Rules: Any decision tree can be translated into a set of production rules.

• However, we do not need an initial tree structure to generate production rules. RuleMaker,­the production rule generator that comes with your iDA software, uses ratios together with mathematical set theory operations to create rules from spreadsheet data.

• Earlier in this chapter you saw two rules generated by RuleMaker for the heart patient dataset. Let’s apply RuleMaker to the credit card promotion data.

Page 19: Lecture2 Data Mining

Supervised Data Mining Techniques

A combination of one or more of the dataset

attributes differentiate between Acme Credit

Card Company card holders who have taken

advantage of a life insurance promotion and

those card holders who have chosen not to

participate in the promotional offer.

Page 20: Lecture2 Data Mining

Supervised Data Mining Techniques

• IF­Sex = Female & 19 <= Age <= 43

THEN­Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

• IF­Sex = Male & Income Range = 40–50K

THEN­Life Insurance Promotion = No

Rule Accuracy: 100.00%

Rule Coverage: 50.00%

Page 21: Lecture2 Data Mining

Supervised Data Mining Techniques

• IF­Credit Card Insurance = Yes

THEN­Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 33.33%

• IF­Income Range = 30–40K & Watch Promotion = Yes

THEN­Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 33.33%

Page 22: Lecture2 Data Mining

Supervised Data Mining Techniques

• A neural­network­is a set of interconnected nodes designed to imitate the functioning of the human brain.

• As the human brain contains billions of neurons and a typicalneural network has fewer than one hundred nodes, the comparison is somewhat superficial.

• However, neural networks have been successfully applied to problems across several disciplines and for this reason are quite popular in the data mining community.

Page 23: Lecture2 Data Mining

Supervised Data Mining Techniques

• Neural networks come in many shapes and forms and can be constructed for supervised learning as well as unsupervised clustering.

• In all cases the values input into a neural network must be numeric.

• The feed-forward network is a popular supervisedlearner model.

Page 24: Lecture2 Data Mining

Supervised Data Mining Techniques

Page 25: Lecture2 Data Mining

Supervised Data Mining Techniques

• Figure 2.2 shows a fully connected feed-forward

neural network consisting of three layers.

• With a feed-forward network the input attribute

values for an individual instance enter at the input

layer and pass directly through the output layer of

the network structure.

• The output layer may contain one or several nodes.

• The out put layer of the network shown in Fig. 2.2

contains two nodes.

Page 26: Lecture2 Data Mining

Supervised Data Mining Techniques

• Therefore the output of the neural network will be an ordered pair of values.

• Neural networks operate in two phases.

• The first phase is called the learning phase. During network learning, the input values associated with each instance enter the net work at the input layer. One input layer node exists for each input attribute contained in the data.

Page 27: Lecture2 Data Mining

Supervised Data Mining Techniques

• The actual output value for each instance is computed and compared with the desired network output. Any error between the desired and computed output is propagated back through the network by changing connection-weight values.

• Training terminates after a certain number of iterations or when the network converges to a predetermined minimum error rate. During the second phase of operation, the network weights are fixed and the network is used to classify new instances.

Page 28: Lecture2 Data Mining

Supervised Data Mining Techniques

• Statistical­regression­is a supervised learning technique that generalizes a set of nu meric data by creating a mathematical equation relating one or more input attributes to a single numeric output attribute.

• A linear­regression­model is characterized by an output attribute whose value is determined by a linear sum of weighted input at tribute values. Here is a linear regression equation for the data in Table 2.3:

life insurance promotion = 0.5909 ~ (credit card insurance) – 0.5455 ~ (sex) + 0.7727

Page 29: Lecture2 Data Mining

Supervised Data Mining Techniques

• life insurance promotion = 0.5909(0) – 0.5455(0) + 0.7727 = 0.7727

• Individual is likely to take advantage of the promotional offer.

• Although regression can be nonlinear, the most popular use of regression is for linear modeling.

• Linear regression is appropriate provided the data can be accurately modeled with a straight line function.

Page 30: Lecture2 Data Mining

Association Rules

• As the name implies, association­rule­mining techniques are used to discover interesting associations between attributes contained in a database.

• Unlike traditional production rules, association rules can have one or several output attributes. Also, an output attribute for one rule can be an input attribute for another rule.

Page 31: Lecture2 Data Mining

Association Rules

• Association rules are a popular technique for

market basket analysis because all possible

combinations of potentially interesting

product groupings can be explored.

• For this reason a limited number of attributes

are able to generate hundreds of association

rules.

Page 32: Lecture2 Data Mining

Association Rules

• IF­Sex = Female & Age = over40 & Credit Card Insurance = No THEN­Life Insurance Promotion = Yes

• IF­Sex = Male & Age =over40 & Credit Card Insurance = No THEN­Life Insurance Promotion = No

• IF­Sex = Female & Age = over40­THEN­Credit Card Insurance = No & Life Insurance Promotion = Yes

Page 33: Lecture2 Data Mining

Clustering Techniques

• Several unsupervised clustering techniques

can be identified.

• One common technique is to apply some

measure of similarity to divide instances into

disjoint parti tions. The partitions are

generalized by computing a group mean for

each cluster or by listing a most typical subset

of instances from each cluster.

Page 34: Lecture2 Data Mining

Clustering Techniques

• A second approach is to partition data in a

hierarchical fashion where each level of the

hierarchy is a generalization of the data at

some level of abstraction.

• One of the unsupervised clustering models

that comes with your iDA software tool is a

hierarchical clustering system.

Page 35: Lecture2 Data Mining

Clustering Techniques

By applying unsupervised clustering to the

instances of the Acme Credit Card Company

database, we will find a subset of input

attributes that differentiate card holders who

have taken advantage of the life insurance

promotion from those cardhold ers who have

not accepted the promotional offer.

Page 36: Lecture2 Data Mining

Clustering Techniques

• IF­Sex = Female & 43 >= Age >= 35 & Credit

Card Insurance = No THEN­Class = 3

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

Page 37: Lecture2 Data Mining

Evaluating Performance

• The most critical of all the steps in the data mining

process.

• A common sense approach to evaluating supervised

and unsupervised learner models.

• Three general questions:

– Will the benefits received from a data mining project more

than offset the cost of the data mining process?

– How do we interpret the results of a data mining session?

– Can we use the results of a data mining process with

confidence?

Page 38: Lecture2 Data Mining

Evaluating Performance

• Is there knowledge about projects similar to the

proposed project? What are the success rates and

costs of projects similar to the planned project?

• What is the current form of the data to be analyzed?

Does the data exist or will it have to be collected?

When a wealth of data exists and is not in a form

amenable for data mining, the greatest project cost

will fall under the category of data preparation. In

fact, a larger question may be whether to develop a

data warehouse for future data mining projects.

Page 39: Lecture2 Data Mining

Evaluating Performance

• Who will be responsible for the data mining

project? How many current employees will be

involved? Will outside consultants be hired?

• Is the necessary software currently available?

If not, will the software be pur chased or

developed? If purchased or developed, how

will the software be integrated into the current

system?

Page 40: Lecture2 Data Mining

Evaluating Performance

• Evaluating Supervised Learner Models

• Classification correctness is best calculated

by presenting previously unseen data in the

form of a test set to the model being

evaluated.

• Test set model accuracy can be summarized

in a table known as a confusion matrix.

Page 41: Lecture2 Data Mining

Evaluating Performance

• A generic confusion matrix for the three class case is shown in Table

• Values along the main diagonal give the total number of correct classifications for each class. For example, a value of 15 for C11 means that 15 class C1 test set instances were correctly classified.

• Values other than those on the main diagonal represent classi fication errors. To illustrate, suppose C12 has the value 4.This means that four class C1instances were incorrectly classified as belonging to class C2.

C1 C2 C3

C1 C11 C12 C13

C2 C21 C22 C23

C3 C31 C32 C33

Page 42: Lecture2 Data Mining

Evaluating Performance

• Two-Class Error Analysis• Evaluating Numeric Output• The mean­absolute­error for a set of test data is computed

by finding the average absolute difference between computed and desired outcome values.

• In a similar manner, the mean­squared­error is simply the average squared difference between computed and desired outcome.

• It is obvious that for a best test set accuracy we wish to obtain the smallest possible value for each measure.

• The root­mean­squared­error­(rms)­is simply the square root of a mean squared error value. Rms is frequently used as a measure of test set accuracy with feed-forward neural networks.

Page 43: Lecture2 Data Mining

Evaluating Performance

• Comparing Models by Measuring Lift

Lift = P(Ci | Sample)

P(Ci | Population)

Page 44: Lecture2 Data Mining

Evaluating Performance

% Sampled

Page 45: Lecture2 Data Mining

Evaluating Performance

• Unsupervised­Model­Evaluation

• Evaluating unsupervised data mining is, in general, a more difficult task than super vised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning.

• A general technique that employs supervised learning to evaluate an unsupervised clustering.

Page 46: Lecture2 Data Mining

Evaluating Performance

• All unsupervised clustering techniques compute some measure of cluster quality.

• A common technique is to calculate the summation of squared error differences between the instances of each cluster and their corresponding cluster center.

• Smaller values for sums of squared error differences indicate clusters of higher quality. However, for a detailed evaluation of unsupervised clustering, it is supervised learning that comes to the rescue.

Page 47: Lecture2 Data Mining

Evaluating Performance

• Finally, a common misconception in the business world is

that data mining can be accomplished simply by choosing the

right tool, turning it loose on some data, and waiting for

answers to problems.

• This approach is doomed to failure. Machines are still

machines. It is the analysis of results provided by the human

element that ultimately dictates the success or failure of a

data mining project.

• A formal KDD process model such as the one described in

Chapter 5 will help provide more complete answers to the

questions posed at the beginning of this section.

Page 48: Lecture2 Data Mining

Summary

• Data mining strategies include classification, estimation, prediction, unsupervised clus tering, and market basket analysis.

• The output of a classification strategy is categorical.

• The output of an estimation strategy is numeric.• A predictive strategy is used to design models

for predicting future outcome.• Unsupervised clustering strategies are employed

to dis cover hidden concept structures in data as well as to locate atypical data instances.

• The purpose of market basket analysis is to find interesting relationships among retail products.

Page 49: Lecture2 Data Mining

Summary

• Data min ing techniques are defined by an algorithm and a knowledge structure.

• Common fea tures that distinguish the various techniques are whether learning is supervised or unsupervised and whether their output is categorical or numeric.

• Performance evaluation is probably the most critical of all the steps in the data mining process.

• Supervised model evaluation is often performed using a training/test set scenario.

Page 50: Lecture2 Data Mining

Summary

• A marketing application measures the goodness of a model by its ability to lift response rate thresholds to levels well above those achieved by naïve (mass) mailing strategies.

• Unsupervised models support some measure of cluster quality that can be used for evaluative purposes. Supervised learning can also be employed to evaluate the quality of the clusters formed by an unsupervised model.

Page 51: Lecture2 Data Mining

Questions