DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...

22
DATA MINING : DATA MINING : CLASSIFICATION CLASSIFICATION

Transcript of DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...

Page 1: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

DATA MINING : DATA MINING : CLASSIFICATIONCLASSIFICATION

Page 2: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Classification : DefinitionClassification : Definition

Classification is a supervised learning.Classification is a supervised learning. Uses training sets which has correct Uses training sets which has correct

answers (class label attributes).answers (class label attributes). A model is created by running the

algorithm on the training data. training data. Test the model. If accuracy is low,

regenerate the model, after changing features,reconsidering samples.

Identify a class label for the incoming new data.

Page 3: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Applications:Applications: Classifying credit card transactions Classifying credit card transactions

as legitimate or fraudulent.as legitimate or fraudulent.

Classifying secondary structures of protein Classifying secondary structures of protein as alpha-helix, beta-sheet, or random as alpha-helix, beta-sheet, or random coil.coil.

Categorizing news stories as finance, Categorizing news stories as finance, weather, entertainment, sports, etc.weather, entertainment, sports, etc.

Page 4: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Classification: A two step processClassification: A two step process

Model construction: describing a set of predetermined classes.

Each sample is assumed to belong to a predefined class, as determined by the class label attribute.

The set of samples used for model construction is training set.

The model is represented as classification rules, decision trees, or mathematical formula.

Page 5: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Model usage: for classifying future or unknown objects.

Estimate accuracy of the model. The known label of test sample is compared with

the classified result from the model. Accuracy rate is the percentage of test set

samples that are correctly classified by the model.

Test set is independent of training set. If the accuracy is acceptable, use the model to

classify data samples whose class labels are not known.

Page 6: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Model Construction:

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

TrainingData

Classifier(Model)

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

ClassificationAlgorithms

Page 7: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 yesGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Page 8: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Classification techniques:Classification techniques:

Decision Tree based MethodsDecision Tree based Methods Rule-based MethodsRule-based Methods Neural NetworksNeural Networks Bayesian Classification Support Vector MachinesSupport Vector Machines

Page 9: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Algorithm for decision tree Algorithm for decision tree induction:induction:

Basic algorithm:Basic algorithm: Tree is constructed in a Tree is constructed in a top-down recursive divide-top-down recursive divide-

and-conquer manner.and-conquer manner. At start, all the training examples are at the root.At start, all the training examples are at the root. Attributes are categorical (if continuous-valued, they Attributes are categorical (if continuous-valued, they

are discretized in advance).are discretized in advance). Examples are partitioned recursively based on Examples are partitioned recursively based on

selected attributesselected attributes..

Page 10: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Example of Decision Tree:age income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Training Dataset

Page 11: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Output: A Decision Tree for“buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Page 12: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Advantages of decision tree Advantages of decision tree based classification:based classification:

Inexpensive to construct.Inexpensive to construct. Extremely fast at classifying unknown Extremely fast at classifying unknown

records.records. Easy to interpret for small-sized trees.Easy to interpret for small-sized trees. Accuracy is comparable to other classification Accuracy is comparable to other classification

techniques for many simple data sets.techniques for many simple data sets.

Page 13: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Enhancements to basic decision tree Enhancements to basic decision tree

inductioninduction:: Allow for continuous-valued attributesAllow for continuous-valued attributes

Dynamically define new discrete-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a that partition the continuous attribute value into a discrete set of intervalsdiscrete set of intervals

Handle missing attribute valuesHandle missing attribute values Assign the most common value of the attributeAssign the most common value of the attribute Assign probability to each of the possible valuesAssign probability to each of the possible values

Attribute constructionAttribute construction Create new attributes based on existing ones that Create new attributes based on existing ones that

are sparsely representedare sparsely represented This reduces fragmentation, repetition, and This reduces fragmentation, repetition, and

replicationreplication

Page 14: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Potential Problem: Over fitting: This is when the generated model

does not apply to the new incoming data. » Either too small of training data, not covering

many cases. » Wrong assumptions

Over fittingOver fitting results in decision trees that are more results in decision trees that are more complex than necessarycomplex than necessary

Training error no longer provides a good Training error no longer provides a good estimate of how well the tree will perform on estimate of how well the tree will perform on previously unseen recordspreviously unseen records

Need new ways for estimating errorsNeed new ways for estimating errors

Page 15: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

How to avoid Over fitting:How to avoid Over fitting:

Two ways to avoid over fitting are –Two ways to avoid over fitting are – Pre-pruningPre-pruning Post-pruningPost-pruning

Pre-pruning:Pre-pruning: Stop the algorithm before it becomes a fully Stop the algorithm before it becomes a fully

grown tree.grown tree. Stop if all instances belong to the same class.Stop if all instances belong to the same class. Stop if no. of instances is less than some user Stop if no. of instances is less than some user

specified thresholdspecified threshold

Page 16: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Post-pruning:Post-pruning: Grow decision tree to its entirety.Grow decision tree to its entirety. Trim the nodes of the decision tree in a Trim the nodes of the decision tree in a

bottom-up fashion.bottom-up fashion. If generalization error improves after trimming, If generalization error improves after trimming,

replace sub-tree by a leaf node.replace sub-tree by a leaf node. Class label of leaf node is determined from Class label of leaf node is determined from

majority class of instances in the sub-tree.majority class of instances in the sub-tree.

Page 17: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Bayesian Classification Bayesian Classification Algorithm:Algorithm: Let X be a data sample whose class label is unknownLet X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the For classification problems, determine P(H/X): the

probability that the hypothesis holds given the probability that the hypothesis holds given the observed data sample Xobserved data sample X

P(H): prior probability of hypothesis H (i.e. the initial P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the probability before we observe any data, reflects the background knowledge)background knowledge)

P(X): probability that sample data is observedP(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given P(X|H) : probability of observing the sample X, given

that the hypothesis holdsthat the hypothesis holds

Page 18: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Training dataset for Bayesian Classification:

income student credit_rating buys_computerhigh no fair nohigh no excellent nohigh no fair yesmedium no fair yeslow yes fair yeslow yes excellent nolow yes excellent yesmedium no fair nolow yes fair yesmedium yes excellent yesmedium yes fair yesmedium no excellent yeshigh yes fair yesmedium no excellent no

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’

Data sample X =(age<=30,Income=medium,Student=yesCredit_rating=Fair)

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Page 19: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Advantages & Disadvantages of Advantages & Disadvantages of Bayesian Classification:Bayesian Classification:

Advantages : Advantages : Easy to implement Easy to implement Good results obtained in most of the casesGood results obtained in most of the cases

Disadvantages:Disadvantages: Due to assumption there is loss of accuracy.Due to assumption there is loss of accuracy. Practically, dependencies exist among variables Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history E.g., hospitals: patients: Profile: age, family history

etc ,Symptoms: fever, cough etc., Disease: lung etc ,Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc cancer, diabetes etc

Dependencies among these cannot be modeled by Dependencies among these cannot be modeled by Bayesian ClassifierBayesian Classifier

Page 20: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Conclusion:Conclusion:

Training data is an important factor in building a model in supervised algorithms.

The classification results generated by each of the algorithms (Naïve Bayes, Decision Tree, Neural Networks,…) is not considerably different from each other.

Different classification algorithms can take different time to train and build models.

Mechanical classification is fasterMechanical classification is faster

Page 21: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

References:References: www.google.com http://www.thearling.com www.mamma.com www.amazon.com http://www.kdnuggets.com C. Apte and S. Weiss. Data mining with

decision trees and decision rules. Future Generation Computer Systems, 13, 1997.

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984.

Page 22: DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Thank you !!!Thank you !!!