DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...

DATA MINING : DATA MINING : CLASSIFICATIONCLASSIFICATION

Classification : DefinitionClassification : Definition

Classification is a supervised learning.Classification is a supervised learning. Uses training sets which has correct Uses training sets which has correct

answers (class label attributes).answers (class label attributes). A model is created by running the

algorithm on the training data. training data. Test the model. If accuracy is low,

regenerate the model, after changing features,reconsidering samples.

Identify a class label for the incoming new data.

Applications:Applications: Classifying credit card transactions Classifying credit card transactions

as legitimate or fraudulent.as legitimate or fraudulent.

Classifying secondary structures of protein Classifying secondary structures of protein as alpha-helix, beta-sheet, or random as alpha-helix, beta-sheet, or random coil.coil.

Categorizing news stories as finance, Categorizing news stories as finance, weather, entertainment, sports, etc.weather, entertainment, sports, etc.

Classification: A two step processClassification: A two step process

Model construction: describing a set of predetermined classes.

Each sample is assumed to belong to a predefined class, as determined by the class label attribute.

The set of samples used for model construction is training set.

The model is represented as classification rules, decision trees, or mathematical formula.

Model usage: for classifying future or unknown objects.

Estimate accuracy of the model. The known label of test sample is compared with

the classified result from the model. Accuracy rate is the percentage of test set

samples that are correctly classified by the model.

Test set is independent of training set. If the accuracy is acceptable, use the model to

classify data samples whose class labels are not known.

Model Construction:

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

TrainingData

Classifier(Model)

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

ClassificationAlgorithms

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 yesGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Classification techniques:Classification techniques:

Decision Tree based MethodsDecision Tree based Methods Rule-based MethodsRule-based Methods Neural NetworksNeural Networks Bayesian Classification Support Vector MachinesSupport Vector Machines

Algorithm for decision tree Algorithm for decision tree induction:induction:

Basic algorithm:Basic algorithm: Tree is constructed in a Tree is constructed in a top-down recursive divide-top-down recursive divide-

and-conquer manner.and-conquer manner. At start, all the training examples are at the root.At start, all the training examples are at the root. Attributes are categorical (if continuous-valued, they Attributes are categorical (if continuous-valued, they

are discretized in advance).are discretized in advance). Examples are partitioned recursively based on Examples are partitioned recursively based on

selected attributesselected attributes..

Example of Decision Tree:age income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Training Dataset

Output: A Decision Tree for“buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Advantages of decision tree Advantages of decision tree based classification:based classification:

Inexpensive to construct.Inexpensive to construct. Extremely fast at classifying unknown Extremely fast at classifying unknown

records.records. Easy to interpret for small-sized trees.Easy to interpret for small-sized trees. Accuracy is comparable to other classification Accuracy is comparable to other classification

techniques for many simple data sets.techniques for many simple data sets.

Enhancements to basic decision tree Enhancements to basic decision tree

inductioninduction:: Allow for continuous-valued attributesAllow for continuous-valued attributes

Dynamically define new discrete-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a that partition the continuous attribute value into a discrete set of intervalsdiscrete set of intervals

Handle missing attribute valuesHandle missing attribute values Assign the most common value of the attributeAssign the most common value of the attribute Assign probability to each of the possible valuesAssign probability to each of the possible values

Attribute constructionAttribute construction Create new attributes based on existing ones that Create new attributes based on existing ones that

are sparsely representedare sparsely represented This reduces fragmentation, repetition, and This reduces fragmentation, repetition, and

replicationreplication

Potential Problem: Over fitting: This is when the generated model

does not apply to the new incoming data. » Either too small of training data, not covering

many cases. » Wrong assumptions

Over fittingOver fitting results in decision trees that are more results in decision trees that are more complex than necessarycomplex than necessary

Training error no longer provides a good Training error no longer provides a good estimate of how well the tree will perform on estimate of how well the tree will perform on previously unseen recordspreviously unseen records

Need new ways for estimating errorsNeed new ways for estimating errors

How to avoid Over fitting:How to avoid Over fitting:

Two ways to avoid over fitting are –Two ways to avoid over fitting are – Pre-pruningPre-pruning Post-pruningPost-pruning

Pre-pruning:Pre-pruning: Stop the algorithm before it becomes a fully Stop the algorithm before it becomes a fully

grown tree.grown tree. Stop if all instances belong to the same class.Stop if all instances belong to the same class. Stop if no. of instances is less than some user Stop if no. of instances is less than some user

specified thresholdspecified threshold

Post-pruning:Post-pruning: Grow decision tree to its entirety.Grow decision tree to its entirety. Trim the nodes of the decision tree in a Trim the nodes of the decision tree in a

bottom-up fashion.bottom-up fashion. If generalization error improves after trimming, If generalization error improves after trimming,

replace sub-tree by a leaf node.replace sub-tree by a leaf node. Class label of leaf node is determined from Class label of leaf node is determined from

majority class of instances in the sub-tree.majority class of instances in the sub-tree.

Bayesian Classification Bayesian Classification Algorithm:Algorithm: Let X be a data sample whose class label is unknownLet X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the For classification problems, determine P(H/X): the

probability that the hypothesis holds given the probability that the hypothesis holds given the observed data sample Xobserved data sample X

P(H): prior probability of hypothesis H (i.e. the initial P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the probability before we observe any data, reflects the background knowledge)background knowledge)

P(X): probability that sample data is observedP(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given P(X|H) : probability of observing the sample X, given

that the hypothesis holdsthat the hypothesis holds

Training dataset for Bayesian Classification:

income student credit_rating buys_computerhigh no fair nohigh no excellent nohigh no fair yesmedium no fair yeslow yes fair yeslow yes excellent nolow yes excellent yesmedium no fair nolow yes fair yesmedium yes excellent yesmedium yes fair yesmedium no excellent yeshigh yes fair yesmedium no excellent no

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’

Data sample X =(age<=30,Income=medium,Student=yesCredit_rating=Fair)

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Advantages & Disadvantages of Advantages & Disadvantages of Bayesian Classification:Bayesian Classification:

Advantages : Advantages : Easy to implement Easy to implement Good results obtained in most of the casesGood results obtained in most of the cases

Disadvantages:Disadvantages: Due to assumption there is loss of accuracy.Due to assumption there is loss of accuracy. Practically, dependencies exist among variables Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history E.g., hospitals: patients: Profile: age, family history

etc ,Symptoms: fever, cough etc., Disease: lung etc ,Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc cancer, diabetes etc

Dependencies among these cannot be modeled by Dependencies among these cannot be modeled by Bayesian ClassifierBayesian Classifier

Conclusion:Conclusion:

Training data is an important factor in building a model in supervised algorithms.

The classification results generated by each of the algorithms (Naïve Bayes, Decision Tree, Neural Networks,…) is not considerably different from each other.

Different classification algorithms can take different time to train and build models.

Mechanical classification is fasterMechanical classification is faster

References:References: www.google.com http://www.thearling.com www.mamma.com www.amazon.com http://www.kdnuggets.com C. Apte and S. Weiss. Data mining with

decision trees and decision rules. Future Generation Computer Systems, 13, 1997.

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984.

Thank you !!!Thank you !!!

DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...

Documents

Transcript of DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...