DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...
-
Upload
elvin-gordon-hodges -
Category
Documents
-
view
222 -
download
0
Transcript of DATA MINING : CLASSIFICATION. Classification : Definition Classification is a supervised learning. ...
DATA MINING : DATA MINING : CLASSIFICATIONCLASSIFICATION
Classification : DefinitionClassification : Definition
Classification is a supervised learning.Classification is a supervised learning. Uses training sets which has correct Uses training sets which has correct
answers (class label attributes).answers (class label attributes). A model is created by running the
algorithm on the training data. training data. Test the model. If accuracy is low,
regenerate the model, after changing features,reconsidering samples.
Identify a class label for the incoming new data.
Applications:Applications: Classifying credit card transactions Classifying credit card transactions
as legitimate or fraudulent.as legitimate or fraudulent.
Classifying secondary structures of protein Classifying secondary structures of protein as alpha-helix, beta-sheet, or random as alpha-helix, beta-sheet, or random coil.coil.
Categorizing news stories as finance, Categorizing news stories as finance, weather, entertainment, sports, etc.weather, entertainment, sports, etc.
Classification: A two step processClassification: A two step process
Model construction: describing a set of predetermined classes.
Each sample is assumed to belong to a predefined class, as determined by the class label attribute.
The set of samples used for model construction is training set.
The model is represented as classification rules, decision trees, or mathematical formula.
Model usage: for classifying future or unknown objects.
Estimate accuracy of the model. The known label of test sample is compared with
the classified result from the model. Accuracy rate is the percentage of test set
samples that are correctly classified by the model.
Test set is independent of training set. If the accuracy is acceptable, use the model to
classify data samples whose class labels are not known.
Model Construction:
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
TrainingData
Classifier(Model)
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
ClassificationAlgorithms
Classification Process (2): Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 yesGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification techniques:Classification techniques:
Decision Tree based MethodsDecision Tree based Methods Rule-based MethodsRule-based Methods Neural NetworksNeural Networks Bayesian Classification Support Vector MachinesSupport Vector Machines
Algorithm for decision tree Algorithm for decision tree induction:induction:
Basic algorithm:Basic algorithm: Tree is constructed in a Tree is constructed in a top-down recursive divide-top-down recursive divide-
and-conquer manner.and-conquer manner. At start, all the training examples are at the root.At start, all the training examples are at the root. Attributes are categorical (if continuous-valued, they Attributes are categorical (if continuous-valued, they
are discretized in advance).are discretized in advance). Examples are partitioned recursively based on Examples are partitioned recursively based on
selected attributesselected attributes..
Example of Decision Tree:age income student credit_rating buys_computer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
Training Dataset
Output: A Decision Tree for“buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
Advantages of decision tree Advantages of decision tree based classification:based classification:
Inexpensive to construct.Inexpensive to construct. Extremely fast at classifying unknown Extremely fast at classifying unknown
records.records. Easy to interpret for small-sized trees.Easy to interpret for small-sized trees. Accuracy is comparable to other classification Accuracy is comparable to other classification
techniques for many simple data sets.techniques for many simple data sets.
Enhancements to basic decision tree Enhancements to basic decision tree
inductioninduction:: Allow for continuous-valued attributesAllow for continuous-valued attributes
Dynamically define new discrete-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a that partition the continuous attribute value into a discrete set of intervalsdiscrete set of intervals
Handle missing attribute valuesHandle missing attribute values Assign the most common value of the attributeAssign the most common value of the attribute Assign probability to each of the possible valuesAssign probability to each of the possible values
Attribute constructionAttribute construction Create new attributes based on existing ones that Create new attributes based on existing ones that
are sparsely representedare sparsely represented This reduces fragmentation, repetition, and This reduces fragmentation, repetition, and
replicationreplication
Potential Problem: Over fitting: This is when the generated model
does not apply to the new incoming data. » Either too small of training data, not covering
many cases. » Wrong assumptions
Over fittingOver fitting results in decision trees that are more results in decision trees that are more complex than necessarycomplex than necessary
Training error no longer provides a good Training error no longer provides a good estimate of how well the tree will perform on estimate of how well the tree will perform on previously unseen recordspreviously unseen records
Need new ways for estimating errorsNeed new ways for estimating errors
How to avoid Over fitting:How to avoid Over fitting:
Two ways to avoid over fitting are –Two ways to avoid over fitting are – Pre-pruningPre-pruning Post-pruningPost-pruning
Pre-pruning:Pre-pruning: Stop the algorithm before it becomes a fully Stop the algorithm before it becomes a fully
grown tree.grown tree. Stop if all instances belong to the same class.Stop if all instances belong to the same class. Stop if no. of instances is less than some user Stop if no. of instances is less than some user
specified thresholdspecified threshold
Post-pruning:Post-pruning: Grow decision tree to its entirety.Grow decision tree to its entirety. Trim the nodes of the decision tree in a Trim the nodes of the decision tree in a
bottom-up fashion.bottom-up fashion. If generalization error improves after trimming, If generalization error improves after trimming,
replace sub-tree by a leaf node.replace sub-tree by a leaf node. Class label of leaf node is determined from Class label of leaf node is determined from
majority class of instances in the sub-tree.majority class of instances in the sub-tree.
Bayesian Classification Bayesian Classification Algorithm:Algorithm: Let X be a data sample whose class label is unknownLet X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the For classification problems, determine P(H/X): the
probability that the hypothesis holds given the probability that the hypothesis holds given the observed data sample Xobserved data sample X
P(H): prior probability of hypothesis H (i.e. the initial P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the probability before we observe any data, reflects the background knowledge)background knowledge)
P(X): probability that sample data is observedP(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given P(X|H) : probability of observing the sample X, given
that the hypothesis holdsthat the hypothesis holds
Training dataset for Bayesian Classification:
income student credit_rating buys_computerhigh no fair nohigh no excellent nohigh no fair yesmedium no fair yeslow yes fair yeslow yes excellent nolow yes excellent yesmedium no fair nolow yes fair yesmedium yes excellent yesmedium yes fair yesmedium no excellent yeshigh yes fair yesmedium no excellent no
Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’
Data sample X =(age<=30,Income=medium,Student=yesCredit_rating=Fair)
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
Advantages & Disadvantages of Advantages & Disadvantages of Bayesian Classification:Bayesian Classification:
Advantages : Advantages : Easy to implement Easy to implement Good results obtained in most of the casesGood results obtained in most of the cases
Disadvantages:Disadvantages: Due to assumption there is loss of accuracy.Due to assumption there is loss of accuracy. Practically, dependencies exist among variables Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history E.g., hospitals: patients: Profile: age, family history
etc ,Symptoms: fever, cough etc., Disease: lung etc ,Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc cancer, diabetes etc
Dependencies among these cannot be modeled by Dependencies among these cannot be modeled by Bayesian ClassifierBayesian Classifier
Conclusion:Conclusion:
Training data is an important factor in building a model in supervised algorithms.
The classification results generated by each of the algorithms (Naïve Bayes, Decision Tree, Neural Networks,…) is not considerably different from each other.
Different classification algorithms can take different time to train and build models.
Mechanical classification is fasterMechanical classification is faster
References:References: www.google.com http://www.thearling.com www.mamma.com www.amazon.com http://www.kdnuggets.com C. Apte and S. Weiss. Data mining with
decision trees and decision rules. Future Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984.
Thank you !!!Thank you !!!