Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes,...

Chapter 6: Implementations

Why are simple methods not good enough?

• Robustness: Numeric attributes, missing values, and noisy data

Decision Trees• Divide and conquer method

• Earlier discussion of this method worked for nominal values. How to deal with numeric attributes?

• How to calculate information?

Decision Trees and Numeric Attributes

• Generally, when a numeric attribute is used to split data, it is a binary split. So the same numeric attribute may be tested several times.

• Selection of the attribute to select at first once again is based on the information gain: sort the attribute values and determine a breakpoint where the information gain is maximized.

• For example, when the values of a numeric attribute are sorted as follows, with the corresponding classification, the information gain at the specified breakpoint (between 62 and 65) is computed as:

<45 (C1), 53 (C2), 55 (C2), 62 (C2), 65 (C1), 71 (C1), 75 (C1), 80 (C2)>

info([1,3],[3,1]) = 4/8*info([1,3])+4/8*info([3,1]) = info([1,3]) = -1/4*log(1/4) - 3/4*log(3/4) = 0.811

Decision Trees and Missing Values• Treat is as a separate category if it has some

significance• Alternately, choose the most popular branch at a

split point when that attribute is missing for a given instance

• More sophisticated approach: Notionally split the instance into weights based on the portion of instances that go along that path; whenever it is further split at an intermediate node, the weights are further split; finally, all branches lead to the leaf nodes; the decision is also weighted based on the weights and summed up.

Decision Trees: Pruning

• Postpruning (or backward pruning) and prepruning (forward pruning)

• Most decision tree builders employ postpruning.• Postpruning involves subtree replacement and subtree raising• Subtree replacement: Select some subtrees and replace them with

a single leaf node---see Figure 1.3a• Subtree raising: More complex and not always worthwhile. C4.5

scheme uses it. See Fig. 6.1---generally restricted to the most popular branch.

Decision Trees: Estimating Error

• In making the decision of subtree pruning or subtree raising, we need to know the resulting estimated error.

• Keep in mind that training set is just a small subset of an entire universe of data---so the tree should not be fitting just the training data---the error estimation also should take this into account

• Method 1: Reduced error pruning: Hold back some of the training data and use it to estimate the error due to pruning---not very good as it reduces the training data

• Method 2: Error estimate based on the entire training data

Classification Rules

• Simple separate-and-conquer technique• Problem: Tend to overfit the training data and do not generalize

well to independent sets, particularly on noisy data• Criteria for choosing tests (in a rule):

– Maximize the correctness: p/t where t is total instances covered by the rule out of which p are positive.

– Based on information gain: p[log(p/t)-log(P/T)] where P is total +ve instances and T total instances before the rule was applied.

– Test 1 places more importance on correctness rather than coverage; Test 2 is also concerned about coverage

• Missing values: Best to treat them as if the values on which the missing values are being tested do not match; this way they may match on other attributes in other rules

• Numeric attributes: Sort the attribute values and use break points to make rules

Classification Rules: Generating Good Rules

• Objective: Instead of deriving rules that overfit to the training data, it is best to generate sensible rules that stand a better chance of performing well on new test instances.

• Coverage versus accuracy: Should we choose a rule that is true over 15/20 instances or the one that is 2/2 (that is 100% correct)?

• Split the training data set into: growing set and pruning set– Use the growing set to form rules– Then, remove part of a rule and see its effect on the pruning set; if

satisfied, remove that part of the test

• Algorithm for forming rules by incremental reduced-error pruning• Worth of a rule based on the pruning set: If it gets p instances right

out of the t instances it covers, and P is the total right instances out of T. If N= T-p and n= t-p, then (N-n) are the total negative ones it does not cover and p it covers p positive ones. So [p+(N-n)]/T is taken as a metric.

Classification Rules: Global Optimization• First generate rules based on incremental reduced-error pruning

techniques• Then a global optimization is performed to increase the accuracy of

the rules---by revising or replacing individual rules• Postinduction optimization is shown to improve both the size and

performance of the rule set• But this process in often complex• RIPPER is a build and optimize algorithm

Classification Rules: Using Partial Decision Trees

• Alternative approach to rule induction that avoids global optimization

• Combines divide-and-conquer of decision tree learning (p. 62) and separate-and-conquer for rule learning (p. 112)

– Separate-and-conquer: It builds a rule, removes the instances it covers, and continues creating rules recursively for the remaining instances until none are left.

• To make a single rule, a pruned decision tree is built for the currents set of instances, the leaf with the largest coverage is made into a rule, and the tree discarded

• A partial decision tree is an ordinary decision tree that contains branches to undefined subtrees.

• Entropy(p1,p2,p3,…,pn)=-p1logp1-p2logp2-…-pnlogpn

• Info([a,b,c])=entropy(a/(a+b+c), b/(a+b+c), c/(a+b+c)])

• See Fig. 6.6 for an illustration

• Figure 6.5 is the algorithm

• Once a partial tree has been built, a single rule is extracted from it. – Each leaf corresponds to a possible rule, and we seek the best leaf of those

subtrees that have been expanded into leaves---choose the leaf that covers the greatest number of instances

– If there is a missing attribute value, it is assigned to each of the branches with a weight proportional to the number of training instances going down that branch

Extending Linear Models

• Basic techniques:– Linear regression (for numeric prediction)– Logistic regression (for linear classification) --- linear model based on

transformed target variables– Perceptron (for linear classification)– Winnow (for linear classification) –for data sets with binary attributes

• Basic problem: The boundaries between classes are not necessarily linear

• Support vector machines use linear models to implement nonlinear class boundaries

– Transform the input using a nonlinear mapping; map given instance to a new instance

– Use linear models in the new space– Transform the boundaries back to original space---they are now nonlinear– Ex: x = w1a13 + w2a12a2 + w3a1a22 + w4a23

• Here, a1and a2 are the attributes; w1, w2, w3, and w4 are to be learned; x is the outcome

• Train one linear system for each class

• Assign an unknown instance to the class that gives the greatest output x---like multiresponse linear regression

• Problems: Higher computational complexity; danger of overfitting

• SVM solves both problems: Use maximum margin hyperplane---hyperplane that gives the greatest separation between the classes---it comes no closer to either than it has to.

– The maximum hyperplane is the perpendicular bisector of the shortest line connecting the hulls of the two classes (say yes and no)

– The instances that are closest to the maximum margin hyperplane are called support vectors; there is always at least one (if not more) support vector for each class

– Given the support vectors for the two classes (say yes and no), the maximum margin hyperplane can be easily constructed

Support Vector Regression

• Basic regression---find a function that approximates the training data well by minimizing the prediction error (e.g., MSE)

• What is special about SVR: all deviations up to a user specified parameter ε are simply discarded.– Also, what is minimized is absolute error rather than MSE

• The value of ε controls how closely the function fits the training data: too small an ε leads to overfitting; too large an ε leads to meaningless predictions.

• See Fig. 6.9

Instance-based Learning

• Basic scheme: Use nearest neighbor technique– Tends to be slow for large training sets

– Performs badly with noisy data---the class of a single instance is based on its single nearest neighbor rather than on averaging

– No weights associated to different attributes---generally some may have larger effect than others

– Does not perform explicit generalization

• Reducing the number of exemplars– Already seen instances that are used for classification are referred to as

exemplars (or examples)

– Classify each example with the examples already seen and save only the ones that don’t fit the current ones---expand examples only when necessary

– Problem: Noisy examples are likely to be classified as new examples

• Pruning noisy exemplars:– For a given k, choose the k nearest neighbors and assign the majority class to

the unknown instance– Alternately, monitor the performance of the stored exemplars---keep the ones

that do well (match well) and discard the rest– IB3---Instance-based learner version 3 --- uses 5% confidence level for

acceptance and 1.25% for rejection. Criterion for acceptance is more stringent than for rejection making it more difficult for an instance to be accepted

• Weighting attributes: Use w1, w2, …, wn as weights in computing the Euclidean distance metric for the n attributes. (see page 238)

– All attribute weights are updated after each training instance is classified and the most similar exemplar is used as the basis for updating.

– Suppose x is the training instance and y the most similar exemplar---then for each attribute I, |xi-yi| is a measure of the contribution of that attribute to the decision. If the difference is small, contribution is more.

– See page 238 for details of changing the attribute weights

• Generalizing exemplars:– These are rectangular regions of exemplars---called hyperrectangles

– Now, when classifying new instances, it is necessary to calculate the distance based on its distance to the hyperrectangle

– When a new exemplar is classified correctly, it is generalized simply by merging it with the nearest exemplar of the same class

• If the nearest exemplar is a single instance, a new hyperrectangle is created that covers both exemplars.

• Otherwise, the hyperrectangle is modified to cover the new one

– If the prediction is incorrect, the hyperrectangle’s boundaries are shrunk so it is separated from the instance that was misclassified

Distance Functions for Generalized Exemplars

• Generalized exemplars are no longer points; instead they are hyperrectangles. So the distance of an instance from an exemplar is computed as follows.– If the point lies within the hyperrectangle, the distance is zero.

– Measure the distance from the outside point instance to the nearest point on the hyperrectangle boundary or measure the distance between the outside point and the nearest instance that is within the hyperrectangle.

– In case hyperrectangles overlap, choose a hyperrectangle that is most specific or the one that covers the smallest instnace space.

Numeric Prediction

• Model tree: Used for numeric prediction. Predicts the class value of instances that reach a leaf

• Regression tree: Here, the leaf nodes represents average value of all instances that reach that node---a special case of model trees

• Model trees:– Each leaf will contain a linear model based on some of the attribute values, and

is used to yield a raw predicted value for a test instance– The raw value can be smoothed out by producing linear models for each internal

node as well as for the leaves, at the time the tree is built. Once a raw value has been obtained at the leaf node, it is filtered along the path back to the root

– Smoothing occurs at each internal node by combining it with the value predicted by the linear model for that node. Function: p’ = (np+kq)/(n+k) where p’ is the new prediction, p the prediction from the node below, n is the number of training instances that reach the node below and k is a smoothing constant.

– Alternately, the leaf node’s model can be modified to reflect the smoothing that takes place at the internal nodes.

• Building the tree: Here, similar to information gain used with nominal attributes, expected error reduction is chosen as a metric to choose an attribute on which to split a tree.– SDR = Std. Dev. Reduction = sd(T) - ∑ |Ti|/|T|*sd(Ti)

where T is the tree prior to splitting and Tis are the subtrees formed after splitting choosing a particular attribute to split. In other words, the idea is to choose an attribute that reduces the variance after a split.

– Splitting process terminates when the class value std. deviation is small fraction of the std. dev of the original instance.

• Pruning the tree:– First, a linear model is calculated for each node of the unpruned

tree. – Only the attributes tested subtree below this node are used in

the regression---assume that all attributes are numeric– Once a linear model is in place for each interior node, the tree is

pruned back from the leaves as long as the expected estimated error decreases

• In case there are nominal attributes, they are converted to binary variables that are treated as numeric. If a nominal value has k possible values, it is replaced by k-1 synthetic binary attributes

Clustering

• Basic scheme: k-means clustering• Choosing k?

– MDL: Minimum description length principle• Occam’s razor: Other things being equal, simple things are better than

complex ones.• General theory + exceptions = what we learn• MDL principle: Best theory for a body of data is one that minimizes the size

of the theory plus the amount of information necessary to specify the exceptions relative to the theory

– The one that minimizes the #of bits required to communicate the generalization, along with the examples from which it is made (i.e., training set)

Bayesian Networks

• Naïve Bayes classifier: For each class value, estimate the probability that a given instance belongs to that class.

• More advanced: Bayesian networks---a network of nodes, one for each attribute, connected by directed edges; a directed acyclic graph

• Fig 6.20 and Fig. 6.21

Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes,...

Documents

Transcript of Chapter 6: Implementations. Why are simple methods not good enough? Robustness: Numeric attributes,...