Decision Tree Learning - Montana State Univ · Decision Tree Learning Richard McAllister Outline...

Decision TreeLearning

RichardMcAllister

Outline

Overview

TreeConstruction

Case Study:Determinantsof HousePrice

Decision Tree Learning

Richard McAllister

February 4, 2008

1 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


1 Overview

2 Tree Construction

3 Case Study: Determinants of House Price

2 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Decision Tree Learning

Widely UsedUsed for approximating discrete-valued functionsRobust to noisy dataCapable of learning disjunctive expressions

3 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Algorithms

ID3AssistantC4.5

Uses information entropy / information gain

4 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Inductive Bias

Preference for small trees over largeOccam’s Razor: Prefer the simplest hypothesis that fitsthe data.

5 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Tree

Outlook

Humidity

No

High

Yes

Normal

Sunny

Yes

Overcast

Wind

No

Strong

Yes

Weak

Rain

Leftmost Branch:〈Outlook = Sunny , Temperature = Hot , Humidity =High, Wind = Strong〉

6 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Disjunction of Conjunctions

(Outlook = Sunny ∧ Humidity = Normal) ∨ (Outlook =Overcast) ∨ (Outlook = Rain ∧Wind = Weak)

7 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Appropriate Problems

Instances represented by attribute-value pairsTarget function has discrete output values (e.g. yes orno)Disjunctive descriptions may be requiredTraining data may contain errorsTraining data may contain missing attribute values

8 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


For Example...

Classifying medical patients by diseaseLoan applicationsEquipment malfunctionsClassification problems

9 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


ID3

Learns by constructing decision trees top-down"What should be tested first?"Greedy

Descendants are then chosen from possible valuesresulting from a decision option

10 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


How do you choose what’s next?

Information GainMeasures how well a given attribute separates thetraining examples according to their targetclassifications.

11 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Information Gain

EntropyCharacterizes the (im)purity of an arbitrary collection ofexamplesMeasures expected reduction in entropy caused byknowing the value of AEntropy(s) ≡ −p⊕log2p⊕ − plog2p

S : the collection of positive and negative examples ofsome target conceptp⊕ : the proportion of positive examples in Sp : the proportion of negative examples in S

incidentally - GeneralEntropy(S) ≡c∑

i=1

−pi log2pi

12 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Information Gain (Cont’d)

Information Gain

Gain(S, A) ≡ Entropy(S) =∑

v∈Values(a)

| Sv || S |

Entropy(Sv )

S : the collection of positive and negative examples ofsome target conceptA : a particular attributeSv : the collection of positive an negative examples ofthe same target concept after A has partitioned S.Sv = {s ∈ S | A(s) = v}

13 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Use of Information Gain

Information Gain is calculated for each attribute:Gain(S, Outlook) = 0.246Gain(S, Humidity) = 0.151Gain(S, Wind) = 0.048Gain(S, Temperature) = 0.029

Maximum is chosen at each level

14 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Continues until...

1 Every attribute has already been included along thetree’s path.

2 Training examples associated with leaf nodes all havethe same target attribute value (i.e. Entropy is zero).

15 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Hypothesis Space Search in Decision Tree Learning

Search proceeds by hill-climbing.Information gain measure guides hill-climbing.

16 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Capabilities and Limitations

ID3’s hypothesis space is a complete space of finite,discrete-valued functions.Maintains only one current hypothesis as it searchesthrough the tree.No backtracking - can reach a local optimum that is notglobal.Uses all training examples at each step in it’sstatistics-based decisions.

17 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Issues

Overfitting the DataIncorporating Continuous-Valued AttributesAlternative Measures for Selecting AttributesHandling Training Examples with Missing AttributeValuesHandling Attributes with Differing Costs

18 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Overfitting the Data

Overfitting: when some other hypothesis that fits thetraining examples less well actually performs betterover the entire distribution of instances.Addition of incorrect training data will cause ID3 tocreate a more complex tree.Addition of noise can also cause overfitting.Prevention:

Reduced error pruning: making a node out of a subtree,using most common classification of the trainingexamples to characterize it.Rule post-pruning: convert the tree to rules, then prunerules created by the decision tree by removingpreconditions that result in improving the rule’sestimated accuracy.

19 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Issues


20 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Incorporating Continuous-Valued Attributes

Discretizing: dynamically defining new discrete-valuedattributes that partition the continuous attribute valueinto a discrete set of intervals.

21 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Issues


22 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Alternative Measures for Selecting Attributes

SplitInformation(S, A): how broadly and uniformly theattribute A splits the data.Distance-based measures

23 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Issues


24 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Handling Training Examples with Missing Attribute Values

Assign the missing attribute the most common amongthe training examples.Assign a probability to each of the possible values.Estimate the value of the missing attribute based on theobserved frequencies of the values among the trainingexamples.

25 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Issues


26 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Handling Attributes with Differing Costs

Introduce a cost term into the attribute selectionmeasure.Example:

Gain2(S,A)Cost(A)

27 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Case Study: Determinants of House Price

The housing purchasing decision depends on householdpreference or priority orderings for dwelling attributes orservices and that, therefore, the attributes or servicesprobably have different importance in the determination ofhousing prices.

28 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Decision Tree Application

CART AlgorithmMakes use of an impurity measureImpurity:

low value: indicates the predomination of a single classwithin a set of caseshigh value: indicates the even distribution of classes

29 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Gini Impurity Measure

IG(i) = 1−m∑

j=1

f (i , j)2 =∑j 6=k

f (i , j)f (i , k)

f (i , j) : the probability of getting value j in node i

Gini impurity is based on squared probabilities ofmembership for each target category in the node. Itreaches its minimum (zero) when all cases in the nodefall into a single target category.

30 / 31


RichardMcAllister

Outline

Overview

TreeConstruction


Figure 1. Decision tree results for the HDB flats

2308

GANG-ZHIFAN

ETAL.

31 / 31

Decision Tree Learning - Montana State Univ · Decision Tree Learning Richard McAllister Outline...

Documents

Transcript of Decision Tree Learning - Montana State Univ · Decision Tree Learning Richard McAllister Outline...