Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees...

8
Lecture 12, CS567 1 Decisions, Decisions • Concepts Naïve Bayesian Classification Decision Trees General Algorithm – Refinements • Accuracy • Scalability Strengths and Limitations

Transcript of Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees...

Page 1: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 1

Decisions, Decisions• Concepts• Naïve Bayesian Classification• Decision Trees

– General Algorithm– Refinements

• Accuracy• Scalability

– Strengths and Limitations

Page 2: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 2

Concepts• Problem

– Will there be a pop quiz today?

• Data– (Duration of delay) in entering class– Instructor’s bag (bulging/not bulging)– Instructor (has/does not have) wicked/impish smile on face– There (was/wasn’t) a quiz last class

• Naïve Bayesian– Calculate P(Pop Quiz) from data with no regard to order of calculation

• Decision Tree– Evaluate data in a particular branching sequenceIf …. then elsif …..

Page 3: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 3

Naïve Bayesian• Goal: To estimate P(M|D) aka Posterior• “Naïve” assumption

– All data have “free will”- All attributes have independent probability distributions (Mutual information between every pair of attributes = 0)

• Prior = P(M). Binomial distribution with parameter p = P(Pop Quiz). Thus p = ?

• P(Pop Quiz|D) = P(D|Pop Quiz) P(Pop Quiz)/P(D) where, for i attributes constituting the data

– P(D|Pop Quiz) = i P(Di|Pop Quiz) – P(D) = K (uniform assumption) OR P(D) = i P(Di)

• Thus, either calculate explicit P(Pop Quiz|D) OR Max likelihood comparison of P(Pop Quiz|D) and P(No Pop Quiz|D)

Page 4: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 4

Decision Trees• Directed Graph for reaching a decision• Decision =

– Verdict– More generally, classification into no of several classes– If (..) OR (…) .. then Pop Quiz; If (..) OR (..) .. Then no Pop Quiz

• Given i attributes/data about an instance, navigate graph based on the values of {i}

• Based on minimizing uncertainty – Greedy approach: Largest drops in uncertainty occur first

Page 5: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 5

Decision Trees - Training• Given: Set of labeled data (Di = {ai}, Ck)• Goal: To find best classification tree

• Maximum uncertainty = log2(|class|) = 1 bit for a 2-class problem

• Entropy for given data E(D) = - n P(Ck) log P(Ck), for n classes

• Conditional/Residual entropy

E(D|ai, Ck) = l { [|Dl| / |D|].E(Dl) } for l subsets• Reduction in uncertainty = Gain of Information • Gain G(ai) = E(D) – E(D|ai, Ck)

Page 6: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 6

Decision Trees - Training• Find ai | [G(ai) > G(aj) for all i j]• Root = ai with children = subsets of data falling into each

range of ai

• Iterate through remaining list of attributes till all ai have been considered

• Label each subset with majority class label• Optional, highly recommended, steps:

– Prepruning and postpruning: Avoid over-fitting

Page 7: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 7

Decision Trees - Training• Caveat with previous approach:

– Subsetting with single or few data point(s) highly favored– In other words, variable with higher resolution (that have a wider

range of possibilities) favored

• Gain ratio– Alternative to information gain as criterion for choice of attributes– Compensates for bias towards high scores for ai with high resolution

(higher number of states for that attribute)– Gain ratio = G(ai) / E(ai)

• Gini– Recommended for attribute selection for large training sets for

scalability– Gini Index = 1 - Pk

2

Page 8: Lecture 12, CS5671 Decisions, Decisions Concepts Naïve Bayesian Classification Decision Trees –General Algorithm –Refinements Accuracy Scalability –Strengths.

Lecture 12, CS567 8

Decision Trees - Limitation• Rectangular (Hypercuboidal) data space partitioning

assumption• Not the best solution where the hyperline/plane is not

orthogonal to data dimensions• Greeedy strategy can easily lead to overfitting