chap18b

22
Categorical data

description

chap 17b detaits

Transcript of chap18b

Page 1: chap18b

Categorical data

Page 2: chap18b

Decision Tree Classification

Page 3: chap18b

Which feature to split on?

Try to classify as many as possible with each split(This is a good split)

Page 4: chap18b

Which feature to split on?

This is a bad split – no classifications obtained

Page 5: chap18b

Improving a good split

Page 6: chap18b

Decision Tree Algorithm Framework

If you have positive and negative examples, use a splitting criterion to decide on best attribute to split Each child is a new decision tree – call the

algorithm again with the parent feature removed If all data points in child node are same

class, classify node as that class If no attributes left, classify by majority rule If no data points left, no such example seen:

classify as majority class from entire dataset

Page 7: chap18b

Splitting Criterion ID3 Algorithm

Some information theory Blackboard

Page 8: chap18b

Issues on training and test sets

Do you know the correct classification for the test set?

If you do, why not include it in the training set to get a better classifier?

If you don’t, how can you measure the performance of your classifier?

Page 9: chap18b

Cross Validation Tenfold cross-validation

Ten iterations Pull a different tenth of the dataset out

each time to act as a test set Train on the remaining training set Measure performance on the test set

Leave one out cross-validation Similar, but leave only one point out

each time, then count correct vs. incorrect

Page 10: chap18b

Noise and Overfitting Can we always obtain a decision tree

that is consistent with the data? Do we always want a decision tree that

is consistent with the data? Example: Predict Carleton students who

become CEOs Features: state/country of origin, GPA letter,

major, age, high school GPA, junior high GPA, ...

What happens with only a few features? What happens with many features?

Page 11: chap18b

Overfitting Fitting a classifier “too closely” to

the data finding patterns that aren’t really there

Prevented in decision trees by pruning When building trees, stop recursion on

irrelevant attributes Do statistical tests at node to

determine if should continue or not

Page 12: chap18b

Examples of decision treesusing Weka

Page 13: chap18b

Preventing overfitting by cross validation

Another technique to prevent overfitting (is this valid)? Keep on recursing on decision tree as

long as you continue to get improved accuracy on the test set

Page 14: chap18b

Review of how to decide on which attribute to split

Dataset has two classes, P and N Relationship between information and

randomness The more random a dataset is (points in P

and N), the more information is provided by the message “Your point is in class P (or N).”

The less random a dataset is, the less information is provided by the message “Your point is in class P (or N).”

Information of message =Randomness of dataset =

NNPP pppp loglog

Page 15: chap18b

How much randomness in split?

01log10log0 22

00log01log1 22

9183.06

4log

6

4

6

2log

6

222

4591.0)9183.0(12

6)0(

12

4)0(

12

2average Weighted

Page 16: chap18b

How much randomness in split?

14

2log

4

2

4

2log

4

222

1average Weighted

12

1log

2

1

2

1log

2

122

Page 17: chap18b

Which split is better? Patrons split

Randomness = 0.4591 Type split

Randomness = 1 Patrons has less randomness, so it

is a better split Randomness is often referred to as

entropy (similarities with thermodynamics)

Page 18: chap18b

Learning Logical Descriptions

)(),()(),(

),()(),(

),()(),(

),(

)(

xFriSatThaixTypexHungryFullxPatrons

BurgerxTypexHungryFullxPatrons

FrenchxTypexHungryFullxPatrons

SomexPatrons

xWillWaitx

Hypothesis

Page 19: chap18b

Learning Logical Descriptions

Goal is to learn a logical hypothesis consistent with the data

Example of hypothesis consistent with X1:

Is this consistent with X2? X2 is a false negative for hypothesis if

hypothesis says negative, but should be positive X2 is a false positive for hypothesis if hypothesis

says positive, but should be negative

)(100)()(

)(

xEstxBarxAlternate

xWillWaitx

Page 20: chap18b

Current-best-hypothesis search

Start with an initial hypothesis and adjust it as you see examples

Example: based on X1, arbitrarily start with

X2 should be -, but H1 says +. H1 is not restrictive enough, specialize it:

X3 should be +, but H2 says -. H2 is too restrictive, generalize:

)()(:1 xAlternatexWillWaitxH

),()(

)(:2

SomexPatronsxAlternate

xWillWaitxH

Page 21: chap18b

Current-best-hypothesis search

X4 should be +, H3 says -. Must generalize:

What if you end up with an inconsistent hypothesis that you cannot modify to make work?

Backup search and try a different route Tree on blackboard

),()()(:2 SomexPatronsxAlternatexWillWaitxH

),()(:3 SomexPatronsxWillWaitxH

))(),((

),(

)(:4

xFriSatFullxPatrons

SomexPatrons

xWillWaitxH

Page 22: chap18b

Neural Networks Moving on to Chapter 19, neural

networks