SEEM4630 2013-2014 Tutorial 2 Classification : Decision tree, Naïve Bayes & k-NN

SEEM4630 2013-2014 Tutorial 2

Classification:Decision tree, Naïve

Bayes & k-NNWentao TIAN, [email protected]

Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN

Goal: previously unseen records should be assigned a class as accurately as possible.

2

Classification: Definition

GoalConstruct a tree so that instances belonging to

different classes should be separated Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive manner

At start, all the training examples are at the rootTest attributes are selected on the basis of a

heuristics or statistical measure (e.g., information gain)

Examples are partitioned recursively based on selected attributes 3

Decision Tree

4

Attribute Selection Measure 1: Information Gain

Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D|

Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo

(D)InfoInfo(D)Gain(A) A

)(log)( 21

i

m

ii ppDInfo

5

Attribute Selection Measure 2: Gain Ratio Information gain measure is biased towards

attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to

overcome the problem (normalization to information gain)

GainRatio(A) = Gain(A)/SplitInfo(A)

)||||

(log||||

)( 21 D

DDD

DSplitInfo jv

j

jA

6

Attribute Selection Measure 3: Gini index If a data set D contains examples from n classes,

gini index, gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1

and D2, the gini index gini(D) is defined as

Reduction in Impurity:

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAgini A

ExampleOutlook Temperature Humidity Wind Play TennisSunny >25 High Weak NoSunny >25 High Strong NoOvercast >25 High Weak YesRain 15-25 High Weak YesRain <15 Normal Weak YesRain <15 Normal Strong NoOvercast <15 Normal Strong YesSunny 15-25 High Weak NoSunny <15 Normal Weak YesRain 15-25 Normal Weak YesSunny 15-25 Normal Strong YesOvercast 15-25 High Strong YesOvercast >25 Normal Weak YesRain 15-25 High Strong No

7

8

Tree induction example

S[9+, 5-] Outlook

Sunny [2+,3-] Overcast [4+,0-] Rain [3+,2-]

Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25

Entropy of data S

Split data by attribute Outlook

Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94

9


S[9+, 5-] Temperature

<15 [3+,1-]15-25 [5+,1-]>25 [2+,2-]

Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14

Split data by attribute Temperature

10

S[9+, 5-] WindWeak [6+, 2-]

Strong [3+, 3-]

Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15

Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05

Split data by attribute Humidity

Split data by attribute Wind


S[9+, 5-] HumidityHigh [3+,4-]

Normal [6+, 1-]

11

Outlook

Yes?? ??

Overcast

Sunny Rain

Gain(Outlook) = 0.25Gain(Temperature)=0.14Gain(Humidity) = 0.15Gain(Wind) = 0.05

NoWeakHigh>25SunnyNoStrongHigh>25SunnyYesWeakHigh>25Overca

stYesWeakHigh15-25RainYesWeakNormal<15RainNoStrongNormal<15RainYesStrongNormal<15Overca

stNoWeakHigh15-25SunnyYesWeakNormal<15SunnyYesWeakNormal15-25RainYesStrongNormal15-25SunnyYesStrongHigh15-25Overca

stYesWeakNormal>25Overca

stNoStrongHigh15-25Rain

Play Tennis

WindHumidity

Temperature

Outlook


12

Sunny[2+, 3-] Wind

Weak [1+, 2-]

Strong [1+, 1-]

Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]= 0.97 – 0 = 0.97

Gain(Wind)= 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]= 0.97 – 0.95= 0.02

Entropy of branch Sunny

Split Sunny branch by attribute Temperature

Split Sunny branch by attribute Humidity

Split Sunny branch by attribute Wind

Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97

Sunny[2+,3-] Temperature

<15 [1+,0-]15-25 [1+,1-]>25 [0+,2-]

Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0.4 = 0.57

Sunny[2+,3-] Humidity

High [0+,3-]

Normal [2+, 0-]

13

Outlook

YesHumidity ??

YesNo

High

Sunny Rain

Normal

Overcast


Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]= 0.97 – 0.95 = 0.02

Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0 = 0.97

Entropy of branch Rain

Split Rain branch by attribute Temperature

Split Rain branch by attribute Humidity

Split Rain branch by attribute Wind

14

Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97

Rain[3+,2-] Temperature

<15 [1+,1-]15-25 [2+,1-]>25 [0+,0-]

Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]= 0.97 – 0.95 = 0.02

Rain[3+,2-] Wind

Weak [3+, 0-]

Strong [0+, 2-]

Rain[3+,2-] Humidity

High [1+,1-]

Normal [2+, 1-]

15

Outlook

YesHumidity Wind

YesNo

NormalHigh

NoYes

StrongWeak

OvercastSunny Rain

16

Bayesian Classification A statistical classifier: performs probabilistic

prediction, i.e., predicts class membership probabilities where xi is the value of attribute

Ai

Choose the class label that has the highest probability Foundation: Based on Bayes’ Theorem.

posteriori probability prior probability likelihood

),...,,|( 21 ni xxxCP

),...,,|( 21 ni xxxCP

)|,...,,( 21 in CxxxP

)( iCP

),...,,()()|,...,,(),...,,|(

21

2121

n

iinni xxxP

CPCxxxPxxxCP

Model: compute from data

)|,...,,( 21 in CxxxP

17

Naïve Bayes Classifier Problem: joint probabilities are difficult to estimate

Naïve Bayes Classifier Assumption: attributes are conditionally independent

)|()|()|,...,,( 121 iniin CxPCxPCxxxP

11 2

1 2

( | ) ( )( | , ,..., )

( , ,..., )

nj i ij

i nn

P x C P CP C x x x

P x x x

A B Cm b tm s tg q th s tg q tg q fg s fh b Fh q fm b f

18

Example: Naïve Bayes Classifier

P(C=t) = 1/2 P(C=f) = 1/2

P(A=m|C=t) = 2/5P(A=m|C=f) = 1/5P(B=q|C=t) = 2/5P(B=q|C=f) = 2/5

Test Record: A=m, B=q, C=?

InputA set of stored recordsk: # of nearest neighbors

OutputCompute distance: Identify k nearest neighborsDetermine the class label of unknown record based on

class labels of nearest neighbors (i.e. by taking majority vote)

20

Nearest Neighbor Classification

i ii

qpqpd 2)(),(

21

Nearest Neighbor ClassificationInput Given 8 training

instancesP1 (4, 2) OrangeP2 (0.5, 2.5) OrangeP3 (2.5, 2.5) OrangeP4 (3, 3.5) OrangeP5 (5.5, 3.5) OrangeP6 (2, 4) BlackP7 (4, 5) BlackP8 (2.5, 5.5) Black k = 1 & k = 3

New Instance:Pn (4, 4) ?

Calculate the distances:d(P1, Pn) = d(P2, Pn) = 3.80d(P3, Pn) = 2.12d(P4, Pn) = 1.12d(P5, Pn) = 1.58d(P6, Pn) = 2d(P7, Pn) = 1d(P8, Pn) = 2.12

A Discrete Example

2)42()44( 22

22

Nearest Neighbor Classificationk = 1

P1P2 P3

P4 P5P6

P7P8

Pn

P1P2 P3

P4 P5

P6

P7P8

Pn

k = 3

Scaling issuesAttributes may have to be scaled to

prevent distance measures from being dominated by one of the attributes• Each attribute must follow in the same range• Min-Max normalization

Example:• Two data records: a = (1, 1000), b = (0.5, 1)• dis(a, b) = ?

23

Nearest Neighbor Classification…

Two Types of Learning MethodologiesLazy Learning

• Instance-based learning. (k-NN)Eager Learning

• Decision-tree and Bayesian classification.• ANN & SVM

24

Classification: Lazy & Eager Learning

P1P2 P3

P4 P5

P6

P7P8Pn

P1P2 P3

P4 P5

P6

P7P8Pn

Lazy Learninga. Do not require model buildingb. Less time training but more time predictingc. Lazy method effectively uses a richer

hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

Eager Learninga. Require model buildingb. More time training but less time predictingc. Must commit to a single hypothesis that

covers the entire instance space

25

Differences Between Lazy &Eager Learning

Thank you & Questions?

26

SEEM4630 2013-2014 Tutorial 2 Classification : Decision tree, Naïve Bayes & k-NN

Documents

Transcript of SEEM4630 2013-2014 Tutorial 2 Classification : Decision tree, Naïve Bayes & k-NN