Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...

Data MiningChapter 4. Algorithms:

The Basic Methods

(1R, Statistical Modeling,

Decision Trees)

1

Inferring rudimentary rules

1R (1-rule)

Simple rules, high accuracy

Testing a single attribute and branching

accordingly

Pseudocode for 1R (fig 4.1)

Using 1R in the weather data

Rule sets

• Outlook: Sunny→no, Overcast→Yes, Rainy→Yes

– Played when it is overcast or rainy but not when it is

sunny

2


3


Numeric attributes

Sort the training examples according to the

values of the numeric attribute

Place breakpoints whenever the class

changes: Yes | No | …. | No

Choose breakpoints halfway between the

examples on either side

• 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84

4

73.572+75

2= 73.5

no yes


A minimum no. of examples of the majority

class in each partition

• Suppose minimum is set at three.

Yes no yes yes | yes …

(1) (2) (3)

Whenever adjacent partitions have the same

majority class, they can be merged together.

• Temperature

– ≤ 77.5 → yes

– > 77.5 → no

5

5 errors


• Humidity

– ≤ 82.5 → yes

– > 82.5 and ≤ 95.5 → no

– > 95.5 → yes

– 3 errors : the best “1 rule”

* Humidity

65 70 70 70 75 80 80 85 86 90 90 91 95 96

Y N Y Y Y Y Y N Y N Y N N Y

* Temperature

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y N Y Y Y N N Y Y Y N Y Y N

6

67.5 72.5 82.5 85.5 88 95.5

64.5 66.5 70.5 73.5 77.5 80.5 84

No! No!

No! Yes!


• 1R? in the weather domain

Statistical Modeling

Using all attributes

Equally important

Independent

Table 4.2

Observed probabilities

New example in Table 4.3

likelihood of yes, likelihood of no

“normalization”

Naïve Bayes: Naïve - independence

8


9


Bayes’ Rule

If is a partition of a sample space, then the posterior probabilities of the event conditional on an event B can be obtained from the probabilities and using the formula

11

1 2, ,..., nA A A

iA

( )iP A ( | )iP B A

1

( ) ( | )( | )

( ) ( | )

i ii n

j j

j

P A P B AP A B

P A P B A


A new day

Likelihood of yes, Likelihood of no

Probability of yes, Probability of no

12


Numeric values

Normal or Gaussian probability distribution

Table 4.4

13


Probability density function for a Normal distribution

e.g.) when temperature = 66,

what’s the value of the p.d.f. for yes and no?

14

𝑓 𝑥 =1

2𝜋e−(𝑥 − 𝜇)2

2𝜎2


A new day including numeric attributes

Likelihood of yes, Likelihood of no

Probability of yes, Probability of no

15

Document classification

Bayesian models for document classification

Each instance: a document

Instance’s class: the document’s topic


presence or absence of each word as a

Boolean attribute

Document: a bag (repeated elements) of

words

16


Multinominal Naïve Bayes

Assumption: the probability is independent of

the word’s context and position in the document.

17

Suppose n1, n2, . . . , nk is the number of times word i occurs in the document,

and P1, P2, . . . , Pk is the probability of obtaining word i when sampling from

all the documents in category H. Assume that the probability is independent of

the word’s context and position in the document. These assumptions lead to a

multinomial distribution for document probabilities. For this distribution, the

probability of a document E given its class H—in other words, the formula for

computing the probability Pr[E|H] in Bayes’s rule—is

𝑃𝑟[𝐸|𝐻] ≈ 𝑁! ×

𝑖=1

𝑘𝑃𝑖𝑛𝑖

𝑛𝑖!


18

where N = n1 + n2 + . . . + nk is the number of words in the document. The

reason for the factorials is to account for the fact that the ordering of the

occurrences of each word is immaterial according to the bag-of-words model.

Pi is estimated by computing the relative frequency of word i in the text of all

training documents pertaining to category H.

In reality there should be a further term that gives the probability that the

model for category H generates a document whose length is the same as the

length of E (that is why we use the symbol ≈ instead of =), but it is common

to assume that this is the same for all classes and hence can be dropped.


Suppose there are ‘yellow’ and ‘blue’.

A document class H

P[yellow|H]=0.75, P[blue|H]=0.25

A document E: blue yellow blue with a length of

N=3 words.

What are the possible bags of three words?

Then, what are the probabilities of them for the

document class H?

19


Suppose there are ‘yellow’ and ‘blue’.

A document class H ’

P[yellow|H’]=0.10, P[blue|H’]=0.90

A document E: blue yellow blue with a length of

N=3 words.

Decide the class of the document E.

20

Constructing decision trees

Decision tree

Being repeated recursively for each branch

21


Decision tree Information and bits

Entropy

Remainder

Gain

Basic idea To test the most important attribute first

Trying to get to the correct classification with a small number of tests

If all path in the tree will be short, then tree will be small.

Minimizing the depth of the tree

22


Information theory for attribute selection

Information: information content in bits

How many bits needed to classify an example?

I(p(v1), … , p(vn)) = 𝑖=1𝑛 −𝑝 𝑣𝑖 log2 𝑝 𝑣𝑖

I(1/2, 1/2)?, I(1, 0) or I(0, 1)

After split on attribute A with v values

• Remainder(A) : need bits of information to classify the example

• Gain(A) : the difference between the original information requirement and the

new requirement

• Choose the attribute with the largest gain.

23

𝑖=1𝑣 (𝑝𝑖+𝑛𝑖)

(𝑝+𝑛)𝐼(𝑝𝑖

𝑝𝑖+𝑛𝑖,𝑛𝑖

𝑝𝑖+𝑛𝑖)

𝐼𝑝

𝑃 + 𝑛,𝑛

𝑝 + 𝑛− 𝑅𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝐴)


24

Nominal attributes

Example - Gain(outlook), Gain(windy)

- Gain(outlook)

[ 𝐼(9

14,5

14) – [5

14𝐼(2

5,3

5)+4

14𝐼(4

4,0

4)+5

14𝐼(3

5,2

5)] = 0. 940 − 0.693 = 0.247𝑏𝑖𝑡𝑠

- Gain(windy) = ?

𝐼(𝑝

𝑝+𝑛,𝑛

𝑝+𝑛) = 𝐼

9

14,5

14= −9

14∙ log2

9

14−5

14∙ log2

5

14= 0.940


25

72 75 81 85

No Yes Yes No

73.5 78 83

(1) break point : 73.5

(2) break point : 78

(3) break point : 83

𝐼(2

4,2

4) – [1

4𝐼(0

1,1

1)+3

4𝐼(2

3, 1

3)] = 1 −

3

4𝐼2

3,1

3=1-3

4(−2

3log22

3−1

3log21

3)

=1- 0.750(0.390+0.528)=1-0.689= 0.311

𝐼(2

4,2

4) – [2

4𝐼(1

2,1

2)+2

4𝐼(1

2, 1

2)] = 1 − 1 = 0

𝐼(2

4,2

4) – [3

4𝐼(2

3,1

3)+1

4𝐼(0

1, 1

1)] = 1 −

3

4𝐼2

3,1

3=1-3

4(−2

3log22

3−1

3log21

3) = 0.311

Numeric attributes

A two-way or binary split

Example – Gain(temperature)


26

Example of a decision tree

27

http://cis.catholic.ac.kr/sunoh

Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...

Documents

Transcript of Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...