Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...

27
Data Mining Chapter 4. Algorithms: The Basic Methods (1R, Statistical Modeling, Decision Trees) 1

Transcript of Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...

Page 1: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Data MiningChapter 4. Algorithms:

The Basic Methods

(1R, Statistical Modeling,

Decision Trees)

1

Page 2: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Inferring rudimentary rules

1R (1-rule)

Simple rules, high accuracy

Testing a single attribute and branching

accordingly

Pseudocode for 1R (fig 4.1)

Using 1R in the weather data

Rule sets

• Outlook: Sunny→no, Overcast→Yes, Rainy→Yes

– Played when it is overcast or rainy but not when it is

sunny

2

Page 3: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Inferring rudimentary rules

3

Page 4: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Inferring rudimentary rules

Numeric attributes

Sort the training examples according to the

values of the numeric attribute

Place breakpoints whenever the class

changes: Yes | No | …. | No

Choose breakpoints halfway between the

examples on either side

• 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84

4

73.572+75

2= 73.5

no yes

Page 5: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Inferring rudimentary rules

A minimum no. of examples of the majority

class in each partition

• Suppose minimum is set at three.

Yes no yes yes | yes …

(1) (2) (3)

Whenever adjacent partitions have the same

majority class, they can be merged together.

• Temperature

– ≤ 77.5 → yes

– > 77.5 → no

5

5 errors

Page 6: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Inferring rudimentary rules

• Humidity

– ≤ 82.5 → yes

– > 82.5 and ≤ 95.5 → no

– > 95.5 → yes

– 3 errors : the best “1 rule”

* Humidity

65 70 70 70 75 80 80 85 86 90 90 91 95 96

Y N Y Y Y Y Y N Y N Y N N Y

* Temperature

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y N Y Y Y N N Y Y Y N Y Y N

6

67.5 72.5 82.5 85.5 88 95.5

64.5 66.5 70.5 73.5 77.5 80.5 84

No! No!

No! Yes!

Page 7: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Inferring rudimentary rules

• 1R? in the weather domain

Page 8: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

Using all attributes

Equally important

Independent

Table 4.2

Observed probabilities

New example in Table 4.3

likelihood of yes, likelihood of no

“normalization”

Naïve Bayes: Naïve - independence

8

Page 9: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

9

Page 10: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

Probability notation

Prior probability : P(A) (Unconditional)

• e.g.) P(cavity)= 0.1

Conditional probability : P(A|B) (all we know is B)

• e.g.) P(cavity | Toothache) = 0.8

P(A|B) = 𝑃(𝐴⋀𝐵)

𝑃(𝐵)

P(B|A) = 𝑃(𝐴⋀𝐵)

𝑃(𝐴)

∴ P(A⋀B) = P(A|B)⋅P(B) = P(B|A) ⋅P(A)

10

Page 11: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

Bayes’ Rule

If is a partition of a sample space, then the posterior probabilities of the event conditional on an event B can be obtained from the probabilities and using the formula

11

1 2, ,..., nA A A

iA

( )iP A ( | )iP B A

1

( ) ( | )( | )

( ) ( | )

i ii n

j j

j

P A P B AP A B

P A P B A

Page 12: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

A new day

Likelihood of yes, Likelihood of no

Probability of yes, Probability of no

12

Page 13: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

Numeric values

Normal or Gaussian probability distribution

Table 4.4

13

Page 14: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

Probability density function for a Normal distribution

e.g.) when temperature = 66,

what’s the value of the p.d.f. for yes and no?

14

𝑓 𝑥 =1

2𝜋e−(𝑥 − 𝜇)2

2𝜎2

Page 15: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Statistical Modeling

A new day including numeric attributes

Likelihood of yes, Likelihood of no

Probability of yes, Probability of no

15

Page 16: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Document classification

Bayesian models for document classification

Each instance: a document

Instance’s class: the document’s topic

Document classification

presence or absence of each word as a

Boolean attribute

Document: a bag (repeated elements) of

words

16

Page 17: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Document classification

Multinominal Naïve Bayes

Assumption: the probability is independent of

the word’s context and position in the document.

17

Suppose n1, n2, . . . , nk is the number of times word i occurs in the document,

and P1, P2, . . . , Pk is the probability of obtaining word i when sampling from

all the documents in category H. Assume that the probability is independent of

the word’s context and position in the document. These assumptions lead to a

multinomial distribution for document probabilities. For this distribution, the

probability of a document E given its class H—in other words, the formula for

computing the probability Pr[E|H] in Bayes’s rule—is

𝑃𝑟[𝐸|𝐻] ≈ 𝑁! ×

𝑖=1

𝑘𝑃𝑖𝑛𝑖

𝑛𝑖!

Page 18: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Document classification

18

where N = n1 + n2 + . . . + nk is the number of words in the document. The

reason for the factorials is to account for the fact that the ordering of the

occurrences of each word is immaterial according to the bag-of-words model.

Pi is estimated by computing the relative frequency of word i in the text of all

training documents pertaining to category H.

In reality there should be a further term that gives the probability that the

model for category H generates a document whose length is the same as the

length of E (that is why we use the symbol ≈ instead of =), but it is common

to assume that this is the same for all classes and hence can be dropped.

Page 19: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Document classification

Suppose there are ‘yellow’ and ‘blue’.

A document class H

P[yellow|H]=0.75, P[blue|H]=0.25

A document E: blue yellow blue with a length of

N=3 words.

What are the possible bags of three words?

Then, what are the probabilities of them for the

document class H?

19

Page 20: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Document classification

Suppose there are ‘yellow’ and ‘blue’.

A document class H ’

P[yellow|H’]=0.10, P[blue|H’]=0.90

A document E: blue yellow blue with a length of

N=3 words.

Decide the class of the document E.

20

Page 21: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Constructing decision trees

Decision tree

Being repeated recursively for each branch

21

Page 22: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Constructing decision trees

Decision tree Information and bits

Entropy

Remainder

Gain

Basic idea To test the most important attribute first

Trying to get to the correct classification with a small number of tests

If all path in the tree will be short, then tree will be small.

Minimizing the depth of the tree

22

Page 23: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Constructing decision trees

Information theory for attribute selection

Information: information content in bits

How many bits needed to classify an example?

I(p(v1), … , p(vn)) = 𝑖=1𝑛 −𝑝 𝑣𝑖 log2 𝑝 𝑣𝑖

I(1/2, 1/2)?, I(1, 0) or I(0, 1)

After split on attribute A with v values

• Remainder(A) : need bits of information to classify the example

• Gain(A) : the difference between the original information requirement and the

new requirement

• Choose the attribute with the largest gain.

23

𝑖=1𝑣 (𝑝𝑖+𝑛𝑖)

(𝑝+𝑛)𝐼(𝑝𝑖

𝑝𝑖+𝑛𝑖,𝑛𝑖

𝑝𝑖+𝑛𝑖)

𝐼𝑝

𝑃 + 𝑛,𝑛

𝑝 + 𝑛− 𝑅𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝐴)

Page 24: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Constructing decision trees

24

Nominal attributes

Example - Gain(outlook), Gain(windy)

- Gain(outlook)

[ 𝐼(9

14,5

14) – [5

14𝐼(2

5,3

5)+4

14𝐼(4

4,0

4)+5

14𝐼(3

5,2

5)] = 0. 940 − 0.693 = 0.247𝑏𝑖𝑡𝑠

- Gain(windy) = ?

𝐼(𝑝

𝑝+𝑛,𝑛

𝑝+𝑛) = 𝐼

9

14,5

14= −9

14∙ log2

9

14−5

14∙ log2

5

14= 0.940

Page 25: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Constructing decision trees

25

72 75 81 85

No Yes Yes No

73.5 78 83

(1) break point : 73.5

(2) break point : 78

(3) break point : 83

𝐼(2

4,2

4) – [1

4𝐼(0

1,1

1)+3

4𝐼(2

3, 1

3)] = 1 −

3

4𝐼2

3,1

3=1-3

4(−2

3log22

3−1

3log21

3)

=1- 0.750(0.390+0.528)=1-0.689= 0.311

𝐼(2

4,2

4) – [2

4𝐼(1

2,1

2)+2

4𝐼(1

2, 1

2)] = 1 − 1 = 0

𝐼(2

4,2

4) – [3

4𝐼(2

3,1

3)+1

4𝐼(0

1, 1

1)] = 1 −

3

4𝐼2

3,1

3=1-3

4(−2

3log22

3−1

3log21

3) = 0.311

Numeric attributes

A two-way or binary split

Example – Gain(temperature)

Page 26: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

Constructing decision trees

26

Example of a decision tree

Page 27: Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf · 2018-08-28 · Document classification 18 where N = n 1 + n 2 + . . . + n k is the

27

http://cis.catholic.ac.kr/sunoh