Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...
Transcript of Data Mining - Center for Intelligent Systemscis.catholic.ac.kr/sunoh/Courses/DMining/DM04_1.pdf ·...
Data MiningChapter 4. Algorithms:
The Basic Methods
(1R, Statistical Modeling,
Decision Trees)
1
Inferring rudimentary rules
1R (1-rule)
Simple rules, high accuracy
Testing a single attribute and branching
accordingly
Pseudocode for 1R (fig 4.1)
Using 1R in the weather data
Rule sets
• Outlook: Sunny→no, Overcast→Yes, Rainy→Yes
– Played when it is overcast or rainy but not when it is
sunny
2
Inferring rudimentary rules
3
Inferring rudimentary rules
Numeric attributes
Sort the training examples according to the
values of the numeric attribute
Place breakpoints whenever the class
changes: Yes | No | …. | No
Choose breakpoints halfway between the
examples on either side
• 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84
4
73.572+75
2= 73.5
no yes
Inferring rudimentary rules
A minimum no. of examples of the majority
class in each partition
• Suppose minimum is set at three.
Yes no yes yes | yes …
(1) (2) (3)
Whenever adjacent partitions have the same
majority class, they can be merged together.
• Temperature
– ≤ 77.5 → yes
– > 77.5 → no
5
5 errors
Inferring rudimentary rules
• Humidity
– ≤ 82.5 → yes
– > 82.5 and ≤ 95.5 → no
– > 95.5 → yes
– 3 errors : the best “1 rule”
* Humidity
65 70 70 70 75 80 80 85 86 90 90 91 95 96
Y N Y Y Y Y Y N Y N Y N N Y
* Temperature
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y N N Y Y Y N Y Y N
6
67.5 72.5 82.5 85.5 88 95.5
64.5 66.5 70.5 73.5 77.5 80.5 84
No! No!
No! Yes!
Inferring rudimentary rules
• 1R? in the weather domain
Statistical Modeling
Using all attributes
Equally important
Independent
Table 4.2
Observed probabilities
New example in Table 4.3
likelihood of yes, likelihood of no
“normalization”
Naïve Bayes: Naïve - independence
8
Statistical Modeling
9
Statistical Modeling
Probability notation
Prior probability : P(A) (Unconditional)
• e.g.) P(cavity)= 0.1
Conditional probability : P(A|B) (all we know is B)
• e.g.) P(cavity | Toothache) = 0.8
P(A|B) = 𝑃(𝐴⋀𝐵)
𝑃(𝐵)
P(B|A) = 𝑃(𝐴⋀𝐵)
𝑃(𝐴)
∴ P(A⋀B) = P(A|B)⋅P(B) = P(B|A) ⋅P(A)
10
Statistical Modeling
Bayes’ Rule
If is a partition of a sample space, then the posterior probabilities of the event conditional on an event B can be obtained from the probabilities and using the formula
11
1 2, ,..., nA A A
iA
( )iP A ( | )iP B A
1
( ) ( | )( | )
( ) ( | )
i ii n
j j
j
P A P B AP A B
P A P B A
Statistical Modeling
A new day
Likelihood of yes, Likelihood of no
Probability of yes, Probability of no
12
Statistical Modeling
Numeric values
Normal or Gaussian probability distribution
Table 4.4
13
Statistical Modeling
Probability density function for a Normal distribution
e.g.) when temperature = 66,
what’s the value of the p.d.f. for yes and no?
14
𝑓 𝑥 =1
2𝜋e−(𝑥 − 𝜇)2
2𝜎2
Statistical Modeling
A new day including numeric attributes
Likelihood of yes, Likelihood of no
Probability of yes, Probability of no
15
Document classification
Bayesian models for document classification
Each instance: a document
Instance’s class: the document’s topic
Document classification
presence or absence of each word as a
Boolean attribute
Document: a bag (repeated elements) of
words
16
Document classification
Multinominal Naïve Bayes
Assumption: the probability is independent of
the word’s context and position in the document.
17
Suppose n1, n2, . . . , nk is the number of times word i occurs in the document,
and P1, P2, . . . , Pk is the probability of obtaining word i when sampling from
all the documents in category H. Assume that the probability is independent of
the word’s context and position in the document. These assumptions lead to a
multinomial distribution for document probabilities. For this distribution, the
probability of a document E given its class H—in other words, the formula for
computing the probability Pr[E|H] in Bayes’s rule—is
𝑃𝑟[𝐸|𝐻] ≈ 𝑁! ×
𝑖=1
𝑘𝑃𝑖𝑛𝑖
𝑛𝑖!
Document classification
18
where N = n1 + n2 + . . . + nk is the number of words in the document. The
reason for the factorials is to account for the fact that the ordering of the
occurrences of each word is immaterial according to the bag-of-words model.
Pi is estimated by computing the relative frequency of word i in the text of all
training documents pertaining to category H.
In reality there should be a further term that gives the probability that the
model for category H generates a document whose length is the same as the
length of E (that is why we use the symbol ≈ instead of =), but it is common
to assume that this is the same for all classes and hence can be dropped.
Document classification
Suppose there are ‘yellow’ and ‘blue’.
A document class H
P[yellow|H]=0.75, P[blue|H]=0.25
A document E: blue yellow blue with a length of
N=3 words.
What are the possible bags of three words?
Then, what are the probabilities of them for the
document class H?
19
Document classification
Suppose there are ‘yellow’ and ‘blue’.
A document class H ’
P[yellow|H’]=0.10, P[blue|H’]=0.90
A document E: blue yellow blue with a length of
N=3 words.
Decide the class of the document E.
20
Constructing decision trees
Decision tree
Being repeated recursively for each branch
21
Constructing decision trees
Decision tree Information and bits
Entropy
Remainder
Gain
Basic idea To test the most important attribute first
Trying to get to the correct classification with a small number of tests
If all path in the tree will be short, then tree will be small.
Minimizing the depth of the tree
22
Constructing decision trees
Information theory for attribute selection
Information: information content in bits
How many bits needed to classify an example?
I(p(v1), … , p(vn)) = 𝑖=1𝑛 −𝑝 𝑣𝑖 log2 𝑝 𝑣𝑖
I(1/2, 1/2)?, I(1, 0) or I(0, 1)
After split on attribute A with v values
• Remainder(A) : need bits of information to classify the example
• Gain(A) : the difference between the original information requirement and the
new requirement
• Choose the attribute with the largest gain.
23
𝑖=1𝑣 (𝑝𝑖+𝑛𝑖)
(𝑝+𝑛)𝐼(𝑝𝑖
𝑝𝑖+𝑛𝑖,𝑛𝑖
𝑝𝑖+𝑛𝑖)
𝐼𝑝
𝑃 + 𝑛,𝑛
𝑝 + 𝑛− 𝑅𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝐴)
Constructing decision trees
24
Nominal attributes
Example - Gain(outlook), Gain(windy)
- Gain(outlook)
[ 𝐼(9
14,5
14) – [5
14𝐼(2
5,3
5)+4
14𝐼(4
4,0
4)+5
14𝐼(3
5,2
5)] = 0. 940 − 0.693 = 0.247𝑏𝑖𝑡𝑠
- Gain(windy) = ?
𝐼(𝑝
𝑝+𝑛,𝑛
𝑝+𝑛) = 𝐼
9
14,5
14= −9
14∙ log2
9
14−5
14∙ log2
5
14= 0.940
Constructing decision trees
25
72 75 81 85
No Yes Yes No
73.5 78 83
(1) break point : 73.5
(2) break point : 78
(3) break point : 83
𝐼(2
4,2
4) – [1
4𝐼(0
1,1
1)+3
4𝐼(2
3, 1
3)] = 1 −
3
4𝐼2
3,1
3=1-3
4(−2
3log22
3−1
3log21
3)
=1- 0.750(0.390+0.528)=1-0.689= 0.311
𝐼(2
4,2
4) – [2
4𝐼(1
2,1
2)+2
4𝐼(1
2, 1
2)] = 1 − 1 = 0
𝐼(2
4,2
4) – [3
4𝐼(2
3,1
3)+1
4𝐼(0
1, 1
1)] = 1 −
3
4𝐼2
3,1
3=1-3
4(−2
3log22
3−1
3log21
3) = 0.311
Numeric attributes
A two-way or binary split
Example – Gain(temperature)
Constructing decision trees
26
Example of a decision tree
27
http://cis.catholic.ac.kr/sunoh