CS145: INTRODUCTION TO DATA...
Embed Size (px)
Transcript of CS145: INTRODUCTION TO DATA...

CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou [email protected]
December 7, 2017
Text Data: Naïve Bayes
mailto:[email protected]

Methods to be Learnt
2
Vector Data Set Data Sequence Data Text Data
Classification Logistic Regression; Decision Tree; KNN;SVM; NN
Naïve Bayes for Text
Clustering Kmeans; hierarchicalclustering; DBSCAN; Mixture Models
PLSA
Prediction Linear RegressionGLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search DTW

Naïve Bayes for Text
•Text Data
•Revisit of Multinomial Distribution
•Multinomial Naïve Bayes
•Summary
3

Text Data
•Word/term
•Document
•A sequence of words
•Corpus
•A collection of
documents
4

Text Classification Applications
•Spam detection
•Sentiment analysis
5
From: [email protected]: Loan OfferDo you need a personal or business loan urgent that can be process within 2 to 3 working days? Have you been frustrated so many times by your banks and other loan firm and you don't know what to do? Here comes the Good news Deutsche Bank Financial Business and Home Loan is here to offer you any kind of loan you need at an affordable interest rate of 3% If you are interested let us know.
mailto:[email protected]

Represent a Document
•Most common way: BagofWords
• Ignore the order of words
• keep the count
6
c1 c2 c3 c4 c5 m1 m2 m3 m4
Vector space model
For document 𝑑, 𝒙𝑑 = (𝑥𝑑1, 𝑥𝑑2, … , 𝑥𝑑𝑁), where 𝑥𝑑𝑛 is the number of words for nth word in the vocabulary

More Details
• Represent the doc as a vector where each entry
corresponds to a different word and the number at that
entry corresponds to how many times that word was
present in the document (or some function of it)• Number of words is huge
• Select and use a smaller set of words that are of interest
• E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop
words
• Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’
could be substituted by the single stem ‘learn’
• Other simplifications can also be invented and used
• The set of different remaining words is called dictionary or vocabulary. Fix
an ordering of the terms in the dictionary so that you can operate them by
their index.
• Can be extended to bigram, trigram, or so
7

Limitations of Vector Space Model
•Dimensionality
•High dimensionality
•Sparseness
•Most of the entries are zero
•Shallow representation
•The vector representation does not capture
semantic relations between words
8

Naïve Bayes for Text
•Text Data
•Revisit of Multinomial Distribution
•Multinomial Naïve Bayes
•Summary
9

Bernoulli and Categorical Distribution
•Bernoulli distribution• Discrete distribution that takes two values {0,1}
• 𝑃 𝑋 = 1 = 𝑝 and 𝑃 𝑋 = 0 = 1 − 𝑝
• E.g., toss a coin with head and tail
•Categorical distribution• Discrete distribution that takes more than two values, i.e., 𝑥 ∈ 1,… , 𝐾• Also called generalized Bernoulli distribution,
multinoulli distribution
• 𝑃 𝑋 = 𝑘 = 𝑝𝑘 𝑎𝑛𝑑 σ𝑘 𝑝𝑘 = 1
• E.g., get 16 from a dice with 1/6
10

Binomial and Multinomial Distribution
• Binomial distribution• Number of successes (i.e., total number of 1’s) by
repeating n trials of independent Bernoulli distribution with probability 𝑝
• 𝑥: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠
• 𝑃 𝑋 = 𝑥 =𝑛𝑥
𝑝𝑥 1 − 𝑝 𝑛−𝑥
•Multinomial distribution (multivariate random variable)• Repeat n trials of independent categorical distribution
• Let 𝑥𝑘 be the number of times value 𝑘 has been observed, note σ𝑘 𝑥𝑘 = 𝑛
• 𝑃 𝑋1 = 𝑥1, 𝑋2 = 𝑥2, … , 𝑋𝐾 = 𝑥𝐾 =𝑛!
𝑥1!𝑥2!…𝑥𝐾!ς𝑘 𝑝𝑘
𝑥𝑘
11

Naïve Bayes for Text
•Text Data
•Revisit of Multinomial Distribution
•Multinomial Naïve Bayes
•Summary
12

Bayes’ Theorem: Basics
• Bayes’ Theorem:
• Let X be a data sample (“evidence”)
• Let h be a hypothesis that X belongs to class C
• P(h) (prior probability): the probability of hypothesis h• E.g., the probability of “spam” class
• P(Xh) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., the probability of an email given it’s a spam
• P(X): marginal probability that sample data is observed
• 𝑃 𝑋 = σℎ𝑃 𝑋 ℎ 𝑃(ℎ)
• P(hX), (i.e., posterior probability): the probability that the hypothesis holds given the observed data sample X
)()()()(
XXXP
hPhPhP
13

Classification: Choosing Hypotheses• Maximum Likelihood (maximize the likelihood):
• Maximum a posteriori (maximize the posterior):• Useful observation: it does not depend on the denominator P(X)
14
)(maxarg hDPhHh
ML
)()(maxarg)(maxarg hPhDPDhPhHhHh
MAP
X
XX

Classification by Maximum A Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is represented by an pD attribute vector x = (x1, x2, …, xp)
• Suppose there are m classes y∈{1, 2, …, m}
• Classification is to derive the maximum posteriori, i.e., the maximal P(y=jx)
• This can be derived from Bayes’ theorem
𝑝 𝑦 = 𝑗 𝒙 =𝑝 𝒙 𝑦 = 𝑗 𝑝(𝑦 = 𝑗)
𝑝(𝒙)
• Since p(x) is constant for all classes, only 𝑝 𝒙 𝑦 𝑝(𝑦) needs to be maximized
15

Now Come to Text Setting
•A document is represented as a bag of words• 𝒙𝑑 = (𝑥𝑑1, 𝑥𝑑2, … , 𝑥𝑑𝑁), where 𝑥𝑑𝑛 is the number of words for nth word in the vocabulary
•Model 𝑝 𝒙𝑑 𝑦 for class 𝑦• Follow multinomial distribution with
parameter vector 𝜷𝑦 = (𝛽𝑦1, 𝛽𝑦2, … , 𝛽𝑦𝑁), i.e.,
• 𝑝 𝒙𝑑 𝑦 =(σ𝑛 𝑥𝑑𝑛)!
𝑥𝑑1!𝑥𝑑2!…𝑥𝑑𝑁!ς𝑛𝛽𝑦𝑛
𝑥𝑑𝑛
•Model 𝑝 𝑦 = 𝑗• Follow categorical distribution with parameter vector 𝝅 = (𝜋1, 𝜋2, … , 𝜋𝑚), i.e.,• 𝑝 𝑦 = 𝑗 = 𝜋𝑗
16

Classification Process Assuming Parameters are Given
• Find 𝑦 that maximizes 𝑝 𝑦 𝒙𝑑 , which is equivalently to maximize
𝑦∗ = 𝑎𝑟𝑔max𝑦
𝑝 𝒙𝑑 , 𝑦
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑝 𝒙𝑑 𝑦 𝑝 𝑦
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑦(σ𝑛 𝑥𝑑𝑛) !
𝑥𝑑1! 𝑥𝑑2! … 𝑥𝑑𝑁!ෑ
𝑛
𝛽𝑦𝑛𝑥𝑑𝑛 × 𝜋𝑦
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑦ෑ
𝑛
𝛽𝑦𝑛𝑥𝑑𝑛 × 𝜋𝑦
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑦
𝑛
𝑥𝑑𝑛𝑙𝑜𝑔𝛽𝑦𝑛 + 𝑙𝑜𝑔𝜋𝑦
17
Constant for every class, denoted as 𝒄𝒅

Parameter Estimation via MLE• Given a corpus and labels for each document
• 𝐷 = {(𝒙𝑑 , 𝑦𝑑)}
• Find the MLE estimators for Θ = (𝜷1, 𝜷2, … , 𝜷𝑚, 𝝅)
• The log likelihood function for the training dataset
𝑙𝑜𝑔𝐿 = 𝑙𝑜𝑔ෑ
𝑑
𝑝(𝒙𝑑 , 𝑦𝑑Θ) =
𝑑
𝑙𝑜𝑔 𝑝 𝒙𝑑 , 𝑦𝑑 Θ
=
𝑑
𝑙𝑜𝑔 𝑝 𝒙𝑑 𝑦𝑑 𝑝 𝑦𝑑 =
𝑑
𝑥𝑑𝑛𝑙𝑜𝑔𝛽𝑦𝑛 + 𝑙𝑜𝑔𝜋𝑦𝑑 + 𝑙𝑜𝑔𝑐𝑑
• The optimization problemmaxΘ
log 𝐿
𝑠. 𝑡.
𝜋𝑗 ≥ 0 𝑎𝑛𝑑
𝑗
𝜋𝑗 = 1
𝛽𝑗𝑛 ≥ 0 𝑎𝑛𝑑
𝑛
𝛽𝑗𝑛 = 1 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗18
Does not involve parameters, can be dropped for optimization purpose

Solve the Optimization Problem
•Use the Lagrange multiplier method
• Solution
• መ𝛽𝑗𝑛 =σ𝑑:𝑦𝑑=𝑗
𝑥𝑑𝑛
σ𝑑:𝑦𝑑=𝑗σ𝑛′
𝑥𝑑𝑛′
• σ𝑑:𝑦𝑑=𝑗 𝑥𝑑𝑛: total count of word n in class j
• σ𝑑:𝑦𝑑=𝑗σ𝑛′ 𝑥𝑑𝑛′: total count of words in class j
• ො𝜋𝑗 =σ𝑑 1(𝑦𝑑=𝑗)
𝐷
• 1(𝑦𝑑 = 𝑗) is the indicator function, which equals to 1 if 𝑦𝑑 = 𝑗 holds
• D: total number of documents
19

Smoothing • What if some word n does not appear in some class j
in training dataset?
• መ𝛽𝑗𝑛 =σ𝑑:𝑦𝑑=𝑗
𝑥𝑑𝑛
σ𝑑:𝑦𝑑=𝑗σ𝑛′
𝑥𝑑𝑛′= 0
• ⇒ 𝑝 𝒙𝑑 𝑦 = 𝑗 ∝ ς𝑛𝛽𝑦𝑛𝑥𝑑𝑛 = 0
• But other words may have a strong indication the document belongs to class j
• Solution: add1 smoothing or Laplacian smoothing
• መ𝛽𝑗𝑛 =σ𝑑:𝑦𝑑=𝑗
𝑥𝑑𝑛+1
σ𝑑:𝑦𝑑=𝑗σ𝑛′
𝑥𝑑𝑛′+𝑁
• 𝑁: 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦
• Check: σ𝑛 መ𝛽𝑗𝑛 = 1?
20

Example
• Data:
• Vocabulary:
• Learned parameters (with smoothing):
21
Index 1 2 3 4 5 6
Word Chinese Beijing Shanghai Macao Tokyo Japan
መ𝛽𝑐1 =5 + 1
8 + 6=3
7
መ𝛽𝑐2 =1 + 1
8 + 6=1
7
መ𝛽𝑐3 =1 + 1
8 + 6=1
7
መ𝛽𝑐4 =1 + 1
8 + 6=1
7
መ𝛽𝑐5 =0 + 1
8 + 6=
1
14
መ𝛽𝑐6 =0 + 1
8 + 6=
1
14
መ𝛽𝑗1 =1 + 1
3 + 6=2
9
መ𝛽𝑗2 =0 + 1
3 + 6=1
9
መ𝛽𝑗3 =0 + 1
3 + 6=1
9
መ𝛽𝑗4 =0 + 1
3 + 6=1
9
መ𝛽𝑗5 =1 + 1
3 + 6=2
9
መ𝛽𝑗6 =1 + 1
3 + 6=2
9
ො𝜋𝑐 =3
4
ො𝜋𝑗 =1
4

Example (Continued)
•Classification stage
•For the test document d=5, compute
•𝑝 𝑦 = 𝑐 𝒙5 ∝ 𝑝 𝑦 = 𝑐 × ς𝑛𝛽𝑐𝑛𝑥5𝑛 =
3
4×
3
7
3×
1
14×
1
14≈ 0.0003
•𝑝 𝑦 = 𝑗 𝒙5 ∝ 𝑝 𝑦 = 𝑗 × ς𝑛𝛽𝑗𝑛𝑥5𝑛 =
1
4×
2
9
3×
2
9×
2
9≈ 0.0001
•Conclusion: 𝒙5 should be classified into c class
22

A More General Naïve Bayes Framework
• Let D be a training set of tuples and their associated class labels, and each tuple is represented by an pD attribute vector x = (x1, x2, …, xp)
• Suppose there are m classes y∈{1, 2, …, m}• Goal: Find y max𝑦
𝑃 𝑦 𝒙 = 𝑃(𝑦, 𝒙)/𝑃(𝒙) ∝ 𝑃 𝒙 𝑦 𝑃(𝑦)
• A simplified assumption: attributes are conditionally independent given the class (class conditional independency):
• 𝑝 𝒙 𝑦 = ς𝑘 𝑝(𝑥𝑘𝑦)• 𝑝(𝑥𝑘𝑦) can follow any distribution, e.g., Gaussian,
Bernoulli, categorical, …
23

Generative Model vs. Discriminative Model
•Generative model
•𝑚𝑜𝑑𝑒𝑙 𝑗𝑜𝑖𝑛𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝(𝒙, 𝑦)
•E.g., naïve Bayes
•Discriminative model
•𝑚𝑜𝑑𝑒𝑙 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝(𝑦𝒙)
•E.g., logistic regression
24

Naïve Bayes for Text
•Text Data
•Revisit of Multinomial Distribution
•Multinomial Naïve Bayes
•Summary
25

Summary
•Text data
•Bag of words representation
•Naïve Bayes for Text
•Multinomial naïve Bayes
26