INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...
Transcript of INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...
INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING
Jan Tore Lønning, Lecture 6, 21.9
1
Today
Word Sense Disambiguation (WSD) Classification Naive Bayes classifiers
Bernoulli Multinomial
Smoothing Evaluation
2
Word Sense Disambiguation 3
Word Senses
Word Net: Noun S: (n) bass (the lowest part of the musical range) S: (n) bass, bass part (the lowest part in polyphonic music) S: (n) bass, basso (an adult male singer with the lowest voice) S: (n) sea bass, bass (the lean flesh of a saltwater fish of the family
Serranidae) S: (n) freshwater bass, bass (any of various North American freshwater fish
with lean flesh (especially of the genus Micropterus)) S: (n) bass, bass voice, basso (the lowest adult male singing voice) S: (n) bass (the member with the lowest range of a family of musical
instruments) S: (n) bass (nontechnical name for any of numerous edible marine and
freshwater spiny-finned fishes)
4
Word Sense Disambiguation
Why? Dialogue system: fish vs. musical instrument Translation: Han bestilte en salt-bakt bass. Search: relevant hits only Speech synthesis: (Norw.) tapet, urte
How? Corpus-based methods
Unsupervised – clustering Supervised – classification
Dictionaries and Thesauri
5
Corpus-based
Professor Emeritus Dag Østerberg doserer i klassisk byteori , og Knut Halvorsen fra Oslo Teknopol i synergieffektene mellom byer og forskningsmiljøer .
Her sitter jeg på en bar , tenkte Harry , og hører på en transvestitt som doserer om australsk politikk .
Han doserer teen etter tempoet i fortellingen hennes , ser ikke ned på den for å forvisse seg om at koppen er på rett kjøl , ser ufravendt på henne , henger ved hennes lepper - teen kommer av seg selv .
Som doserer morfinen og holder en knivskarp balanse mellom bevissthet og smerteterskel .
How to disambiguate? Use context
6
Classification 7
Classification
Supervised: Given classes Examples
Unsupervised: Construct classes
Mostly Machine Learning (ML) But not necessarily
Many ML problems can be described as classification
Supervised examples: Spam detection Word sense
disambiguation Genre of text Language classification Sentiment analysis: Positive-negative
Topic modelling
8
} INF4820 k-Nearest Neighbors Rocchio Decision Trees Naive Bayes Maximum entropy (Logistic regression) Support Vector Machines
(INF4490)
A variety of classifiers 9
Classification 10
Supervised classification
Given a well-defined set of objects, O a given set of classes, S={s1, s2, …, sk}
Goal: a classifier, γ, a mapping from O to S For supervised training one needs a set of pairs from OxS
Task O S
Word sense disambiguation Occurrences of ”bass” Sense1, …, sense8
Spam classification E-mails Spam, no-spam
Language clssification Pieces of text Arabian, Chinese, English, Norwegian, …
11
Features
To represent the objects in O, extract a set of features
Be explicit: Which features For each feature
The type Categorical Numeric (Discrete/Continuous)
The value space Cf. First lecture
Object: person Features: • height • weight • hair color • eye color • …
Object: email Features: • length • sender • contained words • language •…
12
Supervised classification
A given set of classes, S={s1, s2, …, sk} A well defined class of objects, O
Some features f1, f2, …, fn
For each feature: a set of possible values V1, V2, …, Vn The set of feature vectors: V= V1× V2×…× Vn Each object in O is represented by some member of V:
Written (v1, v2, …, vn), or (f1=v1, f2=v2, …, fn=vn)
A classifier, γ, can be considered a mapping from V to S
Examples
C = {English, Norwegian,…} O is the set of strings of
letters f1 is last letter of o V1= {a, b, c,…, å} f2 is the last two letters V2 is all two letter
combinations f3 is the length of o, V3 is 1, 2, 3, 4, …
C = {fish, music} O: all occurrences of ”bass” fi= fwi: word wi occurs in same
sentence as ”bass”, where w1 = fishing, w2 = big, …, w11 = guitar, w12 = band
V1=V2=…=V12={1,0} Example:
o = (0,0,0,1,0,0,0,0,0,0,1,0) o = (ffishing=0, …,
fguitar=1, fband=0)
Language classifier Word sense disambiguation
Maximum Likelihood Estimation 15
Maximum Likelihood Estimation
We have a data set: {x1, x2, …, xn} We assume they belong to a population with a
distribution where: we know the general form but some parameters θ={p1, p2,… , pk} are unknown
We want to choose the parameters to maximize the likelihood of the observations
when the {x1, x2, …, xn} are independently drawn
∏=
=n
iin xPxxxP
121 )|()|,...,( θθ
16
Examples
Example 1: x̄ is a MLE for µ for a normal distribution
Example 2: For a categorical variable X the relative frequency Count(a)/N is a MLE for P(X=a)
That these are MLEs need proofs which we will not consider.
MLEs may be too attuned to the specific sample to serve as a model for the population as we will see. Remedy: smoothing!
17
Naive Bayes 18
Naive Bayes: Decision 19
Given an object
Consider for each class sm
Choose the class with the largest value, in symbols
i.e. choose the class for which the observations are most likely
nn vfvfvf === ,...,, 2211
( )nnm vfvfvfsP === ,...,,| 2211
( )nnmSs
vfvfvfsPm
===∈
,...,,|maxarg 2211
Naive Bayes: Model 20
Bayes formula
Sparse data, we may not even have seen
Assume (wrongly) independence
Putting together
( ) ( )∏=∈∈
=≈===n
imiim
Ssnnm
SssvfPsPvfvfvfsP
mm 12211 |)(maxarg,...,,|maxarg
( ) ( )( )nn
mmnnnnm vfvfvfP
sPsvfvfvfPvfvfvfsP
======
====,...,,
)(|,...,,,...,,|
2211
22112211
( ) ( )∏=
=≈===n
imiimnn svfPsvfvfvfP
12211 ||,...,,
nn vfvfvf === ,...,, 2211
Naive Bayes: Calculation 21
For calculations avoid underflow, use logarithms
( ) ( )∏=∈∈
=≈===n
imiim
Ssnnm
SssvfPsPvfvfvfsP
mm 12211 |)(maxarg,...,,|maxarg
( )
( )
( )
=+=
=
==
∑
∏
∏
=∈
=∈
=∈
n
imiim
Ss
n
imiim
Ss
n
imiim
Ss
svfPsP
svfPsP
svfPsP
m
m
m
1
1
1
)|log())(log(maxarg
|)(logmaxarg
|)(maxarg ( )
( )
=
==
∏
∏
=∈
=∈
n
imiim
Ss
n
imiim
Ss
svfPsP
svfPsP
m
m
1
1
|)(logmaxarg
|)(maxarg
Naive Bayes, Training 1 22
Maximum Likelihood
where C(sm, o) are the number of occurrences of objects o in
class sm
Observe what we are doing in statistical terms: We want to estimate the true probability 𝑃(𝑠𝑚) from a set
of observations This is similar to estimating properties (parameters) of a
population from a sample.
( ))(
),(ˆoC
osCsP mm =
Naive Bayes (Bernoulli): Training 2 23
Maximum Likelihood
where C(fi=vi, sm) is the number of occurrences of
objects o where the object o belongs to class sm and the feature fi takes the value vi
C(sm) is the number of occurrences belonging to class sm
( ))(
),(|ˆm
miimii sC
svfCsvfP ===
The two models 24
Bernoulli the standard form of NB
NLTK book, Sec. 6.1, 6.2, 6.5
Jurafsky and Martin, 2.ed, sec. 20.2, WSD
Multinomial model For text classification
Related to n-gram models
Jurafsky and Martin, 3.ed, sec. 7.1, Sentiment analysis
Both Manning, Raghavan, Schütze, Introduction to Information Retrieval, Sec.
13.0-13.3
Multinomial text classification 25
Build a language model for each class Score the document according to the different
classes Choose the class with the best score
Multinomial NB: Decision 26
In the multinomial model fi refers to position i in the text
vi refers to the word occurring in this position
We model the probability of the full texts given the class sm
Then we make a simplifying assumption: We assume a word to be equally likely in all positions:
( ) ( )∏=∈∈
=≈===n
imiim
Ssnnm
SssvfPsPvfvfvfsP
mm 12211 |)(maxarg,...,,|maxarg
( ) ( )∏∏=∈=∈
==n
imim
Ss
n
imiim
SssvPsPsvfPsP
mm 11
|)(maxarg|)(maxarg
Multinomial NB: Training 27
where C(sm, o) is the number of occurrences of objects o in class sm
where C(wi, sm) is the number of occurrences of word wi in all texts in class sm
is the total number of words in all texts in class sm
Bernoulli counts the number of objects/texts where wi occurs Multinomial counts the number of occurrences of wi.
( ))(
),(ˆoC
osCsP mm =
( )∑
=
jmj
mimi swC
swCswP),(
),(|ˆ
∑j
mj swC ),(
Comparison
Registers whether a term is present or not
Considers both The present terms The absent terms
Suitable for various tasks
Counts how many times a term is present
Considers only the present terms Ignores absent terms
Tailor-made for text classification
28
Bernoulli Multinomial
Smoothing 29
Naive Bayes: Calculation 30
When using maximum likelihood estimation
may become 0 Then the whole
becomes 0 Goal to avoid 0-probabilities
( ) ( )∏=∈∈
=≈===n
imiim
Ssnnm
SssvfPsPvfvfvfsP
mm 12211 |)(maxarg,...,,|maxarg
( ))(
),(|ˆm
miimii sC
svfCsvfP ===
31
Laplace Smoothing
Also called add-one smoothing Just add one to all the counts! Very simple
MLE estimate:
Laplace estimate:
Variant add k: 𝑃� 𝑤𝑖 = 𝑐𝑖+𝑘𝑁+𝑘𝑘
NLTK, 6.3, FSNLP, 8.1, IR 13.6
Evaluation 32
Data 33
Evaluation measures
Is in C Yes NO Classifier
Yes tp fp No fn tn
Accuracy: (tp+tn)/N Precision:tp/(tp+fp) Recall: tp/(tp+fn) F-score
34
Estimating the accuracy of a classifier 35
Suppose we evaluate on 100 items randomly chosen and get 80 of them correct.
What is the true accuracy with 95% confidence? This is like flipping an (unfair) coin 100 times 100 Bernoulli trials The sample standard deviation of one trial is
s = 𝑝(1 − 𝑝) = 0.8 1 − 0.8 =0.4 The sample standard deviation of 100 trials is 𝑠 𝑛 = 4
Estimating the accuracy 36
The estimate for the true accuracy using the z-distribution is [�̅� − 𝑧 𝑛 𝑠, �̅� + 𝑧 𝑛 𝑠] = [80 − 1.96 × 4, 80 + 1.96 × 4] = [72.16, 87.84]
The t-distribution with n-1=99 yields nearly the same result
Using the binomial distribution directly stats.binom.ppf([0.025, 0.975], 100, 0.8) Out[12]: array([ 72., 88.])
Moral 37
We can only conclude that the accuracy is between .72 and .88 (with 95% confidence).
We cannot conclude that it is .80 Be careful in drawing conclusions from your
evaluation Larger test sets are better!