INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...

INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING

Jan Tore Lønning, Lecture 6, 21.9

1

Today

Word Sense Disambiguation (WSD) Classification Naive Bayes classifiers

Bernoulli Multinomial

Smoothing Evaluation

2

Word Sense Disambiguation 3

Word Senses

Word Net: Noun S: (n) bass (the lowest part of the musical range) S: (n) bass, bass part (the lowest part in polyphonic music) S: (n) bass, basso (an adult male singer with the lowest voice) S: (n) sea bass, bass (the lean flesh of a saltwater fish of the family

Serranidae) S: (n) freshwater bass, bass (any of various North American freshwater fish

with lean flesh (especially of the genus Micropterus)) S: (n) bass, bass voice, basso (the lowest adult male singing voice) S: (n) bass (the member with the lowest range of a family of musical

instruments) S: (n) bass (nontechnical name for any of numerous edible marine and

freshwater spiny-finned fishes)

4

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=bass&i=0&h=000000000


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=bass+part


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=basso


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=sea+bass


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=freshwater+bass


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=bass+voice

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=basso



Word Sense Disambiguation

Why? Dialogue system: fish vs. musical instrument Translation: Han bestilte en salt-bakt bass. Search: relevant hits only Speech synthesis: (Norw.) tapet, urte

How? Corpus-based methods

Unsupervised – clustering Supervised – classification

Dictionaries and Thesauri

5

Corpus-based

Professor Emeritus Dag Østerberg doserer i klassisk byteori , og Knut Halvorsen fra Oslo Teknopol i synergieffektene mellom byer og forskningsmiljøer .

Her sitter jeg på en bar , tenkte Harry , og hører på en transvestitt som doserer om australsk politikk .

Han doserer teen etter tempoet i fortellingen hennes , ser ikke ned på den for å forvisse seg om at koppen er på rett kjøl , ser ufravendt på henne , henger ved hennes lepper - teen kommer av seg selv .

Som doserer morfinen og holder en knivskarp balanse mellom bevissthet og smerteterskel .

How to disambiguate? Use context

6

Classification 7

Classification

Supervised: Given classes Examples

Unsupervised: Construct classes

Mostly Machine Learning (ML) But not necessarily

Many ML problems can be described as classification

Supervised examples: Spam detection Word sense

disambiguation Genre of text Language classification Sentiment analysis: Positive-negative

Topic modelling

8

} INF4820 k-Nearest Neighbors Rocchio Decision Trees Naive Bayes Maximum entropy (Logistic regression) Support Vector Machines

(INF4490)

A variety of classifiers 9

Classification 10

Supervised classification

Given a well-defined set of objects, O a given set of classes, S={s1, s2, …, sk}

Goal: a classifier, γ, a mapping from O to S For supervised training one needs a set of pairs from OxS

Task O S

Word sense disambiguation Occurrences of ”bass” Sense1, …, sense8

Spam classification E-mails Spam, no-spam

Language clssification Pieces of text Arabian, Chinese, English, Norwegian, …

11

Features

To represent the objects in O, extract a set of features

Be explicit: Which features For each feature

The type Categorical Numeric (Discrete/Continuous)

The value space Cf. First lecture

Object: person Features: • height • weight • hair color • eye color • …

Object: email Features: • length • sender • contained words • language •…

12

Supervised classification

A given set of classes, S={s1, s2, …, sk} A well defined class of objects, O

Some features f1, f2, …, fn

For each feature: a set of possible values V1, V2, …, Vn The set of feature vectors: V= V1× V2×…× Vn Each object in O is represented by some member of V:

Written (v1, v2, …, vn), or (f1=v1, f2=v2, …, fn=vn)

A classifier, γ, can be considered a mapping from V to S

Examples

C = {English, Norwegian,…} O is the set of strings of

letters f1 is last letter of o V1= {a, b, c,…, å} f2 is the last two letters V2 is all two letter

combinations f3 is the length of o, V3 is 1, 2, 3, 4, …

C = {fish, music} O: all occurrences of ”bass” fi= fwi: word wi occurs in same

sentence as ”bass”, where w1 = fishing, w2 = big, …, w11 = guitar, w12 = band

V1=V2=…=V12={1,0} Example:

o = (0,0,0,1,0,0,0,0,0,0,1,0) o = (ffishing=0, …,

fguitar=1, fband=0)

Language classifier Word sense disambiguation

Maximum Likelihood Estimation 15

Maximum Likelihood Estimation

We have a data set: {x1, x2, …, xn} We assume they belong to a population with a

distribution where: we know the general form but some parameters θ={p1, p2,… , pk} are unknown

We want to choose the parameters to maximize the likelihood of the observations

when the {x1, x2, …, xn} are independently drawn

∏=

=n

iin xPxxxP

121 )|()|,...,( θθ

16

Examples

Example 1: x̄ is a MLE for µ for a normal distribution

Example 2: For a categorical variable X the relative frequency Count(a)/N is a MLE for P(X=a)

That these are MLEs need proofs which we will not consider.

MLEs may be too attuned to the specific sample to serve as a model for the population as we will see. Remedy: smoothing!

17

Naive Bayes 18

Naive Bayes: Decision 19

Given an object

Consider for each class sm

Choose the class with the largest value, in symbols

i.e. choose the class for which the observations are most likely

nn vfvfvf === ,...,, 2211

( )nnm vfvfvfsP === ,...,,| 2211

( )nnmSs

vfvfvfsPm

===∈

,...,,|maxarg 2211

Naive Bayes: Model 20

Bayes formula

Sparse data, we may not even have seen

Assume (wrongly) independence

Putting together

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP

mm 12211 |)(maxarg,...,,|maxarg

( ) ( )( )nn

mmnnnnm vfvfvfP

sPsvfvfvfPvfvfvfsP

======

====,...,,

)(|,...,,,...,,|

2211

22112211

( ) ( )∏=

=≈===n

imiimnn svfPsvfvfvfP

12211 ||,...,,

nn vfvfvf === ,...,, 2211

Naive Bayes: Calculation 21

For calculations avoid underflow, use logarithms

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP


( )

( )

( )

=+=

=

==

∑

∏

∏

=∈

=∈

=∈

n

imiim

Ss

n

imiim

Ss

n

imiim

Ss

svfPsP

svfPsP

svfPsP

m

m

m

1

1

1

)|log())(log(maxarg

|)(logmaxarg

|)(maxarg ( )

( )

=

==

∏

∏

=∈

=∈

n

imiim

Ss

n

imiim

Ss

svfPsP

svfPsP

m

m

1

1

|)(logmaxarg

|)(maxarg

Naive Bayes, Training 1 22

Maximum Likelihood

where C(sm, o) are the number of occurrences of objects o in

class sm

Observe what we are doing in statistical terms: We want to estimate the true probability 𝑃(𝑠𝑚) from a set

of observations This is similar to estimating properties (parameters) of a

population from a sample.

( ))(

),(ˆoC

osCsP mm =

Naive Bayes (Bernoulli): Training 2 23

Maximum Likelihood

where C(fi=vi, sm) is the number of occurrences of

objects o where the object o belongs to class sm and the feature fi takes the value vi

C(sm) is the number of occurrences belonging to class sm

( ))(

),(|ˆm

miimii sC

svfCsvfP ===

The two models 24

Bernoulli the standard form of NB

NLTK book, Sec. 6.1, 6.2, 6.5

Jurafsky and Martin, 2.ed, sec. 20.2, WSD

Multinomial model For text classification

Related to n-gram models

Jurafsky and Martin, 3.ed, sec. 7.1, Sentiment analysis

Both Manning, Raghavan, Schütze, Introduction to Information Retrieval, Sec.

13.0-13.3

Multinomial text classification 25

Build a language model for each class Score the document according to the different

classes Choose the class with the best score

Multinomial NB: Decision 26

In the multinomial model fi refers to position i in the text

vi refers to the word occurring in this position

We model the probability of the full texts given the class sm

Then we make a simplifying assumption: We assume a word to be equally likely in all positions:

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP


( ) ( )∏∏=∈=∈

==n

imim

Ss

n

imiim

SssvPsPsvfPsP

mm 11

|)(maxarg|)(maxarg

Multinomial NB: Training 27

where C(sm, o) is the number of occurrences of objects o in class sm

where C(wi, sm) is the number of occurrences of word wi in all texts in class sm

is the total number of words in all texts in class sm

Bernoulli counts the number of objects/texts where wi occurs Multinomial counts the number of occurrences of wi.

( ))(

),(ˆoC

osCsP mm =

( )∑

=

jmj

mimi swC

swCswP),(

),(|ˆ

∑j

mj swC ),(

Comparison

Registers whether a term is present or not

Considers both The present terms The absent terms

Suitable for various tasks

Counts how many times a term is present

Considers only the present terms Ignores absent terms

Tailor-made for text classification

28

Bernoulli Multinomial

Smoothing 29

Naive Bayes: Calculation 30

When using maximum likelihood estimation

may become 0 Then the whole

becomes 0 Goal to avoid 0-probabilities

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP


( ))(

),(|ˆm

miimii sC

svfCsvfP ===

31

Laplace Smoothing

Also called add-one smoothing Just add one to all the counts! Very simple

MLE estimate:

Laplace estimate:

Variant add k: 𝑃� 𝑤𝑖 = 𝑐𝑖+𝑘𝑁+𝑘𝑘

NLTK, 6.3, FSNLP, 8.1, IR 13.6

Evaluation 32

Data 33

Evaluation measures

Is in C Yes NO Classifier

Yes tp fp No fn tn

Accuracy: (tp+tn)/N Precision:tp/(tp+fp) Recall: tp/(tp+fn) F-score

34

Estimating the accuracy of a classifier 35

Suppose we evaluate on 100 items randomly chosen and get 80 of them correct.

What is the true accuracy with 95% confidence? This is like flipping an (unfair) coin 100 times 100 Bernoulli trials The sample standard deviation of one trial is

s = 𝑝(1 − 𝑝) = 0.8 1 − 0.8 =0.4 The sample standard deviation of 100 trials is 𝑠 𝑛 = 4

Estimating the accuracy 36

The estimate for the true accuracy using the z-distribution is [�̅� − 𝑧 𝑛 𝑠, �̅� + 𝑧 𝑛 𝑠] = [80 − 1.96 × 4, 80 + 1.96 × 4] = [72.16, 87.84]

The t-distribution with n-1=99 yields nearly the same result

Using the binomial distribution directly stats.binom.ppf([0.025, 0.975], 100, 0.8) Out[12]: array([ 72., 88.])

Moral 37

We can only conclude that the accuracy is between .72 and .88 (with 95% confidence).

We cannot conclude that it is .80 Be careful in drawing conclusions from your

evaluation Larger test sets are better!

INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...

Documents

Transcript of INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...