INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...

37
INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Transcript of INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE...

Page 1: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING

Jan Tore Lønning, Lecture 6, 21.9

1

Page 2: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Today

Word Sense Disambiguation (WSD) Classification Naive Bayes classifiers

Bernoulli Multinomial

Smoothing Evaluation

2

Page 3: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Word Sense Disambiguation 3

Page 4: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Word Senses

Word Net: Noun S: (n) bass (the lowest part of the musical range) S: (n) bass, bass part (the lowest part in polyphonic music) S: (n) bass, basso (an adult male singer with the lowest voice) S: (n) sea bass, bass (the lean flesh of a saltwater fish of the family

Serranidae) S: (n) freshwater bass, bass (any of various North American freshwater fish

with lean flesh (especially of the genus Micropterus)) S: (n) bass, bass voice, basso (the lowest adult male singing voice) S: (n) bass (the member with the lowest range of a family of musical

instruments) S: (n) bass (nontechnical name for any of numerous edible marine and

freshwater spiny-finned fishes)

4

Page 5: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Word Sense Disambiguation

Why? Dialogue system: fish vs. musical instrument Translation: Han bestilte en salt-bakt bass. Search: relevant hits only Speech synthesis: (Norw.) tapet, urte

How? Corpus-based methods

Unsupervised – clustering Supervised – classification

Dictionaries and Thesauri

5

Page 6: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Corpus-based

Professor Emeritus Dag Østerberg doserer i klassisk byteori , og Knut Halvorsen fra Oslo Teknopol i synergieffektene mellom byer og forskningsmiljøer .

Her sitter jeg på en bar , tenkte Harry , og hører på en transvestitt som doserer om australsk politikk .

Han doserer teen etter tempoet i fortellingen hennes , ser ikke ned på den for å forvisse seg om at koppen er på rett kjøl , ser ufravendt på henne , henger ved hennes lepper - teen kommer av seg selv .

Som doserer morfinen og holder en knivskarp balanse mellom bevissthet og smerteterskel .

How to disambiguate? Use context

6

Page 7: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Classification 7

Page 8: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Classification

Supervised: Given classes Examples

Unsupervised: Construct classes

Mostly Machine Learning (ML) But not necessarily

Many ML problems can be described as classification

Supervised examples: Spam detection Word sense

disambiguation Genre of text Language classification Sentiment analysis: Positive-negative

Topic modelling

8

Page 9: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

} INF4820 k-Nearest Neighbors Rocchio Decision Trees Naive Bayes Maximum entropy (Logistic regression) Support Vector Machines

(INF4490)

A variety of classifiers 9

Page 10: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Classification 10

Page 11: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Supervised classification

Given a well-defined set of objects, O a given set of classes, S={s1, s2, …, sk}

Goal: a classifier, γ, a mapping from O to S For supervised training one needs a set of pairs from OxS

Task O S

Word sense disambiguation Occurrences of ”bass” Sense1, …, sense8

Spam classification E-mails Spam, no-spam

Language clssification Pieces of text Arabian, Chinese, English, Norwegian, …

11

Page 12: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Features

To represent the objects in O, extract a set of features

Be explicit: Which features For each feature

The type Categorical Numeric (Discrete/Continuous)

The value space Cf. First lecture

Object: person Features: • height • weight • hair color • eye color • …

Object: email Features: • length • sender • contained words • language •…

12

Page 13: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Supervised classification

A given set of classes, S={s1, s2, …, sk} A well defined class of objects, O

Some features f1, f2, …, fn

For each feature: a set of possible values V1, V2, …, Vn The set of feature vectors: V= V1× V2×…× Vn Each object in O is represented by some member of V:

Written (v1, v2, …, vn), or (f1=v1, f2=v2, …, fn=vn)

A classifier, γ, can be considered a mapping from V to S

Page 14: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Examples

C = {English, Norwegian,…} O is the set of strings of

letters f1 is last letter of o V1= {a, b, c,…, å} f2 is the last two letters V2 is all two letter

combinations f3 is the length of o, V3 is 1, 2, 3, 4, …

C = {fish, music} O: all occurrences of ”bass” fi= fwi: word wi occurs in same

sentence as ”bass”, where w1 = fishing, w2 = big, …, w11 = guitar, w12 = band

V1=V2=…=V12={1,0} Example:

o = (0,0,0,1,0,0,0,0,0,0,1,0) o = (ffishing=0, …,

fguitar=1, fband=0)

Language classifier Word sense disambiguation

Page 15: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Maximum Likelihood Estimation 15

Page 16: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Maximum Likelihood Estimation

We have a data set: {x1, x2, …, xn} We assume they belong to a population with a

distribution where: we know the general form but some parameters θ={p1, p2,… , pk} are unknown

We want to choose the parameters to maximize the likelihood of the observations

when the {x1, x2, …, xn} are independently drawn

∏=

=n

iin xPxxxP

121 )|()|,...,( θθ

16

Page 17: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Examples

Example 1: x̄ is a MLE for µ for a normal distribution

Example 2: For a categorical variable X the relative frequency Count(a)/N is a MLE for P(X=a)

That these are MLEs need proofs which we will not consider.

MLEs may be too attuned to the specific sample to serve as a model for the population as we will see. Remedy: smoothing!

17

Page 18: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes 18

Page 19: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes: Decision 19

Given an object

Consider for each class sm

Choose the class with the largest value, in symbols

i.e. choose the class for which the observations are most likely

nn vfvfvf === ,...,, 2211

( )nnm vfvfvfsP === ,...,,| 2211

( )nnmSs

vfvfvfsPm

===∈

,...,,|maxarg 2211

Page 20: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes: Model 20

Bayes formula

Sparse data, we may not even have seen

Assume (wrongly) independence

Putting together

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP

mm 12211 |)(maxarg,...,,|maxarg

( ) ( )( )nn

mmnnnnm vfvfvfP

sPsvfvfvfPvfvfvfsP

======

====,...,,

)(|,...,,,...,,|

2211

22112211

( ) ( )∏=

=≈===n

imiimnn svfPsvfvfvfP

12211 ||,...,,

nn vfvfvf === ,...,, 2211

Page 21: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes: Calculation 21

For calculations avoid underflow, use logarithms

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP

mm 12211 |)(maxarg,...,,|maxarg

( )

( )

( )

=+=

=

==

=∈

=∈

=∈

n

imiim

Ss

n

imiim

Ss

n

imiim

Ss

svfPsP

svfPsP

svfPsP

m

m

m

1

1

1

)|log())(log(maxarg

|)(logmaxarg

|)(maxarg ( )

( )

=

==

=∈

=∈

n

imiim

Ss

n

imiim

Ss

svfPsP

svfPsP

m

m

1

1

|)(logmaxarg

|)(maxarg

Page 22: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes, Training 1 22

Maximum Likelihood

where C(sm, o) are the number of occurrences of objects o in

class sm

Observe what we are doing in statistical terms: We want to estimate the true probability 𝑃(𝑠𝑚) from a set

of observations This is similar to estimating properties (parameters) of a

population from a sample.

( ))(

),(ˆoC

osCsP mm =

Page 23: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes (Bernoulli): Training 2 23

Maximum Likelihood

where C(fi=vi, sm) is the number of occurrences of

objects o where the object o belongs to class sm and the feature fi takes the value vi

C(sm) is the number of occurrences belonging to class sm

( ))(

),(|ˆm

miimii sC

svfCsvfP ===

Page 24: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

The two models 24

Bernoulli the standard form of NB

NLTK book, Sec. 6.1, 6.2, 6.5

Jurafsky and Martin, 2.ed, sec. 20.2, WSD

Multinomial model For text classification

Related to n-gram models

Jurafsky and Martin, 3.ed, sec. 7.1, Sentiment analysis

Both Manning, Raghavan, Schütze, Introduction to Information Retrieval, Sec.

13.0-13.3

Page 25: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Multinomial text classification 25

Build a language model for each class Score the document according to the different

classes Choose the class with the best score

Page 26: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Multinomial NB: Decision 26

In the multinomial model fi refers to position i in the text

vi refers to the word occurring in this position

We model the probability of the full texts given the class sm

Then we make a simplifying assumption: We assume a word to be equally likely in all positions:

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP

mm 12211 |)(maxarg,...,,|maxarg

( ) ( )∏∏=∈=∈

==n

imim

Ss

n

imiim

SssvPsPsvfPsP

mm 11

|)(maxarg|)(maxarg

Page 27: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Multinomial NB: Training 27

where C(sm, o) is the number of occurrences of objects o in class sm

where C(wi, sm) is the number of occurrences of word wi in all texts in class sm

is the total number of words in all texts in class sm

Bernoulli counts the number of objects/texts where wi occurs Multinomial counts the number of occurrences of wi.

( ))(

),(ˆoC

osCsP mm =

( )∑

=

jmj

mimi swC

swCswP),(

),(|ˆ

∑j

mj swC ),(

Page 28: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Comparison

Registers whether a term is present or not

Considers both The present terms The absent terms

Suitable for various tasks

Counts how many times a term is present

Considers only the present terms Ignores absent terms

Tailor-made for text classification

28

Bernoulli Multinomial

Page 29: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Smoothing 29

Page 30: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Naive Bayes: Calculation 30

When using maximum likelihood estimation

may become 0 Then the whole

becomes 0 Goal to avoid 0-probabilities

( ) ( )∏=∈∈

=≈===n

imiim

Ssnnm

SssvfPsPvfvfvfsP

mm 12211 |)(maxarg,...,,|maxarg

( ))(

),(|ˆm

miimii sC

svfCsvfP ===

Page 31: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

31

Laplace Smoothing

Also called add-one smoothing Just add one to all the counts! Very simple

MLE estimate:

Laplace estimate:

Variant add k: 𝑃� 𝑤𝑖 = 𝑐𝑖+𝑘𝑁+𝑘𝑘

Page 32: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

NLTK, 6.3, FSNLP, 8.1, IR 13.6

Evaluation 32

Page 33: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Data 33

Page 34: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Evaluation measures

Is in C Yes NO Classifier

Yes tp fp No fn tn

Accuracy: (tp+tn)/N Precision:tp/(tp+fp) Recall: tp/(tp+fn) F-score

34

Page 35: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Estimating the accuracy of a classifier 35

Suppose we evaluate on 100 items randomly chosen and get 80 of them correct.

What is the true accuracy with 95% confidence? This is like flipping an (unfair) coin 100 times 100 Bernoulli trials The sample standard deviation of one trial is

s = 𝑝(1 − 𝑝) = 0.8 1 − 0.8 =0.4 The sample standard deviation of 100 trials is 𝑠 𝑛 = 4

Page 36: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Estimating the accuracy 36

The estimate for the true accuracy using the z-distribution is [�̅� − 𝑧 𝑛 𝑠, �̅� + 𝑧 𝑛 𝑠] = [80 − 1.96 × 4, 80 + 1.96 × 4] = [72.16, 87.84]

The t-distribution with n-1=99 yields nearly the same result

Using the binomial distribution directly stats.binom.ppf([0.025, 0.975], 100, 0.8) Out[12]: array([ 72., 88.])

Page 37: INF5820 – 2011 fall Natural Language Processing · INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 6, 21.9 1

Moral 37

We can only conclude that the accuracy is between .72 and .88 (with 95% confidence).

We cannot conclude that it is .80 Be careful in drawing conclusions from your

evaluation Larger test sets are better!