Alea Jacta Est!nlp/seminari/edgar-eco-2009-04-17.pdf · Alea Jacta Est! (O com el reverend Thomas...

Alea Jacta Est!(O com el reverend Thomas Bayes pot animar

una tarda avorrida de diumenge(i sense drogues!))

Edgar Gonzàlez i Pellicer

TALP Research Center

17 Abril 2009

Edgar Gonzàlez (TALP) Alea Jacta Est! 17 Abril 2009 1 / 74

Probabilitat

Una Fòrmula

²̂ ¥ë̂¹ ¡ ¤© ̂ â"̧,

²̂ ¥ë̂¹ ¹ _|© ̂ â,

²̂ ¥ë̂¹ ¢à^̧ ̂ â"̧,

»̂ © ¤ê _²_� ¸à _~̈ ̂ â

p(X ,Y ) = p(X | Y ) · p(Y )


Probabilitat

Dades, Models i Generació

Dades Models

Generació


Probabilitat

Dades Simples, Models Simples

X = {x}

Θ = {ϑ}

x = 1 x = 0{p(x = 1) = 12p(x = 0) = 12


Probabilitat

Dades Simples, Models Simples

X = {x1 . . . xn}

Θ = {ϑ}

x i = 1 x i = 0{p(x i = 1) =

12

p(x i = 0) =12


Probabilitat

Paràmetres

X = {x1 . . . xn}Θ = {ϑ}

x i = 1 x i = 0{p(x i = 1; Θ) = ϑp(x i = 0; Θ) = 1− ϑ


Probabilitat

Estimació

Dades Models

Generació

Paràmetres

Estimació


Probabilitat

Models Simples, Estimació Simple

ϑ =2

5{p(xi = 1; Θ) =

25

p(xi = 0; Θ) =35


Probabilitat

Models Simples, Estimació Simple

CONEIXEMENT


Probabilitat

Probabilitat Freqüentista

La probabilitat és el ĺımit de la freqüència

p(xi = 1; Θ) = limn→∞

‖xi ∈ Xn | xi = 1‖‖Xn‖

Estimació de màxima versemblança

Θ̂ = arg maxΘ

p(X ; Θ)


Probabilitat

Coneixement a priori

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 1


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 6 · ϑ · (1− ϑ)


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 20 · ϑ2 · (1− ϑ)2


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 12 · ϑ · (1− ϑ)2


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 12 · ϑ2 · (1− ϑ)


Probabilitat

Probabilitat Bayesiana

La probabilitat indica una creença

S’actualitza mitjançant evidència

Estimació de màxim a posteriori

p(X ; Θ) = p(X | Θ)Θ̂ = arg max

Θp(Θ | X )

= arg maxΘ

p(X | Θ) · p(Θ)p(X )

= arg maxΘ

p(X | Θ) · p(Θ)


Probabilitat

Hiperparàmetres

Dades Models

Generació

Paràmetres

Hiper-Paràmetres

Estimació


Probabilitat

Model Simple, Estimació Simple (però Bayesiana)

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 6 · ϑ · (1− ϑ)

ϑ =3

7{p(xi = 1; Θ) =

37

p(xi = 0; Θ) =47


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 20 · ϑ2 · (1− ϑ)2

ϑ =4

9{p(xi = 1; Θ) =

49

p(xi = 0; Θ) =59


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 20 · ϑ · (1− ϑ)2

ϑ =3

8{p(xi = 1; Θ) =

38

p(xi = 0; Θ) =58


Probabilitat


0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

p(Θ) = 1

ϑ =2

5{p(xi = 1; Θ) =

25

p(xi = 0; Θ) =35


Probabilitat


0.35 0.538

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

25

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

37

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

49

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1


Probabilitat

Inferència

Dades Models

Generació

Estimació

Inferència


Probabilitat

És correcta la moneda?

{H0 : ϑ = ϑ0 = 0.5H1 : ϑ 6= ϑ0 = 0.5

B01 =p(H0 | X )p(H1 | X )

=p(X | H0) · p(H0)p(X | H1) · p(H1)


Probabilitat


B01 =p(X | H0) · p(H0)p(X | H1) · p(H1)

=ϑ20 · (1− ϑ0)3 · p(ϑ0)∫ 1

0 ϑ2 · (1− ϑ)3 · p(ϑ) · dϑ

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 B01 =158

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 B01 =315128

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 B01 =3516

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 B01 =6332


Probabilitat


1.75 2.75158

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

6332

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

3516

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

315128

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1


Probabilitat

Coneixement, Creences, Assumpcions. . .

L’enfocament bayesià fa expĺıcites les assumpcions que es prenen.

Va ser plantejat per Bayes i Laplace al segle XVIII

I criticat per Fischer al segle XX

Subjectivitat

Però, tanmateix. . .

No es pot fer inferència sense fer assumpcions

[McKay, 2003]

Revifalla amb aproximacions d’Aprenentatge Automàtic


Probabilitat

La importància del prior

El prior representa doncs les nostres assumpcions

Podem optar per

Priors subjectiusPriors informatsPriors no-informatsPriors conjugats


Models Probabilistics per a NLP Näıve Bayes

Documents

X = {x1 . . . xn}xi = {xi1 . . . xili}

p(X ) =∏i

p(xi )

p(xi ) = p(li ) · p(xi1 . . . xili )= p(li ) ·

∏j

p(xij)

Distribució categòrica/multinomial



Documents

is

.

thecat

eats

dog

beach

sun

housefish



Documents i Classes

X ′ = {x ′1 . . . x ′n}x ′i = (yi , xi = {xi1 . . . xili})

p(X ′) =∏i

p(x ′i )

p(x ′i ) = p(yi ) · p(xi | yi )= p(yi ) · p(li | yi ) · p(xi1 . . . xili | yi )= p(yi ) · p(li ) ·

∏j

p(xij | yi )

Assumpció de Näıve Bayes



Documents i Classes

is

.

thecat

eats

dog

beach

sun

housefish is

.

thecat

eats

dog

beach

sun

housefish is

.

thecat

eats

dog

beach

sun

housefish is

.

thecat

eats

dog

beach

sun

housefish is

.

thecat

eats

dog

beach

sun

housefish is

.

thecat

eats

dog

beach

sun

housefish



Classificació

Dades Models

Generació

Paràmetres

Estimació

Dades

Classificació



Classificació

xN = {xN1 . . . xNlN}

p(yN = y |xN) =p(xN | y) · p(y)

p(xN)

=p(xN | y) · p(y)∑y ′ p(xN | y ′) · p(y ′)

=p(lN | y) · p(xN1 . . . xNlN | y) · p(y)∑

y ′ p(lN | y ′) · p(xN1 . . . xNlN | y ′) · p(y ′)

=

∏j p(xNj | y) · p(y)∑

y ′∏

j p(xNj | y ′) · p(y ′)



Detecció d’Spam

is

.

thecat

eats

dog

beach

sun

housefish

.

is

the

free

you

diploma

enlarge

penis

pill

money



Aplicacions

Classificació de textos [Nigam et al., 2000]

Detecció de spam [Sahami et al., 1998]Detecció caracteŕıstiques criminals [Bache et al., 2008]

Word Sense Disambiguation [Gale et al., 1992]


Models Probabilistics per a NLP Expectation-Maximization

I què passa si amaguem la classe?

X = {x1 . . . xn}xi → x ′i = (yi , xi )

Però coneixem una probabilitat per a les x ′i

p(x ′i ) = p(yi ) ·∏j

p(xij | yi )



I què passa si amaguem la classe?

Problema de clustering

Θ̂ =

{arg maxΘ

∑Y p(X ,Y | Θ)

arg maxΘ∑Y p(Θ,Y | X )

Ŷ = arg maxY

p(Y | X , Θ̂)

Maximitzar aquestes fòrmules de forma expĺıcita és dif́ıcil

Algorisme d’Expectation-Maximization



Expectation-Maximization

Θ0Inicialització

E0

Expectation

Θ1Maximization

E1

Expectation

Θ2Maximization

. . .

Er−1 ΘrMaximization

Er

Expectation



Expectation-Maximization

Es = p(Y | X ,Θs)

Θs+1 =

{arg maxΘ

∑Y Es(Y) · p(X ,Y | Θ)

arg maxΘ∑Y Es(Y) · p(Θ | X ,Y)

En el cas d’un conjunt de classes, trobar Θs a cada pas sol equivaler aresoldre un problema de classificació en què cada xi pertany en unafracció Es(y , xi ) a cada classe y .



Aplicacions

Classificació de textos [Nigam et al., 2000]

Incorporació de dades no etiquetades

Detecció no supervisada de relacions



Flexibilitat

Incorporació d’una classe soroll

p(xi | yi ) ={ ∏

j p(xij | yi ) i < kW−j i = k

Detecció de caracteŕıstiques irrellevants [Law et al., 2002]

p(xi | yi ) =∏j

ρj · p(xij | yi , rj) + (1− ρj) · p(xij | ¬rj))



L’Etern Dilema

Quants clusters?



L’Etern Dilema

Problema de sel·lecció de modelsÚs de factors de bayes

Prior sobre els models

Criteris provinents d’altres fonts

Akaike Information CriterionBayesian Information CriterionMinimum Message LengthMinimum Description Length


Models Probabilistics per a NLP Models de Markov

Documents

X = {x1 . . . xn}xi = {xi1 . . . xili}

p(X ) =∏i

p(xi )

p(xi ) = p(li ) · p(xi1 . . . xili )= p(li ) ·

∏j

p(xij)



Seqüències

p(x1 . . . xl) = p(x1) ·∏j>1

p(xj | x1 . . . xj−1)

= p(x1) ·∏j>1

p(xj | xj−k . . . xj−1)

= p(x ′1) ·∏j>1

p(x ′j | x ′j−1)

Model de Markov

p(x1)p(xj | xj−1)



Models de Markov

el

gat

gos menja peix

.

0.9

0.1

0.5

0.5

0.2

0.8 0.2

0.8 0.5

0.5

1.0



Aplicacions

Models de n-grames



Una volta de rosca més

x1 x2 x3 . . . xl−1 xl

↓ ↓ ↓ ↓ ↓x1 x2 x3 . . . xl−1 xl




y1 y2 y3 . . . yl−1 yl

↓ ↓ ↓ ↓ ↓x1 x2 x3 . . . xl−1 xl




y1 y2 y3 . . . yl−1 yl↓ ↓ ↓ ↓ ↓x1 x2 x3 . . . xl−1 xl




. . .↓ ↓ ↓ ↓ ↓

x1 x2 x3 . . . xl−1 xl



Models Ocults de Markov

p(x1, y1 . . . xl , yl) = p(y1) · p(x1 | y1) ··∏j>1

p(yj | yj−1) · p(xj | yj)

p(x1 . . . xl) =∑y1...yl

p(x1, y1 . . . xl , yl)




DTis

.

thecat

eats

dog

beach

sun

housefish

NNis

.

thecat

eats

dog

beach

sun

housefish

VBis

.

thecat

eats

dog

beach

sun

housefish

.

is

.

thecat

eats

dog

beach

sun

housefish

0.9

0.1

1.0

0.8

0.2

0.5 0.

5




Tres problemes canònics1 Probabilitat d’una seqüència

p(x1 . . . xl)Algorisme Forward

2 Seqüència d’estats més probable

arg maxy1...yl p(y1 . . . yl | x1 . . . xl)Algorisme de Viterbi

3 Estimació dels paràmetres

arg maxΘ p(X | Θ)Algorisme de Baum-Welch (Expectation-Maximization)



Aplicacions

Part-of-Speech Tagging [Charniak et al., 1993]

Reconeixement de NEs [Malouf, 2002]

Extracció d’Informació [Freitag and McCallum, 1999]


Models Probabilistics per a NLP Models de Màxima Entropia

Models Generatius

p(y | x) = p(y) · p(x | y)p(x)

=p(y) · p(x | y)∑y ′ p(y

′) · p(x | y ′)

=p(x , y)∑y ′ p(x , y

′)



Restriccions

Quina p(y | x) escollim?Ha d’acomplir restriccions:

(yi , xi ) = (yi , {x1 . . . xl})(yi , xi ) → {f1(yi , xi ) . . . fs(yi , xi )}

fj(yi , xi ) =

{1 yi = ϕ ∧ xij = χ0 yi 6= ϕ ∨ xij 6= χ∑

(yi ,xi )∈X ′fj(yi , xi ) =

∑x∈Xy∈Y

p(y , x) · fj(y , x)

Ẽ (fj) = E (fj)



Restriccions

Ha de ser una distribució de probabilitat

p(y | x) ≥ 0∑y

p(y | x) = 1

Ha de tenir Màxima Entropia



Entropia

H(x) = −∑x∈X

p(x) · log p(x)



Entropia

H(y | x) = −∑x∈Xy∈Y

p(y , x) · log p(y | x)



Màxima Entropia

Usant Lagrange arribem a la funció

Ψ(Λ) = H(y | x) +∑

j

λj

(E (fj)− Ẽ (fj)

)+ λ′

(∑y

p(y | x)− 1

)

I fent aproximacions, dedüım que els models han de tenir la forma

p(y | x) = 1Z (x)

· exp

∑j

λj · fj(y , x)



Màxima Entropia

Estimació de paràmetres usant mètodes numèrics

Iterative Scaling

Sovint s’afegeix un terme de regularització

Ψ(Λ)−∑

j

λ2j2σ2



Màxima Entropia/Versemblança

p(x , y) = p(y | x) · p(x)p(x , y ; Θyx ,Θx) = p(y | x ; Θyx) · p(x ; Θx)

(Θ̂yx , Θ̂x) = arg maxΘyx ,Θx

∏(y ,x)∈X ′

p(y , x ; Θyx ,Θx)

= arg maxΘyx ,Θx

∑(y ,x)∈X ′

log p(y , x ; Θyx ,Θx)

= arg maxΘyx ,Θx

∑(y ,x)∈X ′

log p(y | x ; Θyx)+

+∑

(y ,x)∈X ′log p(x ; Θx)




(Θ̂yx , Θ̂x) = arg maxΘyx ,Θx

∑(y ,x)∈X ′

log p(y | x ; Θyx)+

+∑

(y ,x)∈X ′log p(x ; Θx)

Θ̂yx = arg max

Θyx

∑(y ,x)∈X ′

log p(y | x ; Θyx)

Θ̂x = arg maxΘx

∑(y ,x)∈X ′

log p(x ; Θx))




∑(y ,x)∈X ′

log p(y | x ; Θyx) =∑x∈Xy∈Y

p̃(y , x) · log p(y | x ; Θyx)

Que per als models proposats, equival a Ψ(Λ)

Màxima Entopia = Màxima Versemblança

El terme de regularització equival a afegir un prior gaussiàΛ ∼ (0, σ2I )

Màxima Entopia = Màxim A Posteriori



Aplicacions

Part-of-Speech Tagging [Ratnaparkhi, 1996]

Traducció Automàtica [Berger et al., 1996]


Models Probabilistics per a NLP Conditional Random Fields

Recapitulem

Naive bayes

p(y , x) = p(y) · p(x1 | y) . . . p(xl | y)

p(y , x) =1

1·Ψ0 ·Ψ1 . . .Ψl


p(y , x) = p(y1) · p(x1 | y1) · p(y2 | y1) . . . p(xl | yl)

p(y , x) =1

1·Ψ1 ·Ψ2 ·Ψ3 . . .Ψ2l

Models de Màxima Entopia

p(y | x) = 1Z (x)

· exp(λ1 · f1(y , x)) . . . exp(λl · fl(y , x))

p(y | x) = 1Z (x)

·Ψ1 . . .Ψl



Models Gràfics

y

x1 x2 . . . xl

Ψ0

Ψ1 Ψ2 Ψl

y

x

Ψ1 Ψ2 . . . Ψl

Naive Bayes Màxima Entropia



Models Gràfics

y

x1 x2 . . . xl

y

x

Naive Bayes Màxima Entropia



Models Gràfics

y1

x1

y2

x2

y3

x3

. . .

yl

xl

Ψ1

Ψ2

Ψ3

Ψ4

Ψ5

Ψ6 Ψ2l




Models Gràfics

y1

x1

y2

x2

y3

x3

. . .

yl

xl




Models Gràfics

y1 y2 y3

. . .

yl

x

Linear-Chain CRFs



Linear-Chain CRFs

~x = {x1 . . . xl}~y = {y1 . . . yl}

p(~y | ~x) = 1Z (~x)

· exp

∑i ,j

λi · fi (yj−1, yj ,~x , j)

Extensió de Models Ocults de Markov

Extensió de Models de Màxima Entropia



Linear-Chain CRFs

Hereten dels Models Ocults de Markov

Model seqüencialAlgorisme de ViterbiAlgorisme de Forward-Backward

Hereten dels Models de Màxima Entropia

Model discriminatiuEstimació de paràmetres per optimització numèricaNecessitat de regularització



Aplicacions

Part-of-speech Tagging [Lafferty et al., 2001]

Shallow Parsing [Sha and Pereira, 2003]



Linear-Chain CRFs

y1 y2 y3

. . .

yl

x

Linear-Chain CRFs



Skip-Chain CRFs

y1 y2 y3

. . .

yl

x

Skip-Chain CRFs



Skip-Chain CRFs

Relaxació de les restriccions en les features

Dependències no locals

Increment en la complexitat algoŕısmica

Inferència aproximada

[Sutton and McCallum, 2007]



The Road Goes Ever On And On

Xarxes Bayesianes

Maximum-Entropy Markov Models

Hierarchical Markov Models

CRFs Generals

. . .


Final

Gràcies!


Bibliografia

Bibliografia I

R. Bache, F. Crestani, D. Canter, and D. Youngs. A language modellingapproach to linking criminal styles with offender characteristics. InNatural Language for Information Systems (NLDB), 2008.

A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropyapproach to natural language processing. Computational Linguistics, 22(1), 1996.

E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz. Equationsfor part-of-speech tagging. In National Conference on ArtificialIntelligence, 1993.

D. Freitag and A. McCallum. Information extraction with HMMs andshrinkage. In COLING-ACL, 1999.

W. Gale, K. Church, and D. Yarowsky. A method for disambiguating wordsenses in a large corpus. Computers and the Humanities, 26(5/6), 1992.


Bibliografia

Bibliografia II

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. InInternational Conference on Machine Learning, 2001.

M. Law, A. Jain, and M. Figueiredo. Feature selection in mixture-basedclustering. In Neural Information Processing Systems (NIPS), 2002.

R. Malouf. Markov models for language-independent named entityrecognition. In Conference on Natural Language Learning (CoNLL),2002.

C. Manning and H. Schütze. Foundations of Statistical Natural LanguageProcessing. MIT Press, 1999.

D. McKay. Information Theory, Inference, and Learning Algorithms.Cambridge University Press, 2003.


Bibliografia

Bibliografia III

K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classificationfrom labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 2000.

A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. InEmpirical Methods in Natural Language Processing (EMNLP), 1996.

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesianapproach to filtering junk e-mail. In AAAI Workshop on Learning forText Categorization, 1998.

F. Sha and F. Pereira. Shallow parsing with conditional random fields. InHLT-NAACL, 2003.

C. Sutton and A. McCallum. Introduction to Statistical RelationalLearning, chapter An Introduction to Conditional Random Fields forRelational Learning. MIT Press, 2007.


ProbabilitatModels Probabilistics per a NLPNaïve BayesExpectation-MaximizationModels de MarkovModels de Màxima EntropiaConditional Random Fields

FinalApèndixBibliografiaReferències

Alea Jacta Est!nlp/seminari/edgar-eco-2009-04-17.pdf · Alea Jacta Est! (O com el reverend Thomas...

Documents

Transcript of Alea Jacta Est!nlp/seminari/edgar-eco-2009-04-17.pdf · Alea Jacta Est! (O com el reverend Thomas...