01/*,( 2+,-34(5675(...Conditional Probabilities • A simple relation between joint and conditional...

!"#$%&'()*+,-(#./01/*,(2+,-34(5675(

89:3(;3<=30*>34(((

"=+?3.(1?1@-3?(A4*0(!14=*.(B93.-4+,(1,?(C1,(D=3+,(

E*94(F4.-(G*,.9=/,H(I*J(•  K(J+==+*,1+43(A4*0(-L3(.9J94J.(*A("31<=3(1.:.(>*9(1(M93./*,'(– N3(.1>.'(O(L1P3(-L90J-1G:Q(+A(O(R+@(+-Q(SL1-T.(-L3(@4*J1J+=+->(+-(S+==(A1==(S+-L(-L3(,1+=(9@U(

–  E*9(.1>'()=31.3(R+@(+-(1(A3S(/03.'(

–  E*9(.1>'(VL3(@4*J1J+=+->(+.'(•  )WNX(Y(Z[$(

– !"#$%&$'#()&***#–  E*9(.1>'(\3G19.3](

Random Variables

•  A random variable is some aspect of the world about which we (may) have uncertainty –  R = Is it raining? –  D = How long will it take to drive to work? –  L = Where am I?

•  We denote random variables with capital letters

•  Random variables have domains –  R in {true, false} (sometimes write as {+r, ¬r}) –  D in [0, !) –  L in possible locations, maybe {(0,0), (0,1), !}

Probability Distributions •  Discrete random variables have distributions

•  A discrete distribution is a TABLE of probabilities of values •  A probability (lower case value) is a single number

•  Must have:

T P warm 0.5 cold 0.5

W P sun 0.6 rain 0.1 fog 0.3

meteor 0.0

Joint Distributions •  A joint distribution over a set of random variables:

specifies a real number for each assignment (or outcome):

–  Size of distribution if n variables with domain sizes d?

–  Must obey:

•  For all but the smallest distributions, impractical to write out

T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3

Marginal Distributions •  Marginal distributions are sub-tables which eliminate variables •  Marginalization (summing out): Combine collapsed rows by adding


T P hot 0.5 cold 0.5

W P sun 0.6 rain 0.4

Conditional Probabilities •  A simple relation between joint and conditional probabilities

–  In fact, this is taken as the definition of a conditional probability


Conditional Distributions •  Conditional distributions are probability distributions over

some variables given fixed values of others




Conditional Distributions Joint Distribution

The Product Rule •  Sometimes have conditional distributions but want the joint

•  Example:

R P sun 0.8 rain 0.2

D W P wet sun 0.1 dry sun 0.9 wet rain 0.7 dry rain 0.3

D W P wet sun 0.08 dry sun 0.72 wet rain 0.14 dry rain 0.06

Bayes’ Rule •  Two ways to factor a joint distribution over two variables:

•  Dividing, we get:

•  Why is this at all helpful? –  Lets us build one conditional from its reverse –  Often one conditional is tricky but the other one is simple –  Foundation of many practical systems (e.g. ASR, MT)

•  In the running for most important ML equation!

That’s my rule!

VL90J-1G:(^(\+,*0+1=(C+.-4+J9/*,(

•  )WN31?.X(Y("Q(()WV1+=.X(Y(7_"#

(•  `=+@.(143(!"!"#"'((

–  O,?3@3,?3,-(3P3,-.(–  O?3,/G1==>(?+.-4+J9-3?(1GG*4?+,H(-*(\+,*0+1=(?+.-4+J9/*,(

•  "3M93,G3($(*A($N(N31?.(1,?($V(V1+=.(((

!

D={xi | i=1…n}, P(D | ! ) = !iP(xi | ! )

a1b+090(8+:3=+L**?(#./01/*,(•  +%,%'#cJ.34P3?(.3-(D(*A($N(N31?.(1,?($V(V1+=.(((•  !&-.,)"$/$'(\+,*0+1=(?+.-4+J9/*,((•  0"%12/23'#F,?+,H(" +.(1,(*@/0+d1/*,(@4*J=30(

– 2L1-T.(-L3(*JI3G/P3(A9,G/*,U(

•  405'#!L**.3("(-*(01b+0+d3(@4*J1J+=+->(*A($%

E*94(F4.-(@14103-34(=314,+,H(1=H*4+-L0(

•  "3-(?34+P1/P3(-*(d34*Q(1,?(.*=P3e(

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθln θαH (1− θ)αT

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

1

\9-Q(L*S(01,>(R+@.(?*(O(,33?U(

•  \+==+*,1+43(.1>.'(O(R+@@3?(Z(L31?.(1,?(5(-1+=.f(•  E*9(.1>'("(Y(Z[$Q(O(G1,(@4*P3(+-e(•  N3(.1>.'(2L1-(+A(O(R+@@3?(Z6(L31?.(1,?(56(-1+=.U(•  E*9(.1>'("103(1,.S34Q(O(G1,(@4*P3(+-e(

•  !"#$%&$'#()%,6$#7"8"1*#•  E*9(.1>'(g00](VL3(0*43(-L3(0344+34UUU(•  N3(.1>.'(O.(-L+.(SL>(O(10(@1>+,H(>*9(-L3(J+H(J9G:.UUU(

N

P

rob.

of M

ista

ke

Exponential Decay!

K(J*9,?((WA4*0(N*3h?+,HT.(+,3M91=+->X(

•  `*4(N(=!H+!TQ(1,?((•  83-("

*(J3(-L3(-493(@14103-34Q(A*4(1,>(%>0'(

)K!(8314,+,H(•  )K!'()4*J1J=>(K@@4*b+01-3(!*443G-(•  \+==+*,1+43(.1>.'(O(S1,-(-*(:,*S(-L3(-L90J-1G:("Q(S+-L+,(%(Y(6f7Q(S+-L(@4*J1J+=+->(1-(=31.-(7_&(Y(6fi$f((

•  N*S(01,>([email protected](c4Q(L*S(J+H(?*(O(.3-(N U(

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

1

Interesting! Lets look at some numbers! •  % = 0.1, &=0.05

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

1

2L1-(+A(O(L1P3(@4+*4(J3=+3A.U((•  \+==+*,1+43(.1>.'(21+-Q(O(:,*S(-L1-(-L3(-L90J-1G:(+.(jG=*.3k(-*($6_$6f(2L1-(G1,(>*9(?*(A*4(03(,*SU(

•  9.:#$%&'#;#<%2#="%12#/,#,)"#>%&"$/%2#?%&@#•  l1-L34(-L1,(3./01/,H(1(.+,H=3("Q(S3(*J-1+,(1(?+.-4+J9/*,(*P34(@*..+J=3(P1=93.(*A("#

In the beginning After observations Observe flips

e.g.: {tails, tails}

\1>3.+1,(8314,+,H(•  g.3(\1>3.(49=3'(

•  Or equivalently: •  Also, for uniform priors:

Prior

Normalization

Data Likelihood

Posterior

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

1

! reduces to MLE objective

\1>3.+1,(8314,+,H(A*4(VL90J-1G:.(

8+:3=+L**?(A9,G/*,(+.(\+,*0+1='(

•  2L1-(1J*9-(@4+*4U(–  [email protected],-(3b@34-(:,*S=3?H3(–  "+0@=3(@*.-34+*4(A*40(

•  !*,I9H1-3(@4+*4.'(–  !=*.3?_A*40([email protected],-1/*,(*A(@*.-34+*4(–  A.1#>/2.B/%=C#<.2D:3%,"#-1/.1#/$#>",%#E/$,1/7:F.2#

\3-1(@4+*4(?+.-4+J9/*,(^()W"X(

•  8+:3=+L**?(A9,G/*,'(•  )*.-34+*4'(

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1 = θαH+βH−1(1− θ)αT+βt+1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )

1

)*.-34+*4(?+.-4+J9/*,(•  )4+*4'(•  C1-1'($N(L31?.(1,?($V(-1+=.(

•  )*.-34+*4(?+.-4+J9/*,'((

\1>3.+1,()*.-34+*4(O,A343,G3(•  )*.-34+*4(?+.-4+J9/*,'((

•  \1>3.+1,(+,A343,G3'(– m*(=*,H34(.+,H=3(@14103-34(–  `*4(1,>(.@3G+FG(f,%-L3(A9,G/*,(*A(+,-343.-(–  !*0@9-3(-L3(3b@3G-3?(P1=93(*A(f (

(–  O,-3H41=(+.(*n3,(L14?(-*(G*0@9-3(

aK)'(a1b+090(1(@*.-34+*4+(1@@4*b+01/*,(

((•  K.(0*43(?1-1(+.(*J.34P3?Q(\3-1(+.(0*43(G34-1+,(•  MAP: use most likely parameter to approximate

the expectation

aK)(A*4(\3-1(?+.-4+J9/*,(

(•  aK)'(9.3(0*.-(=+:3=>(@14103-34'(

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )

αH + βH − 1αH + βH + αT + βT − 2

1•  Beta prior equivalent to extra thumbtack flips •  As N " 1, prior is “forgotten” •  But, for small sample size, prior is important!

2L1-(1J*9-(G*,/,9*9.(P14+1J=3.U(

•  \+==+*,1+43(.1>.'(OA(O(10(031.94+,H(1(G*,/,9*9.(P14+1J=3Q(SL1-(G1,(>*9(?*(A*4(03U(

•  9.:#$%&'#0",#B"#,"==#&.:#%7.:,#G%:$$/%2$@#

"*03(@4*@34/3.(*A(B19..+1,.(•  Ko,3(-41,.A*401/*,(W09=/@=>+,H(J>(.G1=14(1,?(1??+,H(1(G*,.-1,-X(143(B19..+1,(–  p(q(&WµQ'5X(–  E(Y(1p(r(J(!(E(q(&W1µrJQ15'5X(

•  "90(*A(B19..+1,.(+.(B19..+1,(–  p(q(&WµpQ'5

pX(–  E(q(&WµEQ'5

EX(–  ;(Y(prE( !(;(q(&WµprµEQ('5

pr'5EX(

•  #1.>(-*(?+h343,/1-3Q(1.(S3(S+==(.33(.**,e(

8314,+,H(1(B19..+1,(•  !*==3G-(1(J9,GL(*A(?1-1(

– N*@3A9==>Q(+f+f?f(.10@=3.(– 3fHfQ(3b10(.G*43.(

•  8314,(@14103-34.(– a31,'(µ – s14+1,G3'("

xi i =

5H%B#I<.1"#

6( t$(

7( i$(

5( 766(

Z( 75(

]( ](ii( ti(

a8#(A*4(B19..+1,'(•  )4*Jf(*A(+f+f?f(.10@=3.($Yub7Q]Qbmv'(

•  Log-likelihood of data:

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

2

E*94(.3G*,?(=314,+,H(1=H*4+-L0'(a8#(A*4(031,(*A(1(B19..+1,(

•  2L1-T.(a8#(A*4(031,U(


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N�

i=1

xi +Nµ = 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

2

a8#(A*4(P14+1,G3(•  KH1+,Q(.3-(?34+P1/P3(-*(d34*'(


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

2

8314,+,H(B19..+1,(@14103-34.(•  a8#'(

((•  \V2f(a8#(A*4(-L3(P14+1,G3(*A(1(B19..+1,(+.(7/%$"E#

– #b@3G-3?(43.9=-(*A(3./01/*,(+.(2.,#-493(@14103-34e((– g,J+1.3?(P14+1,G3(3./01-*4'(

\1>3.+1,(=314,+,H(*A(B19..+1,(@14103-34.(

•  !*,I9H1-3(@4+*4.(– a31,'(B19..+1,(@4+*4(– s14+1,G3'(2+.L14-(C+.-4+J9/*,(

•  )4+*4(A*4(031,'(

01/*,( 2+,-34(5675(...Conditional Probabilities • A simple relation between joint and conditional...

Documents

Transcript of 01/*,( 2+,-34(5675(...Conditional Probabilities • A simple relation between joint and conditional...