01/*,( 2+,-34(5675(...Conditional Probabilities • A simple relation between joint and conditional...
Transcript of 01/*,( 2+,-34(5675(...Conditional Probabilities • A simple relation between joint and conditional...
!"#$%&'()*+,-(#./01/*,(2+,-34(5675(
89:3(;3<=30*>34(((
"=+?3.(1?1@-3?(A4*0(!14=*.(B93.-4+,(1,?(C1,(D=3+,(
E*94(F4.-(G*,.9=/,H(I*J(• K(J+==+*,1+43(A4*0(-L3(.9J94J.(*A("31<=3(1.:.(>*9(1(M93./*,'(– N3(.1>.'(O(L1P3(-L90J-1G:Q(+A(O(R+@(+-Q(SL1-T.(-L3(@4*J1J+=+->(+-(S+==(A1==(S+-L(-L3(,1+=(9@U(
– E*9(.1>'()=31.3(R+@(+-(1(A3S(/03.'(
– E*9(.1>'(VL3(@4*J1J+=+->(+.'(• )WNX(Y(Z[$(
– !"#$%&$'#()&***#– E*9(.1>'(\3G19.3](
Random Variables
• A random variable is some aspect of the world about which we (may) have uncertainty – R = Is it raining? – D = How long will it take to drive to work? – L = Where am I?
• We denote random variables with capital letters
• Random variables have domains – R in {true, false} (sometimes write as {+r, ¬r}) – D in [0, !) – L in possible locations, maybe {(0,0), (0,1), !}
Probability Distributions • Discrete random variables have distributions
• A discrete distribution is a TABLE of probabilities of values • A probability (lower case value) is a single number
• Must have:
T P warm 0.5 cold 0.5
W P sun 0.6 rain 0.1 fog 0.3
meteor 0.0
Joint Distributions • A joint distribution over a set of random variables:
specifies a real number for each assignment (or outcome):
– Size of distribution if n variables with domain sizes d?
– Must obey:
• For all but the smallest distributions, impractical to write out
T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3
Marginal Distributions • Marginal distributions are sub-tables which eliminate variables • Marginalization (summing out): Combine collapsed rows by adding
T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3
T P hot 0.5 cold 0.5
W P sun 0.6 rain 0.4
Conditional Probabilities • A simple relation between joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability
T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3
Conditional Distributions • Conditional distributions are probability distributions over
some variables given fixed values of others
T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3
W P sun 0.8 rain 0.2
W P sun 0.4 rain 0.6
Conditional Distributions Joint Distribution
The Product Rule • Sometimes have conditional distributions but want the joint
• Example:
R P sun 0.8 rain 0.2
D W P wet sun 0.1 dry sun 0.9 wet rain 0.7 dry rain 0.3
D W P wet sun 0.08 dry sun 0.72 wet rain 0.14 dry rain 0.06
Bayes’ Rule • Two ways to factor a joint distribution over two variables:
• Dividing, we get:
• Why is this at all helpful? – Lets us build one conditional from its reverse – Often one conditional is tricky but the other one is simple – Foundation of many practical systems (e.g. ASR, MT)
• In the running for most important ML equation!
That’s my rule!
VL90J-1G:(^(\+,*0+1=(C+.-4+J9/*,(
• )WN31?.X(Y("Q(()WV1+=.X(Y(7_"#
(• `=+@.(143(!"!"#"'((
– O,?3@3,?3,-(3P3,-.(– O?3,/G1==>(?+.-4+J9-3?(1GG*4?+,H(-*(\+,*0+1=(?+.-4+J9/*,(
• "3M93,G3($(*A($N(N31?.(1,?($V(V1+=.(((
!
D={xi | i=1…n}, P(D | ! ) = !iP(xi | ! )
a1b+090(8+:3=+L**?(#./01/*,(• +%,%'#cJ.34P3?(.3-(D(*A($N(N31?.(1,?($V(V1+=.(((• !&-.,)"$/$'(\+,*0+1=(?+.-4+J9/*,((• 0"%12/23'#F,?+,H(" +.(1,(*@/0+d1/*,(@4*J=30(
– 2L1-T.(-L3(*JI3G/P3(A9,G/*,U(
• 405'#!L**.3("(-*(01b+0+d3(@4*J1J+=+->(*A($%
E*94(F4.-(@14103-34(=314,+,H(1=H*4+-L0(
• "3-(?34+P1/P3(-*(d34*Q(1,?(.*=P3e(
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθln θαH (1− θ)αT
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
1
\9-Q(L*S(01,>(R+@.(?*(O(,33?U(
• \+==+*,1+43(.1>.'(O(R+@@3?(Z(L31?.(1,?(5(-1+=.f(• E*9(.1>'("(Y(Z[$Q(O(G1,(@4*P3(+-e(• N3(.1>.'(2L1-(+A(O(R+@@3?(Z6(L31?.(1,?(56(-1+=.U(• E*9(.1>'("103(1,.S34Q(O(G1,(@4*P3(+-e(
• !"#$%&$'#()%,6$#7"8"1*#• E*9(.1>'(g00](VL3(0*43(-L3(0344+34UUU(• N3(.1>.'(O.(-L+.(SL>(O(10(@1>+,H(>*9(-L3(J+H(J9G:.UUU(
N
P
rob.
of M
ista
ke
Exponential Decay!
K(J*9,?((WA4*0(N*3h?+,HT.(+,3M91=+->X(
• `*4(N(=!H+!TQ(1,?((• 83-("
*(J3(-L3(-493(@14103-34Q(A*4(1,>(%>0'(
)K!(8314,+,H(• )K!'()4*J1J=>(K@@4*b+01-3(!*443G-(• \+==+*,1+43(.1>.'(O(S1,-(-*(:,*S(-L3(-L90J-1G:("Q(S+-L+,(%(Y(6f7Q(S+-L(@4*J1J+=+->(1-(=31.-(7_&(Y(6fi$f((
• N*S(01,>([email protected](c4Q(L*S(J+H(?*(O(.3-(N U(
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
1
Interesting! Lets look at some numbers! • % = 0.1, &=0.05
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
1
2L1-(+A(O(L1P3(@4+*4(J3=+3A.U((• \+==+*,1+43(.1>.'(21+-Q(O(:,*S(-L1-(-L3(-L90J-1G:(+.(jG=*.3k(-*($6_$6f(2L1-(G1,(>*9(?*(A*4(03(,*SU(
• 9.:#$%&'#;#<%2#="%12#/,#,)"#>%&"$/%2#?%&@#• l1-L34(-L1,(3./01/,H(1(.+,H=3("Q(S3(*J-1+,(1(?+.-4+J9/*,(*P34(@*..+J=3(P1=93.(*A("#
In the beginning After observations Observe flips
e.g.: {tails, tails}
\1>3.+1,(8314,+,H(• g.3(\1>3.(49=3'(
• Or equivalently: • Also, for uniform priors:
Prior
Normalization
Data Likelihood
Posterior
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
1
! reduces to MLE objective
\1>3.+1,(8314,+,H(A*4(VL90J-1G:.(
8+:3=+L**?(A9,G/*,(+.(\+,*0+1='(
• 2L1-(1J*9-(@4+*4U(– [email protected],-(3b@34-(:,*S=3?H3(– "+0@=3(@*.-34+*4(A*40(
• !*,I9H1-3(@4+*4.'(– !=*.3?_A*40([email protected],-1/*,(*A(@*.-34+*4(– A.1#>/2.B/%=C#<.2D:3%,"#-1/.1#/$#>",%#E/$,1/7:F.2#
\3-1(@4+*4(?+.-4+J9/*,(^()W"X(
• 8+:3=+L**?(A9,G/*,'(• )*.-34+*4'(
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1 = θαH+βH−1(1− θ)αT+βt+1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )
1
)*.-34+*4(?+.-4+J9/*,(• )4+*4'(• C1-1'($N(L31?.(1,?($V(-1+=.(
• )*.-34+*4(?+.-4+J9/*,'((
\1>3.+1,()*.-34+*4(O,A343,G3(• )*.-34+*4(?+.-4+J9/*,'((
• \1>3.+1,(+,A343,G3'(– m*(=*,H34(.+,H=3(@14103-34(– `*4(1,>(.@3G+FG(f,%-L3(A9,G/*,(*A(+,-343.-(– !*0@9-3(-L3(3b@3G-3?(P1=93(*A(f (
(– O,-3H41=(+.(*n3,(L14?(-*(G*0@9-3(
aK)'(a1b+090(1(@*.-34+*4+(1@@4*b+01/*,(
((• K.(0*43(?1-1(+.(*J.34P3?Q(\3-1(+.(0*43(G34-1+,(• MAP: use most likely parameter to approximate
the expectation
aK)(A*4(\3-1(?+.-4+J9/*,(
(• aK)'(9.3(0*.-(=+:3=>(@14103-34'(
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )
αH + βH − 1αH + βH + αT + βT − 2
1• Beta prior equivalent to extra thumbtack flips • As N " 1, prior is “forgotten” • But, for small sample size, prior is important!
2L1-(1J*9-(G*,/,9*9.(P14+1J=3.U(
• \+==+*,1+43(.1>.'(OA(O(10(031.94+,H(1(G*,/,9*9.(P14+1J=3Q(SL1-(G1,(>*9(?*(A*4(03U(
• 9.:#$%&'#0",#B"#,"==#&.:#%7.:,#G%:$$/%2$@#
"*03(@4*@34/3.(*A(B19..+1,.(• Ko,3(-41,.A*401/*,(W09=/@=>+,H(J>(.G1=14(1,?(1??+,H(1(G*,.-1,-X(143(B19..+1,(– p(q(&WµQ'5X(– E(Y(1p(r(J(!(E(q(&W1µrJQ15'5X(
• "90(*A(B19..+1,.(+.(B19..+1,(– p(q(&WµpQ'5
pX(– E(q(&WµEQ'5
EX(– ;(Y(prE( !(;(q(&WµprµEQ('5
pr'5EX(
• #1.>(-*(?+h343,/1-3Q(1.(S3(S+==(.33(.**,e(
8314,+,H(1(B19..+1,(• !*==3G-(1(J9,GL(*A(?1-1(
– N*@3A9==>Q(+f+f?f(.10@=3.(– 3fHfQ(3b10(.G*43.(
• 8314,(@14103-34.(– a31,'(µ – s14+1,G3'("
xi i =
5H%B#I<.1"#
6( t$(
7( i$(
5( 766(
Z( 75(
]( ](ii( ti(
a8#(A*4(B19..+1,'(• )4*Jf(*A(+f+f?f(.10@=3.($Yub7Q]Qbmv'(
• Log-likelihood of data:
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
2
E*94(.3G*,?(=314,+,H(1=H*4+-L0'(a8#(A*4(031,(*A(1(B19..+1,(
• 2L1-T.(a8#(A*4(031,U(
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N�
i=1
xi +Nµ = 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
2
a8#(A*4(P14+1,G3(• KH1+,Q(.3-(?34+P1/P3(-*(d34*'(
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
2
8314,+,H(B19..+1,(@14103-34.(• a8#'(
((• \V2f(a8#(A*4(-L3(P14+1,G3(*A(1(B19..+1,(+.(7/%$"E#
– #b@3G-3?(43.9=-(*A(3./01/*,(+.(2.,#-493(@14103-34e((– g,J+1.3?(P14+1,G3(3./01-*4'(
\1>3.+1,(=314,+,H(*A(B19..+1,(@14103-34.(
• !*,I9H1-3(@4+*4.(– a31,'(B19..+1,(@4+*4(– s14+1,G3'(2+.L14-(C+.-4+J9/*,(
• )4+*4(A*4(031,'(