CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer...
-
Upload
angel-hodge -
Category
Documents
-
view
212 -
download
0
Transcript of CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer...
![Page 1: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/1.jpg)
CS B351LEARNING PROBABILISTIC MODELS
![Page 2: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/2.jpg)
MOTIVATION
Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net
Next few lectures: where does the Bayes net come from?
![Page 3: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/3.jpg)
Win?
Strength Opponent Strength
![Page 4: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/4.jpg)
Win?
Offense strength
Opp. Off.
Strength
Defense strength
Opp. Def.
Strength
Pass yds
Rush yds Rush yds
allowed
Score allowed
![Page 5: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/5.jpg)
SWin?
Offense strength
Opp. Off.
Strength
Defense strength
Opp. Def.
Strength
Pass yds
Rush yds Rush yds
allowed
Score allowed
Strength of
schedule
At Home
?
Injuries?Opp
injuries?
![Page 6: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/6.jpg)
SWin?
Offense strength
Opp. Off.
Strength
Defense strength
Opp. Def.
Strength
Pass yds
Rush yds Rush yds
allowed
Score allowed
Strength of
schedule
At Home
?
Injuries?Opp
injuries?
![Page 7: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/7.jpg)
AGENDA
Learning probability distributions from example data
Influence of structure on performance Maximum likelihood estimation (MLE) Bayesian estimation
![Page 8: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/8.jpg)
PROBABILISTIC ESTIMATION PROBLEM
Our setting: Given a set of examples drawn from the target
distribution Each example is complete (fully observable)
Goal: Produce some representation of a belief state so
we can perform inferences & draw certain predictions
![Page 9: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/9.jpg)
DENSITY ESTIMATION
Given dataset D={d[1],…,d[M]} drawn from underlying distribution P*
Find a distribution that matches P* as “close” as possible
High-level issues: Usually, not enough data to get an accurate
picture of P*, which forces us to approximate. Even if we did have P*, how do we define
“closeness” (both theoretically and in practice)? How do we maximize “closeness”?
![Page 10: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/10.jpg)
WHAT CLASS OF PROBABILITY MODELS?
For small discrete distributions, just use a tabular representation Very efficient learning techniques
For large discrete distributions or continuous ones, the choice of probability model is crucial Increasing complexity =>
Can represent complex distributions more accurately Need more data to learn well (risk of overfitting) More expensive to learn and to perform inference
![Page 11: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/11.jpg)
TWO LEARNING PROBLEMS
Parameter learning What entries should be put into the model’s
probability tables? Structure learning
Which variables should be represented / transformed for inclusion in the model?
What direct / indirect relationships between variables should be modeled?
More “high level” problem Once structure is chosen, a set of (unestimated)
parameters emerge These need to be estimated using parameter learning
![Page 12: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/12.jpg)
LEARNING COIN FLIPS Cherry and lime candies are in an opaque
bag Observe that c out of N draws are cherries
(data)
![Page 13: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/13.jpg)
LEARNING COIN FLIPS Observe that c out of N draws are cherries
(data) Intuition: c/N might be a good hypothesis for
the fraction of cherries in the bag(or it might not, depending on the draw!)
“Intuitive” parameter estimate: empirical distribution P(cherry) c / N(this will be justified more thoroughly later)
![Page 14: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/14.jpg)
STRUCTURE LEARNING EXAMPLE: HISTOGRAM BUCKET SIZES
Histograms are used to estimate distributions of continuous or large #s of discrete values… but how fine?
0 20 40 60 80 100
120
140
160
180
200
012345678
0 20 40 60 80 1001201401601802000
2
4
6
8
10
12
14
16
0 100 2000
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112
128
144
160
176
192
0
1
2
3
4
5
6
![Page 15: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/15.jpg)
STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS
Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)
Case 1: 15 free parameters (16 entries – sum to 1 constraint) P(ABCD) = p1
P(ABCD) = p2
… P(ABCD) = p15
P(ABCD) = 1-p1-…-p15
Case 2: 4 free parameters P(A)=p1, P(A)=1-p1
…
P(D)=p4, P(D)=1-p4
![Page 16: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/16.jpg)
STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS
Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)
P(A,B,C,D) Would be able to fit ALL relationships in the data
P(A)P(B)P(C)P(D) Inherently does not have the capability to
accurately model correlations like A~=B Leads to biased estimates: overestimate or
underestimate the true probabilities
![Page 17: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/17.jpg)
1
2
3
0
0.1
0.2
0.3
1
2
3
1
2
3
0
0.1
0.2
1
2
3
Original joint distribution P(X,Y)Learned using independence
assumption P(X)P(Y)
XY
YX
![Page 18: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/18.jpg)
STRUCTURE LEARNING: EXPRESSIVE POWER
Making more independence assumptions always makes a probabilistic model less expressive
If the independence relationships assumed by structure model A are a superset of those in structure B, then B can express any probability distribution that A can
X
Y Z
X
Y Z
X
Y Z
![Page 19: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/19.jpg)
C
F1 F2 Fk
C
F1 F2 Fk
Or
?
![Page 20: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/20.jpg)
ARCS DO NOT NECESSARILY ENCODE CAUSALITY!
A
B
C
C
B
A
2 BN’s that can encode the same joint probability distribution
![Page 21: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/21.jpg)
READING OFF INDEPENDENCE RELATIONSHIPS
Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?
No! C parent’s (B) are
given, and so it is independent of its non-descendents (A)
Independence is symmetric:C A | B => A C | B
A
B
C
![Page 22: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/22.jpg)
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
![Page 23: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/23.jpg)
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
Parameters estimated via empirical distribution (“Intuitive fit”)
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11
![Page 24: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/24.jpg)
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
Parameters estimated via empirical distribution (“Intuitive fit”)
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are
likely to be larger!
![Page 25: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/25.jpg)
STRUCTURE LEARNING: FIT VS COMPLEXITY
Must trade off fit of data vs. complexity of model
Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity
to noise
![Page 26: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/26.jpg)
STRUCTURE LEARNING: FIT VS COMPLEXITY
Must trade off fit of data vs. complexity of model
Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity
to noise
Typical approaches explore multiple structures, while optimizing the trade off between fit and complexity
Need a way of measuring “complexity” (e.g., number of edges, number of parameters) and “fit”
![Page 27: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/27.jpg)
FURTHER READING ON STRUCTURE LEARNING
Structure learning with statistical independence testing
Score-based methods (e.g., Bayesian Information Criterion)
Bayesian methods with structure priors Cross-validated model selection (more on this
later)
![Page 28: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/28.jpg)
STATISTICAL PARAMETER LEARNING
![Page 29: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/29.jpg)
LEARNING COIN FLIPS Observe that c out of N draws are cherries
(data) Let the unknown fraction of cherries be q
(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and
identically distributed (i.i.d)
![Page 30: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/30.jpg)
LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and
identically distributed (i.i.d) Probability of drawing 2 cherries is *q q Probability of drawing 2 limes is (1-q)2
Probability of drawing 1 cherry and 1 lime: *(1- )q q
![Page 31: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/31.jpg)
LIKELIHOOD FUNCTION
Likelihood of data d={d1,…,dN} given q
P(d|q) = Pj P(dj|q) = qc (1-q)N-c
i.i.d assumption Gather c cherry terms together, then N-c lime terms
![Page 32: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/32.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1/1 cherry
q
P(d
ata
|q)
![Page 33: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/33.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
2/2 cherry
q
P(d
ata
|q)
![Page 34: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/34.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
2/3 cherry
q
P(d
ata
|q)
![Page 35: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/35.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2/4 cherry
q
P(d
ata
|q)
![Page 36: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/36.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
2/5 cherry
q
P(d
ata
|q)
![Page 37: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/37.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.10.20.30.40.50.60.70.80.9 10
0.0000002
0.0000004
0.0000006
0.0000008
0.000001
0.0000012
10/20 cherry
q
P(d
ata
|q)
![Page 38: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/38.jpg)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-31
2E-31
3E-31
4E-31
5E-31
6E-31
7E-31
8E-31
9E-31
50/100 cherry
q
P(d
ata
|q)
![Page 39: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/39.jpg)
MAXIMUM LIKELIHOOD
Peaks of likelihood function seem to hover around the fraction of cherries…
Sharpness indicates some notion of certainty…
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-31
2E-31
3E-31
4E-31
5E-31
6E-31
7E-31
8E-31
9E-31
50/100 cherry
q
P(d
ata
|q)
![Page 40: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/40.jpg)
MAXIMUM LIKELIHOOD
P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
![Page 41: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/41.jpg)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]
![Page 42: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/42.jpg)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]
![Page 43: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/43.jpg)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)
![Page 44: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/44.jpg)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq( )q = 0 gives the maximum likelihood
estimate
![Page 45: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/45.jpg)
MAXIMUM LIKELIHOOD
dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0
=> q = c/N
![Page 46: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/46.jpg)
OTHER MLE RESULTS
Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram)
Continuous Gaussian distributions Mean = average data Standard deviation = standard deviation of data
![Page 47: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/47.jpg)
AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION
P(q|d) = 1/ Z P(d|q) P(q) is the posterior Distribution of hypotheses given the data
P(d|q) is the likelihood P(q) is the hypothesis prior
q
d[1] d[2] d[M]
![Page 48: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/48.jpg)
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
qi
d[1] d[2] d[M]
Y
![Page 49: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/49.jpg)
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
qi
d[1] d[2] d[M]
Y
![Page 50: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/50.jpg)
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!
= (c+1) / (N+2)
qi
d[1] d[2] d[M]
Y
Can think of this as a “correction” using “virtual counts”
![Page 51: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/51.jpg)
NONUNIFORM PRIORS
P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability that I believe in q
10 q
P(q)
![Page 52: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/52.jpg)
BETA DISTRIBUTION
Betaa,b(q) = g qa-1 (1-q)b-1
a, b hyperparameters > 0 g is a normalization
constant a=b=1 is uniform
distribution
![Page 53: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/53.jpg)
POSTERIOR WITH BETA PRIOR
Posterior qc (1-q)N-c P(q)= g qc+a-1 (1-q)N-c+b-1
= Betaa+c,b+N-c(q)
Prediction = meanE[ ]q =(c+a)/(N+a+b)
![Page 54: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/54.jpg)
POSTERIOR WITH BETA PRIOR
What does this mean? Prior specifies a “virtual
count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b
Effect of prior diminishes with more data
![Page 55: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/55.jpg)
CHOOSING A PRIOR
Part of the design process; must be chosen according to your intuition
Uninformed belief = =1a b , strong belief => ,a b high
![Page 56: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/56.jpg)
EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)
distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in
practice still takes the form of “virtual counts”
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 20 40 60 80 1001201401601802000
0.020.040.060.080.1
0.120.140.160.18
0 1
5 10
![Page 57: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/57.jpg)
RECAP
Learning probabilistic models Parameter vs. structure learning Single-parameter learning via coin flips
Maximum Likelihood Bayesian Learning with Beta prior
![Page 58: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/58.jpg)
MAXIMUM LIKELIHOOD FOR BN
For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values
Alarm
Earthquake Burglar
E: 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380
E B P(A|E,B)
T T 0.95
F T 0.95
T F 0.34
F F 0.003
![Page 59: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.](https://reader031.fdocuments.in/reader031/viewer/2022032605/56649e865503460f94b89433/html5/thumbnails/59.jpg)
FITTING CPTS
Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT
Note that for large k=|PaXi|, very few datapoints will share the values of paXi! O(|D|/2k), but some values may be even rarer Large domains |Val(Xi)| can also be a problem Data fragmentation