An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions :...
Transcript of An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions :...
![Page 1: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/1.jpg)
An Asymptotic Analysis of Estimators:
Generative, Discriminative, Pseudolikelihood
ICML 2008 Helsinki, Finland
July 6, 2008
Percy Liang Michael I. Jordan
UC Berkeley
![Page 2: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/2.jpg)
Goal: structured prediction
x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y
2
![Page 3: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/3.jpg)
Goal: structured prediction
x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y
Many approaches
Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)
2
![Page 4: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/4.jpg)
Goal: structured prediction
x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y
Many approaches
Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)Pseudolikelihood [Besag, 1975]
2
![Page 5: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/5.jpg)
Goal: structured prediction
x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y
Many approaches
Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)Pseudolikelihood [Besag, 1975]Composite likelihood [Lindsay, 1988]Multi-conditional learning [McCallum, et al., 2006]Piecewise training [Sutton & McCallum, 2005]Variational relaxations [Wainwright, 2006]Agreement-based learning [Liang, et al., 2008]...how to choose among these approaches?
2
![Page 6: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/6.jpg)
Goal: structured prediction
x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y
Many approaches
Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)Pseudolikelihood [Besag, 1975]Composite likelihood [Lindsay, 1988]Multi-conditional learning [McCallum, et al., 2006]Piecewise training [Sutton & McCallum, 2005]Variational relaxations [Wainwright, 2006]Agreement-based learning [Liang, et al., 2008]...how to choose among these approaches?
Our work:
• Put first three in a unified composite likelihood framework• Compare their statistical properties theoretically
2
![Page 7: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/7.jpg)
Existing intuitions:
• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]
3
![Page 8: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/8.jpg)
Existing intuitions:
• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]
• Pseudolikelihood: slower statistical convergence[Besag, 1975]
3
![Page 9: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/9.jpg)
Existing intuitions:
• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]
• Pseudolikelihood: slower statistical convergence[Besag, 1975]
Our general result:
Derive the (excess) risk of composite likelihood estimators
3
![Page 10: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/10.jpg)
Existing intuitions:
• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]
• Pseudolikelihood: slower statistical convergence[Besag, 1975]
Our general result:
Derive the (excess) risk of composite likelihood estimators
Specific conclusions:
If the model is well-specified:
Risk(generative) < Risk(discriminative) < Risk(pseudolikelihood)
3
![Page 11: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/11.jpg)
Existing intuitions:
• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]
• Pseudolikelihood: slower statistical convergence[Besag, 1975]
Our general result:
Derive the (excess) risk of composite likelihood estimators
Specific conclusions:
If the model is well-specified:
Risk(generative) < Risk(discriminative) < Risk(pseudolikelihood)
If the model is misspecified:
Risk(discriminative) < Risk(pseudolikelihood), Risk(generative)
3
![Page 12: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/12.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
4
![Page 13: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/13.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
(x, y) = {(∗, ∗)}
4
![Page 14: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/14.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
(x, y) = {(∗, ∗)}
Discriminative: θd = argmaxθ
E log pθ(y | x)
4
![Page 15: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/15.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
(x, y) = {(∗, ∗)}
Discriminative: θd = argmaxθ
E[log pθ(x, y)− log pθ(x)]
4
![Page 16: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/16.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
(x, y) = {(∗, ∗)}
Discriminative: θd = argmaxθ
E[log pθ(x, y)− log pθ(x)]
(x, y) = {(x, ∗)}
4
![Page 17: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/17.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
(x, y) = {(∗, ∗)}
Discriminative: θd = argmaxθ
E[log pθ(x, y)− log pθ(x)]
(x, y) = {(x, ∗)}
More generally: θ = argmaxθ
E[log pθ(x, y)− log pθ(r(x, y))]
(x, y) = r(x, y)
4
![Page 18: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/18.jpg)
Model-based estimators and neighborhoods
Generative: θg = argmaxθ
E log pθ(x, y)
(x, y) = {(∗, ∗)}
Discriminative: θd = argmaxθ
E[log pθ(x, y)− log pθ(x)]
(x, y) = {(x, ∗)}
More generally: θ = argmaxθ
E[log pθ(x, y)− log pθ(r(x, y))]
(x, y) = r(x, y)
r(x, y) is subset of input-output space we want to contrast4
![Page 19: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/19.jpg)
Composite likelihood estimators
Discriminative pseudolikelihood:y1 yj y`
x
θp = argmaxθ
E[ ∑j=1
log p(yj | x, y\{yj})]
5
![Page 20: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/20.jpg)
Composite likelihood estimators
Discriminative pseudolikelihood:y1 yj y`
x
θp = argmaxθ
∑j=1
E[log p(yj | x, y\{yj})]
5
![Page 21: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/21.jpg)
Composite likelihood estimators
Discriminative pseudolikelihood:y1 yj y`
x
θp = argmaxθ
∑j=1
E[log p(x, y)− log p(x, y\{yj})]
5
![Page 22: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/22.jpg)
Composite likelihood estimators
Discriminative pseudolikelihood:y1 yj y`
x
θp = argmaxθ
∑j=1
E[log p(x, y)− log p(x, y\{yj}︸ ︷︷ ︸rj(x,y)
)]
5
![Page 23: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/23.jpg)
Composite likelihood estimators
Discriminative pseudolikelihood:y1 yj y`
x
θp = argmaxθ
∑j=1
E[log p(x, y)− log p(x, y\{yj}︸ ︷︷ ︸rj(x,y)
)]
General composite likelihood:
θ = argmaxθ
∑j
wjE[log pθ(x, y)− log pθ(rj(x, y))]
(x, y) = r1(x, y)= r2(x, y)
5
![Page 24: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/24.jpg)
Review of exponential families
log pθ(x, y | r(x, y)) =
φ(x, y) · θ − log∑
(x′,y′)∈r(x,y) exp{φ(x′, y′)>θ}features parameters log-partition function
6
![Page 25: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/25.jpg)
Review of exponential families
log pθ(x, y | r(x, y)) =
φ(x, y) · θ − log∑
(x′,y′)∈r(x,y) exp{φ(x′, y′)>θ}features parameters log-partition function
Moment-generating properties:
Mean:
∇ log pθ(x, y | r(x, y)) = φ− Eθ[φ | r]Variance:
∇2 log pθ(x, y | r(x, y)) = −varθ[φ | r]
Derivatives are useful for asymptotic Taylor expansions
6
![Page 26: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/26.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
7
![Page 27: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/27.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
7
![Page 28: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/28.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
7
![Page 29: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/29.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise
7
![Page 30: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/30.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise
7
![Page 31: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/31.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise slope = variance of φ overdef= sensitivity
7
![Page 32: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/32.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise slope = variance of φ overdef= sensitivity
Sensitivity ↑ ⇒ Risk ↓
7
![Page 33: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/33.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise slope = variance of φ overdef= sensitivity
Sensitivity ↑ ⇒ Risk ↓Generative Discriminative
var(φ) ? E var(φ | X)7
![Page 34: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/34.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise slope = variance of φ overdef= sensitivity
Sensitivity ↑ ⇒ Risk ↓Generative Discriminative
var(φ) = E var(φ | X) + var E(φ | X)7
![Page 35: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/35.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise slope = variance of φ overdef= sensitivity
Sensitivity ↑ ⇒ Risk ↓Generative Discriminative
var(φ) � E var(φ | X)7
![Page 36: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/36.jpg)
Sketch of arguments for comparing estimators
(x, y) = r(x, y)
Intuition:
Grow r ⇒ model more about data⇒ data tells us more about parameters
For exponential families:
Parameters θ
Features φ θ → mean φ
noise slope = variance of φ overdef= sensitivity
Sensitivity ↑ ⇒ Risk ↓Generative Discriminative
var(φ) � E var(φ | X)Risk(generative) < Risk(discriminative)
7
![Page 37: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/37.jpg)
Overview of asymptotic analysis
How accurately can we estimate the parameters?
ParameterError = O(
Σ√n
)Σ: asymptotic variance of parametersn: number of training examples
8
![Page 38: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/38.jpg)
Overview of asymptotic analysis
How accurately can we estimate the parameters?
ParameterError = O(
Σ√n
)Σ: asymptotic variance of parametersn: number of training examples
How fast can we drive the excess risk (expected log-loss) to 0?
In general, get normal rate:
Risk = O(
Σ√n
)
8
![Page 39: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/39.jpg)
Overview of asymptotic analysis
How accurately can we estimate the parameters?
ParameterError = O(
Σ√n
)Σ: asymptotic variance of parametersn: number of training examples
How fast can we drive the excess risk (expected log-loss) to 0?
In general, get normal rate:
Risk = O(
Σ√n
)But if some condition is satisfied, get fast rate:
Risk = O(Σn
)
8
![Page 40: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/40.jpg)
Overview of asymptotic analysis
How accurately can we estimate the parameters?
ParameterError = O(
Σ√n
)Σ: asymptotic variance of parametersn: number of training examples
How fast can we drive the excess risk (expected log-loss) to 0?
In general, get normal rate:
Risk = O(
Σ√n
)But if some condition is satisfied, get fast rate:
Risk = O(
Σn
)Issues:• O(n−
12) or O(n−1)?
• Compare Σ
8
![Page 41: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/41.jpg)
Overview of asymptotic analysis
How accurately can we estimate the parameters?
ParameterError = O(
Σ√n
)Σ: asymptotic variance of parametersn: number of training examples
How fast can we drive the excess risk (expected log-loss) to 0?
In general, get normal rate:
Risk = O(
Σ√n
)But if some condition is satisfied, get fast rate:
Risk = O(
Σn
)Issues:• O(n−
12) or O(n−1)?
• Compare Σ
Agenda:1. Well-specified, one component
2. Well-specified, multiple components
3. Misspecified8
![Page 42: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/42.jpg)
Well-specified case
Risk = O(Σn
)for all consistent estimators
Thus, sufficient to just compare Σs of different estimators...
9
![Page 43: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/43.jpg)
Well-specified case
Risk = O(Σn
)for all consistent estimators
Thus, sufficient to just compare Σs of different estimators...
Estimator:
θ = argmaxθ E[log pθ(x, y)− log pθ(r(x, y))]
(x, y) = r(x, y)
9
![Page 44: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/44.jpg)
Well-specified case
Risk = O(
Σn
)for all consistent estimators
Thus, sufficient to just compare Σs of different estimators...
Estimator:
θ = argmaxθ E[log pθ(x, y)− log pθ(r(x, y))]
(x, y) = r(x, y)
Asymptotic variance:
Σ = Γ−1, where Γ = E var(φ | r) is the sensitivity
9
![Page 45: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/45.jpg)
Well-specified case
Risk = O(
Σn
)for all consistent estimators
Thus, sufficient to just compare Σs of different estimators...
Estimator:
θ = argmaxθ E[log pθ(x, y)− log pθ(r(x, y))]
(x, y) = r(x, y)
Asymptotic variance:
Σ = Γ−1, where Γ = E var(φ | r) is the sensitivity
Proof:
By Taylor expansion and moment-generating properties.
9
![Page 46: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/46.jpg)
Well-specified case: comparing two estimatorsTwo estimators:
θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2
(x, y) = r1(x, y)= r2(x, y)
10
![Page 47: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/47.jpg)
Well-specified case: comparing two estimatorsTwo estimators:
θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2
(x, y) = r1(x, y)= r2(x, y)
Comparison theorem:If model is well-specified and
r1(x, y) ⊃ r2(x, y)Then
Risk(θ1) ≤ Risk(θ2)
10
![Page 48: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/48.jpg)
Well-specified case: comparing two estimatorsTwo estimators:
θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2
(x, y) = r1(x, y)= r2(x, y)
Comparison theorem:If model is well-specified and
r1(x, y) ⊃ r2(x, y)Then
Risk(θ1) ≤ Risk(θ2)Proof:
Σj = E var(φ | rj)−1
10
![Page 49: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/49.jpg)
Well-specified case: comparing two estimatorsTwo estimators:
θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2
(x, y) = r1(x, y)= r2(x, y)
Comparison theorem:If model is well-specified and
r1(x, y) ⊃ r2(x, y)Then
Risk(θ1) ≤ Risk(θ2)Proof:
Σj = E var(φ | rj)−1 Σ1 � Σ2
10
![Page 50: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/50.jpg)
Well-specified case: comparing two estimatorsTwo estimators:
θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2
(x, y) = r1(x, y)= r2(x, y)
Comparison theorem:If model is well-specified and
r1(x, y) ⊃ r2(x, y)Then
Risk(θ1) ≤ Risk(θ2)Proof:
Σj = E var(φ | rj)−1 Σ1 � Σ2 Risk = O(
Σjn
)
10
![Page 51: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/51.jpg)
Well-specified case: comparing two estimatorsTwo estimators:
θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2
(x, y) = r1(x, y)= r2(x, y)
Comparison theorem:If model is well-specified and
r1(x, y) ⊃ r2(x, y)Then
Risk(θ1) ≤ Risk(θ2)Proof:
Σj = E var(φ | rj)−1 Σ1 � Σ2 Risk = O(
Σjn
)Modeling more reduces error (when model is well-specified)
10
![Page 52: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/52.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
11
![Page 53: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/53.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
Γ =∑
j wjE var(φ | rj) is the sensitivity
11
![Page 54: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/54.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
Γ =∑
j wjE var(φ | rj) is the sensitivity
Cc � 0 : correction due to multiple components
11
![Page 55: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/55.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
Γ =∑
j wjE var(φ | rj) is the sensitivity
Cc � 0 : correction due to multiple components
Comparison theorem:If the model is well-specified and
θ1: one component r1 θ2: multiple components {r2,j}
11
![Page 56: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/56.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
Γ =∑
j wjE var(φ | rj) is the sensitivity
Cc � 0 : correction due to multiple components
Comparison theorem:If the model is well-specified and
θ1: one component r1 θ2: multiple components {r2,j}r1(x, y) ⊃ r2,j(x, y) for all components j
11
![Page 57: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/57.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
Γ =∑
j wjE var(φ | rj) is the sensitivity
Cc � 0 : correction due to multiple components
Comparison theorem:If the model is well-specified and
θ1: one component r1 θ2: multiple components {r2,j}r1(x, y) ⊃ r2,j(x, y) for all components j
Then
Risk(θ1) ≤ Risk(θ2)
11
![Page 58: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/58.jpg)
Multiple componentsAsymptotic variance:
Σ = Γ−1 + Γ−1CcΓ−1
Γ =∑
j wjE var(φ | rj) is the sensitivity
Cc � 0 : correction due to multiple components
Comparison theorem:If the model is well-specified and
θ1: one component r1 θ2: multiple components {r2,j}r1(x, y) ⊃ r2,j(x, y) for all components j
Then
Risk(θ1) ≤ Risk(θ2)
Note: does not apply if θ1 has more than one component
11
![Page 59: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/59.jpg)
Misspecified case
Result:
For any estimator in general, get normal rate:
Risk = O(
Σ√n
)
12
![Page 60: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/60.jpg)
Misspecified case
Result:
For any estimator in general, get normal rate:
Risk = O(
Σ√n
)But for the discriminative estimator, get fast rate:
Risk = O(Σn
)
12
![Page 61: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/61.jpg)
Misspecified case
Result:
For any estimator in general, get normal rate:
Risk = O(
Σ√n
)But for the discriminative estimator, get fast rate:
Risk = O(Σn
)Corollary:
Risk(discriminative) < Risk(pseudolikelihood), Risk(generative)
12
![Page 62: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/62.jpg)
Misspecified case
Result:
For any estimator in general, get normal rate:
Risk = O(
Σ√n
)But for the discriminative estimator, get fast rate:
Risk = O(Σn
)Corollary:
Risk(discriminative) < Risk(pseudolikelihood), Risk(generative)
Key desirable property: training criterion = test criterion
12
![Page 63: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/63.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
13
![Page 64: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/64.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified
generate from
20K 40K 60K 80K 100K
n
var(
Risk)
Generative
Discriminative
Pseudolikelihood
13
![Page 65: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/65.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified
generate from
20K 40K 60K 80K 100K
n
√n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood
13
![Page 66: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/66.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified
generate from
20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood
13
![Page 67: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/67.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified
generate from
20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood
All: O(n−1)
13
![Page 68: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/68.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified Misspecified
generate from generate from
20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood20K 40K 60K 80K 100K
n
var(
Risk)
All: O(n−1)
13
![Page 69: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/69.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified Misspecified
generate from generate from
20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood20K 40K 60K 80K 100K
n
√n·v
ar(R
isk)
All: O(n−1)
13
![Page 70: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/70.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified Misspecified
generate from generate from
20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
All: O(n−1)
13
![Page 71: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/71.jpg)
Verifying the error rates empiricallySetup:
Learn
y1 y2
xfrom n training examples
Estimate (excess) risk from 10,000 trials
Well-specified Misspecified
generate from generate from
20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
Generative
Discriminative
Pseudolikelihood20K 40K 60K 80K 100K
n
n·v
ar(R
isk)
All: O(n−1)Fully dis.: O(n−1)
others: O(n−12)
13
![Page 72: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/72.jpg)
Application: part-of-speech tagging
Task:
y: Det Noun Verb Det Noun
x: The cat ate a fish
14
![Page 73: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/73.jpg)
Application: part-of-speech tagging
Task:
y: Det Noun Verb Det Noun
x: The cat ate a fish
Data: Wall Street Journal news articles (40K sentences)
14
![Page 74: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/74.jpg)
Application: part-of-speech tagging
Task:
y: Det Noun Verb Det Noun
x: The cat ate a fish
Data: Wall Street Journal news articles (40K sentences)
Synthetic data (well-specified)
Gen. Dis. Pseudo.
4.0
8.0
12.0
Tes
ter
ror
14
![Page 75: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/75.jpg)
Application: part-of-speech tagging
Task:
y: Det Noun Verb Det Noun
x: The cat ate a fish
Data: Wall Street Journal news articles (40K sentences)
Synthetic data (well-specified)
Gen. Dis. Pseudo.
4.0
8.0
12.0
Tes
ter
ror
Real data (misspecified)
Gen. Dis. Pseudo.
2.0
4.0
6.0
Tes
ter
ror
14
![Page 76: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/76.jpg)
Summary
Unifying composite likelihood framework for
generative, discriminative, pseudolikelihood estimators
15
![Page 77: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/77.jpg)
Summary
Unifying composite likelihood framework for
generative, discriminative, pseudolikelihood estimators
Asymptotic statistics:
a powerful tool for comparing estimators
15
![Page 78: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,](https://reader036.fdocuments.in/reader036/viewer/2022071114/5feb60ac7e4cce60ef0a318b/html5/thumbnails/78.jpg)
Summary
Unifying composite likelihood framework for
generative, discriminative, pseudolikelihood estimators
Asymptotic statistics:
a powerful tool for comparing estimators
General conclusions:
• Well-specified case: modeling more of data reduces error
• Desirable: training criterion = test criterion
15