Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs (...
Transcript of Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs (...
![Page 1: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/1.jpg)
Disentangling Gaussians
Ankur Moitra, MIT
November 6th, 2014 — Dean’s Breakfast
Ankur Moitra (MIT) Disentangling Gaussians November 6th, 20141
Algorithmic Aspects of Machine Learning© 2015 by Ankur Moitra.Note: These are unpolished, incomplete course notes.Developed for educational use at MIT and for publication through MIT OpenCourseware.
![Page 2: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/2.jpg)
The Gaussian Distribution
The Gaussian distribution is defined as (µ = mean, σ2 = variance):
N (µ, σ2 1, x) = √
2πσ2e−
(x−µ)2
22σ
Central Limit Theorem: The sum of independent random variablesX1,X2, ....,Xs converges in distribution to a Gaussian:
1√∑s
Xi →d N (µ, σ2)s
i=1
This distribution is ubiquitous — e.g. used to model height,velocities in an ideal gas, annual rainfall, ...
2
![Page 3: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/3.jpg)
Karl Pearson (1894) and the Naples Crabs
(figure due to Peter Macdonald)
0.58 0.60 0.62 0.64 0.66 0.68 0.70
05
1015
20
3
Image courtesy of Peter D. M. Macdonald. Used with permission.
![Page 4: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/4.jpg)
Gaussian Mixture Models
F (x) = w1F1(x) + (1− w1)F 22(x), where Fi(x) = N (µi , σi , x)
In particular, with probability w1 output a sample from F1, otherwiseoutput a sample from F2
five unknowns: w1, µ1, σ21, µ2, σ
22
Question
Given enough random samples from F , can we learn these parameters(approximately)?
Pearson invented the method of moments, to attack this problem...
4
![Page 5: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/5.jpg)
Pearson’s Sixth Moment Test
Claim
Ex←F (x)[xr ] is a polynomial in θ = (w 2 2
1, µ1, σ1, µ2, σ2)
In particular:
Ex←F (x)[x ] = w1µ1 + (1− w1)µ2
Ex←F (x)[x2] = w1(µ2
1 + σ21) + (1− w1)(µ2
2 + σ22)
E [x3] = w (µ3 + 3µ σ2) + (1− w )(µ3 2x←F (x) 1 1 1 1 1 2 + 3µ2σ2)
...
5
![Page 6: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/6.jpg)
Pearson’s Sixth Moment Test
Claim
E r 2 2x←F (x)[x ] is a polynomial in θ = (w1, µ1, σ1, µ2, σ2)
Let Ex←F (x)[xr ] = Mr (θ)
Gather samples S
Set Mr = 1|S | i∈S x
ri for r = 1, 2, ..., 6
Compute simultaneous roots of {Mr (θ) = Mr}r=1,2,...,5, selectroot θ that is closest in sixth moment
∑
6
![Page 7: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/7.jpg)
Provable Guarantees?
In Contributions to the Mathematical Theory of Evolution (attributedto George Darwin):
“Given the probable error of every ordinate of a frequencycurve, what are the probable errors of the elements of thetwo normal curves into which it may be dissected?”
Are the parameters of a mixture of two Gaussians uniquelydetermined by its moments?
Are these polynomial equations robust to errors?
7
![Page 8: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/8.jpg)
A View from Theoretical Computer Science
Suppose our goal is to provably learn the parameters of eachcomponent within an additive ε:
Goal
Output a mixture F = w 1F1 + w
2F2 so that there is a permutationπ : {1, 2} → {1, 2} and for i = {1, 2}
|wi − wπ(i)|, |µi − µπ(i)|, |σ2i − σ2
π(i)| ≤ ε
Is there an algorithm that takes poly(1/ε) samples and runs in timepoly(1/ε)?
8
![Page 9: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/9.jpg)
A Conceptual History
Pearson (1894): Method of Moments (no guarantees)
Fisher (1912-1922): Maximum Likelihood Estimator (MLE)
consistent and efficient, usually computationally hard
requires many samples
Teicher (1961): Identifiability through tails
Dempster, Laird, Rubin (1977): Expectation-Maximization (EM)
gets stuck in local maxima
Dasgupta (1999) and many others: Clustering
assumes almost non-overlapping components
9
![Page 10: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/10.jpg)
Identifiability through the Tails
Approach: Find the parameters of the component with largestvariance (it dominates the behavior of F (x) at infinity); subtract itoff and continue
10
![Page 11: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/11.jpg)
A Conceptual History
Pearson (1894): Method of Moments (no guarantees)
Fisher (1912-1922): Maximum Likelihood Estimator (MLE)
consistent and efficient, usually computationally hard
Teicher (1961): Identifiability through tails
requires many samples
Dempster, Laird, Rubin (1977): Expectation-Maximization (EM)
gets stuck in local maxima
assumes almost non-overlapping components
Dasgupta (1999) and many others: Clustering
11
![Page 12: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/12.jpg)
Clustering Well-separated Mixtures
ignore
Approach: Cluster samples based on which component generatedthem; output the empirical mean and variance of each cluster
12
![Page 13: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/13.jpg)
A Conceptual History
Pearson (1894): Method of Moments (no guarantees)
Fisher (1912-1922): Maximum Likelihood Estimator (MLE)
consistent and efficient, usually computationally hard
Teicher (1961): Identifiability through tails
requires many samples
Dempster, Laird, Rubin (1977): Expectation-Maximization (EM)
gets stuck in local maxima
Dasgupta (1999) and many others: Clustering
assumes almost non-overlapping components
13
![Page 14: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/14.jpg)
In summary, these approaches are heuristic, computationallyintractable or make a separation assumption about the mixture
Question
What if the components overlap almost entirely?
[Kalai, Moitra, Valiant] (studies n-dimensional version too):
Reduce to the one-dimensional case
Analyze Pearson’s sixth moment test (with noisy estimates)
14
![Page 15: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/15.jpg)
Our Results
∫Suppose w1 ∈ [ε10, 1− ε10] and |F1(x)− F2(x)|dx ≥ ε10
Theorem (Kalai, Moitra, Valiant)
There is an algorithm that (with probability at least 1− δ) learns theparameters of F within an additive ε, and the running time andnumber of samples needed are poly(1
ε, log 1).
δ
Previously, the best known bound on the running time/samplecomplexity were exponential
See also [Moitra, Valiant] and [Belkin, Sinha] for mixtures of kGaussians
15
![Page 16: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/16.jpg)
Analyzing the Method of Moments
Let’s start with an easier question:
Question
What if we are given the first six moments of the mixture, exactly?
Does this uniquely determine the parameters of the mixture?
(up to a relabeling of the components)
Question
Do any two different mixtures F and F differ on at least one of thefirst six moments?
16
![Page 17: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/17.jpg)
Method of Moments
17
![Page 18: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/18.jpg)
Claim
One of the first six moment of F , F is different!
Proof: ∣ ∫ ∣ ∣ ∫ ∑ ∣∣ ∣ ∣ 6 ∣0 < ∣ p(x)f (x)dx∣ = ∣ prx
r f (x)dx∣x x r=1∑6 ∣ ∫ ∣∣ ∣≤ |pr |∣ x r f (x)dx∣
xr=1∑6= |pr ||Mr (F )−Mr (F )|
r=1
So ∃r∈{1,2,...,6} such that |Mr (F )−Mr (F )| > 0
18
![Page 19: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/19.jpg)
Our goal is to prove the following:
Proposition∑If f (x) = k
i=1 αiN (µi , σ2i , x) is not identically zero, f (x) has at
most 2k − 2 zero crossings (αi ’s can be negative).
......
We will do it through properties of the heat equation
......
19
![Page 20: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/20.jpg)
The Heat Equation
Question
If the initial heat distribution on a one-dimensional infinite rod (κ) isf (x) = f (x , 0) what is the heat distribution f (x , t) at time t?
There is a probabilistic interpretation (σ2 = 2κt):
f (x , t) = Ez←N (0,σ2)[f (x + z , 0)]
Alternatively, this is called a convolution:∫ ∞f (x , t) = f (x + z)N (0, σ2, z)dz = f (x) ∗ N (0, σ2)
z=−∞
20
![Page 21: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/21.jpg)
The Key Facts
Theorem (Hummel, Gidas)
Suppose f (x) : R→ R is analytic and has N zeros. Then
f (x) ∗ N (0, σ2, x)
has at most N zeros (for any σ2 > 0).
Convolving by a Gaussian does not increase # of zero crossings
Fact
N (µ1, σ21, x) ∗ N (µ2, σ
22, x) = N (µ1 + µ2, σ
21 + σ2
2, x)
21
![Page 22: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/22.jpg)
Recall, our goal is to prove the following:
Proposition∑If f (x) = k
i=1 αiN (µi , σ2i , x) is not identically zero, f (x) has at
most 2k − 2 zero crossings (αi ’s can be negative).
We will prove it by induction:
......
Start with k = 3 (at most 4 zero crossings),
Let’s prove it for k = 4 (at most 6 zero crossings)
......
22
![Page 23: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/23.jpg)
Bounding the Number of Zero Crossings
23
![Page 24: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/24.jpg)
Bounding the Number of Zero Crossings
24
![Page 25: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/25.jpg)
Bounding the Number of Zero Crossings
25
![Page 26: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/26.jpg)
Bounding the Number of Zero Crossings
26
![Page 27: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/27.jpg)
Bounding the Number of Zero Crossings
27
![Page 28: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/28.jpg)
Bounding the Number of Zero Crossings
28
![Page 29: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/29.jpg)
Hence, the exact values of the first six moments determine themixture parameters!
Let Θ = {valid parameters} (in particular wi ∈ [0, 1], σi ≥ 0)
Claim
Let θ be the true parameters; then the only solutions to{ }θ ∈ Θ|Mr (θ) = Mr (θ) for r = 1, 2, ...6
are (w1, µ1, σ1, µ2, σ2) and the relabeling (1− w1, µ2, σ2, µ1, σ1)
Are these equations stable, when we are given noisy estimates?
29
![Page 30: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/30.jpg)
A Univariate Learning Algorithm
Our algorithm:
Take enough samples S so that Mr = 1∑
|S | i∈S xri is w.h.p. close
to Mr (θ) for r = 1, 2...6
(within an additive εC
2)
(within an additive εC
Compute θ such that Mr (θ) is close to Mr for r = 1, 2...6
2)
And θ must be close to θ, because solutions to this system ofpolynomial equations are stable
30
![Page 31: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/31.jpg)
Summary and Discussion
Here we gave the first efficient algorithms for learning mixturesof Gaussians with provably minimal assumptions
Key words: method of moments, polynomials, heat equation
Computational intractability is everywhere in machinelearning/statistics
Currently, most approaches are heuristic and have no provableguarantees
Can we design new algorithms for some of the fundamentalproblems in these fields?
31
![Page 32: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/32.jpg)
My Work
Natural Algorithms
Is Learning Computa.onally Easy?
32
![Page 33: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image](https://reader034.fdocuments.in/reader034/viewer/2022050611/5fb2acbc17873f179e554916/html5/thumbnails/33.jpg)
MIT OpenCourseWarehttp://ocw.mit.edu
18.409 Algorithmic Aspects of Machine LearningSpring 2015
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.