Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
description
Transcript of Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
![Page 1: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/1.jpg)
Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
Affiliation: Kyoto UniversityName: Kevin Chien, Dr. Oba
Shigeyuki, Dr. Ishii ShinDate: Nov 04, 2011
1
![Page 2: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/2.jpg)
Terminologies
For understanding distributions
2
![Page 3: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/3.jpg)
Terminologies• Schur complement: relationship between
original matrix and its inverse.
• Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic.
• Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x) 3
![Page 4: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/4.jpg)
Terminologies (cont.)• [Stochastic appro., wiki., 2011]
– Condition on that
• Trace Tr(W) is sum of diagonals
• Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.
4
![Page 5: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/5.jpg)
Distributions
Gaussian distributions and motives
5
![Page 6: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/6.jpg)
Conditional Gaussian Distribution• Derivation of conditional mean and variance:
– Noting Schur complement
• Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb. 6
Assume y=Xa, x=Xb
![Page 7: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/7.jpg)
Marginal Gaussian Distribution• Goal is also to identify mean and variance by
‘completing the square’.
• Solving above integration while noting Schur complement and compare components
7
![Page 8: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/8.jpg)
Bayesian relationship with Gaussian distr. (quick view)
• Consider multivariable Gaussian where– Thus– According to Bayesian equation
• The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x)– Ie. becomes
8
![Page 9: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/9.jpg)
Bayesian relationship with Gaussian distr.
• Starting from
• Mean and var. for joint Gaussian distr. P(x,y)
• Mean and variance for P(x|y)
9
Can be seem as prior Can be seem as likelihood
Can be seem as posterior
![Page 10: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/10.jpg)
Bayesian relationship with Gaussian distr., sequential est.
• Estimate mean by (N-1)+1 observations
• Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. – solve for by Robbin-Monro
10
![Page 11: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/11.jpg)
Bayesian relationship with Univariate Gaussian distr.
• Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function
• Conjugate prior of univariate Gaussian is Gaussian-gamma function
11
![Page 12: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/12.jpg)
• Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr.
• Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.
12
Bayesian relationship with Multivariate Gaussian distr.
![Page 13: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/13.jpg)
Distributions
Gaussian distributions variations
13
![Page 14: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/14.jpg)
Student’s t-distr• Use in analysis of variance on whether effect
is real and statistical significant using t-distri. w/ n-1 degree of freedom.
• If Xi are normal random then– T-distr. has lower peak and longer tail (allow more
outliers thus robust) than Gaussian distr.
• Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision
14
n
n
nXtn
)(
)(2
![Page 15: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/15.jpg)
Student’s t-distr (cont.)• For multivariate Gaussian ,
corresponding t-distri.
– Mahalanobis dist.
• Mean, variance
15
![Page 16: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/16.jpg)
Gaussian with periodic variables• To avoid mean been dependent on choice of
origin use polar coordinate
– Solve for theta
• Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle
16
![Page 17: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/17.jpg)
Gaussian with periodic variables (cont.)
• From Gaussian of Cartesian coordinate to polar
– Becomes
– Von Mises distr.• Mean• Precision (concentration)
17
![Page 18: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/18.jpg)
Gaussian with periodic variables: mean and variance
• Solving log likelihood
– mean– precision ‘m’
• By noting
18
![Page 19: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/19.jpg)
Mixture of Gaussians• In part1 we already know one limitation of
Gaussian is unimodal property.– Solution: linear comb. (superposition) of Gaussians
• Mixing coefficients sum to 1
• Posterior here is known as ‘responsibilities’
– Log likelihood:19
![Page 20: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/20.jpg)
Exponential family• Natural form
– Normalize by
• 1) Bernoulli– Becomes
• 2) Multinomial– Becomes
20
![Page 21: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/21.jpg)
Exponential family (cont.)• 3) Univariate Gaussian
– Becomes
• Solve for natural parameter
– Becomes
– From max. likelihood
21
![Page 22: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/22.jpg)
Parameters of Distributions
And interesting methodologies
22
![Page 23: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/23.jpg)
Uninformative priors• “Subjective Bayesian”: avoid incorrect
assumption by using uninformative (ex. uniform distr.) prior.– Improper prior: prior need not sum to 1 for
posterior to sum to 1 as per Bayes equation.
• 1) location parameter for translation invariance
• 2) scale parameter for scale invariance in
23
![Page 24: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/24.jpg)
Nonparametric methods• Instead of assume form of distribution, use
nonparametric methods.• 1) Histogram of constant bin width
– Good for sequential data– Problem: discontinuity, dimensionality increase exp.
• 2) Kernel estimators: sum of Parzen windows– ‘N’ Observations falling in region R (volume V) is ‘K’
– becomes24
![Page 25: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/25.jpg)
Nonparametric method: Kernel estimators
• 2) Kernel estimators: fix V, determine K– Form of kernel function for points falling in R– h>0 is fixed parameter bandwidth for smoothing– Parzen estimator. Can choose k(u) (ex. Gaussian)
25
![Page 26: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/26.jpg)
Nonparametric method: Nearest-neighbor
• 3) Nearest neighbor: this time use data to grow V Prior:
– Same as kernel estimator: training set is store as knowledge base.
– ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions.
– For classifying N points into Nk points
in class Ck from Bayesian maximize
26
![Page 27: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/27.jpg)
Nonparametric method: Nearest-neighbor (cont.)
• 3) Nearest neighbor: assign new point to class Ck
by majority vote of its k nearest neighbors……………… - for k=1 and n->∞ , error is bounded by
Bayes error rate
27
[k-nearest neighbor algorithm, wiki., 2011]
![Page 28: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/28.jpg)
Ch.2 Basic Graph Concepts
From David Barber’s book
28
![Page 29: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/29.jpg)
Directed and undirected graphs
29
![Page 30: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/30.jpg)
Representations of Graphs
30
• Singled connected (tree): only one path from A to B
• Spanning tree of undirected graph: singly connected subset covering all vertices
• Graph representation (numerical)• Edge list: ex.
• Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.
![Page 31: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/31.jpg)
Representations of Graphs (cont.)
31
• Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix• Provided there are no edge from a vertex to
itself• K maximum clique undirected graph has a N x K
matrix, where each column Ck express which nodes form a clique.• 2 cliques: vertices {1,2,3}
and {2,3,4}
![Page 32: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/32.jpg)
Incidence Matrix
32
• Adjacency matrix A and incidence matrix Zinc
• Maximum clique incidence matrix Z • Property:
• Note: Zinc columns denote edges, and rows denote vertices
![Page 33: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs](https://reader036.fdocuments.in/reader036/viewer/2022062315/56814dd9550346895dbb438d/html5/thumbnails/33.jpg)
Additional Information
33
• Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127.
• Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23.
• Slide uploaded to Google group. Use with reference.