Elements, Compounds, Homogeneous mixtures & Heterogeneous mixtures
Distribution Mixtures of Product Componentsro.utia.cas.cz/lectures/Product_Mixtures.pdf · N) 2X...
Transcript of Distribution Mixtures of Product Componentsro.utia.cas.cz/lectures/Product_Mixtures.pdf · N) 2X...
Outline
Distribution Mixtures of Product Components
Part I: EM Algorithm & Modifications
Jirı Grim
Institute of Information Theory and AutomationAcademy of Sciences of the Czech Republic
January 2017
Available at: http://www.utia.cas.cz/people/grim
Cf. J. Grim: “Approximation of Unknown Multivariate Probability Distributionsby Using Mixtures of Product Components: A Tutorial.” International Journal
of Pattern Recognition and Artificial Intelligence, (2017).DOI: 10.1142/S0218001417500288
Outline
Outline
1 METHOD OF MIXTURESApproximation of Unknown Probability DistributionsExample - mixture of Gaussian densities
2 GENERAL VERSION OF EM ALGORITHMGeneral EM Iteration SchemeMonotonic Property of EM AlgorithmComputational Properties of EM AlgorithmHistorical Comments
3 PRODUCT MIXTURESDistribution Mixtures with Product ComponentsImplementation of EM Algorithm
4 MODIFICATIONS OF PRODUCT MIXTURE MODELStructural Mixture ModelEM algorithm for Incomplete DataModification of EM algorithm for Weighted DataSequential Decision Scheme
5 SURVEY: computational properties of product mixturesLiterature related to Mixtures
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Method of Distribution Mixtures
Information Source:
training data S: independent observations of a random vector identicallydistributed (i.i.d.) according to an unknown probability distribution P∗(x)
S = x (1), x (2), . . . , x (K), x (k) = (x(k)1 , x
(k)2 , . . . , x
(k)N ) ∈ X
Principle of the Method of Mixtures:
approximation of unknown multidimensional multimodal distribution P∗(x)by means of a linear combination of component distributions F (x |m)
P(x) =∑m∈M
wmF (x |m),∑m∈M
wm = 1,∑x∈X
F (x |m) = 1
(=
∫XF (x |m)dx
)
Application examples:
pattern recognition, image analysis, prediction problems, texture modeling,statistical models, classification of text documents, . . .
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Mixtures as a “Semiparametric” Model
parametric approach: e.g. assuming multivariate normal density
P(x) =1√
(2π)N detAexp−1
2(x − c)TA−1(x − c), x ∈ X
mean: c =1
|S|∑x∈S
x , covariance matrix: A =1
|S|∑x∈S
(x − c)(x − c)T
nonparametric approach: general kernel estimate Theorem (Parzen, 1962)
P(x) =1
|S|∑y∈S
∏n∈N
1√2πσn
exp (xn − yn)2
2σ2n
, x ∈ X
problem: optimal smoothing (choice of the smoothing parameters σn)
Mixtures as a Compromise: Semiparametric Multimodal Model
not so limiting as parametric models
almost as general as nonparametric model, without smoothing
efficient estimation of parameters by EM algorithm
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Example - EM algorithm for mixtures of Gaussian densities
computation of parameter estimates from data: S = x (1), . . . , x (K)
F (x |cm,Am) =1√
(2π)N detAm
exp−1
2(x − cm)TA−1
m (x − cm), x ∈ IRN
L =1
|S|∑x∈S
logP(x) =1
|S|∑x∈S
log
[ ∑m∈M
F (x |cm,Am)wm
]
Iteration equations: ≈ to maximize log-likelihood function
E-step: q(m|x) =wmF (x |cm,Am)∑Mj=1 wjF (x |c j ,Aj)
, x ∈ S, m = 1, 2, . . . ,M
M-Step: w′
m =1
|S|∑x∈S
q(m|x), c′
m =1∑
x∈S q(m|x)
∑x∈S
x q(m|x)
A′
m =1∑
x∈S q(m|x)
∑x∈S
q(m|x) (x − c′
m)(x − c′
m)T
Remark: The number of components has to be given.
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Example: reconstruction of a Gaussian mixture from data
dimension of data: N = 2, number of mixture components: M = 7
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Random sampling from a Gaussian mixture (M=7)
6000 data points (test of the correct implementation of EM algorithm)
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Example of the mixture estimate (M=28)
number of mixture components M = 28 ( 6= 7) (COMPARISON: kernel estimate)
Method of Mixtures EM generally Product mixtures Modification Survey Principle Example smesi
Original mixture of Gaussian densities (M=7)
dimension of data N = 2, number of mixture components M = 7
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
General Version of EM Algorithm
EM algorithm: to maximize log-likelihood function
L =1
|S|∑x∈S
logP(x) =1
|S|∑x∈S
log
[ ∑m∈M
wmF (x |m)
]Iteration Equations: (m = 1, 2, . . . ,M, x ∈ S, S = x (1), . . . , x (K))
E-step: q(m|x) =wmF (x |m)∑Mj=1 wjF (x |j)
, w′
m =1
|S|∑x∈S
q(m|x)
M-Step: F′(.|m) = arg max
F (.|m)
1∑x∈S q(m|x)
∑x∈S
q(m|x) log F (x |m)
for product components: F (x |m) =∏
n∈N fn(xn|m), N = 1, 2, . . . ,N
⇒ f′
n (.|m) = arg maxfn(.|m)
1∑x∈S q(m|x)
∑x∈S
q(m|x) log fn(xn|m), n ∈ N
Remark: Only inequality is sufficient in the M-Stepinstead of maximum ⇒ generalized EM (GEM) algorithm.
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Explicit Solution of the M-Step (Grim,1982)
Let F (x |b), x ∈ X be a probability density function and
let b∗ be the maximum-likelihood estimate of the parameter b:
b∗ = arg max
b
L(b)
= arg max
b
1
|S|∑x∈S
log F (x |b)
Further let b∗ be an additive function of the data vectors x ∈ S:
b∗ =
1
|S|∑x∈S
a(x).
Denoting γ(x) = N(x)/|S| the relative frequency of x in S we can write:
L(b) =∑x∈X
γ(x) log F (x |b), X = x ∈ X : γ(x) > 0, (∑x∈X
γ(x) = 1)
b∗ =
∑x∈X
γ(x) a(x) = arg maxb
∑x∈X
γ(x) log F (x |b)
Consequence: Weighted likelihood function is maximized by theweighted analogy of the related m.-l. estimate. Example: Gaussian mixture
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Monotonic Property of EM Algorithm (Schlesinger, 1968)
The sequence of log-likelihood values L(t)∞t=0 is non-decreasing:
L(t+1) − L(t) ≥ 0, t = 0, 1, 2, . . .
and, if bounded above, converges to a local or global maximum(or a saddle-point) of the log-likelihood function:
limt→∞
L(t) = L∗ <∞.
The existence of a finite limit L∗ <∞ implies the related necessaryconditions: Proof
limt→∞
(L(t+1) − L(t)) = 0 ⇒
⇒ limt→∞
|w (t+1)(m)−w (t)(m)| = 0,m ∈M, limt→∞
||q(t+1)(·|x)−q(t)(·|x)|| = 0
Remark: The convergence of the sequence L(t)∞t=0 does not implythe convergence of the corresponding parameter estimates!
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Proof of the Monotonic Property of EM Algorithm
Lemma
Kullback-Leibler information divergence I (q(·|x)||q′(·|x)) is non-negative forany two distributions q(·|x), q
′(·|x) and it is zero if and only if the two
distributions are identical. Proof
⇒ 1
|S|∑x∈S
I (q(·|x)||q′(·|x)) =
1
|S|∑x∈S
[ ∑m∈M
q(m|x) logq(m|x)
q′(m|x)
]≥ 0
Substitution for q(m|x), q′(m|x) from the E-Step implies the inequality:
1
|S|∑x∈S
∑m∈M
q(m|x) logP′(x)
P(x)− 1
|S|∑x∈S
∑m∈M
q(m|x) log
[w′
mF′(x |m)
wmF (x |m)
]≥ 0
where the first term is equal to the increment of the criterion L:
1
|S|∑x∈S
∑m∈M
q(m|x) logP′(x)
P(x)=
1
|S|∑x∈S
logP′(x)
P(x)= L
′− L.
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Proof of the Monotonic Property of EM Algorithm
Making substitution from the last equation we obtain:
(*) L′−L ≥
∑m∈M
[1
|S|∑x∈S
q(m|x)
]log
w′
m
wm+∑m∈M
1
|S|∑x∈S
q(m|x) logF′(x |m)
F (x |m)
and by using substitution from the M-Step
(**) w′
m =1
|S|∑x∈S
q(m|x), m = 1, 2, . . . ,M
we can write the inequality:
(***)∑m∈M
[1
|S|∑x∈S
q(m|x)
]log
w′
m
wm=∑m∈M
w′
m logw′
m
wm≥ 0.
Consequently, the first sum on the right-hand side of the inequality (*) isnon-negative.
Remark: The definition (**) of the weights w′
m maximizesthe first sum in Eq. (***).
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Proof of the Monotonic Property of EM Algorithm
In view of the M-Step definition, the function F′(·|m) maximizes
the left-hand side, i.e. we can write:
∑m∈M
1
|S|∑x∈S
q(m|x) log F′(x |m) ≥
∑m∈M
1
|S|∑x∈S
q(m|x) log F (x |m).
The last inequality can be rewritten in the form∑m∈M
1
|S|∑x∈S
q(m|x) logF′(x |m)
F (x |m)≥ 0,
i.e. the increment of the log-likelihood function L is non-negative:
L′− L ≥
∑m∈M
w′
m logw′
m
wm+∑m∈M
1
|S|∑x∈S
q(m|x) logF′(x |m)
F (x |m)≥ 0
⇒ L′≥ L Alternative proof
Remark: Any statistical interpretation of the proof is unnecessary!
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Mixture Identification × Approximating by Mixtures
Problem of mixture identification (e.g. cluster analysis)
GOAL: to identify the true number of components and toestimate the true mixture parameters
the estimated mixture must be identifiable Definition
PROBLEM: the log-likelihood function has local maxima nearly always(especially in case of small data sets in high dimensional spaces)
⇒ the resulting local maximum is starting-point dependent
PROBLEM: the mixture estimate is strongly influenced by the chosennumber of components and by the initial parameters
Problem of approximating unknown probability distributions
GOAL: precise approximation of the unknown probabilitydistribution by using mixture distributions Approximation Problem × MLE
the approximating mixture need not be identifiable
the exact number of components is irrelevant
the approximating mixture can be initialized randomly
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
Computational properties of EM Algorithm
real-life approximation problems ⇒
large data sets + large number of components:
in case of large mixtures (M ≈ 101 − 102) the low-weight componentsmay be neglected (⇒ the exact number of components is irrelevant)
the existence of local log-likelihood maxima of large mixtures is lessrelevant because the related maximum values are comparable
⇒ the influence of initial parameters is less relevant, the mixtures can beinitialized randomly
the EM iterations can be stopped e.g. by a relative increment thresholdbecause of limited influence on the achieved log-likelihood value
a reasonable stopping rule may decrease the risk of overfitting (excessiveadaptation to training data)
the EM algorithm is applicable to weighted data
Remark: The computational properties are data-dependent andtherefore not generally valid.
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
From the History of the Mixture Estimation Problem
Computation of m.-l. estimates of mixture parameters by setting partialderivatives to zero cannot be solved analytically. SOLUTION?
First paper: Pearson (1894): “Contributions to the mathematicaltheory of evolution. 1. Dissection of frequency curves.”Philosophical Trans. of the Royal Society of London 185, 71-110.Subject: mixture of two univariate Gaussian densities estimated by themethod of moments. (about 80 papers in the years 1895-1965)
efficient estimation of mixtures was enabled only by computers:
Hasselblad (1966), Day (1969), Wolfe (1970): derivedsimple iteration scheme by algebraic rearrangement of the likelihoodequations (at present known as EM algorithm) which was convergingand easily applicable to large mixtures in multidimensional spaces
Hosmer (1973): “Iterative m.-l. estimates were proposed by Hasselbladand subsequently have been looked at by Day, Hosmer and Wolfe.”
Peters a Walker (1978): “... we have observed in experiments thatthe convergence is monotone, i.e. that the likelihood function is actuallyincreased in each iteration, but we have been unable to prove it.”
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
From the History of the Mixture Estimation Problem
the first proof of the monotonic property of EM algorithm:
Schlesinger M.I. (1968): “Relation between learning and self learningin pattern recognition”, Kibernetika, (Kiev), No. 2, 81-88. M.I. Schlesinger
Ajvazjan et al. (1974, in Russian): cite Schlesinger (1968)
Isaenko & Urbach (1976, in Russian): cite Schlesinger (1968)
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
From the History of the Mixture Estimation Problem
the standard reference to EM algorithm:
Dempster et al. (1977): “Maximum likelihood from incomplete datavia the EM algorithm.” J. Roy. Statist. Soc., B, Vol. 39, pp.l-38.
Dempster et al. introduced the name EM algorithm and described itswide application possibilities (main subject: problem of incomplete data)
Google Scholar (2017): 48 500 citations of the above paper(“all time top 10” in statistics)
: the term “EM algorithm” used in 340 000 papers
: the terms “EM algorithm & mixture” used in 103 000 papers
Method of Mixtures EM generally Product mixtures Modification Survey General EM Monotonie Computational properties Historie
From the History of the Mixture Estimation Problem
erroneous proof of the convergence of parameter estimates:(does not concern the monotonic property of EM algorithm)
Boyles R.A. (1983): “On the convergence of the EM algorithm.” J.Roy. Statist. Soc., B, Vol. 45, pp. 47-50.
Wu C.F.J. (1983): “On the convergence properties of the EMalgorithm.” Ann. Statist., Vol. 11, pp. 95-103.
Monographs on Mixtures:
Titterington et al. (1985): Statistical analysis of finite mixturedistributions, John Wiley & Sons: Chichester, New York.
McLachlan and Peel (2000): Finite Mixture Models, John Wiley &Sons, New York, Toronto.
Method of Mixtures EM generally Product mixtures Modification Survey Product mixtures Implementation
PRODUCT MIXTURES
mixtures of product components (conditional independence model):
P(x) =∑m∈M
wm
∏n∈N
fn(xn|m), x ∈ XExamples:Gaussian mixtures with diagonal covariance matrices (real variables)mixtures of multivariate Bernoulli distributions (binary variables)
ADVANTAGES:
do not imply the assumption of independence of variables
⇒ do not imply the “naive Bayes” assumption
the mixture parameters can be efficiently estimated by EM algorithm
any discrete distribution can be expressed as product mixture Proof
Gaussian product mixtures approach the asymptotic accuracy ofnon-parametric Parzen estimates for M >> 1 Parzen estimates
no risk of ill-conditioned covariance matrices in Gaussian components
marginal distributions: by omitting superfluous terms in the products
any conditional distributions easily computed
product mixtures support the subspace (structural) modification
Method of Mixtures EM generally Product mixtures Modification Survey Product mixtures Implementation
EM Estimation of Gaussian Product Mixtures
COMPONENTS: Gaussian densities with diagonal covariance matrices
F (x |µm,σm) =∏n∈N
1√2πσmn
exp− (xn − µmn)2
2σ2mn
, x ∈ X
L =1
|S|∑x∈S
logP(x) =1
|S|∑x∈S
log[∑m∈M
wmF (x |µm,σm)]
EM iteration equations: (m ∈M, n ∈ N ) Unnecessary norming of variables
q(m|x) =wmF (x |µm,σm)∑Mj=1 wjF (x |µj ,σj)
, x ∈ S,
w′
m =1
|S|∑x∈S
q(m|x), µ′
mn =1
w ′m|S|∑x∈S
xnq(m|x)
(σ′
mn)2 =1
w ′m|S|∑x∈S
(xn − µ′
mn)2q(m|x) =1
w ′m|S|∑x∈S
x2nq(m|x) − (µ
′
mn)2
no matrix inversion ⇒ no risk of ill-conditioned matrices
Method of Mixtures EM generally Product mixtures Modification Survey Product mixtures Implementation
EM Estimation of Discrete Product Mixtures
COMPONENTS: products of univariate discrete distributions
F (x |m) =∏n∈N
fn(xn|m), x = (x1, . . . , xN) ∈ X , xn ∈ Xn, |Xn| <∞
L =1
|S|∑x∈S
logP(x) =1
|S|∑x∈S
log
[ ∑m∈M
wm
∏n∈N
fn(xn|m)
], x ∈ X
EM iteration equations: (x ∈ S, S = x (1), . . . , x (K))
q(m|x) =wmF (x |m)∑Mj=1 wjF (x |j)
, w′
m =1
|S|∑x∈S
q(m|x)
f′
n (ξ|m) =1
w ′m|S|∑x∈S
δ(ξ, xn)q(m|x) More details:
Remark 1 Discrete product mixture is not identifiable. Proof
(⇒ problem in cluster analysis × advantage in approximation)
Remark 2 Any discrete distribution is representable as a product mixture.
Method of Mixtures EM generally Product mixtures Modification Survey Product mixtures Implementation
EM Estimation of Multivariate Bernoulli Mixtures
COMPONENTS: products of univariate Bernoulli distributions
binary data: numerals on a binary raster, results of biochemical tests ...
x = (x1, x2, . . . , xN) ∈ X , xn ∈ 0, 1, X = 0, 1N
F (x |m) = F (x |θm) =∏n∈N
fn(xn|θmn) =∏n∈N
θxnmn(1− θmn)1−xn
L =1
|S|∑x∈S
log[∑m∈M
wmF (x |θm)], S = x (1), . . . , x (K)
EM iteration equations:
q(m|x) =wmF (x |θm)∑Mj=1 wjF (x |θj)
, x ∈ S, m = 1, 2, . . . ,M
w′
m =1
|S|∑x∈S
q(m|x), θ′
mn =1
w ′m|S|∑x∈S
xnq(m|x)
Remark: Product of a large number of parameters θmn may underflow.
Method of Mixtures EM generally Product mixtures Modification Survey Product mixtures Implementation
Implementation Comments on EM Algorithm
implementation of EM algorithm as a data cycle (for |S| >> 1)∑x∈S
q(m|x)→ w′
m,∑x∈S
xn q(m|x)→ µ′
mn, θ′
mn
basic condition to verify the correct implementation: L′ ≥ L
relative increment threshold ε to stop iterations:(L′ − L)/L < ε, (ε ≈ 10−3 − 10−5)
ε is useful to avoid “overpeaking” in final stages of convergence
EM algorithm suppresses the weights of “superfluous” components(large number of low-weight components ⇒ to many components M)
global information about overlapping components:
qmax(x) = maxm∈M
q(m|x), qmax =1
|S|∑x∈S
qmax(x)
in multi-dimensional spaces (N >> 1) the criterion qmax isusually high (≈ 0.85÷ 0.99) ⇒ the overlap of components is small
Remark: Correct implementation of EM algorithm can be reliably verifiedby re-identification of mixture parameters from large artificial data.
Method of Mixtures EM generally Product mixtures Modification Survey Product mixtures Implementation
Implementation of EM Algorithm in High Dimensions
PROBLEM: numerical instability of the E-step
the components F (x |m) may “underflow” at dimensions N ≈ 30− 40
⇒ the “lost” values cannot be “recovered” by norming in Eq. for q(m|x)
⇒ inaccurate evaluation of the conditional weights q(m|x)
SOLUTION:
log[F (x |m)wm] = logwm +∑n∈N
log fn(xn|m)
maximum component: logC (x) = maxm log[F (x |m)wm]
NORMING of F (x |m) a P(x) for evaluation of q(m|x):
exp− logC (x) + logwm +∑n∈N
log fn(xn|m) = C (x)−1F (x |m)wm
q(m|x) =C (x)−1F (x |m)wm∑Mj=1 C (x)−1F (x |j)wj
=F (x |m)wm∑Mj=1 F (x |j)wj
Examples of C-pseudocode: Bernoulli Mixture Gaussian Mixture
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Structural Mixture Model (Grim et al. 1986, 1999, 2002)
binary structural parameters: φm = (φm1, . . . , φmN) ∈ 0, 1N
F (x |m) =∏n∈N
fn(xn|m)φmn fn(xn|0)1−φmn ,
fn(xn|0) : fixed “background” distributions, usually fn(xn|0) = P∗n (xn)φmn = 0 ⇒ fn(xn|m) is replaced by fn(xn|0)
P(x) =∑m∈M
F (x |m)wm = F (x |0)∑m∈M
G (x |m,φm)wm,
G (x |m,φm) =∏n∈N
[fn(xn|m)
fn(xn|0)
]φmn
, F (x |0) =∏n∈N
fn(xn|0) > 0
“the background distribution” F (x |0) reduces in the Bayes formula:
p(ω|x) =P(x |ω)p(ω)
P(x)=
∑m∈Mω
G (x |m,φm)wm∑j∈M G (x |j ,φj)wj
≈∑
m∈Mω
G (x |m,φm)wm
MOTIVATION: Local, component-specific feature selection,“dimensionless” computation, structural neural networks.
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Structural Modification of EM Algorithm
structural optimization can be included into EM algorithm:
L =1
|S|∑x∈S
log[ ∑m∈M
F (x |0)G (x |m,φm)wm
]EM iteration equations: (m ∈M, n ∈ N , x ∈ S)
q(m|x) =G (x |m,φm)wm∑j∈M G (x |j ,φj)wj
, w′
m =1
|S|∑x∈S
q(m|x),
f′
n (.|m) = arg maxfn(.|m)
∑x∈S
q(m|x)
w ′m|S|log fn(xn|m)
structural optimization:
φ′
mn = 1 for a fixed number R of largest values of the criterion γ′
mn:
γ′
mn =1
|S|∑x∈S
q(m|x) log[ f ′n (xn|m)
fn(xn|0)
]Proof
Remark: The background distribution F (x |0) can be included intooptimization too (Grim, 1999).
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Structural EM Algorithm - Discrete Mixture
fn(xn|m), xn ∈ Xn, n ∈ N ≈ discrete probability distributions
L =1
|S|∑x∈S
log[ ∑m∈M
G (x |m,φm)wm
], G (x |m) =
∏n∈N
[fn(xn|m)
fn(xn|0)
]φmn
EM iteration equations: (m ∈M, n ∈ N , x ∈ S)
q(m|x) =G (x |m,φm)wm∑j∈M G (x |j ,φj)wj
, w′
m =1
|S|∑x∈S
q(m|x)
f′
n (ξ|m) =∑x∈S
δ(ξ, xn)q(m|x)
w ′m|S|, Details
structural optimization: φ′
mn = 1 for the R largest values γ′
mn:
γ′
mn =∑x∈S
q(m|x)
w ′m|S|log[ f ′n (xn|m)
fn(xn|0)
]= w
′
m
∑ξn∈Xn
f′
n (ξn|m) logf′
n (ξn|m)
fn(ξn|0)Proof
Remark: The last sum is the Kullback-Leibler information divergence.
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Structural EM Algorithm - Gaussian Mixture
Gaussian densities: fn(xn|µmn, σmn) = 1√2πσmn
exp− (xn−µmn)2
2σ2mn
L =
1
|S|∑x∈S
log[ ∑m∈M
wm
∏n∈N
(fn(xn|µmn, σmn)
fn(xn|µ0n, σ0n)
)φmn ],
EM iteration equations: (m ∈M, n ∈ N , x ∈ S)
q(m|x) =G (x |m,φm)wm∑j∈M G (x |j ,φj)wj
, w′
m =1
|S|∑x∈S
q(m|x),
µ′
mn =1
w ′m|S|∑x∈S
xnq(m|x), (σ′
mn)2 =1
w ′m|S|∑x∈S
x2nq(m|x)− (µ
′
mn)2,
structural optimization: φ′
mn = 1 for the R largest values γ′
mn :
γ′
mn =w′
m
2
[(µ′
mn − µ0n)2
(σ0n)2+
(σ′
mn)2
(σ0n)2− log
(σ′
mn)2
(σ0n)2− 1
]= w
′
mI (f′
n (·|m), fn(·|0))
Remark: γ′
mn is the Kullback-Leibler information divergence. Proof
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Properties of Structural Mixture Model
STRUCTURAL MIXTURES ≈ statistically correct subspace approach:
PRINCIPLE: the less informative univariate distributions fn(xn|m)are replaced by fixed “background” distributions fn(xn|0)
reduces the number of mixture parameter (and components)⇒ reduces the risk of overpeaking
suppresses the influence of unreliable (less informative) variables
the EM algorithm performs feature selection for each componentindependently (it is not necessary to exclude variables globally)
Bayesian decision-making based on structural mixtures is dimensionindependent (Grim 2016)
the structural optimization implied by EM algorithm is controlled by theKullback-Leibler information divergence
avoids the biologically unnatural connection of probabilistic neurons withall input variables (Grim et al. 2000)
enables the structural optimization of probabilistic neural networks byEM algorithm (Grim 2007)
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Modification of EM Algorithm for Incomplete Data
INCOMPLETE DATA: x = (x1,−, x3, x4,−,−, x7, . . . , xN) ∈ X
N (x) = n ∈ N : variable xn is defined in x, x ∈ XSn = x ∈ S : n ∈ N (x), ≈ vectors x ∈ S with the defined variable xn
Assumption: components in product form ⇒ Easily available marginals
L =1
|S|∑x∈S
log[∑m∈M
wmF (x |m)], F (x |m) =∏
n∈N (x)
fn(xn|m)
EM iteration equations: (m ∈M, n ∈ N , x ∈ S)
q(m|x) =wmF (x |m)∑Mj=1 wj F (x |j)
, w′
m =1
|S|∑x∈S
q(m|x)
f′
n (.|m) = arg maxfn(.|m)
1∑x∈Sn q(m|x)
∑x∈Sn
q(m|x) log fn(xn|m)
Remark: The likelihood criterion depends on available values only.
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Modification of EM algorithm for Weighted Data
NOTATION: γ(x) > 0 : relative frequency of x in S, (∑
x∈X γ(x) = 1)
L =1
|S|∑x∈S
log[∑m∈M
wmF (x |m)] =∑x∈X
γ(x) log[∑m∈M
wmF (x |m)]
X = x ∈ X : γ(x) > 0 : the sum can be confined to x ∈ X :
“weighted” EM iteration equations: (m ∈M, n ∈ N , x ∈ X )
q(m|x) =wmF (x |m)∑j∈M wjF (x |j)
, F (x |m) =∏n∈N
fn(xn|m)
w′
m =1
|S|∑x∈S
q(m|x) =∑x∈X
γ(x)q(m|x)
F′(.|m) = arg max
F (.|m)
∑x∈X
γ(x)q(m|x)
w ′mlog F (x |m)
Applications: relevance of data, aggregation of data,discrete data weighted by table values: γ(x) = P∗(x), x ∈ X
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Sequential Decision Scheme (Grim 1986, 2014)
INFORMATION CONTROLLED SEQUENTIAL DECISION-MAKING
Given the observations xD = (xj1 , . . . , xjl ) ∈ XD , D = j1, . . . , jl ⊂ N wehave to choose the next most informative variable xn, n /∈ D to maximize theconditional information IxD
(Xn,Ω) about the classes Ω = ω1, . . . , ωK.
SOLUTION: explicit evaluation of the criterion IxD(Xn,Ω)
IxD(Xn,Ω) = HxD
(Xn)− HxD(Xn|Ω), n∗ = arg max
n/∈DIxD
(Xn,Ω)
HxD(Xn) =
∑xn∈Xn
−Pn|D(xn|xD) logPn|D(xn|xD), Pn|D(xn|xD) =PnD(xn, xD)
PD(xD)
HxD(Xn|Ω) =
∑ω∈Ω
p(ω|xD)∑xn∈Xn
−Pn|Dω(xn|xD , ω) logPn|Dω(xn|xD , ω),
Pn|Dω(xn|xD , ω) = PnD|ω(xn, xD |ω)/PD|ω(xD |ω) =∑
m∈Mω
Wm(xD , ω)fn(xn|m),
PnD|ω(xn, xD |ω) =∑m∈M
wmfn(xn|m, ω)∏i∈D
fi (xi |m, ω),
Method of Mixtures EM generally Product mixtures Modification Survey Structural Model Incomplete Data Weighted Data Sequential Scheme
Feature Selection: the Most Informative Subspace
special case of the sequential decision scheme:
INFORMATION CRITERION for the optimal feature subset
ASSUMPTION: class-conditional product mixtures P(x |ω), ω ∈ Ω
I (XD ,Ω) = H(XD)− H(XD |Ω), D∗ = arg maxD⊂N
I (XD ,Ω)
PD|ω(xD |ω) =∑
m∈Mω
wm
∏n∈D
fn(xn|m), xD ∈ XD ,
H(XD) =∑
xD∈XD
−PD(xD) logPD(xD), D = j1, . . . , jk ⊂ N , |D| = k
H(XD |Ω) =∑ω∈Ω
p(ω)∑
xD∈XD
−PD|ω(xD |ω) logPD|ω(xD |ω)
optimal subset D ⊂ N : complete search, approximate methods
APPLICATION: informative feature selection for pattern recognition
Method of Mixtures EM generally Product mixtures Modification Survey Literature
PROPERTIES OF PRODUCT MIXTURES
SURVEY: computational properties of product mixtures
efficient estimation of multivariate distribution mixtures (!)
suitable to approximate multi-modal, real-life probability distributions
with increasing number of components the Gaussian mixturesapproach the asymptotic accuracy of Parzen (kernel) estimates
unlike Parzen estimates the product mixtures are optimally“smoothed” by the efficient EM algorithm
directly available marginal probability distributions (!)
the mixture parameters can be estimated from incomplete data
product components enable the information controlledsequential decision-making in multi-dimensional spaces
product mixtures can be interpreted as probabilistic neural networks
enable the structural optimization of probabilistic neural networks
provide information criterion for the optimal feature subset Literature
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A1: Asymptotic Properties of Parzen Estimates
Theorem (Parzen, 1962; Cacoullos, 1966)
Let SK be a sequence of K independent observations of an N-dimensionalrandom vector distributed with the probability density function P∗(x). Thenon-parametric density estimate P(x) with the soothing parameter σK
P(x) =1
K
∑y∈SK
∏n∈N
1√2πσK
exp (xn − yn)2
2σ2K
is asymptotically unbiased in each continuity point of P∗(x), i.e. it holds
limK→∞
ESK P(x) = P∗(x)
if limK→∞ σK = 0. In addition, if limK→∞ KσNK =∞, then the unbiased
estimate P(x) is asymptotically consistent in the quadratic mean sense:
limK→∞
ESK [P∗(x)− P(x)]2 = 0.
Back: Compromise Back: Product mixtures
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A2: Optimal Smoothing of Parzen (Kernel) Estimates
Parzen estimate with Gaussian kernel:
P(x) =1
|S|∑y∈S
f (x |y ,σ) =1
|S|∑y∈S
[∏n∈N
1√2πσn
exp (xn − yn)2
2σ2n
]
optimization by cross-validation (leaving-one-out) method:≈ to maximize the modified log-likelihood function by EM algorithm:
L(σ) =∑x∈S
log
1
(|S| − 1)
∑y∈S,y 6=x
∏n∈N
1√2πσn
exp (xn − yn)2
2σ2n
q(y |x) =
f (x |y ,σ)∑u∈S,u 6=x f (x |u,σ)
, y ∈ S
(σ′
n)2 =1
|S|∑x∈S
∑y∈S,y 6=x
(xn − yn)2q(y |x)
Remark: Optimal smoothing is crucial in high-dimensional spaces!Back: Product mixtures Back: Compromise
Method of Mixtures EM generally Product mixtures Modification Survey Literature
“Under-smoothed” Kernel Estimate
Method of Mixtures EM generally Product mixtures Modification Survey Literature
“Over-smoothed” Kernel Estimate
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Optimally Smoothed Kernel Estimate
(general Gaussian kernel) Back: Norm. mixture
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A3: Marginal Distributions of a Product Mixture
easily obtained by omitting superfluous terms in products:
P(x) =∑m∈M
wmF (x |m) =∑m∈M
wm
∏n∈N
fn(xn|m), x = (x1, . . . , xN) ∈ X
∑xi∈Xi
P(x) =M∑
m=1
wm(∑xi∈Xi
fi (xi |m))∏
n∈N\i
fn(xn|m) =M∑
m=1
wm
∏n∈N\i
fn(xn|m)
xC = (xi1 , xi2 , . . . , xik ) ∈ XC , XC = Xi1 × · · · × Xik , C = i1, . . . , ik ⊂ N
PC (xC ) =∑m∈M
wmFC (xC |m), FC (xC |m) =∏n∈C
fn(xn|m)
Pn|C (xn|xC ) =PnC (xn, xC )
PC (xC )=∑m∈M
wmFC (xC |m)
PC (xC )fn(xn|m)
Pn|C (xn|xC ) =∑m∈M
Wm(xC )fn(xn|m), Wm(xC ) =wmFC (xC |m)
PC (xC )
Back - Incomplete data Back: Product mixtures
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A4: Solution of the M-Step - Gaussian Mixture
Gaussian Mixture with a General Covariance Matrix:
F (x |cm,Am) =1√
(2π)N detAm
exp−1
2(x − cm)TA−1
m (x − cm)
P(x) =∑m∈M
wmF (x |cm,Am)
implicit form of the M-Step:
(c′
m,A′
m) = arg max(cm,Am)
∑x∈S
γ(x) log F (x |cm,Am)
explicit solution:
c′
m =∑x∈S
γ(x) x , γ(x) =q(m|x)∑y∈S q(m|y)
A′
m =∑x∈S
γ(x) (x − c′
m)(x − c′
m)T =∑x∈S
γ(x)xxT − c′
m(c′
m)T
Back: Gaussian Mixture
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A5: Solution of the M-Step - Discrete Product Mixture
f′
n (.|m) = arg maxfn(.|m)
∑x∈S
q(m|x)
w ′m|S|log fn(xn|m)
, n ∈ N , m ∈M,
∑ξ∈Xn
δ(ξ, xn) = 1, xn ∈ Xn,
f′
n (.|m) = arg maxfn(.|m)
∑x∈S
( ∑ξ∈Xn
δ(ξ, xn))q(m|x)
w ′m|S|log fn(xn|m)
,
f′
n (.|m) = arg maxfn(.|m)
∑ξ∈Xn
∑x∈S
δ(ξ, xn)q(m|x)
w ′m|S|log fn(ξ|m)
,
f′
n (.|m) = arg maxfn(.|m)
∑ξ∈Xn
(∑x∈S
δ(ξ, xn)q(m|x)
w ′m|S|
)log fn(ξ|m)
,
⇒ f′
n (ξ|m) =∑x∈S
δ(ξ, xn)q(m|x)
w ′m|S|
Back: EM algorithm Back: Structural EM
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A5: Invariance of EM Algorithm Under Linear Transform
EM estimate of a Gaussian mixture is invariant under linear transform
Let the parameters wm, µmn, σmn,m ∈M, n ∈ N of a Gaussian productmixture define a stationary point of EM algorithm, i.e. they satisfy the EMiteration equations. Further let y = T (x) be a linear transform of the vectorsx ∈ X a of the mixture parameters :
yn = anxn + bn, x ∈ S, wm = wm, µmn = anµmn + bn, σmn = anσmn.
Then the transformed parameters wm, µmn, σmn,m ∈M, n ∈ N also definea stationary point of EM algorithm in the transformed space Y.
Proof: The following equations can be verified by related substitutions:
F (y |µm, σm) =1∏
n∈N anF (x |µm,σm), P(y) =
1∏n∈N an
P(x)
µmn =1
wm|S|∑y∈S
ynq(m|y), (σmn)2 =1
wm|S|∑y∈S
(yn − µmn)2q(m|y)
q(m|y) = q(m|x), y = T (x), x ∈ S, m ∈M Back: Gaussian Product Mixture
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A6: Monotonic Property of Structural EM Algorithm
structural mixture is a special case of product mixture model, i.e.
w′
m =1
|S|∑x∈S
q(m|x), f′
n (.|m) = arg maxfn(.|m)
∑x∈S
q(m|x)
w ′m|S|log fn(xn|m)
It is necessary to prove, that the monotonic property holds for theoptimized structural parameters φmn. We use the inequality :
L′− L ≥
∑m∈M
1
|S|∑x∈S
∑m∈M
q(m|x) log[F ′(x |m)
F (x |m)
]≥ 0
and, making substitution for F′(x |m),F (x |m), we obtain:
L′− L ≥
∑m∈M
1
|S|∑x∈S
∑m∈M
q(m|x) log[G ′(x |m,φ′m)
G (x |m,φm)
]
L′− L ≥
∑m∈M
1
|S|∑x∈S
∑m∈M
q(m|x) log[ f ′n (xn|m)
fn(xn|0)
]φ′mn[ fn(xn|m)
fn(xn|0)
]φmn
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Monotonic Property of Structural EM Algorithm
The last inequality can be rewritten in the form:
(∗) L′− L ≥
∑m∈M
∑n∈N
(φ′
mn − φmn)γ′
mn +∑m∈M
∑n∈N
φmn
|S|q(m|x) log
f′
n (xn|m)
fn(xn|m)
where γ′
mn is the structural optimization criterion:
γ′
mn =1
|S|∑x∈S
q(m|x) logf′
n (xn|m)
fn(xn|0), n ∈ N ,m ∈M
In view of the above definition of f′
n (.|m) we can write for arbitrary fn(·|m) :
1
|S|∑x∈S
q(m|x) log f′
n (xn|m) ≥ 1
|S|∑x∈S
q(m|x) log fn(xn|m)
Therefore, the last sum in the inequality (∗) is non-negative and, for thesame reason, we have γ
′
mn ≥ 0 for all n ∈ N ,m ∈M;
By setting φ′
mn = 1 for the R highest values γ′
mn, we obtain
L′− L ≥
∑m∈M
∑n∈N
(φ′
mn − φmn) γ′
mn ≥ 0 q.e.d. Back: Structural EM
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Interpretation of Structural Criterion - Discrete Mixture
fn(xn|m), xn ∈ Xn, n ∈ N ≈ discrete probability distribution
γ′
mn =1
|S|∑x∈S
q(m|x) logf′
n (xn|m)
fn(xn|0), n ∈ N ,m ∈M
∑ξ∈Xn
δ(ξ, xn) = 1, xn ∈ Xn,
γ′
mn =1
|S|∑x∈S
q(m|x)[ ∑ξ∈Xn
δ(ξ, xn)]
logf′
n (xn|m)
fn(xn|0),
γ′
mn =1
|S|∑ξ∈Xn
[∑x∈S
δ(ξ, xn)q(m|x)]
logf′
n (ξ|m)
fn(ξ|0),
γ′
mn = w′
m
∑ξ∈Xn
f′
n (ξ|m) logf′
n (ξ|m)
fn(ξ|0)= w
′
mI (f′
n (·|m), fn(·|0)),
γ′
mn ≈ Kullback-Leibler information divergence Back: Structural EM
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Interpretation of Structural Criterion - Gaussian Mixture
Gaussian densities: fn(xn|µmn, σmn) = 1√2πσmn
exp− (xn−µmn)2
2σ2mn
γ′
mn =1
|S|∑x∈S
q(m|x) logfn(xn|µ
′
mn, σ′
mn)
fn(xn|µ0mn, σ0n), n ∈ N ,m ∈M,
γ′
mn = w′
m
∑x∈S
q(m|x)
w ′m|S|
[− log
σ′
mn
σ0n− (xn − µ
′
mn)2
2(σ′mn)2+
(xn − µ0n)2
2(σ0n)2
],
γ′
mn =w′
m
2
[(µ′
mn − µ0n)2
(σ0n)2+
(σ′
mn)2
(σ0n)2− 1− log
(σ′
mn)2
(σ0n)2
]=
it is easily verified: Back: Structural EM
= w′
m
∫Xn
fn(xn|µ′
mn, σ′
mn) logfn(xn|µ
′
mn, σ′
mn)
fn(xn|µ0n, σ0n)dxn = w
′
mI (f′
n (·|m), fn(·|0))
⇒ γ′
mn ≈ “continuous” Kullback-Leibler information divergence
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A7: Non-Identifiability of Discrete Product Mixtures
Definition of Identifiability of Mixtures (Teicher, 1963)
The class of Mixtures P = P(x ,θ) : θ ∈ Θ is identifiable, if the parameters
θ,θ′∈ Θ of any two equivalent mixtures
P(x ,θ) = P(x ,θ′), ∀ x ∈ X
may differ only by the order of components. Back: identification x aproximation
Theorem ( Grim, 2001; cf. Teicher, 1963, 1968; Gyllenberg et al., 1994;)
Arbitrary discrete product mixture (xn ∈ Xn, |Xn| <∞)
P(x) =∑m∈M
wmF (x |m) =∑m∈M
wm
∏n∈N
fn(xn|m)
has infinitely many equivalent forms with different parameters, if at least oneof the univariate component distributions fi (xi |m) is nonsingular, i.e. satisfiesthe condition
0 < fi (xi |m) < 1, for some xi ∈ Xi . Back: Discrete mixture
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Proof: Non-Identifiability of Discrete Product Mixtures
Proof: Let 0 < fi (xi |m) < 1 for some i ∈ N , xi ∈ Xi and m ∈M. Then, forany 0 < α < 1, β = 1− α, we can construct two different probabilitydistributions f
′
i (·|m), f′′
i (·|m) in such a way that the distribution fi (·|m)
represents an internal point of the abscise 〈f ′i (·|m), f′′
i (·|m)〉 in the|Xi |-dimensional space in the sense of the following condition:
(*) fi (ξ|m) = αf′
i (ξ|m) + βf′′
i (ξ|m), ξ ∈ Xi .
Consequently, the nonsingular probability distribution fi (·|m) can beexpressed as a convex combination of two distributions f
′
i (·|m), f′′
i (·|m) ininfinitely many ways. By using the above substitution (*) we can write
(**) wmF (x |m) = w′
mF′(x |m) + w
′′
mF′′
(x |m),
wherew′
m = αwm, w′′
m = βwm, (w′
m + w′′
m = wm),
F′(x |m) = f
′(xi |m)
∏n∈N ,n 6=i
fn(xn|m), F′′
(x |m) = f′′
(xi |m)∏
n∈N ,n 6=i
fn(xn|m)
Finally, making substitution (**) for wmF (x |m), we obtain a non-triviallydifferent equivalent of the original distribution P(x), q.e.d. Back: EM algorithm
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A8: Alternative Proof of the EM Monotonic Property
Kullback-Leibler information divergence is non-negative, i.e. :
I (q(·|x), q′(·|x)) =
∑m∈M
q(m|x) logq(m|x)
q′(m|x)≥ 0, Proof
The following proof follows the original idea of Schlesinger. Using notation
L =1
|S|∑x∈S
log[∑m∈M
wmF (x |m)], q(m|x) =wmF (x |m)∑Mj=1 wjF (x |j)
We can express the log-likelihood functions L and L′
equivalently by means ofthe conditional weights q(m|x), q
′(m|x):
L =1
|S|∑x∈S
∑m∈M
q(m|x) log[wmF (x |m)]−∑m∈M
q(m|x) log q(m|x)
L′
=1
|S|∑x∈S
∑m∈M
q(m|x) log[w′
mF′(x |m)] −
∑m∈M
q(m|x) log q′(m|x)
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Alternative Proof of the EM Monotonic Property
Using the above equations we can express the increment L′ − L as follows:
L′−L =
1
|S|∑x∈S
∑m∈M
q(m|x) log[w ′mF ′(x |m)
wmF (x |m)
]+∑m∈M
q(m|x) logq(m|x)
q′(m|x)
where the second sum on the right-hand side is the non-negativeKullback-Leibler divergence:
L′− L =
1
|S|∑x∈S
∑m∈M
q(m|x) log[w ′mF ′(x |m)
wmF (x |m)
]+ I (q(·|x), q
′(·|x))
and therefore, we can write the inequality:
L′− L ≥ 1
|S|∑x∈S
∑m∈M
q(m|x) log[w ′mF ′(x |m)
wmF (x |m)
]L′− L ≥
∑m∈M
[ 1
|S|∑x∈S
q(m|x)]
logw′
m
wm+
1
|S|∑m∈M
∑x∈S
q(m|x) logF′(x |m)
F (x |m)
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Alternative Proof of the EM Monotonic Property
Making substitution for w′
m from the M-Step we obtain the inequality∑m∈M
[ 1
|S|∑x∈S
q(m|x)]
logw′
m
wm=∑m∈M
w′
m logw′
m
wm≥ 0
Further, in view of the M-Step definition
F′(.|m) = arg max
F (.|m)
∑x∈S
q(m|x)
w ′m|S|log F (x |m)
we can write for any component F (x |m) the inequality:
(∗)∑x∈S
q(m|x) log F′(x |m) ≥
∑x∈S
q(m|x) log F (x |m), m ∈M
The monotonic property of EM algorithm follows from the above inequalities:
L′− L ≥
∑m∈M
w′
m logw′
m
wm+
1
|S|∑m∈M
∑x∈S
q(m|x) logF′(x |m)
F (x |m)≥ 0
Remark: The M-Step definition is redundantly strong, the new parametersneed to satisfy only the inequalities (*) ⇒ GEM algorithm Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A9: Monotonic Property of EM Algorithm - Implications
Nondecreasing and above bounded sequence L(t)∞t=0 has a finite limitL∗ <∞ and therefore the following necessary condition is satisfied:
limt→∞
L(t) = L∗ <∞ ⇒ limt→∞
(L(t+1) − L(t)) = 0
Analogous conditions hold for the sequences w (t)(m)∞t=0 and
q(t)(·|x)∞t=0,m ∈M, too:
limt→∞
||w (t+1)(m)− w (t)(m)|| = 0, limt→∞
||q(t+1)(m|x)− q(t)(m|x)|| = 0.
The last limits follow from the inequality
L(t+1) − L(t) ≥ I (w (t+1)(·)||w (t)(·)) +1
|S|∑x∈S
I (q(t)(·|x)||q(t+1)(·|x))
and from the following general inequality (cf. Kullback (1966)): Back
∑x∈X
P∗(x) logP∗(x)
P(x)≥ 1
4
(∑x∈X|P∗(x)− P(x)|
)2
≥ 1
4‖P∗(·)− P(·)‖2
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A10: M.-L. Estimates versus Approximation Problems
Lemma
Maximum-likelihood estimate asymptotically minimizes the upper bound ofthe Euklidean distance between the true discrete distribution P∗(·) and itsapproximating estimate P(·).
Proof: Asymptotically, for |S| → ∞, we can write
lim|S|→∞
1
|S|∑x∈S
logP(x) = lim|S|→∞
∑x∈S
γ(x) logP(x) =∑x∈X
P∗(x) logP(x)
where γ(x) ≥ 0 is the relative frequency of the discrete vector x in the i.i.d.sequence S and P∗ is the true probability distribution. The assertion followsfrom the inequality (cf. Kullback, 1966):∑
x∈XP∗(x) log
P∗(x)
P(x)≥ 1
4
(∑x∈X|P∗(x)− P(x)|
)2
≥ 1
4‖P∗(·)− P(·)‖2
Remark: The m.-l. estimate P(·) is justified as approximation of P∗(·).Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A11: Kullback-Leibler Divergence is Non-Negative
Theorem (cf. e.g. Vajda, 1992)
Any two discrete probability distributions q1, q2, . . . , qM, q′
1, q′
2, . . . , q′
Msatisfy the following inequality
I (q‖ q′) =
∑m∈M
qm logqmq′m≥ 0
where the equality holds only if q′
m = qm, for all m ∈M.
Proof: Without any loss of generality we can assume qm > 0 for all m ∈M(since 0 log 0 = 0 asymptotically). By Jensens inequality we have:∑
m∈Mqm log
q′
m
qm≤ log
( ∑m∈M
qmq′
m
qm
)= log
( ∑m∈M
q′
m
)= log 1 = 0,
where the equality occurs only if q′
1/q1 = · · · = q′
M/qM , q.e.d.
Consequence: The following left-hand sum is maximized by q′
= q:∑m∈M
qm log q′
m ≤∑m∈M
qm log qm Back - Proof Back (Alternative Proof) Back (M-Step)
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A12: Universality of Discrete Product Mixtures
Lemma (see e.g. Grim, 2006)
Let the table values p(k), k = 1, . . . ,K , K = |X | define a probabilitydistribution P(x) on a discrete space X :
P(x (k)) = p(k), x (k) ∈ X , k = 1, . . . ,K , X = ∪Kk=1x (k)
Then the discrete probability distribution P(x) can be expressed as a productdistribution mixture by using δ-functions in the product components:
P(x) =K∑
k=1
wkF (x |k) =K∑
k=1
p(k)∏n∈N
δ(xn, x(k)n ), x ∈ X .
Proof: The products of δ-functions in the components uniquely define thepoints x (k) ∈ X corresponding to the respective probabilistic table values p(k):
F (x |k) =∏n∈N
δ(xn, x(k)n ), wk = p(k), k = 1, . . . ,K .
Remark: The proof has only formal meaning, the mixture approximationbased on EM algorithm is numerically more efficient. Back - (“representable”)
Back - Advantages
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A13: EM algorithm for Multivariate Bernoulli Mixtures
example of EM algorithm: multivariate Bernoulli mixture
Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
A14: EM algorithm for Gaussian Product Mixtures
example of EM algorithm: multivariate Gaussian product mixture
Remark: Possible solution of the “underflow” problem. Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Prof. M.I. Schlesinger with his wife
At Karlstejn castle during his visit in Prague in 1995. Back Back - Literature
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 1/12
Ajvazjan S.A., Bezhaeva Z.I., Staroverov O.V. (1974):Classification ofMultivariate Observations, (in Russian). Moscow: Statistika.
Boyles R.A. (1983): On the convergence of the EM algorithm. J. Roy. Statist.Soc., B, Vol. 45, pp. 47-50.
Cacoullos I. (1966): Estimation of a multivariate density. Ann. Inst. Stat.Math., Vol. 18, pp. 179-190.
Carreira-Perpignan M.A., Renals S. (2000): Practical identifiability of finitemixtures of multivariate Bernoulli distributions. Neural Computation, Vol. 12,pp. 141-152.
Day N.E. (1969): Estimating the components of a mixture of normaldistributions. Biometrika, Vol. 56, pp. 463-474.
Dempster A.P., Laird N.M. and Rubin D.B. (1977): Maximum likelihood fromincomplete data via the EM algorithm. J. Roy. Statist. Soc., B, Vol. 39,pp.l-38.
Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 2/12
Duda R.O., Hart P.E. (1973): Pattern Classification and Scene Analysis. NewYork: Wiley-Interscience.
Everitt, B.S. and D.J. Hand (1981): Finite Mixture Distributions. Chapman &Hall: London, 1981.
Grim J. (1982): On numerical evaluation of maximum - likelihood estimates forfinite mixtures of distributions. Kybernetika, Vol.18, No.3, pp.173-190.http://dml.cz/dmlcz/124132
Grim J. (1982): Design and optimization of multilevel homogeneous structuresfor multivariate pattern recognition. In Fourth FORMATOR Symposium 1982,Academia, Prague 1982, pp. 233-240.
Grim, J. (1984): On structural approximating multivariate discrete probabilitydistributions. Kybernetika, Vol. 20, No. 1, pp. 1-17, 1984.http://dml.cz/dmlcz/125676
Grim J. (1986): Multivariate statistical pattern recognition with nonreduceddimensionality, Kybernetika, Vol. 22, pp. 142-157.http://dml.cz/dmlcz/125022 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 3/12
Grim, J. (1986): Sequential decision-making in pattern recognition based onthe method of independent subspaces. In: Proceedings of the DIANA IIConference on Discriminant Analysis, (Ed. F. Zitek), Mathematical Institute ofthe AS CR, Prague 1986, pp. 139-149.
Grim J. (1994): Knowledge representation and uncertainty processing in theprobabilistic expert system PES, International Journal of General Systems, Vol.22, No. 2, p. 103 - 111.
Grim J. (1992): A dialog presentation of census results by means of theprobabilistic expert system PES, in Proceedings of the Eleventh EuropeccnMeeting on Cybernetics and Systems Research, (Ed. R.Trappl), Vienna, April1992, World Scientific, Singapore 1992, pp. 997-1005. Paper Award
Grim J. and Bocek P. (1995): Statistical Model of Prague Households forInteractive Presentation of Census Data, In SoftStat’95. Advances inStatistical Software 5, pp. 271 - 278, Lucius & Lucius: Stuttgart, 1996.
Grim J. (1996): Maximum Likelihood Design of Layered Neural Networks. In:Proceedings of the 13th International Conference on Pattern Recognition IV(pp. 85-89), Los Alamitos: IEEE Computer Society Press. Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 4/12
Grim J. (1996a): Design of multilayer neural networks by informationpreserving transforms. In: E. Pessa, M.P. Penna, A. Montesanto (Eds.),Proceedings of the Third European Congress on System Science (pp.977-982), Roma: Edizzioni Kappa.
Grim J. (1998): A sequential modification of EM algorithm. In Studies inClassification, Data Analysis and Knowledge Organization, Gaul W.,Locarek-Junge H., (Eds.), pp. 163 - 170, Springer, 1999.
Grim J., Somol P., Novovicova J., Pudil P. and Ferri F. (1998b): Initializingnormal mixture of densities. In Proc. 14th Int. Conf. on Pattern RecognitionICPR’98, A.K. Jain, S. Venkatesh, B.C. Lovell (Eds.), pp. 886-890, IEEEComputer Society: Los Alamitos, California, 1998
Grim J. (1999): Information approach to structural optimization ofprobabilistic neural networks. In Proceedings of the 4th System ScienceEuropean Congress, L. Ferrer et al. (Eds.), (pp: 527-540), Valencia: SociedadEspanola de Sistemas Generales, 1999.
Grim J. (2000): Self-organizing maps and probabilistic neural networks. Neural
Network World, 3(10): 407-415. Paper Award Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 5/12
Grim J., Kittler J., Pudil P. and Somol P. (2000): Combining multipleclassifiers in probabilistic neural networks, In Multiple Classifier Systems, Eds.Kittler J., Roli F., Springer, 2000, pp. 157 - 166.
Grim J., Pudil P. and Somol P. (2000): Recognition of handwritten numeralsby structural probabilistic neural networks. In: Proceedings of the Second ICSCSymposium on Neural Computation, Berlin, 2000. (Bothe H., Rojas R. eds.).ICSC, Wetaskiwin, 2000, pp 528-534. Paper Award
Grim J., Kittler J., Pudil P. and Somol P. (2001): Information analysis ofmultiple classifier fusion. In: Multiple Classifier Systems 2001, Kittler J., RoliF., (Eds.), Lecture Notes in computer Science, Vol. 2096, Springer-Verlag,Berlin, Heidelberg, New York 2001, pp. 168 - 177.
Grim J., Bocek P. and Pudil P. (2001): Safe dissemination of census results bymeans of interactive probabilistic models. In: Proceedings of the ETK-NTTS2001 Conference, (Hersonissos (Crete), June 18-22, 2001), Vol.2, pp. 849-856,European Communities 2001. Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 6/12
Grim J. (2001): Latent Structure Analysis for Categorical Data. Research
Report No. 2019. UTIA AV CR, Praha 2001, 13 pp. 23
Grim J., Kittler J., Pudil P. and Somol P. (2002): Multiple classifier fusion inprobabilistic neural networks. Pattern Analysis & Applications Vol. 5, No. 7,pp. 221-233.
Grim J. and Haindl M. (2003): Texture Modelling by Discrete DistributionMixtures. Computational Statistics and Data Analysis, 3-4 41, pp. 603-615.
Grim J., Just P. and Pudil P. (2003): Strictly modular probabilistic neuralnetworks for pattern recognition. Neural Network World, Vol. 13 , No. 6, pp.599-615.
Grim J., Somol P., Pudil P. and Just P. (2003): Probabilistic neural networkplaying a simple game. In Artificial Neural Networks in Pattern Recognition.(Marinai S., Gori M. Eds.). University of Florence, Florence 2003, pp. 132-138.
Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 7/12
Grim J., Hora J. and Pudil P. (2004): Interaktivnı reprodukce vysledku scıtanılidu se zarucenou ochranou anonymity dat. Statistika, Vol. 84, No. 5, pp.400-414.
Grim J., Haindl M., Somol P., Pudil P. and Kudo M. (2004): A Gaussianmixture-based colour texture model. In: Proc. of the 17th InternationalConference on Pattern Recognition. IEEE, Los Alamitos 2004, pp. 177-180.
Grim J., Somol P., Haindl M. and Pudil P. (2005): A statistical approach tolocal evaluation of a single texture image. In: Proceedings of the 16-th AnnualSymposium PRASA 2005. (Nicolls F. ed.). University of Cape Town, 2005, pp.171-176.
Grim J., Haindl M., Pudil P. and Kudo M. (2005): A Hybrid BTF Model Basedon Gaussian Mixtures. In: Texture 2005. Proceedings of the 4th InternationalWorkshop on Texture Analysis. (Chantler M., Drbohlav O. eds.). IEEE, LosAlamitos 2005, pp. 95-100.
Grim J. (2006): EM cluster analysis for categorical data. In: Structural,Syntactic and Statistical Pattern Recognition. (Yeung D. Y., Kwok J. T., FredA. eds.), (LNCS 4109). Springer, Berlin 2006, pp. 640-648. Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 8/12
J. Grim (2007): Neuromorphic features of probabilistic neural networks.Kybernetika, Vol. 43, No. 5, pp.697-712. http://dml.cz/dmlcz/135807
Grim J. and Hora, J. (2008): Iterative principles of recognition in probabilistic
neural networks. Neural Networks, Special Issue, 6 21, 838–846 Paper Award
Grim J., Novovicova J. and Somol P. (2008): Structural Poisson Mixtures forClassification of Documents , Proceedings of the 19th International Conferenceon Pattern Recognition, Tampa (Florida), US, p. 1324-1327.
Grim J., Somol P., Haindl M. and J. Danes (2009): Computer-AidedEvaluation of Screening Mammograms Based on Local Texture Models. IEEETrans. on Image Processing, Vol. 18, No. 4, pp. 765-773. Paper Award
Grim J., Hora J., Bocek P., Somol P. and P. Pudil (2010): Statistical Model ofthe 2001 Czech Census for Interactive Presentation. Journal of OfficialStatistics. Vol. 26, No. 4, pp. 673–694. Paper Award
Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 9/12
Grim J., Somol P. and Pudil P. (2010): Digital Image Forgery Detection byLocal Statistical Models. Proc. 2010 Sixth International Conference onIntelligent Information Hiding and Multimedia Signal Processing, Los Alamitos,IEEE computer society, Echizen, I. et al., eds., pp. 579-582.
J. Grim (2011): Preprocessing of Screening Mammograms Based on LocalStatistical Models. Proceedings of the 4th International Symposium on AppliedSciences in Biomedical and Communication Technologies, ISABEL 2011,Barcelona, ACM, pp. 1-5
Grim, J. (2014). Sequential pattern recognition by maximum conditionalinformativity. Pattern Recognition Letters, Vol. 45C, pp. 39-45.http:// dx.doi.org/10.1016/j.patrec.2014.02.024 Paper Award
Grim, J. (2017). Approximation of unknown multivariate probabilitydistributions by using mixtures of product components: a tutorial. InternationalJournal of Pattern Recognition and Artificial Intelligence, to appear.
Gyllenberg M., Koski T., Reilink E. and M. Verlaan (1994): Non-uniqueness inprobabilistic numerical identification of bacteria. Journal of AppliedProbability, Vol. 31, pp. 542–548. Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 10/12
Hasselblad V. (1966): Estimation of prameters for a mixture of normaldistributions. Technometrics, Vol. 8, pp. 431-444.
Hasselblad V. (1969): Estimation of finite mixtures of distributions from theexponential family. Journal of Amer. Statist. Assoc., Vol. 58, pp. 1459-1471.
Isaenko O.K. and Urbakh K.I. (1976): Decomposition of probability distributionmixtures into their components (in Russian). In: Theory of probability,mathematical statistics and theoretical cybernetics, Vol. 13, Moscow: VINITI.
Kullback S. (1966): An information-theoretic derivation of certain limitrelations for a stationary Markov Chain. SIAM J. Control, Vol. 4, No. 3, pp.454-459.
McLachlan, G.J. and Krishnan, T. (1997): The EM algorithm and extensions,John Wiley & Sons, New York.
McLachlan G.J. and Peel D. (2000): Finite Mixture Models, John Wiley &Sons, New York, Toronto, (2000)
Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 11/12
Meng X.L. and Van Dyk D. (1997): The EM Algorithm—an Old Folk-songSung to a Fast New Tune. Journal of the Royal Statistical Society: Series B(Statistical Methodology), Vol. 59, No. 3, pp. 511-567.
Parzen E. (1962): On estimation of a probability density function and itsmode. Annals of Mathematical Statistics, Vol. 33., pp. 1065-1076.
Pearson C. (1894): Contributions to the mathematical theory of evolution. 1.Dissection of frequency curves. Philosophical Transactions of the Royal Societyof London 185, 71-110.
Peters B.C. and Walker H.F. (1978): An iterative procedure for obtainingmaximumlikelihood estimates of the parameters for a mixture of normaldistributions. SIAM Journal Appl. Math., Vol. 35, No. 2, pp. 362-378.
Teicher H. (1963): Identifiability of finite mixtures. Ann. Math. Statist., Vol.34, pp. 1265-1269.
Teicher H. (1968): Identifiability of mixtures of product measures. Ann. Math.
Statist., Vol. 39, pp. 1300-1302. Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Literature 12/12
Schlesinger M.I. (1968): Relation between learning and self learning in pattern
recognition (in Russian), Kibernetika, (Kiev), No. 2, pp. 81-88. Foto
Titterington D.M., Smith A.F.M. and Makov U.E. (1985): Statistical analysisof finite mixture distributions, John Wiley & Sons: Chichester, New York.
Vajda I. and Grim J. (1998): About the maximum information and maximumlikelihood principles in neural networks, Kybernetika, Vol. 34, No. 4, pp.485-494.
Wolfe J.H. (1970): Pattern clustering by multivariate mixture analysis.Multivariate Behavioral Research, Vol. 5, pp. 329-350.
Wu C.F.J. (1983): On the convergence properties of the EM algorithm. Ann.Statist., Vol. 11, pp. 95-103.
Xu L. and Jordan M.I. (1996): On convergence properties of the EM algorithmfor Gaussian mixtures. Neural Computation, Vol. 8. pp. 129-151.
Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
Eleventh European Meeting on Cybernetics and Systems Research,Vienna, April 1992 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
Second ICSC Symposium on Neural Computation, Berlin, 2000 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
IEEE Transactions on Image Processing 18(4): 765-773, 2009 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
Sixth International Conference on Intelligent Information Hiding and MultimediaSignal Processing, IIH-MSP Darmstadt, 2010 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
Pattern Recognition Letters, Vol. 45C, pp. 39-45, 2014 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
Neural Networks, 21(6): 838–846, 2008 Back
Method of Mixtures EM generally Product mixtures Modification Survey Literature
Paper Award
Neural Network World, 3(10): 407-415, 2000 Back