1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of...
-
Upload
aleesha-moody -
Category
Documents
-
view
219 -
download
0
description
Transcript of 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of...
11
Parameter EstimationParameter Estimation
Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimedia, National Taiwan Universitymedia, National Taiwan University
22
Typical Classification ProblemTypical Classification ProblemRarely know the complete Rarely know the complete probabilistic structure of the problemprobabilistic structure of the problemHave vague, general knowledgeHave vague, general knowledgeHave a number of design samples or Have a number of design samples or training data as representatives of training data as representatives of patterns for classificationpatterns for classificationFind some way to use this Find some way to use this information to design or train the information to design or train the classifierclassifier
33
Estimating ProbabilitiesEstimating ProbabilitiesNot difficulty to Estimate prior Not difficulty to Estimate prior probabilitiesprobabilitiesHard to estimate class-conditional Hard to estimate class-conditional densitiesdensities– Number of available samples always Number of available samples always
seems too smallseems too small– Serious when dimensionality is largeSerious when dimensionality is large
44
Estimating ParametersEstimating ParametersProblems permit to parameterize the coProblems permit to parameterize the conditional densitiesnditional densitiesSimplifies the problem from one of estiSimplifies the problem from one of estimating an unknown function to one of emating an unknown function to one of estimating the parametersstimating the parameters– e.g.,e.g., mean vector and covariance matrix for mean vector and covariance matrix for multi-variate normal distribution multi-variate normal distribution
55
Maximum-Likelihood EstimationMaximum-Likelihood EstimationView the parameters as quantities View the parameters as quantities whose values are fixed but unknownwhose values are fixed but unknownBest estimate is the one that Best estimate is the one that maximize the probability of obtaining maximize the probability of obtaining the samples actually observedthe samples actually observedNearly always have good Nearly always have good convergence properties as the convergence properties as the number of samples increasesnumber of samples increasesOften simpler than alternative Often simpler than alternative methodsmethods
66
I. I. D. Random VariablesI. I. D. Random VariablesSeparate data into Separate data into DD11, . . ., , . . ., DDccSamples in Samples in DDjj are drawn independently a are drawn independently according to ccording to pp((xx||jj))Such samples are independent and identiSuch samples are independent and identically distributed (i. i. d.) random variablescally distributed (i. i. d.) random variablesLet Let pp((xx||jj)) has a known parametric form a has a known parametric form and is determined uniquely by a parametend is determined uniquely by a parameter vector r vector j,j,, , i.e.,i.e., p p((xx||jj))=p=p((xx||jj,,jj))
77
Simplification AssumptionsSimplification AssumptionsSamples in Samples in DDii give no information about give no information about jj, if , if ii is not equal to is not equal to jjCan work with each class separatelyCan work with each class separatelyHave Have cc separate problems of the same fo separate problems of the same formrm– Use set Use set DD of i. i. d. samples from of i. i. d. samples from pp((xx||)) to esti to estimate unknown parameter vector mate unknown parameter vector
88
Maximum-likelihood EstimateMaximum-likelihood Estimate
)|( maximizes
ˆ estimate likelihood-maximum
) respect to with of d(likelihoo
)|()|(
,, samples d. i. i.contain Let
1
1
Dp
D
pDp
Dn
kk
n
x
xx
99
Maximum-likelihood EstimationMaximum-likelihood Estimation
1010
A NoteA NoteThe likelihood The likelihood pp((DD||)) as a function of as a function of is is not a probability density function of not a probability density function of Its area on the Its area on the -domain has no significa-domain has no significancenceThe likelihood The likelihood pp((DD||)) can be regarded as can be regarded as probability of probability of DD for a given for a given
1111
Analytical ApproachAnalytical Approach
p
tp
n
kk
n
kk
l
xplxpl
l
Dpl
1
1
11
,,,
0:ˆfor condition necessary
)|(ln,)|(ln)(
)(maxargˆ)|(ln)(function likelihood-log
1212
MAP EstimatorsMAP Estimators
prior uniform for theestimator MAPan isestimator (ML) likelihood-maximum
valuesparameter different ofy probabilitprior :)()(ln)( maximize that find
:estimator (MAP) posteriori a maximum
ppl
1313
Gaussian Case: Unknown Gaussian Case: Unknown
n
kk
n
kk
kk
kt
kd
k
n
p
p
1
1
1
1
1
1ˆ
0ˆ
)|(ln212ln
21)|(ln
x
xΣ
xΣx
xΣxΣx
1414
Univariate Gaussian Case: UnknowUnivariate Gaussian Case: Unknown n and and 22
n
kk
n
kk
n
k
kn
k
n
kkn
kk
kk
xn
xn
l
x
xxpl
xxp
1
22
1
122
21
1 2
11
2
1
21
22
221
ˆ1ˆ,1ˆ0
221
1
)|(
212ln
21)|(ln
,
1515
Multivariate Gaussian Case: Multivariate Gaussian Case: Unknown Unknown and and
n
k
tkk
n
kk
n
n
1
1
ˆˆ1ˆ
1ˆ
xxΣ
x
1616
Bias, Absolutely Unbiased, and Bias, Absolutely Unbiased, and Asymptotically Unbiased Asymptotically Unbiased
CΣ
Σ
xxC
nn
n
nnxx
nE
tk
n
kk
n
kk
1ˆ
unbiasedally asymptotic is ofestimator ML
ˆˆ1
1matrix covariancefor estimator unbiased y)(absolutelan
11
estimation biased a as for estimator ML
1
22
1
2
2
1717
Model ErrorModel ErrorFor reliable model, the ML classifier For reliable model, the ML classifier can give excellent resultscan give excellent resultsIf the model is wrong, the ML If the model is wrong, the ML classifier can not get the best results, classifier can not get the best results, even for the assumed set of modelseven for the assumed set of models
1818
Bayesian Estimation Bayesian Estimation (Bayesian Learning)(Bayesian Learning)
Answers obtained in general is nearly Answers obtained in general is nearly identical to those by maximum-identical to those by maximum-likelihoodlikelihoodBasic conceptual differenceBasic conceptual difference– The parameter vector The parameter vector is a random is a random
variablevariable– Use the training data to convert a Use the training data to convert a
distribution on this variable into a distribution on this variable into a posterior probability densityposterior probability density
1919
Central ProblemCentral Problem
)|( determine to)(unknown but fixed toaccordingtly independendrawn samples of set a Use
:learningBaysean of problem Centraltly.independen treatedbecan classEach
)(),|(
)(),|(),|( if ),,|(
affect not do in Samples .,, toseparated be Let )()|(find easy to are iesprobabilitprior Assume
)|(),|(
)|(),|(),|(, sample Given the
1
1
1
DppD
PDp
PDpDpjiDp
DDDDPDP
DPDp
DPDpDpD
c
jjjj
iiiij
ic
ii
c
jjj
iii
xx
x
xxx
x
xx
2020
Parameter DistributionParameter DistributionAssume Assume pp((xx)) has a known parametric for has a known parametric form with parameter vector m with parameter vector of unknown va of unknown valuelueThus,Thus, p p((xx||)) is completely known is completely knownInformation about Information about prior to observing sa prior to observing samples is contained in known prior densitmples is contained in known prior density y pp(())Observations convert Observations convert pp(()) to to pp((||DD)) – should be sharply peaked about the true valushould be sharply peaked about the true value of e of
2121
Parameter DistributionParameter Distribution
)ˆ|()|(
ˆ someabout sharply very peaks )|( if
)|()|()|(
)|(),|()|(),|()|,(
)|,()|(
xx
xx
xxxx
xx
pDp
Dp
dDppDp
pDpDpDpDp
dDpDp
2222
Univariate Gaussian Case: Univariate Gaussian Case: pp((||DD))
n
kk
n
k
k
n
kk
n
xn
x
pxpDp
dpDppDpDpxxD
Np
Nxp
120
02
220
2
1
2
0
02
1
1
200
200
200
2
12121exp"
21exp'
)()|()|(
)()|()()|()|(,,,
guess) about thisy uncertaint: ; of guessbest :(
known are and ),,(~)( Assume
unknownonly theis ),,(~)|(
2323
Reproducing DensityReproducing Density
220
2202
0220
2
220
20
20
0222
022
2
ˆ
ˆ,11prior] conjugate:)( [c.f.
density] ng[reproduci ),(~)|(
nσ
nnn
nnσ
pNDp
n
nn
nn
n
n
nn
2424
Bayesian LearningBayesian Learning
2525
DogmatismDogmatism
are and t matter wha no
,ˆ toconverge will finite, is dogmatismWhen )(dogmatism
and of ratio by theset is data empirical and
knowledgeprior between balance relative Theembetween th somewhere lies always and, and ˆ ofn combinatiolinear a is
200
nn
20
2
0
nn
2626
Univariate Gaussian Case: Univariate Gaussian Case: pp((xx||DD))
),(~)|(
2
21exp),(
),(21exp
21
)|()|()|(
22
22
22
2
22
22
22
22
22
2
nn
n
n
n
n
n
nn
nn
n
n
NDxp
dnxf
fx
dDpxpDxp
2727
Multivariate Gaussian CaseMultivariate Gaussian Case
n
kkn
nnnn
nnt
n
n
kk
n
n
nn
pxpDp
DNpNp
1
01
0111
011
1
1
1
00
1ˆ
ˆ,21exp'
)()|()|(
,,)(~)(),(~)|(
x
ΣΣΣΣΣΣ
Σ
xxΣΣx
2828
Multivariate Gaussian CaseMultivariate Gaussian Case
),(~)|()|(),(~)|(
),(~)()|( lettingby or,
)|()|()|(
11
11ˆ1
1
00
0
1
0
1
00
11111
nn
nn
n
nn
NDpDpNDp
NpDp
dDppDp
nn
nnn
ΣΣyxΣ
Σ0yyxy
xx
ΣΣΣΣΣ
ΣΣΣΣΣΣ
ABABBBAABA
2929
Multivariate Bayesian LearningMultivariate Bayesian Learning
3030
General Bayesian EstimationGeneral Bayesian Estimation
n
kkpDp
dpDppDpDp
dDppDp
1
)|()|(
)()|()()|()|(
)|()|()|(
x
xx
3131
Recursive Bayesian LearningRecursive Bayesian Learning
)()|(
)|()|()|()|()|(
)()|()()|()|(
)()|()|()()|()|(
)()|()()|()|(
)|()|()|(,,,
0
1
1
1
11
1
1
11
pDp
dDppDppDp
dpDppDpDp
dpDpppDpp
dpDppDpDp
DppDpD
nn
nnn
n
nn
nn
nn
n
nn
nn
nn
n
xx
xx
xxx
3232
Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning
10maxfor/1)|(
otherwise0107for /1
)|()|()|(
otherwise0104for /1
)|()|()|(
)10,0(~)()|(
8,2,7,4),10,0(~)(otherwise00/1
),0(~)|(
21
22
01
1
0
n
x
nn DDp
DpxpDp
DpxpDp
UpDp
DUp
xUxp
3333
Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning
3434
Example 1: Bayes vs. MLExample 1: Bayes vs. ML
3535
IdentifiabilityIdentifiabilitypp((xx||)) is identifiable is identifiable – Sequence of posterior densities Sequence of posterior densities pp((||DDnn)) conve converge to a delta functionrge to a delta function– Only one Only one causes causes pp((xx||)) to fit the data to fit the dataIn some occasions, more than one In some occasions, more than one valu values may yield the same es may yield the same pp((xx||)) – pp((||DDnn)) will peak near all will peak near all that explain the da that explain the datata– Ambiguity is erased in integration for Ambiguity is erased in integration for pp((xx||DDnn), ), which converges towhich converges to pp((xx) ) whether or notwhether or not pp((xx||)) is identifiableis identifiable
3636
ML vs. Bayes MethodsML vs. Bayes MethodsComputational complexityComputational complexityInterpretabilityInterpretabilityConfidence in prior informationConfidence in prior information– Form of the underlying distribution Form of the underlying distribution pp((xx||))
Results differs when Results differs when pp((||DD)) is broad, or a is broad, or asymmetric around the estimated symmetric around the estimated – Bayes methods would exploit such informatBayes methods would exploit such information whereas ML would notion whereas ML would not
3737
Classification ErrorsClassification ErrorsBayes or indistinguishability errorBayes or indistinguishability errorModel errorModel errorEstimation errorEstimation error– Parameters are estimated from a finite samParameters are estimated from a finite sampleple– Vanishes in the limit of infinite training data Vanishes in the limit of infinite training data (ML and Bayes would have the same total cl(ML and Bayes would have the same total classification error)assification error)
3838
Invariance and Invariance and Non-informative PriorsNon-informative Priors
Guidance in creating priorsGuidance in creating priorsInvarianceInvariance– Translation invarianceTranslation invariance– Scale invarianceScale invarianceNon-informative with respect to an Non-informative with respect to an invarianceinvariance– Much better than accommodating Much better than accommodating
arbitrary transformation in a MAP arbitrary transformation in a MAP estimatorestimator
– Of great use in Bayesian estimation Of great use in Bayesian estimation
3939
Gibbs AlgorithmGibbs Algorithm
classifier optimal Bayes theoferror expected themost twiceat iserror icationmisclassif thes,assumption given weak
Algorithm] Gibbs[)|()|(Let )|( toaccording apick
)|()|()|(
0
0
xx
xx
pDpDp
dDppDp
4040
Sufficient StatisticsSufficient StatisticsStatisticStatistic– Any function of samplesAny function of samples
Sufficient statisticSufficient statistic s s of samplesof samples DD – ss Contains all information relevant to estimat Contains all information relevant to estimat
ing some parameter ing some parameter – Definition: Definition: pp((DD||ss, , )) is independent of is independent of – If If can be regarded as a random variable can be regarded as a random variable
)|()|(
)|(),|(),|( ss
sss pDp
pDpDp
4141
Factorization TheoremFactorization TheoremA statistic A statistic ss is sufficient for is sufficient for if and only if if and only if PP((DD||)) can be written as the product can be written as the product
PP((DD||)) = = gg((ss, , ) ) hh((DD)) for some functions for some functions gg(.,.)(.,.) and and hh(.)(.)
4242
Example: Multivariate GaussianExample: Multivariate Gaussian
for sufficient are 1ˆ thusand
21exp
21
2exp
21exp
21)|(
),(~)|(
11
1
12/2/
1
11
1
12/12/
n
kkn
n
kk
n
kk
tknnd
n
kk
tt
kt
k
n
kd
n
n
Dp
Np
xxs
xΣxΣ
xΣΣ
xΣxΣ
Σx
4343
Proof of Factorization Theorem: Proof of Factorization Theorem: The “Only if” PartThe “Only if” Part
),()()|()()|(),|(
)|(),|()|,()( oft independen is ),( ,for sufficient is Suppose
ssss
ssss
s
ss
gDhPDhPDP
PDPDPD|P
D|P
4444
Proof of Factorization Theorem: Proof of Factorization Theorem: The “if” PartThe “if” Part
for sufficient is and , oft independen
)()(
)(),()(),(
)|()|(
)|,()|,(
)|()|,(),|(
)(|),(
s
ss
ss
sss
ss
DDDDDD
DD
DhDh
DhgDhg
DPDP
DPDP
PDPDP
DDDD
4545
Kernel DensityKernel DensityFactoring of Factoring of PP((DD||)) into into gg((ss,,))hh((DD)) is not u is not uniquenique– If If ff((ss)) is any function, is any function, gg’(’(ss,,)=)=ff((ss))gg((ss,,)) and and hh’’
((DD) = ) = hh((DD)/)/ff((ss)) are equivalent factors are equivalent factors
Ambiguity is removed by defining the kerAmbiguity is removed by defining the kernel density invariant to such scalingnel density invariant to such scaling
dggg
),(),(),(
sss
4646
Example: Multivariate GaussianExample: Multivariate Gaussian
)ˆ(1)ˆ(21exp
12
1),ˆ(
)ˆ2(2
exp),ˆ(
1ˆ ),(),ˆ(
21exp
21
2exp)|(
),(~)|(
1
2/12/
11
1
1
12/2/
1
11
nt
nd
n
ntt
n
n
kknn
n
kk
tknnd
n
kk
tt
nn
g
ng
nDhg
nDp
Np
ΣΣ
ΣΣ
x
xΣxΣ
xΣΣ
Σx
4747
Kernel Density and Kernel Density and Parameter EstimationParameter Estimation
Maximum-likelihoodMaximum-likelihood– maximization of maximization of gg((ss,,))BayesianBayesian
– If prior knowledge of If prior knowledge of is vague, is vague, pp(()) tend to tend to be uniform, and be uniform, and pp((||DD)) is approximately the is approximately the same as the kernel densitysame as the kernel density
– If If pp((xx||)) is identifiable, is identifiable, gg((ss,,)) peaks sharply a peaks sharply at some value, and t some value, and pp(()) is continuous as well is continuous as well as non-zero there, as non-zero there, pp((||DD)) approaches the ke approaches the kernel density rnel density
dpgpg
dpDppDpDp
)(),()(),(
)()|()()|()|(
ss
4848
Sufficient Statistics for Sufficient Statistics for Exponential FamilyExponential Family
n
kk
tn
kk
n
kk
n
kk
t
t
Dh
ngn
Dhg
nDp
p
1
1
11
)()(
)()(exp),(,)(1)(),(
)()()()(exp)|(
)()()(exp)()|(
x
sbasxcs
s
xxcba
xcbaxx
4949
Error Rate and DimensionalityError Rate and Dimensionality
2
1
212
211
212
2/
2/
case,t independenlly conditiona In the
,21)(
rateerror Bayes ies,probabilitprior equalWith
2,1),,(~)|(case normal temultivaria class-woConsider t
tindependenlly statistica are features Suppose
2
d
i i
ii
t
r
u
jj
r
rdueeP
jNp
Σ
Σx
5050
Accuracy and DimensionalityAccuracy and Dimensionality
5151
Effects of Additional FeaturesEffects of Additional FeaturesIn practice, beyond a certain point, In practice, beyond a certain point, inclusion of additional features leads inclusion of additional features leads to worse rather than better to worse rather than better performanceperformanceSources of difficultySources of difficulty– Wrong modelsWrong models– Number of design or training samples is Number of design or training samples is
finite and thus the distributions are not finite and thus the distributions are not estimated accuratelyestimated accurately
5252
Computational Complexity for Computational Complexity for Maximum-Likelihood EstimationMaximum-Likelihood Estimation
)1()()()()(
2ln2
)(lnˆln21ˆˆˆ
21)(
)( :matrix a oft determinan find
)( :matrix a of inverse find
)(:ˆˆ1ˆ
)(:1ˆ
32
1
3
3
2
1
1
OnOdOndOndO
dPg
dndOdd
dOdd
ndOn
ndOn
nt
n
n
k
tnknk
n
kkn
ΣxΣxx
xxΣ
x
5353
Computational Complexity for Computational Complexity for ClassificationClassification
learningan simpler th)( :tionclassificafor Total
)(:decision )(max)(: vectorseparation by the
matrix covariance inverse heMultiply t)(:ˆ Compute
Given
2
2
dO
cOgdO
dO
ii
n
x
xx
5454
Approaches for Approaches for Inadequate SamplesInadequate Samples
Reduce dimensionalityReduce dimensionality– Redesign feature extractorRedesign feature extractor– Select appropriate subset of featuresSelect appropriate subset of features– Combine the existing featuresCombine the existing features– Pool the available data by assuming all Pool the available data by assuming all
classes share the same covariance matrixclasses share the same covariance matrixLook for a better estimate for Look for a better estimate for – Use Bayesian estimate and diagonal Use Bayesian estimate and diagonal 00
– Threshold sample covariance matrixThreshold sample covariance matrix– Assume statistical independenceAssume statistical independence
5555
Shrinkage Shrinkage (Regularized Discriminant Analysis)(Regularized Discriminant Analysis)
10,)-(1)(matrixidentity the toward shrink"" or,
10,1
1onecommon thematrix to covariance individual shrink""
matrix covariance same assumingby estimated is questionin categories on theindex an is
IΣΣΣ
ΣΣΣ
Σ
nnnn
ci
i
iii
5656
Concept of OverfittingConcept of Overfitting
5757
Best Representative PointBest Representative Point
)( minimizes
)()()(
1
minimized is )(
such that find,,,Given
000
1 1
220
1
2000
1
1
2000
01
xmx
mxmx
mxmxx
xm
xxx
xxx
J
J
n
J
n
k
n
kk
n
kk
n
kk
n
kk
n
5858
Projection Along a LineProjection Along a Line
5959
Best Projection to a Line Through Best Projection to a Line Through the Sample Meanthe Sample Mean
)(
)(2
)();,,(
minimize Toerror with by Represent
Line
1
2
11
22
1
211
mxe
mxmxee
xeme
emxemx
kt
k
n
kk
n
kk
tk
n
kk
n
kkkn
kk
a
aa
aaaJ
aa
6060
Best Representative DirectionBest Representative Direction
eSeeeSee
eSee
mxSee
mxemxmxe
mxe
e
e
0)1( maximize :method Lagrange
1 subject to Maximize
))((
2)(
minimize to Find
2
1
2
1
2
1
1
2
1
2
1
21
uu
aaJ
tt
t
n
kk
t
n
kk
n
k
tkk
t
n
kk
n
kk
n
kk
6161
Principal Component Analysis Principal Component Analysis (PCA)(PCA)
seigenvaluelargest thehaving of rseigenvecto theare ,,
),,(
minimize to',,1, Find
: space Projection
'1
2
1
'
1'1'
'
1
d'd'
aJ
di
a
d
n
kk
d
iikidd
i
d
iii
See
xemee
e
emx
6262
Concept of Concept of Fisher Linear DiscriminantFisher Linear Discriminant
6363
Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis
22
21
221
22
21
22
~~~~
)( maximize To
~~ :scatter class-Within
)~(~,1~
2,1,1
on separation maximalget to Find
ssmm
J
ss
msn
m
in
iD
tii
t
D
t
ii
Dii
t
ii
i
w
xwmwxw
xm
xwyw
xx
x
6464
Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis
tB
Bttt
WWt
Dx
tiii
it
Dxi
tti
mm
ss
s
i
i
2121
221
221
212
22
1
22
~~,~~
~
mmmmS
wSwmwmw
SSSwSw
mxmxS
wSwmwxw
6565
Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis
scales] [ignoring )(
)( ofdirection thein always is ))((
problem) eigenvalue ed(generaliz when maximized is
quotient,Rayleigh dgeneralize ,)(
211
21
2121
1
mmSw
mmwmmmmwS
wwSS
wSwS
wSwwSww
W
tB
BW
WB
Wt
Bt
J
6666
Fisher Linear Discriminant Analysis Fisher Linear Discriminant Analysis for Multivariate Normalfor Multivariate Normal
analysis]nt discriminalinear Fisher to[solution
, and , ,for estimationWith
0
boundarydecision optimalmatrix covariance same Assume
211
21
211
0
mmSw
ΣΣw
xw
Σ
W
t w
6767
Concept of Multidimensional DiscriConcept of Multidimensional Discriminant Analysisminant Analysis
6868
Multiple Discriminant AnalysisMultiple Discriminant Analysis
ii
i
i
Di
D
tiii
c
iiW
Wt
c
i D
ti
ti
tW
c
iii
D
ti
ttii
n
nnn
ciy
c-dc
xx
x
x
xmmxmxSSS
WSWmxWmxWS
mmxWm
xWyxw
1,,
~~~
~1~,1~
1,,1,
subspaceldimensiona-)1( tospace ldimensiona- from Projection
problem class-Consider
1
1
1
6969
Multiple Discriminant AnalysisMultiple Discriminant Analysis
BW
c
i
tiiiW
c
i D
tii
c
i D
tii
c
i D
tiiii
tT
i
c
ii
n
nnn
ii
i
SSmmmmS
mmmmmxmx
mmmxmmmx
mxmxS
mxm
xx
x
x
x
1
11
1
1
11
7070
Multiple Discriminant AnalysisMultiple Discriminant Analysis
WSW
WSW
S
SW
W
WSWmmmmS
Wt
Bt
W
B
Bt
c
i
tiiiB
J
n
~
~)(let
)directions principal in the variances ofproduct thet to(equivalenmatrix scatter theof
tdeterminan theisscatter of measurescalar simpleA scatter class- withon thescatter to class-between
theof ratio themaximize toation transformaSeek
~~~~~1
7171
Multiple Discriminant AnalysisMultiple Discriminant Analysis
etc. matrices, scalingor rotation by multipliedbecan it since unique,not is optimal
eigenvaluelargest the torelatedr eigenvecto dgeneralize theis and
satisfies optimal of Columns
W
wSwSW
iWiiB
7272
Expectation-Maximization (EM)Expectation-Maximization (EM)Finding the maximum-likelihood estimate of Finding the maximum-likelihood estimate of the parameters of an underlying distribution the parameters of an underlying distribution – from a given data set when the data is from a given data set when the data is
incomplete or has missing valuesincomplete or has missing valuesTwo main applicationsTwo main applications– When the data indeed has missing valuesWhen the data indeed has missing values– When optimizing the likelihood function is When optimizing the likelihood function is
analytically intractable but when the likelihood analytically intractable but when the likelihood function can be simplified by assuming the function can be simplified by assuming the existence of and values for additional but existence of and values for additional but missing (or hidden) parametersmissing (or hidden) parameters
7373
Expectation-Maximization (EM)Expectation-Maximization (EM)Full sample Full sample DD = { = {xx11, . . ., , . . ., xxnn}}
xxkk = { = { xxkgkg, , xxkbkb } }Separate individual features into Separate individual features into DDgg an and d DDbb
– DD is the union of is the union of DDgg and and DDbbForm the functionForm the function igbgDi DDDpEQ
b ;|);,(ln);(
7474
Expectation-Maximization (EM)Expectation-Maximization (EM)begin initialize begin initialize 00, , TT,, i i 0 0 do do i i i + i + 11 E step: Compute E step: Compute QQ((; ; ii)) M step: M step: ii+1+1 arg max arg max QQ((,,ii))
until until QQ((ii+1+1;;ii)-)-QQ((ii;;ii-1-1) ) TT return return ii+1+1
end end
7575
Expectation-Maximization (EM)Expectation-Maximization (EM)
7676
Example: 2D ModelExample: 2D Model
1100
,
matrix covariancediagonal with modelGaussian 2D Assume
4*
,22
,01
,20
,,,
0
22
21
2
1
41
4321
xD
D
b
xxxx
7777
Example: 2D ModelExample: 2D Model
'41
0'41
41
041
413
1
41420
41
3
14
041
0
|4
|4
|4
ln)|(ln
)4;|(
)|(ln)|(ln
,|);,(ln);(41
dxx
pK
dxK
xp
xpp
dxxxp
pp
DxDpEQ
kk
kk
ggx
x
xx
7878
Example: 2D ModelExample: 2D Model
0.2938.0
0.275.0
2ln2
42
1)|(ln
)4(21exp
21|
4ln1
)|(ln);(
1
2122
22
21
21
3
1
4122
4141
3
1
0
kk
kk
p
dxxx
pK
pQ
x
x
7979
Example: 2D ModelExample: 2D Model
0.200667.0
0.20.1
at converges algorithm the,iterations 3After
Σ
8080
Generalized Expectation-Generalized Expectation-Maximization (GEM)Maximization (GEM)
Instead of maximizing Instead of maximizing QQ((; ; ii), we find s), we find some ome ii+1+1 such thatsuch thatQQ((ii+1+1;;ii)>)>QQ((;;ii) )
and is also guaranteed to convergeand is also guaranteed to convergeConvergence will not as rapidConvergence will not as rapidOffers great freedom to choose computaOffers great freedom to choose computationally simpler stepstionally simpler steps– e.g., using maximum-likelihood value of unke.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihnown values, if they lead to a greater likelihoodood
8181
Hidden Markov Model (HMM)Hidden Markov Model (HMM)Used for problems of making a series of Used for problems of making a series of decisionsdecisions– e.ge.g., speech or gesture recognition., speech or gesture recognitionProblem states at time Problem states at time tt are influenced d are influenced directly by a state at irectly by a state at t-t-11More reference:More reference:– L. A. Rabiner and B. W. Juang, L. A. Rabiner and B. W. Juang, FundamentalFundamentals of Speech Recognitions of Speech Recognition, Prentice-Hall, 1993,, Prentice-Hall, 1993, Chapter 6. Chapter 6.
8282
First Order Markov ModelsFirst Order Markov Models
1321223213
6312231
6 )|(,,,,,,.,.
)(,),2(),1( states of sequence
aaaaaPge
TT
8383
First Order Hidden Markov ModelsFirst Order Hidden Markov Models
jkjk
T
bttvPvvvvvvge
Tvvv
))(|)((,,,,,,.,.
)(,),2(),1( states visibleof Sequence
3241146 V
V
8484
Hidden Markov Model ProbabilitiesHidden Markov Model Probabilities
1,1
))(|)((:state visiblea ofemission ofy probabilit
))(|)1(( :yprobabilit transition1: state absorbingor final 000
k
jkj
ij
jkjk
ijij
ba
ttvPb
ttPaa
8585
Hidden Markov Model ComputationHidden Markov Model ComputationEvaluation problemEvaluation problem– Given Given aaijij and and bbjkjk, determine , determine PP((VVTT||))
Decoding problemDecoding problem– Given Given VVTT, determine the most likely sequenc, determine the most likely sequence of hidden states that lead to e of hidden states that lead to VVTT
Learning problemLearning problem– Given training observations of visible symbolGiven training observations of visible symbols and the coarse structure but not the probas and the coarse structure but not the probabilities, determine bilities, determine aaijij and and bbjkjk
8686
EvaluationEvaluation
max
max
1 1
1
1
1
))1(|)(())(|)(()(
))(|)(()|(
))1(|)(()(
)()|()(
r
r
T
t
T
T
t
Tr
T
T
t
Tr
Tr
r
r
Tr
TT
ttPttvPP
ttvPP
ttPP
PPP
V
V
VV
8787
HMM ForwardHMM Forward
)()()|)(()),(()(
state initial,1state initial,0
)0(
)1(
)),1(())1(|)(())(|(
)),(()(
))1(|)(())(|)(()(
000
1)(
1
1
1 1
max
TTTT
j
c
iiijtjkv
ti
c
iijjk
tjj
r
r
T
t
T
PPTPTPT
jj
tab
tPttPtvP
tPt
ttPttvPP
VVVV
V
V
V
8888
HMM Forward and TrellisHMM Forward and Trellis
8989
HMM ForwardHMM Forward
endstate finalfor )()(return
until
)1()(
,,0 ,1for
)0(,,,,0 initialize
0
1)(
TP
Tt
atbt
cjtt
bat
T
c
i ijitjkvj
jT
jkij
V
V
9090
HMM BackwardHMM Backward
)()()|)0(()),0(()0(
0,10,0
)(
)1(
)),1(())(|)1(())1(|(
)),(()(
))1(|)(())(|)(()(
1)1(
)1(
1
1 1
max
TTTinit
Tinitinit
i
c
jjijtjkv
tTj
c
jijjk
tTii
r
r
T
t
T
PPPP
ii
T
tab
tPttPtvP
tPt
ttPttvPP
VVVV
V
V
V
9191
HMM BackwardHMM Backward
endstate initialfor )0()(return
0 until
)1()(
,,1 ,1for
)(,,,, initialize
0
0 )1(
T
c
j tjkvijji
jT
jkij
P
t
batt
citt
TbaTt
V
V
9292
Example 3: Hidden Markov ModelExample 3: Hidden Markov Model
9393
Example 3: Hidden Markov ModelExample 3: Hidden Markov Model
2.01.02.05.001.07.01.01.002.01.04.03.00
00001
1.00.01.08.01.02.05.02.04.01.03.02.0
0001
jk
ij
b
a
9494
Example 3: Hidden Markov ModelExample 3: Hidden Markov Model
9595
Left-to-Right Models for SpeechLeft-to-Right Models for Speech
)()()|()|( T
TT
PPPP
VVV
9696
HMM DecodingHMM Decoding
9797
Problem of Local OptimizationProblem of Local OptimizationThis decoding algorithm depends This decoding algorithm depends only on the single previous time step, only on the single previous time step, not the full sequencenot the full sequenceNot guarantee that the path is Not guarantee that the path is indeed allowableindeed allowable
9898
HMM DecodingHMM Decoding
endreturn
until
to Append
)(maxarg until
)1()(
1for 0 ,1for
{},0 initialize
'
1)(
PathTt
Path
tj'cj
atbt
jjjttPatht
j
jj
c
i ijitjkvj
9999
Example 4: HMM DecodingExample 4: HMM Decoding
100100
Forward-Backward AlgorithmForward-Backward AlgorithmDetermines model parameters, Determines model parameters, aaijij and and bbjkjk,, from an ensemble of training samples from an ensemble of training samplesAn instance of a generalized expectatioAn instance of a generalized expectation-maximization algorithmn-maximization algorithmNo known method for the optimal or moNo known method for the optimal or most likely set of parameters from datast likely set of parameters from data
101101
Probability of TransitionProbability of Transition
)|()()1(
)|()|),(),1((
),|)(),1((
),|)(),1(()(
Tjjkiji
T
Tji
Tji
Tjiij
Ptbat
PttP
ttP
ttPt
V
VV
V
V
102102
Improved Estimate for Improved Estimate for aaijij
T
t k ik
T
tij
ij
ij
T
t k iki
T
tij
ji
t
ta
a
t
t
tt
1
1
1
1
)(
)(ˆ
: of Estimate
)(:
from tionsany transi ofnumber expected Total
)(
:sequence in the any timeat )( and )1(statebetween ns transitioofnumber Expected
103103
Improved Estimate for Improved Estimate for bbjkjk
T
t lil
T
vtvt lil
jk
t
tb k
1
)( ,1
)(
)(ˆ
104104
Forward-Backward AlgorithmForward-Backward Algorithm(Baum-Welch Algorithm)(Baum-Welch Algorithm)
end
)();(return
)1()(),1()(max until
)(ˆ)(
)(ˆ)(
)1( and )1( all from )(ˆ all compute
)1( and )1( all from )(ˆ all compute 1 do
0, threshold, sequence training,, initialize
,,
zbbzaa
zbzbzaza
zbzb
zaza
zbzazb
zbzazazz
zba
jkjkijij
jkjkijijkji
jkjk
ijij
jkijjk
jkijij
Tjkij
V