Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter...

Pattern Classification, Chapter 3 0

Pattern Classification

All materials in these slides were taken fromPattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000with the permission of the authors and the publisher

Bayesian Estimation (BE)Bayesian Parameter Estimation: Gaussian CaseBayesian Parameter Estimation: General

EstimationProblems of DimensionalityComputational ComplexityComponent Analysis and DiscriminantsHidden Markov Models

Chapter 3:Maximum-Likelihood and Bayesian

Parameter Estimation (part 2)

2Bayesian Estimation (Bayesian learning to pattern classification problems)

In MLE θ was supposed fixIn BE θ is a random variableThe computation of posterior probabilities P(ωi | x) lies at the heart of Bayesian classificationGoal: compute P(ωi | x, D) Given the sample D, Bayes formula can be written

DPDpDP

1)|(),|(

)|(),|(),|(ωω

ωωωx

To demonstrate the preceding equation, use:

)(),|(

)(),|(),|(

:Thustindependen are classesdifferent in samples that assume We

) this!provides sample (Training )|()(

prob.) totalof law (from )|,()|(prob.) cond. of def. (from )|(),|()|,(

DPDPDPDPDP

ωωω

Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate θ using the a-posteriori density P(θ | D)

The univariate case: P(μ | D)μ is the only unknown parameter

(μ0 and σ0 are known!)

),N( ~ )P(),N( ~ ) | P(x

σμμ

But we know

Plugging in their gaussian expressions and extracting out factors not depending on μ yields:

kk PxP

dPPPPP

Formula Bayes )()|()()|()|(

μμα

μμμμμμ

) ,(~)( and ),(~)|( 200

2 σμμσμμ NpNxp k

93) page 29 eq. (from

12121exp)|( 2

2 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛+−= ∑

μσμ

σσαμ

Observation: p(μ|D) is an exponential of a quadratic

It is again normal! It is called a reproducing density

),(~)|( 2nnNP σμμ D

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛+−= ∑

μσμ

σσαμ 2

12121exp)|(

Identifying coefficients in the top equation with that of the generic Gaussian

Yields expressions for μn and σn2

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛+−= ∑

μσμ

σσαμ 2

12121exp)|(

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡ −−=

Dpσμμ

σπμ

022 ˆ and 11

σμμ

σσμ

σσσ+=+= n

Solving for μn and σn2 yields:

From these equations we see as n increases:the variance decreases monotonicallythe estimate of p(μ|D) becomes more peaked

σσσσσ

μσσ

σμσσ

++⎟⎟

⎞⎜⎜⎝

10The univariate case P(x | D)

P(μ | D) computed (in preceding discussion)P(x | D) remains to be computed!

It provides:

We know σ2 and how to compute (Desired class-conditional density P(x | Dj, ωj))Therefore: using P(x | Dj, ωj) together with P(ωj)And using Bayes formula, we obtain theBayesian classification rule:

Gaussian is )|()|()|( μμμ dPxPxP DD ∫= ),(~)|( 22

nnNxP σσμ +D

[ ] [ ])(),|(,|( jjjj PxPMaxxPMaxjj

ωωωωω

DD ≡

2 and nn σμ

3.5 Bayesian Parameter Estimation: General Theory

P(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are:

The form of P(x | θ) is assumed known, but the value of θis not known exactlyOur knowledge about θ is assumed to be contained in a known prior density P(θ)The rest of our knowledge θ is contained in a set D of n random variables x1, x2, …, xn that follows P(x)

12The basic problem is:“Compute the posterior density P(θ | D)”then “Derive P(x | D)”, where

Using Bayes formula, we have:

And by the independence assumption:

)|()|(1

θxθ k

pp ∏=

,)()|()()|()|(

θθθθθθdPP

PPPDDD

θθθxx dDppDp )|()|()|( ∫=

Convergence (from notes)

Problems of DimensionalityProblems involving 50 or 100 features are common (usually binary valued)Note: microarray data might entail ~20000 real-valued featuresClassification accuracy dependant on

dimensionalityamount of training datadiscrete vs continuous

Case of two class multivariate normal with the same covariance

P(x|ωj) ~N(μj,Σ), j=1,2Statistically independent featuresIf the priors are equal then:

0)(limdistance sMahalanobi squared theis

)()( :

error) (Bayes 21)(

−Σ−=

∞→

∞−

errorPr

rwhere

dueerrorP

μμμμ

If features are conditionally independent then:

Do we remember what conditional independence is?Example for binary features:

Let pi= Pr[xi=1|ω1] then P(x|ω1) is the product of the pi

),...,,(diag

∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛σμ−μ

σσσ=Σ

Most useful features are the ones for which the difference between the means is large relative to the standard deviation

Doesn’t require independence

Adding independent features helps increase r reduce error

Caution: adding features increases cost & complexity of feature extractor and classifier

It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance:

we have the wrong model !we don’t have enough training data to support the additional dimensions

19Computational Complexity

Our design methodology is affected by the computational difficulty

“big oh” notationf(x) = O(h(x)) “big oh of h(x)”If:(An upper bound on f(x) grows no worse than h(x) for sufficiently large x!)

f(x) = 2+3x+4x2

g(x) = x2

f(x) = O(x2)

)x(hc)x(f ;)x,c( 02

00 ≤ℜ∈∃

“big oh” is not unique!f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

“big theta” notationf(x) = θ(h(x))If:

f(x) = θ(x2) but f(x) ≠ θ(x3)

)x(gc)x(f)x(gc0xx ;)c,c,x(

≤≤≤>∀ℜ∈∃

21Complexity of the ML Estimation

Gaussian priors in d dimensions classifier with n training samples for each of c classes

For each category, we have to compute the discriminant function

Total = O(d2n)Total for c classes = O(cd2n) ≅ O(d2n)

Cost increase when d and n are large!

321321

)1()()(

)(lnˆln212ln

2)ˆ(ˆ)ˆ(

Pdg ωπ +−−−−−= − ΣμxΣμxx 1

OverfittingDimensionality of model vs size of training data

Issue: not enough data to support the modelPossible solutions:

Reduce model dimensionalityMake (possibly incorrect) assumptions to better estimate Σ

OverfittingEstimate better Σ

use data pooled from all classesnormalization issues

use pseudo-Bayesian form λΣ0 + (1-λ)Σn“doctor” Σ by thresholding entries

reduces chance correlations

assume statistical independencezero all off-diagonal elements

Shrinkage

Shrinkage: weighted combination of common and individual covariances

We can also shrink the estimate common covariancestoward the identity matrix

10for )1(

)1()( <<+−+−

= ααααααnnnn

ΣΣΣ

10for ) 1()( <<+−= ββββ IΣΣ

25Component Analysis and Discriminants

Combine features in order to reduce the dimension of the feature spaceLinear combinations are simple to compute and tractableProject high dimensional data onto a lower dimensional spaceTwo classical approaches for finding “optimal”linear transformation

PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense”MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense”

PCA (from notes)

27Hidden Markov Models: Markov Chains

Goal: make a sequence of decisions

Processes that unfold in time, states at time t are influenced by a state at time t-1

Applications: speech recognition, gesture recognition, parts of speech tagging and DNA sequencing,

Any temporal process without memoryωT = {ω(1), ω(2), ω(3), …, ω(T)} sequence of statesWe might have ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}

The system can revisit a state at different steps and not every state need to be visited

First-order Markov models

Our productions of any sequence is described by the transition probabilities

P(ωj(t + 1) | ωi (t)) = aij

θ = (aij, ωT)P(ωT | θ) = a14 . a42 . a22 . a21 . a14 .

P(ω(1) = ωi)

Example: speech recognition

“production of spoken words”Production of the word: “pattern” represented

by phonemes/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)

Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/ and /n/ to a silent state

Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter...

Documents

Transcript of Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter...

Recursive Bayesian electromagnetic refractivity estimation from …noiselab.ucsd.edu/papers/Vasudevan07.pdf · 2017-10-14 · Recursive Bayesian electromagnetic refractivity estimation

Approximate Bayesian recursive estimationlibrary.utia.cas.cz/separaty/2014/AS/karny-0425539.pdf · Approximate Bayesian recursive estimation ... approximates the lossless estimation

Bayesian Estimation for Continuous-Time

Statistical Physics of Inference and Bayesian Estimation · Statistical Physics of Inference and Bayesian Estimation Florent ... Bayesian versus frequentistThe Bayesian approach assumes

Bayesian Estimation of Dynamical Systems: An Application to fMRIkarl/Bayesian Estimation of Dynamical Systems.pdf · Bayesian Estimation of Dynamical Systems: An Application to fMRI

LECTURE 07: BAYESIAN ESTIMATION (Cont.)

Bayesian Classifiers and Probability Estimation

3 recursive bayesian estimation

Successive Bayesian Estimation

BAYESIAN ESTIMATION - Weebly

Overview Multiple Imputation for Multilevel Data Bayesian ... · PDF fileBayesian Estimation For Multilevel Models Bayesian Estimation And Imputation Bayesian estimation (e.g., Gibbs

SMALL AREA ESTIMATION WITH HIERARCHICAL BAYESIAN …

The “Checklist” - 3b. Estimation blending and assessing - Bayesian estimation

Bayesian Estimation for Reliability Engineering ...

Bayesian estimation of visual discomfort - PREPRINT

BAYESIAN STATE AND PARAMETER ESTIMATION OF …jimbeck.caltech.edu/summerlectures/references/PEM_PF6_submitted.pdf · BAYESIAN STATE AND PARAMETER ESTIMATION OF UNCERTAIN DYNAMICAL

BAYESIAN AND NON BAYESIAN ESTIMATION OF ERLANG ...

Bayesian Estimation & Model Evaluation

Bayesian Estimation of Epidemiological Models: Methods ...

Bayesian Estimation Under Informative Sampling