Kuliah 4 Estimasi Bagian 1

Estimasi Bagian I Introduction Maximum-Likelihood Estimation Example of a Specific Case The Gaussian Case: unknown and Bias Appendix: ML Problem Statemen Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General Estimation

IntroductionData availability in a Bayesian frameworkWe could design an optimal classifier if we knew:P(i) (priors)P(x | i) (class-conditional densities)Unfortunately, we rarely have this complete information!Design a classifier from a training sampleNo problem with prior estimationSamples are often too small for class-conditional estimation (large dimension of feature space!)

1

A priori information about the problemNormality of P(x | i) P(x | i) ~ N( i, i) Characterized by 2 parameters Estimation techniquesMaximum-Likelihood (ML) and the Bayesian estimationsResults are nearly identical, but the approaches are different1

Parameters in ML estimation are fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known distribution

In either approach, we use P(i | x) for our classification rule!

1

Maximum-Likelihood EstimationHas good convergence properties as the sample size increasesSimpler than any other alternative techniquesGeneral principleAssume we have c classes andP(x | j) ~ N( j, j)P(x | j) P (x | j, j) where:2

Use the information provided by the training samples to estimate = (1, 2, , c), each i (i = 1, 2, , c) is associated with each categorySuppose that D contains n samples, x1, x2,, xn

ML estimate of is, by definition the value that maximizes P(D | )It is the value of that best agrees with the actually observed training sample

2

Optimal estimationLet = (1, 2, , p)t and let be the gradient operator

We define l() as the log-likelihood functionl() = ln P(D | )

New problem statement:determine that maximizes the log-likelihood

2

Set of necessary conditions for an optimum is:

l = 0

2

Example of a specific case: unknown P(xi | ) ~ N(, )(Samples are drawn from a multivariate normal population)

= therefore: The ML estimate for must satisfy:

2

Multiplying by and rearranging, we obtain:

Just the arithmetic average of the samples of the training samples!Conclusion: If P(xk | j) (j = 1, 2, , c) is supposed to be Gaussian in a d dimensional feature space; then we can estimate the vector = (1, 2, , c)t and perform an optimal classification!2

ML Estimation: Gaussian Case: unknown and = (1, 2) = (, 2) 2

Summation:

Combining (1) and (2), one obtains:

2

Bias ML estimate for 2 is biased

An elementary unbiased estimator for is:

2

Appendix: ML Problem Statement

Let D = {x1, x2, , xn} P(x1,, xn | ) = 1,nP(xk | ); |D| = n Our goal is to determine (value of that makes this sample the most representative!)2

|D| = nx1x2xn..................x11x20x10x8x9x1N(j, j) = P(xj, 1)D1DcDkP(xj | 1)P(xj | k)2

= (1, 2, , c)

Problem: find such that:

2

Bayesian Estimation (Bayesian learning to pattern classification problems)In MLE was supposed fixIn BE is a random variableThe computation of posterior probabilities P(i | x) lies at the heart of Bayesian classificationGoal: compute P(i | x, D) Given the sample D, Bayes formula can be written3

To demonstrate the preceding equation, use:

3

Bayesian Parameter Estimation: Gaussian Case Goal: Estimate using the a-posteriori density P( | D) The univariate case: P( | D) is the only unknown parameter

(0 and 0 are known!)4

Reproducing density

Identifying (1) and (2) yields: 4

Bayesian Parameter Estimation: General TheoryP(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are:The form of P(x | ) is assumed known, but the value of is not known exactlyOur knowledge about is assumed to be contained in a known prior density P()The rest of our knowledge is contained in a set D of n random variables x1, x2, , xn that follows P(x)

5

The basic problem is:Compute the posterior density P( | D)then Derive P(x | D)Using Bayes formula, we have:

And by independence assumption:5

Problems of DimensionalityProblems involving 50 or 100 features (binary valued)Classification accuracy depends upon the dimensionality and the amount of training dataCase of two classes multivariate normal with the same covariance7

If features are independent then:

Most useful features are the ones for which the difference between the means is large relative to the standard deviation It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model !7

Computational ComplexityOur design methodology is affected by the computational difficultybig oh notationf(x) = O(h(x)) big oh of h(x)If:

(An upper bound on f(x) grows no worse than h(x) for sufficiently large x!)f(x) = 2+3x+4x2g(x) = x2f(x) = O(x2)7

big oh is not unique! f(x) = O(x2); f(x) = O(x3); f(x) = O(x4) big theta notationf(x) = (h(x))If:

f(x) = (x2) but f(x) (x3) 7

Complexity of the ML EstimationGaussian priors in d dimensions classifier with n training samples for each of c classesFor each category, we have to compute the discriminant function

Total = O(d2..n)Total for c classes = O(cd2.n) O(d2.n) Cost increase when d and n are large!7

Kuliah 4 Estimasi Bagian 1

Documents

Transcript of Kuliah 4 Estimasi Bagian 1