Chapter 20 Classification and Estimation. 20.2 Classification –20.2.1 Feature selection Good...
-
Upload
carmella-pearson -
Category
Documents
-
view
230 -
download
0
Transcript of Chapter 20 Classification and Estimation. 20.2 Classification –20.2.1 Feature selection Good...
Chapter 20 Classification and Estimation
Chapter 20 Classification and Estimation• 20.2 Classification
– 20.2.1 Feature selection• Good feature have four characteristics:
– Discrimination. Features should take on significantly different values for objects belonging to different classes.
– Reliability. Features should take on similar values for all objects of the same class.
– Independence. The various features used should be uncorrelated with each other.
– Small numbers. The number of features should be small because the complexity of a pattern recognition system increases rapidly with the dimensionality of the system.
20.2 Classification
• Classifier design– Classifier design consists of establishing the log
ical structure of the classifier and the mathematical basis of the classification rule.
• Classifier Training– A group of known objects are used to train a cla
ssifier to determine its threshold values.
20.2 Classification
– The training set is a collection of objects from each class that have been previously identified by some accurate method.
– Training rule:minimizing an error function or a cost function.
– Unrepresentative training set; Biased training set.
20.2 Classification
– 20.2.4 Measurement of performance• A classifier accuracy can be directly estimated by cl
assifying a known test set of objects.
• An alternative is to use a test set of known objects to estimate the PDFs of the features for objects belonging to each group.
• Using a different test set from the training set is a better approach to evaluate a classifier.
20.3 Feature selection
• Feature selection is the process of eliminating some features and combining others that are related, until the feature set becomes manageable and performance is still adequate.
• The brute force approach of feature selection.
20.3 Feature selection
• A training set containing objects from M different classes, let be the number of objects from class j, and , are two features obtained when the ith object in class j, the mean value of each feature is
jN
ijx ijy
jN
iij
jxj x
N 1
1̂
jN
iij
jyj y
N 1
1̂
20.3 Feature selection
• 20.3.1 Feature Variance– All objects within the same class should take on
similar values. The variance of the features with class j is
jN
ixjij
jxj x
N 1
22 )ˆ(1
ˆ
jN
iyjij
jyj y
N 1
22 )ˆ(1
ˆ
20.3 Feature selection
• 20.3.2 Feature correlation– The correlation of the features x and y in class j
is
– A value of zero indicates that the two features are uncorrelated, while a value near 1 implies a high degree of correlation.
yjxj
N
iyjijxjij
jxyj
j
yxN
ˆˆ
)ˆ)(ˆ(1
ˆ 1
20.3 Feature Selection
• 20.3.3 Class separation distance– The variance-normalized distance between two
class is
where the two classes are j and k. – The greater the distance is, the better the feature
is.
22 ˆˆ
|ˆˆ|ˆ
yjxj
xkxjxjkD
20.3 Feature selection
• 20.3.4 Dimension reduction– Many features can combine to form few numbe
r of features. – Linear combination. Two features x and y can p
roduce a new feature z by
this can be reduced to
byaxz
)sin()cos( yxz
20.3 Feature selection
– This is a projection of (x,y) plane to line z.
Class 1
Class 2
x
y
z
20.4 Statistical Classification
• 20.4.1 Statistical decision theory– An approach that makes classification by statist
ical method. The PDFs of features are assumed to be known
– The PDFs of a feature may be estimated by measuring a large number of objects, and plotting a histogram of the feature.
20.4 Statistical Classification
– 20.4.1.1 A Priori Probabilities• The a priori probabilities represent our knowledge a
bout an object before it has been measured.
• The conditional probability is the probability of the event , when a given event occurs.
)|( 21 EEP
1E 2E
20.4 Statistical classification
• 20.4.1.2 Bayes’ Theorem– The a posteriori probability is the conditional pr
obability , which means the probability of the object belongs to the class , when the feature
occurs. – The Bayes’ Theorem (two classes)
)|( xCP i
iC
x
2
1
)()|(
)()|(
)(
)()|()|(
iii
iiiii
CpCxp
CpCxp
xp
CpCxpxCP
20.4 Statistical classification
• Bayes’ theorem may be used to pattern classification. For example, when there are only 2 classes, a object is assigned to class 1 if
• This is equivalent to
• The classifier defined by this decision rule is called a maximum-likelihood classifier.
)|()|( 21 xCPxCP
)()|()()|( 2211 CPCxpCPCxp
20.4 Statistical classification
– If there are more than one features and the feature vector is , and suppose there are m classes, then Bayes’ theorem is
• Bayes’ Risk. The conditional risk is
where is the cost (loss) of assigning an object to class i when it really belongs in class j.
Tnxxx ],,,[ 21
m
iiin
iinni
CpCxxxp
CpCxxxpxxxCp
121
2121
)()|,,,(
)()|,,,(),,,|(
m
jnjijni xxxCplxxxCR
12121 ),,,|(),,,|(
ijl
20.4 Statistical classification
– Bayes’ decision rule. Each object should be assigned to the class that produces the minimum conditional risk. The Bayes’ risk is
– Parametric and Nonparametric classifier• If the functional form of the conditional PDFs is known, but so
me parameters are unknown, the classifier is called parametric.
• If the functional form of some or all of the conditional PDFs is unknown, the classifier is called nonparametric.
nnnm dxdxdxxxxpxxxRR 212121 ),,,(),,,(
20.4.3 Parameter estimation and classifier training
• The process of estimating the conditional PDFs o their parameters is refered to as training the classifier.
• Supervised and unsupervised training– Supervised training. The classes to which the o
bjects in the training set is known.– Unsupervised training. The conditional PDFs ar
e estimated using samples whose class is unknown.
20.4.3 Parameter estimation and classifier training
• Maximum-likelihood estimation– The maximum-likelihood estimation approach assumes
that the parameters to be estimated are fixed but unknown.
– The maximum-likelihood estimate of a parameter is the value that makes the occurrence of the observed training set most likely.
– The Maximum-likelihood estimates of the mean and standard deviation of a normal distribution are the sample mean and sample standard deviation, respectively.
20.4.3.3 Bayesian Estimation
– The Bayesian estimation treats the unknown parameter as a random variable, and it has a known a priori PDF before any samples are taken.
– After the training set has been measured, Bayes’ theorem is used to update the a priori PDF, and this results in an a posterior PDF of the unknown parameter value.
– The a posteriori PDF with a single narrow peak, centered on the true value of the parameter is desired.
20.4.3.3 Bayesian estimation
• An example of Bayesian estimation– Estimate the mean of a normal distribution with
known variance. The a priori PDF is .– The functional form of the PDF of the unknown
mean is assumed to be , this means that given a value for , we known .
– Suppose represents the set of sample values obtained by measuring the training set.
)(p
)|( xp
)(xp
X
20.4.3.3 Bayesian estimation
– Bayes’ theorem gives the a posteriori PDF
– What we really want is
– For example, if has a single sharp peak at , it can be approximated as an impulse
dpXp
pXpXp
)()|(
)()|()|(
dXpxpdXxpXxp )|()|()|,()|(
)|( Xp
)()|( 0 Xp
0
20.4.3.3 Bayesian estimation
– Then
This means that is the best estimate of the unknown mean.
– If has a relatively broad peak, then
becomes a weighted average of many PDFs.
– Both maximum-likelihood and Bayesian estimate the unknown mean at the mean of a large training set.
)|()()|()|( 00 xpdxpXxp
0
)|( Xp )|( Xxp
20.4.3.3 Bayesian estimation
• Steps of Bayesian estimation– 1.Assume an a priori PDF for the unknown parameters;
– 2.collect samples values from the population by measuring the training set.
– 3.Use Bayes’ theorem to refine the a priori PDF into the a posteriori PDF
– 4.Form the joint density of x and the unknown parameter and integrate out the latter to leave the desired estimate of the PDF.
20.4.3.3 Bayesian estimation
– If we have strong ideas about the probable values of the unknown parameter, we may assume a narrow a priori PDF, otherwise, we should assume a relatively broad PDF.
20.5 Neural Networks
• Layered feedforward neural networks
where the activation function is usually a Sigmoidal function.
j
jkjk xwS
1x
2x
nx
1w
2w
nwy
][][1
SgwxggON
iii
T
WX
][g
)( kk Sgx
1x
2x
Nx
1kw
2kw
knw