Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser...

93
Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser [email protected]

Transcript of Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser...

Page 1: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Institute of Information Theory and Automation

Introduction to Pattern Recognition

Jan Flusser

[email protected]

Page 2: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• Recognition (classification) = assigning a pattern/object to one of pre-defined classes

Pattern Recognition

Page 3: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• Recognition (classification) = assigning a pattern/object to one of pre-defined classes

• Statictical (feature-based) PR - the pattern is described by features (n-D vector in a metric space)

Pattern Recognition

Page 4: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• Recognition (classification) = assigning a pattern/object to one of pre-defined classes

• Syntactic (structural) PR - the pattern is described by its structure. Formal language theory (class = language, pattern = word)

Pattern Recognition

Page 5: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• Supervised PR

– training set available for each class

Pattern Recognition

Page 6: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• Supervised PR

– training set available for each class

• Unsupervised PR (clustering)

– training set not available, No. of classes may not be known

Pattern Recognition

Page 7: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

PR system - Training stage

Definition of the

features

Selection of the

training set

Computing the features

of the training set

Classification rule

setup

Page 8: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Desirable properties of the features

• Invariance

• Discriminability

• Robustness

• Efficiency, independence, completeness

Page 9: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Desirable properties of the training set

• It should contain typical representatives of each class including intra-class variations

• Reliable and large enough

• Should be selected by domain experts

Page 10: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Classification rule setup

Equivalent to a partitioning of the feature space

Independent of the particular application

Page 11: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

PR system – Recognition stage

Image acquisition

Preprocessing

Object detection

Computing of

the features

Classification

Class label

Page 12: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

An example – Fish classification

Page 13: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

The features: Length, width, brightness

Page 14: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

2-D feature space

Page 15: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Empirical observation

• For a given training set, we can have

several classifiers (several partitioning

of the feature space)

Page 16: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Empirical observation

• For a given training set, we can have

several classifiers (several partitioning

of the feature space)

• The training samples are not always

classified correctly

Page 17: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Empirical observation

• For a given training set, we can have

several classifiers (several partitioning

of the feature space)

• The training samples are not always

classified correctly

• We should avoid overtraining of the

classifier

Page 18: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Formal definition of the classifier

• Each class is characterized by its

discriminant function g(x)

• Classification = maximization of g(x)

Assign x to class i iff

• Discriminant functions defines decision

boundaries in the feature space

Page 19: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Minimum distance (NN) classifier

• Discriminant function

g(x) = 1/ dist(x,)

• Various definitions of dist(x

• One-element training set

Page 20: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Voronoi polygons

Page 21: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Minimum distance (NN) classifier

• Discriminant function

g(x) = 1/ dist(x,)

• Various definitions of dist(x • One-element training set Voronoi pol

• NN classifier may not be linear

• NN classifier is sensitive to outliers k-NN classifier

Page 22: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

k-NN classifierFind nearest training points unless k samples

belonging to one class is reached

Page 23: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Discriminant functions g(x) are hyperplanes

Linear classifier

Page 24: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Assumption: feature values are random variables

Statistic classifier, the decission is probabilistic

It is based on the Bayes rule

Bayesian classifier

Page 25: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

The Bayes rule

A posteriori

probability

Class-conditional

probabilityA priori

probability

Total

probability

Page 26: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Bayesian classifier

Main idea: maximize posterior probability

Since it is hard to do directly, we rather maximize

In case of equal priors, we maximize only

Page 27: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Equivalent formulation in terms

of discriminat functions

Page 28: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

How to estimate ?

• Parametric estimate (assuming pdf is

of known form, e.g. Gaussian)

• Non-parametric estimate (pdf is

unknown or too complex)

• From the case studies performed before (OCR, speech recognition)

• From the occurence in the training set

• Assumption of equal priors

Page 29: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Parametric estimate of Gaussian

Page 30: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

d-dimensional Gaussian pdf

Page 31: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

The role of covariance matrix

Page 32: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Two-class Gaussian case in 2D

Classification = comparison of two Gaussians

Page 33: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Two-class Gaussian case – Equal cov. mat.

Linear decision boundary

Page 34: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Equal priors

Classification by minimum Mahalanobis distance

If the cov. mat. is diagonal with equal variances

then we get “standard” minimum distance rule

max

min

Page 35: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Non-equal priors

Linear decision boundary still preserved

Page 36: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

General G case

in 2D

Decision boundary is a hyperquadric

Page 37: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

General G case

in 3D

Decision boundary is a hyperquadric

Page 38: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

More classes, Gaussian case in 2D

Page 39: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

What to do if the classes are not normally

distributed?

• Gaussian mixtures

• Parametric estimation of some other pdf

• Non-parametric estimation

Page 40: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Non-parametric estimation – Parzen window

Page 41: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

The role of the window size

• Small window overtaining

• Large window data smoothing

• Continuous data the size does not matter

Page 42: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

n = 1

n = 10

n = 100

n = ∞

Page 43: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

The role of the window size

Small window Large window

Page 44: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Applications of Bayesian classifier

in multispectral remote sensing

• Objects = pixels

• Features = pixel values in the spectral bands

(from 4 to several hundreds)

• Training set – selected manually by means of thematic maps (GIS), and on-site observation

• Number of classes – typicaly from 2 to 16

Page 45: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Satellite MS image

Page 46: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Other classification methods in RS

• Context-based classifiers

• Shape and textural features

• Post-classification filtering

• Spectral pixel unmixing

Page 47: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Typically for “YES – NO” features

Feature metric is not explicitely defined

Decision trees

Non-metric classifiers

Page 48: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

General decision tree

Page 49: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Binary decision tree

Any decision tree can be replaced by a binary tree

Page 50: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Real-valued features

• Node decisions are in form of inequalities

• Training = setting their parameters

• Simple inequalities stepwise decision

boundary

Page 51: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Stepwise decision boundary

Page 52: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Real-valued features

The tree structure and the form of inequalities

influence both performance and speed.

Page 53: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.
Page 54: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.
Page 55: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• How to evaluate the performance of the

classifiers?

- evaluation on the training set (optimistic error estimate)

- evaluation on the test set (pesimistic

error estimate)

Classification performance

Page 56: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

• How to increase the performance?

- other features

- more features (dangerous – curse of dimensionality!)

- other (larger, better) traning sets

- other parametric model

- other classifier

- combining different classifiers

Classification performance

Page 57: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Combining (fusing) classifiers

C feature vectors

Bayes rule: max

max

Several possibilities how to do that

Page 58: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Product rule

Assumption: Conditional independence of xj

max

Page 59: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Max-max and max-min rules

Assumption: Equal priors

max (max )

max (min )

Page 60: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Majority vote

• The most straightforward fusion method

• Can be used for all types of classifiers

Page 61: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Training set is not available, No. of classes may not be a priori known

Unsupervised Classification

(Cluster analysis)

Page 62: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

What are clusters?

Intuitive meaning

- compact, well-separated subsets

Formal definition

- any partition of the data into disjoint subsets

Page 63: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

What are clusters?

Page 64: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

How to compare different clusterings?

Variance measure J should be minimized

Drawback – only clusterings with the same

N can be compared. Global minimum

J = 0 is reached in the degenerated case.

Page 65: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Clustering techniques

• Iterative methods

- typically if N is given

• Hierarchical methods

- typically if N is unknown

• Other methods

- sequential, graph-based, branch & bound,

fuzzy, genetic, etc.

Page 66: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Sequential clustering

• N may be unknown

• Very fast but not very good

• Each point is considered only once

Idea: a new point is either added to an existing cluster or it forms a new cluster. The decision is based on the user-defined distance threshold.

Page 67: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Sequential clustering

Drawbacks:

- Dependence on the distance threshold

- Dependence on the order of data points

Page 68: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Iterative clustering methods

• N-means clustering

• Iterative minimization of J

• ISODATA

Iterative Self-Organizing DATa Analysis

Page 69: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

N-means clustering

1. Select N initial cluster centroids.

Page 70: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

N-means clustering

2. Classify every point x according to minimum distance.

Page 71: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

N-means clustering

3. Recalculate the cluster centroids.

Page 72: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

N-means clustering

4. If the centroids did not change then STOP else GOTO 2.

Page 73: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

N-means clustering

Drawbacks

- The result depends on the initialization.

- J is not minimized

- The results are sometimes “intuitively wrong”.

Page 74: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

N-means clustering – An example

Two features, four points, two clusters (N = 2)

Different initializations different clusterings

Page 75: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Initial centroids

N-means clustering – An example

Page 76: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Initial centroids

N-means clustering – An example

Page 77: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Iterative minimization of J

1. Let’s have an initial clustering (by N-means)

2. For every point x do the following:

Move x from its current cluster to another cluster, such that the decrease of J is maximized.

3. If all data points do not move, then STOP.

Page 78: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Example of “wrong” result

Page 79: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

ISODATA

Iterative clustering, N may vary.

Sophisticated method, a part of many statistical software systems.

Postprocessing after each iteration

- Clusters with few elements are cancelled

- Clusters with big variance are divided

- Other merging and splitting strategies can be implemented

Page 80: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Hierarchical clustering methods

• Agglomerative clustering

• Divisive clustering

Page 81: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Basic agglomerative clustering

1. Each point = one cluster

Page 82: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Basic agglomerative clustering

1. Each point = one cluster

2. Find two “nearest” or “most similar” clusters and merge them together

Page 83: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Basic agglomerative clustering

1. Each point = one cluster

2. Find two “nearest” or “most similar” clusters and merge them together

3. Repeat 2 until the stop constraint is reached

Page 84: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Basic agglomerative clustering

Particular implementations of this method differ from each other by

- The STOP constraints

- The distance/similarity measures used

Page 85: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Simple between-cluster distance measures

d(A,B) = d(m1,m2)

d(A,B) = min d(a,b) d(A,B) = max d(a,b)

Page 86: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Other between-cluster distance measures

d(A,B) = Hausdorf distance H(A,B)

d(A,B) = J(AUB) – J(A,B)

Page 87: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Agglomerative clustering – representation

by a clustering tree (dendrogram)

Page 88: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

How many clusters are there?

2 or 4 ?

Clustering is a very subjective task

Page 89: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

How many clusters are there?

• Difficult to answer even for humans

• “Clustering tendency”

• Hierarchical methods – N can be estimated from the complete dendrogram

• The methods minimizing a cost function –

N can be estimated from “knees” in J-N graph

Page 90: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Life time of the clusters

Optimal number of clusters = 4

Page 91: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Optimal number of clusters

Page 92: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

Applications of clustering in image proc.

• Segmentation – clustering in color space

• Preliminary classification of multispectral

images

• Clustering in parametric space – RANSAC,

image registration and matching

Numerous applications are outside image processing area

Page 93: Institute of Information Theory and Automation Introduction to Pattern Recognition Jan Flusser flusser@utia.cas.cz.

References

Duda, Hart, Stork: Pattern Clasification, 2nd ed., Wiley Interscience, 2001

Theodoridis, Koutrombas: Pattern Recognition, 2nd ed., Elsevier, 2003

http://www.utia.cas.cz/user_data/zitova/prednasky/