Digital Image Processing Lectures 27 & 28

Segmentation & Feature Extraction Feature Selection Pattern Classification Unsupervised Cluster Discovery

Digital Image ProcessingLectures 27 & 28

M.R. Azimi, Professor

Department of Electrical and Computer EngineeringColorado State University

M.R. Azimi Digital Image Processing


Area 5: Segmentation & Feature Extraction

Segmentation:Detect and isolate objects of interest (targets) from thebackground.Feature Extraction:Extract salient features of the detected objects for thepurpose of classification and recognition.

Figure 1: Block Diagram of a Pattern Classification System.

Segmentation can be done using one of the following classes ofapproaches:

Histogram-Based

Template Matching

Region Growing



Feature Extraction & SelectionThe most crucial step in any pattern classification system. Goals are:

1 Extract salient and representative set of features with highdiscriminatory ability.

2 Dimensionality reduction

3 Decorrelation, if possible.

Category of Methods:

Energy-based: KL transform, statistical-based

Contour-based: Fourier Descriptor, Hough transform

Shape-dependent: Moments invariants, Zernike moments

Texture-based: WT, Gabor filter, Gray-level Co-occurrence Matrix(GLCM), statistical-based



Fourier Descriptor (FD)

FD extracts contour shape-dependent features. Steps are:1 Define a closed contour with M-points (see Fig. 2).

2 Map contour points to real and imaginary parts of a complete-valued function,p(n) = xn + jyn, n ∈ [0,M − 1]. Clearly, p(n) = p(n+ lM) i.e. periodic withperiod M .

3 Take DFT of complex-valued, p(n) function to generate FD coefficients i.e.P (k) = DFT{p(n)}, k ∈ [0,M − 1]. Thus,

P (k) =

M−1∑n=0

p(n)e−j 2πnkM

Figure 2: Contour with M-points.



Properties:1 FD’s of regular circular contours are primarily concentrated on

low-order coefficients, whereas irregular contours have FD’s that arespan over the high-order coefficients.

2 FD is NOT good for small objects. A large number of points isnecessary in order to get good FD’s.

3 Due to the periodicity of the closed contours and FD’s the featuresare translation and rotation independent.

The next three figures show some examples of binary silhouettes and their

corresponding FD plots for three targets as well as two data sets without

and noise. As can be seen, one can discriminate different objects using

their FD coefficients based upon the Euclidean distances (see tables).



Figure 3: Target silhouettes and corresponding FD coefficients.



Moments Invariant

Moment invariants (MIs) provide a set of nonlinear features dependenton normalized 2nd and 3rd order central moments. MIs provide sevenfeatures invariant to rotation, scaling and translation ideal for patternrecognition.Let x(m,n) be the detected object, the (p+ q)th order regular andcentral moments are:

µp,q =∑m

∑n

mpnqx(m,n)

ξp,q =∑m

∑n

(m−m)p(n− n)qx(m,n)

where m∆= µ10

µ00,and n

∆= µ01

µ00The central moments of order p+ q ≤ 3

are:ξ00 = µ00 ξ11 = µ11 − nµ10

ξ10 = 0 ξ20 = µ20 −mµ10

ξ01 = 0 ξ02 = µ02 − nµ01

ξ30 = µ30 − 3mµ20 + 2µ10m2

ξ03 = µ03 − 3nµ02 + 2µ01n2



ξ12 = µ12 − 2nµ11 −mµ02 + 2n2µ10

ξ21 = µ21 − 2mµ11 − nµ20 + 2m2µ01

The normalized central moments are:

ηp,q =ξp,qξγ00

, γ =p+ q

2+ 1, p+ q = 2, 3, ·

Then the seven invariant features are computed using

φ1 = η20 + η02

φ2 = (η20 − η02)2 + 4η2

11

φ3 = (η30 − 3η12)2 + (3η21 − η03)

2

φ4 = (η30 + η12)2 + (η21 + η03)

2

φ5 = (η30 − 3η12)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)

2]

+(3η21 − η03)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)

2]

φ6 = (η20 − η02)[(η30 + η12)2 − (η21 + η03)

2]

+4η11(η30 + η12)(η21 + η03)



φ7 = (3η21 − η30)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)

2]

+(3η12 − η30)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)

2]

Remarks:1 The symmetric form of these features make them independent of

rotation and translation.

2 Everything about the shape of the object is represented by theseseven features.

3 Two measures known as “Spread” and “Slenderness” can be definedin terms of the 2nd order moments as SP = µ20 + µ02 = φ1 andSL =

√(µ20 − µ02)2 + 4µ2

11 =√φ2, respectively. These may be

used as features in some simple shape discrimination problems.



Feature Selection

Bellman’s Curse of DimensionalityClassification performance will NOT improve as more features are added.More features ⇒ more parameters to be estimated ⇒ increasedestimation error when using finite training samples. The trained classifierwill be so fine-tuned to training data that will lack generalization abilityon novel data.

Fisher Discriminant Function (DF)Goal: Extract a lower-dimensional feature subspace that are mostdiscriminatory and remove the ones that may have detrimental effects.Assume two-class problems. For every feature xi in the feature vector, x,the Fisher DF is computed over all the training samples using

Dxi(C1, C2) =|µC1

(xi)− µC2(xi)|2

σ2C1

(xi) + σ2C2

(xi)

where µCj (xi) and σ2Cj(xi) are the mean and variance of class Cj for the

ith feature xi.



The process is repeated for every feature and only those that have highdiscriminatory ability are selected. The next figure depicts the featurespace distribution for a 2D feature vector case.

=

2

1

xx

x

x1

x2 Feature Vector

µc1(x1) µc2(x1)

µc1(x2) µc2(x2)

σ2c2(x1)

σ2c1(x1)

σ2c2(x2)

σ2c1(x2)

Class 2

Class 1

Figure 4: Feature Selection using Fisher DF.



Area 6 - Pattern Classification

Goal:

Assign a pattern (picture, fingerprint, characters, speech, EKG, etc.) intoone of the M prescribed (known) classes.

A classifier maps the feature (pattern) space into classification decision(label) space i.e. performs a mapping x ∈ RN −→ io, io ∈ [1,M ] orio = f(x) where x is the N -dimensional feature vector and io = krepresents kth class. The function f(·) specifies the relation between theclassifier inputs and outputs or the “decision rule”.

Decision Regions & Surface: The decision rule divides the feature(pattern) space into M disjoint regions, Rk, k ∈ [1,M ], known as“Decision Regions” that are separated by “Decision Surfaces”, Si,j .

Discriminant Functions: Assuming that the classifier is already designed,the classification decision for an unknown pattern x is made bycomparing M scalar functions g1(x), g2(x), . . . gM (x), known as“discriminant functions” (DF’s).



Figure 5: Decision regions and surfaces.

A pattern x belongs to class j (Cj) iff

gj(x) > gk(x),∀k ∈ [1,M ], k 6= j ⇐⇒ x ∈ Cji.e. selecting the class with the largest DF.The decision surface, Sk,l, separating regions two contiguous decisionregions Rk, Rl satisfies

gk(x)− gl(x) = 0⇒ Sk,l

There are several types of classifiers that can be built depending on the

type of the DF’s generated.



Once the type of the DF is selected, the learning algorithm results in asolution for the unknown parameters of the DFs. Among typical DFs are:

Linear Classifier

Bayes Classifier

K-mean clustering

Vector Quantization

Neural Network (supervised vs. unsupervised)

1. Linear Classifier: A linear classifier (linear DF) constitutes ahyperplane in N -dimensional feature space. Thus, DF is

gi(x) = wiTx+ wi,N+1 =

N∑j=1

wi,jxj + wi,N+1, ∀i ∈ [1,M ]

Important Remark:A minimum distance classifier is a linear classifier. Assume that eachclass is represented by its “prototype” pattern (mean or centroid of eachgroup of patterns) ci, i ∈ [1,M ].



2. Minimum Distance Classifier: Makes its decision on pattern xbased upon the smallest Euclidean distance to a particular prototypepattern, i.e.

d(x, cj) < d(x, ck),∀k ∈ [1,M ], k 6= j ⇐⇒ x ∈ Cj

where d(x, cj) =‖ x− cj ‖2.Rewrite

‖ x− cj ‖2 = (x− cj)T (x− cj)= xTx− 2cj

Tx+ cjT cj

Since the term xTx is common to all the M expressions, it only sufficesto examine cj

Tx− 12cj

T cj . That is

x ∈ Cj ⇔ cjTx− 1

2cjT cj > ck

Tx− 1

2ckT ck,∀k ∈ [1,M ], k 6= j

This implies that gj(x) = cjTx− 1

2cjT cj .



Now, comparing with linear DF gj(x) = wjTx we have wj = cj and

wj,N+1 = − 12cj

T cj . Thus, a linear (minimum distance) classifier caneasily be built using the prototype patterns.Note: Linear classifier uses a deterministic DF.

3. Bayes Classification:Bayes decision is based upon minimizing the loss in making wrongdecisions. The decision rule follows

x ∈ Cj or Rj ⇔ p(Rj |x) > p(Rk|x),∀k ∈ [1,M ], k 6= j

where p(Rj |x) is the a posteriori class conditional PDF. Using Bayes rule

p(Rj |x) =p(x|Rj)P (Rj)

p(x)

Since the denominator is common to all classes, the decision rule can bemodified to

x ∈ Cj ⇔ p(x|Rj)P (Rj) > p(x|Rk)P (Rk),∀k ∈ [1,M ], k 6= j



Thus, in Bayes classification DF is gi(x) = p(x|Ri)P (Ri) i.e. notdeterministic. Alternatively, we can use gi(x) = ln[p(x|Ri)P (Ri)].Note that p(x|Ri) and P (Ri) can be computed from the “training data”.

Bayes Classifier for Gaussian CasesSuppose the distribution of patterns in each decision region Ri can berepresented by multi-variate Gaussian, i.e.

p(x|Ri) =1

(2π)N/2(Det(Ri))1/2e−

12 (x−µ

i)T Ri

−1(x−µi)

where µi

and Ri represent the mean vector and covariance matrix for the

ith class computed from the training data in each class. Then, assumingthat Ri = σ2

x I (i.e. features are uncorrelated) we have

gi(x) = ln[p(x|Ri)] + ln[P (Ri)]

= −N2ln(2πσ2

x)−1

2σ2x

(x− µi)T (x− µ

i) + ln[P (Ri)]

= − 1

2σ2x

(xTx− 2µiTx− µ

iTµ

i) + ln[P (Ri)] + ...

Since the term xTx is common to all the expressions, it could be ignored.



Remarks:1 In the Gaussian case, Bayes classifier becomes a linear one withwi = µ

i/σ2

x.2 Bayes minimizes the loss in misclassification independent of the

overlap between the distributions.

Figure 6: Bayes classification in 1-D.



Figure 7: Bayes classification in 2-D.



Cluster Discovery Networks

These systems use:

No a priori knowledge about distribution of the unlabeleddata.

Network learns the underlying distribution (statisticalproperties) of the data and forms clusters of data depending.

The number of clusters, K, must be determined based uponsome prior knowledge or expectations.

Perform some type of vector quantization (VQ).

Training and testing involves winner-take-all scheme.

Let S = x1, x2, · · · , xQ where xk is N -D, be the training set ofpatterns with unknown labels. The goal is to cluster them into Kclusters depending on their underlying distribution.



1. Initialization: The code-book vectors (weights) are firstinitialized using e.g., uniform distribution on unity hypersphere or”convex combination” method wi(0) =

1√N[1 1 · · · 1]t, i ∈ [1,K].

2. Winner-Take-All Learning : During the unsupervised trainingthe winner selection involves finding the kth cluster for whichk = argmini∈[1,m] ||wi − x|| or k = argmaxi∈[1,m]w

tix

The winner then updates (promoting the winner) its code-bookvector usingwk(l + 1) = wk(l) + α(l)(xl − wk(l))

While the losers don’t update their code-book vector i.e.wj(l + 1) = wj(l), ∀j ∈ [1,m], j 6= k3. Cluster Selection (testing): After training is completed (i.e.code-book vectors wi’s are established). Now, if for an unknownsample x, mth cluster is the winner i.e.ym = f(wt

mx) = maxi yi, i 6= m, then x ∈ cluster m.



Important Remarks:

1 At convergence, weight vectors represent centroids of clusters i.e.wi = µ

i

2 0 < α(l) < 1 is the step size in learning. For first 1000 steps, wechoose α(l) ≈ 0.99 thereafter use a monotonically decreasingfunction, e.g.,Linear: α(l) = α0(1− l−lp

lq), α0 = 0.9

lp is epoch at which decay starts and lq is time constant.

Exponential: α(l) = α0e(− l−lp

lq)

3 If the trained system is to be used for classification, clusters mustbe calibrated (required labeling after cluster formation with someknown patterns).

4 Works well for linearly separable clusters. It does not work well whencode-book vectors are overlapping (see Fig. shown). One remedy isto use large number of clusters to partition the clusters further.



Figure 8: Unsupervised clustering fails.


Digital Image Processing Lectures 27 & 28

Documents

Transcript of Digital Image Processing Lectures 27 & 28