Machine vision
description
Transcript of Machine vision
2
Outline• Pattern Recognition (weeks 4 - 7)- Model based approach: template matching (covered in week #3)- Statistical approaches
IntroductionSupervised
Minimum distance, minimum risk, knn, Fisher’s discriminantUnsupervised
Clustering algorithmsApplications- Biometrics
3
Recommended Reference Books
1. R.C. Gonzalez & R. E. Woods, “Digital image processing”, Addison-Wesley, 2010.
2. RO Duda, E Hart and DG Stork, “Pattern Classification 2nd ed.” John Wiley & Sons, Inc 2001. [Duda01]
3. Bow ST, Pattern recognition and image preprocessing, M. Dekker, TK7882.P3B784P 2002.
4
Feature Extraction
Representation& description
Knowledgebase
Image processing(e.g. segmentation) Classification,
Recognition &Interpretation
Image acquisition
High-level processingLow-level processing
Intermediate-level processing
Overview of image analysis
5
Design Cycle
[Duda01]
Iterative processNo universal approach
Training – use data to determine optimum Classifier providing optimal decision surface
Evaluation of error for improvement
6
What is a feature ? A feature is a characteristic primitive or attribute of image
patterns.(e.g. pixel value of an image point or a distribution of pixel values within an
image region, shape characteristics of patterns, etc.)
A pattern may be characterized by one or more features.
Examples of features used to characterize regions/objectsFeatures used to represent image regions (objects) are :
area (A): counting the no. of pixels within a region. This gives a measure of the size of the region.
Perimeter (P): counting the no. of pixels on the boundary of an image region.
roundness (R): a measure of circularity (R=1 for a circle):
A lot more on features in lecture set 1.
R PA
2
4
7
Statistical features:
• Statistical measures commonly used to represent features and the variations of feature values.
• Two basic types of statistical measures are:First order statistics - e.g. mean, variance & standard
deviation of feature distribution.(assumed normal distribution).
Second order statistics - features obtained from co-occurrence matrix.
• By analyzing feature values, shapes/patterns can be classified and recognized.
• Statistical features can be any extractable measurements.
8
9
10
11
12
13
14
15
16
17
18
19
2
20
21
Minimum Distance Classifier (MDC)
Example (Bean – red/grn; big/small):
1 : m1= (x1(1), x2
(1))T
2 : m2= (x1(2), x2
(2))T
3 : m3= (x1(3), x2
(3))T
4 : m4= (x1(4), x2
(4))T
||x- m3|| < ||x- m1||||x- m3|| < ||x- m2||||x- m3|| < ||x- m4||
x is closest to m3Therefore, x belongs to 3.
m3
m1
m2
x
x1
x2
Types of decision function:
m4
22
• Each pattern class is represented by a prototype :
where mj is a mean vector for class j,Nj is the no. of pattern vectors from class j.
• Given an unknown pattern vector x, find the closest prototype by finding the minimum Euclidean Distance:
where = Euclidean norm of vector a
x i if
m xx
jjN
j Mj
1 12
, ,...,
MjD jj ,...,2,1 )( mxx
= 1/2 aaa T
D D j M j ii j( ) ( ) , ,..., ;x x 12
m4m3
m1 m2
23
• Recognition is achieved by finding
• Decision function for MDC can be defined as
• Recognition is then achieved by finding
• By expanding Dj(x) (in prev. slide) and removing the common terms, dj(x) is simplified to
;,...,2,1 )}({min)( MjDd jji xx
)()( xx jj Dd
d d j Mij
j( ) max{ ( )} , ,..., ;x x 1 2
d jT
j jT
j( )x x m m m 12
24
How to simplify the decision function of MDC ?
)()()( mxmxx TjD
d j Mj jT
j( ) ( ) ( ) ,...,x x m x m 1 ( )( )x m x mT
jT
j
( )x x m x x m m mTjT T
j jT
j
d jT T
j jT
j( ) ( )x x x x m m m2
m x x mjT T
j
x x
x x
T
jT
j
d
d
is independent of j, i.e. a common term.
When (x)} is sought, is can be dropped.
Therefore, dividing by 2, we have
max{
d jT
j jT
j( )x x m m m12
25
versicolor 1 : m1=(4.3, 1.3)T
setosa 2 : m2=(1.5, 0.3)T
Rewrite in matrix form
d x xT T1 1 1 1 1 2
12
4 3 13 101( ) . . .x x m m m
d x xT T2 2 2 2 1 2
12
15 03 117( ) . . .x x m m m
4 3 13 10115 03 117
1
1
21
2
. . .
. . .
xx
dd
26
If a petal has length 2 cm and width 0.5cm, i.e. x =(2, 0.5)T
Therefore, it is Iris Setosa.
If a petal has length 5 cm and width 1.5cm, i.e. x =(5, 1.5)T
Therefore, it is Iris Versicolor.
4 3 13 10115 03 117
2051
085198
. . .
. . ..
..
4 3 13 10115 03 117
5151
1335678
. . .
. . ..
..
27
Decision boundary
- it is the line that separates two classes.
Property of the decision boundary in MDC
the decision boundary is a perpendicular bisector of the line joining two class prototypes.
At the decision boundary (surface), d1(x) = d2(x)
d12(x) = d1(x) - d2(x) = 2.8x1 + 1.0x2 - 8.9 = 0
for Iris classification,
x if d12(x) > 0x if d12(x) < 0
28
Example:
x =(2, 0.5)T; d12((2, 0.5)T) = -2.8 2 Iris setosax =(5, 1.5)T; d12((5, 1.5)T) = 6.6 1 Iris versicolor
Decision boundary can be used for recognition
Definition of decision function:
Let the decision function be a line or surface in the pattern space separating the 2-classes.
Recognition is achieved by testing the sign of the decision function when presented with a pattern vector.
29
Example:
Classify Iris versicolor and Iris setosa by petal length, x1, and petal width, x2.
30
Use of Probability Theory in Recognition
• Probability theory is widely used in statistical pattern recognition.
• In particular, Bayes probability theory is used in formulating the optimal classifier (recognizer).
• Feature distributions are used to form a priori probability distributions and they are represented as
• The above is also known as the likelihood function of class j object (prior probability).
p j Mj( | ) , ,...., .x for 12
31
Optimal Statistical Classifier - The Bayes Classifier
Recognition problem: given x, find i.
Let the probability of deciding x belonging to i be p(i|x) . (Given x, the probability that it belongs to i)
If x should belong to i but is wrongly classified to j, the conditional average risk or conditional average loss is defined as
where Lkj is the loss incurred due to misclassification.
M
kkkjj pLr
1=)|( =)( xx
rj (x) = conditional average risk
32
• p(i|x) is a posterior probability and is generally unknown.
From Bayes’ rule,
p(x)-1 is independent of j and is a common term,
rj can be reduced to
)()()|()|(
xx
ppxpp kk
k
)()|( )(
1=)(1=
kM
kkkjj ppL
pr x
xx
)()|( =)(1=
kM
kkkjj ppLr xx
p(x|ωk): probability density function of the patterns in each class;p(ωk): probability of occurrence of each class.
33
• p(x | k) and p(k) are both a priori probabilities that can be estimated from training set.
• Recognition is achieved by
x i if for j = 1,2, ... M.
Decision function for Bayes classifier is
x i if for j = 1,2, ... M.
)()|( -=)(1=
kM
kkkjj ppLd xx
}{max jj
i dd
}{min jj
i rr
34
Example illustrating the concept of Bayes classifier
For a 2-class problem, show how a feature x coming from class 2 can be correctly classified using the Bayes classifier. Assume that patterns from both classes are equally likely to be seen. For correct classification, there should be no penalty. For incorrect classification, the penalty is Lij=1, i j.
From above, we know : M = 2, p(1) = p(2), and x 2Also, L11 = L22 = 0, L12=L21=1.
r1 = L11 p(x | 1) p(1) + L21 p(x | 2) p(2) r2 = L12 p(x | 1) p(1) + L22 p(x | 2) p(2)
35
Since p(1) = p(2), and L11 = L22 = 0, L12=L21=1.
r1 = p(x | 2) r2 = p(x | 1)d1 = -p(x | 2) d2 = -p(x | 1)
If x 2 p(x | 2) > p(x | 1) r1 > r2
If x 2 - p(x | 2) < -p(x | 1) d1 < d2
p(x | 2)p(x | 1)
36
Bayes classifier with 0-1 loss function.
• In the penalty approach,
Lii = 0 for correct decision, i.e. i = j.Lij = 1 for incorrect decision, i.e. i j.
• In the reward approach,
ii= 1 for correct decision, i.e. i = j .ij= 0 for incorrect decision. i.e. i j.
Let Lij = 1- ij and ij= 1 when i = j, ij= 0 when i j.
)()|( )1(-=)(1=
kM
kkkjj ppd xx Penalty approach.
From 3 slides back
37
The summation can be separated into 2 parts:
But,
Since -p(x) is constant and ij= 0 except when k = j,
x i ifMjppd jjj ,...,2,1 , )()|(=)( xx
Mjdd jj
i ,...,2,1 ,)}({max x
)()|( +)()|( -=)(1=1=
kM
kkkjk
M
kkj ppppd xxx
)()|( =)(1=
kM
kk ppp xx
)()|( +)(-=)(1=
kM
kkkjj pppd xxx
38
A Bayes Classifier with 0-1 loss
x Evaluationof
p(x|j)j=1,2,…,M
DecisionMaximum
Selector
p(x|1)
p(x|2)
p(x|M)
p(1)
p(2)
p(M)
39
Bayes classifier with 0-1 loss for Gaussian patterns
A pattern class with a Gaussian distribution of its feature values can be represented by
where m = E{x}
2 = E{(x-m)2}
E{.} denotes the expected values.
p x x m( ) exp
12
12
2
Gaussian probability density function
40
A 2-class example of Gaussian patterns
Suppose only one feature is used to classify two patterns. Let x be the feature, the Bayes decision function is
2,1 , )()|(=)( jpxpxd jjj
2,1 , )(21exp
21)(
2
jp
mxxd j
j
j
jj
Therefore,
41
At the decision boundary , d1(x0)=d2(x0)
If p(2) = p(1), p(x0 | 1) = p(x0 | 2).
If p(1) > p(2), x0 will be shifted to the left.
If p(1) < p(2), x0 will be shifted to the right.
42
For n dimensional feature vectors, Gaussian distribution is represented by
)()(
21exp
||)2(1)|( 1
2/12/ jjT
jj
njp mxCmxC
x
where mj is the mean vector of class j,Cj is the covariance matrix of class j,|Cj| is the determinant of Cj,n is the dimension of feature vector.
= Ej { x }= Ej {(x – mj)(x – mj)T}
Approximating the Expected value, E{•}, with its average value:
jj
j N x
xm 1
j
Tjj
T
jj N x
mmxxC 1
43
The Bayes decision function for Gaussian patterns is:
)()()(21exp
||)2(1)( 1
2/12/ jjjT
jj
nj pd
mxCmx
Cx
)(ln=)(' xx jj dd
Remove the constant terms, dj(x) can be redefined as
jjjjT
jj pd CmxCmxx ln21)(ln+)()(
21)( 1'
This is the general form of Bayes classifier for Gaussian pattern classes with 0-1 loss function
)(ln+)()(21ln
21)2(ln
2)( 1'
jjjT
jjj pnd mxCmxCx
44
Cj is a symmetric square matrix of n by n.
The Covariance matrix Cj
}))({( Tjjjj E mxmxC
nn
nn
jj mxmxmx
mx
mxmx
E ...
.
.
.2211
22
11
C
45
E x m x m E x m x m E x m x mE x m x m E x m x m
E x m x m E x m x m
n n
n n n n n n
{( )( )} {( )( )} ... {( )( )}{( )( )} {( )( )} .
. . .
. .
. . .{( )( )} ... ... {( )( )}
1 1 1 1 1 1 2 2 1 1
2 2 1 1 2 2 2 2
1 1
Cj=
j
C j
n
n nn j
c c cc c
c c
11 12 1
21 22
1
....
. . .
. .
. . .... ...
•cii is the variance
•cij is the covariance; i j.
46
• Covariance matrix can be estimated by expanding the original definition and simplifying by using averages to:
Tjj
T
j
Tjj
Tjj
jN
E mmxxmmxxCx
1][
• For features that are statistically independent (uncorrelated), Cj is a diagonal matrix. (cij = 0)
If all covariance matrices are equal and are identity matrices and P(j) = 1/M for all j = 1,2, …., M, then:
d jT
j jT
j( )x x m m m12
Thus, MDC is optimal in the Bayes sense if:1. Pattern classes are Gaussian2. All covariances are equal to identity matrix3. All classes are equally likely to occur
= MDC (Minimum Distance Classifier)
47
Special cases of the covariance matrix (2D examples):
(a) assume m = 0, C = I
(b) assume m = [1,0]T, c22 > c11
(0,0)
x1
x2
(1,0)
x1
x2
22
110
0c
cC
48
(c) assume m = [0,1]T, C
c cc c11 12
21 22
(0,1)
x1
x2
Random Vector Functional Link Network (RVFL)
EE6222
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Structure of an SLFN (Single Layer Feedforward Network)
Structure
Details
I Bias :x × β + b = [x , 1]× β. i.e, extending the input features by addingadditional feature which equals to 1.
I For C classes, we can have C 0-1 output neurons. When we present aninput observation belonging to c th class, we expect the c th neuron to givean output near one while all others to give near zero. Circles in thehidden and output layers have non-linear activation functions. Inputs toactivation functions are weighted sums.
Learning with neural network
learning the mapping ti = f (xi ), where i is the index for the data. The pipelinecan be formalized as follows:
I Input: Input consists of a set of N data [xi , yi ], where xi is the featuresand yi is the actual target given with the dataset. We refer to this dataas the training set used to learn network parameters, etc.
I Learning: Use the training set to learn network structure and parameters.We refer to this step as training a classifier.
I Classification: In the end, we evaluate the quality of the classifier byasking it to predict target t for an input data that it has never seenbefore. We will then compare the true target y of the data to the tpredicted by the model. Prediction t is selected by a max operator amongthe C outputs.
Approximation ability of neural network
Theoretical proof about the approximation ability of standard multilayerfeedforward network can be found in1.With the following conditions:
I Arbitrary squashing functions.
I Sufficiently many hidden units are available.
Standard multilayer feedforward network can approximate virtually anyfunction to any desired degree of accuracy.
1Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayerfeedforward networks are universal approximators”. In: Neural Networks 2.5(1989), pp. 359–366.
Why randomized neural network
Some weaknesses of back-prop
I Error surface can have multiple local minimum.
I Slow convergence due to iteratively adjusting the weights onevery link. These weights are first randomly generated.
I Sensitivity to learning rate setting.
Alternative for back-prop
I In2, the authors show the parameters in hidden layers can berandomly and properly generated without learning.
I In3, the parameters in the enhancement nodes can berandomly generalized with proper range.
2Wouter F Schmidt, Martin Kraaijveld, Robert PW Duin, et al.“Feedforward neural networks with random weights”. In: 1992 11th IAPRInternational Conference on Pattern Recognition. IEEE. 1992, pp. 1–4.
3Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. “Learning andgeneralization characteristics of the random vector functional-link net”. In:Neurocomputing 6.2 (1994), pp. 163–180.
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Structure of RVFL4
Illustrations
Details
I Parameters β (indicated in blue)in enhancement nodes arerandomly generated in properrange and kept fixed.
I Original features (from the inputlayer) are concatenated withenhanced feature (from hiddenlayer) and referred to as d .
I Learning aims at solvingti = di ∗W , i = 1, 2, ...,N, whereN is the number of trainingsample, W (indicated in red andgray) are the weights in theoutput layer, and ti , di representthe output values and thecombined features, respectively.
4Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. “Learning andgeneralization characteristics of the random vector functional-link net”. In:Neurocomputing 6.2 (1994), pp. 163–180.
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Details of RVFLNotations
I X = [x1, x2, ...xN ]′: input data (N samples and m features).
I β = [β1, β2, ...βm]′: the weights for the enhancement nodes (m × k, k isthe number of enhancement nodes).
I b = [b1, b2, ...bk ] the bias for the enhancement node.
I H = h(X ∗ β + ones(n, 1) ∗ b): feature matrix after the enhancementnodes and h is the activation function. No activation in output nodes.
Commonly used activation functions in hidden layer
activation function formulation: s is the input, y is the output
sigmoid y = 11+e−s
sine y = sine(s)
hardlim y = (sign(s) + 1)/2)
tribas y = max(1− |s|, 0)
radbas y = exp(−s2)
sign y = sign(s)
relu y = max(0, x)
Details of RVFL
Notations
I H =
h(x1)h(x2)
...h(xN)
=
h1(x1) · · · hk(x1)h1(x2) · · · hk(x2)
.... . .
...h1(xN) · · · hk(xN)
I D = [H,X ] =
d(x1)d(x2)
...d(xN)
h1(x1) · · · hk(x1) x11 · · · x1mh1(x2) · · · hk(x2) x21 · · · x2m
.... . .
......
. . ....
h1(xN) · · · hk(xN) xN1 · · · xNm
I Target y , can be class labels for C class classification problem (1, 2, ...,C).
I For classification, y is usually transformed into 0-1 coding of the class
label. For example, if we have 3 classes y =
123
=
1 0 00 1 00 0 1
, i.e. we
have 3 neurons, 1st neuron goes to 1 for 1st classes while others at zero,and so on.
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Learning
Define the problemOnce the random parameters β and b are generated in proper range, thelearning of RVFL aims at solving the following problem:ti = di ∗W , i = 1, 2, ...,N,where N is the number of training sample, W are the weights in the outputlayer, and ti , di represent the ouput and the combined features.
Solutions
I closed form:
I In prime space:W = (λI + DTD)−1DTY ; (1)
I In dual space:W = DT (λI + DDT )−1Y (2)
where λ is the regularization parameter, when λ→ 0, the methodsconverges to the Moore-Penrose pseudoinverse solution.Y = [y1, y2, ...yN ]′;
I Iterative methods: tune the weight based on the gradient of the errorfunction in each step. In assignment, you’ll use above Eqns.
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Approximation ability of RVFLTheoretical justification for RVFL is presented in5.
Method
I It formulates a limit-integral representation of the function tobe approximated.
I Evaluates the integral with the Monte-Carlo Method.
Conclusion
I RVFL is a universal approximator for continuous function onbounded finite dimensional sets.
I RVFL is an efficient universal approximator with the rate ofapproximation error converge to zero of order O(C/
√n),
where n is the number of basis functions and C isindependent of n.
5Bons Igelnik and Yoh-Han Pao. “Stochastic choice of basis functions inadaptive function approximation and the functional-link net”. In: IEEETransactions on Neural Networks 6.6 (1995), pp. 1320–1329.
Outline
IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL
Classification ability of RVFLproperty of RVFL variants
Classification ability of RVFL
EvaluationThe following issues can be investigated by using 121 UCI datasets as done by6
I Effect of direct links from the input layer to the output layer.
I Effect of the bias in the output neuron.
I Effect of scaling the random features before feeding them into theactivation function.
I Performance of 6 commonly used activation functions: sigmoid , radbas,sine, sign, hardlim, tribas .
I Performance of Moore-Penrose pseudoinverse and ridge regression (orregularized least square solutions) for the computation of the outputweights.
6Manuel Fernandez-Delgado et al. “Do we need hundreds of classifiers tosolve real world classification problems?” In: The Journal of Machine LearningResearch 15.1 (2014), pp. 3133–3181.
67
k-Nearest Neighbour• We are concerned with the following problem: we wish to
label some observed pattern x with some class category . • Two possible situations with respect to x and may occur:
– We may have complete statistical knowledge of the distribution of observation x and category . In this case, a standard Bayes analysis yields an optimal decision procedure.
– We may have no knowledge of the distribution of observation x and category aside from that provided by pre-classified samples. In this case, a decision to classify x into category will depend only on a collection of correctly classified samples.
• The nearest neighbor rule is concerned with the latter case. Such problems are classified in the domain of non-parametric statistics. No optimal classification procedure exists with respect to all underlying statistics under such conditions.
68
The NN Rule• Recall that the only knowledge we have is a set of i points
x correctly classified into categories : (xi,i). By intuition, it is reasonable to assume that observations that are close together -- for some appropriate metric -- will have the same classification.
• Thus, when classifying an unknown sample x, it seems appropriate to weight the evidence of the nearby xi's heavily.
• One simple non-parametric decision procedure of this form is the nearest neighbor rule or NN-rule. This rule classifies x in the category of its nearest neighbor. More precisely, we call
xm (x1, x2, x3, …, xn)xm is a nearest neighbor to x if min d(xi, x) = d(xm, x) where i = 1, 2, . . , n .
69
70
71
72
73
Fisher’s Linear Discriminant Analysis
74
5001000
75
76
ii Hx
ii
i xn
m 1
77
78
79
80
81
( +
82
83
84
85
86
87
Also refer to the book by Robert Schalkoff, “Pattern Recognition …”, on page 89- for more details on LDA Q327.S297
88
Supervised Training Vs Unsupervised Training
Supervised training• A supervisor teaches the system how to classify a
known set of patterns with class labels before using it.
• need a priori class label information
Unsupervised training • does not depend on a priori class label information• used when a priori class label information is not
available or proper training sets are not available.
89
Unsupervised training by clusteringTo seek regions such that every pattern falls into one of these regions and no pattern falls in two or more regions, i.e.
* Algorithms classify objects into clusters bynatural association according to some similaritymeasures. e.g. Euclidean distance* Clustering algorithms are based either on heuristics or optimization principles.
;....321 SSSSS K jiSS ji
90
Clustering with a known number of classesRequire knowledge about the number of classes, e.g.
K classes (or clusters).The K-means algorithm is based on the
minimization of the sum of squared distances
, j = 1,2, ... , K
where Sj(k) is the cluster domain for cluster Centre zj at the kth iteration.
)(
2
kSjj
j
Jx
zx
91
K-Means Algorithm
1. Arbitrarily choose K samples as initial cluster centres. i.e. z1(k), z2(k), z3(k), .. , zK(k); iteration,k = 1.
2. At kth iterative step, distribute the pattern samples x among the K clusters according to the following rule:
if
for i = 1, 2, ..., K ; i j ;
)(kS jx )()( kk ij zxzx
92
4. If zj(k+1)= zj(k) for j = 1,2, ..., K, stop.
Otherwise go to step 2.
3. Compute zj(k+1) and Jj for j = 1,2, ..., K.
)(
;1)1(kSj
jj
Nk
xxz
)(
2)1(
kSjj
j
kJx
zx
93
Advantages :
• simple and efficient.
• minimum computations are required
Drawbacks :• K must be decided • Individual Clusters themselves should be tight,
and clusters should be widely separated from one another.
• Results usually depend on the order of presentation of training samples and the initial cluster centers.
94
ExampleApply K-means algorithm to the followingdata for K = 2 :x1 = (0,0), x2 =(0,1), x3 =(1,0), x4=(1,1), x5 =(2,1), x6 =(1,2), x7 =(2,2), x8 =(3,2), x9 =(6,6), x10 =(7,6) x11 =(8,6), x12 =(6,7), x13 =(7,7), x14=(8,7), x15 =(9,7), x16 =(7,8), x17 =(8,8), x18 =(9,8), x19 =(8,9), x20 =(9,9)
95
0123456789
10
0 2 4 6 8 10
x1
x2
Homework – make use of this dataset with class labels to try out Fisher’sDiscriminant method.
96
ExampleIteration 1:
Step 1. Let K = 2, z1(1) = x1 = (0,0)T, z2(1) = x2 = (0,1)T.(These starting points are selected randomly).Step 2.Calculate distances of all patterns from z1 and z2. The patterns are assigned to S1(1) and S2(1) asS1(1) = {x1, x3} ; S2(1) = {x2, x4, x5,..., x20}
97
Step 3.Update the cluster centers:
z1(2) =(0.5 0.0) T ; z2(2) =(5.67 5.33) T
Step 4.
zj(1) and zj(2) are different, go back to step 2
Iteration 2:Step 2.Calculate distances of all patterns from z1 and z2. The patterns are assigned to S1(2) and S2(2) as S1(2) = {x1, x2,..., x8} ; S2(2) = {x9, x10,..., x20}
98
Step 3.Update the cluster centers: z1(3) = (1.25 1.13)T ; z2(3) = (7.67 7.33)T
Step 4.zj(3) and zj(2) are different, go to step 2
Iteration 3:Step 2.Calculate distances of all patterns from z1 and z2.No change in S1 and S2.
99
Step 3.• Update the cluster centers:• z1(3) = (1.25 1.13)T ; z2(3) = (7.67 7.33)T
Step 4.• zj(3) and zj(4) are the same, stop.
• Final Centers :• z1(4) = (1.25 1.13)T ; z2(4) = (7.67 7.33)T
100
Batchelor and Wilkin’s algorithm
This algorithm forms clusters without requiring the specification of the number of clusters, K. But instead, it needs to specify aparameter, namely the proportion p.
The algorithm: 1. Arbitrarily, take the first sample as first
cluster. i.e. z1= x1
2. Choose the farthest pattern sample from z1 to be z2.
101
3. Compute distance from each remaining pattern sample to z1 and z2.
4. Save the minimum distance for each pair of these computations.
5. Select the maximum of these minimumdistances.
6. If the maximum distance is greater than a fraction p of d(z1, z2), assign the correspondingsample to a new cluster z3. Otherwise, stop.
102
7. Compute distance from each remaining pattern sample to each of the 3 clusters and save the minimum distance of every group of 3 distances.
8. Then select the maximum of these minimum distances.
9. If the maximum distance is greater than a fraction p of the typical previous maximum distance, assign the corresponding sample to a new clusterz4. Otherwise, stop.
103
10. Repeat steps 7 to 9 until no new cluster is formed.
11. Assign each sample to its nearest clusters.
*********************
104
Example Apply Batchelor and Wilkin’s algorithmto find the cluster groups for the following samples :
x1 = (0,0), x2 = (3,8), x3 = (2,2), x4 = (1,1),
x5 = (5,3), x6 = (4,8), x7 = (6,3), x8 = (5,4),
x9 = (6,4), x10 = (7,5)
By using the algorithm, the result will be
S1 = {x1, x3, x4}; S2 = {x2, x6}
S3 = {x5, x7, x8, x9, x10};
105
0123456789
0 2 4 6 8
x1
x2
Q : Apply Batchelor and Wilkens algorithm to the above training data. Select the proportion parameter as 0.5. Start with x1 = (0,0).
Q : What effect do you think the proportion parameter has on the final clustering ?
106
The The Advantages of this heuristic algorithm are as follows :
i) It determines K,
ii) it self-organises, and
iii) it is efficient for small-to-medium numbers of classes.
107
The disadvantages are :
i) the proportion p must be given,
ii) the ordering of the samples influences the clustering,
iii) the value of p influences the clustering,
iv) the separation is linear,
v) the cluster centers are not recomputed to betterrepresent the classes,
vi) there is no measure of how good the clustering is,
vii) an outlier can upset the whole process.
108
In practice, it is suggested that the process be run for several values of p and the mean-squared error of all distances of the samples from their center be computed for each kth cluster (k = 1, 2, …, K) and then summed over all classes k to obtain the total mean-squared error. The value of p that provides the lowest total mean-square error is an optimal one. The final clusters should be averaged to obtain a new center and all sample vectors reassigned via minimum distance to the new centers.