Pruning a classifier based on a Self-Organizing Map using Boolean

6
Pruning a classifier based on a Self-Organizing Map using Boolean function formalization Victor Jose Lobo Portuguese Naval Academy, Escola Naval, Alfeite, 2800 ALMADA, PORTUGAL, [email protected] Roman Swiniarski Dep.Mathematical and Computer Sciences, San Diego State University, 5500 Campanile Drive, San Diego CA 92182-7720, USA [email protected] Fernando Moura-Pires Dep.Informatics, New University of Lisbon, Quinta da Torre 2825 MONTE DA CAPARICA, PORTUGAL, [email protected] Abstract guassian based maximum likelihood classifier impossible. An algorithm is presented to minimize the number of neurons needed for a classifier based on Kohonens Self- Organizing Maps (SOM}, or on any other “code-book type” (or “prototype based”) classifier such as Kohonens Linear Vector Quantization (LVQ), K-means or nearest neighbor. The neuron minimization problem is formalized as a problem of simplification of Boolean functions, and a geometric interpretation of this simplification is provided. A step by step example with an illustrative classification problem is given. Such a situation arises often, as for example in [3], where 210 data patterns, each containing 2048 features, were to be classified. Finally the sheer amount of memory required to calculate these matrices (one for each class), can be computationally too demanding for most applications. Keywords: Pruning, SOM, classification, neural nets If the multivariate patterns do not have a Gaussian distribution (as is the general case), and particularly if a region (in the input space) occupied by a class is concave or multimodal, then the classical parametric techniques can provide very poor results. 1 - Introduction Classifying multivariate data with Gaussian distribution, is relatively straightforward, and the quadratic Bayesian classifier, minimizing the probability of error, can easily be obtained [l] [2]. Unfortunately, in real life situations, the premises about Guassian distribution, will often not be met. Even if it is, a very large number of input features will still make an ideal classifier difficult to obtain for three main reasons: l Numerical instability of the covariance matrix 0 Possible singularity of the covariance matrix 0 Very large memory requirements for computations As it is widely known [ 11, the covariance matrix C In these cases, both classical statistics (with K-means) and artificial intelligence (with neural networks) use techniques that involve calculating a number of “reference” patterns and then calculating the distance from these patterns to the ones to be classified. The input space is thus broken in a Voroni tesselation [4], according to some criteria (many times minimizing the quantization error, or sum of the squared distances to the reference patterns). This is the case with K-means, LVQ, etc. We shall call these methods generically “prototype-based methods” (some authores call them “code-book based methods” [4]), and these are the ones that can be improved by our method . For the sake of simplicity, we shall use neural network terminology and call each of the “mean” or code-book value or prototype, a neuron. becomes numerically unstable when the dimensionality is very large. It implies calculating n(n+1)/2 numbers to construct the nxn Cmatrix, where n is the dimension of the input space. To further aggravate the problem, the determinant of the covariance matrix has then to be computed, causing a serious propagation of numerical errors. If the number of samples available is small (number of training patterns less than the number of One of the big problems with prototype-based methods is that they require the designer to decide how many neurons to use for each class. In a 2-dimensional problem, the number of neurons can easily be estimated with a simple visualization of the data. However, in a high dimensional space, the problem gets considerably more difficult. One common solution is to over-dimension the number of neurons needed, thus avoiding errors, but increasing computation time for classification. On the other hand, the constructive methods [S] are features) then the covariance matrix will be singular, and a O-7803-4859-1/98 $10.0001998 IEEE 1910

Transcript of Pruning a classifier based on a Self-Organizing Map using Boolean

Page 1: Pruning a classifier based on a Self-Organizing Map using Boolean

Pruning a classifier based on a Self-Organizing Map

using Boolean function formalization

Victor Jose Lobo Portuguese Naval Academy, Escola Naval, Alfeite, 2800 ALMADA, PORTUGAL,

[email protected]

Roman Swiniarski Dep.Mathematical and Computer

Sciences, San Diego State University, 5500 Campanile Drive, San Diego CA 92182-7720, USA

[email protected]

Fernando Moura-Pires Dep.Informatics, New University of Lisbon, Quinta da Torre 2825

MONTE DA CAPARICA, PORTUGAL, [email protected]

Abstract guassian based maximum likelihood classifier impossible.

An algorithm is presented to minimize the number of neurons needed for a classifier based on Kohonens Self- Organizing Maps (SOM}, or on any other “code-book type” (or “prototype based”) classifier such as Kohonens Linear Vector Quantization (LVQ), K-means or nearest neighbor. The neuron minimization problem is formalized as a problem of simplification of Boolean functions, and a geometric interpretation of this simplification is provided. A step by step example with an illustrative classification problem is given.

Such a situation arises often, as for example in [3], where 210 data patterns, each containing 2048 features, were to be classified.

Finally the sheer amount of memory required to calculate these matrices (one for each class), can be computationally too demanding for most applications.

Keywords: Pruning, SOM, classification, neural nets

If the multivariate patterns do not have a Gaussian distribution (as is the general case), and particularly if a region (in the input space) occupied by a class is concave or multimodal, then the classical parametric techniques can provide very poor results.

1 - Introduction

Classifying multivariate data with Gaussian distribution, is relatively straightforward, and the quadratic Bayesian classifier, minimizing the probability of error, can easily be obtained [l] [2]. Unfortunately, in real life situations, the premises about Guassian distribution, will often not be met. Even if it is, a very large number of input features will still make an ideal classifier difficult to obtain for three main reasons:

l Numerical instability of the covariance matrix

0 Possible singularity of the covariance matrix

0 Very large memory requirements for computations

As it is widely known [ 11, the covariance matrix C

In these cases, both classical statistics (with K-means) and artificial intelligence (with neural networks) use techniques that involve calculating a number of “reference” patterns and then calculating the distance from these patterns to the ones to be classified. The input space is thus broken in a Voroni tesselation [4], according to some criteria (many times minimizing the quantization error, or sum of the squared distances to the reference patterns). This is the case with K-means, LVQ, etc. We shall call these methods generically “prototype-based methods” (some authores call them “code-book based methods” [4]), and these are the ones that can be improved by our method . For the sake of simplicity, we shall use neural network terminology and call each of the “mean” or code-book value or prototype, a neuron.

becomes numerically unstable when the dimensionality is very large. It implies calculating n(n+1)/2 numbers to construct the nxn Cmatrix, where n is the dimension of the input space. To further aggravate the problem, the determinant of the covariance matrix has then to be computed, causing a serious propagation of numerical errors. If the number of samples available is small (number of training patterns less than the number of

One of the big problems with prototype-based methods is that they require the designer to decide how many neurons to use for each class. In a 2-dimensional problem, the number of neurons can easily be estimated with a simple visualization of the data. However, in a high dimensional space, the problem gets considerably more difficult. One common solution is to over-dimension the number of neurons needed, thus avoiding errors, but increasing computation time for classification. On the other hand, the constructive methods [S] are

features) then the covariance matrix will be singular, and a

O-7803-4859-1/98 $10.0001998 IEEE 1910

Page 2: Pruning a classifier based on a Self-Organizing Map using Boolean

computationally very intensive, because they generally involve re-computing solutions for different numbers of neurons.

We shall now present a new method to prune an over- dimensioned code-book based classifier. For reasons that will be explained later, we shall select the SOM map [4], as the basis for simplification.

2 - Algorithm for pruning a SOM based classifier

The rationale behind the proposed algorithm for pruning a SOM based classifier is quite simple, and can be explained as follows. An input pattern is classified by the neuron that is it’s nearest neighbor, from a point of view of Euclidean distance between the input pattern vector and the neurons weight. Most of the times, the second, third or more nearest neighbors would still classify correctly the pattern. Thus, the neuron that was originally the nearest neighbor can be removed, and the classifier will still work properly. The only two problems that arise are how to find

. out which neurons can still classify correctly any given pattern; and how to choose which neurons to discard from the sets of redundant neurons. Let us now define with some formalism the algorithm for pruning a SOM based classifier, and then proceed to explain it.

The algorithm that has been developed can be briefly explained as follows:

Given:

i) The training set, consisting of labeled data patterns ii) A SOM architecture with a sufficiently large

(overestimated) amount of neurons per class.

Do the following:

1. VI

1.1 1.2

2.

Train the SOM map, using the traditional method

Calibrate the map [4] Select only the neurons & which have labels (prune the “empty” neurons)

For each class in the training set, do the following:

2.1 For each pattern construct a set Qj of the best matching neurons Si that still have the same class as the pattern.

2.2 If any set Qj has only one element, choose that Si as a classifier neuron. Else select the most occurring neuron Si as classifier neuron.

2.3 Remove all the sets Qj that contain the Si selected in 2.2

2.4 Repeat 2.2 and 2.3 until there are no more Qj sets left.

The neurons selected in this way will set with at least a step 1, and will simpler classifier.

s much accuracy as the be smaller in number,

classify the test map obtained in thus forming a

Let us now explain the algorithm step by step.

2.1- Building a labeled SOM Based on the training set, a Kohonen Self-Organizing

Map (SOM) [4] is built, deliberately overestimating the amount of neurons needed. Although there is still some heuristics involved in the choice of the original number of neurons, it’s importance is greatly diminished be the further processing that is to be done. The SOM is then trained and calibrated (assigning labels or classes to the output neurons) using Kohonen’s rule [4] until we reach the desired misclassification error on the given data set. Usually, for classifiable data sets, the misclassification error can actually be zero, but if it is not, then the final classifier obtained with the proposed algorithm is guaranteed to have a misclassification error at least not greater then the original map, for the same data set. For simplicity’s sake, we shall consider that this SOM classifier, whatever its misclassification error might be, is an “accurate” classifier, and use that nomenclature from now onwards.

After the SOM building phase, we have as a result a number of labeled neurons representing means of a certain group of topologically near patterns [4]. The trained SOM generally may have some “dead” neurons, that are useful for novelty detection, but are useless for “normal”, or “positive” classification, and can thus be ignored in the application of the presented method.

The resulting set of neurons ( the map will now have “holes” due to the removal of the dead neurons), will generally have a number of neurons for each class. So as to permit a simpler classification, we shall now attempt to reduce the number of neurons, while maintaining the same classification accuracy.

Let us assume that a SOM classifier has been trained and calibrated with 3 classes. This SOM might look like the one presented in Figure 2. In that map, for example, whenever neurons (1,1),(1,2),(2,1) or (2,~) win, the input pattern belongs to class A.

Figure 1 - An example of a trained and calibrated SOM

1911

Page 3: Pruning a classifier based on a Self-Organizing Map using Boolean

2.2 - Constructing best match sets for each class (Q sets)

In a normal “code-book based” classification, the distance from the pattern being classified to each of the neurons is calculated, and the neuron which is closer is selected as the winner. The pattern is then considered to belong to the same class as the winner. We may now ask ourselves what would happen if the winning neuron was removed. Would the pattern still be correctly classified ? That, of course, depends on which label the second best matching neuron has. We can thus “remove” neurons until the best matching neuron belongs to another class (thus originating a classification error).

If, for each pattern in the training set, we compute the distance to all neurons in the map, arrange the neurons in ascending order of distance, and then select neurons up to (but not including) the first that has a label different from that of the training pattern, we obtain the best match set that we called the Q set for that pattern.

The Q set for each training set pattern contains all neurons that are closer to that pattern then any neuron with a different (incorrect) label. Thus, if the final classifier contains any of these neurons it will classify correctly that pattern.

If a non zero misclassification rate was admitted when constructing the SOM map, then some of the Q sets will be empty. Those sets shall be ignored, since they belong to patterns that are incorrectly classified in the first place.

Let us provide an example of constructing the Q sets. Suppose that for the SOM of Figure 2, we have 5 training patterns for class A : x1 to x5. Let us now construct an ordered set of the best matching neurons for each of these patterns (assume the presented sets are correct), and then truncate that set when the first neuron with a “wrong” label is encountered. The hypothetical results are presented in Table 1.

Pattern All best fitting neurons Q set

x1 AZ. Al, A4 B3, Bl, A3, B4.. .... AZ, AI, A4

x2 A2, A4, A3, B3. Al, BI.. .... A2, A4 / x3 A4, Al, B3, Bl, c2, f43.. .... A4, AI x4 Al, A3, A2, B3, Cl, f44.. .... AI, f43

X5 A3, c2, Cl, B3, Al, A2.. .... f43

Table I- Construction of the Q sets

23 - Selecting classifier neurons and eliiinating Q sets

If a Q set has only one element, it means that if the neuron contained in that set is not present in the final classifier (and all other neurons are), then the best match for this pattern would be a neuron with an incorrect label.

Thus, this neuron is indispensable, and should be selected for the final classifier.

Once a neuron has been selected for the final classifier, any pattern containing that neuron in its Q set will be correctly classified. This is obvious, since the selected neuron will be closer (in the input pattern space) to the pattern then any other neuron with a different label. So either it will be the winner, or another neuron also with the correct label, but closer to the input pattern, will be the winner. Thus, we can ignore that pattern and its Q set.

When we are left with all Q sets containing more than one neuron, we want to select a classifier neuron that can classify as many patterns as possible. Thus we select the most occurring neuron in the Q sets, because this one will classify more patterns than any other. For the same reason as above, we can eliminate the Q sets where that neuron occurs. We will later discuss other approaches to selecting the neurons in some optimal way.

Finally we shall be left with no more Q sets, and a set of selected neurons. Those neurons are smaller (or in the worst case equal) in number to the neurons contained in the original SOM map, and thus form a simpler classifier.

Let us provide an example of selecting classifier neurons and eliminating Q sets. In the example of Table 1, set QS has only one element (A& thus A3, is selected. After selecting A3 we may eliminate all sets where it occurs, that is Q4. ‘The sets Ql, Q2, Q3, all have more then one element, so we choose the most occurring neuron, which is Ad. We have no more Q sets, so { A3, A4} can classify class A with the same accuracy as the original 4 neurons.

3 - Discussion Two interpretations of the proposed algorithm are

presented. The first is a geometrical interpretation in the input space, which provides an insight into why this algorithm makes sense. The second is a much more powerful interpretation, because it provides a method of simplifying any code-book based classifier, while doing all the manipulations in a binary Boolean output classification space. This can lead to a faster and more powerful simplification.

Although we will not discuss the problem in this article, the proposed algorithm bears a striking resemblance to Davis and Putnam’s method for refutation proofs in theorem proving [5]. In fact, if we consider the existence of a neuron to be a ground clause, and the classifier to be a program in logic, then the rules used by the proposed algorithm are the same as Davis and Putnam’s rules. However, in the proposed algorithm, those rules must be

1912

Page 4: Pruning a classifier based on a Self-Organizing Map using Boolean

applied in a certain order, where as in theorem proving, as the final objective is different, no such order exists.

3.1 - A geometric interpretation in the input space

The original SOM, (or any other code-book type classifier) provides only a set of candidates for the final classifier network. A SOM is preferred, because it will select more neurons for a “noisier” class, thus those classes will have more “degrees of freedom”. This way we can avoid the rather tricky decision of assigning a given number of neurons to a class, as is required by LVQ [4]. However a K-means classification algorithm could be selected with similar results.

The neurons will partition the input space into a Voroni tesselation, and each class will occupy a few of these areas. Most code-book methods try to minimize the quantization error, or sum of distances, or number of training patterns per neuron, and not the classification error directly. Thus, there will usually be “interior” neurons, that will be irrelevant for classification purposes, since the “next best” matches for the pattern in that region will be bordering neurons that have the same class. The same applies to neurons in outward border areas.

When we eliminate neurons in steps 2.2 and 2.3 of the proposed algorithm, we are in fact enlarging the tessellation region of the remaining neurons, so as to include the regions of the pruned neurons. By selecting for the Q sets only the neurons that have the same class label, we guarantee that we are not committing any classification errors: any well-labeled neuron that is closer then a wrong- labeled neuron will correctly classify the pattern in a competition with the “wrong-labeled” ones. The border between the classes will actually shift during the pruning process, but the area covered by that shift will not have any training patterns that where previously classified correctly.

3.2 - A Boolean function interpretation The data manipulation of phase 2 of the proposed

algorithm (selecting neurons from the Q sets) can be viewed as finding the largest implicant of a Boolean function, as we shall now prove.

In step 2.1 of the algorithm, we are constructing a set of neurons (Q sets), that have the property of being nearer to the given training pattern than any neuron not labeled with the correct class. Thus, that pattern will be correctly classified if any of those neurons exist. Let us now define the following Boolean variables that shall be used to describe the algorithm:

Let Pi take the value 1 if the ith neuron is present, and 0 if it is not. Here, i can vary from 1 to the number of neurons labeled with the class being analyzed. Let Qj take the value 1 if the jth training pattern is correctly classified, and 0 if it is not. Here, j can vary from 1 to the number of training patterns with the label of the class being analyzed. Let R take the value 1 if the classifier can classify correctly all training patterns of a given class (or more precisely all that where correctly classified by the original classifier SOM), and 0 if can not.

As previously stated, pattern j will be correctly classified if any of the neurons of its Q set is present:

Qj = P, v p,v.....vpz

Using a common Boolean algebra notation where “+” represents the “or” operation, and “.” the “and” operation, we have:

Qj = Pa + Pb + . ..+P. or

Qj= Cr: (1) For Pit@

where Z stands for the a series of Boolean “or” functions.

Also, a classifier will classify correctly a class c, if it can classify all its ~1, patterns from the training set: pattern land 2 and . . . and ~2,:

R = Q, A Q2 /\.....‘\Qnc or

R=~ Qj (2) j=l

where YI, is the number of training patterns for the class, and KI stands for a series of Boolean “and” functions.

Joining (1) and (2), we obtain:

This last equation is a simple product of sums. In the field of Boolean functions, it is commonly known as a product of min-implicants ( it would be a product of min- terms, or second canonical form [6], if we expanded the sums to contain all free variables). We can now expand the sums, and obtain a sum of products, in the form

j=l For NicQj

(3)

R=i nq (4) k=l NiCMk

1913

Page 5: Pruning a classifier based on a Self-Organizing Map using Boolean

where K is the number of implicants, and Mk the sets of variables included in those implicants. To be more explicit, Eq.(4) will look like R = Pap& + PLIPd + . . . . . This should be read as “R will classify the given class accurately if neurons a and b and c exist; or if neurons a and d exist, or....“. If we want the simplest classifier, we just select the shortest sum.

At this stage, one more familiar with Boolean functions may think the problem is solved, because we have a few methods of simplifying Boolean functions, namely the Quine-McCluskey method [7]. This, however, is not quite true: when we simplify a Boolean function, we have to include all minterms, so as to represent that function correctly. For this classification problem, we do not need to include all minterms. Actually, any minterm is the trivial solution for the problem, since it includes all neurons that are labeled with the class. What we need is the largest implicant possible (the one with fewer variables), which may not even be an essential !

The elimination process used by the presented algorithm provides a maximum implicant, but not necessarily the largest. It does however provide a fast and efficient way of finding one good implicant, and thus is useful.

4 - Numerical results

A Matlab program was written to show the advantages of the presented method. With this program, a training set is generated, containing two classes of patterns, with any desired distribution. The program allows the design and training of a SOM map, calibrates it, and forms the Q sets, using the training patterns. The program outputs the Q sets thus obtained. These have to be simplified manually. The program was run several times for different randomly generated data, with similar results. We will now proceed to describe the numerical results obtained with one of such experiments.

Let us consider a two class (c& classification problem, with two-dimensional pattern vectors (X E RL):

x= [ Xl, x2 IT XiE R, i=1,2

Let us assume that the training set T={x’, x2,.-x2’} was formed from a priori gathered pattern set containing ten patterns from the class cl, To=@“‘, dy2 , d3,.... x’~” }, and ten patterns for class c2, Tc2={x2y’, x292 , x”~,.... x2y’o ), where T = TcI v Tc2.

axis, and unit width (see Figure 3), so class C2 can not be described by only 1 center.

Figure 2 - Distributions of the classes in the

The program then built a SOM map with 3x4 neurons, and displayed the resulting network, in input space coordinates, with the original data patterns (Figure 3).

2.5 c

01 +’ + I I I I 0 0.5 1 2 2.5 3

Figure 3 - The resulting SOM map represented in the input space. The black connected dots represent the trained neurons, in their weight coordinates. The remaining points represent the original patterns (“+” for class 1, “0” for class 2).

A mare visual inspection of Figure 3 reveals that we could represent class cl with only one neuron (instead of 5 neurons), and class c2 could probably be defined with two.

The program then goes on to calibrate these neurons with the training set, and we obtain the following three

Neurons labeled with class cl = ( 1, 2,5,6,9 } Neurons labeled with class c2 = { 4, 8, 11, 12 } Unlabeled neurons = { 3,7, 10 }

Class cI has a uniform distribution in the unit square centered at (0.5, 0.5). Class c2 also has a uniform distribution, but in a larger L-shaped area with its vertex at (3,3) , equal length “arms” stretching to the coordinate

1914

Page 6: Pruning a classifier based on a Self-Organizing Map using Boolean

Figure 4 - The resulting SOM map represented in output space

Running a separate program, named QS BUILD, we - obtain the Q sets for each training pattern:

Class cl: Ql={5,2,9,1,6) Qz={ L%%6 Q~={69,2SJ 1 Q~={2S,LW Q7=( 1,2,5,9,6) Qa=W,6,2,1 Qlo={ 62SA 1)

Class c2:

QJ={ LX&W) Qs={ 6,9,2,5,1) Qd 1,52,9,6 1

Q11={8,4,12,11) Qrz=WLW1) Q13={4,fW) Q14={4N Q1s=l2,lLW Q16={ 11,121 Q17={ 1 Ll2) Qu={ 12,8,11,4) QB={ 12,11,8,41 Q20={ 1LW

For class cl, all patterns contain all 5 labeled neurons, so any one of them might be chosen as a classifier neuron. Let us then choose neuron number 1 for no special reason other than it is the first we came across.

We could also use a Karnaugh map:

Neur.6 and 9 vs Neurons 1.2. and 5 I 000 I 001 I 011 I 010 I 110 I 111 I 101 I 100 I

00 0 11 1 1 [l 1 01 1 ( 1 l\ 1 1 ( 1 l\ 1 11 l\l 1J 11\1 1J 1 \ / \ /

I1011 I- I1 11 IU I1 I

Neur. 11 and 12 Neurons 4 and 8

From this map it is obvious that any neuron would be sufficient for classification. It is interesting that the upper left corner of the Karnaugh map will always be 0, since a true value in that position would mean that we would not need any neuron to classify the class, which is absurd whenever the class is not empty.

[4

[5

5 - Conclusions

The main practical advantage of this method, is that it provides a computationally efficient way of finding a relatively small but reliable prototype-based classifier.

We are currently working on extending this method so as minimize the number of neurons simultaneously for all classes (taking into acount the effect on one class of minimizing the other class). We further intend to apply an algorithm to find the largest implicant in a large boolean space in an efficient way, and to use prototype generating methods other than SOM.

Acknowledgments

This work was supported by grants from FLAD and INVOTAN.

References

[l] Duda, Richard O., Hart, Peter E; “Pattern classification and scene analysis”, John Wiley & Sons, 1973.

[2] Fukunga, Keinosuke; “Introduction to Statistical Pattern Recognition”, Academic Press, 1990.

[3] Lobo, Victor; Moura-Pires, Femando;“Ship noise classification using Kohonen Networks”; Proc. of the Engineering Applications of Neural Networks - EANN’95, Helsinki, 1995.

For class ~2, there is no Q set with only one element, so we have to find the most occurring neuron. In this case it is neuron number 12 (with 9 occurrences). Thus we select neuron 12, and we may now eliminate patterns 11, 12, 13, 15, 16, 17, 18, 19, and 20. Thus, we are left with only pattern 4, that has Q={4,8}. We may choose any of these neurons, so let us take neuron 4.

Kohonen, Tuevo; “Self-Organizing Maps”, Springer-Verlag, 1995.

Chang, Chin-Liang; Lee, Richard Char-Tung; “Symbolic Logic and Mechanical Theorem Proving”, Academic Press, 1973.

[6] Taub, Herbert; “Digital Circuits and Microprocessors”, McGraw Hill 1988.

[7] Fletcher, Peter; Hughes, Hoyle, Patty, C.Wayne; “Foundations of discrete mathematics”, PWS-Kent 199 1.

[8] Fritzke, Bemd; “Let it Grow - Self-organizing Feature Maps With Problem Dependent Cell Structure”; Proc.od the ICANN-9 1, Helsinki, Elesevier Science Publ., 199 1.

We are now left with only 3 neurons in a minimum distance classifier: neuron 1 (with weight coordinates wi=O.34, w2=0.23 ) for class q, and neurons 12 and 4 (with weight coordinates wi=2.32, w2=2.01, and wi=l. 11, w2=2.42 ) for class c2.

1915