Rule extraction from local cluster neural nets

20
Neurocomputing 47 (2002) 1–20 www.elsevier.com/locate/neucom Rule extraction from local cluster neural nets Robert Andrews , Shlomo Geva Faculty of Information Technology, Queensland University of Technology, GPO Box 2434, Brisbane. Q 4001, Australia Abstract This paper describes RULEX, a technique for providing an explanation component for local cluster (LC) neural networks. RULEX extracts symbolic rules from the weights of a trained LC net. LC nets are a special class of multilayer perceptrons that use sigmoid func- tions to generate localised functions. LC nets are well suited to both function approximation and discrete classication tasks. The restricted LC net is constrained in such a way that the local functions are ‘axis parallel’ thus facilitating rule extraction. This paper presents results for the LC net on a wide variety of benchmark problems and shows that RULEX produces comprehensible, accurate rules that exhibit a high degree of delity with the LC network from which they were extracted. c 2002 Elsevier Science B.V. All rights reserved. Keywords: Rule extraction; Local response networks; Knowledge extraction 1. Introduction In [8] Geva et al. describe the local cluster (LC) network, a sigmoidal perceptron with 2 hidden layers where the connections are restricted in such a way that clusters of sigmoids form local response functions similar to radial basis functions (RBFs). They give a construction and training method for LC networks and show that these networks (i) exceed the function representation capability of generalised Gaussian networks, and (ii) are suitable for discrete classication. They also describe a restricted version of the LC network and state that this version of the network is suitable for rule extraction without, however, describing how this is possible. Corresponding author. E-mail addresses: [email protected] (R. Andrews), [email protected] (S. Geva). 0925-2312/02/$ - see front matter c 2002 Elsevier Science B.V. All rights reserved. PII: S0925-2312(01)00577-X

Transcript of Rule extraction from local cluster neural nets

Page 1: Rule extraction from local cluster neural nets

Neurocomputing 47 (2002) 1–20www.elsevier.com/locate/neucom

Rule extraction from local cluster neural nets

Robert Andrews∗, Shlomo GevaFaculty of Information Technology, Queensland University of Technology, GPO Box 2434,

Brisbane. Q 4001, Australia

Abstract

This paper describes RULEX, a technique for providing an explanation component forlocal cluster (LC) neural networks. RULEX extracts symbolic rules from the weights of atrained LC net. LC nets are a special class of multilayer perceptrons that use sigmoid func-tions to generate localised functions. LC nets are well suited to both function approximationand discrete classi0cation tasks. The restricted LC net is constrained in such a way that thelocal functions are ‘axis parallel’ thus facilitating rule extraction. This paper presents resultsfor the LC net on a wide variety of benchmark problems and shows that RULEX producescomprehensible, accurate rules that exhibit a high degree of 0delity with the LC networkfrom which they were extracted. c© 2002 Elsevier Science B.V. All rights reserved.

Keywords: Rule extraction; Local response networks; Knowledge extraction

1. Introduction

In [8] Geva et al. describe the local cluster (LC) network, a sigmoidal perceptronwith 2 hidden layers where the connections are restricted in such a way that clustersof sigmoids form local response functions similar to radial basis functions (RBFs).They give a construction and training method for LC networks and show that thesenetworks (i) exceed the function representation capability of generalised Gaussiannetworks, and (ii) are suitable for discrete classi0cation. They also describe arestricted version of the LC network and state that this version of the network issuitable for rule extraction without, however, describing how this is possible.

∗ Corresponding author.E-mail addresses: [email protected] (R. Andrews), [email protected] (S. Geva).

0925-2312/02/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved.PII: S0925-2312(01)00577-X

Page 2: Rule extraction from local cluster neural nets

2 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Local function networks are attractive for rule extraction for two reasons. Firstly,it is conceptually easy to see how the weights of a local response unit can beconverted to a symbolic rule. Local function units are hyper-ellipsoids in inputspace and can be described in terms of a reference vector that represents thecentre of the hyper-ellipsoid and a set of radii that determine the eCective rangeof the hyper-ellipsoid in each input dimension. The rule derived from the localfunction unit is formed by the conjunct of these eCective ranges in each dimension.Rules extracted from each local function unit are thus propositional and of theform

if ∀ 16 i6 n: xi ∈ [xi lower ; xi upper]then pattern belongs to the target class;

(1)

where [xi lower, xi upper] represents the eCective range in the ith input dimension.Secondly, because each local function unit can be described by the conjunct of

ranges of values in each input dimension it makes it easy to add units to the net-work during training such that the added unit has a meaning that is directly relatedto the problem domain. In networks that employ incremental learning schemes anew unit is added when there is no signi0cant improvement in the global error.The unit is chosen such that its reference vector, i.e., the centre of the unit, is oneof the as yet unclassi0ed points in the training set. Thus the premise of the rulethat describes the new unit is the conjunction of the attribute values of the datapoint with the rule consequent being the class to which the point belongs.In recent years there has been a proliferation of methods for extracting rules

from trained arti0cial neural networks (see [2,18] for surveys of the 0eld). Whilethere are many methods for extracting rules from specialised networks the majorityof techniques focus on extracting rules from MLPs. There are a small numberof published techniques for extracting rules from local basis function networks.Tresp et al. [20] describe a method for extracting rules from gaussian RBF units.Berthold and Huber [3,4] describe a method for extracting rules from a specialisedlocal function network, the RecBF network. Abe and Lan [1] describe a recursivemethod for constructing hyper-boxes and extracting fuzzy rules from the same.Duch et al. [7] describe a method for extraction, optimisation and application ofsets of fuzzy rules from ‘soft trapezoidal’ membership functions which are formedusing a method similar to that described in this paper.In this paper we brieJy describe the restricted LC network and introduce the

RULEX algorithm for extracting symbolic rules from the weights of a trained,restricted LC neural net. The remainder of this paper is organised as follows. Sec-tion 2 describes the restricted LC net. Section 3 looks brieJy at the general ruleextraction task, describes the ADT taxonomy [2,18] for classifying rule extractiontechniques and introduces the RULEX algorithm. Section 4 presents comparativeresults for the LC, nearest neighbour and C5 techniques on some benchmark prob-lems. Section 5 presents an assessment of RULEX in terms of the rule qualitycriteria of the ADT taxonomy. The paper closes with Section 6 where we put ourmain 0ndings into perspective.

Page 3: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 3

Fig. 1.

2. The restricted local cluster network

Geva et al. [8] show that a region of local response is formed by the diCerenceof two appropriately parameterised, parallel, displaced sigmoids.

l(w; r;x) = l+(w; r;x)− l−(w; r;x)

= �(k1;wT(x− r) + 1)− �(k1;wT(x− r)− 1): (2)

The ‘ridge’ function l in Eq. (2) above is a function that is almost zero every-where except in the region between the steepest part of the two logistic sigmoidfunctions. 1 The parameter r is a reference vector; the width of the ridge is givenby the reciprocal of |w|; and the value of k1 determines the shape of the ridge(which can vary from a rectangular impulse for large values of k1 to a broad bellshape for small values of k1). (see Fig. 1) Adding n ridge functions l with diCerentorientations but a common centre produces a function f that peaks at the centrewhere the ridges intersect but with the component ridges radiating on all sides ofthe centre (see Fig. 2). To make the function local these component ridges mustbe ‘cut oC’ without introducing discontinuities into the derivatives of the localfunction (see Fig. 3).The function

f(w; r; x)=n∑

i=1

l(wi ; r; x) (3)

1 where �(k; h)= 1=(1 + e−kh).

Page 4: Rule extraction from local cluster neural nets

4 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Fig. 2.

Fig. 3.

is the sum of the n ridge functions and the function

L(w; r;x)=�0(k2; f(w; r;x)− d) (4)

eliminates the unwanted regions of the radiating ridge functions when d is selectedto ensure that the maximum value of the function f, located at x= r, coincides

Page 5: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 5

with the centre of the linear region of the output sigmoid �0. 2 The parameter k2determines the steepness of the output sigmoid �0.Geva et al. [8] show that a target function y∗(x) can be approximated by a

function y(x) which is a linear combination of m local cluster functions withcentres r� distributed over the domain of the function. The expression

y(x)=m∑

�=1

��L(w�; r�;x) (5)

then describes the generalised LC network where �� is the output weight associatedwith each of the individual local cluster functions L. (Network output is simplythe weighted sum of the outputs of the local clusters.)In the restricted version of the network the weight matrix w is diagonal.

wi=(0; : : : wi; : : : ; 0); i=1; : : : ; n (6)

which simpli0es the functions l and f as follows

l(wi ; r;x)=�(k1; wi(xi − ri) + 1)− �(k1; wi(xi − ri)− 1); (7)

f(wi ; r;x)=n∑

i=1

l(wi; ri; xi): (8)

One further restriction is applied to the LC network in order to facilitate ruleextraction, viz., the output weight, ��, of each local cluster is held constant. 3

This measure prevents local clusters ‘overlapping’ in input space thus allowingeach local cluster to be individually decompiled into a rule. The 0nal form of therestricted LC network for binary classi0cation tasks is given by

y(x)=m∑

�=1

2L(w�; r�;x): (9)

For multiclass problems several such networks can be combined, one network perclass, with the output class being the maximum of the activations of the individualnetworks.The LC network is trained using gradient descent on an error surface. The

training equations are given in Geva et al. [8] and need not be reproduced here.

3. The general rule extraction task

Until recently neural networks were very much a ‘black box’ technology, i.e.a trained neural network accepts input and produces output without providing a

2 The value of d is given as d= n(1=1 + e−k1 − 1=1 + ek1 ) where n is the input dimensionality.3 The maximum value of L is 0.5. Hence for classi0cation problems where the target values are

{0; 1} it is appropriate to set �� =2.

Page 6: Rule extraction from local cluster neural nets

6 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

mechanism whereby a user of the network can explain/verify that the output is‘correct’ within the province of the problem domain. Rule extraction providessuch an explanation=veri0cation mechanism by providing a symbolic link betweennetwork inputs and outputs.The rule extraction process may be illustrated with the following simple example.

A data set consisting of 2 real valued attributes plus a class label is constructedand a LC network is trained on the data, the task being to learn the classi0cation.Fig. 4a below is a plot of the data set. The horizontal axes represent input valueswhile the vertical axis shows the target class, {0; 1}, of the data points. Fig. 4bis a contour plot of the data set. In this simple example it is clear that the databelonging to class 1 occurs in 2 distinct clusters.Fig. 5a is a plot of inputs against the outputs of the trained network. Fig. 5b

is a contour plot of network outputs with the contour line set to the networkclassi0cation threshold.Analysis of the trained network showed that 2 local clusters were required to

learn the problem and that each local cluster covered a disjoint region of inputspace. From Fig. 5b above it is clear that the behaviour of the network can beexplained by approximating the hyper-ellipsoid decision boundaries of the localclusters with hyper-rectangular rules of the form given in Eq. (1). Fig. 6 belowshows the rules extracted from the local clusters.This simple and arti0cial problem serves to illustrate the rule extraction process

in general and rule extraction from local cluster networks in particular. The examplealso shows how rule extraction provides an explanation=veri0cation facility forarti0cial neural networks.

3.1. A taxonomy for classifying rule extraction techniques

Andrews et al. [2] describe the ADT taxonomy for describing rule extractiontechniques. This taxonomy was re0ned in Tickle et al. [18] to better cater forthe profusion of published techniques for eliciting knowledge from trained neuralnetworks. The taxonomy consists of 0ve primary classi0cation criteria, viz.

(a) the expressive power (or, alternately, the rule format) of the extracted rules;(b) the quality of the extracted rules;(c) the translucency of the view taken within the rule extraction technique of the

underlying neural network;(d) the complexity of the rule extraction algorithm;(e) the portabililty of the rule extraction technique across various neural network

architectures (i.e., the extent to which the underlying neural network incorpo-rates specialised training regimes).

The expressive power of the rules describes the format of the extracted rules. Cur-rently there exist rule=knowledge extraction techniques that extract rules in variousformats including propositional rules [5,14,15], fuzzy rules [12,13], scienti0c laws[16], 0nite state automata [9], decision trees [6], and m-of-n rules [19].

Page 7: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 7

Fig. 4.

The rule quality criterion is assessed via four characteristics, viz.,

(a) rule accuracy, the extent to which the rule set is able to classify a set ofpreviously unseen examples from the problem domain;

(b) rule 5delity, the extent to which the extracted rules mimic the behaviour ofthe network from which they were extracted;

Page 8: Rule extraction from local cluster neural nets

8 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Fig. 5.

(c) rule consistency, the extent to which, under diCering runs of the rule extractionalgorithm, rule sets are generated which produce the same classi0cations ofunseen examples;

(d) rule comprehensibility, the size of the extracted rule set in terms of the numberof rules and number of antecedents per rule.

Page 9: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 9

Fig. 6.

The translucency criterion categorises a rule extraction technique according to thegranularity of the neural network assumed by the rule extraction technique. An-drews et al. [2] use three key identi0ers to mark reference points along a continuumof granularity from decompositional (rules are extracted at the level of individualhidden and output layer units) to pedagogical (the network is treated as a ‘blackbox’; extracted rules describe global relationships between inputs and outputs; noanalysis of the detailed characteristics of the neural network itself is undertaken).The algorithmic complexity of the rule extraction technique provides a useful

measure of the eOciency of the process. It should be noted, however, that fewauthors in the surveys [2,18] reported or commented on this issue.The portability criterion assessed ANN rule-extraction techniques in terms of

the extent to which a given technique could be applied across a range of ANNarchitectures and training regimes. Currently, there is a preponderance of tech-niques that might be termed speci5c purpose techniques, i.e. those where the ruleextraction technique has been designed speci0cally to work with a particular ANNarchitecture. A rule extraction algorithm that is tightly coupled to a speci0c neuralnetwork architecture has limited use unless the architecture can be shown to beapplicable to a broad cross section of problem domains.

3.2. The RULEX technique

RULEX is a decompositional technique that extracts propositional rules of theform given in (1) above. As such the imperative is to be able to determine [xi lower,xi upper] for each input dimension i of each local cluster. This section describes howthese values can be determined.Eq. (7) can be rewritten

l(wi ; r;x)=�(ki; (xi − ri + bi))− �(ki; (xi − ri − bi)); (10)

where ki= k1wi and bi=1=wi. Here ki represents the shape of an individual ridgeand bi represents the width of the individual ridge.From Eq. (4) we see that the output of a local cluster unit is determined by

the sum of the activations of all its component ridges. Therefore, the minimumpossible activation of an individual ridge, the ith ridge say, in a local cluster unitthat has activation barely greater than its classi0cation threshold, will occur whenall ridges other than the ith ridge have maximum activation.

Page 10: Rule extraction from local cluster neural nets

10 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

We de0ne the functions min(•) and max(•) as the minimum and maximumvalues, respectively of their function arguments.

min(l(wi; ri; xi))=max(l(wb; rb; xb))− ln(1OT

− 1)/

k2; (11)

where OT is the activation threshold of the local cluster and max(l(wb; rb; xb)) isthe maximum possible activation for any ridge function in the local cluster. Ask2, and OT are constants, and max(l(wb; rb; xb)) can be calculated, the value ofthe minimum activation of the ith ridge, min(l(wb; rb; xb)), can be calculated in astraightforward manner. See Appendix A for the derivation of Eq. (11).Let �=min(l(wb; rb; xb)); m=e−(xi−ri)ki , and n=e−biki . From Eqs. (10) and (11)

we have

�=1

1 +mn− 11 +m=n

: (12)

Let p=(1−�)ebiki and q=(�+1)e−biki . Solving Eq. (12) for m and backsubstitutingfor m and n gives

xi= ri − ln(p− q±

√p2 + q2 − 2(�2 + 1)2�

)k−1i : (13)

See Appendix B for the derivation of Eq. (13). Thus for the ith ridge function theextremities of the active range, [xi lower, xi upper] are given by the expressions

xi lower = ri − �lowerki

; (14)

xi upper = ri +�upperki

; (15)

where �lower is the negative root of the ln(•) expression in Eq. (13) above and�upper is the positive root.

3.3. Simpli5cation of extracted rules

One of the main purposes of rule extraction from neural networks is to providean explanation facility for the decisions made by the network. As such it is clearlyimportant that the extracted rules be as comprehensible as possible. The directlyextracted rule set may contain:

(a) redundant rules;(b) individual rules with redundant antecedent condition(s); and(c) pairs of rules where antecedent conditions can be combined.

Rule b is redundant and may be removed from the rule set if there exists a moregeneral rule a such that

∀ 16 i6 n: [xbi lower ; xbi upper] ⊆ [xai lower ; xai upper]:

Page 11: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 11

A rule is also redundant and may be removed from the rule set if

∃ 16 i6 n: [ilower ; iupper] ∩ [xi lower ; xi upper]=�;

where [ilower ; iupper] represents the entire range of values in the ith input dimension.An antecedent condition is redundant and may be removed from a rule if

∃ 16 i6 n: [ilower ; iupper] ⊆ [xi lower ; xi upper]:Rules a and b may be merged on the antecedent for input dimension j if

∀ 16 i6 n: (i �= j) ∧ ([xai lower ; xai upper]= [xbi lower ; xbi upper]):RULEX implements facilities for simplifying the directly extracted rule set in orderto improve the comprehensibility of the rule set. The simpli0cation is achievedwithout compromising the accuracy of the rule set.

4. Comparative results

The restricted LC network has been applied to a variety of datasets availablefrom the machine learning repository at Carnegie Mellon University. These datasetswere selected to show the general applicability of the network. The datasets containmissing values, noisy data, continuous and discrete valued attributes, a mixture ofhigh and low dimensionality, and a variety of binary and multi-class classi0cationtasks. Table 1 below summarises the problem domains used in this study.A variety of methods including linear regression, cross validation nearest neigh-

bour (XVNN), and C5 were chosen to provide comparative results for the re-stricted LC network. XVNN is a technique whereby 90% of the data is used as a‘codebook’ for classifying the remaining 10% of the data.

Table 1Summary of problem domains used in the study

Domain Cases Number of Number of Continuous Discrete Missingclasses attributes valued data valued data values

Annealing processes 898 6 38√ √ √

Auto insurance 205 6 25√ √ √

Breast cancer (Wisconsin) 699 2 9 × √ √Horse colic 368 2 22

√ √ √Credit screening (Australia) 690 2 15

√ √ √Pima diabetes 768 2 8

√ × ×Glass identi0cation 214 6 9

√ × ×Heart disease (Cleveland) 303 2 13

√ √ √Heart disease (Hungarian) 294 2 13

√ √ √Hepatitis prognosis 155 2 19

√ √ √Iris classi0cation 150 3 4

√ × ×Labor negotiations 57 2 16

√ √ √Sick euthyroid 3772 2 29

√ √ √Sonar classi0cation 208 2 60

√ × ×

Page 12: Rule extraction from local cluster neural nets

12 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Table 2Summary of results for selected problem domains and methods

Domain NN Linear machine C5 boost LC RULEX

Annealing processes 9.9 3.2 4.2 2.5 16.2Auto insurance 16.0 35.5 16.0 15.6 27.0Breast cancer (Wisconsin) 4.7 4.2 3.6 3.2 5.7Horse colic 18.9 16.4 15.7 13.5 14.1Credit screening (Australia) 17.3 13.8 11.4 12.9 15.7Pima diabetes 29.3 22.4 24.2 22.6 27.4Glass identi0cation 29.5 37.6 28.9 36.7 34.4Heart disease (Cleveland) 24.3 17.3 19.8 15.8 19.8Heart disease (Hungarian) 22.8 16.6 22.0 14.6 18.7Hepatitis prognosis 18.7 14.7 18.1 16.2 21.3Iris classi0cation 4.7 6.7 5.3 4.7 6.0Labor negotiations 22.0 12.0 18.3 8.3 12.3Sick euthyroid 3.8 6.1 1.0 7.8 7.6Sonar classi0cation 13.9 22.5 17.8 15.4 21.5

Linear regression was used to obtain a base line for comparison. The near-est neighbour method was chosen because it is a simple form of local functionclassi0er. C5 was chosen because it is widely used in machine learning as a bench-marking tool. Further C5 is an example of an ‘axis parallel’ classi0er and as suchit provides an ideal comparison for the LC network. Ten fold cross validationresults are shown in Table 2 above. Figures quoted are average percentage errorrates.These results show that in all of the study domains, except the sick euthyroid

domain, the restricted LC network produces results that are comparable to thoseobtained by C5. Further, the results also show that RULEX, even though its primarypurpose is explanation not classi0cation, is able to extract accurate rules fromthe trained network, i.e. rules that provide a high degree of accuracy when usedto classify previously unseen examples. As can be seen from table above, theclassi0cation accuracy of RULEX is generally slightly worse than that of the LCnetwork. This is most likely due to the combined eCects of

(i) the LC network solution bene5ting from a degree of interaction betweenlocal clusters. Even though the restricted LC network has been constructedto minimise such interaction there are certain conditions under which localfunctions ‘cooperate’ in classifying data. Cooperation occurs when ridges oftwo local clusters ‘overlap’ in input space in such a way that the activation ofindividual local clusters in the region of overlap is less than the classi0cationthreshold, but the network output, i.e., the sum of the activations of the localclusters, is greater than the classi0cation threshold. When the local clusters aredecompiled into rules the region of overlap is not captured in the extractedrule set. This can result in a number of false negative classi0cations made bythe rule set.

Page 13: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 13

(ii) the RULEX algorithm approximating the hyper-ellipsoidal local clusters tohyper-rectangular rules. By putting a hyper-rectangle around a hyper-ellipsoida region of input space not covered by the hyper-ellipsoid is covered by thehyper-rectangle. In problem domains where data values are closely packed thiscan lead to an increase in the number of false positive classi0cations made bythe rule set.

5. RULEX and the ADT taxonomy

This section places the RULEX algorithm into the classi0cation framework ofthe ADT taxonomy presented in Section 3.

(a) Rule formatFrom (1) it can be seen that RULEX extracts propositional rules. In the di-

rectly extracted rule set each rule contains an antecedent condition for each inputdimension as well as a rule consequent which describes the output class coveredby the rule. As mentioned in Section 3.2, RULEX provides a rule simpli0cationprocess which removes redundant rules and antecedent conditions from the di-rectly extracted rules. The reduced rule set contains rules that consist of only thoseantecedents that are actually used by the trained LC network in discriminatingbetween input patterns.

(b) Rule qualityAs stated previously, the prime function of rule extraction algorithms such as

RULEX is to provide an explanation facility for the trained network. The rulequality criteria provide insight into the degree of trust that can be placed in theexplanation. Rule quality is assessed according to the accuracy, 0delity, consistencyand comprehensibility of the extracted rules. Table 3 below presents data thatallows a quantitative measure to be applied to each of these criteria.(i) Accuracy. Despite the mechanism employed to avoid local cluster units

‘overlapping’ during network training (see Eq. (9)) it is clear that there is somedegree of interaction between local cluster units. (The larger the values of the pa-rameters k1 and k2 the less the interaction between units but the slower the networktraining.) This eCect becomes more apparent in problem domains with high dimen-sion input space and in network solutions involving large numbers of local clusterunits. Further, RULEX approximates the hyper-ellipsoidal local cluster functionsof the LC network with hyper-rectangles. It is therefore not surprising that theclassi0cation accuracy of the extracted rules is less than that of the underlyingnetwork. It should be noted, however, that while the accuracy 0gures quoted forRULEX are worse than the LC network they are comparable to those obtainedfrom C5.(ii) Fidelity. Fidelity is closely related to accuracy and the factors that aCect

accuracy, viz. interaction between units and approximation of hyper-ellipsoids byhyper-rectangles also aCect the 0delity of the rule sets. In general, the rule sets

Page 14: Rule extraction from local cluster neural nets

14 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Table 3Rule quality assessment

Domain LC RULEX Local Rules Antecedents Fidelityerror error clusters per rule

Annealing processes 2.5% 16.2% 16 16 20 85.9%Auto insurance 15.6% 27.0% 60 57 13 86.5%Breast cancer (Wisconsin) 3.2% 5.7% 5 5 24 97.3%Horse colic 13.5% 14.1% 5 2.5 8 99.3%Credit screening (Australia) 12.9% 15.7% 2 2 5 96.8%Pima diabetes 22.6% 27.4% 5 5 5 93.9%Glass identi0cation 36.7% 34.4% 22 19 6 96.6%Heart disease (Cleveland) 15.8% 19.8% 4 3 5 95.2%Heart disease (Hungarian) 14.6% 18.7% 3 2 5 95.2%Hepatitis prognosis 16.2% 21.3% 6 4 8 93.9%Iris classi0cation 4.7% 6.0% 3 3 3 98.6%Labor negotiations 8.3% 12.3% 2 2 7 95.6%Sick euthyroid 7.8% 7.6% 4 4 5 99.8%Sonar classi0cation 15.4% 21.5% 4 3 8 92.7%

extracted by RULEX display an extremely high degree of 0delity with the LCnetworks from which they were drawn.(iii) Consistency. Rule extraction algorithms that generate rules by querying the

trained neural network with patterns drawn randomly from the problem domain[6,17] have the potential to generate a variety of diCerent rule sets from any giventraining run of the neural network. Such algorithms have the potential for lowconsistency. RULEX on the other hand is a deterministic algorithm that alwaysgenerates the same rule set from any given training run of the LC network. HenceRULEX always exhibits 100% consistency.(iv) Comprehensibility. In general, comprehensibility is inversely related to the

number of rules and to the number of antecedents per rule. The LC network isbased on a greedy, covering algorithm. Hence its solutions are achieved with rel-atively small numbers of training iterations and are typically compact, i.e. thetrained network contains only a small number of local cluster units. Given thatRULEX converts each local cluster unit into a single rule, the extracted rule setcontains, at most, the same number of rules as there are local cluster units in thetrained network. The rule simpli0cation procedures built into RULEX potentiallyreduces the size of the rule set and ensures that only signi0cant antecedent condi-tions are included in the 0nal rule set. This leads to extracted rules with as highcomprehensibility as possible.

(c) TranslucencyRULEX is distinctly decompositional in that rules are extracted at the level of

the hidden layer units. Each local cluster unit is treated in isolation with the localcluster weights being converted directly into a rule.

Page 15: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 15

(d) Algorithmic complexityGolea [10,11] showed that, in many cases, the computational complexity of

extracting rules from trained ANNs and the complexity of extracting the rulesdirectly from the data are both NP-hard. Hence the combination of ANN learningand ANN rule-extraction potentially involves signi0cant additional computationalcost over direct rule-learning techniques. Table 4 in Appendix C gives an outline ofthe RULEX algorithm. Table 5 in Appendix C expands the individual modules ofthe algorithm. From these descriptions it is clear that the majority of the modulesare linear in the number of local clusters (or rules) and the number of inputdimensions, O(lc×n). The modules associated with rule simpli0cation are, at worst,polynomial in the number of rules, O(lc2). RULEX is therefore computationallyeOcient and has some signi0cant advantages over rule extraction algorithms thatrely on a (potentially exponential) ‘search and test’ strategy [14,19]. Thus the useof RULEX to include an explanation facility adds little in the way of overhead tothe neural network learning phase.

(e) PortabilityRULEX is non-portable having been speci0cally designed to work with local

cluster (LC) neural networks. This means that it cannot be used as a generalpurpose device for providing an explanation component for existing, trained, neuralnetworks. However, as has been shown in the results presented in Section 4, the LCnetwork is applicable to a broad range of problem domains (including continuousvalued, discrete valued domains and domains which include missing values). HenceRULEX is also potentially applicable to a broad variety of problem domains.

6. Conclusion

This paper has described the restricted form of the LC local cluster neural net-work and the associated RULEX algorithm that can be used to provide an explana-tion facility for the trained network. Results were given for the LC network whichshow that the network is applicable across a broad spectrum of problem domainsand produces results that are at least comparable, and in many cases better thanC5, an accepted benchmark standard in machine learning on the problem domainsstudied. The RULEX algorithm has been evaluated in terms of the guidelines laidout for rule extraction techniques. RULEX is a decompositional technique capableof extracting accurate and comprehensible propositional rules. Further, rule setsproduced by RULEX show high 0delity with the network from which they wereextracted. RULEX has been shown to be computationally eOcient. Analysis ofRULEX reveals only two drawbacks. Firstly the technique is not portable havingbeen designed speci0cally to work with trained LC networks. Secondly, the tech-nique is not immune from the so-called ‘curse of dimensionality’, i.e. in higherdimension problems accuracy and 0delity suCer. This is due to (i) approximatinghyper-ellipsoid local functions with hyper-rectangular rules, and (ii) the increasedlikelihood of interaction between local clusters in the trained network.

Page 16: Rule extraction from local cluster neural nets

16 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Acknowledgements

The authors would like to thank the two anonymous reviewers for their helpfuland constructive comments.

Appendix A. Derivation of minimum ridge activation

Let L= l(wi; ri; xi)

OT =�(k2; (n− 1)max(L) + min(L)− d): (A.1)

Now d= nmax(L)

OT =�(k2;min(L)−max(L)); (A.2)

OT =1

1 + e−k2(min(L)−max(L)) ; (A.3)

1OT

− 1= e−k2 min(L)

e−k2 max(L); (A.4)(

1OT − 1

)e−k2 max(L) = e−k2 min(L); (A.5)

ln(1OT

− 1)− k2 max(L)=− k2 min(L); (A.6)

min(L)=max(L)− ln(1OT

− 1)/

k2: (A.7)

Appendix B. Derivation of range of activation for the ith ridge function

Let �=min(l(wb; rb; xb)); m=e−(xi−ri), and n=ebi . From Eqs. (10) and (11) wehave

�=�(ki; mn)− �(ki; m=n); (B.1)

�=1

1 +mn− 11 +m=n

; (B.2)

�=(1 +m=n)− (1 +mn)(1 +mn)(1 +m=n)

; (B.3)

�(1 +mn)(1 +m=n)= (1 +m=n)− (1 +mn); (B.4)

[�(1 +m=n) + 1](1 +mn)= (1 +m=n); (B.5)(�+

�mn+ 1)(1 +mn)= (1 +m=n); (B.6)

�m2 + (�+ 1)mn+ (�− 1)m=n+ �=0; (B.7)

Page 17: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 17

�m2 +[(�+ 1)n+

(�− 1n

)]m+ �=0: (B.8)

Let a= �; b=[(�+ 1)n+ (�− 1)=n], and c= �. Solving for m gives roots at

m=(�−1)=n− (�+1)n±

√(�+1)2n2+2(�+1)(�−1)+((�−1)=n)2−4�2

2�;

(B.9)

m=(�− 1)=n− (�+ 1)n±

√((�− 1)=n)2 + (�+ 1)2n2 − 2(�2 + 1)

2�: (B.10)

Now n=e−biki . Substituting for n into (B.10) gives

m=(�−1)ebiki − (�+1)e−biki ±

√(�−1)2e2biki+(�+1)2e−2biki −2(�2+1)

2�:

(B.11)

Let p=(1− �)ebiki and q=(�+ 1)e−biki .

m=p− q±

√p2 + q2 − 2(�2 + 1)2�

: (B.12)

Now m=e−(xi−ri)ki . This gives the following as an expression for xi.

xi= ri − ln(m)ki

: (B.13)

Appendix C. Algorithmic complexity of RULEX (Tables 4 and 5)

Table 4The RULEX algorithm

rulex() {create data structures();create domain description();

for each local clusterfor each ridge functioncalculate ridge limits();

while redundancies remainremove redundant rules();remove redundant antecedents();merge antecedents();

endwhile;

feed forward test set();display rule set();

}==end rulex

Page 18: Rule extraction from local cluster neural nets

18 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

Table 5Modules of the RULEX algorithm

remove redundant rules() {OKtoremove = false;for each rule afor each other rule bfor each dimension iif ([xbi lower ; xbi upper] ⊆ [xai lower ; xai upper]) OR

([ilower ; iupper] ∩ [xi lower ; xi upper] =�)then Oktoremove = true;

if Oktoremovethen remove rule b();

}==endremove redundant rules

remove redundant antecedents() {for each rulefor each dimension iif [ilower ; iupper] ⊆ [xi lower ; xi upper]then remove antecedent i();

}==end remove redundant antecedents

merge antecedents() {OKtomerge = true;for each rule a {for each other rule bfor each dimension ifor each other dimension jif NOT ([xai lower ; xai upper] = [xbj lower ; xbj upper])then OKtomerge =false;

if Oktomergethen {[xai lower ; xai upper] = [xai lower ; xai upper] ∪ [xbj lower ; xbj upper]remove rule b();

}}

}==end merge antecedents

feed forward test set() {errors = 0;correct = 0;for each pattern in the test setfor each rule {classi0ed = true;for each dimension Iif testi �∈ [xi lower ; xi upper]then classi0ed = false;

if classi0ed ∧ testtarget = ruleclasslabelthen ++correct;else ++errors;

}==end feed forward test set

Page 19: Rule extraction from local cluster neural nets

R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20 19

Table 5 (continued)

display rule set() {for each rule {for each dimension iwrite([xi lower ; xi upper]);

write(ruleclasslabel);}

}==end display rule set

References

[1] S. Abe, M.S. Lan, A method for fuzzy rules extraction directly from numerical data and itsapplication to pattern classi0cation, IEEE Trans. Fuzzy Systems 3 (1) (1995) 18–28.

[2] R. Andrews, A.B. Tickle, J. Diederich, A survey and critique of techniques for extracting rulesfrom trained arti0cial neural networks, Knowledge Based Systems 8 (1995) 373–389.

[3] M. Berthold, K. Huber, From radial to rectangular basis functions: a new approach for rulelearning from large datasets, Technical Report 15-95, 1995, University of Karlsruhe.

[4] M. Berthold, K. Huber, Building precise classi0ers with automatic rule extraction, Proceedings ofthe IEEE International Conference on Neural Networks, Perth, Australia, 1995, Vol. 3, 1263–1268.

[5] G.A. Carpenter, A.W. Tan, Rule extraction: from neural architecture to symbolic representation,Connection Sci. 7 (1) (1995) 3–27.

[6] M. Craven, Extracting comprehensible models from trained neural networks, Ph.D. Thesis,University of Wisconsin, Madison Wisconsin, 1996.

[7] W. Duch, R. Adamczak, K Grabczewski, Neural optimisation of linguistic variables andmembership functions, Proceedings of the 6th International Conference on Neural InformationProcessing ICONIP ’99, Perth, Australia, 1999, Vol. II, 616–621.

[8] S. Geva, K. Malmstrom, J. Sitte, Local cluster neural net: architecture, training and applications,Neurocomputing 20 (1998) 35–56.

[9] L. Giles, C. Omlin, Rule revision with recurrent networks, IEEE Trans. Knowledge Data Eng. 8(1) (1996) 183–197.

[10] M. Golea, On the complexity of rule extraction from neural networks and network querying,Proceedings of the Rule Extraction From Trained Arti0cial Neural Networks Workshop, SocietyFor the Study of Arti0cial Intelligence and Simulation of Behavior Workshop Series ’96,University of Sussex, Brighton, UK, 1996, 51–59.

[11] M. Golea, On the complexity of extracting simple rules from trained neural nets (1997),to appear.

[12] Y. Hayashi, A neural expert system with automated extraction of fuzzy If-Then rules and itsapplication to medical diagnosis, Adv. Neural Inform. Process. Systems 3 (1990) 578–584.

[13] S. Horikawa, T. Furuhashi, Y. Uchikawa, On fuzzy modeling using fuzzy neural networks withthe back-propagation algorithm, IEEE Trans. Neural Networks 3 (5) (1992) 801–806.

[14] R. Krishnan, A systematic method for decompositional rule extraction from neural networks,Proceedings of the NIPS ’97 Rule Extraction From Trained Arti0cial Neural Networks Workshop,Queensland University of Technology, 1996, 38–45.

[15] F. Maire, A partial order for the m-of-n rule extraction algorithm, IEEE Trans. Neural Networks8 (6) (1997) 1542–1544.

[16] K. Saito, R. Nakano, Law discovery using neural networks, Proceedings of the NIPS ’96Rule Extraction From Trained Arti0cial Neural Networks Workshop, Queensland University ofTechnology, 1996, 62–69.

[17] S. Thrun. Extracting provably correct rules from arti0cial neural networks, Technical ReportIAI-TR-93-5, Institut for Informatik III Universitat Bonn, Germany, 1994.

Page 20: Rule extraction from local cluster neural nets

20 R. Andrews, S. Geva /Neurocomputing 47 (2002) 1–20

[18] A.B. Tickle, R. Andrews, M. Golea, J. Diederich, The truth will come to light: directions andchallenges in extracting the knowledge embedded within trained arti0cial neural networks, IEEETrans. Neural Networks 9 (6) (1998) 1057–1068.

[19] G. Towell, J. Shavlik, The extraction of re0ned rules from knowledge based neural networks,Mach. Learning 131 (1993) 71–101.

[20] V. Tresp, J. Hollatz, S. Ahmad, Network structuring and training using rule-based knowledge,Adv. In Neural Inform. Process. Systems 6 (1993) 871–878.

Robert Andrews received his B.Ed. and M.I.T. degrees from Queensland Uni-versity of Technology in 1985 and 1995, respectively and is currently in the0nal stages of his Ph.D. He is presently a Lecturer in the School of Infor-mation Systems at QUT. He has had practical experience in research projectsinvolving the application of rule extraction techniques to herd improvement inthe dairy cattle industry and the analysis of 0nancial data. His interests includeneurocomputing, pattern recognition, data mining and theory re0nement.

Shlomo Geva received his B.Sc. degree in Chemistry and Physics from theHebrew University in 1982. He received his M.Sc. and Ph.D. degrees fromthe Queensland University of Technology in 1988 and 1992, respectively. Heis currently a Senior Lecturer in the School of Computing Science at QUT.His interests include neurocomputing, data mining, pattern recognition, speechand image processing.