Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

10
Computational Biology and Chemistry 30 (2006) 372–381 Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach Chin-Fu Chen a,,1 , Xin Feng b,1 , Jack Szeto b a Department of Genetics and Biochemistry, 100 Jordan Hall, Clemson University, Clemson, South Carolina 29634, USA b Department of Electrical and Computer Engineering, Marquette University, Milwaukee, Wisconsin 53233, USA Received 27 April 2006; accepted 8 August 2006 Abstract Gene expression profiling by microarray technology is usually difficult to interpret into a simpler pattern. One approach to resolve the complexity of gene expression profiles is the application of artificial neural networks (ANNs). A potential difficulty in this strategy, however, is that the non- linear nature of ANN makes it essentially a ‘black-box’ computation process. Addition of a fuzzy logic approach is useful because it can complement ANN by explicitly specifying membership function during computation. We employed a hybrid approach of neural network and fuzzy logic to further analyze a published microarray study of gene responses to eight bacteria in human macrophages. The original analysis by hierarchical clustering found common gene responses to all bacteria but did not address individual responses. Our method allowed exploration of the gene response of the host to individual bacterium. We implemented a two-layer, feed-forward neural network containing the principle of ‘competitive learning’ (i.e. ‘winner-take-all’). The weights of the trained neural network were fed into a fuzzy logic inference system. A new measurement, called the impact rating (IR) was also introduced to explore the degree of importance of each gene. To assess the reliability of the IR value, a bootstrap re-sampling method was applied to the dataset and a confidence level for each IR was obtained. Our approach has successfully uncovered the unique features of host response to individual bacterium. Further, application of gene ontology (GO) annotation to the genes of high IR values in each response has suggested new biological pathways for individual host–pathogen interactions. © 2006 Elsevier Ltd. All rights reserved. Keywords: Gene expression; Microarray; Gene ontology; Artificial networks; Fuzzy logic 1. Introduction A gene expression microarray provides information about the expression of tens of thousands of genes in a single experiment and has become a major tool for the study of gene expres- sion in response to various stimuli (Brown and Botstein, 1999; Slonim, 2002; Morley et al., 2004; Ernst et al., 2005). For exam- ple, microarray technology has enabled researchers to gain new insights into host defense in response to infection (Bryant et al., 2004; van der Pouw Kraan et al., 2004). The complexity of infor- mation, however, also makes interpretation of results extremely challenging. The most commonly used method of microarray analysis is hierarchical clustering. In this technique, relationships among Corresponding author. Tel.: +1 864 656 0748; fax: +1 864 656 4293. E-mail addresses: [email protected] (C.-F. Chen), [email protected] (X. Feng). 1 These authors contributed equally to this work. genes are represented by a tree-‘dendrogram’ whose branch lengths reflect the degree of similarity in gene expression (Eisen et al., 1998). Cluster analysis can usually suggest a common pattern shared by a sub-group. For example, it can be used to operationally define a ‘core’ immune gene response to various pathogens (Jenner and Young, 2005). A weakness of cluster analysis is that the characteristics of an individual response are hidden within a sub-group of the dendrograms. As a result, clus- ter analysis does not address the critical question of what genes represent a unique response to a given stimulus. One approach to overcome this shortcoming of hierarchi- cal clustering is to use fuzzy logic to evaluate the individual contribution of each member in a cluster. With this technique, useful and important members within a cluster become explicit (‘transparent’). In fuzzy logic, a set of parallel rules is created following generation of a profile for each variable (Lin and Lee, 1996). Each rule is in the form “IF Input 1 is X1 AND Input 2 is X2, THEN output 1 is Y1...,” where X and Y are quali- tative fuzzy labels representing the strength or certainty of the 1476-9271/$ – see front matter © 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2006.08.004

Transcript of Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

Page 1: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

A

olAfcrlcbti©

K

1

easSpi2mc

h

(

1d

Computational Biology and Chemistry 30 (2006) 372–381

Identification of critical genes in microarray experimentsby a Neuro-Fuzzy approach

Chin-Fu Chen a,∗,1, Xin Feng b,1, Jack Szeto b

a Department of Genetics and Biochemistry, 100 Jordan Hall, Clemson University, Clemson, South Carolina 29634, USAb Department of Electrical and Computer Engineering, Marquette University, Milwaukee, Wisconsin 53233, USA

Received 27 April 2006; accepted 8 August 2006

bstract

Gene expression profiling by microarray technology is usually difficult to interpret into a simpler pattern. One approach to resolve the complexityf gene expression profiles is the application of artificial neural networks (ANNs). A potential difficulty in this strategy, however, is that the non-inear nature of ANN makes it essentially a ‘black-box’ computation process. Addition of a fuzzy logic approach is useful because it can complementNN by explicitly specifying membership function during computation. We employed a hybrid approach of neural network and fuzzy logic to

urther analyze a published microarray study of gene responses to eight bacteria in human macrophages. The original analysis by hierarchicallustering found common gene responses to all bacteria but did not address individual responses. Our method allowed exploration of the geneesponse of the host to individual bacterium. We implemented a two-layer, feed-forward neural network containing the principle of ‘competitiveearning’ (i.e. ‘winner-take-all’). The weights of the trained neural network were fed into a fuzzy logic inference system. A new measurement,alled the impact rating (IR) was also introduced to explore the degree of importance of each gene. To assess the reliability of the IR value, a

ootstrap re-sampling method was applied to the dataset and a confidence level for each IR was obtained. Our approach has successfully uncoveredhe unique features of host response to individual bacterium. Further, application of gene ontology (GO) annotation to the genes of high IR valuesn each response has suggested new biological pathways for individual host–pathogen interactions.

2006 Elsevier Ltd. All rights reserved.

uzzy

glepopahtr

eywords: Gene expression; Microarray; Gene ontology; Artificial networks; F

. Introduction

A gene expression microarray provides information about thexpression of tens of thousands of genes in a single experimentnd has become a major tool for the study of gene expres-ion in response to various stimuli (Brown and Botstein, 1999;lonim, 2002; Morley et al., 2004; Ernst et al., 2005). For exam-le, microarray technology has enabled researchers to gain newnsights into host defense in response to infection (Bryant et al.,004; van der Pouw Kraan et al., 2004). The complexity of infor-ation, however, also makes interpretation of results extremely

hallenging.The most commonly used method of microarray analysis is

ierarchical clustering. In this technique, relationships among

∗ Corresponding author. Tel.: +1 864 656 0748; fax: +1 864 656 4293.E-mail addresses: [email protected] (C.-F. Chen), [email protected]

X. Feng).1 These authors contributed equally to this work.

ccu(f12t

476-9271/$ – see front matter © 2006 Elsevier Ltd. All rights reserved.oi:10.1016/j.compbiolchem.2006.08.004

logic

enes are represented by a tree-‘dendrogram’ whose branchengths reflect the degree of similarity in gene expression (Eisent al., 1998). Cluster analysis can usually suggest a commonattern shared by a sub-group. For example, it can be used toperationally define a ‘core’ immune gene response to variousathogens (Jenner and Young, 2005). A weakness of clusternalysis is that the characteristics of an individual response areidden within a sub-group of the dendrograms. As a result, clus-er analysis does not address the critical question of what genesepresent a unique response to a given stimulus.

One approach to overcome this shortcoming of hierarchi-al clustering is to use fuzzy logic to evaluate the individualontribution of each member in a cluster. With this technique,seful and important members within a cluster become explicit‘transparent’). In fuzzy logic, a set of parallel rules is created

ollowing generation of a profile for each variable (Lin and Lee,996). Each rule is in the form “IF Input 1 is X1 AND Inputis X2, THEN output 1 is Y1. . .,” where X and Y are quali-

ative fuzzy labels representing the strength or certainty of the

Page 2: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

C.-F. Chen et al. / Computational Biology and Chemistry 30 (2006) 372–381 373

f the ANN-Fuzzy algorithm.

pup

uioasdra2Aff

bNKladsFStbcAwpFiva((mr

tTi(aa

aadTmm

osaiht2

2

2b

(meTlct

Fig. 1. Work flowchart o

articular input or output. The parallel rules are then mergedsing fuzzy reasoning to produce a set of quantitative output (arocess called ‘de-fuzzyfication’).

The primary use of fuzzy logic is to resolve the linguisticncertainty in the decision-making process. A weakness is thatt does not tackle numerical inter-relationships such as the levelf gene expression in microarrays. Further, fuzzy logic does notddress the issue of background noise, such as experimental andampling errors, which may be contained within the microarrayata. In contrast, artificial neural networks (ANN) are able toesist noise in the data, and thus have been useful for analysisnd classification of microarray data clustering (Khan et al.,001; O’Neill and Song, 2003). The ‘damping’ effect on noise byNN can be attributed both to the linear combination of weight

unction for the inputs and to the non-linear logistic sigmoidunction for the outputs (Gurney, 1997).

We propose that microarray analysis can benefit from a com-ination of artificial neural networks and fuzzy logic, called theeuro-Fuzzy model (Lin and Lee, 1996; Pal and Mitra, 1999;apetanovic et al., 2004). The fuzzy systems deal with explicit

inguistic uncertainty and knowledge (understood and explain-ble); the artificial neural networks process implicit numericalata (machine learning). Thus, by combining quantitative mea-urement and qualitative (i.e. linguistic) concepts, the Neuro-uzzy model (NFM), utilizes the strengths of the two systems.imilar NFM approaches have been previously applied to pro-

ein motif extraction (Chang and Halgamuge, 2002), predictingladder cancer outcome (Catto et al., 2003), and classification ofancer tissue (Futschik et al., 2003). Our NFM approach (calledNN-Fuzzy hereafter) was initially developed by Feng and co-orkers (Clinkenbeard and Feng, 1992) and contains a two-steprocedure. Our overall approach is depicted as a flowchart inig. 1. At the first step, a two-layer, feed-forward neural network

s used to map (transform) gene expression values into patternectors (Fig. 2). At the second step, a fuzzy logic inference ispplied to evaluate the weights of the trained neural networksFig. 3). In addition, a new measurement called the impact ratingIR) is used to explore the importance of each gene in the place-ent of a gene expression vector into a specific ‘host bacterial

esponse’ cluster (see Section 2 for details).Furthermore, we studied the relationship between the sensi-

ivity of the IR and its dependence on the structure of the data.he dependency of the method on its data structure sometimes

s referred to as the problem of ‘over-fitting’ or ‘over-training’Dayhoff and DeLeo, 2001; Hastie et al., 2001; Kapetanovic etl., 2004). In this situation, the parameters of the method aredjusted to optimally fit the original data in a way that may not

aMtg

Fig. 2. Architecture of a competitive learning ANN.

pply to the larger class of data. Over-training in an ANN-basedpproach can be evaluated by construction of a confidence or pre-iction interval using the bootstrap (Dayhoff and DeLeo, 2001).he bootstrap (sampling with replacement) is a computer-basedethod for assigning measures of accuracy to statistical esti-ates (Efron and Tibshirani, 1998) (see Section 2 for details).We applied our approach to a published microarray study

f mRNA changes in human macrophages after upon expo-ure to different bacteria (Nau et al., 2002). The original datanalysis was conducted using hierarchical clustering. To gainnsights into what biological processes may contribute to theost responses to bacteria, we utilized gene ontology (GO) anno-ation (Ashburner et al., 2000; Harris et al., 2004) (see Sectionfor details).

. Material and methods

.1. The microarray data set for macrophage response toacteria

The microarray data set was obtained from Nau et al.2002). These authors studied the transcriptional responses ofacrophages to eight different bacteria at four to five differ-

nt time points, using 6800 genes on the Affimetrix microarray.he eight bacteria consisted of three classes with different cel-

ular components and pathogenesis: Gram-negative bacteria [E.oli, enterohemorrhagic E. coli O157:H7 (EHEC), Salmonellayphi, and S. typhimurium], Gram-positive bacteria (S. aureus

nd L. monocytogenes), and mycobacteria (M. tuberculosis and. bovis bacillus Calmette–Gue′rin). We opted to study in depth

he 977 genes that Nau et al. selected as the most importantenes for the host–pathogen interaction (Nau et al., 2002). The

Page 3: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

374 C.-F. Chen et al. / Computational Biology

gmt

2

tsiouG

btTttvtceebcd

2

‘wbobib

W

wiat

W

ot

W

vvbapppp

Fig. 3. Three fuzzy membership functions of weight spreads.

ene expression values were represented as log2-ratios (experi-ent/control) and ranged in general from negative four to posi-

ive six.

.2. Overview of ANN-Fuzzy algorithm

We consider the eight ‘host bacterial responses’ as eight pat-ern vectors, each composed of 977 gene components. In ourtudies, the ‘host bacterial response’ vectors are considered as

nputs to the two-layer ANN with 977 input neurons and eightutput neurons. We then trained the neural network with thensupervised competitive learning method (Lin and Lee, 1996;urney, 1997; Han and Kamber, 2001) such that the contrast

1(aa

and Chemistry 30 (2006) 372–381

etween the clusters is enhanced. After training, the input vec-ors are clustered into eight ‘host bacterial response’ categories.he trained weights of the ANN represent the mean values of

he ‘host bacterial response’ patterns. At the second step, e.g.he fuzzy inference step, each of the input values of the weightector is evaluated for its ‘importance’ by the magnitude andhe deviation from the median, and is formulated through thealculation of impact rating (IR). All gene responses are thenvaluated by calculating the IR according to their contribution,ither positive or negative, to the membership in a specific ‘hostacterial response’ cluster. This serves as the definition for thatluster. The entire algorithm is shown in a step-by-step proce-ure in Fig. 1.

.3. Architecture of artificial neural networks

Let m be the number of gene expressions, n the number ofhost bacterial response’ clusters. The input layer of the ANNill then consist of m processing elements (neurons), whichuffer the m-dimensional gene vector x = [x1, x2, . . ., xm]T. Theutput layer of the ANN, also called the Kohonen layer, has ninary neurons that are fully connected to the neurons in thenput layer (Fig. 2). The connecting weights can be representedy the n × m weight matrix:

=

⎛⎜⎜⎜⎝

w11 w21 · · · wn1

w12 w22 · · · wn2

. . · · ·w1m w2m · · · wnm

⎞⎟⎟⎟⎠

here the entry of W is denoted by wji, and where j = 1, . . ., n,= 1, . . ., m. Furthermore, W can be represented as either therray of m column vectors representing the weights connectedo the m output neurons in the Kohonen layer:

(i) = [w(1), w(2), w(i), . . . , w(m)],

r the array of n row vectors representing the weights connectedo n input neurons:

j =

⎛⎜⎜⎜⎝

w(1)

w(j)

.

w(n)

⎞⎟⎟⎟⎠

For convenience we denote W(i) as the m column weightectors connecting to the output neurons, and Wj as the n rowectors connecting to the input neurons. The training is achievedy competitive learning algorithm (Bose and Liang, 1996; Linnd Lee, 1996; Gurney, 1997). In competitive learning, the out-ut unit that provides the highest activation to a given inputattern is declared the ‘winner’ and is moved closer to the inputattern, whereas the rest of the neurons are left unchanged. Com-etitive learning is thus a two-phase process (Bose and Liang,

996): (1) the first competition phase in a winner is selected and2) the second reward phase in which the winner is rewarded withn update of its weights. This strategy is also called winner-take-ll since only the winning neuron is updated.
Page 4: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

ology

ixtteIiln

w

wrtcncj(cc

2

Wevjigt‘gd

src

2

cpc

a

n

a

wt

µ

µ

2

bcs

(

(

(

ae

b

mt

b

ta

µ

µ

µ

2

rds

C.-F. Chen et al. / Computational Bi

The weight adjusting criterion in the training process is thenner product of the m-dimensional input gene expression vector

and the weight vector w(j), i.e. yj = w(j)Tx, which representshe degree of “closeness” between w(j) and x. Once x is fed intohe ANN, a competition takes place for all yjs, j = 1, . . ., n tovaluate which weight vector is the closest to the input vector x.n the training process, the weight vector of the winning neuronn the Kohonen layer is adjusted according to the competitiveearning law, i.e. the weight vector w(j) connected to the winningeuron will be updated by

(j)new = w(j)old + α(x − w(j)old)

here α is the learning rate, 0 < α < 1, while other weight vectorsemain unchanged. This algorithm moves the weight vector fromhe old weight vector to the most recent input vector x. Withontinuing training, the weight vectors Wj connecting to theoutput neurons eventually converge to the centers of the n

lusters. Thus, in a well-trained ANN, the weight vectors Wj,= 1, . . ., n, associated with each cluster will point to the centeror median) of the input vectors for all samples placed in thatluster. This is critical for interpreting the meaning of thoselusters.

.4. Fuzzy logic inference of ANN weights

At the second step of the algorithm, each of the weight vectorsj, j = 1, . . ., n, is fed into a fuzzy inference system which

valuates the ‘importance’ (e.g. impact) of each gene expressionector by the new parameter, the impact rating IR(j, i), where= 1, . . ., n and i = 1, . . ., m. Specifically, the inputs to the fuzzynference system are the weight values associated with a givenene for all ‘host bacterial response’ clusters. For instance, inhe case of evaluating the impacts of 977 genes to 8 differenthost bacterial responses’, n = 8 and m = 977, and the impacts ofene j to bacteria i,IR(j, i), j = 1, . . ., 8, i = 1, . . ., 977, will beetermined.

We apply three fuzzy sets (i.e. weight magnitude, weightpread and impact) to model each weight value so as to infer theelative impact of that gene for different ‘host bacterial response’lusters (Fig. 3). All weights must be normalized within [−1, 1].

.4.1. Weight magnitudeGenes for which at least one of the ‘host bacterial response’

lusters has a weight with a large absolute value have the bestotential to affect cluster memberships. The weight value asso-iated with gene i, i = 1, . . ., m, is

i =n∑

j=1

|wj,i|

Since all weight values are normalized, and we define theormalized crisp value as:

N(i) = ai

amax

here amax = max{a1, . . ., am}, and aN(i) is within [0, 1]. Thus,he membership functions for the large weight magnitude

eegf

and Chemistry 30 (2006) 372–381 375

large(aN) fuzzy set can be defined as

large(aN) ={

aN 0 ≤ aN ≤ 1

0 otherwise

.4.2. Weight spreadGenes with the weights that diverge significantly among ‘host

acterial response’ clusters also have the best potential to affectluster memberships. The following fuzzy rules define weightpread fuzzy sets:

1) If all n components in the n × 1 vector w(i) are about equal,then gene i makes a similar contribution to all n clusters,and therefore does little to differentiate the clusters.

2) If |w(j, i)| � |w(l, i)| for all i = 1, . . ., m and j �= l, then genei is more likely to be placed into cluster j than into the otherclusters.

3) If |w(j, i)| �|w(l, i)| for all i = 1, . . ., m and j �= l, then genei is less likely to be placed into cluster j than into the otherclusters.

The difference between w(j, i) and the other weights associ-ted with gene i is denoted by bji, calculated independently forach weight as:

ji = w(j, i) − median(w(i))

After comparing all values of bji for j = 1, . . ., n and i = 1, . . .,, we denote the largest absolute value by bmax, which is used

o normalize the input into [−1.0, 1.0]:

N(j, i) = bji

bmax

The membership functions of three fuzzy sets can be greaterhan median, less than median, and equal to median. Theyre denoted by µGT(bN), µLT(bN), and µEQ(bN), respectively:

GT(bN) ={

bN 0 ≤ bN ≤ 1

0 otherwise

LT(bN) ={

−bN, −1 ≤ bN ≤ 0

0 otherwise

EQ(bN) =

⎧⎪⎨⎪⎩

2bN + 1, −0.5 ≤ bN ≤ 0

1 − 2bN, 0 ≤ bN ≤ 0.5

0, otherwise

.4.3. Impact ratingOur algorithm utilizes a new measurement, called impact

ating, to explore the degree of importance of each gene inistinguishing different experimental conditions (clusters). Theelection of important genes using the ANN-Fuzzy approach

ffectively leads to a great reduction of the number of genexpressions during data analysis. In other words, only theseenes with significant positive IR values will be selected forurther study.
Page 5: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

3 ology

vrgeAcacealthfibI

s

a

a

µ

µ

µ

i

I

wicfcL

2

tlmtacaaIImstts

lrba9tEofdwplI

2

etgo(ot

76 C.-F. Chen et al. / Computational Bi

Our rationale for using the IR is quite intuitive: not every indi-idual member contributes equally to the clustering. The impactating represents the degree or likelihood to which a specificene is to be categorized into a certain pattern of clusters. Forach gene, the IR value may be positive, negative, or negligible.

positive IR value of a specific gene suggests that the geneontributes positively in placing the gene expression vector intocluster. Conversely, a negative IR value of a specific gene indi-ates that the gene contributes negatively in placing the genexpression vector into a cluster. Genes with negligible impactre generally discarded, or “tossed,” because they contributingittle in placing the gene expression vector into any of the clus-ers. Our algorithm produces a spectrum of IRs ranging fromighly negative to highly positive. The negative IRs can be use-ul in discriminating one cluster from another, but our currentnterest is on finding the unique features of responses to eachacterium. Thus, we only focus on selecting the highly positiveR values as a method for ranking the importance of genes.

The following rules infer the impact of a gene from the weightpread and weight magnitude membership functions:

The inference is implemented by the common min–max oper-tion found in fuzzy logic references (Lin and Lee, 1996):

µPos = min(µGT, µLarge)

µNeg = min(µLT, µLarge)

µToss = max(µEQ, 1 − µLarge)

The corresponding fuzzy membership functions µPos, µNeg,nd µToss are defined as:

Pos(impact) ={

impact, 0 ≤ impact ≤ 1

0, otherwise

Neg(impact) ={

impact, −1 ≤ impact ≤ 0

0, otherwise

Toss(impact) =

⎧⎪⎨⎪⎩

2 ∗ impact + 1 −0.5 ≤ impact ≤ 0

1 − 2 ∗ impact 0 ≤ impact ≤ 0.5

0 otherwise

The method of centroid defuzzification (Lin and Lee, 1996)s applied to determine the value of IR(j, i):

R(j, i) =

µPos(j, i)dPos(j, i) + µNeg(j, i)dNeg(j, i)

+ µToss(j, i)dToss(j, i)

µPos(j, i) + µNeg(j, i) + µToss(j, i)

hoa‘pd

and Chemistry 30 (2006) 372–381

here i is the gene index, j the host bacterial response clusterndex, dPos(j, i), dNeg(j, i), and dToss(j, i), are values of theorresponding fuzzy sets. Note that IR values range in valuerom −0.67 to 0.67 after defuzzification, due to the location ofenters of constituent triangle membership functions (Lin andee, 1996).

.5. Bootstrapping of Impact ratings

Previous studies have shown that for the ANN approach,he best assessment of parameters is through the confidenceevel (Dayhoff and DeLeo, 2001). Bootstrapping is a statistical

ethod for estimating the sampling distribution of an estima-or by sampling with replacement from the original data. Theim/purpose is to obtain robust estimates of standard errors andonfidence intervals of a parameter (Efron and Tibshirani, 1998)nd to provide assessment of parameter uncertainty (Davidsonnd Hinkley, 1997). Our goal is thus to evaluate the originalR values by calculating a confidence or prediction interval ofRs from the bootstrapping. By doing so, we can measure howuch the original computed IR varies from the mean of the boot-

trapped IR population. A narrow confidence level (range) forhe IR suggests that the original IR is reliable and also implieshat a computed IR is not solely dependent on a particular datatructure (is not ‘over-fitted’).

As depicted in Fig. 4, we random re-sampled the geneist with replacement for each individual experiment andecalculating the IR in each re-sampling process. That is, eachootstrap will produce 977 genes with replication (by chance)cross the gene ‘row’ for all bacterial responses, i.e. a new77 × 8 matrix. Each resulting new matrix was fed into ourwo-step ANN-Fuzzy procedure and an IR was calculated.ach new IR represents a possible variation due to the changef gene expression in the dataset. All the IRs calculatedrom re-sampled data thus can be constructed as a possibleistribution of the original IR. A Perl computer script wasritten for carrying out the procedures. The above samplingrocedure was repeated for 1000 times and the confidenceevel obtained from the distribution of 1000 bootstrappedRs.

.6. Gene annotation by Gene ontology

The gene ontology database (Ashburner et al., 2000; Harrist al., 2004) is a large collaborative public database of con-rolled biological vocabularies (i.e. ontologies) that describeene products based on their functionalities in the cell. Threentologies are defined in this database: (i) biological processes,ii) cellular components, and (iii) molecular functions. Genentology terms are connected into nodes of a network, suchhat the connections between parents and children form aierarchical data structure (Ashburner et al., 2000). The layersf the GO tree, often referred to as the ‘levels’ of the GO terms,

re listed as 1–6, with higher numbers indicating more specificdeeper’ terms. Deeper terms in the GO hierarchy are morerecise and the number of genes with annotations decreases ateeper GO levels (Al-Shahrour et al., 2004).
Page 6: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

C.-F. Chen et al. / Computational Biology and Chemistry 30 (2006) 372–381 377

Fig. 4. Procedure for bootstrapping (re-sampling with replacement) the microarray gene list for calculation of the impact rating (IR). The top table is the originalm n of bm

hugTGdarE(bFTwo6(

2

lo

tridt6wLr

3

t9tflrb

icroarray gene expression data and the bottom table is the example of one ruay not be selected during the sampling process.

To interpret the annotation of genes by GO terms, it iselpful to ask whether a collection of GO terms is over- ornder-represented by certain GO categories, thus may sug-est an enrichment of certain underlying biological pathways.his can be addressed by determining whether a collection ofO terms for a gene list is statistically different from a ran-om sampling by chance (Tavazoie et al., 1999; Doniger etl., 2003; Al-Shahrour et al., 2004). In general, over- or under-epresentation of GO terms can be evaluated through a Fisherxact Probability Test. We utilized the on-line program FatiGO

Al-Shahrour et al., 2004) to produce: (1) percentage and num-er of genes appearing a GO category; (2) p-values from theisher’s Exact Test for each GO term associated with the gene.o determine whether GO terms are over- or under-represented,e compared the annotations of GO terms for the gene listf interested (e.g. genes with high IR values) to those of the800 Affimetrix genes used in the original study by Nau et al.2002).

.7. Software implementation

We implemented the new ANN-Fuzzy algorithm using Mat-ab environment. There are two major computations: the trainingf the ANN and the calculation of IRs, i.e., the implementa-

Nrrh

ootstrap. Note that the same gene can appear more than once and some genes

ion of fuzzy inference system. Since there are only three fuzzyules in the fuzzy rule base, the calculation of the IR is almostnstantaneous, so the main computing time of the algorithm isetermined by the training of ANN. In our experiments, all therainings are completed within 500 epochs, which takes about0 s using the Matlab package on a 2.4 GHz PC. The packageas developed at the Advanced Computing Technology (ACT)aboratory at Marquette University and will be available upon

equest.

. Results

For each time point, the size of the ANN is 977 × 8: i.e.here are 977 neurons in the input layer, representing the77 genes and 8 neurons in the output layer. This representshe number of ‘host bacterial responses’. The ANNs wereully trained after 500 training epochs using the competitiveearning method. The results separated all eight ‘host bacte-ial response’ clusters correctly in accordance with their ‘hostacterial response’ types at all gene expression data points.

o misclassifications occurred. We will discuss representative

esults for two groups of gene expression data, the 1-h and 6-hesponses. Complete results are available at the author’s web sitettp://www.chenlab.clemson.edu/data.

Page 7: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

378 C.-F. Chen et al. / Computational Biology and Chemistry 30 (2006) 372–381

F1

3

ccgf

eoiopFgi(aut5n

a

F1

Table 1Host response at 1-h to E. coli

Gene name IR Mean IRa L1 IRb L2 IRc

SKIV2L 0.5739 0.5734 0.5719 0.5759PTGS2 0.5390 0.5389 0.5374 0.5406COVA1 0.4817 0.4812 0.4794 0.4841GRO2 0.4793 0.4794 0.4782 0.4804HG3141 0.4720 0.4720 0.4692 0.4749HSPA6 0.4690 0.4689 0.4677 0.4703D17357 0.4622 0.4620 0.4600 0.4645EPS15 0.4621 0.4618 0.4609 0.4633ETS2 0.4527 0.4521 0.4503 0.4550ASNS 0.4497 0.4497 0.4484 0.4509

The 10 genes with the highest impact ratings (IR) are shown with means andconfidence limits (L) of IRs obtained from bootstrap permutation.

a

mtv(rp

3

iagitfiq

ig. 5. Sorted impact rating (IR) values of 977 host gene responses to E. coli at-h.

.1. The IR distribution

The key output of our algorithm is the IR, which numeri-ally indicates the importance of each individual gene within theluster. The interpretation of an IR value is straightforward. Theenes with the greatest positive IR rating are the most importantor that specific ‘host bacterial response’.

Sorting genes based on their IR values produces a profile forach host bacterial response cluster. Judging by the steep slopef the tail region, the distribution of the IRs for each clusters not uniform. Importantly, the slope of the curve increasesr decreases steeply around the tails, suggesting that a smallercentage of gene responses are more important than others.or example, for the 1-h response to E. coli, only about a dozenenes are critical (with IR > 0.4) in determining the membershipn clusters (Fig. 5). Similar results were found for M. tuberculosisFig. 6). A general trend is that the shape of the IR profile islways highly curved at the tail regions and that each curved tailsually contains less than 30 genes (data not shown). Based onhe common shape in the tail region, we reasoned that the top

0 genes of high positive IRs contain most of the informationeeded for the cluster.

We then calculated IRs from each bootstrapped re-samplingnd used them to generate a probability distribution of IRs. The

ig. 6. Sorted impact rating (IR) values of 977 host gene responses to EHEC at-h.

saa

TH

G

IIEKILKAEL

Tc

Mean IR: the arithmetic mean of 1000 IRs from bootstrapping.b L1 IR: 95% t-distribution lower limit of Mean IR.c L2 IR: 95% t-distribution upper limit of Mean IR.

eans of the IRs from the bootstrapping were extremely closeo the original IRs. Further, the confidence level of each IR wasery narrow, suggesting that the estimates of IRs are reliablesee Table 1, the top 10 genes with highest IRs for the 1-h hostesponse to E. coli and Table 2 for the 1-h host response EHEC-athogenic E. coli O157:H7).

.2. Macrophage gene response to bacteria at 1 h

When we examined the 50 genes with the highest impact rat-ng for host response to E. coli, 38 out of the 50 genes contain GOnnotations. For the 1-h time point, the list of human macrophageenes responding to E. coli is greatly enriched for genes involvedn ‘response to pest, pathogen or parasite’ (25%) and ‘responseo external biotic stimulus’ (25%) (Table 3). The response pro-le for the pathogenic strain EHEC (E. coli O157:H7) at 1 h isuite different. There is generally no significant enrichment for

pecific biologic processes, although a small number of genesre significantly involved in ‘melanocyte differentiation’ (3%),nd, unexpectedly, ‘mating’ (3%) (Table 4).

able 2ost response at 1-h to EHEC (pathogenic E. coli O157:H7)

ene name IR Mean IRa L1 IRb L2 IRc

TPR3 0.6521 0.6519 0.6505 0.6536FITM1 0.5682 0.5682 0.5669 0.5694NPP2 0.5553 0.5550 0.5538 0.5567IAA0316 0.5494 0.5491 0.5478 0.5511

RF4 0.5302 0.5298 0.5287 0.5317OC51035 0.4985 0.4981 0.4968 0.5003IF2 0.4957 0.4954 0.4942 0.4971NXA3 0.4834 0.4833 0.4816 0.4853DN1 0.4761 0.4761 0.4746 0.4776AMB3 0.4674 0.4673 0.4660 0.4687

he 10 genes with the highest impact ratings (IR) are shown with means andonfidence limits (L) of IRs obtained from bootstrap permutation.a Mean IR: the arithmetic mean of 1000 IRs from bootstrapping.b L1 IR: 95% t-distribution lower limit of Mean IR.c L2 IR: 95% t-distribution upper limit of Mean IR.

Page 8: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

C.-F. Chen et al. / Computational Biology and Chemistry 30 (2006) 372–381 379

Table 3GO annotation at level 5 for the 50 genes with highest impact rating (IR) for 1 h response to E. coli, obtained from FatiGO software (Al-Shahrour et al., 2004)

aPercentage of genes in a gene list annotated by the specific GO term. Black color denotes the genes from the top 50 genes, grey color denotes the 6800 Affimetrixgenes initially used in Nau et al. (2002). bDetermined by a Fisher Exact Test.

Table 4GO annotation at level 5 for the 50 genes with highest impact rating (IR) for 1 h response to E. coli, obtained from FatiGO software (Al-Shahrour et al., 2004)

a lor deg

ctt

TG

a

g

Percentage of genes in a gene list annotated by the specific GO term. Black coenes initially used in Nau et al. (2002). bDetermined by a Fisher Exact Test.

The gene responses of macrophages to another pair oflosely related bacteria, Salmonella typhi and Salmonellayphimurium, are also different. For example, at the first hour,he macrophage genes responding to S. typhi are enriched for

‘l‘r

able 5O annotation at level 5 for the 50 genes with highest impact rating (IR) for 1 h resp

Percentage of genes in a gene list annotated by the specific GO term. Black color deenes initially used in Nau et al. (2002). bDetermined by a Fisher exact test.

notes the genes from the top 50 genes, grey color denotes the 6800 Affimetrix

cellular catabolism’ (25%), ‘generation of precursor metabo-ites and energy’ (21%), ‘macromolecule catabolism’ (21%),carbohydrate metabolism’ (18%), and several metabolism-elated GO terms (Table 5). In contrast, the genes responding

onse to S. typhi, obtained from FatiGO software (Al-Shahrour et al., 2004)

notes the genes from the top 50 genes, grey color denotes the 6800 Affimetrix

Page 9: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

3 ology

tem‘‘i

3

EaaEgo‘‘(s

df(s

ig(t(a‘

Fgwp

‘o

3a

hcgmtctit

4

4A

aaeocop

80 C.-F. Chen et al. / Computational Bi

o the closely related S. typhimurium are involved in differ-nt metabolic pathways, including ‘cellular macromoleculeetabolism’ (43%), ‘macromolecule biosynthesis’ (15%),

RNA metabolism’ (12%), ‘vitamin metabolism’ (6%), andcytokinese’ (6%) (table not shown). These were not observedn the original clustering by Nau et al. (2002).

.3. Macrophage gene response to bacteria at 6 h

For the 6-h time point, the macrophage genes responding to. coli are somewhat similar to those at the 1-h time point andre also enriched with ‘response to pest, pathogen or parasite’nd response to external biotic stimulus’ genes. Response to. coli unique to the 6 h time point (table not shown) includesenes for ‘defense response’ (34%), ‘response to pest, pathogenr parasite’ (30%), ‘response to external biotic stimulus’ (30%),negative regulation of cellular physiological process’ (21%),regulation of cell proliferation’ (17%), ‘response to wounding’17%), and ‘regulation of programmed cell death’ (13%) (nothown).

Response to the pathogenic strain EHEC at the 6-h is quiteifferent from the 1-h response and is enriched with genesor ‘carbohydrate metabolism’ (16%), ‘alcohol metabolism’13%), and ‘positive regulation of cell adhesion’ (3%) (nothown).

Response to the pathogen M. tuberculosis at the first hourncludes genes involved in ‘programmed cell death’ (26%), andenes also seen in response to E. coli: i.e. ‘defense response’33%), ‘response to pest, pathogen or parasite’ (23%), ‘response

o external biotic stimulus’ (23%), and ‘response to wounding’20%) (table not shown). In contrast, at 6 h, the processesre less diverse and confined to fewer categories includingdefense response’ (33%), ‘programmed cell death’ (26%),

ig. 7. Venn diagram of the overlapped gene list. Upper top circle: total uniqueenes with highest IRs at 1-h (263 genes). Lower right circle: total unique genesith highest IRs at 6-h (256 genes). Lower left circle: core-198, ‘activationrogram’ of Nau et al. (2002).

t

ogNnecfih

4i

maiH‘boecm

r

and Chemistry 30 (2006) 372–381

response to pest, pathogen or parasite’ (23%), and ‘regulationf programmed cell death’ (20%) (not shown).

.4. Comparison of genes identified by hierarchical clusternd ANN-Fuzzy

In the original paper of Nau et al. (2002), the authors usedierarchical clustering and identified a set of 198 genes that areonsidered the core immune responses, or ‘shared activation pro-ram’ for all bacteria. This gene list contains receptors, signalingolecules, transcription factors, as well as adhesion molecules,

issue remodeling, enzymes, and anti-apoptotic molecules. Aomparison of our gene list at the 1- and 6-h time points withhe 198 core gene list shows that less than 27% (53 out of 198)n the first hour time point and 29% (57 out of 198) for the 6-hime point are the same as the ‘core’ set (Fig. 7).

. Discussion and conclusions

.1. Classification by hierarchical clustering versusNN-Fuzzy

As stated, the most commonly used method for microarraynalysis is hierarchical clustering and the main purpose of clusternalysis is to identify the similarity among subsets of data. How-ver, very often researchers are more interested in identificationf genes differentially expressed under different experimentalonditions. Using our ANN-Fuzzy approach, the characteristicsf individual microarray experiments can be analyzed and com-ared. Furthermore, the impact rating can be used as a guidelineo compare different experiments.

Our gene lists (identified by the high IR values) are on averagenly 30% identical to the original core ‘shared activation pro-ram’ genes identified through the hierarchical cluster methodau et al. (2002) (Fig. 6). We believe this is due to the selectiveature of our ANN-Fuzzy approach. In hierarchical clusteringach gene is treated equally. In our approach, the process ofompetitive learning selects for the most important genes andnal clusters are dominated by a small number of genes withigh impact ratings.

.2. Insights of the human macrophage–bacteriumnteraction

Our results suggest that the gene responses of humanacrophages to bacteria are extremely complex. For example,

fter 1- and 6-h incubations, M. tuberculosis primarily elic-ts expression of genes involved in ‘programmed cell death’.owever, also elicited are genes involved in ‘defense response’,

response to pest, pathogen or parasite’, and ‘response to externaliotic stimulus’, and ‘response to wounding’. Of note, evocationf ‘programmed cell death’ genes is consistent with previousxperimental studies by several groups that showed M. tuber-

ulosis infection is involved in apoptotic pathways in humanacrophages (Fairbairn, 2004; Koul et al., 2004).Our method can be used to compare host response to closely

elated bacteria. Sometimes two similar bacteria, in spite of

Page 10: Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach

ology

eltah

4

aoIviTcmfbIf

A

vTtrC

R

A

A

B

B

B

C

C

C

D

D

D

E

E

E

F

F

GH

H

H

J

K

K

K

L

M

N

O

P

S

Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M., 1999.

C.-F. Chen et al. / Computational Bi

volutionary or serological resemblance, do not invoke simi-ar responses from the human host. Indeed, our results suggesthat closely related species, such as E. coli and EHEC, or S. typhind S. typhimurium, may be different in their ability to stimulateost gene expression.

.3. Accuracy and robustness of the ANN-Fuzzy algorithm

We have used the bootstrap method to evaluate the statisticalccuracy of the IR. We have observed that in general the meansf IRs from the bootstrap method are very close to the originalRs and that the confidence limits are very narrow. These obser-ations hence suggest that the IR for most of the individual geness not only accurate but also robust against random perturbation.he future challenge is how to expand this approach to moreomplex data and to apply different training methods for esti-ation of errors. In this research we only utilize the positive IRs

or their contributions to the clustering, and more studies wille required to explore the biological significance of the negativeRs, possibly by studying their roles in excluding one clusterrom the other.

cknowledgements

We are grateful to Dr. Zena Indik at University of Pennsyl-ania School of Medicine and Dr. J.-J. Chou at the Nationalaiwan University for critically reading of the manuscript and

o Dr. D.E. Stevenson, Dr. Brandon Moore, and Dr. James Mor-is at Clemson University for valuable suggestions. We thankhun-Huai Cheng for helping with figures and tables.

eferences

l-Shahrour, F., Diaz-Uriarte, R., Dopazo, J., 2004. FatiGO: a web tool forfinding significant associations of gene ontology terms with groups of genes.Bioinformatics 20, 578–580.

shburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M.,Davis, A.P., et al., 2000. GENE ontology: tool for the unification of biology.The gene ontology consortium. Nat. Genet. 25, 25–29.

ose, N.K., Liang, P., 1996. Neural Network Fundamentlas with Graphs, Algo-ritms, and Applications: McGraw-Hill Electronical and Computer Engineer-ing Series. McGraw-Hill Inc., New York.

rown, P.O., Botstein, D., 1999. Exploring the new world of the genome withDNA microarrays. Nat. Genet. 21, 33–37.

ryant, P.A., Venter, D., Robins-Browne, R., Curtis, N., 2004. Chips witheverything: DNA microarrays in infectious diseases. Lancet Infect. Dis. 4,100–111.

atto, J.W., Linkens, D.A., Abbod, M.F., Chen, M., Burton, J.L., Feeley, K.M.,Hamdy, F.C., 2003. Artificial intelligence in predicting bladder cancer out-come: a comparison of neuro-fuzzy modeling and artificial neural networks.

Clin. Cancer Res. 9, 4172–4177.

hang, B.C., Halgamuge, S.K., 2002. Protein motif extraction with neuro-fuzzyoptimization. Bioinformatics 18, 1084–1090.

linkenbeard, R.A., Feng, X., 1992. An unsupervised learning and fuzzy logicapproach for software category identification and capacity planning. In:

v

and Chemistry 30 (2006) 372–381 381

Proceedings of 1992 International Joint Conference on Neural Networks(IJCNN), vol. 2, pp. 1223–1228.

avidson, A.C., Hinkley, D.V., 1997. Bootstrap Methods and their Applications:Cambridge Series in Satistical and Probabilistic Mathematics. CambidgeUniversity Press, Cambridge.

ayhoff, J.E., DeLeo, J.M., 2001. Artificial neural networks: opening the blackbox. Cancer 91, 1615–1635.

oniger, S.W., Salomonis, N., Dahlquist, K.D., Vranizan, K., Lawlor, S.C.,Conklin, B.R., 2003. MAPPFinder: using gene ontology and GenMAPP tocreate a global gene-expression profile from microarray data. Genome Biol.4, R7.

fron, B., Tibshirani, R.J., 1998. An Introduction to the Bootstrap. CRC PressLLC, Boca Raton.

isen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., 1998. Cluster analy-sis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci.U.S.A. 95, 14863–14868.

rnst, J., Nau, G.J., Bar-Joseph, Z., 2005. Clustering short time series geneexpression data. Bioinformatics 21 (Suppl. 1), i159–i168.

airbairn, I.P., 2004. Macrophage apoptosis in host immunity to mycobacterialinfections. Biochem. Soc. Trans. 32, 496–498.

utschik, M.E., Reeve, A., Kasabov, N., 2003. Evolving connectionist systemsfor knowledge discovery from gene expression data of cancer tissue. Artif.Intell. Med. 28, 165–189.

urney, K., 1997. An Introduction to Neural Networks. UCL Press, London.an, J., Kamber, M., 2001. Data Mining: Concepts and Techniques. Morgan

Haufmann, San Francisco.arris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R.,

Eilbeck, K., et al., 2004. The gene ontology (GO) database and informaticsresource. Nucleic Acids Res. 32, D258–D261.

astie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learn-ing: Data Mining, Inference, and Prediction. Springer, New York.

enner, R.G., Young, R.A., 2005. Insights into host responses against pathogensfrom transcriptional profiling. Nat. Rev. Microbiol. 3, 281–294.

apetanovic, I.M., Rosenfeld, S., Izmirlian, G., 2004. Overview of commonlyused bioinformatics methods and their applications. Ann. N.Y. Acad. Sci.1020, 10–21.

han, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F.,Berthold, F., et al., 2001. Classification and diagnostic prediction of cancersusing gene expression profiling and artificial neural networks. Nat. Med. 7,673–679.

oul, A., Herget, T., Klebl, B., Ullrich, A., 2004. Interplay between mycobacteriaand host signalling pathways. Nat. Rev. Microbiol. 2, 189–202.

in, C.T., Lee, C.S.G., 1996. Neural Fuzzy Systems. Prentice Hall, EnglewoodCliff.

orley, M., Molony, C.M., Weber, T.M., Devlin, J.L., Ewens, K.G., Spielman,R.S., Cheung, V.G., 2004. Genetic analysis of genome-wide variation inhuman gene expression. Nature 430, 743–747.

au, G.J., Richmond, J.F., Schlesinger, A., Jennings, E.G., Lander, E.S., Young,R.A., 2002. Human macrophage activation programs induced by bacterialpathogens. Proc. Natl. Acad. Sci. U.S.A. 99, 1503–1508.

’Neill, M.C., Song, L., 2003. Neural network analysis of lymphoma microarraydata: prognosis and diagnosis near-perfect. BMC Bioinform. 4, 13.

al, S.K., Mitra, S., 1999. Neuro-Fuzzy Pattern Recognition. John Wiley & Sons,New York.

lonim, D.K., 2002. From patterns to pathways: gene expression data analysiscomes of age. Nat. Genet. 32 (Suppl.), 502–508.

Systematic determination of genetic network architecture. Nat. Genet. 22,281–285.

an der Pouw Kraan, T.C., Kasperkovitz, P.V., Verbeet, N., Verweij, C.L., 2004.Genomics in the immune system. Clin. Immunol. 111, 175–185.