Studying the Protein Folding Problem by Means of a New Data Mining Approach

1

Studying the Protein Folding Problem by Means of a New Data Mining

Approach

by Huy N.A. Pham andTriantaphyllou Evangelos

Department of Computer Science, Louisiana State University298 Coates Hall, Baton Rouge, LA 70803

Email: [email protected] and [email protected]

ICDM 2005 Workshop on Temporal Data Mining: Algorithms, Theory and Applications

November 27-30, 2005, Houston, TX

This research was done under the LBRN program (www.lbrn.lsu.edu)

2

Brief introduction

The structure prediction problem for proteins plays

an important role in understanding the protein

folding process.

This is an NP-problem.

This research proposes a novel classification

approach based on a new data mining technique.

This technique tries to balance the overfitting and

overgeneralization properties of the derived models.

3

Outline

Introduction to: Classification The Protein Folding Problem

Classification methods

The overfitting and overgeneralization problem

The Binary Expansion Algorithm (BEA)

Experimental evaluation

Summary

4

Introduction to Classification

We are given a collection of records that consist the training set: Each record contains a set of attributes and the class

that it belongs to.We are asked to find a model that describes the records of each class as a function of the values of their attributes. The goal is to use this model to classify new records for which we do not know the class in which they belong to.Typical Applications: Credit approval Target marketing Medical diagnosis Treatment effectiveness analysis

5

Introduction to the protein folding problem

At least two distinct, though related, tasks can

be stated:

Structure Prediction Problem (Protein

Folding Problem): given a protein amino acid

sequence, determine its 3D folded shape.

Pathway Prediction Problem: given a protein

amino acid sequence and its 3D structure,

determine the time-ordered sequence of

folding events.

6

Introduction to the protein folding problem - Cont'd

Protein folding is the problem of finding the 3D

structure of a protein from its amino acid sequence.

There are 20 different types of amino acids (labelled

with their initials as: A, C, G, ...) => A protein is a

sequence of amino acids (e.g. AGGCT... ).

The folding problem is to find how this amino acid

chain (1D structure) folds into its 3D structure.

=> Classification problem.

7

Introduction to the protein folding problem - Cont'd

A protein is classified into one of four structural classes [Levitt and Chothia, 1976] according to its secondary structure components:

all-α (α –helix)

all-β (β – Strand)

α/β

α +β

8

Outline

Introduction to Classification and Protein folding problem



The Binary Expansion Algorithm - BEA


Summary

9


Decision trees A flow-chart-like tree structure. An internal node denotes a test on an attribute. A branch represents an outcome of the test. Leaf nodes represent class labels or class distribution. Use the decision tree to classify an unknown sample.

Bayesian classification Calculate explicit probabilities for hypothesis, among the

most practical approaches to certain types of learning

problems.

Genetic algorithms Based on an analogy to biological evolution.

10

Classification methods - Cont'd

Fuzzy set approaches Use values between 0.0 and 1.0 to represent the degree

of membership.

Attribute values are converted to fuzzy values.

Compute the truth values for each predicted category.

Rough set approaches Approximately or “roughly” define equivalent classes.

K-Nearest Neighbor Algorithms Calculate the mean values of the K-nearest neighbors.

11

Classification methods - Cont'd

Neural Networks (NNs) A problem-solving paradigm modeled after the

physiological functioning of the human brain. The firing of a synapse is modeled by input, output,

and threshold functions. The network “learns” based on problems to which

answers are known and produces answers to entirely new problems of the same type.

Support Vector Machines (SVMs) Data that are non-separable in N-dimensions have a

higher chance of being separable if mapped into a space of higher dimension.

Use a linear hyperplane to partition the high dimensional feature space.

12

Outline

Introduction to Classification and Protein folding problem





Summary

13

Overfitting and overgeneralization in Classification

Algorithms have resulted in classification and prediction systems that are highly accurate or they are not so accurate for no apparent reason.A growing belief is that the root to that problem is the overfitting and overgeneralization behavior of such systems.Overfitting means that the extracted model describes the behavior of known data very well but does poorly on new data points.Overgeneralization occurs when the system uses the available data and then attempts to analyze vast amounts of data that has not seen yet. For example:

The generated tree may overfit the training data. The SVMs method may overgeneralize the training

data.=> Develop an algorithm that balances overfitting and

overgeneralization.

14

A multi-class prediction method

One-vs-Others method (Dubchak et al 1999, Brown et al 2000)

Partition the K classes into a two-class problem: one class contains proteins in one “true” class, and the “others” class combines all the other classes.

A two-class classifier is trained for this two-class problem.

Then partition the K classes into another two-class problem: one class contains another original class, and the “others” class contains the rest.

Another two-class classifier is trained. This procedure is repeated for each of the K classes,

leading to K two-class trained classifiers.

15

Outline

Introduction to Classification and Protein folding

problem





Summary

16

Some basic concepts

A clause: a description of a small area of the state space covering examples of a given class. Homogenous Clause (HC): an area covering a set of examples of a given class and unclassified examples uniformly.Any clause of a given class may be partitioned into of a set of smaller homogenous clauses.

Example: B, A1, A2 are homogenous clauses while A is a non-homogenous clause. A can be partitioned into two smaller homogenous clauses A1 and A2. The example is a 2D

representation. The high dimension cases can be treated similarly.

=> Unclassified examples covered by clause B can more accurately be assumed to belong to the same class than those

in the original clause A.

A

17

Some basic concepts - Cont'dDetermining whether a clause is a homogenous clause can be decided by using its standard deviation.

The clause is superimposed by a hyper-grid with sides of some length h. If all cells have the same density, then it is a (perfectly) homogenous clause.

The density of a cell [Richard, 2001]:

where n = #(examples in the cell), D = #(dimensions), and = a kernel function

Dnhxp

1)(

n

i

m

i

mD

m h

xx

1 1

)(

A is superimposed to a hyper-grid and

the density of all cells can be computed

=> standard deviation = 0

+

+

+

+

+

+ +

+

+

+

+

+

+ +

++

+

+ +

+

+

+ +

+ +

+

+

+

+ +

++

+ +

+

+

+

+

++

+

+

+ +

+

+

+

+

++

+

+

B is superimposed to a hyper-grid and the density of all cells can be

computed

=> standard deviation > 0

Determine the homogenous values of clauses A and B.

A

B

18

Some basic concepts - Cont'd

The density: It expresses how many classified examples exist in a given clause of the state space.

The density of a homogenous clause is the number of examples of a given class per a unit area.

The Density of homogenous clause A > The density of homogenous clause BA B

R_UnitR_Unit

19

BEA

Main idea of the algorithm:Input: positive and negative examplesOutput: a suitable classification

Find positive and negative homogenous clauses using any clustering algorithm.

Sort homogenous clauses based on their densities.

For each homogenous clause, one or more new areas are created by :

If its density > a threshold then Expand it by:

F = expanded area, C = original area, and G = enveloping area.

Accept some noisy examples.

Else Reduce it into smaller

homogenous clauses. Use expanded homogenous clauses

for the new testing data.

Dx

radiussCradiussGradiussCradiussF

1

2

''''

G

F’s radius = C’s radius + (G’s radius – C’s radius)/(2 *D)

Stopping conditions for expansion:

F’s radius ≤ D * C’s radius#(Noisy points) ≤ (D * n) / 100

C+

+

-

-

+

+++

-

-

-

--

-

D=6

20

BEA - Cont'd

Main Algorithm:Input: positive and negative examplesOutput: a suitable classification

Step 1: Find positive and negative clauses using the k-means clustering-

based approach with the Euclidean distance.

Step 2: Find positive and negative homogenous clauses from positive

and negative clauses respectively.

Step 3: Sort positive and negative homogenous clauses on densities.

Step 4: FOR each homogenous clause C DOIf (its density > a threshold = (max – min)/2 of densities) then

- Expand C using its density D.- Accept (D*n)/100 noisy examples where n=#(its

examples).Else

Reduce C into smaller homogenous clauses by considering each cell of its hyper-grid as a new homogenous clause.

21

BEA - Cont'd

Example: BEA in 2D

Positive

Clauses

Homogenous Clauses Extended HC

Expand

22

Correctness of improvement

Definition: e is improved by e’, e > e’, if for all contexts C such that C[e] and C[e’] are closed, and if C[e] converges in n steps then C[e’] also converges in k steps where k ≤ n, [Sands, 2001].BEA:

Use k-means clustering based approach to find positive and negative sets.

Let e denotes results obtained from k-means clustering based approach and e’ denote results obtained from BEA. Certainly C[e] and C[e’] are closed. Moreover C[e’] can accept more examples since all homogenous clauses are expanded from e.

Accept noisy examples. e is improved by e’

or e is refined to e’.

23

Outline

Introduction to Classification and Protein folding

problem





Summary

24

Accuracy measures for multi-class classification

The accuracy of two-class problems involves

calculating true positive rates and false positive

rates.

The accuracy, Q, of multi-class problems can be

determined as true class rates, [Rost &

Scander, 1993, Baldi et al, 2000], by:

k

iiiqwQ

1

qi = ci/ni where ni = #(examples in class ith)and ci = #(true examples in class ith).

wi=ni/N where N = Total of examples of a given class.

25

Experiments

Assess the algorithm for two-class problems. Source: http://www.csie.ntu.edu.tw/~cjlin/methods/guide/data/

Training set Testing set # Atts C.J.Lin’s SVMs

Train_1(3089 exps) Test_1 (4000 exps) 4 96.9%

Train_2(391 exps) Train_2 (391 exps)Cross validation

20 85.2%

Train_3(1243 exps) Test_3 (41 exps) 21 87.8%

Training set Testing set #Fail Positive #Fail Negative Q

Train_1 Test_1 9 (2000 positive exps)

22(200 negative exps)

99,25%

Train_2 Train_2 0 0 100%

Train_3 Test_3 0(41 positive exps)

0(0 negative exps)

100%

BEA

26

Experiments - Cont'd

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

Train_1 &Test_1

Train_2 withcross

validation

Train_3 &Test_3

C.J.Lin's SVMs BEA

The BEA provides 15.5% improvement in the classification accuracy vs. C.J.Lin’s SVMs.

27


Assess the algorithm for two-class problems. Source: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary

Data #atts #exps Q

Train w1a 300 2477

Test

w2a 300 3470 85,97 %

w3a 300 4912 85,40

w4a 300 7366 85,08

w5a 300 9888 84,64

w6a 300 17188 84,18

Data #exps Q

Train w4a 7366

Test

w1a 2477 85,79%

w2a 3470 86,57

w3a 4912 86,16

w5a 9888 85,41

w6a 17188 84,83

Data #atts #exps Q

Train a3a 122 3185

Test

a4a 122 4781 90,17%

a5a 122 6414 86,47

a6a 122 11220 82,17

a7a 122 16100 79,99

Data #exps QTrain a7a 16100

Test

a3a 3185 94,98%a4a 4781 94,92a5a 6414 94,92a6a 11220 96,95

28


A test bed of the algorithm for the protein folding problem Source of data sets: http://www.nersc.gov/~cding/protein by Ding and

Dubchak, 2001.

Data types #atts # training exps # testing exps

A.A.Composition (C) 21 605 385

Secondary struc. (S) 22 605 385

Polarity (P) 22 605 385

Polarizability (Po) 22 605 385

Hydrophobicity (H) 22 605 385

Volume (V) 22 605 385 Six parameter datasets extracted from protein sequences. Use One-vs-Others method for the fourth-classes problem. Use the Independent Test method in experiments. BEA represents a protein as a n dimensional vector corresponding to the

composition of the n amino acids in the protein.

29


The average results obtained from [Ding and Dubchak, 2001] and [Zerrin, 2004] for the dataset with 27-class:

Q1: The average accuracy of the SVMs with the independent test method in [Ding and Dubchack, 2001, Table 6, p11].

Q2: The average accuracy of the Neural Networks with the independent test method in [Ding and Dubchack, 2001, Table 6, p11].

Q3: The average accuracy of the SVMsAAC method in [Zerrin, 2004].Q4: The average accuracy of the SVMstrio AAC method in [Zerrin, 2004].

Data types Q1 Q2 Q3 Q4

A.A.Composition 43.5% 20.5% 71.44% 66.66%

Secondary struc. 43.2 36.8

Hydrophobicity 45.2 40.6

Polarity 43.2 41.1

volume 44.8 41.2

Polarizability 44.9 41.8

30


Results obtained from BEA for 4-class:

Data types all-α all-β α/β α+β Q5

A.A.Composition 87.27% 74.81% 71.43% 91.95% 81.37%

Secondary struc. 87.23 72.21 66.75 91.17 79.34

Hydrophobicity 86.75 74.55 71.17 91.69 81.04

Polarity 87.27 73.51 70.13 91.95 80.72

Volume 87.01 74.29 71.43 91.95 81.17

Polarizability 86.75 74.29 70.13 91.95 80.78

31


0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

C S P Po H V

Ding's NeuralNetworks for 27-class

Ding's SVMs for 27-class

SVMs-TrioAAC for27-class

SVMs-AAC for 27-class

BEA for 4-class

The BEA provides:

• 10% improvement in classification accuracy as the SVMsAAC method at the data type of Amino Acid Composition.

• Approximately 36% improvement as Ding’s SVM.

32

Summary

This research was done to: Enhance our understanding of the performance of a

new data mining algorithm. Propose a new approach based on balancing overfitting

and overgeneralization properties to enhance the performance of data mining algorithms.

Make a contribution in a hot area in pure Bioinformatics by achieving highly accurate results in predicting protein folding properties.

Future work to focus on: Test the BEA with other applications. Improve the performance of the approach by:

Improving the accuracy of the algorithm by finding a suitable density for homogenous clauses.

Decreasing the execution time by using parallel computing techniques.

Studying a multi-class classification algorithm.

33

References

Zerrin Isik et al, “Protein Structural Class Determination Using Support Vector

Machines”, Lecture Notes in Computer Science-ISCIS 2004, vol: 3280, pp. 82, Oct.

2004.

http://people.sabanciuniv.edu/~berrin/methods/fold-classification-iscis04.pdf

A.C.Tan et al, “Multi-Class Protein Fold Classification Using a New Ensemble

Machine Learning Approach”, Genome Informatics 14: 206–217, 2003.

http://www.brc.dcs.gla.ac.uk/~actan/methods/actanGIW03.pdf

Chris H.Q.Ding et al, “Multi-class protein fold recognition using Support Vector

Machines and Neural Networks”, Bioinformatics, 17:349-358, 2001.

http://www.kernel-machines.org/methods/upload_4192_bioinfo.ps

D. Sands.: Improvement theory and its applications. In A. D. Gordon and A. M.

Pitts, editors, Higher Order Operational Techniques in Semantics, Publications of

the Newton Institute, pp 275-306. Cambridge University Press, 1998.

34

Thank you!

Any questions?

Studying the Protein Folding Problem by Means of a New Data Mining Approach

Documents

Transcript of Studying the Protein Folding Problem by Means of a New Data Mining Approach