Protein Super family Classification using Artificial...

Protein Super family Classification using Artificial Neural

Networks

THESIS SUBMITTED IN PARTIAL FULFILLMENT FOR

THE DEGREE OF

Bachelor of Technology

Computer Science and Engineering

Vulisetty Anuja Swetha (Roll no: 108CS057)

Under the supervision of

Prof. S. K. Rath

Department of Computer Science and Engineering

National Institute of Technology, Rourkela

Rourkela-769 008, Orissa, India

Department of Computer Science and Engineering National Institute of Technology Rourkela-769008, India www.nitrkl.ac.in

CERTIFICATE This is to certify that the thesis entitled ‘Protein super family classification using Artificial

neural networks’ submitted by Vulisetty Anuja Swetha, in partial fulfilment of the

requirements for the award of Bachelor of Technology Degree in Computer Science and

Engineering at the National Institute of Technology, Rourkela is an authentic work carried

out by them under my supervision and guidance. To the best of my knowledge, the matter

embodied in the thesis has not been submitted to any other university / institute for the award

of any Degree or Diploma.

Date: (Dr. S. K. Rath)

ACKNOWLEDGEMENT

I express my sincere gratitude to Prof. S. K. Rath for his motivation during the course of the

project which served as a spur to keep the work on schedule. We also convey our heart-felt

sincerity to Swati Vipsita for her constant support and timely suggestions.

We convey our regards to all faculty members of Department of Computer Science and

Engineering, NIT Rourkela for their valuable guidance and advices at appropriate times.

Finally, I would like to thank our friends for their help and assistance all through this project.

V. Anuja Swetha

Roll: 108CS057

Abstract Classification, or supervised learning, is one of the major data mining processes. Pattern

recognition involves assigning a label to a given input value. Protein classification is a

problem of pattern recognition. The classification of protein sequences is an important tool in

the annotation of structural and functional properties to newly discovered proteins. This

protein super family classification is used in drug discovery, prediction of molecular

functions and medical diagnosis. Many techniques can be implemented for classification

tasks such as statistical techniques, decision trees, support vector machines and neural

networks. In this work, feed forward neural networks approach is used. Neural networks have

been chosen as technical tools for the protein sequence classification task because: The

features that are extracted from protein sequences are distributed in a high dimensional space

and they have got complex characteristics which make it difficult to satisfactorily model

using some parameterized approaches; and the rules produced by decision tree techniques are

complex and difficult to understand because the features are extracted from long character

strings.

In this work, a comparative study of training feed forward neural network using the three

algorithms – Back propagation Algorithm, Levenberg marquardt Algorithm and Back

propagation Algorithm with genetic algorithm as optimiser is done. The efficiency of the

three algorithms is measured in terms of convergence rate and performance accuracy.

Keywords: ANN (Artificial neural network), Back propagation algorithm, Levenberg

marquardt algorithm, Genetic algorithm.

Contents

1 INTRODUCTION.......................................................................................................9 2 LITERATURE REVIEW...........................................................................................11 3 BASIC CONCEPTS OF PROTEIN CLASSIFICATION..........................................12 3.1 Data mining............................................................................................................12 3.2 Details of protein classification..............................................................................13 3.3 Artificial Neural Networks.....................................................................................14 3.3.1 Neural Networks properties ..........................................................................15 3.3.2 Neural Network Characteristics.....................................................................15 3.3.2.1 Neural network Architecture .................................................................16 3.3.2.2 Learning Mechanisms ...........................................................................18 3.3.2.3 Activation Functions ............................................................................19 4 CLASSIFICATION USING FEED FORWARD NETWORKS.................................20 4.1 Back propagation Algorithm.........................................................................21 4.2 Levenberg marquardt Algorithm...................................................................24 4.3 Back propagation neural network hybridized with Genetic algorithm.........26 4.3.1 Introduction............................................................................................26 4.3.2 Cycle of Genetic algorithm....................................................................29 4.3.3 Convergence of genetic algorithm..........................................................30 4.3.4 Working of GA-BPN..............................................................................31 5 EXPERIMENTAL DETAILS....................................................................................35 6 RESULTS AND DISCUSSION................................................................................36 7 CONCLUSION..........................................................................................................49 REFERENCES...........................................................................................................50

LIST OF ABBREVIATIONS KDD – Knowledge Discovery in Database MAST - Motif Alignment and Search Tool HMM – Hidden Markov Model GA –Genetic Algorithm BPNN – Back Propagation Neural Network LMNN – Levenberg Marquardt Neural Network GA-BPNN – Back Propagation Neural Network hybridized with Genetic Algorithm MSE – Mean Square Error

LIST OF TABLES 5.1 BPNN accuracy table for protein data.......................................................................36 5.2 BPNN accuracy table for cancer data........................................................................36 5.3 LMNN accuracy table for protein data......................................................................37 5.4 LMNN accuracy table for cancer data.......................................................................38 5.5 GA-BPNN accuracy table for protein data................................................................39 5.6 GA-BPNN accuracy table for cancer data.................................................................40 5.7 Table for Algorithm and corresponding maximum accuracy for protein data...........41 5.8 Table for Algorithm and corresponding maximum accuracy for cancer data............41

LIST OF FIGURES 3.1 Single-layer feed forward network.................................................................................16 3.2 Multi-layer feed forward network..................................................................................16 3.3 Recurrent network..........................................................................................................17 3.4 Cross-over and Mutation mechanism.............................................................................28 3.5 Genetic-algorithm cycle.................................................................................................30 3.6 Schematic representation of the generation of fitness-function.....................................32 5.1 BPNN accuracy graph for protein data..........................................................................42 5.2 BPNN accuracy graph for cancer data...........................................................................42 5.3 MSE vs. No. of epochs graph for protein data with α = 0.5 and η = 0.7.......................43 5.4 MSE vs. No. of epochs graph for protein data with α = 0.7 and η = 0.5.......................43 5.5 MSE vs. No. of epochs graph for cancer data with α = 0.5 and η = 0.7........................44 5.6 MSE vs. No. of epochs graph for cancer data with α = 0.7 and η = 0.7........................44 5.7 MSE vs. No. of epochs graph for protein data with α = 0.7 and η = 0.5.......................45 5.8 LMNN accuracy graph for protein data........................................................................46 5.9 LMNN accuracy graph for cancer data.........................................................................46 5.10 Avg.fitness vs. No. Of generations graph for protein data..........................................47 5.11 Avg.fitness vs. No. Of generations graph for cancer data..........................................47

Chapter 1 Introduction Bioinformatics is the application of computer science and information technology in the field

of biology and medicine. Bioinformatics outputs new knowledge as well as the

computational tools to create such knowledge. Analysis and interpretation of biological

sequence data is a fundamental task in bioinformatics. Classification and prediction

techniques are one way to deal with such task.

Data mining is a growing field of computer science. It is a process of finding new patterns in

huge databases. Many algorithms have been proposed for the analysis of the data. Algorithms

make the analysis of data and attempt to fit a model into the data. Sometimes Knowledge

discovery in databases (KDD) is the other term used for data mining.KDD analyses the data

and extracts required information and patterns from the analysed data. Data mining uses

algorithms in order to extract the useful information and patterns in the data.

Neural networks are simplified models of a biological neuron system, are massively parallel

distributed processing systems which are made up of highly interconnected processing

elements which have the capacity to learn and thereby acquire knowledge and make it

available for use. Various mechanisms exist to enable the NN acquire knowledge [1]. In the

training stage, neural networks extract the features of the input data. In the recognizing stage,

the network distinguishes the pattern of the input data by the features, and the result of

recognition is greatly influenced by the hidden layer [2]. Neural-network learning can be

specified as a function approximation problem where the goal is to learn an unknown

function (or a good approximation of it) from a set of input-output pairs [3].

There is a need to develop an intelligent system in order to classify a incoming protein into a

particular super family.

Neural networks have been selected by considering it as an effective tool for the protein

sequence classification task because:

1) The features that are extracted from protein sequences are distributed in a high

dimensional space with complex characteristics which is difficult to satisfactorily

model using some parameterized approaches.

2) The rules produced by decision tree techniques are complex and difficult to

understand because the features are extracted from long character strings [3].

Chapter 2

Literature Review The popular BLAST tool (Altshul et al., 1990) represents the simplest nearest neighbour

approach and exploits pair wise local alignments to measure sequence similarity. Another

type of direct modelling methods is based on Hidden Markov models (HMMs) (Durbin et al.,

1998; Karplus et al., 1998). After constructing an HMM for each family, protein queries can

be easily scored against all established HMMs by calculating the log likelihood of each

model for the unknown sequence and then selecting the class label of the most likely model.

The Motif Alignment and Search Tool (MAST) (Bailey and Gribskov, 1998) is based on the

combination of multiple motif-based statistical score values. In [17], a neural network

classification method has been developed as an alternative approach to the search

organization problem of protein sequence databases. The neural networks used are three

layered, feed-forward back-propagation networks. The protein sequences are encoded into

neural input vectors by a hashing method that counts occurrences of n-gram words. A new

SVD (singular value decomposition) method, which compresses the long and sparse n-gram

input vectors and captures semantics of n-gram words, has improved the generalization

capability of the network. The sensitivity is close to 90% overall, and approaches 100% for

large superfamilies.In [19], a comparative study of the back propagation algorithm

probabilistic neural networks and Radial basis function neural networks are implemented on

iris and protein data set. The advantages and limitations of each network are also discussed.

Chapter 3

Basic concepts of protein classification 3.1 Data mining

Data mining is often defined as extracting out the hidden information in any database. Data

mining access of a database differs from the traditional access in several ways:

Query: Query might not be accurately stated or well formed. The data miner might not even

be exactly sure of what he wants to see.

Data: The data accessed is usually a different version from that of the original operational

database. The data have to be cleansed and modified to better support the mining process.

Output: The output of the data mining query probably is not a subset of the database.

Instead, it is the output of some analysis of the contents of the database.

Data mining algorithms can be characterised as consisting of 3 parts:

Model: The purpose of the algorithm is to fit a model to the data.

Preference: Some criteria must be used to fit one model over another.

Search: All algorithms require some technique to search the data. Each model created can be

either predictive or descriptive in nature. A predictive model makes a prediction about values

of data using known results from different data. Predictive model data mining tasks include

classification, regression, time series analysis and prediction. A descriptive model identifies

patterns or relationships in data. Clustering, summarization, association rules and sequence

discovery are usually viewed as descriptive in nature.

3.2 Details of protein classification Proteins are biochemical compounds which consist of one or more peptide bonds arranged in

a linear chain and folded into a globular or fibrous form. The amino acids in a polypeptide are

joined together by the peptide bonds between the carboxyl and amino groups of adjacent

amino acid residues. The sequence of amino acids in a protein is defined by the sequence of

a gene, which is encoded in the genetic code. The genetic code specifies 20 standard amino

acids. During synthesis of proteins, the residues in it are often chemically modified by post-

translational modification, which may change the physical and chemical properties, folding,

stability, activity, and ultimately, the function of the proteins. Specific functions are achieved

by proteins, and they combine together to form stable complex compounds. Two proteins are

classified into a same super family if they possess similar sequence of amino acids, which

may therefore be functionally and structurally related. Traditionally, two protein sequences

are classified into the same class if they have high similarity i.e. have most of the features in

common.

The aim of protein super family classification is to predict the super family to which input

protein belongs to. Protein classification aims on predicting the function or the structure of

new proteins.

Feature extraction technique The goal of feature extraction is to characterize an object to be recognized by measurements

whose values are very similar for objects in the same category, and very different for objects

in different categories. This leads to the idea of seeking distinguishing features that are

invariant to irrelevant transformations of the input [4]. After the extraction of features, some

may be irrelevant or redundant to the target concept. So in order to reduce the complexity,

there is a need to remove all the irrelevant and redundant features. The features that are

extracted from the protein sequences are as follows: isoelectric point, molecular weight,

atomic composition and the length of amino acid. These extracted features serve as an input

to the neural network that is going to be constructed in order to predict the super family to

which the input protein belongs.

Atomic composition: It includes the number of Carbon, Hydrogen, Nitrogen, Oxygen and

Sulphur atoms in the sequence.

Molecular weight: Mass of a molecule of a substance, based on 12 as the atomic weight of

carbon-12. It is calculated in practice by summing the atomic weights of the atoms making up

the substance’s molecular formula.

Isoelectric point: The isoelectric point (pI) is the PH at which a particular molecule or

surface carries no net electrical charge. The net charge on the molecule is affected by pH of

their surrounding environment and can become more positively or negatively charged due to

the loss or gain of protons (퐻+). At a pH below their pI, proteins carry a net positive charge,

and above their pI they carry a net negative charge. Proteins can thus be separated according

to their isoelectric point (overall charge).

Length of amino acid sequence: There are twenty standard amino acid bases for a protein

sequence. The individual frequency of amino acid bases gives the length of the amino acid

sequence.

3.3 Artificial neural networks Neural networks are simplified models of a biological neuron system, are massively parallel