TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC...

19
TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI Abstract. yrt rewrew654 1. Introduction 1.1. Biological Background. Proteins are biological macromolecules extremely important. They are composed by a sequence of elementary components called am- minoacids . They exist in nature in about twenty different peptides. Every protein is expressed by a particular region of DNA. Protein synthesis has origin in the cellu- lar nucleus with the division of the Dna and the transcription of a single filament of m-RNA. Subsequently the action moves him on the ribosomia where the informa- tion contained by the nucleotidic filament serves as grammar for the construction of the protein sequence. Chemical structure of a single amminoacid includes an carbonium atom called C(α) bound to an carboxylic group and an amminic group, a hidrogenum atom and a residue that identifies each amminoacid.Two amminoacid are bound by a peptidic bind and they forms a linear chain. The sequence backbone is the sequence of C(α). Each amminoacid is coded by three nucleotides. Although there are four nucleotides Adenine Timine Citosine Guanine and 64 possible combinations exist, the amminoacids known are only 20. There are many combinations that identifie a single peptide. The process that causes the protein formation is called protein synthesis and is mediated by nucleic acids. After that a protein has been synthesized a lot of chemical reactions starts. Mainly the protein assumes a unique spatial conformation: this is called protein folding. Some reaction causes some modification in the backbone : there are called post traslational modification. Proteome is extremely dynamic and it has spatial and temporal variations. The general sequence of the peptidis identifies each protein. Studying proteins has introduced different levels of abstraction which are translated in the various introduced structures. Primary Secondary Tertiary 1

Transcript of TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC...

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OFPROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES

MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, ANDPIERANGELO VELTRI

Abstract.

yrtrewrew654

1. Introduction

1.1. Biological Background. Proteins are biological macromolecules extremelyimportant. They are composed by a sequence of elementary components called am-minoacids . They exist in nature in about twenty different peptides. Every proteinis expressed by a particular region of DNA. Protein synthesis has origin in the cellu-lar nucleus with the division of the Dna and the transcription of a single filament ofm-RNA. Subsequently the action moves him on the ribosomia where the informa-tion contained by the nucleotidic filament serves as grammar for the constructionof the protein sequence. Chemical structure of a single amminoacid includes ancarbonium atom called C(α) bound to an carboxylic group and an amminic group,a hidrogenum atom and a residue that identifies each amminoacid.Two amminoacidare bound by a peptidic bind and they forms a linear chain. The sequence backboneis the sequence of C(α). Each amminoacid is coded by three nucleotides. Althoughthere are four nucleotides

• Adenine• Timine• Citosine• Guanine

and 64 possible combinations exist, the amminoacids known are only 20. Thereare many combinations that identifie a single peptide. The process that causesthe protein formation is called protein synthesis and is mediated by nucleic acids.After that a protein has been synthesized a lot of chemical reactions starts. Mainlythe protein assumes a unique spatial conformation: this is called protein folding.Some reaction causes some modification in the backbone : there are called posttraslational modification. Proteome is extremely dynamic and it has spatial andtemporal variations. The general sequence of the peptidis identifies each protein.Studying proteins has introduced different levels of abstraction which are translatedin the various introduced structures.

• Primary• Secondary• Tertiary

1

2MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

• QuaternaryThe primary structure is the orderly sequence of the amminoacids that constitutesthe skeleton of the protein. The mapping between a single protein and the primarystructure is biunivoc . The secondary structure is represented by the spatial con-formation of the skeleton . The tertiary structure is the general spatial conforma-tion while the quaternary one is the structure determined from more polipeptidicschains. A explanation more detailed of structures is presented in various sections.

1.2. Computational Background. The scientific research in Proteomic,Genomicsand computational biology fields is amassing a heavy amount of data. Althoughthis data are modeled in a very simple way, their large amount and a plethora ofadditional information stored complicate this scenario. Three primary areas exist:

• creation of databases ,• development of new algorithms ,• implementation of software tools for access and manage data.

Each of this areas has own issues but we focus on common characteristics:• Large Amount of Data ,• Heterogeneity of formats ,• Need to interface with different fields.

These considerations cause that computational methods are data intensive andthey requires high computational resources. For this reason the research focuses onparallel implementation of this method and many approaches are grid based.

2. From Genomic to Proteomic

Human Genome Project aimed to map the entire human genome. This purposewas recognized but the comprehension of the gene expression was not exhaustive.Proteome , the entire set of proteins expressed by a genomic region, is more complexthan genome. It is dynamic on the contrary of the static genome. The proteinssuffer, in fact, a lot of modifications post translational that they modify peptidicbackbone. Besides it was experimentally shown that a single region codifies a setof proteins. It caused the research focused to the study of the proteins.

3. Biological Data Bank

3.1. Introduction. Improvement of technologies produced a large amount of datawhich need to be stored in a efficient way. A lot of biological databanks wereimplemented and some databank classification has been introduced. The first clas-sification is according to the contents of the databases. The second classificationfocuses on the implementation. The third classification explores the possibilities ofdata retrieval, query interfaces of the databases.[19]

3.2. Data Modeling. The biological molecules are simply modeled in form ofdigital symbol sequences. Particulary the proteins are modeled with their sequenceof amminoacid (primary sequence).The amminoacid are identified in two ways,belonging the IUPAC nomenclature:

• With the first letter of amminoacid,• with the a triletteral code.

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES3

Figure 1. Swiss Prot line codes

The first one is assumed in Swiss Prot Databank and is useful in all the algorithmsthat operate over primary sequence. The second is used in PDB and it is moresimple to understand. This databank also stores the structural data of proteinsthat are codified as sequences of 3-D coordinates. The scenario is more complicatedfor proteomic experimental data. This are sets of (masse/charge) ratio produced bymass spectometers. A standard initiative (HUPO ) is being developed and a XMLdraft is proposed. Databank often stores a lot of data together with sequences andcoordinates, this are called annotations. There are much kind of annotation thatinvolves information about post translational modification, researches that havefound the proteins and functional information.

3.3. Sequence Databases: Swiss Prot. Swiss-Prot [5] is a database of anno-tated protein sequences. It was created in 1986 and maintained in collaboration,since 1987 from the group of Amos Bairoch, first in Department of Medical Bio-chemistry of the University of Geneva and now to the Swiss Institute of Bioinfor-matics (SIB) and to the EMBL Data Library (now the EMBL Outstation - TheEuropean Bioinformatics Institute (EBI)). In Swiss Prot are stored two classes ofdata: core data and annotations. Core Data are the protein sequences.The anno-tation consists of the description of the following items:

• Function(s) of the protein• Post-translational modification(s)• Domains and sites.• Disease(s) associated with deficiencies• Secondary structure

Each Swiss Prot entry is composed by a set of line, identified by two letter code.Swiss Prot is weekly updated and it is freely distributed as flat file. Each researchercan submit his data to databank .They will be included if they are verified by SwissProt committee.Figure resume line codes used.

4MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

3.4. Structural Databanks :Protein Data Bank PDB. The Protein DataBank [12] was established at Brookhaven National Laboratories in 1971. In October1998, the management of the database became the responsibility of the ResearchCollaboratory for Structural Bioinformatics (RCSB). PDB is worldwide archive ofstructural data of biological macromolecules. This data are generated by Cristallog-raphy and NMR experiments. Although it is distributed as flat file and it is basedon relational model. Each PDB entry is stored in an own flat file. The formatof an entry is defined in the Guide to Authors v2.1 (draft) provided by the PDB.The database currently contains entries in at least four distinct formats, v1.0, v2.0,v2.1, and v2.2. There is an underlying ontology of about 1700 terms that definethe macromolecular structure and the crystallographic experiment. This ontologyis called macromolecular Crystallographic Information File (mmCIF) dictionary .There are three distinct query interfaces:

• Status Query,• Search Lite,• Search Field.

Search Lite, introduced on February 1999,has a single text field in which it ispossible to write keywords. Search Field,released on May 1999, is a customizablequery form. It allows researches based on author citation, sequences (via FASTAalgorithm), dates and chemical formulas. Many interfaces present informationsrelatives to the results. Query Result Browser Interface allows to browse some moredetailed information and to download set of files that stores the structure found.The final curated files are stored as ASCII files in PDB and mmCIF formats in theFTP archive. PDB files in XML format are currently tested . Data are acquiredfrom the research community by submission.

3.5. Databases of structural Classifications: CATH and SCOP. SCOP[2](Structural Classification of Proteins) stores the information about structuralsimilarity and evolutionary relationships. This database was created by both auto-mated methods and manual intervention. Some methods used are:

• An Hidden Markov Model [4] ,• 3d Search [30]

The second method is based on on geometric hashing that will be focused in thefollowing section.

CATH stores a hieryarchical classification of PDB structures.It stores all thestructures obtained with NMR. These obtained with cristallography will stored ifthese have a precision less than 3 Angstroms. The SIFT program automaticallyreads the PDB databank and select the structures. Hierarchy has four levels:

Figure 2. New data acquisition of PDB

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES5

• (C) Class• (A) Architecture• (T) Topology• (H) Homology

Class is automatically assigned. Three major classes exist:

• mainly α• mainly β• α + β

3.6. HSSP. HSSP stores homologous sequences. It support homology modelling.Alignment of a protein with all sequences stored constitute the major step for data-base population. The alignment results are processed with biological consideration.For each known structure HSSP contains aligned sequences, secondary structure,variations and sequence profile. Tertiary structures are not stored expressly. They

Figure 3. CATH structure

6MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

are implied by other informations . Database growth is considerable. It can be in-terpreted as example in which results of alignment algorithms populate a databaseand this data generate more knowledge with other algorithms.

3.7. Domain Databases: PROSITE. Prosite is a databank which stores pro-tein familes and domains. It is based upon the consideration that the enormousnumber of proteins can be grouped in few groups. The classification criterium issequence similarity. The proteins, or proteic domains, classified in the same groupshare function and evolution. Currently PROSITE contains patterns and profiles ofabout thousand different families. Each of these contains structural and functionalinformations.

3.8. Derived Databases:InterPRO. InterPro is an archive of documentationsfor protein families.This databank aim to integrate information stored in other bio-logical databanks. Each of integrated databank has a application domain different.In this way it is possible to share all the functionalities of various biological data-banks. It is implemented ad a relational database (ORACLE r) and the accessis based on Java Servlet r. Its distribution is based on XML and is stored inflat file format. Next table 5 shows all the InterPro entries divised in the differentdatabank.

Figure 4. Biological Data Bank

Figure 5. Interpro Entries

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES7

Each entry has unique code,IPRXXXXXX, in which XXXXXX is a unique iden-tifier.

3.9. Interfaces to Databanks: SRS and Entrez. SRS [32](Sequence RetrievalSystem) is a system created by Thure Etzold at EMBL. SRS creates various indicesfor many biological databases including the literary abstracts in MedLine. Eachsite in which SRS runs has installed a different subset of various databases .Eachother databank which has a SRS indices can interact with this server. Actually SRSprovides to access to 500 different databases. It has a interface form based and anadvanced query language. With this two methods it is possible to execute complexqueries. SRS provides to equip the result with other support informations. Entrez[1] is a text based search engine. It is used on NCBI for many biological databanks,including Pub Med, nucleotidic sequences, protein structures and taxonomies.It isforbidden to download Entrez software, unlikely to SRS. It is already forbidden toinstall in Entrez personal databanks.

——————————————————–

4. Protein Identification

Figure 6. Protein Identification

Biological sequences encoded much information. All the algorithms try to de-code what is implicity stored.The amount of information is very large and so eachmethod discovers only a partial knowledge.The algorithms used in proteomic re-search are classified by the structures which they explain . Proteomics studies tryto understand the function and the phylogenetic evolution of the proteins analyzingvarious proteins structures. The first problem it’s to identify a protein expressed ina mass spectrum. A mass spectrum is a set of a mass/charge ratio obtained by insilico experiment. The first step of this process produces an unordered sequence ofrecognized peptides. The second step it’s to try to identify the protein whose pri-mary structure are matched with the first one.There are two methods to determinethe sequence of proteins. The first correlates known proteins (from a sequence data-base) with the measured MS spectrum [13]. The quality of matching it’s evaluatedusing scoring formulas.The second it’s a de novo interpretation of the data that iscapable of sequencing unknown (novel) proteins and their modifications.[34] Pep-tIdent is a tool freely enjoyable on the server Expasy.org.It identifies the proteinsbeginning from mass fingerprinting data. Peptident simulates the digestion withthe enzymes of all the protihe masses theoretical of the produced fragments. Theresults are memorized in a chart. The technique employed is simple.It begins fromthe insertion of a experimental spectrum and it performs a comparison.Finally, thepossible correspondences are presented in output, each of them are valued according

8MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

to specific criterions. The tool considers also the modifications post-translational,nevertheless to those suitable expressly in databank. Glycosilation is excluded.Results include link to other tools software (FindMod etc).

Figure 7. Peptident home page

5. Primary structure analysis

5.1. Pairwise Alignment. The aim of the method in this section is to under-stand phylogenetic evolution of the protein analyzing the sequence similarities.Thecommon representation used in this task represents the peptides as a sequence ofamminoacids. If we represent each amminoacid with a letter of latin alphabet aprotein will be a string.

A fundamental prerequisite is a biologically relevant criteria to asses the sequencesimilarity. Given a alphabet A and two sequences S1 and S2 we must define a set ofpossible operation that operates over sequences. The most simple operation is thereplacement: a character s[j] is replaced by another character of the alphabet. Thisoperation didn’t change the sequence length and corresponds to a biological muta-tion. We define another two operations: insertion and deletion. The first one it’sthe insertion of a set of character into a string, a.e. ABBC ABFGHFBC. Deletionis the opposite process. All the operations are evaluated by a scoring function whichsummarizes the biological interpretation of three operation. PAM ,BLOSUM, andanother kind of matrices [15] are used to evaluate the scoring function as startingpoint.Scoring matrices can influence the outcome of analysis. They implicity repre-sent a theory of evolution. PAM matrices (POINT ACCEPTED MUTATION), usesimilar sequences generating first alignment;particulary if two sequences differ byonly one amminoacid and they have the same function then they have a 1 PAM dis-tance. Blosum matrices consider only preserved sequences. A BLOSUM 60 matrixis calculated over sequences that have almost 60 same peptides, a PAM 60 is cal-culated over sequences that have at least 60 same amminoacids. Two amminoacidscan be compared in many ways and each of these comparison can be traslated in aparticular distance matrix. We can define, for example, hydrophobicity matrix and

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES9

another many kind of matrices. Each of these is used when a particular analysis isperformed.Fortunately the dynamic programming helps to find the similarities withspace and time polynomial complexity.The basic idea is that an optimal alignmentof lenght k can be determinated by an alignment of k-1 lenght. Let us consider twosequences s(i) and z(j), (i,j¿0). An optimal aligment of k length can be calculatedas an extension of the alignment ok k-1 length with :

• A gap,• A deletion,• A replacement or a match.

The function cost associated with three cases determines the correct choiche. Thetask is performed into two steps: Building the distance matrix and Evaluate thealignment score.The pairwise alignment is represented by a path into matrix. Thispath starts in right -lower position and trace back in left upper position. The trace-back is the implementation of recursion described. Each step of path is evaluatedwith selected scoring function.In the matrix a match is represented by a path be-long the diagonal (from C(i,j) to C(i-1,j-1)), a gap is a horizontal or vertical motion.The elements of distance matrix are recursively evaluated starting with a particularmatrix (PAM BLOSUM or another). The Needleman-Wunsch algorithm [23] is thefirst algorithm developped.This algorithm finds the global similarities between twopeptides.

Let s = s1 sn e z = z1 zm two sequences. Let F(i, j) alignment best score.If F(i - 1, j - 1), F(i - 1, j) e F(i, j - 1) are known,it could evaluate F(i, j) by threeways:

• alignment of xi to zj,• alignment of xi to a gap,• alignment of zj to a gap.

Matematically the algorithm implements this formulas:

Fi,j = max

F (i− 1, j − 1) + s(xi, zj), F (i− 1, j)− d

F (i, j − 1)− d.

Often we are interested to determine local similarities also known as partialmatching problem. Some subsequences of proteins played a specific role in biologicalinteraction. The researcher have particular interest to study this subsequences. TheSmith and Waterman Algorithm [31] finds local similarities with a modification inthe schema before discussed.

The possibility to determine local similarity is due to the formula of recurrenceemployed. The deletion of a prefix of a suffix it is’nt charged of cost.Each firstrow is initialized with 0. When the traceback is performed the negative scores arediscarded. This algorithm implements:

Fi,j = max

0F (i− 1, j − 1) + s(xi, zj),F (i− 1, j)− d,

F (i, j − 1)− d.

10MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

Currently an improvement exists,it’s called Gotoh’s improvement,that has a linearspace complexity and a quadratic time complexity. Either algorithms are imple-mented in JAVA programming language in the gnu.bioinformatics package. Theyare freely downloadables . Needleman and Wunsch algorithm and Smith-Watermanalgorithm either build a matrix with O(n*m) space complexity, and they determinea path with O(n+m) time complexity.

5.2. Heuristic methods. A big group of heuristic methods has the best perfor-mances. They already correspond to those employed in genomic search. [26] TheFASTA sequence comparison programs all require similar information, the name ofa query sequence file, a library file, and the ktup parameter. All of the programscan accept arguments on the command line, or they will prompt for the file namesand ktup value. The FASTA programs know about three kinds of sequence files:

• plain sequence files,• standard library files,• blocked ascii formats as Genbank flat-file format and EMBL flat-file format.

The most famous of heuristic methods is BLAST [29] with his variations.They em-ploy some non exact techniques of alignment and they need a final verification.Thishas the purpose to statistically appraise the alignment determined by discardingthe ones which have scarce biological importance. The BLAST algorithm uses aword based heuristic to approximate a simplification of the Smith-Waterman al-gorithm. This is known as the maximal segment pairs algorithm MSP. Maximalsegment pairs alignments do not allow gaps. Maximal segment pair alignments havethe very valuable property that their statistics are well understood. Thus, we canreadily compute a significance probability for a maximal segment pair alignment.Recent advances in maximal segment pairs statistics allow the use of several inde-pendent segment alignments to be used in evaluating the significance probability.The BLAST algorithm is less sensitive than Smith-Waterman. For proteins theBLAST word based heuristic is more sensitive than the FASTA. Thus, FASTA ismore sensitive than BLAST for nucleic acid sequences and should used instead ofBLAST. The BLAST word based heuristic uses a default word size of three forproteins and eleven for nucleic acid sequences. The tables below illustrate how theBLAST heuristic differs from the FASTA heuristic using a word size of two appliedto a short protein query sequence. A word size of two is used in the example to keepit to a manageable size. A new criterion for triggering the extension of word hits,combined with a new heuristic for generating gapped alignments, yields a gappedBLAST program that runs at approximately three times the speed of the original.In addition, a method is introduced for automatically combining statistically sig-nificant alignments produced by BLAST into a position-specific score matrix, andsearching the database using this matrix. The resulting Position-Specific IteratedBLAST (PSI-BLAST) program runs at approximately the same speed per iterationas gapped BLAST, but in many cases is much more sensitive to weak but biologi-cally relevant sequence similarities. PSI-BLAST is used to uncover several new andinteresting members of the BRCT superfamily. It is freely downloadable in Linuxor windows o.s.. Some web server implements a web version of this program.

5.3. Statistical Evaluation of Alignment. It is important to evaluate the statis-tical relevance of a calculated alignment. Statistical relevance measures the prob-ability that the score obtained with random sequences exceeds the mean of the

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES11

Figure 8. Blast Output

distribution of the scores obtained with real sequences. Unfortunately the exactdistribution of this values is unknown, and some approximation is made. A methodgenerates a lot of sequence by mixing the original sequences. Subsequently this se-quences are aligned.The hypothesis is that series has a normal distribution.Meanand standard deviation are calculated. If we consider the alignment without inser-tion and deletion, an important theoretical result is available. The distribution forlocal alignment has been studied (Karlin and Altschul, 1990 ). It is the Gumbeldistribution. Contrarily theory for local alignment with gaps it isn’t developed, butempirical observation suggest that an identical distribution can be used.

5.4. Multiple Alignment. The need to line up more than a couple of proteins fortime has found its applications in bioinformatic . A field in which multiple align-ments could generate useful informations is the evolutionary study of the proteins,analyzing the phylogenetic tree. The study of the primary sequences can discoverfunctional bonds not founds with the techniques till now exposed.This task is calledMultiple Alignment.

Multiple Alignment is achieved with Dynamic Programming Techniques but it’scomputationally more complex then pairwise. A simple demonstration is madewith geometric consideration. In the pairwise alignment the algorithms explore amatrix, in the worst case the longest path between the element C(n,m) and C(1,1)belongs the two matrix sides with temporal complexity o(n+m). (n and m are the

Figure 9. Multiple Alignment

12MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

sides lengths). In the multiple alignment complexity goes up exponentially with se-quence number.[16] For this reason have been introduced geometrical considerationsthat improve complexity.An application s Carrillo-Lipman method[20].A commonmethod to perform multiples alignments is the progressive-alignment. It proceedsbuilding couples of alignments. The basic idea employed is to line up at first the se-quences with a greater degree of similarity established with evolutionary criterions.The way of progress is justified from the consideration that the couples of sequencesmostly alike they have greater possibility to be derived, more recently, from a com-mon ancestor. Feng-Doolittle [9] algorithm applies the ideas just described. Inthe first step all pairwise alignments are calculated and scores are converted in dis-tances; in the second step an algorithm of incremental clustering builds a tree-guide,finally the tree is visited until all the sequences have been lined up. The algorithm ofFeng-Doolittle is implemented, for instance, in the programs FITCH and KITSCHand in the program PileUp ( it uses UPGMA as clustering algorithm).The timecomplexity is Θ(ClusteringComplexity) +O(TreeNodes*AlignmentCost) Thomp-son Higgins Gibson [THO81]algorithm uses the information obtained during theiterations as base of departure for the following alignments. In this way they wouldbe able, for instance, to penalize some gaps rather than others. . CLUSTALW is aprogram available for a lot of platforms. It is based on the progressive alignment.Ittakes a series of sequences and it calculates for each couple the alignments. On thebase of these comparisons a distance-matrix is built. It memorizes the distance ofevery couple of sequences. This matrix constitutes the base for the construction ofa phylogenetic tree. The program is freely downloadable.Generally it has a promptuser interface. It receives in input the sequence strings in different formats

• NBRF/PIR• EMBL/SwissProt• Pearson• GCG/MSF• GDE clustal• (Fasta)

It presents in output the alignment obtained.

6. Secondary structure prediction

The algorithms in this section analyze protein secondary structure. The sec-ondary structure synthesizes the possible spatial conformations of the protein sub-sequences. Three possible conformations exist: alpha helixes, beta sheets and ran-dom coils. The first two were predicted from Linus Pauling with considerationsof energetic character.The installation of one of the three structures depends onthe value of bond angles among various peptides. Methods that explain the entirespatial conformation of proteins have a low throughput. The researcher commu-nity needs a univoque and affidable tool to predict the secondary structure only bythe knowledge of peptides sequence. Anfinsen [7] demostrated that the primary se-quence knowledge is neccessary and sufficent to predict secondary structure of bovinribonucleasi. DSSP [28] is a method developed by Kabsch and Sander in 1983.Itreceives in entry the spatial positions of the atoms of the principal chain and itdetermines the secondary structure analyzing the angles of bond in Ramachandranplot. DSSP is a point of reference for the other methods. The measure of thereliability of the prediction is performed in two ways. The index Q3 measures the

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES13

percentage of aforesaid structures. The SOA (segment overlap) measures the over-lapping between the aforesaid structure and the real. And’it’s important to specifythat each result produced by the following methods necessarily has to be validatelater by the experimental results of NMR. Actually exists three class of predictionmethods:

• Ab initio calculus of minimal energetic conformation;• Statistical elaboration over known structures;• Employement of neural networks.

The first group is based on teoretical consideration that the native folding isa energetic minimum.dasds demonstrated that the determination of the energyminimum is a NP-complete problem. [18]For this reason the algorithms belongingthis class tighten the research space with biological considerations.

An example of second group is the methods of Chou and Fasan[14] methods. Theidea is that each amminoacid has a particular tendency (i.e. K) to participate in aparticular secodary structure (i.e. to random coil). It is given by the relationshipamong the fraction of residual (K) that they are found in that secondary structureand the fraction of residues (K) that him they find in one any of the three struc-tures (to, b, coil). This technique has a Q3 of 0,7 and it has a dependency with thesequence length. GOR method has been developed in 1976 by Garnier, Osguthorpeand Robson. It is based on the consideration that the secondary conformation isthe result of an equilibrium. Each amminoacid is influenced by nearest neighborsand the tendency to participate to a particular structure is modified. This methodhas been reviewed to improve performances. Actually is used GORIV version.[3]The third group has best performances and is actually used.The different belong-ing techniques to this class differentiate him for net topology,for the constructionof the training set and on the rules of education. Qjan and Sejnowsky[33] in 1988developed a prediction method. Their net has been trained with 106 PDB struc-tures; 15 of these was the test set, the others the training set. In training set existsthe presence of three structures has been balanced corresponding to biological fre-quency. The net has three layers: three output cells, 40 cells intermediary, and13 input cells. The input layer is a window of 13 amminoacid whose the net tryto predict the conformation of central residue. PHD (Profile from Heidelberg) isanother prediction method. It receives in input a set of aligned sequences and nota single sequence. It is based upon three layers:

• Sequence to Structure• Structure to Structure• Jury Decision

The first layer is neural network that classifies the central residue of entire multiplealignment. The second formalizes the consideration that the input sequences arecorrelated. The last layer improves the performances by the combining the resultof 12 different networks. JPRED is a method based on consensus. It is freelyavailable and is implemented on a web server. It uses a neural network called Jnet.It received in input a multiple alignment or a single sequence. In this case a multiplealignment is preliminary calculated.

14MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

Figure 10. JPRED Home Page

7. Tertiary structure analysis

The algorithms on the tertiary structure are classified in the following groups:comparison of structures and fold prediction. The first ones find justification fromtwo biological considerations. In primis, the proteins interact for contact of spatialstructures. Besides, the spatial structures are mostly preserved in the phylogeneticevolution. The fold prediction finds justification from low throughput of Cristal-lography and NMR. All the algorithms in the next section receive in input thespatial coordinates of C(alfa) atoms of proteins. This data are maintained in PDBdatabank. Some alternative representation (2d curve i.e) has been introduced inparticular algorithms.

7.1. Alignment of tertiary structure. A classical formulation of structure align-ment problem is based on RSMD measure. Given two sets P and Q, each of thiscontaining 3D points, we need to find a rotation P and a translation Q to minimizethe RSMD od distances between correspondent residues. This problem has beenresolved with dynamic programming algorithms.An example is SSAP developed byTaylor and Orengo whith a technique said double dynamic programming . Thesteps are the follows:

• define a local invariant structural environment for each residue;• for each pair of residues compute their similarity/distance;• each computed distance is an entry of a dynamic programming matrix.• find optimal path in the above matrix.

An alternative approach is based on Contact Map. A much simpler measure ofprotein similarity is the overlap of contact maps to the two proteins.A contact mapsbetween two proteins is a matrix where C(i,j)=1 if the distances between residuesi and j is lower than a threshold level,0 otherwise.

Ci,j =

{1 if i, j are in contact,0 otherwise .

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES15

Contact overlap Q, in contrast to RMSD , is focused on contacts between residuesand, hence, it completely ignores residues that are far from each other. A differ-ent approach is geometric hashing A different approach that involves the flexibilityof protein models is described in [22]. Flex-Prot algorithm receives in input twoprotein molecules A and B each being represented by the sequence of the 3-D co-ordinates of its coordinates of its C( alfa) atoms. It finds largest flexible alignmentby decomposing the two molecules into a minimal number of rigid fragment pairshaving similar 3-D structure. An alternative approach is represented by the tech-

Figure 11. Flex Prot(from prof. Haim Wolfson slides)

niques of Geometric Hashing[24]. This paradigm was originally developed for objectrecognition problems in Computer Vision. It introduces a novel indexing approachbased on transformation invariant representations.This algorithm is suitable forquick scanning of structural databases and will detect a recurring structural motif,which is a-priori unknown. The motifs need not be identical. The algorithm usesamino acid (or nucleotide) sequences, atomic labels and their 3-D coordinates. Aprototype version of the algorithm has been implemented and applied successfullyto the detection of substructures in proteins. Another approach is implemented inDALI (Distance mAtrix aLIgnment)[6] server. It receives in input the coordinatesof a protein in PDB format. This data are compared with all the proteins storedin Brookhaven databank.The server send by mail a multiple structural alignmentand neighbors structures. This server uses RSMD as scoring function and it modelthe proteins as rigid structures. It exist a method of structure comparison basedon combinatorial approach [25]. This method does’nt consider the global alignmentbut it search by first local similarities. This similarities are called AFB (AlignmentFragment Pairs) that are pairs of fragment that have a certain similarities. Thesepairs are based on local geometries. The algorithm has the follows steps:

(1) All AFP’s are determined;(2) By combinatorial techiniques are built alignment paths of strucures(3) The paths are extended or rejected and is built optimal alignment.

16MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

The method receives in input two PDB files and it presents a optimal alignment.

7.2. Fold Prediction. This method aims to predict protein folding by simulat-ing with energetics considerations. Each proteins has a native conformation thatminimizes the free energy. Nevertheless has been demonstrated that each peptideshas unique fold, determined by his primary structure[10]. When a perturbationmodifies protein native conformation, a contrary process starts aiming to cancelmodifications[21]. Simply stated folding problem is a process that transform infor-mation form primary sequence to a set of 3d-coordinates.

The direct approach to protein folding, based on modeled atomic force and ap-proximations from classical mechanics, seeks to find the folded conformation hav-ing minimum free energy. This is an exact methods but is computationally hard.Although its complexity is o(n2), where n is the number of atoms, n is 106 and com-putational time is very long. A simple biological consideration can help to solvethis problem. In fact although a lot of different proteins exists, we can classifythese in around 1000 10000 classes. In this approach, the information about knownstructure is reused as template. These template spatial positions generally includeonly the backbone atoms.An alignment of an amino acid sequence to the set ofpositions in one such core template is chosen. Each three-dimensional coordinateof all amminoacid is determined by mininum-energy consideration starting fromtemplate model. In this approach, estimation of the complete structure requiresassigning positions to the atoms not in back bone.

7.3. Fold Prediction by threading. Threading is a method based over a prob-abilistic model and consideration over structure similarity. In fact, it is reasonablythat proteins with high degree of sequence similarity have a similar fold. In thismodel the native conformation is the state with biggest probability. A threadingmethods performes, by first, sequence-structure alignments for a number of ex-amples for protein folds using realistic knowledge-based potentials. If a templatestructure is similar to the native one then the optimal threading alignment is cor-rect and robust. In contrast, if the template structure is only moderately closeto the native structure then the reliability of the alignment has been reduced. Inthis cases success of threading alignment may depend on the quality of potentialsused.Threading aims at finding a conformation similar to the native one. The fun-damental difference between threading and folding is that in threading the nativeconformation is not present in the space of possible conformations and only anapproximate conformation can be found.

7.4. Homology Modeling. Homology modeling is a tertiary structure predictionthat uses data stored in databases.This approach is based on consideration thattwo proteins that have a similar evolution have a similar structure. Experimentalobservations confirm this idea, but restrict these at backbone atoms only. Whena model by homology is realised a template protein is chosen. This proteins hasa similar evolution with the unknown structure proteins. The similarity betweentwo structure is an apriori measurement of reliability. It is evident that a biggersimilarity produces greater precision of modeling. The steps of this process are thefollows:

• Choice of a known structure protein,• Alignment between two sequence (target and template protein),• Choice of preserved and changed regions,

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES17

• Building of side chains• Refinement of model.

The first step follow the classical article of Lesk and Chotia [8]. Two researchersdemonstrated that exist a relation between the sequence similarity and structuralsimilarity. Recent methods have refined this concept and employ the local similari-ties. They chose more than one template protein, each of these has a local similaritywith the target. The second step uses the sequence alignment algorithms describedin preceding sections with some modification. In this task we need to maximizesimilarity, not the sequence identity.This consideration has a crucial importance,because a mistake in this step causes intolerable errors in modeling.[11] In thisstep some algorithms require a human drive to chose between multiple possibilities.Often the algorithms come back at this point when the model built has some incon-gruence with experimental observations. The third steps can be fully automated(i.e. CAFASP) or can be human-driven. The fourth steps is often performed withoptimization techniques as Monte-Carlo methods [17].

Figure 12. Homology Modeling steps)

Swiss Model [27] and What If are two specialized servers.Swiss Model is a fullyautomated for homology modelling.The first one is accessible via internet on Expasyserver The fully accessibility was main idea under the project.Actually the 3.5versions It was built on 1993 by Peitsch Schede and Guex. Results have populateda database which stores homologies between proteins.

18MARIO CANNATARO, PIETRO HIRAM GUZZI, TOMMASO MAZZA, AND PIERANGELO VELTRI

WHAT IF is a program that can be used for structure prediction.It needs aregistration.

References

[1] Entrez On line Documentation.

[2] T. Hubbard A.G. Murzin, S.E. Brenner and C. Chotia. Scop: A structural classification

of proteins database for the investigation of sequences and structures. J. Mol. Biol, pages536–540, 1995.

[3] Garnier J Gibrat JF Robson B. Gor secondary structure prediction method version iv. Meth-

ods in Enzymology, 266:540–553, 1996.[4] J G Gough C Chotia C. K. Karplus C. Barrett and R Hughey. Optimal hidden markov

models for all sequence of known structure. Currents in Computational Molecular Biology,

pages 512–522, 2000.[5] Apweiler R Blatter MC Estreicher A. Gasteiger E. Martin M.J. Michoud K. ODonovan C.

Phan I. Pilbout S. Schneider M. Boechman B, Bairoch A. The swiss-prot knowledgebase andits supplement trembl in 2003. Nucleic Acid Research, 31:365–370, 2003.

[6] Holm L Sander C. Searching protein structure databases has come of age. Proteins, 19(165-

173), 1994.[7] Anfinsen CB. Principles that govern the folding of protein chains. Science, (181):223–230,

1973.

[8] Lesk A Chotia C. The relation between the divergence of sequence and structure in proteins.EMBO Journal, (5):823 826, 1986.

[9] D F Feng R F Doolittle. Progressive sequence aligment as a prerequisite to correct phyloge-

netic trees. J Mol Evol, (25):351 360, 1987.[10] Shakhnovich E. Proteins with selected sequences fold to their unique native conformation.

Phys. Rev. Letters, (72):3907–3910, 1994.

[11] Samudrala R e Moult. Handling context-sensitivity in protein structures using graph teory:bona fide prediction. Proteins Suppl 1, page 43 49, 1997.

[12] H. Berman et al. The protein data bank. Nucleic Acids Research, 1(28):235–242, 2000.[13] J Eng et al. An approach to correlate tandem mass spectral data of peptides with amino acid

sequences in a protein database. Journal American Society Mass Spectrometry, 5(5):976–989,

1994.[14] Chou PY Fasmang. Conformational parameters for amminoacid in helical, beta sheets, and

random coil regions calculated from proteins. Biochemistry, (13):211–222, 1974.

[15] Weight Matrices for Sequence Similarity Scoring. David wheeler.[16] George Fuellen. A gentle guide to multiple alignment. 1996.

[17] Sander C Holm L. Fast and simple monte carlo algorithm for side chain optimization in pro-

teins:application to model building by homology. Proteins:Structure,Function and Genetics,(14):175–182, 1992.

[18] Ngo TJ Marks J. Computational complexity of a problem in molecular structure prediction.

Prot. Eng, (5):313 321, 1992.[19] Peer Kroger. Molecular Biology Data : Database Overview Modelling Issues. PhD thesis,

Ludwig Maximilian Universitat Munchen, 2001.[20] H Carrillo Lipmann. The multiple sequence alignment problem in biology. J Appl Math,

48:1073 1082, 1998.

[21] Sali A. Shakhnovich E. Karplus M. How does a protein fold? Nature, (369):248–251, 1994.[22] R. Nussinov H. Wolfson M. Shatsky, Z.Y. Fligelman. Alignment of flexible protein structures.

Proc. of the 8’th International Conference Intelligent Systems for Molecular Biology, pages329 – 343, August 2000.

[23] Needleman-Wunsch. A general method applicable to the search for similarity in the aminoacid sequences of two proteins. J Mol Biol, 48:444–453, 1970.

[24] R. Nussinov and H.J Wolfson. Efficient detection of three-dimensional structural motifs inbiological macromolecules by computer vision techniques. Proc. Natl. Sci. U.S.A., (88):1049510499, 1991.

[25] Shindyalov In Bourne PE. Protein structure alignment by incremental combinatorial exten-sion. Protein Engineering, 11:739–747, 1998.

TOWARDS A PROTEOMIC ONTOLOGY: A SURVEY OF PROTEOMIC ALGORITHMS AND BIOLOGICAL DATABASES19

[26] W. R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta. Methods in

Enzymology, (183):63–98, 1990.

[27] Guex N Schwede T, Kopp J and Peitsch MC. Swiss-model: an automated protein homology-modeling server. Nucleic Acids Research, (31):3381–3385, 2003.

[28] Kabsch Sender. 1970. DSSP: Dictionary of Secondary Structure Prediction

[29] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. Gappedblast and psi-blast a new generation of protein database search programs. Nucl. Acids. Res,

17:3389–3402, 1997.

[30] A P Singh and D.L Brutlag. Protein structure alignment: A comparison of methods. Bioin-formatics, 2000.

[31] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal

of Molecular Biology, 147:195–197, 1981.[32] P. Argos T. Etzold, A. Ulyanow. Srs: Information retrieval system for molecular biology data

banks. Methods in Enzymology, pages 114 –128, 1996.[33] Qjan N Sejnowsky TG. J Mol Biol, page 865, 1988.

[34] V Dancik T Addona K R Clauser J E Vath and P A Pevzner. De novo peptide sequencing

via tandem mass spectrometry. Journal of Computational Biology, 6:327–342, 1999.