Protein Secondary Structure Prediction PSSP

74
Protein Secondary Structure Prediction PSSP Anita Wasilewska Computer Science Department SUNY at Stony Brook, Stony Brook, NY, USA Victor Robles Forcada Department of Computer Architecture and Technology Technical University of Madrid, Madrid, Spain Pedro Larranaga Mugica Computer Science Department University of the Basque Country, San Sebastian, Spain

description

Protein Secondary Structure Prediction PSSP. Anita Wasilewska Computer Science Department SUNY at Stony Brook, Stony Brook, NY, USA Victor Robles Forcada Department of Computer Architecture and Technology Technical University of Madrid, Madrid, Spain - PowerPoint PPT Presentation

Transcript of Protein Secondary Structure Prediction PSSP

Page 1: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure Prediction PSSP

Anita Wasilewska Computer Science Department SUNY at Stony Brook, Stony Brook, NY, USA Victor Robles Forcada Department of Computer Architecture and Technology Technical University of Madrid, Madrid, Spain

Pedro Larranaga Mugica Computer Science DepartmentUniversity of the Basque Country, San Sebastian, Spain 

Page 2: Protein Secondary Structure Prediction  PSSP

Overview

• Introduction to proteins• Four levels of protein structure: symbolic model• Proteomic databases• Classification and classifiers• PSSP datasets• Protein Secondary Structure Prediction

The Window: role and symbolic modelUse of evolutionary information

• PSSP Bayesian Network Metaclassifier• Future Research

Page 3: Protein Secondary Structure Prediction  PSSP

Introduction to proteins

1TIM

Page 4: Protein Secondary Structure Prediction  PSSP

Proteins

• Protein: from the Greek word PROTEUO which means "to be first (in rank or influence)"

• Why are proteins important to us:Proteins make up about 15% of the mass of the average

personEnzyme – acts as a biological catalystStorage and transport – HaemoglobinAntibodiesHormones – Insulin

Page 5: Protein Secondary Structure Prediction  PSSP

Proteins

• Why are proteins important to us (c.d.):•Ligaments and arteries (mainly former by

elastin protein)•Muscle – Proteins in the muscle respond to nerve

impulses by changing the packing of their molecules

•Hair, nails and skins: protein -keratin as main component

• And more ……

Page 6: Protein Secondary Structure Prediction  PSSP

Proteins Research Benefits

• Medicine – design of drugs which inhibit specific enzyme targets for therapeutic purposes (engineering of insulin)

• Agriculture – Treat diseases of plants and to modify growth and development of crops

• Industry – Synthesis of enzymes to carry out industrial processes on a mass scale

Page 7: Protein Secondary Structure Prediction  PSSP

Four levels of protein structure

Page 8: Protein Secondary Structure Prediction  PSSP

Four Levels of Protein Structure:AMINOACIDS

• AMINOACIDS: There are 20 aminoacids:

• Alanine (A), Cysteine (C), Aspartic Acid (D),• Glutamic Acid (E), Phenylalanine (F), Glicine (G),• Histidine (H), Isoleucine (I),Lycine (K), Leucine (L), • Methionine (M), AsparagiNe (N), Proline (P), • Glutamine (Q), ARginine (R), Serine (S),• Threonine (T), Valine (V),Tryptophan (W), • TYrosine (Y)

• AMINOACIDS SYMBOLS: A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y

Page 9: Protein Secondary Structure Prediction  PSSP

Primary Structure: Symbolic Definition

• A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting all aminoacids

• A * - set of all finite sequences formed out of elements of A, called protein sequences

• Elements of A* are denoted by x, y, z …..i.e. we write x A*, y A*, zA*, … etc

• PROTEIN PRIMARY STRUCTURE: any x A* is also called a protein sequence or protein sub-unit

• PROTEINS (primary structure level)

P A*

Page 10: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure SS• Secondary structure is the term protein chemist give

to the arrangement of the peptide backbone in space. It is produced by hydrogen bondings between aminoacids

• The assignment of the SS categories to the experimentally determined three-dimensional (3D) structure of proteins is a non-trivial process and is typically performed by widely used DSSP program

• PROTEIN SECONDARY STRUCTURE consists of : protein sequence and its hydrogen bonding patterns called SS categories

Page 11: Protein Secondary Structure Prediction  PSSP

Secondary Structure

8 different categories (DSSP):• H: - helix• G: 310 – helix• I: - helix (extremely rare) • E: - strand• B: - bridge• T: - turn• S: bend • L: the rest

Page 12: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure

• Databases for protein sequences are expanding rapidly due to the genome sequencing projects and the gap between the number of determined protein structures (PSS – protein secondary structures) and the number of known protein sequences in public

• Protein data banks (PDB) is growing bigger.

• PSSP (Protein Secondary Structure Prediction) research is trying to breach this gap.

Page 13: Protein Secondary Structure Prediction  PSSP

Three secondary structure states

• Prediction methods are normally trained and assessed for only 3 states (residues):

• H (helix), E (strands) and L (coil)• There are many published 8-to-3 states reduction

methods• Standard reduction methods are defined by

programs DSSP (Dictionary of SS of Proteins), STRIDE, and DEFINE

• Improvement of predictive accuracy of different SSP (Secondary Structure Prediction) programs depends on the choice of the reduction method

Page 14: Protein Secondary Structure Prediction  PSSP

Three SS states: Reduction methods

• Method 1, used by DSSP program: • H(helix) ={ G (310 – helix), H (- helix)} E (strands) = {E (-strand), B (-bridge)} , L (coil) –

all the rest

• Shortly: E,B => E; G,H => H; Rest => C• We are using this method that is the most difficult to

predict and is used in CASP (International contests for PSSP programs)

• Method 2, used by STRIDE program: • H(helix) as in Method 1 E (strands) = {E (-strand), b (isolated -bridge)}, L (coil) – all the rest

Page 15: Protein Secondary Structure Prediction  PSSP

Three SS states: Reduction methods

• Method 3, used by DEFINE program:• H(helix) as in Method 1 E (strands) = {E (-strand)}, L (coil) – all the rest

• Some other methods:Method B: E => E; H => H; Rest => L; EE and HHHH => L

Method C: GGGHHHH => HHHHHHH; B, GGG => L; H => H; E => E

Page 16: Protein Secondary Structure Prediction  PSSP

Example of typical PSS Data

• Protein Data is gathered in a form of protein sub-units (sequences) and assigned to them sequences of SS categories H, E, L as observed empirically in their 3-dimensional structures. The SS categories are assigned by DSSP program

• Example:SequenceKELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEYNDRQGFVP

Observed SSHHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEEEELLLHHH

Page 17: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure (PSS):Symbolic Definition

• Given A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting aminoacids and a protein sequence (sub-unit) x A*

• Let S ={ H, E, L} be the set of symbols of 3 states (residues): H (helix), E (strands) and L (coil) and S* be the set of all finite sequences of elements of S.

• We denote elements of S* by e,o, with indices if necessary i.e. we write e S*, e1, e2 S*, etc…

Page 18: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure (PSS):Symbolic Definition

• Any PARTIAL, ONE –TO –ONE FUNCTION

f : A* S* i.e. f A* x S* is called a protein secondary structure (PSS)

identification function• An element (x, e) f is a called protein

secondary structure (of the protein sequence x)• The element e S* (of (x, e) f ) is called

secondary structure sequence, or for short a secondary structure.

Page 19: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure (PSS):Symbolic Definition

• Following the standard way we write in PSS research the pairs sequence-structure, we represent the pair (x,e) i.e. the protein

secondary structure in a vertical form: ( ) and hence the PSS identification function f is viewed as the set of all secondary structures it identifies i.e.

• f = { ( ) : x A* x Domf e= f(x) }

xe

X

e

Page 20: Protein Secondary Structure Prediction  PSSP

PSS and PSS Identification Function :Examples

• Any Data Set (DS) used in PSS Prediction defines its own identification function fDS empirically and by PSSD program and we identify DS with fDS and write

DS= { ( ) : fDS (x) = e }• For example: if DS is such that a protein sequence ARNVSTVVLA has the observed SS sequence

HHHEEECCCHH we put :

fDS (ARNVSTVVLA) = HHHEEECCCHH and write

( ) DS

X

e

ARNVSTVVLAHHHEEECCCHH

Page 21: Protein Secondary Structure Prediction  PSSP

Tertiary Structure

• The tertiary structure of a protein is the arrangement in space of all its atoms

• The overall 3D shape of a protein molecule is a compromise, where the structure has the best balance of attractive and repulsive forces between different regions of the molecule

• For a given protein we can experimentally determine its tertiary structure by X-rays or NMR

• Given the tertiary structure of a protein we can know its secondary structure with the DSSP program

Page 22: Protein Secondary Structure Prediction  PSSP

Protein Sequence Tertiary Structure: Symbolic Definition

• Let x A* denote a protein sequence,• S= (x, e) f be secondary structure of x, the

element

x = (S, tx ) a tertiary structure of xwhere tx is the sequence’s x tertiary folding

function

Page 23: Protein Secondary Structure Prediction  PSSP

Quaternary Structure

• Many globular proteins are made up of several polypeptide chains called sub-units, stuck to each other by a variety of attractive forces but rarely by covalent bonds. Protein chemists describe this as quaternary structure.

Page 24: Protein Secondary Structure Prediction  PSSP

Protein Quaternary Structure

• Quaternary structure is a pair

(Q, fQ )• where Q is a multiset (to be defined) (for example Q=[ , , , ] in a

haemoglobin) and

• fQ is the quaternary folding function

Page 25: Protein Secondary Structure Prediction  PSSP

Protein: Symbolic Definition

PROTEIN P = { protein P sequences, their secondary structures, their tertiary structure, their quaternary structure}

PROTEIN P = {x1…xn , (x1,e1)(x2,e2)..(xn,en),

x1=((x1,e1), tx1)… xn= ((xn,en), txn),

( [x1,… xn] , fx1,… xn) } where xi is protein P ith sub-unit, txi is xi’s tertiary folding

function and fx1,… xn is protein P quaternary folding function

Page 26: Protein Secondary Structure Prediction  PSSP

Protein: Symbolic Definition

• In PSSP research we deal with protein sub-units (sequences) Xi, not with the whole sequence of sub-units X1,X2, …Xn

• We write PXk when we refer only to the sub-unit Xk of the protein P

• We write P(xi, ei) when we refer to the sub-unit xi of the protein P and its secondary structure

• We write P xi when we refer to the sub-unit xi of the protein P and its secondary structure

and its tertiary structure

P |xi = { xi , (xi,ei), xi=((xi,ei), fxi),

( [x1,… xn] , fx1,… xn) }

Page 27: Protein Secondary Structure Prediction  PSSP

Protein sub-units

Given a protein P = {x1…xn , (x1,e1)(x2,e2)..(xn,en),

x1=((x1,e1), tx1)… xn= ((xn,en), txn), ( [x1,… xn] ,

fx1,… xn) }• Pxi = { xi }• P(xi, ei) = { xi, (xi, ei) }• Pxi = { xi, (xi, ei), xi = ((xi,ei), txi) }

Page 28: Protein Secondary Structure Prediction  PSSP

Example: haemoglobin

• Haemoglobin = { x,y , (x,ex), (y, ey), =((x,ex), tx ) , = ((y, ey), tx ),

( [ , , , ] , f , ) }

Where x,y A*, are called haemoglobin sub-units

Page 29: Protein Secondary Structure Prediction  PSSP

Proteomic Databases

• The most important proteomic databases are:

Swiss-Prot + TrEMBL PIR-PSDPIR-NREFPDB

Page 30: Protein Secondary Structure Prediction  PSSP

Swiss-Prot + TrEMBL Web site: http://us.expasy.org/sprot/• Swiss-Prot is a protein sequence

database with high level of annotations, a minimal level of redundancy and high level of integration with other databases. 124464 entries

• TrEMBL is a computer-annotated supplement of Swiss-Prot that contains all sequence entries not yet integrated in Swiss-Prot. 828210 entries

Page 31: Protein Secondary Structure Prediction  PSSP

PIR-PSD• Web site: http://pir.georgetown.edu/• PIR-PSD: Protein Information Resource

- Protein Sequence Database• Founded in 1960 by Margaret Dayhoff• Comprehensive and annotated protein

sequence database in the public domain. 283308 entries

Page 32: Protein Secondary Structure Prediction  PSSP

PIR-NREF• Web site: http://pir.georgetown.edu/• PIR-NREF: PIR Non-Redundant Reference

Protein Database• It contains all sequences in PIR-PSD,

SwissProt, TrEMBL, RefSeq, GenPept, and PDB.

• 1,186,271 entries • The most used for finding protein profiles with

PSI-BLAST program.• A mandatory in Protein Secondary

Structure Prediction (PSSP) research

Page 33: Protein Secondary Structure Prediction  PSSP

PDB: Protein Data Bank

• Web site: http://www.rcsb.org/pdb/• PDB contains 3-D biological macromolecular

structure data• 22-April-2003 => 20747 Structures• How do we use PDB?• All PSSP datasets start with some PDB

sequences with known secondary structures. Then, with DSSP program we get the secondary structure and its reduction to three categories and use it as learning data for our algorithms

Page 34: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure Prediction• Techniques for the prediction of protein secondary

structure provide information that is useful both in• ab initio structure prediction and• as an additional constraint for fold-recognition

algorithms. • Knowledge of secondary structure alone can help the

design of site-directed or deletion mutants that will not destroy the native protein structure.

• For all these applications it is essential that the secondary structure prediction be accurate, or at least that, the reliability for each residue can be assessed.

Page 35: Protein Secondary Structure Prediction  PSSP

Protein Secondary Structure Prediction• If a protein sequence shows clear similarity to a

protein of known three dimensional structure, then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms, as the homology modelling is much more accurate than secondary structure prediction for high levels of sequence identity.

• Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable.

• It is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.

Page 36: Protein Secondary Structure Prediction  PSSP

Classification and Classifiers

• Given a data base table DB with a special atribute C, called a class attribute (or decision attribute). The values: C1, C2, ...Cn of the class atrribute are called class labels.

• Example:

A1 A2 A3 A4 C

1 1 m g c1

0 1 v g c2

1 0 m b c1

Page 37: Protein Secondary Structure Prediction  PSSP

Classification and Classifiers

• The attribute C partitions the records in the DB i.e. divides the records into disjoint subsets defined by the attributes C values, called classes, or shortly CLASSIFIES the records. It means we use the attributre C and its values to divide the set R of records od DB into n disjoint classes:

C1={ rDB: C=c1} ...... Cn={rBD: C=cn}• Example (from our table) C1 = { (1,1,m,g), (1,0,m,b)} = {r1,r3} C2 = { (0,1,v,g)} ={r2}

Page 38: Protein Secondary Structure Prediction  PSSP

Classification and Classifiers

• An algorithm (model, method) is called a classification algorithm if it uses the data and its classification to build a set of patterns: discriminant and /or characterization rules or other pattern descriptions. Those patters are structured in such a way that we can use them to classify unknown sets of objects- unknown records.

• For that reason (because of the goal) the classification algorithm is often called shortly a classifier.

• The name classifier implies more then just classification algorithm. A classifier is final product of a data set and a classification algorithm.

Page 39: Protein Secondary Structure Prediction  PSSP

Classification and Classifiers• Building a classifier consists of two phases: training and testing.• In both phases we use data (training data set and

disjoint with it test data set) for which the class labels are known for ALL of the records.

• We use the training data set to create patterns (rules, trees, or to train a Neural or Bayesian network).

• We evaluate created patterns with the use of of test data, which classification is known.

• The measure for a trained classifier accuracy is called predictive accuracy.

• The classifier is build i.e. we terminate the process if it has been trained and tested and predictive accuracy was on an acceptable level.

Page 40: Protein Secondary Structure Prediction  PSSP

Classifiers Predictive Accuracy

• PREDICTIVE ACCURACY of a classifier is a percentage of well classified data in the testing data set.

• Predictive accuracy depends heavily on a choice of the test and training data.

• There are many methods of choosing test and and training sets and hence evaluating the predictive accuracy. This is a separate field of research.

Page 41: Protein Secondary Structure Prediction  PSSP

Predictive Accuracy Evaluation

The main methods of predictive accuracy evaluations are:

• Re-substitution (N ; N)• Holdout (2N/3 ; N/3)• x-fold cross-validation (N-N/x ; N/x)• Leave-one-out (N-1 ; 1), where N is the number of instances in the dataset

• The process of building and evaluating a classifier is also called a supervised learning, or lately when dealing with large data bases a classification method in Data Mining.

Page 42: Protein Secondary Structure Prediction  PSSP

Classification Models: Different Classifiers

• Decision Trees (ID3, C4.5) • Neural Networks• Rough Sets• Bayesian Networks• Genetic Algorithms• Most of the best classifiers for PSSP are based on

Neural Network model.• We have bulid PSSP classifier and a meta-classifier

based on Bayesian Networks model and plan to use the Rough Sets model as an alternative design of our meta-classifier.

Page 43: Protein Secondary Structure Prediction  PSSP

PSSP Datasets• Historic RS126 dataset. Contains126 sub-units with

known secondary structure selected by Rost and Sander. Today is not used anymore

• CB513 dataset. Contains 513 sub-units with known secondary structure selected by Cuff and Barton in 1999. Very much used in PSSP research

• HS17771 dataset. Created by Hobohm and Scharf. In March-2002 it contained 1771 sub-units

• This family of datasets are non redundant PDB (Protein Data Bank) subsets. Sub-units in the dataset never have an identity bigger than 25%.

• Lots of authors has their own and “secret” datasets

Page 44: Protein Secondary Structure Prediction  PSSP

PSSP Algorithms

• There are three generations in PSSP algorithms• First Generation: based on statistical

information of single aminoacids• Second Generation: based on windows

(segments) of aminoacids. Typically a window containes 11-21 aminoacids

• Third Generation: based on the use of windows on evolutionary information

Page 45: Protein Secondary Structure Prediction  PSSP

PSSP: First Generation

• First generation PSSP systems are based on statistical information on a single aminoacid

• The most relevant algorithms:

Chow-Fasman, 1974GOR, 1978

• Both algorithms claimed 74-78% of predictive accuracy, but tested with better constructed datasets were proved to have the predictive accuracy ~50% (Nishikawa, 1983)

Page 46: Protein Secondary Structure Prediction  PSSP

PSSP: Second Generation• Based on the information contained in a

window of aminoacids (11-21 aa.)• The most systems use algorithms based on:

Statistical information Physico-chemical properties Sequence patterns Multi-layered neural networks Graph-theory Multivariante statistics Expert rules Nearest-neighbour algorithms No Bayesian networks

Page 47: Protein Secondary Structure Prediction  PSSP

PSSP: Second Generation

• Main problems:Prediction accuracy <70%Prediction accuracy for -strand 28-48%Predicted chains are usually too short what leads do the difficult use of predictions

Page 48: Protein Secondary Structure Prediction  PSSP

PSSP: Third Generation• PHD: First algorithm in this generation (1994)• Evolutionary information improves the prediction accuracy to

72%

• Use of evolutionary information:1. Scan a database with known sequences with alignment

methods for finding similar sequences

2. Filter the previous list with a threshold to identify the most significant sequences

3. Build aminoacid exchange profiles based on the probable homologs (most significant sequences)

4. The profiles are used in the prediction, i.e. in building the classifier

Page 49: Protein Secondary Structure Prediction  PSSP

PSSP: Third Generation

• Many of the second generation algorithms have been updated to third generation

• The most important algorithms of todayPredator: Nearest-neighbourPSI-Pred: Neural networksSSPro: Neural networksSAM-T02: Homologs (Hidden Markov Models)PHD: Neural networks

• Due to the improvement of protein information in databases i.e. better evolutionary information, today’s predictive accuracy is ~80%

• It is believed that maximum reachable accuracy is 88%

Page 50: Protein Secondary Structure Prediction  PSSP

PSSP Data Preparation

• Public Protein Data Sets used in PSSP research contain protein secondary structure sequences. In order to use classification algorithms we must transform secondary structure sequences into classification data tables.

• Records in the classification data tables are called, in PSSP literature (learning) instances.

• The mechanism used in this transformation process is called window.

• A window algorithm has a secondary structure as input and returns a classification table: set of instances for the classification algorithm.

Page 51: Protein Secondary Structure Prediction  PSSP

Window

• Consider a secondary structure (x, e).• (x,e)= (x1x2 ..xn, e1e2…en)• Window of the length k chooses a

subsequence of length k of x1x2 ..xn, and an element ei from e1e2…en, corresponding to a special position in the window, usually the middle

• Window moves along the sequences x = x1x2 ..xn and e= e1e2…en simultaneously, starting at the beginning

moving to the right one letter at the time at each step of the process.

Page 52: Protein Secondary Structure Prediction  PSSP

Window: Sequence to Structure• Such window is called sequence to

structure window. We will call is for short a window.

• The process terminates when the window or its middle position reaches the end of the sequence x.

• The pair: (subsequence, element of e ) is often written in a form

• subsequence H, E or C is called an instance, or a rule.

Page 53: Protein Secondary Structure Prediction  PSSP

Example: Window

• Consider a secondary structure (x, e) and the window of length 5 with the special position in the middle (bold letters)

• Fist position of the window is:

• x = A R N S T V V S T A A ….• e = H H H H C C C E E E

• Window returns instance: • A R N S T H

Page 54: Protein Secondary Structure Prediction  PSSP

Example: Window

• Second position of the window is:

• x = A R N S T V V S T A A ….• e = H H H H C C C E E E

• Windows returns instance: • R N S T V H• Next instances are:• N S T V V C• S T V V S C• T V V S T C

Page 55: Protein Secondary Structure Prediction  PSSP

Symbolic Notation

• Let f be a protein secondary structure (PSS) identification function:

• f : A* S* i.e. f A* x S* • Let x= x1 x2 … xn, e= e1 e2 … en and • f(x)= e, we define• f(x1 x2 … xn)|{xi}= ei, i.e. • f(x)|{xi}= ei

Page 56: Protein Secondary Structure Prediction  PSSP

Example:Semantics of Instances

Let• x = A R N S T V V S T A A ….•e = H H H H C C C E E E

And assume that the windows returns an instance:

A R N S T H•Semantics of the instance is:

f(x)|{N}=H,

where f is the identification function and N is preceded by A R and followed by S T and the window has the length 5

Page 57: Protein Secondary Structure Prediction  PSSP

Classification Data Base (Table)

• We build the classification table with attributes being the positions

• p1, p2, p3, p4, p5 .. pn in the window, where n is length of the window. The corresponding values of attributes are elements of

of the subsequent on the given position.• Classification attribute is S with values in the set {H,

E, C} assigned by the window operation (instance, rule).

• The classification table for our example (fist few records) is the following.

Page 58: Protein Secondary Structure Prediction  PSSP

Classification Table (Example)• x = A R N S T V V S T A A ….• e = H H H H C C C E E E

p1 p2 p3 p4 p5 SA R N S T H

R N S T V H

N S T V V CS T V V S C

Semantics of record r= r(p1, p2, p3,p4,p5, S) is :

f(x)|{Vp3} = Vs

where Va denotes a value of the attribute a.

Page 59: Protein Secondary Structure Prediction  PSSP

Missing Values• Missing values: if we want to “cover” assignment of

elements of S corresponding to all of the elements of the sequence x, we have to position the window with its middle position at the first element of x.

• In this case our classification table is (for our x and e)

p1 p2 p3 p4 p5 SA R N H

A R N S HA R N S T HR N S T V H

• x = A R N S T V V S T A A

•e = H H H H C C C E E E

Page 60: Protein Secondary Structure Prediction  PSSP

Size of classification datasets (tables)

• The window mechanism produces very large datasets

• For example window of size 13 applied to the CB513 dataset of 513 protein subunits produces about

70,000 records (instances)

Page 61: Protein Secondary Structure Prediction  PSSP

Window

• Window has the following parameters:• PARAMETER 1 : i N+ (Natural numbers >0) , the

starting point of the window as it moves along the sequence x= x1 x2 …. xn. The value i=1 means that window starts at x1, i=5 means that window starts at x5, etc.

• PARAMETER 2: k N+ denotes the size (length) of the window.

• For example:• the PHD system of Rost and Sander (1994) uses two window

sizes: 13 and 17.• The BRNN (Bidirectional Recurrent Neural Networks) of

Pollastri, Rost, Baldi and Przybylski (2002) use variable sizes of windows: 7,9,5 with additional windows (located on the “wheels”) of sizes 3 , 4 or 2

Page 62: Protein Secondary Structure Prediction  PSSP

Window

• PARAMETER 3: p {1,2, …, k}

• where p is a special position of the window that returns the classification attribute values from S ={H, E, C} and

• k is the size (length) of the window

• PSSP PROBLEM: find optimal size k, optimal special

position p for the best prediction accuracy

Page 63: Protein Secondary Structure Prediction  PSSP

Window: Symbolic Definition

• WINDOW ARGUMENTS: window parameters and secondary structure (x,e)

• WINDOW VALUE: (subsequence of x, element of e)

• OPERATION (sequence – to –structure window)

W is a partial function

W: N+ N+ {1,…, k} (A* S* ) A* S W(i, k, p, (x,e)) = (xi x(i+1)…. x(i+k-1), f(x)|{x(i+p)})

where (x,e)= (x1x2 ..xn, e1e2…en)

Page 64: Protein Secondary Structure Prediction  PSSP

Sequence Alignment

• We perform sequence alignment to know if two sequences are homologs

• Homologs: sequences with the same 3D structure and function

• Main aspects:1. Alignment classes: Gapped vs. ungapped, global vs.

partial2. Punctuation systems: Substitution matrices3. Alignment Algorithms:

1. Dynamic programming: Needleman-Wunsch, Smith-Waterman

2. Heuristics: BLAST, FASTA

Page 65: Protein Secondary Structure Prediction  PSSP

Example (ungrapped alignment)

(a)HBA_HUMAN GSAQVKGHGKKVADALV Homologs G+ +VK+ HGKKV A+HBB_HUMAN GNPKVKAHGKKVLGAF(b)HBA_HUMAN GSAQVKGHGKKVADAL Homologs ++ ++++H+ KV +LGB2_LUPLU NNPELQAHAGKVFKLV(c)HBA_HUMAN GSAQVKGHGKKVADAL No homologs GS+ + G + +D LF11G11.2 GSGYLVGDSLTFVDLL

Page 66: Protein Secondary Structure Prediction  PSSP

Metaclassifier: 9 Servers contacted

Page 67: Protein Secondary Structure Prediction  PSSP

Metaclassifier: 6 Servers selected

NAME LOCATION PREDICTION METHOD Q3 CB513 RESULTSPredator Institut Pasteur - Paris Nearest neighbour 80.0 e-mail

PSIpred Univ College London Neural network 79.9 e-mail

Sspro Univ California Irvine Neural network 79.1 e-mail

SAM-T02 Univ California, Santa Cruz Homologs 78.1 e-mail

PHD Exp Columbia Univ, New York Neural network 77.6 e-mail

Prof Univ Wales Neural network 77.1 e-mail

Jpred Univ Dundee, Scotland Consensum 73.4 e-mail

SOPM Institute of Biology, Lyon Homologs 66.8 web

GOR Univ Southampton. UK Information Theory 55.4 web

Page 68: Protein Secondary Structure Prediction  PSSP

Creating the metaclassifier

dataset

Page 69: Protein Secondary Structure Prediction  PSSP

Obtainingprotein profilewith PSI-BLAST

We use windows onsequences profiles to obtain learninginstances

Page 70: Protein Secondary Structure Prediction  PSSP

Metaclassifier

• We have used for our metaclassifier a naïve Bayes model

• The learning process has been done with HS1771 and CB513 datasets

• The metaclassifier uses evolutionary information and combines the predictions from six classifiers. It gives a 2-3% improvement in accuracy over the best single method and 15-20% over the worst.

Page 71: Protein Secondary Structure Prediction  PSSP

Measures for secondary structure prediction accuracy

• http://cubic.bioc.columbia.edu/eva/doc/measure_sec.html (for more information)

• Q3 :Three-state prediction accuracy (percent of succesful classified)

• Qi %obs: How many of the observed residues were

correctly predicted?• Qi

%prd: How many of the predicted residues were correctly predicted?

• Sov: Per-segment accuracy

Page 72: Protein Secondary Structure Prediction  PSSP

Experimental results (CB513)

Metaclassifier I: Learning with HS1771and validated CB513Metaclassifier II: Learning with CB513And validated with CB513 (leave one out)

Servidor Q 3 Q h%obs Q e

%obs Q l%obs Q h

%prd Q e%prd Q l

%prd C h C e C l SOV h SOV e SOV l SOVMetaclassifier I 81.74 87.62 79.02 78.69 84.46 77.04 82.03 0.79 0.72 0.65 85.06 78.04 68.02 74.59Metaclassifier II 81.75 87.65 79.04 78.68 84.45 77.08 82.03 0.79 0.72 0.65 85.26 78.08 68.05 74.6PSIPRED V2.4 79.946 83.521 70.313 82.16 84.666 80.145 76.365 0.758 0.684 0.627 78.977 69.806 73.419 76.48PHD Expert 77.609 78.921 73.32 78.82 85.581 72.001 74.842 0.734 0.646 0.587 73.355 69.83 71.265 74.98Prof 77.127 74.192 74.227 81.026 88.645 71.144 73.104 0.726 0.644 0.582 68.578 69.828 72.243 73.743GOR 55.357 56.205 47.526 58.819 55.91 43.023 62.552 0.328 0.281 0.328 51.451 51.869 50.885 51.595SOPM 66.844 69.784 57.238 69.557 69.293 58.805 68.978 0.534 0.459 0.461 65.305 62.695 62.394 64.345SSPro 79.066 82.723 66.898 82.557 85.9 78.428 74.529 0.762 0.652 0.609 76.217 68.107 72.27 74.393JPred 73.368 65.21 56.182 89.074 89.409 77.011 65.401 0.67 0.578 0.54 64.647 60.998 69.161 68.033

Page 73: Protein Secondary Structure Prediction  PSSP

Experimental results (HS1771)

Servidor Q 3 Q h%obs Q e

%obs Q l%obs Q h

%prd Q e%prd Q l

%prd C h C e C l SOV h SOV e SOV l SOVMetaclassifier I 79.97 85.88 76.57 77.0 83.97 74.02 79.78 0.77 0.68 0.62 82.16 76.01 67.2 72.54Metaclassifier II 79.98 85.91 76.61 76.98 83.95 74.02 79.81 0.77 0.68 0.62 82.17 76.08 67.16 72.53PSIPRED V2.4 78.899 83.4 69.18 80.289 83.339 78.191 75.722 0.743 0.665 0.61 77.863 71.533 71.649 74.622PHD Expert 77.372 80.012 72.872 77.554 84.726 71.576 74.805 0.732 0.641 0.581 72.811 72.918 70.22 73.442Prof 74.574 71.56 71.011 78.952 86.503 67.267 70.93 0.689 0.598 0.545 65.324 69.955 69.429 69.742GOR 54.314 56.666 45.278 57.131 54.43 42.343 61.488 0.306 0.269 0.312 52.154 54.405 49.6 50.213SOPM 65.663 70.345 54.847 67.484 67.404 57.523 68.285 0.513 0.439 0.445 64.377 64.281 60.809 62.522SSPro 78.042 82.646 66.262 80.438 83.836 76.833 74.167 0.742 0.639 0.593 74.702 70.086 69.809 71.779JPred 71.682 63.565 54.575 87.516 87.037 73.131 64.399 0.639 0.545 0.52 62.78 63.234 67.818 65.908SAM-T02 77.543 84.391 74.965 73.204 81.412 72.163 77.137 0.733 0.657 0.577 75.536 73.561 68.154 73.048Predator 66.289 65.421 45.834 77.826 71.002 64.425 63.897 0.519 0.439 0.45 63.642 54.731 61.947 61.075

Metaclassifier I: Learning with HS1771and validated with HS1771 (leave one out)Metaclassifier II: Learning with CB513and validatted with HS1771

Page 74: Protein Secondary Structure Prediction  PSSP

FUTURE research

• In depth analysis already obtained results• Improve the metaclassifier (we have only a naïve

Bayes at this stage)• Dependence of results on the choice of classifiers:plan to use Rough Set model classifier to find optimal

subset of classifiers• Study the dependence of results on the choice of

testing and training sets