Protein Prediction

7/30/2019 Protein Prediction

1/100

Protein Secondary Structure PredictionPSSP


2/100

Proteins

Protein: from the Greek word PROTEUO which means "tobe first (in rank or influence)"

Why are proteins important to us:

Proteins make up about 15% of the mass of the average person

and maintain the structural integrity of the cell.

Enzyme acts as a biological catalyst

Storage and transportHaemoglobin

Antibodies

Hormones Insulin


3/100

Introduction to proteins

Peptide

Bond


4/100

Four levels of protein

structure


5/100

Conformational Parameters of secondary structure of

a protein

Dihedral Angles/torsion angles/rotation angles

=phi= N-C

=psi= C-C

= omega=C-N


6/100

These values can be calculated?

Ramachandran Plot: Founded by G.N.Ramachandran.

Green region indicates the

stericially permitted & values except Gly and Pro.

Yellow circles represent the

conformational angles of several

secondary structures..-helix,parallel & anti parallel -sheet


7/100

Helices


8/100

Helices

H: - helix

G: 310 helix

I: - helix (extremely rare)


9/100

Hydrogen bondingi-i+3th i-i+4th i-i+5th


10/100

Secondary Structure

8 different categories (DSSP): H: - helix

G: 310 helix

I: - helix (extremely rare)

E: - strand

B: - bridge

T: - turn

S: bend

L: the rest

(pitch 5.4 A0)

1.5 A0


11/100

Three secondary structure states

Prediction methods are normally trained andassessed for only 3 states (residues):

H (helix), E (strands) and L (coil)

There are many published 8-to-3 states reduction

methods Standard reduction methods are defined by

programsDSSP (Dictionary of SS of Proteins),STRIDE, and DEFINE

Improvement of predictive accuracy of different SSP(Secondary Structure Prediction) programs depends

on the choice of the reduction method


12/100

Protein Secondary Structure Prediction

Techniques for the prediction of protein secondarystructure provide information that is useful both in

ab initio structure prediction and

as an additional constraint for fold-recognition

algorithms. Knowledge of secondary structure alone can help the

design of site-directed or deletion mutants that will

not destroy the native protein structure.

For all these applications it is essential that thesecondary structure prediction be accurate, or at

least that, the reliability for each residue can be

assessed.


13/100

Protein Secondary Structure Prediction If a protein sequence shows clear similarity to a

protein of known three dimensional structure, then

the most accurate method of predicting the

secondary structure is to align the sequences by

standard dynamic programming algorithms, as the

homology modelling is much more accurate thansecondary structure prediction for high levels of

sequence identity.

Secondary structure prediction methods are of most

use when sequence similarity to a protein of knownstructure is undetectable.

It is important that there is no detectable sequence

similarity between sequences used to train and test

secondary structure prediction methods.


14/100

Protein Secondary Structure

Secondary Structure

Regular

Secondary

Structure

( -helices, -

sheets)

Irregular

Secondary

Structure

(Tight turns,

Random coils,

bulges)


15/100

PSSP Algorithms

There are three generations in PSSP

algorithms

Early/First Generation: based onstatistical/rule basedinformation of single

aminoacids

Second Generation: based on windows(segments) of aminoacids. Typically a window

containes 11-21 aminoacids

Third Generation: based on the use of

windows on evolutionary information


16/100

PSSP: First Generation

First generation PSSP systems are based on

statistical information on a single aminoacid

The most relevant algorithms:

Chow-Fasman, 1974 (Statistics based)

GOR, 1978 (Rule based)

Both algorithms claimed 74-78% of predictive

accuracy, but tested with better constructed datasetswere proved to have the predictive accuracy ~50%

(Nishikawa, 1983)


17/100

PSSP: Second Generation

Based on the information contained in awindow of aminoacids (11-21 aa.)

The most systems use algorithms based on: Statistical information

Physico-chemical properties

Sequence patterns

Multi-layered neural networks

Graph-theory Multivariante statistics

Expert rules

Nearest-neighbour algorithms

No Bayesian networks


18/100

PSSP: Second Generation

Main problems:

Prediction accuracy


19/100

PSSP: Third Generation

PHD: First algorithm in this generation (1994) Evolutionary information improves the prediction accuracy to

72%

Use of evolutionary information:

1. Scan a database with known sequences with alignmentmethods for finding similar sequences

2. Filter the previous list with a threshold to identify themost significant sequences

3. Build aminoacid exchangeprofiles based on theprobable homologs (most significant sequences)

4. The profiles are used in the prediction, i.e. in buildingthe classifier


20/100

PSSP: Third Generation

Many of the second generation algorithms have been

updated tothird generation The most important algorithms of today

Predator: Nearest-neighbour

PSI-Pred: Neural networks

SSPro: Neural networks

SAM-T02: Homologs (Hidden Markov Models)

PHD: Neural networks

Due to the improvement of protein information indatabases i.e. better evolutionary information, todays

predictive accuracy is ~80%

It is believed that maximum reachable accuracy is

88%


21/100

First Generation PSSP

Two classical methods that use previously

determined propensities:

Chou-Fasman

Garnier-Osguthorpe-Robson


22/100

Chou-Fasman method

Uses table of conformational parameters

(propensities) determined primarily frommeasurements of secondary structure.

Frequency of amino acid X observed in element Y

Frequency of element Y in database


23/100


24/100


25/100

Designations:H = Strong Former,

h = Former,

I = Weak Former,

i = Indifferent,

B = Strong Breaker,

b = Breaker;

P = Conformational

Parameter


26/100

The Chou-Fasman method

If you were asked to determine whether an amino

acid in a protein of interest is part of a -helix or -

sheet, you might think to look in a protein database

and see which secondary structures amino acids insimilar contexts belonged to.

The Chou-Fasman method (1974) is a combination of

such statistics-based methods and rule-basedmethods.


27/100

Steps of the Chou-Fasman algorithm:

1. Calculate propensities from a set of solvedstructures. For all 20 amino acids i,calculate these

propensities by:

)(

)/(

)(

)/(

)(

)/(

iP

TurniP

iP

BetaiP

iP

HelixiP

Propensities > 1 mean that the residue type I is likely to be found in theCorresponding secondary structure type.


28/100

Amino Acid -Helix -Sheet TurnAla 1.29 0.90 0.78

Cys 1.11 0.74 0.80

Leu 1.30 1.02 0.59

Met 1.47 0.97 0.39

Glu 1.44 0.75 1.00

Gln 1.27 0.80 0.97

His 1.22 1.08 0.69Lys 1.23 0.77 0.96

Val 0.91 1.49 0.47

Ile 0.97 1.45 0.51

Phe 1.07 1.32 0.58

Tyr 0.72 1.25 1.05

Trp 0.99 1.14 0.75

Thr 0.82 1.21 1.03

Gly 0.56 0.92 1.64

Ser 0.82 0.95 1.33

Asp 1.04 0.72 1.41

Asn 0.90 0.76 1.23

Pro 0.52 0.64 1.91

Arg 0.96 0.99 0.88

Chou and Fasman

Favors

-Helix

Favors

-strand

Favors

turn


29/100

Chou and Fasman

Predicting helices:- find nucleation site: 4 out of 6 contiguous residues with P()>1

- extension: extend helix in both directions until a set of 4 contiguous

residues has an average P() < 1 (breaker)

- if average P() over whole region is >1, it is predicted to be helical

Predicting strands:

- find nucleation site: 3 out of 5 contiguous residues with P()>1

- extension: extend strand in both directions until a set of 4 contiguousresidues has an average P() < 1 (breaker)

- if average P() over whole region is >1, it is predicted to be a strand


30/100

2. Once the propensities are calculated, each aminoacid is categorized using the propensities as one of:

Each amino acid is also categorized as one of:

helix-former,

helix-breaker, or

helix-indifferent.

(That is, helix-formers have high helical propensities, helix-breakershave low helical propensities, and helix-indifferent haveintermediate propensities.)

sheet-former,

sheet-breaker, or

sheet-indifferent.

For example, it was found (as expected) that glycine and prolines

are helix-breakers.


31/100

3.Find nucleation sites.

These are short subsequences with a high-concentration ofhelix-formers (or sheet-formers).

These sites are found with some heuristic rule (e.g. a sequenceof 6 amino acids with at least 4 helix-formers, and no helix-breakers").

4. Extend the nucleation sites, adding residues at theends, maintaining an average propensity greater thansome threshold.

5. Step 4 may create overlaps; Finally, we deal withthese overlaps using some heuristic rules.

Th GOR th d


32/100

The GOR method(Garnier, Osguthorpe, Robson)

Position-dependent propensities for helix, sheet or turn is calculated for each amino

acid. For each position j in the sequence, eight residues on either side areconsidered.

A helix propensity table contains information about propensity for residues at 17

positions when the conformation of residue j is helical. The helix propensity tables

have 20 x 17 entries.

Build similar tables for strands and turns.

GOR simplification:

The predicted state of AAj is calculated as the sum of the position-dependent

propensities of all residues around AAj.

GOR can be used at : http://abs.cit.nih.gov/gor/(current version is GOR IV)

j
http://abs.cit.nih.gov/gor/http://abs.cit.nih.gov/gor/


33/100

Suppose aj is the amino acid that we are trying tocategorize.

GOR looks at the residues

Intuitively, it assigns a structure based on probabilities it has

calculated from protein databases. These probabilities are of the

form


34/100

Accuracy

Both Chou and Fasman and GOR have been

assessed and their accuracy is estimated to

be Q3=60-65%.

(initially, higher scores were reported, but theexperiments set to measure Q3 were flawed,

as the test cases included proteins used toderive the propensities!)


35/100

Nearest Neighbour Method

This method depends on the spatial structure of

central residues in each window having kowledge of

its neighbours.

This methods is hence called as Memory/Homology

based method.

Steps


36/100

Steps Computes MS Algorithm

Computes the distance between homologous

sections

Example : NNSSP

Training Data

Test data

New data

NN Tree(Binary search tree)

Validation

Prediction

Actual NN

tree with data

Modelling scheme of the nerest-neighbour method

SDV Hyperplanes

DS manipulation


37/100

Neural Networks

Single sequence methods - train network using sets

of known proteins of certain types (all alpha, all beta,

alpha+beta) then use to predict for query sequence

NNPREDICT (>70% accuracy)


38/100

NEURAL NETWORKS


39/100

Inspired by the brain

Traditional computers struggle to

recognize and generalize patterns

of the past for future actions

Brain as an information processing

system contains 10 billion nerve

cells or neurons and each neuron isconnected to other neuron through

about 10,000 synapses


40/100

Brain

Interconnected networkof that

collect, process anddisseminate electricalsignals via

Neurons

Synapses

Neural Network

Interconnected network of

units (or nodes) that

collect, process anddisseminate values via

links

Nodes

Links


41/100

Scheme for modeling Neural Network:

Building a random network Training the net on a training set


42/100

Building a random network

random selection of the type of node

random selection of the parameters of the node

random selection of the number of the inputs

connecting the inputs and outputs until the netis larger

running the training set over the net

selecting the proper outputremoval of all nodes which do not contribute to

the output


43/100

Training the network

The general idea behind this is to run the net on the

training set, and every time it gives a right answer

leave it as it is and or strengthen the path. Every time

it makes a mistake penalize it. This can be done inseveral ways to incrementally modify the net:

alter the parameters

add/delete connections

add/delete the nodes with their connections


44/100

T i l th d l d t t i f d f d


45/100

Typical methodology used to train a feed-forwardnetwork for secondary structure prediction is based onQian and Sejnowski, 1988

Alanine: 100000000

Helix : 100000000


46/100

Different Network Topologies

Single layer feed-forward networks

Input layer projecting into the output layer

Input Output

layer layer

Single layer

network


47/100


Multi-layer feed-forward networks

One or more hidden layers. Input projects only

from previous layers onto a layer.

Input Hidden Output

layer layer layer

2-layer or

1-hidden layer

fully connected

network


48/100


Recurrent networks

A network with feedback, where some of its

inputs are connected to some of its outputs(discrete time).

Input Output

layer layer

Recurrent

network


49/100

Features

A typical training set consists of 100 non-homologous protein chains (15,000 training patterns)

A net with an input window of 17, five hidden nodes

in a single hidden layer and three outputs will have

357 input nodes and 1,808 weights.

Predictions are made on a winner-takes-all basis


50/100

PHD (Profile network from HeiDelberg)

A program with several cascading neural networks. The method employs a two-layered feed-forward

neural network

Sequence to structure (First layer)

Structure to Structure (Second layer)

> 70%

Window length is 13, and for every position in the window

frequencies for 20 aa is calculated (20x13)

Based on this the OUTPUT is a probability of the three

possible classes (H,E and C)


51/100

Evaluating Prediction Efficiency

Jackknief test:

Percentage of correctly classified residues:

Correlation coefficient for each target class:


52/100

Prediction of Transmembrane helices

Two Ways:

One is solely based on the construction principles of

proteins associated with physico-chemical properties

of amino acids. No concept of training is involved.

The other is to collect data sets with known

structures, extract features and use machine learning

algorithms for predictions.


53/100

Outline

1. Importance of Transmembrane Proteins

2. General Topologies

3. Methods (and challenges) for Structural

Studies of TM Proteins


54/100

Eukaryotic cells have many membranes


55/100

Transmembrane Proteins

v Cellular roles include:Communication between cells

Communications between organelles and cytosolIon transport, Nutrient transport

Links to extracellular matrixReceptors for viruses

Connections for cytoskeletonv Over 25% of proteins in complete genomes.

v Key roles in diabetes, hypertension, depression,arthritis, cancer, and many other common diseases.v Targets for over 75% of pharmaceuticals.


56/100

Transmembrane Proteins

v Cellular roles include:Communication between cells

Communications between organelles and cytosolIon transport, Nutrient transportLinks to extracellular matrix

Receptors for virusesConnections for cytoskeleton

v Over 25% of proteins in complete genomes.v Key roles in diabetes, hypertension, depression,

arthritis, cancer, and many other common diseases.v Targets for over 75% of pharmaceuticals.

However, very few TM protein structures have been solved!

Bi l i l M b Li id Bil


57/100

Biological Membrane = Lipid Bilayer

Approximately 30 thickHydrophobic core + Hydrophilic or charged headgroups

Mixture of lipids that vary in type of head groups, lengths of acyl chains,number of double bonds

(Some membranes also contain cholesterol)

M b Bil ith P t i


58/100

Membrane Bilayer with Proteins

In order to be stable in this environment, a polypeptide chain needs to

(1) contain a lot of amino acids with hydrophobic sidechains, and

(2) fold up to satisfy backbone H-bond propensity - How?


59/100

Structure Solution #1: Hydrophobic alpha-

helix

Satisfies polypeptide

backbone hydrogen

bonding Hydrophobic

sidechains face

outward into lipids


60/100

Examples of Helix Bundle TM Proteins

PDB = 1QHJ PDB = 1RRC

Single helix or helical bundles (> 90% of TM proteins)Examples: Human growth hormone receptor, Insulin receptor

ATP binding cassette family - CFTRMultidrug resistance proteins

7TM receptors - G protein-linked receptors


61/100

Structure solution #2

Beta-barrel

Beta sheet satisfies

backbone hydrogen

bonds between

strands

Wrap sheet around

into barrel shape

Sidechains on theoutside of the barrel

are hydrophobic

E l f B t B l TM P t i


62/100

Examples of Beta Barrel TM Proteins

PDB = 1EK9 PDB = 2POR

Beta barrels - in outer membrane of gram negative bacteria,and some nonconstitutive membrane acting toxins

Examples: Porins

General Topologies of TM Proteins


63/100

General Topologies of TM Proteins

Single helix or helical bundles and Beta barrelsBoth topologies result in

hydrophobic surfaces facing acyl chains of lipidsPart protruding from membrane can be a very short sequence (a few

amino acids), a loop, or large, independently folding domains


64/100

Presence of Hydrophobic TM

Domain can result in:Low levels of expressionDifficulties in solubilization

Difficulties in crystallization

Attempting crystallization and structure solution of

transmembrane proteins is considered difficult and

risky.

Diffi lt d i k b t till ibl


65/100

Difficult and risky, but still possible:TM Proteins of Known Structure

Great summary and resource:http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html

Bacteriorhodopsin, RhodopsinPhotosynthetic reaction centers

PorinsLight harvesting complexes

Potassium channelsChloride channelsAquaporin

TransportersEtc.

**Although few in number, each of these structureshave been important for addressing key functions.***


66/100

Protein Folding Problem

How does a one-dimensional amino

acid sequence determine a specific

three-dimensional structure?

Or

How can we read the sequence andpredict that structure?


67/100

General Idea

We know what an alpha-helix or a beta strand

looks like, so

(1) figure out which parts of the sequence arehelices and which parts are strands

(2) figure out how they pack together

For soluble proteins, neither is well predicted.

But for transmembrane proteins ...

TM P t i St t P di ti St #1


68/100

TM Protein Structure Prediction, Step #1For alpha-helical transmembrane proteins, hydropathy plot analysisprovides a fairly accurate method to predict which amino acids form

membrane-spanning helices

We can model the structure of anindividual alpha helix fairlyaccurately.


69/100

TM Protein Structure Prediction, Step #2

How do the helices pack in the membrane?There are several labs studying known protein structures to identify factorsinvolved in determining how transmembrane helices pack together(specificity of interaction and packing motifs)

Hydrogen bonds

HydrophobiciityAmino acids known to face the lumen of a channelMultiple sequence alignments

Helix packing sequence motifs, etc.These kinds of information are then combined with protein docking and

energy minimization programs to predict how the helices pack together.

It is quite possible that studies of helical transmembrane proteins couldlead to key information about the protein folding problem - how to predictprotein structure from amino acid sequence


70/100

collect data sets with known

Build Model

Align using clustralW/hmmalign

Hmmbuild

Hmmsearch

Calculate threshold

Score distribution and Training data (NN)Validation

Prediction using Machine learning techniques


71/100

Summary Transmembrane Proteins play many important processes in

cellular processes in both health and disease

Two general type of tertiary structure are found to cross themembranes: beta-barrels and alpha-helices

Structural Studies of TM Proteins are impeded by difficulties in

overexpression, purification and crystallization However, the few dozen structures that have been determined

have provided key information about channels (gating,selectivity, etc.), energetics, transport, and othertransmembrane processes

Analysis of helical transmembrane protein structures may leadto accurate predictions of protein structure from amino acidsequence for this type of protein


72/100

Prediction of protein conformations

from protein sequences

(3D Prediction)

Protein Conformations


73/100

73

Predict protein 3D structure from (amino acid) sequence

Sequence secondary structure 3D structure

function


74/100

74


75/100

75

Protein 3D Structure Detection

X-ray Crys

NMR

Expensive

Slow


76/100

76

Protein Structure

Protein 3D structure biological function

Lock & key model of enzyme function (docking)

Folding problem

protein sequence 3D structure

Structure prediction and alignment

Protein design, drug design, etc

The holy grailof bioinformatics


77/100

77

The Prediction Problem

Can we predict the final 3D protein structure knowing

only its amino acid sequence?

Studied for 4 Decades

Primary Motivation for Bioinformatics

Based on this 1-to-1 Mapping of Sequence to

Structure

Still very much an OPEN PROBLEM


78/100

78


79/100

79

Predicting Protein Structure

Goal Find best fit of sequence to 3D structure

Comparative (homology) modeling

Construct 3D model from alignment to protein sequences

with known structure Threading (fold recognition)

Pick best fit to sequences of known 2D / 3D structures (folds)

Ab initio / de novo methods

Attempt to calculate 3D structure from scratch Molecular dynamics

Energy minimization

Lattice models


80/100

80

PSP: Goals

Accurate 3D structures. But not there yet.

Good guesses

Working models for researchers

Understand the FOLDING PROCESS

Get into the Black Box

Only hope for some proteins

25% wont crystallize, too big for NMR Best hope for novel protein engineering

Drug design, etc.


81/100

81

PSP: Major Hurdles

Energetics We dont know all the forces involved in detail

Too computationally expensive BY FAR!

Conformational search impossibly large

100 a.a. protein, 2 moving dihedrals, 2 possible positions for

each diheral: 2200

conformations!

Levinthals Paradox

Longer than time of universe to search

Proteins fold in a couple of seconds??

Multiple-minima problem


82/100

82

Tertiary Structure Prediction

Major Techniques

Comparative Modeling

Homology Modeling

Threading

Template-Free Modeling

De novo/ab initioMethods

Physics-Based

Knowledge-Based

Homology Modeling


83/100

83

Homology Modeling


84/100

84

Steps

Template selection

Target template alignment

Model building Evaluation

Repeated until a satisfactory

model structure is achieved


85/100

85

Threading

a library of protein folds (templates)

a scoring function to measure the fitness of a

sequence -> structure alignment

a search technique for finding the best alignmentbetween a fixed sequence and structure

a means of choosing the best fold from among the

best scoring alignments of a sequence to all possible

folds


86/100

86

ab initioMethods

The ab initio approach (Figure 6.25) ignores


87/100

87

pp ( g ) g

sequence homology and attempts to predict the

folded state from fundamental energetics orphysicochemical properties associated with the

constituent residues. This involves modelling

physicochemical parameters in terms of forcefields that direct the folding. These constraints will

reflect the energetics associated with charge,

hydrophobicity and polarity with the aim being tofind a single structure oflow energy.

How to define the energy of a PROTEIN?


88/100

88

How to define the energy of a PROTEIN?

How to find the conformations for which the energy is

minimum?

This approach is based on the thermodynamic argument that

the native structure of a protein is the global minimum in

the free energy profile.

Generally the results are expressed as rmsd (root mean

square deviations), reflecting the difference in positions

between corresponding atoms in the experimental and

calculated (predicted) structures.


89/100

89

What make global minimum?

1) Semi imperical potential function

Which is calculated as the sum of all the possible pair

wise interactions between the atoms in the molecule

Eg: AMBER force field

Evaluates the N(N-1)/2 atom-atom interaction which

requires computational time N square.

Most of the computational methods look for the

Global minimum??

SEMI IMPERICAL POTENTIAL FUNCTION


90/100

90

structure structure

E

N

ER

G

Y

E

N

E

R

G

Y

SCHEMATIC DIAGRAM OF DIFFERENT TYPES OF

ENERGY FUNCTIONS

Global Minimum Global Minimum

N(N-1)/2 Atom-atom Interaction

Eg : AMBER Force Fields


91/100

To over come this reduce the resolution at which the

potential function is calculated.

Instead of atom-atom potential The United atompotential would be an approximation.

This approximation is also called as a Pseudo atom


92/100

92

2) UNRES force field (Residues with Solvent)

UNRES was originally designed and parameterizedto locate native-like structures of proteins as thelowest in potential energy by unrestricted globaloptimization.

Propensities of amino acids calculated.

the backbone is represented as a sequence of -carbon (C) atoms linked by virtual bonds designatedas dC, with united peptide groups (ps) in theircenters.


93/100

93

Molecular Dynamics


94/100

94

Computation of dynamics or motion of a ptn.

1) The physical forces which influence the folding

process are well represented by the semi empirical

force fields.

2) The atoms of the ptn move independently uponinduction of the force fields..

Calculated by Newtons laws of motion:-

F=ma (calculated for each fematoseconds)..


95/100

Motion influenced by temperature??


96/100

96

Stimulated annealing:-

Higher temperatures the motion will be greater in ashorter period of time.

1) Initial global minimum of the atom is calculated first.

2) Temperature is raised to 3000K for a fewFemtoseconds and then gradually lowered to roomtemperature 300K

The idea is to even if you start to predict the protein inwith wrong structure (at 3000K) final corrections willlead to conformation.

Molecular dynamics Optimization tool


97/100

97

energy

Conformational space

Trajectory

Oth h i l ti


98/100

98

Other such simulations are ..

MONTE CARLO SIMULATIONS

CONFORMATIONAL SPACE ANNEALING

ROSETTA ALGORITHM

CASP (Critical assessment of structure

prediction)

LEVINTHAL PARODOX etc

ROSETTA ALGORITHM


99/100

Break target sequence into fragments of 9 amino acids

Create profile , X, for target

Create profile, S, for similar PDB sequences

Align profiles X, S to get best match fragment

Use fragments as starting point for optimisation, using:

- hydrophobic burial

- polar side-chain interactions

- hydrogen bonding between beta-strands

- hard sphere repulsion (van der Waals)

Create 1000 structures, and

Choose cluster centre as the

best prediction


100/100

Protein Prediction

Documents

Transcript of Protein Prediction