Protein folding

166
Protein folding Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution

description

Protein folding. Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution. Process. Local Interactions Secondary Structure Elements (SSE) Assembly of SSE Equilibrium Structure. Protein folding. - PowerPoint PPT Presentation

Transcript of Protein folding

Page 1: Protein folding

Protein folding

Process of folding

Modeling the process of folding

Evolution vs. folding

Impact of function on protein evolution

Page 2: Protein folding

Process

Local Interactions

Secondary Structure Elements (SSE)

Assembly of SSE

Equilibrium Structure

Page 3: Protein folding

Protein folding

http://www.blueprint.org/proteinfolding/trades/details/trades_movies.html

Page 4: Protein folding

Protein folding

Important thing to note

It is possible that residues that are not doing anything in the folded protein were

actually critically important to get the peptide folded in the

first place.

Page 5: Protein folding

Protein folding

Simulation studies are demonstrating that the most common protein folds are those who can

withstand the most sequence variation over time without

affecting their topologies. The prion protein is a posterchild example

of the opposite.

Page 6: Protein folding

Protein Evolution

Evolutionary meaning

Most common folds are those able to

withstand point mutations the best.

These are known as designable folds.

Page 7: Protein folding

Protein folding

Marginal stability

The most stable folds are not necessarily these

with the lowest energy.

But these that maximally penalize switching to an alternative conformation.

Page 8: Protein folding

Protein Evolution

Marginal stability

Evolutionary implication(s)

There is thus selective pressure on residues in

protein not only to maintain important

interaction, but also to make sure that some interaction NEVER

happen.

Page 9: Protein folding

Summary

Proteins fold into energetically stable conformations.

For one chain, there are a large number of possible conformations, however.

The biological conformation is selected during folding: not necessarily the “best” conformation.

Page 10: Protein folding

Role of biology on structures

A few examples using mapping of rate of evolution.

The fitness of a protein is ultimately its biological function, not its structure.

We’ll have a look at their structural requirements.

Page 11: Protein folding

Structural Biology

Page 12: Protein folding

Outline

How genetics encode structure.

What make a protein fold.

Role of biological function on preserving a fold.

Comparing two structures for similarities.

Page 13: Protein folding

Genetic information and proteins

3D information is encoded into (1D) sequences.

STKKKPLTQEQLEDARRLKA IYEKKKNELGLSQESVADKM GMGQSGVGALFNGINALNAY NAALLAKILKVSVEEFSPSIAREIYEMYEA

Protein structure of CRO repressor in phage Lambda, PDB: 1LMB

?

Page 14: Protein folding

Genetic information and proteins

The encoding can only be indirect

Because there is nothing in the DNA

that tells each amino acid where to go.

Page 15: Protein folding

Genetic information and proteins

However,

There is a few types of physical interactions that are dominating

the process of protein folding.

Page 16: Protein folding

Amino-acidsComponentsMain Chain

Side Chains

Side ChainsResponsible for the “name”.

Can be clustered based on:

- chemical properties

- Structure

This ultimately determine the evolutionary interchangeability.

Page 17: Protein folding

Protein folding

Van der Waal forces

The electron clouds around the nuclei are more

stable if they can lightly interact with other electron clouds.

Makes atoms sticky relative to each other.

Page 18: Protein folding

Protein folding

Electrostatic forces

Long range interactions.

Pull/Push over longer distances.

Page 19: Protein folding

Protein folding

Hydrogen bonds

Electrostatic. Short range, not flexible

Can be seen as the velcro holding proteins

together.

Page 20: Protein folding

Protein folding

Hydrophobic interactions

Water molecules in liquid pack as to minimize their

energies

This implies that water molecules are more than often are doing H-bond

with their neighbors.

Page 21: Protein folding

Protein folding

Hydrophobic interactions

If you introduce a droplet of oil in solution, many hydrogen bonds will have to be broken at the interface, at an energy cost.

This is why hydrophobic and hydrophilic groups look like they are avoiding

each other.

Page 22: Protein folding

Protein folding

During folding,

The polypeptide has to follow a strict

sequence of event in order to find the

correct conformation in a timely fashion.

Page 23: Protein folding

Protein folding

Secondary Structures

Stable because of local h-bonds.

Makes larger block with fewer freedom of

movement

Page 24: Protein folding

Protein folding

Geometry plays a very important role.

Because there are only a few angles that can

change along the backbone, there is a

limited number of ways a protein can

fold onto itself.

Page 25: Protein folding

Protein structures are organized in a Hierarchical fashion

Secondary structures - Geometry

Dihedral AngleBecause most main chain atoms are constrained in a “amide bond”, the entire trajectory of the chain can be defined by the pair of angles (for each AA):

This can be represented with a

“Ramachandran Plot”.From which it is obvious that there are some kind of clustering going-on.

,

l

l

Page 26: Protein folding

Protein structures are organized in a Hierarchical fashion

Secondary structures – The alpha helix

The Hydrogen BondAgain, a helix is an ideal setup to place our “velcro” H-bond always at the right place.

PeriodicityTo the delight of statisticians and computer scientists.

Page 27: Protein folding

Protein structures are organized in a Hierarchical fashion

Secondary structures – The beta strand (beta sheets)

Another periodical pattern ( )Responsible for super-structure rigidity and some truly amazing patterns.

2f

Page 28: Protein folding

Protein structures are organized in a Hierarchical fashion

Secondary structures – The myth of “random” coil.

Random structures in protein are extremely rare.Many uses the expression anyway to refer to the “rest” of the protein.

Other minor secondary structuresTurns, loops, bridges. Although these don’t have the critical periodicity found in and structures.

Page 29: Protein folding

Protein structures are organized in a Hierarchical fashion

Tertiary structures – The reason why to care about 2nd structures.

Secondary structures are building blocksDetecting and predicting secondary structures is a key process in structural biology.

Other usesVisualization, classification…

Page 30: Protein folding

Protein Diversity

The current release of PDB contains 28,000

structure entries.

26,000 are proteins

There is an estimated 600-8000 possible

unique protein folds.

http://www.jacquesdeshaies.com/expositions/virtual/new-virtual/uppsala-invit.html

Page 31: Protein folding

PDB

Overview

Repository of structuresProteins, Nucleotides, complexes, mutants

Quality improve over timeData validation tools are getting better. More redundant structure are available for cross-reference.

Page 32: Protein folding

Small number of folds

Does this means that all proteins are

coming from a small set of ancestor

molecule?

Perhaps, but not necessarily.

Page 33: Protein folding

Protein folding

Process of folding

Modeling the process of folding

Evolution vs. folding

Impact of function on protein evolution

Page 34: Protein folding

Process

Local Interactions

Secondary Structure Elements (SSE)

Assembly of SSE

Equilibrium Structure

Page 35: Protein folding

Protein folding

http://www.blueprint.org/proteinfolding/trades/details/trades_movies.html

Page 36: Protein folding

Protein folding

Important thing to note

It is possible that residues that are not doing anything in the folded protein were

actually critically important to get the peptide folded in the

first place.

Page 37: Protein folding

Protein folding

Simulation studies are demonstrating that the most common protein folds are those who can

withstand the most sequence variation over time without

affecting their topologies. The prion protein is a posterchild example

of the opposite.

Page 38: Protein folding

Protein Evolution

Evolutionary meaning

Most common folds are those able to

withstand point mutations the best.

These are known as designable folds.

Page 39: Protein folding

Protein folding

Marginal stability

The most stable folds are not necessarily these

with the lowest energy.

But these that maximally penalize switching to an alternative conformation.

Page 40: Protein folding

Protein Evolution

Marginal stability

Evolutionary implication(s)

There is thus selective pressure on residues in

protein not only to maintain important

interaction, but also to make sure that some interaction NEVER

happen.

Page 41: Protein folding

Summary

Proteins fold into energetically stable conformations.

For one chain, there are a large number of possible conformations, however.

The biological conformation is selected during folding: not necessarily the “best” conformation.

Page 42: Protein folding

Role of biology on structures

A few examples using mapping of rate of evolution.

The fitness of a protein is ultimately its biological function, not its structure.

We’ll have a look at their structural requirements.

Page 43: Protein folding

Fast Slow

Maximum-Likelihood Site-Rates are Biologically Relevant

Rhodopsin-like G-protein receptors

Pfam (dataset 1Tml_7) 69 taxa

Page 44: Protein folding

Maximum-Likelihood Site-Rates are Biologically Relevant

Tubulin

34 taxa 33 taxa

The constraints imposed by co-evolution far outweigh the

structural constraints.

Fast Slow

Page 45: Protein folding

Phylogenetic mapping of structures

Predicting rates of evolution

This experiment was conducted to see if we could predict the rate of evolution in

the enzyme Enolase.

Page 46: Protein folding

Phylogenetic mapping of structures

Predicting rates of evolution

The most important factor to predict

evolutionary constraints was the

presence of the active site.

Evolutionarily constrained by the active site.

Page 47: Protein folding

Summary

Structures are rigid templates to provide some biological function.

It takes a lot of structure to position a few atoms in an enzyme.

Page 48: Protein folding

Structural Homology

Because 1 structure is made of thousands of coherent interactions:

The probability to see a new structure emerge from a random sequence is close to 0.

Therefore: similar structures are likely to be homologous.

Page 49: Protein folding

Use of structural similarity in evolutionary studies

Homology can be detected via sequence identity

Structures are drifting at a much smaller rate. In fact, are they drifting at all?

Structural similarity can be used to detect homology, although there are evidences that

convergence is much more common in structure than sequence.

Page 50: Protein folding

Structural Convergence

There are so many different ways to fold a dozen of secondary structure elements.

Some fold are much more probable to evolve because they are more robust to mutations.

Designability

Page 51: Protein folding

Protein Similarity

VASTAlign secondary structure only.

Consider the geometric transformation that brings as

many helices and strands together.

CEBreak down each structures in

peptide of 8 residues. Find the best match against a reference

protein. The final alignment is the transformation that allow to align as many continuous residues as

possible.

Page 52: Protein folding

Comparing and aligning structures

Expanding into detection methods

What about for remote, yet significant similarities.

Example on the right

There is a significant similarity between a single domain in two distinct proteins (yellow and orange).

Are they homologous?

Page 53: Protein folding

Comparing and aligning structures

Difficulties in aligning structures.

In some cases, the order of the elements that superimpose have been shuffled by circular permutation.

There are many cases of structurally similar proteins with no more than a random degree of identity at the sequence level.

Page 54: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

Probably them most used service for protein alignment since it is running off the NCBI web site and has already been run on every available structures.

1 – Given two proteins A and B.

2 – Given that each structure has a collection

of secondary structure element (SSE).

1 2 3

1 2 3

, , ,..., ,

, , ,...,

SSE n

SSE m

A H S H H

B H H S S

Page 55: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

3 – Find the rotation, translation to apply to each helices/strands to in A to align with each elements in B.

These transformations can be summarized by a matrix

1 1 2 1A B A B 1

...nA B

1 2 2 2A B A B 2

...nA B

1

... ... ... ...

mA B2 mA B ...

n mA B

1 2 3

1 2 3

, , ,..., ,

, , ,...,

SSE n

SSE m

A H S H H

B H H S S

Page 56: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

4 – If two structure are identical, each helix/strand will be part of a pair with a common would just be the transformation to align the whole proteins.

In remotely similar structure, not all helices/strands will have a match.

The best set of rotation/translation will be the one that is shared by the largest number of secondary structures pairs.

Page 57: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

4 – Sharing has to be defined a bit more formally (where alpha would some kind of tolerance cutoff to determine if two transformations are identical):

i j

Page 58: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

4a – Every time we have a “match”, we draw a link between two The result would be a so-called graph with connection only between similar set of rotation/translation.

i j

i

i

i i

i

i

i

i

i

i

i

i

ii

i

i

i

i

i

Page 59: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

5 – Once the problem is abstracted into a ‘graph’, it is possible to use the computational bag-of-tricks to figure out which set of connected matrices forms the largest group. The average rotation/translation in this group would best superimpose protein A and B.

i

i

i i

i

i

i

i

i

i

i

i

ii

i

i

i

i

i

Page 60: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

7 – The alignment is performed irrespective of the sequence order of the structural elements. This is good because it can catch circularly permuted proteins. But it also enhances the chances to find match by accident.

Page 61: Protein folding

Comparing and aligning structures

VAST (Vector alignment sequence tool)

8 – Statistical vallidation. This is a very important step since there is only a limited number of ways a small number of SSE will interact. Thus, sampling in a large database of random structure would still return a distribution of “hits”.

This is second hand information:

The p-value is the probability to observe a similar score by chance multiplied by the number of possible alternative substructures within the comparison.

The default cutoff = 0.05. Which should be regarded as a noise reduction cutoff, not a bulletproof jacket.

Page 62: Protein folding

Comparing and aligning structures

CE (Combinatorial extension)

CE doesn’t uses secondary structure elements as basic aligning unit. Instead, it seeks the optimal path amongst all possible n-mers between two query proteins.

1 – Given two proteins A and B of length nA and nB. CE will search for the longest continuous path P of aligned fragment peptides (AFP) of length m.

Page 63: Protein folding

Comparing and aligning structures

CE (Combinatorial extension)

4 – Some distance metric has to be made up to score AFP alignment

1

, ,0

1i k j k i k j k

mA B

ij p p p pk

D d dm

1 1

, ,20 0

1i k j l i k j l

m mA B

ij p p p pl k

D d dm

Each residue is counted once.

Each residue is counted against all.

Using RMSD

Page 64: Protein folding

Comparing and aligning structures

CE (Combinatorial extension)

4 – Pathfinding

There is a substantial decrease in the size of the search space by restricting the value of G

There is a substantial decrease in the size of the search space by restricting the value of G

1 – Select all possible next AFP under a certain (self) threshold.

2 – Consider the path to chose the best next AFP.

3 – Choose whether to pursue the extension or not.

Page 65: Protein folding

Comparing and aligning structures

CE (Combinatorial extension)

4 – Statistics uses a z-score which compares path of similar length and score to a random sampling from a reference database.

z-score of 3.5 -> p-value of 10E-3

So, given about 2000 different protein folds, such threshold would imply two fortuitous hits. Visual inspection must be done as well as a more restrictive threshold should be used.

Page 66: Protein folding

Comparing and aligning structures

CE (Combinatorial extension)

Structural similarity between Acetylcholinesterase and Calmodulin found using CE (Tsigelny et al, Prot Sci, 2000, 9:180)

Page 67: Protein folding

SCOP database

http://scop.berkeley.edu/ Seen as the golden standard for protein structure classification

Query for structures given a protein sequence

Browse protein architecture organized in a hierarchical fashion.

Keyword search for structures.

Fold Common topology for secondary structure

Superfamilies probable common evolutionary origins, low sequence ID

Families (common evolutionary origins)

domains

individual domains

Page 68: Protein folding

CATH database

http://www.biochem.ucl.ac.uk/bsm/cath/ Involves manual inspection and classification, especially at more abstract levels such as the architecture-level.

CLASS secondary structure composition

Architecture what would be know as fold in SCOP)

Topology (What would be known as superfam.)

Sequence-level

Page 69: Protein folding

Summary

Aligning protein structure can detect homologous relationship that are deeper that sequence alignment because structures are more stable over time.

VAST abstracts proteins into SSE, or secondary structure elements and find the set of rotation/translation that maximize the number of paired SSE.

CE looks for the best alignment frame to superimpose a protein into another.

Statistics are important because it is likely that small unrelated structures will resemble each other.

Page 70: Protein folding

Summary

A distribution of random protein scores can be generated by aligning unrelated proteins in the databases. An alignment score must be significantly larger than score expected in this distribution.

This type of analysis is used to classify protein folds and infer relationship between structural evolution and biological activity.

Try to find structural neighbors of the protein 1AZT while browsing the NCBI website ( www.ncbi.nlm.nih.gov ).

Page 71: Protein folding

Molecular Modeling

Lecture 4

Page 72: Protein folding

Why modeling proteins

Example of applications

Modeling the binding site of the anticodon on eRF3

Modeling substrate binding in the active site of Mandelate racemase.

Solving X-ray and NMR structures.

The theory behind the calculation

Parametrizing protein models

Molecular mechanics as an optimization problem

Molecular mechanics as a time simulation

Conceptual clash between protein folding and molecular mechanics.

Page 73: Protein folding

Why modeling proteins

Anticodon binding site on eRF32 possibilities.

From phylogenetic information, a few residues were identified as players.

Use molecular mechanics to “see” whether the surface of the protein ca accommodate

an anti-codon.

Page 74: Protein folding

Why modeling proteins

Modeling a weird substrate into an active site.

Mandelate racemase can bind a substrate with two rings! Is there room for this in the wild type active site?

The answer is yes, although a bit counter-intuitive.

Page 75: Protein folding

How do structures are viewed

Pre-computer days

Sir John Kendrew and his model of insulin, 1958

Page 76: Protein folding

How do structures are resolved

X-ray diffraction PrincipleCreate a lattice of protein into a crystal.

Collect thousands of diffraction pattern in all degree of freedom rotational space.

Substract the phase between the layers in the lattice.

Compile into a 3D volume based on density of reflective material (electrons in this case).

Thread model into density map, optimize the geometry using the density map as an additional criterion.

Page 77: Protein folding

How do structures are resolved

NMR spectroscopyPrincipleUse magnetic fields and “radio” frequency photons to detect shifts in nuclear states.

Assign shifts to a model along the chain.

Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances).

Optimize the trajectory of the modeled chain, given this list of constraints.

Page 78: Protein folding

How do structures are resolved

NMR spectroscopyPrincipleUse magnetic fields and “radio” frequency photons to detect shifts in nuclear states.

Assign shifts to a model along the chain.

Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances).

Optimize the trajectory of the modeled chain, given this list of constraints.

Page 79: Protein folding

Physical simulation in Molecular Modeling

Jensen, F., 1999, Introduction to computational chemistry, Whiley,

Chichester, UK

Why is it useful to you?

Modeling is used often by experimental biochemists and is a staple in structural biology.

The complexity of the simulation is far beyond the complexity of the interface. This necessarily convey a false sense that the “defaults” settings will do fine.

Page 80: Protein folding

Physical simulation in Molecular Modeling

Limitations

True atoms and bonds are probabilistic constructs. The computation of the resulting geometries is a very involved process for which the analytical equations are not fully worked out.

Luckily, the observable behavior is much more predictable and thus can be modeled under a limiting set of assumptions.

Page 81: Protein folding

Physical simulation in Molecular Modeling

Assumptions

Newtonian physics is used to simulate molecules under a set of restrictions which for proteins would be:

1. In solution (or vacuum).

2. Near room temperature.

3. Chemically inert.

Page 82: Protein folding

Physical simulation in Molecular Modeling

AbstractionEach atoms has a fix geometry constrained by a somewhat arbitrary energy scoring scheme.

The problem thus boils down to find the best set of coordinates for all atoms to minimize the energy.

There are no absolute correspondence between this scoring scheme and experimentally measurable energy values.

Page 83: Protein folding

Molecular Modeling in Bioinformatics

Modeling

Although there is only a small subset of all possible atoms that end-up in biological molecules. Each atoms has a set of different states in which they exist. These states are referred to as types in molecular modeling.

Page 84: Protein folding

Molecular Modeling in Bioinformatics

Energy function

The energy function is used to evaluate and calculate the derivatives use to optimize a structure.

FF str bend tors VdW el crossE E E E E E E

O N O N O N 2O N

2O N 2O N

Page 85: Protein folding

Molecular Modeling in Bioinformatics

Computational efficiency and limitations of the model

The energy function is used to evaluate and calculate the derivatives use to optimize a structure.

02 AB ABr

a Bt

b As kP E R R R

2

2 0

3 4

3 0 4 04 AB AB AB ABstr

ab AB ab AB ab ABk R kP E R R R RR k R

1ABAB R

strMorse eDE R

Page 86: Protein folding

Molecular Modeling in Bioinformatics

Parameterization nightmare

Can someone come up with all these numbers?

Generalization

How robust is the simulation in a range of conditions.

Computational cost

The longer it takes to perform a single task, the fewer iterations will be computed in the same amount of time.

Page 87: Protein folding

Molecular Modeling in Bioinformatics

Parameterization nightmare

Can someone come up with all these numbers?

For MM2 forcefield (71 atom types):

Term Params(est.) Determined

E(VdW) 142 142

E(str) 900 290

E(bnd) 27000 824

E(tors) 1215000 2466

E(cross) 107-8 ?

hc

E

Page 88: Protein folding

Molecular Modeling in Bioinformatics

Generalization

How robust is the simulation in a range on conditions.

In the example to the left, the EXP.-6 model causes nuclear fusion at unrealistic distances.

Such unrealistic distance will be found in Monte-Carlo, Genetics Algorithms and Simulated annealing experiments.

Page 89: Protein folding

Molecular Modeling in Bioinformatics

Lennard-Jones

Is actually a computational stunt so there is no need to compute R but rather use Rn where n is an even factor.

12 6

( ) o oR RE R

R R

2 2 2

ij i j i j i jR x x y y z z

6 6( ) BR C

AEXP R eR

Page 90: Protein folding

Molecular Modeling in Bioinformatics

Lennard-Jones

In practice, Lennard-Jones is optimized to reproduce validated results (and works out satisfactorily).

Page 91: Protein folding

Molecular Modeling in Bioinformatics

Electrostatic Models

… are real ugly.

Why does this matter?

Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.

Examples

Coulomb’s Law

( )el AA

BAB

BEQ Q

RR

Page 92: Protein folding

Molecular Modeling in Bioinformatics

Electrostatic Models

… are real ugly.

Why does this matter?

Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.

Examples

Dipolar moment interactions

3( ) cos 3cos cosA B

el AB A B

AB

E RR

Page 93: Protein folding

Molecular Modeling in Bioinformatics

Computational cost of non-bonded energy (VdW, El)

~99.88% of computation in protein-sized models. Most of this is very small and does not contribute to the total energy significantly.

Computational tricks

Cutoff -> blending function -> neighbor list*

*must be updated O(N2)

Validation

1. Reproduces Geometries

2. Reproduces Relative energies.

Page 94: Protein folding

E is not G

G H TS Real energies are temperature-dependant.

Entropic contribution cannot be calculated from a snapshot.

Page 95: Protein folding

Principle of optimization

You start with a protein for which you know all coordinates.

Evaluate the energy

Find a better structure, usually with small changes

Repeat until no better structure can be found.

This task is usually NEVER straightfoward, unless the system would be made of a small number of atoms.

Page 96: Protein folding

Molecular Modeling in Bioinformatics

Optimization (local minima)Straightforward, although computationally expensive.

1 – A clear equation.2 – A defined set of variables.3 – “only” three dimension to worry about

Steepest Descent (Robust, fast)Conjugate Gradient (Improved convergence properties)Newton-Raphson (Saddle points)

Pseudo-NR (progressive Hessian estimate)

Page 97: Protein folding

Molecular Modeling in Bioinformatics

Optimization (Global minimum)In a simple, circular, system with 17 main-chain atoms. There are 262 distinct conformations within 3 kcal/mol from the global minimum (out of ~1.6E13 conformers).

The size of proteins is 1-2 order of magnitude larger.

Stochastic Methods (Monte-Carlo)Molecular DynamicsSimulated AnnealingGenetic AlgorithmsStatistical Mechanics

Page 98: Protein folding

Molecular Modeling in Bioinformatics

Time dependent methods (Molecular Dynamics)

Make use of classical mechanics equations such as:

F ma

2 31

1 1...

2 6i i i i ir r v t a t b t

2 31

1 1...

2 6i i i i ir r v t a t b t 2

1 12i i i ir r r a t

Verlet AlgorithmNumerical solution to Newton’s equations

Page 99: Protein folding

Molecular Modeling in Bioinformatics

2 31

1 1...

2 6i i i i ir r v t a t b t

2 31

1 1...

2 6i i i i ir r v t a t b t 2

1 12i i i ir r r a t

Verlet AlgorithmNumerical solution to Newton’s equations

Problems with this methods

No explicit use of speed (which is needed to calculate the total energy):

2

1

1

2

N

Tot i ii

E m v U r

Page 100: Protein folding

Molecular Modeling in Bioinformatics

2 31

1 1...

2 6i i i i ir r v t a t b t

2 31

1 1...

2 6i i i i ir r v t a t b t

1 12

i i ir r v t

Leapfrog AlgorithmNumerical solution to Newton’s equations

TimestepReasonable: Femtoseconds 10-15

Scope of simulation (ideal): Millisecond 10-3

(practical): Microsecond 10-6

21/ 2 1/ 2i i iv v a t

Page 101: Protein folding

Molecular Modeling in Bioinformatics

Simulated AnnealingRobustness vs. initial solution

Variable contribution of the objective function.Broader Sampling.

Both help to explore around a minimum.

F U K

potential

Blending functionKinetic

Net Movement

Page 102: Protein folding

Protein folding from Scratch

Must be restrained to a limited scope

Two genes: TC5b and TC3b

Both have references structure for validation.

Sequences

NLYIQWLKDGGPSSGRPPPS (TC5b; 304 atoms)

NLFIEWLKNGGPSSGAPPPS (TC3b; 289 atoms)

Software: AMBER 6.0

Model: AMBER

Solvation: Generalize Born/solvent-accessible surface area

This means that the water molecules are not explicitly defined in the simulation and the effect of the solvent is treated as a macro

property.

Page 103: Protein folding

Protein folding from Scratch

Must be restrained to a limited scope

Understanding folding and design: Replica-exchange simulation of “Trp-Cage” miniproteins.

Pitera, JW., Swope, W. 2003. Proc. Natl. Acad. Sci. USA, 100: 7587-7592

Page 104: Protein folding

Protein folding from Scratch

Algorithm

Initialization

Input: A protein sequence

Output: A starting structure for the main simulation.

1: Thread each character from the input sequence to a 3D corresponding model (extended).

2: Minimize with 5000 steps of steepest descent

3: for i = 1 to 50000 do

Simulate with Molecular Dynamic

if !(i%1000) then Readjust the temperature 298K.

4: Return equilibrated model.

Required to prevent strong “jerking” motion in the first iteration of a simulation

Page 105: Protein folding

Protein folding from Scratch Algorithm

Simulation (simulated annealing variant)

Input: P, An equilibrated protein model

Output: A collection of coordinate snapshots (trajectory) for analysis.

1: T = a list of 23 simulation temperatures from 250K to 603K.

2: E = {} , an empty list of experiments

3: for i = 1 to |T| do

4: Pi = Copy P

5: Set the temperature of Pi to Ti

6: Add Pi to E

7: for i = 1 to 4,000,000 (4 ns) do

8: Simulate using MD |in parallel|

9: if i % 250 == 0 then take a Snapshot of coordinates.

10: if i % 5000 == 0 then

11: Swap temperature between process (Metropolis-style probabilities)

12: Adjust each E to their new simulation temperature

13: Discard all but the snapshot taken in the last nanosecond of simulation.

14: Pool all 23 experiments for analysis.

Page 106: Protein folding

Protein folding from Scratch

Computational cost

Ridiculously small protein, no initial good guess.

19 days on 23 200 MHz IBM POWER3 SP2 processor (R6000 series)

Which, on the campus here, approaches the mean time between power outage!

Page 107: Protein folding

Protein folding from Scratch

Validation

The root mean square deviation RMSD

2

1

n

i refatom i

n

ii

w i i

RMSDn w

Which is a suitable distance metric for related structures.

Page 108: Protein folding

Comparing and aligning structures

Why?

There is a need for a distance metric to compare similar protein structure.

Simulation analysis.

Similarity quantification.

Pattern detection.

RMSD

Works well for closely similar structure.

2

1

n

i refatom i

n

ii

w i i

RMSDn w

Page 109: Protein folding

Comparing and aligning structures

RMSD

Works well for closely similar structure.

2n

refatom i

i i

RMSDn

Absolutely require some kind of pair wise equivalence between the two compared

entities,

Page 110: Protein folding

Comparing and aligning structures

RMSD

Sequence identity falls quickly.

Hard to separate weak hits from purely random proteins.

2

1

n

i refatom i

n

ii

w i i

RMSDn w

Page 111: Protein folding

Protein folding from Scratch

Validation

The root mean square deviation RMSD

2

1

n

i refatom i

n

ii

w i i

RMSDn w

≤2.0 RMSD

from any of 38 experimental structures

≤2.0 RMSD from the

average low temperature structure.

Page 112: Protein folding

Protein folding from Scratch

Impact

Impact of this paper

Make good use of parallelism to conduct a heuristic search.

Sampling-based method.

Promising because in many cases the folding of a large

protein can be approximated to the folding of its components.

(Remember, domains are independent units in most

cases)

Page 113: Protein folding

Building a large machine for molecular modeling

IBM Blue Gene project

Architecture

64K FPU

20K FPU (protein folding)

FPU 64-bit @ 700 MHz (low cost, low heat)

64 compute nodes (256 MB) per I/O nodes (512MB)

MPI library

3D torus network for fast neighbor to neighbor communication.

Page 114: Protein folding

High Performance achievement in MD NAMD

Open source

University of Illinois, Dept. of theoretical physicshttp://www.ks.uiuc.edu/Research/namd/

Benchmark system

(their big one)

Page 115: Protein folding

High Performance achievement in MD NAMD

Open source

There is no need to use this system to study protein folding.

Instead, MD were used in this case to study the conversion of torque into energy that can be stored in molecular batteries: the ATP molecule.

Page 116: Protein folding

Overview

Protein folding and parallel computing.

Homology modeling and statistical mechanics.

Secondary structure prediction and artificial intelligence.

Page 117: Protein folding

Spectrum of strategies

Physics Knowledge

Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling

Page 118: Protein folding

Parallel computing and Molecular dynamics

Folding protein from an extended conformation is a difficult problem because of the crossing of energy

barriers.

The following slides describe how crossing barrier can be achieved using a technique called parallelization.

Page 119: Protein folding

Parallel computing

It takes 1500 days to complete a thesis for one student

If the student is helped by someone, the work may go 2X as fast: 750 days.

What if 1500 students are working on the same thesis?

Overhead

Communication

Load balancing

Page 120: Protein folding

Parallel computing

Factors that complicate parallelization:

Some work have to be executed in a sequence

Communicating the task and the results becomes an increasingly important time step as the task become small.

Each individual process have to wait for the slowest one to finish, leading to a loss of efficiency.

Page 121: Protein folding

Time scale in protein folding

In the order of micro to milliseconds

This is not achievable by modern computers.

~10 000 days for 1 experiment (~28 years)

folding@home

Hundreds of million computer idle at any time

Why not use their unspent cycles.

Create a “screen saver”

Page 122: Protein folding

Crossing energy barrier

Most of the time is spent waiting for the thermal motion to topple a structure over a barrier.

Principle of Ensemble dynamics

M CPU should take M X less time to go over a barrier.

K = 1/10,000 ns , M = 10,000 , t = 30 ns

f(t) ~ 30 folding events

( ) 1 exp( )f t kt

Page 123: Protein folding

Ensemble Dynamics

Start M dynamic calculations with the same initial structure.

Once 1 thread finds a barrier and go over it, copy the state of this thread into all other M

replicate processes.

The communication overhead is negligible if the crossings

are rare events, which is true in this case.

Page 124: Protein folding

Ensemble Dynamics

Detecting a barrier

Will be noticeable by a large variance in energy over the duration of the simulation.

Page 125: Protein folding

Ensemble Dynamics

Calculation details

We simulated folding and unfolding at 300K at pH 7.0,

using OPLS parameters set to Generalized Borne implicit

solvent model.

Time step 2 fs

Long range interaction truncated with a 16A cutoff.

Page 126: Protein folding

What are they doing with this technique?

Page 127: Protein folding

A more complex system

Note how most of the interactions in the partially

folded protein are non-native.

This means that in order to resume folding, these must

be broken.

The Villin headpiece is one of the fastest (known) folding peptide !! What

about simulating anything else?

Page 128: Protein folding

Energy Landscape

It is clear in this figure that there are:

1. one folding pathway

2. One intermediate

3. Two energy barriers

Page 129: Protein folding
Page 130: Protein folding

Statistical Mechanics

Practical definition for our purpose:

Statistical mechanics can be used to create predictive models in absence of theoretical models.

For example: interaction between amino-acids.

Page 131: Protein folding

Statistical mechanics

Atom-level simulation are expansive, and empirical.

Statistical mechanics bridges frequencies of observations with physical forces for chemical systems.

The resulting model is thus used to assess the “energy” of a trial conformation and can be used as an objective function to optimize a

solution.

This technique is increasingly used in bioinformatics since the information in the database can be seen as the collection of

observation at equilibrium.

Page 132: Protein folding

Statistical mechanics

In other words, if it can’t be seen in the database, the energy state of an observation must be high. If its

common, the energy must be low.

Remember, everything is possible, the probability of an observation is related to its relative energy.

lni iE RT f

Page 133: Protein folding

How does this ties in to bioinformatics?

There is a direct relationship between energy of a state in a system in equilibrium and the probability to observe this

state.

lni iE RT f

ln ii

ii

nE RT

n

Page 134: Protein folding

What are “states” in protein structures?

There are a lot of freedom in defining states for protein structures. Here is one example:

Sequence

Sequence

ContactsIn this plot, if two positions of

the 1D sequence are in physical contact, it is marked

as an orange pixel.

It is thus possible to harvest from a collection of structures a matrix of observed contacts.

Page 135: Protein folding

What are states in protein structures?

There are a lot of freedom in defining states for protein structures. Here is one example:

Sequence

Sequence

ContactsIn this case the energy for any

given pair would be:

( , )

,ln

,Pair a b

i

n a bE RT

n a i

Page 136: Protein folding

What are states in protein structures?

There are a lot of freedom in defining states for protein structures. Here is one example:

Sequence

Sequence

In order for this value to be valid; there is an assumption

of equilibrium.

Equilibrium:

The sampling would not change significantly over time.

Page 137: Protein folding

What are states in protein structures?

There are a lot of freedom in defining states for protein structures. Here is one example:

Sequence

Sequence

PitfallIn order to be accurate for rare observation, the total number

of observation should be infinitely large and derived

from sequences-structures in equilibrium.

Practically, there should be enough instances of the rarest entry to avoid large errors on

the estimate (log(0)).

Page 138: Protein folding

What are states in protein structures?

There are a lot of freedom in defining states for protein structures. Here is one example:

Sequence

Sequence

Miyazawa-Jernigan Matrix

Such matrix has been generated

Miyazawa, S.,Protein Eng. 1993 Apr;6(3):267-78

This is particularly useful for threading sequences in known structures for structure prediction purpose.

Page 139: Protein folding

What are states in protein structures?

The implementation of a distance-based energy term is trickier… but boils down to the same thing.

Knowledge-based force-field

Need to store in 4D matrices the tuple

{ (a,b), r, k }

R distance in Euclidian space

K distance in sequence space

x1

x2

x3

x4 x5

x6

x7x8

x9

r

k = 6

Page 140: Protein folding

What are states in protein structures?

The energy will be calculated with respect to all parameters considered.

Knowledge-based force-field

x1

x2

x3

x4 x5

x6

x7x8

x9

r

k = 6

( , , , )

,

,ln

kr

kr

iPair a b r k k

rk

n a b

n a iE RT

nn

Page 141: Protein folding

What are states in protein structures?

There are some implementation for this technique, such as PROSAII

http://www.came.sbg.ac.at/Services/prosa.html

Knowledge-based force-field

x1

x2

x3

x4 x5

x6

x7x8

x9

r

k = 6

( , , , )

,

,ln

kr

kr

iPair a b r k k

rk

n a b

n a iE RT

nn

Page 142: Protein folding

What are “states” in protein structures?

The exposure of each site to the exterior is an important factor. This is often quantified as Solvent Accessible Area

(ASA)

Knowledge-based force-field

Need to store in 2D tuple

{ a, ASA }

,

{ }

,ln

,a ASA

i ASA discretization

n a ASAE RT

n a i

Page 143: Protein folding

What are states in protein structures?

Ultimately, the energy of seeing a given sequence adopt a given structure can be computed as follow:

Knowledge-based force-field

Tot Pairs Solv otherE E E E Caveats

The finer is the parameterization, the larger must be the reference collection of (appropriate) structures in the database in order to observe many times all possibilities.

Design-level decision as to the choice of the minimum set of terms to fully define a structure.

Page 144: Protein folding

An example

Real life example of using Knowledge-based methods.

This enzyme is called Enolase. It is a key enzyme in the sugar breakdown metabolism.

If there are important terms that are forgotten, the energy values may be inadequate.

Page 145: Protein folding

An example

Real life example of using Knowledge-based methods.

The function and the composition are very tightly related.

Red negatively chargedBlue positively chargedTan Hydrophobic

These are the active site residues.

Page 146: Protein folding

An example

Real life example of using Knowledge-based methods.

The critical region in this protein has radically different properties than expected in an average protein. The knowledge-based system does not account for these properties and thus, the position shown in white were poorly estimated.

The way this assessment was done quantitatively goes well beyond the scope of this course.

Page 147: Protein folding

An other example

Cubic lattice simulation

The dimensionality of the protein folding problem can be reduced by simplifying the geometric properties of the system.

Knowledge-based energy evaluation can be used as an objective function that is relevant to the physical world, without the need to fully define a system with the 6 degrees of freedom.

Page 148: Protein folding

Spectrum of strategies

Physics Knowledge

Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling

Page 149: Protein folding

Homology Modeling

Homology

Related by a common ancestor.

Sequence identity amongst homologous structure can be as low as 15%.

Why making models?

There is a good chance that the structural efforts will never catch up with the sequencing projects.

How?

Figure out the most probable 3D structure, given a (1D) sequence and a 3D template from a related protein.

Page 150: Protein folding

Homology Modeling

Assumption

•Regions of alignable sequence share homologous structures•Loop regions (non-conserved residues) allow insertions and deletions without disrupting the overall structure of a protein.

Query sequence

Sequence Similarity to

Solved structure?

PSI-Blast/profile MSASecondary Structure Prediction

Fold prediction

Homology Modeling Model Validation

Page 151: Protein folding

Homology Modeling

Aligning a sequence and a structure

MSA (multiple sequence alignment) between the query and the sequence of the target structure.

Profile MSA – The query and a MSA of homolog proteins to the target structure.

Threading.

Page 152: Protein folding

Homology Modeling

Principle of threading

“Pull” a sequence through a structure such that the alignment correspond to the frame with the best energy score.

Page 153: Protein folding

Homology Modeling

Energy evaluation for threading

Statistical mechanics is ideal in this case because physical models would require extensive simulation time to figure out the precise atomic conformation.

Page 154: Protein folding

Homology Modeling

Threading to detect correct alignments

The application GenTHREADER uses threading to perform protein fold recognition from genomic sequences.

Page 155: Protein folding

Homology Modeling

General Principle

1. Align to the sequence of a known structure.2. Change the structure of the side-chains to match the query

sequence according to the sequence alignment.3. Model loops and variable regions.4. Minimize energy / conformational search5. Check models for inconsistencies.

Feasibility

> 40% sequence identity is preferable.25% - 40% “Twilight Zone”< 25% Insufficient similarity in most cases.

May work only for one domain out of the whole protein.

Page 156: Protein folding

Neural Network

Anatomy of a NN:

Input parameters Output parametersWeights

Page 157: Protein folding

Neural Network

Before a NN can be used, it must be trained:

Training compared the output of a NN with a known answer, the weight of each “arrows” is changed to minimize the error.

Page 158: Protein folding

Secondary Structure prediction

Three Generations of methods

Generation Approach

1 (’60-’70)

GOR1

Single character statistical information

~ 57% ACC

2 (‘80)

GOR3

Local interactions

~ 63% ACC

3 (’90+)

PHD

Homologous protein sequences

~ 72% ACC

Page 159: Protein folding

Secondary Structure prediction

1ST Generation

Making use of compiled frequencies of the different characters for three possible classes:

Helix (H)

Strand (S)

Coild (-)

SDFDKILVSTYSPPQARILIVM

-----SSSSSSS----HHHHHH

Page 160: Protein folding

Secondary Structure prediction

2nd Generation

Making use of compiled frequencies of the different characters for three possible classes.

Considering the periodicity and neighbors.

Sliding window analyses

SDFDKILVSTYSPPQARILIVM

-----SSSSSSS----HHHHHH

Page 161: Protein folding

Secondary Structure prediction

3rd Generation

S D F ... M

0.1 0.01 0.0 ... 0.0

0.0 0.98 0.1 ... 0.09

... ... ... ... ...

0.02 0.0 0.05 ... 0.7

Frequency vectors obtained from multiple sequence alignments.

These MSA can be generated using BLAST

or Psi-BLAST

Also known as profiles

Page 162: Protein folding

Secondary Structure prediction

Best done using Neural Networks (or HMM… )

3rd Generation

S D F ... M

0.1 0.01 0.0 ... 0.0

0.0 0.98 0.1 ... 0.09

... ... ... ... ...

0.02 0.0 0.05 ... 0.7

H H - … S

The NN output of the profiles gets scanned by a few, distinct, NNs using a sliding window

strategy.

Assignment on the basis of the “winner

takes all”.

Page 163: Protein folding

Secondary Structure prediction

Alignment grow, secondary structure prediction improvesPrzybylski, Rost. 2002. Proteins, 46:197-205

Conlcusions

•Using MSA (multiple sequence alignment) significantly improve the predictions (0.72 -> 0.75)

•The larger the dB used, the better. However, there is a point where the information content saturates.

•Psi-BLAST vs BLAST: BLAST may be better in some cases.

•Refining the alignment did not help.

Page 164: Protein folding

Secondary Structure prediction

Bidirectional Dynamics for protein secondary structure prediction

Baldi et al., 2000, in Sequence learning, pp. 80-114

IOHMM model

Memory evaluated experimentally at about 15 characters

Page 165: Protein folding

Secondary Structure prediction

Bidirectional Dynamics for protein secondary structure prediction

Baldi et al., 2000, in Sequence learning, pp. 80-114

Recurrent Neural Network implementation

Page 166: Protein folding

Overview

Protein folding and parallel computing.

Current simulation works for modest-sized systems.

Homology modeling and statistical mechanics.

There is a clear advantages to use the information that we already have to solve new problems.

Secondary structure prediction and artificial intelligence.

Machine learning is appropriate to capture the trends leading to prediction.