Xin Gao PhD student 2006.11.6. Outline Traditional Protein Structure Prediction Introduction ...

Refinement: Refinement: A Crucial Step to Approach A Crucial Step to Approach

Accurate PredictionsAccurate Predictions

Xin Gao

PhD student

2006.11.6

Outline• Traditional Protein Structure Prediction

Introduction Methods Review Experimental Results

• Refinement Motivation Methods Review Proposed Research Plan

Traditional Protein Structure Prediction — Introduction

• WHY do we study protein structure prediction problem?

• WHAT determines protein structures?

• HOW can we know protein structures?


• WHY? One of the most significant “grand

challenges” in Science. Key problem in Proteomics, the next step

in understanding life processes after the Human Genome Project are successfully completed.

Necessary step in studying protein functions. Improve, or even revolutionize human medicine and health care.


• WHAT? Inference of Structure from Sequence

Observation: Structure of a protein is uniquely determined by its amino acid sequence according to both energy and kinematics. (exceptions exist)


Inference of Function from Structure

Observation:

1) Proteins perform functions through their structures.

2) Proteins in the same fold usually have similar functions.

3) Proteins with novel, not yet observed, folds are rarely discovered recently.


• HOW? Experimental Methods

X-ray Crystallography

Nuclear Magnetic Resonance Spectroscopy (NMR)

Shortage: Costly and time consuming.

Computational Methods Have been studied for 3 decades. Great process

has been made.


Introduction

Methods Review Experimental Results


Traditional Protein Structure Prediction — Methods Review

• Basic hypothesis Anfinsen’s (1973) thermodynamic hypothesis:

Proteins are not assembled into their native structures by a biological process, but folding is a purely physical process that depends only on the specific amino acid sequence of the protein.

Anfinsen’s hypothesis implies that in principle protein structure can be predicted if a model of the free energy is available, and if the global minimum of this function can be identified.


• Computational Methods Ab Initio Methods Comparative Modeling Methods Fold-recognition Methods Consensus-based Methods Other Methods


Ab Initio Methods (Template-free Modeling)

1) Basic Idea:

According to Anfinsen’s (1973) thermodynamic hypothesis, such methods attempt to identify the structure with the minimum free energy by solely using the first principles: energy and kinematics.



2) Major Steps: Choose a first principle

based energy function. Apply an algorithm to

generates all possible

conformations. Use a search strategy to

search for the conformation

that minimizes the energy

function.



3) Advantages: Do not depend on any template databases. Can be used when other methods fail. Can be used as a complementary approach for others,

e.g., loop modeling.

4) Limitations: Computationally demanding.



5) Famous Servers: Folding@Home (A distributed computing project-people

from through out the world download and run software to band together).

6) Current Development: Becoming more and more important to deal with hard targets or hard parts of targets; hybrid servers with other methods are preferred.


Comparative Modeling Methods

1) Basic Idea:

Aim to predict the structures of a target protein, when a clear evolutionary relationship between the target and a protein of known structure can be easily detected from the sequence.

Based on the observation that when two proteins have more than 30% sequence identity, the structures of them are very similar.



2) Major Steps: Choose a template

database and a scoring

matrix or profile. Do sequence-sequence

alignment on each template

in the database, and select

the one best aligned. Refine side chains and

regions of low sequence

identity.



3) Advantages: If there is indeed a homologous template in the database,

the prediction result can be very accurate, usually with rmsd<4A.

Can do prediction very fast.

4) Limitations: Database dependent. Can only generate good predictions for easy targets,

which have homologous templates in the database.



5) Famous Servers: SAM-T02, FFAS03.

6) Current Development: Structure dependent score and gap penalty, and profile-profile alignment techniques are being used to deal with targets with distant homology from templates.


Fold Recognition Methods

1) Basic Idea:

Aim to predict the structure of a target protein even if no sequence similarity can be detected.

Based on the notion that structure is evolutionary more conserved than sequence.



2) Major Steps: Choose a reasonable structure database and an energy

function. Do sequence-structure alignment on each template in the

database, and select the one best aligned. Refine side chains and non-aligned regions.



3) Advantages: Can detect distant homology. Can predict protein structures even if they have no

sequence similarity or they are evolutionarily unrelated.

4) Limitations: Database dependent. The predictions generated are usually medium

resolution.



5) Famous Servers: RAPTOR, SPARK, PROSPECTOR, FUGUE.

6) Current Development: Different profile extracting methods are being tested. Fragment assembly and mini-threading techniques are used to improve the accuracy.


Consensus Based Methods

1) Basic Idea:

Based on the observation that different servers usually generate good predictions for different targets. Why not combine their strength together?



2) Major Kinds: Selection-only consensus methods: Try to choose the

best predictions from the input prediction set. Can not do better on a target than the best input server.

Hybrid consensus methods: Try to combine different regions extracted from different input predictions to construct a new and hopefully better prediction.



3) Famous Servers: ACE, Pcons, Pmodeller, 3D-SHOTGUN.

4) Current Development: Bad quality predictions are sometimes supported by many servers, and are then selected. New techniques are being used to eliminate the input server correlation to overcome this problem.


Other Methods

Combine different methods together. Fragment assembly is usually used. Famous servers including ROSETTA, TOUCHSTONE.


Introduction Methods Review

Experimental Results• Refinement

Motivation Methods Review Proposed Research Plan

Traditional Protein Structure Prediction — Experimental Results

Critical Assessment of Protein Structure

Prediction (CASP)

• Began in 1994 (CASP1)

• Held every two years

• The most objective assessment in the field.

• In CASP7 (May-Aug, 2006), 98 automated servers and 204 human expert servers are registered.

Traditional Protein Structure Prediction — Experimental Results

Best servers:

TASSER, ROSETTA, RAPTOR-ACE, RAPTOR, PModeller, SPARK.

Observation: 1) Consensus servers usually outperform individual servers.

2) There are more and more hybrid servers.

3) Most servers can generate good predictions or at least good regions for many targets. Thus, refinement is urgently needed.

Refinement — Motivation

• What is refinement? Goal:

To make predictions to be more accurate.

No formal definition. My definition:

Given a set of reasonably good predictions, construct a prediction that is more close to the native structure.


Reasonably good:

The whole structure is close to native, or there are good regions in the structure that are close to those regions in native,

Close to native:

One of the most controversial problems in the field. No measure is considered to be perfect.

Here, rmsd or GDT score is better than some thresholds.


• Why is refinement possible?

Data are taken from SBC evaluation, on 2006.10.30, of 86 targets.

http://www.pdc.kth.se/~bjornw/casp7/targets/results/

Server Name

Sum GDT of Top1

Average GDT of Top1

Sum GDT of Best of

Top 5

Average GDT of Best

of Top5

TASSER 52.17 0.607 54.19 0.630

ROBETTA 48.86 0.568 51.61 0.600RAPTOR 47.31 0.550 50.27 0.585

http://www.pdc.kth.se/~bjornw/casp7/targets/results/


Quick notes about GDT:

1) Zemla et al, Global Distance Test

2) Defined as the average coverage of the target sequence of the substructures with the four different distance thresholds (1, 2, 4, and 8A).

3) Weakness: Since the GDT score focuses only on the size of the substructures, the detailed match information of models and native structures is partially missed.

Refinement — Motivation Some Instances:

For T0198 of CASP6, RAPTOR predicted two good regions, but the orientation of them is wrong, which got a low score.

T0198 by RAPTOR T0198 Native


Taken from Zhang Yang’s online evaluation server.

http://zhang.bioinformatics.ku.edu/TM-score/

http://zhang.bioinformatics.ku.edu/TM-score/



• Refinement Motivation

Methods Review Proposed Research Plan

Refinement — Methods Review

• Two Major Categories of Methods

Partial Structure Refinement

Whole Structure Refinement• Ab Initio Methods• Template-Based Methods• Consensus-Based Methods


• Partial Structure Refinement Based on the assumption that backbone

structures of core regions are good. Aim to refine other regions.

Loop modeling methods: LOOPY (Honig Lab, ab initio method

to generate initial conformations, random tweak method to close conformations)

Side chain packing methods:SCWRL (Dunbrack Lab, graph theory)

SCATD (Jinbo Xu, tree decomposition)


• Whole Structure Refinement Ab Initio Methods:

Basic Idea: Assume the structure is roughly good, just need to “shake” a little bit to achieve a conformation with lower energy.

Server: RAPTORESS (Xin Gao et al., integer linear programming based backbone refinement)


Template-Based Methods:

Basic Idea: Extract information from a set of particularly chosen templates, and refine the structure according to such information

Server: MODELLER (Andrej Sali, try to optimize probability density function for each of the restraint features of the model); SEGMOD (Michael Levitt, a segment match modeling using a database of known protein X-ray structures).


Consensus-Based Methods:Basic Idea: Suppose we can get an input

prediction set, each structure of which contains some close to native regions, try to combine them together and get a hybrid but closer to native structure.

Server: TASSER (Zhang Yang, hyperbolic Monte Carlo sampling method to assemble continuous template fragments); POPULUS (Marc Offman et al., “move-set” based genetic algorithm to reshuffle and repack structural components).

Refinement — Methods Review• TASSER (Threading/ASSEmbly/Refinement)

Steps:1) Thread the sequence through a representative template

library (35% pairwise sequence identity cutoff) by PROSPECTOR.

2) Split target sequence into threading template aligned and unaligned regions, parallel hyperbolic Monte Carlo sampling is exploited to assemble full-length protein models by rearranging the continuous aligned fragments (building blocks) excised from threading templates.During assembly, building blocks are kept rigid and off-lattice to retain their geometric accuracy, unaligned regions are modeled on a cubic lattice by an ab initio procedure.

Performance:Ranked number one in CASP7, much better than any other servers, even including consensus servers.


• TASSER (Threading/ASSEmbly/Refinement)


• POPULUSMove set:

X = single crossover

XX = double crossover

C = coil mutation

H = helix mutation

CCD = Cyclic Coordinate Descent Algorithm


Move set:

Protein Mutation


Flowchart:CASP6 submitted models

Ratio: 2:1:1:1

Energy based scoring scheme, top 25

D(Ave, Best) < 0.0001, Sum(Cur)=Sum(Previous), D(Si, Sj) < 0.04, N(rounds) > 20

Top 20 structures returned


Performance:1) Did not attend CASP7 with POPULUS server2) Show Move-set is good

Use (GDT+Maxsub+TMscore)/3 as scoring function, assume already known native structure. Average is around 80%.

3) Energy based scoring function is good Aim: score the native structure as the global

minimum. Use native structures as input, the output

structures are stable with the input (20/23 cases, rmsd<0.6A).

Score(native)<Score(Lowest output) (21/23 cases)



• Refinement Motivation Methods Review

Proposed Research Plan

Refinement — Proposed Research Plan

• Basic Idea: By assembling good and long fragments, the

traditional search space can be greatly reduced. An efficient energy function is used to direct assembly process.

• Subproblems:

1) How to find good fragments?

2) What is the assembly process?3) What is the energy function?


How to find good fragments?

The main task of my project.

Basic Idea: Develop a confidence score which can evaluate the confidence for an aligned region.

Goal: Try to increase the sensitivity (recognize good regions as good) as much as possible, while keeping a high specificity (recognize bad regions as bad).


Preliminary Experiments:

1) Statistical Alignment Coverage (RAPTOR results for CASP7)

#100%

#

aligned residuesCoverage

residues in sequence


Analysis:

There is a huge gap between how much the alignment covers in the target and how much the good regions cover in the target. So we have to filter out about half alignment parts which are aligned but bad regions.

Server Name Top1 Coverage

Top1 Ave GDT

Best Top5 Coverage

Best Top5 Ave GDT

RAPTOR 90.17% 55.0% 92.0% 58.5%



2) Measure Local Quality Directly by CHARMM Energy Function

Server Name Ave # Regions per Target

Difference Between Best Energy and Worst

RAPTOR 6.7 <2%

bond angle dihedral angle improper torsionangle van der Waals electrostaticE E E E E E E

For aligned regions supported by all the top 5 models, with length at least 10 amino acids in length.


Analysis:

(a) This energy function is not good enough for separating locally good regions and locally bad regions, because it works better on the whole protein structure rather than the parts, and in aligned regions, those terms are usually similar, very close to the standard values.

(b) Small number of gaps in a relatively long region may be tolerable to reduce the effect of alignment errors, and to reduce the number of such regions with increasing their length.



3) How much can good regions cover and their locations.

(In Process)

Server Name Total Coverage of Good Regions in

Top5

Relative Locations of Good Regions

RAPTOR ? ?



4) Different Information Influences The Local Quality

(a) Local alignment quality: alignment score for the region.

(b) Consensus information: how much do other servers support this region.

(c) Local energy: not enough to determine region quality, but will be helpful.

(d) Server related information: average coverage, different servers will prefer different alignments.

(e) Target information: target length, target categories (easy/medium/hard).

(f) Region information: region length, region position.


Possible Strategies:

(a) Neural Network: take all kinds of information as input value, train a neural network to predict how good an input region is.

(b) SVM: extract all kinds of information as features, describe a region as a vector of these features, train an SVM to classify whether an input region is good or not.

(c) Linear Programming: suppose the confidence score is the linear combination of all kinds of information, try to optimize the weights and maximize the confidence gap between good and bad regions.


Importance:

1) Protein motif discovery or functional site and active site study.

2) Contrast to traditional prediction quality criteria, this can be used for researchers as a blind (without knowing native structure) prediction quality criteria.

3) Can be used to improve the accuracy of consensus based, and fragment assembly based methods.


What is the assembly process?

Future work.Idea: Try some possible algorithms to

assemble the good fragments selected with high priority, the 9-mer fragments selected in the candidate set (Shuaicheng’s work) with medium priority, the 3-mer or single residue with low priority.

Goal: For well-aligned targets, exact search strategy can be used due to the small search space; otherwise, search space will also be significantly reduced, some heuristic algorithm can be used.


What is the energy function?

The vital problem in all the methods described. The most important problem in current stage.

The work will be meaningless with a wrong objective!

A universal energy function:

By combining all existing energy functions, optimize their terms to make sure close to

native structures always have lower energy than decoy structures. (Joint work with Shuaicheng Li, Dongbo Bu)

Summary

Previous study has shown refinement is an indispensable step to solve protein structure prediction problem.

Refinement can be done based on the current methods and current PDB database.

CASP can provide an objective evaluation.

Thank youQuestions & Comments

Xin Gao PhD student 2006.11.6. Outline Traditional Protein Structure Prediction Introduction ...

Documents

Transcript of Xin Gao PhD student 2006.11.6. Outline Traditional Protein Structure Prediction Introduction ...