Protein Structural Prediction
description
Transcript of Protein Structural Prediction
![Page 1: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/1.jpg)
Protein Structural Prediction
![Page 2: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/2.jpg)
Structure Determines Function
What determines structure?
• Energy• Kinematics
How can we determine structure?
• Experimental methods• Computational predictions
The Protein Folding Problem
![Page 3: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/3.jpg)
Protein Structure Prediction
• ab initio Use just first principles: energy, geometry, and kinematics
• Homology Find the best match to a database of sequences with known 3D-
structure
• Threading
• Meta-servers and other methods
![Page 4: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/4.jpg)
Threading
• Threading is the golden mean between homology-based prediction and molecular modeling (?)
MTYKLILN …. NGVDGEWTYTE
Main difference between homology-based prediction and threading:
Threading uses the structure to compute energy function during alignment
![Page 5: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/5.jpg)
Threading – Overview
• Build a structural template database
• Define a sequence–structure energy function
• Apply a threading algorithm to query sequence
• Perform local refinement of secondary structure
• Report best resulting structural model
![Page 6: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/6.jpg)
Threading Search Space
Protein Sequence X
ProteinStructure
Y
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
![Page 7: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/7.jpg)
Threading – Template Database
• FSSP, SCOP, CATH
• Remove pairs of proteins with highly similar structures Efficiency Statistical skew in favor of large families
![Page 8: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/8.jpg)
Threading – Energy Function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how well a residue fits a structural environment: Es
how preferable to put two particular residues nearby: Ep
alignment gap penalty: Eg
total energy: wmEm + wsEs + wpEp + wgEg + wssEss
how often a residue mutates to the template residue: Em
compatibility with local secondary structure prediction: Ess
![Page 9: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/9.jpg)
Threading – Formulation
C1 C2 C3 C4
at1a
t2at3a
t4aλ1λ0
λ2λ3 λ4
x
uy
zv Ci
Cj
x
u v
zy
• Contact graph captures amino acid interactions
• Cores represent important local structure units
• No gaps within each core
![Page 10: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/10.jpg)
Threading – Formulation
CMG = (v, )
![Page 11: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/11.jpg)
Threading – Formulation
From Lathrop & Smith
![Page 12: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/12.jpg)
Threading Search Space
Protein Sequence X
ProteinStructure
Y
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
How Hard is Threading?
CORES
![Page 13: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/13.jpg)
How Hard is Threading?
• At least as hard as MAX-CUT
MAX-CUT: Given graph G = (V, E), find a cut (S, T) of V with maximum number of edges between S and T.
The Bad News: APX-complete even when each node has at most B edges (where B>2)
2
3
4 5
6
7
1
![Page 14: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/14.jpg)
Reduction of MAX-CUT to Threading
• |V| cores, each core i has length 1 and corresponds to vi
• Let Ep(0,1) = 1: every edge labeled 0-1 or 1-0 gets a score of 1
• Then, size of cut = threading score
2
3
4 5
6
7
1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 v1 v2 v3 v4 v5 v6 v7
Sequence consistsof |V| 01-pairs
![Page 15: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/15.jpg)
Threading with Branch & Bound
• Set of solutions can be partitioned into subsets (branch)
• Upper limit on a subset’s solution can be computed fast (bound)
Branch & Bound1. Select subset with best
possible bound2. Subdivide it, and compute a
bound for each subset
![Page 16: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/16.jpg)
Threading with Branch & Bound
• Key to this algorithm is tradeoff on lower bound
efficient
tight
![Page 17: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/17.jpg)
Threading with Integer Programming
3x+y ≤ 11
-x+2y ≤ 5
x, y ≥ 0
maximize z = 6x+5y
Subject to
Linear constraints
Linear function
x, y {0, 1}Integral constraints (nonlinear)
Linear Program
Integer Program
RAPTOR: integer programming-based threadingperhaps the best protein threading system
![Page 18: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/18.jpg)
Threading with Integer Programming
x(i,k) denotes that core i is aligned to sequence position k
y(i,k,j,l) denotes that core i is aligned to position k and core j is aligned to position l
D(i) all positions where core i can be aligned to
R(i, j, k) set of possible alignments of core j, given that core i aligns to position k
corei (headi, taili, lengthi = taili – headi + 1)
![Page 19: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/19.jpg)
Threading with Integer Programming
}1,0{,
1
1
],,[,
],,[,
],1,[,1
..
),)(,(,
][,
,,),)(,(
,),)(,(
,),)(,(
),1(),(
kjlili
iDlli
kjlikjli
kjkjli
likjli
kili
ssssggppssmm
yx
x
xxy
kijRlxy
ljiRkxy
liiRkxx
ts
EWEWEWEWEWE
Minimize
Each core has only one alignment position
Each y variable is 1 if and only if its two x variables are 1 –
x and y represent exactly the same threading
Cores are aligned in order
![Page 20: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/20.jpg)
Energy Function is Linear
• Sequence substitution score
• Fitness of aa in each position (example, hydrophobicity)
• Agreement with secondary structure prediction
• Pairwise interaction between two cores
• Gap between two successive cores
![Page 21: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/21.jpg)
LP Relaxation and (again) Branch & Bound
1. Relax the integral constraint, to
x(i,j), y(i,k,j,l) 0
2. Solve the LP using a standard method(RAPTOR uses IBM’s OSL)
3. If resulting solution is integral, done
4. Else, select one non-integral variable (heuristically), and generate two subproblems by setting it to 0, and 1 -- use Branch & Bound
In practice, in RAPTOR only 1% of the instances in the test database required step 4; almost all solutions are integral !!!
![Page 22: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/22.jpg)
CAFASP
GOAL
The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the
user intervention allowed in CASP.
![Page 23: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/23.jpg)
Performance Evaluation in CAFASP3
Servers
(54 in total)
Sum MaxSub
Score
# correct
(30 FR targets)
3ds5 robetta 5.17-5.25 15-17
pmod 3ds3 pmode3 4.21-4.36 13-14
RAPTOR 3.98 13
shgu 3.93 13
3dsn 3.64-3.90 12-13
pcons3 3.75 12
fugu3 orf_c 3.38-3.67 11-12
… … …
pdbblast 0.00 0
(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)
Servers with name in italic are meta servers
MaxSub score ranges from 0 to 1
Therefore, maximum total score is 30
![Page 24: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/24.jpg)
One structure where RAPTOR did best
Red: true structure
Blue: correct part of prediction
Green: wrong part of prediction
• Target Size:144
• Super-imposable size within 5A: 118
• RMSD:1.9
![Page 25: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/25.jpg)
Some more results by other programs
![Page 26: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/26.jpg)
Some more results by other programs
![Page 27: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/27.jpg)
Some more results by other programs
![Page 28: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/28.jpg)
Structural Motifs
beta helix
beta barrel
beta trefoil
![Page 29: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/29.jpg)
Structural Motif Recognition
• Secondary Structure Prediction Find the helices, sheets, loops in a protein sequence
• Given an amino acid residue sequence, does it fold as a Coiled Coil? helix? barrel? Zinc finger?
• Intermediate goals towards folding• Useful information about the function of a protein• More amenable to sequence analysis, than full fold prediction
![Page 30: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/30.jpg)
Structural Motif Recognition
1. Collect a database of known motifs and corresponding amino acid subsequences
2. Devise a method/model to “match” a new sequence to existing motif database
3. Verify computationally on a test set (divide database into training and testing subsets)
4. Verify in lab
![Page 31: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/31.jpg)
Structural Motif Recognition Methods
• Alignment
• Neural Nets
• Hidden Markov Models
• Threading
• Profile-based Methods
• Other Statistical Methods
![Page 32: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/32.jpg)
Predicting Coiled Coils
![Page 33: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/33.jpg)
Predicting Coiled Coils
• NewCoils: multiply probs of frequencies in each coiled coil position
![Page 34: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/34.jpg)
Predicting Coiled Coils
• PairCoil: multiply pairwise probs of spatially neighboring positions
• Use a sliding window of length 28
• Perfect score separation between true and false examples (false = non-coil-coil helices)
Berger et al. PNAS 1995
![Page 35: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/35.jpg)
Predicting helices
• Helix composed of three parallel sheets
• Very few solved structures, very different from one another
• Absent in eukaryotes! Probably evolved subsequent to prok/euk split
![Page 36: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/36.jpg)
Predicting helices
• Only available program: BetaWrap
1. The rungs subproblemGiven the location of a T2 turn of one rung, find location of T2 turn of
next rung Distribution of turn lengths Bonus/penalty for stacked pairs in the parallel strands Discard if highly charged residues in the inward-point positions of
strand
2. From a rung to multiple rungs
Find multiple initial B2-T2-B3 rungs Use sequence template based on hydrophobicity to find many
candidate rungs Find “optimal wrap” by DP + heuristic score, based on 5
consistent rungs
3. Completing the parse Find B1 strands by locally optimizing their location
![Page 37: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/37.jpg)
Predicting helices
• BetaWrap gives scores that separate true from false helices
Bradley et al. PNAS 2001
![Page 38: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/38.jpg)
Predicting trefoils
http://betawrappro.csail.mit.edu/
Similar idea – use a combination of domain-specific expert knowledge with statistics
WRAP-AND-PACK
WRAP: Search for antiparallel strands to “wrap” a cap
PACK: Place the side chains in the interior of the wrapped strands
![Page 39: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/39.jpg)
Predicting Secondary Structure
• Given amino acid sequence, classify positions into helices, strands, or loops
• In general, harder than protein motif identification
• Best methods rely on Neural Networks
Similarly good separation can be achieved by SVMs
PSIPRED1. Given a sequence x, generate profile using PSI-BLAST2. Pass the profile to a pre-trained NN3. Output classification: helix / strand / loops
![Page 40: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/40.jpg)
PSIPRED
Profile M
Training & Testing• Start with database of determined folds (<1.87 Ao)
• Remove redundancy: any pair of proteins with high similarity (found by PSI-BLAST) – 187 remaining proteins
• 3-fold cross validation
~76% classification accuracy
![Page 41: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/41.jpg)
PSIPRED server
PSIPRED PREDICTION RESULTS
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
# PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
Conf: 9888788777656877765688766579
Pred: CCCCCCCCCCCCCCCCCCCCCCCCCCCC
AA: PEPTIDEPEPTIDEPEPTIDEPEPTIDE
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
# PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
Conf: 988888600148777777885001487777778842003789
Pred: CCCCCCCCEECCCCCCCCCCCCEECCCCCCCCCCCCCCCCCC
AA: PPEEPPTTIIDDEEPPEEPPTTIIDDEEPPEEPPTTIIDDEE
PSIPRED PREDICTION RESULTS
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
# PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
Conf: 998888872100111210012112359
Pred: CCCCCCCCCCCHHHHHHHHCCCCCCCC
AA: PTYPTYPTXXXXXXXXXXXXTEETEET
PSIPRED PREDICTION RESULTS
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
# PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
Conf: 91025687432236422336410232027743223653334679
Pred: CCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCCCCCCCCCCCCC
AA: THISISAPRXTEINSEQXENCETHISISAPRXTEINSEQXENCE
![Page 42: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/42.jpg)
TRILOGY: Sequence–Structure Patterns
• Identify short sequence–structure patterns 3 amino acids• Find statistically significant ones (hypergeometric distribution)
Correct for multiple trials
• These patterns may have structural or functional importance
1. Pseq: R1xa-bR2xc-dR3
2. Pstr: 3 C – C distances, & 3 C – C vectors
• Start with short patterns of 3 amino acids{V, I, L, M}, {F, Y, W}, {D, E}, {K, R, H}, {N, Q}, {S, T}, {A, G, S}
• Extend to longer patterns
Bradley et al. PNAS 99:8500-8505, 2002
![Page 43: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/43.jpg)
TRILOGY
![Page 44: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/44.jpg)
TRILOGY: Extension
Glue together two 3-aa patterns that overlap in 2 amino acids
P-score = i:Mpat,…,min(Mseq, Mstr) C(Mseq, i) C(T – Mseq, Mstr – i) C(T, Mstr)-1
![Page 45: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/45.jpg)
TRILOGY: Longer PatternsType-II turn between unpaired strands
NAD/RAD binding motif found in several folds
-- unit found in three proteins with the TIM-barrel fold
Helix-hairpin-helix DNA-binding motif
Four Cysteines forming 4 S-S disulfide bonds
A fold with repeated aligned -sheets
Three strands of an anti-parallel -sheet
A -hairpin connected with a crossover to a third -strand
![Page 46: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/46.jpg)
Small Libraries of Structural Fragments for Representing Protein
Structures
![Page 47: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/47.jpg)
Fragment Libraries For Structure Modeling
knownstructures
…
fragmentlibrary
proteinsequence
predictedstructure
![Page 48: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/48.jpg)
Small Libraries of Protein Fragments
Kolodny, Koehl, Guibas, Levitt, JMB 2002
Goal: Small “alphabet” of protein structural fragments that can be used to represent
any structure
1. Generate fragments from known proteins2. Cluster fragments to identify common structural motifs3. Test library accuracy on proteins not in the initial set
f
![Page 49: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/49.jpg)
Small Libraries of Protein Fragments
Dataset: 200 unique protein domains with most reliable & distinct structures from SCOP 36,397 residues
• Divide each protein domain into consecutive fragments beginning at random initial position
Library: Four sets of backbone fragments 4, 5, 6, and 7-residue long fragments
• Cluster the resulting small structures into k clusters using cRMS, and applying k-means clustering with simulated annealing Cluster with k-means Iteratively break & join clusters with simulated annealing to optimize total variance Σ(x – μ)2
f
![Page 50: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/50.jpg)
Evaluating the Quality of a Library
• Test set of 145 highly reliable protein structures (Park & Levitt)
• Protein structures broken into set of overlapping fragments of length f
• Find for each protein fragment the most similar fragment in the library (cRMS)
Local Fit: Average cRMS value over all fragments in all proteins in the test set
Global Fit: Find “best” composition of structure out of overlapping fragments Complexity is O(|Library|N) Greedy approach extends the C best
structures so far from pos’n 1 to N
![Page 51: Protein Structural Prediction](https://reader030.fdocuments.in/reader030/viewer/2022012914/568151c8550346895dbfffd8/html5/thumbnails/51.jpg)
Results
C =