STAR: Recombination site prediction
-
Upload
denis-bauer -
Category
Education
-
view
759 -
download
2
description
Transcript of STAR: Recombination site prediction
Predicting structural disruption caused by crossover: a machine learning approach
Denis C. Bauer
Talk CIBCB 2005
Outline
• Introduction in Protein Design
• Theory of SCHEMA
• Our Approach
• Results
• Summary
Protein• Biological Functions
– Proteins are fundamental components of all living cells
• Messenger Function (e.g. Hormones)• Catalystic Function (e.g. Enzymes)• Regulatoy Function (e.g. Antibodies)
• Protein Design for Industry and Medicine – Better adjusted– New function
Introduction
Protein Structure• Primary Structure
• Secondary Structure
• Tertiary Structure
• Quaternary Structure
Pictures from: Principles of BIOCHEMISTRY, Horton, Moran, Ochs, Rawn, Scrimgeours
Introduction
– Huge sequence space
– Not every possible sequence is stable
Protein Design
• Creating new amino acid sequences
20100
possible Amino Acid sequences
Solution: using sequences which already exist
Introduction
Gly Ala– Glu ThrPro Val Gly Asp– – –Glu ThrPro– –– – – – Gly Ala– Glu Pro– ––
KEMHQPLTFGELENLPLLNTDKPVQALM
Benefit of Recombination
Problem: how to identify recombination sites ?
Introduction
KIPDELGLIFKFEAPGRVTRVLSSQ…MH KL NE K AP
TIKELPQPPTFGELKKLPLLNTDKPVQALML KP GK
G
MKIADELGEIFKFEAPGRVTRYLSSQ…AP EL YAMKIPDELGLIFKFEAPGRVTRALSSQ…MKIPDELGLIFKFEAPGRVTRALSSQ…
KEMHQPLTFGELENLPLLNTDKPVQAL KEMHQPLTFGELENLPLLNTDKPVQAL
Better resistant to heat
Higher performance
Higher performance
Better resistant to heat
Mayfly
Lives where its hot
SCHEMA
• Research group of Prof. Francis Arnold
• Idea: Positions where the least interaction are disrupted
SCHEMA
SCHEMA profile
Limitations
• 3D Structure necessary– Problem: hard to derive for some proteins
• time consuming• expensive
Solution: Disengaging from 3D structure
SCHEMA
Our approach
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.71A31A
residues
SC
HE
MA
sco
re
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.71A31A
residues
SC
HE
MA
sco
re
Alternative to SCHEMA3D Structure Information Schema Alg Schema Score
PredictingSequence
Benefit: All Proteins can be processed
Our Approach
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
residues
1A31A
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.71A31A
SC
HE
MA
sco
re
residues
Model
Predicting Schema-Profile
Bidirectional RecurrentNetwork
Predicted Schema Score
Sequence
Support Vector Regression
PredictiveModel
Feed Forward NeuralNetwork
*
* Bodén, M., Yuan, Z. and Bailey, T. L. Prediction of protein continuum secondary structure with probabilistic models. submitted
Our Approach
Results
Method r devA
FFNN 0.86 0.57
BRNN 0.88 0.52
SVR eps 0.82 0.63
SVR nu 0.83 0.62
Table 1 Results for all approaches. r = correlation coefficient (ideally 1), devA = Root Mean Square Error (RMSE) normalized by the standard deviation (ideally 0).
Results
Results
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.81A4U-A
Sco
re
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.81BKP-B
Sco
re
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.81AJZ_
Sequence position
Sco
re
Results
Results
Results
Refinements
Contact Numbers
Predicting Model
Predicted Schema Score
ML model
predicted
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.71A31A
SC
HE
MA
sco
re
residues
Input features
Solvent AccessibilityScore
CC
0.88
0.88
0.6Ensemble
ML model
ML model
ML model
0.88
Results
However…
• Only a limited number of connections are considered• Broken connections are reconnected after recombination
Summary
• Design proteins with recombination rather than from scratch– Identifiy recombination site – Idea: finding the sites where the least interactions are disrupted
(SCHEMA)
• Predicting SCHEMA-score to overcome the limitation• SCHEMA too limited to be the only means for
recombination site prediction• Future work
– All interactions– Actual recombination process
Acknowledgments
• Supervisors Dr. Mikael Bodén and Dr. Ricarda Thier• Dr. Zheng Yuan • Prof. Francis Arnold’s research group
Thank youRef:C. A. Voigt, C. Martinez, Z.-G. Wang, S. L. Mayo, and F. H. Arnold, Protein building blocks preserved by recombination, Nat Struct Biol, vol. 9, no. 7, pp. 553-558, Jul 2002.
Meyer MM, Silberg JJ, Voigt CA, Endelman JB, Mayo SL, Wang ZG, Arnold FH. Library analysis of SCHEMA-guided protein recombination.Protein Sci. 2003 Aug;12(8):1686-93.
Bodén, M., Yuan, Z. and Bailey, T. L. Prediction of protein continuum secondary structure with probabilistic models. submitted.
PDB 1zg4
Recombination Site Identification
• Recombination vs Mutagenesis or Design from scratch
– Higher fraction of functional proteins– Higher diversity higher chance to find
a better hybrid
• Requirement– Identify recombination site – Identify which segments are useful– Identify beneficial segment combinations
• Existing methods– SCHEMA (Hybrid evaluation : avoid breaking connections)– FamClash (Hybrid evaluation : avoid changing properties of
residue pairs)– STAR (Site suggestion according to strucural compactness)
• Known methods too limited to be a good means for recombination site prediction
http://www.che.caltech.edu/groups/fha/
Possible approaches
• Identify a new measure for evaluating hybrids (derived from datasets of biologically produced hybrids)
• Include more information in the decision process– Sequence/Structure (SCHEMA)– Chemical features (FamClash)– Predicting important residues for structure and/or function– Predicting enzyme function from protein sequence– Substitution tolerance– Hydrophobic patterning– Surface clefts or binding sites– Solvent accessibility – Domains/motifs of parents