Bioinformatics: Introduction and Methods
Transcript of Bioinformatics: Introduction and Methods
Bioinformatics: Introduction and Methods Le Zhang
Computer Science Department, Southwest University
Functional prediction of genetic variants
Le Zhang, Ph. D. Computer Science Department Southwest University
Unit 1: Overview of the problem
Le Zhang, Ph. D. Computer Science Department Southwest University
Do you think Angelina made the right decision to remove her breasts?
Angelina Joli has a genetic mutation in BRCA1.
How can we predict the likelihood of her getting breast cancer given this mutation? • P(breast cancer|her mutation) • P(breast cancer free|her mutation)
The dawning of the age of personalized medicine Next‐generation sequencing can sequence one person’s whole genome with ~$3000.
The personal genomes hold promises for a future of personalized medicine.
Where did your genetic variations come from?
somatic mutations de novo mutations inherited from parents
Annapurna Poduri et. al. Somatic Mutation, Genomic Variation, and Neurological Disease Science 5 July 2013: 341
Types of genetic variations in a human genome
• Chromosomal aneuploidy • Structural Variations (SVs) • Copy Number Variations (CNVs) • Short insertion/deletions (indels) • Single Nucleotide Variations (SNVs)
Nomenclature: Mutation vs. polymorphism vs. variation vs. variant
Structure Variation (SV) and Copy Number Variation (CNV) Insertion Deletion Inversion Translocation CNV
Indel – short Insertion/Deletion Within intergenic/intronic regions Within coding regions
Frameshifting Non‐frameshifting x
SNV – Single Nucleotide Variation There are about 3 million SNVs in one person’s genome, equivalent of ~ 1/1000 frequency.
SNVs within coding regions
Stop gain(nonsense)
Stop loss
Non‐synonymous(missense)
Synonymous(silent)
Affect splicing Missense mutation Nonsense mutation
Missense (nonsynonymous) SNVs
Missense SNVs change the amino acid.
Missense SNVs account for ~2% of the genome but >50% of all mutations known to be
involved in human inherited diseases.
BRCA1 vs. breast cancer
In 1990, DNA linkage studies on large families identified BRCA1 as the first gene associated with
breast cancer. BRCA1 located on chromosome 17 80,818 bp in length 23 exons encodes a protein of 1,863 amino acids a tumor suppressor gene that repairs damaged DNA and regulates cell growth and cell death. Approximately 5‐10% of breast cancers and 14% of ovarian cancers occur from a BRCA1 or BRCA2 genetic mutation.
However, not all missense SNVs cause phenotype change. Some are pathogenic, but many are neutral. Atotal of 238 known missense variations in BRCA1
163 are present only in patients
62 are present only in healthy persons
13 in both patients and healthy persons
On average, a healthy individual has
Class
Synonymous SNPs
Non‐synonymous SNPs
Small in‐frame indels
Small frameshift indels
Stop losses
Stop‐introducing SNPs
Genes disrupted by large deletions
Total genes containing LOF variants
HGMD ‘damaging mutation’ SNPs
Number
60,157
68,300
714
954
77
1,057
147
2,304
671
Class
SNP
Number
3,019,909
Indel
Deletions
Duplications
mobile element
insertions
361,669
15,893
407 4,775
Within protein‐coding regions,
Still an unsolved problem with lots of active on‐going research!
• What features differentiate disease‐causing variants from neutral ones? • How can we predict whether a variation is disease‐causing?
Unit 2: Databases of genetic variations
Le Zhang, Ph. D. Computer Science Department Southwest University
dbSNP
http://www.ncbi.nlm.nih.gov/SNP/
Created in September 1998 by by the
NCBI(National Center for Biotechnology Information) in collaboration with the NHGRI(National Human Genome Research Institute)
Its goal is to act as a single database
that contains all identified genetic variation
232,952,851 62,676,337 44,278,189 27,608,151 73,909,251 35,997,830
dbSNP New information obtained by dbSNP becomes available to the public periodically in a series of “builds”
Contains a range of molecular variation: SNPs Indels
multinucleotide polymorphisms microsatellite markers short tandem repeats heterozygous sequences
As of dbSNP build 138: Consist of variants from131 Organisms For Homo sapiens
Number of Submissions (ss) Number of RefSNP clusters (rs) Validated rs Number of rs in gene Number of ss with genotype Number of ss with frequency
dbSNP– Data increase From dbSNP build 125 in 2005 to build 138 in 2013, for Homo sapiens 250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
0 2005 2007 2008 2009 2011 2012
Number of Submissions(ss)
Number of rs in gene Number of RefSNP Clusters(rs)
dbSNP- Record
dbSNP- Record
1000 Genomes http://www.1000genomes.org/ The 1000 Genomes Project, launched in January 2008, is an international research effort to establish by far the most detailed catalogue of human genetic variation. Pilot‐ In 2010, the project finished its pilot phase Phase I ‐ In October 2012, the sequencing of 1092 genomes was announced in a Nature publication
1000 Genomes
1000 Genomes
Sequencing technology used:
Illumina SOLID 454
Phase I Whole genome Whole exome
strategy Low coverage whole genome sequencing
Deeping sequencing of whole
exome
Coverage 2‐6X 50‐100X
Sample number
1,092 1,039
OMIM Online Mendelian Inheritance in Man A database catalogues all the known diseases with a genetic component, and links them to the relevant genes in the human genome Contain information on all known mendelian disorders and over 12,000 genes.
http://www.omim.org/
OMIM initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man as a book 12 book editions of MIM were published between 1966 and 1998
The online version, OMIM, was created in 1985 and made generally available on the internet starting in 1987.
OMIM Entry Statistics
OMIM
Human Gene Mutation Database (HGMD)
a comprehensive collection of germline mutations in nuclear genes that underlie,
or are associated with, human inherited disease.
By 2013, the database contained over 141,000 different variants detected in over
5,700 different genes
Two versions: Professional – need subscription every year Public – freely available but permanently 3 years out of date, and does not contain any of the additional annotations or extra features present in HGMD Professional
Human Gene Mutation Database (HGMD)
Created by biologist David N. Cooper and mathematician Michael
Krawczak in 1996.
Originally established for the scientific study of mutational mechanisms
in human genes causing inherited disease, but has since acquired a much broader utility as a central unified repository for germ‐line disease‐related functional variation.
All HGMD mutation data are manually curated from the scientific
literature.
HGMD
HGMD 2013.2
HGMD http://www.hgmd.cf.ac.uk/ac/index.php
Locus specific databases (LSDBs)
Collect all known variants of each disease related gene in a specific database
Annotate with Complete and accurate information on genetic mutations
Most LSDBs are build based on LOVD (Leiden Open Variation Database) which is a database framework of storing variants information
http://www.lovd.nl/3.0/home
LSDBs
Unit 3: Conservation-base and Rule-based
methods: SIFT & PolyPhen
Le Zhang, Ph. D. Computer Science Department Southwest University
Questions:
• What features differentiate disease‐causing variants from neutral ones?
• How can we predict whether a variation is disease‐causing?
Phenotypical/functional “effects” of human genetic variations
• Disease vs. normal • Deleterious vs. neutral
• Personal trait differences (e.g., height)
Observations, not “truth”
Statistical and stochastic, not deterministic
• Animal model phenotypic changes • Cellular phenotypic changes
• Protein function changes
• Protein structure changes
• Protein sequence changes
• Nonsense mutations are usually considered deleterious. • even though it is not always the case…
• Known deleterious mutations are enriched in nonsynonymous mutations. • ~50 known mutations of Mendelian disorders are nonsynonymous mutations
• ascertainment bias?
• synonymous mutations, intronic mutations, and intergenic mutations are under‐ studied. • According to GWAS studies, 88% of trait‐associated variants of weak effect are non‐coding.
• Most research so far had focused on nonsynonymous mutations.
1999: Earliest attempt based on BLOSUM substitution matrix
• Assumption: if the substitution score between a variant residue and the wild type residue is positive, then the variant is neutral. If the substitution score is negative, then the variant is deleterious.
More successful methods
• Conservation‐based (e.g., SIFT)
• Rule‐based (e.g., PolyPhen)
• Classifier‐based (e.g., PolyPhen2, SAPRED)
Sort Intolerant From Tolerant substitutions (SIFT)
Published in 2001 by Pauline C. Ng and Steven Henikoff The first tool of predicting deleterious Amino Acid Subsitutions Website: http://sift.jcvi.org/
SIFT bets on evolution Important positions (such as active sites) tend to be conserved in the protein family across species. • Mutations at well‐conserved positions tend to be deleterious.
Some positions have a high degree of diversity across species. • Mutations at these positions tend to be neutral.
SIFT is a multistep procedure
Given a protein sequence:
Step 1. Search for similar sequences
Sequence search database: SWISS‐PROT
PSI‐blast is run for four iterations to collect a pool of sequences similar to the query
Step 2. Choose closely related sequences that are likely to share similar function
The psi‐blast results are grouped together if they are >90% identical in the regions aligned
Step 3. Obtain the multiple alignment of these chosen sequences
Step 4. Calculate normalized probabilities for all possible substitutions at each position at the alignment
If the SIFT score is less than 0.05, the SNV is considered to be deleterious. Otherwise, it is considered neutral.
Prediction results
Score cutoff: 0.05
Accuracy of SIFT False Negative rate: 31% False Positive rate: 20% Coverage: 60%
Truth("Goldstandard")
Positive Negative
Test
Outcome
Positive TruePositive
(hit)
FalsePositive (falsealarm)
Positivepredictivevalue
(PPV)=
Precision=
TP/(TP+FP)
Negative FalseNegative
(miss)
TrueNegative (correctrejection)
Negativepredictivevalue
(NPV)=
TN/(TN+FN)
Sensitivity=
Recall=
TP/(TP+FN)
Specificity=
TN/(TN+FP)
Accuracy=
(TP+TN)/total
Falsenegativerate
(β)=
TypeIIerror=
1-sensitivity=
FN/(TP+FN)
Falsepositiverate
(α)=
TypeIerror=
1-specificity=
FP/(TN+FP)
Falsediscoveryrate
(FDR)=
1-precision=
FP/(TP+FP)
Polymorphism Phenotyping (PolyPhen): a rule‐based method Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.
Changes in protein structure may affect protein function, which may lead to phenotype change.
PolyPhen predicts impact of amino acid allelic variants based on multi‐sequence alignment AND protein 3D structure features
PolyPhen
PolyPhen
1. Multi‐sequence alignment of homologous sequences
2. Structure‐based characterization of the substitution site DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc. Whether the variant is located in transmembrane regions Whether the variant is located in coiled coil regions Whether the variant is located in signal peptide regions
PolyPhen 3. Get the protein 3D structure or using homolog modeling to predict its structure 4. Calculate the 3D structure features of the substitution site
Secondary structure Solvent accessible surface area
Φ Ψ dihedral angles
Normalized B‐factor for the residue Loss of hydrogen bond Contacts with critical sites, ligands or other polypeptide chains
PolyPhen uses empirically derived rules to predict whether an nsSNP is damaging or benign
Cons
If 3D structure is not available, it can only depend on MSA.
The rules are empirical.
PolyPhen Pros
Improved prediction accuracy when protein 3D structure is available
PolyPhen2
An improved version of PolyPhen in 2010 http://genetics.bwh.harvard.edu/pph2/
Use more predictive features Based on Naïve Bayes machine learning
Improved performance compared with PolyPhen
Unit 4: Classifier-based methods: SAPRED
Le Zhang, Ph. D. Computer Science Department Southwest University
Formulate as a supervised classification problem
+ ‐
Structural attributes & Sequence attributes Apply the classifier to newly identified SAPs
Attributes evaluation & Subset selection 60 attributes 10 groups Build SVM classifier On training data
Single Amino acid Polymorphisms disease‐association Predictor (SAPRED)
Currently SAPRED supports two types of predictions: One is based on both the structural and sequence information the other relies on the sequence information only The former aims at higher prediction accuracy and more attributes with putative biological insights, while the latter can work with more queries whose structural models are not available.
PDB – get protein 3D structure http://www.rcsb.org/pdb/home/home.do
Homology Modeling
http://swissmodel.expasy.org/
Homology Modeling
Biologically-Intuitive Attributes
Residue frequencies, conservation score,
Solvent accessibilities and Cβ density, secondary structure...
New attributes:
Structural neighbor profile
Nearby functional sites
Disordered regions
Hydrogen bonds change
β-aggregation
HLA family
Residue frequencies in MSA
LacI 5-38
NR,ai X j
where Xj j i Xj,c < R;
Structural neighbor profile
Definition:
A 20-D vector: take the Cα of the SAP residue as the center, draw a sphere with a specific radius. The residues inside are counted to get the number for each of the 20 kinds of residues. Each number is a component of the vector.
R: radius
L: protein length
ai: a specific residue type
r: distance between a
residue and the center residue
L j1
=1 if X = a & r
otherwise, Xj = 0
Structural neighbor profile
The center is H128, radius is 10 Angstroms. Neighbors are: 42-47: LLICTY
50-52: AGT 55: I 59: V
106-110: LKTHL 112: T
125-127: KFL
129-131: VAR 176-177: HV 180-181: WW 184: K
188-194: QILFLFY 197: I 208: V 211: F
a.a. A C D E F G H I K L
N 2 1 0 0 4 1 2 4 3 7
a.a. M N P Q R S T V W Y
N 0 0 0 1 1 0 4 4 2 2
Structural neighbor profile: vector
Ov
eral
l ac
cura
cy
Structural neighbor profile
Predictive power of different structural neighbor profile
0.68 0.66
0.76 0.74 0.72 0.7
0.78
0 5 10 15 20
Radius (Å)
wildtype profile
variant profile
profile difference
Different radius had different prediction power.
We selected 13 Angstroms as the optimal value of the radius.
Nearby functional sites
Functional sites like ACT_SITE, METAL annotated in Swiss-Prot have intuitive biological insights
SAPs exactly on these sites would disturb protein function heavily but only low coverage in the dataset.
We proposed the SAPs in the vicinity of functional sites could also affect the protein function more probably than others – enlarged the coverage of these attributes in the dataset.
Nearby functional sites
Disordered Region
122 SAPs in disordered regions, 114 (93%) are disease-associated.
From: http://ist.temple.edu/disprot/index.php
Changed
Hydrogenbond
Disease Polymorphism ratio
-6 1 0 1/0
-5 12 1 12
-4 44 2 22
-3 114 16 7.25
-2 230 55 4.18
-1 403 213 1.89
0 1142 716 1.59
1 224 142 1.58
2 68 36 1.89
3 11 4 2.75
4 0 2 0
5 0 2 0
Hydrogen bond change
Other attributes
52 SAPs in transmembrane regions, 49 (94%) are disease-
associated
194 SAPs altered β-aggregation properties, 169 (87%) are
disease-associated
435 SAPs from HLA families, all except one are “polymorphism”.
SVM classifier SVM – support vector machine Separate transformed data with a hyper plane in a high‐dimensional space
Kernel function – Radial Basis Function(RBF)
Grid‐search to select proper values of parameter
Support Vector Machine (SVM) Classifier -- Grid-search for parameters
log2C = 1; log2g = -7
Five-fold cross-validation
Part Total proteins Total SAP Deleterious
SAP
Neutral SAP
1
2
3
4
5
Total
105
104
105
105
103
522
686
688
688
688
688
3438
449
450
450
450
450
2249
237
238
238
238
238
1189
SAPstatus Predictedasdisease-
association(+)
Predictedas
polymorphism(-)
Disease-association(+) TP FN
Polymorphism(-) FP TN
Accuracy: ACC and MCC
ACC TPTN
TPTNFPFN
(TPTN FPFN)
(TN FN)(TN FP)(TP FN)(TP FP)
Overall accuracy:
Matthew correlation
coefficient:
MCC
Predictive power
SAPRED web server
http://sapred.cbi.pku.edu.cn/
Run SAPRED
Results
Explanation of Results: Structural attributes
Explanation of Results: sequence attributes
Results using SAPRED_Seq
ACC=81.5% MCC=0.577
Unit 5:
Support Vector Machine(SVM) Le Zhang, Ph. D.
Computer Science Department Southwest University
……
Decision tree Neural Network Random Forest Ensemble learning
Model
Prediction
Training Data
New Data
Var1 Var2
Var3 VarN
Peking University
Machine learning model Methods SVM HMM Bayesian
Peking University
Classification Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in.
Peking University
Introduction SVM is supervised learning model that analyze data and recognize patterns, used for classification and regression analysis. It selects a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible. SVMs can efficiently perform non‐linear classification using what is called the kernel trick, implicitly mapping their inputs into high‐dimensional feature spaces.
Consider a two‐class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good?
Peking University
What is a good Decision Boundary?
Peking University
Decision Boundary
Intuitively, the best hyperplane is the one that represents the largest separation, or margin, between the two classes, since the larger the margin is, the lower the generalization error of the classifier will be.
Peking University
Support Vector The instances that are closest to the maximum‐margin hyperplane—the ones with the minimum distance to it—are called support vectors.
is the 1 or ‐1 to represent
y 1, 1,
0 0
Peking University
SVM - mathematics The data point is donated by , which is a n dimension vector, and the two different class. The hyperplane is 0 So the classification function is And
and y . And in fact, f x y . So functional margin is:
The functional margin of a hyperplane is measured by
min
Peking University
SVM - mathematics The confidence of a classification can be measured by the functional margin, which is |f x |, and whether the classification is right can be determined by the consistence of signs of f x
However, the functional margin can be scaled even if the hyperplane remain the same, for example, w and b changed into 2w and 2b.
r f x
| |
| |
In this maximum margin classifier, we want to max . Because the functional margin is scalable,
we can assume 1 without influence the optimal result.
Peking University
SVM - mathematics
A intuitional measurement can be obtained using the distance from the point to the hyperplane, which is called geometrical margin
max 1
| | . . 1 , 1,2,…, .
Which equals to
min 1
2 . . 1 , 1,2,…, .
This is a optimization model with constraints, and can be easily solve by Quadratic Programming.
Peking University
SVM - mathematics So the objective function is
L w,b,α 1
2 1
L
w L
b
0 0
0
Peking University
SVM - mathematics We can also solve this by Lagrange multipliers
f x
,
Peking University
SVM - mathematics Finally the classification function can be rewritten as
Peking University SVM - kernel The linear learning machine has very limited ability in practice, because of complexity in the real world, which needs more flexible hypothetical space. We can use a function ϕ to map x to a higher dimension space, in which all the points can be linear separable.
,
Here we get the kernel function:
K x,z ,
Peking University
kernel
So the classification function can be extended as
0 a
The we can construct a 5‐dimension space, where
Z , , , ,
So the hyperplane in the new feather space is
0
Peking University
kernel Take points in the picture for example, the two classes can be separated by a circle
Linear kernel: K x ,x ,
, Polynomial kernel: K x ,x
Gauss kernel: K x ,
Peking University
Kernel function
Gauss kernel
Peking University
SVM - example Linear kernel
Peking University
Applications SVM has been used successfully in many real‐world problems bioinformatics (Mutation classification, Cancer classification) text (and hypertext) categorization image classification – different types of sub‐problems hand‐written character recognition
Peking University
Pros and Cons With support vectors, the maximum‐margin hyperplane is relatively stable. However, they often produce very accurate classifiers because subtle and complex decision boundaries can be obtained. Compared with other methods, even the fastest training algorithms for support vector machines are slow when applied in the nonlinear setting.
Unit 6: Comparative Protein Structure Modeling
of Genes And Genomes Le Zhang, Ph. D.
Computer Science Department Southwest University
Catalogue
•
•
•
•
What is comparative protein structure modeling? Why could we do comparative modeling?
Why is comparative modeling important?
How to do comparative modeling?
Fold assignment and template selection
Target – template alignment
Model building
Model evaluation
• The application of comparative modeling
• Comparative modeling in structural genomics
1. What Is Comparative Protein Structure Modeling?
• Comparative protein structure modeling predicts the three‐ dimensional structure for a given protein sequence of unknown structure (target) on the basis of sequence similarity to proteins of known structure (the templates).
2. Why Could We Do Comparative Modeling?
• Small changes in the protein sequence usually result in small changes in its 3D structure. If similarity between two proteins is detectable at the sequence level, structural similarity can usually be assumed.
• The number of unique structural folds that proteins adopt is limited and because the number of experimentally determined new structures is increasing exponentially.
• Designing mutants to test hypotheses about a protein’s function
• Identifying active and binding
• Identifying, designing and improving ligands for a given binding site
• Modeling substrate specificity
• Predicting antigenic epitopes
• Facilitating molecular replacement in x‐ray structure determination
• Refining models based on NMR constraints
• Testing and improving a sequence‐structure alignment
• Confirming a remote structural relationship
• Rationalizing known experimental observations.
3. Why Comparative Modeling Is Important?
• It is an efficient way to obtain useful information about the proteins of interest.
• Simulating protein–protein docking
• Inferring function from a calculated electrostatic potential around the protein
4. How To Do Comparative Modeling?
• Fold assignment and template selection
• Target – template alignment
• Model Building
• Model evaluation
• Three main classes of protein comparison methods :
1. Comparing the target sequence with each of the database sequences independently. Program : BLAST, FASTA etc.
2. Using multiple sequence comparisons to improve the sensitivity of the search. Program : PSI‐BLAST etc.
*especially useful when the sequencing identity below 25%
3. Threading or 3D template matching methods. *especially useful when there are no sequences clearly related to the modeling target.
4.1 Fold Assignment And Template Selection
4.1 Fold Assignment And Template Selection
• Template selection :
A higher sequence similarity, The family of proteins, The quality of template structure, Solvent, pH, ligands…
• Potential problems:
Distantly related proteins used as templates (i.e., less than 25% sequence identity) may produce an unreliable model.
4.1 Fold Assignment And Template Selection
• The databases and Programs you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations
Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
• Once templates have been selected, a specialized method should be used to align the target sequence with the template structures. Program : CLUSTAL etc.
• The alignment becomes difficult in the “twilight zone” of less than 30% sequence identity. (Only 20% of the residues are likely to be correctly aligned when two proteins share 30% sequence.)
4.2 Target – Template Alignment
Similarity of BLOSUM62 is 62%, also ~45 & ~80.
4.2 Target – Template Alignment
• In difficult cases, it is frequently beneficial to rely on multiple structure and sequence information. The information from structures helps to avoid gaps in secondary structure elements, in buried regions, or between two residues that are far in space.
• Potential problems: Although you can use the methods aforementioned, misalignment may occur especially when the target‐template sequence identity decreases below 30%.
4.2 Target – Template Alignment
• Programs and World Wide Web servers you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
4.3 Model Building
• Three classes of methods can be used to construct a 3D model:
1. Modeling by Assembly of Rigid(刚性的) Bodies
Assemble a model from a small number of rigid bodies obtained from aligned protein structures.
2. Modeling by Segment Matching or Coordinate Reconstruction
Use a subset of atomic positions from template structures as “guiding” positions, and by identifying and assembling short, all‐atom segments that fit these guiding positions.
3. Modeling by Satisfaction of Spatial(空间的) Restraints(约束) Generate many constraints or restraints on the structure of the target sequence, using its
alignment to related protein structures as a guide.
4.3 Model Building
• Programs and World Wide Web servers you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
4.3.1 Loop Modeling
• Loops often determine the functional specificity of a given protein framework. They contribute to active and binding sites.
• Loop modeling can be seen as a mini–protein folding problem, but they are generally too short to provide sufficient information about their local fold.
• Three methods:
1) Ab initio methods
2) Database search techniques 3) Both
4.3.2 Sidechain Modeling • Side chain conformations are predicted from similar structures and from steric(立体的) or energetic considerations. • They are modeled using structural information from proteins in general and from equivalent disulfide(二硫) bridges in related structures. • Two effects on sidechain conformation: 1) The coupling between the main chain and side chains
2) The continuous nature of the distributions of side‐chain dihedral angles(二面角)
• Three different side‐chain prediction methods : 1)The packing of backbone‐dependent rotamers(旋转异构体) 2)The self‐consistent mean‐field approach to positioning rotamers based on their van der Waals interactions 3)The segment‐matching method of Levitt
4.3.3 Potential Problems
• According to a recent survey analyzed the accuracy of 3 modeling methods, they can only correctly predict approximately 50% of χ1 angles and 35% of both χ1 and χ2 angles.
• Segments of the target sequence that have no equivalent region in the template structure (i.e., insertions or loops) are the most difficult regions to model, especially when the insertion is more than 9 residues long.
• Some correctly aligned segments of a model, the template is locally different (<3 A˚) from the target, resulting in errors in that region.
• As the sequences diverge, the packing of side chains in the protein core may changes.
4.4 Model Evaluation
• Typical errors in comparative models :
1. Errors in side‐chain packing 2. Distortions and shift in correctly aligned regions. 3. Errors in regions without a template 4. Errors due to misalignments 5. Incorrect template.
4.4 Model Evaluation
• Typical errors in comparative models :
4.4 Model Evaluation
The criteria of evaluation
Having the correct fold or not
The target‐template sequences similarity
Distributions of many spatial features
The environment
Having good stereochemistry or not
4.4 Model Evaluation
1) Having the correct fold or not A model will have the correct fold if the correct template is picked and if that template is aligned at least approximately correctly with the target sequence. A
The fold of a model can be assessed by a high sequence similarity with the closest template, an energy based Z‐score, or by conservation of the key functional or structural residues in the target sequence.
2) The target‐template sequences similarity Sequence identity above 30% is a relatively good predictor of the expected accuracy.
Average model accuracy as a function of the template‐target sequences similarity
4.4 Model Evaluation
EDN: human eosinophil neurotoxin, is a ribonuclease with 3 α-
helices and 2 three-stranded antiparallel β-sheets arranged in a
single domain.
CRABPI: mouse cellular retinoic acid binding protein I, is a single domain protein composed of interacting α‐helices packed at the edge of two orthogonal, 4‐ and 6‐stranded antiparallel β‐sheets. For the CRABPI
model, 90% of Cαatoms superpose within 3.5 Å of their counterparts in the X‐ray structure; the rms error is 1.31 Å.
NM23H2: Human nucleoside diphosphate kinase, is a single
domain protein consisting of a central 4-stranded antiparallel β-
sheet surrounded by 8 α-helices. For the NM23H2 model, all but
one Cαatom superpose within 3.5 Å of the X-ray structure; rms difference is 0.41 Å.
Solid line: sample models Dotted line: corresponding actual structures
4.4 Model Evaluation Average model accuracy as a function of the template‐target sequences similarity Percentage structure overlap is defined as the
fraction of equivalent residues. Two residues are equivalent when their Cα atoms are within 3.5 Å of each other upon rigid‐body, least‐squares superposition of the two structures.
3) The environment
Example: some calcium‐binding proteins undergo large conformational changes when bound to calcium. If a calcium‐free template is used to model the calcium‐bound state of the target, it is likely that the model will be incorrect.
4) Having good stereochemistry or not
Including bond lengths, bond angles, peptide bond and side‐chain ring planarities, chirality, main‐chain and side‐chain torsion angles, and clashes between nonbonded pairs of atoms.
5) Distributions of many spatial features
Such features include packing, formation of a hydrophobic core, residue and atomic solvent acces sibilities, spatial distribution of charged groups, distribution of atom‐atom distance, atomic volumes, and main‐chain hydrogen bondin.
4.4 Model Evaluation
4.4 Model Evaluation
• There are also methods for testing 3D models that implicitly take into account many of the criteria listed above. These methods are based on 3D profiles and statistical potentials of mean force.
• A physics‐based approach to deriving energy functions has been tested for use in protein structure evaluation (1999).
4.4 Model Evaluation
• Programs and World Wide Web servers you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
Low accuracy <30% sequence identity Less than 50% of their Cα
atoms within 3.5 Å of their correct positions
High accuracy >50% sequence identity Approaches that of low
resolution X‐ray structures or medium resolution NMR structures rw (van der Waals radius) of C atom = 1.70Å
5. The Application of Comparative Modeling
• Three levels of model accuracy and some of the corresponding applications
Three levels
Middle aaccuracy 30‐50% sequence identity 85% of their Cα atoms within 3.5 Å of their correct positions
5. The Application of Comparative Modeling
• Applications1: low accuracy models
• •
<30% sequence identity, having the correct fold Less than 50% of their Cα atoms within 3.5 Å of their correct
positions
• Use: To confirm or reject a match between remotely related proteins
5. The Application of Comparative Modeling
• 30‐50% sequence identity
• 85% of their Cα atoms within 3.5 Å of their correct positions
• Use: Refinement of the functional prediction based on sequence to construct site‐directed mutants with altered or destroyed binding capacity other problems...
• Applications2: middle accuracy models
5. The Application of Comparative Modeling
• Applications3: high accuracy models
• >50% sequence identity • The average accuracy of these models approaches that of low resolution X‐ray structures (3 Å resolution) or medium resolution NMR structures (10 distance restraints per residue) • s
• Use: For docking of small ligands or whole proteins onto a given protein.
6. Comparative modeling in structural genomics
• The aim of structural genomics is to determine or accurately predict the 3D structure of all the proteins encoded in the genomes.
• This aim will be achieved by a focused, large‐scale determination of protein structures by X‐ray crystallography and NMR spectroscopy, combined efficiently with accurate protein structure modeling techniques.
6. Comparative modeling in structural genomics
• For comparative modeling to contribute to structural genomics, automation of all the steps in the modeling process is essential.
• The automation of large‐scale comparative modeling involves assembling a software pipeline that consists of modules for fold assignment, template selection, target–template alignment, model generation, and model evaluation.
• Two examples of large‐scale comparative modeling for complete genomes:
the SWISS‐MODEL web server: The sequences encoded in the E. coli genome have been used to build models for 10–15% of the proteins using the SWISS‐MODEL web server.
MODPIPE: MODPIPE produced models for five procaryotic and eukaryotic genomes. This calculation resulted in models for substantial segments of 17.2%, 18.1%, 19.2%, 20.4%, and 15.7% of all proteins in the genomes of Saccharomyces cerevisiae (6218 proteins in the genome); Escherichia coli (4290 proteins), Mycoplasma genitalium (468 proteins), Caenorhabditis elegans (7299 proteins, incomplete), and Methanococcus janaschii (1735 proteins).
6. Comparative modeling in structural genomics
• Large‐scale comparative modeling will extend opportunities to tackle a myriad of problems by providing many protein models for many genomes.
Rotein evolution Drug design
A facile comparison of ligand binding requirements and Substitutions in and around important residues ......
A specific example:
The selection of a target protein for drug development !
6. Comparative modeling in structural genomics
7. Conclusion • Over the past few years, there has been a gradual increase in both the accuracy of comparative models and the fraction of protein sequences that can be modeled with useful accuracy. • Further advances are necessary in recognizing weak sequence–structure similarities, aligning sequences with structures, modeling of rigid body shifts, distortions, loops and side chains, as well as detecting errors in a model. • It is currently possible to model with useful accuracy significant parts of approximately one third of all known protein sequences. • A major new challenge for comparative modeling is the integration of it with the torrents of data from genome sequencing projects as well as from functional and structural genomics.
Reference • Martí‐Renom M A, Stuart A C, Fiser A, et al. Comparative protein structure modeling of genes and genomes[J]. Annual review of biophysics and biomolecular structure, 2000, 29(1): 291‐325. • Šali A, Potterton L, Yuan F, et al. Evaluation of comparative protein modeling by MODELLER[J]. Proteins: Structure, Function, and Bioinformatics, 1995, 23(3): 318‐326. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Sánchez R, Šali A. Comparative protein structure modeling in genomics[J]. Journal of Computational Physics, 1999, 151(1): 388‐401.
Bioinformatics: Introduction and Methods
Computer Science Department, Southwest University
Thank you