Predictive Methods Using Protein Sequences Unit 23 BIOL221T: Advanced Bioinformatics for...
-
Upload
josephine-brooke-black -
Category
Documents
-
view
218 -
download
1
Transcript of Predictive Methods Using Protein Sequences Unit 23 BIOL221T: Advanced Bioinformatics for...
Predictive Methods Using Predictive Methods Using Protein SequencesProtein Sequences
Unit 23Unit 23
BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for
BiotechnologyBiotechnologyIrene Gabashvili, PhD
IntroductionIntroduction
Each protein starts its life as a shapeless Each protein starts its life as a shapeless string of amino acids – more exactly, string of amino acids – more exactly, residuesresidues
Primary > Secondary > 3D StructurePrimary > Secondary > 3D Structure Function depends on 3D structureFunction depends on 3D structure 3D structure can be “guessed” from 3D structure can be “guessed” from
Sequence, but more info is needed – folding Sequence, but more info is needed – folding environment, chaperonines, etcenvironment, chaperonines, etc
Partial structural predictions can be also Partial structural predictions can be also helpfulhelpful
Amino Acid versus Amino Acid versus ResidueResidue
CCOOHH2N
H
R
CCO N
H
R
H
Amino Acid Residue
--- next lecture: all on structures ---
From previous lecture:From previous lecture:
Often, it’s enough to knowing the sequence of Often, it’s enough to knowing the sequence of the first 6 amino acids to identify the protein the first 6 amino acids to identify the protein
““Terminal sequence identification” approach: Terminal sequence identification” approach: a “label” (“backpack”) is chemically attached a “label” (“backpack”) is chemically attached to the end. to the end.
label-AAlabel-AA11-AA-AA22- …Aa- …Aan n
label-AAlabel-AA1 1
label-AAlabel-AA11-AA-AA22-AA-AA33
……
Riptide Algorithm Riptide Algorithm (D.Carter (D.Carter et al)et al)
Terminal Amino Acid Sequence Prediction, e.g. MQIFVK
Riptide Sequencing Algorithm
mass/charge
Occ
urr
en
ce
co
un
ts
0
1000
2000
3000
4000
5000
6000
200 400 600 800 1000 1200
Cou
nts
m/z (amu)
b1
+
a2
+
a3
+
b4
+
b5
+
H3C
O
HN
O
OH
HN
P(L/I)SDQ...O
HN
H2NNH2
label+MQIFlabel+MQIFV
label+MQIFVK
label+MQIlabel+MQ
label+Mlabel
massF
Mass spec data from an unknown protein
(e.g. protein shown is ubiquitin whose amino acid sequence starts from MQIFVK…)See e.g., >gi|37571|emb|CAA44911.1| ubiquitin [Homo sapiens]
Simple algorithmSimple algorithm
do x = 0 to 2do y = 0 to 2
do z = 0 to 2Pxyz = MS(mx) + MS(mx+my) + MS(mx
+my+mz)
Calculates a score value for each of the 203 amino acid sequences in a nested loop fashion. Suppose there are only 3 amino acids (0, 1 and 2) with masses m0, m1 and m2 (33= 27 permutations), no attached label. For the x-y-z sequence, the scoring function rewards the presence of the likely molecular fragments x, x-y and x-y-z.
The sequence generating the highest scoring Pxyz is reported as the most likely sequence for the unknown protein.
Redundant lookupsRedundant lookupsP000
P001
P002
P010
P011
P012
P020
P021
P022
P100
P101
P102
P110
P111
P112
P120
P121
P122
P200
P201
P202
P210
P211
P212
P220
P221
P222
+MS(m2+m2+m2)+MS(m2+m2+m1)+MS(m2+m2+m0)+MS(m2+m1+m2)+MS(m2+m1+m1)+MS(m2+m1+m0)+MS(m2+m0+m2)+MS(m2+m0+m1)+MS(m2+m0+m0)+MS(m1+m2+m2)+MS(m1+m2+m1)+MS(m1+m2+m0)+MS(m1+m1+m2)+MS(m1+m1+m1)+MS(m1+m1+m0)+MS(m1+m0+m2)+MS(m1+m0+m1)+MS(m1+m0+m0)+MS(m0+m2+m2)
+MS(m0+m2+m0)+MS(m0+m1+m2)+MS(m0+m1+m1)+MS(m0+m1+m0)+MS(m0+m0+m2)+MS(m0+m0+m1)+MS(m0+m0+m0)
MS(m2) P2
+MS(m2+m1)
+MS(m2+m0)
MS(m1) P1
+MS(m1+m2)
+MS(m1+m0)
MS(m0) P0
+MS(m0+m2)
+MS(m0+m1)
+MS(m0+m0) P00
MAX
Highest Scoring Sequence
+MS(m0+m2+m1)P02
P01
P10
P12
P20
P21
+MS(m1+m1) P11
+MS(m2+m2) P22
All highlighted MS() calls are equal due to the commutativity of addition.
P000P000
P001P001
P002P002
P010P010
P011P011
P012P012
P020P020
P021P021
P022P022
P100P100
P101P101
P102P102
P110P110
P111P111
P112P112
P120P120
P121P121
P122P122
P200P200
P201P201
P202P202
P210P210
P211P211
P212P212
P220P220
P221P221
P222P222
+MS(m2+m2+m2)+MS(m2+m2+m2)+MS(m2+m2+m1)+MS(m2+m2+m1)+MS(m2+m2+m0)+MS(m2+m2+m0)+MS(m2+m1+m2)+MS(m2+m1+m2)+MS(m2+m1+m1)+MS(m2+m1+m1)+MS(m2+m1+m0)+MS(m2+m1+m0)+MS(m2+m0+m2)+MS(m2+m0+m2)+MS(m2+m0+m1)+MS(m2+m0+m1)+MS(m2+m0+m0)+MS(m2+m0+m0)+MS(m1+m2+m2)+MS(m1+m2+m2)+MS(m1+m2+m1)+MS(m1+m2+m1)+MS(m1+m2+m0)+MS(m1+m2+m0)+MS(m1+m1+m2)+MS(m1+m1+m2)+MS(m1+m1+m1)+MS(m1+m1+m1)+MS(m1+m1+m0)+MS(m1+m1+m0)+MS(m1+m0+m2)+MS(m1+m0+m2)+MS(m1+m0+m1)+MS(m1+m0+m1)+MS(m1+m0+m0)+MS(m1+m0+m0)+MS(m0+m2+m2)+MS(m0+m2+m2)
+MS(m0+m2+m0)+MS(m0+m2+m0)+MS(m0+m1+m2)+MS(m0+m1+m2)+MS(m0+m1+m1)+MS(m0+m1+m1)+MS(m0+m1+m0)+MS(m0+m1+m0)+MS(m0+m0+m2)+MS(m0+m0+m2)+MS(m0+m0+m1)+MS(m0+m0+m1)+MS(m0+m0+m0)+MS(m0+m0+m0)
MS(m2) P2
MS(m2) P2
+MS(m2+m1)
+MS(m2+m0)
MS(m1) P1
MS(m1) P1
+MS(m1+m2)
+MS(m1+m0)
MS(m0) P0
MS(m0) P0
+MS(m0+m2)
+MS(m0+m1)
+MS(m0+m0) P00
+MS(m0+m0) P00
MAX
Highest Scoring Sequence
+MS(m0+m2+m1)+MS(m0+m2+m1)P02P02
P01P01
P10P10
P12P12
P20P20
P21P21
+MS(m1+m1) P11
+MS(m1+m1) P11
+MS(m2+m2) P22
+MS(m2+m2) P22
All highlighted MS() calls are equal due to the commutativity of addition.
Need combinations Need combinations
P000
MS(m0) P00P0
+MS(m0+m0) +MS(m0+m0+m0)
P001
P002
+MS(m0+m0+m1)
+MS(m0+m0+m2)
+MS(m0+m1)
MAX
C012
+MS(m0+m2)
+MS(m0+m1+m2)MS(m1) P1
MS(m2) P2
+MS(m1+m0)
+MS(m2+m0)
+MS(m1+m2)
+MS(m2+m1)
MAX
Cxyz= MAX(Pxyz ,Pxzy ,Pyxz ,
Pyzx ,Pzxy ,Pzyx)P21
P20
P12
P10
P02
P01
Highest Scoring Combination
MS() calls collapsed from 6 to 1.
...
...
P000
MS(m0) P00P0
+MS(m0+m0) +MS(m0+m0+m0)
P001
P002
+MS(m0+m0+m1)
+MS(m0+m0+m2)
+MS(m0+m1)
MAX
C012
+MS(m0+m2)
+MS(m0+m1+m2)MS(m1) P1
MS(m2) P2
+MS(m1+m0)
+MS(m2+m0)
+MS(m1+m2)
+MS(m2+m1)
MAX
Cxyz= MAX(Pxyz ,Pxzy ,Pyxz ,
Pyzx ,Pzxy ,Pzyx)P21P21
P20P20
P12P12
P10P10
P02P02
P01P01
Highest Scoring Combination
MS() calls collapsed from 6 to 1.
...
...
...
...
Riptide “combination Riptide “combination space” sequencing space” sequencing
MS(m0) C0
+MS(m0+m1
)
MS(m1) C1
MS(m2) C2
+MS(m0+m2
)
+MS(m1+m2)MAX
MAX
MAX
C001
C00
+MS(m0+m0) +MS(m0+m0+m0)
C002
C011
+MS(m0+m0+m1)
C022
C111
C112
C122
C222
C11
C22
+MS(m1+m1)
+MS(m2+m2)
C000
MAX
+MS(m0+m0+m2)
+MS(m0+m1+m1)
+MS(m0+m2+m2)MAX
+MS(m1+m1+m2)MAX
+MS(m1+m2+m2)MAX
+MS(m1+m1+m1)
+MS(m2+m2+m2)
MAX
MAXC01
C012C02
C12
+MS(m0+m1+m2)MAX MAX
Highest Scoring Combination
Highest scoring ordered sequence is easily derived from combination scores.
MS(m0) C0
+MS(m0+m1
)
MS(m1) C1
MS(m2) C2
+MS(m0+m2
)
+MS(m1+m2)MAX
MAX
MAX
C001
C00
+MS(m0+m0) +MS(m0+m0+m0)
C002
C011
+MS(m0+m0+m1)
C022
C111
C112
C122
C222
C11
C22
+MS(m1+m1)
+MS(m2+m2)
C000
MAX
+MS(m0+m0+m2)
+MS(m0+m1+m1)
+MS(m0+m2+m2)MAX
+MS(m1+m1+m2)MAX
+MS(m1+m2+m2)MAX
+MS(m1+m1+m1)
+MS(m2+m2+m2)
MAX
MAXC01
C012C02
C12
+MS(m0+m1+m2)MAX MAX
Highest Scoring Combination
Highest scoring ordered sequence is easily derived from combination scores.
Crash course on Crash course on biostatisticsbiostatistics
Statistics – analyzing data sets in terms of Statistics – analyzing data sets in terms of the relationships between the individual the relationships between the individual pointspoints
Variance & Standard Deviation; Co-Variance & Standard Deviation; Co-variancevariance
Machine Learning approaches (supervised Machine Learning approaches (supervised & unsupervised)& unsupervised)
Clustering vs Classificaion, PCAClustering vs Classificaion, PCA P-values & E-values, Scores via False P-values & E-values, Scores via False
positives, negativespositives, negatives
PCAPCA principal components analysis (PCA)principal components analysis (PCA) is a is a
technique that can be used to simplify a dataset technique that can be used to simplify a dataset It is a linear transformation that chooses a new It is a linear transformation that chooses a new
coordinate system for the data set such that coordinate system for the data set such that greatest variance by any projection of the data set greatest variance by any projection of the data set
comes to lie on the first axis (then called the first comes to lie on the first axis (then called the first principal component), principal component),
the second greatest variance on the second axis, and the second greatest variance on the second axis, and so on. so on.
PCA can be used for reducing dimensionality by PCA can be used for reducing dimensionality by eliminating the later principal components.eliminating the later principal components.
Applications: face recognition, patterns findingsApplications: face recognition, patterns findings
What is Cluster Analysis?What is Cluster Analysis?
Cluster: a collection of data objectsCluster: a collection of data objects Similar to the objects in the same cluster (Intraclass Similar to the objects in the same cluster (Intraclass
similarity)similarity) Dissimilar to the objects in other clusters (Interclass Dissimilar to the objects in other clusters (Interclass
dissimilarity)dissimilarity) Cluster analysisCluster analysis
StatisticalStatistical method for grouping a set of data objects into method for grouping a set of data objects into clustersclusters
A good clustering method produces high quality clusters A good clustering method produces high quality clusters
with high with high intraclassintraclass similarity and low similarity and low interclassinterclass similarity similarity Clustering is Clustering is unsupervised classificationunsupervised classification Can be a stand-alone tool or as a preprocessing step Can be a stand-alone tool or as a preprocessing step
for other algorithmsfor other algorithms
Group objects according to Group objects according to their similaritytheir similarity
Cluster:a set of objectsthat are similar to each otherand separatedfrom the otherobjects.
Example: green/red data pointswere generatedfrom two differentnormal distributions
K-MeansK-Means Clustering Clustering The meaning of ‘K-means’The meaning of ‘K-means’
Why it is called ‘K-means’ clustering: K Why it is called ‘K-means’ clustering: K points are used to represent the points are used to represent the clustering result; each point corresponds clustering result; each point corresponds to the centre (mean) of a clusterto the centre (mean) of a cluster
Each point is assigned to the cluster Each point is assigned to the cluster with the closest center pointwith the closest center point
The number K, must be specifiedThe number K, must be specified Basic algorithmBasic algorithm
K-means clustering
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassignreassign
Protocol 1. Calculate pairwise distance matrix
2. Find the two most similar genes or clusters
3. Merge the two selected clusters to produce a new cluster
4. Calculate pairwise distance matrix involving the new cluster
5. Repeat steps 2-4 until all objects are in one cluster
6. The clustering sequence is represented by a hierarchical tree – dendrogram.
Hierarchical clustering
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
EXAMPLE
(M)ANOVA(M)ANOVA
The analysis of variance technique in One-Way The analysis of variance technique in One-Way Analysis of Variance (ANOVA) takes a set of Analysis of Variance (ANOVA) takes a set of grouped data and determine whether the grouped data and determine whether the mean of a variable differs significantly mean of a variable differs significantly between groupsbetween groups
Often there are multiple variables and you are Often there are multiple variables and you are interested in determining whether the entire interested in determining whether the entire set of means is different from one group to the set of means is different from one group to the nextnext There is a multivariate version of analysis of There is a multivariate version of analysis of
variance that can address that problem (MANOVA)variance that can address that problem (MANOVA)
NCI: (Srinivas et Sirvastava., Cancer Biomarker Research Group, review article,
Vol. 8, 1160-69, 2002)
Biomarkers are biological molecules that are indicators of
physiologic state and also change during a disease process.
The utility of a biomarker lies in its ability to provide an
early indication of a disease, to monitor disease progression,
to provide ease of detection, and to provide a factor
measurable across populations.
What is a Biomarker?What is a Biomarker?
• Genomic
•DNA (e.g., BRCA-I gene mutations)
•RNA (gene expression, up/down regulation)
• Proteomic
•Peptides (e.g., PIF)
•Proteins (e.g., HER2/neu, PSA, CA-125)
• Metabonomic
•Small molecules, metabolites (e.g., glucose, cholesterol, cortisol)
What Types of What Types of Biomarkers Exist?Biomarkers Exist?
MS-based
Bioinformatics tools can Bioinformatics tools can predict:predict:
Secondary StructureSecondary Structure 3D Structure3D Structure Interaction SitesInteraction Sites Solvent AccessibilitySolvent Accessibility Transmembrane SegmentsTransmembrane Segments Subcellular LocalizationSubcellular Localization FunctionFunction
What Can Be Predicted?What Can Be Predicted?
O-Glycosylation SitesO-Glycosylation Sites Phosphorylation SitesPhosphorylation Sites Protease Cut SitesProtease Cut Sites Nuclear Targeting SitesNuclear Targeting Sites Mitochondrial Target Mitochondrial Target
SitesSites Chloroplast Target SitesChloroplast Target Sites Signal SequencesSignal Sequences Signal Sequence Cleav.Signal Sequence Cleav. Peroxisome Target SitesPeroxisome Target Sites
ER Targeting SitesER Targeting Sites Transmembrane SitesTransmembrane Sites Tyrosine Sulfation SitesTyrosine Sulfation Sites GPInositol Anchor SitesGPInositol Anchor Sites PEST sitesPEST sites Coil-Coil SitesCoil-Coil Sites T-Cell/MHC EpitopesT-Cell/MHC Epitopes Protein LifetimeProtein Lifetime And a lot more….And a lot more….
Sequence Feature Sequence Feature ServersServers
T-Cell Epitope Prediction – no longer existsT-Cell Epitope Prediction – no longer exists http://syfpeithi.bmi-heidelberg.com/scripts/MHCServer.dll/http://syfpeithi.bmi-heidelberg.com/scripts/MHCServer.dll/
home.htmhome.htm
B cell epitope prediction from 3D structures:B cell epitope prediction from 3D structures: http://www.cbs.dtu.dk/services/DiscoTope/
Predictions of promiscous MHCI-restricted Predictions of promiscous MHCI-restricted epitopes:epitopes: http://immunax.dfci.harvard.edu/PEPVAC/
O-Glycosylation PredictionO-Glycosylation Prediction http://www.cbs.dtu.dk/services/NetOGlyc/http://www.cbs.dtu.dk/services/NetOGlyc/
Phosphorylation PredictionPhosphorylation Prediction http://www.cbs.dtu.dk/services/NetPhos/http://www.cbs.dtu.dk/services/NetPhos/
Secondary StructureSecondary Structure
PHDsec: PHDsec: http://www.predictprotein.org Workbench: Workbench: http://workbench.sdsc.edu/
NGWB: NGWB: www.ngbw.org PROFsec: PROFsec:
http://cubic.bioc.columbia.edu/predictprotein/ PSIPRED (PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/ ) ) Jpred Jpred (http://www.compbio.dundee.ac.uk/~www-(http://www.compbio.dundee.ac.uk/~www-
jpred/)jpred/)
History of secondary structure History of secondary structure prediction:prediction:
The 1The 1stst generation: physicochemical generation: physicochemical principles, expert rules, and statistics principles, expert rules, and statistics (1970s, 50% accuracy)(1970s, 50% accuracy)
The 2The 2ndnd generation methods: sliding window generation methods: sliding window that walked through the entire sequence. that walked through the entire sequence. (1980s into the 1990s, ~60% accuracy). (1980s into the 1990s, ~60% accuracy).
The 3The 3rdrd generation methods use multiple generation methods use multiple sequence alignments, take advantage of the sequence alignments, take advantage of the evolutionary information (~75% accuracy). evolutionary information (~75% accuracy).
Tutorials/Description:Tutorials/Description:
PredictProtein : sequence PredictProtein : sequence analysis, prediction of protein analysis, prediction of protein function and structurefunction and structure
The PredictProtein Server. The PredictProtein Server. Nucleic Acids Research 32(Web Nucleic Acids Research 32(Web Server issue):W321-W326Server issue):W321-W326..
Interaction sitesInteraction sites
http://cubic.bioc.columbia.edu/services/
See also:See also:http://bioinformatics.ca/links_directory/narweb2007/
(same for 2006-2003) (same for 2006-2003) http://gemdock.life.nctu.edu.tw/3D-http://gemdock.life.nctu.edu.tw/3D-
partner/vers1/index.php (predicts partner/vers1/index.php (predicts interaction partners)interaction partners)
http://ef-site.hgc.jp/eF-seek/index.jsphttp://ef-site.hgc.jp/eF-seek/index.jsp
Solvent AccessibilitySolvent Accessibility
PHDacc PHDacc (http://www.predictprotein.org/(http://www.predictprotein.org/
PROFacc PROFacc (http://cubic.bioc.columbia.edu/predi(http://cubic.bioc.columbia.edu/predictprotein/ )ctprotein/ )
Jpred Jpred (http://www.compbio.dundee.ac.uk/(http://www.compbio.dundee.ac.uk/~www-jpred/ )~www-jpred/ )
Transmembrane SegmentsTransmembrane Segments
TopPred TopPred (http://bioweb.pasteur.fr/seqanal/int(http://bioweb.pasteur.fr/seqanal/interfaces/toppred.html )erfaces/toppred.html )
TMHMM TMHMM (http://www.cbs.dtu.dk/services/TM(http://www.cbs.dtu.dk/services/TMHMM/ )HMM/ )
Membrane Helix PredictionMembrane Helix Prediction http://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.cbs.dtu.dk/services/TMHMM-2.0/
Subcellular LocalizationSubcellular Localization
PSORT: http://psort.ims.u-PSORT: http://psort.ims.u-tokyo.ac.jp/tokyo.ac.jp/
TargetP: TargetP: http://www.cbs.dtu.dk/services/TargetP/http://www.cbs.dtu.dk/services/TargetP/
http://cubic.bioc.columbia.edu/db/http://cubic.bioc.columbia.edu/db/LOC3d/index.htmlLOC3d/index.html